São Paulo's public digital infrastructure is sitting on a problem that has grown largely invisible precisely because it multiplied so quietly. Across municipal servers, cultural repositories and the city's sprawling tech sector, duplicate image files now account for a measurable and costly share of total storage consumption — a finding that is reshaping procurement decisions and, in some cases, forcing emergency audits.
The issue matters right now because the Prefeitura de São Paulo, under Mayor Ricardo Nunes, is mid-cycle on a digital transformation drive that budgeted R$480 million over three fiscal years for cloud migration and infrastructure modernisation. Storage inefficiency directly erodes the value of that investment. When duplicate images pile up inside government systems — the kind generated by repeated document scans at Poupatempo service centres, redundant photo uploads on the Nota Fiscal Paulistana portal, or duplicated satellite images inside the Geosampa urban mapping platform — the city pays twice: once to store the file and again, in processing overhead, every time a system indexes it.
The Scale of the Problem in Concrete Numbers
Industry benchmarks from cloud storage audits conducted in comparable Latin American public-sector environments suggest that between 20 and 35 percent of unmanaged image repositories contain duplicate or near-duplicate files. Apply even the conservative end of that range to São Paulo's context and the figures become significant. Geosampa alone, maintained by the city's Secretaria Municipal de Urbanismo e Licenciamento, holds aerial and satellite imagery covering 1,521 square kilometres of municipal territory. Repeated capture cycles, format conversions and inter-departmental sharing mean the same tile can exist in three or four versions without any automated deduplication catching it.
Private sector actors in the city's tech ecosystem have begun publishing internal findings. Cubo Itaú, the startup hub on Avenida Brigadeiro Faria Lima in Itaim Bibi, hosted a storage optimisation roundtable in May 2026 where participating firms reported median duplicate-image rates of 28 percent across product teams that had never run a formal deduplication audit. One fintech operating out of the hub described recovering 14 terabytes of usable storage in a single weekend after running an open-source perceptual hashing tool across its asset library — the equivalent of eliminating roughly R$3,200 in monthly cloud costs at standard São Paulo data-centre pricing.
Cultural institutions face the same arithmetic. The Pinacoteca do Estado, on Praça da Luz in the Luz neighbourhood, digitised more than 10,000 works across successive scanning projects between 2018 and 2024, each campaign generating master files, web-resolution copies and archival TIFF backups. Without a unified deduplication protocol linking those campaigns, staff have reported manually searching for canonical image versions when preparing loans or licensing requests — a workflow inefficiency with a real labour cost attached to it.
What Deduplication Actually Requires — and What Comes Next
Automated duplicate detection relies on one of two approaches: exact-hash matching, which catches byte-for-byte copies and runs cheaply, and perceptual hashing, which identifies visually identical images that differ in file format, resolution or compression. The second method is more computationally expensive but essential for any institution — like a government archive or a media organisation — where the same photograph arrives via multiple channels and gets saved in different formats along the way.
The Secretaria Municipal de Inovação e Tecnologia has not yet published a formal deduplication policy, though procurement documents filed through the city's Diário Oficial in early 2026 referenced storage rationalisation as a secondary objective inside a broader data governance tender. Tech advocates working with the Instituto Cidade Democrática, based in Pinheiros, have been pushing for the city to mandate open deduplication standards as a condition of new public cloud contracts.
For organisations that cannot wait on municipal timelines, the practical path is straightforward: run a hashing audit before any new cloud migration, not after. Tools including PhotoDNA, rmlint and the open-source fdupes utility are well-tested and widely documented. Budget even half a working week of a junior developer's time on this before signing a storage expansion contract, and the return is almost always measurable within the first billing cycle.