At least 34 percent of image files held across São Paulo's municipal digital infrastructure are duplicates — identical or near-identical copies stored under different filenames, consuming server space and slowing public records retrieval. That figure comes from an internal systems review conducted by the city's Secretaria Municipal de Inovação e Tecnologia, which manages digital services for a metropolitan population of roughly 22 million people.
The problem sounds mundane. It isn't. São Paulo's government digitised decades of urban planning maps, flood-zone photographs, and infrastructure documentation during a push that accelerated between 2019 and 2023. The rush to upload left behind almost no deduplication protocol. The result: bloated repositories, slower query times on public-facing platforms, and, in some documented cases, conflicting versions of the same flood-risk map being served to different departments simultaneously — with obvious consequences for a city that lost over R$2.1 billion to flood damage in the summer of 2024 alone.
What the Data Actually Shows
The Secretaria's review, which covered servers at the Centro de Processamento de Dados do Município on Rua São Bento and at a secondary facility in Lapa, found approximately 4.7 petabytes of image data stored across both sites as of March 2026. Of that, the audit flagged roughly 1.6 petabytes as recoverable through automated deduplication — equivalent to eliminating storage costs the city currently pays to a private cloud contractor at an annualised rate of around R$18 million per year.
The duplication rate is not uniform. Departments dealing with visual documentation — Urban Planning, Civil Defence, and the transport authority SPTrans — showed duplication rates above 40 percent in some subdirectories. The city's health secretariat, which implemented a dedicated image management system in January 2023, had a duplication rate closer to 11 percent, suggesting that protocol, not budget, is the decisive variable.
Private-sector researchers at Instituto de Pesquisas Tecnológicas, headquartered in the Cidade Universitária campus near the Pinheiros district, published a related analysis in May 2026 examining how duplicate imagery degrades machine-learning models trained on urban datasets. Their finding: a training set with a 30 percent duplication rate produced object-recognition accuracy roughly 18 percentage points lower than a cleaned dataset of the same nominal size. For a city deploying computer-vision tools to monitor traffic on Marginal Tietê and identify illegal dumping in Parelheiros, that degradation is operationally significant.
Why Fixing It Is Harder Than It Sounds
Automated deduplication tools exist and are widely deployed elsewhere — the federal Tribunal de Contas da União began mandating deduplication audits for federal agencies in 2022. Municipal governments operate under different procurement rules, and São Paulo's IT contracts historically bundled storage and software together in ways that made it commercially inconvenient for vendors to flag redundancy.
The Ricardo Nunes administration has not yet published a formal remediation timeline. The Secretaria de Inovação identified three possible approaches in its internal review: full automated hash-matching (estimated cost R$4.2 million, implementation window 18 months), a hybrid human-machine audit prioritising the Civil Defence and Urban Planning directories first (R$7.8 million, 24 months), or a phased renegotiation of cloud contracts to include vendor-managed deduplication as a contractual obligation going forward.
For journalists, researchers, and civic-tech developers working with public datasets — many of whom access São Paulo's open-data portal at dados.prefeitura.sp.gov.br — the practical advice is straightforward: treat any image dataset downloaded before June 2026 as potentially contaminated with duplicates and run a local hash-check before using files for analysis. Tools like ExifTool and Python's imagehash library handle this in minutes on a standard laptop. The numbers underlying São Paulo's digital governance are only as reliable as the files used to generate them, and right now those files need a clean-up.