São Paulo's public digital archives contain an estimated 40 percent redundancy rate in photographic holdings — meaning nearly two in every five images stored across municipal systems is a near-identical copy of another file already on the servers. That figure, drawn from a 2025 audit conducted by the Secretaria Municipal de Inovação e Tecnologia, was quietly circulated among city departments earlier this year and has begun reshaping how institutions from Vila Madalena to the Centro Histórico think about data storage budgets.
The timing matters. The city's digital infrastructure has been under pressure since Mayor Ricardo Nunes's administration committed in late 2024 to digitising roughly 1.2 million physical records held by the Arquivo Histórico Municipal, located on Rua Cantareira in the city centre. Scanning campaigns generate duplicate images almost by design — operators rescan unclear pages, software creates automatic backup copies, and multiple departments sometimes commission independent scans of the same document. Without automated deduplication pipelines, those redundant files accumulate silently.
What the Storage Bills Actually Look Like
Cloud storage is not free. São Paulo's municipal government is spending in the range of R$18 million annually on data storage contracts, according to procurement records published on the Portal da Transparência do Município de São Paulo. Analysts who work with public-sector IT procurement say duplicate image files — which can run between 8 and 25 megabytes each for high-resolution scans — represent one of the most straightforward categories of recoverable waste in any large institutional archive. The São Paulo State Government's Casa Civil has separately identified image deduplication as a priority line item in its 2026 digital governance roadmap, published in March of this year.
Private institutions are grappling with the same problem at scale. The Instituto Moreira Salles, which operates a photography archive and cultural centre on Avenida Paulista, manages a collection of more than 8 million images, including historical photographs of São Paulo dating to the late 19th century. Institutions of that size typically see duplication rates between 15 and 30 percent when ingesting large donated collections, according to published benchmarks from the International Council on Archives. At those rates, the IMS archive could be carrying between 1.2 million and 2.4 million redundant files — each occupying server space and complicating search and retrieval.
The city's journalism ecosystem is feeling it too. Newsrooms clustered around the Consolação and Jardins neighbourhoods have accelerated their own digitisation efforts, driven partly by the loss of physical print archives in flooding events that have repeatedly hit basement-level storage facilities across the city. Flood damage to ground-floor and subterranean archives in buildings near the Rua da Consolação and along the Marginal Tietê corridor has made digital redundancy feel urgent — but urgency has also meant rushed ingestion processes that skip deduplication steps.
What the Fix Involves — and What It Costs
Automated deduplication tools use perceptual hashing algorithms to compare images and flag near-identical files even when file names, formats or metadata differ. Software licences for enterprise-grade tools range from R$12,000 to R$90,000 per year depending on archive size, with open-source alternatives available at no licensing cost but requiring in-house technical staff to implement. The Universidade de São Paulo's Instituto de Matemática e Estatística, on Rua do Matão in the Cidade Universitária campus, has published research on low-cost hashing pipelines specifically adapted to Portuguese-language archival workflows — work that several municipal departments have cited in internal planning documents.
For institutions still sitting on the problem, the practical next step is a storage audit before the end of the third quarter. Municipal departments that process their deduplication before the 2027 budget cycle begins in October stand a better chance of reallocating recovered storage capacity rather than requesting new procurement. The Secretaria Municipal de Inovação e Tecnologia has indicated it plans to publish deduplication guidelines for city-affiliated organisations by September 2026. For private archives and newsrooms, the calculus is simpler: every month of inaction adds to a storage bill that compounds as collections grow, and the images do not delete themselves.