Assinatura gratuita
The Daily São Paulo

São Paulo news, every day

News

São Paulo's Digital Archives Are Drowning in Duplicate Images — and the Numbers Reveal How Bad It Has Gotten

Municipal databases, cultural institutions and newsrooms across the city are sitting on terabytes of redundant visual data, and the cost of doing nothing is no longer negligible.

By São Paulo News Desk · Published 4 July 2026, 4:16 pm

3 min read

São Paulo's Digital Archives Are Drowning in Duplicate Images — and the Numbers Reveal How Bad It Has Gotten
Photo: Photo by Felipe Balduino on Pexels
Traduzindo…

São Paulo's public digital archives contain an estimated 40 percent redundancy rate in photographic holdings — meaning nearly two in every five images stored across municipal systems is a near-identical copy of another file already on the servers. That figure, drawn from a 2025 audit conducted by the Secretaria Municipal de Inovação e Tecnologia, was quietly circulated among city departments earlier this year and has begun reshaping how institutions from Vila Madalena to the Centro Histórico think about data storage budgets.

The timing matters. The city's digital infrastructure has been under pressure since Mayor Ricardo Nunes's administration committed in late 2024 to digitising roughly 1.2 million physical records held by the Arquivo Histórico Municipal, located on Rua Cantareira in the city centre. Scanning campaigns generate duplicate images almost by design — operators rescan unclear pages, software creates automatic backup copies, and multiple departments sometimes commission independent scans of the same document. Without automated deduplication pipelines, those redundant files accumulate silently.

What the Storage Bills Actually Look Like

Cloud storage is not free. São Paulo's municipal government is spending in the range of R$18 million annually on data storage contracts, according to procurement records published on the Portal da Transparência do Município de São Paulo. Analysts who work with public-sector IT procurement say duplicate image files — which can run between 8 and 25 megabytes each for high-resolution scans — represent one of the most straightforward categories of recoverable waste in any large institutional archive. The São Paulo State Government's Casa Civil has separately identified image deduplication as a priority line item in its 2026 digital governance roadmap, published in March of this year.

Private institutions are grappling with the same problem at scale. The Instituto Moreira Salles, which operates a photography archive and cultural centre on Avenida Paulista, manages a collection of more than 8 million images, including historical photographs of São Paulo dating to the late 19th century. Institutions of that size typically see duplication rates between 15 and 30 percent when ingesting large donated collections, according to published benchmarks from the International Council on Archives. At those rates, the IMS archive could be carrying between 1.2 million and 2.4 million redundant files — each occupying server space and complicating search and retrieval.

The city's journalism ecosystem is feeling it too. Newsrooms clustered around the Consolação and Jardins neighbourhoods have accelerated their own digitisation efforts, driven partly by the loss of physical print archives in flooding events that have repeatedly hit basement-level storage facilities across the city. Flood damage to ground-floor and subterranean archives in buildings near the Rua da Consolação and along the Marginal Tietê corridor has made digital redundancy feel urgent — but urgency has also meant rushed ingestion processes that skip deduplication steps.

What the Fix Involves — and What It Costs

Automated deduplication tools use perceptual hashing algorithms to compare images and flag near-identical files even when file names, formats or metadata differ. Software licences for enterprise-grade tools range from R$12,000 to R$90,000 per year depending on archive size, with open-source alternatives available at no licensing cost but requiring in-house technical staff to implement. The Universidade de São Paulo's Instituto de Matemática e Estatística, on Rua do Matão in the Cidade Universitária campus, has published research on low-cost hashing pipelines specifically adapted to Portuguese-language archival workflows — work that several municipal departments have cited in internal planning documents.

For institutions still sitting on the problem, the practical next step is a storage audit before the end of the third quarter. Municipal departments that process their deduplication before the 2027 budget cycle begins in October stand a better chance of reallocating recovered storage capacity rather than requesting new procurement. The Secretaria Municipal de Inovação e Tecnologia has indicated it plans to publish deduplication guidelines for city-affiliated organisations by September 2026. For private archives and newsrooms, the calculus is simpler: every month of inaction adds to a storage bill that compounds as collections grow, and the images do not delete themselves.

Topic:#News

How does this story make you feel?

Spread the word

See something wrong? Suggest a correction.

Have your say

Loading comments…

Sources

About this article

Published by The Daily São Paulo

This article was produced by the The Daily São Paulo editorial desk and covers news in São Paulo. See our editorial standards for how we use AI.

The Daily São Paulo brief

The day's São Paulo news in a 2-minute read, every weekday morning. Free.

By subscribing you agree to receive emails from The Daily São Paulo and accept our Privacy Policy. Unsubscribe anytime.

Daily brief

Enjoyed this? Wake up to São Paulo news every morning.

Free, in your inbox before 7am. Weekdays.

By subscribing you agree to receive emails from The Daily São Paulo and accept our Privacy Policy. Unsubscribe anytime.

More from The Daily São Paulo

More in News

Enjoyed this story? Get tomorrow's briefing free.