The problem was hiding in plain sight. Across São Paulo's network of municipal archives, cultural institutions, and fast-growing tech companies, the same photographs, graphics, and scanned documents were being stored — again and again — under different file names, in different folders, on different servers. Nobody agreed on how bad it had become until custodians at the Arquivo Histórico Municipal, on Rua Antônio de Godói in the city centre, began a systematic audit in early 2025 and found that nearly a third of their digitised holdings had at least one exact or near-exact duplicate consuming storage space.
That finding landed at a moment when the conversation about digital infrastructure had already grown urgent. The Lula federal government's push to expand broadband access across the country and Mayor Ricardo Nunes's ongoing Smart City initiative for São Paulo meant institutions were under pressure to migrate more records online, faster. More data flowing in meant the underlying chaos was compounding every quarter. Duplicate images are not a cosmetic nuisance — they inflate storage costs, slow search systems, produce conflicting versions of official documents, and in some legal and journalistic contexts, create genuine questions about the integrity of a record.
A Problem Built Over Two Decades of Rushed Digitisation
The roots run back to the early 2000s, when municipal agencies and private newsrooms along Avenida Paulista began scanning physical archives with whatever equipment and software they had available. There was no unified taxonomy. The Biblioteca Municipal Mário de Andrade, on Rua da Consolação, digitised its photographic collection under one naming convention; the Museu da Imagem e do Som, in Higienópolis, used another. Commercial stock-image agencies that grew rapidly during São Paulo's tech boom of the 2010s layered their own internal systems on top, many built by startups that no longer exist.
By 2020, when remote work accelerated the demand for digital assets, the redundancy problem had metastasised. A 2024 industry survey by the Brazilian Association of Digital Content Management — ABGD, based in São Paulo — estimated that large Brazilian organisations were spending an average of R$180,000 annually on storage costs attributable to duplicate or near-duplicate digital files. For smaller cultural institutions operating on tight municipal budgets, even a fraction of that figure is material. The ABGD survey covered 140 organisations across Brazil, with respondents concentrated in São Paulo and Rio de Janeiro.
The technical challenge is not simply spotting identical copies. Modern duplicate-image detection has to handle images that have been resized, recompressed, watermarked, or colour-corrected after the original was saved — what engineers call perceptual hashing. Open-source tools capable of this kind of analysis have existed since at least 2015, but adoption inside public institutions was slow. Procurement cycles are long, IT teams are stretched, and for years the problem was treated as an administrative annoyance rather than a financial liability.
The Audit That Changed the Conversation
The Arquivo Histórico Municipal's 2025 audit changed the institutional calculus. Running a perceptual-hash analysis across roughly 2.3 million digitised images, the team identified approximately 690,000 files flagged for review — a duplication rate high enough to prompt the Secretaria Municipal de Cultura to commission a broader review of the city's other holdings. That review, contracted to a local technology firm in the Berrini tech corridor, was completed in March 2026 and confirmed the pattern was systemic, not isolated.
The practical consequences are already being felt. Institutions that want to share collections with the São Paulo state digital heritage portal — launched under a federal Culture Ministry programme in late 2024 — are required to submit de-duplicated, consistently tagged files. Those that cannot meet the standard are effectively locked out of the collaboration, losing both visibility and potential co-funding.
For organisations still working through their own backlogs, archivists recommend starting with the highest-traffic collections — press photographs, event images, and scanned official documents — where duplicates cause the most downstream confusion. Tools built on perceptual hashing can be run in batches on existing infrastructure without a full system overhaul. The harder work, as the Arquivo Histórico Municipal found, is the human review that follows: a machine can flag candidates, but a person still has to decide which version is the canonical one. Getting that decision right, and recording it consistently, is what the next phase of São Paulo's digital archive reform will have to solve.