São Paulo's public institutions are sitting on a problem that nobody has wanted to name out loud: their digital archives are riddled with duplicate images, and the technical, legal and budgetary decisions about what to do next are now impossible to defer. The issue has moved from the back offices of IT departments to the desks of administrators at organisations including the Arquivo Público do Estado de São Paulo and the Secretaria Municipal de Cultura, where digitisation drives launched between 2022 and 2024 have left behind overlapping, redundant files that consume server space, confuse researchers and complicate any serious attempt at long-term preservation.
The timing matters. The federal government's push under the Lula administration to consolidate public data infrastructure — including a national open-data framework that municipal bodies are expected to align with by the end of 2026 — means São Paulo cannot keep kicking this down the road. Institutions that fail to clean up their digital collections before the compliance deadline risk losing access to federal co-funding for future digitisation work. That is not a hypothetical. It is the kind of budget exposure that gets the attention of mayors.
What the Problem Actually Looks Like on the Ground
Walk into the reading room at the Arquivo Público do Estado, on Rua Voluntários da Pátria in Santana, and the archivists will tell you the problem is structural. Digitisation projects run by different contractors at different times used different file-naming conventions, different resolution standards and different metadata schemas. The result is that the same photograph — say, an aerial view of Paulista Avenue from the 1970s — might exist in three or four versions across separate servers, none of them flagged as a duplicate, each consuming storage and each potentially carrying conflicting catalogue information.
At the Centro Cultural São Paulo, on Rua Vergueiro in Liberdade, curators managing the institution's visual collection face a related but distinct version of the same challenge. Duplicates there have accumulated partly because exhibition loans require high-resolution copies that often get saved back into the main archive without a proper reconciliation step. The institution has not yet committed publicly to a replacement or consolidation timeline.
The technical options are not simple. Perceptual hashing — software that identifies visually similar images even when file names differ — can flag probable duplicates, but it requires human review before anything is deleted. That review costs money and staff time. A 2024 report by the Fundação Getúlio Vargas on digital preservation practices across Brazilian public institutions found that fewer than 30 percent of state-level archives had a formal deduplication policy in place. São Paulo state was not among the exceptions.
The Decisions That Cannot Wait
Three choices are now on the table, and whoever makes them — whether Mayor Ricardo Nunes's administration acts at the municipal level or defers to state coordination — will shape what São Paulo's public digital memory looks like for decades.
First: which files get designated as the authoritative master copy? This sounds bureaucratic, but it determines which version of a historical image survives. Second: what happens to the duplicates — permanent deletion, cold storage, or transfer to a third-party repository such as the Instituto Moreira Salles, which already holds one of Brazil's largest photographic collections and has its own digitisation infrastructure in São Paulo? Third: who pays for the automated scanning tools, the human review hours, and the metadata standardisation work? Federal digitisation grants through the Programa Nacional de Apoio à Pesquisa cover some costs, but the application window for the next funding cycle closes in September 2026.
Institutions that move quickly stand to gain more than just a cleaner hard drive. A properly deduplicated and catalogued archive becomes searchable in ways that attract researchers, journalists and the growing number of tech companies in São Paulo's Vila Olímpia and Faria Lima corridor that are building AI training datasets and need high-quality, legally clear image libraries. That commercial interest, unlikely as it might have seemed five years ago, is now part of the conversation in rooms where archivists and IT procurement officers sit together.
The September federal deadline is the first hard checkpoint. After that, institutions without a documented deduplication plan will face an uncomfortable audit cycle heading into 2027. The decisions are available. The window is narrowing.