Assinatura gratuita
The Daily São Paulo

São Paulo news, every day

News

São Paulo's Digital Archives Face a Reckoning Over Duplicate Images: The Key Decisions Ahead

As city institutions scramble to modernise vast photo and document collections, the question of how to identify, remove or replace duplicate images is becoming a genuine governance headache.

By São Paulo News Desk · Published 4 July 2026, 4:00 pm

4 min read

São Paulo's Digital Archives Face a Reckoning Over Duplicate Images: The Key Decisions Ahead
Photo: Photo by Giovanna Kamimura on Pexels
Traduzindo…

São Paulo's public institutions are sitting on a problem that nobody has wanted to name out loud: their digital archives are riddled with duplicate images, and the technical, legal and budgetary decisions about what to do next are now impossible to defer. The issue has moved from the back offices of IT departments to the desks of administrators at organisations including the Arquivo Público do Estado de São Paulo and the Secretaria Municipal de Cultura, where digitisation drives launched between 2022 and 2024 have left behind overlapping, redundant files that consume server space, confuse researchers and complicate any serious attempt at long-term preservation.

The timing matters. The federal government's push under the Lula administration to consolidate public data infrastructure — including a national open-data framework that municipal bodies are expected to align with by the end of 2026 — means São Paulo cannot keep kicking this down the road. Institutions that fail to clean up their digital collections before the compliance deadline risk losing access to federal co-funding for future digitisation work. That is not a hypothetical. It is the kind of budget exposure that gets the attention of mayors.

What the Problem Actually Looks Like on the Ground

Walk into the reading room at the Arquivo Público do Estado, on Rua Voluntários da Pátria in Santana, and the archivists will tell you the problem is structural. Digitisation projects run by different contractors at different times used different file-naming conventions, different resolution standards and different metadata schemas. The result is that the same photograph — say, an aerial view of Paulista Avenue from the 1970s — might exist in three or four versions across separate servers, none of them flagged as a duplicate, each consuming storage and each potentially carrying conflicting catalogue information.

At the Centro Cultural São Paulo, on Rua Vergueiro in Liberdade, curators managing the institution's visual collection face a related but distinct version of the same challenge. Duplicates there have accumulated partly because exhibition loans require high-resolution copies that often get saved back into the main archive without a proper reconciliation step. The institution has not yet committed publicly to a replacement or consolidation timeline.

The technical options are not simple. Perceptual hashing — software that identifies visually similar images even when file names differ — can flag probable duplicates, but it requires human review before anything is deleted. That review costs money and staff time. A 2024 report by the Fundação Getúlio Vargas on digital preservation practices across Brazilian public institutions found that fewer than 30 percent of state-level archives had a formal deduplication policy in place. São Paulo state was not among the exceptions.

The Decisions That Cannot Wait

Three choices are now on the table, and whoever makes them — whether Mayor Ricardo Nunes's administration acts at the municipal level or defers to state coordination — will shape what São Paulo's public digital memory looks like for decades.

First: which files get designated as the authoritative master copy? This sounds bureaucratic, but it determines which version of a historical image survives. Second: what happens to the duplicates — permanent deletion, cold storage, or transfer to a third-party repository such as the Instituto Moreira Salles, which already holds one of Brazil's largest photographic collections and has its own digitisation infrastructure in São Paulo? Third: who pays for the automated scanning tools, the human review hours, and the metadata standardisation work? Federal digitisation grants through the Programa Nacional de Apoio à Pesquisa cover some costs, but the application window for the next funding cycle closes in September 2026.

Institutions that move quickly stand to gain more than just a cleaner hard drive. A properly deduplicated and catalogued archive becomes searchable in ways that attract researchers, journalists and the growing number of tech companies in São Paulo's Vila Olímpia and Faria Lima corridor that are building AI training datasets and need high-quality, legally clear image libraries. That commercial interest, unlikely as it might have seemed five years ago, is now part of the conversation in rooms where archivists and IT procurement officers sit together.

The September federal deadline is the first hard checkpoint. After that, institutions without a documented deduplication plan will face an uncomfortable audit cycle heading into 2027. The decisions are available. The window is narrowing.

Topic:#News

How does this story make you feel?

Spread the word

See something wrong? Suggest a correction.

Have your say

Loading comments…

Sources

About this article

Published by The Daily São Paulo

This article was produced by the The Daily São Paulo editorial desk and covers news in São Paulo. See our editorial standards for how we use AI.

The Daily São Paulo brief

The day's São Paulo news in a 2-minute read, every weekday morning. Free.

By subscribing you agree to receive emails from The Daily São Paulo and accept our Privacy Policy. Unsubscribe anytime.

Daily brief

Enjoyed this? Wake up to São Paulo news every morning.

Free, in your inbox before 7am. Weekdays.

By subscribing you agree to receive emails from The Daily São Paulo and accept our Privacy Policy. Unsubscribe anytime.

More from The Daily São Paulo

More in News

Enjoyed this story? Get tomorrow's briefing free.