Assinatura gratuita
The Daily São Paulo

São Paulo news, every day

News

How São Paulo's Public Archives Ended Up Full of the Same Photo Twice — and What's Being Done About It

A chronic duplication problem inside the city's digital image repositories has been building for years, and now administrators are scrambling to clean house.

By São Paulo News Desk · Published 4 July 2026, 4:06 pm

3 min read

How São Paulo's Public Archives Ended Up Full of the Same Photo Twice — and What's Being Done About It
Photo: Photo by Rafael Rodrigues on Pexels
Traduzindo…

São Paulo's municipal digital archive holds hundreds of thousands of photographs documenting everything from the 1966 flooding of the Tietê River floodplain to last year's Carnaval rehearsals on Avenida São João. The problem: a growing share of those images appear more than once, sometimes dozens of times, clogging servers, distorting search results, and costing taxpayers money in unnecessary storage contracts.

The duplication crisis did not happen overnight. It accumulated across at least three separate waves of digitisation — the first in the mid-2000s under the city's Arquivo Histórico Municipal, the second tied to the 2010 expansion of the Acervo Digital da Prefeitura, and the third driven by a pandemic-era rush in 2020 and 2021, when cultural institutions across the Bom Retiro and Consolação districts bulk-uploaded physical collections with minimal quality control. Each migration imported its predecessor's errors and added new ones.

Where the Problem Starts

Duplicate image replacement — the technical process of identifying redundant files, selecting a canonical version, and systematically removing or redirecting the copies — sounds simple. In practice, it requires consistent file-naming conventions, embedded metadata standards, and a centralised repository with enough computational resources to run perceptual-hash comparisons across large datasets. São Paulo's municipal systems, split across at least four separate platforms including the Secretaria Municipal de Cultura's own portal on Rua Sete de Abril, have historically lacked all three.

The Instituto Moreira Salles, which maintains one of Brazil's most respected photographic collections from its São Paulo hub near Avenida Paulista, adopted an open-source deduplication pipeline in 2019 and publicly documented a reduction of redundant files in its own archive. That experience became an informal reference point for city administrators trying to understand the scale of their own problem, though the municipal archive's situation differs significantly in governance and funding structure.

Brazil passed the Lei de Acesso à Informação — Law 12.527 — back in November 2011, establishing the legal framework that requires public bodies to maintain accessible, accurate digital records. More than a decade later, municipal archivists in São Paulo are still working to meet its practical demands. A 2024 federal audit of digital preservation compliance across Brazilian state capitals, published by the Tribunal de Contas da União, found that the majority of municipalities surveyed lacked a formal deduplication policy for visual media. São Paulo was among those flagged for incomplete metadata standards.

The Fix Is Technically Possible — But Slow

Earlier this year, the Secretaria Municipal de Gestão began a pilot program in partnership with the Pontifícia Universidade Católica de São Paulo to apply machine-learning-assisted image comparison tools to a subset of the Arquivo Histórico Municipal's collection, which is physically housed on Rua Carneiro Leão in the Ipiranga neighbourhood. The pilot covers roughly 40,000 images dating from 1950 to 1980 — a manageable slice of a collection that runs to well over a million items.

The methodology matters here. Simple file-hash comparison catches exact binary duplicates — the same file saved twice. Perceptual hashing goes further, catching images that are visually near-identical even when file formats, compression levels, or resolutions differ. The latter technique is essential for an archive like São Paulo's, where the same negative was often scanned multiple times at different resolutions by different departments with no record-keeping to connect the scans.

For journalists, researchers, and students who rely on the Acervo Digital for historical reporting and academic work, the practical consequence of uncleaned duplicates is wasted time: searches return the same image in multiple entries, provenance data conflicts between versions, and licensing status — particularly for images with uncertain copyright — becomes impossible to verify cleanly.

The Secretaria Municipal de Cultura has indicated it plans to roll out a city-wide deduplication policy across all municipal digital collections before the end of 2026, though no formal decree has been published as of today. Researchers working with the archive in the interim should cross-reference any image retrieved from the Acervo Digital against the Arquivo Histórico Municipal's standalone search portal, where metadata tends to be more complete, and note the file accession number to avoid citing a duplicate version in published work.

Topic:#News

How does this story make you feel?

Spread the word

See something wrong? Suggest a correction.

Have your say

Loading comments…

Sources

About this article

Published by The Daily São Paulo

This article was produced by the The Daily São Paulo editorial desk and covers news in São Paulo. See our editorial standards for how we use AI.

The Daily São Paulo brief

The day's São Paulo news in a 2-minute read, every weekday morning. Free.

By subscribing you agree to receive emails from The Daily São Paulo and accept our Privacy Policy. Unsubscribe anytime.

Daily brief

Enjoyed this? Wake up to São Paulo news every morning.

Free, in your inbox before 7am. Weekdays.

By subscribing you agree to receive emails from The Daily São Paulo and accept our Privacy Policy. Unsubscribe anytime.

More from The Daily São Paulo

More in News

Enjoyed this story? Get tomorrow's briefing free.