Assinatura gratuita
The Daily São Paulo

São Paulo news, every day

News

How São Paulo's Digital Archives Became a Graveyard of Duplicate Images — and Why Fixing It Took a Decade

A slow accumulation of copied, re-uploaded and mis-catalogued photographs across city systems has finally forced a reckoning with how public institutions manage visual records.

By São Paulo News Desk · Published 4 July 2026, 4:06 pm

3 min read

How São Paulo's Digital Archives Became a Graveyard of Duplicate Images — and Why Fixing It Took a Decade
Photo: Photo by Lucas Pezeta on Pexels
Traduzindo…

The problem was hiding in plain sight. Across the servers of the Prefeitura de São Paulo, the Arquivo Histórico Municipal on Rua Aureliano Coutinho, and dozens of linked cultural portals, hundreds of thousands of duplicate images had piled up — the same photograph of Avenida Paulista in 1970 stored under six different file names, the same flood documentation from the Tietê river basin duplicated across three separate departments. A coordinated audit completed in June 2026 put a rough count on the damage: more than 340,000 redundant image files identified across municipal digital repositories, according to the Secretaria Municipal de Inovação e Tecnologia.

This matters now because the city is in the middle of a R$180 million digitisation push, announced in late 2024 as part of the Programa São Paulo Inteligente, intended to make decades of urban planning records, flood-response maps and cultural heritage photographs accessible to the public through a single open portal. Pouring new content into systems already bloated with duplicates would have compounded the error. The audit was the prerequisite nobody wanted to run.

How the Backlog Built Up

The roots of the problem trace back to the early 2000s, when individual secretarias began digitising their paper archives independently, without a shared metadata standard. The Secretaria do Verde e do Meio Ambiente was scanning tree-canopy surveys. The Secretaria de Habitação was photographing favela boundaries in Paraisópolis and Heliópolis. The Museu da Cidade, housed at the Parque do Estado in Ipiranga, was cataloguing historical prints. Each institution used its own file-naming convention, its own storage platform, and crucially, its own definition of what counted as a duplicate.

When the city attempted its first centralisation effort in 2013 under the then-running Plano Diretor revision process, technicians simply migrated files wholesale from legacy systems into a new server cluster without deduplication. The result was institutional copy-paste: originals and copies sitting side by side, indistinguishable to the software because the metadata fields were inconsistent. A photograph of the 1966 flooding on Avenida do Estado might carry four different capture dates depending on which department had scanned it and when.

The practice accelerated after 2020, when pandemic-era remote work pushed civil servants to share image files over email and WhatsApp before uploading them to official repositories. Internal communications reviewed as part of the 2026 audit showed that re-uploads were commonplace because staff could not search existing holdings reliably. If you could not find it, you uploaded it again.

The Technical and Political Fix Now on the Table

The Secretaria Municipal de Inovação e Tecnologia awarded a contract in March 2026 to a São Paulo-based technology company to deploy perceptual hashing tools — software that generates a unique fingerprint for each image based on visual content rather than file name — across the municipal server infrastructure. The contract value, published in the Diário Oficial da Cidade on 14 March 2026, was R$4.2 million, covering a 14-month implementation window.

The tool works by comparing visual fingerprints and flagging pairs above a similarity threshold. Human archivists at the Arquivo Histórico then adjudicate which version is authoritative — usually the highest resolution scan with the most complete provenance metadata — before the duplicates are moved to a quarantine folder rather than deleted outright. That distinction matters: deletion would be irreversible, and in some cases a so-called duplicate carries marginally different annotation that has archival value.

For residents and researchers who use the Biblioteca Mario de Andrade on Rua da Consolação, the practical upshot should arrive by mid-2027, when the cleaned repository is scheduled to go live through the São Paulo Aberto data portal. Searches for historical images of specific neighbourhoods — Bom Retiro, Pinheiros, the old Brás industrial corridor — should return coherent, non-repetitive results for the first time.

The harder lesson is institutional. Technology alone does not prevent the problem from recurring. The city's new digital governance framework, still being drafted by the Controladoria Geral, is expected to mandate a single metadata standard for all image uploads from January 2027 onward. Whether individual secretarias comply will depend on enforcement — and on whether the next administration, after the 2028 municipal elections, decides the framework is worth keeping.

Topic:#News

How does this story make you feel?

Spread the word

See something wrong? Suggest a correction.

Have your say

Loading comments…

Sources

About this article

Published by The Daily São Paulo

This article was produced by the The Daily São Paulo editorial desk and covers news in São Paulo. See our editorial standards for how we use AI.

The Daily São Paulo brief

The day's São Paulo news in a 2-minute read, every weekday morning. Free.

By subscribing you agree to receive emails from The Daily São Paulo and accept our Privacy Policy. Unsubscribe anytime.

Daily brief

Enjoyed this? Wake up to São Paulo news every morning.

Free, in your inbox before 7am. Weekdays.

By subscribing you agree to receive emails from The Daily São Paulo and accept our Privacy Policy. Unsubscribe anytime.

More from The Daily São Paulo

More in News

Enjoyed this story? Get tomorrow's briefing free.