Assinatura gratuita
The Daily São Paulo

São Paulo news, every day

News

How São Paulo's Digital Archives Became a Battlefield of Duplicate Images — and Why Fixing It Took This Long

The story of how years of rushed digitisation, bureaucratic inertia, and underfunded municipal IT left the city's public records riddled with duplicate images that now demand a systematic fix.

By São Paulo News Desk · Published 4 July 2026, 4:16 pm

3 min read

How São Paulo's Digital Archives Became a Battlefield of Duplicate Images — and Why Fixing It Took This Long
Photo: Photo by Th2city Santana on Pexels
Traduzindo…

For years, anyone searching the digital archives of the Arquivo Histórico Municipal de São Paulo on Rua Voluntários da Pátria, in Santana, would pull up the same historic photograph of Paulista Avenue two, three, sometimes four times under different catalogue numbers. Nobody at the city level was counting the duplicates. Nobody had been paid to.

That changed in late 2025, when a joint audit by the Secretaria Municipal de Inovação e Tecnologia and the Tribunal de Contas do Município found that duplicate image files were consuming an estimated 34 percent of allocated storage on the city's cloud infrastructure contract — a contract worth approximately R$47 million per year. The finding prompted a formal directive, issued in March 2026, requiring all municipal bodies to submit duplicate-image-replacement plans before the end of the third quarter.

How the Problem Accumulated Over Two Decades

The roots go back to the early 2000s, when the city of São Paulo began digitising physical records in earnest under successive administrations. Each secretariat largely digitised on its own schedule, using different scanning software, different metadata standards, and different naming conventions. A photograph of the Vale do Anhangabaú taken in 1962, for instance, might be catalogued as a TIFF under one department and a compressed JPEG under another — both uploaded to shared servers without any cross-reference check.

The problem compounded when the city migrated its infrastructure at least three times between 2008 and 2019, according to internal procurement records available through the Portal da Transparência da Prefeitura de São Paulo. Each migration pulled existing file trees into new environments, and automated scripts that were supposed to flag duplicates were either poorly configured or simply not run. Batches of images from the Museu da Cidade de São Paulo and from the Empresa Municipal de Urbanização — EMURB's successor bodies — were folded into the same repositories with no deduplication layer.

Staff shortages deepened the backlog. The city's Coordenadoria de Tecnologia da Informação, which oversees the central database environment, was operating with roughly 60 percent of its recommended technical headcount as recently as 2023, according to a report published by the Controladoria Geral do Município in January 2024. Temporary contracts, common under multiple mayoral administrations, meant that institutional memory about which archives had already been cleaned — and which had not — walked out the door with each contract expiry.

What the 2026 Directive Actually Requires

The March 2026 directive mandates that municipal bodies use a defined hash-based comparison protocol — essentially a fingerprinting method that identifies identical or near-identical files regardless of filename — to flag candidates for removal or consolidation. Secretariats must then designate a named technical lead responsible for signing off on each replacement decision, ensuring that a genuinely distinct image is not deleted because a metadata error made it look like a duplicate.

The Biblioteca Mário de Andrade, on Rua da Consolação in the city centre, is listed as one of the pilot sites for the programme. Staff there have already begun working through a backlog of roughly 12,000 flagged image pairs in its digital newspaper collection. The São Paulo State Institute of Technological Research — IPT, based in the Cidade Universitária campus — has been contracted to validate the hashing methodology before wider rollout.

For residents and researchers, the practical impact is real. Slower search results, broken links when a duplicate is removed without a redirect, and the risk of accidental deletion of genuinely unique images are all live concerns that the secretariat's framework is designed, if not yet proven, to address.

Secretariats have until September 30, 2026 to submit their first compliance reports. Bodies that miss the deadline face a freeze on new storage procurement requests — a financial lever that city officials are betting will concentrate minds in a way that previous guidance never did.

Topic:#News

How does this story make you feel?

Spread the word

See something wrong? Suggest a correction.

Have your say

Loading comments…

Sources

About this article

Published by The Daily São Paulo

This article was produced by the The Daily São Paulo editorial desk and covers news in São Paulo. See our editorial standards for how we use AI.

The Daily São Paulo brief

The day's São Paulo news in a 2-minute read, every weekday morning. Free.

By subscribing you agree to receive emails from The Daily São Paulo and accept our Privacy Policy. Unsubscribe anytime.

Daily brief

Enjoyed this? Wake up to São Paulo news every morning.

Free, in your inbox before 7am. Weekdays.

By subscribing you agree to receive emails from The Daily São Paulo and accept our Privacy Policy. Unsubscribe anytime.

More from The Daily São Paulo

More in News

Enjoyed this story? Get tomorrow's briefing free.