Assinatura gratuita
The Daily São Paulo

São Paulo news, every day

News

São Paulo Tackles Duplicate Image Plague in City Archives — and It's Ahead of Some Rivals

As municipal digitisation programmes swell, São Paulo's archivists are wrestling with a problem that has tripped up Mexico City, Bogotá and Lagos: thousands of redundant scanned images clogging public databases.

By São Paulo News Desk · Published 4 July 2026, 3:43 pm

3 min read

São Paulo Tackles Duplicate Image Plague in City Archives — and It's Ahead of Some Rivals
Photo: Photo by Ariadne Barroso on Pexels
Traduzindo…

São Paulo's city hall confirmed this week that its ongoing digitisation push has produced a side-effect nobody budgeted for: an estimated 340,000 duplicate image files now sitting inside the Arquivo Histórico Municipal on Rua Cantareira, eating server space and slowing public access to documents that range from 19th-century land registries to 1970s infrastructure blueprints. The discovery came during a mid-year audit tied to the Prefeitura's broader open-data initiative, which Mayor Ricardo Nunes' administration relaunched in March 2026 under the banner of Dados Abertos SP.

The problem matters because São Paulo is not alone. Cities pushing hard on digital governance — Mexico City, Bogotá, Lagos, and Warsaw among them — have all run into the same bottleneck as scanning volumes outpace data-management capacity. The difference is how aggressively each city is moving to fix it, and how much of the burden they are placing on residents who depend on those archives for property disputes, genealogy research, and environmental-licence checks.

What São Paulo Is Doing Differently

The Prefeitura contracted Instituto de Pesquisas Tecnológicas, the state-linked technology body based in Cidade Universitária, to run deduplication software across the Arquivo Municipal's holdings starting in April 2026. The process uses perceptual hashing — a technique that identifies visually identical or near-identical images even when file names differ — rather than a simple byte-by-byte comparison, which would miss the thousands of files rescanned at slightly different resolutions over the past decade.

That technical choice puts São Paulo ahead of Bogotá, where the Archivo de Bogotá launched a similar audit in late 2024 but relied on manual review teams that have cleared only a fraction of the backlog, according to reporting by Colombian outlet El Tiempo. Mexico City's CDMX Digital programme, which began digitising land records in the Benito Juárez borough in 2022, has publicly acknowledged duplication rates above 20 percent in some collections, without yet deploying automated tools at scale.

Warsaw, by contrast, moved in 2023 to require all municipal scanning contractors to submit deduplication certificates alongside deliverables — a procurement rule that São Paulo's Controladoria Geral do Município is now studying as a model for contracts issued after January 2027.

The Cost of Getting It Wrong

Storage is the obvious drain. The Arquivo Municipal currently holds roughly 4.2 terabytes of image data, and city IT documents reviewed for this article — posted to the Dados Abertos SP portal — show that cloud-storage costs for the archive rose 38 percent between 2023 and 2025. Eliminating confirmed duplicates could recover an estimated 900 gigabytes, enough to absorb two additional years of scanning at current rates without a new procurement tender.

The slower-burning problem is search quality. Residents filing property queries at the Poupatempo unit on Avenida do Estado — which handles roughly 2,400 document requests a month — regularly receive multi-page results sets bloated with identical images returned under different catalogue numbers. Staff there have been logging complaints since at least mid-2025, and the issue appears in internal service-quality reports that the Prefeitura released under a freedom-of-information request in May 2026.

Lagos, for comparison, shelved a 2023 deduplication project midway through after the vendor changed, leaving the Lagos State Archives in Alausa with unresolved redundancies that have stalled several high-profile land-title disputes in the Lekki corridor. The cost of that delay — in litigation, in administrative hours, in public trust — is a cautionary tale São Paulo officials have cited in internal planning memos.

The Instituto de Pesquisas Tecnológicas team is expected to deliver a first-phase report to the Secretaria Municipal de Gestão by August 15, 2026. If the timeline holds, residents should see improved search results on the Arquivo Municipal's public portal before the end of the third quarter. The Controladoria has indicated it will publish the deduplication methodology as open-source code, which would let other Brazilian municipalities — including Recife and Fortaleza, both running comparable digitisation drives — replicate the approach without starting from scratch.

Topic:#News

How does this story make you feel?

Spread the word

See something wrong? Suggest a correction.

Have your say

Loading comments…

Sources

About this article

Published by The Daily São Paulo

This article was produced by the The Daily São Paulo editorial desk and covers news in São Paulo. See our editorial standards for how we use AI.

The Daily São Paulo brief

The day's São Paulo news in a 2-minute read, every weekday morning. Free.

By subscribing you agree to receive emails from The Daily São Paulo and accept our Privacy Policy. Unsubscribe anytime.

Daily brief

Enjoyed this? Wake up to São Paulo news every morning.

Free, in your inbox before 7am. Weekdays.

By subscribing you agree to receive emails from The Daily São Paulo and accept our Privacy Policy. Unsubscribe anytime.

More from The Daily São Paulo

More in News

Enjoyed this story? Get tomorrow's briefing free.