Assinatura gratuita
The Daily São Paulo

São Paulo news, every day

News

How São Paulo's Digital Archives Ended Up Full of the Same Photo Twice — and What's Being Done About It

Duplicate images have quietly swamped public and private databases across the city for years; now the cleanup is forcing a reckoning with how São Paulo manages its visual records.

By São Paulo News Desk · Published 4 July 2026, 4:06 pm

3 min read

How São Paulo's Digital Archives Ended Up Full of the Same Photo Twice — and What's Being Done About It
Photo: Photo by Kaique Rocha on Pexels
Traduzindo…

The problem sounds mundane until you dig into the numbers. São Paulo's municipal government, cultural institutions, and private tech platforms are collectively sitting on millions of duplicate digital images — identical or near-identical files stored redundantly across servers, clogging storage infrastructure and undermining the reliability of public archives. The push to replace and consolidate those duplicate images is now accelerating, driven by a combination of storage costs, a 2025 federal open-data directive from Brasília, and pressure from São Paulo's own tech sector.

This matters right now because the city is midway through a broader digital governance overhaul under Mayor Ricardo Nunes's administration. Several municipal departments have been migrating legacy records to a centralised cloud system since early 2025, and the migration process exposed just how severe the duplication problem had become. Files that were scanned, re-scanned, and uploaded again across decades of disconnected IT systems had multiplied across directories with no automated deduplication running.

How the Duplication Built Up Over Two Decades

São Paulo's archival chaos did not happen overnight. The city's Arquivo Histórico Municipal, based on Rua Cantareira in the Centro district, began digitising physical documents and photographs in the early 2000s. At the same time, individual secretariats — housing, urban planning, infrastructure — were building their own parallel image libraries with no shared metadata standards. By the time the Secretaria Municipal de Inovação e Tecnologia began auditing those systems in 2024, engineers found the same photographic assets catalogued under multiple file names, formats, and directories. The Pinacoteca do Estado, on Praça da Luz, faced a parallel challenge when it integrated donated collections from private estates without a unified ingestion protocol.

The private sector compounded the issue. São Paulo's startup ecosystem — home to more than 20 unicorn-valued companies, many clustered around Faria Lima Avenue and the Vila Olímpia neighbourhood — depends heavily on image-recognition tools and machine-learning training datasets. Several of those datasets were discovered to contain hundreds of thousands of duplicate images, which skew algorithmic outputs and inflate storage invoices. One estimate cited in a 2025 report by Abstartups, the national startup association, put unnecessary cloud storage spending from duplicate data across Brazilian tech firms at hundreds of millions of reais annually, though the precise São Paulo figure was not broken out separately.

The Replacement Process — and Why It's Complicated

Replacing a duplicate image is not simply a matter of deleting one copy. When the same file exists under different metadata tags — different dates, different rights classifications, different resolution labels — archivists must verify which version is canonical before removing the others. The Arquivo Histórico Municipal has been running a manual verification programme since March 2025, with a small team of three archivists cross-referencing physical originals against digital surrogates. Progress has been slow by the team's own publicly stated accounting.

The federal government's Lei de Governo Aberto framework, updated in late 2024, added urgency. Public bodies are now required to publish machine-readable image metadata as part of open-data obligations, which means mislabelled or duplicated entries become publicly visible errors rather than internal IT headaches. For São Paulo's secretariats, that transparency requirement turned a storage problem into a reputational one.

Technology vendors operating in the city have moved faster than government. Several startups with offices in the Berrini corridor have deployed perceptual hashing tools — software that generates a unique fingerprint for each image regardless of file name or format — to automate duplicate detection across client databases. The approach compresses what was a months-long manual audit into days.

For individuals and organisations dealing with their own duplicate-image backlogs, the practical lesson from São Paulo's experience is straightforward: implement a deduplication check at the point of file ingestion, not after a library has grown unmanaged for years. Retroactive cleanup is expensive. The Arquivo Histórico Municipal is expected to complete its first full audit round by the end of the third quarter of 2026, with a public report on findings slated for October. How that report lands will shape whether other city departments accelerate or delay their own image-replacement programmes.

Topic:#News

How does this story make you feel?

Spread the word

See something wrong? Suggest a correction.

Have your say

Loading comments…

Sources

About this article

Published by The Daily São Paulo

This article was produced by the The Daily São Paulo editorial desk and covers news in São Paulo. See our editorial standards for how we use AI.

The Daily São Paulo brief

The day's São Paulo news in a 2-minute read, every weekday morning. Free.

By subscribing you agree to receive emails from The Daily São Paulo and accept our Privacy Policy. Unsubscribe anytime.

Daily brief

Enjoyed this? Wake up to São Paulo news every morning.

Free, in your inbox before 7am. Weekdays.

By subscribing you agree to receive emails from The Daily São Paulo and accept our Privacy Policy. Unsubscribe anytime.

More from The Daily São Paulo

More in News

Enjoyed this story? Get tomorrow's briefing free.