Assinatura gratuita
The Daily São Paulo

São Paulo news, every day

News

São Paulo's Digital Archive Problem: The Hidden Cost of Duplicate Images Clogging City Systems

Municipal databases, newsrooms and e-commerce platforms across the city are burning storage budgets on redundant image files — and new data shows the scale of the waste.

By São Paulo News Desk · Published 4 July 2026, 4:16 pm

3 min read

São Paulo's Digital Archive Problem: The Hidden Cost of Duplicate Images Clogging City Systems
Photo: Siewert, Thomas A. / Public domain (Wikimedia Commons)
Traduzindo…

São Paulo's public and private sector organisations are sitting on hundreds of terabytes of redundant visual data. Duplicate image files — the same photograph stored two, three, sometimes a dozen times under different file names — are costing local institutions real money, and a reckoning with that inefficiency is now underway across the city's tech and administrative sectors.

The issue matters now because São Paulo's municipal digital infrastructure has expanded rapidly since the Prefeitura launched its Programa SP Digital initiative, which pushed dozens of secretariats to digitise records and move workflows online. More data, stored faster, with less governance, produces more duplication. Storage costs in Brazilian data centres rose sharply following the real's depreciation cycle in 2024, making every unnecessary gigabyte a measurable budget line.

What the Numbers Actually Show

Industry benchmarks from the data management sector suggest that between 20 and 30 percent of files stored in large unstructured repositories are exact or near-exact duplicates. For an organisation running a 500-terabyte image archive — common among São Paulo's larger media groups, e-commerce operators and government bodies — that represents between 100 and 150 terabytes of storage that delivers zero informational value. At current pricing for enterprise cloud storage on Brazilian-hosted platforms, which runs at roughly R$0.08 to R$0.12 per gigabyte per month depending on redundancy tier, the annual cost of that dead weight across a single large institution can exceed R$1.4 million.

The Secretaria Municipal de Inovação e Tecnologia, headquartered near the Anhangabaú district, has flagged image deduplication as a priority within its 2026 data governance review. The city's Arquivo Histórico Municipal on Rua Gaioso also faces the challenge in a different form: digitisation campaigns that ran through partner institutions produced multiple scanned versions of the same physical document, compounding catalogue confusion alongside storage overhead.

For the private sector, the problem is sharpest in e-commerce. The Vila Olímpia and Berrini corridor, home to Brazilian operations of major retail platforms and logistics companies, has seen engineering teams dedicate entire sprints to deduplication pipelines. One common pattern: product image libraries accumulated over years of supplier uploads, where the same SKU photograph arrives from three different sources in three slightly different resolutions, each stored as a distinct file because no automated hash-checking was in place at ingestion.

Deduplication Is Not Simple — and São Paulo Knows It

The technical fix sounds straightforward: run a perceptual hash algorithm across an image library, flag near-duplicates, delete or merge them. The operational reality is harder. Images that look identical may carry different metadata, licensing information or alt-text relevant to accessibility compliance under Brazil's Lei Brasileira de Inclusão, which requires digital platforms to maintain descriptive data for users with visual impairments. Deleting the wrong version of a file can strip that compliance layer.

The Fundação Getúlio Vargas campus in Itaim Bibi has been researching automated governance frameworks for exactly this kind of structured data cleaning, looking at how municipal and corporate archives can apply deduplication without losing legally or historically significant file attributes. The work feeds into broader conversations about data stewardship that the federal Lula administration has been pushing through the Autoridade Nacional de Proteção de Dados, which oversees how organisations handle stored personal and public records.

For organisations starting this process now, the practical path involves three steps that data engineers across the city's tech ecosystem broadly agree on: conduct a full storage audit with file fingerprinting before touching anything; establish a retention policy that distinguishes master files from derivatives; and build ingestion-stage deduplication so the problem stops compounding going forward. Several São Paulo-based startups operating out of the Distrito Inovação cluster in Lapa are selling tooling aimed at exactly that pipeline — a sign the market has noticed the scale of the problem even when institutions have not yet moved to fix it.

The cost of inaction keeps climbing. Storage is not free, governance failures attract regulatory scrutiny, and digital archives that nobody trusts become archives nobody uses. São Paulo's institutions have the technical talent to fix this. The data says they also have an increasingly expensive reason to start.

Topic:#News

How does this story make you feel?

Spread the word

See something wrong? Suggest a correction.

Have your say

Loading comments…

Sources

About this article

Published by The Daily São Paulo

This article was produced by the The Daily São Paulo editorial desk and covers news in São Paulo. See our editorial standards for how we use AI.

The Daily São Paulo brief

The day's São Paulo news in a 2-minute read, every weekday morning. Free.

By subscribing you agree to receive emails from The Daily São Paulo and accept our Privacy Policy. Unsubscribe anytime.

Daily brief

Enjoyed this? Wake up to São Paulo news every morning.

Free, in your inbox before 7am. Weekdays.

By subscribing you agree to receive emails from The Daily São Paulo and accept our Privacy Policy. Unsubscribe anytime.

More from The Daily São Paulo

More in News

Enjoyed this story? Get tomorrow's briefing free.