Assinatura gratuita
The Daily São Paulo

São Paulo news, every day

News

São Paulo's Digital Archives Are Drowning in Duplicate Images — and the Numbers Tell a Damaging Story

Municipal databases, newsrooms, and tech platforms across the city are losing time and money to redundant visual data, and a new wave of automated detection tools is putting hard figures on the problem for the first time.

By São Paulo News Desk · Published 4 July 2026, 4:11 pm

3 min read

São Paulo's Digital Archives Are Drowning in Duplicate Images — and the Numbers Tell a Damaging Story
Photo: Photo by Willian Santos on Pexels
Traduzindo…

São Paulo's public and private institutions are sitting on image libraries bloated by duplicate files, a problem that costs the city's digital sector an estimated tens of millions of reais annually in excess storage, slower systems, and wasted editorial labour. The scale became clearer this year as platforms and municipal agencies began deploying automated deduplication software — and the readouts have surprised administrators accustomed to treating the issue as a minor housekeeping annoyance.

The timing matters. The Prefeitura de São Paulo, under Mayor Ricardo Nunes, launched its Programa São Paulo Inteligente initiative to digitise urban services and consolidate municipal data infrastructure. As agencies pushed records online — flooding portals with photographs of public works, flooding incidents in Zona Leste, and infrastructure projects along Avenida 23 de Maio — the volume of redundant image files expanded in parallel. Deduplication is no longer just a tech industry concern; it is a municipal governance headache.

The Numbers Behind the Redundancy

Industry benchmarks used by data engineers in Brazil suggest that large unmanaged image repositories typically contain between 20 and 40 percent duplicate or near-duplicate files. For a mid-sized São Paulo newsroom or a state agency running a digital asset management system with 500,000 images, that range translates to between 100,000 and 200,000 redundant files consuming server space and degrading search performance. Cloud storage pricing in Brazil, billed through providers such as AWS São Paulo Region and Google Cloud's Osasco data centre hub, runs at roughly R$0.10 to R$0.23 per gigabyte per month depending on tier — meaning a library carrying 2 terabytes of duplicate image data can cost an organisation upward of R$4,500 a year in pure redundant storage before factoring in bandwidth or processing overhead.

The problem compounds in journalistic and government contexts because duplicate images often carry inconsistent metadata — different captions, contradictory dates, mismatched location tags. The Instituto Brasileiro de Informação em Ciência e Tecnologia, known as IBICT, has documented metadata inconsistency as one of the top barriers to effective digital preservation across Brazilian public institutions, though specific deduplication loss figures remain scattered across agency-level audits rather than a single national count.

In São Paulo's tech ecosystem, concentrated around Avenida Faria Lima and the Vila Olímpia corridor, several startups have built deduplication tooling on top of perceptual hashing algorithms — methods that detect visually identical or near-identical images even when file names and formats differ. One approach, called dHash or difference hashing, compares pixel gradients across scaled-down image grids and assigns each image a 64-bit fingerprint. Two images with a Hamming distance below a threshold of roughly 10 bits are flagged as probable duplicates. The method can process thousands of images per minute on standard hardware, making bulk library audits feasible for the first time without manual review.

What Organisations Can Do Now

The practical path forward for São Paulo institutions starts with an audit. Free and open-source tools, including scripts built on Python's ImageHash library available through repositories on GitHub, allow IT teams to run a first-pass duplication scan without procurement costs. Organisations using the state government's Rede Governo system or municipal portals managed through the Secretaria Municipal de Inovação e Tecnologia can request technical guidance on deduplication workflows through those bodies' digital governance desks.

For Paulistano newsrooms and content platforms, the editorial case is as strong as the financial one. Duplicate images clog content management systems, slow page load times — a measurable SEO disadvantage — and create legal exposure when the same licensed photograph appears multiple times under different rights records. Getty Images and Agência Estado, two major image suppliers active in the Brazilian market, both flag licensing compliance as a downstream risk of poor asset management.

The deduplication software market is not standing still. Vendors are adding AI-assisted near-duplicate detection that catches images cropped or colour-corrected from the same source — a step beyond simple hash matching. For São Paulo's institutions, the message from the data is straightforward: the cost of inaction compounds every month a duplicate file sits on a server in Osasco or a municipal data centre off Avenida Paulista, and the tools to fix it have never been cheaper or faster.

Topic:#News

How does this story make you feel?

Spread the word

See something wrong? Suggest a correction.

Have your say

Loading comments…

Sources

About this article

Published by The Daily São Paulo

This article was produced by the The Daily São Paulo editorial desk and covers news in São Paulo. See our editorial standards for how we use AI.

The Daily São Paulo brief

The day's São Paulo news in a 2-minute read, every weekday morning. Free.

By subscribing you agree to receive emails from The Daily São Paulo and accept our Privacy Policy. Unsubscribe anytime.

Daily brief

Enjoyed this? Wake up to São Paulo news every morning.

Free, in your inbox before 7am. Weekdays.

By subscribing you agree to receive emails from The Daily São Paulo and accept our Privacy Policy. Unsubscribe anytime.

More from The Daily São Paulo

More in News

Enjoyed this story? Get tomorrow's briefing free.