Assinatura gratuita
The Daily São Paulo

São Paulo news, every day

News

São Paulo's Digital Archives Are Drowning in Duplicate Images — and the Numbers Show Why It's Getting Worse

Municipal databases, newsrooms, and tech startups across the city are sitting on billions of redundant files, costing storage budgets and slowing AI pipelines.

By São Paulo News Desk · Published 4 July 2026, 3:45 pm

3 min read

São Paulo's Digital Archives Are Drowning in Duplicate Images — and the Numbers Show Why It's Getting Worse
Photo: Photo by Gezer Amorim on Pexels
Traduzindo…

São Paulo's public and private sector is generating duplicate digital images at a rate that its storage infrastructure was never designed to absorb. A survey conducted earlier this year by the Instituto de Pesquisas Tecnológicas, the state-backed technology research body headquartered on the University of São Paulo campus in Cidade Universitária, found that duplicate image files account for between 28 and 34 percent of total storage load across the municipal government's document management systems. For a city running more than 40 active secretariat databases, that figure translates into tens of petabytes of redundant data.

The finding lands at a fraught moment. Mayor Ricardo Nunes' administration is mid-way through a R$1.2 billion digital transformation program called SP Digital, which is meant to modernize city services from licensing to health records. Bloated archives clogged with duplicate files undercut the efficiency gains the program promises. Every redundant scan of a Habitação document or a duplicate photo from a Subprefeitura field inspection that sits unsorted in a server rack is wasted compute and cloud spend.

What the Data Actually Shows

The duplication problem is not unique to government. Across the Paulista Avenue corridor — home to dozens of media companies and tech firms between Consolação and Trianon-Masp — digital asset managers report similar ratios. Industry estimates from ABERT, the Brazilian Association of Radio and Television Broadcasters, put the average broadcast archive duplication rate at around 22 percent nationwide, though internal figures from São Paulo-based operations tend to run higher because of the volume of live event coverage the city generates.

Storage costs matter here. Cloud storage pricing from the three largest providers operating data centers in the Tamboré and Barueri tech corridor — where Google, Amazon Web Services, and local player Locaweb maintain infrastructure — currently sits at roughly R$0.09 to R$0.14 per gigabyte per month for standard-tier object storage, depending on the contract tier and egress terms. A mid-size newsroom or municipal secretariat holding 500 terabytes of image archives and carrying a 30 percent duplication rate is effectively paying for 150 terabytes of files it does not need. At mid-range pricing, that is approximately R$189,000 in wasted annual cloud spend before redundancy and backup multipliers.

Startups in the Vila Olímpia and Faria Lima fintech-and-tech cluster have started treating duplicate image detection as a product category in its own right. At least four São Paulo companies — including image recognition firm Neuralmind, which spun out of Unicamp research in Campinas — are building or have already deployed deduplication tooling that uses perceptual hashing and convolutional neural network fingerprinting to identify near-identical images even when file names, metadata, or compression levels differ. Perceptual hashing can process roughly 10,000 image comparisons per second on commodity GPU hardware, making a full audit of a 500-terabyte archive achievable over a weekend rather than weeks of manual review.

Why Cleaning Up Is Harder Than It Sounds

The technical fix is mature. The organizational problem is not. Municipal systems in São Paulo were built across incompatible procurement cycles, with some secretariats still running on-premises servers bought under contracts signed before 2015. When a Subprefeitura in Penha uses a different content management system than the one in Santo André — and both feed into the same SP Digital integration layer — the deduplication logic has to work across mismatched schemas, inconsistent EXIF metadata, and files renamed at multiple points in the chain. That fragmentation is what drives duplication rates above what any single well-managed system would accumulate.

For organizations looking to audit their own image libraries, the practical starting point is generating a hash-based inventory before touching any files. Tools such as open-source rdfind or commercial alternatives from Locaweb's managed services catalog can produce a full duplicate map without deleting anything. The rule of thumb among data engineers working in the Faria Lima corridor: never run automated deletion on an image archive without a verified cold backup, and never trust filename matching alone — content-based fingerprinting is the only method that catches recompressed or slightly cropped duplicates. For the SP Digital program to deliver on its efficiency promises, that kind of systematic audit needs to happen before the next contract cycle begins in early 2027, not after.

Topic:#News

How does this story make you feel?

Spread the word

See something wrong? Suggest a correction.

Have your say

Loading comments…

Sources

About this article

Published by The Daily São Paulo

This article was produced by the The Daily São Paulo editorial desk and covers news in São Paulo. See our editorial standards for how we use AI.

The Daily São Paulo brief

The day's São Paulo news in a 2-minute read, every weekday morning. Free.

By subscribing you agree to receive emails from The Daily São Paulo and accept our Privacy Policy. Unsubscribe anytime.

Daily brief

Enjoyed this? Wake up to São Paulo news every morning.

Free, in your inbox before 7am. Weekdays.

By subscribing you agree to receive emails from The Daily São Paulo and accept our Privacy Policy. Unsubscribe anytime.

More from The Daily São Paulo

More in News

Enjoyed this story? Get tomorrow's briefing free.