São Paulo's public and private sector organisations are sitting on hundreds of terabytes of redundant visual data. Duplicate image files — the same photograph stored two, three, sometimes a dozen times under different file names — are costing local institutions real money, and a reckoning with that inefficiency is now underway across the city's tech and administrative sectors.
The issue matters now because São Paulo's municipal digital infrastructure has expanded rapidly since the Prefeitura launched its Programa SP Digital initiative, which pushed dozens of secretariats to digitise records and move workflows online. More data, stored faster, with less governance, produces more duplication. Storage costs in Brazilian data centres rose sharply following the real's depreciation cycle in 2024, making every unnecessary gigabyte a measurable budget line.
What the Numbers Actually Show
Industry benchmarks from the data management sector suggest that between 20 and 30 percent of files stored in large unstructured repositories are exact or near-exact duplicates. For an organisation running a 500-terabyte image archive — common among São Paulo's larger media groups, e-commerce operators and government bodies — that represents between 100 and 150 terabytes of storage that delivers zero informational value. At current pricing for enterprise cloud storage on Brazilian-hosted platforms, which runs at roughly R$0.08 to R$0.12 per gigabyte per month depending on redundancy tier, the annual cost of that dead weight across a single large institution can exceed R$1.4 million.
The Secretaria Municipal de Inovação e Tecnologia, headquartered near the Anhangabaú district, has flagged image deduplication as a priority within its 2026 data governance review. The city's Arquivo Histórico Municipal on Rua Gaioso also faces the challenge in a different form: digitisation campaigns that ran through partner institutions produced multiple scanned versions of the same physical document, compounding catalogue confusion alongside storage overhead.
For the private sector, the problem is sharpest in e-commerce. The Vila Olímpia and Berrini corridor, home to Brazilian operations of major retail platforms and logistics companies, has seen engineering teams dedicate entire sprints to deduplication pipelines. One common pattern: product image libraries accumulated over years of supplier uploads, where the same SKU photograph arrives from three different sources in three slightly different resolutions, each stored as a distinct file because no automated hash-checking was in place at ingestion.
Deduplication Is Not Simple — and São Paulo Knows It
The technical fix sounds straightforward: run a perceptual hash algorithm across an image library, flag near-duplicates, delete or merge them. The operational reality is harder. Images that look identical may carry different metadata, licensing information or alt-text relevant to accessibility compliance under Brazil's Lei Brasileira de Inclusão, which requires digital platforms to maintain descriptive data for users with visual impairments. Deleting the wrong version of a file can strip that compliance layer.
The Fundação Getúlio Vargas campus in Itaim Bibi has been researching automated governance frameworks for exactly this kind of structured data cleaning, looking at how municipal and corporate archives can apply deduplication without losing legally or historically significant file attributes. The work feeds into broader conversations about data stewardship that the federal Lula administration has been pushing through the Autoridade Nacional de Proteção de Dados, which oversees how organisations handle stored personal and public records.
For organisations starting this process now, the practical path involves three steps that data engineers across the city's tech ecosystem broadly agree on: conduct a full storage audit with file fingerprinting before touching anything; establish a retention policy that distinguishes master files from derivatives; and build ingestion-stage deduplication so the problem stops compounding going forward. Several São Paulo-based startups operating out of the Distrito Inovação cluster in Lapa are selling tooling aimed at exactly that pipeline — a sign the market has noticed the scale of the problem even when institutions have not yet moved to fix it.
The cost of inaction keeps climbing. Storage is not free, governance failures attract regulatory scrutiny, and digital archives that nobody trusts become archives nobody uses. São Paulo's institutions have the technical talent to fix this. The data says they also have an increasingly expensive reason to start.