São Paulo's municipal digital infrastructure is carrying dead weight. An internal review of public image repositories maintained by the Secretaria Municipal de Gestão — the city body responsible for administrative digitisation — found that duplicate image files account for a significant share of active storage load across civic platforms, a problem that IT managers and archivists say has compounded quietly for years. The numbers, while not yet published in full, have prompted urgent discussion inside the Prefeitura de São Paulo about what a proper deduplication programme would cost and how long it would take to run.
The timing matters. Mayor Ricardo Nunes has staked part of his 2025-2026 administrative agenda on accelerating São Paulo's smart-city credentials, including the expansion of the SP156 citizen services portal and the digitisation of urban planning records held at the Arquivo Histórico Municipal, located on Rua Prudente de Morais in the Higienópolis neighbourhood. Bloated, poorly indexed image stores undermine both goals — slowing load times, inflating cloud contract costs, and making records harder to search.
What the Data Actually Shows
Globally, studies of large institutional image libraries consistently find that between 20 and 40 percent of stored files are exact or near-exact duplicates. For a city the scale of São Paulo — which manages records across 96 subprefeituras and dozens of secretarias — that figure translates into potentially millions of redundant files. Cloud storage on Brazilian hyperscaler contracts, including those run through agreements with providers operating out of data centres in Tamboré, in Barueri on the metropolitan fringe, typically bills by the gigabyte per month. Cutting a repository's footprint by even 25 percent can produce material savings across a multi-year contract cycle.
The problem is not unique to government. The Instituto Moreira Salles, which maintains one of Brazil's most important photographic archives at its Paulista Avenue branch, has spoken publicly in past years about the challenge of managing digitised collections at scale. Deduplication is standard practice in serious archival work, but it requires both the right software tooling and, critically, human review — because not every visually similar image is truly redundant. A photograph taken one second apart from the same camera position may carry different metadata, different legal rights status, or different contextual value for researchers.
For São Paulo's civic tech ecosystem, the issue connects to a broader set of infrastructure questions. The city's Rede São Paulo Livre — the municipal free Wi-Fi programme with access points spread across districts including Brás, Bom Retiro, and the Centro Histórico — depends on back-end systems that serve content efficiently. When the platforms those systems support are bloated with duplicate assets, the user-facing slowdowns are real, even if invisible to most residents queuing at a SP156 terminal on Viaduto do Chá.
The Path to Cleaner Repositories
Deduplication at institutional scale is not a one-weekend fix. Standard industry methodology runs in three phases: automated hash-matching to identify binary-identical files, perceptual hashing to catch visually similar but technically distinct images, and then human curatorial review before any file is permanently removed. For a repository of several hundred thousand images — a conservative estimate for São Paulo's combined municipal holdings — that process can take six to eighteen months depending on staffing and tooling investment.
Software costs vary widely. Open-source tools such as dupeGuru handle basic deduplication at no licence cost, while enterprise-grade platforms built for government archival workflows can run into the tens of thousands of reais annually. The Fundação Seade, the state statistics agency headquartered in the Bela Vista district, has invested in data quality infrastructure in recent years and represents a model of how São Paulo-based institutions can approach systematic digital housekeeping.
Civic tech advocates and archivists argue the first step is the simplest: publish a baseline audit. Without a public accounting of how many images the city holds, how many are duplicates, and what the storage bill actually is, there is no pressure for improvement and no benchmark against which progress can be measured. The Arquivo Histórico Municipal's ongoing digitisation drive, which has accelerated since 2023, makes getting those numbers right more urgent with every passing month.