São Paulo's municipal digital archive now holds more than 4 million image files — and a significant share of them are duplicates. That is the core problem driving a quiet but consequential push to clean up the city's sprawling visual records, a project that touches everything from urban planning documents stored at the Arquivo Histórico de São Paulo, on Rua Severo Penteado in Santana, to press photos catalogued by the Prefeitura's Secretaria de Comunicação.
The issue matters right now because the Nunes administration's ongoing Programa Cidade Inteligente, São Paulo's smart-city infrastructure drive, depends heavily on reliable, searchable image data — satellite captures, drone footage, and geotagged street-level photographs used for everything from flood-zone mapping along the Tietê river basin to licensing inspections in Brás and Mooca. If the underlying image library is riddled with redundant files, every automated query becomes slower, every storage bill larger, and every AI-assisted analysis less accurate.
How the Pile-Up Happened
The roots of the problem go back to the early 2000s, when successive municipal governments launched independent digitisation drives with no shared technical standard. The Secretaria Municipal de Cultura ran one scanning programme for historical photographs. The Companhia de Engenharia de Tráfego operated its own camera network. The Secretaria de Habitação maintained separate image folders for favela-regularisation surveys in districts like Heliópolis and Paraisópolis. None of these systems talked to each other, and none required a universal file-naming or metadata protocol.
By 2012, when the city consolidated several departments under a new Centro de Operações São Paulo — modelled loosely on Rio de Janeiro's operations centre — technicians discovered that hundreds of thousands of images had been migrated in bulk from legacy hard drives without deduplication. The same aerial photograph of Marginal Pinheiros, for instance, might exist in three formats under four different file names across two separate servers. Storage costs climbed, but budgets to fix the underlying architecture did not keep pace.
The problem compounded again between 2020 and 2022, when pandemic-era remote work forced multiple secretariats to shift document workflows to cloud platforms — largely Google Drive and Microsoft SharePoint — without a central governance policy. Files were uploaded, shared, re-downloaded, and re-uploaded. A 2023 internal audit by the Controladoria Geral do Município identified redundant image storage as one of three priority inefficiencies in city IT spending, though the full audit report has not been made public.
What a Fix Actually Requires
Duplicate image replacement — the technical process of identifying redundant files using hash-matching algorithms, selecting a canonical master copy, and replacing or deleting the rest — sounds straightforward. In practice, for an institution the size of São Paulo's municipal government, it is a multi-year undertaking. The city's IT arm, Prodam (Empresa de Tecnologia da Informação e Comunicação do Município de São Paulo), has been piloting deduplication tools on a subset of the Secretaria de Urbanismo's image library since the first quarter of 2025, according to procurement documents published on the city's transparency portal, São Paulo Aberto.
The pilot covers roughly 180,000 files linked to zoning-review cases in the expanded Operação Urbana Consorciada Água Branca perimeter, on the west side of the city. Early results from that exercise will shape whether Prodam rolls the methodology out citywide — a decision expected by the end of 2026's first legislative recess window in August.
For residents and journalists who use public image archives — whether researching a flooding event on Avenida do Estado or pulling historical photographs for a story about Paulista Avenue's evolution — the practical advice is straightforward: when accessing the Arquivo Histórico's online portal or the Prefeitura's Flickr collections, always note the unique accession number attached to each image rather than the filename, because filenames are the least reliable identifier in a system still mid-cleanup. The accession number is the only field that technicians are treating as immutable through the deduplication process, meaning it will survive whichever version of a file ultimately gets designated the canonical master.
The broader fix, if the Prodam pilot scales as planned, should produce measurable results in server costs and search response times by mid-2027 — a timeline that will conveniently land just before the city's next municipal election cycle forces a new round of budget justifications.