São Paulo's municipal government confirmed this week that more than 340,000 duplicate image files are clogging the servers of the Secretaria Municipal de Comunicação, a backlog that has grown quietly since the city's first major digitisation push began under the Programa SP Digital in 2014. The problem is not new. But it has reached a scale that is now actively distorting public records, slowing journalism verification workflows, and costing taxpayers real money in storage and administrative labour.
The timing matters. The Nunes administration has staked significant political capital on its smart-city agenda, promising by the end of 2026 to migrate key urban infrastructure data — including flood sensor maps along the Tietê River corridor and the drainage crisis documentation across the Zona Leste — onto a unified open-data platform. Duplicate images embedded in those datasets are not merely inconvenient. They produce false metadata matches, skew analytical outputs, and in at least two documented cases earlier this year delayed emergency response coordination because field teams pulled outdated aerial photographs that had been stored multiple times under different file names.
A Problem Built Slowly, Layer by Layer
The roots go back further than 2014. When newsrooms and government offices across São Paulo began digitising physical archives in earnest during the early 2000s, the process was fragmented. The Arquivo Histórico Municipal, housed on Rua Campos Melo in the Bela Vista neighbourhood, ran its own digitisation protocols. The Empresa Municipal de Urbanização — EMURB, later restructured into SP Urbanismo — ran separate ones. Neither system spoke to the other, and both routinely scanned the same photographs of development projects along Avenida Paulista and the old Centro Histórico without any deduplication step built into the workflow.
Social media accelerated the chaos. Between 2016 and 2020, as prefeitura communications teams began publishing heavily on Facebook and Instagram, images were downloaded, recompressed, re-uploaded, and reshared at scale. A single aerial photograph of the Parque Estadual da Cantareira, for example, might exist in six or seven versions across official channels — each with slightly different resolution or colour profile, each treated by the system as a unique asset. By the time the Secretaria de Infraestrutura e Obras adopted cloud storage through a contract with a domestic provider in March 2022, the inherited mess came with it.
Newsrooms were not immune. Internal audits at two major Paulistano digital outlets — neither of which agreed to be identified by name — found that their content management systems contained image duplication rates of between 18 and 24 percent as of late 2025. For a publication running 50 visual stories a day, that is a meaningful drag on search, on load times, and on the work of photo editors trying to verify whether an image has been used before.
What Comes Next for Public and Private Archives
The federal government's Rede Nacional de Ensino e Pesquisa ran a pilot deduplication project with the Universidade de São Paulo's Instituto de Ciências Matemáticas e de Computação — the ICMC, based in São Carlos — that reduced duplicate rates in a test archive by 67 percent using perceptual hashing algorithms. The methodology is now being proposed for adoption by SP Urbanismo for its urban planning image library, according to documents obtained by this newspaper through a Lei de Acesso à Informação request filed in May.
The practical cost is not trivial. Cloud storage for municipal image assets ran to approximately R$4,2 million in the 2025 fiscal year, according to the prefeitura's published budget. Analysts who reviewed those figures estimate that between 15 and 20 percent of that expenditure covered redundant data. That is potentially R$800,000 a year storing the same photographs twice, three times, or more.
For journalists and researchers working with São Paulo's public image archives, the immediate advice is blunt: cross-reference any visual asset pulled from official sources against the Arquivo Histórico Municipal's catalogue before publication. The Secretaria Municipal de Comunicação has said it expects to complete a first-pass deduplication of its servers by November 2026, though no independent oversight body has been named to verify that timeline. Until then, the duplicates remain — a quiet, expensive record of how fast the city digitised, and how little anyone planned for what happened next.