São Paulo's municipal archive holds more than 40 million digitised documents, and a significant share of them appear more than once. That is the core finding of an internal review completed earlier this year by the Secretaria Municipal de Gestão, which identified duplicate image files as a growing drag on storage costs, search accuracy and public transparency portals used by residents across the city's 96 subprefectures.
The problem is not purely bureaucratic tidiness. The city's Lei de Acesso à Informação requests — São Paulo processes more of them annually than any other Brazilian municipality — return cluttered results when duplicate scans of the same permit, contract or urban-planning map sit uncleaned in the system. Lawyers working near the Fórum João Mendes, in the Centro Histórico neighbourhood, have complained for years that licence searches for properties in districts like Pinheiros and Vila Madalena pull up redundant PDFs that slow down due-diligence work on transactions worth millions of reais.
What São Paulo Is Actually Doing
Since March 2026, the Secretaria has been running a pilot deduplication programme across three municipal departments — Habitação, Obras and Finanças — using a combination of perceptual hashing and metadata cross-referencing. The pilot covers roughly 1.2 million document images, according to materials circulated at a public hearing held at the Câmara Municipal on Viaduto Jacareí in May. A full rollout is pencilled in for the first quarter of 2027, contingent on budget approval in the next Plano Plurianual cycle.
The tool being tested is not proprietary. The Secretaria contracted the São Paulo-based technology company Totvs to adapt an open-source deduplication stack for the city's Oracle-based document management environment. Totvs, headquartered in Bom Retiro, already handles payroll and fiscal systems for several state-level agencies in Brazil, which gave it an advantage in the procurement process over international bidders.
Funding is the immediate constraint. The pilot was budgeted at R$4.2 million for the 18-month testing phase. Scaling to the full 40-million-document archive would, by the Secretaria's own projections presented at the May hearing, require between R$28 million and R$35 million, a figure that has yet to secure a dedicated line in the 2026 municipal budget under Mayor Ricardo Nunes.
How That Compares to Other Cities
Bogotá completed a similar deduplication exercise across its Archivo de Bogotá in 2024, clearing an estimated 6 million redundant images from a 22-million-document base. The Colombian capital used funding from the Inter-American Development Bank's digital-government programme and finished the work in 14 months. Tokyo's Bureau of General Affairs began automated deduplication on its municipal land-registry scans back in 2021 and now runs quarterly automated audits to prevent the problem from re-accumulating.
Mexico City, whose Archivo Histórico sits in the Palacio de Lecumberri, completed a two-year deduplication project in partnership with the Universidad Nacional Autónoma de México in 2023. Officials there reported a 31 percent reduction in storage load and a measurable improvement in keyword-search response times on public portals — concrete benchmarks that São Paulo's Secretaria has cited internally as a target model, according to the May hearing materials.
Where São Paulo differs from all three comparators is scale. Its document archive is larger than Bogotá's and Mexico City's combined, and its incoming digitisation rate — driven by court filings, building permits and social-programme enrolments — runs at an estimated 800,000 new scans per month. That intake pace means that even a successful deduplication sweep risks being outrun by new redundant uploads if the city does not simultaneously reform how documents are scanned and ingested at source.
The next public checkpoint is a progress report the Secretaria is scheduled to present to the Câmara's technology committee in September. If the pilot results are strong enough, budget negotiators could accelerate the timeline. If the R$35 million figure sticks without a dedicated funding line, the broader rollout will almost certainly slip into 2028 — by which point, at current ingestion rates, the duplicate backlog will have grown by another 10 million images at minimum.