As part of our Building Resilient Infrastructure through Dialogue, Growth, and Exchange (BRIDGE) project, Invest in Open Infrastructure (IOI) is pleased to release the first outcome of our research: a landscape scan of the challenges and strategies for bridging AI companies and open curated collections. 


An image of a suspension bridge over water, fading into mist.
Photo by Modestas Urbonas on Unsplash

Curated collections, such as digital archives, open access journals, scientific data repositories, preprint servers, and knowledge graphs, form part of the digital commons: the shared pool of open resources made freely accessible online. Used regularly by researchers, the public, and commercial entities, this digital commons benefits every sector of society. Maintained largely by academic institutions, nonprofits, governments, and volunteer communities, these curated collections represent a public good built on decades of labour, public and private funding, and an ethos of open knowledge sharing. Their sustainability is inseparable from the sustainability of open science and democratic access to information.

That infrastructure is now under strain. The rapid expansion of AI development has turned open curated collections into an increasingly valuable source of training data. Automated bots now generate traffic that, in some cases, exceeds human visits, overwhelming servers, inflating bandwidth costs, and triggering outages. Meanwhile, a surge of AI-generated content submissions threatens to overwhelm editorial and curation workflows at repositories that rely on community contributions. 

To understand this landscape, key themes explored in the report include:

  • The strain on infrastructure: How AI bot traffic is overwhelming servers, inflating costs, and triggering service disruptions at open curated collections.
  • The limitations of current approaches: Why technical, legal, and market-based mechanisms face significant challenges in protecting the commons at scale.
  • The risks of defensive restrictions: How access controls intended to protect collections may paradoxically accelerate data consolidation among well-resourced corporations.
  • The tension of voluntary compliance: Many current approaches rely on voluntary compliance, with legal repercussions as enforcement mechanisms; what else could be done to encourage cooperative behaviour? Wrestling with whether enlightened self-interest can succeed where legal and technical frameworks have fallen short.
  • The possibilities of the commons: AI companies and collection stewards both have much to gain from exploring the commons as a shared space for investment and cooperation. 

The report points toward a promising, if demanding, path: commons-based governance grounded in reciprocal norms and shared interests across sectors. AI companies have concrete reasons to want the commons to survive. The loss of reliable data sources reduces the quality and diversity of training data; an increasingly walled-off web raises legal and regulatory risks; growing public frustration with extractive practices creates reputational pressure. Investment in the commons, properly framed, is investment in the quality of AI itself.

Yet a central tension remains unresolved. Whether enlightened self-interest will prove more effective than the legal and technical mechanisms that have already fallen short is an open question. To answer it requires engaging the curators and consumers of open collections as stewards of the digital commons, co-creating partnership models that could align open knowledge strategies with commercial demand.

We're eager to hear from interested parties and continue the conversation. If you'd like to be part of further discussions on this topic, please contact us at research [at] investinopen [dot] org.