Surfacing Shared Incentives for Curated Collections and AI Tools

As part of our research for the BRIDGE project, we recently published a landscape review of the current state of interactions between AI companies and curated collections.

One thread that we found was that the extant literature nearly uniformly advocates commons-based approaches as alternatives or complements to market-based strategies for open curated collections. The commons approach emphasizes collective coordination, social norm cultivation, and institutional trust-building rather than monetization of individual collections. Unlike licensing deals struck between AI firms and individual publishers, commons-based approaches seek to preserve the open, shared character of knowledge resources while ensuring that the entities profiting most from them contribute meaningfully in return. This perspective positions small collections not as individual vendors but as stewards of interconnected knowledge resources requiring collective protection.

Dual bridges spanning the scenic Salt River Canyon in Arizona with striated rock faces and desert grasses. — Photo by Andreas Staver.

The success of a commons-based approach assumes that AI companies do have incentives to engage with open collections stewards—incentives not predicated solely on legal risk. These incentives include:

Ensuring high-quality data sources remain online. Excessive bot traffic and scraping threatens to push some small collections offline, as many lack the resources to "continue adding more servers, deploying more sophisticated firewalls, and hiring more operations engineers in perpetuity" (Weinberg, 2025; Grant, 2025). If key data sources go dark, AI companies lose access to the very content that makes their models useful.

Encouraging competition and discouraging consolidation. Investment in the commons enables everyone, not only the wealthiest corporations, to build and refine models that lead to innovation. Even larger companies share a broad interest in encouraging innovation in the sector that they can benefit from in the future. In its announcement of its support for Harvard’s Institutional Data Initiative, Microsoft cited the motivation to grow “a vibrant, competitive AI economy” by expanding access to the data resources needed to build LLMs (Davis, 2024). Openly licensed datasets can encourage competition and offer smaller players a way in. Adoption of the Model Concept Protocol (or MCP, initially developed by Anthropic and later donated to the Linux Foundation) by the major AI market players is an example of industry-wide cooperation in this vein.

Retain scraping access. The relationship between AI companies and the broader web is already showing signs of strain, and the consequences of ignoring this dynamic are already visible. A closing off of the web in response to AI crawlers, especially through blunt approaches that do not distinguish them from other machines, is affecting crawling for legitimate and widely accepted purposes, such as archiving and research. As of December 2025, around 5.6M websites had blocked OpenAI's GPTBot, a nearly 70% increase over the previous six months (Claburn, 2025). As scraping restrictions increase and more organizations adopt brute force approaches to thwarting bots, AI companies risk losing access to important sources of high-quality, novel training data. Investing in the health of the commons, and in norms that distinguish reasonable use from indiscriminate extraction, can help arrest this trend.

Sustain high-quality, diverse training data. Well-stewarded commons provide not just volume but also the metadata, documentation, and quality control that make training data more valuable. Companies may see value in building and sustaining resources that provide them with access to high-value, unique, or novel datasets. Some analyses have speculated that the open web will become polluted by low-quality, machine-generated content, making curated collections increasingly valuable data sources. Noroozian et al. (2025) write that AI model developers should have a vested interest in making curated collections data "identifiable, visible, and discoverable" in order to avoid 'model collapse' or increasingly more repetitive, biased, and less capable AI" caused by the growing presence of synthetic data across the web. Looking further ahead, Woahn (2026) predicts that "The next improvements in model capability will come from: highly specialized domain corpora; well-structured technical datasets; targeted refreshes rather than massive new ingestions; data with deep internal organization, not broad volume."

Foster ongoing human contributions to the open web. The commons is sustained by the continuous labour and ingenuity of human creators. Without new approaches for providing permission, credit, and compensation, these creators have diminishing incentives to openly share their work, and AI models lose access to original content (Chan et al., 2023, Huang & Siddarth, 2023). Borgman and Groth (2025) argue that scholars participate in a gifting economy in which they volunteer labour (such as sharing data) "with the expectation that these gifts create indebtedness, encourage reciprocity, and enhance reputations." To build trust among scholars and collections stewards, AI companies may need to more visibly and concretely adopt the norms of a gifting economy, for example, by ensuring proper attribution.

Create a positive public image and consumer trust. Beyond practical considerations, AI model developers may have a reputational incentive to demonstrate a commitment to "ethical" or "responsible" AI, including appropriate data-harvesting practices. The non-profit Fairly Trained, for example, was launched to certify AI model developers and products that adhere to standards for their training data (Knibbs, 2024). They also have an interest in providing consumers with reliable information from robust sources to increase adoption and engagement with their platforms.

Mitigate regulatory and legal risks. Finally, the legal landscape surrounding AI and data use remains in flux. Depending on the outcomes of several lawsuits and pending legislation, AI model developers may need to fundamentally alter how they harvest data. If they cannot rely on fair use justifications for scraping copyrighted data, for example, they will be increasingly reliant on openly licensed and public domain data. The strongest incentive for change could be future government policy that regulates the use of openly available data, for example, by strengthening creator opt-outs.

Despite these shared incentives, the challenge lies in bridging the gap between curated collections stewards and AI companies and developing the sociotechnical infrastructure needed to facilitate cross-sector engagement. Trust between commons communities and AI companies is severely eroded. Open source developer communities have expressed "deep frustration with what they view as AI companies' predatory behaviour toward open source infrastructure," undermining the relationship-building these approaches require (Edwards, 2025). Philosophically, there's tension between ideals of openness and the need for protection. The "open with thoughtfulness" paradigm (Metz, 2025) requires continuous judgment calls that may fragment the commons into incompatible governance zones.

Read the full landscape review

Addressing these challenges will require deliberate effort on multiple fronts. The commons needs norms, governance frameworks, and contribution models developed with input from a range of stakeholders, including AI companies and technology platforms, as well as researchers, creators, and open curated collections stewards. The BRIDGE project is ongoing and represents IOI’s current efforts to identify and foster relationships between these currently disparate groups, and to pilot new approaches to encourage reciprocity that benefits all stakeholders in the AI economy.

This post is an excerpt from our longer work, "Sustaining the Commons in the AI Economy: A Landscape Scan of Challenges and Strategies for Bridging AI Companies and Open Curated Collections."

Blog AI & Open Infrastructure Research & Perspectives