The 2025 edition of IOI’s State of Open Infrastructure (to be released in late May, 2025) includes our third foray into the wild world of collecting and analysing grant awards made to support open infrastructure (OI). This is the second of a pair of posts describing some of the challenges in collecting, preparing and analysing the funding data we collected. The first focused on data collection issues, while this one focuses on the challenges in cleaning up, interpreting, and analysing the data we collected.  

To set the stage, we briefly describe our data collection methods. We focused on funder-reported and centrally reported data as the sources of record, adding several new funders to our list from last year’s report. We harvested data directly from funders’ websites when it was available, from OpenAIRE, from USAspending.gov, and from 360giving. In the case of one funder, we reviewed their US Internal Revenue Service 990 forms for grants disbursed to one particular OI. We searched the compiled data using a predefined list of search terms (names of OIs, name variants, and in some cases, associated organizations) to search the description, title, and recipient of each award. 

Our search was focused on the more than 70 OIs that had complete entries in IOI’s Infra Finder by 14 January, 2025, in order to be able to take advantage of additional information about the OIs described in Infra Finder and use it in our analysis of funding. We interpreted a match in any of these fields to indicate that an award was of plausible interest, and then reviewed all the plausible awards to determine which ones were relevant. 

Data preparation challenges

Finding and classifying relevant awards

There is no straightforward way to surface all grant awards related to all open infrastructure efforts. Indeed, the community struggles even to define what we mean by “open infrastructure.” Ultimately, to search any collection of data, you need words, or ideally a taxonomy, to craft a search strategy. Lacking universally agreed-upon terms by which to call open infrastructure, we opted to use the names of OIs, some name variants, and occasionally, the name of the host organization. We focused on OIs included in Infra Finder to provide a bounded collection that also enabled us to potentially integrate additional information from Infra Finder into our analysis. By definition and by design, that means we have a fairly focused sample of awards related to open infrastructure in our final dataset.

Even with these constraints, detecting awards of plausible interest was non-trivial. Cumulatively, we harvested data for over 7 million awards and projects. After performing our search queries, we were left with 1,444 awards of plausible interest, and following manual review, just 641 awards that were deemed relevant to the 70 OIs on our list. There are two situations where this strategy potentially comes up short. First, if the search terms did not appear in the recipient, title, or description, obviously we did not detect an award as being of plausible interest. Some organisations are successful in attracting grant funding that supports, for example, research and development more generally, which may also be integrated into an OI at that organization, but this was not always clear in the limited award data we harvested. The Internet Archive and the Public Knowledge Project are two such examples: we might have guessed that an award is related to technical development that would benefit the Internet Archive’s Wayback machine, or related to one of the Public Knowledge Project’s publishing platforms, but unless these OIs were explicitly named, these would only have been guesses. Second, OIs with names that are common words, parts of common words, or otherwise not particularly distinctive could be (and usually were) swamped out by false positives and therefore became nearly impossible to detect.

Classifying awards based on the limited information available in the title and description was another challenge. In classifying awards, we differentiated between awards that constituted direct support to an OI, and those that did not provide direct support but did demonstrate use or other impact of an OI in research and scholarship (we term these “adjacent” awards). This was easier said than done, as award titles and descriptions were sparse at best (and sometimes missing entirely), leaving quite a bit of room for interpretation. In addition, with more and more varied OIs in this year’s dataset, we found that the classification scheme we developed last year required some adjustments, and we reclassified all of the awards collected in both years.

De-duplicating awards

Some awards of interest appeared in more than one data source, and some appeared more than once within the same data source. It was an easy decision to exclude records that are identical; it was trickier when some fields were the same and others were not. When we observed differences in fields, we generally did not identify them as duplicates and included all of the awards. In particular, if the awards had different funder-assigned identifiers and differed in one or more other fields (usually award dates, PI, institution, or amount), we treated them as distinct awards. As one example, collaborative awards with multiple recipients often had identical or highly similar titles and the same dates, but different recipients and amounts, and these were treated as distinct awards. 

Multiple recipients, multiple funders

Finally, we had a non-trivial number of awards which named more than one OI. We seldom, if ever, had information as to how the award was divided up among its recipients. Our choices were, essentially, to attribute the entire award amount to each named OI (thus overestimating award totals for OIs); to make some assumptions about how the award was divided (perhaps evenly among the named recipients, a flawed assumption made even more problematic by the fact that there may have been additional recipients we did not detect); or to simply exclude the amounts for multi-recipient awards from the total for an OI. We opted to exclude the totals when reporting funding for individual OIs, because we did not want to over-report the support they might receive. 

We also observed in our unfiltered data some awards that were made jointly by multiple funders. None of these awards appear in our final dataset, but they would present challenges similar to multi-recipient awards.

What would make it better?

We would love for our dataset to be much broader in scope and much larger. As we mentioned in our previous post, we would like to see the principles and practices outlined in the Barcelona Declaration on Open Research Information widely embraced and implemented with rigour. There are specific ways in which widespread adoption of the Barcelona Declaration would facilitate analyses such as ours. Standardized sources of data would likely require less effort to gather, assemble and understand. The use of persistent identifiers to associate awards with open infrastructures (and funders, and recipient individuals and organizations, and probably more) would alleviate some of the search and de-duplication problems we described. And, deposit of grant award metadata in open repositories and transfer systems, with permissive licensing, would support the discovery and reuse of this information. Invest in Open Infrastructure is a supporter of the Barcelona Declaration, and we encourage others to support and adopt its principles.

Posted by Gail Steinhart