Funding Data Collection Challenges

The 2025 edition of IOI’s State of Open Infrastructure (to be released in late May 2025) includes our third foray into the wild world of collecting and analyzing grant awards made to support open infrastructure (OI). In our initial effort in 2022, we described some of the challenges we encountered in Funding open infrastructure: A survey of available data sources. Having expanded our grant funding data collection to include more funders, more OIs, and awards that support the use or ancillary development of OI, the time seems right to revisit and expand upon some of these challenges. We will do so in a pair of posts, first on data collection issues (this post) and second on challenges in cleaning up, interpreting, and analyzing the data we collect.

To set the stage, we briefly describe our data collection methods. We focused on funder-reported and centrally reported data as the sources of record, adding several new funders to our list from last year’s report. We harvested data directly from funders’ websites when it was available, from OpenAIRE, from USAspending.gov, and from 360giving. In the case of one funder, we reviewed their US Internal Revenue Service 990 forms for grants disbursed to one particular OI. We searched the compiled data using a predefined list of search terms (names of OIs, name variants, and in some cases, associated organizations) to search the description, title, and recipient of each award.

Our search was focused on the more than 70 OIs that had complete entries in IOI’s Infra Finder by 14 January, 2025, in order to be able to take advantage of additional information about the OIs described in Infra Finder and use it in our analysis of funding. We interpreted a match in any of these fields to indicate that an award was of plausible interest, and then reviewed all the plausible awards to determine which ones were relevant.

General issues

The most common general issues we encountered were hard-to-use documentation (or a complete lack of it), missing data values, and apparent inconsistency in the availability of data from year to year and from different sources.

Documentation

Our preferred data sources were centrally reported data, especially those hosted by the funders. We found that collecting funder data by harvesting directly from funders’ websites often seemed straightforward, but the data were also rarely (if ever) documented. That means we didn’t know what we didn’t know, but we did not encounter obvious problems or questions, with the exceptions noted later in this post. Documentation for centralized sources of data was generally available but could be labyrinthine.

Missing data values

Our ideal funding dataset would include some fields that simply were not present for every award from every source. Dates were a prime example: most (but not all) awards indicated, at the very least, a year. We presumed that this was either the year the award was made or the year in which work was meant to start. Of course, we didn’t really know, but in such cases, we assumed the error wasn’t likely to be large (plus or minus one year seemed like a safe bet). Still, these errors have a way of multiplying. For example, one downstream effect of missing date information is that we couldn’t perform meaningful currency conversions. A more common problem with respect to date was the lack of any indication of duration or end date; a multi-million dollar award made over several years has a potentially very different impact than a one-year award in the same amount. Another common problem was the lack of an award abstract or description, text which provides additional detail that we could use to search and identify awards of interest. Finally, some funders included unique identifiers in the award data, and these were incredibly useful for deduplicating awards (even if the identifiers are internal to the funder rather than a more standardized identifier such as a DOI), but these were not always present.

Now you see it, now you don’t

In a couple of cases, we discovered that we could sometimes harvest more awards from a data source (funder website or API, or a central source) than the total count of awards displayed on the awards section of funder websites. We didn’t always know how to interpret this difference. This was true for the Institute of Museum and Library Services (IMLS, 22,467 awards harvested in November of 2024 and 22,425 listed on their website 28 March, 2025), and Wellcome as reported by 360giving (20,856 awards harvested in November, 2024, and 19,331 listed on the 360giving website on 28 March, 2025). We also observed a much lower total number of grant awards reported on the Wellcome website than we were able to harvest from 360giving (Wellcome’s website listed 4,724 as of 28 March, 2025). Possibly Wellcome lists only active awards on their website, but we didn’t know - again, pointing to the need for better standard practices in this area to help users interpret the numbers they are seeing.

Source-specific issues

Individual data sources presented their own challenges, and we describe a few of these below (if only briefly). The most significant challenge is, of course, when a funder simply does not make their data available at all. For example, the Simons Foundation does not provide this data in any open or accessible way. In these cases, we tried to use other channels. US-based non-profit organizations are compelled to report grants made to the Internal Revenue Service on Form 990, and we did review 990s filed by the Simons Foundation to identify awards made to arXiv, but this is a cumbersome process that doesn’t scale.

USAspending.gov

We were very hopeful that USAspending.gov (the “official source of (US) government spending data”) would provide a standardized, one-stop shop for US federal award data. We found instead that some details available from individual funders were not present in USAspending.gov data (particularly grant abstracts or descriptions), and that fields were not mapped correctly or consistently. That said, some US federal funders do not make their data readily available elsewhere, and at least we were able to obtain some information from USAspending.gov (or at least we could search for it). This was the case for the National Aeronautics and Space Administration (NASA, for which we found a handful of relevant awards) and the Defense Advanced Research Projects Agency (DARPA) and the Library of Congress (no relevant awards found for either agency).

DFG

The German Research Foundation (DFG) supports some projects directly related to OI, but does not report award amounts. Additionally, their funding information seems to be organized by “project” rather than individual awards, with one entry per project, which might reflect more than one award to an OI.

Crossref Grants Linking System

We have watched with great interest Crossref’s Grants Linking System (GLS) effort to support the assignment of DOIs to grants, and linking grants to their resulting outputs, and were initially optimistic about using it to obtain award data for the Simons Foundation. Unfortunately for our purposes, the Foundation has only registered a handful of grants and only in their life sciences program area, none of which appeared relevant to our work. We certainly hope the number of participating funders will grow and that they will provide complete metadata for as many of their awards as possible.

What would make it better?

We would like to see the principles and practices outlined in the Barcelona Declaration on Open Research Information widely embraced and implemented with rigour. A number of the declaration’s specific recommendations would support work such as we’ve undertaken here. Standard and open protocols and interfaces would support access to the information. The use of persistent identifiers for grant awards, organizations, individuals, and projects would facilitate the collection, interpretation and analysis of grant award data. Deposit of grant award metadata in open repositories and transfer systems, with permissive licensing, would support the discovery and reuse of this information. Invest in Open Infrastructure is a supporter of the Barcelona Declaration, and we encourage others to support and adopt its principles.

Posted May 6, 2025 by Gail Steinhart