Data on Datasets: Quantifying US federally funded records in DataCite

Author: Eric Schares

Introduction

As part of IOI’s investigation into ‘reasonable costs’ for public access to United States federally funded research and scientific data, we recently completed a project to look more closely at the datasets that acknowledge federal funding.

Data is explicitly included in the 2022 “Nelson memo” from the US Office of Science and Technology Policy (“make publications and their supporting data resulting from federally funded research publicly accessible…”).

The prevalence of academic journal articles with federal funding has been generally well-studied (Schares, 2022), and many databases and infrastructure providers provide the ability to search for federally funded research articles. Datasets, however, are relatively new to the public access requirements and are less well understood*. This project looked into metadata from DataCite to see if we could gain a better understanding of how many datasets acknowledge US federal funding using existing resources.

Methods

DataCite is a not-for-profit organization that mints Digital Object Identifiers (DOIs) and connects research outputs, with a focus on datasets. Their REST API provides programmatic access to the structured metadata, and we used Python to query and extract the number of records that met various criteria.

DataCite’s technical support was very helpful, suggesting metadata fields that would be most appropriate for this analysis and ways to improve the queries. The data presented in this post were collected March 22, 2024. Running the same query today will return higher numbers.

To establish a baseline, we looked at the number of datasets registered in DataCite overall.

Of the 54,748,035 DOIs registered in DataCite (as of March 22, 2024), 16,716,816 or 30.5% are classified as datasets. This information can be found via the DataCite API using this link: https://api.datacite.org/dois?resource-type-id=dataset

Next, we asked how many datasets have anything provided in the fundingReferences metadata field that would indicate an acknowledgment of funding of any sort. This field is optional when registering for a DOI, so not all records have something here. Some are empty because there is no funding, and some are empty because the funding wasn’t specified. We found 974,334 datasets with something in this field, which is 5.8% of all datasets, or 1.8% of all DOIs in DataCite. To find this information in the API, use this link: https://api.datacite.org/dois?resource-type-id=dataset&query=fundingReferences:*

The most difficult part of our question was now at hand. We next asked: how many datasets acknowledge a U.S. federal funding agency? Up until now, the API queries provided above can easily run in your browser and deliver a summary number. But to go forward, we want to know how many of the almost 1 million datasets have US federal funding in their fundingReferences field, so we will need to test all 1,000,000 datasets. We will also restrict the years of interest to 2019-2023, which reduces the number of records we need to test from 1 million to 755,000.

ROR ID matching

We are interested in both who the acknowledged funder is and where they sit in any hierarchy. Within the fundingReferences field, organizations depositing metadata can provide a persistent identifier (PID) to describe the funder. DataCite supports the identifiers ISNI, GRID, ROR, Crossref funder ID, and others.

We chose to use ROR IDs as our common identifier to categorize funders. The Research Organization Registry (ROR) is a “a global, community-led registry of open persistent identifiers for research organizations,” and has emerged as the standard PID for over 100,000 funders, organizations, research centers, and more (https://ror.org). ROR provides useful API endpoints to map from other standards to a ROR. However, some datasets do not include any persistent identifier in the funding field, just a free-text funder name. For these cases, ROR also has an API endpoint to match the free-text funder name to a ROR ID for those cases where no persistent identifier is included in the funding field.

US Federal Funders

Once funders were matched and standardized to a ROR ID, it was time to see if a funder was a US federal granting agency or not. I created a recursive function that took a given ROR ID and investigated its “parent” relationship by following the hierarchical chain to the top level. If it ended at the US Government, the ROR was marked as a US federal funder; if not, the code backed up and tried a different parent organization branching path (if applicable). The final result was then saved in a column marked True or False.

For all the ROR ID converting, string matching, and US federal funder identifying, the code kept track of what had previously run and remembered what the result was. A Python dictionary was created and appended each time a new ROR was analyzed. This improved processing time by eliminating the need to keep retesting the same ROR ID over and over; once the program knows that the National Science Foundation has ROR ID 021nxhr62 and it is indeed a US federal funder, it does not need to test it again when encountering it on subsequent rows.

Funneling down the results

Figure 1. Slide presented by IOI’s Gail Steinhart at the 2024 IASSIST/CARTO conference in Nova Scotia.**

We can see that of the DOIs tested from 2019-2023, 346,428 had a US federal funder acknowledged in the fundingReferences field. This is 0.6% of all DOIs in DataCite, and about 2% of all datasets in the registry (Figure 1).

Further, one agency dominates these results. The Environmental Molecular Sciences Laboratory is a Department of Energy, Office of Science facility at Pacific Northwest National Laboratory in Richland, Washington. It minted over 300,000 DOIs, or an overwhelming majority (86%) of these results.

After the EMSL, the NSF (17,647), Office of the Secretary of Energy (8,460), and NASA (8,305) were the funders found most often (Figure 2). It should be noted that the agency was counted based on how it was provided, and no further reunification of sub-agencies, Directorates, or centers was done.

Figure 2. Stacked bar chart of top 10 US federal agencies, excluding EMSL. Bar segments colored by year deposited.

Recommendations

We echo calls from DataCite and many others to take full advantage of available fields to improve metadata for datasets. Analyses like the one presented in this post can only search fields that are exposed by the API endpoint, so we are only able to return records that actually populate those fields with values. It is very likely there are some datasets in DataCite that do result from US federal funding, but they are invisible because they don’t have data in the fields we are searching. Adding ROR IDs into the fundingReferences field is not only a best practice for providing acknowledgement to a funder, but also helps with meta-analyses like this one.

References

* Johnston, L. R., Mohr, A. H., Herndon, J., Taylor, S., Carlson, J. R., Ge, L., Moore, J., Petters, J., Kozlowski, W., & Vitale, C. H. (2024). Seek and you may (not) find: A multi-institutional analysis of where research data are shared. PLOS ONE, 19(4), e0302426. https://doi.org/10.1371/journal.pone.0302426

** Steinhart, G., Schares, E., & Skinner, K. (2024). Navigating the future of data sharing: The impact and cost of expanded public access requirements. IASSIST & CARTO 2024, Halifax, Nova Scotia, Canada. https://doi.org/10.5281/zenodo.11263223

Posted Jul 31, 2024 by Invest In Open Infrastructure

Data on Datasets: Quantifying US federally funded records in DataCite

You might also like

The Cost and Price of Public Access to Scholarly Publications: A Synthesis

Characteristics of open infrastructure dashboard

Grant funding data dashboard