Guidelines for OSTP Data Access Plan

In February 2013, the Executive Office of the President's Office of Science and Technology Policy published a memo entitled "Increasing Access to the Results of Federally Funded Scientific Research," which directs funding agencies with an annual R&D budget over $100 million to develop a public access plan for disseminating the results of their research. ICPSR strongly supports this memorandum and feels it will "promote re-use of scientific data, maximize the return on investments in data collection, and prevent the loss of thousands of potentially valuable datasets."

ICPSR currently partners with several federal agencies to help them fulfill the mandate to provide open access to results of federally funded research. These agencies leverage ICPSR's capacity to curate, preserve, and disseminate data efficiently.

To help these and other federal agencies develop their public access plans, ICPSR is providing guidance on how to meet the requirements laid out in the memo. In the sections below, we provide an overview of each requirement, and discuss why they matter and the key issues to consider when formulating plans. We also provide a glossary of terms for specialized definitions.

We stress that standards and guidelines for many of the requirements currently exist. We also stress that existing specialized, long-lived, and sustainable repositories can mediate between the needs of scientific disciplines and data preservation requirements.

Please contact us for more information about our work or these guidelines.

Update: CENDI is collecting information on Federal Agency plans and guidance for implementation of Public Access, including the effective dates and scope. Additionally, SPARC offers an integrated resource for tracking, comparing, and understanding U.S. federal agencies' article and data sharing policies.

Elements of a Public Access Plan for Scientific Data

Maximize access

"Maximize access, by the general public and without charge, to digitally formatted scientific data created with Federal funds"

Description

Increasing access to research data prevents the duplication of effort, provides accountability and verification of research results, and increases opportunities for innovation and collaboration.

Finding and accessing data in repositories requires descriptive metadata ("data about data") in standard, machine-actionable form. Metadata help search engines find and catalog data, as well enable researchers to perform detailed searches across and understand the context of data collections. In the social sciences, for instance, the Data Documentation Initiative (DDI) is an established international standard for the description of data.

For an inventory of metadata standards across scientific disciplines, see the Digital Curation Centre website.

Access involves not just finding data, but also knowing how to use and interpret the data. Incomplete, incorrect, or messy data limit use and reuse. Proprietary or obsolete data formats can be unreadable or limit access. Repositories 'curate', or enhance, data to make it complete, self-explanatory, and usable for future researchers. This includes adding descriptive labels, correcting coding errors, gathering documentation, and standardizing the final versions of files. Curation is crucial to maximizing access.

For guidelines on preparing and curating data for archiving, see ICPSR's Guide to Social Science Data Preparation and Archiving and the UK Data Archive's Managing and Sharing Data guide.

Issues to Consider

  • Which descriptive metadata standards will your agency use to help researchers discover and find data? Adequate metadata to describe collections and facilitate discovery are essential; otherwise, data are difficult to find and understand.
  • Which curation standards will your agency promote so data are accurate and useful? Incomplete or messy collections are not as useful or valuable as curated data.

"NASA commits to the full and open sharing of Earth science data obtained from NASA Earth observing satellites, sub-orbital platforms and field campaigns with all users as soon as such data become available."
-- NASA Data & Information Policy

"CDC believes that public health and scientific advancement are best served when data are released to, or shared with, other public health agencies, academic researchers, and appropriate private researchers in an open, timely, and appropriate way. The interests of the public—which include timely releases of data for further analysis—transcends whatever claim scientists may believe they have to ownership of data acquired or generated using federal funds."
-- CDC/ATSDR Policy on Releasing and Sharing Data

"Data should be made as widely and freely available as possible while safeguarding the privacy of participants, and protecting confidential and proprietary data"
-- National Institutes of Health Data Sharing Policy

Hide

Protect confidentiality and privacy

"...protecting confidentiality and personal privacy"

Description

A growing number of studies include sensitive and confidential data. Stringent protections must be in place to guard and provide access to these data. Robust methods, such as those promoted by the American Statistical Association, are in place for evaluating and treating disclosure risks, and repositories can offer infrastructure, including virtual and physical data enclaves, for protecting and safely sharing confidential data.

For more information, see ICPSR's Confidentiality page or DataONE's Identify data sensitivity page.

Issues to Consider

  • Who will be responsible for reviewing and treating data for confidentiality issues? Data security is of utmost legal and ethical concern.
  • How will disclosure review and treatment impact the future use and reuse of the data? While steps to anonymize data are necessary, these must be done in consideration of the impact they will have on future use.

"Any release or sharing of public health data will acknowledge that (1) data systems are built on trust between the individuals who provide personal data and the agencies that collect those data and (2) that CDC will respect the privacy rights of individuals and others who provide personal or proprietary data. All release/sharing must be consistent with the confidentiality assurances under which the data were collected or obtained."
-- CDC/ATSDR Policy on Releasing and Sharing Data

"Prior to sharing, data should be redacted to strip all identifiers, and effective strategies should be adopted to minimize risks of unauthorized disclosure of personal identifiers."
-- National Institutes of Health Data Sharing Policy

"The rights and privacy of individuals who participate in HHS-sponsored research must be protected at all times. Thus, data intended for broader use should be free of identifiers that would permit linkages to individual research participants and variables that could lead to deductive disclosure of the identity of individual subjects."
-- Department of Health & Human Services Grants Policy Statement

Hide

Preserve intellectual property rights and commercial interests

"...recognizing proprietary interests, business confidential information, and intellectual property rights and avoiding significant negative impact on intellectual property rights, innovation, and U.S. competitiveness"

Description

Original research may be both commercially valuable and proprietary. There are several approaches to managing these interests, including tailoring copyright and patent licenses, such as through Creative Commons licenses, and putting an embargo period or delayed dissemination on distribution. Ultimately, though, all proprietary and personal interests should be considered with an eye toward eventual public access.

Issues to Consider

  • Which licensing options are optimal for your research community? Will your agency require all data, for instance, to be copyright free? Or will data producers be able to choose freely according to their needs and desires?
  • How can commercial interests be reconciled with the mandate of providing open access to data?

"NOAA recognizes that the investigators who collected the data have a legitimate interest in benefiting from their investment of time and effort. NOAA continues to expect that the initial investigators may benefit from being the first user of the data, but not from prolonged or indefinite exclusive use."
-- NOAA Data Sharing Policy for Grants and Cooperative Agreements

While NASA will require that the data that embodies trade secrets or comprises commercial or financial information which is privileged or confidential generated by the Recipient be delivered to NASA for dissemination to employees of NASA, of JPL, and of appropriate support contractor personnel, such data marked with a suitable notice or legend will be protected for the 2-year period of exclusivity set forth in paragraph D.3 of this clause.
-- NASA Data Rights and Related Issues

"If the final data would not be amenable to sharing, e.g., proprietary data, the SBC should explain that in the application."
-- Department of Health & Human Services Grants Policy Statement

"Any restrictions on data sharing due to cofunding arrangements should be discussed in the data-sharing plan section of an application and will be considered by program staff. While NIH understands that an institution's desire to exercise its intellectual property rights may justify a need to delay disclosure of research findings, a delay of 30 to 60 days is generally viewed as a reasonable period for such activity."
-- National Institutes of Health Data Sharing Policy

"Restrictions can be imposed because... releasing the data would risk disclosing proprietary or confidential information"
-- CDC/ATSDR Policy on Releasing and Sharing Data

Hide

Balance demands of long-term preservation and access

"...preserving the balance between the relative value of long-term preservation and access and the associated cost and administrative burden"

Description

Digital preservation is the proactive and ongoing management of digital content to lengthen the lifespan and mitigate against loss, including physical deterioration, format obsolescence, and hardware and software failure. Preserving digital data requires much more than storing files on a server or desktop. At the same time, we also recognize that not all data are worth preserving indefinitely; less valuable or easily producible data may be preserved for shorter periods -- perhaps five to ten years depending on the scientific discipline.

Selection and appraisal guidelines that make it clear what to save or discard ensure that the limited resources available for long-term preservation and access are spent wisely. Selection criteria consider factors like availability, confidentiality, copyright, quality, file format, and financial commitment.

For example selection and appraisal guidelines, see the National Archives and Records Administration (NARA) Appraisal Policy.

Long-term costs and administrative burdens are essential to consider when selecting data for preservation. The University of California Curation Center has "developed an analytic framework for modeling the full economic costs of preservation," including an interactive spreadsheet. The Keeping Research Data Safe project, funded by the Joint Information Systems Committee (JISC) in the UK, also produced tools and methodologies for "assessing the costs and benefits of curation and preservation of research data."

Issues to Consider

  • Which data will your agency target and select for long-term preservation? Not all data may fit the agency's scope and goals. What are the criteria for selecting which data to preserve?
  • How long will data be preserved and made accessible? Data are costly to preserve for the long term, and not all data must be preserved in perpetuity.
  • What are the long-term preservation costs to make research data available? Understanding the actual financial costs will sharpen selection and retention policies and decisions.

No examples of this from the policies surveyed

Hide

Use of data management plans

"Ensure that all extramural researchers receiving Federal grants and contracts for scientific research and intramural researchers develop data management plans and, as appropriate, describing how they will provide for long-term preservation of, and access to, scientific data in digital formats resulting from federally funded research, or explaining why long-term preservation and access cannot be justified"

Description

Data management plans provide opportunities for researchers to manage and curate their data more actively from project inception to completion. Careful planning helps ensure quality data products when projects are completed. Recommended components of a plan include descriptions of the nature and scale of the data collection, file format types, metadata standards used, and any intellectual property or confidentiality concerns that exist.

For more information about data management plans, see ICPSR's Guidelines for Effective Data Management Plans, the Digital Curation Centre's Data Management Plans page, and MIT's Data Management Plans page.

Issues to Consider

  • Should your agency mandate which elements must be included in the data management plan? Specific disciplines may have different standards related to data management planning, and certain elements may be more relevant to some research than others. However, having standard formats for data management planning may ease an agency's evaluation of plans.
  • What resources can be provided to educate and aid researchers in the writing of effective data management plans? Data management planning is a relatively new area, so many researchers may not be familiar with what should be included in an effective plan.

"Proposals must include a supplementary document of no more than two pages labeled "Data Management Plan". This supplement should describe how the proposal will conform to NSF policy on the dissemination and sharing of research results"
-- NSF Grant Proposal Guide

"All NASA Earth science missions, projects, and grants and cooperative agreements shall include data management plans to facilitate the implementation of these data principles."
-- NASA Data & Information Policy

Hide

Include cost of data management in funding proposals

"Allow the inclusion of appropriate costs for data management and access in proposals for Federal funding for scientific research"

Description

Data management services carry real costs, ranging from personnel to storage to software. Estimating and planning for these costs at the beginning of a project ensures long-term investment in the research data. Just as maintenance costs are routinely built into physical infrastructure development, so too should data management costs be built into data development. Long-term access to data requires durable institutions that plan on a scale of decades and even generations.

For guidance on costs to include when creating funding proposals, see DataONE's Provide budget information for your data management plan page and the UK Data Archive's Costing Tool: Data Management Planning guide.

Issues to Consider

  • How will your agency determine what constitutes reasonable costs for data management? Whether this is a dollar amount or a percentage of the total project budget, having a cap or an expected range for the cost of data management will provide more context for researchers and aid in the evaluation of plans.
  • What funding models are appropriate for supporting long-term data management? Although funding may be limited to the duration of the grant period, data management is not a one-time cost. How can long-term preservation and access be built into proposals?
  • What existing, additional, or new funding tied to proposals will support access and preservation of data?

"Costs of documenting, preparing, publishing, disseminating and sharing research findings and supporting material are allowable charges against the grant."
-- NSF Grant Proposal Guide

"The costs of sharing or archiving data may be included in the amount of funds requested in applications for first-time or continuation funds."
-- CDC/ATSDR Policy on Releasing and Sharing Data

Hide

Evaluate data management plans

"Ensure appropriate evaluation of the merits of submitted data management plans"

Description

Data management plans give insight into the researcher's intentions for their data both during and after the research project. Plans help: researchers prepare for working with and preserving data, repositories get ready to accession and provide access, and agencies to understand the community needs for archiving and access. Evaluation helps refine plans so they are realistic and attainable.

Issues to Consider

  • What standards will be used to determine whether a data management plan is sufficient for a proposed research project? Because data management plans will be specific to the given research project and/or scientific community, can they be assessed based on a standard set of criteria?
  • How will the agency handle data management plans that fall short of criteria? If a data management plan does not meet expected quality standards, agencies should determine how to respond.
  • How will the merits of a data management plan be considered alongside other factors in the evaluation process? Agencies should determine the weighting of this aspect of the funding application against other portions of the proposal.

"The Data Management Plan will be reviewed as an integral part of the proposal, coming under Intellectual Merit or Broader Impacts or both, as appropriate for the scientific community of relevance."
-- NSF Grant Proposal Guide

"CDC reviewers must check whether applications for CDC funds include mechanisms for, and costs of, sharing data."
-- CDC/ATSDR Policy on Releasing and Sharing Data

Hide

Ensure researcher compliance with data management plans

"Include mechanisms to ensure that intramural and extramural researchers comply with data management plans and policies"

Description

If data management plans are to be a standard component of funding applications, funding recipients should be held accountable for diversions from the originally stated plans. As a benefit, monitoring and ensuring compliance should increase the quality of data deposited in repositories and ultimately made available to the public.

Issues to Consider

  • How will funders determine what constitutes compliance with data management plans? Since data management is an ongoing process, establishing clear criteria for assessing compliance will be crucial. Does deposit in a trustworthy digital repository qualify as compliant? Is the quality of the final, curated data product also assessed when judging compliance?
  • What penalties will be in place for researchers who do not comply with data management plans? Will non-compliant researchers be penalized when applying for future funding? Will current funding be withheld for non-compliance?
  • What deadlines will be set for researcher compliance with data management plans? Will compliance be measured during just the lifetime of the funding or extend for a period of time after project funding ends?

"Awardees who fail to release data in a timely fashion will be subject to procedures normally used to address lack of performance (e.g., reduction in funding, restriction of funds, or grant termination)."
-- CDC/ATSDR Policy on Releasing and Sharing Data

"...if an application describes a data-sharing plan, NIH expects that plan to be enacted... In the case of noncompliance (depending on its severity and duration) NIH can take various actions to protect the Federal Government's interests."
-- National Institutes of Health Data Sharing Policy

"Failing to share environmental data and information in accordance with the submitted Data/Information Sharing Plan may lead to disallowed costs and be considered by NOAA when making future award decisions."
-- NOAA Data Sharing Policy for Grants and Cooperative Agreements

Hide

Promote public deposit of data

"Promote the deposit of data in publicly accessible databases, where appropriate and available"

Description

Public deposit of data helps to ensure the long-term accessibility and preservation of the data. It removes the burden of ongoing maintenance and care from the researcher and provides a stable system to which data can be entrusted. Centralized databases also provide more comprehensive and discoverable resources in one place. Data hosted in repositories are indexed by major search engines and are widely accessible to the public.

Many sustainable online repositories are now available to host and archive research data. These may include discipline-specific repositories, archives administered by funding agencies, or institutional repositories.

Data producers need to trust that the data they archive will be properly stored and shared, rather than lost, corrupted, or neglected. Data consumers need to trust that the data they receive is the original, unaltered version saved by the producer. The Open Archival Information System (OAIS) Reference Model, the Trusted Digital Repository (TDR) Checklist (ISO/DIS 16363), and the Data Seal of Approval are standards that guide repositories in documenting and verifying that they are organizationally, procedurally, and technologically sound as data custodians.

Issues to Consider

  • Which research data repository will your agency use or recommend to store and disseminate data? There are many repositories available, although not all provide the same services, target similar disciplines, or are set up for long-term, trusted preservation and access.
  • How will your agency insure that selected repositories are trustworthy, secure and long-lived? Standards exist to gauge whether repositories can be trusted to store and disseminate valuable research data.
  • How will publicly deposited data be promoted by the agency?

"NOAA facilities that archive data and make the data openly available should be considered first for the disposition of the data."
-- NOAA Data Sharing Policy for Grants and Cooperative Agreements

"NASA promotes the full and open sharing of all data with the research and applications communities, private industry, academia, and the general public."
-- NASA Data Rights and Related Issues

"Investigators are expected to submit unique biological information, such as DNA sequences or crystallographic coordinates, to the appropriate data banks so that they can be made available to the broad scientific community."
-- Department of Health & Human Services Grants Policy Statement

Hide

Private-sector cooperation to improve access

"Encourage cooperation with the private sector to improve data access and compatibility, including through the formation of public-private partnerships with foundations and other research funding organizations"

Description

Since data stewardship can be such a costly and technologically demanding proposition, partnerships with other data stewards and producers can provide opportunities for innovation and collaboration. Cooperation between funding agencies and the private sector can take a number of forms. From collaborating with service providers (such as publishers or web services companies) to develop tools and services, to pooling resources with foundations and private funding organizations, these relationships can result in benefits for all parties involved. Two examples of partnerships between repositories and private-sector companies are Flickr Commons and Google Books; while these projects may differ from those undertaken in the preservation and dissemination of scientific data, they are a useful reference point for understanding the benefits and risks involved.

Issues to Consider

  • What funding structures will be in place to ensure that all organizations involved are benefiting from the partnership? For the partnership to be successful, all parties must ensure that the terms of the agreement are clearly laid out. With well-articulated responsibilities and desired outcomes, all partners may benefit.
  • Will the partnership require any rights to be transferred to the private organization? If the partner requires that copyright be transferred to that organization, access restrictions on the content may result, and the collaboration may go against the ideals of providing free public access to the datasets.
  • How does private-sector cooperation affect access restrictions and intellectual property concerns? If there are confidential or proprietary data involved in the collaborative project, special attention must be paid to protecting these data.

No examples of this from the policies surveyed

Hide

Mechanisms for identification & attribution of data

"Develop approaches for identifying and providing appropriate attribution to scientific data sets that are made available under the plan"

Description

Properly citing data encourages the replication of scientific results, improves research standards, guarantees persistent reference, and gives proper credit to data producers.

Citing data is straightforward. Each citation must include the basic elements that allow a unique dataset to be identified over time: title, author, date, version, and persistent identifier (such as the Digital Object Identifier (DOI), Uniform Resource Name URN, or Handle System). Some academic journals, such as the American Sociological Review, have already adopted a set of standards for citing data. DataCite, an international consortium, strives to improve and support data citation.

For more information, see ICPSR's Data Citations page, IASSIST's Quick Guide to Data Citation, the Digital Curation Centre's guide How to Cite Datasets and Link to Publications, or DataONE's Provide identifier for dataset used page.

Issues to Consider

  • How can funders encourage consistent citation methods for data? This is also the responsibility of researchers, secondary data users, professional organizations, librarians, and others. The basic elements of data citation are clear.

"Data archives will include easily accessible information about the data holdings, including quality assessments, supporting relevant information, and guidance for locating and obtaining data."
-- NASA Data & Information Policy

Hide

Data stewardship workforce development

"In coordination with other agencies and the private sector, support training, education, and workforce development related to scientific data management, analysis, storage, preservation, and stewardship"

Description

As stakeholders in the research data lifecycle, funding agencies should ensure that those engaging with research data at all stages are trained and aware of their responsibilities and skills. Training both data producers and data stewards in the appropriate methods for managing, curating and preserving research data will help ensure the ongoing accessibility of the research.

The National Science Board emphasized the importance of data stewardship training and development in a 2005 report titled Long-Lived Digital Data Collections Enabling Research and Education in the 21st Century: "Data scientists materially determine the quality of the data collections that now play a vital role in research. Their role is new, so it is crucial that the professional career of data scientist be defined and recognized so that it will attract the best and brightest."

Recent data stewardship workforce development in the United States has included:

ICPSR hosts data stewardship courses as part of its Summer Program in Quantitative Methods of Social Research. These include:

  • Curating and Managing Research Data for Re-Use
  • Assessing and Mitigating Disclosure Risk: Essentials for Social Science
  • Providing Social Science Data Services: Strategies for Design and Operation

Issues to Consider

  • How can staff be trained in these new competencies and roles relating to digital stewardship? Digital curation may be a minor part of any single staff member's responsibilities, so training should contextualize these activities in terms of broader objectives and processes.
  • What partnerships (with universities, data repositories, etc.) can support the development of these programs?
  • How can agencies create and foster a culture that is supportive of data stewardship and curation? Without cultural buy-in, scientists may hesitate to fully participate.

"...a $2 million award for a research training group in big data will support training for undergraduates, graduates and postdoctoral fellows to use statistical, graphical and visualization techniques for complex data"
-- "NSF Leads Federal Efforts in Big Data"

Hide

Long-term support for repository development

"Provide for the assessment of long-term needs for the preservation of scientific data in fields that the agency supports and outline options for developing and sustaining repositories for scientific data in digital formats, taking into account the efforts of public and private sector entities"

Description

We advocate long-term funding for specialized, long-lived, trustworthy, and sustainable repositories that can mediate between the needs of scientific disciplines and data preservation requirements. As digital data management becomes an increasingly important part of scientific research, funding agencies must contribute to the developing ecosystem of services and technologies that support access to and preservation of data.

As we noted in a position paper in January 2013, "Long-term access to data requires durable institutions that plan on a scale of decades and even generations. Such planning is difficult when grant cycles are of limited duration, and proposed projects are rated for innovation and transformation but not for reliability or permanence."

Issues to Consider

  • Who will bear the costs of documenting and preserving all of the data collections? Will the funding agencies fully support all costs? The researchers? The consumers?
  • How can limited resources for data archiving be focused on data with the highest value for secondary analysis? Long-term preservation may constrain resources and require attention to be prioritized across and within data collections.
  • How can cost models be developed to support future preservation costs? Preservation costs can be difficult to predict. Although no cost model is guaranteed to predict the future financial requirements of a repository project, they can help agencies prepare for the long-term nature of this significant investment.

Acknowledgements
We thank Emily Reynolds and Gavin Strassel, both students at the University of Michigan School of Information, for contributing to the development of this page.