Framework for Creating a Data Management Plan
This framework can be used as an outline in assembling data management plans to accompany grant applications. Note that some funders have page limits for data management plans—NSF limits plans to two pages.
Elements of a Data Management Plan
This list of elements is informed by a gap analysis that ICPSR conducted of existing recommendations for data management plans and other forms of guidance made available for researchers generating data. The result of the gap analysis was a comparison of existing forms of guidance. Elements that are highly recommended for inclusion in effective data management plans are noted.
See our bibliography for additional readings germane to the elements of a data management plan.
Data Description (Recommended)
Provide a brief description of the information to be gathered -- the nature, scope, and scale of the data that will be generated or collected.
Why this is important
A good description of the data to be collected will help reviewers
understand the characteristics of the data, their relationship to
existing data, and any disclosure risks that may apply.
Example 1:
This project will produce public-use nationally representative survey
data for the United States covering Americans' social backgrounds,
enduring political predispositions, social and political values,
perceptions and evaluations of groups and candidates, opinions on
questions of public policy, and participation in political life.
Example 2:
This project will generate data designed to study the prevalence and
correlates of DSM III-R psychiatric disorders and patterns and
correlates of service utilization for these disorders in a nationally
representative sample of over 8000 respondents. The sensitive nature of
these data will require that the data be released through a restricted
use contract.
Access and Sharing (Recommended)
Indicate how you intend to archive and share your data and why you have chosen that particular option. Possible mechanisms for archiving and sharing include:
Domain repository like ICPSR (social science)
Self-dissemination through a dedicated website that the research team will create and maintain. If this option is chosen, it is recommended that the data producer arrange for eventual archiving of the data after the self-dissemination period terminates and specify the schedule for data sharing in the grant application.
Preservation with delayed dissemination. Under such an agreement the data producer makes an arrangement with a public data repository for archival preservation of the data with dissemination to occur at a later date, usually within a year.
Institutional repositories. Institutional repositories at academic institutions have the goal of preserving and making available some portion of the academic work of their students, faculty, and staff. Note that not all IRs have the capacity to accept and curate data.
Sharing data helps to advance science and to maximize the research investment. A recent paper reported that when data are shared through an archive, research productivity increases and many times the number of publications result as opposed to when data are not shared.
Protecting research participants and guarding against disclosure of identities are essential norms in scientific research. Data producers should take efforts to provide effective informed consent statements to respondents, to deidentify data before deposit when necessary, and to communicate to the archive any additional concerns about confidentiality. (See Ethics and Privacy below.)
With respect to timeliness of data deposit, archival experience has demonstrated that the durability of the data increases and the cost of processing and preservation decreases when data deposits are timely. It is important that data be deposited while the producers are still familiar with the dataset and able to transfer their knowledge fully to the archive.
Example 1:
The research data from this project will be deposited with [repository]
to ensure that the research community has long-term access to the data.
Example 2:
The project team will create a dedicated website to manage and
distribute the data because the audience for the data is small and has a
tradition of interacting as a community. The site will be established
using a content management system like Drupal or Joomla so that data
users can participate in adding site content over time, making the site
self-sustaining. The site will be available at a .org location. For
preservation, we will supply periodic copies of the data to
[repository]. That repository will be the ultimate home for the data.
Example 3:
The research data from this project will be deposited with [repository]
to ensure that the research community has long-term access to the data.
The data will be under embargo for one year while the investigators
complete their analyses.
Example 4:
The research data from this project will be deposited with the institutional repository on the grantees' campus.
Will your data be free of direct and indirect identifiers? If not, how will you share your restricted data? Will special terms of use be required?
Example 5:
The data generated by this project will not pose a disclosure risk. All
data will be de-identified before posting to the website established by
the principal investigators.
Example 6:
This project will generate data linked to administrative records, so the
data will be distributed through a restricted data use agreement
managed by [repository]. Through this mechanism, users will apply to use
these files, create data security plans, and agree to other access
controls.
Example 7:
Because the data generated will cover sensitive topics, it is expected
that the data will be deposited with [repository] and distributed
through the secure data enclave mechanism, requiring researchers to
visit the enclave to access the data under secure conditions.
Indicate when the data will be made available to others.
Example 8:
The research data from this project will be deposited with [repository]
before the end of the project so that any issues surrounding the
usability of the data can be resolved.
Example 9:
The data will be deposited with [repository] but not disseminated for
one year to give the investigators time to publish their findings.
Metadata (Recommended)
What types of metadata will you produce to support the data? Will a metadata standard be used?
Why this is important
Good descriptive metadata are essential to effective data use. Metadata
are often the only form of communication between the secondary analyst
and the data producer, so they must be comprehensive and provide all of
the needed information for accurate analysis.
Structured or tagged metadata, like the XML format of the Data
Documentation Initiative (DDI) standard, are optimal because the XML
offers flexibility in display and is also preservation-ready and
machine-actionable.
Example 1:
Metadata will be tagged in XML using the Data Documentation Initiative
(DDI) format. The codebook will contain information on study design,
sampling methodology, fieldwork, variable-level detail, and all
information necessary for a secondary analyst to use the data accurately
and effectively.
Example 2:
The clinical data collected from this project will be documented using CDISC metadata standards.
Intellectual Property Rights (Recommended)
Who will hold intellectual property rights for the data and other information created by the project?
Will these rights be transferred to another organization for data distribution and archiving? Will any copyrighted material (e.g., instruments or scales) be used? If so, how will the project obtain permission to use the materials and disseminate them?
Why this is important
In order to disseminate data, archives need a clear statement from the
data producer of who owns the data. The principal investigator's
university is usually considered to be the holder of the intellectual
property rights for data the PI generates. Many archives do not ask for a
transfer of rights but instead just request permission to preserve and
distribute the data. Copyright may also come into play if copyrighted
instruments are used to collect data. In these cases, data producers
should initiate discussions with archives in advance of data deposit.
Example 1:
The principal investigators on the project and their institutions will
hold the intellectual property rights for the research data they
generate.
Example 2:
The principal investigators on the project and their institutions will
hold the intellectual property rights for the research data they
generate but will grant redistribution rights to [repository] for
purposes of data sharing.
Example 3:
The data gathered will use a copyrighted instrument for some questions. A
reproduction of the instrument will be provided to [repository] as
documentation for the data deposited with the intention that the
instrument be distributed under "fair use" to permit data sharing, but
it may not be redisseminated by users.
Ethics and Privacy (Recommended)
If applicable, how will you handle informed consent with respect to communicating to respondents that the information they provide will remain confidential when data are shared or made available for secondary analysis?
Why this is important
Protection of human subjects is a fundamental tenet of research and an
important ethical obligation for everyone involved in research projects.
Disclosure of identities when privacy has been promised could result in
lower participation rates and a negative impact on science.
Example 1:
For this project, informed consent statements will use language that
will not prohibit the data from being shared with the research
community.
Example 2:
The following language will be used in the informed consent: The
information in this study will only be used in ways that will not reveal
who you are. You will not be identified in any publication from this
study or in any data files shared with other researchers. Your
participation in this study is confidential. Federal or state laws may
require us to show information to university or government officials [or
sponsors], who are responsible for monitoring the safety of this study.
Example 3:
The following language will be used for video data: Participants in this
study will be videotape recorded. The videos will be made available
through the Web for others to use. However, all users will be required
to use the videos for research purposes only and will not be allowed to
share the information from the videos with others.
If applicable, what are your plans to obtain IRB approval?
Example 4:
For this project, the principal investigators will request expedited IRB
review compliant with procedures established by the [University] campus
IRB. Research activities envisioned present no more than minimal risk
to human subjects.
Example 5:
Because the project involves more than minimal risk to human subjects,
the project will undergo full IRB board review, as required by federal
regulations.
Are there legal constraints (e.g., HIPAA) on sharing data?
Example 6:
The proposed medical records research falls under the HIPAA Privacy
Rule. Consequently, the investigators will provide documentation that an
alteration or waiver of research participants' authorization for
use/disclosure of information about them for research purposes has been
approved by an IRB or a Privacy Board.
If applicable, how will you manage disclosure risk in the data to be shared and archived?
Example 7:
During data analysis, the data will be accessible only by certified
members of the project team. The research project will remove any direct
identifiers in the data before deposit with [repository].
Format (Recommended)
Specify the anticipated submission, distribution, and preservation formats for the data and related files (note that these formats may be the same).
Why this is important
Depositing data and documentation in formats preferred for archiving can
make the processing and release of data faster and more efficient.
Preservation formats should be platform-independent and non-proprietary to ensure that they will be usable in the future.
Example 1:
Quantitative survey data files generated will be processed and submitted
to the [repository] as SPSS system files with DDI XML documentation.
The data will be distributed in several widely used formats, including
ASCII, tab-delimited (for use with Excel), SAS, SPSS, and Stata.
Documentation will be provided as PDF. Data will be stored as ASCII
along with setup files for the statistical software packages.
Documentation will be preserved using XML and PDF/A.
Example 2:
Digital video data files generated will be processed and submitted to the [repository] in MPEG-4 (.mp4) format.
Example 3:
Digital image data will be processed and submitted to the [repository] in TIFF version 6 uncompressed (.tif) format.
Example 4:
Geospatial data will be processed and submitted to the [repository] as
an ESRI Shapefile (essential - .shp, .shx, .dbf, optional - .prj, .sbx,
.sbn).
Example 5:
Textual data will be processed and submitted to the [repository] as plain text data, ASCII (.txt).
Archiving and Preservation (Recommended)
How will you ensure that data are preserved for the long term?
Why this is important
Digital data need to be actively managed over time to ensure that they
will always be available and usable. This is important in order to
preserve and protect our shared scientific heritage as technologies
change. Preservation of digital information is widely considered to
require more constant and ongoing attention than preservation of other
media. Depositing data resources with a trusted digital archive can
ensure that they are curated and handled according to good practices in
digital preservation.
Example 1:
By depositing data with [repository], our project will ensure that the
research data are migrated to new formats, platforms, and storage media
as required by good practice.
Example 2:
In addition to distributing the data from a
project website, future long-term use of the data will be ensured by
placing a copy of the data into [repository], ensuring that best
practices in digital preservation will safeguard the files.
Storage and Backup (Recommended)
How and where will you store copies of your research files to ensure their safety? How many copies will you maintain and how will you keep them synchronized?
Why this is important
Digital data are fragile and best practice for protecting them is to store multiple copies in multiple locations.
Example 1:
[Repository] will place a master copy of each digital file (i.e.,
research data files, documentation, and other related files) in Archival
Storage, with several copies stored at designated locations and
synchronized with the master through the Storage Resource Broker.
Security (Recommended)
How will you ensure that the data are secure?
Why this is important
Security for digital information is important over the data life cycle.
Raw research data may include direct identifiers or links to direct
identifiers and should be well-protected during collection, cleaning,
and editing. Processed data may or may not contain disclosure risk and
should be secured in keeping with the level of disclosure risk inherent
in the data. Secure work and storage environments may include access
restrictions (e.g., passwords), encryption, power supply backup, and
virus and intruder protection.
Example 1:
The data will be processed and managed in a secure non-networked environment using virtual desktop technology.
Example 2:
The data files from this study will be managed, processed, and stored in
a secure environment (e.g., lockable computer systems with passwords,
firewall system in place, power surge protection, virus/malicious
intruder protection) and by controlling access to digital files with
encryption and/or password protection. Deidentifed files will be
deposited with [repository] whose security policy has been written
according to best practices.
Responsibility (Recommended)
Who will act as the responsible steward for the data throughout the data life cycle?
Why this is important
Typically data are owned by the institution awarded a Federal grant and
the principal investigator oversees the research data (collection and
management of data) throughout the project period. It is important to
describe any atypical circumstances. For example, if there is more than
one principal investigator the division of responsibilities for the
data should be described.
Example 1:
The project will assign a qualified data manager certified in disclosure
risk management to act as steward for the data while they are being
collected, processed, and analyzed.
Example 2:
All research data collected as part of this project is owned by the
University. The Principal Investigator of this project will take
responsibility for the collection, management, and sharing of the
research data.
Existing Data (Recommended)
Are there existing data with a focus similar to the data that will be produced? If so, list what they are and explain why it is important to collect new data.
Why this is important
This is important to include in a data management plan when the value of
a new data collection comes from its relationship to existing data
sources.
Example 1:
Few datasets exist that focus on this population in the United States
and how their attitudes toward assimilation differ from those of others.
The primary resource on this population, [give dataset title here], is
inadequate because...
Example 2:
Data have been collected on this topic previously (for example: [add
examples]). The data collected as part of this project reflect the
current time period and historical context. It is possible that several
of these datasets, including the data collected here, could be combined
to better understand how social processes have unfolded over time.
Selection and Retention Periods (Recommended)
Indicate how data will be selected for archiving, how long the data will be held, and what your plans are for eventual transition or termination of the data collection in the future.
Why this is important
Not all data need to be preserved in perpetuity, so thinking through the
proper retention period for the data is important, in particular when
there are reasons the data will not be preserved permanently.
Example 1:
Our project will generate a large volume of
data, some of which may not be appropriate for sharing since it involves
a small sample that is not representative. The investigators will work
with staff of the [repository] to determine what to archive and how long
the deposited data should be retained.
Example 2:
Our research project will generate data from a
large national sample. These data will be retained by [repository] as
part of their permanent collection.
Audience (Recommended)
Describe the audience for the data you will produce.
Why this is important
The audience for the data may influence how the data are managed and
shared—for example, when audiences beyond the academic community may use the research data.
Example 1:
The data to be produced will be of interest to demographers studying
family formation practices in early adulthood across different racial
and ethnic groups.
Example 2:
In addition to the research community, we expect these data will be used by practioners and policymakers.
Data Organization
Indicate how the data will be managed during the project, with information about version control, naming conventions, etc.
Why this is important
It is important to describe situations in which research data are in
some way atypical with respect to how they will be organized. For
example, some data collections are dynamically changing and version
control is central to how the data will be used and understood by the
scientific community.
Example 1:
Data will be stored in a CVS system and checked in and out for purposes
of versioning. Variables will use a standardized naming convention
consisting of a prefix, root, suffix system. Separate files will be
managed for the two kinds of records produced: one file for respondents
and another file for children with merging routines specified.
Quality Assurance
Specify how you will ensure that the data meet quality assurance standards.
Why this is important
Producing data of high quality is essential to the advancement of
science, and every effort should be taken to be transparent with respect
to data quality measures undertaken across the data life cycle.
Example 1:
Quality assurance measures will comply with the standards, guidelines,
and procedures established by the World Health Organization.
Example 2:
For quantitative data files, the [repository] ensures that missing data
codes are defined, that actual data values fall within the range of
expected values and that the data are free from wild codes. Processed
data files are reviewed by a supervisory staff member before release.
Budget
How will the costs for creating data and documentation suitable for archiving be paid?
Why this is important
Archiving data to ensure that data will be available and usable in the long
term costs money, and this needs to be recognized. Many funding agencies,
including NSF, permit investigators to include a line item for archiving in
the grant application budget.
Example 1:
Staff time has been allocated in the proposed budget to cover the costs
of preparing data and documentation for archiving. The [repository] has
estimated their additional cost to archive the data is [insert dollar
amount]. This fee appears in the budget for this application as well.
Legal Requirements
Indicate whether any legal requirements apply to archiving and sharing your data.
Why this is important
Some data have legal restrictions that impact data sharing—for example,
data covered by HIPAA, proprietary data, and data collected through the
use of copyrighted data collection instruments. How these issues might
impact data sharing should be described fully in the data management
plan.
Example 1:
The proposed medical records research falls under the HIPAA Privacy
Rule. Consequently, the investigators will provide documentation that an
alteration or waiver of research participants' authorization for
use/disclosure of information about them for research purposes has been
approved by an IRB or a Privacy Board.