5 Data Management Plan

Data management plan in the research project life cycle

Figure 5.1: Data management plan in the research project life cycle

5.1 History and purpose

Since 2013, even earlier for the National Science Foundation (NSF), most federal agencies that education researchers work with have required a data management plan (DMP) as part of their funding application (Holdren 2013). While the focus of these plans is mostly on the future outcome of data sharing, the data management plan is a means of ensuring that researchers will thoughtfully plan a research study that will result in data that can be shared with confidence, and free from errors, uncertainty, or violations of confidentiality. President Obama’s May 2013 Executive Order declared that “the default state of new and modernized government information resources shall be open and machine readable” (The White House 2013). In August of 2022, the Office of Science and Technology Policy (OSTP) doubled down on their data sharing policy and issued a memorandum stating that all federal agencies must update their public access policies no later than December 31, 2025, to make federally funded publications and their supporting data accessible to the public with no embargo on their release (Nelson 2022). Even sooner than this, organizations like the National Institutes of Health (NIH) mandated that grant applicants, beginning January 2023, must submit a plan for both managing and sharing project data (National Institutes of Health 2023c). The National Science Foundation also released version 2.0 of their public access plan in February of 2023, describing how the agency plans to ensure that all scientific data, funded by the NSF and associated with peer-reviewed publications, is publicly shared (National Science Foundation 2023).

Note

In the last year, agencies have begun revising the phrase “data management plan” to include the word “sharing” to better represent the shifting emphasis on sharing publicly funded data. As an example, NIH now uses the term Data Management and Sharing (DMS) Plan8, while the Institute of Education Sciences (IES) has chosen to use the term Data Sharing and Management Plan (DSMP)9. For the sake of simplicity, the term DMP is used throughout this book to generally represent these plans, no matter the precise name, across all federal agencies.

5.1.1 Why are DMPs important?

Funding agencies see DMPs as important in maximizing scientific outputs from investments and increasing transparency. Mandating data sharing for federally funded projects leads to many benefits including accelerating discovery, greater collaboration, and building trust among data creators and users. In addition to the benefits viewed by funders, there are intrinsic benefits that come from having to write a data management plan. Having to thoughtfully plan, and having transparency in that plan, leads to better data management. Knowing that you will eventually be sharing your data and documentation with others outside of your team can motivate researchers to think hard about how to organize their data management practices in a way that will produce data that they trust to share with the outside world (Center for Open Science 2023). Even if a DMP is not required by a funder, it should always be the first step of your planning process. Although brief, this document serves as the foundation for all future planning and provides your team with a shared understanding of data management expectations.

5.2 What is it?

Typically, a data management plan is a supplemental 2–5 page document, submitted with your grant application, that contains high-level decisions about how you plan to collect, store, manage, and share your research data products. For most funders these DMPs are not part of the scoring process, but they are reviewed by a panel or program officer. Some funders may provide feedback or ask for revisions if they believe your plan and/or your budget and associated costs are not adequate. Although this document is usually submitted to your funder, it should be considered a living document to be updated as plans change throughout a study.

5.2.1 What to include?

What to include in a DMP varies some across funding agencies and the landscape of requirements is currently evolving. You should check each funding agency’s site for their specific DMP requirements when submitting a proposal. With that said there are generally 10 common categories covered in a data management plan (Center for Open Science 2023; Gonzales, Carson, and Holmes 2022; ICPSR 2020; Michener 2015) which we will review below.

  1. Description of data to be shared (See Chapters 11, 12, 14, 16)
    • What is the source of data? (e.g., surveys, assessments, observations, extant data)
    • How will data be cleaned and curated prior to data sharing?
    • What will the level of aggregation be? (e.g., item-level, summary data, metadata only)
      • Datasets from a project may need to be shared in different ways due to legal, ethical, or technical reasons.
    • Will both raw and clean data be shared?
    • What are the expected number of files? Expected number of rows/cases in each file?
  2. Format of data to be shared (See Chapters 14 and 16)
    • Will data be in an electronic format?
    • Will it be provided in a non-proprietary format? (e.g., CSV)
    • Will more than one format be provided? (e.g., SPSS and CSV)
    • Are there any tools needed to manipulate or reproduce shared data? (e.g., software, code)
      • Provide details for those tools. (e.g., how they can be accessed, version number, required operating system)
  3. Documentation to be shared (See Chapters 8 and 16)
    • What documentation will you share?
      • Consider project-level, dataset-level, and variable-level documentation.
    • What format will your documentation be in? (e.g., XML, CSV, PDF)
  4. Standards (See Chapters 8 and 11)
    • Do you plan to use any standards for things such as metadata, data collection (e.g., common data elements), or data formatting?
  5. Data preservation (See Chapter 16)
    • Where will data be archived for public sharing?
      • Many agencies are now requiring applicants to name a specific data repository in this section.
    • What are the desirable characteristics of the repository? 10 (e.g., unique persistent identifiers assigned to data, metadata collected, records provenance, licensing options)
    • When will you deposit your study data in the repository and for how long will data remain accessible?
    • How will you enable discoverability and reuse of data?
  6. Access, distribution, or reuse considerations (See Chapters 4 and 16)
    • Are there any legal, technical, or ethical factors affecting reuse, access, or distribution of your data?
    • Will any data be restricted?
    • Are access controls required (e.g., a data use agreement, data enclave)?
  7. Protection of privacy and confidentiality (See Chapters 4, 14, and 16)
    • Do participants sign informed consent agreements? Does the consent communicate how participant data are expected to be used and shared?
    • How will you prevent disclosure of personally identifiable information when you share data?
  8. Data security (See Chapter 13)
    • How will security and integrity of data be maintained during a project? (e.g., consider data storage, access, backup, and transfer)
  9. Roles and responsibilities (See Chapter 7)
    • What are the staff roles in management and preservation of data?
    • Who ensures accessibility, reliability, and quality of data?
    • Is there a plan if a core team member leaves the project or institution?
  10. Pre-registration
    • Where and when will you pre-register your study?

Again, the specifics of what should be included in each category will vary by funder. Here are sites to visit to learn more about the DMP requirements for four common federal education research funding agencies.

  • Institute of Education Sciences 11 12
  • National Institutes of Health 13
  • National Institute of Justice 14
  • National Science Foundation 15

5.3 Creating a data sources catalog

In preparation for writing your DMP, it can be helpful to create a data sources catalog that allows you to visually see what data you are collecting, what the sensitivity level of each source is, and how data will be collected, managed, stored, and shared (Filip 2023). This type of catalog cannot only help you write your DMP but can also serve as an excellent planning or discussion tool throughout your entire project.

In setting up this rectangular formatted document, each row represents a unique data collection effort, and each column (or field) represents information about that effort. Some fields you can add to this catalog include:

  • Source information
    • Instrument (e.g., survey, assessment)
    • Record level (i.e., who is this instrument collected on)
    • Who completes the instrument (e.g., rater, participant)
    • Measures included in the instrument
  • Collection and capture method
  • Data collection waves (i.e., how often will you collect this data source)
  • Planned number and size of data files for each source (e.g., two student assessment files (T1, T2), with ~500 rows per file)
  • PII included
  • Sensitivity level based on your institution’s policies
  • Data storage and access plan
  • Data ownership
  • How confidentiality will be secured
  • Data sharing method

Figure 5.2 is a simplified example of building this catalog for a hypothetical study. Ultimately, each data source in your catalog, multiplied by the number of cohorts and/or waves it is collected, will give you an estimate of the final number of distinct data files at the end of your study. In Figure 5.2, if we only collected data for one year, we would end up with six datasets at the end of our study, three teacher-level files and three student-level files. In Chapter 16, we will discuss whether to share these as separate datasets, or larger files combined by unit of analysis (e.g., combined student-level file, combined teacher-level file).

Example data sources catalog

Figure 5.2: Example data sources catalog

5.4 Getting help

Since DMPs are written before a project is funded, and therefore before additional staff members may be hired, oftentimes the investigators developing the grant proposal are the ones who write the DMP. However, when constructing your DMP it is well worth your time to enlist help. If you have an existing data manager or data team, you will most certainly want to consult with them when writing your plan to ensure your decisions are feasible. If you work for a university system, your research data librarians are also excellent resources with a wealth of knowledge about writing comprehensive data management plans. Also, if you plan to share your final data with a repository or institutional archive you will want to contact their team when writing your plan as well. The repository may have its own requirements for how and when data must be shared, and it is helpful to outline those guidelines in your data management plan at the time of submission. Last, you may want to obtain the help of your colleagues. Your colleagues have likely written DMPs before and many people are willing to share their plans as a way to help others better understand what to include.

As mentioned before, your DMP is a living document, and you can always update your plan during or after your project completion. It may be helpful to keep in contact with your program officer regarding any potential changes throughout your project.

If you are looking for guidance in writing a DMP, a variety of generic DMP templates for different federal agencies are available, as well as actual copies of submitted DMPs that some researchers graciously make publicly available for example purposes. Furthermore, the DMPTool (https://dmptool.org), a free, open-source, online application, allows users to create and share data management plans using pre-defined funding agency templates.

Templates and resources

Source Resource
DMPTool Templates organized by funding agencies 16
Figshare DMP prompts specific to depositing data with Figshare 17
Hao Ye, et al. NIH DMS Plan checklist 18
Harvard Longwood Medical Area RDM Working Group Annotated DMP template 19
ICPSR NIH DMS Plan template with specific recommendations for depositing data with ICPSR 20
NIH Sample DMS Plan for human survey data 21
Sara Hart A submitted DMP that is publicly available for example purposes 22
UMN Libraries Submitted DMP examples from University of Minnesota researchers 23

5.5 Budgeting

Effective data management requires a significant investment. Funding agencies acknowledge that there are costs associated with implementing your data management plan and allow you to explain these costs in your budget narrative. Costs associated with the entire data life cycle should be considered (Cruse 2011), and should include things such as personnel expenses, as well as fees for tools and services. Make sure to review your funder’s documentation for information about allowable costs and time frame for incurring costs. Allowable costs might include things such as (National Institutes of Health 2023a; UK Data Service 2023):

  • Infrastructure or tools required to organize, document, or store data
  • Curating and de-identifying data
  • Developing data documentation
  • Depositing data for long-term sharing in a repository

It can be difficult to estimate the costs of everything that is associated with the vast landscape of managing data. The necessary dollar amount will vary depending on the size of your project, the expertise needed, and your specific data management plan. Recommendations suggest budgeting anywhere from 5-30% of your budget for data stewardship (Mons 2020; J. H. Reynolds et al. 2014). Luckily a few organizations have developed resources to aid in estimating these costs. Exercise caution when using tools though; they may not always account for every cost and could result in an underestimation of costs (Michigan State University 2023).

Resources

Source Resource
National Institutes of Mental Health Data Archive NDA Data Submission Cost Estimation Tool 24
UK Data Service Data management costing tool and checklist 25
University of Twente Estimating RDM costs review list 26
Utrecht University Estimating the costs of data management review list 27