4 Human Subjects Data

In addition to understanding how to organize data, we also need a foundational understanding of the types of data we may collect. In the field of education research, we are often working with data that is collected from human subjects. Along with collecting data from people comes the responsibility to secure that data. Data from humans may contain identifiable information increasing the risk that participants can be revealed in a dataset. Human subjects data sometimes also contains information on sensitive topics such as mental health, drug use, or criminal behavior, further increasing risks if participants are identified. Before beginning your project, it is important to assess the type of data you will be collecting and understand the protections that will need to be in place to secure your data. This chapter will briefly review the types of human subjects data you may work with as well as any regulations, organizations, policies, or agreements that may impact how you need to secure your data.

4.1 Identifiability of a dataset

When working with human subjects there are two types of identifiers you may collect in your study, direct and indirect (see Table 4.1). Direct identifiers are unique to an individual and can be used to identify a participant. Indirect identifiers are not necessarily unique to a particular individual, but if combined with other information they could be used to identify a participant (Kopper, Sautmann, and Turitto 2023a).

Table 4.1: Examples of direct and indirect identifiers
Direct Identifiers	Indirect Identifiers
Name	Age
Initials	Race
Address	Ethnicity
Phone number	Income
Email address	Education level
Social security number	Gender
IP Address	Occupation
ID numbers (student ID, state ID)	Date of birth
License numbers	ZIP Code
Account numbers	Special education services
	Data collection date
	Verbatim responses

A term often used when discussing identifiable information is personally identifiable information (PII). This term broadly refers to information that can be used to identify a participant. There is no agreed-upon list for what fields should be included in a list of PII but generally it includes both the direct and indirect types of information shown in Table 4.1.

When collecting data and creating datasets, you will be working with one or more of these four types of data files (UNC Office of Human Research Ethics 2020).

Identifiable: Data includes personally identifiable information. It is common for your raw research study data to be identifiable.
Coded: In this type of data file, PII has been removed or distorted and names are replaced with a code (i.e., a unique participant identifier). The only way to link the data back to an individual is through that code. The identifying code file (linking key) is stored separate from the research data (see Chapter 10). Coded data is typically the type of file you create after cleaning your raw study data.
De-identified: In this type of file, identifying information has been removed or distorted and the data can no longer be reassociated with the underlying individual (the linking key no longer exists). This is typically what you create when publicly sharing your research study data.
Anonymous: In an anonymous dataset, no identifying information is ever collected and so there should be little to no risk of identifying a specific participant.

4.2 Data classification

Data is often classified based on the level of sensitivity (Filip 2023; Macquarie University 2023; University of Michigan 2023). These levels of sensitivity dictate how the data can be collected, stored, and shared, as well as what the response should be to any data breach. Depending on the institution, the names for these levels, the number of levels, what is included in these levels, and the rules applied to the levels, all vary. While there is variation, here is a general summary of how information may be categorized.

Low sensitivity: This data is considered to have no or low risk if disclosed. This typically includes de-identified and anonymous data that does not contain highly sensitive information.
Moderate sensitivity: This data is considered to have moderate risk if disclosed, meaning it could adversely affect people. This data may include identifiable information or information that could allow participants to be re-identified within the data itself or using an external source. This data is typically required to be kept confidential by law or other agreements. These data should be protected against unauthorized access.
High sensitivity: This data should be under the most stringent security and could cause great harm if disclosed. This data includes PII or information that could allow participants to be re-identified, as well as private or highly sensitive information (e.g., illegal behaviors, medical records) and are typically required to be kept confidential by law or other agreements. These data should be protected against unauthorized access.

It is important to review your institution’s data classification levels, or data sensitivity levels, to determine how your specific institution classifies data. These rules may come from an information technology department, an institutional review board (IRB), or a combination of both. Note that different data collection efforts in the same project can be classified in different ways.

4.3 Human subjects data oversight

When working with human subjects data, there are laws, policies, departments, and agreements that may impact how you collect and manage that data. Below we will review some of the most commonly encountered oversight in education research.

4.3.1 Regulations and laws

FERPA: The Family Educational Rights and Privacy Act (FERPA) is a federal law protecting the privacy of student education records. The law applies to elementary and secondary schools, as well as post-secondary institutions which receive federal funds from the Department of Education. FERPA provides a list of personally identifiable information often contained in education records⁵.
HIPAA: The Health Insurance Portability and Accountability Act (HIPAA) provides federal protection for the privacy of protected health information (PHI) collected by covered entities serving patients. The HIPAA Privacy Rule provides a list of 18 identifiers that should be protected⁶.
Common Rule: In 1991 the Federal Policy for the Protection of Human Subjects was published, establishing core procedures for human subjects protections. The policy, 45 CFR part 46 (Office for Human Research Protections 2016), included four subparts. Subpart A, known as the “Common Rule”, provided a set of protections for human subjects research including informed consent, review by an IRB, and compliance monitoring (National Institute of Justice 2007; Office for Human Research Protections 2009). In 2018 the Common Rule was revised in order to better protect research participants and to reduce administrative burden (Office for Human Research Protections 2018; U.S. Department of Health and Human Services 2018).

4.3.2 Institutions and departments

IRB: An Institutional Review Board (IRB) is a formal organization designated to review and monitor human subjects research and ensure that the welfare, rights, and privacy of participants are maintained throughout the project (Oregon State University 2012). In particular the IRB is concerned with three ethical principles established in the Belmont Report (The National Commission for the Protection of Human Subjects of Biomedical and Behavioral Research 1979); respect for persons (i.e., protecting the autonomy of participants), beneficence (i.e., minimizing harm and maximizing good), and justice (i.e., fair distribution of burdens and benefits)(Duru and Sautmann 2023; Gaddy and Scott 2020). When conducting human subjects research, it is important to review your local IRB’s policies and procedures to determine if your study requires IRB approval.
IT department: Institutional information technology (IT) departments often vet data collection, transfer, and storage tools and are the authority on what tools are approved for research use. They may also be your source for determining classification levels for data security.
Office of research or sponsored programs: Institutions often have an administrative body that serves as a signatory authority and can help negotiate terms and conditions for certain types of agreements (Washington University in St. Louis 2023).

4.3.3 External permission

External permission: When planning to conduct research in schools, many districts require researchers to submit requests for research⁷. The requirements for these requests vary by district but they often include an application or proposal outlining research plans, as well as other supporting documents (e.g., copies of data collection instruments, IRB approval, agreement forms). This submission is then typically reviewed by a committee for possible approval. A similar research or data permission process may also be required when requesting access to non-public data sources such as statewide longitudinal data systems. See Figure 12.6 for an example of what these request processes might look like.

4.3.4 Agreements

Informed consent/assent: Often required by an IRB, consent involves informing a participant of what data will be collected for your research study and how it will be handled and used, as well as obtaining a participant’s voluntary agreement to participate in your study. If your study involves participants under the age of 18, you may also be required to obtain a participant assent form, in addition to a parent/guardian consent form.
DUA: A data use agreement (DUA), also sometimes referred to as a data sharing agreement (DSA), is a contractual agreement that provides the terms and conditions for sharing data. DUAs are commonly written for data sharing when partnering with school districts or state agencies. As an example, a DUA may include the terms for sharing, working with, and storing education records data. However, DUAs can be used to provide guidance for outgoing data as well (i.e., a researcher is sharing their original data with an agency). DUAs can be standalone documents or may be incorporated into other documents such as a memorandum of understanding (MOU).
NDA: Non-disclosure agreements (NDAs), sometimes synonymous with confidentiality agreements, restrict the use of proprietary or confidential information (University of Washington 2023) and are legally enforceable agreements. These may be required when partnering with districts or other agencies.

4.3.5 Funders

Funding agencies: Along with requiring data management plans, funding agencies may have their own data protection procedures and may require applicants to submit additional documents agreeing to specific guidelines or outlining their security plans for human subjects data.

4.4 Protecting human subjects data

Throughout the remaining chapters of this book, we will review ways to keep identifiable human subjects data secure in each phase of the research life cycle. With that said, below is a quick review of some of the most important things to remember if you are collecting data that contain PII.

In most situations it will be important to get consent to collect identifiers. Consult with your local IRB to determine what is required. See Section 11.2.5 for more information.
Collect as few identifiers as possible. Only collect what is necessary. See Section 11.2.1 for more information.
Follow rules laid out in applicable laws, policies, and agreements when collecting, storing, and sharing data. This includes, but is not limited to, using approved tools for data collection, capture, and storage, assigning appropriate data access levels, and transmitting data using approved methods. See Chapters 11, 12, 13, and 15 for more information.
Remove names in data and replace them with codes (i.e., unique study identifiers). See Sections 10.4 and 14.3.1 for more information.
Fully de-identify data before publicly sharing it. See Section 16.2.3.4 for more information.
Use data sharing agreements and controlled access as needed when publicly sharing data. See Section 16.2.1 for more information.

3 Data Organization

5 Data Management Plan