17 Glossary

Term Other Terms Definition
Anonymous data NA Identifying information was never collected. This data can not be linked across time or measures.
Aggregated data NA Individual data that is summarized at a group level.
Append NA Stacking datasets on top of each other (matching variables).
Archive NA The transfer of data to a facility, such as a repository, that preserves and stores data long-term.
Attrition NA The loss of study units from the sample, often seen in longitudinal studies.
Clean data processed data Raw data that has been manipulated or modified for the purposes of correcting and clarifying information.
Cohort NA A group of participants recruited into the study at the same time.
Coded data pseudonymized data, indirectly identifiable, confidential data Personally identifiable information (PII) has been removed and names are replaced with a code. The only way to link the data back to an individual is through that code. The identifying code file (linking key) is stored separate from the research data.
Confidential data NA This data is protected from unauthorized disclosure. This data either contains personally identifiable information, or can still be linked back to an individual through other means (e.g., identifiable data or coded data).
Confidentiality NA Confidentiality concerns data, ensuring participants agree to how their private and identifiable information will be managed and disseminated.
Control business as usual (BAU) The individual or group does not receive the intervention.
Cross-sectional NA Data is collected on participants for a single time point.
Data research data The recorded factual material commonly accepted in the scientific community as necessary to validate research findings. (OMB Circular A-110)
Data type measurement unit, variable format, variable class A classification that specifies what types of values are contained in a variable and what kinds of operations can be performed on that variable. Examples of types include numeric, character, logical, or datetime.
Database relational database An organized collection of related data stored in tables that can be linked together by a common identifier.
Dataset data set, dataframe, spreadsheet, rectangular data, tabular data A structured collection of data usually stored in tabular form. A research study usually produces one final dataset per entity/unit (e.g., teacher dataset, student dataset).
De-identified data anonymized data Identifying information has been removed or distorted and the data can no longer be re-associated with the underlying individual (the linking key no longer exists).
Derived data NA Data created through transformations of existing data (e.g., mean scores).
Direct identifiers NA These variables are unique to an individual and can be used to directly identify a participant (e.g., name, email address).
Directory file structure, file tree A cataloging structure for files and folders on your computer.
Experimental data NA Data collected from a study where researchers randomly introduce an intervention and study the effects.
Extant data secondary data, administrative data Existing data generated/collected by external organizations at an earlier point in time (e.g., school records).
FERPA NA The Family Educational Rights and Privacy Act is a federal law governing the disclosure of personally identifiable information in education records (e.g., name, address, DOB). The law applies to all public elementary and secondary schools, as well as post-secondary institutions.
File formats NA Education research data is typically collected in one of three file formats: text( .txt, .pdf, .docx), tabular (.xlsx, .csv, .sav) , multimedia (.mpeg, .wav).
Human subject NA The Common Rule (45 CFR 46) definition of a human subject is a living individual about whom an investigator conducting research obtains; 1) Data through intervention or interaction with the individual, or 2) identifiable private information.
HIPAA NA The Health Insurance Portability and Accountability Act is a federal law covering the protection of sensitive health information.
Identifiable data NA Data that includes personally identifiable information.
Indirect identifiers NA These variables do not alone identify a particular individual (e.g., ethnicity, gender), but if combined with other information or if category numbers are small, they could be used to identify a participant
Instrument NA A mechanism designed to collect original data (e.g., observation form, questionnaire, assessment)
Limited data set NA Under the HIPAA Privacy Rule, a limited datasest is one in which 16 of the 18 HIPAA protected identifiers have been removed. Age, dates, and city/state/zipcode can remain. A limited dataset may be disclosed to external parties without authorization for specified purposes and often a data use agreement is required. This dataset is not considered de-identified and must be safeguarded against unauthorized access.
Longitudinal NA Data is collected on participants over a period of time.
Measure NA In this book, I use the term measure broadly to refer to a collection of items used to measure an outcome (e.g., an existing scale, an existing academic assessment).
Merge join, link Combining datasets together in a side by side manner (matching on one or more unique identifiers).
Missing data NA Occurs when there is no data stored in a variable for a particular observation/respondent.
Observational data NA Data collected from a study where researchers are observing the effect of an intervention without manipulating who is exposed to the intervention.
Participant database study roster, master list, master key, linking key, tracking database This database, or spreadsheet, includes any identifiable information on your participants as well as their assigned study ID. It is your only own means of linking your confidential research study data to a participant’s true identity. It is also used to track data collected across time and measures as well as participant attrition.
Path file path A string of characters used to locate files in your directory system.
Personally identifiable information PII, personal data This includes direct identifiers (e.g., name and email), as well as indirect identifiers that, if combined with other variables or if in small enough numbers, could identify a participant (e.g., full birthdate and place of birth).
Primary data original data First hand data that is generated/collected by the research team as part of the research study.
Privacy NA Privacy concerns people, ensuring they are given control to the access of themselves and their information.
Private data NA Highly restricted and typically not publicly shared, or is shared with limited access (i.e., passwords, illegal behaviors, medical records, financial information).
Protected health information PHI The HIPAA Privacy Rule provides protections for 18 identifiers held by covered entities providing health care services.
Qualitative data NA Non-numeric data typically made up of text, images, video, or other artifacts.
Quantitative data NA Numerical data that can be analyzed with statistical methods.
Randomized controlled trial RCT A study design that randomly assigns participants to a control or treatment condition. In education research you often hear about two types of RCTs. The first being the Individual-Level Randomized Controlled Trial (I-RCT) in which individuals (such as students) are randomized directly to the treatment or control group. The second is a Cluster Randomized Controlled Trial (C-RCT), sometimes also called group-randomized, in which clusters of students (such as classrooms) are randomized.
Raw data primary, untouched Unprocessed data collected directly from a source.
Replicable NA Being able to produce the same results if the same procedures are used with different materials.
Reproducible NA Being able to produce the same results using the same materials and procedures.
Research NA The Common Rule (45 CFR 46) definition of research is a systematic investigation, including research development, testing, and evaluation, designed to develop or contribute to generalizable knowledge.
Restricted-use data non-public data A dataset that cannot be publicly released due to containing sensitive information or a combination of variables that could enable identification. These data require controlled access conditions and may be shared through data use agreements or other application processes.
Safe harbor method NA Under the HIPAA Privacy Rule, there are two methods of de-identification. The Safe Harbor method allows covered entities to treat data as de-identified if all 18 PHI variables are removed.
Sensitive data protected data An umbrella term that encompasses proprietary, ethical, contractual, or private information that should be protected from unwarranted disclosure. There are varying levels of data sensitivity.
Scale NA Similar to the term “measure”, this is a collection of items used to measure an outcome. However, I typically use this term to more specifically refer to questionnaires that have had psychometric properties assessed. Scales may also be made up of subscales (i.e., groupings of items).
Simulation data NA Data generated through imitations of a real-world process using computer models.
Standardization NA Developing a set of agreed upon technical standards and applying them within and across all research projects.
Study NA A single funded research project resulting in one or more datasets to be used to answer a research question.
Subject case, participant, site, record A person or place participating in research and has one or more piece of data collected on them.
Syntax code, program Programming statements written in a text editor. The statements are machine-readable instructions processed by your computer.
Tool NA A means used to collect data using an instrument (e.g., a paper form, an online survey platform)
Treatment experiment The individual or group receives the intervention.
Unique participant identifier study ID, site ID, unique identifier (UID), subject ID, participant code, record id This is a numeric or alphanumeric identifier that is unique to every participant or site in order to create confidential and de-identified data. These identifiers allow researchers to link data across time or measure.
Variable column, field, question, data element Any phenomenon you are collecting information on/trying to measure. These variables will make up columns in your datasets or databases.
Variable name header A shortened symbolic name given the variable in your data to represent the information it contains.
Wave time period, time point, event, session Intervals of data collection over time.