18 Glossary


Term Other Terms Definition
Anonymous data NA Identifying information was never collected. This data can not be linked across time or measures.
Aggregated data NA Individual data that is summarized at a group level.
Append vertical join, join columns Stacking datasets on top of each other (matching variables).
Archive NA The transfer of data to a facility, such as a repository, that preserves and stores data long-term.
Attrition NA The loss of study units from the sample, often seen in longitudinal studies.
Clean data processed data Raw data that has been manipulated or modified for the purposes of correcting and clarifying information.
Cohort NA A group of participants recruited into the study at the same time.
Coded data pseudonymized data, indirectly identifiable, confidential data Personally identifiable information (PII) has been removed and names are replaced with a code. The only way to link the data back to an individual is through that code. The identifying code file (linking key) is stored separate from the research data.
Confidential data NA This data is protected from unauthorized disclosure. This data either contains personally identifiable information or can still be linked back to an individual through other means (e.g., identifiable data or coded data).
Confidentiality NA Confidentiality concerns data, ensuring participants agree to how their private and identifiable information will be managed and disseminated.
Control business as usual (BAU) The individual or group does not receive the intervention.
Cross-sectional NA Data is collected on participants for a single time point.
Data research data The recorded factual material commonly accepted in the scientific community as necessary to validate research findings. (OMB Circular A-110)
Data repository data archive A storage location for researchers to deposit data and supporting materials associated with their research.
Data structure NA A way of organizing data to allow for more efficient processing and storage. In particular, repeated measures data can be structured in either long or wide format.
Data type measurement unit, variable format, variable class A classification that specifies what types of values are contained in a variable and what kinds of operations can be performed on that variable. Examples of types include numeric, character, logical, or datetime.
Database relational database An organized collection of related data stored in tables that can be linked together by a common identifier.
Database design database schema, data modeling A collection of decisions regarding how tables, or datasets, will be organized and related to one another
Dataset data set, data frame, spreadsheet, rectangular data, tabular data, table A structured collection of data usually stored in tabular form. A research study usually produces one final dataset per entity/unit (e.g., teacher dataset, student dataset).
De-identified data anonymized data Identifying information has been removed or distorted and the data can no longer be re-associated with the underlying individual (the linking key no longer exists).
Derived data calculated values Data created through transformations of existing data (e.g., mean scores).
Direct identifiers NA These variables are unique to an individual and can be used to directly identify a participant (e.g., name, email address).
Directory file structure, file tree A cataloging structure for files and folders on your computer.
Disclosure risk NA The risk of re-identifying a participant and the harm that may come from that disclosure.
Experimental data NA Data collected from a study where researchers randomly introduce an intervention and study the effects.
Extant data secondary data, administrative data, third-party data Existing data generated/collected by external organizations at an earlier point in time (e.g., school records).
FERPA NA The Family Educational Rights and Privacy Act is a federal law governing the disclosure of personally identifiable information in education records (e.g., name, address, DOB). The law applies to all public elementary and secondary schools, as well as post-secondary institutions.
File format file type, file extension A way that information is encoded for storage on a computer. There are both proprietary (e.g., SPSS, XLSX) and non-proprietary formats (e.g., CSV, TXT).
Foreign key NA One or more variables associated with unique values in another table
Human subject NA The Common Rule (45 CFR 46) definition of a human subject is a living individual about whom an investigator conducting research obtains: 1) Data through intervention or interaction with the individual, or 2) identifiable private information.
HIPAA NA The Health Insurance Portability and Accountability Act is a federal law covering the protection of sensitive health information.
Identifiable data NA Data that includes personally identifiable information.
Indirect identifiers quasi-identifiers These variables do not alone identify a particular individual (e.g., ethnicity, gender), but, if combined with other information, they could be used to identify a participant
Instrument NA A mechanism designed to collect original data (e.g., observation form, questionnaire, assessment)
Limited data set NA Under the HIPAA Privacy Rule, a limited dataset is one in which 16 of the 18 HIPAA protected identifiers have been removed. Age, dates, and city/state/ZIP Code can remain. A limited dataset may be disclosed to external parties without authorization for specified purposes and often a data use agreement is required. This dataset is not considered de-identified and must be safeguarded against unauthorized access.
Longitudinal data repeated measures The same information is collected from the same subjects at multiple time points.
Measure scale In this book, I use the term “measure” broadly to refer to a collection of items used to measure an outcome (e.g., an existing scale, an existing academic assessment).
Merge horizontal join, join rows, link Combining datasets together in a side-by-side manner (matching on one or more unique identifiers).
Missing data NA Occurs when there is no data stored in a variable for a particular observation/respondent.
Normalize NA In this book, the term “normalize” is used to refer to returning a value to its normal, or expected state
Observational data NA Data collected from a study where researchers are observing the effect of an intervention without manipulating who is exposed to the intervention.
Original data primary data First-hand data that are generated/collected by the research team as part of the research study.
Participant database study roster, master list, master key, linking key, tracking database This database, or spreadsheet, includes any identifiable information on your participants as well as their assigned study ID. It is your only own means of linking your confidential research study data to a participant’s true identity. It is also used to track data collected across time and measures as well as participant attrition.
Path file path A string of characters used to locate files in your directory system.
Personally identifiable information PII, personal data This includes direct identifiers (e.g., name and email), as well as indirect identifiers that, if combined with other variables or if in small enough numbers, could identify a participant (e.g., full birthdate and place of birth).
Primary key NA One or more variables that uniquely define rows in your data
Privacy NA Privacy concerns people, ensuring they are given control to the access of themselves and their information.
Private data NA Highly restricted and typically not publicly shared, or is shared with limited access (i.e., passwords, illegal behaviors, medical records, financial information).
Protected health information PHI The HIPAA Privacy Rule provides protections for 18 identifiers held by covered entities providing health care services.
Qualitative data NA Non-numeric data typically made up of text, images, video, or other artifacts.
Quantitative data NA Numerical data that can be analyzed with statistical methods.
Randomized controlled trial RCT A study design that randomly assigns participants to a control or treatment condition. In education research you often hear about two types of RCTs. The first being the Individual-Level Randomized Controlled Trial (I-RCT) in which individuals (such as students) are randomized directly to the treatment or control group. The second is a Cluster Randomized Controlled Trial (C-RCT), sometimes also called group-randomized, in which clusters of students (such as classrooms) are randomized.
Raw data primary, untouched Unprocessed data collected directly from a source.
Replicable NA Being able to produce the same results if the same procedures are used with different materials.
Reproducible NA Being able to produce the same results using the same materials and procedures.
Research NA The Common Rule (45 CFR 46) definition of research is a systematic investigation, including research development, testing, and evaluation, designed to develop or contribute to generalizable knowledge.
Restricted-use data non-public data, controlled data, managed access data A dataset that cannot be publicly released due to containing sensitive information or a combination of variables that could enable identification. These data require controlled access conditions and may be shared through data use agreements or other application processes.
Safe harbor method NA Under the HIPAA Privacy Rule, there are two methods of de-identification. The Safe Harbor method allows covered entities to treat data as de-identified if all 18 PHI variables are removed.
Sensitive data protected data An umbrella term that encompasses proprietary, ethical, contractual, or private information that should be protected from unwarranted disclosure. There are varying levels of data sensitivity.
Standardize NA Developing and implementing a set of consistent procedures
Study NA A single funded research project resulting in one or more datasets to be used to answer a research question.
Subject case, participant, site, record A person or place participating in research and has one or more piece of data collected on them.
Syntax code, program, script Programming statements written in a text editor. The statements are machine-readable instructions processed by your computer.
Tool platform A means used to collect data using an instrument (e.g., a paper form, an online survey platform)
Treatment NA The individual or group receives the intervention.
Unique participant identifier study ID, site ID, unique identifier (UID), subject ID, participant code, record ID This is a unique numeric or alphanumeric identifier, assigned to every participant or site, and used to create confidential and de-identified data. These identifiers allow researchers to link data across time or measure.
Variable column, field, question, data element Any phenomenon you are collecting information on/trying to measure. These variables will make up columns in your datasets or databases.
Variable name header A shortened symbolic name given the variable in your data to represent the information it contains.
Wave time period, time point, event, session Intervals of data collection over time.