18 Glossary

Term	Other Terms	Definition
Anonymous data	NA	Identifying information was never collected. This data can not be linked across time or measures.
Aggregated data	NA	Individual data that is summarized at a group level.
Append	vertical join, join columns, union	Stacking datasets on top of each other (matching variables).
Archive	NA	The transfer of data to a facility, such as a repository, that preserves and stores data long-term.
Attrition	NA	The loss of study units from the sample, often seen in longitudinal studies.
Clean data	processed data	Raw data that has been manipulated or modified for the purposes of correcting and clarifying information.
Cohort	NA	A group of participants recruited into the study at the same time.
Coded data	pseudonymized data, indirectly identifiable, confidential data	Personally identifiable information (PII) has been removed and names are replaced with a code. The only way to link the data back to an individual is through that code. The identifying code file (linking key) is stored separate from the research data.
Confidential data	NA	This data is protected from unauthorized disclosure. This data either contains personally identifiable information or can still be linked back to an individual through other means (e.g., identifiable data or coded data).
Confidentiality	NA	Confidentiality concerns data, ensuring participants agree to how their private and identifiable information will be managed and disseminated.
Control	business as usual (BAU)	The individual or group does not receive the intervention.
Cross-sectional	NA	Data is collected on participants for a single time point.
Data	research data	The recorded factual material commonly accepted in the scientific community as necessary to validate research findings. (OMB Circular A-110)
Data repository	data archive	A storage location for researchers to deposit data and supporting materials associated with their research.
Data structure	NA	A way of organizing data to allow for more efficient processing and storage. In particular, repeated measures data can be structured in either long or wide format.
Data type	measurement unit, variable format, variable class, variable type	A classification that specifies what types of values are contained in a variable and what kinds of operations can be performed on that variable. Examples of types include numeric, character, logical, or datetime.
Database	relational database	An organized collection of related data stored in tables that can be linked together by a common identifier.
Database design	database schema, data modeling	A collection of decisions regarding how tables, or datasets, will be organized and related to one another.
Dataset	data set, data frame, spreadsheet, rectangular data, tabular data, table	A structured collection of data usually stored in tabular form. A research study may produce one final dataset per entity/unit (e.g., teacher dataset, student dataset).
De-identified data	anonymized data	Identifying information has been removed or distorted and the data can no longer be re-associated with the underlying individual (the linking key no longer exists).
Derived data	calculated values	Data created through transformations of existing data (e.g., mean scores).
Direct identifiers	NA	These variables are unique to an individual and can be used to directly identify a participant (e.g., name, email address).
Directory	file structure, file tree, folder structure	A cataloging structure for files and folders on your computer.
Disclosure risk	NA	The risk of re-identifying a participant and the harm that may come from that disclosure.
Extant data	secondary data, administrative data, third-party data, external data	Existing data generated/collected by external organizations at an earlier point in time (e.g., school records).
FERPA	NA	The Family Educational Rights and Privacy Act is a federal law governing the disclosure of personally identifiable information in education records (e.g., name, address, DOB). The law applies to all public elementary and secondary schools, as well as post-secondary institutions.
File format	file type, file extension	A way that information is encoded for storage on a computer. There are both proprietary (e.g., SPSS, XLSX) and non-proprietary formats (e.g., CSV, TXT).
Foreign key	NA	One or more variables associated with unique values in another table
Human subject	NA	The Common Rule (45 CFR 46) definition of a human subject is a living individual about whom an investigator conducting research obtains: 1) Data through intervention or interaction with the individual, or 2) identifiable private information.
HIPAA	NA	The Health Insurance Portability and Accountability Act is a federal law covering the protection of sensitive health information.
Identifiable data	NA	Data that includes personally identifiable information.
Indirect identifiers	quasi-identifiers	These variables do not alone identify a particular individual (e.g., ethnicity, gender), but, if combined with other information, they could be used to identify a participant
Instrument	NA	A mechanism designed to collect original data (e.g., observation form, questionnaire, assessment)
Longitudinal data	repeated measures	The same information is collected from the same subjects at multiple time points.
Measure	scale	In this book, I use the term “measure” broadly to refer to a collection of items used to measure an outcome (e.g., an existing scale, an existing academic assessment).
Merge	horizontal join, join rows, link	Combining datasets together in a side-by-side manner (matching on one or more unique identifiers).
Metadata	NA	Data providing details about other data.
Missing data	NA	Occurs when there is no data stored in a variable for a particular observation/respondent.
Normalize	NA	In this book, the term “normalize” is used to refer to returning a value to its normal, or expected state.
Original data	primary data	First-hand data that are generated/collected by the research team as part of the research study.
Participant database	study roster, master list, master key, linking key, tracking database	This database, or spreadsheet, includes any identifiable information on your participants as well as their assigned study ID. It is your only means of linking your confidential research study data to a participant’s true identity. It is also used to track data collected across time and measures as well as participant attrition.
Path	file path	A string of characters used to locate files in your directory system.
Persistent identifier	PID	A unique and enduring digital reference to an object, contributor, or organization. A DOI (digital object identifier) is a type of PID specific to digital objects.
Personally identifiable information	PII, personal data	This includes direct identifiers (e.g., name and email), as well as indirect identifiers that, if combined with other variables or if in small enough numbers, could identify a participant (e.g., full birthdate and place of birth).
Primary key	NA	One or more variables that uniquely define rows in your data
Privacy	NA	Privacy concerns people, ensuring they are given control to the access of themselves and their information.
Private data	NA	Highly restricted and typically not publicly shared, or is shared with limited access (i.e., passwords, illegal behaviors, medical records, financial information).
Protected health information	PHI	The HIPAA Privacy Rule provides protections for 18 identifiers held by covered entities providing health care services.
Qualitative data	NA	Non-numeric data typically made up of text, images, video, or other artifacts.
Quantitative data	NA	Numerical data that can be analyzed with statistical methods.
Randomized controlled trial	RCT	A study design that randomly assigns participants, or groups of participants, to a control or treatment condition.
Raw data	primary, untouched	Unprocessed data collected directly from a source.
Replicable	NA	Being able to produce the same results if the same procedures are used with different materials.
Reproducible	NA	Being able to produce the same results using the same materials and procedures.
Research	NA	The Common Rule (45 CFR 46) definition of research is a systematic investigation, including research development, testing, and evaluation, designed to develop or contribute to generalizable knowledge.
Restricted-use data	non-public data, controlled data, managed access data	A dataset that cannot be publicly released due to containing sensitive information or a combination of variables that could enable identification. These data require controlled access conditions and may be shared through data use agreements or other application processes.
Safe harbor method	NA	Under the HIPAA Privacy Rule, there are two methods of de-identification. The safe harbor method allows covered entities to treat data as de-identified if all 18 PHI variables are removed.
Sensitive data	protected data	An umbrella term that encompasses proprietary, ethical, contractual, or private information that should be protected from unwarranted disclosure. There are varying levels of data sensitivity.
Standardize	NA	Developing and implementing a set of consistent procedures
Study	NA	A single funded research project resulting in one or more datasets to be used to answer a research question.
Subject	case, participant, site, record	A person or place participating in research and has one or more piece of data collected on them.
Syntax	code, program, script	Programming statements written in a text editor. The statements are machine-readable instructions processed by your computer.
Tool	platform	A means used to collect data using an instrument (e.g., a paper form, an online survey platform)
Treatment	NA	The individual or group receives the intervention.
Unique participant identifier	study ID, site ID, unique identifier (UID), subject ID, participant code, record ID	This is a unique numeric or alphanumeric identifier, assigned to every participant or site, and used to create confidential and de-identified data. These identifiers allow researchers to link data across time or measure.
Variable	column, field, question, data element	Any phenomenon you are collecting information on/trying to measure. These variables will make up columns in your datasets or databases.
Variable name	header	A shortened symbolic name given the variable in your data to represent the information it contains.
Wave	time period, time point, event, session	Intervals of data collection over time.

17 Additional Considerations

19 Appendix