Anonymous data |
NA |
Identifying information was never collected. This data can not be linked across time or measures. |
Aggregated data |
NA |
Individual data that is summarized at a group level. |
Append |
vertical join, join columns, union |
Stacking datasets on top of each other (matching variables). |
Archive |
NA |
The transfer of data to a facility, such as a repository, that preserves and stores data long-term. |
Attrition |
NA |
The loss of study units from the sample, often seen in longitudinal studies. |
Clean data |
processed data |
Raw data that has been manipulated or modified for the purposes of correcting and clarifying information. |
Cohort |
NA |
A group of participants recruited into the study at the same time. |
Coded data |
pseudonymized data, indirectly identifiable, confidential data |
Personally identifiable information (PII) has been removed and names are replaced with a code. The only way to link the data back to an individual is through that code. The identifying code file (linking key) is stored separate from the research data. |
Confidential data |
NA |
This data is protected from unauthorized disclosure. This data either contains personally identifiable information or can still be linked back to an individual through other means (e.g., identifiable data or coded data). |
Confidentiality |
NA |
Confidentiality concerns data, ensuring participants agree to how their private and identifiable information will be managed and disseminated. |
Control |
business as usual (BAU) |
The individual or group does not receive the intervention. |
Cross-sectional |
NA |
Data is collected on participants for a single time point. |
Data |
research data |
The recorded factual material commonly accepted in the scientific community as necessary to validate research findings. (OMB Circular A-110) |
Data repository |
data archive |
A storage location for researchers to deposit data and supporting materials associated with their research. |
Data structure |
NA |
A way of organizing data to allow for more efficient processing and storage. In particular, repeated measures data can be structured in either long or wide format. |
Data type |
measurement unit, variable format, variable class, variable type |
A classification that specifies what types of values are contained in a variable and what kinds of operations can be performed on that variable. Examples of types include numeric, character, logical, or datetime. |
Database |
relational database |
An organized collection of related data stored in tables that can be linked together by a common identifier. |
Database design |
database schema, data modeling |
A collection of decisions regarding how tables, or datasets, will be organized and related to one another. |
Dataset |
data set, data frame, spreadsheet, rectangular data, tabular data, table |
A structured collection of data usually stored in tabular form. A research study may produce one final dataset per entity/unit (e.g., teacher dataset, student dataset). |
De-identified data |
anonymized data |
Identifying information has been removed or distorted and the data can no longer be re-associated with the underlying individual (the linking key no longer exists). |
Derived data |
calculated values |
Data created through transformations of existing data (e.g., mean scores). |
Direct identifiers |
NA |
These variables are unique to an individual and can be used to directly identify a participant (e.g., name, email address). |
Directory |
file structure, file tree, folder structure |
A cataloging structure for files and folders on your computer. |
Disclosure risk |
NA |
The risk of re-identifying a participant and the harm that may come from that disclosure. |
Extant data |
secondary data, administrative data, third-party data, external data |
Existing data generated/collected by external organizations at an earlier point in time (e.g., school records). |
FERPA |
NA |
The Family Educational Rights and Privacy Act is a federal law governing the disclosure of personally identifiable information in education records (e.g., name, address, DOB). The law applies to all public elementary and secondary schools, as well as post-secondary institutions. |
File format |
file type, file extension |
A way that information is encoded for storage on a computer. There are both proprietary (e.g., SPSS, XLSX) and non-proprietary formats (e.g., CSV, TXT). |
Foreign key |
NA |
One or more variables associated with unique values in another table |
Human subject |
NA |
The Common Rule (45 CFR 46) definition of a human subject is a living individual about whom an investigator conducting research obtains: 1) Data through intervention or interaction with the individual, or 2) identifiable private information. |
HIPAA |
NA |
The Health Insurance Portability and Accountability Act is a federal law covering the protection of sensitive health information. |
Identifiable data |
NA |
Data that includes personally identifiable information. |
Indirect identifiers |
quasi-identifiers |
These variables do not alone identify a particular individual (e.g., ethnicity, gender), but, if combined with other information, they could be used to identify a participant |
Instrument |
NA |
A mechanism designed to collect original data (e.g., observation form, questionnaire, assessment) |
Longitudinal data |
repeated measures |
The same information is collected from the same subjects at multiple time points. |
Measure |
scale |
In this book, I use the term “measure” broadly to refer to a collection of items used to measure an outcome (e.g., an existing scale, an existing academic assessment). |
Merge |
horizontal join, join rows, link |
Combining datasets together in a side-by-side manner (matching on one or more unique identifiers). |
Metadata |
NA |
Data providing details about other data. |
Missing data |
NA |
Occurs when there is no data stored in a variable for a particular observation/respondent. |
Normalize |
NA |
In this book, the term “normalize” is used to refer to returning a value to its normal, or expected state. |
Original data |
primary data |
First-hand data that are generated/collected by the research team as part of the research study. |
Participant database |
study roster, master list, master key, linking key, tracking database |
This database, or spreadsheet, includes any identifiable information on your participants as well as their assigned study ID. It is your only means of linking your confidential research study data to a participant’s true identity. It is also used to track data collected across time and measures as well as participant attrition. |
Path |
file path |
A string of characters used to locate files in your directory system. |
Persistent identifier |
PID |
A unique and enduring digital reference to an object, contributor, or organization. A DOI (digital object identifier) is a type of PID specific to digital objects. |
Personally identifiable information |
PII, personal data |
This includes direct identifiers (e.g., name and email), as well as indirect identifiers that, if combined with other variables or if in small enough numbers, could identify a participant (e.g., full birthdate and place of birth). |
Primary key |
NA |
One or more variables that uniquely define rows in your data |
Privacy |
NA |
Privacy concerns people, ensuring they are given control to the access of themselves and their information. |
Private data |
NA |
Highly restricted and typically not publicly shared, or is shared with limited access (i.e., passwords, illegal behaviors, medical records, financial information). |
Protected health information |
PHI |
The HIPAA Privacy Rule provides protections for 18 identifiers held by covered entities providing health care services. |
Qualitative data |
NA |
Non-numeric data typically made up of text, images, video, or other artifacts. |
Quantitative data |
NA |
Numerical data that can be analyzed with statistical methods. |
Randomized controlled trial |
RCT |
A study design that randomly assigns participants, or groups of participants, to a control or treatment condition. |
Raw data |
primary, untouched |
Unprocessed data collected directly from a source. |
Replicable |
NA |
Being able to produce the same results if the same procedures are used with different materials. |
Reproducible |
NA |
Being able to produce the same results using the same materials and procedures. |
Research |
NA |
The Common Rule (45 CFR 46) definition of research is a systematic investigation, including research development, testing, and evaluation, designed to develop or contribute to generalizable knowledge. |
Restricted-use data |
non-public data, controlled data, managed access data |
A dataset that cannot be publicly released due to containing sensitive information or a combination of variables that could enable identification. These data require controlled access conditions and may be shared through data use agreements or other application processes. |
Safe harbor method |
NA |
Under the HIPAA Privacy Rule, there are two methods of de-identification. The safe harbor method allows covered entities to treat data as de-identified if all 18 PHI variables are removed. |
Sensitive data |
protected data |
An umbrella term that encompasses proprietary, ethical, contractual, or private information that should be protected from unwarranted disclosure. There are varying levels of data sensitivity. |
Standardize |
NA |
Developing and implementing a set of consistent procedures |
Study |
NA |
A single funded research project resulting in one or more datasets to be used to answer a research question. |
Subject |
case, participant, site, record |
A person or place participating in research and has one or more piece of data collected on them. |
Syntax |
code, program, script |
Programming statements written in a text editor. The statements are machine-readable instructions processed by your computer. |
Tool |
platform |
A means used to collect data using an instrument (e.g., a paper form, an online survey platform) |
Treatment |
NA |
The individual or group receives the intervention. |
Unique participant identifier |
study ID, site ID, unique identifier (UID), subject ID, participant code, record ID |
This is a unique numeric or alphanumeric identifier, assigned to every participant or site, and used to create confidential and de-identified data. These identifiers allow researchers to link data across time or measure. |
Variable |
column, field, question, data element |
Any phenomenon you are collecting information on/trying to measure. These variables will make up columns in your datasets or databases. |
Variable name |
header |
A shortened symbolic name given the variable in your data to represent the information it contains. |
Wave |
time period, time point, event, session |
Intervals of data collection over time. |