3 Data Organization

Before jumping into the project life cycle, we need to have a basic understanding of what data looks like. Understanding the basics of data organization helps us to make informed decisions throughout the life cycle that will result in clear, analyzable information.

3.1 Basics of a dataset

In education research, data is often collected internally by a team using an instrument such as a questionnaire, an observation form, an interview guide, or an assessment. However, data may also be collected by external entities, such as districts, states, or other agencies.

Those data come in many forms (e.g., video, transcripts, documents, data files), represented as text, numbers, or multimedia (USGS 2023). In the world of quantitative education research, we are often working with digital data in the form of a dataset, a structured collection of data. A dataset is organized in a rectangular format which allows the information to be machine-readable. Rectangular, also called tabular, datasets are made up of columns and rows (see Figure 3.1).

Basic format of a dataset.

Figure 3.1: Basic format of a dataset.

3.1.1 Columns

The columns in your dataset will consist of the following types of variables:

  • Variables you collect
    • These are variables collected from an instrument or external source.
  • Variables you create
    • These may be indicators you create (e.g., cohort, treatment, time).
    • Or they me be variables derived for summary purposes (e.g., means, sum scores).
  • Identifier variables
    • You must also include values that uniquely identify subjects in your data (e.g., a student unique identifier).
    • See Section 10.4 for more information on creating unique identifier variables.

3.1.1.1 Column attributes

The columns, or variables, in your dataset also have the following attributes:

  1. Variable names
    • A variable name is the short representation of the information contained in a column.
    • Variable names must be unique. No variable name in a dataset can repeat. We will talk more about variable naming when we discuss style guides in Chapter 9.
  2. Variable types
    • A variable’s type determines allowable values for a variable, the operations that can be performed on the variable, and how the values are stored.
    • Example types include numeric, character (also called text or string), or date. Types can also be more narrowly defined as needed (e.g., continuous, categorical).
  3. Variable values
    • Variable values refer to the information contained in each column. Every variable has pre-determined allowable values.
    • Examples of setting allowable values for different types of variables include:
      • Categorical character variable: “yes” | “no”
      • Numeric variable: 1–25
      • Date variable: 2023-08-01 to 2023-12-15
      • Free text character variable: any value is allowed
    • Anything outside of your expected values or ranges is considered an error.
  4. Variable labels
    • A variable label is the human readable description of what a variable represents.
    • This may be a label that you, as the variable creator, assigns (e.g., “Treatment condition”) or it may be the actual wording of an item (e.g., “Do you enjoy pizza?”).

3.1.2 Rows

The rows in your dataset are aligned with subjects (also called records or cases) in your data. Subjects in your dataset may be students, teachers, schools, locations, and so forth. The unique subject identifier variable mentioned in Section 3.1.1, will denote which row belongs to which subject.

3.1.3 Cells

The cells are the observations associated with each case in your data. Cells are made up of key/value pairs, created at the intersection of a column and a row (see Figure 3.2). Consider an example where we collect a survey (also called a questionnaire) from students. In this dataset, each row is made up of a unique student in our study, each column is an item from the survey, and each cell contains a value that corresponds to that row/column pair (i.e., that participant and that question).

Representation of a cell value.

Figure 3.2: Representation of a cell value.

3.2 Dataset organization rules

In order for your dataset to be machine-readable and analyzable, it should adhere to a set of organizational rules (Broman and Woo 2018; Wickham 2014).

  1. The first rule is that data should make a rectangle (Figure 3.3). The first row of your data should be your variable names (only use one row for this). The remaining data should be made up of values in cells.
A comparison of non-rectangular and rectangular data.

Figure 3.3: A comparison of non-rectangular and rectangular data.

  1. Column values should be consistent (Figure 3.4). Both humans and machines have difficulty categorizing information that is not measured, coded, or formatted consistently.
    • For text categorical values, use controlled vocabularies and keep consistent spelling, case, and spacing.
    • For date values, keep the format consistent.
    • For numeric values, measure in consistent units and keep consistent decimal places.
A comparison of inconsistent and uniform variable values.

Figure 3.4: A comparison of inconsistent and uniform variable values.

  1. Columns should adhere to your expected variable type (Figure 3.5).
    • For example, if you have a numeric variable, such as age, but you add a cell value that is text, your variable no longer adheres to your variable type. Machines will now read this variable type as character.
A comparison of variables adhering and not adhering to a data type.

Figure 3.5: A comparison of variables adhering and not adhering to a data type.

  1. A variable should only collect one piece of information (Figure 3.6). This allows you to more easily work with your variables.
    • For example, rather than combining the number of incidents and the number of enrolled students in the same variable, separate this information into two variables. This allows you to aggregate information as needed (e.g., calculate an incident rate).
A comparison of two things being measured in one variable and two things being measured across two variables.

Figure 3.6: A comparison of two things being measured in one variable and two things being measured across two variables.

  1. All cell values should be explicit (Figure 3.7). This means all cells that are not missing values should be filled with a value.
    • Consider why a cell value is empty
      • If a value is actually missing, you can either leave those cells blank or fill them with your pre-determined missing values (e.g., -99). See Section 9.5.1 for ideas on coding missing values.
      • If a cell is left empty because it is implied to be the same value as above, the cells should be filled with the actual data.
      • If an empty cell is implied to be 0, fill the cells with an actual 0.
A comparison of variables with empty cells and variables with not empty cells.

Figure 3.7: A comparison of variables with empty cells and variables with not empty cells.

  1. All variables should be explicit (Figure 3.8). No variables should be implied using color coding.
    • If you want to indicate information, add an indicator variable to do this rather than cell coloring.
A comparison of information being indicated through cell color and information being provided in an indicator variable.

Figure 3.8: A comparison of information being indicated through cell color and information being provided in an indicator variable.