10 Data Tracking
Figure 10.1: Tracking in the research project life cycle
During your project you will want to be able to answer both progress and summary questions about your recruitment and data collection activities.
- How many participants consented to be in our study? How many have we lost during our study and why?
- How much progress have we made in this cycle of data collection? How much data do we have left to collect?
- How many forms did we collect each cycle and why are we missing data for some forms?
Questions like these will arise many times throughout your study for both your own project coordination purposes, as well as for external progress reporting and publication purposes. Yet, how will you answer these questions? Will you dig through papers, search through emails, and download in-progress data, each time you need to answer a question about the status of your project activities? A better solution is to track all project activities in a participant tracking database.
A participant tracking database is an essential component of both project management and data management. This database contains all study participants, their relevant study information, as well as tracking information about their completion of project milestones. This database has two underlying purposes.
- To serve as a roster of study participants and a “master key” (Pacific University Oregon 2014) that houses both identifying participant information as well as assigned unique study identifiers.
- To aid in project coordination and reporting, tracking the movement of participants as well as completion of milestones, throughout a study.
This database is considered your single source of truth (SSOT) concerning everything that happened throughout the duration of your project. Any time a participant consents to participate, drops from the study, changes their name, completes a data collection measure, is provided a payment, or moves locations, a project coordinator, or other designated team member, updates the information in this one location. Tracking administrative information in this one database, rather than across disparate spreadsheets, emails, and papers, ensures that you always have one definitive source to refer to when seeking answers about your sample and your project activities.
Note
I want to reiterate this single source of truth concept. Information is often coming in from multiple sources (e.g., data collectors in the field, emails to project coordinators from teachers, conversations with administrators). It is important to train your team that all relevant contact information that is gleaned (e.g., name change, new email, moved out of district) must be updated in the participant tracking database alone. If people track this information in other sources, such as their own personal spreadsheets, there is no longer a single source of truth, there are multiple sources of truth. This makes it very difficult to keep track of what is going on in a project. Whether a single person is designated to update information in this database, or multiple, make sure team members know either how to update information or who to contact to update information.
10.1 Benefits
A thorough and complete participant database that is updated regularly is beneficial for the following reasons:
- Protecting participant confidentiality
- Assigning unique study identifiers (i.e., codes) that are only linked to a participant’s true identity within this one database is necessary for maintaining participant confidentiality. This database is stored in a restricted secure location (see Chapter 13), separate from where the identifiable and coded study datasets are stored, and is typically destroyed at a period of time after a project’s completion.
- Project coordination and record keeping
- This database can be used as a customer relation management (CRM) tool, storing all participant contact information, as well as tracking correspondence. It can also be used as a project coordination tool, storing scheduling information that is useful for planning activities such as data collection.
- Integrating this database into your daily workflow allows your team to easily report the status of data collection activities (e.g., as of today we have completed 124 out of 150 assessments). Furthermore, checking and tracking incoming data daily, compared to after data collection is complete, reduces the likelihood of missing data.
- Last, thorough tracking allows you to explain missing data in reports and publications (e.g., teacher 1234 went on maternity leave).
- Sample rostering
- At any time you can pull a study roster from this database that accurately reflects a participant’s current status. The tracking information contained in this tool also aids in the creation of documentation including the flow of participants in your CONSORT diagram.
- Data cleaning
- As part of your data cleaning process, all raw dataset sample sizes should compared against what is reported as complete in your participant database to ensure that no participants are missing from your final datasets
- Furthermore, this database can be used for de-identifying data. If data is collected with identifiers such as name, a roster from the tracking database can be used to merge in unique study identifiers so that name can be removed. A similar process can be used to merge in other assigned variables contained in the database such as treatment or cohort.
10.2 Building your database
While the tracking phase appears after collection in Figure 10.1, it is most beneficial to build this database before you begin recruiting participants, typically during the same time that you are building your data collection tools, in the “Create Instruments” phase. This way, as your team recruits participants, you can record information such as name, consent status, and any other necessary identifying contact information in the participant database and begin assigning participants study IDs. Depending on your database system, you may even be able to scan and upload copies of your consent forms into the database.
While a project coordinator can build this database, it can be helpful to consult with a data manager, or someone with relational database expertise, when creating this system. This ensures that your system is set up efficiently and comprehensively.
This database may be a standalone structure, used only for tracking and anonymization purposes, or it may be integrated as part of your larger study system, where all study data is collected and/or entered as well.
10.2.1 Relational databases
Before we discuss how to build this database, it is helpful to have a basic understanding of the benefits of relational databases. A relational database organizes information into tables, made up of records (rows) and fields (columns), and tables can be related through keys (see Section 3.3.1) (Bourgeois 2014; Chen 2022). Using a relational database to track participant information, compared to disparate, non-connected spreadsheets, has many benefits including reducing data entry errors and improving efficiency.
There are three general steps for building a relational database.
- Create tables made up of fields (i.e., variables)
- Choose one or more primary key fields to uniquely identify rows in those tables. These keys should not change at any point. Typically these keys are your assigned unique study IDs.
- Create relationships between tables through both primary and foreign keys
We can also further refine our database through normalization, structuring our database according to normal form rules (Bourgeois 2014; Nguyen 2017; The Nobles 2020) to reduce redundancy and improve data integrity. Going in to more detail about normalization is outside of the scope of this book and building a database that follows all the normal form rules requires specific expertise, which most teams may not have. So with that said, it is completely acceptable to build a database that is not perfectly optimized but that works well for your team! The most important thing to consider when building a relational database is to not duplicate information across tables. Any one field should only need to be updated in one location, never more than one.
Resources
Source | Resource |
---|---|
Omar Elgabry | A series of posts on database fundamentals 67 |
Let’s compare a very simple example of building a tracking database using a relational model and a non-relational model.
10.2.1.1 Relational model
In Figure 10.2 we have three entities we need to track in our database—schools, teachers, and students. We built a very simple database with one table for each entity. Within each table we added fields that we need to collect on these subjects. We have also set up our tables to include primary keys (which uniquely identify rows in each table) and foreign keys (which includes values that correspond to the primary key of another table). Our keys are all unique study identifiers that we have assigned to our study participants.
Figure 10.2: Participant database built using a relational model
We can see here that across each table we have no duplicated information. The student table only contains student-level information, the teacher table only contains teacher-level information, and the school table only contains school-level information. This is a huge time saver. Imagine if a teacher’s last name changes. Rather than updating that name in multiple places, we now only update it once, in the teacher table. If we want to see a table with both student and teacher information, we can simply query our database to create a new table. In some programs, this type of querying may be a simple point and click option, in other programs it may require someone to write some simple queries that can then be used at any time by any user.
Say for example, we needed to pull a roster of students for each teacher. We could easily create and run a query, such as this SQL query that joins the student and teacher tables from Figure 10.2 by tch_id
and then pulls the relevant teacher and student information from both tables, seen in Table 10.1.
SELECT t_l_name, t_f_name, s_l_name, s_f_name, grade
FROM Student INNER JOIN Teacher ON Student.tch_id = Teacher.tch_id
ORDER BY t_l_name, t_f_name, s_l_name, s_f_name
t_l_name | t_f_name | s_l_name | s_f_name | grade |
---|---|---|---|---|
Clark | Jana | Arnold | Darnell | 2 |
Clark | Jana | Watts | Irene | 2 |
Ramirez | Bill | Dixon | Ernesto | 4 |
Ramirez | Bill | Gibson | Emma | 4 |
Ramirez | Bill | Webster | Grant | 4 |
Depending on the design of your study and the structure of the database model, writing these queries can become more complicated. Again, this is where you want to strike a balance between creating a structure that reduces inefficiencies in data entry but also isn’t too complicated to query based on the expertise of your team.
10.2.1.2 Non-relational model
Now imagine that we built a non-relational database, such as three tabs in an Excel spreadsheet, to track our participant information (see Figure 10.3). Since we are unable to set up a system that links these tables together, we need to enter redundant information into each table (such as teacher or school name) in order to see that information within each table without having to flip back and forth across tables to find the information we need. For example, we now have to enter repeating teacher and school names in the student table, and if any teacher names change, we will need to update it in both the teacher table and in the student table for every student associated with that teacher. This requires more entry time and creates the opportunity for more data entry errors (Borer et al. 2009).
Figure 10.3: Participant database built in using a non-relational model with duplicated variables denoted by rectangles
Note
If your study includes a variety of related entities, tracked over waves of time, a relational database will be very helpful to build. If however, you are only tracking one entity (e.g., just students) for one wave of data collection, then a database might be overkill and a simple spreadsheet will work just fine.
10.2.2 Designing the database
Before you can begin to design your database, you will need to think through the following pieces of information.
- Do you want to use a relational table design?
- How many tables do you want to construct?
- Consider entities (e.g., student, teacher, school)
- Consider purpose (e.g., enrollment info, wave 1 data collection tracking, wave 2 data collection tracking)
- What fields do you want to include in each table?
- If using a relational table design, what fields will you use to relate tables?
Once you make decisions regarding these questions, you can begin to design your database schema. It can be helpful to visualize your database model during this process. In Figure 10.4 I am designing a database model for a scenario where I will only be collecting information from teachers and schools, over two waves of data collection.
Figure 10.4: Example participant database model using two separate tables for tracking across waves
I have designed this database model in this way:
- I have four tables total
- Two tables (teacher info and school info tables) have information that should be fairly constant based on my project assumptions (name, email, consent, one time documents received)
- If at any time this information changes (e.g., withdraw status, new last name, new contact person), I would update that information in the appropriate table and make a note of when and why the change occurred in my
notes
field
- If at any time this information changes (e.g., withdraw status, new last name, new contact person), I would update that information in the appropriate table and make a note of when and why the change occurred in my
- Two tables are for my longitudinal information
- This is where I will track my data collection activities each wave, as well as any information that may change each wave, again based on the assumptions of my project. For example, I may put grade level in my longitudinal tables if I collect data across years because I assume it’s possible that teachers may switch grade levels. As another example, if this scenario included student-level data collection, you could imagine that we may put teacher ID into our student longitudinal tables, knowing that students move into different classrooms across years.
- Two tables (teacher info and school info tables) have information that should be fairly constant based on my project assumptions (name, email, consent, one time documents received)
- I have connected my tables through primary and foreign keys (
tch_id
andsch_id
) - With information separated into four tables, I can also now limit access as needed (e.g., only allow data entry staff access to the de-identified tables, or restricting entry to only the current wave of data preventing accidental overwriting of existing data)
The model in Figure 10.4 is absolutely not the only way you can design your tables. There may be more efficient or more appropriate ways to design this database, but again as long as you are not duplicating information, build what works for you. As an example of a potentially more efficient way to structure this database, I could combine all waves of data collection into one table and create a concatenated primary key that uses both tch_id
and wave
to uniquely identify rows since tch_id
would now be duplicated for each wave of data collection (see Figure 10.5).
Figure 10.5: Example participant database model using one table to track data across waves
While these examples are for a fairly simple scenario, you can hopefully see how you might extrapolate this model to more entities and more waves of data collection, as well as how you might modify it to better meet the needs of your specific project.
Note
If your study involves anonymous data collection, you will no longer be able to track data associated with any specific individual. However, it is still helpful to create some form of a tracking system. Creating a simplified database, with tables based on your sites for instance (school table, district table) allows you to still track your project management and data collection efforts (e.g., number of student surveys received per school per wave, payment sent to school).
10.2.3 Choosing fields
As you design your database model, you will also need to choose what fields to include in each table. The fields you choose to include will be dependent on your particular study design. While your participant tracking database may be the same database you enter all of your study data, for the purposes of this chapter we are only considering fields that are relevant for project coordination and participant de-identification. We are not concerned with fields that are collected as part of your data collection measures (i.e., survey items). You can consider your participant tracking database as an internal database that is only used for coordination, summary, and linking purposes. This is not a database where you would export data for external data sharing.
Below are ideas of field you may consider adding to your database. Depending on the design and assumptions of your study, some of these may be collected once, others may be collected more than once, longitudinally.
Ideas of fields to collect:
- Study IDs (primary and foreign keys for a relational database)
- Names (participants and sites)
- Contact information
- Information relevant to project coordination (grade level, class periods, block schedules)
- Other necessary linking identifiers (double IDs, district/school IDs)
- Information helpful for data collection scheduling (blocks, class times)
- Consent/assent status
- Inclusion/exclusion criteria status
- Enrollment status
- Randomization (treatment/control)
- Grouping information (cohort)
- Summary information not already accounted (# of consents sent out, # of students in class, # of teachers in school)
- Administrative data status (W-9 received, MOUs received)
- Movement/withdraw status
- Data collection status (unique fields for each instrument)
- Incentive status (gift cards sent out)
- Notes
- Reasons for changes (for example changes in name, email)
- Reasons for movement/withdraw
- Communication with participants
- Reasons for missing data
- Errors in data
10.2.3.1 Structuring fields
As you choose your fields you also need to make some decisions about how you will structure those fields.
- Set data types for your fields (e.g., character, integer, date)
- Restrict entry values to only allowable data types to reduce errors
- Set allowable values and ranges
- For example, a categorical status field may only allow “complete”, “partially complete” or “incomplete”
- Do not lump separate pieces of information together in a field
- For example separate out
first name
andlast name
into two fields
- For example separate out
- Name your fields according to the variable naming rules we discussed in Chapter 9
10.2.4 Choosing a tool
There are many criteria to consider when choosing a tool to build your database in.
- Choose a tool that is customizable to your needs
- Can you build a relational table structure?
- Can you export files? Can you connect to the database via application programming interfaces (APIs)?
- Can you query data?
- Choose a tool that is user-friendly
- You don’t want a tool with a steep learning curve for users
- If you are running a project across multiple sites, consider the accessibility of the tool
- For example, you may want a tool that is cloud-based so that all site coordinators can access it
- You may also want to make sure multiple users can access it at the same time
- Choose a tool that is interoperable
- For instance, some tools may have difficulties running on certain operating systems
- Consider cost and licensing
- There are many free tools, but they may not provide all of the functionality you want
- What products do you already have access to (i.e., your institution has a license for)?
- Consider security
- Which tools are approved by your institution to protect the sensitivity level of this data (See Chapter 4)?
- Can you limit access to the entire database? To specific tables?
- If multiple people are entering data, you may want to restrict access/editing capabilities for some tables
- Protect data loss
- Can you backup the system?
- Can you protect against overwriting data?
- Can you keep versions of the database in case a mistake is ever made and you need to go back to an older version?
- Data quality protection
- Can you set up data quality constraints (e.g., restrict input values/types)?
There are many tool options you can choose from. A sampling of those options are below. These tools represent a wide range from the criteria in this section. Take some time to review your options to see which one best meets your needs.
- Microsoft Access
- Microsoft Excel
- Quickbase
- Airtable
- REDCap
- Claris FileMaker
- Google Sheets and Google Forms
- Forms that feed into a relational database, maintained using a SQL (structured query language) database engine such as SQLite, MySQL, or PostgreSQL
10.3 Entering data
Your last consideration when building your database will be, how do you want your team to enter data into your database? There are many ways to enter data including manually entering data, importing data, integrating your data collection platform and your tracking database, or even scanning forms using QR codes. While some of those options may work great for your project, here we are going to talk about the two simplest and most common options: manually entering data into a spreadsheet view, and manually entering data into a form.
10.3.1 Entering data in a spreadsheet view
Your first option is to manually enter data in a spreadsheet format for each participant in a row (see Figure 10.6). This would be the most common (or only) option when using tools such as Microsoft Excel or Google Sheets. However, you can also use this option when entering into other database tools such as Microsoft Access. There are both pros and cons to this method.
- Pros: This is the quickest and easiest method. It also allows you to view all the data holistically.
- Cons: This method can lead to errors if someone enters data on the wrong row/record.
Figure 10.6: Example spreadsheet view data entry
10.3.2 Entering data in a form
Your second option is to create a form that is linked to your tables (see Figure 10.7). As you enter data in your forms, it automatically populates your tables with the information. This option is possible in many systems including Microsoft Access, FileMaker, REDCap, and even Google Forms which populates into Google Sheets.
- Pros: This method reduces data entry errors as you are only working on one participant form at a time
- Cons: Takes some time, and possibly expertise, to set up the data entry forms
Figure 10.7: Example form view data entry
Note
If your participant tracking database is separate from your data collection tools, all information will need to be entered by your team using one of the ways mentioned in Section 10.3. However, if your participant tracking tool is also your data collection/data capture tool (such as those who collect data using REDCap), fields such as data collection status (e.g., survey completed) may not need to be manually entered. Rather they may be automated to populate as “complete” once a participant submits their responses in the data collection tool.
10.4 Creating unique identifiers
One of the most important parts of keeping this participant tracking database is assigning unique participant identifiers. As soon as participants are entered into your database, a unique study ID should be assigned. If confidentiality was promised to schools or districts, you will also want to assign identifiers to sites as well. Assigning these identifiers is an important part of protecting the privacy of human participants. When publicly sharing your study data, all personally identifying information will be removed and these identifiers (i.e., codes), are what will allow you to uniquely identify and link participants in your data.
Participant unique identifiers are numeric or alphanumeric values and typically range from 2-10 digits. While there are several ways participant identifiers can be assigned (e.g., created by participants themselves, assigned by your data collection software), most commonly, the research team assigns these identifiers to participants.
Before assigning identifiers, it can be very helpful to develop an ID schema during your planning phase, and document that schema in an SOP (see Section 8.2.7). In developing that schema, there are several best practices to consider.
- Participants must keep this same identifier for the entire project.
- This even applies in circumstances where a participant has the opportunity to be re-recruited into your study (as seen in Figure 10.8). The participant still keeps the same ID throughout the study. In these cases, you will use a combination of variables to identify the unique instances of that participant (e.g.,
stu_id
andcohort
). - Having a static participant ID allows you to track the flow of each participant through your study and provides the added benefit of helping to measure dosage.
- This even applies in circumstances where a participant has the opportunity to be re-recruited into your study (as seen in Figure 10.8). The participant still keeps the same ID throughout the study. In these cases, you will use a combination of variables to identify the unique instances of that participant (e.g.,
Figure 10.8: Example of keeping participant IDs for the entire study
- Participant identifiers must be unique within and across entities
- For example, no duplicating IDs within students or across teachers and schools
- Not duplicating within entities is imperative to maintain uniqueness of records, while not duplicating across reduces confusion about who a form belongs to and reduces potential errors
- The identifier should be randomly assigned and be completely distinct from any personal information to protect confidentiality.
- Do not sort by identifying information (e.g., names, date of birth) and then assign IDs in sequential order
- Do not group by identifying information (e.g., grade level, teacher) and then assign IDs in sequential order
- Do not include identifying information (e.g., initials) as part of an identifier
- Do not embed project information into the ID that has the potential to change
- Some researchers prefer to embed a project-level ID or acronym into a participant ID to help with tracking of information, especially when running multiple studies using identical forms across studies. This is absolutely okay because it is assumed this information never changes.
- However, embedding information such as wave or session into an identifier variable guarantees that your identifiers will not remain constant. This information should be added to your dataset in other ways (i.e., either as as its own variable or concatenated to variable names)
- Embedding information such as teacher IDs, school IDs, treatment, or cohort also has the potential to cause problems. In longitudinal studies, depending on the study design, it is possible that students move to other study teachers, teachers move to other study schools, or participants get re-recruited into other cohorts. Any of these issues would cause problems if this information was embedded into an ID because the ID would no longer reflect accurate information and would require IDs to be changed, breaking best practice #1. Again, these additional identifiers can be tracked as separate variables (e.g.,
stu_id
,tch_id
,sch_id
,cohort
,treatment
,wave
) and added to forms and datasets as needed
- Last, while less important during the data tracking phase, in your study datasets these identifiers should be stored as character variables. Even if an ID variable is all numbers, it should be stored as character type. This helps prevent people from inappropriately working with these values (i.e., taking a mean of an ID variable).
stu_id | tch_id | sch_id |
---|---|---|
12000 – 1300 | 2000 – 3000 | 300 – 500 |
Note
The only time you will not assign unique identifiers is when you collect anonymous data. In this situation you will not be able to assign identifiers since you will not know who participants are. However, it is still possible to assign identifiers to known entities such as school sites if anonymity is required.