The results of educational research studies are only as accurate as the data used to produce them.
- Aleata Hubbard (2017)
In 2013, without knowing that the term research data management existed, I accepted a position with a prevention science research center. My job was to coordinate the collection and management of data for federally funded randomized controlled trial efficacy studies taking place in K-12 schools, along with a team of investigators, other research staff, part-time data collectors, and graduate students. While I had some experience analyzing and working with education data, i.e., ECLS-K, I had no experience running research grants, collecting original data, or managing research data, but I was excited to learn.
In my time in that position, I learned to plan, schedule, and track data collection activities, create data collection and capture tools, organize and document data inputs, and produce usable data outputs. Yet I didn’t learn to do those things through any formal training. There were no books, courses, or workshops that I learned from. I learned from colleagues and a large amount of trial and error. Since then, as I have met more investigators, data managers, and project coordinators in education research, I realize this is a common method for learning data management—mentoring and “winging it”. And while learning data management through these informal methods helps us get by, the ramifications of this unstandardized system are felt by both the project team and future data users.
Research data management is becoming more complicated. We are collecting more data, in sometimes very novel ways, and using more complex technologies, all while increasing the visibility of our work with the push for data sharing and open science practices (Briney 2015; Nelson 2022). Ad hoc data management practices may have worked for us in the past, but now others need to understand our processes as well, requiring researchers to be more thoughtful in planning their data management routines.
In order to implement thoughtful and standardized data management practices, researchers need training. Yet there is a clear lack of data management training in higher education. In a survey of 274 psychology researchers, Borghi and Van Gulick (2021) found that only 33% of respondents learned data management from college level coursework, while 64% learned from collaborators, and 52% learned from self-education. In their survey of 202 education researchers (principal investigators and co-principal investigators), Ceviren and Logan (2022) found that over 60% of respondents reported having no formal training in data management, yet across eight different data management practices, respondents were responsible for data management activities anywhere from 25-50% of the time. Similarly, in a survey of 150 graduate students in a school of education, when asked if they needed more training in research data management, the average overall score on a scale from 1 to 100 was 80, while the overall confidence in managing data score was only 40 (Zhou, Xu, and Kogut 2023). Furthermore, of the training that does exist, usually provided through university library systems, most material is either discipline agnostic or STEM focused, leaving a gap in training on how to apply skills to the field of education which has unique issues, particularly around working with human subjects data (Nichols Hess and Thielen 2017).
Without training, resources and formal support systems are the next best option for learning best practices. Within university systems, in addition to providing periodic training, research data librarians provide data management planning consultation for researchers and their teams. There is also a wealth of existing research data management resources written for broad audiences which I will reference in this book. However, while education researchers are starting to put out some excellent resources (Neild, Robinson, and Agufa 2022; T. Reynolds, Schatschneider, and Logan 2022), I still find there is a dearth of practical guides for researchers to refer to when building a data management workflow in the field of education, especially those working on large-scale longitudinal research grants where there are many moving pieces. Researchers are often collecting data in real-world environments, such as school systems, and keeping that data secure and reliable in a deliberate and orderly way can be overwhelming.
Last, unfortunately, while other fields of research, such as psychology, appear to be banding together to develop standards around how to structure and document data (Kline 2018), the field of education has yet to develop shared rules for things such as data documentation or data formats. This lack of standards leads to inconsistencies in the quality and usability of data products across the field (Borghi and Van Gulick 2022).
A lack of training in data management practices and an absence of agreed-upon standards in the field of education leads to consequences. Implementing subpar and inconsistent data management practices, while typically only resulting in frustration and time lost, also has the potential to be devastating, resulting in analyzing erroneous data or even unusable or lost data. In a review of 1,082 retracted publications from the journal PubMed from 2013-2016, authors found that 32% of retractions were due to data management errors (Campos-Varela and Ruano-Raviña 2019). In a 2013 study surveying 360 graduate students about their data management practices, 14% of students indicated they had to recollect data that had been previously collected because they could not find a file or the file had been corrupted, while 17% of students said they had lost a file and been unable to recollect it (Doucette and Fyfe 2013). In their study of 488 researchers who had published in a psychology journal between 2010 and 2018, Kovacs, et al. (2021) asked respondents about their data management mistakes and found that the most serious data management mistakes reported led to a range of consequences including time loss, frustration, and even erroneous conclusions.
Poor data management can even prevent researchers from implementing other good open science practices. In waves 1 and 2 of the Open Scholarship Survey being collected by the Center for Open Science, the team has found that of the education researchers surveyed who are currently not publicly sharing their research data, approximately 15% mentioned “being nervous about mistakes” as a reason for not sharing (Beaudry et al. 2022). Similarly, when surveying 780 researchers in the field of psychology, researchers found that 38% of respondents agreed that a “fear of discovery of errors in the data” posed a barrier to data sharing (Houtkoop et al. 2018).
The well-known replication crisis is another reason to be concerned with data management. Failure to implement practices such as quality documentation or standardization of practices (among many other reasons), resulted in one study finding that across 1,500 researchers surveyed, more than 70% had tried and failed to reproduce another researcher’s study (Baker 2016).
While the field of education may not have agreed-upon guidelines for data management, there are still practices that are proven to result in more secure, reproducible, and reliable data. My hope is that this book can be a foundation to help researchers think through how to build a quality, standardized data management workflow that works for their team and their projects. As suggested in the title of this book, this content is designed to specifically help teams navigate the complicated workflows associated with large-scale research, such as randomized controlled trial studies, but ultimately these practices are applicable to any research project, no matter the scale.
If this is your first time opening this book, I recommend reading this book from cover to cover. Much of the information in later chapters builds off of content from earlier chapters. With that said, once you have an understanding of what is contained in each chapter, this book is absolutely meant to become a handbook to be referenced as needed when you are ready to start planning a specific phase of your project.
This book begins, like many other books in this subject area, by describing the research life cycle and how data management fits within the larger picture. The remaining chapters are then organized by each phase of the life cycle, with examples of best practices provided for each phase. Considerations on whether you should implement, and how to integrate those practices into your workflow will be discussed.
Links to templates, checklists, and example documents are provided throughout this book. If you prefer clickable links, you can view the online, open access version of this book at https://datamgmtinedresearch.com/.
It is important to also point out what this book will not cover. This book is intended to be tool agnostic and provide suggestions that anyone can use, no matter what tools you work with, especially when it comes to data cleaning. Therefore, while I might mention options of tools you can use for different tasks, I will not advocate for any specific tools.
There are also no specific coding practices or syntax included in this book. In many ways I feel that the actual “data cleaning” phase of data management is the easiest phase to implement, as long as you implement good practices up until that point. Because of that, this book introduces practices in all phases leading up to data cleaning that will prepare your data for minimal cleaning. With that said, I do provide examples of what I would expect to see in a data cleaning process, I just do not provide steps for any specific software system. That is beyond the scope of this book.
This book will also not talk about analysis or preparing data for analysis through means such as data imputation, removal of legitimate outliers, or calculating analysis specific variables. Written from the perspective of a data manager, the end goal of data management is to build datasets for general data sharing. This means we will cover practices that keep data in its most complete and true, but usable form, for any future researcher to analyze in a way that works best for them.
Last, I want to acknowledge that education research studies, while all happening under a similar umbrella of study, are each unique in their design and requirements. It would be impossible for me to provide examples throughout this book that are applicable to every type of project a reader may encounter. Instead I have done my best to provide examples that I think are generally relatable to a wide audience of researchers, in hopes that you can then extrapolate those examples to your own specific work.
This book is for anyone involved in a research study involving original data collection. In particular, this book focuses on quantitative data, typically collected from human participants, although many of the practices covered could apply to other types of data as well. This book also applies to any team member, ranging from investigators, to data managers, to project staff, to students, to contractual data collectors. The contents of this book are useful for anyone who may have a part in planning, collecting, or organizing research study data.
Planning and implementing new data management practices on top of planning the implementation of your entire research grant can feel overwhelming. However, the idea of this book is to find the practices that work for you and your team and implement them consistently. For some teams that may look like implementing just a few of the suggestions mentioned; for others it may involve implementing all of them. Improving your data management workflow is a process and it becomes easier over time as those practices become part of your normal routine. At some point you may even find that you enjoy working on data management processes as you start to see the benefits of their implementation!