
7 minute read
5.1 Summary: Cleaning and processing research data
After being acquired, data must be structured for analysis in accordance with the research design, as laid out in the data linkage tables and the data flowcharts discussed in chapter 3. This process entails the following tasks.
1. Tidy the data. Many data sets do not have an unambiguous identifier as received, and the rows in the data set often do not match the units of observation specified by the research plan and data linkage table. To prepare the data for analysis requires two steps:
• Determine the unique identifier for each unit of observation in the data. • Transform the data so that the desired unit of observation uniquely identifies rows in each data set.
2. Validate data quality. Data completeness and quality should be validated upon receipt to ensure that the information is an accurate representation of the characteristics and individuals it is supposed to describe. This process entails three steps:
• Check that the data are complete—that is, that all the observations in the desired sample were received. • Make sure that data points are consistent across variables and data sets. • Explore the distribution of key variables to identify outliers and other unexpected patterns.
3. De-identify, correct, and annotate the data. After the data have been processed and de-identified, the information must be archived, published, or both. Before publication, it is necessary to ensure that the processed version is highly accurate and appropriately protects the privacy of individuals:
• De-identify the data in accordance with best practices and relevant privacy regulations. • Correct data points that are identified as being in error compared to ground reality. • Recode, document, and annotate data sets so that all of the content will be fully interpretable by future users, whether or not they were involved in the acquisition process.
Key responsibilities for task team leaders and principal investigators
• Determine the units of observation needed for experimental design and supervise the development of appropriate unique identifiers. • Indicate priorities for quality checks, including key indicators and reference values. • Provide guidance on how to resolve all issues identified in data processing, cleaning, and preparation. • Publish or archive the prepared data set.
Key responsibilities for research assistants
• Develop code, data, and documentation linking data sets with the data map and study design, and tidy all data sets to correspond to the required units of observation.
(Box continues on next page)
BOX 5.1 SUMMARY: CLEANING AND PROCESSING RESEARCH DATA (continued)
• Manage data quality checks, and communicate issues clearly to task team leaders, principal investigators, data producers, and field teams. • Inspect each variable, recoding and annotating as required. Prepare the data set for publication by de-identifying data, correcting field errors, and documenting the data.
Key resources
• The iefieldkit Stata package, a suite of commands to enable reproducible data cleaning and processing:
– Explanation at https://dimewiki.worldbank.org/iefieldkit – Code at https://github.com/worldbank/iefieldkit
• The ietoolkit Stata package, a suite of commands to enable reproducible data management and analysis:
– Explanation at https://dimewiki.worldbank.org/ietoolkit – Code at https://github.com/worldbank/ietoolkit
• DIME Analytics Continuing Education Session on tidying data at https://osf.io/p4e8u/ • De-identification article on the DIME Wiki at https://dimewiki.worldbank.org/De-identification
Data tables are data that are structured into rows and columns. They are also called tabular data sets or rectangular data. By contrast, nonrectangular data types include written text, NoSQL files, social graph databases, and files such as images or documents.
The unit of observation is the type of entity that is described by a given data set. In tidy data sets, each row should represent a distinct entity of that type. For more details, see the DIME Wiki at https://dimewiki.worldbank .org/Unit_of_Observation.
Making data “tidy”
The first step in creating an analysis data set is to understand the data acquired and use this understanding to translate the data into an intuitive format. This section discusses what steps may be needed to make sure that each row in a data table represents one observation. Getting to such a format may be harder than expected, and the unit of observation may be ambiguous in many data sets. This section presents the tidy data format, which is the ideal format for handling tabular data. Tidying data is the first step in data cleaning; quality assurance is best done using tidied data, because the relationship between row and unit of observation is as simple as possible. In practice, tidying and quality monitoring should proceed simultaneously as data are received.
This book uses the term “original data” to refer to the data in the state in which the information was first acquired by the research team. In other sources, the terms “original data” or “raw data” may be used to refer to the corrected and compiled data set created from received information, which this book calls “clean data”—that is, data that have been processed to have errors and duplicates removed, that have been transformed to the correct level of observation, and that include complete metadata such as labels and documentation. This phrasing applies to data provided by partners as well as to original data collected by the research team.
A unique identifier is a variable or combination of variables that distinguishes each entity described in a data set at that level of observation (for example, person, household) with a distinct value. For more details, see the DIME Wiki at https://dimewiki.worldbank .org/ID_Variable_Properties.
The data linkage table is the component of a data map that lists all the data sets in a particular project and explains how they are linked to each other. For more details and an example, see the DIME Wiki at https://dimewiki.worldbank .org/Data_Linkage_Table.
A master data set is the component of a data map that lists all individual units for a given level of observation in a project. For more details and an example, see the DIME Wiki at https://dimewiki.worldbank .org/Master_Data_Set.
A project identifier (ID) is a research design variable used consistently throughout a project to identify observations. For each level of observation, the corresponding project ID variable must uniquely and fully identify all observations in the project. For more details, see the DIME Wiki at https://dimewiki.worldbank .org/ID_Variable_Properties. Establishing a unique identifier
Before starting to tidy a data set, it is necessary to understand the unit of observation that the data represent and to determine which variable or set of variables is the unique identifier for each observation. As discussed in chapter 3, the unique identifier will be used to link observations in this data set to data in other data sources according to the data linkage table, and it must be listed in the master data set.
Ensuring that observations are uniquely and fully identified is arguably the most important step in data cleaning because the ability to tidy the data and link them to any other data sets depends on it. It is possible that the variables expected to identify the data uniquely contain either missing or duplicate values in the original data. It is also possible that a data set does not include a unique identifier or that the original unique identifier is not a suitable project identifier (ID). Suitable project IDs should not, for example, involve long strings that are difficult to work with, such as names, or be known outside of the research team.
In such cases, cleaning begins by adding a project ID to the acquired data. If a project ID already exists for this unit of observation, then it should be merged carefully from the master data set to the acquired data using other identifying information. (In R and some other languages, this operation is called a “data set join”; this book uses the term “merge.”) If a project ID does not exist, then it is necessary to generate one, add it to the master data set, and then merge it back into the original data. Although digital survey tools create unique identifiers for each data submission, these identifiers are not the same as having a unique ID variable for each observation in the sample, because the same observation can have multiple submissions.
The DIME Analytics team created an automated workflow to identify, correct, and document duplicated entries in the unique identifier using the ieduplicates and iecompdup Stata commands. One advantage of using ieduplicates to correct duplicate entries is that it creates a duplicates report, which records each correction made and documents the reason for it. Whether using this command or not, it is important to keep a record of all cases of duplicate IDs encountered and how they were resolved (see box 5.2 for an explanation of how a unique identifier was established for the Demand for Safe Spaces project).