7 minute read

5.3 Tidying data: A case study from the Demand for Safe Spaces project

BOX 5.3 TIDYING DATA: A CASE STUDY FROM THE DEMAND FOR SAFE SPACES PROJECT

The unit of observation in an original data set does not always match the relevant unit of analysis for a study. One of the first steps required is to create data sets at the unit of analysis desired. In the case of the crowdsourced ride data used in the Demand for Safe Spaces project, study participants were asked to complete three tasks in each metro trip: one before boarding the train (check-in task), one during the ride (ride task), and one after leaving the train (check-out task). The raw data sets show one task per row. As a result, each unit of analysis, a metro trip, was described in three rows in this data set.

To create a data set at the trip level, the research team took two steps, outlined in the data flowchart (for an example of how data flowcharts can be created, see box 3.3 in chapter 3). First, three separate data sets were created, one for each task, containing only the variables and observations created during that task. Then the trip-level data set was created by combining the variables in the data tables for each task at the level of the individual trip (identified by the session variable).

The following code shows an example of the ride task script, which keeps only the ride task rows and columns from the raw data set.

1 /**************************************************************************************** 2 Load data set and keep ride variables 3 ****************************************************************************************/ 4 5 use "${dt_raw}/baseline_raw_deidentified.dta", clear 6 7 * Keep only entries that refer to ride task 8 keep if inlist(spectranslated, "Regular Car", "Women Only Car") 9 10 * Sort observations 11 isid user_uuid session, sort 12 13 * Keep only questions answered during this task 14 * (all others will be missing for these observations) 15 dropmiss, force

The script then encodes categorical variables and saves a tidy ride task data set:

1 /**************************************************************************************** 2 Clean up and save 3 ****************************************************************************************/ 4 5 iecodebook apply using "${doc_rider}/baseline-study/codebooks/ride.xlsx", drop 6 order user_uuid session RI_pa - RI_police_present CI_top_car RI_look_pink /// 7 RI_look_mixed RI_crowd_rate RI_men_present 8 9 * Optimize memory and save data 10 compress 11 save "${dt_int}/baseline_ride.dta", replace

(Box continues on next page)

BOX 5.3 TIDYING DATA: A CASE STUDY FROM THE DEMAND FOR SAFE SPACES PROJECT (continued)

The same procedure was repeated for the check-in and check-out tasks. Each of these tidy data sets was saved with a very descriptive name, indicating the wave of data collection and the task included in the data set. For the complete script, visit the GitHub repository at https://git.io/Jtgqj.

In the household data set example, the household-level data table is the main table. This means that there must be a master data set for households. (The project may also have a master data set for household members if it is important for the research, but having one is not strictly required.) The household data set would then be stored in a folder called, for example, baseline-hh-survey. That folder would contain both the household-level data table with the same name as the folder, for example, baseline-hh-survey.csv, and the household member–level data, named in the same format but with a suffix, for example,

baseline-hh-survey-hhmember.csv.

The tidying process gets more complex as the number of nested groups increases. The steps for identifying the unit of observation of each variable and reshaping the separated data tables need to be repeated multiple times. However, the larger the number of nested groups in a data set, the more efficient it is to deal with tidy data than with untidy data. Cleaning and analyzing wide data sets, in particular, constitute a repetitive and error-prone process.

The next step of data cleaning—data quality monitoring—may involve comparing different units of observation. Aggregating subunits to compare to a higher unit is much easier with tidy data, which is why tidying data is the first step in the data cleaning workflow. When collecting primary data, it is possible to start preparing or coding the data tidying even before the data are acquired, because the exact format in which the data will be received is known in advance. Preparing the data for analysis, the last task in this chapter, is much simpler when tidying has been done.

Data quality assurance checks or simply data quality checks are the set of processes put in place to detect incorrect data points due to survey programming errors, data entry mistakes, misrepresentation, and other issues. For more details, see the DIME Wiki at https://dimewiki .worldbank.org/Monitoring _Data_Quality.

Implementing data quality checks

Whether receiving data from a partner or collecting data directly, it is important to make sure that data faithfully reflect realities on the ground. It is necessary to examine carefully any data collected through surveys or received from partners. Reviewing original data will inevitably reveal errors, ambiguities, and data entry mistakes, such as typos and inconsistent values. The key aspects to keep in mind are the completeness, consistency, and distribution of data (Andrade et al. 2021). Data quality assurance checks should be performed as soon as the data are acquired. When data are being collected and transferred to the team in real time, conducting high-frequency checks is recommended. Primary data require

High-frequency data quality checks (HFCs) are data quality checks run in real time during data collection so that any issues can be addressed while the data collection is still ongoing. For more details, see the DIME Wiki at https://dimewiki.worldbank .org/High_Frequency_Checks.

Duplicate observations are instances in which two or more rows of data are identified by the same value of the ID variable or in which two or more rows unintentionally represent the same respondent. They can be created by situations such as data entry mistakes in the ID variable or repeated surveys or submissions. For more details, see the DIME Wiki at https://dimewiki.worldbank .org/Duplicates_and_Survey _Logs. paying extra attention to quality checks, because data entry by humans is susceptible to errors, and the research team is the only line of defense between data issues and data analysis. Chapter 4 discusses survey-specific quality monitoring protocols.

Data quality checks should carefully inspect key treatment and outcome variables to ensure that the data quality of core study variables is uniformly high and that additional effort is centered where it is most important. Checks should be run every time data are received to flag irregularities in the acquisition progress, in sample completeness, or in response quality. The faster issues are identified, the more likely they are to be solved. Once the field team has left a survey area or high-frequency data have been deleted from a server, it may be impossible to verify whether data points are correct or not. Even if the research team is not receiving data in real time, the data owners may not be as knowledgeable about the data or as responsive to the research team queries as time goes by. ipacheck is a very useful Stata command that automates some of these tasks, regardless of the data source.

It is important to check continuously that the observations in data match the intended sample. In surveys, electronic survey software often provides case management features through which sampled units are assigned directly to individual enumerators. For data received from partners, such as administrative data, this assignment may be harder to validate. In these cases, cross-referencing with other data sources can help to ensure completeness. It is often the case that the data as originally acquired include duplicate observations or missing entries, which may occur because of typos, failed submissions to data servers, or other mistakes. Issues with data transmission often result in missing observations, particularly when large data sets are being transferred or when data are being collected in locations with limited internet connection. Keeping a record of what data were submitted and then comparing it to the data received as soon as transmission is complete reduces the risk of noticing that data are missing when it is no longer possible to recover the information.

Once data completeness has been confirmed, observed units must be validated against the expected sample: this process is as straightforward as merging the sample list with the data received and checking for mismatches. Reporting errors and duplicate observations in real time allows for efficient corrections. ieduplicates provides a workflow for resolving duplicate entries with the data provider. For surveys, it is also important to track the progress of data collection to monitor attrition, so that it is known early on if a change in protocols or additional tracking is needed (for an example, see Özler et al. 2016). It is also necessary to check survey completion rates and sample compliance by surveyors and survey teams, to compare data missingness across administrative regions, and to identify any clusters that may be providing data of suspect quality.

Quality checks should also include checks of the quality and consistency of responses. For example, it is important to check whether

This article is from: