
3 minute read
5.7 Recoding and annotating data: A case study from the Demand for Safe Spaces project
completed, these variables may be removed from the data set. In fact, starting from a minimal set of variables and adding new ones as they are cleaned can make the data easier to handle. Using commands such as compress in Stata so that the data are always stored in the most efficient format helps to ensure that the cleaned data set file does not get too big to handle.
Although all of these tasks are key to making the data easy to use, implementing them can be quite repetitive and create convoluted scripts. The iecodebook command suite, part of the iefieldkit Stata package, is designed to make some of the most tedious components of this process more efficient. It also creates a self-documenting workflow, so the data-cleaning documentation is created alongside the code, with no extra steps (see box 5.7 for a description of how iecodebook was used in the Demand for Safe Spaces project). In R, the Tidyverse (https://www.tidyverse.org) packages provide a consistent and useful grammar for performing the same tasks and can be used in a similar workflow.
BOX 5.7 RECODING AND ANNOTATING DATA: A CASE STUDY FROM THE DEMAND FOR SAFE SPACES PROJECT
The Demand for Safe Spaces team relied mostly on the iecodebook command for this part of the data-cleaning process. The screenshot below shows the iecodebook form used to clean the crowdsourced ride data. This process was carried out for each task.
Column B contains the corrected variable labels, column D indicates the value labels to be used for categorical variables, and column I recodes the underlying numbers in those variables. The differences between columns E and A indicate changes to variable names. Typically, it is strongly recommended not to rename variables at the cleaning stage, because it is important to maintain correspondence with the original data set. However, that was not possible in this case, because the same question had inconsistent variable names across multiple transfers of the data from the technology firm managing the mobile application. In fact, this is one of the two cleaning tasks that
BOX 5.7 RECODING AND ANNOTATING DATA: A CASE STUDY FROM THE DEMAND FOR SAFE SPACES PROJECT (continued)
could not be performed directly through iecodebook (the other was transforming string variables to a categorical format for increased efficiency). The following code shows a few examples of how these cleaning tasks were carried out directly in the script:
1 * Encode crowd rate 2 encode ride_crowd_rate, gen(RI_crowd_rate) 3 4 * Reconcile different names for compliance variable 5 replace ride_men_present = approx_percent_men if missing(ride_men_present) 6 7 * Encode compliance variable 8 encode ride_men_present, gen(RI_men_present) 9 10 * Did you look in the cars before you made your choice? 11 * Turn into dummy from string 12 foreach var in sv_choice_pink sv_choice_regular { 13 gen `var'_ = (`var' == "Sim") if (!missing(`var') & `var' != "NA") 14 }
To document the contents of the original data, the team published supplemental materials on GitHub, including the description of tasks shown in the app. All of the codebooks and Excel sheets used by the code to clean and correct data were also included in the documentation folder of the reproducibility package.
For the complete do-file for cleaning the ride task, visit the GitHub repository at https://git .io/Jtgqj. For the corresponding codebook, visit the GitHub repository at https://git.io/JtgNS.
Data documentation is the process of systematically recording information related to research data work. For more details, see the DIME Wiki at https:// dimewiki.worldbank .org/Data_Documentation. Documenting data cleaning
Throughout the data-cleaning process, extensive inputs are often needed from the people responsible for data collection. Sometimes this is the research team, but often it is someone else. For example, it could be a survey team, a government ministry responsible for administrative data systems (for an example, see Fernandes, Hillberry, and Alcántara 2015), or a technology firm that generates remote-sensing data. Regardless of who originally collected the data, it is necessary to acquire and organize all documentation describing how the data were generated. The type of documentation available depends on how the data were collected. For original data collection, it should include field protocols, data collection manuals, survey instruments, supervisor notes, and data quality monitoring reports. For secondary data, the same type of information is useful but often not available unless the data source is a