made directly to the original data set. Instead, any corrections must be made as part of data cleaning, applied through code, and saved to a new data set (see box 5.6 for a discussion of how data corrections were made for the Demand for Safe Spaces project).
BOX 5.6 CORRECTING DATA POINTS: A CASE STUDY FROM THE DEMAND FOR SAFE SPACES PROJECT Most of the issues that the Demand for Safe Spaces team identified in the raw crowdsourced data during data quality assurance were related to incorrect station and line identifiers. Two steps were taken to address this issue. The first was to correct data points. The second was to document the corrections made. The correct values for the line and station identifiers, as well as notes on how they were identified, were saved in a data set called station_correction.dta. The team used the command
merge to replace the values in the raw data in memory (called the “master data” in merge) with the station_correction.dta data (called the “using data” in merge). The following options were used for the following reasons:
• update replace was used to update values in the “master data” with values from the same variable in the “using data.”
• keepusing(user_station) was used to keep only the user_station variable from the “using data.”
• assert(master match_update) was used to confirm that all observations were either
only in the “master data” or were in both the “master data” and the “using data” and that the values were updated with the values in the “using data.” This quality assurance check was important to ensure that data were merged as expected.
To document the final contents of the original data, the team published supplemental materials on GitHub as well as on the World Bank Microdata Catalog. 1 * There was a problem with the line option for one of the stations. 2 * This fixes it: 3 * -----------------------------------------------------------------------4 5
merge 1:1 obs_uuid
///
6
using "${doc_rider}/compliance-pilot/station_corrections.dta", ///
7
update replace
///
8
keepusing(user_station)
///
9
assert(master match_update)
///
10
nogen
For the complete script, visit the GitHub repository at https://git.io/Jt2ZC.
118
DEVELOPMENT RESEARCH IN PRACTICE: THE DIME ANALYTICS DATA HANDBOOK