4 minute read

5.6 Correcting data points: A case study from the Demand for Safe Spaces project

made directly to the original data set. Instead, any corrections must be made as part of data cleaning, applied through code, and saved to a new data set (see box 5.6 for a discussion of how data corrections were made for the Demand for Safe Spaces project).

BOX 5.6 CORRECTING DATA POINTS: A CASE STUDY FROM THE DEMAND FOR SAFE SPACES PROJECT

Most of the issues that the Demand for Safe Spaces team identified in the raw crowdsourced data during data quality assurance were related to incorrect station and line identifiers. Two steps were taken to address this issue. The first was to correct data points. The second was to document the corrections made.

The correct values for the line and station identifiers, as well as notes on how they were identified, were saved in a data set called station_correction.dta. The team used the command merge to replace the values in the raw data in memory (called the “master data” in merge) with the station_correction.dta data (called the “using data” in merge).

The following options were used for the following reasons:

• update replace was used to update values in the “master data” with values from the same variable in the “using data.” • keepusing(user_station) was used to keep only the user_station variable from the “using data.” • assert(master match_update) was used to confirm that all observations were either only in the “master data” or were in both the “master data” and the “using data” and that the values were updated with the values in the “using data.” This quality assurance check was important to ensure that data were merged as expected.

To document the final contents of the original data, the team published supplemental materials on GitHub as well as on the World Bank Microdata Catalog.

1 * There was a problem with the line option for one of the stations. 2 * This fixes it:

3 * ------------------------------------------------------------------------

4 5 merge 1:1 obs_uuid /// 6 using "${doc_rider}/compliance-pilot/station_corrections.dta", /// 7 update replace /// 8 keepusing(user_station) /// 9 assert(master match_update) /// 10 nogen

For the complete script, visit the GitHub repository at https://git.io/Jt2ZC.

In statistical software, categorical variables are stored as numeric integers, each representing one category. Value labels or levels are the names assigned to each category in a categorical variable in Stata and R, respectively. For more details, see the DIME Wiki at https:// dimewiki.worldbank.org /Data_Cleaning.

Survey codes are values that are used as placeholders in survey questions to indicate types of outcomes other than responses to the question, such as refusal to answer. For more details, see the DIME Wiki at https://dimewiki.worldbank .org/Data_Cleaning.

Variable labels are short descriptors of the information contained in a variable in statistical software. For more details, see the DIME Wiki at https://dimewiki.worldbank .org/Data_Cleaning. Recoding and annotating data

The clean data set is the starting point of data analysis. It is manipulated extensively to construct analysis indicators, so it must be easy to process using statistical software. To make the analysis process smoother, the data set should have all of the information needed to interact with it. Having this information will save people opening the data set from having to go back and forth between the data set and its accompanying documentation, even if they are opening the data set for the first time.

Often, data sets are not imported into statistical software in the most efficient format. The most common example is string (text) variables: categorical variables and open-ended responses are often read as strings. However, variables in this format cannot be used for quantitative analysis. Therefore, categorical variables must be transformed into other formats, such as factors in R and labeled integers in Stata. Additionally, open-ended responses stored as strings usually have a high risk of including identifying information, so cleaning them requires extra attention. The choice names in categorical variables (called value labels in Stata and levels in R) should be accurate, concise, and linked directly to the data collection instrument. Adding choice names to categorical variables makes it easier to understand the data and reduces the risk that small errors will make their way into the analysis stage.

In survey data, it is common for nonresponse categories such as “don’t know” and “declined to answer” to be represented by arbitrary survey codes. The presence of these values would bias the analysis, because they do not represent actual observations of an attribute. They need to be turned into missing values. However, the fact that a respondent did not know how to answer a question is also useful information that would be lost by simply omitting all information. In Stata, this information can be elegantly conserved using extended missing values.

The clean data set should be kept as similar to the original data set as possible, particularly with regard to variable names: keeping them consistent with the original data set makes data processing and construction more transparent. Unfortunately, not all variable names are informative. In such cases, one important piece of documentation makes the data easier to handle: the variable dictionary. When a data collection instrument (for example, a questionnaire) is available, it is often the best dictionary to use. But, even in these cases, going back and forth between files can be inefficient, so annotating variables in a data set is extremely useful. Variable labels must always be present in a clean data set. Labels should include a short and clear description of the variable. A lengthier description, which may include, for example, the exact wording of a question, may be added through variable notes in Stata or using data frame attributes in R.

Finally, any information that is not relevant for analysis may be removed from the data set. In primary data, it is common to collect information for quality monitoring purposes, such as notes, duration fields, and surveyor IDs. Once the quality monitoring phase is

This article is from: