BOX 5.2 ESTABLISHING A UNIQUE IDENTIFIER: A CASE STUDY FROM THE DEMAND FOR SAFE SPACES PROJECT All data sets have a unit of observation, and the first columns of each data set should uniquely identify which unit is being observed. In the Demand for Safe Spaces project, as should be the case in all projects, the first few lines of code that imported each original data set immediately ensured that this was true and applied any corrections from the field needed to fix errors related to uniqueness. The code segment below was used to import the crowdsourced ride data; it used the ieduplicates command to remove duplicate values of the uniquely identifying variable in the data
set. The screen shot of the corresponding ieduplicates report shows how the command documents and resolves duplicate identifiers in data collection. After applying the corrections, the code confirms that the data are uniquely identified by riders and ride identifiers and documents the decisions in an optimized format. 1 // Import to Stata format ============================================================ 2 3
import delimited using "${encrypt}/Baseline/07112016/Contributions 07112016", ///
4
delim(",")
///
5
bindquotes(strict) ///
6
varnames(1)
7
clear
///
8 9 * There are two duplicated values for obs_uid, each with two submissions. 10 * All four entries are demographic surveys from the same user, who seems to 11 * have submitted the data twice, each time creating two entries. 12 * Possibly a connectivity issue 13
ieduplicates obs_uid using "${doc_rider}/baseline-study/raw-duplicates.xlsx", ///
14
uniquevars(v1) ///
15
keepvars(created submitted started)
16 17 * Verify unique identifier, sort, optimize storage, 18 * remove blank entries and save data 19
isid user_uuid obs_uid, sort
20
compress
21
dropmiss, force
22
save "${encrypt}/baseline_raw.dta", replace
To access this code in do-file format, visit the GitHub repository at https://github.com/worldbank /dime-data-handbook/tree/main/code.
CHAPTER 5: CLEANING AND PROCESSING RESEARCH DATA
105