Development Research in Practice

Page 134

iecodebook is a Stata command to document and execute repetitive data-cleaning tasks such as renaming, recoding, and labeling variables; to create complete codebooks for data sets; and to harmonize and append data sets containing similar variables. It is part of the iefieldkit package. For more details, see the DIME Wiki at https://dimewiki​ .worldbank.org/iecodebook.

interact directly with the identified data. If the data tidying has resulted in multiple data tables, each table needs to be de-identified separately, but the workflow will be the same for all of them. During the initial round of de-identification, data sets must be stripped of directly identifying information. To do so requires identifying all of the variables that contain such information. For data collection, when the research team designs the survey instrument, flagging all potentially identifying variables at the questionnaire design stage simplifies the initial de-identification process. If that was not done or original data were received by another means, a few tools can help to flag variables with directly identifying data. The Abdul Latif Jameel Poverty Action Lab (J-PAL) PII-scan and Innovations for Poverty Action (IPA) PII_­detection scan variable names and labels for common string patterns associated with identifying information. The sdcMicro package lists variables that uniquely identify observations, but its more refined method and need for higher processing capacity make it better suited for final de-identification (Benschop and Welch, n.d.). The iefieldkit command iecodebook lists all variables in a data set and exports an Excel sheet that makes it easy to select which variables to keep or drop. It is necessary to assess the resulting list of variables that contain PII against the analysis plan, asking for each variable, Will this variable be needed for the analysis? If not, the variable should be removed from the de-identified data set. It is preferrable to be conservative and remove all identifying information at this stage. It is always possible to include additional variables from the original data set if deemed necessary later. However, it is not possible to go back in time and drop a PII variable that was leaked (see box 5.5 for an example of how de-identification was implemented for the Demand for Safe Spaces project).

BOX 5.5 IMPLEMENTING DE-IDENTIFICATION: A CASE STUDY FROM THE DEMAND FOR SAFE SPACES PROJECT The Demand for Safe Spaces team used the iecodebook command to drop identifying information from the data sets as they were imported. Additionally, before data sets were published, the labels indicating line and station names were removed from them, leaving only the masked number for the underlying category. This was done so that it would not be possible to reconstruct individuals’ commuting habits directly from the public data. The code fragment below shows an example of the initial de-identification when the data were imported. The full data set was saved in the folder for confidential data (using World Bank OneDrive accounts), and a short codebook listing variable names, but not their contents, was saved elsewhere. iecodebook was then used with the drop option to remove confidential information from the data set before it was saved in a shared Dropbox folder. The specific variables removed in this operation

contained information about the data collection team that was not needed after quality checks were implemented (deviceid, subscriberid, simid, devicephonenum, username, enumerator, enumeratorname) and the phone numbers of survey respondents (phone_number).

(Box continues on next page) 114

DEVELOPMENT RESEARCH IN PRACTICE: THE DIME ANALYTICS DATA HANDBOOK


Turn static files into dynamic content formats.

Create a flipbook

Articles inside

Appendix C: Research design for impact evaluation

33min
pages 215-231

Appendix A: The DIME Analytics Coding Guide

24min
pages 195-210

Appendix B: DIME Analytics resource directory

3min
pages 211-214

8.1 Research data work outputs

6min
pages 190-194

Chapter 8: Conclusion

1min
page 189

7.4 Releasing a reproducibility package: A case study from the Demand for Safe Spaces project

3min
pages 184-186

7.1 Summary: Publishing reproducible research outputs

8min
pages 172-175

7.3 Publishing research data sets: A case study from the Demand for Safe Spaces project

10min
pages 180-183

7.2 Publishing research papers and reports: A case study from the Demand for Safe Spaces project

8min
pages 176-179

Chapter 7: Publishing reproducible research outputs

1min
page 171

6.1 Data analysis tasks and outputs

3min
pages 168-170

6.8 Managing outputs: A case study from the Demand for Safe Spaces project

10min
pages 163-167

6.7 Visualizing data: A case study from the Demand for Safe Spaces project

4min
pages 161-162

6.6 Organizing analysis code: A case study from the Demand for Safe Spaces project

4min
pages 159-160

6.5 Writing analysis code: A case study from the Demand for Safe Spaces project

3min
pages 157-158

6.4 Documenting variable construction: A case study from the Demand for Safe Spaces project

4min
pages 155-156

6.3 Creating analysis variables: A case study from the Demand for Safe Spaces project

1min
page 154

6.2 Integrating multiple data sources: A case study from the Demand for Safe Spaces project

9min
pages 150-153

6.1 Summary: Constructing and analyzing research data

10min
pages 146-149

Chapter 6: Constructing and analyzing research data

1min
page 145

5.7 Recoding and annotating data: A case study from the Demand for Safe Spaces project

3min
pages 140-141

5.6 Correcting data points: A case study from the Demand for Safe Spaces project

4min
pages 138-139

5.5 Implementing de-identification: A case study from the Demand for Safe Spaces project

9min
pages 134-137

5.1 Summary: Cleaning and processing research data

7min
pages 122-124

5.4 Assuring data quality: A case study from the Demand for Safe Spaces project

7min
pages 131-133

5.3 Tidying data: A case study from the Demand for Safe Spaces project

7min
pages 128-130

5.2 Establishing a unique identifier: A case study from the Demand for Safe Spaces project

7min
pages 125-127

Chapter 5: Cleaning and processing research data

1min
page 121

B4.4.1 A sample dashboard of indicators of progress

12min
pages 113-117

4.4 Checking data quality in real time: A case study from the Demand for Safe Spaces project

2min
page 112

4.3 Piloting survey instruments: A case study from the Demand for Safe Spaces project

14min
pages 106-111

4.2 Determining data ownership: A case study from the Demand for Safe Spaces project

16min
pages 100-105

B3.3.1 Flowchart of a project data map

37min
pages 81-96

B2.3.1 Folder structure of the Demand for Safe Spaces data work

36min
pages 55-72

Chapter 4: Acquiring development data

5min
pages 97-99

Chapter 3: Establishing a measurement framework

18min
pages 73-80

Chapter 1: Conducting reproducible, transparent, and credible research

35min
pages 31-46

Chapter 2: Setting the stage for effective and efficient collaboration

18min
pages 47-54

I.1 Overview of the tasks involved in development research data work

18min
pages 22-30

Introduction

2min
page 21
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.
Development Research in Practice by World Bank Publications - Issuu