4 minute read

6.4 Documenting variable construction: A case study from the Demand for Safe Spaces project

also be able to reproduce those steps and recreate the constructed variables. Therefore, documentation is an output of construction as important as the code and data, and it is good practice for papers to have an accompanying data appendix listing the analysis variables and their definitions.

The development of construction documentation provides a good opportunity for the team to have a wider discussion about creating protocols for defining variables: such protocols guarantee that indicators are defined consistently across projects. A detailed account of how variables are created is needed and will be implemented in the code, but comments are also needed explaining in human language what is being done and why. This step is crucial both to prevent mistakes and to guarantee transparency. To make sure that these comments can be navigated more easily, it is wise to start writing a variable dictionary as soon as the team begins thinking about making changes to the data (for an example, see Jones et al. 2019). The variable dictionary can be saved in an Excel spreadsheet, a Word document, or even a plain-text file. Whatever format it takes, it should carefully record how specific variables have been transformed, combined, recoded, or rescaled. Whenever relevant, the documentation should point to specific scripts to indicate where the definitions are being implemented in code.

The iecodebook export subcommand is a good way to ensure that the project has easy-to-read documentation. When all final indicators have been created, it can be used to list all variables in the data set in an Excel sheet. The variable definitions can be added to that file to create a concise metadata document. This step provides a good opportunity to review the notes and make sure that the code is implementing exactly what is described in the documentation (see box 6.4 for an example of variable construction documentation).

BOX 6.4 DOCUMENTING VARIABLE CONSTRUCTION: A CASE STUDY FROM THE DEMAND FOR SAFE SPACES PROJECT

In an appendix to the working paper, the Demand for Safe Spaces team documented the definition of every variable used to produce the outputs presented in the paper:

Variable definitions for rider audit demographic survey

Variable

Age

Employed

High self-reported socio-economic status

Low education (middle school or less)

Definition

Median age in years of the rider’s age category when demographic survey was responded

= 1 if rider had part-time or full-time job when responded to demographic survey

= 1 if rider reported being a member of classes A or B

= 1 if highest degree obtained by the rider at the time the demographic survey was responded was middle school or lower

(Box continues on next page)

BOX 6.4 DOCUMENTING VARIABLE CONSTRUCTION: A CASE STUDY FROM THE DEMAND FOR SAFE SPACES PROJECT (continued)

Variable Definition

Number of Supervia rides in a typical week Number of times rider would normally ride the Supervia in a typical week during which that rider is not taking any app rides

Single = 1 if rider was not married when responded to demographic survey Years of schooling Number of years equivalent to rider’s highest level of education when responded to demographic survey

Young (18–25 years old) = 1 if rider was between 18 and 25 years old when responded to demographic survey

Appendix B in the working paper presents a set of tables with variable definitions in nontechnical language, including variables collected through surveys, variables constructed for analysis, and research variables assigned by random processes.

For the full set of tables, see appendix B of the working paper at https://openknowledge.worldbank .org/handle/10986/33853.

Writing analysis code

After data have been cleaned and indicators constructed, it is time to start analyzing the data. Many resources deal with data analysis and statistical methods, such as R for Data Science (Wickham and Grolemund 2017); A Gentle Introduction to Stata (Acock 2018); Mostly Harmless Econometrics (Angrist and Pischke 2008); Mastering ‘Metrics (Angrist and Pischke 2014); and Causal Inference: The Mixtape (Cunningham 2021). The discussion here focuses on how to structure code and files for data analysis, not how to conduct specific analyses.

Organizing analysis code

The analysis usually starts with a process called exploratory data analysis, which is when researchers begin to look for patterns in the data, create descriptive graphs and tables, and try different statistical tests to understand the results. It progresses into final analysis when the team starts to decide which are the “main results”— those that will make it into a research output. Code and code outputs for exploratory analysis are different from those for final analysis. During the exploratory stage, the temptation is to write lots of analysis into one big, impressive, start-to-finish script. Although this is fine when writing the research stream of consciousness into code, it leads to poor practices in the final code, such as not clearing the workspace and not loading a fresh data set before each analysis task.

To avoid mistakes, it is important to take time to organize the code that will be kept—that is, the final analysis code. The result is a curated set of polished scripts that will be part of a reproducibility package. A well-organized analysis script starts with a completely fresh workspace

This article is from: