
4 minute read
6.6 Organizing analysis code: A case study from the Demand for Safe Spaces project
BOX 6.6 ORGANIZING ANALYSIS CODE: A CASE STUDY FROM THE DEMAND FOR SAFE SPACES PROJECT
The Demand for Safe Spaces team defined the control variables in globals in the master analysis script. Doing so guaranteed that control variables were used consistently across regressions. It also provided an easy way to update control variables consistently across all regressions when needed. In an analysis script, a regression that includes all demographic controls would then be expressed as regress y x ${demographics}.
Advertisement
1 /**************************************************************************************** 2 * Set control variables 3 ****************************************************************************************/ 4 5 global star star (* .1 ** .05 *** .01) 6 global demographics d_lowed d_young d_single d_employed d_highses 7 global interactionvars pink_highcompliance mixed_highcompliance /// 8 pink_lowcompliance mixed_lowcompliance 9 global interactionvars_oc pos_highcompliance zero_highcompliance /// 10 pos_lowcompliance zero_lowcompliance 11 global wellbeing CO_concern CO_feel_level CO_happy CO_sad /// 12 CO_tense CO_relaxed CO_frustrated CO_satisfied /// 13 CO_feel_compare 14 15 * Balance variables (Table 1) 16 global balancevars1 d_employed age_year educ_year ride_frequency /// 17 home_rate_allcrime home_rate_violent /// 18 home_rate_theft grope_pink_cont grope_mixed_cont /// 19 comments_pink_cont comments_mixed_cont 20 global balancevars2 usual_car_cont nocomp_30_cont nocomp_65_cont /// 21 fullcomp_30_cont fullcomp_65_cont 22 23 * Other adjustment margins (Table A7) 24 global adjustind CI_wait_time_min d_against_traffic CO_switch /// 25 RI_spot CI_time_AM CI_time_PM
For the complete master do-file from which this code is excerpted, visit the GitHub repository at https://git.io/JtgeT.
Creating this setup entails having an effective data management system, including file naming, organization, and version control. Just as for the analysis data sets, each of the individual analysis files needs to have a descriptive name. File names such as spatial-diff-in-
diff.do, matching-villages.R, and summary-statistics.py are clear indicators of what each file is doing and make it easy to find code quickly. If the script files will be ordered numerically to correspond to
exhibits as they appear in a paper or report, such numbering should be done closer to publication, because script files will be reordered often during data analysis.
iegraph is a Stata command that generates graphs directly from results of regression specifications commonly used in impact evaluation. It is part of the ietoolkit package. For more details, see the DIME Wiki at https://dimewiki .worldbank.org/iegraph.
iekdensity is a Stata command that generates plots of the distribution of a variable by treatment group. It is part of the ietoolkit package. For more details, see the DIME Wiki at https:// dimewiki.worldbank.org /iekdensity. Visualizing data
Data visualization is increasingly popular and is becoming a field of expertise in its own right (Healy 2018; Wilke 2019). Although the same principles for coding exploratory and final data analysis apply to visualizations, creating them is usually more involved than the process of running an estimation routine and exporting numerical results into a table. Some of the difficulty of creating good visualizations of data is due to the difficulty of writing code to create them. The amount of customization necessary to create a nice graph can result in quite intricate commands.
Making a visually compelling graph is hard enough without having to go through many rounds of searching and reading help files to understand the graphical options syntax of a particular software. Although getting each specific element of a graph to look exactly as intended can be hard, the solution to such problems is usually a single well-written search away, and it is best to leave these details to the very last. The trickiest and more immediate problem of creating graphical outputs is getting the data into the right format. Although both Stata and R have plotting functions that graph summary statistics, a good rule of thumb is to ensure that each observation in the data set corresponds to one data point in the desired visualization whenever more complex visualizations are desired. This task may seem simple, but it often requires the use of aggregation and reshaping operations discussed earlier in this chapter.
On the basis of DIME’s accumulated experience creating visualizations for impact evaluations, the DIME Analytics team has developed a few resources to facilitate this workflow. First of all, DIME Analytics maintains easily searchable data visualization libraries for both Stata
(https://worldbank.github.io/stata-visual-library) and R (https://worldbank .github.io/r-econ-visual-library). These libraries feature curated data visualization examples, along with source code and example data sets, that provide a good sense of what data should look like before code is written to create a visualization. (For more tools and links to other data visualization resources, see the DIME Wiki at https://dimewiki.worldbank
.org/Data_visualization.)
The ietoolkit package also contains two commands to automate common impact evaluation graphs: iegraph plots the values of coefficients for treatment dummies, and iekdensity displays the distribution of an outcome variable across groups and adds the treatment effect as a note. (For more on how to install and use commands from ietoolkit,
see the DIME Wiki at https://dimewiki.worldbank.org/ietoolkit.) To create a uniform style for all data visualizations across a project, setting common formatting settings in the master script is recommended (see box 6.7 for an example of this process from the Demand for Safe Spaces project).