
3 minute read
6.5 Writing analysis code: A case study from the Demand for Safe Spaces project
and, for each output it creates, explicitly loads data before analyzing them. This setup encourages data manipulation to be done earlier in the workflow (that is, in separate cleaning and construction scripts). It also prevents the common problem of having analysis scripts that depend on other analysis scripts being run before them. Such dependencies tend to require manual instructions so that all necessary chunks of code are run in the right order. Coding each task so that it is completely independent of all other code, except for the master script, is recommended. It is possible to go so far as to code every output in a separate script, but the key is to make sure that it is clear which data sets are used for each output and which code chunks implement each piece of analysis (see box 6.5 for an example of an analysis script structured like this).
BOX 6.5 WRITING ANALYSIS CODE: A CASE STUDY FROM THE DEMAND FOR SAFE SPACES PROJECT
The Demand for Safe Spaces team split the analysis scripts into one script per output and reloaded the analysis data before each output. This process ensured that the final exhibits could be generated independently from the analysis data. No variables were constructed in the analysis scripts: the only transformation performed was to subset the data or aggregate them to a higher unit of observation. This transformation guaranteed that the same data were used across all analysis scripts. The following is an example of a short analysis do-file:
1 /**************************************************************************************** 2 * Demand for "Safe Spaces": Avoiding Harassment and Stigma * 3 *****************************************************************************************
4 OUTLINE: PART 1: Load data 5 PART 2: Run regressions 6 PART 3: Export table 7 REQUIRES: ${dt_final}/platform_survey_constructed.dta 8 CREATES: ${out_tables}/priming.tex 9 WRITEN BY: Luiza Andrade
10
11 *****************************************************************************************
12 * PART 1: Load data 13 ****************************************************************************************/ 14 15 use "${dt_final}/platform_survey_constructed.dta", clear 16 17 /**************************************************************************************** 18 * PART 2: Run regressions 19 ****************************************************************************************/ 20 21 reg scorereputation i.q_group, robust 22 est sto priming1 23
(Box continues on next page)
BOX 6.5 WRITING ANALYSIS CODE: A CASE STUDY FROM THE DEMAND FOR SAFE SPACES PROJECT (continued)
24 sum scorereputation 25 estadd scalar mean `r(mean)' 26 27 reg scoresecurity i.q_group, robust 28 est sto priming2 29 30 sum scoresecurity 31 estadd scalar mean `r(mean)' 32 33 /**************************************************************************************** 34 * PART 3: Export table 35 ****************************************************************************************/ 36 37 esttab priming1 priming2 /// 38 using "${out_tables}/priming.tex", /// 39 ${star} /// 40 tex se replace label /// 41 nomtitles nonotes /// 42 drop(1.q_group) /// 43 b(%9.3f) se(%9.3f) /// 44 scalar("mean Sample mean") /// 45 posthead("\hline \\[-1.8ex]") /// 46 postfoot("\hline\hline \end{tabular}")
For the complete do-file, and to see how the regression results were exported to a table, visit the GitHub repository at https://git.io/JtgOk.
There is nothing wrong with code files being short and simple. In fact, analysis scripts should be as simple as possible, so whoever is reading them can focus on the concepts, not the coding. Research questions and statistical decisions should be incorporated explicitly in the code through comments, and their implementation should be easy to detect from the way the code is written. This process includes clustering, sampling, and controlling for different variables, to name a few. If the team is working with multiple analysis data sets, the name of each data set should describe the sample and unit of observation it contains. As a decision is made about model specification, the team can create functions and globals (or objects) in the master script to use across scripts. The use of functions and globals helps to ensure that specifications are consistent throughout the analysis. It also makes code more dynamic, because it is easy to update specifications and results through a master file without changing every script (see box 6.6 for an example of this from the Demand for Safe Spaces project).