4 minute read

6.7 Visualizing data: A case study from the Demand for Safe Spaces project

BOX 6.7 VISUALIZING DATA: A CASE STUDY FROM THE DEMAND FOR SAFE SPACES PROJECT

The Demand for Safe Spaces team defined the settings for graphs in globals in the master analysis script. Using globals created a uniform visual style for all graphs produced by the project. These globals were then used across the project when creating graphs like the following: twoway (bar cum x, color(${col_aux_light})) (lpoly y x, color(${col_mixedcar})) (lpoly z x, color(${col_womencar})), ${plot_options}.

Advertisement

1 /************************************************************************* 2 * Set plot options 3 *************************************************************************/ 4 5 set scheme s2color 6 7 global grlabsize 4 8 global col_mixedcar `" "18 148 144" "' 9 global col_womencar purple 10 global col_aux_bold gs6 11 global col_aux_light gs12 12 global col_highlight cranberry 13 global col_box gs15 14 global plot_options graphregion(color(white)) /// 15 bgcolor(white) /// 16 ylab(, glcolor(${col_box})) /// 17 xlab(, noticks) 18 global lab_womencar Reserved space 19 global lab_mixedcar Public space

For the complete do-file, visit the GitHub repository at https://git.io/JtgeT.

Creating reproducible tables and graphs

Many outputs are created during the course of a project, including both raw outputs, such as tables and graphs, and final products, such as presentations, papers, and reports. During exploratory analysis, the team will consider different approaches to answer research questions and present answers. Although it is best to be transparent about different specifications tried and tests performed, only a few will ultimately be considered “main results.” These results will be exported from the statistical software. That is, they will be saved as tables and figures in file formats that the team can interact with more easily. For example, saving graphs as image files allows the team to review them quickly and to add them as exhibits to other documents. When these code outputs are first being created, it is necessary to agree on where to store them, what software and formats to use, and how to keep track of them. This discussion will save time

and effort on two fronts: less time will be spent formatting and polishing tables and graphs that will not make their way into final research products, and it will be easier to remember the paths the team has already taken and avoid having to do the same thing twice. This section addresses key elements to keep in mind when making workflow decisions and outputting results.

Managing outputs

Decisions about storage of outputs are limited by technical constraints and dependent on file format. Plain-text file formats like .tex and .csv can be managed through version-control systems like Git, as discussed in chapter 2. Binary outputs like Excel spreadsheets, .pdf files, PowerPoint presentations, or Word documents, by contrast, should be kept in a synced folder. Exporting all raw outputs as plain-text files, which can be done through all statistical software, facilitates the identification of changes in results. When code is rerun from the master script, the outputs will be overwritten, and any changes (for example, in coefficients or numbers of observations) will be flagged automatically. Tracking changes to binary files is more cumbersome, although there may be exceptions, depending on the version-control client used. GitHub Desktop, for example, can display changes in common binary image formats such as .png files in an accessible manner.

Knowing how code outputs will be used supports decisions regarding the best format for exporting them. It is often possible to export figures in different formats, such as .eps, .png, .pdf, or .jpg. However, the decision between using Office software such as Word and PowerPoint versus LaTeX and other plain-text formats may influence how the code is written, because this choice often necessitates the use of a particular command.

Outputs generally need to be updated frequently, and anyone who has tried to recreate a result after a few months probably knows that it can be hard to remember where the code that created it was saved. Filenaming conventions and code organization, including easily searchable filenames and comments, play a key role in not having to rewrite scripts again and again. Maintaining one final analysis folder and one folder with draft code or exploratory analysis is recommended. The latter contains pieces of code that are stored for reference, but not polished or refined to be used in research products.

Once an output presents a result in the clearest manner possible, the corresponding script should be renamed and moved to the final analysis folder. It is typically desirable to link the names of outputs and scripts—for example, the script factor-analysis.do creates the graph factor-analysis.eps, and so on. Documenting output creation in the master script running the code is necessary so that a few lines of comments appear before the line that runs a particular analysis script; these comments list data sets and functions that are necessary for the script to run and describe all outputs created by that script (see box 6.8 for how this was done in the Demand for Safe Spaces project).

This article is from: