
10 minute read
6.8 Managing outputs: A case study from the Demand for Safe Spaces project
BOX 6.8 MANAGING OUTPUTS: A CASE STUDY FROM THE DEMAND FOR SAFE SPACES PROJECT
It is important to document which data sets are required as inputs in each script and what data sets or output files are created by each script. The Demand for Safe Spaces team documented this information both in the header of each script and in a comment in the master do-file where the script was called.
The following is a header of an analysis script called response.do that requires the file platform_survey_constructed.dta and generates the file response.tex. Having this information on the header allows people reading the code to check that they have access to all of the necessary files before trying to run a script.
1 /**************************************************************************************** 2 * Demand for "Safe Spaces": Avoiding Harassment and Stigma * 3 *****************************************************************************************
4 OUTLINE: PART 1: Load data 5 PART 2: Run regressions 6 PART 3: Export table 7 REQUIRES: ${dt_final}/platform_survey_constructed.dta 8 CREATES: ${out_tables}/response.tex 9 ****************************************************************************************/
To provide an overview of the different subscripts involved in a project, this information was copied into the master do-file where the script above is called, and the same was done for all of the script called from that master, as follows:
1 * Appendix tables ======================================================================= 2
3 ***************************************************************************************** 4 * Table A1: Sample size description * 5 *---------------------------------------------------------------------------------------* 6 * REQUIRES: ${dt_final}/pooled_rider_audit_constructed.dta * 7 * ${dt_final}/platform_survey_constructed.dta * 8 * CREATES: ${out_tables}/sample_table.tex * 9 *****************************************************************************************
10 11 do "${do_tables}/sample_table.do" 12
13 ***************************************************************************************** 14 * Table A3: Correlation between platform observations data and rider reports * 15 *---------------------------------------------------------------------------------------* 16 * REQUIRES: ${dt_final}/pooled_rider_audit_constructed.dta * 17 * CREATES: ${out_tables}/mappingridercorr.tex * 18 *****************************************************************************************
19 20 do "${do_tables}/mappingridercorr.do" 21
22 *****************************************************************************************
BOX 6.8 MANAGING OUTPUTS: A CASE STUDY FROM THE DEMAND FOR SAFE SPACES PROJECT (continued)
23 * Table A4: Response to platform survey and IAT * 24 *---------------------------------------------------------------------------------------* 25 * REQUIRES: ${dt_final}/platform_survey_constructed.dta * 26 * CREATES: ${out_tables}/response.tex * 27 *****************************************************************************************
28 29 do "${do_tables}/response.do"
For the complete analysis script, visit the GitHub repository at https://git.io/JtgYB. For the master do-file, visit the GitHub repository at https://git.io/JtgY6.
Exporting analysis outputs
As discussed briefly in the previous section, it is not necessary to export each and every table and graph created during exploratory analysis. Most statistical software allows results to be viewed interactively, and doing so is often preferred at this stage. Final analysis scripts, in contrast, must export outputs that are ready to be included in a paper or report. No manual edits, including formatting, should be necessary after final outputs are exported. Manual edits are difficult to reproduce; the less they are used, the more reproducible the output is. Writing code to implement a small formatting adjustment in a final output may seem unnecessary, but making changes to the output is inevitable, and completely automating each output will always save time by the end of the project. By contrast, it is important not to spend much time formatting tables and graphs until it has been decided which ones will be included in research products; see Andrade, Daniels, and Kondylis (2020) for details and workflow recommendations. Polishing final outputs can be a time-consuming process and should be done as few times as possible.
It cannot be stressed too much: do not set up a workflow that requires copying and pasting results. Copying results from Excel to Word is error-prone and inefficient. Copying results from a software console is even more inefficient and totally unnecessary. The amount of work needed in a copy-paste workflow increases rapidly with the number of tables and figures included in a research output and so do the chances of having the wrong version of a result in a paper or report.
Numerous commands are available for exporting outputs from both R and Stata. For exporting tables, Stata 17 includes more advanced built-in capabilities. Some currently popular user-written commands are estout (Jann 2005), outreg2 (Wada 2014), and outwrite (Daniels 2019). In R, popular tools include stargazer (Hlavac 2015), huxtable (Hugh-Jones 2021), and ggsave (part of ggplot2; Wickham 2016). They allow for a wide variety of output formats. Using formats that are accessible and,
iebaltab is a Stata command that generates balance tables in both Excel and .tex. It is part of the ietoolkit package. For more details, see the DIME Wiki at https://dimewiki .worldbank.org/iebaltab.
ieddtab is a Stata command that generates tables from difference-in-differences regressions in both Excel and .tex. It is part of the ietoolkit package. For more details, see the DIME Wiki at https://dimewiki .worldbank.org/ieddtab.
Dynamic documents are files that include direct references to exported materials and update them automatically in the output. For more details, see the DIME Wiki at https://dimewiki.worldbank .org/Dynamic_documents. whenever possible, lightweight is recommended. Accessible means that other people can open them easily. For figures in Stata, accessibility means always using graph export to save images as .jpg, .png, .pdf, and so forth, instead of graph save, which creates a .gph file that can only be opened by Stata. Some publications require “lossless” .tif or .eps files, which are created by specifying the desired extension. Whichever format is used, the file extension must always be specified explicitly.
There are fewer options for formatting table files. Given the recommendation to use dynamic documents, which are discussed in more detail both in the next section and in chapter 7, exporting tables to .tex is preferred. Excel .xlsx files and .csv files are also commonly used, but they often require the extra step of copying the tables into the final output. The ietoolkit package includes two commands to export formatted tables, automating the creation of common outputs and saving time for research; for instructions and details, see the DIME Wiki at https://
dimewiki.worldbank.org/ietoolkit. The iebaltab command creates and exports balance tables to Excel or LaTeX, and the ieddtab command does the same for difference-in-differences regressions.
If it is necessary to create a table with a very specific format that is not automated by any known command, the command can be written manually (using Stata’s filewrite and R’s cat, for example). Manually writing the file often makes it possible to write a cleaner script that focuses on the econometrics, not on complicated commands to create and append intermediate matrixes. Final outputs should be easy to read and understand with only the information they contain. Labels and notes should include all of the relevant information that is not otherwise visible in the graphical output. Examples of information that should be included in labels and notes are sample descriptions, units of observation, units of measurement, and variable definitions. For a checklist with best practices for generating informative and easy-to-read tables, see the DIME Wiki at
https://dimewiki.worldbank.org/Checklist:_Submit_Table.
Increasing efficiency of analysis with dynamic documents
It is strongly recommended to create final products using a software that allows for direct linkage to raw outputs. In this way, final products will be updated in the paper or presentation every time changes are made to the raw outputs. Files that have this feature are called dynamic documents. Dynamic documents are a broad class of tools that enable a streamlined, reproducible workflow. The term “dynamic” can refer to any document-creation technology that allows the inclusion of explicitly encoded links to output files. Whenever outputs are updated, and a dynamic document is reloaded or recompiled, it will automatically include all changes made to all outputs without any additional intervention from the user. This is not possible in tools like Microsoft Office, although tools and add-ons can produce similar functionality. In Word, by default, each
object has to be copied and pasted individually whenever tables, graphs, or other inputs have to be updated. This workflow becomes more complex as the number of inputs grows, increasing the likelihood that mistakes will be made or updates will be missed. Dynamic documents prevent this from happening by managing the compilation of documents and the inclusion of inputs in a single integrated process so that copying and pasting can be skipped altogether.
Conducting dynamic exploratory analysis
If all team members working on a dynamic document are comfortable using the same statistical software, built-in dynamic document engines are a good option for conducting exploratory analysis. These tools can be used to write both text (often in Markdown; see https://www.markdownguide .org) and code in the script, and the output is usually a .pdf or .html file including code, text, and outputs. These kinds of complex dynamic document tools are typically best used by team members working most closely with code and can be great for creating exploratory analysis reports or paper appendixes including large chunks of code and dynamically created graphs and tables. RMarkdown (.Rmd) is the most widely adopted solution in R; see https://rmarkdown.rstudio.com. Stata offers a built-in package for dynamic documents—dyndoc—and user-written commands are also available, such as markstat (Rodriguez 2017), markdoc (Haghish 2016), webdoc (Jann 2017), and texdoc (Jann 2016). The advantage of these tools in comparison with LaTeX is that they create full documents from within statistical software scripts, so the task of running the code and compiling the document is reduced to a single step.
Documents called “notebooks” (such as Jupyter Notebook; see https://jupyter.org) work similarly, because they also include the underlying code that created the results in the document. These tools are usually appropriate for short or informal documents because users who are not familiar with them find it difficult to edit the content, and they often do not offer formatting options as extensive as those in Word. Other simple tools for dynamic documents do not require direct operation of the underlying code or software, simply access to the updated outputs. For example, Dropbox Paper is a free online writing tool that can be linked to files in Dropbox, which are updated automatically anytime the file is replaced. These tools have limited functionality in terms of version control and formatting and should never include any references to confidential data, but they do offer extensive features for collaboration and can be useful for working on informal outputs. Markdown files on GitHub can provide similar functionality through the browser and are version-controlled. However, as with other Markdown options, the need to learn a new syntax may discourage take-up among team members who do not work extensively with GitHub.
Whatever software is used, what matters is that a self-updating process is implemented for table and figures. The recommendations given here are best practices, but each team has to find out what works for it. If a team has decided to use Microsoft Office, for example, there are still a few options for avoiding problems with having to copy and paste. The easiest solution may be for the less code-savvy members of the team to develop the text of the final output pointing to exhibits that are not included inline. If all figures and tables are presented at the end of the file, whoever is developing the code can export them into a Word document using Markdown or simply produce a separate .pdf file for tables and figures, so at least this part of the manuscript can be updated quickly when the results change. Finally, statistical programming languages can often export directly to binary formats—for example, using the putexcel and putdocx commands in Stata can update or preserve formatting in Office documents.
Using LaTeX for dynamic research outputs
Although formatted text software such as Word and PowerPoint are still prevalent, researchers are increasingly choosing to prepare final outputs like documents and presentations using LaTeX, a document preparation and typesetting system with a unique code syntax. Despite LaTeX’s significant learning curve, its enormous flexibility in terms of operation, collaboration, output formatting, and styling makes it DIME’s preferred choice for most large technical outputs. In fact, LaTeX operates behind the scenes of many other dynamic document tools. Therefore, researchers should learn LaTeX as soon as possible; DIME Analytics has developed training materials and resources available on GitHub at https://github
.com/worldbank/DIME-LaTeX-Templates.
The main advantage of using LaTeX is that it updates outputs every time the document is compiled, while still allowing for text to be added and formatted extensively to publication-quality standards. Additionally, because of its popularity in the academic community, the cost of entry for a team is often relatively low. Because .tex files are plain text, they can be version-controlled using Git. Creating documents in LaTeX using an integrated writing environment such as TeXstudio, TeXmaker, or LyX is great for outputs that focus mainly on text but also include figures and tables that may be updated. It is good for adding small chunks of code into an output. Finally, some publishers make custom LaTeX templates available or accept manuscripts as raw .tex files, so research outputs can be formatted more easily into custom layouts.
Looking ahead
This chapter discussed the steps needed to create analysis data sets and outputs from original data. Combining the observed variables of interest