
8 minute read
7.1 Summary: Publishing reproducible research outputs
Whether writing a policy brief or academic article or producing some other kind of research product, it is important to create three final outputs that are ready for public release (or internal archiving if not public).
1. The data publication package. If the researcher holds the rights to distribute data that have been collected or obtained, this information should be made available to the public as soon as feasible. This release should
• Contain all nonidentifying variables and observations originally collected in a widely accessible format, with a data codebook describing all variables and values; • Contain original documentation about the collection of the data, such as a survey questionnaire, API script, or data license; • Be modified or masked only to correct errors and to protect the privacy of people described in the data; and • Be appropriately archived and licensed, with clear terms of use.
2. The research reproducibility package. Either researchers or their organization will typically have the rights to distribute the code for data analysis, even if access to the data is restricted. This package should
• Contain all code required to derive analysis data from the published data; • Contain all code required to reproduce research outputs from analysis data; • Contain a README file with documentation on the use and structure of the code; and • Be appropriately archived and licensed, with clear terms of use.
3. The written research product(s). These products should be
• Written and maintained as a dynamic document, such as a LaTeX file; • Linked to the locations of all code outputs in the code directory; • Recompiled with all final figures, tables, and other code outputs before release; and • Authored, licensed, and published in accordance with the policies of the organization or publisher.
Key responsibilities for task team leaders and principal investigators
• Oversee the production of outputs, and know where to obtain legal or technical support if needed. • Have original legal documentation available for all data. • Understand the team’s rights and responsibilities regarding data, code, and research publication. • Decide among potential publication locations and processes for code, data, and written materials. • Verify that replication material runs and replicates the outputs in the written research product(s) exactly.
BOX 7.1 SUMMARY: PUBLISHING REPRODUCIBLE RESEARCH OUTPUTS (continued)
Key responsibilities for research assistants
• Rework code, data, and documentation to meet the specific technical requirements of archives or publishers. • Manage the production process for collaborative documents, including technical administration. • Integrate comments or feedback, and support proofreading, translation, typesetting, and other tasks.
Key resources
• Published data sets in the DIME Microdata Catalog at https://microdata.worldbank.org/index .php/catalog/dime/about • Access to DIME LaTeX resources and exercises at https://github.com/worldbank /DIME-LaTeX-Templates • DIME Research Reproducibility Standards at https://github.com/worldbank/dime-standards • Template README for social science replication packages at https://doi.org/10.5281 /zenodo.4319999
Publishing research papers and reports
Development research is increasingly a collaborative effort. This trend reflects changes in the economics discipline overall: the number of sole-authored research outputs is decreasing, and the majority of recent papers in top journals have three or more authors (Kuld and O’Hagan 2017). As a consequence, documents typically pass back and forth between several writers before they are ready for publication or release. As in all other stages of the research process, effective collaboration requires the adoption of tools and practices that enable version control and simultaneous contributions. This book, for example, was written in dynamic document formats (LaTeX and Markdown) and managed on GitHub. All the versions and the history of changes can be viewed
at https://github.com/worldbank/dime-data-handbook. As outlined in chapter 6, dynamic documents are a way to simplify writing workflows: updates to code outputs that appear in these documents, such as tables and figures, can be passed into the final research output with a single click, rather than being copied and pasted or otherwise handled individually. Managing the writing process in this way improves organization and reduces error, such that there is no risk that materials will be compiled with out-of-date results or that completed work will be lost or redundant.
Using LaTeX for written documents
As discussed in chapter 6, LaTeX is currently the most widely used software for dynamically managing formal manuscripts and policy outputs. It is also becoming more popular for shorter documents, such as policy briefs, with the proliferation of skills and templates for these kinds of products. LaTeX uses explicit references to the file path of each code output (such as tables and figures), which are reloaded from these locations every time the final document is compiled. This is not possible by default in, for example, Microsoft Word. There, you have to copy and paste each object whenever tables, graphs, or other inputs are updated. As time goes on, it becomes increasingly likely that a mistake will be made or something will be missed. In LaTeX, instead of writing in a “what-you-seeis-what-you-get” mode as is done in Word, writing is done in plain text in a .tex file, interlaced with coded instructions formatting the document and linking to exhibits (similar to HTML). LaTeX manages tables and figures dynamically and includes commands for simple markup such as font styles, paragraph formatting, section headers, and the like. It includes special controls for footnotes and endnotes, mathematical notation, and bibliography preparation. It also allows publishers to apply global styles and templates to written material, reformatting entire documents in a house style with only a few keystrokes.
Although LaTeX can produce complex formatting, such formatting is rarely needed for academic publishing because academic manuscripts are usually reformatted according to the style of the publisher. (Researchers creating policy briefs and other self-produced documents may desire extensive typesetting and investments in custom templates and formatting.) In academia at least, it is rarely worth the investment to go beyond basic LaTeX tools: the title page, sections and subsections, figures and tables, mathematical equations, bolding and italics, footnotes and endnotes, and, last but not least, references and citations. Many of these functionalities, including dynamic updating of some outputs, can be achieved in Microsoft Word through the use of plugins and careful workflows. If it is possible to maintain such a workflow, then this approach is acceptable, but moving toward the adoption of LaTeX is recommended when possible.
One of the most important tools available in LaTeX is the BibTeX citation and bibliography manager (Kopka and Daly 1995). BibTeX keeps all of the references that might be used in a .bib file and then references them using a simple command typed directly in the document. Specifically, LaTeX inserts references in text using the cite command. Once this is written, LaTeX automatically pulls all the citations into text and creates a complete bibliography based on the citations used to compile the document. The system makes it possible to specify exactly how references should be displayed in text (for example, as superscripts or as inline references) as well as how the bibliography should be styled and in what order (such as Chicago, Modern Language Association, Harvard, or other common styles). The same principles that apply to figures and tables are
therefore applied here: references are changed in one place (the .bib file), and then everywhere they are used they are updated consistently with a single process. BibTeX is used so widely that it is natively integrated in Google Scholar. Because different publishers have different requirements, it is quite useful to be able to adapt this and other formatting very quickly, including through publisher-supplied templates where available.
Because it follows a standard code format, LaTeX has one more useful trick: the ability to convert raw documents into Word or several other formats using utilities such as pandoc, a free and open-source document converter (https://pandoc.org). Even though conversion to Word is required for some academic publishers and can even be preferred for some policy outputs, using LaTeX to prepare these products is still recommended. Exporting to Word should be done only at the final stage, when submitting materials. A .csl file (https://github.com /citation-style-language/styles), which styles the citations in a document, can also be applied automatically in this process so references follow the style of nearly any journal desired. Therefore, even if it is necessary to provide .docx versions or track-change versions of materials to others, these versions can be created effortlessly from a LaTeX document using external tools like Word’s compare feature to generate integrated trackchange versions when needed.
Getting started with LaTeX as a team
Although starting to use LaTeX may be challenging, it offers valuable control over the writing process. Because it is written in a plain-text file format, .tex can be version-controlled using Git. Contributions and version histories can be managed using the same system recommended for data work. DIME Analytics has created a variety of templates and resources that can be adapted to different needs, available at https://
github.com/worldbank/DIME-LaTeX-Templates. Integrated editing and compiling tools like TeXstudio (https://www.texstudio.org) and atom-latex (https://atom.io/packages/atom-latex) offer the most flexibility to work with LaTeX in teams.
Although ultimately worth the effort, setting up LaTeX environments locally is not always simple, particularly for researchers who are new to working with plain-text code and file management. LaTeX requires all formatting to be done in its special code language and is not always informative when something has been done wrong. This situation can be off-putting very quickly for people who simply want to begin writing, and those who are not used to programming may find it difficult to acquire the necessary knowledge.
Cloud-based implementations of LaTeX can make it easier for the team to use LaTeX without all members having to invest in new skills or set up matching software environments; they can be particularly useful for first forays into LaTeX writing. One example of cloud-based implementation is Overleaf (https://www.overleaf.com). Most such sites