10 minute read

6.1 Summary: Constructing and analyzing research data

Moving from raw data to the final data sets used for analysis almost always requires combining and transforming variables into the relevant indicators and indexes. These constructed variables are then used to create analytical outputs, ideally using a dynamic document workflow. Construction and analysis involve three main steps:

1. Construct variables and purpose-built data sets. The process of transforming observed data points into abstract or aggregate variables and analyzing them properly requires guidance from theory and is unique to each study. However, it should always follow these protocols:

• Maintain separate construction and analysis scripts, and put the appropriate code in the corresponding script, even if they are being developed or executed simultaneously. • Merge, append, or otherwise combine data from different sources or units of observation, and transform data to appropriate levels of observation or aggregation. • Create purpose-built analytical data sets, name and save them appropriately, and use them for the corresponding analytical tasks, rather than building a single analytical data set. • Carefully document each of these steps in plain language so that the rationale behind each research decision is clear for any consumer of research.

2. Generate and export exploratory and final outputs. Tables and figures are the most common types of analytical outputs. All outputs must be well organized and fully replicable. When creating outputs, the following tasks are required:

• Name exploratory outputs descriptively, and store them in easily viewed formats. • Store final outputs separately from exploratory outputs, and export them using publication-quality formats. • Version-control all code required to produce all outputs from analysis data. • Archive code when analyses or outputs are not used, with documentation for later recovery.

3. Set up an efficient workflow for outputs. Efficient workflow means the following:

• Exploratory analyses are immediately accessible, ideally created with dynamic documents, and can be reproduced by executing a single script. • Code and outputs are version-controlled so it is easy to track where changes originated. • Final figures, tables, and other code outputs are exported from the statistical software fully formatted, and the final document is generated in an automated manner, so that no manual workflow is needed to update documents when changes are made to outputs.

Key responsibilities for task team leaders and principal investigators

• Provide the theoretical framework for and supervise the production of analytical data sets and outputs, reviewing statistical calculations and code functionality. • Approve the final list of analytical data sets and their accompanying documentation. • Provide rapid review and feedback for exploratory analyses. • Advise on file format and design requirements for final outputs, including dynamic documents.

(Box continues on next page)

BOX 6.1 SUMMARY: CONSTRUCTING AND ANALYZING RESEARCH DATA (continued)

Key responsibilities for research assistants

• Implement variable construction and analytical processes through code. • Manage and document data sets so that other team members can understand them easily. • Flag ambiguities, concerns, or gaps in translation from theoretical framework to code and data. • Draft and organize exploratory outputs for rapid review by management. • Maintain release-ready code and organize output with version control so that current versions of outputs are always accessible and final outputs are easily extracted from unused materials.

Key resources

• DIME’s Research Assistant Onboarding Course, for technical sessions on best practices: – See variable construction at https://osf.io/k4tr6 – See analysis at https://osf.io/82t5e

• Visual libraries containing well-styled, reproducible graphs in an easily browsable format: – Stata Visual Library at https://worldbank.github.io/stata-visual-library – R Econ Visual Library at https://worldbank.github.io/r-econ-visual-library

• Andrade, Daniels, and Kondylis (2020), which discusses best practices and links to code demonstrations of how to export tables from Stata, at https://blogs.worldbank .org/impactevaluations/nice-and-fast-tables-stata

The unit of observation is the type of entity that is described by a given data set. In tidy data sets, each row should represent a distinct entity of that type. For more details, see the DIME Wiki at https://dimewiki.worldbank .org/Unit_of_Observation.

Creating analysis data sets

This chapter assumes that the analysis is starting from one or multiple well-documented tidy data sets (Wickham and Grolemund 2017). It also assumes that these data sets have gone through quality checks and have incorporated any corrections needed (see chapter 5). The next step is to construct the variables that will be used for analysis—that is, to transform the cleaned data into analysis data. In rare cases, data might be ready for analysis as acquired, but in most cases the information will need to be prepared by integrating different data sets and creating derived variables (dummies, indexes, and interactions, to name a few; for an example, see Adjognon, van Soest, and Guthoff 2019). The derived indicators to be constructed should be planned during research design, with the preanalysis plan serving as a guide. During variable construction, data will typically be reshaped, merged, and aggregated to change the level of the data points from the unit of observation in the original data set(s) to the unit of analysis.

Each analysis data set is built to answer a specific research question. Because the required subsamples and units of observation often vary for different pieces of the analysis, it will be necessary to create purpose-built analysis data sets for each one. In most cases, it is not good practice to try to create a single “one-size-fits-all”

Variable construction is the process of transforming cleaned data into analysis data by creating the derived indicators that will be analyzed. For more details, see the DIME Wiki at https:// dimewiki.worldbank.org /Variable_Construction. analysis data set. For a concrete example of what this means, think of an agricultural intervention that was randomized across villages and affected only certain plots within each village. The research team may want to run household-level regressions on income, test for plot-level productivity gains, and check to see if village characteristics are balanced. Having three separate data sets for each of these three pieces of analysis will result in cleaner, more efficient, and less error-prone analytical code than starting from a single analysis data set and transforming it repeatedly.

Organizing data analysis workflows

Variable construction follows data cleaning and should be treated as a separate task for two reasons. First, doing so helps to differentiate correction of errors (necessary for all data uses) from creation of derived indicators (necessary only for specific analyses). Second, it helps to ensure that variables are defined consistently across data sets. For example, take a project that has a baseline survey and an endline survey. Unless the two data collection instruments are exactly the same, which is preferable but rare, the data cleaning for each of these rounds will require different steps and will therefore need to be done separately. However, the analysis indicators must be constructed in the same way for both rounds so that they are exactly comparable. Doing this all correctly will therefore require at least two separate cleaning scripts and a unified construction script. Maintaining only one construction script guarantees that, if changes are made for observations from one data set, they will also be made for the other.

In the research workflow, variable construction precedes data analysis, because derivative variables need to be created before they are analyzed. In practice, however, during data analysis, it is common to revisit construction scripts continuously and to explore various subsets and transformations of the data. Even if construction and analysis tasks are done concurrently, they should always be coded in separate scripts. If every script that creates a table starts by loading a data set, reorganizing it in subsets, and manipulating variables, any edits to these construction tasks need to be replicated in all analysis scripts. Doing this work separately for each analysis script increases the chances that at least one script will have a different sample or variable definition. Coding all variable construction and data transformation in a unified script, separate from the analysis code, prevents such problems and ensures consistency across different outputs.

Integrating multiple data sources

To create the analysis data set, it is typically necessary to combine information from different data sources. Data sources can be combined by adding more observations, called “appending,” or by adding more

A data flowchart is the component of a data map that lists how the data sets acquired for the project are intended to be combined to create the data sets used for analysis. For more details and an example, see the DIME Wiki at https:// dimewiki.worldbank .org/Data_Flow_Charts.

The data linkage table is the component of a data map that lists all the data sets in a particular project and explains how they are linked to each other. For more details and an example, see the DIME Wiki at https://dimewiki.worldbank .org/Data_Linkage_Table.

iecodebook is a Stata command to document and execute repetitive data-cleaning tasks such as renaming, recoding, and labeling variables; to create codebooks for data sets; and to harmonize and append data sets containing similar variables. It is part of the iefieldkit package. For more details, see the DIME Wiki at https://dimewiki .worldbank.org/iecodebook. variables, called “merging.” These are also commonly referred to as “data joins.” As discussed in chapter 3, any process of combining data sets should be documented using data flowcharts, and different data sources should be combined only in accordance with the data linkage table. For example, administrative data may be merged with survey data in order to include demographic information in the analysis, geographic information may be integrated in order to include location-specific controls, or baseline and endline data may be appended to create a panel data. To understand how to perform such operations, it is necessary to consider the unit of observation and the identifying variables for each data set.

Appending data sets is the simplest approach because the resulting data set always includes all rows and all columns from each data set involved. In addition to combining data sources from multiple rounds of data acquisition, appends are often used to combine data on the same unit of observation from multiple study contexts, such as different regions or countries, when the different tables to be combined include the same variables but not the same instances of the unit of observation. Most statistical software requires identical variable names across all data sets appended, so that data points measuring the same attribute are placed in a single column in the resulting combined data set. A common source of error in appending data sets is the use of different units of measurement or different codes for categories in the same variables across the data sets. Examples include measuring weights in kilograms and grams, measuring values in different local currencies, and defining the underlying codes in categorical variables differently. These differences must be resolved before appending data sets. The iecodebook append command in the iefieldkit package was designed to facilitate this process.

Merges are more complex operations than appends, with more opportunities for errors that result in incorrect data points. This is because merges do not necessarily retain all the rows and columns of the data sets being combined and are usually not intended to. Merges can also add or overwrite data in existing rows and columns. Whichever statistical software is being used, it is useful to take the time to read through the help file of merge commands to understand their options and outputs. When writing the code to implement merge operations, a few steps can help to avoid mistakes.

The first step is to write pseudocode to understand which types of observations from each data set are expected to be matched and which are expected to be unmatched, as well as the reasons for these patterns. When possible, it is best to predetermine exactly which and how many matched and unmatched observations should result from the merge, especially for merges that combine data from different levels of observation. The best tools for understanding this step are the three components of the data map discussed in chapter 3. The second step is to think carefully about whether the intention is to keep matched and unmatched observations from one or both data sets or to keep only matching observations. The final step is to run the code to merge the data sets, compare the outcome to the expectations, add comments to explain any exceptions, and write

This article is from: