
9 minute read
6.2 Integrating multiple data sources: A case study from the Demand for Safe Spaces project
validation code so the script will return an error if unexpected results show up in future runs.
Paying close attention to merge results is necessary to avoid unintentional changes to the data. Two issues that require careful scrutiny are missing values and dropped observations. This process entails reading about how each command treats missing observations: Are unmatched observations dropped, or are they kept with missing values? Whenever possible, automated checks should be added in the script to throw an error message if the result is different than what is expected; if this step is skipped, changes in the outcome may appear after running large chunks of code, and these changes will not be flagged. In addition, any changes in the number of observations in the data need to be documented in the comments, including explanations for why they are happening. If subsets of the data are being created, keeping only matched observations, it is helpful to document the reason why the observations differ across data sets as well as why the team is only interested in observations that match. The same applies when adding new observations from the merged data set.
Some merges of data with different units of observation are more conceptually complex. Examples include overlaying road location data with household data using a spatial match; combining school administrative data, such as attendance records and test scores, with student demographic characteristics from a survey; or linking a data set of infrastructure access points, such as water pumps or schools, with a data set of household locations. In these cases, a key contribution of the research is figuring out a useful way to combine the data sets. Because the conceptual constructs that link observations from the two data sources are important and can take many possible forms, it is especially important to ensure that the data integration is documented extensively and separately from other construction tasks (see box 6.2 for an example of merges followed by automated tests from the Demand for Safe Spaces project).
BOX 6.2 INTEGRATING MULTIPLE DATA SOURCES: A CASE STUDY FROM THE DEMAND FOR SAFE SPACES PROJECT
The research team received the raw crowdsourced data acquired for the Demand for Safe Spaces study in a different level of observation than the one relevant for analysis. The unit of analysis was a ride, and each trip was represented in the crowdsourced data set by three rows, one for questions answered before boarding the train, one for those answered during the trip, and one for those answered after leaving the train. The Tidying data example in box 5.3 explains how the team created three intermediate data sets for each of these tasks. To create the ride-level data set, the team combined the individual task data sets. The following code shows how the team assured that all observations had merged as expected, showing two different approaches depending on what was expected.
(Box continues on next page)
BOX 6.2 INTEGRATING MULTIPLE DATA SOURCES: A CASE STUDY FROM THE DEMAND FOR SAFE SPACES PROJECT (continued)
1 /**************************************************************************************** 2 * Merge ride tasks 3 ****************************************************************************************/ 4 5 use "${dt_int}/compliance_pilot_ci.dta", clear 6 merge 1:1 session using "${dt_int}/compliance_pilot_ride.dta", assert(3) nogen 7 merge 1:1 session using "${dt_int}/compliance_pilot_co.dta" , assert(3) nogen
The first code chunk shows the quality assurance protocol for when the team expected that all observations would exist in all data sets so that each merge would have only matched observations. To test that this was the case, the team used the option assert(3). When two data sets are merged in Stata without updating information, each observation is assigned the merge code 1, 2, or 3. A merge code of 1 means that the observation existed in the data set only in memory (called the “master data”), 2 means that the observation existed only in the other data set (called the “using data”), and 3 means that the observation existed in both. The option assert(3) tests that all observations existed in both data sets and were assigned code 3.
When observations that do not match perfectly are merged, the quality assurance protocol requires the research assistant to document the reasons for mismatches. Stata’s merge result code is, by default, recorded in a variable named _merge. The Demand for Safe Space team used this variable to count the number of unique riders in each group and used the command assert to throw an error if the number of observations in any of the categories changed, ensuring that the outcome remained stable if the code was run multiple times.
1 /**************************************************************************************** 2 * Merge demographic survey 3 ****************************************************************************************/ 4 5 merge m:1 user_uuid using "${dt_int}/compliance_pilot_demographic.dta" 6 7 * 3 users have rides data, but no demo 8 unique user_uuid if _merge == 1 9 assert r(unique) == 3 10 11 * 49 users have demo data, but no rides: these are dropped 12 unique user_uuid if _merge == 2 13 assert r(unique) == 49 14 drop if _merge == 2 15 16 * 185 users have ride & demo data 17 unique user_uuid if _merge == 3 18 assert r(unique) == 185
For the complete do-file, visit the GitHub repository at https://git.io/JtgYf.
Dummy variables are categorical variables with exactly two mutually exclusive values, where a value of 1 represents the presence of a characteristic and 0 represents its absence. Common types include yes/no questions, true/ false questions, and binary characteristics such as being below the poverty line. This structure allows dummy variables to be used in regressions, summary statistics, and other statistical functions without further transformation. Creating analysis variables
After assembling variables from different sources into a single working data set with the desired raw information and observations, it is time to create the derived indicators of interest for analysis. Before constructing new indicators, it is important to check and double-check the units, scales, and value assignments of each variable that will be used. This step is when the knowledge of the data and documentation developed during cleaning will be used the most. The first step is to check that all categorical variables have the same value assignment, such that labels and levels have the same correspondence across variables that use the same options. For example, it is possible that 0 is coded as “No” and 1 as “Yes” in one question, whereas in another question the same answers are coded as 1 and 2. (Coding binary questions either as 1 and 0 or as TRUE and FALSE is recommended, so that they can be used numerically as frequencies in means and as dummy variables in regressions. This recommendation often implies recoding categorical variables like gender to create new binary variables like woman.) Second, any numeric variables being compared or combined need to be converted to compatible scales or units of measure: it is impossible to add 1 hectare and 2 acres and get a meaningful number. New derived variables should be given functional names, and the data set should be ordered so that related variables remain together. Attaching notes to each newly constructed variable if the statistical software allows it makes the data set even more user-friendly.
At this point, it is necessary to decide how to handle any outliers or unusual values identified during data cleaning. How to treat outliers is a research question. There are multiple possible approaches, and the best choice for a particular case will depend on the objectives of the analysis. Whatever the team decides, the decision and how it was made should be noted explicitly. Results can be sensitive to the treatment of outliers; keeping both the original and the new modified values for the variable in the data set will make it possible to test how much the modification affects the outputs. All of these points also apply to the imputation of missing values and other distributional patterns. As a general rule, original data should never be overwritten or deleted during the construction process, and derived indicators, including handling of outliers, should always be created with new variable names.
Two features of data create additional complexities when constructing indicators: research designs with multiple units of observation and analysis and research designs with repeated observations of the same units over time. When research involves different units of observation, creating analysis data sets will probably mean combining variables measured at these different levels. To make sure that constructed variables are consistent across data sets, each indicator should be constructed in the data set corresponding to its unit of observation.
Once indicators are constructed at each level of observation, they may be either merged or first aggregated and then merged with data
containing different units of analysis. Take the example of a project that acquired data at both the student and teacher levels. To analyze the performance of students on a test while controlling for teacher characteristics, the teacher-level indicators would be assigned to all students in the corresponding class. Conversely, to include average student test scores in the analysis data set containing teacher-level variables, the analysis data set would start at the student level, the test score of all students taught by the same teacher would be averaged (using commands like collapse in Stata and summarise from R’s dplyr package), and this teacher-level aggregate measure would be merged onto the original teacher data set. While performing such operations, two tasks are important to keep in mind: documenting the correspondence between identifying variables at different levels in the data linkage table and applying all of the steps outlined in the previous section because merges are inevitable.
Finally, variable construction with combined data sets involves additional attention. It is common to construct derived indicators soon after receiving each data set. However, constructing variables for each data set separately increases the risk of using different definitions or samples in each of them. Having a well-established definition for each constructed variable helps to prevent that mistake, but the best way to guarantee that it will not happen is to create the indicators for all data sets in the same script after combining the original data sets.
The most common example is panel data with multiple rounds of data collection at different times. Say, for example, that some analysis variables were constructed immediately after an initial round of data collection and that later the same variables will need to be constructed for a subsequent round. When a new round of data is received, best practice is first to create a cleaned panel data set, ignoring the previous constructed version of the initial round, and then to construct the derived indicators using the panel as input. The DIME Analytics team created the iecodebook append subcommand in the Stata package iefieldkit to make it easier to reconcile and append data into this type of cleaned panel data set, and the command also works well for similar data collected in different contexts (for instructions and details,
see the DIME Wiki at https://dimewiki.worldbank.org/iecodebook).
This harmonization and appending process is done by completing an Excel spreadsheet codebook to indicate which changes in names, value assignments, and value labels should be made so the data are consistent across rounds or settings (Bjärkefur, Andrade, and Daniels 2020). Doing so creates helpful documentation about the appending process. Once the data sets have been harmonized and appended, it is necessary to adapt the construction script so that it can be used on the appended data set. In addition to preventing inconsistencies and documenting the work, this process also saves time and provides an opportunity for the team to review the original code (see box 6.3 for an example of variable construction using a combined data set).