validation code so the script will return an error if unexpected results show up in future runs. Paying close attention to merge results is necessary to avoid unintentional changes to the data. Two issues that require careful scrutiny are missing values and dropped observations. This process entails reading about how each command treats missing observations: Are unmatched observations dropped, or are they kept with missing values? Whenever possible, automated checks should be added in the script to throw an error message if the result is different than what is expected; if this step is skipped, changes in the outcome may appear after running large chunks of code, and these changes will not be flagged. In addition, any changes in the number of observations in the data need to be documented in the comments, including explanations for why they are happening. If subsets of the data are being created, keeping only matched observations, it is helpful to document the reason why the observations differ across data sets as well as why the team is only interested in observations that match. The same applies when adding new observations from the merged data set. Some merges of data with different units of observation are more conceptually complex. Examples include overlaying road location data with household data using a spatial match; combining school administrative data, such as attendance records and test scores, with student demographic characteristics from a survey; or linking a data set of infrastructure access points, such as water pumps or schools, with a data set of household locations. In these cases, a key contribution of the research is figuring out a useful way to combine the data sets. Because the conceptual constructs that link observations from the two data sources are important and can take many possible forms, it is especially important to ensure that the data integration is documented extensively and separately from other construction tasks (see box 6.2 for an example of merges followed by automated tests from the Demand for Safe Spaces project).
BOX 6.2 INTEGRATING MULTIPLE DATA SOURCES: A CASE STUDY FROM THE DEMAND FOR SAFE SPACES PROJECT The research team received the raw crowdsourced data acquired for the Demand for Safe Spaces study in a different level of observation than the one relevant for analysis. The unit of analysis was a ride, and each trip was represented in the crowdsourced data set by three rows, one for questions answered before boarding the train, one for those answered during the trip, and one for those answered after leaving the train. The Tidying data example in box 5.3 explains how the team created three intermediate data sets for each of these tasks. To create the ride-level data set, the team combined the individual task data sets. The following code shows how the team assured that all observations had merged as expected, showing two different approaches depending on what was expected. (Box continues on next page)
130
DEVELOPMENT RESEARCH IN PRACTICE: THE DIME ANALYTICS DATA HANDBOOK