7 minute read

5.2 Establishing a unique identifier: A case study from the Demand for Safe Spaces project

BOX 5.2 ESTABLISHING A UNIQUE IDENTIFIER: A CASE STUDY FROM THE DEMAND FOR SAFE SPACES PROJECT

All data sets have a unit of observation, and the first columns of each data set should uniquely identify which unit is being observed. In the Demand for Safe Spaces project, as should be the case in all projects, the first few lines of code that imported each original data set immediately ensured that this was true and applied any corrections from the field needed to fix errors related to uniqueness.

The code segment below was used to import the crowdsourced ride data; it used the ieduplicates command to remove duplicate values of the uniquely identifying variable in the data set. The screen shot of the corresponding ieduplicates report shows how the command documents and resolves duplicate identifiers in data collection. After applying the corrections, the code confirms that the data are uniquely identified by riders and ride identifiers and documents the decisions in an optimized format.

1 // Import to Stata format ============================================================ 2 3 import delimited using "${encrypt}/Baseline/07112016/Contributions 07112016", /// 4 delim(",") /// 5 bindquotes(strict) /// 6 varnames(1) /// 7 clear 8 9 * There are two duplicated values for obs_uid, each with two submissions. 10 * All four entries are demographic surveys from the same user, who seems to 11 * have submitted the data twice, each time creating two entries. 12 * Possibly a connectivity issue 13 ieduplicates obs_uid using "${doc_rider}/baseline-study/raw-duplicates.xlsx", /// 14 uniquevars(v1) /// 15 keepvars(created submitted started) 16 17 * Verify unique identifier, sort, optimize storage, 18 * remove blank entries and save data 19 isid user_uuid obs_uid, sort 20 compress 21 dropmiss, force 22 save "${encrypt}/baseline_raw.dta", replace

To access this code in do-file format, visit the GitHub repository at https://github.com/worldbank /dime-data-handbook/tree/main/code.

ieduplicates is a Stata command to identify duplicate values in ID variables. It is part of the iefieldkit package. For more details, see the DIME Wiki at https://dimewiki .worldbank.org/ieduplicates.

iecompdup is a Stata command to compare duplicate entries and understand why they were created. It is part of the iefieldkit package. For more details, see the DIME Wiki at https://dimewiki .worldbank.org/iecompdup.

A variable is the collection of all data points that measure the same attribute for each observation.

An observation is the collection of all data points that measure attributes for the same instance of the unit of observation in the data table.

Wide format refers to a data table in which the data points for a single variable are stored in multiple columns, one for each subunit. In contrast, long format refers to a data table in which a subunit is represented in one row and values representing its parent unit are repeated for each subunit. Tidying data

Although data can be acquired in all shapes and sizes, they are most commonly received as one or multiple data tables. These data tables can organize information in multiple ways, and not all of them result in easyto-handle data sets. Fortunately, a vast literature of database management has identified the format that makes interacting with the data as easy as possible. While this is called normalization in database management, data in this format are called tidy in data science. A data table is tidy when each column represents one variable, each row represents one observation, and all variables in it have the same unit of observation. Every other format is untidy. This distinction may seem trivial, but data, and original survey data in particular, are rarely received in a tidy format.

The most common case of untidy data acquired in development research is a data set with multiple units of observations stored in the same data table. When the rows include multiple nested observational units, then the unique identifier does not identify all observations in that row, because more than one unit of observation is in the same row. Survey data containing nested units of observation are typically imported from survey platforms in wide format. Wide format data could have, for instance, one column for a household-level variable (for example, owns_fridge) and a few columns for household member–level variables (for example, sex_1, sex_2). Original data are often saved in this format because it is an efficient way to transfer the data: adding different levels of observation to the same data table allows data to be transferred in a single file. However, doing so leads to the widespread practice of interacting with data in wide format, which is often inefficient and error-prone.

To understand how dealing with wide data can be complicated, imagine that the project needs to calculate the share of women in each household. In a wide data table, it is necessary first to create variables counting the number of women and the total number of household members and then to calculate the share; otherwise, the data have to be transformed to a different format. In a tidy data table, however, in which each row is a household member, it is possible to aggregate the share of women by household, without taking additional steps, and then to merge the result to the household-level data tables. Tidy data tables are also easier to clean, because each attribute corresponds to a single column that needs to be checked only once, and each column corresponds directly to one question in the questionnaire. Finally, summary statistics and distributions are much simpler to generate from tidy data tables.

As mentioned earlier, there are unlimited ways for data to be untidy; wide format is only one of those ways. Another example is a data table containing both information on transactions and information on the firms involved in each transaction. In this case, the firm-level information is repeated for all transactions in which a given firm is involved. Analyzing firm data in this format gives more weight to firms that conducted more transactions, which may not be consistent with the research design.

Reshape means to transform a data table in such a way that the unit of observation it represents changes.

The basic process behind tidying a data table is simple: first, identify all of the variables that were measured at the same level of observation; second, create separate data tables for each level of observation; and third, reshape the data and remove duplicate rows until each data table is uniquely and fully identified by the unique identifier that corresponds to its unit of observation. Reshaping data tables is one of the most intricate tasks in data cleaning. It is necessary to be very familiar with commands such as reshape in Stata and pivot in R. It is also necessary to ensure that identifying variables are consistent across data tables, so they can always be linked. Reshaping is the type of transformation referred to in the example of how to calculate the share of women in a wide data set. The important difference is that in a tidy workflow, instead of reshaping the data for each operation, each such transformation is done once during cleaning, making all subsequent operations much easier.

In the earlier household survey example, household-level variables are stored in one tidy data table, and household-member variables are reshaped and stored in a separate, member-level, tidy data table, which also contains the household ID for each individual. The household ID is intentionally duplicated in the household-member data table to allow one or several household members to be linked to the same household data. The unique identifier for the household member–level data will be either a single household member ID or a combination of household ID and household member ID. In the transaction data example, the tidying process creates one transaction-level data table, containing variables indicating the ID of all firms involved, and one firm-level data table, with a single entry for each firm. Then, firm-level analysis can be done easily by calculating appropriate statistics in the transactions data table (in Stata, often through collapse) and then merging or joining those results with the firm data table.

In a tidy workflow, the clean data set contains one or more tidy data tables (see box 5.3 for an example of how data sets were tidied in the Demand for Safe Spaces project). In both examples in the preceding paragraphs, the clean data set is made up of two tidy data tables. There must be a clear way to connect each tidy data table to a master data set and thereby also to all other data sets. To implement this connection, one data table is designated as the main data table, and that data table’s unit of observation is the main unit of observation of the data set. It is important that the main unit of observation correspond directly to a master data set and be listed in the data linkage table. There must be an unambiguous way to merge all other data tables in the data set with the main data table. This process makes it possible to link all data points in all of the project’s data sets to each other. Saving each data set as a folder of data tables, rather than as a single file, is recommended: the main data table shares the same name as the folder, and the names of all other data tables start with the same name, but are suffixed with the unit of observation for that data table.

This article is from: