BOX 5.1 SUMMARY: CLEANING AND PROCESSING RESEARCH DATA After being acquired, data must be structured for analysis in accordance with the research design, as laid out in the data linkage tables and the data flowcharts discussed in chapter 3. This process entails the following tasks. 1. Tidy the data. Many data sets do not have an unambiguous identifier as received, and the rows in the data set often do not match the units of observation specified by the research plan and data linkage table. To prepare the data for analysis requires two steps: • Determine the unique identifier for each unit of observation in the data. • Transform the data so that the desired unit of observation uniquely identifies rows in each data set. 2. Validate data quality. Data completeness and quality should be validated upon receipt to ensure that the information is an accurate representation of the characteristics and individuals it is supposed to describe. This process entails three steps: • Check that the data are complete—that is, that all the observations in the desired sample were received. • Make sure that data points are consistent across variables and data sets. • Explore the distribution of key variables to identify outliers and other unexpected patterns. 3. De-identify, correct, and annotate the data. After the data have been processed and de-identified, the information must be archived, published, or both. Before publication, it is necessary to ensure that the processed version is highly accurate and appropriately protects the privacy of individuals: • De-identify the data in accordance with best practices and relevant privacy regulations. • Correct data points that are identified as being in error compared to ground reality. • Recode, document, and annotate data sets so that all of the content will be fully interpretable by future users, whether or not they were involved in the acquisition process.
Key responsibilities for task team leaders and principal investigators • Determine the units of observation needed for experimental design and supervise the development of appropriate unique identifiers. • Indicate priorities for quality checks, including key indicators and reference values. • Provide guidance on how to resolve all issues identified in data processing, cleaning, and preparation. • Publish or archive the prepared data set.
Key responsibilities for research assistants • Develop code, data, and documentation linking data sets with the data map and study design, and tidy all data sets to correspond to the required units of observation. (Box continues on next page)
102
DEVELOPMENT RESEARCH IN PRACTICE: THE DIME ANALYTICS DATA HANDBOOK