
13 minute read
6.1 Examples of Big Data
How to Harness the Power of Data and Inference | 343
keep costs low, as part of keeping the administrative costs of eligibility determination reasonable. Economizing on data collection can imply partial measurement, infrequent updates, use of proxies, and/or making use of other existing data whenever useful and feasible.
“Big data” are generated to serve other purposes but may also be useful for eligibility assessment and so have the allure of a “free lunch,” and their higher frequency of updating means that they may be useful for making eligibility determination much more dynamic. Box 6.1 summarizes different sources of big data, although the division between those collected by the government and those by commercial firms can be blurred; private
BOX 6.1
Examples of Big Data
Data Collected by or under Contract to the Government
• Administrative data ° From sources related to tax collection (on wages, land, vehicles, businesses, and the like) ° From sources related to civil processes (registrations for births, marriages, divorces, deaths, residency, voter registration, and military service) ° From sources related to service delivery (receipt of any government-provided social protection programs, contribution to social or health insurance schemes, border crossings, and possibly kilowatt hours of energy used if power companies are state run) • Remote sensing data are collected remotely from the household or surrounding locale, usually by satellite, aircraft, or drone. ° Satellite imagery captures images of highly geographically disaggregated areas (lighting at night, land use such as density or features of construction, or caliber of vegetative cover). ° Climate data such as on temperature, rainfall, windspeed, and water speed. ° Global Positioning System (GPS) data on access/distance from a particular location to different facilities.
continued next page
344 | Revisiting Targeting in Social Assistance
BOX 6.1 (continued)
Data Generated by Households but Held by Commercial Firms
• Mobile phone data. Call detail records that record the frequency of texts and length of calls, as well as the frequency or size of topups to data plans. • Phone-based location data. Where people go and when, how long it takes them to get there, and how long they stay there. • Social media data. People’s account information on their (selfreported) age, sex, and education; data on the type (and quality) of their device and connectivity; data on their social networks; and data on who they follow, like, or retweet. • Commercial financial transactions. Use of mobile money or credit cards and commercial banking information on income, assets, and debt from mortgage and loan applications (some of this private credit information is reported to the central bank or a public credit bureau, meaning the government may have some access).
satellites collect remote sensing data, while in some countries mobile operators may be state owned. Because the data are generated for purposes other than welfare determination, they may be proxies rather than measurements per se, and they may require some inference for use. Nonetheless, many enticing research studies show strong correlations between welfare as mapped by the new sources of big data and more traditional interviewbased measures.
An important dimension of big data is who owns them and who may use them for what purpose. Governments using data for eligibility decisions for social protection programs must have access to such data. Ownership is a start, although data privacy laws may prevent or bureaucratic silos may impede sharing data between government agencies or functions. Therefore, a regulatory reform of greater or lesser weight may be needed to make even government-owned data available for use in social protection. Normally a program applicant must give consent as part of the application process for the agency to access and use data from other functions or agencies (social security contributions, property tax registers, and so forth).
A strong regulatory environment is needed to enable reuse of public sector data for eligibility determination. World Development Report 2021: Data for Better Lives identifies the elements of such a good practice regulatory environment (World Bank 2021b). These include various legislation governing
How to Harness the Power of Data and Inference | 345
open data and access to information, data classification policies, interoperability between government agencies, and licensing arrangements. However, although high-income countries have made significant progress, having adopted around two-thirds of such practices, progress is slower in less developed countries, suggesting that reuse of public sector data to help determine beneficiary eligibility may still take time. Moreover, the content and coverage of such data may be limited relative to its ability to observe welfare at the lower end of the welfare distribution.
A different regulatory or payments problem occurs when using household-generated data owned by private firms. The issues of ownership, privacy, and aggregation across different commercial sources will need to be sorted in ways that are technically feasible, economically practical, and politically acceptable. This is new ground, as yet it is hard to predict how fast or extensively the government will be able to harvest and use such data for social protection purposes. Such developments may be uneven initially when using different sources of data and between countries.
So far, data from e-commerce is not much observed by governments, but because of its parallels with traditional commerce, it might be culturally acceptable to think of building that capacity. If brick and mortar stores pay taxes, the analogy to sellers on electronic platforms is clear, although the details of how to regulate/tax them can be quite complex. Governments also maintain a greater margin of control over private sector data transactions subject to competition or consumer protection laws (World Bank 2021b), which may make them more accessible than other private sector big data. Moreover, governments may in some cases directly observe the transactions if they control the digital currency; for example, the Chinese central bank recently piloted an electronic Chinese yuan (eCNY).1 Where and when it comes to pass that such data are observable to the government, it is likely that the principal driver will be the tax revenue that the government could generate. If in the process of taxing electronic financial flows, the government converts some part of the data to government-owned data, then the improved observability of welfare may be available for benefit determination as well as taxation. This could be revolutionary in helping to observe the welfare of the informal sector and distinguish welfare across the gradients therein, although where household domestic and business accounts are intertwined, this approach would not completely solve the problem. Observing the purchases by a person who works informally as a day laborer, as domestic staff, or in a larger firm but off the books might yield a good idea of their welfare. But a petty trader may look fairly well-off if the purchases of her stock are taken to measure her welfare; the desired concept would be purchases for private use or the perhaps small profit that the trading yields.
346 | Revisiting Targeting in Social Assistance
The case for routine government access to household-generated data such as call detail records (CDRs), internet search data, and social media posts is less obvious, other than for selected law enforcement purposes. However, in a growing body of work, some examples of which are provided later in the chapter, such commercial data have been made available to researchers. Those instances show some interesting analyses of patterns of welfare and demonstrate the technical potential to use such data for eligibility determination. Whether it will be deemed culturally acceptable is a discussion that is beginning to unfold, and there may be various practical issues as well.
Some progress has been made in making private sector data available for public purposes through legislation, to support evidence-based policy making and promote innovation and competition (World Bank 2021b). For example, some countries have legislation requiring sharing of private sector data of public interest (OECD 2019), such as from utilities and transportation. A particularly relevant example is France’s Law for a Digital Republic, which was enacted in 2016 (French Republic 2016; OECD 2019). This law includes provisions mandating the sharing of private sector data according to open standards for the creation of public interest data sets, which cover data from delegated public services or data that are relevant for targeting welfare payments or constructing national statistics (World Bank 2021b, 214, fn 117).
World Development Report 2021 identifies other approaches to allowing the use of private sector data, including promoting open licensing, data portability, and data partnerships. Open licensing encourages private data owners to invest in mechanisms that provide access to proprietary data in return for control and financial rewards, although this is rare in lower-income countries. Data portability allows individuals to facilitate the transfer of data about themselves between parties. This prevents locking in consumer data and fosters competition between companies. In practice, it means the right to receive a copy of the data from a data collector, the right to transmit the data to another data collector, and the right to request a transfer from one data collector to another. Retaining the same phone number when changing operators is a simple example of data portability. Allowing the transfer of financial data on credit and debit card use, personal loans, and mortgages, as is now done in Australia, is a more complex example. Alternatively, data partnerships are contractual agreements between two businesses or a business and government under a public-private partnership. An example of this is Waze, which provides traffic mobility data to cities and other public organizations for traffic management, emergency response, and other mobility-based projects.2
Nonetheless, progress toward the regulatory environment needed to facilitate this voluntary provision of private sector data for public use lags
How to Harness the Power of Data and Inference | 347
that of public sector data. The World Development Report 2021 identifies a regulatory framework for enabling reuse of private sector data, which corresponds to that discussed for public sector data (World Bank 2021b). This includes ID authentication, data portability, and voluntary licensing of access to data. On these dimensions, countries have adopted less than 20 percent of good practices, suggesting that the reuse of private sector data for eligibility determination is further off than that of public sector data.
Moreover, it remains to be seen whether the tech giants who own much of these data would be willing to license them given the likely greater returns to maintaining a data monopoly. Google facilitates 90 percent of all internet searches,3 which in turn may lead to advertisers paying Google higher prices (Scott Morton and Dinielli 2020). Moreover, much of the value of private sector big data comes from just how big it is. The far greater number of web pages indexed by Google compared with most other search engines means that Google provides better results and thus continues to be popular. The value of social media data comes from the networks they reveal. The greater is the size of these networks, the more the data can reveal and the more valuable they become. Maintaining a near-monopoly on such huge networks of data, such as Facebook does, likely has more value than licensing for other users.
Another important facet of data is the unit of observation. Eligibility for many social protection programs is determined at the household level, although for some programs, it is determined at the individual level, but these are not the usual units of observation for much of big data.
Remote sensing data can be associated with increasingly small geographic areas, but they are still fundamentally about “area” and as such, they are conceptually different from a household or individual. The mapping between the two is sometimes straightforward if people or households have GPS coordinates recorded somewhere or geomappable addresses. But even such links are not perfect. At present, sensing data are often available on a grid that is much larger than a household—a square kilometer is a fine bore. Even where the resolution is smaller, in urban areas, many people can live in a single multistory building. In rural areas, livelihoods can depend on flows from multiple plots in different locations or a mix of farm and off-farm employment. A satellite picture may show that a given field is lush or withered or that the plot is lit or dark at night, but the picture does not show the name of the household head or all the other elements of the household’s welfare.
The unit of observation issue also arises in the emerging use of CDRbased PMT to approximate the welfare of households. A problem with this approach is the difficulty of linking phones to individuals and households. First, in many developing countries, poor people often use prepaid subscriber identification module (SIM) cards instead of more costly postpaid
348 | Revisiting Targeting in Social Assistance
accounts. In many instances, these SIMs are not registered to a particular person (although they are meant to be in theory). Consequently, even if CDR patterns predict a poor person, that person may not be known to the mobile operator or the government. Moreover, often people in developing countries have multiple SIMs from different operators to take advantage of cheaper calls to numbers within different networks or at different times of the day. Usually, these SIMs cannot be matched to the same person, meaning the modeling cannot take their aggregate phone usage into account. Second, many programs are aimed at poor households and not poor individuals. If multiple household members have phones and their usage patterns predict eligibility, the household could end up receiving multiple benefits or at least making the eligibility determination at the household level difficult. Conversely, in particularly poor places, the same phone is used by multiple households, which could confound the models. Over time and with greater training data for machine learning models, it may be possible to resolve some of these issues—for example, household members may be identified by their colocation in the household at night, patterns of communication between each other, and so forth—but the issues raise an additional level of complexity in modeling. They also raise issues of incentives, that is, about how households might alter their use of SIMs or phones.
Using big data requires ways to merge data from one source or database with another. This may require a lot of technical work, but success is increasingly possible, due both to advances in the data ecosystem and increased computing power. A frequent key in data matching between separate administrative records is to use a foundational ID number. As the whole Identification for Development (ID4D) agenda discussed in chapter 4 advances, such easy mergers will become more feasible in the coming years. In the meantime, where foundational IDs are not part of the data sets or have limited coverage among the vulnerable population, data algorithms to match on a combination of keys, such as name, age, sex, address or GPS coordinate, or phone number, and other identifiers can sometimes work well enough. However, the algorithms take significant computing power when the number of individuals to be matched is large and there will still be some failures or mismatches that will require manual processes. An address or even more precise, a GPS coordinate, can merge a household’s location with sensing data, although GPS coordinates are usually available only for households where data have been collected in the home in recent years, or where there is a good address system that the government has geomapped.
The complexity of big data has also led to the use of more sophisticated modeling techniques—or machine learning—to understand them. Machine learning algorithms take various forms, which are described later in the chapter. They allow objects to be classified (for example, roof type, paved