Notes

from Revisiting Targeting in Social Assistance

7.5 Performance Triangle for Two Programs

How to Harness the Power of Data and Inference | 447

the logisticians who plan for good delivery systems; the social workers or community development officers who build bridges between ministries, communities, and individuals; and the advocates, organizers, technocrats, and elected officials who build consensus for good social protection policy.

Targeting methods evolve. Although the suite of methods has the same names as were written about two decades ago, the practice and potential of each is changing as new data, new technology, or new capacities and expectations push them to evolve.

1. https://www.nytimes.com/2021/03/01/technology/china-national-digital -currency.html 2. See Waze (2018, 2020), https://www.waze.com/wazeforcities/casestudies. 3. https://www.businessinsider.com/how-google-retains-more-than-90-of -market-share-2018-4. 4. The most popular area-level poverty map is that of Fay and Herriot (1979). The

Elbers, Lanjouw, and Lanjouw (2003) methodology has since been supplanted in many cases by the empirical best/Bayes approach of Molina and Rao (2010).

This chapter focuses on poverty maps developed with big data and does not review the different econometric models currently used in traditional poverty maps. For a good review of the models, see Molina, Corral, and Nguyen (2021).

For implementation of the models in Stata, see Corral Rodas, Molina, and

Nguyen (2021). 5. Head et al. (2017) use the same transfer learning procedure as Jean et al. (2016) with a different set of countries and an expanded set of human development indicators at the level of household sample survey clusters. Steele et al. (2017) use satellite imagery and CDRs to predict regional poverty and wealth measures. The authors employ hierarchical Bayesian geostatistical models for the prediction task. The models serve as a way of averaging or smoothing across a spatial area while also accounting for uncertainty. 6. Several papers demonstrate a correlation between phone activity (as measured using CDR) and regional economic activity. Toole et al. (2015) use CDR to predict employment. Schmid et al. (2017) predict literacy using CDR. Deville et al. (2014) generate population estimates using CDR, and Blumenstock and Eagle (2012), Dong et al. (2014), and Schmid et al. (2017) use CDR to predict demographic characteristics at the subnational level. Eagle, Macy, and Claxton (2010) explain regional economic rankings using social network diversity as measured using CDR. Frias-Martinez and Virseda (2012) also show that regional economic indicators can be predicted by CDR activity. Mao et al. (2013) and Smith-Clarke, Mashhadi, and Capra (2014) present similar results for Côte d’Ivoire. Pokhriyal and Jacques (2017) predict regional poverty in

Senegal using a combination of CDR and satellite imagery. Njuguna and

McSharry (2017) also use CDR and satellite imagery to predict the regional

Multidimensional Poverty Index in Rwanda using a least absolute shrinkage

448 | Revisiting Targeting in Social Assistance

and selection operator (Lasso) model. Blondel, Decuyper, and Krings (2015) provide a survey of CDR-based research. 7. The authors implement a deterministic finite automaton to generate 5,088 covariates. They then use the elastic net algorithm to predict the principal component analysis wealth index for each survey respondent. 8. Lain (2018) used data taken from GoogleMaps and Trafi to show that it takes residents of an average Jakarta neighborhood around 40 minutes to reach a regional public hospital using only public transport, but that for some neighborhoods the travel time is almost two hours. Similarly, it costs Jakartans

Rp 4,000 (US$0.40) to reach a major hospital at the median, but this rises to

Rp 20,000 (US$2) for some neighborhoods. Accessing the three top-ranked senior high schools in each municipal district takes 107 minutes and costs

Rp 16,000 (US$16) for some Jakarta residents, despite taking just 35 minutes and Rp 3,500 (US$0.35) for the median neighborhood. Additionally, the times taken to reach the top-ranked high schools are positively correlated with neighborhood-level poverty rates, consistent with the idea that richer households are located in areas that offer better access to good schools, but that do not necessarily offer improved access to other facilities. 9. Facebook data include the number of users, sex, age, reported education type, type of operating system (iOS, Android, or other), expense of phone, and type of connection (for example, 2G, 3G, and so forth). 10. See Coady et al. (2021). The authors estimate the share of means-tested programs among the following types of programs: unemployment, social exclusion, housing, and family and children’s programs. 11. See OECD (2013). 12. Urban Institute 2021, quoted in https://fortunly.com/statistics/welfare -statistics/#gref. 13. See Moffitt (2015). 14. Moffitt (2015, 7) explains that the patchwork of means-tested programs in the

United States, although it appears as a “crazy-quilt assortment of programs with different structures and recipient groups, rather than following from some single rational design for assistance for the poor of all types,” it does reflect the preference of the voters. For example, most programs are in-kind in nature (for medical care, food consumption and nutritional assistance, housing, and early childhood education) and, when cash is provided, it does not cover all lowincome households, but certain deserving categories such as workers (Earned

Income Tax Credit), the aged, and the disabled (Supplemental Security

Income). 15. See https://www.missoc.org/missoc-database/comparative-tables/. 16. Even in advanced economies, administrative data quality is still an issue. To address this issue, in 2014, the United States enacted the Digital Accountability and Transparency Act, which assesses and compares the completeness, timeliness, quality, and accuracy of federal spending data that agencies submit and the implementation and use of data standards. An agency’s data quality is considered good if the completeness, timeliness, and accuracy of the information is at least 80 percent. The latest Government Accountability Office report on this topic finds that only 88 percent of the audited agencies achieve this standard

How to Harness the Power of Data and Inference | 449

(GAO 2020). Improvements in data quality could be partly tackled by interoperability and integration, but discrepancies in nonsalary income, profits from partnerships, and capital gains can remain. 17. http://mds.gov.br/area-de-imprensa/noticias/2014/setembro/beneficiarios-do -bolsa-familia-podem-trabalhar-com-carteira-assinada; https://economia.uol.com .br/guia-de-economia/bolsa-familia-o-que-e-quem-tem-direito-qual-valor.htm; https://www.youtube.com/watch?v=2jEaQQ01asY. 18. World Bank (2019a). 19. Interoperability is the ability of a system to share information with other systems using common standards. 20. Data integration combines data from different sources and provides users a unified view of these data. 21. Broad-based categorical eligibility is a policy in which households may become categorically eligible for SNAP because they qualify for a noncash TANF or state maintenance of effort funded benefit. Many states implement broad-based categorical eligibility, the programs that confer broad-based categorical eligibility, the asset limit of the TANF/maintenance of effort program, and the gross income limit of the TANF/maintenance of effort program. Broad-based categorical eligibility cannot limit eligibility. Households that are not eligible for the program that confers categorical eligibility may apply for and receive SNAP under regular program rules. Under regular program rules, SNAP households with elderly or disabled members do not need to meet the gross income limit but must meet the net income limit. 22. The threshold formula is set as follows € (189.66 + 132.76 × A + 94.83 × B) where A is the total number of other adults, and B is the total number of children. For example: for a single-family household, the threshold is €189.66 (US$213.10); for a household with two adults and two children, the threshold is €512.08 (US$575.37) (189.66+132.76+94.83 x 2); and for a household with three adults and one child, it is €550.01 (US$617.99) (189.66+132.76 x 2+94.83). 23. The possession of certain kinds of luxury assets such as boats, airplanes, luxury cars, and real estate valued over €150,000 (US$168,540) also acts as a disqualifier filter. This amount is obtained by the following formula: total taxable value may not exceed €90,000 (US$101,124) for the first individual, which is increased by €15,000 (US$16,853.97) for each additional household member, with an overall maximum threshold for each recipient unit of €150,000 (US$168,540). 24. For example: South Africa’s Older Persons Grant, Cabo Verde’s Minimum

Social Pension, and China’s Dibao program, which now have more strict verification procedures. 25. Educational level of individuals and school enrollment database. 26. The Social Household Registry is the only registry that collects information on household composition, the assistance unit of the registry. This is self-reported by the head of the household. The legal definition of the household is a group of persons who share the same address, roof, and financial resources. Unlike the United States or the European Union, income taxes in Chile are only individual; hence, information on the family or household is not collected by the tax authorities.

450 | Revisiting Targeting in Social Assistance

27. The self-employed get a Social Integration Program (PIS) number, which is equivalent to the labor card number. At any time, if the self-employed individual is in formal employment, the PIS number gets converted to a labor card number. 28. About US$500 million as of February 2020. 29. This included 30.3 million social insurance (pensions) beneficiaries, 18.6 million social assistance beneficiaries, and 6.7 million in labor market and activation benefit beneficiaries. 30. After 2018, the Ndhima Ekonomike was switched to PMT. 31. The Ministry of Family and Social Services is in the process of assessing new targeting methods and household needs, particularly those associated with recent economic shocks and increases in poverty. 32. Such data limitations could reduce the accuracy and even feasibility of means testing as well. 33. The Poverty Scorecards developed by Mark Schreiner of Microfinance Risk

Management, L.L.C., are a PMT that uses only 10 easy-to-verify pieces of information to help target the benefits of microcredit services. The Poverty

Scorecard weights are estimated using a logistic regression, but the main feature is that program staff can compute poverty scores in real time, using paper-based or electronic data collection. However, Diamond et al. (2016) show that using different estimation models with more variables outperforms the nationally calculated simple Poverty Scorecard in terms of bias and variance, highlighting the fundamental trade-off between simplicity of use and accuracy. 34. See Ward et al. (2010) for more details on the targeting precision of the method. 35. According to the Law of August 2020. 36. See World Bank and UNICEF (2020). The vulnerability score in Armenia weights the following household circumstances: the socioeconomic group of each individual in the family, the number of family members with reduced work capacity, area of residence, housing conditions, possession of a vehicle, entrepreneurial activities, conspicuous expenditures (for example, real estate purchases, foreign exchange transactions, electricity, and natural gas), family income, and assessment of the family’s living conditions by the social worker. 37. World Bank (2019a). 38. See Areias et al. (forthcoming). 39. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4054530/. 40. An alternative is to produce multiple models, optimized for different program coverage levels. However, this takes significantly more work and can be confusing for policy makers. 41. See del Ninno and Mills (2015) and Brown, Ravallion, and van de Walle (2016). 42. The number of districts and provinces has changed from time to time; this is the number that was pertinent when the work was done. 43. In theory, all possible variable combinations could be tried to determine which model specification performs the best. This approach (see James et al. 2013), which is called “best subset selection,” assesses the best model with one variable, two variables, three variables, and up to p variables. It reduces selection

How to Harness the Power of Data and Inference | 451

among the 2p possibilities. This represents p + 1 possible models by selecting the best model for each exactly k variable, where k = 1, 2, …, p, and uses crossvalidated prediction errors to assess between the best models for each k. The residual sum of squares or R2 will always indicate that the model with all the variables performs the best, but this indicates that it describes the training data the best and not necessarily how well it will perform out of sample. The problem with “best subset” is that it is very computationally intensive for any significant number of potential variables. Forward stepwise begins with a model with no variables and assesses all the models with a single variable included to see which improves cross-validated prediction error the most over the null model. The variable that led to the best single-variable model is retained. Then all possible combinations of that variable with one other variable are assessed.

The variables from the two-variable model with the best cross-validated prediction error are then kept as the basis for a three-variable model and so forth until no further improvements are obtained by adding new variables. Backward stepwise begins with a model with all the variables included and subtracts one variable at a time, again looking for the model that has the best cross-validated prediction errors, and continues subtracting variables until no further improvements are made. 44. These are called penalized regressions or shrinkage methods; see Areias et al. (forthcoming) for a full review. 45. Lasso is used as a variable selection device and a less biased post-Lasso estimator is used to calculate the final PMT weights. OLS is then used to regress log consumption on the Lasso selected variables and derive the final scoring coefficients. 46. However, there are differences between explanatory and predictive modeling, which Shmueli (2010) highlights. First, Shmueli indicates that predictive modeling has higher predictive accuracy than explanatory statistical models.

Second, Shmueli highlights that predictive models: (1) aim to look for association between the X (covariates) and Y (dependent variable); (2) do not have a requirement for direct interpretability in terms of the relationship between X and Y; (3) have a forward-looking approach instead of testing existing hypotheses; and (4) reduce at once the combination of bias (result of misspecification of the model) and estimation variance (result of using a sample). Addressing these points in predictive models translates into a different approach for selecting the covariates. The main criteria for selecting the set of covariates are the quality of the association between them and the dependent variable, as well as preexisting knowledge of correlation/association that does not necessarily come from the data set but from other studies or local knowledge. This procedure is quite different in explanatory models, where researchers (1) only keep significant variables in the model, (2) must address multicollinearity, (3) must have clear/independent control variables, and (4) must minimize endogeneity to address causality. 47. See Elbers, Lanjouw, and Lanjouw (2003). 48. Few expenditure surveys collect such variables, which often have very good discretionary power and can be verifiable as part of the presentation of required documents during enrollment.

452 | Revisiting Targeting in Social Assistance

49. Sample size issues can be addressed by pooling multiple rounds of survey data if they are frequent (for example, annually). In Indonesia, the greatest improvement in predictive accuracy came from moving from one year of data used for 65 provincial urban-rural models to using three years of pooled data, which allowed the reliable estimation of district-level models (around 515 models). 50. See Stock and Watson (2003). 51. Model selection in predictive modeling is not based on the explanatory power assessed using metrics computed as R2-type values and the statistical significance of overall f-type statistics. The researcher can retain covariates that are statistically insignificant if the variables are important for the prediction.

Predictive power for predictive models is measured by their capacity to predict the event using new data (Geisser 1975; Stone 1974) or carefully using the same data. Usually, researchers focus on extracting a holdout (subsample from the same data) or pseudo-samples. In the targeting context, beyond measuring whether the average prediction and errors are acceptable overall, researchers must analyze the predictive power of the model for certain marginalized groups or groups that must be targeted by the method. For example, good predictive power for the average income or poverty levels in a region X does not guarantee that the same power would generate acceptable errors for households with elderly living alone, small households, or female-headed households. PMT uses statistical models for prediction. This means significant differences in use compared with statistical models for identifying causal relationships. Causal models relate the dependent variable (the variable being predicted) to a set of independent (or explanatory) variables. They assume that the explanatory covariates are unrelated to each other and the dependent variable does not have any reverse causal relationship to any of the covariates.

With PMT, inference is not causal but about association. Hence, the strong underlying assumptions needed to determine causality are nonexistent or incorporated in a less formal way. Consequently, the best PMT model is not the one with high explanatory power or R2 , but the one with high predictive power, which can be quite different. 52. SISBEN I started its data collection in 1995, SISBEN II in 2005, SISBEN III in 2011, and SISBEN IV in 2017. SISBEN III was the main database used to allocate services and benefits until March 2020, when SISBEN IV data started to be used. 53. See Hillis et al. (2013). 54. The pilot has been in full operation since the end of 2009. The pilot program was planned and designed in 2008. Targeting, enrollment, management information systems development, and all other preparations were developed in 2009. The first cash transfer payment was delivered to beneficiaries in

November–December 2009. The first phase of enrollment covered almost 2,500 households. In August 2011, it reached 4,998 households in 40 villages in three districts in two provinces: Chamwino in Dodoma, and Bagamoyo and

Kibaha in Pwani. The targeting method combined geographic targeting, CBT, and PMT. 55. See Bourguignon, Ferreira, and Leite (2008) and Datt and Ravallion (1991).

How to Harness the Power of Data and Inference | 453

56. This is a common test used in poverty map methodology for comparing household survey and census data distributions. See Elbers, Lanjouw, and Lanjouw (2003). 57. Most of the household surveys combine stratified sampling with cluster sampling, and without replacement. 58. For example, the predictive power of any inference test drops quickly as intracluster correlations increase. 59. A third source of prediction error is error variance, which is irreducible noise that cannot be eliminated through modeling. 60. The bootstrap method is a resampling technique used to estimate statistics on a population by sampling a data set with replacement to create many simulated samples. The jackknife method (or leave-one-out method) is an alternative resampling method to the bootstrap, based on sequentially deleting one observation from the data set to estimate the precision of the estimator. 61. See Areias et al. (forthcoming) for discussion of a special case of k-fold validation called leave-one-out cross validation (LOOCV) where k is set to n, the number of observations in the data. 62. The challenge is that with cross-sectional data, household welfare before and after a shock is not observed. Instead, the impact of a shock on changes in consumption is inferred from differences in consumption of otherwise observationally equivalent households. 63. The Listahanan or National Household Targeting System for Poverty Reduction is an information management system that identifies who and where the poor are in the Philippines. 64. The models used are (1) current: the current formula (the default); (2) recalibrate: recalibrate the coefficients of the current model with the Household Socio-

Economic Survey 2018 data; (3) new: regression models with more and/or different variables than the current model; (4) stepwise: use stepwise regressions to streamline variable selection in the new model; (5) Lasso: use the least absolute shrinkage and selection operator (Lasso) algorithm for variable selection in the new model; (6) Lasso-int: use the Lasso algorithm for variable selection in the new model and extend it to include variable interactions; (7) random forest: use the random forest algorithm on the variables for the new model to predict welfare (this algorithm is nonlinear and does not result in simple coefficients for each of the variables); and (8) Nnet: use the neural network algorithm on the variables for the new model to predict welfare (this algorithm is also nonlinear and does not result in simple coefficients for each of the variables). 65. The machine learning models used are Ridge, Lasso, elastic net, random forest, gradient boosting, and two blended models. 66. In the Costa Rican Household Poverty Level Kaggle Competition, 616 teams tried to build the most accurate poverty classifier; all the top performers with public code used feature engineering and tree-based models (such as XGB and

LightGBM). See https://www.kaggle.com/c/costa-rican-household-poverty -prediction/overview. 67. Quantile regression with cross-validated quantile. 68. Targeting measures are discussed in detail in chapter 7. For readers already familiar with inclusion and exclusion errors, Precision is 1 – Inclusion Error (or the

454 | Revisiting Targeting in Social Assistance

percentage of predicted eligible who are truly eligible; Recall is 1 – Exclusion Error (or the percentage of those truly eligible who are correctly predicted so). F1 is then the harmonic mean of Precision and Recall, or 2 * ([Recall * Precision] / [Recall +

Precision]). In this sense, it is used as a balanced average of measures analogous to inclusion and exclusion errors. F2 is the same as F1 but with twice the weight on

Recall (that is, exclusion error carries more weight). Matthews correlation coefficient (MCC) is the geometric mean of Informedness and Markedness, where

Informedness is Recall + Specificity – 1, where Specificity is the percentage of truly ineligible households correctly predicted so (or TN / [TN + FP]) and Markedness is

Precision + Inverse Precision – 1, where Inverse Precision is the percentage of those predicted not eligible who are truly not eligible (or TN / [FN + TN]).

Effectively, MCC is a scale-invariant balance of inclusion and exclusion errors. 69. Areias et al. (forthcoming); McBride and Nichols (2016); Ohlenburg (2020b);

Shrestha (2020). 70. Gradient boosted quantile regression trees tend to outperform most other algorithms under both implementation modalities (line and quota), but the traditional logistic regression with a simple Lasso variable selection model also proves robustly useful. 71. k-nearest neighbors. 72. The choice between gradient boosted quantile regression trees and logistic regression with a simple Lasso variable may depend on other considerations.

The former allows for more predictors than observations (which is quite feasible with big data), meaning a greater range of models can be explored, whereas the latter does not. Logistic models also require significant data preprocessing and are not robust to predictor noise, while boosted models do not require preprocessing and are robust. Logistic models are easier to interpret than boosted models, do not require optimization or choice of tuning parameters, and are computationally much less intensive. 73. Areias et al. (forthcoming) implement both a threshold approach, where only households with scores below a certain threshold are deemed eligible, regardless of how many there are, and a quota approach, where the scores are only used to rank households and a fixed quota (tied to, say, the official poverty rate or the program budget) is deemed eligible. 74. Personal communication. 75. A similar approach is being implemented in Panama (https://www.gaceta oficial.gob.pa/pdfTemp/29163/GacetaNo_29163_20201126.pdf) and Honduras (https://www.iadb.org/es/noticias/honduras-mejora-las-condiciones-de-vida -de-los-hogares-mas-pobres-con-apoyo-del-bid). 76. See Steele et al. (2017) and https://blogs.worldbank.org/opendata/using-big -data-and-machine-learning-locate-poor-nigeria. 77. As COVID-19 cases increase, the government of Bangladesh with the support of Yale Y-Rise is working with machine learning modelers to extrapolate trends in mobile usage data to identify those who need to be targeted with cash support. 78. Blumenstock, Cadamuro, and On (2015); Head et al. (2017); and Jean et al. (2016) use Demographic and Health Survey data to train their models. The targeting performance for all methods in Aiken et al. (2021) improve when

How to Harness the Power of Data and Inference | 455

moving from a PMT-based dependent variable to a true consumption variable, suggesting that the use of proxies for training models introduces additional noise (although all methods face this issue, not just ones based on big data). 79. See, for example, global mobility trends over COVID-19: https://ourworldindata .org/covid-google-mobility-trends. 80. See, for example, Beegle et al. (2012) on the impact of truncated consumption modules. 81. See Montjoye, Gambs, and Blondel (2018) and Zhang, Chen, and Zhong (2016). 82. Camacho and Conover (2011) find that in Colombia’s traditional PMT, there is anecdotal evidence of people moving or hiding their assets or borrowing and lending children, and there is evidence of manipulation of scores. 83. The Participatory Wealth Ranking begins with a communitywide meeting convened by the facilitation team. After discussing the meaning and understanding of poverty in the local context, the people draw a map of all the households in the village and fill a card with the name of each household. Three reference groups are then formed in each ranking section, that is, the hamlet. McCord (2013) highlights that the Participatory Wealth Ranking literature shows that it delivers a robust and broadly noncontentious ranking based on a community’s own understanding of poverty considered from multiple dimensions. 84. The number of Ubudehe categories evolved over time. At the start of

Ubudehe categorization system in 2002, there were six categories. These were revised to four categories in 2014. Currently, the government is in the process of introducing a five-category system. In the current four Ubudehe categories, the first category is designated for the poorest people in society, while the fourth category is for the wealthiest members of society. More specifically: category 1: very poor and vulnerable citizens who are homeless and unable to feed themselves without assistance; category 2: citizens who are able to afford some form of rented or low-class owned accommodation, but who are not gainfully employed and can only afford to eat once or twice a day; category 3: citizens who are gainfully employed or even employers of labor (this category includes small farmers who have moved beyond subsistence farming, or owners of small and medium-scale enterprises); and category 4: citizens classified under this category are chief executive officers of large businesses, employees who have full-time employment with organizations, industries or companies, government employees, owners of lockdown shops or markets, and owners of commercial transport or trucks (Government of Rwanda 2015; MINALOC 2015). 85. The index was using factor analysis on a set of variables such as region, type of dwelling, house owner, poverty level, consumption expenditure, type of residency, current dwelling value, water source, light source, amount paid for electricity in the past four weeks, primary source of cooking fuel, and type of toilet in an aggregated form using factor analysis. 86. i2i Dime and World Bank (2017). 87. Area Executive Committees are mostly government frontline staff, also known as extension workers at the community/traditional authority level, for example, community development assistants; health surveillance assistants; and

Notes

Next Article

7.5 Performance Triangle for Two Programs

Notes

More articles from this publication:

7.5 Performance Triangle for Two Programs

7.9 Relative Efficiency of Programs

Concluding Remarks

7.13 Exclusion and Inclusion Errors

the Poverty Line

7.12 Impacts on Poverty and Inequality

7.3 Inclusion and Exclusion Errors in a 10-Person Economy

7.4 Targeting Differential

What to Look for When Conducting Method Assessments

This article is from:

Revisiting Targeting in Social Assistance