15 minute read

6.8 Machine Learning Models That Are Commonly Used

How to Harness the Power of Data and Inference | 413

There are various approaches in the machine learning literature. This section highlights four categories of models that are commonly used: robust models, penalized regression models, nonlinear models, and tree-based models, which use different optimization algorithms and can be contrasted with traditional PMT models. A brief and high-level overview of the differences in the modeling philosophies—and not an explanation of the differences between individual algorithms (model approach for maximization of the functions)—for each of the models is presented in box 6.8. A fuller treatment is provided in Areias et al. (forthcoming), based on James et al. (2013) and Kuhn and Johnson (2018).

BOX 6.8

Machine Learning Models That Are Commonly Used

Robust Models

Linear models (ordinary least squares [OLS] regressions commonly used in proxy means testing) can be overly influenced by outliers. If these outliers violate the assumptions on which linear models are based, the resulting performance of the linear model can be poor. Robust models are “robust” (less sensitive) to outliers. While they are computationally more intensive, robust models have less restrictive assumptions and thus can perform better across a wider range of data.a The main algorithms are robust linear regression and various quantile regressions with different methods for variable and quantile selection.

Penalized Regressions

Penalized regression models use shrinkage methods, in that they shrink the OLS coefficient estimates toward zero by introducing a penalty term on the coefficients. The objective of shrinkage methods is to reduce the variance of the models significantly with only a small increase in bias, thus reducing overall model error. This can be the case particularly when multicollinearity between explanatory variables is high. The basic approach is the same as OLS but introduces a penalty on the size of the coefficient (the degree to which a variable explains the outcome of interest). If a variable does not significantly improve the model, its role in the model is reduced. The main algorithms are Lasso regression, Ridge regression, and elastic net regression.

continued next page

414 | Revisiting Targeting in Social Assistance

BOX 6.8 (continued)

Nonlinear Models

Most of the models that are used for targeting applications are linear models. These models can largely be adapted to nonlinear relationships in the data—for example, by adding polynomial terms (age and age squared)—but this requires knowing in advance the nature of the nonlinearities. Among the range of nonlinear models, each works in different ways, but they are related in that the analyst does not need to know the nature of the nonlinearities in the data ex ante. The main algorithms are multivariate adaptive regression splines; k-nearest neighbors, computational neural network, and support vector machines.

Tree-Based Models

Tree-based models are a subclass of nonlinear models that differ from the other categories of models considered so far in that they use ifthen statements to partition the data, rather than a regression. At their most basic, tree-based models split the data into two partitions, based on whether each observation’s value on an explanatory variable is above or below a certain cut point. For example, is the household size more or less than four? Within each split, a further split is made based on the observation’s value on a second explanatory variable and a second accompanying cut point. This continues until each branch of the tree reaches a terminal node, upon which the remaining observations at that node receive a predicted outcome, which could be as simple as a fixed value or the mean value of outcomes at this node from the training data, up to a regression-based prediction using all the explanatory variables. The main algorithms are all ensemble approachesb: random forest, random forest with quantile loss, gradient boosted regression trees, and gradient boosted quantile regression trees.

a. Areias et al. (forthcoming) look at five robust models: (1) robust linear regression with principal components, (2) quantile regression, (3) quantile regression with cross-validated quantile, (4) quantile regression with Akaike Information Criterion variable selection, and (5) quantile regression with Akaike Information Criterion variable selection and cross-validated quantile.

b. Tree-based models are popular for three main reasons: (1) they generate conditions that are highly interpretable and easy to implement; (2) they do not require specifying the relationship between the explanatory and outcome variables ahead of modeling; and (3) they handle missing data and implicitly conduct variable selection. At the same time, two well-known weaknesses are (1) model instability, and (2) predictive performance that can be beaten by other approaches. Ensemble methods have been developed to address these issues. This chapter looks at both basic regression trees and ensemble approaches, which build on them.

How to Harness the Power of Data and Inference | 415

Kuhn and Johnson (2018) summarize the different models and algorithms and their characteristics along various dimensions. As is further summarized by Areias et al. (forthcoming), “there is no one ML [machine learning] model which is uniformly better than the others; the applicability of a technique is dependent on the type of data being analyzed, the needs of the analyst and the context of how the model will be used.” Particularly of note is that the traditional PMT models (linear and logistic regressions) are easily interpretable and computationally easy with no tuning parameters, but they require significant preprocessing, have no automatic variable selection, and are not robust to predictor noise.

Does Machine Learning Improve Prediction Accuracy over Traditional PMT Regressions?

A few studies compare machine learning models relative to traditional regressions when using standard household survey data. The Areias et al. (forthcoming) exercise uses data from 17 Sub-Saharan African harmonized household surveys (100 subsamples each) to test 19 algorithms. It evaluates them on four metrics, which is the most systematic assessment to date. Shrestha (2020) uses the Mongolia Household Socio-Economic Survey 2018 to compare the current PMT model prediction with seven algorithms,64 using training data—a 70 percent random subset of the Household Socio-Economic Survey 2018 data—and compares their performance in the test data not used for estimation of the remaining 30 percent. McBride and Nichols (2016) run a similar exercise with national household survey data from Bolivia, Malawi, and Timor-Leste, using the United States Agency for International Development poverty assessment tool and base data to compare the gains in accuracy of machine learning against PMT. Ohlenburg (2020b) compares seven machine learning models65 with OLS and stepwise OLS for Indonesia.66

The first result is that an algorithm’s performance depends on: (1) the targeting metric it is optimizing and by which it is assessed, and (2) the type of targeting problem it is asked to solve. Areias et al. (forthcoming) show that algorithms perform differently on different metrics. For example, consider an algorithm in the robust class.67 This algorithm performs poorly on a standard machine learning metric (root mean square error) but is one of the best relative performers on a targeting measure that favors lower exclusion error (F2).68 At the same time, it performs poorly on another targeting measure that favors a balanced lower inclusion-exclusion error (Matthews correlation coefficient [MCC]; see chapter 7). This suggests that an algorithm’s performance depends on the policy maker’s objectives; if she wanted to reduce inclusion and exclusion errors in equal measure, she would not choose this robust measure. Conversely, if she cared most about

416 | Revisiting Targeting in Social Assistance

exclusion error, she might consider it. Similarly, the decision between using household scores with a poverty line versus a poverty quota approach for eligibility has implications for the choice of algorithm. The same robust algorithm is one of the best performers on F1 and F2 measures when the line approach is used, but it is one of the worst when the quota approach is used. Some of the other models reviewed showed similar swings in results. McBride and Nichols (2016) also find that the differences between machine learning and traditional approaches depend on the metric being used; on one metric, machine learning approaches are better and on another, they are worse.

Moreover, this small literature69 finds that the gains in precision compared with traditional PMT regressions are marginal at best. All three studies show that even when there are clearly preferred algorithms on statistical grounds, generally the differences in performance are not important in magnitude. Areias et al. (forthcoming) find that most of the differences in model performance are less than 5 percent. They also highlight that in their study, the two best performers across the line and quota approaches are a machine learning model and a traditional regression,70 while another machine learning model is clearly inferior across all performance measures and modes of implementation,71 although there are nonperformance-based considerations as well.72 Shrestha (2020) concludes that nonlinear machine-learning algorithms do not necessarily help improve targeting performance compared with a simple PMT recalibration in his study using Mongolia’s most recent household survey data. Machine learning models failed to improve PMT performance in all sublocations of the country with different algorithms except one where the improvement was on the order of 2 percent. Moreover, the analysis shows that no single algorithm consistently outperformed the other ones. Ohlenburg (2020b) notes that in Indonesia, although some of the machine learning models perform better than the traditional PMT, their gain over a traditional regression is surprisingly limited considering that PMT should play to the strengths of machine learning methods.

Although the results are from a limited range of countries so far, they provide important considerations for the use of machine learning estimation in other settings. As Areias et al. (forthcoming) indicate, performance can vary across different dimensions, including the choice of targeting metric and the policy maker’s objective, program size, and scoring implementation approach. Does a policy maker prioritize reducing exclusion error or more balanced inclusion and exclusion errors? Is she trying to use the same score to determine eligibility for programs of very different sizes? Is a threshold or quota approach being used?73 Regardless of whether the precise results of this work extend to other settings, they indicate that there is not necessarily a universally best algorithm and that if an optimal algorithm does exist for a particular context, context will matter. Whether an

How to Harness the Power of Data and Inference | 417

exhaustive search for the optimal algorithm is justified by the often limited real-world (if not statistical) differences between model outcomes is an important consideration for policy makers.

The ability to optimize machine learning algorithms for different measures of performance means that one approach may be one of the best performers on one metric and one of the worst performers on another metric, making it more important for policy makers to specify their preferred measures. Policy makers may also have different sensitivities to different targeting errors depending on the program and country. Some may be most worried about exclusion errors, prioritizing minimizing the number of poor who are mistakenly excluded from the program at the expense of including more nonpoor than they might under an alternative scoring approach. Others may be more worried about inclusion errors, as high-profile mistakes, such as including the mayor’s spouse or a well-known businessperson, undermine the credibility of the targeting system. Yet others may prefer a balanced approach to minimizing inclusion and exclusion errors. Areias et al. (forthcoming) provide a few examples of machine learning models that are among the most accurate on one performance measure (such as errors of inclusion or exclusion) and one of the least accurate on another. This may not be a common occurrence, but it underscores the need for comprehensive evaluation of all models against policy makers’ objectives and intended uses.

Transparency is a concern with PMT, but it is accentuated with machine learning models. Even a single traditional PMT model is often subject to the criticism that it is a “black box,” both because the statistical approach can be difficult for a general audience to understand and because the exact scoring formula is often kept confidential to avoid households gaming the scoring system. Such criticisms could become louder when that black box is not a simple OLS or related regression but a neural network with scoring criteria that are very difficult to interpret and even more difficult to explain how they were derived; often the modelers themselves do not know how the algorithm got to its final destination. In contrast, the key advantage of the traditional PMT models is that a good model allows the researcher to read the coefficients and tell a story about who would be the beneficiaries of the program.

However, there are limitations to studies that only apply machine learning models to standard survey data. The systematic assessment of models in Areias et al. (forthcoming) is only a static assessment based on static surveys. It applies different PMT models and machine learning algorithms to the standard household cross-sectional survey data used to predict whether a household is below an income or consumption threshold. The conclusion is that machine learning does not offer significant improvements against traditional PMT with existing data; it says nothing about machine learning

418 | Revisiting Targeting in Social Assistance

improvements with new data. McBride and Nichols (2016) indicate that the conservative gains in accuracy from machine learning methods seem to be due to data limitations. Ohlenburg (2020b, 25) notes that a key limitation of his Indonesian study is that “the relative similarity of results between OLS-based approaches and various machine learning methods might imply that lack of additional information in the data is a greater bottleneck than variable or method selection.” He also notes that no attempt was made to do feature engineering, which is discussed next.

Machine learning models could benefit from the dynamic data collection of the social registry (box 6.7). Given the data-hungry nature of machine learning approaches, they may offer significantly more promise than traditional regression-based models. For example, as the amount of information on households in a country’s social registry expands, the information on each household changes over time. As more households enter the registry, machine learning algorithms, particularly unsupervised and deep learning approaches, may be able to improve performance by incorporating the new information. In other words, it is assumed that machine learning systems improve as more data are available or, according to Ohlenburg (2020a, 10), “as they learn, much like people, they should also keep learning as they keep performing a task, as people do.”

Noriega-Campero et al. (2020) go beyond the earlier studies to incorporate verified income data from the social registry and use feature engineering to develop new variables from the existing survey data. The paper documents the development of machine learning–based PMT models for use in the social registries in Costa Rica and Colombia; they were adopted for implementation in the former and not in the latter. In particular, this real-world example improves on earlier work by making more data available to the models. First, the study pools multiple years of surveys to increase the number of observations, as was done in Indonesia to facilitate district-level models, providing samples of 22,000 in Costa Rica and 462,000 in Colombia. Such an approach would also benefit traditional regressionbased PMT models but should only be used in countries where surveys are conducted close together so that the relationship between income/consumption and proxies is relatively stable. In many developing countries, surveys are five years or more apart; thus, pooling is not advised in such cases (this approach was not possible in Areias et al. [forthcoming]). Second, the study matches the survey data to administrative data on verified household income (in a sense, making this closer to HMT than PMT), which leads to a significant improvement in both traditional and machine learning models.74 Finally, the study derives new variables from the existing survey data (as selected by experts):

Second, statistical features, including means, modes and entropies for all individual-level variables of household members, such as age, gender, and

How to Harness the Power of Data and Inference | 419

education. Lastly, deep features, generated by a recursive neural network that condenses information of the individual-level features into a one-dimensional encoding—a technique akin to the AI subfield of multiple instance learning (MIL). (Noriega-Campero et al. 2020, 243)

Noriega-Campero et al. (2020) find significant improvements in prediction accuracy, which are largely derived from incorporating new data and variables; traditional models also show considerable improvements with the new data. Noriega-Campero et al. simulate targeted programs with coverage equal to the national poverty rate (22 percent in Costa Rica, 28 percent in Colombia).75 The simulations indicate improvements of 20 and 26 percent for Costa Rica and Colombia, respectively, when comparing a traditional regression model (quantile linear regression) with standard survey variables to the best performing machine learning model (gradient boosting) with statistical and deep features. As figure 6.6 indicates, it is the inclusion of the statistical and deep features that drives the majority of the improvement. Gradient boosting with expert (standard) variables shows only a modest reduction in errors compared with quantile linear regression with expert variables. Quantile linear regression with the new variables outperforms gradient boosting with the standard variables and can provide over half of the improvement that gradient boosting with the new variables can.

Better survey and administrative data are likely more important than machine learning–driven feature generation and estimation in improving PMT accuracy. Despite the results of the paper, Noriega-Campero and his associated organization, PROSPERIA (https://www.prosperia.ai), consider feature engineering and algorithm choice to be less important in improving PMT prediction than having updated survey data and a wider range of quality variables. They suggest a ranking of factors, from most to least important, that would improve PMT (PROSPERIA 2021): 1. Updated survey data 2. Quality and quantity of usable variables (whether from surveys or administrative sources) 3. Feature generation 4. Algorithm or model choice 5. Combination of 3 and 4.

Combining difficult-to-explain machine learning models with deep features risks putting a black box on top of another black box; intuitive visualization tools can help mitigate this. The section on traditional PMT discussed the concern that PMT is a black box for policy makers and the public. The use of machine learning models that do not produce everyday variables with intuitive scoring (as in table 6.3) risks doubling down on this opacity. However, visualization tools have been developed that can help policy

This article is from: