25 minute read

Causal Effect One

382 | Revisiting Targeting in Social Assistance

BOX 6.5

PMTs Are a Predictive Model Exercise, Not a Causal Effect One

In general, statistical models are powerful tools for causal explanations, prediction, and description of data. A great deal of research/econometric analysis uses statistical modeling to test causal claims. In a regression, the causal claim follows a simple structure in which each of the covariates (called independent variables) is assumed to have a causal influence (regression coefficient) on the dependent variable (for example, income or consumption per capita). The models are based on the assumptions that covariates cannot have any causal influence on one another and there is no reciprocal causal influence from the dependent variable to any of the covariates.

In the targeting world, most of the time, similar models are used for their ability to predict what income or consumption would be when it is not measured. The inference is not causal but rather about association. Hence, strong underlying assumptions that are needed to determine causality are not needed or are incorporated in a less formal way. Consequently, the best model is not the one with high explanatory power or R2, but the one with high predictive power, which is quite different.

Shmueli (2010) highlights the main differences between explanatory and predictive modeling. First, predictive modeling tends to have higher predictive accuracy than explanatory statistical models. Second, predictive models aim at (1) looking for association between the x (covariates) and y (dependent variable), (2) not having a requirement for direct interpretability in terms of the relationship between x and y, (3) having a forward-looking approach instead of testing an existent set of hypotheses, and (4) reducing at once the combination of bias (the result of misspecification of the model) and estimation variance (the result of using a sample). Addressing these points in predictive models translates into a different approach for selecting the covariates.

While building a model for proxy means testing (PMT), the aim is to find correlations and associations rather than to look for causal structure, endogeneity, or reverse causality. The main criteria for selecting the set of covariates are the quality of the association between them and the dependent variable, as well as preexisting knowledge of correlation/association that does not necessarily come from the data set but from other studies or local knowledge.a This procedure is different from explanatory models, where researchers must (1) only keep significant variables in the model, (2) address

continued next page

How to Harness the Power of Data and Inference | 383

BOX 6.5 (continued)

multicollinearity, (3) have clear/independent control variables, and (4) minimize endogeneity to address causality.

Finally, model selection in predictive modeling is not based on explanatory power—assessed using metrics computed as R2-type values and the statistical significance of overall F-type statistics. The researcher can retain covariates that are statistically insignificant if the variable has importance for prediction.b The predictive power of models is measured by their capacity to predict an event using new data (Geisser 1975; Stone 1974) or carefully using the same data. Usually researchers focus on extracting holdout (a subsample from the same data) or pseudo-samples. In the targeting context, beyond measuring whether the average prediction and errors are acceptable overall, researchers must analyze the predictive power of the model for certain marginalized groups or groups that may be of interest for social policy. For example, good predictive power for the average income or poverty levels in region x does not guarantee that the same model would generate acceptable errors for households with elderly living alone, small households, or female heads of households.

The role of the researcher differs for traditional PMT and PMT that uses machine learning. Traditional PMT may be thought of as a study, not a technology, that requires good data mining and skilled researchers to make decisions, as it is important to understand historical data as well as existing external data to find patterns and behaviors. PMT using machine learning, in contrast, is a methodology for prediction that often uses artificial intelligence techniques. With these models, the algorithms are often given data and asked to process them without a predetermined set of rules and regulations. In this case, the systems adapt and learn as and when new data are added, without the need of being directly programmed, and without continuously addressing the discontinuity of loss functions.c Machine learning is data driven, and the problem to be solved needs to be precisely described to find the right algorithm, as once it is calibrated for a particular event, the model cannot be used for a different one. It must be reestimated, so the role of the researcher is important upfront in problem specification.

a. For example, in work on a PMT for a given country in 2008/09, the addition of the interaction between possession of camels and region x would improve the predictive power significantly, although the interaction was not significant according to the p-value. The rationale behind it was the local knowledge of the team that highlighted the main difference in welfare, which was not observed in the data due to the small sample size. Given the dryness of the area, subsistence farmers in region x owned camels for trading limited goods. b. Shmueli (2010) explains that in medical research, a variable for smoking habits is often present in models for health conditions, whether it is statistically significant or not, and that sometimes exclusion of significant variables improves predictive performance. c. See chapter 8 for a fuller explanation of using machine learning for targeting.

384 | Revisiting Targeting in Social Assistance

the incidence of beneficiaries in the current program and the simulated PMT model. The comparison showed that such a move would reduce the program’s inclusion and exclusion errors. In other countries, the ad hoc formulae have passed the test of time. This is the case in Armenia, where eligibility for the two flagship antipoverty programs, the Family Benefit and Social Benefit programs, is based on a complex vulnerability score.36 The incidence of program beneficiaries has been compared with an analytically derived PMT model, but the improvement in targeting accuracy was not considered to be enough to warrant a reform of the eligibility criteria and associated changes in the delivery system. The Kosovo Social Assistance Scheme shares a similar story.37

As PMT is an inference-based method, it contains statistical error, which makes it controversial. For example, Brown, Ravallion, and van de Walle (2016) simulate the performance of various PMT methods using data from nine African countries. They show how PMT helps filter out the nonpoor but excludes many poor people. Kidd (2011) and Kidd, Gelders, and Bailey-Athias (2017) stress that PMT has in-built design (statistical) errors and static instruments as well as some inevitable level of implementation errors. They argue that since errors can be high and the methods not well understood by applicants or communities, PMT can be perceived as arbitrary or a lottery. The works cited in this paragraph are among the more strident in tone in criticizing PMT, but everyone engaged in PMT work acknowledges the inherent imperfection in having to rely on statistical modeling rather than more precise measurement of welfare. But PMTs are not meant to be used where means testing or HMT is feasible. Many authors who reject PMT recommend geographic or demographic targeting as simple metrics of welfare, which can be thought of conceptually as single-variable PMTs and thus even less accurate, although they are simpler and more transparent. The prior chapters consider whether PMT is a suitable choice in a given setting and how to reduce implementation error, which are topics that are not rehashed in this chapter. The focus here is on how to reduce statistical error, although it cannot be eliminated.

This section reviews the main steps for developing PMT and the traditional data and models that underly it. PMT was first developed four decades ago and, while the basic policy problem to be solved is the same, the how-tos have evolved over time. The section is broken into several subsections: model choice, whether to use multiple models, how to choose explanatory variables, how to update the model and the social registry, and data-related limitations. All the discussion focuses on what have until relatively recently been the standard approaches to PMT, which are referred to as “traditional.” A subsequent section considers how the advent of big data and machine learning might improve PMT.

How to Harness the Power of Data and Inference | 385

Traditional PMT Models

PMT is fundamentally a question of inference. This subsection briefly surveys the traditional options for predicting income or consumption from available proxy data. PMT is the process of estimating unobservable household income or consumption from observable proxies that correlate well with it. This inference can be done on different bases. This subsection examines how proxies can be used with traditional approaches that rely on classical regression methods; more recent machine learning algorithms and big data–based proxies are examined later.

The PMT models with the lowest data requirements use principal component analysis. This method was made popular by Filmer and Pritchett (2001) and does not require a household survey with income or consumption as all other PMT methods do. It identifies linear combinations of variables measured in household surveys, which maximize the variation in the characteristics correlated with welfare across households. Usually, just the first principal component identified by this method is kept as a proxy for household welfare. Principal component analysis has significant limitations—the resulting score can be ranked but is difficult to interpret, standard inequality measures such as the Gini index cannot be calculated, and it cannot be compared with principal component analysis scores from other models—but it obviates the need for income or consumption surveys. This method can still be used—and is the basis for a household welfare proxy in the widespread Demographic and Health Surveys—but given the improvements in the availability of household surveys with income or consumption, a range of more accurate regression techniques are available for determining beneficiary eligibility.

Logit or probit models regress the binary status of a household (poor or nonpoor; eligible or not eligible) on the explanatory variables. Ordinary least squares (OLS) models regress household income or consumption on the explanatory variables. Both models are popular as they are more easily understood than principal component analysis and construct a direct probability of a household being poor (logit/probit) or a direct income or consumption measure (OLS). Both models are used in practice, but the main advantage of OLS is that the model coefficients are not generated relative to a fixed poverty line. With binary models, different versions must be estimated for different cutoff points (for example, at both the poverty and extreme poverty lines), meaning there is less flexibility to use the same scoring for different programs with different eligibility thresholds. Other models, such as generalized least squares and nonparametric models, are also used.38

The main weakness of OLS is that it assumes that the association between the explanatory and dependent variables is the same at all levels of the distribution. That is, it assumes that the regression coefficients are constant

386 | Revisiting Targeting in Social Assistance

across the population. Cook and Manning (2013)39 highlight the need for thinking beyond the mean as the correlations and characteristics of the population between welfare and certain variables may be different for those at the bottom, middle, or top of the distribution. OLS models for targeting would then be inappropriate if the mean is not a good representation of the poor. In other words, if PMT attempts to identify those at the bottom of the distribution (the poor), why not estimate the model under the assumption of a population average effect for that income range while still estimating continuous welfare? Different approaches can be incorporated into OLS models for PMT, such as adding interactions to address this limitation.40

Koenker and Bassett’s (1978) quantile regression method allows estimation of the coefficients in a direct and transparent way at the poverty rate/ proposed eligibility threshold, which better matches the policy problem of targeting. A feature of quantile regressions is that they focus on limiting errors at the bottom end of the expenditure distribution by ensuring that the formula effectively models the expenditures of the poorest households, without caring about the imprecision in the formula above the poverty threshold needed for the program. This is an advantage if policy makers care about exclusion error more than inclusion error. A challenge is setting the right quantile. In practice, the first and second deciles or the first quartile is often used as this represents the program population of interest. Del Ninno and Mills (2015) suggest that when the poverty threshold lies far from the mean, or when between-cluster correlation is high, it may be more relevant to estimate PMT weights using quantile regressions. Brown, Ravallion, and van de Walle (2016) show that for nine African countries, quantile regression performs better than traditional models in most cases. In other words, quantile regression may be more appropriate for countries with high levels of poverty and low inequality, as the average population is not a good representation of the poorest; when much of the distribution is similarly poor, PMT models struggle to distinguish between more finely grained degrees of need.

Two alternative methods to quantile regression that aim to give more weight to the poor are poverty-weighted least squares and truncated regression. The first allows various weighting schemes (such as weighting equally for observations below the poverty line but giving zero weight to those above the line), while for the second, the sample is truncated for certain ranges (for example, the nonpoor). However, in practice, this does not improve the models as some variables are lost, which reduces accuracy. Quantile regression is considered more effective in certain contexts41 because it weights different portions of the sample to generate the coefficient estimates, thus increasing the power to detect differences in the upper and lower tails.

How to Harness the Power of Data and Inference | 387

With the advent of new computational algorithms, machine learning algorithms are being used to estimate PMT, using parametric and nonparametric models that are more computationally intense (for example, nonlinear models and tree-based models). A full treatment of machine learning algorithms is presented in the next section, but all the discussion below referring to regional models, choice of variables, updates, and so forth that are valid for traditional PMT and machine learning models.

Multiple Regional Models

An important policy consideration is whether to use a single national model or multiple regional models that are fine-tuned for different purposes. Different models for different regions are common. The values of different proxies can vary from place to place due to differences in such things as climate and preferences. For example, in rural areas, having livestock may indicate prosperity; but in urban areas, not having livestock does not necessarily mean that a household is poor. An air conditioner may be a marker of prosperity in a location with very hot summers but not somewhere with more mild temperatures. Thus, most countries use different models for different parts of the country. The extent to which this can be done depends on how much disaggregation the survey data allow. The most extreme example is Indonesia, where 500 different models were developed, one for each district. This is possible by pooling the large annual surveys from consecutive years. In simulations, the biggest gains came from pooling three years of data. In each additional disaggregation—from a single national model, to urban and rural, to provincial, to districts—accuracy improved, but the move to 500 district models from 71 provincial models42 showed the greatest improvement (Lange et al. 2016). However, it is not necessary to go to this extreme for regional models to be more accurate. Much will depend on what the data (sample sizes) permit, but the analysts should consider what kinds of splits are pertinent and feasible. Is urban/rural a better split than models that distinguish by political/administrative unit (for example, state or department)? Does distinguishing between metropolises, mid-size cities, and smaller urban townships yield improvements in urban models? Do rural models improve if they are broken into major agroecological zones (for example, mountains versus plains, or desert or jungle versus moderate climates)?

A less common approach is to use multiple regional models that are finetuned for different program thresholds. If scores are being used to select households for multiple programs of different sizes, then a policy maker could use a single score for each household and a different eligibility threshold for each program based on program size, which might be based on the available budget or predetermined program objectives. Indeed, this

388 | Revisiting Targeting in Social Assistance

approach is currently taken in many countries using PMT. A policy maker could also choose to use a different model for each program on the basis that models can be optimized to target different parts of the income or consumption distribution and a set of program-specific models can be more accurate for each program than a single one-score-fits-all approach used for all programs. Nonetheless, using multiple scoring models requires considerably more time and effort to develop and can be difficult to communicate to policy makers and the public. Whether any improved accuracy warrants these complications is a trade-off to be assessed.

Choosing the Explanatory Variables

The choice of explanatory variables is key for modeling. It can be attractive to use many, even all the potentially suitable variables at once in the model without further processing, or deeper thinking on how the variables interact with one another. This can result in overfitting where the model is very good at predicting the survey data but not very good at predicting new data (which is how the model will be applied). Some practitioners use a stepwise approach that involves introducing (forward stepwise) or eliminating (backward stepwise) variables one at a time and using a statistical test to determine whether the model fit is improved.43 This is a computationally efficient way of assessing model effectiveness, which improves upon assessing all possible variable combinations. However, a stepwise approach may not produce the best model. As James et al. (2013) note, if the best onevariable model uses variable A while the best two-variable model uses variables B and C, then the best two-variable model will not be assessed (variable A is selected from the first stage and only variable A + variable B and variable A + variable C will be assessed next). Alternative approaches from machine learning prevent overfitting due to the inclusion of too many variables. These approaches have started to be incorporated into traditional PMT development. A common method includes a penalty for complexity in the model.44 This results in simpler models and tends to improve how well the models perform on new data. Common machine learning “penalized algorithms” include the least absolute shrinkage and selection operator (Lasso) and Ridge regressions. Ridge regularizes the model to prevent overfitting, while Lasso both regularizes the model and facilitates variable selection. Lasso has been used in several countries. For example, the recently developed PMT models in Iraq use Lasso,45 as does the new poverty map being developed for Jordan.

Regardless of the process for incorporating variables in a model, various standard data analyses can be implemented. Before choosing the covariates, it is good practice to run basic data analysis to reproduce the official poverty and inequality statistics (which are usually constructed by the

How to Harness the Power of Data and Inference | 389

national statistical office). This step is important as social programs are assessed on their capacity to mitigate poverty. Ensuring that the welfare metric is correctly built using the proper consumption or income aggregate, as well equivalence of scales and regional price adjustments, if any, is a precondition for having a good model. Poverty assessments and other such studies can be important sources to inform the analyst about the potential covariates that should be used in the statistical models for deriving a model.

Another good practice for selecting an initial set of explanatory variables is to start with traditional methods for exploratory data analysis among the variables that would be easily observable when an applicant is filling in program forms. A researcher must first “read” the data to understand its strengths and limitations and clean it to deal with missing observations and outliers. Using the frequency of responses (low-frequency responses can bias the predictors as they become noise in the model), sampling design, and sampling weights to reproduce core statistics made available by national statistical offices guarantees proper data manipulation later. Data visualization through scatterplots, histograms, box plots, normal plots, and the like can also be employed. Such analyses are at the core of any cause-and-effect analysis, and the approach should not be different for predictive modeling.46

Once the exploratory analysis is completed, further analysis that is more tailored to the traditional modeling is necessary. This stage comprises grouping variables into blocks and then analyzing each group separately, looking at variable correlations, including correlation with the dependent income/ consumption variable. New variables can be created—such as “acceptable lighting material for the household”—based on the number of responses in different categories and their correlation with the dependent variable. For example, electricity access with own meter or community meter can be combined as desired; electricity without meter, oil, kerosene, or gas combined as acceptable; and candle and others combined as unacceptable.

In addition, interactions of location indicators with other variables are sometimes used to increase the predictive power of the model and lower exclusion and inclusion errors. The inclusion of location indicators may create separate thresholds and separate PMT weights for each location. The trade-off between capturing location-specific circumstances and maintaining a common threshold for all beneficiaries needs to be addressed explicitly as part of program policy. Nevertheless, the use of local-level indicators and estimation of different thresholds based on local poverty lines have provided better results in Honduras, Kenya, and Mexico, as well as the West Bank and Gaza, to name a few. One way to avoid this issue is to use real income or consumption measures in the regression, adjusting before modeling for differences in the cost of living across different locations. Then the location-specific circumstances

390 | Revisiting Targeting in Social Assistance

capture local aspects that relate to the real standard of living rather than differences in prices. Generally, this step should be done even when location-specific indicators are not being used.

In addition, the increasing amount of ancillary data coming from big data and the modernization of administrative data systems increase the ease with which local-level variables can be introduced in the model, which helps reduce model bias and variance (as residual location effects can greatly reduce the precision of the welfare estimates).47 These variables are fixed at the enumeration area level; therefore, to incorporate them directly into the model, the data analyst may need to work closely with the national statistical office because enumeration area codes are often not available in the public version of the data made available to researchers. If the system that is used to code geographical areas in the ancillary data is different from the survey enumeration areas, a concordance between the two systems must be built, and the modeler will need to assess the trade-off between the required time and model improvements. In addition, machine learning can produce new variables from within the existing survey data, which can also improve accuracy; this is discussed in the following section.

The following list describes a nonexhaustive example of how to create groups: • Characteristics of the housing: main source water, main source of cooking fuel, main source of lighting fuel, material of the walls, main toilet facility, material of the roof, number of rooms, room density, and expenditures on utilities48 • Durables: possession of satellite TV, vehicles, motorcycles, boats, and refrigerators • Land and livestock: ownership of land, usage of land for agriculture, possession of livestock, and type of livestock • Characteristics of the household head: gender, age, literacy, educational level, occupation, and disabilities • Characteristics of other household members: share of adults working, average educational level, and number of adults with disabilities • Household size and type of family: elderly living alone or couple, missing generation, nuclear family, number of children ages 0–5, number of children ages 6–14, number of youth ages 15–24, number of adults ages 25–59, and number of elderly ages 60+ • Location and other local-level development variables: urban, region, province, and enumeration-level aggregate information from other sources such as ancillary data to improve the precision of the measure of welfare.

It is recommended that for each group, researchers use the following steps to better understand the data and the problem they are trying to address:

How to Harness the Power of Data and Inference | 391

• Step 1: Assessing the dependent and covariate variables. This step involves assessing proportions and means through descriptive statistics to identify the center, spread, and shape of each distribution. Frequency tables, bar charts, pie charts, and histograms allow identification of the mode (most common response) as well as categories with low response. In addition to simple descriptive statistics, some inferential methods are used to identify confidence intervals and perform significance tests. • Step 2: Assessing the correlation and associations of variables. This step involves using descriptive statistics such as cross-tabulations to estimate conditional proportions, correlations, analysis of variance, simple regressions, contingency tables, paired differences, chi-square tables, nonparametric tests, and so forth, to understand the main correlations of variables to feed the predictive model. • Step 3: Assessing multiple correlations of variables. This step involves assessing correlation in different ways. For multicollinearity we can use the variance inflation factor, which measures the correlation and strength of correlation between the explanatory variables in a regression model. To test the stability of the means and variances across variables, we can use the factorial analysis of variance. Where necessary, methods such as principal component analysis or other data compression methods can be employed to construct variables that reduce sampling variance.

Once the variables are assessed, the policy makers must move to the selection of models. In traditional PMT, applying steps 4 to 5 is recommended to select a final set of explanatory variables and model. For machine learning, the main step is to run the different models and let the computer select the best model (see chapter 8). • Step 4: Selecting the best model. This step involves assessing the best approach for modeling based on the distribution of the dependent variable. If the dependent variable is a binary one, a logit/probit model is more appropriate, but for continuous variables, the researcher must choose between

OLS, quantile regression, and the other models discussed earlier.

Generally, when running PMT on a continuous dependent variable such as income or consumption, which are highly skewed, a logarithm transformation of the variable is used. The choice of the model involves running additional checks such as residual analysis to check the shape of the distribution and see how unusual observations affect the estimates. In addition, depending on the data analysis in the steps above, the researcher may decide to run different models for different geographical areas due to the representativeness and heterogeneity of the information per area.

That is, different models per region, such as metropolitan, other urban, and rural areas, may be preferable to a unique model if the observable characteristics within each group are different—a fixed-effects

392 | Revisiting Targeting in Social Assistance

specification may not be the solution (see Elbers, Lanjouw, and Lanjouw 2003) to control for regional effects. However, the number of models is generally constrained by the representativeness of the survey sample and the trade-off between the time to construct multiple models and the improvement in predictive accuracy.49

Once step 4 is completed for each group, steps 3 and 4 can be repeated after grouping all the variables selected from each group. At this stage, new variables can be created by adding interactions to the model, while ancillary data at the lowest geographic level (such as enumeration areas), calculated from the census or obtained from ancillary data sources, can be added to the model specification to capture small area heterogeneity and improve prediction.

All the tests in steps 1 to 3 are important for measuring multicollinearity and instability of the coefficients caused by large variance. When high multicollinearity is present, the confidence intervals for the coefficients tend to be very wide and the t-statistics tend to be very small. The coefficients must be larger to be statistically significant; it will be harder to reject the null when multicollinearity is present. Detecting high multicollinearity is important and there are several warning signals. Most importantly, dropping variables should not generate large changes in a coefficient; model stability means that seemingly innocuous changes will not produce big shifts. Finally, the model selection implies that the set of explanatory variables in the group is composed of independent, unrelated groups, and the analysis of variance allows testing for a statistically significant difference between the groups, and how certain variables with large variance can bring noise to the estimates. • Step 5: Running a test for the null hypothesis that there is not specification error in the model selected and whether the residuals are homoscedastic, after estimations using the same model in step 4. The first test is also known as an omitted variable test,50 which tests the assumption that the error term and covariates are not correlated. The second test can be done using the

Breusch-Pagan test to measure whether the residuals do not vary for lower or higher values of the covariates.

The final step, which is also applied for machine learning, is the simulation of the model performance for different cutoff points and special groups: • Step 6: Simulating coverage, performance, and predictive power for special groups.

Using the indicators presented in chapter 7 (coverage, distribution of beneficiaries, exclusion error or undercoverage, inclusion error or leakage, or benefit incidence) across the income or consumption distribution and for particular groups, such as urban/rural, female heads of households, and households with elderly living alone, the researcher can simulate the performance of the model against different thresholds (for

This article is from: