
5 minute read
A LASSO-Regression Approach
with their dropout rate, extra time to graduate, formal employment, or wages? If so, how strong is the association?
The empirical strategy used to estimate these contributions or associations is described in box 4.221 and the results are summarized in figures 4.4 to 4.7. The values in the figures correspond only to the variables that showed a relevant correlation with the corresponding outcome (that is, the figures include only the associations that are statistically significant at the 10 percent level or less). The magnitudes of the coefficients are comparable within the figures, but they have different interpretations across the graphs, depending on the outcome under analysis. To approximate the estimations to a value-added approach, the main specifications include controls for student, program, and HEI characteristics, as described in box 4.2.
Academic Performance and Determinants of SCP Quality
Dropout Rate
Figure 4.4 summarizes the quality determinants associated with the dropout rate. The estimations show four determinants associated with lower dropout rates. The first one is related to the curriculum: programs with a fixed curriculum are more likely to have lower dropout rates. This result is in line with the literature from the United States, which finds evidence that programs with a completely
Box 4.2 Estimating the Contributions of the Quality Determinants to Academic and Labor Market Outcomes: A LASSO-Regression Approach
For estimating the contributions of the programs’ practices and inputs to academic and labor market outcomes, the World Bank Short-Cycle Program Survey (WBSCPS) has the advantage of providing a large set of explanatory variables that can be assessed as potential quality determinants. However, the large number of explanatory variables posits two challenges. The first is selecting the “right” set of explanatory variables. On the one hand, using too few controls or the wrong ones may create omitted variable bias. On the other hand, using too many may lead to model overfitting. The second challenge is that the sample sizes in some countries are small. For instance, there are only 80 SCPs in the Dominican Republic. Since there might be more variables than observations, the model might not be identified.
The first challenge could be addressed by creating indexes within each of the five categories of determinants by using statistical techniques for data reduction, such as factor or principal components analyses. However, this technique requires interval-level data, a requirement not met by some of the survey variables. Moreover, the types of variables (interval level or dummy variables) would vary within each determinant, which would preclude the use of these techniques.
Hence, to address the challenges of selecting explanatory variables and potential underidentification or nonidentification of the model, the parameters of interest are estimated using the Least Absolute Shrinkage and Selection Operator (LASSO) technique. This is being used in
box continues next page
Box 4.2 Estimating the Contributions of the Quality Determinants to Academic and Labor Market
Outcomes: A LASSO-Regression Approach (continued)
the literature for estimating parameters in linear models with several controls with the aim of improving model fit. Intuitively, LASSO throws out the variables that contribute little (or nothing) to the fit.
A two-stage process is followed. The first stage uses an adaptive LASSO methodology and estimates the following model for each outcome of interest:
yjc =α d∑ α α αα ∈+ + +φ + =
' ' 0 1 2 1
6 Q C jc jc c jc
d , (B4.2.1)
where yjc represents the average academic (dropout rates and extra time to graduate) or labor market (formal employment and wages) outcome of interest for graduates from SCP j in country c. Qjc d is a vector that includes all the variables within each of the six quality determinant categories.
Cjc is a vector of control variables at the program and higher education institution (HEI) level. These controls are program or HEI characteristics that do not constitute a quality determinant, such as the number of years the HEI has been operating and program age, among others. Some of these characteristics (such as whether the HEI is for profit, public or private, or a university) are “fixed” in the first stage. That is, LASSO is “asked” to keep them as controls for the first and second stages. Other characteristics, including HEI age, number of branches, and student characteristics, are not fixed. In other words, they can be kept or dropped by the LASSO procedure.
The vector of coefficients α1 corresponds to the associations between the outcome and each quality determinant. Similarly, the coefficients α2 indicate the correlations between the control variables and the outcome. For the cross-country estimations, country fixed effects фc are included. Finally, єjc is the error term. In all the pooled (with all countries) and countryspecific models, clustered standard errors are estimated at the HEI level.
Among all the quality determinants included in Qjc d , LASSO calculates a “penalty” parameter that determines the set of variables that minimizes the out-of-sample minimum square error of the estimations. In this sense, LASSO conducts a data-driven selection of the set of determinants, Q*, that provides the best fit to the data.
In the second stage, each outcome of interest yjc is regressed on the set of selected determinants and the following equation is estimated:
y
jc Q N' ' jc d jc c jc ,
d
* 1 2 1 6∑0β β ββ β β= + + + = γ + ω (B4.2.2)
where jcN is a vector of control variables at the program and HEI level that are fixed by LASSO during the first stage and kept for the second stage; for this second one, γc corresponds to country fixed effects, and ωjc is the error term. The rest of the variables are defined as previously.
Equation B4.2.2 is estimated using ordinary least squares for the dropout rate, extra time to graduate, and wages, and probit is used for formal employment. The estimated parameters of interest are in vector β β 1, which reflects the association between the quality determinants and outcomes in the sample. As in the first stage, in all the cross-country and country-specific models, clustered standard errors are estimated at the HEI level.