Volume 14 / Number 1 / 2018

Volume 14 / Number 1 / 2018

Methodology

Methodology

Editors Jost Reinecke JosĂŠ L. Padilla Managing Editors Andreas PĂśge Luis-Manuel Lozano

med_40-20-0-5_58-60_positiv.indd 3

European Journal of Research Methods for the Behavioral and Social Sciences Official Organ of the European Association of Methodology

17.04.2018 09:24:09

Contents Editorial

Trends and Challenges for Methodology ... and Changes in the Editorial Team in 2018! José-Luis Padilla and Jost Reinecke

1

Original Articles

Multiple Imputation by Predictive Mean Matching When Sample Size Is Small Kristian Kleinke

3

Methodology (2018), 14(1)

Strategies for Increasing the Accuracy of Interviewer Observations of Respondent Features: Evidence From the US National Survey of Family Growth Brady T. West and Frauke Kreuter

16

Estimating a Three-Level Latent Variable Regression Model With Cross-Classiﬁed Multiple Membership Data Audrey J. Leroux and S. Natasha Beretvas

30

Ó 2018 Hogrefe Publishing

Editorial Trends and Challenges for Methodology ... and Changes in the Editorial Team in 2018! José-Luis Padilla1 and Jost Reinecke2 1

Department of Methodology of Behavioral Sciences, University of Granada, Spain

2

Faculty of Sociology, University of Bielefeld, Germany

In September 2017 Peter Lugtig and Charlotte Rietbergen (Utrecht University) stepped down from the editorial team. Andreas Pöge and Jost Reinecke took up their editorial responsibilities as Managing Editor and Co-Editor, respectively, trying to maintain the high standards. We as current editors wanted to thank Charlotte Rietbergen and Peter Lugtig for their work in advancing Methodology as a journal for discussing and solving methodological problems in empirical social science. According to journal citation reports, the 2016 impact factor of 1.143 places Methodology in the top quartile for the “social sciences, mathematical methods” category. We intend to use this Editorial to lay out future plans for both the content of the journal and how we want to streamline the editorial process for authors. As a result of our successful collaboration with Hogrefe Publishing, the online editorial system Editorial Managerâ was established for Methodology in October 2017. Authors, reviewers, managing editors, and editors, as well as the production staff at Hogrefe Publishing, are now using the platform to further improve the high-quality standards of the journal. The Editorial Manager system allows the editorial process to be accelerated and submissions to Methodology to easily be tracked. We are convinced that this online editorial system is a key element for a quicker and more transparent editorial process. As a professor of quantitative methods of empirical social research at the Faculty of Sociology, Bielefeld University (Germany), Jost Reinecke works mainly in multivariate techniques of longitudinal data analysis. He teaches courses in structural equation modeling, multiple imputation, and mixture models applied to criminological panel data. His current methodological research focuses on growth curve and growth mixture models and the development of techniques related to multiple imputations of missing data in complex survey designs. His current substantive research focuses on the longitudinal development of adolescents’ Ó 2018 Hogrefe Publishing

delinquent behavior and on the application of the situational action theory in panel designs. As was discussed in earlier editorials (Lugtig & Balluerka, 2015; Padilla & Lugtig, 2016), there are more similarities than differences in methods and practices across the social and behavioral sciences. It will be an identity sign of the journal to publish manuscripts by researchers and professionals from different academic and professional fields who want to disseminate their methodological contributions through Methodology. We intend to continue the journal’s editorial policy outlined in the formal editorial by Padilla and Lugtig (2016). Methodology will keep welcoming articles addressing current methodological challenges from new “ecological contexts” and “new behaviors”: advances in social computing, social networks, new data collection technologies, and new methods of data analysis. The journal will publish articles that bring forward qualitative and quantitative methods for applied and basic research in social sciences. The methodological core and the quality of the articles are the only criteria we use in the editorial process. Methodology aims to advance substantive knowledge in the social sciences improving methods and practice. We also rely on the past and future support of the European Association of Methodology (EAM) board and its members to improve access and dissemination of the published papers. Together with the EAM board, we are working on future developments for Methodology, all of them aimed at increasing the impact of the journal in social science fields. Finally, there cannot be a high-quality journal without authors, readers, and reviewers. We will do our best to improve communication with them to maintain the standards of quality reached by Methodology since its inception 14 years ago.

References Lugtig, P., & Balluerka, N. (2015). Methodology turns 10. Methodology, 11, 1–2. https://doi.org/10.1027/1614-2241/a000092

Methodology (2018), 14(1), 1–2 https://doi.org/10.1027/1614-2241/a000144

2

Editorial

Padilla, J. L., & Lugtig, P. (2016). Trends and challenges for Methodology. Methodology, 12, 73–74. https://doi.org/10.1027/ 1614-2241/a000109

José-Luis Padilla CIMCYC. Mind, Brain & Behavior Research Center Department of Methodology of Behavioral Sciences University of Granada 18071 Granada Spain methodologyjournal@ugr.es Jost Reinecke Faculty of Sociology University of Bielefeld 33501 Bielefeld Germany methodologyjournal@uni-bielefeld.de

Methodology (2018), 14(1), 1–2

Ó 2018 Hogrefe Publishing

Original Article

Multiple Imputation by Predictive Mean Matching When Sample Size Is Small Kristian Kleinke Department of Psychology, Bielefeld University, Germany

Abstract: Predictive mean matching (PMM) is a state-of-the-art hot deck multiple imputation (MI) procedure. The quality of its results depends, inter alia, on the availability of suitable donor cases. Applying PMM in small sample scenarios often found in psychological or medical research could be problematic, as there might not be many (or any) suitable donor cases in the data set. So far, there has not been any systematic research that examined the performance of PMM, when sample size is small. The present study evaluated PMM in various multiple regression scenarios, where sample size, missing data percentages, the size of the regression coefficients, and PMM’s donor selection strategy were systematically varied. Results show that PMM could be used in most scenarios, however results depended on the donor selection strategy: overall, PMM using either automatic distance-aided selection of donors (Gaffert, Meinfelder, & Bosch, 2016) or using the nearest neighbor produced the best results. Keywords: missing data, multiple imputation, predictive mean matching, small samples

Introduction and Overview Since Rubin (1987) laid out the theoretical foundation of multiple imputation (MI) and Schafer (1997a, 1997b) published software to impute incomplete data, the approach has enjoyed ever-increasing popularity and is nowadays one of the standard methods to handle missing data (Schafer & Graham, 2002): MI routines are now implemented in all major statistical packages, supporting a wide range of missing data scenarios and models, including both fully parametric methods (e.g., Schafer, 1997a, 1997b) and also more robust procedures, like predictive mean matching (PMM), which is the default imputation technique for continuous data in the MI software mice in R (van Buuren & Groothuis-Oudshoorn, 2011) and also in some other packages. An overview of available MI procedures and packages is given in Horton and Kleinman (2007) and at www.multiple-imputation.com. While MI had originally been developed for handling missing data in large public-use data files, for example, from surveys and censuses (Rubin, 1987), the practical use of MI has shifted over the years: today, many practitioners also impute much smaller data sets, including data from psychological or medical research, where sample sizes are often quite small. Although, small sample size adjustments Ó 2018 Hogrefe Publishing

have been made to the MI framework (Barnard & Rubin, 1999) and implemented, for example, in SAS PROC MIANALYZE, in Stata, and also in mice, systematic research that evaluated MI’s performance in such settings is very scarce (e.g., Graham & Schafer, 1999). The present study contributes to fill in this gap and evaluated the performance of multiple imputation in settings where sample sizes ranged from N = 20 to N = 100. Additionally, using predictive mean matching (PMM) as default imputation technique might not be unproblematic, when sample size is small: classical PMM approaches impute an observed value, whose value predicted by a linear regression model is among a set of k values (the so-called donor pool), which are closest to the value predicted for the missing one. When sample size is small, the number of suitable donors could also be small. Setting the size of the donor pool k too large might result in the selection of inadequate donors, implausible imputations, and as a consequence biased inferences. Choosing a very small donor pool, on the other hand, might result in one single donor being chosen again and again. This could lead to an increased correlation of the m imputations, a too small between-imputation variance component, underestimated standard errors, and the benefits of creating multiple imputations over using a single imputation might ultimately be Methodology (2018), 14(1), 3–15 https://doi.org/10.1027/1614-2241/a000141

4

K. Kleinke, Multiple Imputation by Predictive Mean Matching When Sample Size Is Small

lost (see also the discussion of the bias-variance-tradeoff in Schenker & Taylor, 1996). The aim of the present study was (a) to evaluate if PMM is able to produce sufficiently accurate parameter estimates and standard errors, when sample size is small, (b) to explore, what size of the donor pool yields the best tradeoff between unbiased parameters and adequate standard errors in various small sample size scenarios, and (c) to explore, if more flexible donor selection strategies like automatic distance-based selection of donors, proposed by Gaffert, Meinfelder, and Bosch (2016), work better in that regard.

Theoretical Background The basic idea of MI is: (a) to fill in each missing value m > 1 times by different values, which are equally plausible under the specified imputation model, (b) to analyze the m completed data sets separately by standard complete data procedures (e.g., such as regression analysis) and (c) to combine the m sets of parameter estimates into a single overall set of results using Rubin’s (1987) formula. The variability between the m imputations is supposed to reflect the additional uncertainty in parameter estimation due to missing data in an adequate way. If done “properly,” MI usually yields both widely unbiased parameter estimates and adequate standard errors (cf. Schafer & Graham, 2002). For a definition of “properness” in that regard (see Rubin, 1987, Chap. 4). One of the standard techniques to create the m imputations is PMM.

Predictive Mean Matching The idea of matching predicted means in the context of missing data imputation has been first mentioned in Rubin (1986, 1987), Rubin and Schenker (1986), and Little (1988). PMM can be enumerated among the hot deck imputation procedures (Andridge & Little, 2010). The basic principle of hot deck methods is to find one suitable donor value from an observed case that is in some regard “similar” to the missing case. PMM matches potential donors and donees via the closeness of predicted means. For each potential donor case, the fitted value (based on some meaningful regression model) is calculated and compared to the value predicted for the incomplete case. Classical PMM approaches draw one case from a pool of k cases, whose predicted values are closest to the one predicted for the missing case. The observed value of this donor case is then used to fill in the missing one. Further donor selection strategies will be discussed in the following sections.

Methodology (2018), 14(1), 3–15

For an overview and discussions of different implementations of PMM in various software packages and their default settings, see, for example, Allison (2015), Morris, White, and Royston (2014), and Gaffert et al. (2016). For a detailed description of the PMM algorithm in mice, which I used in this paper, see van Buuren (2012), Algorithm 3.3.

Advantages and Disadvantages of Predictive Mean Matching One of the major advantages of PMM is its robustness: in comparison to fully parametric procedures like, for example, Schafer’s (1997a) NORM approach, PMM is less sensitive to model misspecifications, including nonlinear associations, heteroscedasticity, and deviations from normality (Morris et al., 2014; Schenker & Taylor, 1996; Vink, Frank, Pannekoek, & van Buuren, 2014; Yu, Burton, & Rivero-Arias, 2007). This is because the parametric linear model is only used for matching the incomplete case to a potential donor. By imputing an actual observed value, PMM is usually able to preserve the original distribution of the data quite well, even when the assumptions of the underlying linear resgression model are violated. Furthermore, PMM always imputes a “valid” value, meaning that the imputed value will fit the respective scale of measurement and will always be within the range of the observed values, which makes rounding unnecessary. Rounding following normal model multiple imputation can be problematic (cf. Horton, Lipsitz, & Parzen, 2003). The practice of imputing actual observed values, however, could also have disadvantages. Performance of PMM might be poor in situations where no or only few suitable donor cases could be found. Van Buuren (2012) summarizes the potential drawbacks of PMM as follows: PMM “cannot be used to extrapolate beyond the range of the data, or to interpolate within the range of the data if the data at the interior are sparse. Also, it may not perform well with small datasets” (p. 74). From a theoretical point of view, it is plausible that PMM might perform poorly, when sample size is small: for a given missing data percentage, a smaller sample implies less suitable donors. In consequence, selecting inadequate observed values could introduce bias. This means that, for a constant sample size N, bias should increase with increasing missing data percentage. Likewise, for a constant missing data percentage pmis, bias should increase with decreasing sample size. Furthermore, in addition to missing data percentage, results are also likely to depend on the donor selection strategy, as will be outlined in the next sections. Currently, there are no evaluation studies that

Ó 2018 Hogrefe Publishing

K. Kleinke, Multiple Imputation by Predictive Mean Matching When Sample Size Is Small

systematically tested, if or under what conditions PMM can be used, when sample size is small. For practitioners it is important to know, if the PMM technique in general fails, when sample size is small, or if only certain settings lead to biased inferences in certain scenarios.

The Size of the Donor Pool in Classical PMM and Its Effects on Statistical Inferences The default size of the donor pool varies across the statistical packages. The MI procedure in SAS and the current version of mice in R use k = 5 as default. Some older versions of mice used k = 3, while Solas and ice in Stata sample from a pool of k = 10 donors by default (cf. Allison, 2015; Morris et al., 2014). For sufficiently large samples with N 100, Schenker and Taylor (1996) have found only small differences in performance between using k = 3 and k = 10 donors, meaning that practitioners typically do not have to adjust the size of the donor pool to obtain reasonable results. When sample size is small, however, the size of the donor pool might have a greater effect on the accuracy of both parameter estimates and standard errors. To obtain unbiased parameter estimates, decreasing the size of the donor pool might become increasingly important, the smaller the sample is. On the other hand, as already mentioned, downsizing the donor pool might come at the cost of underestimated standard errors.

Newer Adaptive and Distance-Based Donor Selection Strategies To solve the problem of determining which number of donors might overall work best in a given scenario, and to overcome some of the shortcomings of classical PMM listed above, Schenker and Taylor (1996) proposed an “adaptive technique that chooses the number of possible donors case-by-case based on the density of complete cases in the neighborhood of the incomplete case in question” (p. 430). Unfortunately, their approach has never been implemented in any of the major statistical packages and has hardly been used in practice. A more recent solution, where one donor is selected with probability inversely proportional to its distance from the respective incomplete case, has been proposed by Siddique and Belin (2008). The procedure is available as a SAS macro called MIDAS (Multiple Imputation using Distance-aided Selection of donors). It uses an approximate Bayesian bootstrap to introduce between-imputation variability. For a detailed description of the implementation

Ó 2018 Hogrefe Publishing

5

in SAS, see Siddique and Harel (2009). While their solution overcomes the problem of specifying the number of donors, their formula (Siddique & Belin, 2008, Equation 1) also includes a parameter k, which can best be described as a closeness parameter that adjusts the probability of selection. This parameter must be set by the user. Setting k = 0 means that each donor has the same selection probability (which comes down to a simple random hot deck). With k ! 1, the nearest neighbor is always chosen. In their simulation study, Siddique and Belin (2008) found that a closeness parameter of k = 3 produced reasonable results. They, however, only examined scenarios, where N 100. No recommendations are yet available for small sample size scenarios. Gaffert et al. (2016) published a “touched-up” version of the MIDAS macro for R, which they call midastouch. One of the main differences to the solution by Siddique and Belin (2008) is that the user can, but does not necessarily have to specify the closeness parameter. In the default settings, the R function automatically sets the parameter according to the R2 of the imputation model based on the full set of donors. Additionally, the function uses a correction for the total variance as originally suggested by Parzen, Lipsitz, and Fitzmaurice (2005). They use this correction because the approximate Bayesian bootstrap used by Siddique and Belin (2008) yields unbiased estimates only when the number of observed values N obs ! 1. For a more detailed discussion, and for a full list of differences between MIDAS and midastouch, see Gaffert et al. (2016). While using such automatic distance-based donor selection procedures might be preferrable from a practitioner’s point of view, as it overcomes the problem of specifying the size of the donor pool, little is yet known about the quality of these procedures, especially in small sample size scenarios.

Research Questions and Hypotheses In summary, the present study had several aims: (1) to test, if overall, PMM yields accurate statistical inferences in a variety of small sample size scenarios; (2) to explore, if (and how) the size of the donor pool needs to be adjusted to produce acceptable parameter estimates and standard errors in a given scenario. Bias in parameter estimates was hypothesized to increase with decreasing sample size, increasing missing data percentage, and increasing size of the donor pool, while bias in standard error estimates was believed to increase with decreasing size of the donor pool; and (3) to test, if automatic distance-based donor selection (midastouch) produces overall better results than using a donor pool of constant size k.

Methodology (2018), 14(1), 3–15

6

K. Kleinke, Multiple Imputation by Predictive Mean Matching When Sample Size Is Small

Method Design of the Study Overview The Monte Carlo study was designed as a 4 5 3 5 factorial experiment regarding a multiple regression scenario, in which data in the dependent variable were missing. The factors that were manipulated were sample size, the missing data percentage, the regression weights of the predictors and thus the R2 of the regression model, and finally the way, how PMM identified a donor case. I first describe the data generation process and the factors in the study, then the quality criteria used to evaluate the performance of PMM multiple imputation in these scenarios. Data Generation The data generation process was similar to the one by Schenker and Taylor (1996). I used a variant of their baseline model, with, x1 and x2 being the predictors in the regression model and y being the dependent variable. x1 and x2 were independently distributed as N(5, 1). The model for y given x1 and x2 was linear and homoscedastic. y1 were obtained by

yi ¼ 10 þ β1 x1i þ β2 x2i þ ei ;

ð1Þ

where ei N(0, 1). Experimental Conditions The first factor I manipulated was sample size, with N 2 {20, 30, 50, 100}. Secondly, I varied the missing data percentages with pmis 2 {10%, 20%, 30%, 40%, 50%}. How missing data were introduced will be described in the next paragraph. Thirdly, the β-weights in Equation (1) were varied and were set to β1 ¼ β2 2 {0.2, 0.5, 1}, which yielded average R2-values of .13, .36, and .67, respectively, indicating small, medium, and large effect sizes, respectively. Finally, I compared classical PMM, as it is implemented in the function mice.impute.pmm from R package mice version 2.22 (van Buuren & GroothuisOudshoorn, 2011) using a donor pool of constant size k, with k 2 {1, 3, 5, 10} against the automatic distance-based donor selection variant midastouch version 1.3 (Gaffert et al., 2016).1 This resulted in a total of 300 experimental conditions. Each condition was replicated 1,000 times. Introduction of Missing Data Analogous to the study by Schenker and Taylor (1996), values were deleted only from the dependent variable y, 1 2

while the predictors remained completely observed. While Schenker and Taylor (1996) examined MCAR missingness – that is, missing completely at random (Rubin, 1976), missing data in this study followed a MAR (missing at random) mechanism (Rubin, 1976). Missingness in y depended on predictor x1. I subsequently refer to x1 as the “cause of missfor each yi to be missing was ingness.” The probability pmis i determined by a logit model:

pmis ¼ invlogit ð6 þ x1i Þ: i

ð2Þ

Depending on the experimental condition, 10%, 20%, 30%, 40%, or 50% of observations in y were selected with probfrom all y-values and their values were deleted. abilities pmis i The intercept of 6 in Equation (2) was chosen to get a wide range of selection probabilities, with low selection probabilities for the major part of the sample. They were thus mostly cases with large x1-values that had a high chance of having a missing value in y. The minimum observed missingness probability was 0.33%, the maximum 98.72%. The median was 27.1%. Data Imputation Missing data were imputed using the R package mice (van Buuren & Groothuis-Oudshoorn, 2011). The number of imputations m was set to be equal to the respective missing data percentage.2 If, for example, in one condition 50% of the data in y were missing, m = 50 imputations of each missing value were created. The complete variables x1 and x2 were used to predict missing data in y. Depending on the respective condition, either function mice.impute. pmm with k 2 {1, 3, 5, 10} or function mice.impute. midastouch was used to create the imputations. Data Analysis The linear model in Equation (1) was fitted to the m completed data sets from each replication in each condition and results were combined using Rubin’s formula for MI inference (Rubin, 1987), with degrees of freedom being calculated by the Barnard and Rubin (1999) correction formula.

Quality Criteria I evaluated PMM’s performance in terms of estimation accuracy and estimation consistency regarding the marginal mean, precision of standard error estimates, and in terms of how well the procedure was able to preserve the original distribution of y (cf. Schenker & Taylor, 1996).

R, mice, and midastouch are available from https://cran.r-project.org. Research by Bodner (2008) suggests that the quality of statistical inferences could be improved by using more than the formerly standard m = 5 imputations. Bodner proposed a rather complex procedure to determine m based on the estimated fraction of missing information λ. I used a simpler proxy and set m equal to the observed missing data percentage in y.

Methodology (2018), 14(1), 3–15

Ó 2018 Hogrefe Publishing

K. Kleinke, Multiple Imputation by Predictive Mean Matching When Sample Size Is Small

A measure for estimation accuracy is the average distance of the estimate from the true parameter – that is, b where Q is the population bias, which is defined as Q Q b is the average parameter estimate across quantity and Q the 1,000 replications. I report relative bias, which is Q 100%, and which makes the quantity defined as Q Q b independent from the respective scale of measurement. Note that there are no commonly agreed on criteria, when bias or relative bias is “significant.” I follow Forero and Maydeu-Olivares (2009), who deem absolute values of less than 10% parameter bias as acceptable. Secondly, consistency in parameter estimation is reflected in the variance of the estimates across the 1,000 replications. A small variance signifies consistently good estimates across the replicated samples. A hybrid measure that reflects both accuracy and consistency in parameter estimation is the p root mean square ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ ﬃ error (RMSE), defined as RMSE ¼ bias2 þ variance2 , which signifies the typical distance between the estimate and the true value. Note that there are no definite criteria as to when either the variance of the parameter estimates or the RMSE is unacceptably large. Obviously, one would like both quantities to be small, signifying a precise and consistent inference (cf. Rubin, 1996). Thirdly, I report coverage rates (CR), a hybrid measure that reflects both bias in parameter estimates and bias in standard error estimates. CR is defined as the percentage of 95% confidence intervals that include the true parameter. A coverage rate close to 95% indicates that the standard error estimates are large enough, so that the true parameter is inside the interval most of the time. Schafer and Graham (2002) deem rates below 90% – the double of the nominal error rate – as seriously low. Undercoverage may result from large biases (so that the interval is too far to the left or to the right to cover the true parameter), from underestimated standard errors, or from a combination of both factors. Finally, with regard to how PMM was able to preserve the original distribution of y, I looked at the percentages of the respective sample with observations greater than the “true” 5th, 25th, 50th, 75th, and 95th percentiles of y (cf. Schenker & Taylor, 1996). Values should obviously be close to 95%, 75%, 50%, 25%, and 5%, respectively.

Results Complete Data Results Firstly, to test that the simulation of the data worked well, I computed parameter estimates and quality statistics based on the complete data (i.e., before any missing data were Ó 2018 Hogrefe Publishing

7

introduced). Like Schenker and Taylor (1996), I focused on results regarding the marginal mean and the distribution of y. These values are displayed in Table 1. As can be seen in that table, the simulation worked well and the parameter estimates of the marginal mean were close to the specified population quantity. Furthermore, coverage was around 95% and the estimated quantiles of the simulated data were very close to the theoretical quantiles. Note that the variance estimates (and thus also the RMSE estimates) of the marginal mean were larger on average, the smaller the sample size got. Variance ranged from 0.51 (N = 100) to 3.26 (N = 20). RMSE was approximately 1.8, when N = 20; 1.4, when N = 30; 1.1, when N = 50; and around 0.7, when N = 100. Naturally, statistical inferences become more stable and more precise, when based on larger rather than smaller samples. We need to bear this in mind, when discussing results of the different PMM conditions regarding variance and RMSE later on.

Fractions of Missing Information Secondly, to convey an idea of the extent of the missing data problems simulated in this study, I list the estimated average fractions of missing information λ (cf. Schafer, 1997a) regarding the marginal mean, which quantify the level of uncertainty about the imputed values, and the impact missing data have on (a) the respective parameter estimate and (b) the performance of statistical tests: the average λ was .09, when pmis = 10%; .17, when pmis = 20%; .25, when pmis = 30%; .32, when pmis = 40%; and .39 when pmis = 50%. λ has a range between 0 and 1. Generally, the larger λ, the more biased results could be. I now present results regarding how well the respective PMM variants were able to estimate the marginal mean, its standard error, and the quantiles of y in the simulated scenarios.

Relative Bias Results regarding relative bias in the estimates of the marginal mean are summarized in Figure 1. Generally, biases got larger, the larger the donor pool was, the more data had to be imputed and the larger the β-weights were. Larger β-weights signified a “stronger” missing data mechanism, with stronger referring to the relationships between the cause of missingness x1, y, and missingness in y (see Equations 1 and 2). Furthermore, biases generally increased with decreasing sample size. Overall, nearest neighbor PMM produced the best results. When N 50, all estimates were sufficiently accurate. When N = 30, bias was found only in the most extreme condition, where β = 1 and 50% of the data had Methodology (2018), 14(1), 3–15

8

K. Kleinke, Multiple Imputation by Predictive Mean Matching When Sample Size Is Small

Table 1. Complete data estimates N

β

%BIAS

VAR

RMSE

CR

% > P5

% > P25

% > P50

% > P75

% > P95 4.84

20

0.2

0.31

3.24

1.80

95.50

94.86

75.38

50.48

25.13

20

0.5

0.42

3.26

1.81

94.80

94.85

75.31

50.43

25.14

5.14

20

1.0

0.01

3.09

1.76

95.20

95.03

75.29

49.81

25.15

5.03

30

0.2

0.00

1.83

1.35

95.80

95.01

74.71

49.85

25.11

4.92

30

0.5

0.27

1.93

1.39

95.50

95.10

75.08

49.68

25.40

5.16

30

1.0

0.06

1.93

1.39

95.20

95.09

75.09

50.02

25.23

5.05

50

0.2

0.47

1.15

1.07

94.40

95.07

74.83

49.84

24.91

5.00

50

0.5

0.05

1.11

1.05

94.50

95.13

75.18

50.06

25.16

4.98

50

1.0

0.14

1.13

1.06

95.50

95.20

75.11

50.21

24.95

5.04

100

0.2

0.07

0.55

0.74

94.70

94.90

75.01

49.94

24.92

4.95

100

0.5

0.17

0.58

0.76

94.50

94.95

75.00

49.83

24.96

5.00

100

1.0

0.28

0.51

0.72

95.30

95.07

75.22

50.00

24.82

4.93

Notes. N = sample size; β = parameters β1 and β2 in the regression model, which were set to be equal; %BIAS = relative bias of the marginal mean, defined b where Q b is the average estimate across the 1,000 replications; VAR as BIAS/Q 100%, where Q is the set population quantity and BIAS is defined as Q Q, = variance of the estimates across the replicated samples; RMSE = root mean squared error of the marginal mean; CR = coverage rate, that is, the percentage of 95% confidence intervals that include the “true” parameter. The columns “% > P5”–“% > P95” denote the percentage of the sample with y-values larger than the respective percentile.

to be filled in. Here, the marginal mean was overestimated by 11.4%. Furthermore, even when N = 20, setting k = 1 produced widely accurate estimates. Only when β = 1 and pmis 40% severe biases could be observed. Secondly, also sampling from a pool of k = 3 or k = 5 donors yielded low biases in many scenarios. Especially when the model R2 was small to moderate (i.e., β 2 {.2, .5}) and less than about 30%–40% of the data were missing, results were generally accurate enough. Overall largest biases were found in the conditions, where k = 10. Especially when the missingness mechanism got stronger, more values in y were missing and sample size was 50 or less, downsizing the donor pool improved estimation accuracy quite noticeably: for example, when N = 20, β = 0.5, and pmis = 50%, setting k = 10 yielded a bias of 25.6%. Generating the imputations from a donor pool of size k = 5 decreased bias to 16.65%. Setting k = 3 produced a bias of 13.56%. In comparison, nearest neighbor PMM yielded the lowest bias in this condition of 9.42%. Finally, also the automatic donor selection procedure midastouch yielded sufficiently accurate estimates in most scenarios. Biases in fact were highly similar to those obtained by sampling from a donor pool of size k = 3. When N = 100, estimates were generally unbiased. When N = 50, bias of 10.6% was found in the most extreme condition, where β = 1 and 50% of the data in y were missing. When N = 30, large biases were only found in the more extreme scenarios with 50% missing data in y, when β = 0.5, or with pmis 40%, when β = 1. In summary, classical PMM produced sufficiently accurate estimates most of the time – given that reasonable settings regarding donor selection were applied. Decreasing Methodology (2018), 14(1), 3–15

the size of the donor pool became more important, the smaller the sample size got, especially the more data were missing and the stronger the missing data mechanism became. Furthermore, also automatic distance-aided selection of donors yielded sufficiently accurate estimates in many scenarios. Results were by and large comparable to using a donor pool of size k = 3.

Variance of the Parameter Estimates and Root Mean Square Error RMSE estimates are given in Figure 2, the variance of the parameter estimates across the 1,000 replications is presented in Figure 3. RMSE estimates depended on both the β-weights (with higher values leading to larger RMSE estimates) and missing data percentages (with larger percentages leading to larger RMSE estimates). Furthermore, RMSE estimates increased, when the size of the donor pool increased, but noticeably only in those conditions, where the model R2 was medium to large and more than about 20%–30% of the data had to be imputed. This was mainly due to the fact, that both bias (see Figure 1) and variance (see Figure 3) increased with increasing missing data percentages and increasing size of β. On the other hand, variance estimates usually were “better” (i.e., lower) using larger donor pools: for example, when N = 100, the largest variance estimate was 1.48, when k = 1; 1.29, when k = 3; 1.21, when k = 5; and 1.11, when k = 10. In comparison, midastouch here yielded a largest estimate of 1.32. In comparison to the complete data estimates (Table 1), PMM produced similar Monte Carlo variance estimates only in some conditions, especially when the missing data Ó 2018 Hogrefe Publishing

K. Kleinke, Multiple Imputation by Predictive Mean Matching When Sample Size Is Small

9

Figure 1. Percent parameter bias of the marginal mean. %BIAS is the relative bias of the marginal mean, defined as BIAS/Q 100%, where Q is b where Q b is the average estimate across the 1,000 replications. N is the sample size; % the set population quantity and BIAS is defined as Q Q, NA is the missing data percentage; k refers to the donor selection strategy.

percentage was small to moderate and the R2 of the regression model was small to moderate. Especially when the missingness mechanism was strong, missingness percentages were high, and the donor pool was small, PMM estimates were not as consistently good across the replications as one could have wished for.

Coverage Rates So far, results have shown that overall, small donor pools produced more accurate estimates than large donor pools. However, as already mentioned, estimation accuracy might come at the cost of underestimated standard errors (cf. Schenker & Taylor, 1996). Coverage rates – as a hybrid measure that reflects both the adequateness of parameter Ó 2018 Hogrefe Publishing

estimates and their standard errors – help to determine, which PMM settings produce an acceptable trade-off between unbiased parameter estimates and unbiased standard errors. These values are summarized in Figure 4. Firstly, it can be seen that overall, automatic, distancebased donor selection (midastouch) produced the best results in terms of coverage. Though coverage was generally somewhat lower, the larger β and thus the strength of the missing data meachanism was, nearly all coverage rates lay above 90%. Suboptimal coverage rates were found only in two conditions, namely when 50% of the data were missing, β = 1, and N = 30 – here coverage was 89.9%. Furthermore, when N = 50, β = 1, and pmis = 50%, coverage rate was 88.7%. The average coverage rate obtained by midastouch across all conditions was 94.0%. Methodology (2018), 14(1), 3–15

10

K. Kleinke, Multiple Imputation by Predictive Mean Matching When Sample Size Is Small

Figure 2. Root mean squared error (RMSE) of the marginal mean. N is the sample size; %NA is the missing data percentage; k refers to the donor selection strategy.

Secondly, nearest neighbor PMM also produced acceptable results in many scenarios. However, when k = 1, the average coverage rate across all conditions of 91.2% was noticeably lower than the one obtained by midastouch. Significant undercoverage was found in the more extreme conditions, when 40% or 50% of the data were missing, and β was either 0.5 or 1, regardless of the sample size. It seems that while nearest neighbor PMM produced more accurate parameter estimates in general, midastouch produced a better overall trade-off between both accurate parameter estimates and accurate standard errors. Thirdly, generating the imputations from a pool of three nearest neighbors also produced acceptable results in many scenarios. However, again, coverage rates were on average lower in comparison to using midastouch. The average Methodology (2018), 14(1), 3–15

coverage rate across all conditions was 91.9%. Undercoverage was found in some of the more extreme scenarios, where β 2 {0.5, 1} and pmis 40%. With an average coverage rate of 91.7% across all scenarios, also mice’s default setting of sampling from a fixed pool of five donors yielded mostly acceptable results. Note, however, that the drop in coverage was more pronounced in the extreme scenarios in comparison to using k = 1, k = 3, or the midastouch procedure. This effect got even stronger, when the size of the donor pool was increased to k = 10. Here, the average coverage rate across all conditions dropped down to 90.3%. All in all, results were best using either automatic distance-based donor selection or when imputations were generated from a small donor pool. Ó 2018 Hogrefe Publishing

K. Kleinke, Multiple Imputation by Predictive Mean Matching When Sample Size Is Small

11

Figure 3. Variance (VAR) of the estimates of the marginal mean across the 1,000 replications. N is the sample size; %NA is the missing data percentage; k refers to the donor selection strategy.

Estimation of Quantiles Finally, I examined, how well the different PMM settings fared in preserving the distribution of y. Figure 5 displays the percentages of the sample with values larger than the respective percentile separately for the four different sample size conditions. Results were averaged across all β-weight conditions. Each panel in Figure 5 covers the range ±5% of the respective percentile, the gray shaded areas therein denote the range ±2% of the respective percentile. Note that all complete data estimates were within the range ±1% of the respective percentile (cf. Table 1). PMM should not perform noticeably worse. I defined “noticeably worse” as producing estimates outside the interval ±2% of the respective percentile. Note that, again, there are Ó 2018 Hogrefe Publishing

no definite criteria as to when bias in the estimates of percentiles should be regarded as significant. In general, estimation precision increased with increasing sample size. Most of the time, estimates by classical PMM were reasonable, when N 50. When N = 100, all estimates were acceptable. When N = 50, only sampling from a large pool of 10 donors yielded biased estimates of some percentiles, when 40% or more of the data were missing. However, the smaller the sample became, the smaller the donor pool had to be to obtain accurate estimates: When N = 30, nearest neighbor PMM or sampling from a pool of k = 3 donors yielded overall acceptable estimates of the respective quantiles. Generating the imputations from larger donor pools yielded biased estimates of some quantiles. Biases here ranged from 2.07% to 4.95%. Methodology (2018), 14(1), 3–15

12

K. Kleinke, Multiple Imputation by Predictive Mean Matching When Sample Size Is Small

Figure 4. Coverage rates of the marginal mean. CR is the coverage rate of the marginal mean, that is, the fraction of its 95% confidence intervals that include the “true” parameter; N is the sample size; %NA is the missing data percentage; k refers to the donor selection strategy.

When N = 20, again setting k = 1 or k = 3 produced the most accurate estimates among the fixed donor pool conditions. Generally, biases increased both with increasing missing data percentage and with increasing size of the donor pool. In comparison, midastouch worked well, when N 30. When N = 30, midastouch produced only one suboptimal estimate of the 25th percentile, when 50% of the data were missing. The estimate here was 2.47% off. Separate results for the three β-weight conditions are not shown in Figure 5: estimates, however, were good and results of the different PMM variants were hardly discernible, when up to 30% of the data were missing. When more data had to be imputed, downsizing the donor pool improved the accuracy of results noticeably, the stronger the missingness mechanism became. In this case, best results were obtained by nearest neighbor PMM. Methodology (2018), 14(1), 3–15

Discussion The performance of PMM multiple imputation has been evaluated in various scenarios, in which sample size, missing data percentage, the coefficients of the β-weights in the regression model, and the donor selection strategy were systematically varied. Performance has been evaluated in terms of estimation accuracy and consistency regarding estimates of the marginal mean and corresponding standard errors, and in terms of how well the procedure was able to preserve the original distribution of the variable. Firstly, both classical PMM and the automatic donor selection variant midastouch (Gaffert et al., 2016) yielded accurate statistical inferences in many scenarios. Findings of this study thus do not corrobarate the general caveat that PMM might not be an option for small data sets. Ó 2018 Hogrefe Publishing

K. Kleinke, Multiple Imputation by Predictive Mean Matching When Sample Size Is Small

13

Figure 5. Estimation accuracy of quantiles (averaged over β-weight conditions). The panels display the respective percentage of the sample with values larger then the respective percentile. P5–P95 denote the respective percentiles; %NA is the missing data percentage; k refers to the donor selection strategy.

Secondly, as expected, the magnitude of the observed biases depended not only on the interplay of various factors: the size of the donor pool, the missing data percentage, and sample size, but also on the size of the regression coefficients in the data generating model. That the size of the regression coefficients had an effect on statistical inferences is because missingness probabilities in this study depended on one of the predictors, and as a consequence larger β-weights also corresponded to “stronger” MAR mechanisms. Stronger in this context means that the relationships between the predictor, the dependent variable, and missingness in the dependent variable became stronger (see Equations 1 and 2). With increasing size of the regression coefficients, more values from the upper half and the tail of the distribution of y were deleted. This required Ó 2018 Hogrefe Publishing

the missing data procedure to make at least some extrapolations beyond the range predicted by the remaining observed cases – something PMM cannot do (cf. van Buuren, 2012, Chap. 3). As statistical inferences are based on both observed and imputed values, this effect obviously increased with increasing missing data percentage. Secondly, the effect that biases got stronger with increasing size of the β-coefficients was further magnified with an increasing size of the donor pool, and also with decreasing sample size. This is because a decreasing sample size means that fewer suitable donors will be available in the sample. Furthermore, sampling from a large donor pool (in relation to sample size) increases the chance that also rather “dissimilar” cases will be selected, which in turn increases the chance of obtaining biased estimates. Methodology (2018), 14(1), 3–15

14

K. Kleinke, Multiple Imputation by Predictive Mean Matching When Sample Size Is Small

Consequently, nearest neighbor PMM produced the overall lowest biases. Additionally, coverage rates were sufficiently large most of the time. This implies that the standard error estimates were still large enough and that the decrease in between-imputation variance due to a small donor pool did not have an overly huge detrimental effect. Furthermore, setting k = 3 or k = 5 also produced acceptable coverage rates in many scenarios. Undercoverage was mainly found in the more extreme scenarios, when about 30% or more of the data were missing. Though biases were usually somewhat larger, when k = 3 or k = 5 in comparison to using the nearest neighbor (see Figure 1), it appears that these biases were buffered by sufficiently large confidence intervals, resulting in adequate coverage, when the missing data problem was not too severe. As many statistical packages use either k = 3 or k = 5 as default (cf. Allison, 2015), these are important findings for practitioners, who naturally want to focus on data analysis without having to reflect too much on the issue, what PMM settings would be most appropriate in a given scenario. Worst overall results were found, when k = 10. Thirdly, also the automatic distance-based donor selection procedure midastouch (Gaffert et al., 2016) yielded good results. While the overall lowest biases were obtained by using nearest neighbor PMM, midastouch produced overall highest coverage rates. It appears that the touched-up version of the MIDAS approach (Siddique & Belin, 2008) yielded more appropriate standard error estimates in comparison to classical nearest neighbor PMM. Finally, one finding that – on first glance – seemed to be counterintuitive, was that the variance of the estimates across the 1,000 replications tended to decrease with an increasing size of the donor pool. One possible explanation for this finding is that choosing a large k in relation to sample size could have produced a central tendency trend toward a biased marginal mean estimate. Gaffert et al. (2016), for example, stated that “if the distributions of the donors and recipients are roughly comparable then a large k will increase the probability for the donors closer to the center to give their value to the recipients closer to the bounds. That inevitably decreases the variance of y” (p. 6; see also the Appendix in Gaffert et al., 2016). This in consequence could also have decreased the variance of the estimates across the replications. To see, if a central tendency trend might be a plausible explanation here, we need to have a closer look at some of the results. For example, when N = 20, β = .2, pmis = 30%, and k = 10, the averb ¼ 10.48, the age estimate of the marginal mean was Q variance of the estimates across the replications was VAR = 2.80, the interquartile range (IQR) was 2.21, with a total range of between 3.90 and 17.67. In comparison, when only k = 1 donor was used in the same scenario, the average estimate of 10.21 was closer to the true value of 10, Methodology (2018), 14(1), 3–15

however its variance estimate was larger (4.36), and also the IQR of 2.70 and the total range of between 1.60 and 20.00 were larger. Similar results were also found in other conditions. It appears that using a large donor pool indeed yielded more centered estimates around a biased marginal mean. Future research should explore this effect further. All in all, results suggest that PMM could be used for missing data imputation in small data sets, when a reasonable donor selection strategy is applied. However, results also imply that practitioners should try to get larger samples. Increasing sample size by even 10 or 20 participants helped to increase the accuracy of statistical inferences quite noticeably.

Limitations No single simulation study can cover all relevant aspects of interest. Focusing on some aspects makes it necessary to disregard others. The present study, for example, considered only a very basic model with two predictors. Future research could systematically vary the number of predictors in the model, and their relationships with both the dependent variable and how well they predict missingness. Relationships could additionally be more complex, including interactions and higher-order relationships. Furthermore, in this study, only the dependent variable contained missing data, while the predictors remained completely observed. Future research could address more complex missing data scenarios with missingness on both sides of the equation. Also, the model in this study was homoscedastic – and all distributional assumptions were met. While some studies already tested the robustness of PMM toward violations of distributional assumptions (e.g., Kleinke, 2017; Yu et al., 2007), future research should look into this in greater detail in the context of small sample sizes. Finally, Monte Carlo Simulations are naturally artificial. Future simulations could also be based on empirical data sets, therefore being more realistic.

References Allison, P. D. (2015, March 5). Imputation by predictive mean matching: Promise & peril. Retrieved from http://statisticalhorizons.com/predictive-mean-matching Andridge, R. R., & Little, R. J. (2010). A review of hot deck imputation for survey non-response. International Statistical Review, 78, 40–64. https://doi.org/10.1111/j.1751-5823.2010. 00103.x Barnard, J., & Rubin, D. B. (1999). Small-sample degrees of freedom with multiple imputation. Biometrika, 86, 948–955. https://doi.org/10.1093/biomet/86.4.948 Bodner, T. E. (2008). What improves with increased missing data imputations? Structural Equation Modeling, 15, 651–675. https://doi.org/10.1080/10705510802339072

Ó 2018 Hogrefe Publishing

K. Kleinke, Multiple Imputation by Predictive Mean Matching When Sample Size Is Small

Forero, C. G., & Maydeu-Olivares, A. (2009). Estimation of IRT graded response models: Limited versus full information methods. Psychological Methods, 14, 275–299. https://doi. org/10.1037/a0015825 Gaffert, P., Meinfelder, F., & Bosch, V. (2016). midastouch: Towards an MI-proper predictive mean matching. Discussion paper. Retrieved from https://www.uni-bamberg.de/fileadmin/ uni/fakultaeten/sowilehrstuehle/statistik/Personen/Dateien_ Florian/properPMM.pdf Graham, J. W., & Schafer, J. L. (1999). On the performance of multiple imputation for multivariate data with small sample size. In R. Hoyle (Ed.), Statistical strategies for small sample research (pp. 1–29). Thousand Oaks, CA: Sage. Horton, N. J., & Kleinman, K. P. (2007). Much ado about nothing: A comparison of missing data methods and software to fit incomplete data regression models. The American Statistician, 61, 79–90. https://doi.org/10.1198/000313007X172556 Horton, N. J., Lipsitz, S. R., & Parzen, M. (2003). A potential for bias when rounding in multiple imputation. The American Statistician, 57, 229–232. https://doi.org/10.1198/0003130032314 Kleinke, K. (2017). Multiple imputation under violated distributional assumptions – A systematic evaluation of the assumed robustness of predictive mean matching. Journal of Educational and Behavioral Statistics, 42, 371–404. https://doi.org/ 10.3102/1076998616687084 Little, R. J. A. (1988). Missing-data adjustments in large surveys. Journal of Business & Economic Statistics, 6, 287–296. https:// doi.org/10.1080/07350015.1988.10509663 Morris, T. P., White, I. R., & Royston, P. (2014). Tuning multiple imputation by predictive mean matching and local residual draws. BMC Medical Research Methodology, 14, 75–87. https:// doi.org/10.1186/1471-2288-14-75 Parzen, M., Lipsitz, S. R., & Fitzmaurice, G. M. (2005). A note on reducing the bias of the approximate Bayesian bootstrap imputation variance estimator. Biometrika, 92, 971–974. https://doi.org/10.1093/biomet/92.4.971 Rubin, D. B. (1976). Inference and missing data. Biometrika, 63, 581–592. https://doi.org/10.1093/biomet/63.3.581 Rubin, D. B. (1986). Statistical matching using file concatenation with adjusted weights and multiple imputations. Journal of Business & Economic Statistics, 4, 87–94. https://doi.org/ 10.1080/07350015.1986.10509497 Rubin, D. B. (1987). Multiple imputation for nonresponse in surveys. New York, NY: Wiley. Rubin, D. B. (1996). Multiple imputation after 18+ years. Journal of the American Statistical Association, 91, 473–489. https://doi. org/10.1080/01621459.1996.10476908 Rubin, D. B., & Schenker, N. (1986). Multiple imputation for interval estimation from simple random samples with ignorable nonresponse. Journal of the American Statistical Association, 81, 366–374. https://doi.org/10.1080/01621459.1986.10478280 Schafer, J. L. (1997a). Analysis of incomplete multivariate data. London, UK: Chapman & Hall. Schafer, J. L. (1997b). Imputation of missing covariates under a general linear mixed model (Technical Report 97–10). University

Ó 2018 Hogrefe Publishing

15

Park, PA: Pennsylvania State University, The Methodology Center. Schafer, J. L., & Graham, J. W. (2002). Missing data: Our view of the state of the art. Psychological Methods, 7, 147–177. https:// doi.org/10.1037/1082-989X.7.2.147 Schenker, N., & Taylor, J. M. (1996). Partially parametric techniques for multiple imputation. Computational Statistics & Data Analysis, 22, 425–446. https://doi.org/10.1016/0167-9473(95) 00057-7 Siddique, J., & Belin, T. R. (2008). Multiple imputation using an iterative hot-deck with distance-based donor selection. Statistics in Medicine, 27, 83–102. https://doi.org/10.1002/sim.3001 Siddique, J., & Harel, O. (2009). MIDAS: A SAS macro for multiple imputation using distance-aided selection of donors. Journal of Statistical Software, 29, 1–18. https://doi.org/10.18637/jss. v029.i09 van Buuren, S. (2012). Flexible imputation of missing data. Boca Raton, FL: Chapman & Hall/CRC. van Buuren, S., & Groothuis-Oudshoorn, K. (2011). MICE: Multivariate imputation by chained equations in R. Journal of Statistical Software, 45, 1–67. https://doi.org/10.18637/jss. v045.i03 Vink, G., Frank, L. E., Pannekoek, J., & van Buuren, S. (2014). Predictive mean matching imputation of semicontinuous variables. Statistica Neerlandica, 68, 61–90. https://doi.org/ 10.1111/stan.12023 Yu, L. M., Burton, A., & Rivero-Arias, O. (2007). Evaluation of software for multiple imputation of semi-continuous data. Statistical Methods in Medical Research, 16, 243–258. https:// doi.org/10.1177/0962280206074464 Received August 25, 2015 Revision received August 11, 2016 Accepted September 18, 2017 Published online April 23, 2018 Kristian Kleinke Department of Psychology Bielefeld University Postfach 10 01 31 33501 Bielefeld Germany kristian.kleinke@uni-bielefeld.de

Kristian Kleinke is a postdoctoral researcher at the University of Bielefeld. His primary research interests are missing data and multiple imputation. He focuses on imputation solutions for complex data structures like panel data, and “non-normal” missing data problems, that is, when convenient distributional assumptions of standard MI procedures are violated.

Methodology (2018), 14(1), 3–15

Original Article

Strategies for Increasing the Accuracy of Interviewer Observations of Respondent Features Evidence From the US National Survey of Family Growth Brady T. West1 and Frauke Kreuter2,3,4 1

Institute for Social Research, University of Michigan-Ann Arbor, MI, USA

2

Joint Program in Survey Methodology, University of Maryland-College Park, MD, USA

3

School of Social Science, University of Mannheim, Germany Statistical Methods Group, Institute for Employment Research, Nuremberg, Germany

4

Abstract: Because survey response rates are consistently declining worldwide, survey researchers strive to obtain as much auxiliary information on sampled units as possible. Surveys using in-person interviewing often request that interviewers collect observations on key features of all sampled units, given that interviewers are the eyes and ears of the survey organization. Unfortunately, these observations are prone to error, which decreases the effectiveness of nonresponse adjustments based on the observations. No studies have investigated the strategies being used by interviewers tasked with making these observations, or examined whether certain strategies improve observation accuracy. This study is the first to examine the associations of observational strategies used by survey interviewers with the accuracy of observations collected by those interviewers. A qualitative analysis followed by multilevel models of observation accuracy shows that focusing on relevant correlates of the feature being observed and considering a diversity of cues are associated with increased observation accuracy. Keywords: interviewer observations, auxiliary variables, survey paradata, multilevel modeling, interviewer effects

Interviewers in “face-to-face” surveys are often tasked with observing key features of sampled units, given that the interviewers are the eyes and ears of the survey organization. Because response rates in surveys of all formats have been consistently declining worldwide (Baruch & Holtom, 2008; Biener, Garrett, Gilpin, Roman, & Currivan, 2004; Brick & Williams, 2013; Cull, O’Connor, Sharp, & Tang, 2005; Curtin, Presser, & Singer, 2005; de Leeuw & de Heer, 2002; Tolonen et al., 2006; Williams & Brick, 2017), interviewers are asked to do this in an effort to obtain as much relevant auxiliary information on all sampled units as possible. The survey methodology literature has clearly established the need for auxiliary variables used for nonresponse adjustment of survey estimates to be related to both survey variables of interest and response propensity (Beaumont, 2005; Bethlehem, 2002; Groves, 2006; Kreuter et al., 2010; Lessler & Kalsbeek, 1992; Little & Vartivarian, 2005), so that the adjustments will reduce both the bias and the variance in the ultimate survey estimates. Survey organizations will therefore request that Methodology (2018), 14(1), 16–29 https://doi.org/10.1027/1614-2241/a000142

interviewers attempt to collect observations on auxiliary variables having these optimal properties (in theory) for both respondents and nonrespondents, and then evaluate the potential of the observations for nonresponse adjustment purposes (e.g., West, 2013a). Unfortunately, interviewer observations can be prone to error. The various conceptualizations of total survey error (TSE) that have been published over the years (Groves & Lyberg, 2010) consistently acknowledge the problem of nonresponse bias that can arise in surveys. Errors in estimation are less often acknowledged as a key part of TSE (Biemer, 2010; Deming, 1944). From a TSE perspective that also considers errors in estimation, errors in interviewer observations may lead to nonresponse adjustments that introduce more bias in survey estimates than was present before the nonresponse adjustments (Lessler & Kalsbeek, 1992; Stefanski & Carroll, 1985; West, 2013a, 2013b). This underscores the need for interviewer observations to be of sufficient accuracy, so that nonresponse adjustments based in part on the observations will do an Ó 2018 Hogrefe Publishing

B. T. West & F. Kreuter, Strategies for Increasing the Accuracy of Interviewer Observations of Respondent Features

effective job of repairing nonresponse bias. Researchers have started to consider design strategies for improving observation accuracy as a result (West & Kreuter, 2015). More generally, the survey methodology literature has recently begun to examine the non-negligible error properties of these observations (Campanelli, Sturgis, & Purdon, 1997; Casas-Cordero, Kreuter, Wang, & Babey, 2013; Groves, Wagner, & Peytcheva, 2007; McCulloch, Kreuter, & Calvano, 2010; Pickering, Thomas, & Lynn, 2003; Sinibaldi, Durrant, & Kreuter, 2013; Tipping & Sinibaldi, 2010; West, 2013a; West & Kreuter, 2013, 2015; West, Kreuter, & Trappmann, 2014). Several recent studies have suggested that interviewers working on the same survey and receiving the same general training will vary substantially in terms of the accuracy of these observations, even when controlling for relevant respondent-, interviewer-, and area-level covariates (Sinibaldi et al., 2013; West & Kreuter, 2013, 2015; West et al., 2014). So why might interviewers working on the same study, and having received the same training, vary substantially in terms of observation accuracy (even after adjustment for covariates that might influence the accuracy)? No previous studies have assessed the possibility that interviewers may be using different strategies to collect these observations (i.e., looking for different cues that might serve as indicators of a particular feature being observed), and considered whether different strategies in a given context may lead to more or less accurate observations. Because survey organizations may use these interviewer observations for nonresponse adjustment purposes, and interviewers have repeatedly been demonstrated to vary substantially in terms of the accuracy of the observations, this study sought to (1) explore the variability in the strategies that interviewers use to collect these observations, and (2) understand whether there is a relationship between the strategy used and observation accuracy. The identification of strategies that are associated with more accurate observations has clear implications for improving interviewer training, and the overall improvements in accuracy that may arise from more standardized training on this process may in turn improve the effectiveness of nonresponse adjustments based in part on these observations. Why might interviewers vary in terms of the observational strategies that they use? In the absence of standardized training on the observation process, survey interviewers might utilize different “native” observational strategies, based on varying prior expectations of how the social world around them is organized (e.g., “a respondent who is currently in a romantic relationship will have features X, Y, and Z, so I will only look for features X, Y, and Z when making this judgment. . .”; Funder, 1987, 1995; Manderson & Aaby, 1992a, 1992b; McCall, 1984; Tversky & Kahneman, 1974). However, empirical evidence Ó 2018 Hogrefe Publishing

17

of this assumption is largely lacking in the methodological literature. Do interviewers really vary in terms of the cues that they are looking for when attempting to record these types of observations? From a psychological perspective, interviewer perceptions may be influenced by specific environmental and behavioral cues (Forgas & Brown, 1977). The literature cited above suggests that interviewers with different backgrounds and expectations may be influenced in different ways by different types of cues (both when recording observations and survey responses), meaning that only certain types of cues will influence the observations that they record. In the context of the present study, these more influential cues would define the “strategies” employed by interviewers when recording their observations. No studies to date have considered whether these naturally-varying strategies may ultimately lead to variability among interviewers in observation accuracy. What observational strategies might be associated with increased accuracy? Some interviewers may resort to considering features of the areas/environments in which they are working in the absence of any household-specific cues. This could be helpful if the interviewers are familiar with the areas and the areas are fairly homogeneous in terms of the feature(s) being observed. But if areas tend to be more heterogeneous, interviewers may incorrectly apply expectations that all households in that area will have the same features, or assume that if several households have been similar, the next will also have similar features (Babbie, 2001; Das, 1983; Harris, Jerome, & Fawcett, 1997; Manderson & Aaby, 1992a, 1992b; Millen, 2000; Repp, Nieminen, Olinger, & Brusca, 1998 Seidler, 1974; Tversky & Kahneman, 1974). For other interviewers, the observational task may be quite difficult (e.g., inability to access a locked building of apartments, working in crowded urban areas, etc.), leading to a failure to pick up on important external visual cues and subsequent guessing or “going on hunches.” In these situations, observations would be expected to have reduced accuracy (Feldman, Hyman, & Hart, 1951; Funder, 1987; Graham, 1984; Jones, Riggs, & Quattrone, 1979; Kazdin, 1977; Most, Scholl, Clifford, & Simons, 2005; Simons & Jensen, 2009). Other interviewers may attempt to pick up on several different relevant predictors of a feature being observed as well as specific features of a given household or respondent. These strategies reflecting diversity in the cues used (depending on the context) and an ability to detect specific features of a given respondent would be expected to have increased accuracy (Funder, 1995; Kazdin, 1977; West & Kreuter, 2015; West et al., 2014). The social psychology literature also suggests that observations based on first impressions in the presence of limited information will tend to have increased accuracy (Ambady, Hallahan, & Conner, 1999; Patterson & Stockbridge, 1998). Whether Methodology (2018), 14(1), 16–29

18

B. T. West & F. Kreuter, Strategies for Increasing the Accuracy of Interviewer Observations of Respondent Features

or not these theoretical expectations related to accurate observational strategies are borne out in the survey interviewing context remains an open research question. This article presents an examination of observational strategies that are associated with the accuracy of a key interviewer judgment in the US National Survey of Family Growth (NSFG). NSFG interviewers (each of whom is female) first attempt to conduct a screening interview with an adult informant within a randomly sampled housing unit. The primary purpose of the screening interview is to determine whether the sampled housing unit contains a person between the ages of 15 and 49 (the target population for the NSFG). Upon identification of all eligible persons within a household, one age-eligible person is selected at random for the main (face-to-face) NSFG interview (possibly at a later date that is convenient for the selected respondent). Only about 77% of these selected individuals ultimately participate in the main interview, and this number has continued to decline gradually over time. Beginning with the seventh cycle of the NSFG (June 2006–June 2010; Lepkowski, Mosher, Davis, Groves, & Van Hoewyk, 2010), interviewers were tasked with judging whether this randomly selected age-eligible person was currently in a sexually active relationship with a member of the opposite sex, immediately after completion of the screening interview. Why might the NSFG interviewers be asked to record this type of subjective judgment after each screening interview, given that this is not a directly observable feature of potential respondents? Unlike other types of auxiliary information commonly recorded by interviewers (e.g., contact observations, socio-demographics, housing unit features, etc.), this NSFG-specific judgment has a strong association with a number of key NSFG variables, along with the propensity of persons screened in the NSFG to respond to the main survey request (West, 2013a). These judgments therefore define an ideal auxiliary variable for nonresponse adjustment purposes (Little & Vartivarian, 2005). The NSFG in part focuses on several key variables related to sexual behavior and activity, and being able to get some indication from the interviewers as to whether potential respondents (selected from the initial screening interview) are currently in a sexually active relationship provides useful auxiliary information for persons who do not eventually respond to the main NSFG interview. NSFG staff can use this information (along with other auxiliary variables on the sampling frame) to predict key outcomes related to being in sexual relationships for main interview nonrespondents. These types of “tailored” interviewer observations, which survey managers can specifically design as potential correlates of key survey variables of interest, have also been shown to be more effective for nonresponse adjustments than other data sources, including linked information from Methodology (2018), 14(1), 16–29

commercial data sources (Sinibaldi, Trappmann, & Kreuter, 2014). These interviewer judgments could also be validated based on actual respondent reports of sexual activity collected in the main NSFG interview, which is important for assessing their accuracy. In addition to being asked to record these judgments after completing screening interviews, NSFG interviewers in the last two quarters of Cycle 7 data collection were also asked to provide open-ended justifications for why they made their judgments, and what cues they noted when recording the judgments. This study sought to leverage this qualitative information and assess the interviewer-specific observational strategies evident in these justifications, along with the amount of variability among interviewers in judgment accuracy that was explained by these strategies. Given that no prior research has examined observational strategies being used by field interviewers tasked with making these types of judgments, the present study aimed to answer the following two research questions: 1. Do NSFG interviewers tend to fall into distinct “strategy” clusters based on the justifications used for their sexual activity judgments? 2. Were certain observational strategies associated with increased accuracy of the sexual activity judgments?

Data Coding of Open-Ended Justifications In the last two quarters of data collection for the NSFG (Cycle 7), 45 interviewers were asked to record (on laptop applications) open-ended justifications for their postscreener judgments of perceived current sexual activity for selected persons (see Figure 1). The interviewers were trained to provide the justifications immediately after the judgments were made, and the interviewers could not proceed with main interview tasks until after the judgments and their justifications had been entered (along with all other observations from the screening interview). This means that all judgments and justifications were recorded prior to the main NSFG interview, and there were no missing data for the judgments or justifications. The interviewers were not prompted for specific justifications or limited in any way (e.g., justification length), and interviewer training sessions on this process did not suggest any specific strategies to use when recording the judgments. The interviewers were simply told to make their best judgments based on what they had seen and/or heard, opening the door for the use of the aforementioned “native” observational strategies. Ó 2018 Hogrefe Publishing

B. T. West & F. Kreuter, Strategies for Increasing the Accuracy of Interviewer Observations of Respondent Features

19

Figure 1. NSFG interviewers entered open-ended justifications for their sexual activity judgments into the box labeled “Rsex rel with opposite sex partner.”

In total, the 45 interviewers provided 3,992 open-ended justifications of widely varying lengths during these two quarters of data collection. Two examples of real recorded justifications follow: 1. “He works and goes to school and lives here with his twin – I do not think he could have someone over as the carpet is all taken up and it smells badly of dog poo.” A justification for a judgment of not currently sexually active, reflecting the use of cues describing features of the housing unit. 2. “He has a tattoo, ‘Carol’, over his heart.” A justification for a judgment of currently sexually active, reflecting an ability to pick up on the respondent’s appearance during the screening interview. The 3,992 justifications were coded on 13 different indicator variables (1 = mentioned in justification, 0 = not mentioned), with all indicators coded for each justification: Living arrangement (living with spouse, parents, etc.). Relationship status (mention of spouse, partner, etc.). Age. Housing unit characteristics (presence of children, cleanliness, etc.).

Ó 2018 Hogrefe Publishing

Appearance (references to physical appearance, ethnicity, etc.). Neighborhood characteristics. Shyness. Going on Hunches/Guessing (indication of a gut feeling, or not being sure). Incorrect (an incorrect judgment was entered in hindsight; after recording the judgment and having it saved in the system, the interviewer realized that they entered an incorrect judgment while they were writing their justification and mentioned this in their justification; we note that NSFG interviewers cannot go back and change their prior entries). Conservative (a conservative or strict household/ parents). Health (reference to health or physical disability). Personality (reference to the person’s personality). Occupation (reference to the person’s occupation). In addition, the number of words used for each justification was coded as a proxy of effort dedicated to the observational task. For example, the first justification given above was

Methodology (2018), 14(1), 16–29

20

coded as having 35 words, and assigned “1” for living arrangement, household characteristics, and occupation, and “0” for all other indicators. All coding of the justifications was performed twice with the assistance of an undergraduate research assistant. The inter-rater reliability of the codes was quite high; of the 3,992 coded justifications, the percentages of codes on a given indicator that did not agree between the two coders ranged from 0.32% (Shyness) to 7.53% (Housing unit characteristics). This means that there was higher than 92% agreement for all of the coded indicators. Discrepancies in coding were detected using PROC COMPARE in the SAS software, and any discrepancies in coding or word counts were discussed and resolved. The percentage of justifications falling into each of these 13 categories was then computed for each interviewer, along with the mean word count for the interviewer.

work, motivated by prior studies examining correlates of observation accuracy (Sinibaldi et al., 2013; West & Kreuter, 2013, 2015). These included the percentage of the interviewer’s assigned households in urban areas (i.e., Metropolitan Statistical Areas or MSAs), the percentage of households in areas with access problems (e.g., gated communities), the percentage of households in primarily residential areas, the percentage of households in areas with evidence of non-English speakers or Spanish speakers, the percentage of households in areas with safety concerns, the percentage of households in multiunit buildings, the percentage of households with physical impediments (e.g., security gates), and the percentage of households where females were the selected respondents. We describe the roles of these covariates in our models of accuracy below.

Measures of Observation Accuracy

Analytic Approach

To compute dependent variables measuring observation accuracy for each interviewer, the total number of judgments of sexual activity, the total number of sexually active respondents (based on respondent reports of at least one sexual partner in the past year, from completed main interviews), the total number of sexually inactive respondents, and the total number of discordant judgments (i.e., judgments inconsistent with survey reports) were determined for each of the 45 interviewers. From these measures, we computed the overall gross difference rate (i.e., the proportion of judgments that were incorrect), the false positive rate (i.e., the proportion of sexually inactive respondents who were judged to be sexually active), and the false negative rate (i.e., the proportion of sexually active respondents who were judged to not be sexually active) for each interviewer. For purposes of this study, respondent reports of current sexual activity (i.e., at least one opposite-sex partner in the past year) were assumed to be true. Two interviewers had insufficient information available for the false positive rate analyses, because all of their main interview respondents reported being sexually active. An additional two interviewers were found to be outliers in the subsequent cluster analyses and removed from further analysis. Our models of observation accuracy were thus based on 43 interviewers and 3,044 observations (for gross difference rates), 43 interviewers and 2,347 observations (for false negative rates), or 41 interviewers and 697 observations (for false positive rates).

To address our first research question, we performed an exploratory cluster analysis to determine whether distinct groups of interviewers existed in terms of the percentages of justifications falling into each category and effort spent on the observational task. To do so, the 13 percentages and the mean word counts for the 45 interviewers who completed main interviews were initially standardized. An agglomerative hierarchical clustering approach was then applied (Everitt, Landau, Leese, & Stahl, 2011), using squared Euclidean distances based on the 14 standardized variables as distance measures between interviewers and Ward’s (1963) minimum within-cluster variance method to define the clusters. This approach was selected for its established superiority in identifying known clusters when using continuous measures (Punj & Stewart, 1983). We examined descriptive statistics for the derived clusters, and then determined conceptual labels for the clusters by comparing the distributions of the percentages and means between them using the nonparametric, independent samples Kruskal-Wallis H test. Next, to address our second research question, we fit a sequence of two multilevel logistic regression models to each of the three dependent accuracy measures (the gross difference rates, false positive rates, and false negative rates). The first model included random interviewer effects, capturing between-interviewer variability (and within-interviewer correlation) in a given accuracy indicator, and fixed effects of the P = 9 covariates described above, to account for the effects of these area-level features on the different accuracy measures. This initial model is shown in Equation (1):

Covariates We also extracted a number of covariates describing features of the areas where an interviewer was assigned to Methodology (2018), 14(1), 16–29

ln

ϕi 1 ϕi

¼ β0 þ

P X p¼1

βp xpi þ u0i ;

ð1Þ

Ó 2018 Hogrefe Publishing

21

Table 1. Descriptive statistics for the variables used in the cluster analysis M

SD

Minimum

Maximum

Relationship Status

43.25

13.70

17.86

75.00

Age

32.67

23.21

0.00

88.24

Living arrangement

24.53

19.27

0.00

88.76

Housing unit characteristics

23.75

13.18

0.00

62.50

Guessing/going on hunches

Percentage of justifications mentioning

12.10

17.62

0.00

82.14

Appearance

6.86

8.22

0.00

33.00

Occupation

4.43

5.82

0.00

25.00

Personality

3.88

4.70

0.00

17.82

Health

3.32

7.89

0.00

43.14

Neighborhood characteristics

3.16

8.77

0.00

55.41

Conservative

1.69

2.90

0.00

12.50

Incorrect

1.01

1.81

0.00

8.70

Shyness

0.39

0.91

0.00

4.00

6.32

4.03

1.90

27.92

Mean word count Note. n = 45 interviewers.

Table 2. Descriptive statistics for the interviewer-level error rates M

SD

Minimum

Maximum

Gross difference rate

0.206

0.079

0.000

0.439

False positive rate

0.537

0.029

0.000

1.000

False negative rate

0.116

0.012

0.000

0.500

Note. n = 45 interviewers, with the exception of the false positive rates (n = 43).

where ϕi = probability of incorrect judgment/false positive/false negative for IWER i, xpi = value of covariate p for IWER i, u0i N(0, τ2), with τ2 = variance of random interviewer effects. Next, given an initial estimate of the variance of the random interviewer effects, we added fixed effects of the specific interviewer “strategy” clusters identified from the cluster analysis to the initial model (omitting one cluster as a reference category):

ln

ϕi 1 ϕi

¼ β0 þ

P X p¼1

βp xpi þ

C1 X

λc Ii ½cluster ¼ c

c¼1

þ u0i ;

ð2Þ

where ϕi = probability of incorrect judgment/false positive/false negative for IWER i, xpi = value of covariate p for IWER i, 1 if IWER i in cluster c , Ii ½cluster ¼ c ¼ 0 otherwise u0i N(0, τ2), with τ2 = variance of random interviewer effects. Ó 2018 Hogrefe Publishing

We then computed the percentage of variance in the random interviewer effects explained by the inclusion of the fixed effects of the strategy clusters in Equation (2). In each model, the variance of the random interviewer effects was tested against zero using a likelihood ratio test based on a mixture of chi-square distributions (Zhang & Lin, 2008). Models (1) and (2) were fitted using the GLIMMIX procedure in the SAS software (Version 9.4), specifically using a Laplace approximation for estimation purposes (Kim, Choi, & Emery, 2013).

Results We first consider descriptive summaries of the variables computed for all 45 interviewers. Descriptive statistics for the interviewer-specific percentages and mean word counts are shown in Table 1. A large amount of variability among the 45 interviewers is evident, in terms of the justification strategies and the average number of words used for the justifications. Table 2 presents descriptive statistics for the three dependent variables computed for each interviewer. Interviewers made correct judgments of current sexual activity 79.4% of Methodology (2018), 14(1), 16–29

22

Figure 2. Dendrogram showing results of initial cluster analysis, with evidence of two outliers (Interviewers 36 and 40).

the time, with a minimum of 0% errors (i.e., some interviewers were correct all the time) and a maximum of 43.9% errors. There were more false positive judgments than false negative judgments, suggesting that interviewers Methodology (2018), 14(1), 16â€“29

tended to err on the side of assuming sexual activity (West & Kreuter, 2015). The initial cluster analysis provided evidence of two interviewers that could be considered outliers (Figure 2), Ă“ 2018 Hogrefe Publishing

23

Figure 3. Dendrogram showing results of second cluster analysis, with evidence of four distinct groups of interviewers (based on rescaled cluster distances greater than 10).

with one interviewer citing neighborhood features in 55.41% of her justifications (the next highest percentage being 17.12%), and another interviewer citing health reasons for 43.14% of her justifications (the next highest Ă“ 2018 Hogrefe Publishing

percentage being 27.50%). After dropping these two interviewers, the second cluster analysis presented evidence of four unique clusters of interviewers based on scaled distances between the clusters (Figure 3); that is, there were Methodology (2018), 14(1), 16â€“29

24

Table 3. Descriptive statistics for interviewer-level justification tendencies and mean word counts within four distinct clusters of interviewers (n = 43 total)

Number of interviewers

Cluster 1

Cluster 2

Cluster 3

Cluster 4

20

7

11

5

Kruskal-Wallis w2(df), p-value

Percentage of justifications mentioning Relationship status

45.90 (11.10)

41.87 (7.04)

47.76 (18.43)

25.63 (5.85)

10.30(3), p = .016

Age

23.02 (14.94)

49.96 (20.99)

21.03 (20.43)

57.52 (12.52)

17.20(3), p = .001

Living arrangement

37.62 (18.36)

20.96 (12.33)

11.11 (8.38)

2.06 (2.35)

26.10(3), p < .001

Housing unit characteristics

28.60 (13.25)

25.11 (9.33)

13.04 (10.81)

26.23 (14.42)

9.00(3), p = .029

Guess/gut feelings

5.95 (8.41)

3.96 (6.18)

33.93 (22.14)

1.55 (2.99)

17.00(3), p = .001

Appearance

5.62 (4.01)

20.60 (9.67)

1.90 (3.89)

0.39 (0.54)

24.10(3), p < .001

Occupation

5.59 (6.13)

5.73 (4.41)

1.33 (2.09)

0.00 (0.00)

14.70(3), p = .002

Personality

3.85 (3.20)

8.94 (6.33)

1.09 (1.24)

0.27 (0.61)

17.00(3), p = .001

Health

1.29 (1.61)

7.22 (10.32)

2.23 (4.62)

0.27 (0.61)

6.41(3), p = .093

Neighborhood characteristics

2.78 (4.71)

3.81 (3.44)

0.38 (1.08)

0.00 (0.00)

9.50(3), p = .023

Conservative

2.68 (3.70)

1.75 (2.86)

0.59 (0.76)

0.00 (0.00)

5.94(3), p = .115

Incorrect

1.29 (1.74)

0.81 (1.16)

0.24 (0.54)

2.29 (3.77)

4.06(3), p = .255

Shyness

0.25 (0.55)

0.98 (1.41)

0.05 (0.16)

0.00 (0.00)

7.89(3), p = .048 11.90(3), p = .008

Mean word count

6.32 (2.48)

7.01 (1.58)

4.17 (1.31)

4.96 (2.29)

Gross difference rate

0.247

0.191

0.168

0.171

False positive rate

0.413

0.536

0.515

0.795

False negative rate

0.196

0.087

0.070

0.006

Notes. Cells contain mean (SD). The highest mean for each input variable is in boldface.

in fact distinct groups of interviewers based on their justification tendencies. Descriptive statistics on the 14 variables for each cluster are shown in Table 3. The results in Table 3 suggest that the first cluster of interviewers is largely defined by a tendency to notice living arrangements and housing unit characteristics. The second cluster is largely defined by references to appearance and personality, and a relatively large word count; interviewers in this cluster appeared to use the widest diversity of cues, including references to age and relationship status. The third cluster is primarily defined by references to relationship status and guesses/gut feelings, while the fourth cluster focuses primarily on age, occasionally referring to relationship status and household characteristics but hardly anything else. We now consider differences among these clusters in terms of the various accuracy measures for the sexual activity judgments. Table 4 presents estimates of the fixed effects and variance components for the two multilevel logistic regression models fitted to each dependent accuracy measure. The observational “strategy” clusters describing interviewers based on their justification tendencies are able to explain significant portions of the unexplained variance among interviewers in terms of observation accuracy. In the case of overall gross difference rates, Clusters 2

Methodology (2018), 14(1), 16–29

(use of a diversity of cues) and 3 (focus on relationship status, or “gut feelings”) have significantly reduced odds [odds ratio for Cluster 2 = exp(0.40) = 0.67, or 33% lower odds, and odds ratio for Cluster 3 = exp(0.35) = 0.70, or 30% lower odds, respectively] of making an error relative to Cluster 1. Furthermore, 36.2% of the unexplained variance among interviewers (when adjusting for the fixed effects of the area-level covariates) is accounted for by these “strategy” clusters. In the case of false positive rates, Cluster 4 (focus primarily on age) had substantially increased odds of making a false positive error [odds ratio for Cluster 4 = exp(1.70) = 5.47, or 447% higher odds], and the fixed strategy cluster effects explain about 15% of the unexplained variance in accuracy among interviewers. In the case of false negative rates, Clusters 2 and 3 once again have significantly reduced odds of making a false negative error relative to Cluster 1 [odds ratio for Cluster 2 = exp(0.87) = 0.42, or 58% lower odds, and odds ratio for Cluster 3 = exp(0.99) = 0.37, or 63% lower odds, respectively]. We also note that Cluster 4 has reduced odds of making a false negative error, but the frequency with which this occurred was too small to prevent reliable estimation of this effect (see Table 3). Collectively, the fixed cluster effects were able to explain nearly 44% of the unexplained variance in false negative rates among interviewers.

Ó 2018 Hogrefe Publishing

Ó 2018 Hogrefe Publishing 0.01 (0.01)

< 0.01 (0.01)

0.01 (0.01)

0.02 (0.01)*

Notes. “–” indicates reference category. ***p .01, **p .05, *p .10.

% Reduction in interviewer variance

Interviewer variance

Variance components 36.2%

14.9%

0.769*** (0.289)

1.70** (0.72)

0.32 (0.24) 0.037*** (0.022)

0.45 (0.45)

0.35** (0.15)

Cluster 4: primarily age

0.75 (0.54)

0.40** (0.17)

Cluster 3: relationship status/hunches

0.904*** (0.333)

0.05 (0.03)

0.04 (0.03)

–

0.01 (0.01)

0.01 (0.01)

0.01 (0.01)

0.01 (0.06)

0.05 (0.05) 0.01 (0.01)

0.02 (0.06)

0.04 (0.06)

0.01 (0.01)

0.01 (0.01)

0.01 (0.01) < 0.01 (0.01)

< 0.01 (0.48)

2.08 (1.68)

Estimate (SE)

0.05 (0.48)

2.22 (1.68)

Estimate (SE)

Including strategy clusters

Model 2: false positive rates Base model

Cluster 1: living arrangement/housing unit features Cluster 2: diversity of cues

0.058*** (0.029)

0.01 (0.01)

0.01 (0.01) –

< 0.01 (< 0.01)

< 0.01 (< 0.01)

% Many units

% Physical impediments

% Female

< 0.01 (< 0.01)

0.01 (< 0.01)*

% Safety concerns

0.01 (0.02)

0.02 (0.02)

0.01 (< 0.01) 0.01 (0.02)

0.01 (< 0.01) 0.02 (0.02)

% Residential

% Non-English

% Spanish

< 0.01 (< 0.01)

0.17 (0.15)

1.82*** (0.54)

< 0.01 (< 0.01)

0.10 (0.16)

% Access problems

1.57*** (0.56)

% Urban

Estimate (SE)

Estimate (SE)

Intercept

Fixed effects

Including strategy clusters

Base model

Model 1: gross difference rates

0.732*** (0.243)

0.01 (0.03)

0.02 (0.01)*

0.01 (0.01)

0.02 (0.01)**

0.07 (0.04)

0.06 (0.05)

0.01 (0.01)

< 0.01 (0.01)

0.24 (0.42)

2.35* (1.39)

Estimate (SE)

Base model

43.7%

0.412*** (0.149)

5.91 (4.07)

0.99** (0.36)

0.87** (0.39)

–

0.01 (0.02)

0.02 (0.01)

0.01 (0.01)

0.02 (0.01)*

0.12 (0.17)

0.12 (0.18)

0.01 (0.01)

< 0.01 (0.01)

0.40 (0.36)

2.68** (1.23)

Estimate (SE)

Including strategy clusters

Model 3: false negative rates

Table 4. Parameter estimates (logit scale) in multilevel logistic regression models indicating the relationships of selected covariates and interviewer “strategy” cluster membership with gross difference rates, false positive rates, and false negative rates (n = 43 interviewers for Models 1 and 3; n = 41 for Model 2)

B. T. West & F. Kreuter, Strategies for Increasing the Accuracy of Interviewer Observations of Respondent Features 25

Methodology (2018), 14(1), 16–29

26

We therefore find consistent evidence of the observational strategies influencing the error properties of the sexual activity judgments in a significant manner, as described below: – Focusing primarily on relationship status and gut feelings will result in the highest accuracy; – Judging based primarily on age will result in systematic false positives; – Considering a diversity of cues, including appearance, personality, and other external features, will also improve accuracy; – Focusing on living arrangement and housing unit features is detrimental in terms of accuracy and results in systematic false negatives. What might cause the remaining unexplained variance in accuracy among interviewers? All of the NSFG interviewers were female, and the vast majority were white, married, and had previous NSFG experience. For as many interviewers as possible (42 of 45), we extracted their overall years of interviewing experience, their age, and the number of children that they had from a voluntary interviewer survey (3 of the 45 interviewers chose not to participate in the voluntary survey). NSFG managers examined these 42 values for each variable and reported no evidence of measurement error given their knowledge of these particular interviewers. In exploratory analyses, we added fixed effects of these interviewer-level covariates to the “full” models in Table 4. For the gross difference rates, we found that interviewers with more kids had significantly reduced odds of making an error overall, with the effects of the strategy clusters remaining the same, and that the interviewer variance component was reduced to the point where it was no longer significantly greater than zero. For the false positive rates, we found that older interviewers had significantly reduced odds of making a false positive error, again further reducing the variance component. Finally, for the false negative rates, we found the older interviewers had significantly increased odds of making a false negative error; older interviewers appeared to err in the direction of no sexual activity. Relevant interviewer features therefore did seem to explain additional variance in accuracy, which is consistent with the existing literature (Sinibaldi et al., 2013; West & Kreuter, 2013, 2015) and could have training implications depending on the type of observation being collected.

Discussion This study has demonstrated that the collection and analysis of open-ended justifications for the observations that field interviewers are often asked to record while Methodology (2018), 14(1), 16–29

conducting face-to-face surveys is feasible in practice. The analyses provide interesting insights into the observational strategies used by NSFG interviewers who make more accurate judgments regarding a specific respondent characteristic. With regard to our first research question, we found evidence of distinct clusters of interviewers based on the justifications that they tended to use for their judgments. This finding suggests that with only minimal guidance and training on the observation process provided by NSFG staff, different interviewers did in fact use different observational strategies in the field when recording these types of judgments, consistent with our theoretical expectations. This finding certainly needs to be replicated in other survey contexts to further understand this phenomenon. With regard to our second research question, the four clusters of interviewers identified based on the observational strategies evident in their justifications were found to vary significantly in terms of the error properties of this specific interviewer observation, when adjusting for the effects of other area- and interviewer-level covariates on accuracy. This finding suggests that variance in error rates on these types of observations may in fact be a function of varying observational strategies being employed in the field, which has important implications for standardized training on the process of recording interviewer observations (e.g., Dahlhamer, 2012). Our results were mainly consistent with theoretical expectations, where a focus on highly relevant cues (Funder, 1995) and more diversity in the cues used (Kazdin, 1977) was found to improve the error properties of the observations. Slightly contrary to theoretical expectations was the finding that a combination of focusing on a highly relevant cue (mention of relationship status during the screening interview) and guessing or “going on hunches” (expected to result in lower observation accuracy) was found to produce favorable error properties. A reasonable suggestion for practice would thus be to first try and identify highly relevant correlates of a feature being observed, and then go with general impressions or best guesses if those correlates are not readily available. Identification of these highly relevant features for a particular survey will require replications of this study, or at least discussions with interviewers who are found to produce highly accurate observations in a given survey. Either way, we hope that the methods described in this study will be used by other survey researchers to better understand effective observational strategies in other survey contexts. Importantly, the different strategies used by interviewers to record the observations could also be reflective of the approaches that they take to recording the actual survey measurements in the interview. For example, an interviewer who, based on their background and expectations, uses age cues exclusively to justify their current sexual Ó 2018 Hogrefe Publishing

activity judgments may ultimately make comments or express opinions about risky sexual behavior in older or younger people during the actual interview. This could lead respondents to answer questions about current sexual activity in different (and possibly error-prone) ways, depending on statements made by the interviewer. A review of the literature on measurement of sexual behaviors in surveys (Fenton, Johnson, McManus, & Erens, 2001) suggests that interviewer gender can influence self-reports of sexual behavior (which would not be relevant in the NSFG, given that all interviewers are females), but also that establishment of rapport between the interviewer and the respondent can lead to reports of more frequent (and possibly exaggerated) sexual behavior. Interviewers may become more conversational and go off on tangents when rapport has been established, increasing the probability of interviewers communicating (either verbally or nonverbally) their expectations regarding the topic at hand (West & Blom, 2017), and this is when their “native” observational strategies might play a role in affecting measurement. If this were the case in practice, comparing the “strategy clusters” in terms of observation accuracy may be misleading, but this requires future research (possibly taking advantage of computer audio-recorded interviewing (CARI) technologies for recording the survey interviews). Alternatively, if older NSFG respondents tend to provide more socially desirable responses about sexual activity with respect to their age (Fenton et al., 2001), then in the group of interviewers that tended to use age exclusively in their justifications (Cluster 4 in this study), we may be estimating the error rates associated with this strategy incorrectly. For example, if (1) respondents between the ages of 20 and 49 tend to report being sexually active when they actually are not, (2) an interviewer tends to judge current sexual activity based on age (e.g., older respondents are more likely to be sexually active), and (3) the social desirability bias is larger for older individuals than younger individuals, what seem like correct judgments for older individuals may actually be incorrect at higher rates. This speculation would require additional research as well, and each of these possibilities speaks to the importance of having good validation data for establishing the accuracy of the interviewer judgments. Setting aside these possibilities of a link between the observational strategy used and the accuracy of the respondent reports, the findings of this study have direct practical implications for future interviewer training. At a minimum, NSFG interviewers could be provided with verbal guidance about strategies to avoid (e.g., focusing on living arrangement) and strategies to employ (e.g., using a diversity of cues and attempting to detect some hint of relationship status in the screening interview) when recording these judgments. More generally, NSFG interviewers could watch Ó 2018 Hogrefe Publishing

27

brief videos of “staged” hypothetical screening interviews, where particular cues are mentioned (e.g., “My boyfriend isn’t home right now. . .”), and then be asked more generally about what judgments they might record and why. Thinking even more broadly about other surveys and other types of interviewer observations, some survey programs are starting to incorporate practice sessions for recording interviewer observations based on real photographs of housing units and neighborhoods into interviewer training (e.g., Dahlhamer, 2012; Stähli, 2010). Importantly, these sessions and the “correct” responses for what observations to record should be based on the type of empirical evidence generated in the present study. There are several opportunities for future research in this area. First, the interviewer judgments in this study were only recorded for housing units where a screening interview was successfully completed (about 91% of housing units sampled for the NSFG). In general, the literature in this area would benefit from more case studies discussing the collection and analysis of “tailored” observations that are correlated with key survey variables and response propensity for all sampled units in a given survey. More evidence of such observations contributing to effective nonresponse adjustments would provide empirical support for the continued use of this practice. Second, the observations could only be validated using information provided by survey respondents. The possibility that observations on nonresponding units may have suffered from reduced quality could not be considered in this study, and future studies would need to consider alternative sources of validation data to investigate this possibility further. Third, we spoke with a lawyer about the possibility that privacy regulations affecting data collection practices in Europe may eventually impact the ability of survey organizations to collect interviewer observations, and the lawyer indicated that interviewer observations are not an issue with respect to the current European privacy laws. It will therefore be interesting to see if future debates about privacy protection make it more difficult to collect these types of observations, both in Europe and other countries.

Acknowledgments Funding for this research was provided by NIH Grant #1R03-HD-075979-01-A1. The National Survey of Family Growth (NSFG) is carried out under a contract with the CDC’s National Center for Health Statistics, Contract # 200-2010-33976. Funding comes from several agencies, including NCHS, NICHD, CDC, OPA and ACF. The views expressed here do not represent those of NCHS or the other funding agencies. We are indebted to Ziming Liao from the University of Michigan Undergraduate Research Opportunity Program (UROP) for his contributions to this work. Methodology (2018), 14(1), 16–29

28

References Ambady, N., Hallahan, M., & Conner, B. (1999). Accuracy of judgments of sexual orientation from thin slices of behavior. Journal of Personality and Social Psychology, 77, 538–547. https://doi.org/10.1037/0022-3514.77.3.538 Babbie, E. R. (2001). The practice of social research (9th ed.). Belmont, CA: Wadsworth/Thomson Learning. https://doi.org/ 10.1177/0018726708094863 Baruch, Y., & Holtom, B. C. (2008). Survey response rate levels and trends in organizational research. Human Relations, 61, 1139–1160. Beaumont, J.-F. (2005). On the use of data collection process information for the treatment of unit nonresponse through weight adjustment. Survey Methodology, 31, 227–231. Bethlehem, J. G. (2002). Weighting nonresponse adjustments based on auxiliary information. In R. Groves, D. Dillman, J. Eltinge, & R. Little (Eds.), Survey nonresponse (pp. 275–287). New York, NY: Wiley. Biemer, P. P. (2010). Total survey error: Design, implementation, and evaluation. Public Opinion Quarterly, 74, 817–848. https:// doi.org/10.1093/poq/nfq058 Biener, L., Garrett, C. A., Gilpin, E. A., Roman, A. M., & Currivan, D. B. (2004). Consequences of declining survey response rates for smoking prevalence estimates. American Journal of Preventative Medicine, 27, 254–257. https://doi.org/10.1016/j.amepre. 2004.05.006 Brick, J. M., & Williams, D. (2013). Explaining rising nonresponse rates in cross-sectional surveys. The Annals of the American Academy of Political and Social Science, 645, 36–59. https:// doi.org/10.1177/0002716212456834 Campanelli, P., Sturgis, P., & Purdon, S. (1997). Can you hear me knocking: An investigation into the impact of interviewers on survey response rates. London, UK: SCPR. Casas-Cordero, C., Kreuter, F., Wang, Y., & Babey, S. (2013). Assessing the measurement error properties of interviewer observations of neighbourhood characteristics. Journal of the Royal Statistical Society (Series A), 176, 227–250. https://doi. org/10.1111/j.1467-985X.2012.01065.x Cull, W. L., O’Connor, K. G., Sharp, S., & Tang, S. S. (2005). Response rates and response bias for 50 surveys of pediatricians. Health Services Research, 40, 213–226. https://doi.org/ 10.1111/j.1475-6773.2005.00350.x Curtin, R., Presser, S., & Singer, E. (2005). Changes in telephone survey nonresponse over the past quarter century. Public Opinion Quarterly, 69, 87–98. https://doi.org/10.1093/poq/nfi002 Dahlhamer, J. (2012). New observation questions. Presentation at the 2013 NHIS Centralized Refresher Training and Conference, December 3-6, 2012, Hyattsville, MD. Available from the author (fzd2@cdc.gov) Das, T. H. (1983). Qualitative research in organizational behaviour. Journal of Management Studies, 20, 301–314. https://doi.org/ 10.1111/j.1467-6486.1983.tb00209.x de Leeuw, E., & de Heer, W. (2002). Trends in household survey nonresponse: A longitudinal and international comparison. In R. Groves, D. Dillman, J. Eltinge, & R. Little (Eds.), Survey nonresponse (pp. 41–54) (Chapter 3). New York, NY: Wiley. Deming, W. E. (1944). On errors in surveys. American Sociological Review, 9, 359–369. Everitt, B. S., Landau, S., Leese, M., & Stahl, D. (2011). Cluster analysis (5th ed). Series in Probability and Statistics. New York, NY: Wiley. Feldman, J. J., Hyman, H., & Hart, C. W. (1951). A field study of interviewer effects on the quality of survey data. Public Opinion Quarterly, 15, 734–761. https://doi.org/10.1086/266357

Methodology (2018), 14(1), 16–29

Fenton, K. A., Johnson, A. M., McManus, S., & Erens, B. (2001). Measuring sexual behavior: Methodological challenges in survey research. Sexually Transmitted Infections, 77, 84–92. https://doi.org/10.1136/sti.77.2.84 Forgas, J. P., & Brown, L. B. (1977). Environmental and behavioral cues in the perception of social encounters: An exploratory study. The American Journal of Psychology, 90, 635–644. https://doi.org/10.2307/1421737 Funder, D. C. (1987). Errors and mistakes: Evaluating the accuracy of social judgment. Psychological Bulletin, 101, 75–90. https:// doi.org/10.1037/0033-2909.101.1.75 Funder, D. C. (1995). On the accuracy of personality judgment: A realistic approach. Psychological Review, 102, 652–670. https://doi.org/10.1037/0033-295X.102.4.652 Graham, R. J. (1984). Anthropology and O.R.: The place of observation in management science process. The Journal of the Operational Research Society, 35, 527–536. https://doi.org/ 10.2307/2581799 Groves, R. M. (2006). Nonresponse rates and nonresponse bias in household surveys. Public Opinion Quarterly, 70, 646–675. Groves, R. M., & Lyberg, L. (2010). Total survey error: Past, present, and future. Public Opinion Quarterly, 74, 849–879. https://doi.org/10.1093/poq/nfq065 Groves, R. M., Wagner, J., & Peytcheva., E. (2007). Use of interviewer judgments about attributes of selected respondents in post-survey adjustments for unit nonresponse: An illustration with the national survey of family growth. Proceedings of the Section on Survey Research Methods, Joint Statistical Meetings, Salt Lake City, UT. Harris, K. J., Jerome, N. W., & Fawcett, S. B. (1997). Rapid assessment procedures: A review and critique. Human Organization, 56, 375–378. https://doi.org/10.17730/humo.56.3. w525025611458003 Jones, E. E., Riggs, J. M., & Quattrone, G. (1979). Observer bias in the attitude attribution paradigm: Effect of time and information order. Journal of Personality and Social Psychology, 37, 1230–1238. Kazdin, A. E. (1977). Artifact, bias, and complexity of assessment: The ABCs of reliability. Journal of Applied Behavior Analysis, 10, 141–150. https://doi.org/10.1901/jaba.1977.10-141 Kreuter, F., Olson, K., Wagner, J., Yan, T., Ezzati-Rice, T., Casas-Cordero, C., . . . Raghunathan, T. E. (2010). Using proxy measures of survey outcomes to adjust for survey nonresponse. Journal of the Royal Statistical Society (Series A), 173, 389–407. https://doi.org/10.1111/j.1467-985X.2009.00621.x Kim, Y., Choi, Y.-K., & Emery, S. (2013). Logistic regression with multiple random effects: A simulation study of estimation methods and statistical packages. The American Statistician, 67, 171–182. https://doi.org/10.1080/00031305.2013.817357 Lepkowski, J. M., Mosher, W. D., Davis, K. E., Groves, R. M., & Van Hoewyk, J. (2010). The 2006–2010 National Survey of Family Growth: Sample design and analysis of a continuous survey. National Center for Health Statistics. Vital and Health Statistics, 2, 1–36. Lessler, J., & Kalsbeek, W. (1992). Nonresponse: Dealing with the problem. In Nonsampling errors in surveys (pp. 161–233). New York, NY: Wiley-Interscience. Little, R. J., & Vartivarian, S. (2005). Does weighting for nonresponse increase the variance of survey means? Survey Methodology, 31, 161–168. Manderson, L., & Aaby, P. (1992a). An epidemic in the field? Rapid assessment procedures and health research. Social Science Medicine, 35, 839–850. https://doi.org/10.1016/0277-9536(92) 90098-B Manderson, L., & Aaby, P. (1992b). Can rapid anthropological procedures be applied to tropical diseases? Health Policy and Planning, 7, 46–55. https://doi.org/10.1093/heapol/7.1.46

Ó 2018 Hogrefe Publishing

McCall, G. J. (1984). Systematic field observation. Annual Review of Sociology, 10, 263–282. McCulloch, S. K., Kreuter, F., & Calvano, S. (2010, May 14). Interviewer observed vs. reported respondent gender: Implications on measurement error. Paper presented at the 2010 Annual Meeting of the American Association for Public Opinion Research, Chicago, IL. Millen, D. R. (2000). Rapid ethnography: Time deepening strategies for HCI field research. In ACM (Ed.), Proceedings on DIS00: Designing Interactive Systems: Processes, Practices, Methods, and Techniques (pp. 280–286). Brooklyn, NY: ACM. Most, S. B., Scholl, B. J., Clifford, E. R., & Simons, D. J. (2005). What you see is what you get: Sustained inattentional blindness and the capture of awareness. Psychological Review, 112, 217–242. https://doi.org/10.1037/0033-295X.112.1.217 Patterson, M. L., & Stockbridge, E. (1998). Effects of cognitive demand and judgment strategy on person perception accuracy. Journal of Nonverbal Behavior, 22, 253–263. https://doi.org/ 10.1023/A:1022996522793 Pickering, K., Thomas, R., & Lynn, P. (2003, July). Testing the shadow sample approach for the English house condition survey. Prepared for the Office of the Deputy Prime Minister by the National Centre for Social Research, London, UK. Punj, G., & Stewart, D. W. (1983). Cluster analysis in marketing research: Review and suggestions for application. Journal of Marketing Research, 20, 134–148. https://doi.org/10.2307/ 3151680 Repp, A. C., Nieminen, G. S., Olinger, E., & Brusca, R. (1998). Direct observation: Factors affecting the accuracy of observers. Exceptional Children, 55, 29–36. https://doi.org/10.1177/ 001440298805500103 Seidler, J. (1974). On using informants: A technique for collecting quantitative data and controlling measurement error in organization analysis. American Sociological Review, 39, 816–831. Simons, D. J., & Jensen, M. S. (2009). The effects of individual differences and task difficulty on inattentional blindness. Psychonomic Bulletin and Review, 16, 398–403. https://doi. org/10.3758/PBR.16.2.398 Sinibaldi, J., Durrant, G. B., & Kreuter, F. (2013). Evaluating the measurement error of interviewer observed paradata. Public Opinion Quarterly, 77, 173–193. https://doi.org/10.1093/poq/ nfs062 Sinibaldi, J., Trappmann, M., & Kreuter, F. (2014). Which is the better investment for nonresponse adjustment: Purchasing commercial auxiliary data or collecting interviewer observations? Public Opinion Quarterly, 78, 440–473. https://doi.org/ 10.1093/poq/nfu003 Stähli, M. E. (2010). Examples and experiences from the Swiss interviewer training on observable data (neighborhood characteristics) for ESS 2010 (R5). Paper presented at the NC Meeting Mannheim, Germany, March 31–April 1, 2011. Stefanski, L. A., & Carroll, R. J. (1985). Covariate measurement error in logistic regression. The Annals of Statistics, 13, 1335–1351. https://doi.org/10.1214/aos/1176349741 Tipping, S., & Sinibaldi, J. (2010, June 15). Examining the trade off between sampling and targeted non-response error in a targeted non-response follow-up. Paper presented at the 2010 International Total Survey Error Workshop, Stowe, Vermont. Tolonen, H., Helakorpi, S., Talala, K., Helasoja, V., Martelin, T., & Prattala, R. (2006). 25-year trends and socio-demographic differences in response rates: Finnish adult health behaviour survey. European Journal of Epidemiology, 21, 409–415. https://doi.org/10.1007/s10654-006-9019-8 Tversky, A., & Kahneman, D. (1974). Judgment under uncertainty: Heuristics and biases. Science, 185, 1124–1131. https://doi. org/10.1126/science.185.4157.1124

Ó 2018 Hogrefe Publishing

29

Ward, J. H. (1963). Hierarchical grouping to optimize an objective function. Journal of the American Statistical Association, 58, 236–244. West, B. T. (2013a). An examination of the quality and utility of interviewer observations in the National Survey of Family Growth (NSFG). Journal of the Royal Statistical Society (Series A), 176, 211–225. https://doi.org/10.1111/j.1467-985X.2012.01038.x West, B. T. (2013b). The effects of error in paradata on weighting class adjustments: A simulation study. In F. Kreuter (Ed.), Improving surveys with paradata: Making use of survey process information (pp. 361–388). Hoboken, NJ: Wiley. West, B. T., & Blom, A. G. (2017). Explaining interviewer effects: A research synthesis. Journal of Survey Statistics and Methodology, 5, 175–211. https://doi.org/10.1093/jssam/smw024 West, B. T., & Kreuter, F. (2013). Factors impacting the accuracy of interviewer observations: Evidence from the National Survey of Family Growth (NSFG). Public Opinion Quarterly, 77, 522–548. https://doi.org/10.1093/poq/nft016 West, B. T., & Kreuter, F. (2015). A practical technique for improving the accuracy of interviewer observations of respondent characteristics. Field Methods, 27, 144–162. https://doi. org/10.1177/1525822X14549429 West, B. T., Kreuter, F., & Trappmann, M. (2014). Is the collection of interviewer observations worthwhile in an economic panel survey? New evidence from the German labor market and social security (PASS) study. Journal of Survey Statistics and Methodology, 2, 159–181. https://doi.org/10.1093/jssam/smu002 Williams, D., & Brick, J. M. (2017). Trends in U.S. face-to-face household survey nonresponse and level of effort. Journal of Survey Statistics and Methodology. https://doi.org/10.1093/ jssam/smx019 Zhang, D., & Lin, X. (2008). Variance component testing in generalized linear mixed models for longitudinal/clustered data and other related topics. In D. B. Dunson (Ed.), Random effect and latent variable model selection (pp. 19–36). New York, NY: Springer. https://doi.org/10.1007/978-0-38776721-5_2 Received February 22, 2017 Revision received May 24, 2017 Accepted September 26, 2017 Published online April 23, 2018 Brady T. West Survey Methodology Program (SMP) Survey Research Center (SRC) Institute for Social Research (ISR) University of Michigan-Ann Arbor Ann Arbor, MI 48109 USA bwest@umich.edu

Brady T. West is a Research Associate Professor in the Survey Methodology Program, located within the Survey Research Center of the Institute for Social Research at the University of MichiganAnn Arbor, and also in the Joint Program in Survey Methodology at the University of Maryland-College Park.

Frauke Kreuter is a Professor in the Joint Program in Survey Methodology at the University of Maryland-College Park, Professor of Statistics and Methodology at the University of Mannheim, and Head of the Statistical Methods Research Department at the Institute for Employment Research (IAB) in Nürnberg, Germany.

Methodology (2018), 14(1), 16–29

Original Article

Estimating a Three-Level Latent Variable Regression Model With Cross-Classified Multiple Membership Data Audrey J. Leroux1 and S. Natasha Beretvas2 1

Department of Educational Policy Studies, Georgia State University, Atlanta, GA, USA

2

Department of Educational Psychology, The University of Texas at Austin, TX, USA

Abstract: The current study proposed a new model, termed the cross-classified multiple membership latent variable regression (CCMM-LVR) model that provides an extension to the three-level latent variable regression (HM3-LVR) model that can be used with cross-classified multiple membership data, for example, in the presence of student mobility across schools. The HM3-LVR model is beneficial for testing more flexible hypotheses about growth trajectory parameters and handles pure clustering of participants within higher-level (level-3) units. However, the HM3-LVR model involves the assumption that students remain in the same cluster (school) throughout the duration of the time period of interest. The CCMM-LVR model appropriately models the participants’ changing clusters over time. The impact of ignoring mobility in the real data was investigated by comparing parameter estimates, standard error estimates, and model fit indices for the model (CCMM-LVR) that appropriately modeled the cross-classified multiple membership structure with results when this structure was ignored (HM3-LVR). Keywords: multilevel modeling, multiple membership, cross-classified, growth, mobility, MCMC estimation

Individual change has been studied for many years within the context of multilevel modeling, particularly in the educational context where, for example, studies have assessed students’ rate of growth in reading comprehension (Bryk & Raudenbush, 1987; Seltzer, Frank, & Bryk, 1994), student trajectories in math achievement (Bryk & Raudenbush, 1987), as well as teacher-reported student aggressiveness over time within an intervention program (Muthén & Curran, 1997). These are just examples within the education context, but many other fields of applied social and behavioral science research also employ growth curve models (GCMs) to test hypotheses about growth over time (Francis, Fletcher, Stuebing, Davidson, & Thompson, 1991; Horney, Osgood, & Marshall, 1995; Huttenlocher, Haight, Bryk, Seltzer, & Lyons, 1991; Raudenbush & Chan, 1993).

Three-Level Latent Variable Regression Modeling Growth curve modeling can be used to model growth trajectory parameters and their covariances whereas the Methodology (2018), 14(1), 30–44 https://doi.org/10.1027/1614-2241/a000143

use of latent variable regression (LVR) modeling in the GCM context extends this notion by allowing modeling of, for example, the prediction of an individual’s growth rate parameter by the individual’s initial status parameter. The rationale behind this type of growth analysis is to study the expected differences in growth rates holding constant initial status. In particular with educational research using longitudinal data, it can be of interest to take into account the levels or variation in student achievement at the initial status (i.e., start of time for the study). In addition, modeling the expected change in growth rates, given a one unit change in initial status, provides an additional set of research questions of interest in longitudinal studies. For more details and illustrative examples, refer to Choi and Seltzer (2010) and Seltzer, Choi, and Thum (2003). The three-level LVR (HM3-LVR) model allows handling of the dependence of individuals clustered within organizations (such as schools, classrooms, etc.). The LVR coefficient that designates the effect of initial status on growth within the organizations can be modeled as varying across organizations. Assessment of this variation permits evaluation of organizational differences in the LVR coefficient Ó 2018 Hogrefe Publishing

A. J. Leroux & S. N. Beretvas, Cross-Classified Multiple Membership LVR Model

as well as assessment of factors that might influence the effect of initial status on growth. Following the same notation as Choi and Seltzer (2010), the formulation for level 1 is

Y tij ¼ π0ij þ π1ij TIMEtij þ etij ;

ð1Þ

where for individual i within organization j, Ytij is the observed score at time t, π0ij is the intercept parameter, π1ij is the slope parameter, and TIMEtij is the time point at time t. The errors, etij, are typically assumed to be independent and normally distributed with a mean of zero and constant variance σ2. The formulation for level 2 is

8 < π0ij ¼ β00j þ r 0ij

; : π1ij ¼ β10j þ Bwj π0ij β00j þ r 1ij

ð2Þ

where β00j is the mean initial status (i.e., when TIMEtij equals zero) across individuals within organization j, β10j is the mean growth rate for organization j for an individual at the mean on initial status, and Bwj is the LVR coefficient that represents the change in the growth rate for one unit increase in initial status within organization j. This LVR coefficient is termed the within-organization initial status on growth effect. The random effects are assumed normally distributed with means of zero and variances τπ00 and τπ11 for r0ij and r1ij, respectively, and Cov(r0ij, r1ij) = 0. τπ00 is the variance of initial status within the organizational units and τπ11 is the variance in growth rates remaining after taking into account differences in initial status within the organizations. The level-3 baseline unconditional LVR model is

8 β00j ¼ γ000 þ u00j > > > < β10j ¼ γ100 þ Bb β00j γ000 þ u10j ; > > > : Bwj ¼ Bw 0 þ Bw 1 β γ 000 þ uBwj 00j

ð3Þ

with γ000 representing the overall mean initial status across individuals and organizations, γ100 is the overall mean growth rate across organizations for organization j at the grand mean on initial status, Bb is an LVR coefficient that represents the change in growth rate for one unit increase in mean initial status across organizations, Bw 0 is another LVR coefficient that is the effect of initial status on growth for organization j at the grand mean on initial status, and Bw 1 is the change in the effect of initial status on growth for a one unit increase in mean initial status for organization j. The three random effects are assumed multivariate normally distributed with means of zero and a 3 by 3 covariance matrix Tu, which is

Ó 2018 Hogrefe Publishing

31

2

3

τβ00

6 Tu ¼ 4 0 0

7 5;

τβ10 τβ10;Bw

ð4Þ

τBw

where τβ00 is the variance in initial status among the organizational units, τβ10 is the variance in growth rates remaining between organizational units after taking into account organization mean initial status, and τBw is the variance in within-organization initial status on growth effects remaining among the organizational units after taking into account organization mean initial status. Note also that the Cov(u00j, u10j) = 0 and Cov(u00j ; uBwj ) = 0 because β00j is used as a predictor of β10j (explaining the covariance term, τβ00,β10) and of Bwj (explaining the covariance term, τβ00,Bw). The HM3-LVR is essentially a re-parameterization of the three-level GCM, because constraining the LVR coefficients (i.e., Bb, Bw 0, and Bw 1) to zero and the withinorganization LVR residual variance (i.e., τBw) to zero results in the three-level GCM. However, a more general model is achieved by freely estimating these parameters which sheds additional light on individual growth and allows research questions to be assessed that cannot be answered using the GCM. In particular, the LVR version of the GCM does not model a covariance between the initial measurement and growth over time (at both the individual and organization level) but instead reformulates that association to allow testing of hypotheses that the intercept predicts the slope. In addition, the LVR model can be used to test whether the relationship between the intercept and slope varies across organizations which cannot be tested using the GCM. As is described and demonstrated later, covariates describing the organizations can also be included in the LVR model to assess how they explain variability across organizations in the relationship between the intercept and slope.

Growth Curve Modeling With Mobile Individuals The three-level growth modeling technique previously discussed applied to a purely hierarchical data structure, where measurement occasions were assumed nested within individuals who were themselves nested within a single organization for the entire duration of the study. In reality, this purely clustered data structure may not always hold, especially in educational studies where students can move to different schools or classrooms over time. According to Ihrke and Faber’s (2012) geographical mobility report 38.5% of people aged 5–17 years moved within those years. More specifically, 25% of people between the ages 5 and

Methodology (2018), 14(1), 30–44

32

A. J. Leroux & S. N. Beretvas, Cross-Classified Multiple Membership LVR Model

17 years relocated within the same county. From 2013 to 2014, 11% of people between the ages 5 and 17 years moved, with 69% of those moves occurring within the same county (US Census Bureau, 2015). A report by the US Government Accounting Office (2010) found that 13% of students changed schools four or more times between kindergarten and 8th grade, and 11.5% of schools had high rates of mobility. Many longitudinal examples exist outside of education research that also engender participant mobility, such as residential mobility when individuals change areas of residence or neighborhoods over time (see Leyland & Næss, 2009), when handling longitudinal patient data where patients are changing doctors, nurses, or hospitals over time, and in organizational research when individuals change departments or working groups over time. A GCM termed the cross-classified multiple membership growth curve model (CCMM-GCM) was introduced by Grady and Beretvas (2010) that was designed to handle mobility across clustering units. The model is intended for researchers interested in research questions for which the intercept is interpreted as the initial status and thus the Time variable in the model is coded with a zero at the initial measurement occasion (see Equation 1). Nontrivial modifications of Grady and Beretvas’s model can be used for scenarios in which the researcher is interested in testing research questions using the intercept parameter to represent the predicted outcome at a time other than the initial measurement occasion. In this study, however, we are focusing on research questions in which the relationship between the outcome at the initial measurement occasion and growth in the outcome over time is of interest. The CCMM-GCM is a combination of cross-classified and multiple membership random effects models because individuals are cross-classified by their first organization and the subsequent organization or set of organizations attended (which results in the possible multiple membership portion). The cross-classified component is required because at the initial status the individual has only been affiliated with the first organization, therefore all organizations should not be modeled as contributing to an individual’s outcome at the first measurement occasion. Using schools as a particular example of an organization, under the CCMM-GCM, a student’s intercept can be modeled as varying across and influenced by the first school attended, and a student’s growth rate can be modeled as varying across the set of schools attended across the duration of the study. Therefore, in the model, the school’s effect on the slope incorporates residuals for all of the schools attended by a student. For more detailed examples and illustrations on growth curve modeling with mobile individuals, see Grady and Beretvas (2010) and Luo and Kwok (2012). Methodology (2018), 14(1), 30–44

Consequences of Ignoring Mobility Results from Grady (2010) comparing the CCMM-GCM to the HM3-GCM that recognized only the first school attended indicated that ignoring the multiple membership data structure led to inaccurate parameter estimates for the between-schools variance in growth rates. The conclusions associated with results from the HM3-GCM would mislead researchers because the between-schools variance in growth rates will be reallocated to the between-firstschools variance in growth rates. This means that the individual’s growth rate would be modeled as only having been affected by the first school attended. Other research on cross-classified and/or multiple membership data structures has also demonstrated that incorrect model specification in the presence of participant mobility can negatively impact parameter estimates. Previous simulation studies have shown that model misspecification can lead to inaccurate estimates of between-organizations variance components and standard errors of the fixed effects (Chung & Beretvas, 2012; Grady, 2010; Luo & Kwok, 2009, 2012; Meyers & Beretvas, 2006).

Latent Variable Regression Modeling With Mobile Individuals Mobile individuals are encountered frequently in longitudinal studies, especially in educational research as well as in organizational research and the medical, social, and behavioral sciences. The CCMM-GCM provides a GCM that handles mobility, although it only allows growth parameters (e.g., the intercept and slope) to covary and that relationship (the covariance) cannot be modeled as varying across clusters. The current study extends the benefits of the CCMM-GCM to offer a more flexible parameterization and introduces the cross-classified multiple membership latent variable regression (CCMM-LVR) model that appropriately handles mobility while also allowing modeling of differences in growth rates as a function of initial status, allowing the intercept-slope relationship to vary across clusters. Using the same formulation as in Grady and Beretvas (2010), as well as the example of schools as the relevant organizational cluster, the level-1 equation for the newly proposed CCMM-LVR model is

Y tiðj ;fj gÞ ¼ π0iðj ;fj gÞ þ π1iðj ;fj gÞ TIMEtiðj ;fj gÞ 1 2 1 2 1 2 1 2 þ etiðj ;fj gÞ ; 1 2

ð5Þ

where j1 represents the first school attended, {j2} represents the subsequent set of schools attended, and the parentheses signify cross-classification between the first and subsequent set of schools. The level-2 formulation of the baseline unconditional CCMM-LVR model is Ó 2018 Hogrefe Publishing

A. J. Leroux & S. N. Beretvas, Cross-Classified Multiple Membership LVR Model

8 π0iðj ;fj gÞ ¼ β00ðj ;fj gÞ þ r 0iðj ;fj gÞ > > 1 2 1 2 1 2 > < π1iðj ;fj gÞ ¼ β10ðj ;fj gÞ þ Bwðj ;fj gÞ ; 1 2 1 2 1 2 > > > : π0iðj ;fj gÞ β00ðj ;fj gÞ þ r 1iðj ;fj gÞ 1 2 1 2 1 2 ð6Þ and at level 3 the model is

8 β00ðj ;fj gÞ ¼ γ0000 þ u00j1 0 > 1 2 > > > > > < β10 j ; j ¼ γ γ þ Bb β 1000 0000 þ u10j1 0 00ðj 1 ;fj 2 gÞ ð 1 f 2 gÞ ; P > þ h2fj2 g wtih u100h > > > > > : Bw ¼ Bw γ 0 þ Bw 1 β 0000 þ uBwj1 00ðj ;fj gÞ ð j ;f j g Þ 1

2

1

2

ð7Þ where γ0000 now represents the overall mean initial status across individuals and first schools, γ1000 is then the mean growth rate across first and subsequent schools for first school j1 at the grand mean on initial status, Bb is now the LVR coefficient that captures the change in growth rate for one unit increase in first school j1 mean initial status across first schools, Bw 0 is the effect of initial status on growth for first school j1 at the grand mean on initial status, and Bw 1 is the change in the effect of initial status on growth for one unit increase in mean initial status for first school j1. The weight wtih is assigned to each individual i who attended school h at each time point t, and the sum of the weights for each individual must equal one to capture the proportional contribution of each of the set of subsequent (to the first) schools attended by each individual i. The level-1 errors etiðj ;fj gÞ are assumed normally dis1 2 tributed with a mean of zero and variance σ2. The level-2 random effects are assumed normally distributed with means of zero and variances τr00 and τr11 for r 0iðj ;fj gÞ 1

2

and r1iðj ;fj gÞ , respectively, and Cov(r0iðj ;fj gÞ , r1iðj ;fj gÞ ) 1 2 1 2 1 2 = 0. The four level-3 random effects are assumed multivariate normally distributed with means of zero and a 4 by 4 covariance matrix Tu, which is defined as 3 2 τuj1 00 7 6 τuj1 11 7 6 0 7; 6 ð8Þ Tu ¼ 6 7 0 τ τ uj1 11;Bw uj1 Bw 5 4 0 0 0 τufj g 11 2

where τuj1 00 is the variance of initial status among the first schools, τuj1 11 is the variance in growth rates remaining among the first schools after taking into account the first school’s mean initial status, and τufj g 11 is the variance in 2 growth rates remaining among the set of subsequent schools attended after taking into account the first organization school mean initial status, and τuj1 Bw is the variance Ó 2018 Hogrefe Publishing

33

in within-first-school initial status on growth effects remaining between the first school units after taking into account the first school mean initial status. Note that then there are two variance components estimated for the slope parameter’s level-3 random effects capturing the variability in slope residuals as a function of first schools attended separate from the variability having to do with the set of subsequent schools attended. Under the GCM, only a single variance component capturing organization (here, school) variability is typically assumed. Therefore, this CCMM model provides some added flexibility (in addition to the benefits of the LVR formulation). The weights that are used in Equation 7 refer to the random effects for the set of subsequent schools and do not include the first school which has its own variance component. Differences in the scale of the two variances (for first and for subsequent schools) allow a form of weighting for the first school’s random effects. The multiple membership model assumed for mobility across subsequent schools entails the typical assumption of a single variance component common across those schools. Instead of using a multiple membership model for the subsequent schools, an even more complicated model that includes a cross-classification factor for the school attended at each time point could be used to allow unique variances for the schools’ random effects at each time point. However, if there is mobility within time points (e.g., within each academic year) then this model would need a cross-classified factor for each move. In addition, this would require estimation of additional random effects variance parameters which further complicates model estimation. Given the complexity of the model in Equations 5–7 and the level-3 random effects covariance matrix (see Equation 8, a real dataset is used to demonstrate interpretation of the CCMM-LVR model’s parameters as well as the impact of recognizing versus ignoring participants’ mobility.

Method This study used a large-scale longitudinal real dataset that involved student mobility to investigate the differences in parameter and standard error estimates as well as model fit using two models, the HM3-LVR and the CCMM-LVR models. The HM3-LVR model ignores student mobility by only modeling the first school students attended, while the CCMM-LVR model handles the multiple membership data structure.

Data The data used for the analysis is from the Student/Teacher Achievement Ratio (STAR) project conducted from 1985 to Methodology (2018), 14(1), 30–44

34

A. J. Leroux & S. N. Beretvas, Cross-Classified Multiple Membership LVR Model

1989 (Achilles et al., 2008), which was a longitudinal study conducted in the state of Tennessee. This dataset is structured where measurement occasions are nested within students who are then nested within schools, and students switched schools throughout the duration of data collection. For students who entered the study in the fall of 1985 in kindergarten, the datasets consist of a total of 6,325 students and 79 schools. There are four measurement occasions with students being tested at the end of each year from spring of kindergarten through spring of 3rd grade. Students without school identifiers at each measurement occasion were removed, which left 3,083 students from 76 schools in the dataset. Students attending two schools that did not participate for the duration of the study were removed resulting in 3,011 students and 74 schools. The STAR project only collected data from kindergarteners through 3rd graders within these 74 schools, even if a student moved away from these schools, which resulted in an average per school sample size of 41 that ranged from 10 to 85 students.

Measures The outcome was student achievement in math with scores based on a norm-referenced measure, called the Stanford Achievement Test (Psychological Corporation, 1983), which was scaled using an item response theory model. Students who had scores for at least one of the measurement occasions were included in the analysis, which led to only removing one student from a total of 3,010 students.

Level-2 and Level-3 Predictors The level-2 (student-level) predictor included in the models is the number of years the student was in a small classroom (YRS SMALLij and YRS SMALLiðj ;fj gÞ ), which ranged 1

2

from 0 to 4 years. Given some students moved from small to large classrooms and vice versa, use of this predictor represented a form of “dosage variable” for the intervention. The level-3 (school-level) predictor incorporated was school urbanicity, which is a dichotomous variable (INNER CITYj , INNER CITY j1 , and INNER CITY fj2 g ) with a value of one for inner city schools and zero for non-inner city schools. None of the 3,010 students in the sample were missing predictor values at level 2 or level 3.

Student Mobility There were 125 (4.2%) students considered mobile from the sample of 3,010 students. Out of those mobile students, 40 (32.0%) changed schools only between the first and second measurement occasions, 77 (61.6%) changed schools solely between the second and third measurement Methodology (2018), 14(1), 30–44

occasions, and 7 (5.6%) switched schools twice between the first and second time points as well as between the second and third time points. One student changed schools between every measurement occasion.

Analyses The baseline unconditional HM3-LVR model fit to the data was exactly the same as in Equations 1, 2, and 3 for levels 1, 2, and 3, respectively. The TIMEtij variable was assigned values of 0, 1, 2, and 3 for the kindergarten, first, second, and third grade measurement occasions, respectively, in order for the intercept to take on the meaning as initial status. In addition, this model ignored any school changes made by students and used their first school attended as the school identifier for all four measurement occasions. The first school was chosen for this study because in research studies, especially those using randomized control trials or cluster randomized trials, school identifier information will be known at the initial measurement occasion in the study, although identifiers for schools will typically be missing for mobile students whose outcome scores might be missing at later time points. The conditional HM3-LVR model was fit to the data using Equation 1 for level 1, for level 2 the equation is

8 π0ij ¼ β00j þ β01j YRS SMALLij YRS SMALL:: þ r 0ij > > < π1ij ¼ β10j þ Bwj π0ij β00j ; > > : þ β11j YRS SMALLij YRS SMALL:: þ r 1ij ð9Þ with the student-level predictor grand mean centered, and at level 3 it is 8 β00j ¼ γ000 þ γ001 INNER CITY j þ u00j > > > > > β01j ¼ γ010 > > > > > > > β ¼ γ þ Bb β γ > 100 000 10j 00j > < þ γ101 INNER CITY j þ u10j : ð10Þ > > > β11j ¼ γ110 > > > > > > > Bwj ¼ Bw 0 þ Bw 1 β00j γ000 > > > > : þ Bw 2 INNER CITY j þ uBwj Covariates can be included at level 1 similarly to a conventional GCM. In addition, if substantial variability across clusters were found in the slopes for the level-2 predictor, then a level-3 predictor could be included in the model for those slopes (β01j and β11j) to evaluate a cross-level interaction. The baseline unconditional CCMM-LVR model that handles mobility was fit to the data using Equations 5–7. Ó 2018 Hogrefe Publishing

A. J. Leroux & S. N. Beretvas, Cross-Classified Multiple Membership LVR Model

The conditional CCMM-LVR model that was used to estimate the parameters and standard errors is the same as Equation 5 for level 1, for level 2 it is

8 π0iðj ;fj gÞ ¼ β00ðj ;fj gÞ þ β01ðj ;fj gÞ > 1 2 1 2 1 2 > > > > > > > YRS SMALLiðj1 ;fj2 gÞ YRS SMALL:: þ r 0iðj1 ;fj2 gÞ > > < π1iðj ;fj gÞ ¼ β10ðj ;fj gÞ þ Bwðj ;fj gÞ ; 1 2 1 2 1 2 > > > > π0iðj ;fj gÞ β00ðj ;fj gÞ þ β11ðj ;fj gÞ > > 1 2 1 2 1 2 > > > > : YRS SMALL YRS SMALL þ r 1iðj ;fj gÞ :: i ð j ;f j g Þ 1

2

1

2

ð11Þ and at level 3 the model is

8 ¼ γ0000 þ γ0010 INNER CITY j1 þ u00j1 0 β > > > 00ðj1 ;fj2 gÞ > > > β01ðj ;fj gÞ ¼ γ0100 > > 1 2 > > > > >β > ¼ γ γ þ Bb β 1000 0000 > 10 j ; j 00 j ; j ð 1 f 2 gÞ ð 1 f 2 gÞ > > > < þ γ1010 INNER CITY j1 þ u10j1 0 P : > þ h2fj2 g ½wtih ðγ1001 INNER CITY h þ u100h Þ > > > > > β11ðj ;fj gÞ ¼ γ1100 > > 1 2 > > > > > Bwðj ;fj gÞ ¼ Bw 0 þ Bw 1 β00ðj ;fj gÞ γ0000 > > 1 2 1 2 > > > : þ Bw 2 INNER CITY j1 þ uBwj1 ð12Þ Once again, level-1 covariates could be included in the model in the same manner as for a typical GCM. Similarly to the HM3-LVR model, if significant variation existed across initial clusters in the slopes for the level-2 predictor, then the level-3 predictor associated with the first cluster could be included in the model for those slopes (β01ðj ;fj gÞ 1

35

occasions). If a student changed schools between each time point, a weight of 1/3 was associated with each of the three subsequent schools attended. As emphasized earlier, note that the seeming weight of one for the initial school is assigning a weight for a school that is not on the same scale as the set of weights used with subsequent schools. In addition, the resulting variance estimates for the initial versus subsequent schools’ random effects also contribute to the operational weight for the effect associated with each school. All models were fit using R software (version 3.2.1; R Core Team, 2015) with the package R2jags (version 0.5– 6; Su & Yajima, 2015), which is the R interface to the Just Another Gibbs Sampler (JAGS) MCMC software (version 3.4.0; Plummer, 2013). The JAGS code for the unconditional and conditional CCMM-LVR models is provided in Appendices A and B, respectively. The prior specification set for all fixed effects parameters was a normal distribution with a mean of zero and variance of 100,000. The priors were set to the inverse-Pareto(1, 0.0001) distribution for the scalar variance components and the inverse-Wishart distribution for the variance-covariance matrix associated with the β10j and Bwj level-3 equations, which is recommended based on the simulation from Choi and Seltzer (2010). To determine the burn-in period and number of iterations for convergence, an examination of the trace and autocorrelation function plots was conducted. The examination supported use of a single chain with a burnin period of 10,000 iterations and an additional 50,000 iterations, for a total of 60,000 iterations. The deviance information criterion (DIC) was used to compare models’ fit, where smaller values indicate better fit (Spiegelhalter, Best, Carlin, & van der Linde, 2002). The DIC fit index is defined as

2

and β11ðj ;fj gÞ ) to evaluate a cross-level interaction. In addi-

þp ; DIC ¼ D D

tion, if the impact of the level-2 predictor on the growth rate significantly varied across subsequent clusters (β11ðj ;fj gÞ ),

is the posterior mean deviance and pD is the where D effective number of parameters in the model.

1

2

1

2

then the weighted level-3 predictor associated with the subsequent clusters could be incorporated into that equation. The weights that were used for both the baseline unconditional and conditional CCMM-LVR models were based on how long a student was a member of a school at the second through fourth time points. If a student did not change schools or their subsequent school remained the same from the second through fourth measurement occasions, then the single weight assigned to the one subsequent school’s residual was assigned a value of 1. If a student changed schools between the second and third measurement occasions and remained at the school for the fourth measurement occasion, then a weight of 1/3 was assigned to the first subsequent school and a weight of 2/3 for the second subsequent school (attended at the third and fourth Ó 2018 Hogrefe Publishing

ð13Þ

Results Descriptive statistics are provided in Table 1 for the math achievement scores at each of the four measurement occasions, and for the level-2 (student-level) and level-3 (schoollevel) predictors, number of years in a small classroom and school type, respectively, from the real data sample.

Baseline Unconditional Fixed Effects From Table 2, the grand mean of the initial status (intercept) is 497.825 for the CCMM-LVR model, and the grand Methodology (2018), 14(1), 30–44

36

A. J. Leroux & S. N. Beretvas, Cross-Classified Multiple Membership LVR Model

Table 1. Descriptive statistics for STAR data Variable name

M

SD

N

Math achievement at Time 1

Y1ij

497.63

44.59

2,846

Math achievement at Time 2

Y2ij

542.85

40.16

2,982

Math achievement at Time 3

Y3ij

591.28

44.13

2,903

Math achievement at Time 4

Y4ij

626.95

39.91

2,862

1.78

3,010

Outcome

Level-2 variable Years in small classes

YRS_SMALLij

1.43

Variable name

Percentage

N

INNER_CITYj

20.27%

15

79.73%

59

Level-3 variable Inner city school Non-inner city school

Table 2. Fixed effects parameter and SE estimates for baseline unconditional CCMM-LVR and HM3-LVR models Estimating model CCMM-LVR Parameter

HM3-LVR

Coeff.

Est.

(SE)

γ0000

497.825

(2.537)

γ1000

43.496

Bb

–0.278

Coeff.

Est.

(SE)

γ000

497.839

(2.354)

(0.998)

γ100

43.455

(0.937)

(0.042)

Bb

–0.248

(0.040)

Model for intercept Grand mean Model for slope Grand mean School mean initial status Model for Bw Grand mean

Bw_0

–0.007

(0.013)

Bw_0

–0.008

(0.012)

School mean initial status

Bw_1

–0.0003

(0.001)

Bw_1

–0.0003

(0.001)

DIC

118,019.7

119,157.0

Notes. CCMM-LVR = cross-classified multiple membership latent variable regression; HM3-LVR = three-level latent variable regression; Coeff. = coefficient; Est. = parameter estimate; DIC = deviance information criterion.

mean of the growth (slope) is 43.496. The Bb coefficient is negative, which indicates that the growth rate for a school with a higher mean initial status will be lower than the growth rate for a school with a lower mean initial status. To demonstrate visually, consider three hypothetical schools, where the initial status of School 1 is two SDs (39.15 points, calculated from Table 3) below the grand mean initial status, School 2 is at the grand mean initial status, and School 3 is two SDs above the grand mean initial status. Expected school growth rates are calculated using the grand mean growth rate, γ1000 (43.496), and the between-schools effect of initial status on growth, Bb (0.278). Therefore, the expected growth rate for students in School 1 would be 54.37 points per grade [43.496 + (0.278 39.15)], for School 2 it would be 43.50 points per grade, and for School 3 it would be 32.62 points per grade. Figure 1 displays the expected growth rates for the three schools depicting the slightly negative relationship between school mean initial status and school mean growth rate. Methodology (2018), 14(1), 30–44

To help visualize the expected growth rates within schools, consider three hypothetical students from each of the previous three hypothetical schools who are, respectively, two SDs (59.39 points, calculated from Table 3) below their school’s mean initial status, at their school’s mean initial status, and two SDs above their school’s mean initial status. The expected growth trajectories within a school are based on the growth (43.496), Bb (0.278), Bw 0 (0.007), and Bb 1 (0.00025) parameter estimate values. Figure 2 displays the expected growth trajectories for the three students within each of the three schools. As can be seen in the figure, the expected growth rates increase very slightly as the students’ initial statuses increase within School 1. For School 2 and School 3, the students’ expected growth rates decrease as the values for initial status increase. For Student I within School 3, for example, the expected growth rate is calculated by adding the school’s expected growth rate (32.6) with the value from the model for Bw [(0.007 59.39) + (0.00025 39.15 59.39) = 1.0] to obtain 32.6 + 1.0 = 33.6. Ó 2018 Hogrefe Publishing

A. J. Leroux & S. N. Beretvas, Cross-Classified Multiple Membership LVR Model

37

Table 3. Random effects parameter and SE estimates for baseline unconditional CCMM-LVR and HM3-LVR models Estimating model CCMM-LVR Parameter

HM3-LVR

Coeff.

Est.

(SE)

Coeff.

Est.

(SE)

σ2

609.465

(11.108)

σ2

609.236

(11.411)

Level-1 variance between Measures Intercept variance between Students 1st schools

τr00

881.876

(34.536)

τr00

885.096

(36.607)

τuj1 00

383.101

(70.560)

τu00

384.069

(72.432)

Slope variance between τr11

14.647

(4.057)

τr11

15.571

(4.116)

1st schools

τuj1 11

16.439

(6.956)

τu11

37.605

(7.395)

Subsequent schools

τufj2g 11

21.761

(8.515)

–

–

–

τuj1 Bw

0.004

(0.001)

τuBw

Students

Bw variance between 1st schools

0.003

(0.001)

Notes. – = not applicable; CCMM-LVR = cross-classified multiple membership latent variable regression; HM3-LVR = three-level latent variable regression; Coeff. = coefficient; Est. = parameter estimate.

Figure 1. Expected growth trajectories for Schools 1–3 with mean initial status values that are two SDs below the grand mean intercept, at the grand mean intercept, and two SDs above the grand mean intercept, respectively.

For the baseline unconditional fixed effects parameters in Table 2, the two models’ estimates were similar. The fixed effects SE estimates also revealed similarities between the two types of baseline unconditional models.

Baseline Unconditional Random Effects In Table 3, very similar random effects parameter and SE estimates were found between the baseline unconditional CCMM-LVR and HM3-LVR models, except in the estimates of the between-first-schools slope variance, τuj1 11 and τu11, with values of 16.439 versus 37.605 for the CCMM-LVR and HM3-LVR models, respectively. The between-first-schools slope variance, τuj1 11 , parameter estimate under the CCMM-LVR model was less than half of the estimate, τu11, under the HM3-LVR model, but the

Ó 2018 Hogrefe Publishing

Figure 2. Expected growth trajectories for three students within Schools 1–3 with initial status values that are two SDs below their school’s mean initial status, at their school’s mean initial status, and two SDs above their school’s mean initial status, respectively.

associated SE estimates seemed more similar (6.956 and 7.395). The difference in the parameter estimates, τuj1 11 and τu11, corresponds to between-subsequent-schools slope variance estimate, τufj2g 11 (21.761).

Conditional Fixed Effects The parameter and SE estimates were mostly similar across the conditional CCMM-LVR and HM3-LVR models in Table 4. A substantial difference was found in parameter

Methodology (2018), 14(1), 30–44

38

A. J. Leroux & S. N. Beretvas, Cross-Classified Multiple Membership LVR Model

Table 4. Fixed effects parameter and SE estimates for conditional CCMM-LVR and HM3-LVR models that include Level-2 and Level-3 predictors Estimating model CCMM-LVR Parameter

HM3-LVR

Coeff.

Est.

(SE)

Grand mean

γ0000

501.476

(2.517)

YRS_SMALL

γ0100

2.998

(0.373)

Sch1_INNER_CITY

γ0010

18.760

Grand mean

γ1000

43.859

School mean initial status

Bb

–0.338

YRS_SMALL

γ1100

–0.467

Sch1_INNER_CITY

γ1010

1.281

SubSch_INNER_CITY

γ1001

–9.553

Coeff.

Est.

(SE)

γ000

501.411

(2.532)

γ010

2.975

(0.393)

(5.429)

γ001

18.616

(5.338)

(1.105)

γ100

43.821

(1.070)

(0.039)

Bb

–0.310

(0.041)

(0.133)

γ110

–0.464

(0.132)

(3.064)

γ101

–7.463

(1.812)

(3.014)

–

Model for intercept

Model for slope

–

–

Model for Bw Grand mean

Bw_0

0.007

(0.013)

Bw_0

0.006

(0.014)

School mean initial status

Bw_1

–0.001

(0.001)

Bw_1

–0.001

(0.001)

Sch1_INNER_CITY

Bw_2

–0.065

(0.027)

Bw_2

–0.059

(0.028)

DIC

117,597.6

118,195.3

Notes. – = not applicable; CCMM-LVR = cross-classified multiple membership latent variable regression; HM3-LVR = three-level latent variable regression; Coeff. = coefficient; Est. = parameter estimate; YRS_SMALL = number of years student was in small classes; Sch1_INNER_CITY = first-school type; SubSch_INNER_CITY = weighted average of subsequent-school type; DIC = deviance information criterion.

and SE estimates of the effect of the first school’s type (inner city or not) on the slope. The HM3-LVR model resulted in a much stronger parameter estimate (γ101 = 7.463) of the effect of the first school’s urbanicity on the slope with a smaller SE estimate (1.812) as compared with the CCMM-LVR model’s estimates (γ1010 = 1.281, SE(γ1010) = 3.064). The difference in the parameter estimates is reflected in the parameter estimate, γ1001, for the effect of the weighted average subsequent schools’ type on the slope in the CCMM-LVR model (9.553).

Conditional Random Effects The pattern of results for the random effects variance component estimates for the conditional models in Table 5 was generally similar to that of the results from the baseline unconditional models. The parameter estimates of the between-first-schools slope variance, τuj1 11 and τu11, once again revealed some large differences, with values of 7.696 and 28.783 for the CCMM-LVR and HM3-LVR models, respectively. The between-first-schools slope variance, τuj1 11 , estimate under the CCMM-LVR model was less than one-third of the size of the value for the HM3-LVR model’s estimate, τu11, while the associated SE estimate was smaller for the CCMM-LVR model (4.250 vs. 5.820). The difference between the parameter estimates, τuj1 11 and τu11, was reflected in the parameter estimate, τufj2g 11 (23.318). Methodology (2018), 14(1), 30–44

Fit Index The DIC values were much lower for the CCMM-LVR model than for the HM3-LVR model as indicated in Tables 2 and 4. The more complex conditional models also resulted in much lower DIC values than the baseline unconditional models.

Discussion The purpose of this study was to introduce the CCMM-LVR model and to demonstrate interpretation and use of its parameters. In addition, parameter estimates for the newly proposed CCMM-LVR model were compared to corresponding parameters in the HM3-LVR model when applied to a longitudinal dataset that included student mobility. The HM3-LVR model ignores mobility by only modeling one of the multiple clusters associated with some participants, while the CCMM-LVR model handles the student mobility found in multiple membership data structures. The HM3-LVR model is a useful extension to the typical HM3-GCM because it can model the LVR coefficients as varying across clusters, examine interactions between the participant and/or cluster characteristics and the initial status on growth effect, and control for differences in initial status among participants and among clusters. However, the HM3-LVR model cannot handle the participant mobility Ó 2018 Hogrefe Publishing

A. J. Leroux & S. N. Beretvas, Cross-Classified Multiple Membership LVR Model

39

Table 5. Random effects parameter and SE estimates for conditional CCMM-LVR and HM3-LVR models that include Level-2 and Level-3 predictors Estimating model CCMM-LVR Parameter

Coeff.

HM3-LVR

Est.

(SE)

Coeff.

Est.

(SE)

σ2

608.119

(11.153)

σ2

608.456

(11.071)

Students

τr00

857.894

(34.786)

τr00

856.237

(35.259)

1st schools

τuj1 00

330.882

(62.252)

τu00

334.470

(65.429)

Level-1 variance between Measures Intercept variance between

Slope variance between Students

τr11

14.607

(3.941)

τr11

15.425

(3.829)

1st schools

τuj1 11

7.696

(4.250)

τu11

28.783

(5.820)

Subsequent schools

τufj2g 11

23.318

(6.643)

–

–

–

τuj1 Bw

0.003

(0.001)

τuBw

Bw variance between 1st schools

0.003

(0.001)

Notes. – = not applicable; CCMM-LVR = cross-classified multiple membership latent variable regression; HM3-LVR = three-level latent variable regression; Coeff. = coefficient; Est. = parameter estimate.

that is typically encountered in large-scale longitudinal studies, which is why the CCMM-LVR model was proposed in this study. The results revealed similarities and differences in parameter estimate values and model fit between the two estimating models. For the fixed effects parameters estimated using the two conditional models, the effect of first-school type on the slope (γ1010 and γ101) differed between the models. The HM3-LVR model’s parameter estimate for the effect of first-school type on the slope, γ101, was stronger and significantly different from zero, while the parameter estimate, γ1010, was weaker and not statistically significant under the CMM-LVR model. However, the parameter estimate for the effect of the subsequent-school type on the slope, γ1001, was stronger than γ101 and statistically significant, indicating that subsequent inner city schools have a more negative effect on the student’s growth than the effect of a first school’s being inner city. The HM3-LVR model in the current study also seems to capture the sum of the effect of the first- and subsequent-schools’ type on the slope, whereas the CCMM-LVR model breaks down the effect of urbanicity on the slope into the subcomponents (i.e., first-school and subsequent-schools). For the random effects variance component estimates from both the baseline unconditional and conditional models in the real data analysis, estimates of the betweenfirst-schools slope variance, τuj1 11 and τu11, were substantially smaller for the CCMM-LVR model than for the HM3-LVR model. This difference demonstrates how the two models parameterized the between-schools slope variance, because the HM3-LVR model only estimated a single level-3 slope variance (the between-first-schools slope Ó 2018 Hogrefe Publishing

variance, τu11) while the CCMM-LVR model partitioned that variance into the between-first-schools slope variance (τuj1 11 ) and the between-subsequent-schools slope variance (τufj2g 11 ). The substantially larger values found for the between-firstschools slope variance from the HM3-LVR models in the real data analysis match what previous research has revealed will happen to variance component estimates at the cluster level when comparing a multiple membership model to a typical multilevel model that ignores the multiple membership data structure (Chung & Beretvas, 2012; Grady, 2010; Grady & Beretvas, 2010; Luo & Kwok, 2009, 2012; Meyers & Beretvas, 2006). The results indicated that the model fit was substantially better with the CCMM-LVR model for both the baseline unconditional and conditional models. This pattern matches that found in Grady and Beretvas (2010) study, and implies that the better fit is likely due to the CCMMLVR model accounting for the students attending multiple schools in the dataset, here, even despite the very small percentage of mobile students in the dataset being analyzed.

Limitations This study used a linear growth model whereas other possible growth forms should also be evaluated and provide natural extensions of the model presented in this study. In addition, applied researchers might be interested in using the intercept to capture the expected value for the outcome of interest at a measurement occasion other than the initial status and in exploring its relationship with growth. For such scenarios, the current model would need Methodology (2018), 14(1), 30–44

40

A. J. Leroux & S. N. Beretvas, Cross-Classified Multiple Membership LVR Model

substantial revision to ensure that the cluster-level effects across which the intercept might vary are relevant to that occasion. For example, if the occasion of interest is the midpoint measurement, say the second of three, then the set of clusters associated with the individual at the first and second occasions might be modeled as influencing the intercept – thereby changing the model in Equation 8. In addition, the specification of the LVR part of the model (in which the intercept predicts the slope) would also need to be carefully considered. The current CCMM-LVR model can provide a flexible model that can be tailored to match the research question of interest although uses of a different coding for the Time variable will change the resulting mean and covariance structures substantially and additional future methodological work should explore the relevant models’ complexities and their estimation. Homogeneous variances across level-3 units were assumed here, whereas this assumption was not made in the Choi and Seltzer (2010) HM3-LVR real data analysis. Future research could explore first the validity of this homogeneous variances assumption in real-world data as well as the robustness of model estimates when the assumption is misaligned with the true underlying structure of the data. Another issue that was not tackled in this study and that is inherent with multiple membership data structures is that the cluster (here, school) identifiers are frequently missing for those who have changed clusters. Future methodological research is needed to investigate ways of handling missing identifiers for multiple membership data structures. The benefits already mentioned about the addition of the LVR coefficients to explicitly model the prediction of the slope parameter using the intercept parameter at two levels of the model and the ability to assess variability in that relationship as a function of individual and cluster characteristics substantiate the contribution of the CCMM-LVR model over the CCMM-GCM. In addition, there is another fine distinction from the model upon which the CCMMLVR is built that helps further validate its benefits over those of the HM3-LVR model. In terms of the parameterization of the model proposed in this study, in a conventional GCM the slope is modeled as affected by the single cluster (school) affiliated with the individual (student) being measured over time. Using the educational example, the assumption is that the single school’s effect on a student outcome is unchanging across time. In the model introduced here, the effect of the first school is allowed to contribute uniquely to the slope from the effect of the set of subsequent schools. Thus, the random effects covariance structure parameterized using the CCMM-LVR model is that much more flexible than that of the HM3-LVR model. As discussed in an earlier section, this flexibility could be further expanded to allow unique effects at each time point Methodology (2018), 14(1), 30–44

by using a distinct cross-classified factor for the school attended at each time point rather than by using the multiple membership model for the effects of the set of subsequent schools. However, this introduces several added complications including model estimation and specification, selection of the meaningful time points for cluster change (and cross-classified factor specification) and, under a strict cross-classified random effects model (CCREM), a school’s effect at each time would be assumed completely independent of its effect at another time point providing an overly restrictive assumption. There are alternative extensions of the CCREM that can allow correlated random effects (see, e.g., Leyland & Næss, 2009) although these are beyond the scope of the current paper.

Conclusions When conducting a typical LVR analysis using a higherlevel clustering unit, such as schools, neighborhoods, departments, hospitals, etc., the assumption is made that the participant (or student) remains in the same higher-level cluster for the duration of analysis. However, there are many scenarios in which the participants change contexts over time, which results in a multiple membership data structure. This study extended the HM3-LVR model for researchers interested in handling multiple membership data structures. The results from the conditional fixed effects suggest that including a cluster-level predictor associated with first- and subsequent-clusters could reveal differences in the magnitude of that effect on the slope and in its associated standard error. Therefore, practitioners examining the results from an HM3-LVR model that includes school-level characteristics in the model should be careful when interpreting the effects on student growth achievement associated with school-level predictors. In addition, school-specific residuals were not assessed in this study, but the HM3-LVR model results revealed much larger between-first-clusters slope variance which seems to indicate that the school-specific growth residuals might also be inaccurate using the HM3-LVR model. Other approaches exist (besides the CCMM-GCM and now the CCMM-LVR model) to handle individual mobility across clusters in longitudinal settings, such as the crossclassified GCM where each individual changes organizational clusters at each measurement occasion (Luo & Kwok, 2012; Raudenbush & Bryk, 2002). This cross-classified GCM approach is what the well-known Tennessee ValueAdded Assessment System (TVAAS) model is based on. An issue arises with this type of GCM with cross-classified individuals because the unique effect of each organization is assumed to be the same across measurement occasions (Luo & Kwok, 2012; Raudenbush & Bryk, 2002). Ó 2018 Hogrefe Publishing

A. J. Leroux & S. N. Beretvas, Cross-Classified Multiple Membership LVR Model

Raudenbush and Bryk (2002) do present a modification to the cross-classified GCM that would allow for cumulative organizational effects using a dummy-coded variable associated with each organization and time point for the organizational random effects (and possible organizational characteristics) in the intercept model of level 2. However, their data are yearly assessments nested within students nested within teachers, so every student has a different teacher each year. This again means that different organizational random effects are assumed at each time point, so no organizational random effects are incorporated into the growth rate model at level 2. Additional value-added models have also been suggested (e.g., Lockwood, McCaffrey, Mariano, & Setodji, 2007; Mariano, McCaffrey, & Lockwood, 2010) that also directly handle multiple membership data structures although they do not include LVR coefficients nor the direct assessment of the relationship between growth models’ parameters and their explanation with individual and cluster characteristics. Future research should explore how well residuals are recovered using the CCMM-LVR model suggested in this study because the model could be used to address value-added modeling questions focused on, for example, school effects in longitudinal data systems. In summary, future research should continue focusing on finding ways to best handle and assess the impact of ignoring mobility across clusters. This work has provided a first exploration of an extension to the flexible HM3LVR model that could provide the foundation for future research intended to identify optimal solutions for handling multiple membership data structures that are so common in applied educational and social science research.

References Achilles, C. M., Bain, H. P., Bellott, F., Boyd-Zaharias, J., Finn, J., Folger, J., . . . Word, E. (2008). Tennessee’s student teacher achievement ratio (STAR) project. [Data file]. Retrieved from http://hdl.handle.net/1902.1/10766 Bryk, A. S., & Raudenbush, S. W. (1987). Application of hierarchical linear models to assessing change. Psychological Bulletin, 101, 147–158. https://doi.org/10.1037/0033-2909.101. 1.147 Choi, K., & Seltzer, M. (2010). Modeling heterogeneity in relationships between initial status and rates of change: Treating latent variable regression coefficients as random coefficients in a three-level hierarchical model. Journal of Educational and Behavioral Statistics, 35, 54–91. https://doi.org/10.3102/ 1076998609337138 Chung, H., & Beretvas, S. N. (2012). The impact of ignoring multiple membership data structures in multilevel models. British Journal of Mathematical and Statistical Psychology, 65, 185–200. https://doi.org/10.1111/j.2044-8317.2011.02023.x Francis, D. J., Fletcher, J. M., Stuebing, K. K., Davidson, K. C., & Thompson, N. M. (1991). Analysis of change: Modeling

Ó 2018 Hogrefe Publishing

41

individual growth. Journal of Consulting and Clinical Psychology, 59, 27–37. https://doi.org/10.1037/0022-006X.59.1.27 Grady, M. W. (2010). Modeling achievement in the presence of student mobility: A growth curve model for multiple membership data (Unpublished doctoral dissertation). The University of Texas at Austin, Austin, TX. Grady, M. W., & Beretvas, S. N. (2010). Incorporating student mobility in achievement growth modeling: A cross-classified multiple membership growth curve model. Multivariate Behavioral Research, 45, 393–419. https://doi.org/10.1080/00273171. 2010.483390 Horney, J., Osgood, D. W., & Marshall, I. H. (1995). Criminal careers in the short-term: Intra-individual variability in crime and its relation to local life circumstances. American Sociological Review, 60, 655–673. Huttenlocher, J., Haight, W., Bryk, A., Seltzer, M., & Lyons, T. (1991). Early vocabulary growth: Relation to language input and gender. Developmental Psychology, 27, 236–248. https:// doi.org/10.1037/0012-1649.27.2.236 Ihrke, D. K., & Faber, C. S. (2012). Geographical Mobility: 2005 to 2010 (Current Population Reports, P20–567). Washington, DC: US Census Bureau Leyland, A. H., & Næss, Ø. (2009). The effect of area of residence over the life course on subsequent mortality. Journal of the Royal Statistical Society: Series A, 172, 555–578. Lockwood, J. R., McCaffrey, D. F., Mariano, L. T., & Setodji, C. (2007). Bayesian methods for scalable multivariate value-added assessment. Journal of Educational and Behavioral Statistics, 32, 125–150. https://doi.org/10.3102/1076998606298039 Luo, W., & Kwok, O. (2009). The impacts of ignoring a crossed factor in analyzing cross-classiﬁed data. Multivariate Behavioral Research, 44, 182–212. https://doi.org/10.1080/ 00273170902794214 Luo, W., & Kwok, O. (2012). The consequences of ignoring individuals’ mobility in multilevel growth models: A Monte Carlo study. Journal of Educational and Behavioral Statistics, 36, 31–56. https://doi.org/10.3102/1076998610394366 Mariano, L. T., McCaffrey, D. F., & Lockwood, J. R. (2010). A model for teacher effects from longitudinal data without assuming vertical scaling. Journal of Educational and Behavioral Statistics, 35, 253–279. Meyers, J. L., & Beretvas, S. N. (2006). The impact of inappropriate modeling of cross-classified data structures. Multivariate Behavioral Research, 41, 473–497. https://doi.org/10.1207/ s15327906mbr4104_3 Muthén, B. O., & Curran, P. J. (1997). General longitudinal modeling of individual differences in experimental designs: A latent variable framework for analysis and power estimation. Psychological Methods, 2, 371–402. https://doi.org/10.1037/ 1082-989X.2.4.371 Plummer, M. (2013). JAGS: Just Another Gibbs Sampler (Version 3.4.0) [Computer software]. Retrieved from http://mcmc-jags. sourceforge.net/ Psychological Corporation, Harcourt Brace Jovanovich. (1983). Stanford Achievement Test (7th ed.). San Diego, CA: Author. R Core Team. (2015). R: A language and environment for statistical computing (Version 3.2.1) [Computer software]. Retrieved from http://www.R-project.org Raudenbush, S. W., & Bryk, A. S. (2002). Hierarchical linear models: Applications and data analysis methods (2nd ed.). Thousand Oaks, CA: Sage. Raudenbush, S. W., & Chan, W. (1993). Application of a hierarchical linear model to the study of adolescent deviance in an overlapping cohort design. Journal of Consulting and Clinical Psychology, 61, 941–951. https://doi.org/10.1037/0022-006X. 61.6.941

Methodology (2018), 14(1), 30–44

42

A. J. Leroux & S. N. Beretvas, Cross-Classified Multiple Membership LVR Model

Seltzer, M., Choi, K., & Thum, Y. (2003). Examining relationships between where students start and how rapidly they progress: Using new developments in growth modeling to gain insight into the distribution of achievement within schools. Education Evaluation and Policy Analysis, 25, 263–286. https://doi.org/ 10.3102/01623737025003263 Seltzer, M. H., Frank, K. A., & Bryk, A. S. (1994). The metric matters: The sensitivity of conclusions about growth in student achievement to choice of metric. Educational Evaluation and Policy Analysis, 16, 41–49. https://doi.org/10.3102/ 01623737016001041 Spiegelhalter, D. J., Best, N. G., Carlin, B. P., & van der Linde, A. (2002). Bayesian measures of model complexity and fit. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 64, 583–639. https://doi.org/10.1111/1467-9868.00353 Su, Y., & Yajima, M. (2015). Package “R2jags”. Retrieved from http://cran.r-project.org/web/packages/R2jags/index.html US Census Bureau. (2015). Geographic mobility: 2013 to 2014. Retrieved from http://www.census.gov/hhes/migration/data/ cps/cps2014.html US Government Accounting Office. (2010). K-12 education: Many challenges arise in educating students who change schools frequently (GAO Publication No. 11–40). Washington, DC: US Government Printing Office.

Audrey J. Leroux Department of Educational Policy Studies Georgia State University P.O. Box 3977 Atlanta, GA 30302-3977 USA aleroux@gsu.edu

Audrey J. Leroux is an Assistant Professor of Research, Measurement, and Statistics at Georgia State University. Her research is focused on evaluating innovative models and procedures within item exposure controls and stopping rules in computerized adaptive testing, as well as in extensions to conventional multilevel modeling that handle individual mobility.

S. Natasha Beretvas is a Professor in the Quantitative Methods program at The University of Texas at Austin. Her research is currently focused on assessing meta-analytic techniques as well as examining extensions to the multilevel model that are intended to handle student mobility and other sources of data structure complexities.

Received July 22, 2015 Revision received October 14, 2016 Accepted August 28, 2017 Published online April 23, 2018

Appendix A JAGS Code for the Unconditional CCMM-LVR Model ##T = total number of time-points; I = total number of students; J = total number of initial schools; K = total number of subsequent schools## model { ##Level-1 Model## for(t in 1:T) { MATH[t] dnorm(mu[t],tauinv_e) mu[t] <- pi0[STU[t]] + pi1[STU[t]] * TIME[t] } ##Level-2 Model## for (i in 1:I) { pi0[i] dnorm(beta00[SCH[i]],tauinv_r0) pi1[i] dnorm(stu_growth[i],tauinv_r1) stu_growth[i] <- beta10[SCH[i],1] + beta10[SCH[i],2] * (pi0[i] – beta00[SCH[i]]) + WGT2[i] * u1J2[SCH2[i]] + WGT3[i] * u1J2[SCH3[i]] + WGT4[i] * u1J2[SCH4[i]] } ##Level-3 Model## for (j in 1:J) { beta00[j] dnorm(gamma0000,tauinv_u0) beta10[j,1:2] dmnorm(sch_growth[j,1:2],tauinv_u1[1:2,1:2]) sch_growth[j,1] <- gamma1000 + Bb * (beta00[j] - gamma0000) sch_growth[j,2] <- Bw0 + Bw1 * (beta00[j] - gamma0000) } Methodology (2018), 14(1), 30–44

Ó 2018 Hogrefe Publishing

A. J. Leroux & S. N. Beretvas, Cross-Classified Multiple Membership LVR Model

43

##Level-4 Model## for (k in 1:K) { u1J2[k] dnorm(0,tauinv_u1J2) } ##Priors for Fixed Effects## gamma0000 dnorm(0,0.00001) gamma1000 dnorm(0,0.00001) Bb dnorm(0,0.00001) Bw0 dnorm(0,0.00001) Bw1 dnorm(0,0.00001) ##Priors for Variance Components## tauinv_e dpar(1,0.0001) tau_e <- 1/tauinv_e tauinv_r0 dpar(1,0.0001) tauinv_r1 dpar(1,0.0001) tau_r0 <- 1/tauinv_r0 tau_r1 <- 1/tauinv_r1 tauinv_u0 dpar(1,0.0001) tau_u0 <- 1/tauinv_u0 tauinv_u1[1:2,1:2] dwish(S[1:2,1:2],3) S[1,1] <- 32.348 S[2,2] <- 0.012 S[1,2] <- 0.000 S[2,1] <- 0.000 tau_u1[1:2,1:2] <- inverse(tauinv_u1[,]) tauinv_u1J2 dpar(1,0.0001) tau_u1J2 <- 1/tauinv_u1J2 }

Appendix B JAGS Code for the Conditional CCMM-LVR Model ##T = total number of time-points; I = total number of students; J = total number of initial schools; K = total number of subsequent schools## model { ##Level-1 Model## for(t in 1:T) { MATH[t] dnorm(mu[t],tauinv_e) mu[t] <- pi0[STU[t]] + pi1[STU[t]] * TIME[t] } ##Level-2 Model## for (i in 1:I) { pi0[i] dnorm(stu_start[i],tauinv_r0) pi1[i] dnorm(stu_growth[i],tauinv_r1) stu_start[i] <- beta00[SCH[i]] + gamma0100 * YRS_SMALL[i] stu_growth[i] <- beta10[SCH[i],1] + beta10[SCH[i],2] * (pi0[i] – beta00[SCH[i]]) + gamma1100 * YRS_SMALL[i] + WGT2[i] * u1J2[SCH2[i]] + WGT3[i] * u1J2[SCH3[i]] + WGT4[i] * u1J2[SCH4[i]] + gamma1001 * INNER_CITY_SUB[i]

Ó 2018 Hogrefe Publishing

Methodology (2018), 14(1), 30–44

44

A. J. Leroux & S. N. Beretvas, Cross-Classified Multiple Membership LVR Model

} ##Level-3 Model## for (j in 1:J) { beta00[j] dnorm(sch_start[j],tauinv_u0) beta10[j,1:2] dmnorm(sch_growth[j,1:2],tauinv_u1[1:2,1:2]) sch_start[j] <- gamma0000 + gamma0010 * INNER_CITY[j] sch_growth[j,1] <- gamma1000 + Bb * (beta00[j] - gamma0000) + gamma1010 * INNER_CITY[j] sch_growth[j,2] <- Bw0 + Bw1 * (beta00[j] - gamma0000) + Bw2 * INNER_CITY[j]} ##Level-4 Model## for (k in 1:K) { u1J2[k] dnorm(0,tauinv_u1J2) } ##Priors for Fixed Effects## gamma0000 dnorm(0,0.00001) gamma1000 dnorm(0,0.00001) gamma0100 dnorm(0,0.00001) gamma1100 dnorm(0,0.00001) gamma0010 dnorm(0,0.00001) gamma1010 dnorm(0,0.00001) gamma1001 dnorm(0,0.00001) Bb dnorm(0,0.00001) Bw0 dnorm(0,0.00001) Bw1 dnorm(0,0.00001) Bw2 dnorm(0,0.00001) ##Priors for Variance Components## tauinv_e dpar(1,0.0001) tau_e <- 1/tauinv_e tauinv_r0 dpar(1,0.0001) tauinv_r1 dpar(1,0.0001) tau_r0 <- 1/tauinv_r0 tau_r1 <- 1/tauinv_r1 tauinv_u0 dpar(1,0.0001) tau_u0 <- 1/tauinv_u0 tauinv_u1[1:2,1:2] dwish(S[1:2,1:2],3) S[1,1] <- 19.412 S[2,2] <- 0.008 S[1,2] <- 0.000 S[2,1] <- 0.000 tau_u1[1:2,1:2] <- inverse(tauinv_u1[,]) tauinv_u1J2 dpar(1,0.0001) tau_u1J2 <- 1/tauinv_u1J2 }

Methodology (2018), 14(1), 30â€“44

Ă“ 2018 Hogrefe Publishing

Instructions to Authors Methodology is the ofﬁcial organ of the European Association of Methodology. This association is a union of methodologists working in different areas of the social and behavioral sciences (e.g., psychology, sociology, economics, educational and political sciences). The aim of the journal is to present a platform for an interdisciplinary exchange of methodological research and applications in the different ﬁelds. The journal will be open for publishing new methodological approaches, review articles, software information, and instructional papers that can be used in teaching. There are three main disciplines to be covered: data analysis, research methodology, and psychometrics. The articles published in the journal should not only be accessible to methodologists but also to more applied researchers in the different disciplines. Methodology publishes the following types of articles: Original Articles Manuscript Submission All manuscripts should in the ﬁrst instance be submitted electronically at http://www.editorialmanager.com/methodology Detailed instructions to authors are provided at http://www. hogrefe.com/periodicals/journal-of-individual-differences/advicefor-authors/ Copyright Agreement By submitting an article, the author conﬁrms and guarantees on behalf of him-/herself and any coauthors that the manuscript has not been submitted or published elsewhere, and that he or she holds all copyright in and titles to the submitted contribution, including any ﬁgures, photographs, line drawings, plans, maps, sketches, tables, and electronic supplementary material, and that the article and its contents do not infringe in any way on the rights of third parties. ESM will be published online as received from the author(s) without any conversion, testing, or reformatting. They will not be checked for typographical errors or functionality. The author indemniﬁes and holds harmless the publisher from any third-party claims. The author agrees, upon acceptance of the article for publication, to transfer to the publisher the exclusive right to reproduce and distribute the article and its contents, both physically and in nonphysical, electronic, or other form, in the

journal to which it has been submitted and in other independent publications, with no limitations on the number of copies or on the form or the extent of distribution. These rights are transferred for the duration of copyright as deﬁned by international law. Furthermore, the author transfers to the publisher the following exclusive rights to the article and its contents: 1. The rights to produce advance copies, reprints, or offprints of the article, in full or in part, to undertake or allow translations into other languages, to distribute other forms or modiﬁed versions of the article, and to produce and distribute summaries or abstracts. 2. The rights to microﬁlm and microﬁche editions or similar, to the use of the article and its contents in videotext, teletext, and similar systems, to recordings or reproduction using other media, digital or analog, including electronic, magnetic, and optical media, and in multimedia form, as well as for public broadcasting in radio, television, or other forms of broadcast. 3. The rights to store the article and its content in machine-readable or electronic form on all media (such as computer disks, compact disks, magnetic tape), to store the article and its contents in online databases belonging to the publisher or third parties for viewing or downloading by third parties, and to present or reproduce the article or its contents on visual display screens, monitors, and similar devices, either directly or via data transmission. 4. The rights to reproduce and distribute the article and its contents by all other means, including photomechanical and similar processes (such as photocopying or facsimile), and as part of so-called document delivery services. 5. The right to transfer any or all rights mentioned in this agreement, as well as rights retained by the relevant copyright clearing centers, including royalty rights to third parties.

Online Rights for Journal Articles Guidelines on authors’ rights to archive electronic versions of their manuscripts online are given in the document ‘‘Guidelines on sharing and use of articles in Hogrefe journals’’ on the journal’s web page at www.hogrefe.com/j/med

November 2017