Modeling Student Growth Using Multilevel Mixture Item Response Theory Hong Jiao Robert Lissitz University of Maryland Presentation at the 2012 MARCES Conference October 2012 Thanks to Yong Luo, Chao Xie, and Ming Li for feedback Outline of presentation • • • • Value-added modeling Multilevel IRT models Mixture IRT models Direct modeling of students’ growth parameters in multilevel mixture IRT models • Simulation for direct modeling of growth in IRT models • Future explorations Value-added modeling • VAM intends to estimate the effect of educational inputs on student outcomes or student achievement as measured by standardized tests. (McCaffrey et al. 2003) • Accurate estimation of students’ achievement is very important as high stakes decisions are associated with the use of such scores. • All value-added models estimate the growth associated with schools and/or teachers • To measure growth, some models control for students’ prior achievement (FL commissioned paper by AIR) Complexity of the VAM • How prior achievements are accounted for • How value-added scores of school and teacher effects are estimated • Assumptions about the sustainability of school and teacher effects • Value-added models can be grouped into two major classes (AIR): Typical learning path models Covariate adjustment models Typical learning path modelslongitudinal mixed-effects mdoels • Each student is assumed to have a typical learning path • Schools and teachers can alter this learning path relative to the state mean, a conditional average • No direct control of prior achievement • With more data points, a student’s propensity to achieve can be estimated with more accuracy • With each passing year, a student’s typical learning path can be estimated with increased precision over time • Different learning path models assume differently about how teachers and schools can impact a student’s propensity to achieve Different learning path models • Sander’s Tennessee value added assessment system (TVAAS) model, teacher effects are assumed to have a permanent impact on students • McCaffrey and Lockwood (2008) relaxed this assumption and let the data dictate the extent to which teacher effects decay over time • Kane et al. (2008) found that teacher effects appear to dissipate over the course of about two years in an experiment in Los Angeles Covariate adjustment models • Direct control of prior student scores, prior test scores are included as predictors in the model • Teacher effects can be treated as either fixed or random • To obtain unbiased estimates, covariate adjustment models must account for measurement error introduced by the inclusion of model predictors-students’ prior achievement Covariate adjustment models • Two frequently used methods for accounting for measurement error in regression analysis include Direct modeling of error such as in structural equation models or errors-invariables regression Instrumental variable approach using one or more variables that are assumed to influence the current year score, but not prior year scores to statistically purge the measurement error from the prior year scores Statistical controls for contextual factors • Students are not randomly assigned to districts, schools, and classes • Parent selection of schools and teachers, teacher selection of schools, subjects, and sections, principal discretion in assigning certain students to certain teachers • These selection factors cause significant biases • Unbiased estimates of teacher value-added controls the factors that influence both selection of students into particular classes and current year test scores Statistical controls for contextual factors • Many value-added models assume only students’ prior test score is relevant to students’ posttest score • Other models incorporate controls for additional variables that might influence selection and outcomes Statistical controls for contextual factors • Empirical evidence is mixed on the extent to which student characteristics other than score histories remain correlated with test scores after controlling for prior test scores • Some studies found that controlling for studentlevel characteristics makes little if any significant difference in model estimates (Ballou, Sanders, and Wright, 2004; McCaffrey et al. 2004) This is consistent with the view that durable student characteristics associated with race, income, and other characteristics are already reflected in prior test scores, such that controlling for the prior test scores controls for any relevant impact of the factors proxied by the measured characteristics Statistical controls for contextual factors • In contrast, when student factors are aggregated to school or classroom levels, they sometimes reveal a significant residual effect (Raudenbush, 2004; Ballou, Sanders, & Wright, 2004). School or classroom characteristics may explain additional variance in students’ posttest scores independently beyond students’ individual characteristics accounted for by their prior test scores • True teacher effectiveness really does vary with student characteristics and correlated variation of estimated teacher value-added is not the consequence of uncontrolled selection bias but rather a reflection of these true differences in teacher effectiveness. Durability of teacher effects • Typical learning path models require an assumption about the durability of the impact of teachers on a student’s learning path. Sanders’ Tennessee value-added assessment system assume that teacher effects have a permanent impact on students McCaffrey & Lockwood (2008) let the data dictate the extent to which teacher effects decay over time Kane et al. (2008) found that teacher effects appeared to dissipate over the course of about two years in an experiment in LA. Durability of teacher effects • Covariate models do not make assumption about the durability of teacher effects as they explicitly establish expectations based on prior achievement by including prior test scores as a covariate, rather than the abstract ‘propensity to achieve’ estimated in learning path models Unit of measurement for student achievement • Colorado growth model (Betebenner, 2008) uses entirely normative in-state percentile ranks • Not rely on a potentially flawed vertical scale • But only provide normative criteria • Students’ growth is examined relative to their peers rather than absolute growth in their own learning. Dependent variable in growth modeling • Majority used interval measures of students; scaled test score • Student percentile ranks within the student’s grade was also used as the dependent variable in some models Correction of biased estimates of teacher effects in VAM • Selection effects include parent selection of schools and teachers; teacher selection of schools, subjects, and sections; and principal discretion in assigning certain students to certain teachers • Selection effects can be mitigated when the model includes factors that are not accounted for by pretest scores, and are associated with posttest scores after controlling for pretest scores. Issues arising from the use of achievement test scores as an outcome measure • Testing is infrequent-once a year • Tests sample all topics related to achievement • The scale for measuring achievement is not predetermined by the nature of achievement but is chosen by the test developer. • Changes to the timing of tests, the weight given to alternative topics, or the scaling of the test could change our conclusions about the relative achievement or growth in achievement across classes of students. Potential problems in value-added models • Linking errors could be conflated with teacher effects • Equal interval property of the scale across grades was questionable. • Ceiling effects at higher grades may lead to smaller learning gains than grades in the middle scale. • Measurement errors cause estimated treatment effects confounded with group means of prior achievement (Lockwood, 2012) Covariate Adjusted Models (McCaffrey, et al. 2003) St mt b *S (t 1) Tt t the student’s score at time t a student-specific mean the student’s score at time t-1 the teacher effect the error term assumed to be normally distributed and independent of Gain Score Models (McCaffrey et al. 2003) St S (t 1) mt Tt t the student’s score at time t a student-specific mean the student’s score at time t-1 The gain score model can be viewed as a special case of the covariate adjusted model, where b, the coefficient of prior achievement, is equal to 1. the teacher effect the error term assumed to be normally distributed and independent of Teacher Effect Estimate in VAM (Luo, Jiao, & Van Wie,2012) • Two-step process: In most value-added modeling, student achievement scores are estimated before entering the model for estimating teacher or school effect. Students’ achievement scores are estimated based on a certain item response theory (IRT) model first, most often a unidimensional IRT model. Issues with Two-Step Process (Luo, Jiao, & Van Wie, 2012) • Standard IRT models are used in operation to measure students’ achievement scores. • Non-random assignment of students into schools and classes cause local person dependence due to the nesting structure (Reckase, 2009, Jiao et al. 2012). • Measurement precision might be affected • Parameter estimates may be biased due to the reduced effective sample size (Cochrane, 1977; Cyr & Davies, 2005; Kish, 1965). • Ultimately, the accuracy in estimating teacher and school effect may also be affected. Outcome variables in VAM • Standardized test scores • Intrinsic measurement errors in the test scores • Possible solution is to use multilevel item response theory (IRT) models Simultaneous modeling of students’ achievement, teacher effects, and school effects using item response data as the input data and the latent ability is simultaneously estimated with other model parameters such as item parameters and teacher and school random-effects.(van Wie, Luo, & Jiao, 2012; Luo, Jiao, & van Wie, 2012) Four level model in the traditional Rasch model format (Van Wie, Luo, & Jiao, 2012) p jmsi 1 1 exp[( j T S bi )] 26 Four level model in the traditional 3PL IRT model format (Luo, Jiao, & Van Wie, 2012) p jmsi 1 ci ci 1 exp[( j T S bi )] 27 Multilevel IRT Framework • Model parameter estimation of the 4-level IRT models: item parameters, student ability, teacher effect, and school effect. Rasch: HLM7, Proc Glimmix, MCMC 2pl: MCMC 3pl: MCMC Teacher Effect and School Effect Computation • In the two-step process, teacher effect is computed as the average of the scores of the nested students, and school effect is computed as the average of the teacher effects within the school. This is analogous to the status model. • In the 4-level IRT model, the student ability, the teacher effect and the school effect were simultaneously estimated. Findings • Except for RMSE in teacher effect parameter estimation, the 4-level 3pl IRT model performs significantly better than the 2-level 3pl IRT model. • Especially noticeable is the considerable improvement of teacher effect parameter estimation. Improvement of Teacher Effect Estimates • The improvement is especially noticeable when teacher effects and school effects are medium. • The improvement decreases with the decrease of teacher effects and school effects. Further improvement • As the change score is ultimately used in evaluating teacher and school effects in several value-added models, we explored direct estimation of change score by including prior achievement scores in the IRT modeling. • An IRT model formulation for growth score is presented and model parameter estimation is explored. • A multilevel formulation is presented. • A mixture IRT version including growth score is presented and model parameter estimation is discussed. Possible models • Rasch model with direct modeling of growth parameter 1 Pji ( x 1bi , j, j ) (1 exp(( j j bi ))) • Multilevel Rasch model with direct modeling of growth parameter p jmsi 1 1 exp[( j j T S bi )] Possible models • Multilevel Rasch mixture model with direct modeling of growth parameter with no latent classes at teacher and school levels p jmsic 1 1 exp[( jc jc T S bic )] • Multilevel Rasch mixture model with direct modeling of growth parameter with latent classes at teacher and school levels p jmsic 1 1 exp[( jc jc Tc Sc bic )] Simulation Study • 30 items and 1000 examinees simulated b ~ N (0,1) ~ N (0,1) ~ N (0,0.5) 1 Pji ( x 1bi , j, j ) (1 exp(( j j bi ))) Model Parameter Estimation Using the Markov Chain Monte Carlo (MCMC) method implemented in OpenBUGS 3.0.7 , x ji ~ Bernoulli( p ji ) Pji ( x 1bi , j, j ) 1 (1 exp(( j j bi ))) Priors: b ~ dnorm(0,10) i j ~ N (0, 2) . j is a known parameter a prior test score MCMC runs two chains used Initial values were generated by the program Convergence Check Multiple criteria for convergence check The required number of iterations for equilibrium varied for different models The number of burn-in iterations: 40,000 iterations The model parameter inferences were made based on the 10,000 monitoring iterations for each chain with a total of 20,000 samples. Growth parameter estimates Descriptive Statistics N Minimum Maximum Mean Std. Deviation dtheta 1000 -1.426000 1.818000 -.00315860 .485358632 dtheta_true 1000 -1.346726 1.459898 -.01686716 .488377290 dif_theta 1000 -1.513055 2.036097 .00835295 .543427499 Valid N (listwise) 1000 Correlations: growth parameter estimates Correlations dtheta Pearson Correlation dtheta 1 .745 Sig. (2-tailed) N Pearson Correlation dtheta_true dtheta_true ** .000 1000 1000 ** 1 .745 Sig. (2-tailed) .000 N 1000 1000 **. Correlation is significant at the 0.01 level (2-tailed). Correlations dtheta_true Pearson Correlation dtheta_true dif_theta 1 Sig. (2-tailed) dif_theta .616 ** .000 N 1000 1000 Pearson Correlation .616 ** 1 Sig. (2-tailed) .000 N 1000 **. Correlation is significant at the 0.01 level (2-tailed). 1000 Future Research • Multilevel IRT model for direct estimation of the growth change scores. • A Mixture multilevel IRT model for direct estimation of the growth change scores • A constrained version of the model is possible by setting the growth change scores to non-negative values. • Extensions to other IRT models such as 2PL, 3PL-c, 3PL-d, and 4P IRT models and the mixture version of the models. • Replications and simulate more study conditions. • Model fit indexes to select among competing models should be investigated under more extensive study conditions. Thank you!