University of Maryland IRT Models to Assess Change Across Repeated Measurements James S. Roberts Georgia Institute of Technology Qianli Ma University of Maryland Many Thanks!!! • Thanks Bob. • Thanks to Mayank Seksaria,Vallerie Ellis, Dan Graham, Yi Cao, and Yunyun Dai for their assistance at various stages of this project. • Thanks to the Project MATCH Coordinating Center at the University of Connecticut for sharing their data. Situations in Which Repeated Measures IRT Models Are Useful • Each respondent receives the same test multiple times – Typical pretest, posttest, follow-up, treatment studies • Each respondent receives alternate forms of a comparable test with common items across forms (or across pairs of forms) – More elaborate repeated measures designs that control for memory effects • Each respondent receives alternate forms that are not comparable (in difficulty) but have some common items – Vertical measurement situations • ECLS, Some school testing programs • Each of these situations involves a set of common items across (successive pairs of) administered tests – 100% common items = same form – Less than 100% common items = alternate forms Typical Approaches to Repeated Measures Data In IRT • Calibrate responses from each administration separately – Ignores correlation of the latent trait across test administrations • Calibrate responses from each administration simultaneously allowing for different prior distributions at each administration – Still ignores correlation • Multidimensional Approaches – Andersen (1985) – Reckase and Martineau (2004) – Estimate theta at each testing occasion simultaneously • Does incorporate correlation across testing occasions • Does not really assess change in the latent variable An Alternative IRT Approach • Embretson’s (1991) Multidimensional Rasch Model for Learning and Change (MRMLC) – Developed to measure change in a latent trait across repeatedly measured items that are scored as binary variables P( X 1 | ji(t ) * ) * j 1 , ... , jT exp b t * jq i(t) q 1 1 exp b t * jq q 1 i(t) Where: * is the baseline (time 1) level of the latent trait for the jth respondent j1 j1 * j2 j2 j1 is the change in the level of latent trait from time1 to time 2 for the jth respondent * jt jt j ( t 1) is the change in the level of latent trait from time t -1 to time t for the jth respondent with t = 2, …, T bi(t) is the difficulty of the ith item nested within test administration t There must be common items across test form administrations and the difficulty is assumed constant for a given common item This maintains the metric across forms • This model parameterizes the latent trait scores for each individual as an initial trait level followed by t-1 latent change scores – It is multivariate in the sense that each individual has T latent trait scores • However, each of these scores relates to positions on a single unidimensional continuum Note that: t jt * jq q 1 So the latent trait level for the jth individual at time t (i.e., the composite trait at time t ) is the sum of the initial level along with all the latent change scores • Along with estimates of the aforementioned parameters, one also obtains estimates of the latent variable means and the correlation matrix for these latent variables: , ,..., * 1 r r R ... r 11 21 T1 * 2 r r ... r 12 22 T2 * T ... r ... r ... ... ... r 1T 2T TT Advantages of the Multidimensional IRT Approach to Change • Traditional Benefits of IRT Models that Fit the Data – Sample invariant interpretation of item parameters – Item invariant interpretation of person parameters – Index of precision at the individual level • Advantages to measuring change with this multidimensional IRT approach – Parameterizing change as an additional dimension in an IRT model eliminates the reliability paradox associated with observed change scores classical test theory • Higher correlation between pretest and posttest lead to less reliable observed change scores • The precision of IRT measures of latent change do not depend on pretest to posttest correlations • Small changes in observed scores may have a different meaning when the initial observed score is extreme rather than more moderate – Because the relationship between the expected test score and the latent trait is nonlinear, an IRT model allows for this relationship Further Generalization of the Basic Model • One can easily extend the MRMLC to more general situations – Allow for graded (polytomous) responses Wang, Wilson & Adams (1998) Wang & Chyi-In (2004) • We have generalized the basic model further in this project by allowing items to vary in their discrimination capability – Form a similar model of change using Muraki’s (1991) generalized partial credit model P( X x | ji(t ) * j 1 , ... , ) * jT exp x i(t) k 0 w 0 w i(t) k 0 * jq q 1 exp Mi t t q 1 * jq i ( t )k i(t )k Where: * is the baseline level of the latent trait for the jth respondent j1 j1 * j2 j2 j1 is the change in the level of latent trait from baseline to time 2 for the jth respondent * jt jt j ( t 1) is the change in the level of latent trait from time t -1 to time t for the jth respondent with t = 2, …, T i ( t ) k is the kth step difficulty parameter for the ith item on the test administration t i ( t ) is the discrimination parameter for the ith item on test administration t Again, these item parameters are held constant for common items on successive test administrations. • Also get means and correlations for latent variables : , ,..., * 1 r r R ... r 11 21 T1 * 2 r r ... r 12 22 T2 * T ... r ... r ... ... ... r 1T 2T TT • Example 1: Beck Depression Inventory – 21 self-report items designed to measure depression • Two items were clearly not appropriate for a cumulative IRT model – Appetite loss and weight loss • Remaining items relate to: – Sadness, discouragement, failure, dissatisfaction, guilt persecution, disappointment, blame, suicide, crying, irritation, interest in others, decisiveness, attractiveness appraisal, ability to work, ability to sleep, tiring, worry, sexual interest – Four response categories per item • Graded item responses coded as 0 to 3 – Higher item scores are indicative of more severe symptoms – 1322 subjects in an alcohol treatment clinical trial – Responses from Baseline, End of 3 month alcoholism treatment period, and 9-month follow-up • Dimensionality Assessment Eigenvalue Ratio Baseline 7.01 / 1.32 3-Months 7.72 / 1.23 9-Months 7.83 / 1.39 • Classical Test Theory Statistics Baseline Mean Score: 9.52 s.d. 7.94 =.90 3 Months Mean Score: 6.75 s.d. 7.29 =.90 9 Months Mean Score: 6.94 s.d. 7.45 =.91 • Classical Test Theory Statistics (cont.) ITC Range ___ Obs. ___ Obs. range Baseline (.34, .64) .50 (.12, .76) 3 Months (.20, .72) .36 (.11, .53) 9 Months (.36, .71) .37 (.13, .53) Time Classification Baseline 3 Mo. 9 Mo. No Depression 56.2% 71.4% 69.1% Mild 29.5% 19.7% 20.9% Moderate 10.8% 6.3% 7.9% 3.5% 2.6% 2.1% Severe • Parameter Estimation – Markov Chain Monte Carlo estimation with WinBUGS MVN(, S) prior for N(0,4) prior for LN(0,.25) prior for * jt i ( t )1 i (t ) Estimation requires two constraints on a common item Set one step difficulty parameter and one discrimination parameter to constant values • Item Parameter Estimates Range Mean . (1.37, 2.38) 1.82 (.43, 2.73) 1.62 • Test Characteristic Curve (for Composite Theta at Time t) • Test Information Function (for Composite Theta at Time t) • Estimated Person Distribution Hyperparameters ˆ * jt ˆ * jt Baseline .362 .861 Change from Baseline to Tx End (3 Months) -.525 .856 Change from Tx End to Follow-up (3 to 9 Months) .002 .829 • Estimated Correlation Among Person Parameters 1 .18 .09 ˆR .18 1 .34 .09 .34 1 EAP Person Estimates of Latent Baseline Level and Change Example 2: Simulated Multiple Forms Design • Two Assessment Periods With a 20-Item Form Administered at Each Testing Period – Four items are common across test forms – Item parameters sampled from 3-category items from the 1998 NAEP Technical Report • True Item Parameters . Range Form 1 (-1.01, 1.74) Form 2 (-1.01, 1.70) . Mean .11 .50 Range (.56, 1.23) (.56, 1.57) Mean .90 1.00 • Person Parameters at Time 1 and Change at Time 2 were Sampled From a Bivariate Normal Distribution with r = -.243 j1* ~ N(0, 1) j2* ~ N(.5, 1.0625) • 2000 Simulees • Estimated Item Parameters Range Form 1 Mean Form 2 Form 1 Form 2 . ( -.99, 1.74) ( -.99, 1.87) (-1.01, 1.74) (-1.01, 1.70) .17 .11 .61 .50 .85 .90 .96 1.00 (.53, 1.15) (.56, 1.23) (.53, 1.43) (.56, 1.57) • Test Characteristic Curves (for Composite Theta at Time t) • Test Information Functions (for Composite Theta at Time t) • Estimated Person Distribution Hyperparameters ˆ * jt ˆ * jt Time 1 .07 .00 1.08 1.00 Change from Time 1 to Time 2 .54 .50 1.10 1.03 • Estimated Correlation Among Person Parameters 1 .30 Rˆ 1 .30 r .243 EAP Person Estimates of Latent Baseline Level and Change Next Steps • Recovery Simulations – In progress, so far, so good • Want to try this out with real student proficiency data – Do you have any to share? james.roberts@psych.gatech.edu • Want to investigate alternative estimation strategies for new model – WinBUGS is really slow – NLMIXED would probably be quite slow too – MMAP should work well, but will require a lot of effort to develop a general program The Sprout Model • The assessment is p-dimensional at baseline • Individuals change along the p dimensions, but q new dimensions “sprout” out across time – Individuals change along the new dimensions as well • Could look at change on all dimensions or project onto some subset of dimensions • Similar to work that Reckase and Martineau (2004) have done with MIRT – Strategies differ in how change is parameterized – Sprout model emphasizes change over repeated measurements of the same respondents rather than vertical scaling of cross-sectional groups • Potential problems – Identification – Data demands required for reasonable parameter recovery Summary • The multidimensional IRT approach to change has the advantages of other IRT models and can alleviate some problematic aspects to measuring change from a traditional classical test theory perspective • The model presented here is quite general and can be applied to a variety of testing situations • It leads to some very intuitive multi-trait generalizations – The practicality of implementing these generalizations remains to be seen • We are hopeful Thanks!