Identification of Misfit Item Using IRT Models Dr Muhammad Naveed Khalid Outline 1. 2. 3. 4. 5. 6. 7. 8. Item Response Theory Model Fit Fit Procedures Issues and Limitations Lagrange Multiplier (LM) Test Simulation Design Results Conclusions Item Response Theory Item response theory (IRT) also known as latent trait theory, strong true score theory, or modern mental test theory, is a paradigm for the design, analysis, and scoring of tests, questionnaires, and similar instruments measuring abilities, attitudes, or other variables. Some well documented advantages over CTT are 1) Invariance Item and Ability Estimates 2) Computer Adaptive Testing 3) Equating 4) Development of Item Bank 5) Reliability Model Fit IRT models are based on a number of explicit assumptions. Uni-dimensionalty: Assumption entails that the item/test should measure only one ability, trait or construct. DIF (MI): The assumption entails that the item responses can be described by the same parameters in all sub-populations. ICC: The shape of item response function which describes the relation between the latent variable and the observable responses to items is invariant. Local Independence: The local independence, assumes that responses to different items are independent given the latent trait variable value. Speededness: The score-oriented perspective focuses on the effect of speededness on examinees’ test scores, while the fairnessoriented perspective focuses on the degree to which speededness adversely affects some examinees relative to others. Consequences of Misfit Yen (1981) and Wainer & Thissen (1987) have shown inadequacy of model-data fit have adverse consequences such as 1) 2) 3) 4) Biased ability estimates Unfair ranks Wrongly equated scores Validity Fit Procedures The fit of item response theory models can be evaluated by the computation of residuals and the associated test statistics. Chi – Square Statistics Tests of the discrepancy between the observed and expected frequencies. Pearson-Type Item-Fit Indices (Yen, 1984; Bock, 1972). Likelihood Ratio Based Item-Fit Indices (McKinley & Mills, 1985). Issues and Limitations Glas and Suarez Falcon (2003) note that the standard theory for chi-square statistics does not hold in the IRT context because the observations on which the statistics are based do not have a multinomial or Poisson distribution. Glas and Suarez Falcon (2003) have also criticized these procedures for failing to take into account the stochastic nature of the item parameter estimates. Orlando and Thissen (2000) argued that because the observed proportions correct are based on model-dependent trait estimates, the degrees of freedom may not be as claimed. Continue’d The problem of huge power in large samples. The fact that they lose their validity when the model is grossly violated. The fact that they do not directly reveal the impact of the model violation for the envisioned application. They do not provide diagnostic information. Lagrange Multiplier (LM) Test Glas(1999) proposed the LM test to the evaluation of model fit. The LM tests are used for testing a restricted model against a more general alternative. LM test is based on the evaluation of the first-order partial derivatives of the log-likelihood function of the general model, evaluated using the maximum likelihood estimates of the restricted model. Consider a null hypothesis about a model with parameters 0 This model is a special case of a general model with parameters '0 = ( '01 , c) LM (c) h(c)'W 1h(c) LM Item Fit Statistics DIF exp(i ( n i ) yn i )) Pi (n ) 1 exp(i ( n i ) yn i )) LOC exp(i (n i n l il )) P( X ni 1, X nl 1| n , il ) 1 exp(i (n i n l il )) ICC P( X ni 1| n , ig ) Null Model Null Model Null Model i 0 il 0 ig 0 exp(i (n ig i )) 1 exp(i (n ig i )) Alternative Model Alternative Model Alternative Model i 0 il 0 ig 0 Simulation Design The 1-PL,2-PL & 3-PL Model is used for generation and calibration. Test length (10, 20, 40) and examinee sample size (100, 400,1000). Item difficulty and discrimination parameters were drawn from standard normal and log normal distribution respectively. Ability parameters were drawn from a standard normal distribution. The effect size, degree of misfit, was varying as 0.5, 1.0. The number of misfit items varies in each test from 10% to 40%. Nominal significance level of 5 % was used. 100 replications were carried out in each condition of study. The power and Type I error by test length, effect size and sample size under Rasch model Power Type I error rate Number of Items with MI K δ N 10% 20% 10% 20% 10 0.5 100 0.42 0.32 0.07 0.07 400 0.89 0.71 0.06 0.07 1000 0.99 0.99 0.05 0.06 100 0.84 0.76 0.11 0.15 400 1.00 1.00 0.07 0.08 1000 1.00 1.00 0.07 0.09 100 0.50 0.42 0.05 0.07 400 0.89 0.83 0.06 0.06 1000 1.00 0.99 0.05 0.08 100 0.95 0.86 0.05 0.09 400 1.00 1.00 0.06 0.07 1000 1.00 1.00 0.08 0.09 100 0.48 0.47 0.11 0.14 400 0.89 0.88 0.06 0.07 1000 1.00 1.00 0.06 0.08 100 0.95 0.92 0.11 0.12 400 1.00 1.00 0.06 0.07 1000 1.00 1.00 0.06 0.07 1.0 20 0.5 1.0 40 0.5 1.0 The power and Type I error by test length, effect size and sample size under Rasch model Power Type I error rate Number of Items with LOC K δ N 10% 20% 10% 20% 10 0.5 100 0.22 0.23 0.06 0.06 400 0.60 0.47 0.06 0.08 1000 0.96 0.85 0.06 0.08 100 0.71 0.52 0.10 0.12 400 1.00 0.90 0.08 0.10 1000 1.00 0.97 0.05 0.07 100 0.54 0.35 0.06 0.07 400 0.89 0.75 0.08 0.09 1000 1.00 0.97 0.05 0.06 100 0.86 0.83 0.09 0.09 400 1.00 0.99 0.09 0.10 1000 1.00 1.00 0.06 0.08 100 0.34 0.36 0.05 0.05 400 0.48 0.44 0.06 0.05 1000 0.98 0.86 0.05 0.06 100 0.49 0.52 0.06 0.07 400 1.00 1.00 0.06 0.06 1000 1.00 1.00 0.05 0.06 1.0 20 0.5 1.0 40 0.5 1.0 The power and Type I error by test length, effect size and sample size under Rasch model Power Type I error rates Number of Items with ICC K δ N 10% 20% 10% 20% 10 0.5 100 0.55 0.40 0.05 0.06 400 0.53 0.48 0.06 0.07 1000 0.77 0.61 0.05 0.05 100 0.97 0.77 0.05 0.06 400 0.97 0.78 0.05 0.05 1000 1.00 0.98 0.05 0.06 100 0.65 0.61 0.05 0.05 400 0.88 0.71 0.05 0.05 1000 0.81 0.68 0.05 0.06 100 0.97 0.90 0.06 0.06 400 0.95 0.87 0.05 0.05 1000 1.00 0.98 0.05 0.06 100 0.47 0.40 0.05 0.06 400 0.98 0.92 0.06 0.07 1000 1.00 0.96 0.05 0.05 100 0.53 0.49 0.06 0.06 400 0.99 0.96 0.05 0.06 1000 1.00 1.00 0.05 0.05 1.0 20 0.5 1.0 40 0.5 1.0 An Empirical Example Lagrange Multiplier Tests (DIF) for Rasch Model ------------------------------------------------------------Focal-Group Reference Abs. Item LM df Prob Obs Exp Obs Exp Dif. ------------------------------------------------------------1 Item1 3.26 1 0.07 0.73 0.76 0.76 0.74 0.02 2 Item2 0.63 1 0.43 0.95 0.95 0.95 0.95 0.01 3 Item3 0.91 1 0.34 0.75 0.76 0.74 0.73 0.01 4 Item4 1.48 1 0.22 0.78 0.80 0.79 0.77 0.01 5 Item5 1.60 1 0.21 0.81 0.83 0.82 0.81 0.02 6 Item6 0.00 1 0.96 0.76 0.76 0.73 0.73 0.00 7 Item7 0.15 1 0.70 0.72 0.71 0.69 0.70 0.01 8 Item8 0.37 1 0.54 0.91 0.90 0.88 0.88 0.01 9 Item9 0.45 1 0.50 0.91 0.90 0.89 0.89 0.01 10 Item10 0.02 1 0.90 0.83 0.83 0.81 0.81 0.00 11 Item11 0.11 1 0.74 0.88 0.88 0.86 0.86 0.00 12 Item12 4.68 1 0.03 0.91 0.89 0.86 0.88 0.02 13 Item13 0.38 1 0.54 0.60 0.59 0.53 0.53 0.01 14 Item14 0.08 1 0.77 0.60 0.61 0.58 0.58 0.00 15 Item15 3.89 1 0.05 0.79 0.76 0.71 0.74 0.03 16 Item16 0.14 1 0.71 0.70 0.69 0.64 0.65 0.00 17 Item17 0.72 1 0.40 0.62 0.61 0.54 0.55 0.01 18 Item18 0.53 1 0.47 0.49 0.50 0.47 0.46 0.01 19 Item19 0.15 1 0.69 0.84 0.84 0.81 0.82 0.00 20 Item20 0.04 1 0.83 0.74 0.74 0.71 0.71 0.00 21 Item21 0.05 1 0.82 0.87 0.87 0.85 0.85 0.00 22 Item22 1.72 1 0.19 0.79 0.80 0.78 0.77 0.01 23 Item23 0.38 1 0.54 0.87 0.88 0.87 0.86 0.01 24 Item24 2.77 1 0.10 0.85 0.87 0.87 0.85 0.02 25 Item25 0.37 1 0.54 0.95 0.95 0.94 0.94 0.00 26 Item26 2.47 1 0.12 0.65 0.63 0.56 0.58 0.02 27 Item27 0.39 1 0.53 0.68 0.67 0.62 0.63 0.01 28 Item28 0.02 1 0.89 0.51 0.52 0.48 0.47 0.00 29 Item29 0.55 1 0.46 0.69 0.68 0.63 0.64 0.01 30 Item30 0.12 1 0.73 0.71 0.71 0.69 0.68 0.00 -------------------------------------------------------------- Conclusions 1. 2. 3. 4. 5. 6. 7. The fit statistics have known asymptotic null distribution. The fit statistics have sound statistical properties in terms of Power and Type 1 error rates. LM (MI), LM (LI) and LM (ICC) statistics have detection rates in ascending order, respectively. 1PL, 2PL and 3PL have Power in ascending order, respectively. These fit indices also provide a measure of effect size. Effect size has practical advantage to gauge the severity of misfit. The performance of these indices less deteriorates in the presence of large misfitting items. The sample sizes, test length, degree of misfit are potential factors which have influence on Type 1 error rates and Power. Thanks for Kind Attention & Questions