Using Response Times to Improve Parameter Estimation for Speeded Test Items James A. Wollack Vincent Woo University of Wisconsin-Madison Paper presented at the annual meeting of the National Council on Measurement in Education, San Diego, CA, April, 2009 April 16, 2009 RUNNING HEAD: Parameter Estimation Using Response Times Parameter Estimation Using Response Times 2 Using Response Times to Improve Parameter Estimation for Speeded Test Items Test speededness effects refer to the impact on test performance (at both the item and person level) of administering a test under time constraints that are irrelevant to the construct being measured. The artificial time constraints will affect examinees differently. As examinees run out of time, some will resort to random guessing on the remaining items; others will adjust they way they are budgeting their time throughout the test, so that they attend to all items, but move through the test more quickly than is comfortable. Regardless of examinees’ individual tendencies when answering questions under time pressure, when examinees are speeded, their performance on items at the end of the test often suffers. As a result, ability estimates for speeded examinees are often under-estimated (van der Linden, Scrams, Schnipke, 1999), and item difficulty parameters for end-of-test items are often over-estimated (Douglas, Kim, Habing, & Gao, 1998; Oshima, 1994). Much of the work developing psychometric models that account for speed was developed around psychological studies of reaction time, in which participants were presented with straightforward tasks and asked to respond as quickly as possible (Luce, 1986; Maris, 1993). Within the context of educational tests, in which students’ achievement, rather than processing speed, is of primary concern, models of test speededness have focused on improving the estimation of person and/or item parameters. By and large, three separate approaches have been used to model test speededness. The first approach uses response-time (RT) data from computerized assessments. Schnipke and Scrams (1997) used a multi-class mixture model to identify examinees who switch from solution behavior to hurried behavior for items late in the test, as evidenced by a change in their RT relative to what is expected. A modification of the Schnipke and Scrams mixture approach is proposed by Wise and colleagues (Wise & DeMars, 2006; Wise & Kong, 2005) within the context of examinee effort, but the model is also appropriate for detecting test speededness. The second approach uses item response data to identify examinees whose performance on end-of-test items is significantly worse than it was for items earlier in the test. Most of this research has been within the context of latent class mixture models (Bolt, Cohen, & Wollack, 2002; Yamamoto, 1987; Yamamoto & Everson, 1997), in which a latent class of nonspeeded examinees is distinguished from one or more latent classes of examinees exhibiting speeded tendencies on the end-of-test items. Recently, an alternative approach has been developed, in which two continuous examinee-specific parameters are used to model the point at which speededness first occurs and the rate at which the examinee’s performance deteriorates (Goegebeur, DeBoeck, Wollack, & Cohen, 2008). The third approach is to simultaneously model item response and RT data. Over the past decade, as computer-based testing has become increasingly popular, so too have these more sophisticated models. Wang and Hanson (2005) developed a four-parameter logistic response d time model that incorporates into the logit a quantity equal to i j , the negative product of a tij slowness parameter for person i (ρi) and a slowness parameter for item j (dj), divided by the RT of person i on item j (tij). van der Linden (2007) developed a hierarchical model, in which item response and RT models are specified at level 1 for each combination of item and person, and the second level of the model specifies the relationships between the parameters in the first level. This model was expanded by Klein Entink, Kuhn, Hornke, and Fox (2009) to cognitive tests with Parameter Estimation Using Response Times 3 rule-based items, for purposes of understanding the relationships between response time, item accuracy, and specific cognitive components of items. The vast majority of the research on models of test speededness has concentrated on either understanding the relationship between item performance and time, or on improving the quality of ability estimation. Within educational testing, using a model to purify estimates of examinee ability may be appropriate for low-stakes tests or when group-level data are of primary concern (e.g., for No Child Left Behind); however, it is not entirely clear that it is legal to decontaminate examinees’ ability estimates if those tests scores will be used to inform high-stakes decisions. Under Title I of the Civil Rights Act of 1991 (1991), it is illegal to use different cut-scores for different manifest groups of test takers (e.g., males and females). Though speededness models have not been subjected to litigation, it is unclear whether it would be deemed permissible to score exams in different ways for examinees in different latent classes or with different distributions of RT parameters. Our attention in this study is on using speededness models to improve the estimation of item parameters. As a result, we have chosen to concentrate on the latent class mixture modeling approach to account for speededness. The latent class approach is conceptually appealing for educational tests because the purpose in these models is to find a set of item parameter estimates (for a nonspeeded latent class) that most closely reflects how the end-of-test items would perform were the test entirely nonspeeded. Within the context of high-stakes educational tests, it is presumably these purified parameter estimates that would be applied to all examinees, for purposes of ability estimation (Wollack, Cohen, & Wells, 2003). In addition, several of the studies involving mixture models have concentrated squarely on item parameter recovery (Bolt et al., 2002; Mroch & Bolt, 2006; Mroch, Bolt, & Wollack, 2005), so the qualities of mixture models for item parameter estimation are reasonably well understood. In general, mixture models have been found to greatly reduce the amount of bias in item parameters caused by test speededness. However, when the end-of-test items are very difficult, mixture models can struggle to accurately distinguish the classes, thereby resulting in item parameter estimates for the nonspeeded class that are still partially confounded by test speededness. In this study, two mixture model approaches—one based only on item response data and a second that also incorporates RT data—are compared with respect to item parameter recovery, for different types of simulated speededness. The expectation is that considering RT data in conjunction with item response data will serve to improve the quality of the item parameter estimates and provide a more accurate picture of the extent of test speededness. Mixture Models Used in this Study Mixture Rasch Model Bolt et al. (2002) used a 2-class mixture Rasch model (MRM) to estimate item difficulty parameters for two distinct latent classes of examinees, with constraints imposed so that examinees in one class were able to complete the entire exam comfortably and examinees in the second class, a speeded class, were not. Under this model, P(u=1|g, θg), the probability of an examinee answering an item correctly, given that they are in latent class g, is given by P(u=1|g, θg) = exp(θg − βig) / [1 + exp(θg − βig)], (1) where g indexes latent class, g = 1, 2 θg denotes examinee’s latent ability within latent class g, and βig denotes difficulty of item i within latent class g. Parameter Estimation Using Response Times 4 To distinguish the two latent classes, two sets of constraints are imposed. Equality constraints across classes are imposed on the first k item difficulties (e.g., βi1 = βi2 for i = 1, . . . , k), where k identifies the last item for which one is confident that speed is not an issue. Ordinal constraints across classes are applied to the last m items (e.g., βi1 > βi2 for i = (n − m + 1), . . . , n), where (n − m + 1) identifies the first end-of-test item where speededness is suspected. Items k + 1 through n − m are either left unconstrained, or are excluded from the model for purposes of estimating the nonspeeded class difficulties for the end-of-test items. Mixture Rasch Model with Response Time Components Meyer (2008) extended the Bolt et al. (2002) MRM to utilize information from response times to help distinguish the latent classes. Under the mixture Rasch model with response time components (MRM-RT), item responses and RT are assumed to be locally independent, given class membership. As a result, item responses are used to estimate item difficulty parameters, βig, and examinee ability parameters, θg, but are not used to estimate the median or standard deviation of RT distributions. Similarly, RT is used to estimate RT parameters, but not item or examinee parameters. Both RT and item responses are used to determine latent class membership, as well as the proportion of examinees in each latent class, πg. Under the MRM-RT, P(u=1|g, θg) is specified by (1). However, the MRM-RT differs from the MRM in that a mixture RT distribution is also specified, in which the parameters of the distribution vary by class. Any of a number of different distributions could be used to describe RTs (Klein Entink et al., 2009); however, the lognormal is most commonly used for educational tests, and has been found to generally fit RT distributions rather well (Schnipke & Scrams, 1997; van der Linden, 2006). Meyer (2008) used a mixture lognormal distribution to model RT, and we have done similarly in this study. The mixture lognormal distribution, f(tij | g, mgi, σgi ), is given by f(tij | g, μgi, σgi ) = 1 tij ig ln(t ) - m 2 ij ig exp 2 2 ig 2 (2) where tij denotes the RT for examinee j to item i, mig denotes median log response time for item i within latent class g, and σig denotes standard deviation of item i within latent class g. To distinguish the latent classes in a way consistent with a speededness hypothesis, Meyer (2008) imposed the constraint that m2i > m1i. Given these constraints, class 1 would be the speeded class and class 2 would be the nonspeeded class. Purpose of Study Both Bolt et al (2002) and Meyer (2008) found that their models offered significant improvements over the (single-class) Rasch model with respect to both parameter estimation and model fit. This suggests that the mixture approach, both with and without RT data, offers an improvement upon the Rasch model, which does not account for test speededness. However, no attempt has been made to compare the two mixture approaches to see whether, and to what degree, item parameter estimation under the MRM can be improved by attending to RT data. This study represents an initial attempt to explore this issue. Parameter Estimation Using Response Times 5 Method Data Simulation Data were simulated under the MRM-RT, using the computer program WinBugs (Spiegelhalter, Thomas, & Best, 2000). All replications included N=3,000 examinees and n=24 items. This simulation followed the model of including only items that could either safely be assumed to be free of speededness (i.e., beginning-of-test items) or be known to include speededness (i.e., end-of-test items). Performance on middle-of-test items which might be partially speeded for some examinees was not simulated. Therefore, the same generating parameters were used for the first 16 items, whereas separate parameters were generated for the two latent classes for the remaining 8 items, so that βi1 < βi2. Generating values of item and RT parameters were informed by previous research. Based on the magnitude of contamination in item parameter estimates for end-of-test items found in empirical studies of latent class speededness (Bolt et al, 2002; Mroch, Bolt, & Wollack, 2005), item difficulty parameters for the nonspeeded class (class 1) and speeded class (class 2) were generated so that βi1 ~ N(0, 1) and βi2 = βi1 + Yi, where Yi ~ N(3.5, 0.662). Generating RT medians-standard deviation pairs for the end-of-test items were selected from Schnipke and Scrams (1997). For the nonspeeded class, RT parameters (mi1 and σi1) were randomly selected from the estimated solution-behavior RT values for a 25-item test shown in Table 1 of their article, under the common-guessing mixture model1. For the speeded class, mi2 and σi2 were taken to be loge (9.73) = 2.275 and 1.26, respectively, the values estimated by Schnipke and Scrams for the guessing behavior distribution. The nonspeeded class RT distributions for the 8 end-of-test items, along with the common RT distribution for the speeded class, are shown in Figure 1. Additionally, parameter values for μ(θ)g, the average ability levels for class g, were fixed such that μ(θ)1 = 1.082 and μ(θ)2 = -0.305, and mixing proportions were fixed so that π1 = .80 and π2 = .20. Generating values of μ(θ)g and πg came from an analysis of English placement data analyzed under the MRM (Wollack, et al., 2003). Insert Figure 1 About Here Item Ordering Two separate item orders were considered in this study. In the first condition, the 24 randomly-generated item difficulty parameters for the nonspeeded class were randomly assigned to their location on the test. In the second condition, the 24 item difficulty parameters were ordered from easy to hard. It was expected that with hard items at the end of the test, the MRM might struggle to distinguish poor behavior due to speededness from that due to the items being difficult. The generating item difficulty values are shown in Table 1 for the two item ordering conditions. Because this is just a pilot study, only five replications were generated in each condition. Insert Table 1 About Here Parameter Estimation Using Response Times 6 Estimation Models Three different models were fit to the data: the Rasch model, the MRM, and the MRM-RT. By examining all three of these models on the same datasets, it is possible to better understand the added benefit of (a) adopting a multi-class perspective (through comparison of the Rasch model and MRM), as (b) incorporating RT (through comparison of the MRM and MRM-RT). Parameters for all three models were estimated using a Markov chain Monte Carlo algorithm (MCMC; Gilks, Richardson, & Spiegelhalter, 1996; Patz & Junker, 1999a, 1999b), as implemented in WinBugs (Spiegelhalter et al., 2000). Based on several pilot runs, a burn-in of the first 1,000 iterations was determined to be appropriate for the different models. A minimum of 5,000 iterations were sampled after burn-in. The average sampled value across all iterations after burn-in was taken as the parameter estimate. For purposes of estimating the Rasch model, the standard normal was used as the prior distributions for both the βi and θj parameters. The following were the prior distributions for the MRM: βi1 ~ Normal (0, 1) βi2 ~ Normal (0, 1), such that βi1 < βi2 θj ~ Normal (μ(θ)g, 1) μ(θ)g ~ Normal (0, 1) In addition to utilizing all the prior distributions from the MRM, the MRM-RT further imposed the following prior distributions on RT parameters: mi1 ~ Normal (4.30, 1.042) mi2 ~ Normal (2.30, 0.762) σi1 ~ Normal (4.00, 1.002) σi2 ~ Normal (0.65, 0.402) πg parameters were fixed at .80 and .20, for g=1 and g=2, respectively, in both the MRM and MRM-RT. In this study, although the priors placed on the mig used different mean values to reflect that the response times for the nonspeeded group will generally be longer than for the speeded group, we did not impose the constraint that mi1 > mi2, as was done in Meyer (2008). Instead, in the interest of better understanding the added benefit of incorporating RT into the MRM, we chose to distinguish the latent classes by placing constraints on the item parameters (i.e., βi1 < βi2). A copy of the WinBugs code used to estimate the MRM-RT can be found in the Appendix. Evaluative Measures To assess the quality of item parameter recovery, root mean square errors (RMSE) and biases were computed between generating parameters and their estimates for all models. Prior to assessing RMSEs and biases, the item parameter estimates were linked to the metric of the generating parameters using the test characteristic curve method (Stocking & Lord, 1983), as implemented in the EQUATE program (Baker, Al-Karni, & Al-Dosary, 1991). The linking was performed using all 24 items. Because two different sets of item parameters exist for the MRM and MRM-RT (one each for the speeded and nonspeeded classes), For both the MRM and MRM-RT, only the parameter estimates from the nonspeeded class (class 1) were used, because it is these estimates that were assumed to be purified of the contaminating effects of test speededness. In the interest of focusing more clearly on the quality of item parameter recovery for end-of-test items in speeded tests, RMSEs and biases were computed only across the 8 endof-test items. Parameter Estimation Using Response Times 7 In addition, both the AIC (Akaike, 1974) and BIC (Schwarz, 1978) information criteria were estimated for all models in each replication to help determine which model provides the best fit for the data. Results Biases and RMSEs for item difficulty parameters of the 8 end-of-test items, averaged across replications and items, are provided in Table 2 for all models and both item ordering conditions. In addition, Table 2 includes the proportion of replications in which each model was selected as the most appropriate, based on the AIC and BIC indices. Insert Table 2 About Here Although the number of replications is very small, two patterns seem clear. First, both the MRM and the MRM-RT offer a significant improvement upon the Rasch model in terms of item parameter recovery for end-of-test in speeded tests. In both conditions, RMSEs and biases for the Rasch model were substantially higher than in either of the two mixture models. Also, the Rasch model was never selected as the most appropriate model by either of the two selection indices. The second pattern that emerged is that, given the nature of these data simulations, there was very little difference between the MRM and the MRM-RT. RMSEs and biases for the two models were virtually identical when the items were randomly ordered, and were uniformly very small for both models. When the items were ordered by difficulty, RMSEs and biases were somewhat larger, suggesting that recovery was more variable and it was not possible to remove all the bias due to speededness. RMSEs and biases appeared slightly smaller for the MRM-RT, but the difference was very small and not reliable, in light of the small number of replications. Despite the marginally better RMSEs and biases in the difficulty-ordered condition, the selection indices did not identify the MRM-RT as the more appropriate model. The AIC selected each of the two mixture models roughly half the time; however, the BIC selected the MRM for all conditions. As has been observed by others (Lin & Dayton, 1997; Kang & Cohen, 2007), the AIC and BIC can lead to different results. The BIC is particularly sensitive to model complexity, and tends to select models with fewer parameters than does the AIC. In the present study, the selection indices, and particularly the BIC, appear to indicate that whatever slight advantage may have been afforded by the MRM-RT in terms of reduced RMSE and bias was more than offset by the cost associated with using a more complex and heavily parameterized model. It is worth noting, however, that both selection indices identified the Rasch model as the least appropriate for all replications. Discussion This study investigated the item parameter recovery under speeded conditions for end-of-test items using three different item response theory models. One of these models, the MRM, has repeatedly demonstrated that it is more appropriate than the Rasch model when the test is speeded. Particular interest here was in studying the extent to which recovery in the MRM could be improved by extending the MRM to utilize RTs (in addition to item responses) in determining class membership. The results of this study suggested that the MRM-RT, as implemented here, did not offer a significant improvement upon the MRM in either of the conditions studied. This result is Parameter Estimation Using Response Times 8 somewhat surprising, in light of previous research detailing the utility of RT data in accounting for test speededness. The three most logical explanations for the absence of an effect are as follows: 1. the MRM works so well even without RT information that there is very little room for improvement, 2. the nature of the data simulation resulted in the MRM recovering the underlying parameters very well, thereby leaving little room for the MRM-RT to improve estimation, and 3. the nature of the constraints and prior distributions forced the two mixture models to produce highly similar solutions. In reality, it is likely that all three of these potential explanations are contributing to a certain extent. The MRM has been shown to be quite effective at improving item parameter estimates for end-of-test items (Bolt et al., 2002; Mroch et al., 2005, Wollack et al., 2003); however, other models have been shown to slightly outperform the MRM in other contexts (Mroch et al., 2005), so it seems unlikely that the MRM works so well that it cannot be improved. Model appropriateness and parameter recovery studies are only as good as the simulation design used to generate the datasets. In this study, a mixture IRT model was used to simulate the test speededness. Although mixture models have proven useful as a tool for analyzing speeded datasets and the generating parameters for this simulation study were selected from observed values in previous research, it is still not altogether clear that generating data from a mixture model produces datasets that resemble actual speeded datasets. What does seem clear, however, is that recovering data with the same model used to generate it is the formula for maximizing your overall fit and minimizing biases and RMSEs. In this case, because the MRM-RT was used to generate the data and the MRM is nested within that model, the conditions were ripe for the MRM to be very successful at recovering item parameters. To the extent that this is true, it is unlikely that another model could be found to improve the recovery. The comparison of the MRM and MRM-RT would be enhanced by using an alternative RT and item response speededness model to simulate the data, such as the model proposed by Thissen (1983; which was used as the generating model in Wang & Hanson, 2005), or that by van der Linden (2007). Finally, the variable constraints and prior distributions imposed in executing the MRM and MRM-RT were quite restrictive and may well have contributed to the solutions being so similar for the two models. In particular, both models fixed the mixing proportions at their population values. Fixing mixing proportions helps the models considerably, because accurate item parameter estimation is contingent on correctly identifying examinees’ class membership. In the MRM, the biggest challenge is correctly identifying whether examinees’ performance has changed enough from the beginning to the end of the test to warrant classifying the examinee into the speeded class. This is particularly difficult when items are ordered by difficulty, because performance for all examinees, speeded and nonspeeded, will appear to deteriorate at the end of the test. When the mixing proportions are fixed, the software knows what percentage of examinees belong in each class, regardless of the magnitude of the differences in performance between the beginning and end of the test. It is entirely possible that fixing these proportions to their population values dramatically improved classification accuracy in the MRM, thereby also improving the accuracy of the item parameter estimates for the nonspeeded class. Furthermore, the nature of the constraints also forced the mixing proportions to be the same for both the MRM and MRM-RT. Had those parameters been free to vary, it is likely that they would have differed somewhat between the two models, thereby guaranteeing that examinees were classified Parameter Estimation Using Response Times 9 differently under the two models and increasing the likelihood of the item parameter recovery differing, as well. Moving forward, it will be useful to explore the quality of recovery when mixing proportions are estimated. Similarly, this study utilized reasonably informative prior distributions on the RT medians and standard deviations, and it will be worthwhile to investigate the properties of the MRM-RT when those priors are made less informative. Notes 1 The generating RT parameters for the 8 end-of-test items corresponded to items 19, 25, 16, 7, 1, 22, 21, and 8, respectively, from Schnipke and Scrams (1997). Appendix WinBugs code for the MRM-RT model { for (j in 1:NE){ for (k in 1:NI){ num[j,k]<- exp(theta[j]-b[gmem[j],k]) p[j,k]<-num[j,k]/(1+num[j,k]) r[j,k]~dbern(p[j,k]) } theta[j]~dnorm(mut[gmem[j]],1.) gmem[j]~dcat(pi[1:G]) } for (k in 1:16){ beta[1,k]~dnorm(0.0,1.0) beta[2,k]<- beta[1,k] } for (k in 17:24){ beta[1,k]~dnorm(0.0,1.0) beta[2,k]~dnorm(0.0,1.0) I(beta[1,k], ) } #Deviance values are the inverse of the variance for (k in 1:NI) { rtm[1,k] ~ dnorm(4.30,0.93) rtm[2,k] ~ dnorm(2.30,1.74) rts[1,k] ~ dnorm(4.00,1.00) rts[2,k] ~ dnorm(0.65,6.15) } Parameter Estimation Using Response Times for (j in 1:NE){ for (k in 1:8){ rt[j,k]~dlnorm(rtm[gmem[j],k],rts[gmem[j],k]) }} for (j in 1:24){ b[1,j]<-beta[1,j]-mean(beta[1,1:24]) b[2,j]<-beta[2,j]-mean(beta[2,1:24]) } pi[1]<- .8 pi[2]<- .2 mut[1]~ dnorm(0.,1.) mut[2]~ dnorm(0.,1.) } list(NE=3000, NI=24, G=2, r = structure(.Data = c( 1.0,1.0,1.0,1.0,1.0, 1.0,1.0,1.0,1.0,1.0, 0.0,1.0,1.0,1.0,1.0, 0.0,1.0,1.0,1.0,1.0, 1.0,0.0,0.0,1.0,1.0, 1.0,1.0,1.0,1.0,0.0 . . . . . . . . . 1.0,0.0,0.0,0.0,1.0),.Dim = c(3000,24)), rt = structure(.Data = c( 103.97,54.04,70.86,78.02,119.05, 41.06,48.87,128.16,155.58,98.30, 27.49,105.43,67.94,188.35,46.66, 43.14,47.37,96.23,76.94,75.33, . . . . . . . . . 35.25,75.46,53.05,148.87,50.31), .Dim = c(3000,8))) 10 Parameter Estimation Using Response Times 11 References Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19(6), 716-723. Baker, F. B., Al-Karni, A., & Al-Dosary, I. M. (1991). EQUATE: A computer program for the test characteristic curve method of IRT equating. Applied Psychological Measurement, 50, 529-549. Bolt, D. M., Cohen, A. S., & Wollack, J. A. (2002). Item parameter estimation under conditions of test speededness: Application of a mixture Rasch model with ordinal constraints. Journal of Educational Measurement, 39, 331-348. Civil Rights Act of 1991, Pub. L. No. 102-166, §106 (1991). Douglas, J., Kim, H. R., Habing, B., & Gao, F. (1998). Investigating local dependence with conditional covariance functions. Journal of Educational and Behavioral Statistics, 23, 129-151. Gilks, W. R., Richardson, S., & Spiegelhalter, D. J., Eds. (1996). Markov chain Monte Carlo in practice. London: Chapman & Hall. Kang, T. & Cohen, A. S., (2007). IRT model selection methods for dichotomous items. Applied Psychological Measurement, 31, 331-358. Klein Entink, R. H., Kuhn, J. T., Hornke, L. F., & Fox, J.-P. (2009). Evaluating cognitive theory: A joint modeling approach using responses and response times. Psychological Methods, 14, 54-75. Lin, T. H., & Dayton, C. M. (1997). Model selection information criteria for non-nested latent class models. Journal of Educational and Behavioral Statistics, 22(3), 249-264. Luce, R. D. (1986). Response times: Their roles in inferring elementary mental organization. Oxford, UK: Oxford University Press. Maris, E. (1993). Additive and multiplicative models for gamma distributed random variables, and their application as psychometric models for response times. Psychometrika, 58, 445-469. Meyer, J. P. (2008, March). A mixture Rasch model with item response-time components. Paper presented at the annual meeting of the National Council on Measurement in Education, New York, NY. Mroch, A. A. & Bolt, D. M. (2006, April). An IRT-based response likelihood approach for addressing test speededness. Paper presented at the annual meeting of the National Council on Measurement in Education, San Francisco, CA. Mroch, A. A., Bolt, D. M., & Wollack, J. A. (2005, April). A new multi-class mixture Rasch model for test speededness. Paper presented at the annual meeting of the National Council on Measurement in Education, Montreal, Canada. Oshima, T. C. (1994). The effect of speededness on parameter estimation in item response theory. Journal of Educational Measurement, 31, 200 – 219. Patz, R. J., & Junker, B. W. (1999a). A straightforward approach to Markov chain Monte Carlo methods for item response models. Journal of Educational and Behavioral Statistics, 24, 146-178. Patz, R. J., & Junker, B. W. (1999b). Applications and extensions of MCMC in IRT: Multiple item types, missing data, and rated responses. Journal of Educational and Behavioral Statistics, 24, 342-366. Parameter Estimation Using Response Times 12 Schnipke, D. L., & Scrams, D. J. (1997). Modeling item response times with a two-state mixture model: A new method of measuring speededness. Journal of Educational Measurement, 34, 213-232. Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics, 6, 461-464. Spiegelhalter, D. J., Thomas, A., & Best, N. G. (2000). WinBUGS version 1.3 [Computer program]. Robinson Way, Cambridge CB2 2SR, UK: Institute of Public Health, Medical Research Council Biostatistics Unit. Stocking, M., & Lord, F. M. (1983). Developing a common metric in item response theory. Applied Psychological Measurement, 7, 207-210. van der Linden, W. J. (2006). A lognormal model for response times on test items. Journal of Educational and Behavioral Statistics, 31, 181-204. van der Linden, W. J. (2007). A hierarchical framework for modeling speed and accuracy on test items. Psychometrika, 72, 287-308. van der Linden, W. J., Scrams, D. J., & Schnipke, D. L. (1999). Using response-time constraints to control for differential speededness in computerized adaptive testing. Applied Psychological Measurement, 23, 195-210. Wang, T., & Hanson, B. A. (2005). Development and calibration of an item response model that incorporates response time. Applied Psychological Measurement, 29, 323-339. Wise, S. L., & DeMars, C. E. (2006). An application of item response time : The effortmoderated IRT model. Journal of Educational Measurement, 43, 19-38. Wise, S. L., & Kong, X. (2005). Response time effort: A new measure of examinee motivation in computer-based tests. Applied Measurement in Education,16, 163-183. Wollack, J. A., Cohen, A. S., & Wells, C. S. (2003). A method for maintaining scale stability in the presence of test speededness. Journal of Educational Measurement, 40, 307-330. Yamamoto, K. (1987). A model that combines IRT and latent class models. Unpublished doctoral dissertation. University of Illinois, Champaign – Urbana. Yamamoto, K. & Everson, H. (1997). Modeling the effects of test length and test time on parameter estimation using the HYBRID model. In J. Rost and R. Langeheine (Eds.): Applications of Latent Trait and Latent Class Models in the Social Sciences (pp. 89 – 98). New York: Waxman Parameter Estimation Using Response Times Table 1 Generating Item Difficulty Parameters for Two Item Ordering Conditions Item Location 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Random Ordering Class 1 Class 2 -0.573 -0.573 -0.505 -0.505 0.682 0.682 -0.782 -0.782 0.682 0.682 -1.926 -1.926 -0.149 -0.149 0.094 0.094 0.255 0.255 2.306 2.306 1.055 1.055 -0.563 -0.563 -1.080 -1.080 -0.447 -0.447 -1.118 -1.118 -0.285 -0.285 1.848 4.521 -1.198 2.600 1.166 4.983 -0.618 2.960 -0.704 3.385 1.330 3.702 0.382 3.790 -0.377 4.124 Difficulty Ordering Class1 Class 2 -1.926 -1.926 -1.198 -1.198 -1.118 -1.118 -1.080 -1.080 -0.782 -0.782 -0.704 -0.704 -0.618 -0.618 -0.573 -0.573 -0.563 -0.563 -0.505 -0.505 -0.447 -0.447 -0.377 -0.377 -0.285 -0.285 -0.149 -0.149 0.094 0.094 0.255 0.255 0.382 3.985 0.682 4.473 0.682 5.055 1.055 5.014 1.116 4.254 1.330 5.157 1.848 4.219 2.306 5.202 Table 2 Summary Statistics Model Rasch MRM MRM-RT Random Ordering RMSE Bias AIC 0.643 0.603 0.000 0.078 −0.014 0.600 0.075 −0.018 0.400 BIC 0.000 1.000 0.000 RMSE 0.570 0.125 0.118 Difficulty Ordering Bias AIC 0.562 0.000 0.118 0.400 0.111 0.600 BIC 0.000 1.000 0.000 13 Parameter Estimation Using Response Times Figure Captions Figure 1. Nonspeeded and speeded class RT distributions for end-of-test items 14 0.025 NS3 0.02 NS2 SP NS6 0.015 NS4 NS8 NS7 NS5 0.01 NS1 0.005 0 0 25 50 75 100 125 150 175 200