Wollack, J. A., & Woo, V. (2009, April).

advertisement
Using Response Times to Improve Parameter Estimation for Speeded Test Items
James A. Wollack
Vincent Woo
University of Wisconsin-Madison
Paper presented at the annual meeting of the National Council on Measurement in Education,
San Diego, CA, April, 2009
April 16, 2009
RUNNING HEAD: Parameter Estimation Using Response Times
Parameter Estimation Using Response Times
2
Using Response Times to Improve Parameter Estimation for Speeded Test Items
Test speededness effects refer to the impact on test performance (at both the item and person
level) of administering a test under time constraints that are irrelevant to the construct being
measured. The artificial time constraints will affect examinees differently. As examinees run
out of time, some will resort to random guessing on the remaining items; others will adjust they
way they are budgeting their time throughout the test, so that they attend to all items, but move
through the test more quickly than is comfortable. Regardless of examinees’ individual
tendencies when answering questions under time pressure, when examinees are speeded, their
performance on items at the end of the test often suffers. As a result, ability estimates for
speeded examinees are often under-estimated (van der Linden, Scrams, Schnipke, 1999), and
item difficulty parameters for end-of-test items are often over-estimated (Douglas, Kim, Habing,
& Gao, 1998; Oshima, 1994).
Much of the work developing psychometric models that account for speed was developed
around psychological studies of reaction time, in which participants were presented with
straightforward tasks and asked to respond as quickly as possible (Luce, 1986; Maris, 1993).
Within the context of educational tests, in which students’ achievement, rather than processing
speed, is of primary concern, models of test speededness have focused on improving the
estimation of person and/or item parameters.
By and large, three separate approaches have been used to model test speededness. The first
approach uses response-time (RT) data from computerized assessments. Schnipke and Scrams
(1997) used a multi-class mixture model to identify examinees who switch from solution
behavior to hurried behavior for items late in the test, as evidenced by a change in their RT
relative to what is expected. A modification of the Schnipke and Scrams mixture approach is
proposed by Wise and colleagues (Wise & DeMars, 2006; Wise & Kong, 2005) within the
context of examinee effort, but the model is also appropriate for detecting test speededness.
The second approach uses item response data to identify examinees whose performance on
end-of-test items is significantly worse than it was for items earlier in the test. Most of this
research has been within the context of latent class mixture models (Bolt, Cohen, & Wollack,
2002; Yamamoto, 1987; Yamamoto & Everson, 1997), in which a latent class of nonspeeded
examinees is distinguished from one or more latent classes of examinees exhibiting speeded
tendencies on the end-of-test items. Recently, an alternative approach has been developed, in
which two continuous examinee-specific parameters are used to model the point at which
speededness first occurs and the rate at which the examinee’s performance deteriorates
(Goegebeur, DeBoeck, Wollack, & Cohen, 2008).
The third approach is to simultaneously model item response and RT data. Over the past
decade, as computer-based testing has become increasingly popular, so too have these more
sophisticated models. Wang and Hanson (2005) developed a four-parameter logistic response
d
time model that incorporates into the logit a quantity equal to  i j , the negative product of a
tij
slowness parameter for person i (ρi) and a slowness parameter for item j (dj), divided by the RT
of person i on item j (tij). van der Linden (2007) developed a hierarchical model, in which item
response and RT models are specified at level 1 for each combination of item and person, and the
second level of the model specifies the relationships between the parameters in the first level.
This model was expanded by Klein Entink, Kuhn, Hornke, and Fox (2009) to cognitive tests with
Parameter Estimation Using Response Times
3
rule-based items, for purposes of understanding the relationships between response time, item
accuracy, and specific cognitive components of items.
The vast majority of the research on models of test speededness has concentrated on either
understanding the relationship between item performance and time, or on improving the quality
of ability estimation. Within educational testing, using a model to purify estimates of examinee
ability may be appropriate for low-stakes tests or when group-level data are of primary concern
(e.g., for No Child Left Behind); however, it is not entirely clear that it is legal to decontaminate
examinees’ ability estimates if those tests scores will be used to inform high-stakes decisions.
Under Title I of the Civil Rights Act of 1991 (1991), it is illegal to use different cut-scores for
different manifest groups of test takers (e.g., males and females). Though speededness models
have not been subjected to litigation, it is unclear whether it would be deemed permissible to
score exams in different ways for examinees in different latent classes or with different
distributions of RT parameters.
Our attention in this study is on using speededness models to improve the estimation of item
parameters. As a result, we have chosen to concentrate on the latent class mixture modeling
approach to account for speededness. The latent class approach is conceptually appealing for
educational tests because the purpose in these models is to find a set of item parameter estimates
(for a nonspeeded latent class) that most closely reflects how the end-of-test items would
perform were the test entirely nonspeeded. Within the context of high-stakes educational tests, it
is presumably these purified parameter estimates that would be applied to all examinees, for
purposes of ability estimation (Wollack, Cohen, & Wells, 2003). In addition, several of the
studies involving mixture models have concentrated squarely on item parameter recovery (Bolt
et al., 2002; Mroch & Bolt, 2006; Mroch, Bolt, & Wollack, 2005), so the qualities of mixture
models for item parameter estimation are reasonably well understood. In general, mixture
models have been found to greatly reduce the amount of bias in item parameters caused by test
speededness. However, when the end-of-test items are very difficult, mixture models can
struggle to accurately distinguish the classes, thereby resulting in item parameter estimates for
the nonspeeded class that are still partially confounded by test speededness.
In this study, two mixture model approaches—one based only on item response data and a
second that also incorporates RT data—are compared with respect to item parameter recovery,
for different types of simulated speededness. The expectation is that considering RT data in
conjunction with item response data will serve to improve the quality of the item parameter
estimates and provide a more accurate picture of the extent of test speededness.
Mixture Models Used in this Study
Mixture Rasch Model
Bolt et al. (2002) used a 2-class mixture Rasch model (MRM) to estimate item difficulty
parameters for two distinct latent classes of examinees, with constraints imposed so that
examinees in one class were able to complete the entire exam comfortably and examinees in the
second class, a speeded class, were not. Under this model, P(u=1|g, θg), the probability of an
examinee answering an item correctly, given that they are in latent class g, is given by
P(u=1|g, θg) = exp(θg − βig) / [1 + exp(θg − βig)],
(1)
where
g indexes latent class, g = 1, 2
θg denotes examinee’s latent ability within latent class g, and
βig denotes difficulty of item i within latent class g.
Parameter Estimation Using Response Times
4
To distinguish the two latent classes, two sets of constraints are imposed. Equality constraints
across classes are imposed on the first k item difficulties (e.g., βi1 = βi2 for i = 1, . . . , k), where k
identifies the last item for which one is confident that speed is not an issue. Ordinal constraints
across classes are applied to the last m items (e.g., βi1 > βi2 for i = (n − m + 1), . . . , n), where (n
− m + 1) identifies the first end-of-test item where speededness is suspected. Items k + 1
through n − m are either left unconstrained, or are excluded from the model for purposes of
estimating the nonspeeded class difficulties for the end-of-test items.
Mixture Rasch Model with Response Time Components
Meyer (2008) extended the Bolt et al. (2002) MRM to utilize information from response
times to help distinguish the latent classes. Under the mixture Rasch model with response time
components (MRM-RT), item responses and RT are assumed to be locally independent, given
class membership. As a result, item responses are used to estimate item difficulty parameters,
βig, and examinee ability parameters, θg, but are not used to estimate the median or standard
deviation of RT distributions. Similarly, RT is used to estimate RT parameters, but not item or
examinee parameters. Both RT and item responses are used to determine latent class
membership, as well as the proportion of examinees in each latent class, πg.
Under the MRM-RT, P(u=1|g, θg) is specified by (1). However, the MRM-RT differs from
the MRM in that a mixture RT distribution is also specified, in which the parameters of the
distribution vary by class. Any of a number of different distributions could be used to describe
RTs (Klein Entink et al., 2009); however, the lognormal is most commonly used for educational
tests, and has been found to generally fit RT distributions rather well (Schnipke & Scrams, 1997;
van der Linden, 2006). Meyer (2008) used a mixture lognormal distribution to model RT, and
we have done similarly in this study. The mixture lognormal distribution, f(tij | g, mgi, σgi ), is
given by
f(tij | g, μgi, σgi ) =
1
tij ig
  ln(t ) - m 2 
ij
ig

exp 
2


2 ig
2


(2)
where
tij denotes the RT for examinee j to item i,
mig denotes median log response time for item i within latent class g, and
σig denotes standard deviation of item i within latent class g.
To distinguish the latent classes in a way consistent with a speededness hypothesis, Meyer
(2008) imposed the constraint that m2i > m1i. Given these constraints, class 1 would be the
speeded class and class 2 would be the nonspeeded class.
Purpose of Study
Both Bolt et al (2002) and Meyer (2008) found that their models offered significant
improvements over the (single-class) Rasch model with respect to both parameter estimation and
model fit. This suggests that the mixture approach, both with and without RT data, offers an
improvement upon the Rasch model, which does not account for test speededness. However, no
attempt has been made to compare the two mixture approaches to see whether, and to what
degree, item parameter estimation under the MRM can be improved by attending to RT data.
This study represents an initial attempt to explore this issue.
Parameter Estimation Using Response Times
5
Method
Data Simulation
Data were simulated under the MRM-RT, using the computer program WinBugs
(Spiegelhalter, Thomas, & Best, 2000). All replications included N=3,000 examinees and n=24
items. This simulation followed the model of including only items that could either safely be
assumed to be free of speededness (i.e., beginning-of-test items) or be known to include
speededness (i.e., end-of-test items). Performance on middle-of-test items which might be
partially speeded for some examinees was not simulated. Therefore, the same generating
parameters were used for the first 16 items, whereas separate parameters were generated for the
two latent classes for the remaining 8 items, so that βi1 < βi2.
Generating values of item and RT parameters were informed by previous research. Based on
the magnitude of contamination in item parameter estimates for end-of-test items found in
empirical studies of latent class speededness (Bolt et al, 2002; Mroch, Bolt, & Wollack, 2005),
item difficulty parameters for the nonspeeded class (class 1) and speeded class (class 2) were
generated so that βi1 ~ N(0, 1) and βi2 = βi1 + Yi, where Yi ~ N(3.5, 0.662). Generating RT
medians-standard deviation pairs for the end-of-test items were selected from Schnipke and
Scrams (1997). For the nonspeeded class, RT parameters (mi1 and σi1) were randomly selected
from the estimated solution-behavior RT values for a 25-item test shown in Table 1 of their
article, under the common-guessing mixture model1. For the speeded class, mi2 and σi2 were
taken to be loge (9.73) = 2.275 and 1.26, respectively, the values estimated by Schnipke and
Scrams for the guessing behavior distribution. The nonspeeded class RT distributions for the 8
end-of-test items, along with the common RT distribution for the speeded class, are shown in
Figure 1.
Additionally, parameter values for μ(θ)g, the average ability levels for class g, were fixed
such that μ(θ)1 = 1.082 and μ(θ)2 = -0.305, and mixing proportions were fixed so that π1 = .80
and π2 = .20. Generating values of μ(θ)g and πg came from an analysis of English placement data
analyzed under the MRM (Wollack, et al., 2003).
Insert Figure 1 About Here
Item Ordering
Two separate item orders were considered in this study. In the first condition, the 24
randomly-generated item difficulty parameters for the nonspeeded class were randomly assigned
to their location on the test. In the second condition, the 24 item difficulty parameters were
ordered from easy to hard. It was expected that with hard items at the end of the test, the MRM
might struggle to distinguish poor behavior due to speededness from that due to the items being
difficult. The generating item difficulty values are shown in Table 1 for the two item ordering
conditions.
Because this is just a pilot study, only five replications were generated in each condition.
Insert Table 1 About Here
Parameter Estimation Using Response Times
6
Estimation Models
Three different models were fit to the data: the Rasch model, the MRM, and the MRM-RT.
By examining all three of these models on the same datasets, it is possible to better understand
the added benefit of (a) adopting a multi-class perspective (through comparison of the Rasch
model and MRM), as (b) incorporating RT (through comparison of the MRM and MRM-RT).
Parameters for all three models were estimated using a Markov chain Monte Carlo algorithm
(MCMC; Gilks, Richardson, & Spiegelhalter, 1996; Patz & Junker, 1999a, 1999b), as
implemented in WinBugs (Spiegelhalter et al., 2000). Based on several pilot runs, a burn-in of
the first 1,000 iterations was determined to be appropriate for the different models. A minimum
of 5,000 iterations were sampled after burn-in. The average sampled value across all iterations
after burn-in was taken as the parameter estimate.
For purposes of estimating the Rasch model, the standard normal was used as the prior
distributions for both the βi and θj parameters. The following were the prior distributions for the
MRM:
βi1 ~ Normal (0, 1)
βi2 ~ Normal (0, 1), such that βi1 < βi2
θj ~ Normal (μ(θ)g, 1)
μ(θ)g ~ Normal (0, 1)
In addition to utilizing all the prior distributions from the MRM, the MRM-RT further
imposed the following prior distributions on RT parameters:
mi1 ~ Normal (4.30, 1.042)
mi2 ~ Normal (2.30, 0.762)
σi1 ~ Normal (4.00, 1.002)
σi2 ~ Normal (0.65, 0.402)
πg parameters were fixed at .80 and .20, for g=1 and g=2, respectively, in both the MRM and
MRM-RT.
In this study, although the priors placed on the mig used different mean values to reflect that
the response times for the nonspeeded group will generally be longer than for the speeded group,
we did not impose the constraint that mi1 > mi2, as was done in Meyer (2008). Instead, in the
interest of better understanding the added benefit of incorporating RT into the MRM, we chose
to distinguish the latent classes by placing constraints on the item parameters (i.e., βi1 < βi2). A
copy of the WinBugs code used to estimate the MRM-RT can be found in the Appendix.
Evaluative Measures
To assess the quality of item parameter recovery, root mean square errors (RMSE) and biases
were computed between generating parameters and their estimates for all models. Prior to
assessing RMSEs and biases, the item parameter estimates were linked to the metric of the
generating parameters using the test characteristic curve method (Stocking & Lord, 1983), as
implemented in the EQUATE program (Baker, Al-Karni, & Al-Dosary, 1991). The linking was
performed using all 24 items. Because two different sets of item parameters exist for the MRM
and MRM-RT (one each for the speeded and nonspeeded classes), For both the MRM and
MRM-RT, only the parameter estimates from the nonspeeded class (class 1) were used, because
it is these estimates that were assumed to be purified of the contaminating effects of test
speededness. In the interest of focusing more clearly on the quality of item parameter recovery
for end-of-test items in speeded tests, RMSEs and biases were computed only across the 8 endof-test items.
Parameter Estimation Using Response Times
7
In addition, both the AIC (Akaike, 1974) and BIC (Schwarz, 1978) information criteria were
estimated for all models in each replication to help determine which model provides the best fit
for the data.
Results
Biases and RMSEs for item difficulty parameters of the 8 end-of-test items, averaged across
replications and items, are provided in Table 2 for all models and both item ordering conditions.
In addition, Table 2 includes the proportion of replications in which each model was selected as
the most appropriate, based on the AIC and BIC indices.
Insert Table 2 About Here
Although the number of replications is very small, two patterns seem clear. First, both the
MRM and the MRM-RT offer a significant improvement upon the Rasch model in terms of item
parameter recovery for end-of-test in speeded tests. In both conditions, RMSEs and biases for
the Rasch model were substantially higher than in either of the two mixture models. Also, the
Rasch model was never selected as the most appropriate model by either of the two selection
indices.
The second pattern that emerged is that, given the nature of these data simulations, there was
very little difference between the MRM and the MRM-RT. RMSEs and biases for the two
models were virtually identical when the items were randomly ordered, and were uniformly very
small for both models. When the items were ordered by difficulty, RMSEs and biases were
somewhat larger, suggesting that recovery was more variable and it was not possible to remove
all the bias due to speededness. RMSEs and biases appeared slightly smaller for the MRM-RT,
but the difference was very small and not reliable, in light of the small number of replications.
Despite the marginally better RMSEs and biases in the difficulty-ordered condition, the selection
indices did not identify the MRM-RT as the more appropriate model. The AIC selected each of
the two mixture models roughly half the time; however, the BIC selected the MRM for all
conditions. As has been observed by others (Lin & Dayton, 1997; Kang & Cohen, 2007), the
AIC and BIC can lead to different results. The BIC is particularly sensitive to model complexity,
and tends to select models with fewer parameters than does the AIC. In the present study, the
selection indices, and particularly the BIC, appear to indicate that whatever slight advantage may
have been afforded by the MRM-RT in terms of reduced RMSE and bias was more than offset
by the cost associated with using a more complex and heavily parameterized model. It is worth
noting, however, that both selection indices identified the Rasch model as the least appropriate
for all replications.
Discussion
This study investigated the item parameter recovery under speeded conditions for end-of-test
items using three different item response theory models. One of these models, the MRM, has
repeatedly demonstrated that it is more appropriate than the Rasch model when the test is
speeded. Particular interest here was in studying the extent to which recovery in the MRM could
be improved by extending the MRM to utilize RTs (in addition to item responses) in determining
class membership.
The results of this study suggested that the MRM-RT, as implemented here, did not offer a
significant improvement upon the MRM in either of the conditions studied. This result is
Parameter Estimation Using Response Times
8
somewhat surprising, in light of previous research detailing the utility of RT data in accounting
for test speededness. The three most logical explanations for the absence of an effect are as
follows:
1. the MRM works so well even without RT information that there is very little room for
improvement,
2. the nature of the data simulation resulted in the MRM recovering the underlying
parameters very well, thereby leaving little room for the MRM-RT to improve estimation,
and
3. the nature of the constraints and prior distributions forced the two mixture models to
produce highly similar solutions.
In reality, it is likely that all three of these potential explanations are contributing to a certain
extent. The MRM has been shown to be quite effective at improving item parameter estimates
for end-of-test items (Bolt et al., 2002; Mroch et al., 2005, Wollack et al., 2003); however, other
models have been shown to slightly outperform the MRM in other contexts (Mroch et al., 2005),
so it seems unlikely that the MRM works so well that it cannot be improved.
Model appropriateness and parameter recovery studies are only as good as the simulation
design used to generate the datasets. In this study, a mixture IRT model was used to simulate the
test speededness. Although mixture models have proven useful as a tool for analyzing speeded
datasets and the generating parameters for this simulation study were selected from observed
values in previous research, it is still not altogether clear that generating data from a mixture
model produces datasets that resemble actual speeded datasets. What does seem clear, however,
is that recovering data with the same model used to generate it is the formula for maximizing
your overall fit and minimizing biases and RMSEs. In this case, because the MRM-RT was used
to generate the data and the MRM is nested within that model, the conditions were ripe for the
MRM to be very successful at recovering item parameters. To the extent that this is true, it is
unlikely that another model could be found to improve the recovery. The comparison of the
MRM and MRM-RT would be enhanced by using an alternative RT and item response
speededness model to simulate the data, such as the model proposed by Thissen (1983; which
was used as the generating model in Wang & Hanson, 2005), or that by van der Linden (2007).
Finally, the variable constraints and prior distributions imposed in executing the MRM and
MRM-RT were quite restrictive and may well have contributed to the solutions being so similar
for the two models. In particular, both models fixed the mixing proportions at their population
values. Fixing mixing proportions helps the models considerably, because accurate item
parameter estimation is contingent on correctly identifying examinees’ class membership. In the
MRM, the biggest challenge is correctly identifying whether examinees’ performance has
changed enough from the beginning to the end of the test to warrant classifying the examinee
into the speeded class. This is particularly difficult when items are ordered by difficulty, because
performance for all examinees, speeded and nonspeeded, will appear to deteriorate at the end of
the test. When the mixing proportions are fixed, the software knows what percentage of
examinees belong in each class, regardless of the magnitude of the differences in performance
between the beginning and end of the test. It is entirely possible that fixing these proportions to
their population values dramatically improved classification accuracy in the MRM, thereby also
improving the accuracy of the item parameter estimates for the nonspeeded class. Furthermore,
the nature of the constraints also forced the mixing proportions to be the same for both the MRM
and MRM-RT. Had those parameters been free to vary, it is likely that they would have differed
somewhat between the two models, thereby guaranteeing that examinees were classified
Parameter Estimation Using Response Times
9
differently under the two models and increasing the likelihood of the item parameter recovery
differing, as well. Moving forward, it will be useful to explore the quality of recovery when
mixing proportions are estimated. Similarly, this study utilized reasonably informative prior
distributions on the RT medians and standard deviations, and it will be worthwhile to investigate
the properties of the MRM-RT when those priors are made less informative.
Notes
1
The generating RT parameters for the 8 end-of-test items corresponded to items 19, 25, 16, 7, 1,
22, 21, and 8, respectively, from Schnipke and Scrams (1997).
Appendix
WinBugs code for the MRM-RT
model
{
for (j in 1:NE){
for (k in 1:NI){
num[j,k]<- exp(theta[j]-b[gmem[j],k])
p[j,k]<-num[j,k]/(1+num[j,k])
r[j,k]~dbern(p[j,k])
}
theta[j]~dnorm(mut[gmem[j]],1.)
gmem[j]~dcat(pi[1:G])
}
for (k in 1:16){
beta[1,k]~dnorm(0.0,1.0)
beta[2,k]<- beta[1,k]
}
for (k in 17:24){
beta[1,k]~dnorm(0.0,1.0)
beta[2,k]~dnorm(0.0,1.0) I(beta[1,k], )
}
#Deviance values are the inverse of the variance
for (k in 1:NI) {
rtm[1,k] ~ dnorm(4.30,0.93)
rtm[2,k] ~ dnorm(2.30,1.74)
rts[1,k] ~ dnorm(4.00,1.00)
rts[2,k] ~ dnorm(0.65,6.15)
}
Parameter Estimation Using Response Times
for (j in 1:NE){
for (k in 1:8){
rt[j,k]~dlnorm(rtm[gmem[j],k],rts[gmem[j],k])
}}
for (j in 1:24){
b[1,j]<-beta[1,j]-mean(beta[1,1:24])
b[2,j]<-beta[2,j]-mean(beta[2,1:24])
}
pi[1]<- .8
pi[2]<- .2
mut[1]~ dnorm(0.,1.)
mut[2]~ dnorm(0.,1.)
}
list(NE=3000, NI=24, G=2,
r = structure(.Data = c(
1.0,1.0,1.0,1.0,1.0,
1.0,1.0,1.0,1.0,1.0,
0.0,1.0,1.0,1.0,1.0,
0.0,1.0,1.0,1.0,1.0,
1.0,0.0,0.0,1.0,1.0,
1.0,1.0,1.0,1.0,0.0
.
.
.
.
.
.
.
.
.
1.0,0.0,0.0,0.0,1.0),.Dim = c(3000,24)),
rt = structure(.Data = c(
103.97,54.04,70.86,78.02,119.05,
41.06,48.87,128.16,155.58,98.30,
27.49,105.43,67.94,188.35,46.66,
43.14,47.37,96.23,76.94,75.33,
.
.
.
.
.
.
.
.
.
35.25,75.46,53.05,148.87,50.31),
.Dim = c(3000,8)))
10
Parameter Estimation Using Response Times
11
References
Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on
Automatic Control, 19(6), 716-723.
Baker, F. B., Al-Karni, A., & Al-Dosary, I. M. (1991). EQUATE: A computer program for
the test characteristic curve method of IRT equating. Applied Psychological Measurement, 50,
529-549.
Bolt, D. M., Cohen, A. S., & Wollack, J. A. (2002). Item parameter estimation under
conditions of test speededness: Application of a mixture Rasch model with ordinal constraints.
Journal of Educational Measurement, 39, 331-348.
Civil Rights Act of 1991, Pub. L. No. 102-166, §106 (1991).
Douglas, J., Kim, H. R., Habing, B., & Gao, F. (1998). Investigating local dependence with
conditional covariance functions. Journal of Educational and Behavioral Statistics, 23, 129-151.
Gilks, W. R., Richardson, S., & Spiegelhalter, D. J., Eds. (1996). Markov chain Monte Carlo
in practice. London: Chapman & Hall.
Kang, T. & Cohen, A. S., (2007). IRT model selection methods for dichotomous items.
Applied Psychological Measurement, 31, 331-358.
Klein Entink, R. H., Kuhn, J. T., Hornke, L. F., & Fox, J.-P. (2009). Evaluating cognitive
theory: A joint modeling approach using responses and response times. Psychological Methods,
14, 54-75.
Lin, T. H., & Dayton, C. M. (1997). Model selection information criteria for non-nested
latent class models. Journal of Educational and Behavioral Statistics, 22(3), 249-264.
Luce, R. D. (1986). Response times: Their roles in inferring elementary mental
organization. Oxford, UK: Oxford University Press.
Maris, E. (1993). Additive and multiplicative models for gamma distributed random
variables, and their application as psychometric models for response times. Psychometrika, 58,
445-469.
Meyer, J. P. (2008, March). A mixture Rasch model with item response-time components.
Paper presented at the annual meeting of the National Council on Measurement in Education,
New York, NY.
Mroch, A. A. & Bolt, D. M. (2006, April). An IRT-based response likelihood approach for
addressing test speededness. Paper presented at the annual meeting of the National Council on
Measurement in Education, San Francisco, CA.
Mroch, A. A., Bolt, D. M., & Wollack, J. A. (2005, April). A new multi-class mixture Rasch
model for test speededness. Paper presented at the annual meeting of the National Council on
Measurement in Education, Montreal, Canada.
Oshima, T. C. (1994). The effect of speededness on parameter estimation in item response
theory. Journal of Educational Measurement, 31, 200 – 219.
Patz, R. J., & Junker, B. W. (1999a). A straightforward approach to Markov chain Monte
Carlo methods for item response models. Journal of Educational and Behavioral Statistics, 24,
146-178.
Patz, R. J., & Junker, B. W. (1999b). Applications and extensions of MCMC in IRT:
Multiple item types, missing data, and rated responses. Journal of Educational and Behavioral
Statistics, 24, 342-366.
Parameter Estimation Using Response Times
12
Schnipke, D. L., & Scrams, D. J. (1997). Modeling item response times with a two-state
mixture model: A new method of measuring speededness. Journal of Educational Measurement,
34, 213-232.
Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics, 6, 461-464.
Spiegelhalter, D. J., Thomas, A., & Best, N. G. (2000). WinBUGS version 1.3 [Computer
program]. Robinson Way, Cambridge CB2 2SR, UK: Institute of Public Health, Medical
Research Council Biostatistics Unit.
Stocking, M., & Lord, F. M. (1983). Developing a common metric in item response theory.
Applied Psychological Measurement, 7, 207-210.
van der Linden, W. J. (2006). A lognormal model for response times on test items. Journal
of Educational and Behavioral Statistics, 31, 181-204.
van der Linden, W. J. (2007). A hierarchical framework for modeling speed and accuracy on
test items. Psychometrika, 72, 287-308.
van der Linden, W. J., Scrams, D. J., & Schnipke, D. L. (1999). Using response-time
constraints to control for differential speededness in computerized adaptive testing. Applied
Psychological Measurement, 23, 195-210.
Wang, T., & Hanson, B. A. (2005). Development and calibration of an item response model
that incorporates response time. Applied Psychological Measurement, 29, 323-339.
Wise, S. L., & DeMars, C. E. (2006). An application of item response time : The effortmoderated IRT model. Journal of Educational Measurement, 43, 19-38.
Wise, S. L., & Kong, X. (2005). Response time effort: A new measure of examinee
motivation in computer-based tests. Applied Measurement in Education,16, 163-183.
Wollack, J. A., Cohen, A. S., & Wells, C. S. (2003). A method for maintaining scale stability
in the presence of test speededness. Journal of Educational Measurement, 40, 307-330.
Yamamoto, K. (1987). A model that combines IRT and latent class models. Unpublished
doctoral dissertation. University of Illinois, Champaign – Urbana.
Yamamoto, K. & Everson, H. (1997). Modeling the effects of test length and test time on
parameter estimation using the HYBRID model. In J. Rost and R. Langeheine (Eds.):
Applications of Latent Trait and Latent Class Models in the Social Sciences (pp. 89 – 98). New
York: Waxman
Parameter Estimation Using Response Times
Table 1
Generating Item Difficulty Parameters for
Two Item Ordering Conditions
Item
Location
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
Random Ordering
Class 1
Class 2
-0.573
-0.573
-0.505
-0.505
0.682
0.682
-0.782
-0.782
0.682
0.682
-1.926
-1.926
-0.149
-0.149
0.094
0.094
0.255
0.255
2.306
2.306
1.055
1.055
-0.563
-0.563
-1.080
-1.080
-0.447
-0.447
-1.118
-1.118
-0.285
-0.285
1.848
4.521
-1.198
2.600
1.166
4.983
-0.618
2.960
-0.704
3.385
1.330
3.702
0.382
3.790
-0.377
4.124
Difficulty Ordering
Class1
Class 2
-1.926
-1.926
-1.198
-1.198
-1.118
-1.118
-1.080
-1.080
-0.782
-0.782
-0.704
-0.704
-0.618
-0.618
-0.573
-0.573
-0.563
-0.563
-0.505
-0.505
-0.447
-0.447
-0.377
-0.377
-0.285
-0.285
-0.149
-0.149
0.094
0.094
0.255
0.255
0.382
3.985
0.682
4.473
0.682
5.055
1.055
5.014
1.116
4.254
1.330
5.157
1.848
4.219
2.306
5.202
Table 2
Summary Statistics
Model
Rasch
MRM
MRM-RT
Random Ordering
RMSE Bias
AIC
0.643
0.603
0.000
0.078 −0.014
0.600
0.075 −0.018
0.400
BIC
0.000
1.000
0.000
RMSE
0.570
0.125
0.118
Difficulty Ordering
Bias
AIC
0.562
0.000
0.118
0.400
0.111
0.600
BIC
0.000
1.000
0.000
13
Parameter Estimation Using Response Times
Figure Captions
Figure 1. Nonspeeded and speeded class RT distributions for end-of-test items
14
0.025
NS3
0.02
NS2
SP
NS6
0.015
NS4
NS8
NS7
NS5
0.01
NS1
0.005
0
0
25
50
75
100
125
150
175
200
Download