PSYCHOMETR1KA--VOL. 55, NO. 2, 195-215 JUNE 1990 MODELING ITEM RESPONSES WHEN DIFFERENT SUBJECTS EMPLOY DIFFERENT SOLUTION STRATEGIES ROBERT J. M1SLEVY EDUCATIONAL TESTING SERVICE NORMAN VERHELST CITO (NATIONAL INSTITUTE FOR EDUCATIONAL MEASUREMENT) ARNHEM, THE NETHERLANDS A model is presented for item responses when different subjects employ different strategies, but only responses, not choice of strategy, can be observed. Using substantive theory to differentiate the likelihoods of response vectors under a fixed set of strategies, we model response probabilities in terms of item parameters for each strategy, proportions of subjects employing each strategy, and distributions of subject proficiency within strategies. The probabilities that an individual subject employed the various strategies can then be obtained, along with a conditional estimate of proficiency under each. A conceptual example discusses response strategies for spatial rotation tasks, and a numerical example resolves a population of subjects into subpopulations of valid responders and random guessers. Key words: differential strategies, EM algorithm, item response theory, linear logistic test model, mixture models. Introduction The standard models of item response theory (IRT), such as the 1-, 2-, and 3parameter normal and logistic models, characterize subjects in terms of their propensities to make correct responses. Consequently, subject parameter estimates are strongly related to simple percent-correct scores (adjusted for the average item difficulties, if different subjects are presented different items). Item parameters characterize the regression of a correct response (x = 1 as opposed to 0) on this overall propensity toward correctness. These models lend themselves well to tests in which all subjects employ the same strategy to solve the items. Comparisons among estimates of subjects' ability parameters are meaningful comparisons of their degrees of success in implementing the strategy. Item parameters reflect the number or complexity of the operations needed to solve a given item (Fischer, 1973). The same models can prove less satisfactory when different subjects employ different strategies. The validity of using scores that convey little more than percentcorrect to compare subjects who have used different strategies must first be called into question. And item parameters keyed only to a generalized propensity toward correctThe first author's work was supported by Contract No. N00014-85-K-0683, project designation NR 150-539, from the Cognitive Science Program, Cognitive and Neural Sciences Division, Office of Naval Research. We are grateful to Murray Aitkin, Isaac Bejar, Neil Dorans, Norman Frederiksen, and Marilyn Wingersky for their comments and suggestions, and to Alison Gooding, Maxine Kingston, Donna Lembeck, Joling Liang, and Kentaro Yamamoto for their assistance with Example 2. Requests for reprints should be sent to Robert J. Mislevy, Educational Testing Service, Princeton, NJ 08541. 0033-3123/90/0600-9534 $00.75/0 © 1990 The Psychometric Society 195 196 PSYCHOMETRIKA ness will not reveal how a particular kind of item might be easy for subjects who follow one line of attack, but difficult for those who follow another. Extensions of IRT to multiple strategies have several potential uses. In psychology, such models would provide a rigorous framework for testing alternative theories about cognitive processing (e.g., Carter, Pazak, & Kail, 1983). In education, estimates of how students solve problems could be more valuable than how many they solve, for the purposes of diagnosis, remediation, and curriculum revision (Messick, 1984). And even when a standard IRT model would provide reasonable summaries and meaningful comparisons for most subjects, an extended model allowing for departures along predetermined lines, such as random responding, would reduce estimation biases for the parameters in the standard model. In contrast to standard IRT models, and, for that matter, to the true score models of classical test theory, a model that accommodates alternative strategies must begin with explicit statements about the processes by which subjects arrive at their answers. For example, items may be characterized in terms of the nature, number, and complexity of the operations required to solve them under each strategy that is posited. As we shall see, the psychological theory that must underlie an attempt to model mixed strategies will probably be lacking for most conventional tests. The most profitable uses of the methods we propose will be in applications in which a relatively strong psychological or educational theory has been or can be developed, and the practical decisionmaking problem concerns strategy usage. The recent psychometric literature contains a few implementations of these ideas. Tatsuoka (1983) has studied performance on mathematics items in terms of the application of correct and incorrect rules, locating response vectors in a two-dimensional space, where the first dimension is an estimated ability parameter from a standard IRT model and the second is an index of lack of fit from that model. Paulson (1985), analyzing similar data but with fewer rules, uses latent class models to relate the probability of correct responses on an item to the features it exhibits and the rules that subjects might be following to solve it. Yamamoto (1987) combines aspects of both of these models, positing subpopulations of IRT respondents and of nonscalable respondents associated with particular expected response patterns. Samejima (1983) and Embretson (1985) offer models for alternative strategies in situations where subtask results can be observed in addition to the overall correctness or incorrectness of an item. The present paper describes a family of multiple-strategy IRT models that apply when each subject belongs to one of a number of exhaustive and mutually-exclusive classes that correspond to item-solving strategies, and the responses from all subjects in a given class accord with a standard IRT model. It is further assumed that for each item, its parameters under the IRT model for each strategy class can be related to known features of the item through psychological or substantive theory. The following section gives a general description of the model. A conceptual example illustrates some of the key ideas. A two-stage estimation procedure is then presented. The first stage estimates structural parameters: basic parameters for test items, proportions of subjects following each strategy, and proficiency distributions within each. The second stage estimates posterior distributions for individual subjects: the probability that they belong to each strategy class, and a conditional distribution of their ability corresponding to each class. A numerical example resolves subjects into classes of valid responders and random guessers. The final section discusses prospects of the approach for educational and psychological testing. ROBERT J. MISLEVY AND NORMAN VERHELST 197 The Response Model This section lays out the basic structure of a mixture of constrained item response models. Discussion will be limited to dichotomous items for convenience, but extensions to polytomous, continuous, and vector-valued observations are straightforward. We begin by briefly reviewing the general form of an IRT model. The probability of response x 0 (I if correct, 0 if not) from Subject i to i t e m j is given by an IRT model as p(xolOi,/3 s) = [f(0i,/3s)]x0[1 -f(Oi,/3s)] 1 -xo (1) where Oi and/3j are (possibly vector-valued) parameters associated with subject i and item j respectively, and f is a known, twice-differentiable, function whose range is the unit interval. Under the usual IRT assumption of local independence, the conditional probability of the response pattern xi = (xil . . . . . Xin) of subject i to n items is the product of n expressions like (1): p(x;lO;, 13)= 12I p(xolOi, ~s). j=l It may be possible to express item parameters as functions of some smaller number of more basic parameters at = ( a l , . . . , aM) that reflect the effects of M salient characteristics of items; that is,/3j =/3fiat). An important example of this type is the linear logistic test model (LLTM; Fischer, 1973; Scheiblechner, 1972). Under the L L T M , the item response function is the one-parameter logistic (Rasch) model, or p(xij]Oi, [3j(at))= exp [xo(Oi -/3j)] 1 + exp (Oi - f3j) and the model for item parameters is linear: M Qsm m = Q/at. m=l The elements of at are contributions to item difficulty associated with the M characteristics of items, presumably related to the number or nature of processes required to solve them. The elements of the known vector Qj indicate the extent to which item j exhibits each characteristic. (To isolate the indeterminacy of origin in the L L T M , Fischer wrote 13 = Q'at + c l , where 1' = (1 . . . . . t) and c is an arbitrary constant. This is subsumed in the form used in this paper by incorporating 1 into Q and c into at. The indeterminacy can be resolved by enforcing a constraint such as E/3 = 0 or E(O) = 0.) Fischer (1973), for example, modeled the difficulty of the items in a calculus test in terms of the number of times an item requires the application of each of seven differentiation rules; Qjm was the number of times that rule m must be employed to solve item j. Consider now a set of items that may be answered by means of K different strategies. It need not be the case that all are equally effective, nor even that all generally lead to correct responses. Not all strategies need be available to all subjects. We make the following assumptions: 198 PSYCHOMETRIKA 1. Each subject is applying the same one of these strategies for all the items in the set. (The final section discusses how to relax this assumption to handle strategy-switching.) 2. The responses of a subject are observed but the strategy the subject has employed is not. 3. The responses of subjects following strategy k conform to an item response model of a known form. 4. Substantive theory associates the observable features of items with the probabilities of success for members of each strategy class. The relationships may be known fully, or only partially as when the Q matrices in LLTM-type models are known but the basic parameters are not. Define a K-dimensional subject parameter dp to indicate strategy usage, letting the k-th element in dpi take the value one if subject i follows strategy k, and zero if not. Extending the notation introduced above, we write the probability of response pattern xi, conditional on the subject parameters dpi and 0i, as p(xil*i, Oi, or) = I-I k [fk(Oik, fljk)]x~[1 -- fk(Oik, fljk)] 1-x0 1 , (2) where fljk =- ~jk(or) gives the item parameter(s) for item j under strategy k and 0i (Oil . . . . . OiK) gives the proficiencies of subject i under the K strategies. For brevity, let 13k ~__ (/31k(~t). . . . . /3nk(~t)) denote the vector of parameters of all n items with respect to strategy k. It will be natural in certain applications to partition basic parameters for items in accordance with strategy classes; that is, at = (at 1. . . . . et/¢). When the strategies can be defined through K versions of the LLTM, as in Example I below, the differential expectations of success on item j under the various strategies are conveyed by the K different vectors Qjk, k = I . . . . . K, that relate the item to each of the strategies: ~ j k "~- ~ m ajkmOtkm= QjkOtk • Here the item difficulty parameter for item j under strategy k is a weighted sum of elements in ~tk, the basic parameter vector associated with strategy k. The weights Qjkm indicate the degree to which each of the features m, as relevant under strategy k, are present in item j. Example 1: Alternative Strategies f o r Spatial Tasks The items of some tests intended to measure spatial visualization ability admit to solution by nonspatial analytic strategies (French, 1965; Lohman, 1979; Pellegrino, Mumaw, & Shute, 1985). Consider items in which subjects are shown a drawing of a three-dimensional target object, and asked whether a stimulus drawing could be the same object after rotation in the plane of the picture. In addition to rotation, one or more key features of the stimulus may differ from the those of the target. A subject can solve the item either by rotating the target mentally the required degree and recognizing the match (Strategy 1), or by employing analytic reasoning to detect feature matches without performing rotation (Strategy 2). Consider further a hypothetical three-item test comprised of such items. Each item will be characterized by (a) rotational displacement of 60, 120, or 180 degrees, and by (b) the number of features that must be matched. Table I lists the characteristics of the items in the hypothetical test. 199 ROBERTJ. MISLEVY AND NORMAN VERHELST TABLE i Item Features Item rotational displacement salient features 1 60 degrees 3 2 120 degrees 2 3 180 degrees 1 Each subject i is characterized by two vectors. In the first, ~bi -= (t~il, t~i2), where the value 1 if subject i employs strategy k and 0 if not. In the second, 0 i (0il, 0i2), where Oik characterizes the proficiency of subject i if the subject employs strategy k. Only one of the elements of 0 i is involved in producing subject i's responses, but we do not know which one. Suppose that for subjects employing a rotational strategy, probability of success is given by the Rasch model: ckik takes P(Xo]Oil, fljl, exp ~1 = 1) = [Xu(Oil 1 + exp (Oil -- fljl)] - ~jl)" Here 0il is the proficiency of subject i at solving tasks by means of the rotational strategy, and/3jl is the difficulty of item j under the rotational strategy. It is well established that the time required to solve mental rotation tasks is often linearly related to rotational displacement (e.g., Cooper & Shepard, 1973). To an approximation, so are log-odds of success (Tapley & Bryden, 1977). For the sake of the example, suppose that under the rotational strategy, item parameters take the following form: ~jl = QjllO~ll + o~12, where Q511 encodes the rotational displacement of item j - - 1 for 60 degrees, 2 for 120 degrees, and 3 for 180 degrees--and all is the incremental increase in diffculty for each increment in rotation; and ~12 is a constant term, with Qjl2 --= 1 implicit for all j. If a n = I and oq2 = - 2 , the item parameters/3jl that are in effect under Strategy 1 would be as shown in the second column of Table 2. (A subsequent section shows how to estimate these quantities.) A Rasch model will also be assumed for subjects employing Strategy 2, the analytic strategy, but here the item parameters depend on the number of features that must be matched: flj2 = Qj21~21 + O~22, where Qv21 is the number of salient features, a21 is the incremental contribution to item difficulty of an additional feature, aEe is a constant term, and Qa22 ~ 1 implicitly for all 200 PSYCHOMETRIKA TABLE 2 Item Difficulty Parameters Item Strategy i Strategy 2 1 -I.0 2.0 2 0.0 0.5 3 1.0 -i.0 items. If azl = 1.5 and ct22 = - 2 . 5 , the item parameters in effect under Strategy 2 are the values in the third column of Table 2. Note that the test has been constructed so that items that are relatively hard under one strategy are relatively easy under the other. Inferring strategy choice from observed response patterns is possible only if at least some patterns are more likely under some strategies than others. The response pattern (011), for example, has a correct answer to an item that is easy under Strategy 2 but hard under Strategy 1, and an incorrect answer to an item that is hard under Strategy 2 but easy under Strategy I. Figure 1 plots the likelihood functions for x = (0ll) under both strategies; that is, p[x ---- (011)10k, ~bk = 1] for k = I, 2 as a function of 01 and of 02, respectively. The maximum of the likelihood under Strategy 2 is about eight times higher than the maximum under Strategy 1. We can draw inferences about individual subjects if we know the proportions of people who choose each strategy, or ~rk = P(4~k = 1), and the distributions of proficiency among those using each strategy class, or 9k(Ok) = P(Oklchk = 1). We shall discuss in the sequel how to estimate these quantities, but for now, assume some illustrative values, and examine the implications that follow for measuring individuals. Suppose that (i) 01 and 02 both follow standard normal distributions among the subjects that employ Strategies 1 and 2, respectively, and (ii) three times as many subjects use Strategy 1 as use Strategy 2 m t h a t is, ~r1 = 3/4 and 7rz = 1/4. This joint prior distribution, illustrated in Figure 2, can be written as p(Ok = 0, 4~k = 11o) = ~rkVk(0). Combining this prior via Bayes theorem with the likelihood induced by an observed response pattern x, produces a joint posterior density for ~b and Ok given thk = 1 for k = I, 2: p(Ok = O, 4~k = llx, ~ , a ) ~ p[xl4~k = 1, Ok = O, 13g(CO]~rkgk(O), where p[xl4'k = 1, Ok = 0 , 13k(~)] = 1-I exp {xu[O-/3jk(C0]} j I + [0 --/3ik(,~)]___" (3) ROBERT J. MISLEVYAND NORMAN VERHELST 0.45 I THETA, STRATEGY=l) I THETA, STRATEGY=2) F ~ . ~ DIAMOND = P ( X = 0 1 1 PLUS = P(X=Oll 0.40 \ / /i # / 0.35- L 201 \ / I I 0.30" \ / I / I K E 0.25. L i l / / I H 0 0.20' 0 D / 0.15' \ /I / / / \ # / 0.10' / O.OS ~ ,~ n t - ~ - i -5 ; , - . -4 . - , i - ~ - 3 -3 . - t -2 , ~ \ / "+,, \ I "4- ~ - i -t s - i ¸ • , , • 0 • . - i I • i , - • 2 • ~ • , 3 • ~ • - . 4 , - | • S THETA FIGURE 1. Likelihood Function. The constant of proportionality required to normalize (3) is the reciprocal o f the marginalization o f the right side, namely p(xla'r, (x) = ~/~ 7rk f p[xkbk = 1, Ok = O, ~k(ot)]gk(O) dO. The posterior distribution induced by x = (011) is shown in Figure 3. Marginalizing with respect to Ok amounts to summing the area under the curve for strategy k, and gives the posterior probability that ~bk = l - - t h a t is, that the subject has employed strategy k. The resulting values for this response pattern are P(4h = llx -- (011)) -- .28 and P(~b2 = llx = (011)) = .72. The prior probabilities that favored Strategy 1 have been revised substantially to favor Strategy 2. The conditional posterior for 01 is given 4)1 = 1 has a mean and standard deviation o f about .32 and .80. Corresponding values for the distribution of Oz given 4)2 = 1 are .50 and .81. We note in passing that these posteriors are not generally normal distributions, even though the priors were. The posteriors can in fact be quite skewed, as occurs with response patterns that have all correct or incorrect answers. 202 PSYCHOMETRIKA 0.40DIAMOND = P(THETA PLUS = P(THETA I STRATEGY=l) I STRATEGY=2) 0.35. D E 0.2iS N S ! 0.20 T Y 0.15 0.10" 0.05 - , ~.s S ,,~of 0.00 ,-~ -S -4 -3 -2 -~ 0 1 2 3 4 THETA FIGURE 2. Prior Distribution. Parameter Estimation Example 1 showed how to draw inferences about the strategy usage and proficiencies of subjects if the basic parameters for items and the strategy population proportions and distributions were known. In practice these quantities will have to be estimated. Sometimes it will be possible to obtain calibrating samples of subjects whose strategy usage is known with certainty. In these cases, standard IRT methodology can be used to estimate the structural parameters of the problem--the basic parameters a for items, the proportions ~rk of subjects employing each strategy, and the parameters of the distributions 9/,((9) of subjects employing each strategy. The item parameters need not even be described in terms of salient item features in this situation; that is,/3's need not be modeled in terms of a's. In other cases, obtaining certain knowledge of subjects' strategy usage will not be convenient or even possible. It then becomes § 203 ROBERT J. MISLEVY AND NORMAN VERHELST 0.40 DIAMOND = P(THETA, PLUS = P(THETA, STRATEGY=I I X=011) STRATEGY=2 ! X=011) 0.36 0.30' D E N S ! 0.26 / I O.iO " l I T Y " I I •,. i ', l ÷ • l 0.16 / '., 0.10 // 0.06 - j 0.00 ! -6 i | | -4 ! i i i -3 i i ! i -2 i i | i -1 i | i | | i 0 i i 1 i | i ! 2 i i | | 3 i | i i i 4 THETA F i o u ~ 3. Posterior Distribution. necessary to estimate a's, ~r's, and gk's directly from the responses of subjects from an unknown mixture of strategy classes, as discussed in the following section. In either case, the resulting estimates of these structural parameters can be treated as known true values so that "empirical Bayes" inferences can be drawn about individual examinees; additional details on this second stage of estimation are presented subsequently. Estimating Structural Parameters in the Mixture Case Equation (2) gives the conditional probability of the response vector x given 0 and ~b, o r p ( x l 0 , 4,, a ) . Consider a population in which strategies are employed in proportions 7r/~ and within-strategy proficiencies have densities gk(0kl~Tk), characterized by possibly unknown parameters r/k, among the subjects using them. The marginal probability of x, or the probability of observing x from a subject selected at random from the mixed population, is i i | 5 204 PSYCHOMETRIKA o) = p(xJ, , f p(xtok, 6k = 1, a)gk(0kl'Tk) dot. (4) Let g denote the vector of all structural parameters, (t~, 'rr, -q). The likelihood for g induced by observing the response vectors X =- (xl . . . . . x N) of N subjects is the product over subjects of terms like (4). Maximum likelihood estimates for g are obtained by maximizing the likelihood function, or, equivalently, the logarithm of it. Bayes modal estimates can be obtained by similar numerical procedures after multiplying the likelihood by a prior distribution for g. The log likelihood is N A= ~ logp(xilg ) i=1 = ~i log ff'~k zrk f p(xl0k, d~k= 1, oL)gk(0kl~k) dOk. (5) Note that the proficiency distributions gk appearing in (4) and (5) pertain only to the subjects who use the strategies. It may be the case that (01 . . . . . OK) has a joint distribution in the population at large, but because we assume that only one component is involved in the response process of any individual, only the margins of this joint distribution are estimable--just as would be the case if each subject's strategy class were known. Let S be the vector of first derivatives and H the matrix of second derivatives, of A with respect to g. Under regularity conditions, the maximum likelihood estimates solve the likelihood equation S = 0, and a large-sample approximation of the matrix of estimation errors is given by the negative inverse of H evaluated at ~. A standard numerical approach to solving likelihood equations is to use some variation of Newton's method. Newton-Raphson iterations, for example, improve a provisional estimate g0 by adding the correction term - H - 1 S , evaluated at g0. Fletcher-Powell iterations avoid computing and inverting H by using an approximation of H-1 that is built up from changes in S from one cycle to the next. These solutions have the advantage of rapid convergence if starting values are reasonable; often fewer than I0 iterations are necessary. S and H can be difficult to work with, however, and all parameters must usually be dealt with simultaneously because the off-diagonal elements in H needn't be zero. For these reasons, computationally simpler but slower-converging solutions based on Dempster, Laird, and Rubin's (1977) EM algorithm are more typically employed in mixture problems (Titterington, Smith, & Makov, I985). The solution described below for the present problem uses discrete representations for the gks, so the relatively simple "'finite mixtures" case of EM obtains (see section 4.3 of Dempster, Laird, & Rubin). Suppose that for each k, subject proficiency under Strategy k can take only the L k values ®kl . . . . . ®kL~. The density 9k is thus characterized by these points of support and by weights associated with each, y k ( O k l [ ' q k ) . Define the subject variable 0i --(~ill ..... qtiKLr), a vector of length L1 + " " • + L K of indicator variables: the element tPikl is 1 if the proficiency of subject i under strategy k is Ok/and 0 if not. There are a total of K ones in d~i, one for each strategymthough again, only the one associated with the strategy used by subject i played a role in producing x i. The feature of the actual log likelihood (5) that makes it difficult to solve is that we do not observe subjects' d~ and 0 values (or equivalently, ~b and ~ values). In Dempster, R O B E R T J. M I S L E V Y A N D N O R M A N V E R H E L S T 205 Laird, and Rubin's terminology (1977), this is an "incomplete data" problem. Obtaining MLEs for ~ would be much simpler if these values were observed along with x's, in a corresponding "complete data" problem. Were this the case, the log-likelihood would be A* = E E ~)ik E ~ikt log p[x~lOk= Okl, i k (~k = 1, [~k(Ct)] l -4- E E ~ik E ~likl log yk(Oktlnk) i k t + E E ~)ik log rrk. / (6) k The basic parameter for items, or, appears only in the first term on the right, so maximizing with respect to a must address that term only. This amounts to estimating item parameters when subject abilities are known. When ot consists of distinct subvectors for each strategy, each subvector could be estimated separately using data from only the subjects in the corresponding strategy group. The subpopulation parameters ~1 appear in only the second term, separating them from ot and ¢r in ML estimation. They too lead to smaller separate subproblems if ~i consists of distinct subvectors for each strategy. The population proportions ~ appear in only the last term. Unless they are further constrained, their ML estimates are simply observed proportions. The values of O may be either specified a priori, as in Mislevy (1986), or estimated from the data, as in de Leeuw and Verhelst (1986). In the latter case, they are additional elements of ~. Their likelihood equations have contributions from both the first and second terms of (6), but the equations for these points of support under strategy k involve data from only those subjects using strategy k. Their cross second derivatives with points corresponding to other strategies are zero, although their cross derivatives with elements of ot and ~1 that are involved with the same strategy are generally not. The M-step of an EM solution requires solving a maximization problem of exactly the type of (6), with one exception: the unobserved values of each ~bi and Oi are replaced by their conditional expectations given xi and provisional estimates of 6, say 6 °. The E-step calculates these conditional expectations as follows. Denote by Iig t the following term in the marginal likelihood associated with subject i, strategy k, and proficiency value Ok/within strategy k: Iikt = p[xilOk = ~)kl, (f~k ---- 1, [~k(t~)Jgk(okl'qk)~rk, and let I°t be a provisional estimate obtained using 6 ° rather than 6. Provisional conditional expectations of the parameters of individual subjects are then obtained as toot = E(Oiulxi ' ~ = ~o) I iOkl q and 0 ' E Iikl' (7) 206 PSYCHOMETRIKA 4,i° = E(4,;klx;, g = t °) Ei°,, l' = E E to,,, . (8) k' l' The EM formulation makes it clear how each subject contributes to the estimation of the parameters in all strategy classes, even though only one of them was relevant to the production of any one subject's responses. Each subject's data contribute to the estimation for each strategy class in proportion to the probability that that strategy was the one the subject employed, given the observed response pattern. Although the EM solution has the advantages of conceptual and computational simplicity, it may converge slowly. Its rate of convergence depends on how well x determines subjects' 0 and • values. This in turn is determined by how greatly the relative likelihoods of response patterns differ from one strategy class to another. Accelerating procedures such as those described by Ramsay (1975) and Louis (1982) can be used to hasten convergence of the EM solution. The closer the starting values are to the MLEs, the better--not just for quicker convergence, but also for better chances of convergence to the global maximum. There is no guarantee against local maxima in such mixture problems. One general approach to obtaining initial values is to sort subjects according to their most plausible strategy choices, and estimate zr's, a's, and 9's as if these temporary assignments were correct. Empirical Bayes Inference for Individual Subjects If the structural parameters ~ are accurately estimated, the posterior density of the parameters of subject i is approximated by p(Oik = O, ~)ik = IlXi, ~) o¢ p[xil4~k = 1, 0, [~k(&)]Crkgk(Ol41k), where the reciprocal of the normalizing constant is obtained by integrating the expression on the right over 0 within each k, then summing over k. The posterior probability that subject i used strategy k is approximated by P(¢~ik = I[X,, ~) = _f p(Oik = O, ~ik : llx/, ~) dO. A subject's posterior conditional mean and variance for a given strategy class are approximated by Oik = f__~(0ik_ = 0, 4,,, = llxi, ~) dO ,o~ / 11x~,~) - ' and If discrete representations have been employed for the 9k's, approximations based on (7) and (8) can be used. Empirical Bayes estimates more closely approximate true Bayes estimates as structural parameters are more precisely estimated, so they are most suitable with large ROBERT J. MISLEVY AND NORMAN VERHELST 207 samples of subjects. Tsutakawa and Soltys's (1988) study of IRT ability estimation suggests that when ~ is not accurately determined in the present problem, point estimates of 0 and d~ may not be far off but their accuracy will be overstated. A full Bayesian solution, which takes uncertainty about ~ into account, may be preferred in those circumstances. The interested reader is referred to Tsutakawa and Soltys for an illustration of one suitable approximation and references to others. Example 2: A Mixture o f Valid Responders and Random Guessers The Rasch model can sometimes provide a reasonably good summary of subjects' responses on multiple-choice tests, especially if the items are fairly easy for the subjects, or if hard items have attractive distractors that usually lead low proficiency subjects to incorrect responses. But if some unmotivated subjects simply respond at random to all items, their responses will bias the estimation of the item parameters that would pertain to the majority of the subjects. In this example we consider a two-class model, under which a subject responds either in accordance with the Rasch model or guesses totally at random. For subjects in the latter class, probabilities of correct response are fixed at the reciprocals of the number of response alternatives to the items. Using the procedures described above, it is possible to free estimates of the item parameters that pertain to the valid responders from biases due to random guessers--even though it is not known with certainty who the guessers are--and to estimate the proportions of valid responders and random guessers in the sample. The marginal probability of response pattern x in this situation is the two-class mixture 2 /'(xil~) = Y~ e ( x i l r k = 1, ~)~k, k=t where Strategy 1 corresponds to the Rasch model and Strategy 2 corresponds to random guessing. The composition of ~ can be described as follows. It includes first the strategy proportions 7r1 and 7r2. For the Rasch class, the basic parameters a l are item difficulty parameters bj f o r j = 1. . . . . n. Suppose the distribution 9t of proficiencies of subjects following the Rasch model is discrete, with L points of support O = (O1 . . . . . OL) and associated weights to = (to1, . . . , tOL). The marginal probability of response pattern x under Strategy 1 is thus e(xltkl = I, a s , O, to) = ~ tot 1~ exp [xj(Oi - bj)] t j 1 + exp (®l - bj)" Under the random guessing strategy, the basic parameters ot2 are the probabilities cj of responding correctly to each itemj. If we assume these probabilities take the values of the reciprocals of the number of item choices, they are known constants. All subjects following this strategy are assumed to have the same probabilities of correct response on a given item, so no distribution 92 is required. For such subjects, the probability of response pattern x is simply e(x14~2 = 1, OrE) = l-[ c]J(1 - cj) 1 -xj J The data for the following numerical example were gathered in a field trial of a test designed to measure the reading proficiencies of college students. Since their perfor- 208 PSYCHOMETRIKA mances had no bearing on their academic records, the subjects had no external reason to try their best. And indeed, proctors reported that a few subjects appeared to be marking the answer sheet without having opened their test booklets. From a total sample of about 2180 subjects, 1906 provided responses to all the items. This example concerns their right/wrong responses to twelve four-choice items. Our objective is neither to provide a complete analysis of these particular data nor to promote the use of the two-class model, but to offer a numerical illustration of a simple mixed-strategies model. We focus our attention on a comparison of results from the Rasch IRT model and the two-class model described above, adding a few comments concerning the three-parameter logistic IRT model. The two-class model is a special case of both our mixed-strategies model and Yamamoto's (1987) hybrid model for IRT and latent class responders. The estimates presented here were obtained with a computer program written by Yamamoto, using the EM estimation discussed in a preceding section. For both the Rasch and the twoclass solution, a ten-point discrete characterization o f g l was employed, over equallyspaced points O1 = - 4 and ®10 = +4. The Rasch solution was comprised of item parameters and of estimated population weights at each point of support. The two-class solution also had these parameters, which now pertained just to subjects following the Rasch model, plus an additional parameter for the proportion of subjects in the Rasch class, as opposed to the hypothesized random guessing class. Probabilities of correct response were fixed at .25 for the random guessing class. Resulting values of - 2 log A for the Rasch model and the two-class model are 2752 and 2606 respectively. The chi-square approximation for - 2 log A is not trustworthy because most realized responses patterns were observed only once or twice, but the degrees of freedom would be 942 and 941. The difference in chi-squares is also not to be taken seriously as a chi-square, since the Rasch model is obtained as a boundary solution of the two-class model, with 7r1 = 1. Nevertheless, improving the chi-square by 146 with a single additional parameter must be considered worthwhile. In some respects the solutions do not differ radically. The two-class solution gives an MLE of .9548 for 7r1, so it is estimated that fewerthan five percent of the subjects are guessing at random. The item parameter estimates are related monotonically, as seen in Table 3, but the differences are meaningful: distances between items are quite similar under the two models at the hard end of the scale, but become increasingly spread out as the items become easier. This is because differences between Rasch item parameter estimates are nearly equivalent to differences between logits of the items' proportions of correct response, and if the two-stage model is correct, differences in proportions-correct are attenuated by the constant noise of the random responders. A simple example makes this clear. The observed proportions of correct response to Items 1 and 2 are .863 and .809; corresponding logits are -1.844 and -1.443, for a difference of .401. Under the two-class model, however, the observed proportion correct is the weighted average of the Rasch proportion correct and the guessing rate, the expected value of which is .25. The weights are the proportions of subjects in each class. Thus, Pobs = PRasch 7rl + .25 ~r2. Substituting in the estimated values .9548 and .0452 for ~'l and ~'2 and solving for PRasch, We can approximate Rasch proportions correct as .934 and .829. The corresponding logits are -2.650 and - 1.579, for a difference of 1.071. Perhaps more interesting are implications for measuring individuals. Total score is a sufficient statistic for ability under the Rasch model, so everyone with the same score receives the same ability estimate whenever the Rasch model is assumed to hold. In the Rasch model solution this is always the case. In the two-class solution, everyone with the same score receives the same ability estimate conditional on membership in the ROBERTJ. MISLEVYAND NORMANVERHELST 209 TABLE 3 Item P a r a m e t e r Estimates from the R a s c h and Two-Class Models Item Difficulties Item Rasch Model* Two-Class** Difference i 2 3 4 5 -2.473 -1.907 -1.794 -i.i02 -.541 6 -.241 7 .021 .503 .824 .912 .978 1.276 8 9 i0 Ii 12 Slope .431 270 391 231 283 673 326 046 468 787 880 943 218 -3 -2 -2 -i - - i .797 .484 .437 .181 .132 .085 .067 .017 .037 .032 .035 .058 • 411 s e t by standardizing the s u b j e c t distribution. *Scale * * S c a l e s e t by standardizing Rasch-class distribution. the e s t i m a t e d Rasch class--but the posterior probabilities of belonging to the Rasch class can vary considerably among subjects with the same score. This variation depends on just which items they answered correctly, a feature of their data that has no relevance for ability estimation under the Rasch model. (As we shall discuss below, however, these considerations do play a role in thoughtful applications of the Rasch model.) Table 4 illustrates this phenomenon. Subjects A, B, and C all have scores of three, the expected score of a random guesser, and their posterior means for 0 are the same; but their posterior probabilities of being a guesser differ substantially--about .1, .3, and .6. Using ability estimates to compare these subjects with subjects who have high scores and high probabilities of being Rasch responders would seem warranted for Subject A, questionable for Subject B, and clearly inappropriate for Subject C--a consideration that should be taken into account in, say, assigning educational treatments. An alternative treatment of multiple-choice data uses the three-parameter logistic (3PL) IRT model, or g(xi = exp [ x i a i ( O - bj)] 110, aj, bj, c i) = cj + (1 - c j ) 1 + e x p [aj(O - bj)]" 210 PSYCHOMETRIKA TABLE 4 Posterior Probabilities and Ability Estimates for Selected Response Patterns Two-Class Model Response Pattern Easy Hard Rasch Rasch Prob. Estimate Estimate Guesser 3-Parameter Model Estimate PSD* A 110000000001 -1.028 -1.174 .091 -1.164 .575 B 010010001000 -1.028 -1.174 .314 -1.263 .611 C 010000001010 -1.028 -1.174 .587 -1.451 .610 *PSD = Posterior Standard Deviation The 3PL can be obtained from the Rasch model by adding a slope parameter aj and a lower asymptote parameter cj for each item. The lower asymptote is related to the possibility of chance success, as it allows even low ability subjects a chance of responding correctly. We fit the Rasch model and the 3PL to the data described above with Mislevy and Bock's (1983) BILOG program. The Rasch solution was essentially the same as the one obtained with Yamamoto's program. The 3PL solution employed BILOG's mild default Bayesian prior distributions for a and c parameters, and obtained the item parameter estimates shown in Table 5. The difference between the Rasch model and the 3PL in terms of - 2 log A was 130, similar to that obtained when going from the Rasch model to the two-class solution, but at the cost of 23 additional parameters rather than just one! Unlike the Rasch model, the 3PL can assign different proficiency estimates to response patterns with the same score. The posterior mean 0's for Subjects A, B, and C appear in Table 4. For these subjects, higher 3PL ability estimates are associated with better conformance to the model; the 3PL effectively discounted their unexpected correct responses to hard items. The difference among the estimates, however, is trivial in comparison to their precision. Unlike the two-class model's differential posterior probabilities of being a guesser, these slightly different 3PL estimates indicate merely a slight difference in degree in proficiency rather than profound difference in kind. These results call to mind current work in IRT on subject fit statistics or "caution indices" (e.g., see Chapter 4 of Hulin, Drasgow, & Parsons, 1983). The idea is to detect aberrant response patterns that can signal atypical patterns of knowledge, misunderstanding of directions, or idiosyncratic effects of item content with particular subjects. A high value on a caution index suggests that the subject's score may not be suitable for comparing one subject's performance with other subjects who gave more typical response patterns. But even if a subject's score is not used subsequently, the subject's aberrant response pattern may have distorted the estimation of IRT item parameters and thereby contaminated comparisons among subjects whose responses did accord ROBERTJ. MISLEVY AND NORMAN VERHELST TABLE 211 5 Item Parameter Estimates from the Three-Parameter Model Item a b 1 2 3 4 5 6 7 8 9 i0 ii 12 .745 .723 1.002 1.087 688 907 979 529 965 721 .800 1.079 -1.522 -1.201 -.933 -.409 -.026 .332 .625 1.361 1.216 1.992 1.604 1.642 e 209 176 189 207 171 203 259 256 199 282 233 258 with the model. This concern leads to a sometimes controversial practice often carried out in practical applications of the Rasch model with multiple-choice tests: deleting poorly fitting subjects from item parameter estimation. A similarity between this procedure and our two-class model is that both procedures partition the data into a block where the simple Rasch model fits well and a block in which it does not. A difference is that the partitioning effected by subject trimming is explicit and based on observed variables; with the two-class model, it is implicit and based on a latent variable. Each subject's data are effectively trimmed in proportion to weight of the evidence that it does not fit the Rasch model. Discussion Theories about the processes by which subjects attempt to solve test items play no formal role in standard test theory, including conventional item response theory. Only a data matrix of correct and incorrect responses is addressed, and items and subjects are parameterized strictly on the basis of propensities toward correct response. When all that is desired is a simple comparison of subjects in terms of a general propensity of this nature, IRT models suffice and in fact offer many advantages over classical truescore test theory. Situations for which standard IRT models prove less satisfactory involve a need either to better understand the cognitive processes that underlie item response, or to employ theories about such processes to provide more precise or more valid measurement. Extensions of item response theory in this direction are exemplified by the linear logistic test model (Fischer, 1973; Scheiblechner, 1972), Embretson's (1985) multicomponent models, Samejima's (1983) model for multiple strategies, and Tatsuoka's (1983) "rule space" analyses. The approach offered in this paper concerns situations in which different persons may choose different strategies from a number of known alternatives, but overall pro- 212 PSYCHOMETRIKA ficiencies provide meaningful comparisons among persons employing the same strategy. We suppose that strategy choice is not directly observed but can be inferred, though without certainty, from response patterns on theoretical bases. Assuming that substantive theory allows us to differentiate our expectations about response patterns under different strategies, and that a subject applies the same strategy on all items, it is possible to estimate the parameters of IRT models for each strategy. It is further possible to calculate the probabilities that a given subject has employed each of the alternative strategies, and estimate the subject's proficiency under each possible strategy. Assuming that a subject uses the same strategy on all items may not suffice for all applications, since switching strategies from one item to another is sometimes an option (Kyllonen, Lohman, & Snow, 1984). In a technical sense, our approach can allow for strategy-switching by incorporating additional strategy classes that are combinations of different strategies for different items. Based on Just and Carpenter's (1985) finding that subjects sometimes apply whichever strategy is easier for a given problem, we might define three strategy classes for items like those in Example 1: 1. Always apply the rotational strategy; 2. Always apply the analytic strategy; 3. Apply whichever strategy is better suited to an item. ff items were constructed to run from easy to hard under the rotational strategy and hard to easy under the analytic, subjects using the third " m i x e d " strategy would find them easy, then harder, then easier again. There are limitations to how far these ideas can be pressed in applications with binary data. A simulation study reported in Mislevy and Verhelst (1987) showed that a Rasch model fit a four-item test acceptably well with a sample of 1200 subjects when the true model was the Rasch/guessing mixture. In one way or another, more information would be needed to attain a sharper distinction between strategy classes and, correspondingly, more power to differentiate among competing models for the data. One source of information is more binary items. Fifty items rather than four, including some that are very hard under the Rasch strategy, would do. A different source of information available in other settings would be to draw from richer observational possibilities, such as levels of correctness, response latencies, eye-fixation patterns, or choices of response alternatives that are differentially attractive under different strategies. Differentiating the likelihood of response patterns under different strategies is the key to successful applications of the approach. The items in the test must be constructed to maximize strategy differences, as by including items that are hard under one strategy but easy under another. Most tests in current use with standard test theory are not constructed with this purpose in mind; indeed, they are constructed so as to minimize differentiation among strategies, since it lowers the reliability of overall-propensity scores. When strategy class decisions are of interest, useful information is less likely to be obtained with an existing conventional test than with a newly constructed test that highlights differential patterns related to strategy choice. In addition to the applications used in the preceding examples, a number of other current topics in educational and psychological research are amenable to expression in terms of mixtures of IRT models. We conclude by mentioning three. Hierarchical development. Wilson's (1984, 1989) "saltus" model (Latin for "leap") extends the Rasch model to developmental patterns in which capabilities increase in discrete stages, by including stage parameters as well as abilities for persons, and stage parameters as well as difficulties for items. Examples would include Piaget's 213 ROBERT J. MISLEVY AND NORMAN VERHELST 1 2 3 4 5 6 hard easy Item difficultles-highest stage i 2 3 4 5 6 Item difficulties-middle stage 1 2 3 4 5 6 Item difficulties-lowest stage FIGURE 4. SaltusExampM:ThreeStages, Common O ~ e t . (1960) innate developmental stages and Gagnr's (1962) learned acquisition of rules. Suppose that K stages are ordered in terms of increasing and cumulative competence. In our notation, d~ would indicate the stage membership of a subject. In the highest stage, item responses follow a Rasch model with parameters/3j. Rasch models fit lower stages as well, but the item parameters are offset by amounts that depend on which stage the item can first be solved. Our basic parameters ot would correspond to the item parameters for the highest stage and the offset parameters for particular item types at particular lower stages. Figure 4 gives a simple illustration in which items associated with higher stages have an additional increment of difficulty for subjects at lower stages. In applications such as Siegler's (1981) balance beam tasks, subjects at selected lower stages tend to answer certain types of higher-stage items correctly, albeit through flawed reasoning. In these cases, the offset serves to give easier item difficulty parameters to those items in those stages. Mental models for problem solving. In introducing their experimental study of mental models for electricity, Gentner and Gentner (1983) state Analogical comparisons with simple or familiar systems often occur in people's descriptions of complex systems, sometimes as explicit analogical models, and sometimes as implicit analogies, in which the person seems to borrow structure 214 PSYCHOMETRIKA from the base domain without knowing it. Phrases like "current being routed along a conductor" and "stopping the flow" of electricity are examples (p. 99). Mental models are important as a pedagogical device and as a guide to problemsolving. Inferring which models a person is using, based on a knowledge of how conceivable analogues help or hinder the solution of certain types of problems, provides a guide to subsequent training. In Gentner and Gentner's experiment, the problems concerned simple electrical circuits with series and parallel combinations of resistors and batteries. Popular analogies for electricity are flowing waters (Strategy l) and "teeming crowds" of people entering a stadium through a few narrow turnstiles (Strategy 2). The water flow analogy facilitates battery problems, but does not help With resistor problems; indeed, it suggests an incorrect solution for the current in circuits with parallel resistors. The teeming crowd analogy facilitates problems on the combination of resistors, but is not informative about combinations of batteries. If a Rasch model holds for items within strategies, Gentner and Gentner's hypotheses correspond to constraints on the order of item difficulties with the two strategies. If each item type were replicated a sufficient number of times, it would be possible to make strong inferences about which model a particular subject was using, to plan subsequent instruction. Changes in intelligence over age. An important topic in the field of human development is whether, and how, intelligence changes as people age (Birren, Cunningham, & Yamamoto, 1983). Macrae (no date) identifies a weakness of most studies that employ psychometric tests to measure aging effects: total scores fail to reflect important differences in the strategies different subjects bring to bear on the items they are presented. Total score differences among age and educational-background groups on Raven's matrices test were not significant in the study she reports. But analyses of subjects' introspective reports on how they solved items revealed that those with academically oriented background were much more likely to have used the preferred "algorithmic" strategy over a "holistic" strategy than those with vocationally oriented backgrounds. Since the use of algorithmic strategies was found to increase probabilities of success differentially on distinct item types, this study would be amenable to IRT mixture modeling. Inferences could then be drawn about problem-solving approaches without resorting to more expensive and possibly unreliable introspective evidence. The key here, as in any other application in which a mixture of strategies is to be resolved, is to develop tasks that offer different amounts of resistance to subjects using different strategies. References Birren, J. E., Cunningham, W. R., & Yamamoto, K. (1983). Psychology of adult development and aging. Annual Review of Psychology, 34, 543-575. Carter, P., Pazak, B., & Kail, R. (1983). Algorithms for processing spatial information. Journal o f Experimental Child Psychology, 36, 284-304. Cooper, L. A., & Shepard, R. N. (1973). Chronometric studies of the rotation of mental images. In W. G. Chase (Ed.), Visual information processing (pp. 76-176). Orlando, FL: Academic Press. de Leeuw, J., & Verhelst, N. (1986). Maximum likelihood estimation in generalized Rasch models. Journal of Educational Statistics, 11, 183-196. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39, 1-38. Embretson, S. E. (1985). Multicomponent latent trait models for test design. In S. E. Embretson (Ed.), Test design: Developments in psychology and psychometrics (pp. 195-218). Orlando, FL: Academic Press. Fischer, G. H. (1973). The linear logistic test model as an instrument in educational research. Acta Psychologica, 36, 359-374. ROBERT J. MISLEVY AND NORMAN VERHELST 215 French, J. W. (1965). The relationship of problem-solving styles to the factor composition of tests. Educational and Psychological Measurement, 25, 9-28. Gagnr, R. M. (1962). The acquisition of knowledge. Psychological Review, 69, 355-365. Gentner, D., & Gentner, D. R. (1983). Flowing waters or teeming crowds: Mental models of electricity. In D. Gentner & A. L. Stevens (Eds.), Mental models (pp. 99-129). Hillsdale, NJ: Earlbaum. Hulin, C. L., Drasgow, F., & Parsons, C. K. (1983). Item response theory: Application to psychological measurement. Homewood, IL: Dow Jones-Irwin. Just, M. A., & Carpenter, P. A. (1985). Cognitive coordinate systems: Accounts of mental rotation and individual differences in spatial ability. Psychological Review, 92, 137-172. Kyllonen, P. C., Lohman, D. F., & Snow, R. E. (1984). Effects of aptitudes, strategy training, and task facets on spatial task performance. Journal of Educational Psychology, 76, 130-145. Lohman, D. F. (1979). Spatial ability: A review and reanalysis of the correlational literature (Technical Report No. 8). Stanford, CA: Stanford University, Department of Education, Aptitude Research Project. Louis, T. A. (1982). Finding the observed information matrix when using the EM algorithm. Journal of the Royal Statistical Society, Series B, 44, 226-233. Macrae, K. S. (n.d.). Strategies underlying psychometric test responses in young and middle-aged adults of varying educational background. Unpublished manuscript, La Trobe University, Australia. Messick, S. (1984). The psychology of educational measurement. Journal of Educational Measurement, 21, 215-237. Mislevy, R. J. (1986). Bayes modal estimation in item response models. Psychometrika, 51, 177-195. Mislevy, R. J., & Bock, R. D. (1983). BILOG: Item analysis and test scoring with binary logistic models. Mooresville, IN: Scientific Software. Mislevy, R. J., & Verhelst, N. (1987). Modeling item responses when different subjects employ different solution strategies (Research Report RR-87-47-ONR). Princeton, NJ: Educational Testing Service. Paulson, J. (1985). Latent class representation of systematic patterns in test responses (ONR Technical Report). Portland, OR: Portland State University. Pellegrino, J. W., Mumaw, R. J., & Shute, V. J. (1985). Analysis of spatial aptitude and expertise. In S. E. Embretson (Ed.), Test design: Developments in psychology and psychometrics (pp. 45-76). Orlando, FL: Academic Press. Piaget, J. (1960). The general problems of the psychological development of the child. In J. M. Tanner & B. Inhelder (Eds.), Discussions on child development: Vol. 4 (pp. 3-27). The fourth meeting of the World Health Organization Study Group on the Psychobiological Development of the Child, Geneva, 1956. Ramsay, J. O. (1975). Solving implicit equations in psychometric data analysis. Psychometrika, 40, 361-372. Samejima, F. (1983). A latent trait model for differential strategies in cognitive processes (Technical Report ONR/RR-83-1). Knoxville, TN: University of Tennessee. Scheiblechner, H. (1972). Das lernen und 16sen komplexer denkaufgaben [The studying and solving of complex conceptual problems]. Zeitschrift far Experimentelle und Angewandte Psychologie, 19, 476506. Siegler, R. S. (1981). Developmental sequences within and between concepts. Monograph of the Society for Research in Child Development, 46 (2, Serial No. 189). Tapley, S. M., & Bryden, M. P. (1977). An investigation of sex differences in spatial ability: Mental rotation of three-dimensional objects. Canadian Journal of Psychology, 31, 122-130. Tatsuoka, K. K. (1983). Rule space: An approach for dealing with misconceptions based on item response theory. Journal of Educational Measurement, 20, 345-354. Titterington, D. M., Smith, A. F. M., & Makov, U. E. (1985). Statistical analysis offinite mixture distributions. Chichester: Wiley & Sons. Tsutakawa, R. K., & Soltys, M. J. (1988). Approximation for Bayesian ability estimation. Journal of Educational Statistics, 13, 117-130. Wilson, M. R. (1984). A psychometric model of hierarchical development. Unpublished doctoral dissertation, University of Chicago. Wilson, M. R. (1989). Saltus: A psychometric model of discontinuity in cognitive development. Psychological Bulletin, 105, 276-289. Yamamoto, K. (1987). A hybrid model for item responses. Unpublished dissertation, University of Illinois. Manuscript received 10/26/87 Final version received 5/15/89