ON LOCAL HOMOGENEITY AND STOCHASTICALLY ORDERED MIXED RASCH MODELS SVEND KREINER Dept. of Biostatistics, University of Copenhagen, Denmark MOGENS HANSEN and CARSTEN ROSENBERG HANSEN The Danish University of Education, Denmark This paper is concerned with Rasch models for cognitive development assuming that development proceeds through a hierarchical set of stages characterized by increasing scores on cognitive tests and changes in perceived relative difficulties of items. Two approaches to item-analysis under these assumptions are considered. First, mixed Rasch models by Rost (1990) and Rost and Davier (1995) assuming that relative difficulties of items are different in two or more latent classes corresponding to different stages of development. Second, techniques for stepwise analysis of local homogeneity suggested by Kreiner et.al. (1990) identifying score intervals within which item responses appear to be homogeneous even though a Rasch model was rejected in the first place. It will be shown that the two approaches together lead to a family of stochastically ordered mixed Rasch models. The techniques are illustrated by an analysis of a subset of items from a new cognitive test, CHIPS (Hansen et.al. (1992)). Keywords: Rasch models, Mixed Rasch models, Local homogeneity, cognitive tests, the CHIPS test Development of cognitive abilities and other latent traits is often assumed to involve qualitative shifts in problem solving processes and therefore also in relative perceived degrees of difficulty of items in cognitive and educational tests. At the same time overall probabilities of obtaining high scores on cognitive and educational tests are expected to increase when a person develops from a lower to a higher level. In populations containing persons at different levels of development one may therefore expect Rasch models and other simple types of item response models to fail for sets of items intended to measure a specific ability. If the assumption that different developmental levels are characterized by higher or lower summary scores is correct one may however, expect Rasch models to fit locally for persons with high scores and persons with lower scores. This was the idea behind the analysis of 1 local homogeneity suggested by Kreiner et. al. (1990), who determined diagnostic cut points for a score summarizing responses to items measuring paranoia. The family of mixed Rasch models suggested by (Rost, 1990) and discussed by Rost and Davier (1995) provides another and in some ways more satisfactory approach to modeling of qualitative shifts in problem solving. The mixed Rasch model makes the assumption that several different latent classes exist each with its own Rasch model for a specific set of items. In the type of application considered in this paper it will be assumed that the latent classes correspond to different developmental stages, and that the different sets of item parameters in each of the latent classes reflect different perceptions of relative degrees of difficulty of items at each of these stages. The purpose of this paper is to illustrate how the two different approaches may strengthen and support each other. Section 2 recapitulates a few well-known facts from the theory of Rasch models. Section 3 introduces the mixed Rasch models and discusses some of the problems that must be addressed in item analysis by these models. Section 4 describes analysis of local homogeneity and discusses how results of this type of analysis may be strengthened by an analysis by mixed Rasch models. Section 5 briefly describes a new cognitive test, CHIPS, and the theory of cognitive development underlying this test. Data collected for validation of CHIPS is presented and analyzed in Section 6. The Rasch model Let Y = (Y1,...,Yk) be a vector of item responses for a test with k items. The line of reasoning in this paper will be concerned with the family of Rasch models where the raw score, S = ∑iYi, is a sufficient statistic for a specific latent trait or ability, . To simplify the discussion here it will be assumed that items are dichotomous coded 0 and 1 for incorrect and correct responses. The arguments may with only minor technical problems be extended to Rasch models for polytomous ordinal items like the partial credit model of Masters (1982) and the rating scale model of Andrich (1978). The Rasch model for locally independent dichotomous items assumes that the conditional distribution of item responses given is equal to P(Y1 y1 ,...,Yk yk | ) exp( s i yi i ) i ( 1 exp( i )) 2 (1) where 1,..., k are the usual item parameters and s iy i . To estimate the item parameters one may use either marginal maximum likelihood estimates assuming that follows some known distribution (e.g. the normal distribution) or conditional maximum likelihood estimates based on the conditional distribution of item responses given the sufficient score. In the applications considered here, where we will focus on mixture distributions across a number of latent classes the assumption that the marginal distribution of is a normal distribution is inappropriate. We will therefore only consider techniques for conditional inference. Let Is = {(y1,…,yk) | yi{0,1} ∑iyi = s} be the set of item responses for which the score is equal to s. The conditional distribution of the raw score given is equal to P( S s| ) exp( s ) s exp( t ) t (2) t where (s)s=0,...,k are symmetrical polynomials given by s exp( y ) ( y1 ,.. yk )I s i i i (3) The conditional distribution of item responses given the total score is easily seen to be P(Y1 y1 ,...,Yk yk | S s ) exp( i yi i ) s (4) Some notational conventions have to be introduced: Suppose that item responses (yi)=1,..,n;i=1..k have been observed for n persons. Conditional inference and analysis of local homogeneity requires that persons are classified into score groups. We will here be concerned with both elementary score, Gs = {|∑iyi = s} and compounded score groups defined by score intervals, H[s,t] = Gs Gs+1... Gt. The two extreme score groups, G0 and Gk, are non-informative for analysis of item parameters. In what follows, we will therefore only be concerned with score groups defined by subsets of the informative score range, H[s,t] H[1,k-1]. 3 Andersen (1973)’s conditional likelihood ratio test partitions the informative score range into disjoint score intervals and compares maximum likelihood estimates in different score groups with maximum likelihood estimates calculated for all informative score groups. Assume that the informative score range is partitioned into m disjoint score intervals by m-1 thresholds, 1<t1< t2<... tm-1<k-1 such that H[ 1,k 1 ] H[ 1,t1 ] H[ t1 1,t2 ] ... H[ tm1 ,k 1 ] . It is assumed that all score groups are non-empty. The conditional likelihood ratio test of Andersen (1973a,b) compares conditional estimates of item parameters in the m distinct score groups to the estimates of the parameters for all informative scores. Let L( [s,t];Y [s,t]) be the conditional likelihood function evaluated for the data, Y [s,t], of the [s,t] score group using conditional maximum estimates, [s,t], obtained from the same score group. Andersen’s conditional likelihood ratio test is then equal to 2 1,t1 ,..t m1 ,k 1 G m 2 ln( L( [ ti1 ,ti ] ; Y[ ti1 ,ti ] ) ln( L( 1,k 1 ; Y1,k 1 ) i 1 (5) where we assume that t0 = 1 and tm = k-1. Andersen (1973a,b) shows that G12,t1 ,..tm1 ,k 1 is approximately 2 distributed with (m-1) · (k-1) degrees of freedom. We refer to G12,t1 ,..tm1 ,k 1 as a test of global homogeneity of item parameters. The test for global homogeneity generalizes without problems to tests for local homogeneity comparing item parameters in score groups not adding up to the complete set of informative scores. We consider here a partition of a score interval H[a,b] H[1,k-1] into m disjoint score intervals by m-1 thresholds t0 = a<t1< t2<... tm-1<tm = b and a conditional likelihood ratio test, Ga2,t1 ,..tm1 ,b , of homogeneity of item parameters within the [a,b] score interval, where Ga2,t1 ,..tm1 ,b is defined as in (5) with t0 = a and tm = b. It is shown in Andersen (1973a, Corollary 4.5, p. 127) that the arguments proving the asymptotic 2 distribution for the test of global homogeneity extends without problems to the test for local homogeneity. Ga2,t1 ,..tm1 ,b therefore is also approximately 2 distributed with (m-1) · (k-1) degrees of freedom. 4 The mixed Rasch model The mixed Rasch model extends the Rasch model (2) by adding a latent class variable, , on which item parameters are assumed to depend, P(Y1 y1 ,...,Yk yk | , ) exp( s i yi ,i ) i ( 1 exp( ,i )) (6) In addition to the conditional probabilities (6) a mixed Rasch model also contains marginal probabilities – class sizes - , for each class and conditional densities of the latent trait in each class, f(|). Rost and Davier (1995) give details on item analysis by mixed Rasch models. Inference in these models is conditional in the sense that item parameters are estimated from the conditional distribution of item responses given the raw score, P(Y1 y1 ,...,Yk yk | S s, ) exp( i yi ,i ) s ( ) (7) where (s())s=0,..,k are the symmetrical functions of the parameters in the ’th class. Instead of imposing assumptions on the latent trait distribution in each class Rost and Davier (1995) use a two-parameter power series model for the conditional distribution of the score in each class first discussed by Leunbach (1976), P( S s| ) exp( s s2 ) ( , ) (8) where (,) is a normalizing constant needed to insure that probabilities add up to 1. Item parameters, class sizes, and score distribution parameters, and , can be estimated by an extended EM algorithm implemented by Davier (1994). The classification problem of assigning persons to latent classes may be approached from a Baysian point of view. Given estimates of class sizes, item parameters and parameters of the conditional score distributions, posterior probabilities can be estimated by 5 P( |Y1 y1 ,..,Yk yk ) P(Y1 y1 ,..,Yk yk , ) P(Y1 y1 ,..,Yk yk ) (9) P(Y1 y1 ,..,Yk yk | ) P(Y1 y1 ,..,Yk yk | S s, ) P( S s| ) P(Y1 y1 ,..,Yk yk ) P(Y1 y1 ,..,Yk yk ) The posterior class probabilities depend on the marginal distribution P(Y1=y1,.., Y1=y1). Posterior probabilities are needed if one needs to estimate the latent class to which a specific person belongs. For this purpose the denominator of (9) can be disregarded. We estimate by the class with the largest posterior probability given an observed vector of item responses, that is by the class, , satisfying P(Y1 y1 ,..,Yk yk | S s, ) P( S s| ) P(Y1 y1 ,..,Yk yk | S s, ) P( S s| ) (10) P(Y1 y1 ,..,Yk yk | S s, ) P( S s| ) P(Y1 y1 ,..,Yk yk | S s, ) P( S s| ) for all . Equation (10) shows that the Bayesian posterior estimate of latent classes depends on both the distribution of scores and on the conditional distribution of item responses given scores in the different latent classes. (10) shows that the distribution of the latent trait in different classes have an effect on posterior estimates through the conditional score distribution. The conditional distribution of item responses give score and class, P(Y1,..,Yk) | S,) do not depend on the latent trait. In the next section we will consider the possibility of estimating latent scores given nothing but the raw scores. The posterior estimates would in this case be P( | S s ) P( S s, ) P( S s| ) P( S s ) P( S s ) (11) with the estimate, , of the latent class satisfying P( S s| ) P( S s| ) 6 (12) for all . From a practical point of view (12) is a more attractive estimate than (10) because it does not require estimates of the conditional probabilities, P(Y1=y1,.., Yk=yk | S=s). In general (12) will, however, be an inferior estimate with a too high risk of misclassification. In general (12) is therefore not worthy of serious consideration. The one exception to this rule may be the case of stochastically ordered classes defined in the next section in connection with analysis of local homogeneity. Analysis of local homogeneity Analysis of local homogeneity is based on the same assumption of latent classes as the mixed Rasch model, but adds the additional assumption that classes differ in terms of the expected summary scores. A simple two-class model can illustrate the idea. Assume that items contain problems that require a certain strategy for a proper solution. If we assume that one (latent) class, 1, consists of persons who are not able to use this strategy while the other class, 2, consists of persons who master the strategy, then the two groups will be characterized by the following two different conditions. In the first group, scores will be relative low and items will appear to be more or less equally difficult because the means to solve the problems are not readily available. In the second group items will generally be easier, scores will be higher and the difference between easy items and difficult items will be more apparent. In terms of score distributions and item parameters we therefore expect the following to hold: 1) Two stochastically latent classes, 1 and 2 , with different item parameters will exist. The two classes will be stochastically ordered in the sense that scores are relatively low in the first class and relatively higher in the second. The precise definition of stochastically ordered latent Rasch classes will be given below. 2) The class with low scores, 1 , will be characterized by fairly small item parameter contrasts, |,i - ,j|, while the other class will have somewhat larger contrasts. If we assume that item parameters are constrained in such a way that item parameters sum to zero in each of the two classes this requirement can be expressed as the requirement that the dispersion of item 7 parameters, 2 i ,i , is smaller in the latent class with low scores than in the class with high scores. Condition 1) leads to the definition below of stochastically ordered mixed Rasch models defined below. Condition 2) is an additional requirement for stochastically ordered Rasch models for developmental abilities. Let P1(s) and P2(s) be the probabilities of the score distribution in the two latent classes of ea mixed Rasch model, P1(s) = P(S=s|1) and P2(s) = P(S=s|2). We define stochastically ordered Mixed Rasch models with higher scores in the second class in the following way: Definition 1. Let (s) be the ratio of probabilities of scores in two different classes of a mixed Rasch model, (s) = P2(s)/ P1(s). The two classes are stochastically ordered if (s) is an increasing function of s. Let 2|s be the posterior probability that a person belongs to 2 given observed score equal to s. It follows from Definition 1 and Bayes’ theorem that 2|s is a increasing function of s if and only if the two classes are stochastically ordered because 2|s depends on the marginal class sizes 1 and 2 in the following way, 2|s (s) 2 P2 (s) 2 1 P1 (s)1 P2 (s) 2 1 (s) 2 1 (13) Equation (13) shows that highly skewed score distributions will generate data where persons with high scores almost exclusively will consist of persons from the second class while persons with low scores belong to the first class. Extreme class sizes on the other hand will tend to generate data where persons at one end of the score range consist of persons from one class while persons at the 8 other end will tend to be a mixture of persons of both classes. This means that one may expect little or no evidence against homogeneity of item parameters estimated in extreme score groups as long as one only compares score groups belonging to the same end of the score range. This leads to the following definition of local homogeneity. Definition 2. A set of k dichotomous items is locally homogenous if the score range, [1,k-1] partitions into r score intervals, [1,s1],[s1+1,s2],..,[sr-1+1,k-1], where at least one score interval contains more than one score value and such that further partitioning of the score intervals does not disclose evidence against homogeneity within the r score intervals. Definition 3. Consider a partition, [1,s1],[s1 + 1,s2],..,[sr-1 + 1,k-1], of the score range, [1,k-1] into r score intervals such that item responses appear to be homogenous within score intervals. If concatenation of adjacent score intervals results in score intervals where item responses appear to be heterogeneous then we refer to the set of score intervals as a maximally locally homogeneous (MLH) set of score intervals. Note, that local homogeneity is defined in terms of empirical findings during item analysis by Rasch models and that these findings may suggest, but do not in themselves define a stochastically ordered mixed Rasch model. Note also, that the definition permits both situations where homogeneity is found in just one extreme score interval, but heterogeneous elsewhere and situations where homogeneity are found in two extreme score intervals but heterogeneous in between. The purpose of the analysis of local homogeneity discussed in this paper is to identify a set of MLH score intervals when evidence against the Rasch model has surfaced. Analysis of local homogeneity thus should be a protected procedure in the sense that it should only be performed when initial tests of the Rasch model has rejected the model and only if the possibility of a stochastically ordered mixed Rasch models makes sense. Under these conditions the following stepwise procedure is suggested: Initialization: Select an initial set of m0 disjoint score intervals, J 0 {J10 , J 02 ,..J 0m0 } , such that 1, k 1 J i0 . i Ideally the initial score intervals should be the elementary score groups, but larger score intervals 9 may be warranted if the number of cases in each score group is too small for estimation of item parameters to make sense. Iterative step: Compare item parameters in adjacent score groups by Andersen’s conditional likelihood ratio test. If all test statistics are significant the search for local homogeneity stops. If some test statistics are insignificant two or more homogenous score intervals are merged into larger score intervals followed by a new iterative step. Several procedures for joining score intervals can be considered. Let Gi2 (t ) be the conditional likelihood ratio test for comparison of score intervals i and i+1 during the t’th step of the procedure. One procedure would be to merge score intervals i and i+1 if Gi2 (t ) is insignificant and smaller than all other test statistics calculated during the j’th step. The score intervals for the t+1’st step will the be J tj1 J tj for j = 1,,,i-1 J it 1 J it J it1 J tj1 J tj 1 for j = i+1,…,mt+1 where mt+1 = mt-1 In consideration of the fact that we are aiming at stochastically ordered mixed Rasch models the above procedure should take second place to a procedure where extreme score intervals, J 1t and J 2t or J mt t 1 and J mt t are joined first before merging of midrange intervals is considered. The results of a search for local homogeneity agrees with a stochastically ordered mixed Rasch model with two latent classes if the end result is two homogenous score intervals at extreme ends of the score range. The results of local homogeneity are particular favorable if the end result consists of two locally homogenous score intervals, J 1t J 2t 1, k 1 . When this happens the upper limit of the first interval may be used as a diagnostic cut-point distinguishing persons from one latent class from persons from the other class. In most cases things will not turn out so conveniently. One 10 or more score intervals between the two extreme intervals may exist comprising a mixture of persons from both classes where classification in practice is doubtful or close to impossible. Even when end results appear to be unambiguous the risk of misclassification should be recognized. In order to both check the adequacy of a stochastically ordered latent class model and evaluate the risk of misclassification analysis of local homogeneity should be succeeded by an analysis by mixed Rasch models which would provide not only a check of the adequacy of the model but also proper estimates of class sizes and the risk of misclassification if cut-points separating homogenous score intervals are used for classification. CHIPS – a test of cognitive development CHIPS (Children’s Problem Solving) is an instrument for measurement of cognitive function meant for children from six to twelve years of age. CHIPS is based on the assumption that cognitive function changes qualitatively during a person’s lifetime. It provides some possibility for evaluating the level of cognitive function in quantitative terms. The main purpose of CHIPS is, however, to classify pupils according to stage of cognitive development. The theory of cognitive development behind CHIPS defines cognitive function in terms of the ability to which the person is able to draw on three qualitatively different types of strategies for solving abstract problems. Three stages of cognitive development are recognized: Global, A/S and Comprehensive. At the global stage the children register likeness more than anything else. At the analytic/synthetic stage the child is able to cope with both likenesses and differences and to synthesize them to wholes. At these two stages in the cognitive development the child is not yet ready to deal with abstract mental images, but has to see or handle the physical objects. At the comprehensive stage the child is able of using and finding abstract principles and rules when it is required for problem solving. CHIPS consists of three types of items: 1. 11 G-items requiring global cognition. 2. 14 A/S-items requiring analytic/synthetic cognition. 3. 15 C-items requiring comprehensive cognition. 11 We will here only be concerned with the final 15 C-items and the question of whether or not responses to these items seem to confirm the hypothesis of two latent classes distinguishing between pupils who are able and pupils who are not able to use the comprehensive strategies required for finding a proper solution to the problems represented by these items. We refer to students who are able to use comprehensive strategies as pupils at the C-level of cognitive development. Figure 1 shows four C-items from CHIPS. Local homogeneity of CHIPS subscales The data for this example originated in the third study undertaken to develop and validate the CHIPS test. The two previous studies were of roughly the same size. Apart from the fact that a few items were revised or replaced the results from the two first studies corresponded very closely to the results from the third study to be presented here. Data were collected from three different sources: 1) 454 pupils receiving ordinary education in the second grade of the Danish primary school. 2) 382 pupils receiving ordinary education in the fifth grade of the Danish primary school. 3) 78 pupils receiving special education in the fifth grade of the Danish primary school. The score distributions in the three different subpopulations are shown in Table 1. 12 Table 1 Relative CHIPS C-score distributions in three different subpopulations nd 5th grade – special education 16.7 11.5 20.5 10.3 6.4 7.7 12.8 3.8 3.8 5.1 C-score 0 1 2 3 4 5 6 7 8 9 10 11 12 13 2 .. grade 13.0 13.7 10.8 11.0 12.1 9.0 9.3 10.4 4.0 3.5 2.0 1.1 0.2 5th. grade 3.1 3.4 5.2 6.0 8.6 8.4 13.6 11.8 13.9 8.9 8.9 5.2 2.4 0.5 n 454 382 78 mean 3.9 6,5 3.4 1.3 Analysis of local homogeneity CHIPS is meant as a measure of cognitive function that can be used to study the development of cognitive function over time. It is therefore important that there is no differential item functioning depending on when measurements are taken. In addition to the usual tests of homogeneity of responses across score groups we therefore also test for homogeneity of responses across the three populations sampled for this study. The initial check of the Rasch model for dichotomous items clearly rejects the model. Item parameters appear to be heterogeneous both across score groups (G2 = 519.2, df = 140, p<0.0005) and across subpopulations (G2 = 71.1, df = 28, p<0.0005). 13 Figure 1. Four C-items 14 The analysis of local homogeneity is summarized in Table 2. Pairwise comparisons of adjacent score groups suggest that heterogeneity is located in the lower score range, whereas homogeneity seems to be acceptable in the high score range. The stepwise procedure starts by accepting that item parameters are the same in score group 10 and score group 11-14. In the next step these two score groups are combined and compared to score group 8, once again with no evidence of heterogeneity of item parameters across score groups. Concatenating of score groups in the higher score range continues until score group 6-14, but stops because the test of homogenous item parameters in score group 5 and score group 6-14 is strongly rejected. Tests results comparing score groups in the lower score range are significant. Multiple testing problems suggest first, that homogeneity of score groups 1 and 2 should nevertheless be accepted and second, that homogeneity score groups 4 and 5 also should not be rejected. The final result then is a statement of local homogeneity. Item responses appear to be homogenous for scores from 6 to 14. Below 6, the picture is less clear. If homogeneity is present at all, it seems to be located in fairly small score intervals. The low score range has to be partitioned into at least three intervals, 1-2, 3 and 4-5. The disparity of score groups in the low score range however poses no special problems for the hypothesis of a stochastically ordered mixed Rasch model. From the point of view of the hypothesis of a comprehensive stage of cognitive development, the wide score interval from 6 to 14 would consist of persons at the comprehensive stage, whereas the score groups below six could be a mixture of children at all three stages of development. The analysis of local homogeneity suggests that the idea about qualitatively different problem solving processes for those with high and those with low scores was responsible for the rejection of the Rasch model in the first place. Four heterogeneous score groups was identified by the analysis. The distribution of pupils across these score groups is shown in Table 3. Note that the pupils with extreme and therefore non-informative scores have been included in this table. 15 Table 2 Analysis of local homogeneity of C-items from Chips. Comparison of G2 df p Subpopulations (DIF) 71.1 28 <0.0005 519.2 140 <0.0005 25.7 23.4 32.2 26.9 28.7 14.1 16.9 17.9 7.5 15.5 14 14 14 14 14 14 14 14 14 14 0.028 0.054 0.004 0.020 0.012 0.442 0.263 0.210 0.913 0.348 9.6 16.6 25.8 11.9 46.3 14 14 14 14 14 0.793 0.280 0.028 0.615 <0.0005 38.7 48.4 46.3 14 14 14 <0.0005 <0.0005 <0.0005 Global comparison of score groups 1,2,3,4,5,6,7,8,9,10,11-14 Comparisons of adjacent score groups 1 and 2 2 and 3 3 and 4 4 and 5 5 and 6 6 and 7 7 and 8 8 and 9 9 and 10 10 and 11-14 Tests for combined score groups in the high score range 9 and 10-14 8 and 9-14 7 and 8-14 6 and 7-14 5 and 6-14 Tests for combined score groups in the low score range 1-2 and 3 3 and 4-5 4-5 and 6-14 16 Table 3 Pupils distributed across four heterogeneous score groups. Score Count Frequency 0-2 3 4-5 6-15 253 81 172 408 .277 .089 .188 .446 Although the demarcation between low and high scores is not as unambiguous as one could hope for the results of the analysis of local homogeneity in general and the fairly broad interval of high scores appears to confirm the existence of a fairly large class of pupils who are able to work with the strategy required to solve the C-items. Whether or not the presence of three narrow score intervals at the lower end of the score range is a result of general heterogeneity among pupils not able to use the optimal strategy or whether it is caused by mixing of pupils mastering and not mastering the C-strategy is a question that cannot be answered by this analysis. We would be inclined to interpret the results in the following way: - Pupils with a score of 6-15 are almost surely pupils who are able to use the C-strategies. - Pupil with a score of 0 – 2 almost surely are pupils at a level of cognitive development where the strategies for solving the 15 items are not within reach. They must be either at a global or an analytical/synthetic level of development. The score groups 3 – 5 consists of a mixture of pupils at the C-level and pupils at lower levels. Score group three probably has a majority of pupils below C-level while score groups 4 – 5 has a higher frequency of pupils a C-level. Analysis of differential functioning The analysis so far has only dealt with the problem of heterogeneity among score groups. The evidence of DIF suggested by the heterogeneity of item responses in the three populations also has to be addressed. To examine whether or not DIF is present for children with a high score we use Andersen’s conditional likelihood ratio to test that item parameters estimated for children with a total score of 6 or higher are the same in the three subpopulations. The test is clearly insignificant 17 (G2 = 31.4, df = 28, p=0.199). The procedure for analysis of local homogeneity has thus identified a population where DIF does not appear to be a problem. Guessing CHIPS is a multiple choice test with one correct response and five distracters for each item. Random guessing therefore could be one type of problem solving behavior that would generate responses violating the assumption underlying conventional Rasch models. Random guessing may on the other hand also provide a convenient starting point for an analysis of local homogeneity. If we assume that the problem solving behavior is purely random it follows that the distribution of the score on the 15 C-items will be binomially distributed, Bin(1/6,15). In this distribution the probability of a score of seven or more is only 2.7 %. A naïve, but nevertheless natural argument would therefore be to hypothesize that the majority of the children scoring seven are able to use the strategy required for solving the type of problems represented by the C-items. This does not guarantee homogeneity of item parameters in score groups above six, but it implies that these score groups defines a natural initial hypothesis for a test of local homogeneity. Classification based on this argument alone is however inadequate. Qualitative interviews of children indicate that purely random guessing is very rare. Guessing appears, of course, but guessing is almost exclusively among a subset of distracters sharing some specific features that appear attractive to the child but are nevertheless insufficient as cues pointing at the correct solution to the problem in an item. If this observation is valid it follows that children who are not able to solve the problems in the C-items tend to get a lower than expected score under purely random guessing, such that local homogeneity may be obtained also in score groups with less than seven correct responses. Mixed Rasch models The analysis of local homogeneity leaves a number of questions unanswered: 1. Does a proper analysis by mixed Rasch models confirm the results? 2. Are there two or more latent classes? 18 3. Are they stochastically ordered and is 6 a reasonable cut-point? 4. What are the class sizes? 5. What is the risk of misclassification using the cut-points suggested by the analysis of local homogeneity? The question of the number of latent classes for a mixed model is a difficult question that cannot be answered by the type of significance tests otherwise used to evaluate models. Instead one has to use information criteria to investigate whether the larger number of parameters in a model with a large number of classes gives a better description of the complexities in the data than a model with fewer classes and fewer parameters. Table 4 shows the so-called Bayesian information criterion (BIC) suggested by Schwartz (1978) for models with 1 to 4 latent classes. BIC reaches its smallest value for two classes indicating that a two-class model provides the best description of data. Table 4 Bayesian information criteria and class sizes for the Rasch model for dichotomous items and for three mixed Rasch models Number of classes BIC 1 2 3 4 13680.00 13352.81 13385.16 13462.89 Size of Class 1 1.000 .393 .382 .367 Size of Class 2 .607 .368 .232 Size of Class 3 .250 .207 Size of Class 4 .193 The information criteria may support the idea of two latent classes but tells nothing about whether or not these classes are stochastically ordered. To get an impression of this we must look at the estimated score distributions in each of the two classes, P(S|1) and P(S|2), where 1 is Class 1 and 2 is Class 2 of Table 5. The score distributions are shown in Figure 2 and Table 5. Relatively low scores characterize the first class while the second class has comparatively high scores. This is exactly as expected according to the hypothesis of a comprehensive stage of development required for solving the problems in the C-items of CHIPS. According to this hypothesis the second class is interpreted as pupils who have reached this level while pupils in Class 1 are at either the Global or the Analytic/Synthetic stage of cognitive development. We notice also that the frequencies of scores higher than or equal to five is higher in Class 2 than in Class 1 whereas the opposite relation holds 19 for scores below five. This agrees pretty much with the results obtained during analysis of local homogeneity. The frequencies of scores below six in Class 2 also seems to be in accordance with the interpretation of score groups 3 – 5 as being composed of a mixture of children at the C-level and children at lower levels. We return to the question of estimation of the latent classes and the risk of misclassification below. The two latent classes are stochastically ordered if (s) = P(S=s|1)/ P(S=s|2) is a monotonous function of s. Values of (s) in Table 5 shows that this requirement is met. The second requirement for a stochastically ordered mixed Rasch model of cognitive development is that the dispersion of item parameters in Class 2 with pupils at C-level is larger than in the class with inferior scores. That this requirement is satisfied can be seen in Table 5 and Figure 3. Table 6 includes estimates of item parameters in score group 6 – 14 obtained during the analysis of local homogeneity. These estimates correspond closely to those obtained for Class 2. 20 ,3 ,2 ,1 Class 1 0,0 Class 2 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Score Figure 2. Score distributions in the two latent classes of the mixed Rasch model for the C-items of CHIPS 21 Table 5 Estimated score distributions in the two latent classes of the mixed Rasch model for the C-items of CHIPS. Score P(S|1) P(S|2) s = P(S|1)/ P(S|2) s / (s +1) 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ,2339 ,2346 ,2017 ,1486 ,0939 ,0509 ,0236 ,0094 ,0032 ,0009 ,0002 ,0001 ,0000 ,0000 ,0000 ,0000 ,0029 ,0088 ,0222 ,0471 ,0838 ,1253 ,1575 ,1662 ,1473 ,1098 ,0687 ,0361 ,0160 ,0059 ,0018 ,0005 79,6243 26,6072 9,0718 3,1559 1,1202 ,4057 ,1499 ,0565 ,0217 ,0085 ,0034 ,0014 ,0006 ,0002 ,0001 ,0000 ,9876 ,9638 ,9007 ,7594 ,5284 ,2886 ,1304 ,0535 ,0213 ,0085 ,0034 ,0014 ,0006 ,0002 ,0001 ,0000 Table 6 Estimated item parameters the two latent classes of the mixed Rasch model for the C-items of CHIPS. Item Class 1 Class 2 Score 6-14 C26 C27 C28 C29 C30 C31 C32 C33 C34 C35 C36 C37 C38 C39 C40 -0.75 -0.56 +0.64 -0.10 +0.69 +1.15 +2.45 -0.41 -0.38 -1.65 -0.55 -0.82 +0.29 -0.66 +0.67 -1.73 -1.68 -0.26 -3.08 -2.05 +2.96 +2.93 +0.07 -0.09 -0.41 -0.25 -0.31 +0.82 +0.90 +2.17 -1.74 -1.57 -0.25 -2.75 -1.88 +3.02 +2.97 -0.08 -0.23 -0.49 -0.39 -0.38 +0.74 +0.92 +2.10 Dispersion 3.75 6.59 22 4 3 2 1 0 -1 -2 -3 -4 -4 -3 -2 -1 0 1 2 3 4 Parameter estimates - Class 2 Figure 3. Scatter plot of item parameters in two latent classes. Estimating stochastically ordered latent classes Equation (12) gives the justification for score based posterior estimation of stochastically ordered latent classes. We write 1 and 2 for Class 1 and 2 with class sizes 1 and 2. For a given score, s, we choose Class 2 as the estimate of the latent class if s P( S s| 1 ) 2 . If s is a P( S s| 2 ) 1 decreasing function of s it follows that Class 2 will also be the estimate of for all scores higher than s. Estimating in other words is a question of selecting the smallest score value, s0, for which (12) applies: s0 = min {s; s P( S s| 1 ) 2 } P( S s| 2 ) 1 = 1 if s < s0 = 2 if s ≥ s0 23 The cut-points depend on class sizes. In the CHIPS study class sizes are 1 = 0.393 and 2 = 0.607. Values of s in Table 5 should therefore be compared to a value of 0.607/0.393 = 1.544 implying that not only cases with a score of 6 or higher but also cases from score group 4 – 5 should be classified as pupils at the comprehensive level. If class sizes are completely unknown a cut-point corresponding to 1 = 0.5 and 2 = 0.5 would be more appropriate. In this case a cut-point of 5 should be used instead of 4 for estimates of Cut points for other sizes of Class 2 depend on the values of 2 / (1 + 2) shown in Table 5. If 2 > 0.9876 then estimates should always be equal to 2. If s 1 s 2 for s = 1,...,k-1 then the estimate of the latent class should be 2 if the score s 1 1 s1 is equal to or higher than s. Finally the estimate should always be equal to 1 if 2 < 0.0006. The risk of misclassification The unrestricted mixed Rasch model estimates the size of the class of pupils at C-level to 0.607. The strict cut-point suggested by the analysis of local homogeneity estimates the class of pupil definitely at C-level to 0.446, which is updated to 0.634 for the stochastically ordered mixed Rasch model. While there thus seems to be a fairly good agreement between the unrestricted and restricted estimates of class sizes the risk of misclassification may be very different and somewhat higher for the two methods because the item responses are ignored for the second estimate. This is illustrated in Table 6. Table 7 Risk of misclassification by unrestricted and stochastically ordered mixed Rasch models Method Unrestricted Stochastic order Class Risk of misclassification 1 2 1 2 .043 .067 .081 .182 A large part of the misclassification by the stochastically ordered mixed Rasch model appears for persons with scores between 3 and 5. Taking this into consideration the cautious estimates following analysis of local homogeneity may be preferable. Out of the 277 pupils classified as 24 definitely not at the C-level and the 446 classified as definitely at the C-level, the unrestricted mixed Rasch model estimates that respectively 2.4 % and 3.0 % will be misclassified. Discussion Mixed Rasch models The mixed Rasch model is an attractive model for the kinds of problems addressed in this paper, but a few cautionary remarks concerning the use of these models may nevertheless be in order. Notice first, that the conditional distributions of given are only determined apart from an additive constant due to the unidentifiability of the item parameters. Item parameters of Rasch models are in most cases identified by assuming that the sum of item parameters is equal to zero. As this has to be done for each of the different latent classes separately, ∑ii = 0, it follows that neither values nor item parameters from different classes are directly comparable. Comparison of persons and items from different classes must therefore be based on observed person scores and item mean scores. Second, the term “latent” is usually taken to indicate an unobservable characteristic of persons (Vogt, 1993). At the same time as we try to fit a mixed Rasch model to data we should recognize that evidence suggesting such a model may be caused by reasons that have nothing to do with latent classes in the strict sense. Structures comparable to the structures described by the mixed Rasch models may for instance be generated not only by unobservable variables but also by unobserved or unused variables. While the difference between unobservable and unused variables is small in purely technical terms, the difference in substance matter terms may be crucial. If the failure of the conventional Rasch model to give an adequate description of item responses was caused by unobserved or unused variables, the population heterogeneity implied by equation (6) should be recognized as a problem of differential functioning (DIF) rather than as a problem of mixed heterogeneous populations. Instead of estimating latent classes these types of problems should be solved by either observation of the unobserved or use of the unused variables. Mixed Rasch models therefore should never be used indiscriminately, but only in situations where substantive arguments suggest that latent classes actually might exist. If used in a purely 25 exploratory way, analysis by mixed Rasch models may lead to ad hoc interpretations of heterogeneity, where other and more simpler interpretations would be more appropriate. From a more practical point of view mixed Rasch modelling suffers from not providing the user with simple operational diagnostic procedures that can be used to identify or at least estimate the class that specific persons belong to. This problem can and have been solved in several different ways within the framework of the mixed Rasch models, but the solution will invariably involve specialized software that will not be available in general cases. Local Homogeneity The strategy for analysis of local homogeneity suggested in this paper may be described as a step-up procedure for pairwise comparison of score groups. Several other step-down and step-up procedures have been suggested (Hochberg and Tamhane, 1987). They all suffer from multiple testing problems to which no satisfactory solution has been offered. Sequentially rejective Bonferroni procedures make use of /r where r equals the maximum possible number of true hypotheses given that a specific hypothesis is false. In the example discussed in this paper this procedure results in fairly unambiguous results, but in other situations this may not always be the case. On the validity of CHIPS The main purpose of this study was to validate the hypothesis of a stochastically ordered mixed Rasch model distinguishing between pupils at the comprehensive level of cognitive development and pupils at lower levels. The results seem to confirm this hypothesis as the estimate of the unrestricted mixed Rasch model reproduced all the main features of such a model. We feel that the results provide strong confirmation of the hypothesis, because two computationally very different analyses lead to basically the same results. The difference in dispersion of item parameters in the two different classes also support the hypothesis underlying C-subscale of CHIPS. We note that the same kind of confirmation was obtained by analyses of the G- and A/S-subscales. These results will be published elsewhere together with a more exhaustive discussion of the theory of cognitive development on which CHIPS was grounded. 26 Estimation of latent classes For a mixed model of cognitive development to be of practical use, simple procedures for estimating the level of development for specific pupils are required. The classification based on estimates of posterior probabilities by the unrestricted model uses more of the available information for this purpose than classification procedures based on raw scores and must therefore be expected to provide the best estimates. Estimates of this kind are however impractical for clinical uses as they presume that specialized software is available for the school psychologist administering the test. Instead we have suggested procedures based on cut-points partitioning the score range into intervals indicative of different latent classes. Two procedures were discussed, on based on the stochastically ordered mixed Rasch models and another utilizing analysis of local homogeneity. The risk of misclassification is acceptable for the stochastically ordered mixed Rasch model, but somewhat higher than for the unrestricted model. The estimate by local homogeneity provides a cautious and to our mind a more honest estimate distinguishing between cases where classification is uncertain and cases where classification appears to be reliable. Once item analysis has confirmed the adequacy of the stochastically ordered mixed Rasch model we therefore tend to prefer this approach to classification based on posterior estimates of classes given raw scores. Note also that posterior estimates depend not only on class sizes, but also on the estimates of score distributions in different classes (10). In this connection it is important to remember that the score distributions (8) that are often used as part of an analysis by mixed Rasch models are arbitrary. They do provide conveniently smoothed estimates of regular score distributions, but we have no arguments implying that these models are more than models of convenience. If one wants to estimate posterior class probabilities for populations of pupils composed in another way than the one studied here, want may have to adjust expectations of both the sizes of classes and expected score distributions within classes, before class estimates are determined. We mention finally that the risk of misclassification may be reduced in two different ways, if estimates of latent classes for the G- and A/S-subscales are calculated at the same time as the estimate posterior C-probabilities are estimated. The first is based on the observation that G and A/S levels have to be reached before the C-level. If estimates of posterior G- and A/S-probabilities indicate that the pupil is at the G-level, then the pupil cannot be at the C-level. The second way to improve estimates of the cognitive level is through analysis of distracters. Pupils at different levels 27 select different types of distracters when they cannot solve a specific problem. Taking this into account also reduces the risk of misclassification not only for estimates based on posterior probabilities give scores, but also for estimates taking complete results into consideration. The use of distracters will also be published elsewhere. References Andersen, E.B. (1973a) Conditional Inference and Models for Measuring. Copenhagen: Mentalhygiejnisk Forskningsinstitut. Andersen, E.B. (1973b). A goodness of fit for the Rasch model. Psychometrika, 38, 123-140 Andrich, D. (1978). A rating formulation for ordered response categories. Psychometrika, 47, 105 – 113. Davier, M. von (1994). WINMIRA: A Windows program for Analyses with the Rasch Model , with the Latent Class model , and with the Mixed Rasch Model. Kiel: Institute for Science Education. Hansen, M., Kreiner, S. and Hansen, C.R. (1992). CHIPS – Children’s Problem Solving. Manual. Copenhagen: Dansk psykologisk forlag. Hochberg, Y. and Tamhane, A.C. (1987). Multiple Comparison Procedures. New York: John Wiley & Sons. Kreiner, S., Simonsen, E. and Mogensen, J. Validation of a Personality Inventory Scale: The MCMI P-Scale (Paranoia). Journal of Personality Disorders, 4, 303-311. Leunbach, G. (1976). A probabilistic measurement model for assessing whether two tests measure the same personal factor. Technical report 1976.19. Copenhagen: The Danish Institute of Educational Research. Masters, G.N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47, 149 – 174. Rost, J. (1990) Rasch models in latent classes: An integration of two approaches to item analysis. Applied Psychol. Measurement, 14, 271-282. Rost, J. and Davier, M. von (1995) Mixture Distribution Rasch Models. In Fischer, G. and Molenaar I (eds.): Rasch models: Foundations, Recent Developments and Applications. 257268. New York: Springer Verlag. Schwartz, G. (1978). Estimating the dimension of a model. Annals of Statistics, 6, 461-466. Vogt, W. P. (1993). Dictionary of Statistics and Methodology. A Nontechnical Guide for the Social Sciences. Newbury park: Sage Publications. 28