On local homogeneity and Mixed Rasch models

advertisement
ON LOCAL HOMOGENEITY AND STOCHASTICALLY
ORDERED MIXED RASCH MODELS
SVEND KREINER
Dept. of Biostatistics, University of Copenhagen, Denmark
MOGENS HANSEN and CARSTEN ROSENBERG HANSEN
The Danish University of Education, Denmark
This paper is concerned with Rasch models for cognitive development assuming that development proceeds through a
hierarchical set of stages characterized by increasing scores on cognitive tests and changes in perceived relative
difficulties of items. Two approaches to item-analysis under these assumptions are considered. First, mixed Rasch
models by Rost (1990) and Rost and Davier (1995) assuming that relative difficulties of items are different in two or
more latent classes corresponding to different stages of development. Second, techniques for stepwise analysis of local
homogeneity suggested by Kreiner et.al. (1990) identifying score intervals within which item responses appear to be
homogeneous even though a Rasch model was rejected in the first place. It will be shown that the two approaches
together lead to a family of stochastically ordered mixed Rasch models. The techniques are illustrated by an analysis of
a subset of items from a new cognitive test, CHIPS (Hansen et.al. (1992)).
Keywords: Rasch models, Mixed Rasch models, Local homogeneity, cognitive tests, the CHIPS test
Development of cognitive abilities and other latent traits is often assumed to involve qualitative
shifts in problem solving processes and therefore also in relative perceived degrees of difficulty of
items in cognitive and educational tests. At the same time overall probabilities of obtaining high
scores on cognitive and educational tests are expected to increase when a person develops from a
lower to a higher level.
In populations containing persons at different levels of development one may therefore expect
Rasch models and other simple types of item response models to fail for sets of items intended to
measure a specific ability. If the assumption that different developmental levels are characterized by
higher or lower summary scores is correct one may however, expect Rasch models to fit locally for
persons with high scores and persons with lower scores. This was the idea behind the analysis of
1
local homogeneity suggested by Kreiner et. al. (1990), who determined diagnostic cut points for a
score summarizing responses to items measuring paranoia.
The family of mixed Rasch models suggested by (Rost, 1990) and discussed by Rost and
Davier (1995) provides another and in some ways more satisfactory approach to modeling of
qualitative shifts in problem solving. The mixed Rasch model makes the assumption that several
different latent classes exist each with its own Rasch model for a specific set of items. In the type of
application considered in this paper it will be assumed that the latent classes correspond to different
developmental stages, and that the different sets of item parameters in each of the latent classes
reflect different perceptions of relative degrees of difficulty of items at each of these stages.
The purpose of this paper is to illustrate how the two different approaches may strengthen and
support each other. Section 2 recapitulates a few well-known facts from the theory of Rasch
models. Section 3 introduces the mixed Rasch models and discusses some of the problems that must
be addressed in item analysis by these models. Section 4 describes analysis of local homogeneity
and discusses how results of this type of analysis may be strengthened by an analysis by mixed
Rasch models. Section 5 briefly describes a new cognitive test, CHIPS, and the theory of cognitive
development underlying this test. Data collected for validation of CHIPS is presented and analyzed
in Section 6.
The Rasch model
Let Y = (Y1,...,Yk) be a vector of item responses for a test with k items. The line of reasoning in
this paper will be concerned with the family of Rasch models where the raw score, S = ∑iYi, is a
sufficient statistic for a specific latent trait or ability, . To simplify the discussion here it will be
assumed that items are dichotomous coded 0 and 1 for incorrect and correct responses. The arguments may with only minor technical problems be extended to Rasch models for polytomous ordinal
items like the partial credit model of Masters (1982) and the rating scale model of Andrich (1978).
The Rasch model for locally independent dichotomous items assumes that the conditional
distribution of item responses given  is equal to
P(Y1  y1 ,...,Yk  yk | ) 
exp( s   i yi i )
 i ( 1  exp(  i ))
2
(1)
where 1,..., k are the usual item parameters and s  iy i .
To estimate the item parameters one may use either marginal maximum likelihood estimates
assuming that  follows some known distribution (e.g. the normal distribution) or conditional
maximum likelihood estimates based on the conditional distribution of item responses given the
sufficient score. In the applications considered here, where we will focus on mixture distributions
across a number of latent classes the assumption that the marginal distribution of  is a normal
distribution is inappropriate. We will therefore only consider techniques for conditional inference.
Let Is = {(y1,…,yk) | yi{0,1} ∑iyi = s} be the set of item responses for which the score is
equal to s. The conditional distribution of the raw score given  is equal to
P( S  s| ) 
exp( s )   s
 exp( t )   t
(2)
t
where (s)s=0,...,k are symmetrical polynomials given by
s 
  exp( y  )
( y1 ,.. yk )I s
i i
i
(3)
The conditional distribution of item responses given the total score is easily seen to be
P(Y1  y1 ,...,Yk  yk | S  s ) 
exp(  i yi  i )
s
(4)
Some notational conventions have to be introduced: Suppose that item responses (yi)=1,..,n;i=1..k
have been observed for n persons. Conditional inference and analysis of local homogeneity requires
that persons are classified into score groups. We will here be concerned with both elementary score,
Gs = {|∑iyi = s} and compounded score groups defined by score intervals, H[s,t] = Gs Gs+1... Gt.
The two extreme score groups, G0 and Gk, are non-informative for analysis of item parameters. In
what follows, we will therefore only be concerned with score groups defined by subsets of the
informative score range, H[s,t]  H[1,k-1].
3
Andersen (1973)’s conditional likelihood ratio test partitions the informative score range into
disjoint score intervals and compares maximum likelihood estimates in different score groups with
maximum likelihood estimates calculated for all informative score groups.
Assume that the informative score range is partitioned into m disjoint score intervals by m-1
thresholds, 1<t1< t2<... tm-1<k-1 such that H[ 1,k 1 ]  H[ 1,t1 ]  H[ t1 1,t2 ]  ... H[ tm1 ,k 1 ] . It is assumed
that all score groups are non-empty. The conditional likelihood ratio test of Andersen (1973a,b)
compares conditional estimates of item parameters in the m distinct score groups to the estimates of
the parameters for all informative scores. Let L( [s,t];Y [s,t]) be the conditional likelihood function
evaluated for the data, Y [s,t], of the [s,t] score group using conditional maximum estimates,  [s,t],
obtained from the same score group. Andersen’s conditional likelihood ratio test is then equal to
2
1,t1 ,..t m1 ,k  1
G
 m

 2  ln( L(  [ ti1 ,ti ] ; Y[ ti1 ,ti ] )  ln( L( 1,k 1 ; Y1,k 1 )
 i 1

(5)
where we assume that t0 = 1 and tm = k-1.
Andersen (1973a,b) shows that G12,t1 ,..tm1 ,k 1 is approximately 2 distributed with (m-1) · (k-1)
degrees of freedom. We refer to G12,t1 ,..tm1 ,k 1 as a test of global homogeneity of item parameters. The
test for global homogeneity generalizes without problems to tests for local homogeneity comparing
item parameters in score groups not adding up to the complete set of informative scores. We
consider here a partition of a score interval H[a,b]  H[1,k-1] into m disjoint score intervals by m-1
thresholds t0 = a<t1< t2<... tm-1<tm = b and a conditional likelihood ratio test, Ga2,t1 ,..tm1 ,b , of
homogeneity of item parameters within the [a,b] score interval, where Ga2,t1 ,..tm1 ,b is defined as in (5)
with t0 = a and tm = b. It is shown in Andersen (1973a, Corollary 4.5, p. 127) that the arguments
proving the asymptotic 2 distribution for the test of global homogeneity extends without problems
to the test for local homogeneity. Ga2,t1 ,..tm1 ,b therefore is also approximately 2 distributed with (m-1)
· (k-1) degrees of freedom.
4
The mixed Rasch model
The mixed Rasch model extends the Rasch model (2) by adding a latent class variable, , on
which item parameters are assumed to depend,
P(Y1  y1 ,...,Yk  yk | , ) 
exp( s   i yi  ,i )

i
( 1  exp(    ,i ))
(6)
In addition to the conditional probabilities (6) a mixed Rasch model also contains marginal
probabilities – class sizes - , for each class and conditional densities of the latent trait in each
class, f(|).
Rost and Davier (1995) give details on item analysis by mixed Rasch models. Inference in these
models is conditional in the sense that item parameters are estimated from the conditional distribution of item responses given the raw score,
P(Y1  y1 ,...,Yk  yk | S  s, ) 
exp(  i yi  ,i )
 s ( )
(7)
where (s())s=0,..,k are the symmetrical functions of the parameters in the ’th class. Instead of
imposing assumptions on the latent trait distribution in each class Rost and Davier (1995) use a
two-parameter power series model for the conditional distribution of the score in each class first
discussed by Leunbach (1976),
P( S  s| ) 
exp(   s   s2 )
 (   , )
(8)
where (,) is a normalizing constant needed to insure that probabilities add up to 1.
Item parameters, class sizes,  and score distribution parameters,  and , can be estimated by
an extended EM algorithm implemented by Davier (1994). The classification problem of assigning
persons to latent classes may be approached from a Baysian point of view. Given estimates of class
sizes, item parameters and parameters of the conditional score distributions, posterior probabilities
can be estimated by
5
P( |Y1  y1 ,..,Yk  yk ) 
P(Y1  y1 ,..,Yk  yk , )
P(Y1  y1 ,..,Yk  yk )
(9)
P(Y1  y1 ,..,Yk  yk | )  P(Y1  y1 ,..,Yk  yk | S  s, ) P( S  s| ) 


P(Y1  y1 ,..,Yk  yk )
P(Y1  y1 ,..,Yk  yk )
The posterior class probabilities depend on the marginal distribution P(Y1=y1,.., Y1=y1). Posterior
probabilities are needed if one needs to estimate the latent class to which a specific person belongs.
For this purpose the denominator of (9) can be disregarded. We estimate  by the class with the
largest posterior probability given an observed vector of item responses, that is by the class, ,
satisfying
P(Y1  y1 ,..,Yk  yk | S  s,  ) P( S  s|  )   P(Y1  y1 ,..,Yk  yk | S  s, ) P( S  s| ) 

(10)
P(Y1  y1 ,..,Yk  yk | S  s, ) P( S  s| )  

P(Y1  y1 ,..,Yk  yk | S  s,  ) P( S  s|  )  
for all .
Equation (10) shows that the Bayesian posterior estimate of latent classes depends on both the
distribution of scores and on the conditional distribution of item responses given scores in the
different latent classes. (10) shows that the distribution of the latent trait in different classes have an
effect on posterior estimates through the conditional score distribution. The conditional distribution
of item responses give score and class, P(Y1,..,Yk) | S,) do not depend on the latent trait. In the
next section we will consider the possibility of estimating latent scores given nothing but the raw
scores. The posterior estimates would in this case be
P( | S  s ) 
P( S  s, ) P( S  s| ) 

P( S  s )
P( S  s )
(11)
with the estimate, , of the latent class satisfying
P( S  s| )  

P( S  s|  )  
6
(12)
for all .
From a practical point of view (12) is a more attractive estimate than (10) because it does not
require estimates of the conditional probabilities, P(Y1=y1,.., Yk=yk | S=s). In general (12) will,
however, be an inferior estimate with a too high risk of misclassification. In general (12) is
therefore not worthy of serious consideration. The one exception to this rule may be the case of
stochastically ordered classes defined in the next section in connection with analysis of local
homogeneity.
Analysis of local homogeneity
Analysis of local homogeneity is based on the same assumption of latent classes as the mixed
Rasch model, but adds the additional assumption that classes differ in terms of the expected
summary scores. A simple two-class model can illustrate the idea.
Assume that items contain problems that require a certain strategy for a proper solution. If we
assume that one (latent) class, 1, consists of persons who are not able to use this strategy while the
other class, 2, consists of persons who master the strategy, then the two groups will be characterized by the following two different conditions. In the first group, scores will be relative low and items
will appear to be more or less equally difficult because the means to solve the problems are not
readily available. In the second group items will generally be easier, scores will be higher and the
difference between easy items and difficult items will be more apparent. In terms of score distributions and item parameters we therefore expect the following to hold:
1) Two stochastically latent classes, 1 and 2 , with different item parameters will exist. The
two classes will be stochastically ordered in the sense that scores are relatively low in the
first class and relatively higher in the second. The precise definition of stochastically
ordered latent Rasch classes will be given below.
2) The class with low scores, 1 , will be characterized by fairly small item parameter contrasts,
|,i - ,j|, while the other class will have somewhat larger contrasts. If we assume that item
parameters are constrained in such a way that item parameters sum to zero in each of the
two classes this requirement can be expressed as the requirement that the dispersion of item
7
parameters,

2
i  ,i
, is smaller in the latent class with low scores than in the class with high
scores.
Condition 1) leads to the definition below of stochastically ordered mixed Rasch models defined
below. Condition 2) is an additional requirement for stochastically ordered Rasch models for
developmental abilities.
Let P1(s) and P2(s) be the probabilities of the score distribution in the two latent classes of ea
mixed Rasch model, P1(s) = P(S=s|1) and P2(s) = P(S=s|2). We define stochastically ordered
Mixed Rasch models with higher scores in the second class in the following way:
Definition 1. Let (s) be the ratio of probabilities of scores in two different classes of a mixed
Rasch model,
(s) = P2(s)/ P1(s).
The two classes are stochastically ordered if (s) is an increasing function of s.
Let 2|s be the posterior probability that a person belongs to 2 given observed score equal to s. It
follows from Definition 1 and Bayes’ theorem that 2|s is a increasing function of s if and only if the
two classes are stochastically ordered because 2|s depends on the marginal class sizes 1 and 2 in
the following way,
 2|s
 
(s) 2 
P2 (s) 2
 1 


P1 (s)1  P2 (s) 2
 
1  (s) 2 
 1 
(13)
Equation (13) shows that highly skewed score distributions will generate data where persons with
high scores almost exclusively will consist of persons from the second class while persons with low
scores belong to the first class. Extreme class sizes on the other hand will tend to generate data
where persons at one end of the score range consist of persons from one class while persons at the
8
other end will tend to be a mixture of persons of both classes. This means that one may expect little
or no evidence against homogeneity of item parameters estimated in extreme score groups as long
as one only compares score groups belonging to the same end of the score range. This leads to the
following definition of local homogeneity.
Definition 2. A set of k dichotomous items is locally homogenous if the score range, [1,k-1] partitions into r score intervals, [1,s1],[s1+1,s2],..,[sr-1+1,k-1], where at least one score interval contains
more than one score value and such that further partitioning of the score intervals does not disclose
evidence against homogeneity within the r score intervals.
Definition 3. Consider a partition, [1,s1],[s1 + 1,s2],..,[sr-1 + 1,k-1], of the score range, [1,k-1] into r
score intervals such that item responses appear to be homogenous within score intervals. If
concatenation of adjacent score intervals results in score intervals where item responses appear to
be heterogeneous then we refer to the set of score intervals as a maximally locally homogeneous
(MLH) set of score intervals.
Note, that local homogeneity is defined in terms of empirical findings during item analysis by
Rasch models and that these findings may suggest, but do not in themselves define a stochastically
ordered mixed Rasch model. Note also, that the definition permits both situations where homogeneity is found in just one extreme score interval, but heterogeneous elsewhere and situations where
homogeneity are found in two extreme score intervals but heterogeneous in between.
The purpose of the analysis of local homogeneity discussed in this paper is to identify a set of
MLH score intervals when evidence against the Rasch model has surfaced. Analysis of local homogeneity thus should be a protected procedure in the sense that it should only be performed when
initial tests of the Rasch model has rejected the model and only if the possibility of a stochastically
ordered mixed Rasch models makes sense. Under these conditions the following stepwise procedure
is suggested:
Initialization:
Select an initial set of m0 disjoint score intervals, J 0  {J10 , J 02 ,..J 0m0 } , such that 1, k  1   J i0 .
i
Ideally the initial score intervals should be the elementary score groups, but larger score intervals
9
may be warranted if the number of cases in each score group is too small for estimation of item
parameters to make sense.
Iterative step:
Compare item parameters in adjacent score groups by Andersen’s conditional likelihood ratio
test. If all test statistics are significant the search for local homogeneity stops. If some test statistics
are insignificant two or more homogenous score intervals are merged into larger score intervals
followed by a new iterative step.
Several procedures for joining score intervals can be considered. Let Gi2 (t ) be the conditional
likelihood ratio test for comparison of score intervals i and i+1 during the t’th step of the procedure.
One procedure would be to merge score intervals i and i+1 if Gi2 (t ) is insignificant and smaller than
all other test statistics calculated during the j’th step. The score intervals for the t+1’st step will the
be
J tj1  J tj for j = 1,,,i-1
J it 1  J it  J it1
J tj1  J tj 1 for j = i+1,…,mt+1
where mt+1 = mt-1
In consideration of the fact that we are aiming at stochastically ordered mixed Rasch models the
above procedure should take second place to a procedure where extreme score intervals, J 1t and J 2t
or J mt t 1 and J mt t are joined first before merging of midrange intervals is considered.
The results of a search for local homogeneity agrees with a stochastically ordered mixed Rasch
model with two latent classes if the end result is two homogenous score intervals at extreme ends of
the score range. The results of local homogeneity are particular favorable if the end result consists
of two locally homogenous score intervals, J 1t  J 2t  1, k  1 . When this happens the upper limit
of the first interval may be used as a diagnostic cut-point distinguishing persons from one latent
class from persons from the other class. In most cases things will not turn out so conveniently. One
10
or more score intervals between the two extreme intervals may exist comprising a mixture of
persons from both classes where classification in practice is doubtful or close to impossible. Even
when end results appear to be unambiguous the risk of misclassification should be recognized. In
order to both check the adequacy of a stochastically ordered latent class model and evaluate the risk
of misclassification analysis of local homogeneity should be succeeded by an analysis by mixed
Rasch models which would provide not only a check of the adequacy of the model but also proper
estimates of class sizes and the risk of misclassification if cut-points separating homogenous score
intervals are used for classification.
CHIPS – a test of cognitive development
CHIPS (Children’s Problem Solving) is an instrument for measurement of cognitive function
meant for children from six to twelve years of age. CHIPS is based on the assumption that cognitive
function changes qualitatively during a person’s lifetime. It provides some possibility for evaluating
the level of cognitive function in quantitative terms. The main purpose of CHIPS is, however, to
classify pupils according to stage of cognitive development.
The theory of cognitive development behind CHIPS defines cognitive function in terms of the
ability to which the person is able to draw on three qualitatively different types of strategies for
solving abstract problems. Three stages of cognitive development are recognized: Global, A/S and
Comprehensive. At the global stage the children register likeness more than anything else. At the
analytic/synthetic stage the child is able to cope with both likenesses and differences and to
synthesize them to wholes. At these two stages in the cognitive development the child is not yet
ready to deal with abstract mental images, but has to see or handle the physical objects. At the
comprehensive stage the child is able of using and finding abstract principles and rules when it is
required for problem solving.
CHIPS consists of three types of items:
1. 11 G-items requiring global cognition.
2. 14 A/S-items requiring analytic/synthetic cognition.
3. 15 C-items requiring comprehensive cognition.
11
We will here only be concerned with the final 15 C-items and the question of whether or not
responses to these items seem to confirm the hypothesis of two latent classes distinguishing
between pupils who are able and pupils who are not able to use the comprehensive strategies
required for finding a proper solution to the problems represented by these items. We refer to
students who are able to use comprehensive strategies as pupils at the C-level of cognitive
development. Figure 1 shows four C-items from CHIPS.
Local homogeneity of CHIPS subscales
The data for this example originated in the third study undertaken to develop and validate the
CHIPS test. The two previous studies were of roughly the same size. Apart from the fact that a few
items were revised or replaced the results from the two first studies corresponded very closely to the
results from the third study to be presented here.
Data were collected from three different sources:
1) 454 pupils receiving ordinary education in the second grade of the Danish primary school.
2) 382 pupils receiving ordinary education in the fifth grade of the Danish primary school.
3) 78 pupils receiving special education in the fifth grade of the Danish primary school.
The score distributions in the three different subpopulations are shown in Table 1.
12
Table 1
Relative CHIPS C-score distributions in three different subpopulations
nd
5th grade – special
education
16.7
11.5
20.5
10.3
6.4
7.7
12.8
3.8
3.8
5.1
C-score
0
1
2
3
4
5
6
7
8
9
10
11
12
13
2 .. grade
13.0
13.7
10.8
11.0
12.1
9.0
9.3
10.4
4.0
3.5
2.0
1.1
0.2
5th. grade
3.1
3.4
5.2
6.0
8.6
8.4
13.6
11.8
13.9
8.9
8.9
5.2
2.4
0.5
n
454
382
78
mean
3.9
6,5
3.4
1.3
Analysis of local homogeneity
CHIPS is meant as a measure of cognitive function that can be used to study the development
of cognitive function over time. It is therefore important that there is no differential item
functioning depending on when measurements are taken. In addition to the usual tests of
homogeneity of responses across score groups we therefore also test for homogeneity of responses
across the three populations sampled for this study. The initial check of the Rasch model for
dichotomous items clearly rejects the model. Item parameters appear to be heterogeneous both
across score groups (G2 = 519.2, df = 140, p<0.0005) and across subpopulations (G2 = 71.1, df =
28, p<0.0005).
13
Figure 1. Four C-items
14
The analysis of local homogeneity is summarized in Table 2. Pairwise comparisons of adjacent
score groups suggest that heterogeneity is located in the lower score range, whereas homogeneity
seems to be acceptable in the high score range.
The stepwise procedure starts by accepting that item parameters are the same in score group 10
and score group 11-14. In the next step these two score groups are combined and compared to score
group 8, once again with no evidence of heterogeneity of item parameters across score groups.
Concatenating of score groups in the higher score range continues until score group 6-14, but
stops because the test of homogenous item parameters in score group 5 and score group 6-14 is
strongly rejected.
Tests results comparing score groups in the lower score range are significant. Multiple testing
problems suggest first, that homogeneity of score groups 1 and 2 should nevertheless be accepted
and second, that homogeneity score groups 4 and 5 also should not be rejected.
The final result then is a statement of local homogeneity. Item responses appear to be homogenous for scores from 6 to 14. Below 6, the picture is less clear. If homogeneity is present at all, it
seems to be located in fairly small score intervals. The low score range has to be partitioned into at
least three intervals, 1-2, 3 and 4-5. The disparity of score groups in the low score range however
poses no special problems for the hypothesis of a stochastically ordered mixed Rasch model. From
the point of view of the hypothesis of a comprehensive stage of cognitive development, the wide
score interval from 6 to 14 would consist of persons at the comprehensive stage, whereas the score
groups below six could be a mixture of children at all three stages of development.
The analysis of local homogeneity suggests that the idea about qualitatively different problem
solving processes for those with high and those with low scores was responsible for the rejection of
the Rasch model in the first place. Four heterogeneous score groups was identified by the analysis.
The distribution of pupils across these score groups is shown in Table 3. Note that the pupils with
extreme and therefore non-informative scores have been included in this table.
15
Table 2
Analysis of local homogeneity of C-items from Chips.
Comparison of
G2
df
p
Subpopulations (DIF)
71.1
28
<0.0005
519.2
140
<0.0005
25.7
23.4
32.2
26.9
28.7
14.1
16.9
17.9
7.5
15.5
14
14
14
14
14
14
14
14
14
14
0.028
0.054
0.004
0.020
0.012
0.442
0.263
0.210
0.913
0.348
9.6
16.6
25.8
11.9
46.3
14
14
14
14
14
0.793
0.280
0.028
0.615
<0.0005
38.7
48.4
46.3
14
14
14
<0.0005
<0.0005
<0.0005
Global comparison of score groups
1,2,3,4,5,6,7,8,9,10,11-14
Comparisons of adjacent score groups
1 and 2
2 and 3
3 and 4
4 and 5
5 and 6
6 and 7
7 and 8
8 and 9
9 and 10
10 and 11-14
Tests for combined score groups in the high score range
9 and 10-14
8 and 9-14
7 and 8-14
6 and 7-14
5 and 6-14
Tests for combined score groups in the low score range
1-2 and 3
3 and 4-5
4-5 and 6-14
16
Table 3
Pupils distributed across four heterogeneous score groups.
Score
Count
Frequency
0-2
3
4-5
6-15
253
81
172
408
.277
.089
.188
.446
Although the demarcation between low and high scores is not as unambiguous as one could hope
for the results of the analysis of local homogeneity in general and the fairly broad interval of high
scores appears to confirm the existence of a fairly large class of pupils who are able to work with
the strategy required to solve the C-items. Whether or not the presence of three narrow score intervals at the lower end of the score range is a result of general heterogeneity among pupils not able to
use the optimal strategy or whether it is caused by mixing of pupils mastering and not mastering the
C-strategy is a question that cannot be answered by this analysis. We would be inclined to interpret
the results in the following way:
-
Pupils with a score of 6-15 are almost surely pupils who are able to use the C-strategies.
-
Pupil with a score of 0 – 2 almost surely are pupils at a level of cognitive development
where the strategies for solving the 15 items are not within reach. They must be either at a
global or an analytical/synthetic level of development.
The score groups 3 – 5 consists of a mixture of pupils at the C-level and pupils at lower levels.
Score group three probably has a majority of pupils below C-level while score groups 4 – 5 has a
higher frequency of pupils a C-level.
Analysis of differential functioning
The analysis so far has only dealt with the problem of heterogeneity among score groups. The
evidence of DIF suggested by the heterogeneity of item responses in the three populations also has
to be addressed. To examine whether or not DIF is present for children with a high score we use
Andersen’s conditional likelihood ratio to test that item parameters estimated for children with a
total score of 6 or higher are the same in the three subpopulations. The test is clearly insignificant
17
(G2 = 31.4, df = 28, p=0.199). The procedure for analysis of local homogeneity has thus identified a
population where DIF does not appear to be a problem.
Guessing
CHIPS is a multiple choice test with one correct response and five distracters for each item.
Random guessing therefore could be one type of problem solving behavior that would generate
responses violating the assumption underlying conventional Rasch models. Random guessing may
on the other hand also provide a convenient starting point for an analysis of local homogeneity. If
we assume that the problem solving behavior is purely random it follows that the distribution of the
score on the 15 C-items will be binomially distributed, Bin(1/6,15). In this distribution the probability of a score of seven or more is only 2.7 %. A naïve, but nevertheless natural argument would
therefore be to hypothesize that the majority of the children scoring seven are able to use the strategy required for solving the type of problems represented by the C-items. This does not guarantee
homogeneity of item parameters in score groups above six, but it implies that these score groups
defines a natural initial hypothesis for a test of local homogeneity.
Classification based on this argument alone is however inadequate. Qualitative interviews of
children indicate that purely random guessing is very rare. Guessing appears, of course, but
guessing is almost exclusively among a subset of distracters sharing some specific features that
appear attractive to the child but are nevertheless insufficient as cues pointing at the correct solution
to the problem in an item. If this observation is valid it follows that children who are not able to
solve the problems in the C-items tend to get a lower than expected score under purely random
guessing, such that local homogeneity may be obtained also in score groups with less than seven
correct responses.
Mixed Rasch models
The analysis of local homogeneity leaves a number of questions unanswered:
1. Does a proper analysis by mixed Rasch models confirm the results?
2. Are there two or more latent classes?
18
3. Are they stochastically ordered and is 6 a reasonable cut-point?
4. What are the class sizes?
5. What is the risk of misclassification using the cut-points suggested by the analysis of local
homogeneity?
The question of the number of latent classes for a mixed model is a difficult question that cannot be
answered by the type of significance tests otherwise used to evaluate models. Instead one has to use
information criteria to investigate whether the larger number of parameters in a model with a large
number of classes gives a better description of the complexities in the data than a model with fewer
classes and fewer parameters. Table 4 shows the so-called Bayesian information criterion (BIC)
suggested by Schwartz (1978) for models with 1 to 4 latent classes. BIC reaches its smallest value
for two classes indicating that a two-class model provides the best description of data.
Table 4
Bayesian information criteria and class sizes for the Rasch model for dichotomous items and for
three mixed Rasch models
Number of classes
BIC
1
2
3
4
13680.00
13352.81
13385.16
13462.89
Size of
Class 1
1.000
.393
.382
.367
Size of
Class 2
.607
.368
.232
Size of
Class 3
.250
.207
Size of
Class 4
.193
The information criteria may support the idea of two latent classes but tells nothing about whether
or not these classes are stochastically ordered. To get an impression of this we must look at the
estimated score distributions in each of the two classes, P(S|1) and P(S|2), where 1 is Class 1 and
2 is Class 2 of Table 5. The score distributions are shown in Figure 2 and Table 5. Relatively low
scores characterize the first class while the second class has comparatively high scores. This is
exactly as expected according to the hypothesis of a comprehensive stage of development required
for solving the problems in the C-items of CHIPS. According to this hypothesis the second class is
interpreted as pupils who have reached this level while pupils in Class 1 are at either the Global or
the Analytic/Synthetic stage of cognitive development. We notice also that the frequencies of scores
higher than or equal to five is higher in Class 2 than in Class 1 whereas the opposite relation holds
19
for scores below five. This agrees pretty much with the results obtained during analysis of local
homogeneity. The frequencies of scores below six in Class 2 also seems to be in accordance with
the interpretation of score groups 3 – 5 as being composed of a mixture of children at the C-level
and children at lower levels. We return to the question of estimation of the latent classes and the
risk of misclassification below.
The two latent classes are stochastically ordered if (s) = P(S=s|1)/ P(S=s|2) is a monotonous
function of s. Values of (s) in Table 5 shows that this requirement is met.
The second requirement for a stochastically ordered mixed Rasch model of cognitive development is that the dispersion of item parameters in Class 2 with pupils at C-level is larger than in
the class with inferior scores. That this requirement is satisfied can be seen in Table 5 and Figure 3.
Table 6 includes estimates of item parameters in score group 6 – 14 obtained during the analysis of
local homogeneity. These estimates correspond closely to those obtained for Class 2.
20
,3
,2
,1
Class 1
0,0
Class 2
0
1
2
3
4
5
6
7
8
9 10 11 12 13 14 15
Score
Figure 2. Score distributions in the two latent classes of the mixed Rasch model for the C-items of
CHIPS
21
Table 5
Estimated score distributions in the two latent classes of the mixed
Rasch model for the C-items of CHIPS.
Score
P(S|1)
P(S|2)
s = P(S|1)/ P(S|2)
s / (s +1)
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
,2339
,2346
,2017
,1486
,0939
,0509
,0236
,0094
,0032
,0009
,0002
,0001
,0000
,0000
,0000
,0000
,0029
,0088
,0222
,0471
,0838
,1253
,1575
,1662
,1473
,1098
,0687
,0361
,0160
,0059
,0018
,0005
79,6243
26,6072
9,0718
3,1559
1,1202
,4057
,1499
,0565
,0217
,0085
,0034
,0014
,0006
,0002
,0001
,0000
,9876
,9638
,9007
,7594
,5284
,2886
,1304
,0535
,0213
,0085
,0034
,0014
,0006
,0002
,0001
,0000
Table 6
Estimated item parameters the two latent classes of the mixed Rasch model
for the C-items of CHIPS.
Item
Class 1
Class 2
Score 6-14
C26
C27
C28
C29
C30
C31
C32
C33
C34
C35
C36
C37
C38
C39
C40
-0.75
-0.56
+0.64
-0.10
+0.69
+1.15
+2.45
-0.41
-0.38
-1.65
-0.55
-0.82
+0.29
-0.66
+0.67
-1.73
-1.68
-0.26
-3.08
-2.05
+2.96
+2.93
+0.07
-0.09
-0.41
-0.25
-0.31
+0.82
+0.90
+2.17
-1.74
-1.57
-0.25
-2.75
-1.88
+3.02
+2.97
-0.08
-0.23
-0.49
-0.39
-0.38
+0.74
+0.92
+2.10
Dispersion
3.75
6.59
22
4
3
2
1
0
-1
-2
-3
-4
-4
-3
-2
-1
0
1
2
3
4
Parameter estimates - Class 2
Figure 3. Scatter plot of item parameters in two latent classes.
Estimating stochastically ordered latent classes
Equation (12) gives the justification for score based posterior estimation of stochastically
ordered latent classes. We write 1 and 2 for Class 1 and 2 with class sizes 1 and 2. For a given
score, s, we choose Class 2 as the estimate of the latent class if  s 
P( S  s| 1 )  2
 . If s is a
P( S  s| 2 )  1
decreasing function of s it follows that Class 2 will also be the estimate of  for all scores higher
than s. Estimating  in other words is a question of selecting the smallest score value, s0, for which
(12) applies:
s0 = min {s;  s 
P( S  s| 1 )  2
 }
P( S  s| 2 )  1
 = 1 if s < s0
 = 2 if s ≥ s0
23
The cut-points depend on class sizes. In the CHIPS study class sizes are 1 = 0.393 and 2 = 0.607.
Values of s in Table 5 should therefore be compared to a value of 0.607/0.393 = 1.544 implying
that not only cases with a score of 6 or higher but also cases from score group 4 – 5 should be
classified as pupils at the comprehensive level. If class sizes are completely unknown a cut-point
corresponding to 1 = 0.5 and 2 = 0.5 would be more appropriate. In this case a cut-point of 5
should be used instead of 4 for estimates of  Cut points for other sizes of Class 2 depend on the
values of 2 / (1 + 2) shown in Table 5. If 2 > 0.9876 then estimates should always be equal to 2.
If
 s 1
s
 2 
for s = 1,...,k-1 then the estimate of the latent class should be 2 if the score
 s 1  1
s1
is equal to or higher than s. Finally the estimate should always be equal to 1 if 2 < 0.0006.
The risk of misclassification
The unrestricted mixed Rasch model estimates the size of the class of pupils at C-level to 0.607.
The strict cut-point suggested by the analysis of local homogeneity estimates the class of pupil
definitely at C-level to 0.446, which is updated to 0.634 for the stochastically ordered mixed Rasch
model. While there thus seems to be a fairly good agreement between the unrestricted and restricted
estimates of class sizes the risk of misclassification may be very different and somewhat higher for
the two methods because the item responses are ignored for the second estimate. This is illustrated
in Table 6.
Table 7
Risk of misclassification by unrestricted and stochastically ordered mixed Rasch models
Method
Unrestricted
Stochastic order
Class
Risk of misclassification
1
2
1
2
.043
.067
.081
.182
A large part of the misclassification by the stochastically ordered mixed Rasch model appears for
persons with scores between 3 and 5. Taking this into consideration the cautious estimates
following analysis of local homogeneity may be preferable. Out of the 277 pupils classified as
24
definitely not at the C-level and the 446 classified as definitely at the C-level, the unrestricted
mixed Rasch model estimates that respectively 2.4 % and 3.0 % will be misclassified.
Discussion
Mixed Rasch models
The mixed Rasch model is an attractive model for the kinds of problems addressed in this paper, but a few cautionary remarks concerning the use of these models may nevertheless be in order.
Notice first, that the conditional distributions of  given  are only determined apart from an
additive constant due to the unidentifiability of the item parameters. Item parameters of Rasch models are in most cases identified by assuming that the sum of item parameters is equal to zero. As this
has to be done for each of the different latent classes separately, ∑ii = 0, it follows that neither 
values nor item parameters from different classes are directly comparable. Comparison of persons
and items from different classes must therefore be based on observed person scores and item mean
scores.
Second, the term “latent” is usually taken to indicate an unobservable characteristic of persons
(Vogt, 1993). At the same time as we try to fit a mixed Rasch model to data we should recognize
that evidence suggesting such a model may be caused by reasons that have nothing to do with latent
classes in the strict sense. Structures comparable to the structures described by the mixed Rasch
models may for instance be generated not only by unobservable variables but also by unobserved or
unused variables. While the difference between unobservable and unused variables is small in
purely technical terms, the difference in substance matter terms may be crucial. If the failure of the
conventional Rasch model to give an adequate description of item responses was caused by unobserved or unused variables, the population heterogeneity implied by equation (6) should be recognized as a problem of differential functioning (DIF) rather than as a problem of mixed heterogeneous
populations. Instead of estimating latent classes these types of problems should be solved by either
observation of the unobserved or use of the unused variables.
Mixed Rasch models therefore should never be used indiscriminately, but only in situations
where substantive arguments suggest that latent classes actually might exist. If used in a purely
25
exploratory way, analysis by mixed Rasch models may lead to ad hoc interpretations of heterogeneity, where other and more simpler interpretations would be more appropriate.
From a more practical point of view mixed Rasch modelling suffers from not providing the user
with simple operational diagnostic procedures that can be used to identify or at least estimate the
class that specific persons belong to. This problem can and have been solved in several different
ways within the framework of the mixed Rasch models, but the solution will invariably involve
specialized software that will not be available in general cases.
Local Homogeneity
The strategy for analysis of local homogeneity suggested in this paper may be described as a
step-up procedure for pairwise comparison of score groups. Several other step-down and step-up
procedures have been suggested (Hochberg and Tamhane, 1987). They all suffer from multiple
testing problems to which no satisfactory solution has been offered. Sequentially rejective
Bonferroni procedures make use of /r where r equals the maximum possible number of true
hypotheses given that a specific hypothesis is false. In the example discussed in this paper this
procedure results in fairly unambiguous results, but in other situations this may not always be the
case.
On the validity of CHIPS
The main purpose of this study was to validate the hypothesis of a stochastically ordered mixed
Rasch model distinguishing between pupils at the comprehensive level of cognitive development
and pupils at lower levels. The results seem to confirm this hypothesis as the estimate of the unrestricted mixed Rasch model reproduced all the main features of such a model. We feel that the results
provide strong confirmation of the hypothesis, because two computationally very different analyses
lead to basically the same results. The difference in dispersion of item parameters in the two
different classes also support the hypothesis underlying C-subscale of CHIPS. We note that the
same kind of confirmation was obtained by analyses of the G- and A/S-subscales. These results will
be published elsewhere together with a more exhaustive discussion of the theory of cognitive
development on which CHIPS was grounded.
26
Estimation of latent classes
For a mixed model of cognitive development to be of practical use, simple procedures for
estimating the level of development for specific pupils are required. The classification based on
estimates of posterior probabilities by the unrestricted model uses more of the available information
for this purpose than classification procedures based on raw scores and must therefore be expected
to provide the best estimates. Estimates of this kind are however impractical for clinical uses as they
presume that specialized software is available for the school psychologist administering the test.
Instead we have suggested procedures based on cut-points partitioning the score range into intervals
indicative of different latent classes. Two procedures were discussed, on based on the stochastically
ordered mixed Rasch models and another utilizing analysis of local homogeneity. The risk of misclassification is acceptable for the stochastically ordered mixed Rasch model, but somewhat higher
than for the unrestricted model. The estimate by local homogeneity provides a cautious and to our
mind a more honest estimate distinguishing between cases where classification is uncertain and
cases where classification appears to be reliable. Once item analysis has confirmed the adequacy of
the stochastically ordered mixed Rasch model we therefore tend to prefer this approach to classification based on posterior estimates of classes given raw scores. Note also that posterior estimates
depend not only on class sizes, but also on the estimates of score distributions in different classes
(10). In this connection it is important to remember that the score distributions (8) that are often
used as part of an analysis by mixed Rasch models are arbitrary. They do provide conveniently
smoothed estimates of regular score distributions, but we have no arguments implying that these
models are more than models of convenience. If one wants to estimate posterior class probabilities
for populations of pupils composed in another way than the one studied here, want may have to
adjust expectations of both the sizes of classes and expected score distributions within classes,
before class estimates are determined.
We mention finally that the risk of misclassification may be reduced in two different ways, if
estimates of latent classes for the G- and A/S-subscales are calculated at the same time as the estimate posterior C-probabilities are estimated. The first is based on the observation that G and A/S
levels have to be reached before the C-level. If estimates of posterior G- and A/S-probabilities
indicate that the pupil is at the G-level, then the pupil cannot be at the C-level. The second way to
improve estimates of the cognitive level is through analysis of distracters. Pupils at different levels
27
select different types of distracters when they cannot solve a specific problem. Taking this into
account also reduces the risk of misclassification not only for estimates based on posterior
probabilities give scores, but also for estimates taking complete results into consideration. The use
of distracters will also be published elsewhere.
References
Andersen, E.B. (1973a) Conditional Inference and Models for Measuring. Copenhagen:
Mentalhygiejnisk Forskningsinstitut.
Andersen, E.B. (1973b). A goodness of fit for the Rasch model. Psychometrika, 38, 123-140
Andrich, D. (1978). A rating formulation for ordered response categories. Psychometrika, 47,
105 – 113.
Davier, M. von (1994). WINMIRA: A Windows program for Analyses with the Rasch Model , with
the Latent Class model , and with the Mixed Rasch Model. Kiel: Institute for Science
Education.
Hansen, M., Kreiner, S. and Hansen, C.R. (1992). CHIPS – Children’s Problem Solving. Manual.
Copenhagen: Dansk psykologisk forlag.
Hochberg, Y. and Tamhane, A.C. (1987). Multiple Comparison Procedures. New York: John Wiley
& Sons.
Kreiner, S., Simonsen, E. and Mogensen, J. Validation of a Personality Inventory Scale: The MCMI
P-Scale (Paranoia). Journal of Personality Disorders, 4, 303-311.
Leunbach, G. (1976). A probabilistic measurement model for assessing whether two tests measure
the same personal factor. Technical report 1976.19. Copenhagen: The Danish Institute of
Educational Research.
Masters, G.N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47, 149 – 174.
Rost, J. (1990) Rasch models in latent classes: An integration of two approaches to item analysis.
Applied Psychol. Measurement, 14, 271-282.
Rost, J. and Davier, M. von (1995) Mixture Distribution Rasch Models. In Fischer, G. and
Molenaar I (eds.): Rasch models: Foundations, Recent Developments and Applications. 257268. New York: Springer Verlag.
Schwartz, G. (1978). Estimating the dimension of a model. Annals of Statistics, 6, 461-466.
Vogt, W. P. (1993). Dictionary of Statistics and Methodology. A Nontechnical Guide for the Social
Sciences. Newbury park: Sage Publications.
28
Download