ch3

advertisement
Chapter 3 Polytomous Data
3.1 Introduction
Motivating example:
Cheese
A
B
C
D
Total
I
0
6
1
0
7
II
0
9
1
0
10
III
1
12
6
0
19
IV
7
11
8
1
27
V
8
7
23
3
41
VI
8
6
7
7
28
VII
19
1
5
14
39
VIII
8
0
1
16
25
IX
1
0
0
11
12
Total
52
52
52
52
208
Response category: I~IX (“strong dislike” to “excellent taste”).
yij , i  1, 2, 3, 4; j  1, 2,  , 9 : the response frequencies for
the cheese additives.
i  1 : additive A, i  2 : additive B,
i  3 : additive C, i  4 : additive D
j  1 : response I, j  2 : response II,
j  3 : response III, j  4 : response IV,
j  5 : response V, j  6 : response VI,
j  7 : response VII, j  8 : response VIII,
j  9 : response VII
For example,
y11  0, y12  0, y13  1, y14  7,
y15  8, y16  8, y17  19, y18  8, y19  1
.
Also,
 ij , i  1, 2, 3, 4; j  1, 2,  , 9 : the probability corresponding
1
to the cheese additive and the
response category.
rij    ir : the cumulative probability corresponding to the
r j
cheese additive and the response category.
That is,
ri1   i1, ri 2   i1   i 2 ,, ri 9   i1   i 2     i 9  1
Then,
Yi  Yi1 , Yi 2 , , Yi 9 , i  1, 2, 3, 4 : multinomial distribution
with parameters mi  52
and
 i   i1 ,,  i 9  .
Objective:
We are concerned with the effect on the taste of various cheese
additives. That is, we want to evaluate the statistical significance of
the differences among these cheese additives. We want to find a
model which is capable of describing these differences, for example,
the interrelation among different cumulative probabilities such as
r4 j  r1 j  r3 j  r3 j (D, A, C, B from best to worst).
Definition of Polytomous Data:
If the response of an individual or item in a study is restricted to one
of a fixed set of possible values, we say that the response is
polytomous.
Examples of polytomous data include blood type (A, B, AB, O,…),
food testing, measures of mental and physical well-being, variables
arising in social science research.
2
Note:
Whatever the nature of the scale of the response, the response
probabilities
 1 ,  2 ,,  k ,  j  PY  y j 
need to be clarified.
Note:
If the categories are ordered, we may prefer to work with the
cumulative response probabilities
r1   1, r2   1   2 ,, rk   1   2     k  1 ,
where


rj  P Y   yr  .
r j


It makes little sense to work with a model specified in term of
rj
if the response categories are not ordered.
3.2 Measurement scales and modeling
(a) General
There are two types of scales, pure scales and compound scales. A
bivariate responses with one response ordinal and the other
continuous is an example of compound scales. For pure scales, there
are several types:
1. nominal scales: the categories are regarded as exchangeable and
totally devoid of structure.
2. ordinal scales: the categories are ordered much like the ordinal
number, “first”, “second”,…. It does not make
sense to talk of “distance” or “spacing” between
“first” and “second” nor to compare “spacings”
3
between pairs of response categories.
3. interval scales: the categories are ordered and numerical labels
or scores are attached. The scores are treated as
category averages, median or mid-points.
Differences between scores are therefore
interpreted as a measure of separation of the
categories.
Note:
In applications, the distinction between nomial or ordinal scales is
usually but not always clear. For example, hair color and eye color
can be ordered to a large extent on the grey-scale from light to dark
and are therefore ordinal. However, unless there is a clear connection
with electromagnetic spectrum or a grey-scale, colors are best
regarded as nomial.
(b) Models for ordinal scales
Ordinal scales occur more frequently in applications than the other
types. The applications include food testing (bad, good, excellent,…),
classification of radiographs, determination of physical or mental
well-being, ….
Note:
It is essential the same conclusion can be arrived even though the
number or choice of response categories has been changed. As a
consequence, if a new category is formed by combining adjacent
categories of the old scale, the form of the conclusions should be
unaffected. This is an important non-mathematical point that is
difficult to make mathematically rigorous. This point lead fair
directly to models based on the cumulative probabilities
than the category probabilities

j
4
.
r j rather
Commonly used models:
There are two commonly used models that are found to work well in
practice. They are
1. logistic scale:
It is the simplest model. The form is
 r j x  
log 
 


1

r
x


j


j
 x .
This model is also known as the proportional-odds model since the
ratio of the odds is
r j  x1 
1  r j  x1 
 exp   x1  x2  
r j  x2 
,
1  r j  x2 
which is independent of the choice of category (j). In addition, if
1, treatme nt group
X 
0, control group
then
,
r j 1
1  r j 1
 e
r j 0 
.
1  r j 0 
2. complementary log-log scale:
The form is


log  log 1  r j  x     j  x .
5
Note:
The model based on logistic scale may be derived from the notion of a
tolerance distribution or an underlying unobserved continuous
random variable Z, Z  x   ,

is distributed as logistic
distribution. If the unobserved variable lies in the interval
 j 1  Z   j ,
then Y  y j is recorded. That is,



r j  x   P
Y

y

r   P Z   j 

r j


 P Z  x   j  x 
 P    j  x 

exp  j  x 
1  exp  j  x 
 rj x  
 log 
   j  x

1  r j  x  

Note:
It is sometimes claimed that the models based on logistic scale and
complementary log-log scale and related models are appropriate only
if there exists a latent variable Z. This claim seems to be too strong
and, in any case, the existence of Z is usually unverifiable in practice.
Note:
Z  x
The model, exp( x )   , is worthy of serious consideration,
where

is distributed as logistic distribution. The model will lead
6
to
 r j  x    j  x
log 



1

r
x
exp  x  ,


j


where
x
plays the role of linear predictor for the mean and in the
denominator
x
plays the role of linear predictor for the
dispersion or variance. if
1, treatme nt group
X 
0, control group
,
then
r j 1
1  r j 1
 j  


 exp 
 j 

r j 0 
 

1  r j 0 
 1
  

 exp  
 1 
 exp  j 
  

 
where
  exp   .
increasing in
j
If
 1 ,
,
then the odds ratio is
and decreasing otherwise. This model is useful
for testing the proportional-odds assumption ( 
 0 ) against the
alternative that the odds ratio is systematically increasing or
systematically decreasing in
j.
Note:
Models in which the k-1 regression lines are not parallel can be
7
specified by
 rj x  
log 
   j  x j .


1

r
x


j


(c) Models for interval scales
Interval scales are distinguished by the following properties:
1. The categories are of interest in themselves and are not chosen
arbitrarily.
2. It does not normally make sense to form a new category by
amalgamating adjacent categories.
3. Attached to the j’th category is a cardinal number or score,
sj ,
such that the difference between scores is a measure of distance
between or separation of categories.
Note:
Genuine interval scales having these 3 properties are rare in practice
because, although properties 1 and 2 may be satisfied, it is rare to
find a response scale having well determined cardinal scores attached
to the categories.
There are 3 options for model construction.
1.
 rj x  
 s j  s j 1 

  x  x c j  c 
log 





0
1


1

r
x
2


j


where c j 
s j  s j 1
2
 s j  s j 1 

 .
c

log
it
or j
2


2.
8
The probability

can also be used. The model is
j
 j xi  

 exp  x  ,
k
j
j 1
where

exp  j xi 
i
 j  xi    j   xi  s j   i .
Note:
The relative odds for category j over category k in the above model
are
 j x 
 exp  j   k   x s j  sk 
 k x 
Thus, the relative odds are increased multiplicatively by the factor
exp s j  sk  per unit increase in x .
3.
k
 x s
j 1
j
i
j
 xi 
In this model, instead of regarding y as the response and the score
sj
as a contrast of special interest, we may regard the observed
score as the response and y as the set of observed multiplicities or
k
weights.
 x s
j 1
j
i
j
is the expected score. The estimate of the
expected score is
9
k
Si 
s
j 1
j
yij
.
mi
If there are only two treatment groups, with observed counts
y
1j
, y2 j 
we may use the standardized difference as test statistic
T
S1  S 2
2
 k
 1
k


1 
2
~
~
  j s j     j s j  




 j 1
 j 1
  m1 m2 
~  y1 j  y 2 j

where j
m1  m2 .
(d) Models for nomial scales
The probability

j
can be used. The model is
 j xi  

 exp  x  ,
k
j 1
where

exp  j xi 
j
i
 j  xi    j  x0    xi  x0  j   i .
Note:
The relative odds for category j over category k in the above model
are
 j x   j x0 

exp x  x0  j   k 
 k x   k x0 
10
Thus, the relative odds are increased multiplicatively by the factor
 j  x0 
exp  j   k 
 k  x0 
per unit increase in
x.
(e) Models for nested or hierarchical scales
Example:
Objective: we want to test the hypothesis that a winter diet
containing a high proportion of red clover has the effect of reducing
the fertility of milch cows.
To test the hypothesis, 80 cows were assigned at random to one of the
two diets. More cows become pregnant at first insemination but a few
require a second or third insemination. The response variable is the
pregnancy rate. The response, probability and odds are summarized
in the following table:
Insemination
Response
Probability
Odds
Y1 | m
1
First
1
1  r1
Second
Y2 | m  y1
2
Third
Y3 | m  y1  y 2
3
1  r1
1  r2
2
3
1  r2
1  r3
Then, a simple sequence models having a constant treatment effect is
as follows:
11
g  1    1  x
 2
g
1 r
1



   2  x

 3
g
1 r
2



   3  x

If the logistic link function is used, we have
  j
log 
1 r
j

The incident parameters

 


j
 1 , 2 ,, k 1
 x .
make allowance for the
expected decline in fertility.
3.3 The multinomial distribution
The multinomial distribution is in many ways the most natural
distribution to consider in the context of a polytomous response
variable. We introduce the properties of the multinomial distribution
in this section.
(a) Source
There are two derivations of multinomial distribution. One is based
on simple random sampling and the other is based the conditional
distribution of Poisson random variable.
1. Simple random sampling:
Suppose there are K attributes A1 , A2 , , Ak . The attributes might
be “color of hair”, “socio-economic status”, “family size”, “cause of
death” and so on. If the population is effectively infinitely large and if
12
a simple random sample of size m is taken, the probability of the
number of individuals will be observed to have attributes
A1 , A2 ,, Ak is
PY1  y1 , Y2  y 2 ,, Yk  y k  
m!
k
y !
 1y1  2y2  kyk 
m!
 1y1  2y2  kyk
,
y1! y k !
j
j 1
k
where
y
i 1
i
 m and 0  yi  m .
2. Conditional distribution of Poisson random variables:
Let Y1 , Y2 ,, Yk ~ P1 , P 2 ,, P k  . Denote
k
k
i 1
i 1
Y   Yi ,    i ,  i 
i

.
Then, the conditional joint distribution of Y1 , Y2 ,, Yk
given
Y  m is
k


P Y1  y1 , Y2  y 2 , , Yk  y k |  Yi  m 
i 1


m!

 1y1  2y 2  ky k
y1! y 2 ! y k !
(b) Moments and cumulants
The moment generating function of the multinomial distribution is
  k
  k

M Y t   M Y t1 , t 2 , , t k   E exp   tiYi     i exp ti 
  i 1

  i 1
13
m
and the cumulant generating function is
k

KY t   KY t1 , t 2 , , t k   log M Y t1 , t 2 , , t k   m log   i exp ti 
 i1

.
Then,
 K t , t ,, tk 
E Yr    Y 1 2

tr

 t 0






m

exp
t
r 
 k r
 m r


   i exp ti  
 i1
 t 0
and for r  s
  2 KY t1 , t2 ,, tk 
CovYr , Ys   


t

t
r
s

 t 0









m

exp
t

exp
t
r
r
s
s 

2
  k


    i exp ti 

  i1
 t 0

 m r s
and
  2 K Y t1 , t 2 ,  , t k  
Var Yr   

t r2

 t 0


m exp t r 
 k r


   i exp ti 
 i 1


m r2 exp 2t r  
2
 k
 
   i exp ti  
 i 1
  t 0
 m r  m r2
 m r 1   r 
In addition, Z1  Y1 , Z 2  Y1  Y2 ,, Z k  Y1  Y2    Yk ,
14
 Z1  1 0  0 Y1 
 Z  1 1  0 Y 
  2   LY
Z   2  
          
,
  
 
 Z k  1 1  1 Yk 
where L is a lower-triangular matrix containing unit values. Then,
E Z j   mrj
and for
jl
CovZ j , Z l   mrj 1  rl  .
Note:
For j  l  t , the conditional distribution of
Z j given Z l  zl
 rj 

rt  rl 



 .
Z
~
B
z
,
Z

z
~
B
m

z
,
j
l
is
 l r  . In addition, t l

1

r
l 
l 


Note:
k
For
s   siYi , then
i 1
k
 k Yi 
 s  E   si     i si
 i 1 m  i 1
and
2
k
k
k

 k



2
2
Var  siYi   m  i si   s   m  i si     i si  
i 1
 i 1

 i 1
 
 i 1
15
(c) Marginal and conditional distributions
The multinomial distribution has the following important properties:
1. The marginal distribution of
Y j is Y j ~ B m, 
2. The joint marginal distribution of
j
.
Y1 ,Y2 , m  Y1  Y2 
is
multinomial on 3 categories with index m and parameter
 1 , 2 ,1   1   2 
3. The conditional distribution of
given that
Y1 ,, Yi 1 , Yi 1 ,, Yk 
Yi  yi is multinomial with index m  yi and
probabilities
 1
 i 1  i 1
k

,

,
,
,

,
1
1 i 1i
1i
i

4. The marginal distribution of
Z j is Z j ~ B m, r j  .
5. The conditional distribution of

r
B z j , i

rj



.

Z i given Z j  z j is


 for i  j .

6. The conditional distribution of

 j 1
B m  z j ,

1  rj

Y j 1 given Z j  z j is


.

7. The multinomial distribution can be expressed as a product of k-1
binomial factors
PY1  y1 ,, Yk  yk   f  y1 | z0  f  y2 | z1  f  yk 1 | zk 2 
16
where
 m  z j 1   j 


f  y j | z j 1   



 y j  1  rj 1 
yj
 1  rj 


1 r 
j 1 

m z j 1  y j
z 0  r0  1
and
8. The sequence
Z1 ,, Z k
has the Markov property. That is,
PZ j | Z j 1  z j 1 ,, Z1  z1   PZ j | Z j 1  z j 1  .
(d) Quadratic forms
In order to test


H 0 :    0   10 ,  20 ,,  k0 , the quadratic form
(Pearson’s statistic) in the residuals,
k
X2 
Y
 m 0j 
2
j
m 0j
j 1
,
can be used to test the hypothesis. As m is large,
approximately distributed as
X2
is
 k21 . In addition, we can also use the
cumulative multinomial vector
k 1

Z
j 1
with
rj0
j
 mrj0
m

2



 1
 k  2 Z j  mrj0 Z j 1  mrj01
1
 
2
0
,
 0  0  
m

j

1
j
j

1
j 1


computed under


H 0 :    0   10 ,  20 ,,  k0 . Note that
the above quadratic form is identical to
17
X2.
3.4 Likelihood functions
(a) Log likelihood for multinomial responses
Let
yi   yi1
k
yi 2  yik  ,  yij  mi
t
j 1
and
k
 i   i1  i 2   ik  ,   ij  1 .
t
j 1
Then, the log-likelihood function for observation y i is
k 1


li  i | yi    yij log  ij    yij log  ij   yik log 1   ij 
j 1
j 1
j 1


k 1
k
.
Thus,
li  i | yi  yij


 ij
 ij
yik
k 1
1    ij

yij
 ij

yik
 ik .
j 1
Then, for the maximum likelihood estimate
ˆ ij ,
 li  i | yi 
 yij yik 
 yij

 yij  mi ij 




m








i
  ij  ij ˆij  ij  ik  ij ˆij  ij
 ij ˆij   ij  ij ˆij
since
li  i | yi  yij yik
y
y
y


 0  i1  i 2    ik  k
 ij
 ij  ik
ˆi1 ˆ i 2
ˆ ik
k
k
j 1
j 1
 yij  kˆ ij  mi   yij   kˆ ij  k 
18
yik
 mi
ˆ ik
The log-likelihood function is
l   li  i | yi    yij log  ij  .
n
n
i 1
k
i 1 j 1
Thus,
 l 
 yij  mi ij 







 ij   ij ˆij 
  ij ˆij
ij
Further, introducing matrix notation,
 l 





m

y

m


m

i i
i
i i  ij ˆ ij
i i  yi   i   ij ˆ ij
 
  i  ij ˆ ij




,
where
1
 mi i1 
 1  
0


 i  diag 



 mi ij  

0

and
Let

0
1
mi i 2 


0





0




1
mi ik 
0
i  mi i1 mi i 2  mi ik t .
   1t
 2t   nt  , y  y1t
t
y2t

Then,


 l 

 y     ij ˆij ,

M

  
   ij ˆ ij
where
19
ynt
.
t
1 0

0  2



  diag  i 



 0 0
 
m1 11 1

 0
 

 0
 

 0

 0
 

 0

0

0
0




0



0

0



0






0

0



0
0


0
0



0



0


0
m1 12 
1
m1
0



0
M  

0
0


0

and
  1t
0

0
 

  n 


0
m1

0

0
0

0

 0
 0
 
 m1


 0
 0
 
 0
mn n1 1
0
0

1
mn n1  

0
 0
 0


 0
 
 mn
 0


 0

0
0
0

0

0
mn

0


 0
 0 
  

 0

 

 0
 0

  
 mn 
 2t   nt  .
t
If we choose to work with
r  r1t
m1 1k 1


0




0




0


0



mn nk 1 
0
r2t

 rnt , ri  ri1
t
ri 2
 rik 
t
,
the log-likelihood function for observation y i can be rewritten as
k
k 1
j 1
j 1
li  i | yi    yij log rij  rij1    yij log rij  rij1   yik log 1  rik1 .
20
Then,
yij
yij1
y
y
li  i | yi 


 ij  ij1
rij
rij  rij1 rij1  rij  ij  ij1
 1
z
z
1 


zij  ij1  ij1


 ij  ij1
 ij  ij1 
 1
z  mi rij1   zij1  mi rij1 
1 



zij  mi rij   ij1


 ij
 ij1
 ij  ij1 
where
zij  yi1  yi 2    yij . Thus, introducing matrix notation,
l
 mi i  zi  mi ri 
,
ri
where
 i11   i21
  i21
0

0
0
0


1
1
1
1
 i 2   i3  i3

0
0
0

   i2









1 

1
1
1
1
i  
0
0
   ij  ij   ij 1   ij 1
0

mi 
0
0

0





1
1
0
0

0

0
  Ik 1  Ik 1   Ik1


0
0

0

0
0
0

and
zi  zi1
zi 2  zik  .
t
Further,
l
 M   z   r  ,
r
where
21
0
0

0

0
0











 
   diag i
1

0

 

 0

z  z1t
and

 r  m1r1 t
z2t
0
2

0
 0 

 0 
  ,

 n 
 znt
m2r2 t

,
t
mn rn t  .
t
Note:
yij1  yij
li  i | yi  yij
y   yij1 yik 



 ik   

rij
 ij  ij1   ij  ik    ij1  ik 
l  | y  l  | y 
 i i i  i i i
 ij
 ij1
(b) Parameter estimation
For the model with the form
 rij  xi  
log 
   j  xi  ,


1

r
x


ij
i 

we can rewrite the model as
 rij
log 

1  rij
where
xij  0

0



  xij  ,


1
0

the j’th component
22
0
xi 

 xij1
and
   1

xij 2
xij ( p  k 1)

  k 1
 1
 2
1

 p  k 1 

 p t
t


Then, differentiation with respect to
gives
n
k
n
k
l
l rij
l 
 
 
xijrrij 1  rij 


 r i 1 j 1 rij  r i 1 j 1 rij

since
rij 
and
rij
 r

exp xij  


1  exp xij  
,



x exp x  


1  exp x   1  exp x  
 exp x  
 exp x   
x 


1  exp x   1  exp x  

 x r  r 
 x r 1  r 
x exp x 

ijr

ij

ij

ijr


ij

ijr

ijr



ij
ij

ijr ij

ij

2

ij

2

ij



ij

2




2
ij
ij
Similarly, the second order derivative can be obtained!!
(c) Deviance function
The full model is the model with different parameters
23
 ij . In this
case, the estimate
~
ij
is
~ij 
yij
mi
certain link function, for example,
has the parameter estimate
. The reduced model with
rij 

exp xij  


1  exp xij  
,
ˆ ij . Then, the deviance function is
D~, ˆ   2l ~   2l ˆ   2 yij log ~ij  2 yij log ˆ ij 
n
k
i 1 j 1
n
k
i 1 j 1
 yij 

 2 y ij log 


i 1 j 1
 miˆ ij 
n k
 y ij 
 2 y ij log  
 ˆ 
i 1 j 1
 ij 
n
where
k
~  ~11  ~1k  ~n1  ~nk ,
ˆ  ˆ11  ˆ1k  ˆ n1  ˆ nk  ,
and
ˆ ij  miˆ ij .
Under some regularity conditions (similar to
chapter 2), the deviance function has an approximate
2
distribution.
3.5 Over-dispersion
Over-dispersion for polytomous responses can occur in exactly the
same way as over-dispersion for binary responses. Under the
cluster-sampling model, the covariance matrix of the observed
response vector is the sum of the within-cluster covariance matrix
and the between-cluster covariance matrix. Provided that these two
matrices are proportional, we have
24
E Y   m , CovY    2  ,
where

is the usual multinomial covariance matrix. The main
problem now is to estimate
 2 . The sensible estimate is
~ 2 
where
X2
X2
nk  1  p ,
is Pearson’s statistic and p is the number of unknown
parameters.
25
Download