ch3

Chapter 3 Polytomous Data 3.1 Introduction Motivating example: Cheese A B C D Total I 0 6 1 0 7 II 0 9 1 0 10 III 1 12 6 0 19 IV 7 11 8 1 27 V 8 7 23 3 41 VI 8 6 7 7 28 VII 19 1 5 14 39 VIII 8 0 1 16 25 IX 1 0 0 11 12 Total 52 52 52 52 208 Response category: I~IX (“strong dislike” to “excellent taste”). yij , i  1, 2, 3, 4; j  1, 2,  , 9 : the response frequencies for the cheese additives. i  1 : additive A, i  2 : additive B, i  3 : additive C, i  4 : additive D j  1 : response I, j  2 : response II, j  3 : response III, j  4 : response IV, j  5 : response V, j  6 : response VI, j  7 : response VII, j  8 : response VIII, j  9 : response VII For example, y11  0, y12  0, y13  1, y14  7, y15  8, y16  8, y17  19, y18  8, y19  1 . Also,  ij , i  1, 2, 3, 4; j  1, 2,  , 9 : the probability corresponding 1 to the cheese additive and the response category. rij    ir : the cumulative probability corresponding to the r j cheese additive and the response category. That is, ri1   i1, ri 2   i1   i 2 ,, ri 9   i1   i 2     i 9  1 Then, Yi  Yi1 , Yi 2 , , Yi 9 , i  1, 2, 3, 4 : multinomial distribution with parameters mi  52 and  i   i1 ,,  i 9  . Objective: We are concerned with the effect on the taste of various cheese additives. That is, we want to evaluate the statistical significance of the differences among these cheese additives. We want to find a model which is capable of describing these differences, for example, the interrelation among different cumulative probabilities such as r4 j  r1 j  r3 j  r3 j (D, A, C, B from best to worst). Definition of Polytomous Data: If the response of an individual or item in a study is restricted to one of a fixed set of possible values, we say that the response is polytomous. Examples of polytomous data include blood type (A, B, AB, O,…), food testing, measures of mental and physical well-being, variables arising in social science research. 2 Note: Whatever the nature of the scale of the response, the response probabilities  1 ,  2 ,,  k ,  j  PY  y j  need to be clarified. Note: If the categories are ordered, we may prefer to work with the cumulative response probabilities r1   1, r2   1   2 ,, rk   1   2     k  1 , where   rj  P Y   yr  . r j   It makes little sense to work with a model specified in term of rj if the response categories are not ordered. 3.2 Measurement scales and modeling (a) General There are two types of scales, pure scales and compound scales. A bivariate responses with one response ordinal and the other continuous is an example of compound scales. For pure scales, there are several types: 1. nominal scales: the categories are regarded as exchangeable and totally devoid of structure. 2. ordinal scales: the categories are ordered much like the ordinal number, “first”, “second”,…. It does not make sense to talk of “distance” or “spacing” between “first” and “second” nor to compare “spacings” 3 between pairs of response categories. 3. interval scales: the categories are ordered and numerical labels or scores are attached. The scores are treated as category averages, median or mid-points. Differences between scores are therefore interpreted as a measure of separation of the categories. Note: In applications, the distinction between nomial or ordinal scales is usually but not always clear. For example, hair color and eye color can be ordered to a large extent on the grey-scale from light to dark and are therefore ordinal. However, unless there is a clear connection with electromagnetic spectrum or a grey-scale, colors are best regarded as nomial. (b) Models for ordinal scales Ordinal scales occur more frequently in applications than the other types. The applications include food testing (bad, good, excellent,…), classification of radiographs, determination of physical or mental well-being, …. Note: It is essential the same conclusion can be arrived even though the number or choice of response categories has been changed. As a consequence, if a new category is formed by combining adjacent categories of the old scale, the form of the conclusions should be unaffected. This is an important non-mathematical point that is difficult to make mathematically rigorous. This point lead fair directly to models based on the cumulative probabilities than the category probabilities  j 4 . r j rather Commonly used models: There are two commonly used models that are found to work well in practice. They are 1. logistic scale: It is the simplest model. The form is  r j x   log      1  r x   j   j  x . This model is also known as the proportional-odds model since the ratio of the odds is r j  x1  1  r j  x1   exp   x1  x2   r j  x2  , 1  r j  x2  which is independent of the choice of category (j). In addition, if 1, treatme nt group X  0, control group then , r j 1 1  r j 1  e r j 0  . 1  r j 0  2. complementary log-log scale: The form is   log  log 1  r j  x     j  x . 5 Note: The model based on logistic scale may be derived from the notion of a tolerance distribution or an underlying unobserved continuous random variable Z, Z  x   ,  is distributed as logistic distribution. If the unobserved variable lies in the interval  j 1  Z   j , then Y  y j is recorded. That is,    r j  x   P Y  y  r   P Z   j   r j    P Z  x   j  x   P    j  x   exp  j  x  1  exp  j  x   rj x    log     j  x  1  r j  x    Note: It is sometimes claimed that the models based on logistic scale and complementary log-log scale and related models are appropriate only if there exists a latent variable Z. This claim seems to be too strong and, in any case, the existence of Z is usually unverifiable in practice. Note: Z  x The model, exp( x )   , is worthy of serious consideration, where  is distributed as logistic distribution. The model will lead 6 to  r j  x    j  x log     1  r x exp  x  ,   j   where x plays the role of linear predictor for the mean and in the denominator x plays the role of linear predictor for the dispersion or variance. if 1, treatme nt group X  0, control group , then r j 1 1  r j 1  j      exp   j   r j 0     1  r j 0   1      exp    1   exp  j        where   exp   . increasing in j If  1 , , then the odds ratio is and decreasing otherwise. This model is useful for testing the proportional-odds assumption (   0 ) against the alternative that the odds ratio is systematically increasing or systematically decreasing in j. Note: Models in which the k-1 regression lines are not parallel can be 7 specified by  rj x   log     j  x j .   1  r x   j   (c) Models for interval scales Interval scales are distinguished by the following properties: 1. The categories are of interest in themselves and are not chosen arbitrarily. 2. It does not normally make sense to form a new category by amalgamating adjacent categories. 3. Attached to the j’th category is a cardinal number or score, sj , such that the difference between scores is a measure of distance between or separation of categories. Note: Genuine interval scales having these 3 properties are rare in practice because, although properties 1 and 2 may be satisfied, it is rare to find a response scale having well determined cardinal scores attached to the categories. There are 3 options for model construction. 1.  rj x    s j  s j 1     x  x c j  c  log       0 1   1  r x 2   j   where c j  s j  s j 1 2  s j  s j 1    . c  log it or j 2   2. 8 The probability  can also be used. The model is j  j xi     exp  x  , k j j 1 where  exp  j xi  i  j  xi    j   xi  s j   i . Note: The relative odds for category j over category k in the above model are  j x   exp  j   k   x s j  sk   k x  Thus, the relative odds are increased multiplicatively by the factor exp s j  sk  per unit increase in x . 3. k  x s j 1 j i j  xi  In this model, instead of regarding y as the response and the score sj as a contrast of special interest, we may regard the observed score as the response and y as the set of observed multiplicities or k weights.  x s j 1 j i j is the expected score. The estimate of the expected score is 9 k Si  s j 1 j yij . mi If there are only two treatment groups, with observed counts y 1j , y2 j  we may use the standardized difference as test statistic T S1  S 2 2  k  1 k   1  2 ~ ~   j s j     j s j        j 1  j 1   m1 m2  ~  y1 j  y 2 j  where j m1  m2 . (d) Models for nomial scales The probability  j can be used. The model is  j xi     exp  x  , k j 1 where  exp  j xi  j i  j  xi    j  x0    xi  x0  j   i . Note: The relative odds for category j over category k in the above model are  j x   j x0   exp x  x0  j   k   k x   k x0  10 Thus, the relative odds are increased multiplicatively by the factor  j  x0  exp  j   k   k  x0  per unit increase in x. (e) Models for nested or hierarchical scales Example: Objective: we want to test the hypothesis that a winter diet containing a high proportion of red clover has the effect of reducing the fertility of milch cows. To test the hypothesis, 80 cows were assigned at random to one of the two diets. More cows become pregnant at first insemination but a few require a second or third insemination. The response variable is the pregnancy rate. The response, probability and odds are summarized in the following table: Insemination Response Probability Odds Y1 | m 1 First 1 1  r1 Second Y2 | m  y1 2 Third Y3 | m  y1  y 2 3 1  r1 1  r2 2 3 1  r2 1  r3 Then, a simple sequence models having a constant treatment effect is as follows: 11 g  1    1  x  2 g 1 r 1       2  x   3 g 1 r 2       3  x  If the logistic link function is used, we have   j log  1 r j  The incident parameters      j  1 , 2 ,, k 1  x . make allowance for the expected decline in fertility. 3.3 The multinomial distribution The multinomial distribution is in many ways the most natural distribution to consider in the context of a polytomous response variable. We introduce the properties of the multinomial distribution in this section. (a) Source There are two derivations of multinomial distribution. One is based on simple random sampling and the other is based the conditional distribution of Poisson random variable. 1. Simple random sampling: Suppose there are K attributes A1 , A2 , , Ak . The attributes might be “color of hair”, “socio-economic status”, “family size”, “cause of death” and so on. If the population is effectively infinitely large and if 12 a simple random sample of size m is taken, the probability of the number of individuals will be observed to have attributes A1 , A2 ,, Ak is PY1  y1 , Y2  y 2 ,, Yk  y k   m! k y !  1y1  2y2  kyk  m!  1y1  2y2  kyk , y1! y k ! j j 1 k where y i 1 i  m and 0  yi  m . 2. Conditional distribution of Poisson random variables: Let Y1 , Y2 ,, Yk ~ P1 , P 2 ,, P k  . Denote k k i 1 i 1 Y   Yi ,    i ,  i  i  . Then, the conditional joint distribution of Y1 , Y2 ,, Yk given Y  m is k   P Y1  y1 , Y2  y 2 , , Yk  y k |  Yi  m  i 1   m!   1y1  2y 2  ky k y1! y 2 ! y k ! (b) Moments and cumulants The moment generating function of the multinomial distribution is   k   k  M Y t   M Y t1 , t 2 , , t k   E exp   tiYi     i exp ti    i 1    i 1 13 m and the cumulant generating function is k  KY t   KY t1 , t 2 , , t k   log M Y t1 , t 2 , , t k   m log   i exp ti   i1  . Then,  K t , t ,, tk  E Yr    Y 1 2  tr   t 0       m  exp t r   k r  m r      i exp ti    i1  t 0 and for r  s   2 KY t1 , t2 ,, tk  CovYr , Ys      t  t r s   t 0          m  exp t  exp t r r s s   2   k       i exp ti     i1  t 0   m r s and   2 K Y t1 , t 2 ,  , t k   Var Yr     t r2   t 0   m exp t r   k r      i exp ti   i 1   m r2 exp 2t r   2  k      i exp ti    i 1   t 0  m r  m r2  m r 1   r  In addition, Z1  Y1 , Z 2  Y1  Y2 ,, Z k  Y1  Y2    Yk , 14  Z1  1 0  0 Y1   Z  1 1  0 Y    2   LY Z   2              ,       Z k  1 1  1 Yk  where L is a lower-triangular matrix containing unit values. Then, E Z j   mrj and for jl CovZ j , Z l   mrj 1  rl  . Note: For j  l  t , the conditional distribution of Z j given Z l  zl  rj   rt  rl      . Z ~ B z , Z  z ~ B m  z , j l is  l r  . In addition, t l  1  r l  l    Note: k For s   siYi , then i 1 k  k Yi   s  E   si     i si  i 1 m  i 1 and 2 k k k   k    2 2 Var  siYi   m  i si   s   m  i si     i si   i 1  i 1   i 1    i 1 15 (c) Marginal and conditional distributions The multinomial distribution has the following important properties: 1. The marginal distribution of Y j is Y j ~ B m,  2. The joint marginal distribution of j . Y1 ,Y2 , m  Y1  Y2  is multinomial on 3 categories with index m and parameter  1 , 2 ,1   1   2  3. The conditional distribution of given that Y1 ,, Yi 1 , Yi 1 ,, Yk  Yi  yi is multinomial with index m  yi and probabilities  1  i 1  i 1 k  ,  , , ,  , 1 1 i 1i 1i i  4. The marginal distribution of Z j is Z j ~ B m, r j  . 5. The conditional distribution of  r B z j , i  rj    .  Z i given Z j  z j is    for i  j .  6. The conditional distribution of   j 1 B m  z j ,  1  rj  Y j 1 given Z j  z j is   .  7. The multinomial distribution can be expressed as a product of k-1 binomial factors PY1  y1 ,, Yk  yk   f  y1 | z0  f  y2 | z1  f  yk 1 | zk 2  16 where  m  z j 1   j    f  y j | z j 1        y j  1  rj 1  yj  1  rj    1 r  j 1   m z j 1  y j z 0  r0  1 and 8. The sequence Z1 ,, Z k has the Markov property. That is, PZ j | Z j 1  z j 1 ,, Z1  z1   PZ j | Z j 1  z j 1  . (d) Quadratic forms In order to test   H 0 :    0   10 ,  20 ,,  k0 , the quadratic form (Pearson’s statistic) in the residuals, k X2  Y  m 0j  2 j m 0j j 1 , can be used to test the hypothesis. As m is large, approximately distributed as X2 is  k21 . In addition, we can also use the cumulative multinomial vector k 1  Z j 1 with rj0 j  mrj0 m  2     1  k  2 Z j  mrj0 Z j 1  mrj01 1   2 0 ,  0  0   m  j  1 j j  1 j 1   computed under   H 0 :    0   10 ,  20 ,,  k0 . Note that the above quadratic form is identical to 17 X2. 3.4 Likelihood functions (a) Log likelihood for multinomial responses Let yi   yi1 k yi 2  yik  ,  yij  mi t j 1 and k  i   i1  i 2   ik  ,   ij  1 . t j 1 Then, the log-likelihood function for observation y i is k 1   li  i | yi    yij log  ij    yij log  ij   yik log 1   ij  j 1 j 1 j 1   k 1 k . Thus, li  i | yi  yij    ij  ij yik k 1 1    ij  yij  ij  yik  ik . j 1 Then, for the maximum likelihood estimate ˆ ij ,  li  i | yi   yij yik   yij   yij  mi ij      m         i   ij  ij îj  ij  ik  ij îj  ij  ij îj   ij  ij îj since li  i | yi  yij yik y y y    0  i1  i 2    ik  k  ij  ij  ik î1 ˆ i 2 ˆ ik k k j 1 j 1  yij  kˆ ij  mi   yij   kˆ ij  k  18 yik  mi ˆ ik The log-likelihood function is l   li  i | yi    yij log  ij  . n n i 1 k i 1 j 1 Thus,  l   yij  mi ij          ij   ij îj    ij îj ij Further, introducing matrix notation,  l       m  y  m   m  i i i i i  ij ˆ ij i i  yi   i   ij ˆ ij     i  ij ˆ ij     , where 1  mi i1   1   0    i  diag      mi ij    0  and Let  0 1 mi i 2    0      0     1 mi ik  0 i  mi i1 mi i 2  mi ik t .    1t  2t   nt  , y  y1t t y2t  Then,    l    y     ij îj ,  M        ij ˆ ij where 19 ynt . t 1 0  0  2      diag  i      0 0   m1 11 1   0     0     0   0     0  0  0 0     0    0  0    0       0  0    0 0   0 0    0    0   0 m1 12  1 m1 0    0 M    0 0   0  and   1t 0  0      n    0 m1  0  0 0  0   0  0    m1    0  0    0 mn n1 1 0 0  1 mn n1    0  0  0    0    mn  0    0  0 0 0  0  0 mn  0    0  0       0      0  0      mn   2t   nt  . t If we choose to work with r  r1t m1 1k 1   0     0     0   0    mn nk 1  0 r2t   rnt , ri  ri1 t ri 2  rik  t , the log-likelihood function for observation y i can be rewritten as k k 1 j 1 j 1 li  i | yi    yij log rij  rij1    yij log rij  rij1   yik log 1  rik1 . 20 Then, yij yij1 y y li  i | yi     ij  ij1 rij rij  rij1 rij1  rij  ij  ij1  1 z z 1    zij  ij1  ij1    ij  ij1  ij  ij1   1 z  mi rij1   zij1  mi rij1  1     zij  mi rij   ij1    ij  ij1  ij  ij1  where zij  yi1  yi 2    yij . Thus, introducing matrix notation, l  mi i  zi  mi ri  , ri where  i11   i21   i21 0  0 0 0   1 1 1 1  i 2   i3  i3  0 0 0     i2          1   1 1 1 1 i   0 0    ij  ij   ij 1   ij 1 0  mi  0 0  0      1 1 0 0  0  0   Ik 1  Ik 1   Ik1   0 0  0  0 0 0  and zi  zi1 zi 2  zik  . t Further, l  M   z   r  , r where 21 0 0  0  0 0                 diag i 1  0      0  z  z1t and   r  m1r1 t z2t 0 2  0  0    0    ,   n   znt m2r2 t  , t mn rn t  . t Note: yij1  yij li  i | yi  yij y   yij1 yik      ik     rij  ij  ij1   ij  ik    ij1  ik  l  | y  l  | y   i i i  i i i  ij  ij1 (b) Parameter estimation For the model with the form  rij  xi   log     j  xi  ,   1  r x   ij i   we can rewrite the model as  rij log   1  rij where xij  0  0      xij  ,   1 0  the j’th component 22 0 xi    xij1 and    1  xij 2 xij ( p  k 1)    k 1  1  2 1   p  k 1    p t t   Then, differentiation with respect to gives n k n k l l rij l      xijrrij 1  rij     r i 1 j 1 rij  r i 1 j 1 rij  since rij  and rij  r  exp xij     1  exp xij   ,    x exp x     1  exp x   1  exp x    exp x    exp x    x    1  exp x   1  exp x     x r  r   x r 1  r  x exp x   ijr  ij  ij  ijr   ij  ijr  ijr    ij ij  ijr ij  ij  2  ij  2  ij    ij  2     2 ij ij Similarly, the second order derivative can be obtained!! (c) Deviance function The full model is the model with different parameters 23  ij . In this case, the estimate ~ ij is ~ij  yij mi certain link function, for example, has the parameter estimate . The reduced model with rij   exp xij     1  exp xij   , ˆ ij . Then, the deviance function is D~, ˆ   2l ~   2l ˆ   2 yij log ~ij  2 yij log ˆ ij  n k i 1 j 1 n k i 1 j 1  yij    2 y ij log    i 1 j 1  miˆ ij  n k  y ij   2 y ij log    ˆ  i 1 j 1  ij  n where k ~  ~11  ~1k  ~n1  ~nk , ˆ  ˆ11  ˆ1k  ˆ n1  ˆ nk  , and ˆ ij  miˆ ij . Under some regularity conditions (similar to chapter 2), the deviance function has an approximate 2 distribution. 3.5 Over-dispersion Over-dispersion for polytomous responses can occur in exactly the same way as over-dispersion for binary responses. Under the cluster-sampling model, the covariance matrix of the observed response vector is the sum of the within-cluster covariance matrix and the between-cluster covariance matrix. Provided that these two matrices are proportional, we have 24 E Y   m , CovY    2  , where  is the usual multinomial covariance matrix. The main problem now is to estimate  2 . The sensible estimate is ~ 2  where X2 X2 nk  1  p , is Pearson’s statistic and p is the number of unknown parameters. 25

ch3

Related documents

Products

Support

ch3

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib