Chapter 3 Polytomous Data 3.1 Introduction Motivating example: Cheese A B C D Total I 0 6 1 0 7 II 0 9 1 0 10 III 1 12 6 0 19 IV 7 11 8 1 27 V 8 7 23 3 41 VI 8 6 7 7 28 VII 19 1 5 14 39 VIII 8 0 1 16 25 IX 1 0 0 11 12 Total 52 52 52 52 208 Response category: I~IX (“strong dislike” to “excellent taste”). yij , i 1, 2, 3, 4; j 1, 2, , 9 : the response frequencies for the cheese additives. i 1 : additive A, i 2 : additive B, i 3 : additive C, i 4 : additive D j 1 : response I, j 2 : response II, j 3 : response III, j 4 : response IV, j 5 : response V, j 6 : response VI, j 7 : response VII, j 8 : response VIII, j 9 : response VII For example, y11 0, y12 0, y13 1, y14 7, y15 8, y16 8, y17 19, y18 8, y19 1 . Also, ij , i 1, 2, 3, 4; j 1, 2, , 9 : the probability corresponding 1 to the cheese additive and the response category. rij ir : the cumulative probability corresponding to the r j cheese additive and the response category. That is, ri1 i1, ri 2 i1 i 2 ,, ri 9 i1 i 2 i 9 1 Then, Yi Yi1 , Yi 2 , , Yi 9 , i 1, 2, 3, 4 : multinomial distribution with parameters mi 52 and i i1 ,, i 9 . Objective: We are concerned with the effect on the taste of various cheese additives. That is, we want to evaluate the statistical significance of the differences among these cheese additives. We want to find a model which is capable of describing these differences, for example, the interrelation among different cumulative probabilities such as r4 j r1 j r3 j r3 j (D, A, C, B from best to worst). Definition of Polytomous Data: If the response of an individual or item in a study is restricted to one of a fixed set of possible values, we say that the response is polytomous. Examples of polytomous data include blood type (A, B, AB, O,…), food testing, measures of mental and physical well-being, variables arising in social science research. 2 Note: Whatever the nature of the scale of the response, the response probabilities 1 , 2 ,, k , j PY y j need to be clarified. Note: If the categories are ordered, we may prefer to work with the cumulative response probabilities r1 1, r2 1 2 ,, rk 1 2 k 1 , where rj P Y yr . r j It makes little sense to work with a model specified in term of rj if the response categories are not ordered. 3.2 Measurement scales and modeling (a) General There are two types of scales, pure scales and compound scales. A bivariate responses with one response ordinal and the other continuous is an example of compound scales. For pure scales, there are several types: 1. nominal scales: the categories are regarded as exchangeable and totally devoid of structure. 2. ordinal scales: the categories are ordered much like the ordinal number, “first”, “second”,…. It does not make sense to talk of “distance” or “spacing” between “first” and “second” nor to compare “spacings” 3 between pairs of response categories. 3. interval scales: the categories are ordered and numerical labels or scores are attached. The scores are treated as category averages, median or mid-points. Differences between scores are therefore interpreted as a measure of separation of the categories. Note: In applications, the distinction between nomial or ordinal scales is usually but not always clear. For example, hair color and eye color can be ordered to a large extent on the grey-scale from light to dark and are therefore ordinal. However, unless there is a clear connection with electromagnetic spectrum or a grey-scale, colors are best regarded as nomial. (b) Models for ordinal scales Ordinal scales occur more frequently in applications than the other types. The applications include food testing (bad, good, excellent,…), classification of radiographs, determination of physical or mental well-being, …. Note: It is essential the same conclusion can be arrived even though the number or choice of response categories has been changed. As a consequence, if a new category is formed by combining adjacent categories of the old scale, the form of the conclusions should be unaffected. This is an important non-mathematical point that is difficult to make mathematically rigorous. This point lead fair directly to models based on the cumulative probabilities than the category probabilities j 4 . r j rather Commonly used models: There are two commonly used models that are found to work well in practice. They are 1. logistic scale: It is the simplest model. The form is r j x log 1 r x j j x . This model is also known as the proportional-odds model since the ratio of the odds is r j x1 1 r j x1 exp x1 x2 r j x2 , 1 r j x2 which is independent of the choice of category (j). In addition, if 1, treatme nt group X 0, control group then , r j 1 1 r j 1 e r j 0 . 1 r j 0 2. complementary log-log scale: The form is log log 1 r j x j x . 5 Note: The model based on logistic scale may be derived from the notion of a tolerance distribution or an underlying unobserved continuous random variable Z, Z x , is distributed as logistic distribution. If the unobserved variable lies in the interval j 1 Z j , then Y y j is recorded. That is, r j x P Y y r P Z j r j P Z x j x P j x exp j x 1 exp j x rj x log j x 1 r j x Note: It is sometimes claimed that the models based on logistic scale and complementary log-log scale and related models are appropriate only if there exists a latent variable Z. This claim seems to be too strong and, in any case, the existence of Z is usually unverifiable in practice. Note: Z x The model, exp( x ) , is worthy of serious consideration, where is distributed as logistic distribution. The model will lead 6 to r j x j x log 1 r x exp x , j where x plays the role of linear predictor for the mean and in the denominator x plays the role of linear predictor for the dispersion or variance. if 1, treatme nt group X 0, control group , then r j 1 1 r j 1 j exp j r j 0 1 r j 0 1 exp 1 exp j where exp . increasing in j If 1 , , then the odds ratio is and decreasing otherwise. This model is useful for testing the proportional-odds assumption ( 0 ) against the alternative that the odds ratio is systematically increasing or systematically decreasing in j. Note: Models in which the k-1 regression lines are not parallel can be 7 specified by rj x log j x j . 1 r x j (c) Models for interval scales Interval scales are distinguished by the following properties: 1. The categories are of interest in themselves and are not chosen arbitrarily. 2. It does not normally make sense to form a new category by amalgamating adjacent categories. 3. Attached to the j’th category is a cardinal number or score, sj , such that the difference between scores is a measure of distance between or separation of categories. Note: Genuine interval scales having these 3 properties are rare in practice because, although properties 1 and 2 may be satisfied, it is rare to find a response scale having well determined cardinal scores attached to the categories. There are 3 options for model construction. 1. rj x s j s j 1 x x c j c log 0 1 1 r x 2 j where c j s j s j 1 2 s j s j 1 . c log it or j 2 2. 8 The probability can also be used. The model is j j xi exp x , k j j 1 where exp j xi i j xi j xi s j i . Note: The relative odds for category j over category k in the above model are j x exp j k x s j sk k x Thus, the relative odds are increased multiplicatively by the factor exp s j sk per unit increase in x . 3. k x s j 1 j i j xi In this model, instead of regarding y as the response and the score sj as a contrast of special interest, we may regard the observed score as the response and y as the set of observed multiplicities or k weights. x s j 1 j i j is the expected score. The estimate of the expected score is 9 k Si s j 1 j yij . mi If there are only two treatment groups, with observed counts y 1j , y2 j we may use the standardized difference as test statistic T S1 S 2 2 k 1 k 1 2 ~ ~ j s j j s j j 1 j 1 m1 m2 ~ y1 j y 2 j where j m1 m2 . (d) Models for nomial scales The probability j can be used. The model is j xi exp x , k j 1 where exp j xi j i j xi j x0 xi x0 j i . Note: The relative odds for category j over category k in the above model are j x j x0 exp x x0 j k k x k x0 10 Thus, the relative odds are increased multiplicatively by the factor j x0 exp j k k x0 per unit increase in x. (e) Models for nested or hierarchical scales Example: Objective: we want to test the hypothesis that a winter diet containing a high proportion of red clover has the effect of reducing the fertility of milch cows. To test the hypothesis, 80 cows were assigned at random to one of the two diets. More cows become pregnant at first insemination but a few require a second or third insemination. The response variable is the pregnancy rate. The response, probability and odds are summarized in the following table: Insemination Response Probability Odds Y1 | m 1 First 1 1 r1 Second Y2 | m y1 2 Third Y3 | m y1 y 2 3 1 r1 1 r2 2 3 1 r2 1 r3 Then, a simple sequence models having a constant treatment effect is as follows: 11 g 1 1 x 2 g 1 r 1 2 x 3 g 1 r 2 3 x If the logistic link function is used, we have j log 1 r j The incident parameters j 1 , 2 ,, k 1 x . make allowance for the expected decline in fertility. 3.3 The multinomial distribution The multinomial distribution is in many ways the most natural distribution to consider in the context of a polytomous response variable. We introduce the properties of the multinomial distribution in this section. (a) Source There are two derivations of multinomial distribution. One is based on simple random sampling and the other is based the conditional distribution of Poisson random variable. 1. Simple random sampling: Suppose there are K attributes A1 , A2 , , Ak . The attributes might be “color of hair”, “socio-economic status”, “family size”, “cause of death” and so on. If the population is effectively infinitely large and if 12 a simple random sample of size m is taken, the probability of the number of individuals will be observed to have attributes A1 , A2 ,, Ak is PY1 y1 , Y2 y 2 ,, Yk y k m! k y ! 1y1 2y2 kyk m! 1y1 2y2 kyk , y1! y k ! j j 1 k where y i 1 i m and 0 yi m . 2. Conditional distribution of Poisson random variables: Let Y1 , Y2 ,, Yk ~ P1 , P 2 ,, P k . Denote k k i 1 i 1 Y Yi , i , i i . Then, the conditional joint distribution of Y1 , Y2 ,, Yk given Y m is k P Y1 y1 , Y2 y 2 , , Yk y k | Yi m i 1 m! 1y1 2y 2 ky k y1! y 2 ! y k ! (b) Moments and cumulants The moment generating function of the multinomial distribution is k k M Y t M Y t1 , t 2 , , t k E exp tiYi i exp ti i 1 i 1 13 m and the cumulant generating function is k KY t KY t1 , t 2 , , t k log M Y t1 , t 2 , , t k m log i exp ti i1 . Then, K t , t ,, tk E Yr Y 1 2 tr t 0 m exp t r k r m r i exp ti i1 t 0 and for r s 2 KY t1 , t2 ,, tk CovYr , Ys t t r s t 0 m exp t exp t r r s s 2 k i exp ti i1 t 0 m r s and 2 K Y t1 , t 2 , , t k Var Yr t r2 t 0 m exp t r k r i exp ti i 1 m r2 exp 2t r 2 k i exp ti i 1 t 0 m r m r2 m r 1 r In addition, Z1 Y1 , Z 2 Y1 Y2 ,, Z k Y1 Y2 Yk , 14 Z1 1 0 0 Y1 Z 1 1 0 Y 2 LY Z 2 , Z k 1 1 1 Yk where L is a lower-triangular matrix containing unit values. Then, E Z j mrj and for jl CovZ j , Z l mrj 1 rl . Note: For j l t , the conditional distribution of Z j given Z l zl rj rt rl . Z ~ B z , Z z ~ B m z , j l is l r . In addition, t l 1 r l l Note: k For s siYi , then i 1 k k Yi s E si i si i 1 m i 1 and 2 k k k k 2 2 Var siYi m i si s m i si i si i 1 i 1 i 1 i 1 15 (c) Marginal and conditional distributions The multinomial distribution has the following important properties: 1. The marginal distribution of Y j is Y j ~ B m, 2. The joint marginal distribution of j . Y1 ,Y2 , m Y1 Y2 is multinomial on 3 categories with index m and parameter 1 , 2 ,1 1 2 3. The conditional distribution of given that Y1 ,, Yi 1 , Yi 1 ,, Yk Yi yi is multinomial with index m yi and probabilities 1 i 1 i 1 k , , , , , 1 1 i 1i 1i i 4. The marginal distribution of Z j is Z j ~ B m, r j . 5. The conditional distribution of r B z j , i rj . Z i given Z j z j is for i j . 6. The conditional distribution of j 1 B m z j , 1 rj Y j 1 given Z j z j is . 7. The multinomial distribution can be expressed as a product of k-1 binomial factors PY1 y1 ,, Yk yk f y1 | z0 f y2 | z1 f yk 1 | zk 2 16 where m z j 1 j f y j | z j 1 y j 1 rj 1 yj 1 rj 1 r j 1 m z j 1 y j z 0 r0 1 and 8. The sequence Z1 ,, Z k has the Markov property. That is, PZ j | Z j 1 z j 1 ,, Z1 z1 PZ j | Z j 1 z j 1 . (d) Quadratic forms In order to test H 0 : 0 10 , 20 ,, k0 , the quadratic form (Pearson’s statistic) in the residuals, k X2 Y m 0j 2 j m 0j j 1 , can be used to test the hypothesis. As m is large, approximately distributed as X2 is k21 . In addition, we can also use the cumulative multinomial vector k 1 Z j 1 with rj0 j mrj0 m 2 1 k 2 Z j mrj0 Z j 1 mrj01 1 2 0 , 0 0 m j 1 j j 1 j 1 computed under H 0 : 0 10 , 20 ,, k0 . Note that the above quadratic form is identical to 17 X2. 3.4 Likelihood functions (a) Log likelihood for multinomial responses Let yi yi1 k yi 2 yik , yij mi t j 1 and k i i1 i 2 ik , ij 1 . t j 1 Then, the log-likelihood function for observation y i is k 1 li i | yi yij log ij yij log ij yik log 1 ij j 1 j 1 j 1 k 1 k . Thus, li i | yi yij ij ij yik k 1 1 ij yij ij yik ik . j 1 Then, for the maximum likelihood estimate ˆ ij , li i | yi yij yik yij yij mi ij m i ij ij ˆij ij ik ij ˆij ij ij ˆij ij ij ˆij since li i | yi yij yik y y y 0 i1 i 2 ik k ij ij ik ˆi1 ˆ i 2 ˆ ik k k j 1 j 1 yij kˆ ij mi yij kˆ ij k 18 yik mi ˆ ik The log-likelihood function is l li i | yi yij log ij . n n i 1 k i 1 j 1 Thus, l yij mi ij ij ij ˆij ij ˆij ij Further, introducing matrix notation, l m y m m i i i i i ij ˆ ij i i yi i ij ˆ ij i ij ˆ ij , where 1 mi i1 1 0 i diag mi ij 0 and Let 0 1 mi i 2 0 0 1 mi ik 0 i mi i1 mi i 2 mi ik t . 1t 2t nt , y y1t t y2t Then, l y ij ˆij , M ij ˆ ij where 19 ynt . t 1 0 0 2 diag i 0 0 m1 11 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 m1 12 1 m1 0 0 M 0 0 0 and 1t 0 0 n 0 m1 0 0 0 0 0 0 m1 0 0 0 mn n1 1 0 0 1 mn n1 0 0 0 0 mn 0 0 0 0 0 0 0 mn 0 0 0 0 0 0 mn 2t nt . t If we choose to work with r r1t m1 1k 1 0 0 0 0 mn nk 1 0 r2t rnt , ri ri1 t ri 2 rik t , the log-likelihood function for observation y i can be rewritten as k k 1 j 1 j 1 li i | yi yij log rij rij1 yij log rij rij1 yik log 1 rik1 . 20 Then, yij yij1 y y li i | yi ij ij1 rij rij rij1 rij1 rij ij ij1 1 z z 1 zij ij1 ij1 ij ij1 ij ij1 1 z mi rij1 zij1 mi rij1 1 zij mi rij ij1 ij ij1 ij ij1 where zij yi1 yi 2 yij . Thus, introducing matrix notation, l mi i zi mi ri , ri where i11 i21 i21 0 0 0 0 1 1 1 1 i 2 i3 i3 0 0 0 i2 1 1 1 1 1 i 0 0 ij ij ij 1 ij 1 0 mi 0 0 0 1 1 0 0 0 0 Ik 1 Ik 1 Ik1 0 0 0 0 0 0 and zi zi1 zi 2 zik . t Further, l M z r , r where 21 0 0 0 0 0 diag i 1 0 0 z z1t and r m1r1 t z2t 0 2 0 0 0 , n znt m2r2 t , t mn rn t . t Note: yij1 yij li i | yi yij y yij1 yik ik rij ij ij1 ij ik ij1 ik l | y l | y i i i i i i ij ij1 (b) Parameter estimation For the model with the form rij xi log j xi , 1 r x ij i we can rewrite the model as rij log 1 rij where xij 0 0 xij , 1 0 the j’th component 22 0 xi xij1 and 1 xij 2 xij ( p k 1) k 1 1 2 1 p k 1 p t t Then, differentiation with respect to gives n k n k l l rij l xijrrij 1 rij r i 1 j 1 rij r i 1 j 1 rij since rij and rij r exp xij 1 exp xij , x exp x 1 exp x 1 exp x exp x exp x x 1 exp x 1 exp x x r r x r 1 r x exp x ijr ij ij ijr ij ijr ijr ij ij ijr ij ij 2 ij 2 ij ij 2 2 ij ij Similarly, the second order derivative can be obtained!! (c) Deviance function The full model is the model with different parameters 23 ij . In this case, the estimate ~ ij is ~ij yij mi certain link function, for example, has the parameter estimate . The reduced model with rij exp xij 1 exp xij , ˆ ij . Then, the deviance function is D~, ˆ 2l ~ 2l ˆ 2 yij log ~ij 2 yij log ˆ ij n k i 1 j 1 n k i 1 j 1 yij 2 y ij log i 1 j 1 miˆ ij n k y ij 2 y ij log ˆ i 1 j 1 ij n where k ~ ~11 ~1k ~n1 ~nk , ˆ ˆ11 ˆ1k ˆ n1 ˆ nk , and ˆ ij miˆ ij . Under some regularity conditions (similar to chapter 2), the deviance function has an approximate 2 distribution. 3.5 Over-dispersion Over-dispersion for polytomous responses can occur in exactly the same way as over-dispersion for binary responses. Under the cluster-sampling model, the covariance matrix of the observed response vector is the sum of the within-cluster covariance matrix and the between-cluster covariance matrix. Provided that these two matrices are proportional, we have 24 E Y m , CovY 2 , where is the usual multinomial covariance matrix. The main problem now is to estimate 2 . The sensible estimate is ~ 2 where X2 X2 nk 1 p , is Pearson’s statistic and p is the number of unknown parameters. 25