Multinomial Distribution Consider a series of n independent and identical trials where the beoutcome for into any single trial can classied one of I mutually exclusive exhaustive categories, and and i = Prf outcome is in the i-th categoryg for any single trial. Then I 0 i 1 and 1 = i=1 i: Let Yi = number of outcomes falling into the i-th category in n trials. Then I Yi 0 and Yi = n: i=1 X X 182 183 Probability function: = y1; Y2 = y2; YI = yI g I iyi = n! i=1 yi! . P r fY1 Y Arrange the counts and probabilities in column vectors 2 Y 6 6 6 6 6 6 6 6 6 4 Y1 Y2 = . YI 3 2 7 7 7 7 7 7 7 7 7 5 6 6 6 6 6 6 6 6 6 4 1 2 = . I 3 7 7 7 7 7 7 7 7 7 5 184 A multinomial distribution is denoted by Y Mult(n; ) Each count has a binomial distribution: Yi Bin(n; i) for each i = 1; 2; : : : ; I 185 Moments: ( ) = ni V (Yi) = ni(1 Cov(Yi; Yj ) = Observed (or sample) proportions: 2 1 p= Y= n 6 6 6 6 6 6 6 6 6 6 4 Y1=n Y2=n E Yi 3 2 7 7 7 7 7 7 7 7 7 7 5 6 6 6 6 6 6 6 6 6 4 p1 p2 . = . YI=n pI 3 7 7 7 7 7 7 7 7 7 5 i ) nij ( ) = i V (pi) = n1 i(1 i) Cov(pi; pj ) = nij E pi 187 186 where Covariance matrix: ( ) V Y 2 1(1 1) 1 2 6 (1 2) 6 2 1 2 = n6 .. .. 4 . . I 1 2 =n 6 6 6 6 6 4 1 2 I (1 .. . I ) ... I 6 6 6 6 6 4 . I 1 0 1I 2I . ... . 3 7 7 7 5 = diag() = Then 7 7 7 7 7 5 12 12 2 n 21 2 = n 1 I 2 I 3 2 ... I2 V 3 7 7 7 7 7 5 188 2 6 6 6 6 6 6 6 6 6 4 1 3 2 ... (p) = V (n1Y) = n1 2 V (Y) = n1 0 0 1 @ A I ! 189 7 7 7 7 7 7 7 7 7 5 Limiting normal distribution: For the j-th trial dene 0 0. Yj = 1. 0 2 3 6 6 6 6 6 6 6 6 6 6 6 4 7 7 7 7 7 7 7 7 7 7 7 5 I 1 and V one in the i-th position and zeros elsewhere if the outcome for the j-th trial is in the i-th category The Yj 's are i.i.d. random vectors with E (Yj ) = (Yj ) = The vector of sample proportions p1 p= . = n1 n 2 3 6 6 6 6 6 4 7 7 7 7 7 5 pI X j =1 0 Yj = n1 Y is a vector of sample means. 190 191 Likelihood function: By the Multivariate Central L Limit Theorem pn(p ) dist'n ! N (0; iYi Yi ( ; Y) = n! i=1 ! I 0 as n ! 1 ) Y Log-likelihood: I ( ; Y) = log(n!) i=1 log(Yi!) + i=1I Yi log(i) ` X X 192 193 Method of Lagrange multipliers: Maximum likelihood estimates (mle's): Maximize I g (; ) = `(; Y)+ 1 i i=1 Given observed counts 2 y1 y= . 6 6 6 6 6 4 yI 0 X B @ 3 7 7 7 7 7 5 maximize `(; y) subject to the I constraint 1 = i=1 i . X = 1; 2; : : : ; I Note that the mle's satisfy the parameter constraints, i.e., I 1 = i=1 ^i . X 196 : X 195 Example: i C A Solve the likelihood equations 0 = @@gi = Yii i = 1; 2; : : : ; I @g 0 = @ = 1 i=1I i 194 The mle's are Y ^i = i = pi n 1 Summer Squash Sinnott and Durham (1922) , 177{186. Journal of Heredity, 13 Results for n = 205 progeny of cross bred white and yellow summer squash. 155 white y1 y = y2 = 40 yellow 10 green y3 2 3 2 3 6 6 6 6 6 4 7 7 7 7 7 5 6 6 6 6 6 4 7 7 7 7 7 5 197 The genetic model they considered suggests that white: yellow: green should occur with a ratio of 12 : 3 : 1: Is this an appropriate model? Test the null hypothesis H0 : = 0 where 0;1 12=16 0 = 0;2 = 3=16 0;3 1=16 2 3 2 3 6 6 6 6 6 4 7 7 7 7 7 5 6 6 6 6 6 4 7 7 7 7 7 5 against the general alternative HA : 0 < i < 1 3 = 1: for i = 1; 2; 3 and i=1 i X 198 1. Expected counts when H0 is true m 0= = m0;1 m0;2 m0;3 n0 2 3 6 6 6 6 6 4 7 7 7 7 7 5 12=16 = 205 3=16 1=16 153:78 = 38:4275 12:8125 2 3 6 6 6 6 6 4 7 7 7 7 7 5 2 3 6 6 6 6 6 4 7 7 7 7 7 5 200 199 2. Maximum likelihood estimates of \expected" counts under HA. m ^ A;1 m ^ A = m^ A;2 m ^ A;3 = np 155 = y = 40 10 2 3 6 6 6 6 6 4 7 7 7 7 7 5 2 3 6 6 6 6 6 4 7 7 7 7 7 5 201 3. Log-likelihood ratio test (deviance) 3 y log m^ A;i G2 = 2 i m0;i i=1 X 0 1 B B @ C C A 3 y log yi = 2 i=1 i m0;i X 0 1 B B @ C C A = 0:74 4. Pearson statistic 3 (^mA;i m0;i)2 X2 = m0;i i=1 2 = 3 (yi m0;i) = 0:69 X X =1 i m0;i Each statistic should be compared against the percentiles of a central chi-squared distribution with Dimension of d.f. = parameter space under HA Dimension of parameter space under H0 =2 0 =2 2 3 6 6 6 6 6 4 7 7 7 7 7 5 2 3 6 6 6 6 6 4 7 7 7 7 7 5 202 SAS code 203 /* Establish a format to attach labels to levels of the color variable */ /* This program is stored in the file multfit.sas */ /* This program uses PROC IML in SAS to test the fit of a completely specified multinomial model against a general alternative. It is applied to test the fit of a genetic model for color of summer squash (Sinnott and Durham, 1922, Journal of Heredity, 13, 177-186). */ proc format; value ccode 1 = 'White' 2 = 'Yellow' 3 = 'Green'; run; /* Print the data set with color labels */ proc print data=set1; format color ccode.; run; data set1; input color count; cards; 1 155 2 40 3 10 run; /* Use the IML procedure to compute likelihood ratio and Pearson chi-squared tests of the null hypothesis that white:yellow:green colors occur with a 12:3:1 ratio.*/ 204 205 proc iml; start multfit; /* Enter the data */ use set1; read all into w; /* Smooth observed counts toward the null hypothesis to avoid computing the log of zero */ a = .000000001; xl = (1-a)*x + a*m; /* Create a column of counts */ x = w[ ,2]; /* Compute the deviance, df, and a p-value */ g2 = 2*sum(x#log(xl/m)); df = nc-1; pg2 = 1-probchi(g2,df); /* Compute the total sample size */ n = sum(x); /* Compute the number of categories */ nc = nrow(x); /* Compute the Pearson statistic */ x2 = sum(((x-m)##2)/m); px2 = 1-probchi(x2,df); /* Enter the null hypothesis */ pi = {12, 3, 1}; pi = pi/sum(pi); /* Compute expected counts */ m = n*pi; 207 206 /* Round off results and print results */ g2 = round(g2,.001); pg2=round(pg2, .0001); x2 = round(x2,.001); px2=round(px2, .0001); print,,,, 'Tests of the null hypothesis; ', pi; print,,, 'Observed Counts Expected counts'; print x m; print,,, ' Test' ' DF P-value'; print 'Deviance test: ' g2 df pg2; print 'Pearson test: ' x2 df px2; Obs color count 1 2 3 White Yellow Green 155 40 10 Tests of the null hypothesis; PI finish; 0.75 0.1875 0.0625 run multfit; 208 209 Observed Counts X M 155 40 10 153.75 38.4375 12.8125 Test DF /* Use the FREQ procedure in SAS to test a null hypothesis for a multinomial distribution. This code is posted as multfit2.sas */ P-value G2 DF PG2 Deviance test: 0.741 2 0.6904 X2 DF PX2 0.691 2 0.7078 Pearson test: SAS code Expected counts data set1; input type $9. y; datalines; white 155 yellow 40 green 10 run; /* Note that the probabilities in the null hypothesis are listed in the order (green, white, yellow) because SAS alphabetically orders the values of the type variable. */ proc freq data=set1; table type / testp = (.0625 .75 .1875); weight y; run; 211 210 S-PLUS code The FREQ Procedure type Frequency green white yellow 10 155 40 Percent Test Percent 4.88 75.61 19.51 6.25 75.00 18.75 Chi-Square Test for Specified Proportions Chi-Square DF Pr > ChiSq # # # # # # # # This file contains Splus code for testing the fit of a completely specified multinomial model against a general alternative. It is used to test the fit of a genetic model for the color of summer squash (Sinnott and Durham, 1922, Journal of Heredity, 13, 177-186). # The file is stored as # Enter the observed counts multfit.ssc x<-c(155, 40, 10) 0.6911 2 0.7078 # Compute total sample size n<-sum(x) Sample Size = 205 # Enter labels labels<-c("white", "yellow", "green") 212 213 # Enter the hypothesized proportions pi<-c(12, 3, 1) pi<-pi/sum(pi) # Display results # Compute expected counts e<-n*pi # Smooth observed counts toward the null # hypothesis to avoid computing the log # of zero a<-.000000001 xl<-(1-a)*x + a*e g2<-round(g2, 3) pg2<-round(pg2, 4) x2<-round(x2, 3) px2<-round(px2, 4) dat<-as.matrix(cbind(labels, x, e)) nc<-ncol(dat) dimnames(dat)[[2]]<c("labels", "observed", "expected") cat(dimnames(dat)[[2]], format(t(dat)), file="", sep=c(rep(" ", nc-1), "\n")) cat("\nDeviance test:", format(g2), " df = ", format(df), " p-value = ", format(pg2), "\n") cat("\n Pearson test:", format(x2), " df = ", format(df), " p-value = ", format(px2), "\n") # Compute the deviance g2<-2*sum(x*log(xl/e)) df<-length(x)-1 pg2<-(1-pchisq(g2, df)) # Compute the Pearson statistic x2<-sum(((x-e)**2)/e) px2<-(1-pchisq(x2, df)) 215 214 Displaying multinomial counts in a two-way contingency table labels observed expected white 155 153.75 yellow 40 38.4375 green 10 12.8125 Coronary Disease and Serum Cholesterol An =simple random sample wasof 1329 patient records taken from records of a specic age/sex group maintained by a large HMO. Example: Deviance test: 0.741 df = 2 p-value = 0.6904 Pearson test: 0.691 df = 2 p-value = 0.7078 216 217 A 2 2 Contingency table: (n = 1329) Each patient was classied into one of four possible categories dened by two traits (or factors). Factor 1: Level of serum cholesterol (i = 1) less than 220mg/100cc (i = 2) at least 220mg/100cc Factor 2: Coronary disease status (j = 1) Present (j = 2) Absent Coronary Disease Present < Absent 220 y11 = 20 y12 = 553 220 y21 = 72 y22 = 684 Serum Cholesterol (mg/100g) Rearrange the counts into a column vector 2 Y = 6 6 6 6 6 4 218 Y11 Y12 Y21 Y22 3 7 7 7 7 7 5 Mult(n; ) 219 Question: where 2 = and 1=i 2 X 2 X j =1 =1 ij 6 6 6 6 6 4 11 12 21 22 and 3 7 7 7 7 7 5 n = XX i j Yij Here = proportion of HMO patients with serum cholesterol less than 220mg/100cc and no coronary disease 12 220 Is the incidence of coronary disease the same for both cholesterol categories? Test the t of the independence model. Here \independence" means that the incidence of coronary disease is the same for each serum cholesterol category. 221 With respect to the elements of 2 The null hypothesis can be expressed in terms of conditional probabilities: H0 8 < : Pr : 8 < = Pr : coronary low cholesterol disease level 9 = ; coronary high cholesterol disease level 9 = ; = 6 6 6 6 6 6 6 6 6 4 11 12 21 22 3 7 7 7 7 7 7 7 7 7 5 this is written as 11 21 H0 : = 11 + 12 21 + 22 An equivalent statement is H0 : ij = i++j where i+ = j ij and +j = i ij P P 223 222 The vector of proportions is a function of the parameters 1+ 2+ +1 +2 Likelihood Y 2 2 ijij L(; Y) = n! i=1 j =1 Yij ! Y Note that 1 = 1+ + 2+ and 1 = +1 + +2 Then, is a function of just two parameters, f1+; +1g: 224 Y Find maximum likelihood estimates by maximizing g (11; 12; 21; 22; ) = log(n!) i j log(Yij !) + i j Yij log(ij ) +(1 i j ij ) X X X X X X 225 Solve the equations 0 = @@gij = Yijij i j = 1; 2 = 1; 2 0 = @@g = 1 i j ij Solution Y ^ij = ij i = 1; 2 j = 1; 2 n = n mle's for expected counts n^ij = Yij X X Maximum likelihood estimates for expected counts under independence Substitute ij = i j into the likelihood function to obtain n! i j Y + Q Yij ij Q 2 2 =1 =1 = n! = Q Q + ij ! Y ( i+ +j ) ij Q 2 2 =1 =1 ij ! i 2 i=1 j n! Q 2 j =1 Y Y Y Q 2 i+ Q2 +j Yij i=1 i+ j =1 +j 227 226 Maximize g (1+; +1; 2+; +2; 1; 2) = log(n!) i j log(Yij !) + i Yi+ log(i+) + j Y+j log(+j ) +1(1 i i+) +2(1 j +j ) X X X X X X 228 Solve the equations 0 = @@gi+ = Yii++ 1 i = 1; 2 0 = @@g+j = Y++jj 2 j = 1; 2 0 = @@g1 = 1 0 = @@g2 = 1 X i X j i+ +j 229 Solution: Y ^i+ = i+ n Y ^+j = +j n = 1; 2 j = 1; 2 i The \expected counts" are Then the m.l.e.'s for the cell proportions and expected counts are Y+j Y ^ij = ^i+^+j = i+ n n 0 1 0 1 @ A @ A Serum chol. Coronary Disease Present Absent < 220 (573)(92) = 39:67 m ^ 11 = 1329 (573)(1237) = 533:33 m ^ 12 = 1329 y1+ = 573 < 220 m ^ 21 = (756)(1237) = 703:62 1329 y2+ = 756 (756)(92) = 52:33 1329 m ^ 22 = y+1 = 92 y+2 = 1237 ^ = n ^ij = Yi+nY+j mij 230 The \t" of the independence model is assessed by comparing it to the \general alternative" model that places no restrictions on other than 1 = i j ij : X X The m.l.e.'s for the expected counts are the observed counts, m ^ A;ij = Yij : 232 231 Compute G2 = 2 Yij log Yij =m ^ ij = 19:8 i j 2 ^ ij = 18:4 X2 = ( Yij m ^ ij ) =m i j with d.f. = 3 2 = 1: Since 2(1):005 = 7:88, it appears that the independence model ofis inappropriate. The incidence coronary disease is higher group. for the higher serum cholesterol X X X X 233 Comparing vectors of proportions for several independent samples (or experiments): Suppose j = 1; 2; : : : ; J simple random samples (or experiments) are done. For the j -th survey (or experiment) the nj outcomes are classied into I categories. The random counts are 2 Yj 6 6 6 6 6 6 6 6 6 4 Y1j Y2j = . YIj 3 7 7 7 7 7 7 7 7 7 5 Mult(nj ; j ) where 2 j and 6 6 6 6 6 6 6 6 6 4 = . Y 1; Y2; ; YJ are independent vectors of random counts. Also dene 2 Pj 6 6 6 6 6 6 6 6 6 4 P1j P2j 3 2 7 7 7 7 7 7 7 7 7 5 6 6 6 6 6 6 6 6 6 6 4 = . = PIj Y1j=nj Y2j=nj . YIj=nj 3 7 7 7 7 7 7 7 7 7 7 5 = n1j Yj : 236 Ij 3 7 7 7 7 7 7 7 7 7 5 1 = i=1I ij X and nj = i=1I Yij X for j = 1; :::; J: 234 No outcome of any experiment has any inuence on any other outcome; 1j 2j 235 The hypothesis H0 : 1 = 2 = = J is often of interest. This is often called the homogeneity or independence model and it is usually compared to the general alternative that only assumes I ij = 1 i=1 for each j = 1; ; J . X 237 Example: The General Social Survey, conducted by the National Opinion Research at the University of Chicago, uses many of the same questions from year to year. Haberman (1978) examined responses to the question \In general, do you think courts in this area deal too harshly or not harshly enough with criminals?" Observed Counts Year of Survey Response 1972 1973 1974 1975 Too harshly (i=1) 105 68 42 61 About right (i=2) 265 196 72 144 1066 1092 580 1174 173 138 51 104 4 10 8 7 1612 1504 753 1490 Not harshly enough (i=3) Don't know (i=4) No answer (i=5) Sample Size for independent 1973, 1974, andsurveys 1975. taken in 1972, 238 Maximum likelihood estimation for expected counts: Percentages Year of Survey Response 1972 1973 1974 1975 Too harshly (i=1) 6.5 4.5 5.6 4.1 About right (i=2) 16.4 13.0 9.6 9.7 enough (i=3) 66.1 72.6 77.0 78.8 Don't know (i=4) 10.7 9.2 6.8 7.0 0.3 0.7 1.1 0.5 1612 1504 753 1490 Not harshly No answer (i=5) 239 Model A: general alternative Yj Mult(nj ; j ) j = 1; 2; 3; 4 are independent vectors of random counts and 5 =1 1T j = ij i=1 for each j = 1; 2; 3; 4 . X Sample Size 240 241 log-likelihood function: The joint likelihood function is L(1; 2; 3; 4; Y1; Y2; Y3; Y4) 4 n! 5 = j=1 j ! i=1 This is sometimes \product multinomial"called model.the 2 Y 6 6 6 6 4 Y Y ijij Yij 3 7 7 7 7 5 4 log(n !) 5 4 log(Y !) j ij j =1 i=1 j =1 4 5 Y log( ) + j=1 ij ij i=1 X X X X X Maximize this with respect to the conditions 5 = 1 for ij i=1 j X 242 Dene ( g 1; 2; 3; 4; 1; 2; 3; 4 4 log(n !) = j=1 j ) 0 = @@gij = Yijij 4 5 log(Y !) ij j =1 i=1 4 5 Y log( ) + j=1 ij ij i=1 4 1 5 . + j=1 j ij i=1 X X X B @ 5 = 1 i=1 ij X 1 X j i j = 1; 2; : : : ; 5 = 1; : : : ; 4 0 = @@gj X 0 243 Solve X X = 1; 2; 3; 4 C A 244 j = 1; : : : ; 4 245 Model B: Independence (or homogeneity) model Solution: or ^ = pij = Yij =nj ij 2 ^j = pj = 1 Y nj j expected count = nj ^ij = Yij = observed count: Substitute for j in the log-likelihood subject to 5 = 1T : 1 = i=1 i X 247 246 Dene 4 log(n !) g (; ) = j j =1 X 4 5 log(Y !) ij j =1 i=1 5 Y log( ) + i=1 i i+ 5 + 1 i=1 i X X X 0 B @ 1 X 1 H0 : 1 = 2 = 3 = 4 = = . 6 6 6 6 6 4 C A 248 Solve 0 = @@gi = Yi+i for i = 1; 2; : : : ; 5 and 5 @g 0 = @ = 1 i=1 i X 249 5 3 7 7 7 7 7 5 Test H0 : 1 = 2 = 3 = 4 (model B) against the general alternative (model A). Solution: 4 n ^i = Yi+/ j j =1 X or 4 np 4 n : ^ = j=1 j j/ j j =1 X X Compute: 2 ( yij m ^ ) ij 2 X = m ^ ij = 87:4 i j Expected counts are: m ^ ij = nj ^i = n4j Yi+ X j =1 X X nj column row = totaltotal fortotal entire table 2 3 2 3 6 4 7 5 6 4 7 5 2 3 6 4 7 5 G2 =2i 0 X X j yij 1 log ^ = 87:1 B B @ yij mij C C A each with 16 4 = 12 d.f. 250 Conclusion: The proportion of the population in at least one category was not the same in all 4 years: In which categories or years did changes occur? Examine patterns in observed proportions dierences between observed and expected counts 252 251 Pearson residuals X m ^ ij rij = ij m ^ ij r Note that X 2 = i j rij2 X X adjusted residuals Xij m ^ ij = rij m ^ ij [1 ^ij ][1 r ] nj =n J where n = j=1 nj X 253 Adjusted Residuals Observed Proportions Overall Response Too harshly About right 1972 1973 1974 1975 Proportion 6.5 4.5 5.6 4.1 5.2 9.6 9.7 12.6 16.4 13.0 Not harshly enough 66.1 72.6 10.7 Don't know No answer 77.0 9.2 78.8 6.8 73.0 7.0 8.7 0.3 0.7 1.1 0.5 0.5 16.13 1504 753 1490 5360 From 1972 through 1974 there was an increase in the proportion of the population that felt the courts do not deal harshly enough with criminals and a corresponding decrease in the proportion of \about right" and \don't know" opinions. Response 1972 1973 1974 1975 Too harshly 3.0 -1.3 0.6 -2.2 About right 5.5 0.6 -2.7 -4.1 -7.5 0.4 2.7 5.9 3.5 0.8 -2.0 -2.8 -1.9 0.8 2.1 -0.4 Not harshly enough Don't know No answer Look at the sign and size of the adjusted residuals. 255 254 Compare 1974 to 1975 Observed Counts Response Expected Counts 1974 1975 1974 1975 Too harshly 42 61 34.58 68.42 About right 72 144 72.51 143.49 580 1174 588.84 1165.16 51 104 52.04 102.96 8 7 5.04 9.96 753 1490 753 1490 Not harshly enough Don't know No Answer Sample Size X2 5 2 (yij m^ ij )2 = 5:3 = i=1 m ^ ij j =1 X X Conclusion: Essentially no changes in the proportions of the population holding various opinions between 1974 and 1975. with 4 d.f. and p-value = :26. 256 257 SAS code /* This program is stored in the file crim1.sas */ Values of X 2 for testing homogeneity for pairs of years /* Enter the data as counts in a 2-dimensional table */ Second Year First Year 1973 1972 21.3 1974 <.001) ( 1973 41.7 <.001) ( data set1; input response year count; cards; 1 1 105 1 2 68 1 3 42 1 4 61 2 1 265 2 2 196 . . . . . . 5 4 7 run; 1975 65.9 ( <.001) 12.0 16.5 (.016) (.002) 1974 5.3 (2.62) 258 /* Use PROC FORMAT to assign labels to the row and column categories */ proc format; value rowfmt 1 2 3 4 5 value colfmt 1 2 3 4 = = = = = = = = = 259 /* Compare responses for each pair of years without printing tables of counts and proportions */ 'Too harshly' 'About right' 'Too lenient' 'No opinion' 'No answer'; '1972' '1973' '1974' '1975'; data set2; set set1; if(year=1 or year=2); title 'Comparison Between 1972 and 1973'; proc freq data=set2; table response*year / chisq noprint; weight count; run; /* Compute test for independence */ title 'Annual Opinions on Treatment of Criminals'; proc freq data=set1; table response*year / chisq expected cellchi2; weight count; format response rowfmt. year colfmt.; run; 260 data set2; set set1; if(year=1 or year=3); title 'Comparison Between 1972 and 1974'; proc freq data=set2; table response*year / chisq noprint; weight count; run; 261 data set2; set set1; if(year=3 or year=4); title 'Comparison Between 1974 and 1975'; proc freq data=set2; table response*year / chisq noprint; weight count; run; data set2; set set1; if(year=1 or year=4); title 'Comparison Between 1972 and 1975'; proc freq data=set2; table response*year / chisq noprint; weight count; run; data set2; set set1; if(year=2 or year=3); title 'Comparison Between 1973 and 1974'; proc freq data=set2; table response*year / chisq noprint; weight count; run; data set2; set set1; if(year=2 or year=4); title 'Comparison Between 1973 and 1975'; proc freq data=set2; table response*year / chisq noprint; weight count; run; 262 263 Comparison between 1973 and 1974 Statistic Annual Opinions on Treatment of Criminals The FREQ Procedure Statistics for Table of response by year DF Chi-Square Likelihood Ratio Chi-Square Mantel-Haenszel Chi-Square Phi Coefficient Contingency Coefficient Cramer's V Value Prob 4 21.2788 0.0003 4 21.4459 0.0003 1 7.4150 0.0065 0.0826 0.0823 0.0826 Sample Size = 3117 Statistic DF Value Chi-Square 12 87.3596 Likelihood Ratio Chi-Square 12 87.0513 Mantel-Haenszel Chi-Square 1 11.1240 Phi Coefficient 0.1277 Contingency Coefficient 0.1266 Cramer's V 0.0737 Sample Size = 5360 Prob <.0001 <.0001 0.0009 Comparison Between 1972 and 1974 Statistic DF Chi-Square Likelihood Ratio Chi-Square Mantel-Haenszel Chi-Square Phi Coefficient Contingency Coefficient Cramer's V Value 4 41.7255 <.0001 4 42.7808 <.0001 1 4.3917 0.0361 0.1328 0.1316 0.1328 Sample Size = 2366 264 Prob 265 Comparison Between 1972 and 1975 Statistic DF Chi-Square Likelihood Ratio Chi-Square Mantel-Haenszel Chi-Square Phi Coefficient Contingency Coefficient Cramer's V Value Comparison Between 1973 and 1975 Prob 4 65.9007 <.0001 4 66.6726 <.0001 1 12.4171 0.0004 0.1457 0.1442 0.1457 Statistic Chi-Square Likelihood Ratio Chi-Square Mantel-Haenszel Chi-Square Phi Coefficient Contingency Coefficient Cramer's V Sample Size = 3103 DF Chi-Square Likelihood Ratio Chi-Square Mantel-Haenszel Chi-Square Phi Coefficient Contingency Coefficient Cramer's V Value Prob Statistic 4 16.5413 0.0024 4 16.5917 0.0023 1 0.5301 0.4665 0.0743 0.0741 0.0743 DF Chi-Square Likelihood Ratio Chi-Square Mantel-Haenszel Chi-Square Phi Coefficient Contingency Coefficient Cramer's V Sample Size = 2257 4 4 1 S-PLUS code # # # # # Enter the data as a data frame. This includes column headings that identify the years and row labels that identify the type of response. Compute Pearson chi-squared ests for each pair of years chisq.test(crim1.mat[ ,c(1,2)]) chisq.test(crim1.mat[ ,c(1,3)]) chisq.test(crim1.mat[ ,c(1,4)]) chisq.test(crim1.mat[ ,c(2,3)]) Store the data as a matrix object chisq.test(crim1.mat[ ,c(2,4)]) chisq.test(crim1.mat[ ,c(3,4)]) crim1.mat<-as.matrix(crim1.dat) # # # # # # # # # List the table crim1.mat # # Compute the Pearson chi-squared test for independence chisq.test(crim1.mat) 268 Prob 5.2610 0.2615 5.0254 0.2847 0.4879 0.4848 0.0484 0.0484 0.0484 267 # # crim1.dat<-read.table("crim1.dat",header=T) # Value Sample Size = 2243 266 This code is stored in the file crim1.ssc Prob Comparison Between 1974 and 1975 4 12.0136 0.0173 4 12.2566 0.0155 1 0.0076 0.9307 0.0730 0.0728 0.0730 # # Value Sample Size = 2994 Comparison Between 1973 and 1974 Statistic DF There does not seem to be a corresponding built in function to compute just the deviance. The deviance can be obtained from the glm function as follows. We will first enter the data in another way to form a two-way contingency table with factor labels. 269 # # # # # Enter the counts for the contingency table. The function expand.grid( ) creates a dataframe containing all combinations of its arguments: vectors, factors, or lists. crim1 <- cbind(expand.grid( Year=c("1972","1973","1974","1975"), Opinion=c("Too harshly","About right", "Too lenient","No opinion","No answer")), Fr=c(105, 68, 42, 61, 265, 196, 72, 144, 1066, 1092, 580, 1174, 173, 138, 51, 104, 4, 10, 8, 7)) # # # # Fit the independence model using the glm( ) function. Note that a Poisson distribution must be specified, even though we have multinomial data. crim1.indep <- glm(Fr ~ Opinion + Year, family=poisson, data=crim1) # Display the results summary(crim1.indep, correlation=F) # If you just want to see the # value of the deviance use deviance(crim1.indep) # # # # The summary( ) function does not supply p-values in this application. Use the anova( ) function with test="Chisq" to get p-values for the deviance tests. anova(crim1.indep,test="Chisq") # Use fitted( ) to compute the expected # counts and compute the value of the # Pearson chi-squared statistic crim.fit <- as.matrix(fitted(crim1.indep)) crim.x <- as.matrix(crim1$Fr) pearson <- apply((((crim.x - crim.fit)^2)/ crim.fit), 2, sum) p <- 1.0 -pchisq(pearson,12) 270 271 This is the output from the Splus code stored on the file crim1.ssc > crim1.dat<-read.table("crim1.dat",header=T) > crim1.mat<-as.matrix(crim1.dat) > crim1.mat # Display the results cat("\n", "Value of the Pearson statistic: ", pearson, "\n") cat("\n", "P-value for the Pearson statistic: ", p, "\n") 272 1972 1973 1974 1975 Too_harshly 105 68 About_right 265 196 Too_lenient 1066 1092 No_opinion 173 138 No_answer 4 10 42 61 72 144 580 1174 51 104 8 7 273 >chisq.test(crim1.mat[ >chisq.test(crim1.mat[ >chisq.test(crim1.mat[ >chisq.test(crim1.mat[ >chisq.test(crim1.mat[ >chisq.test(crim1.mat[ ,c(1,2)]) ,c(1,3)]) ,c(1,4)]) ,c(2,3)]) ,c(2,4)]) ,c(3,4)]) > chisq.test(crim1.mat) Pearson's chi-square test without Yates' continuity correction Pearson's chi-square test without Yates' continuity correction data: crim1.mat X-squared = 87.3596, df = 12, p-value = 0 Warning messages: Expected counts < 5. Chi-squared approximation may not be appropriate. in: chisq.test(crim1.mat) data: crim1.mat[, c(1, 2)] X-squared = 21.2788, df = 4, p-value = 3e-04 Pearson's chi-square test without Yates' continuity correction data: crim1.mat[, c(1, 3)] X-squared = 41.7255, df = 4, p-value = 0 Warning messages: Expected counts < 5. Chi-squared approximation may not be appropriate. in: chisq.test(crim1.mat[, c(1, 3)]) 274 Pearson's chi-square test without Yates' continuity correction data: crim1.mat[, c(1, 4)] X-squared = 65.9007, df = 4, p-value = 0 Pearson's chi-square test without Yates' continuity correction data: crim1.mat[, c(2, 3)] X-squared = 12.0136, df = 4, p-value = 0.0173 Pearson's chi-square test without Yates' continuity correction data: crim1.mat[, c(2, 4)] X-squared = 16.5413, df = 4, p-value = 0.0024 Pearson's chi-square test without Yates' continuity correction data: crim1.mat[, c(3, 4)] X-squared = 5.261, df = 4, p-value = 0.2615 276 275 > crim1 <- cbind(expand.grid( Year=c("1972","1973","1974","1975"), Opinion=c("Too harshly","About right", "Too lenient", "No opinion","No answer")), Fr=c(105, 68, 42, 61, 265, 196, 72, 144, 1066, 1092, 580, 1174, 173, 138, 51, 104, 4, 10, 8, 7)) > crim1.indep <- glm(Fr ~ Opinion + Year, family=poisson, data=crim1) > summary(crim1.indep, correlation=F) Call: glm(formula = Fr ~ Opinion + Year, family = poisson, data = crim1) Deviance Residuals: Min 1Q Median 3Q Max -3.361964 -1.86109 0.1317903 1.393514 4.100514 277 Coefficients: (Intercept) Opinion1 Opinion2 Opinion3 Opinion4 Year1 Year2 Year3 Value 4.55563937 0.44863522 0.73425598 -0.16477661 -0.65423965 -0.03498379 -0.24226735 0.04948287 > anova(crim1.indep,test="Chisq") Std. Error 0.041097475 0.035704247 0.013040165 0.013087555 0.037270056 0.017921519 0.013536157 0.007751406 t value 110.849617 12.565318 56.307260 -12.590328 -17.554029 -1.952055 -17.897794 6.383728 (Dispersion Parameter for Poisson family taken to be 1 ) Analysis of Deviance Table Poisson model Response: Fr Terms added sequentially (first to last) Df Deviance Resid. Df Resid. Dev Pr(Chi) NULL 19 8251.947 Opinion 4 7771.211 15 480.735 0 Year 3 393.684 12 87.051 0 Null Deviance: 8251.947 on 19 degrees of freedom Residual Deviance: 87.05134 on 12 degrees of freedom Number of Fisher Scoring Iterations: 3 > crim.fit <- as.matrix(fitted(crim1.indep)) >crim.x <- as.matrix(crim1$Fr) >pearson <- apply((((crim.x - crim.fit)^2) /crim.fit), 2, sum) >p <- 1.0 -pchisq(pearson,12) 278 279 S-PLUS for Windows >cat("\n", "Value of the Pearson statistic: ", pearson, "\n") >cat("\n", "P-value for the Pearson statistic: ", p, "\n") Value of the Pearson statistic: 87.359438070264 P-value for the Pearson statistic: 1.598721e-13 Click on Data ) Data select ) click on New Data and enter a data le name in the New Data box ) enter the data in the spreadsheet that appears with columns labeled response year y for response category, year and count, respectively Click on Statistics ) Data summmaries ) Crosstabulations ) make selections in the boxes 280 281 *** Crosstabulations *** Call: crosstabs(formula = y ~ response + year, data = crim1, na.action = na.fail, drop.unused.levels = T) 5360 cases in table +----------+ |N | |N/RowTotal| |N/ColTotal| |N/Total | +----------+ 282 response|year |1 |2 |3 |4 |RowTotl| --------+--------+--------+--------+--------+-------+ 1 | 105 | 68 | 42 | 61 |276 | |0.3804 |0.2464 |0.1522 |0.221 |0.05149| |0.0651 |0.04521 |0.05578 |0.04094 | | |0.01959 |0.01269 |0.007836|0.01138 | | --------+--------+--------+--------+--------+-------+ 2 | 265 | 196 | 72 | 144 |677 | |0.3914 |0.2895 |0.1064 |0.2127 |0.1263 | |0.1643 |0.1303 |0.09562 |0.09664 | | |0.04944 |0.03657 |0.01343 |0.02687 | | --------+--------+--------+--------+--------+-------+ 3 |1066 |1092 | 580 |1174 |3912 | |0.2725 |0.2791 |0.1483 |0.3001 |0.7299 | |0.6609 |0.7261 |0.7703 |0.7879 | | |0.1989 |0.2037 |0.1082 |0.219 | | --------+--------+--------+--------+--------+-------+ 4 | 173 | 138 | 51 | 104 |466 | |0.3712 |0.2961 |0.1094 |0.2232 |0.08694| |0.1073 |0.09176 |0.06773 |0.0698 | | |0.03228 |0.02575 |0.009515|0.0194 | | --------+--------+--------+--------+--------+-------+ 5 | 4 | 10 | 8 | 7 |29 | |0.1379 |0.3448 |0.2759 |0.2414 |0.00541| |0.00248 |0.006649|0.01062 |0.004698| | |7.463e-4|0.001866|0.001493|0.001306| | --------+--------+--------+--------+--------+-------+ ColTotal|1613 |1504 |753 |1490 |5360 | |0.3009 |0.2806 |0.1405 |0.278 | | --------+--------+--------+--------+--------+-------+ Test for independence of all factors Chi^2 = 87.35959 d.f.= 12 (p=1.598721e-013) Yates' correction not used Some expected values are less than 5, don't trust p-value 283