Discrete Multivariate Analysis Analysis of Multivariate Categorical Data Example 1 In this study we examine n = 1237 individuals measuring X, Systolic Blood Pressure and Y, Serum Cholesterol Serum Cholesterol <200 200-219 220-259 260+ Total Data Set #1 - A two-way frequency table Systolic Blood pressure <127 127-146 147-166 167+ 117 121 47 22 85 98 43 20 119 209 68 43 67 99 46 33 388 527 204 118 Total 307 246 439 245 1237 Example 2 The following data was taken from a study of parole success involving 5587 parolees in Ohio between 1965 and 1972 (a ten percent sample of all parolees during this period). The study involved a dichotomous response Y – Success (no major parole violation) or – Failure (returned to prison either as technical violators or with a new conviction) based on a one-year follow-up. The predictors of parole success included are: 1. type of committed offence (Person offense or Other offense), 2. Age (25 or Older or Under 25), 3. Prior Record (No prior sentence or Prior Sentence), and 4. Drug or Alcohol Dependency (No drug or Alcohol dependency or Drug and/or Alcohol dependency). • The data were randomly split into two parts. The counts for each part are displayed in the table, with those for the second part in parentheses. • The second part of the data was set aside for a validation study of the model to be fitted in the first part. Table Success Failure Success Failure No drug or alcohol dependency Drug and/or alcohol dependency 25 or older Under 25 25 or Older Under 25 Person Other Person Other Person Other Person Other offense offense offense offense offense offense offense offense No prior Sentence of Any Kind 48 34 37 49 48 28 35 57 (44) (34) (29) (58) (47) (38) (37) (53) 1 5 7 11 3 8 5 18 (1) (7) (7) (5) (1) (2) (4) (24) Prior Sentence 117 259 131 319 197 435 107 291 (111) (253) (131) (320) (202) (392) (103) (294) 23 61 20 89 38 194 27 101 (27) (55) (25) (93) (46) (215) (34) (102) Analysis of a Two-way Frequency Table: Frequency Distribution (Serum Cholesterol and Systolic Blood Pressure) Serum Cholesterol <200 200-219 220-259 260+ Total <127 117 85 119 67 388 Systolic Blood pressure 127-146 147-166 121 47 98 43 209 68 99 46 527 204 167+ 22 20 43 33 118 Total 307 246 439 245 1237 Joint and Marginal Distributions (Serum Cholesterol and Systolic Blood Pressure) Serum Cholesterol <200 200-219 220-259 260+ Marginal distn (BP) <127 9.46 6.87 9.62 5.42 31.37 Systolic Blood pressure 127-146 147-166 9.78 3.80 7.92 3.48 16.90 5.50 8.00 3.72 42.60 16.49 167+ 1.78 1.62 3.48 2.67 9.54 Marginal distn (Serum Chol.) 24.82 19.89 35.49 19.81 100.00 The Marginal distributions allow you to look at the effect of one variable, ignoring the other. The joint distribution allows you to look at the two variables simultaneously. Conditional Distributions ( Systolic Blood Pressure given Serum Cholesterol ) Serum Cholesterol <200 200-219 220-259 260+ Marginal distn (BP) <127 38.11 34.55 27.11 27.35 31.37 Systolic Blood pressure 127-146 147-166 39.41 15.31 39.84 17.48 47.61 15.49 40.41 18.78 42.60 16.49 167+ 7.17 8.13 9.79 13.47 9.54 Total 100.00 100.00 100.00 100.00 100.00 The conditional distribution allows you to look at the effect of one variable, when the other variable is held fixed or known. Conditional Distributions (Serum Cholesterol given Systolic Blood Pressure) Serum Cholesterol <200 200-219 220-259 260+ Total <127 30.15 21.91 30.67 17.27 100.00 Systolic Blood pressure 127-146 147-166 22.96 23.04 18.60 21.08 39.66 33.33 18.79 22.55 100.00 100.00 167+ 18.64 16.95 36.44 27.97 100.00 Marginal distn (Serum Chol.) 24.82 19.89 35.49 19.81 100.00 GRAPH: Conditional distributions of Systolic Blood Pressure given Serum Cholesterol 50% SERUM CHOLESTEROL <200 40% 200-219 220-259 260+ 30% Marginal Distribution 20% 10% <127 127-146 147-166 SYSTOLIC BLOOD P RESSURE 167+ Notation: Let xij denote the frequency (no. of cases) where X (row variable) is i and Y (row variable) is j. c xi Ri xij j 1 r x j C j xij i 1 r c r c i 1 j 1 x N xij xi x j i 1 j 1 Different Models The Multinomial Model: Here the total number of cases N is fixed and xij follows a multinomial distribution with parameters ij ij P X i, Y j f x11 , x12 , , xrc x11 N x11 x12 11 12 xrc N! x11 x12 11 12 x11 ! xrc ! ij E xij N ij rcx rc xrc rc The Product Multinomial Model: Here the row (or column) totals Ri are fixed and for a given row i, xij follows a multinomial distribution with parameters j|i f x11 , x12 , , xrc i 1 xi1 r Ri ij E xij Ri j|i x11 x12 1|i 2|i xic cx|i ic The Poisson Model: In this case we observe over a fixed period of time and all counts in the table (including Row, Column and overall totals) follow a Poisson distribution. Let ij denote the mean of xij. ij E xij fij xij f x11 , x12 , xij ij xij ! e r ij c , xrc i 1 j 1 xij ij xij ! e ij Independence Multinomial Model ij P X i, Y j P X i P Y j i j if independent and ij N ij N i j The estimated expected frequency in cell (i,j) in the case of independence is: xi x j mij ˆij Nˆiˆ j N N N xi x j N Ri C j N The same can be shown for the other two models – the Product Multinomial model and the Poisson model namely The estimated expected frequency in cell (i,j) in the case of independence is: mij ˆij Ri C j N xi x j x Standardized residuals are defined for each cell: rij xij mij mij The Chi-Square Statistic r c r c r 2 i 1 j 1 2 ij x i 1 j 1 ij mij 2 mij The Chi-Square test for independence Reject H0: independence if r c 2 i 1 j 1 x ij mij mij 2 2 /2 df r 1 c 1 Table Expected frequencies, Observed frequencies, Standardized Residuals Serum Cholesterol <200 200-219 220-259 260+ Total <127 96.29 (117) 2.11 77.16 (85) 0.86 137.70 (119) -1.59 76.85 (67) -1.12 388 2 = 20.85 (p = 0.0133) Systolic Blood pressure 127-146 147-166 130.79 50.63 (121) (47) -0.86 -0.51 104.80 40.47 (98) (43) -0.66 0.38 187.03 72.40 (209) (68) 1.61 -0.52 104.38 40.04 (99) (46) -0.53 0.88 527 204 167+ 29.29 (22) -1.35 23.47 (20) -0.72 41.88 (43) 0.17 23.37 (33) 1.99 118 Total 307 246 439 245 1237 Example In the example N = 57,407 cases in which individuals were victimized twice by crimes were studied. The crime of the first victimization (X) and the crime of the second victimization (Y) were noted. The data were tabulated on the following slide Table 1: Frequencies Ra A First Ro Victimization PP/PS in pair PL B HL MV Total Ra 26 65 12 3 75 52 42 3 278 A 50 2997 279 102 2628 1117 1251 221 8645 Second Victimization in Pair Ro PP/PS PL B HL 11 6 82 39 48 238 85 2553 1083 1349 197 36 459 197 221 40 61 243 115 101 413 229 12137 2658 3689 191 102 2649 3210 1973 206 117 3757 1962 4646 51 24 678 301 367 1347 660 22558 9565 12394 MV Total 11 273 216 8586 47 1448 38 703 687 22516 301 9595 391 12372 269 1914 1960 Table 2: Standardized residuals Ra A First Ro Victimization PP/PS in pair PL B HL MV Second Victimization in Pair Ra A Ro PP/PS PL B HL 1.4 1.8 1.6 -2.4 -1.0 -1.9 21.5 3.6 2.6 -1.4 -14.1 -9.2 -11.7 47.4 1.9 4.1 4.7 -4.6 -2.8 -5.2 28.0 -0.2 -0.4 5.8 -2.0 -0.2 -4.1 18.6 -3.3 -13.1 -5.0 -1.9 35.0 -17.9 -16.8 0.8 -8.6 -2.3 -0.8 -18.3 40.3 -2.2 -2.3 -14.2 -4.9 -2.1 -15.8 -2.2 38.2 -2.1 -4.0 0.9 0.4 -2.7 -1.0 -2.3 11,430 (highly significant) MV 0.6 -4.5 -0.3 2.9 -2.9 -1.5 -1.5 25.2 Table 3: Conditional distribution of second victimization given the first victimization (%) First Victimization in pair Ra A Ro PP/PS PL B HL MV Marginal Ra 9.5 0.8 0.8 0.4 0.3 0.5 0.3 0.2 0.5 Second Victimization in Pair A Ro PP/PS PL B 18.3 4.0 2.2 30.0 14.3 2.8 1.0 29.7 12.6 34.9 19.3 2.5 31.7 13.6 13.6 14.5 5.7 34.6 16.4 8.7 11.7 1.8 1.0 11.8 53.9 11.6 2.0 1.1 27.6 33.5 10.1 1.7 0.9 30.4 15.9 11.5 2.7 1.3 35.4 15.7 15.1 2.3 1.1 39.3 16.7 HL 17.6 15.7 15.3 14.4 16.4 20.6 37.6 19.2 21.6 MV 4.0 2.5 3.2 5.4 3.1 3.1 3.2 14.1 3.4 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 Log Linear Model Recall, if the two variables, rows (X) and columns (Y) are independent then ij N ij N i j and ln ij ln N ln i ln j In general let 1 1 u ln ij u1(i ) ln ij u rc i j c j 1 u2( j ) ln ij u u12(i , j ) ln ij u u1( i ) u2( j ) r i then ln ij u u1(i ) u2( j ) u12( i , j ) (1) where u 1( i ) i u2( j ) u12( i , j ) u12( i , j ) 0 j i j Equation (1) is called the log-linear model for the frequencies xij. Note: X and Y are independent if u12( i , j ) 0 for all i, j In this case the log-linear model becomes ln ij u u1(i ) u2( j ) Another formulation ln ij u u * * 12 I , j where u * 1i u u * 12i , J * 2 j u u * 1 I * 12i , j u * 2 J 0 Three-way Frequency Tables With two variables the dependence structure is simple: the variables are either dependent or independent. When there are three or more variables the dependence structure is much more complicated. Marginal distributions Distributions of two variables ignoring the third. 1. X1, X2 ignoring X3 2. X1, X3 ignoring X2 3. X2, X3 ignoring X1 Distributions of one variable ignoring the other two. 1. X1 ignoring X2, X3 2. X2 ignoring X1, X3 3. X3 ignoring X1, X2 Conditional distributions Distributions of two variables given the third. 1. X1, X2 given X3 2. X1, X3 given X2 3. X2, X3 given X1 Distributions of one variable given the other two. 1. X1 given X2, X3 2. X2 given X1, X3 3. X3 given X1, X2 Distributions of one variable given either of the other two. 1. X1 given X2 2. X1 given X3 3. X2 given X1 4. X2 given X3 5. X3 given X1 6. X3 given X2 Example Data from the Framingham Longitudinal Study of Coronary Heart Disease (Cornfield [1962]) Variables 1. Systolic Blood Pressure (X) – < 127, 127-146, 147-166, 167+ 2. Serum Cholesterol – <200, 200-219, 220-259, 260+ 3. Heart Disease – Present, Absent The data is tabulated on the next slide Three-way Frequency Table Coronary Heart Disease Present Absent Serum Cholesterol (mm/100 cc) <200 200-219 220-259 260+ <200 200-219 220-259 260+ Systolic Blood pressure (mm Hg) <127 127-146 147-166 2 3 3 3 2 0 8 11 6 7 12 11 117 121 47 85 98 43 119 209 68 67 99 46 167+ 4 3 6 11 22 20 43 33 Log-Linear model for three-way tables Let ijk denote the expected frequency in cell (i,j,k) of the table then in general ln ij u u1(i ) u2( j ) u3( k ) u12( i , j ) u13(i ,k ) u23( j ,k ) u123( i , j ,k ) where 0 u1(i ) u2( j ) u3( k ) u12( i , j ) u12( i , j ) i j k i j u13( i ,k ) u13(i ,k ) u23( j ,k ) u23( j ,k ) i k j k u123(i , j ,k ) u123( i , j ,k ) u123(i , j ,k ) i j k Hierarchical Log-linear models for categorical Data For three way tables The hierarchical principle: If an interaction is in the model, also keep lower order interactions and main effects associated with that interaction 1.Model: (All Main effects model) ln ijk = u + u1(i) + u2(j) + u3(k) i.e. u12(i,j) = u13(i,k) = u23(j,k) = u123(i,j,k) = 0. Notation: [1][2][3] Description: Mutual independence between all three variables. 2.Model: ln ijk = u + u1(i) + u2(j) + u3(k) + u12(i,j) i.e. u13(i,k) = u23(j,k) = u123(i,j,k) = 0. Notation: [12][3] Description: Independence of Variable 3 with variables 1 and 2. 3.Model: ln ijk = u + u1(i) + u2(j) + u3(k) + u13(i,k) i.e. u12(i,j) = u23(j,k) = u123(i,j,k) = 0. Notation: [13][2] Description: Independence of Variable 2 with variables 1 and 3. 4.Model: ln ijk = u + u1(i) + u2(j) + u3(k) + u23(j,k) i.e. u12(i,j) = u13(i,k) = u123(i,j,k) = 0. Notation: [23][1] Description: Independence of Variable 3 with variables 1 and 2. 5.Model: ln ijk = u + u1(i) + u2(j) + u3(k) + u12(i,j) + u13(i,k) i.e. u23(j,k) = u123(i,j,k) = 0. Notation: [12][13] Description: Conditional independence between variables 2 and 3 given variable 1. 6.Model: ln ijk = u + u1(i) + u2(j) + u3(k) + u12(i,j) + u23(j,k) i.e. u13(i,k) = u123(i,j,k) = 0. Notation: [12][23] Description: Conditional independence between variables 1 and 3 given variable 2. 7.Model: ln ijk = u + u1(i) + u2(j) + u3(k) + u13(i,k) + u23(j,k) i.e. u12(i,j) = u123(i,j,k) = 0. Notation: [13][23] Description: Conditional independence between variables 1 and 2 given variable 3. 8.Model: ln ijk = u + u1(i) + u2(j) + u3(k) + u12(i,j) + u13(i,k) + u23(j,k) i.e. u123(i,j,k) = 0. Notation: [12][13][23] Description: Pairwise relations among all three variables, with each two variable interaction unaffected by the value of the third variable. 9.Model: (the saturated model) ln ijk = u + u1(i) + u2(j) + u3(k) + u12(i,j) + u13(i,k) + u23(j,k) + u123(i,j,k) Notation: [123] Description: No simplifying dependence structure. Hierarchical Log-linear models for 3 way table Model [1][2][3] [1][23] [2][13] [3][12] [12][13] [12][23] [13][23] [12][13] [23] [123] Description Mutual independence between all three variables. Independence of Variable 1 with variables 2 and 3. Independence of Variable 2 with variables 1 and 3. Independence of Variable 3 with variables 1 and 2. Conditional independence between variables 2 and 3 given variable 1. Conditional independence between variables 1 and 3 given variable 2. Conditional independence between variables 1 and 2 given variable 3. Pairwise relations among all three variables, with each two variable interaction unaffected by the value of the third variable. The saturated model Maximum Likelihood Estimation Log-Linear Model For any Model it is possible to determine the maximum Likelihood Estimators of the parameters Example Two-way table – independence – multinomial model ij E xij N ij f x11 , x12 , or ij , xrc x11 ij N x11 x12 11 12 xrc N N ! 11 12 x11 ! xrc ! N N x11 x12 rcx rc rc N xrc Log-likelihood rc ln N ! ln xij ! l 11 , 12 , i j ln N xij xij ln ij i j K xij ln ij i where i j j K ln N ! ln xij ! N ln N i j With the model of independence ln ij u u1 i u2 j and l u, u11 , , u2 r K , u1 c , u21 , x ij i j u u u 1i 2 j K Nu xiu1 i x j u2 j i j u u 0 with 1i 2 j i j also e ij i j i j u u1i u2 j e u e e u1i i j u2 j N Let g u, u11 , , u1 c , u21 , K Nu xiu1 i x j u2 j i , u2 r , 1, 2 , 2 j u u1i u2 j 1 u1i 1 u2 j e e e N i j j i Now u g u1i u2 j N e e e N 1 0 u j i 1 u u1i g u2 j xi 1 e e e u1 i j xi 1 e u1 i e u1 i N 0 i e u1 i e u1 i xi 1 xi N N i Since xi 1 1 N i x i i N r 1 N and 1 0 u1i xi K1 Now e or u1i ln xi ln K1 u ln x i 1i i i r ln K1 0 1 ln K1 ln xi r i Hence u1 i and 1 ln xi ln xi r i Similarly u2 j 1 ln x j ln x j c i Finally e ij i j i j u u1i u2 j e u e e u1i i j u2 j N e u Hence N e e u1 i i Now e u1 i j xi xi i 1 r and u2 j e 1 r u2 j x j c x j j 1 1 r r c u e xi x j xi x j i1 j 1 N i j 1 r c 1 r xi x j N i 1 j 1 1 c 1 c 1 c Hence 1 1 u ln xi ln x j ln N r i c j Note ln ij u u1i u2 j 1 1 ln xi ln x j ln N r i c j 1 1 ln xi ln xi ln x j ln x j r i c i ln N ln xi ln x j or ij xi x j N Comments • Maximum Likelihood estimates can be computed for any hierarchical log linear model (i.e. more than 2 variables) • In certain situations the equations need to be solved numerically • For the saturated model (all interactions and main effects), the estimate of ijk… is xijk… . Goodness of Fit Statistics These statistics can be used to check if a log-linear model will fit the observed frequency table Goodness of Fit Statistics The Chi-squared statistic 2 Observed Expected Expected 2 x ijk ˆ ijk 2 ˆ ijk The Likelihood Ratio statistic: Observed G 2 Observed ln 2 xijk Expected 2 xijk ln ˆ ijk d.f. = # cells - # parameters fitted We reject the model if 2 or G2 is greater than / 2 2 Example: Variables 1. Systolic Blood Pressure (B) Serum Cholesterol (C) Coronary Heart Disease (H) Coronary Heart Disease Present Absent Serum Cholesterol (mm/100 cc) <200 200-219 220-259 260+ <200 200-219 220-259 260+ Systolic Blood pressure (mm Hg) <127 127-146 147-166 2 3 3 3 2 0 8 11 6 7 12 11 117 121 47 85 98 43 119 209 68 67 99 46 167+ 4 3 6 11 22 20 43 33 Goodness of fit testing of Models MODEL ----B,C,H. B,CH. C,BH. H,BC. BC,BH. BH,CH. CH,BC. BC,BH,CH. DF -24 21 21 15 12 18 12 9 LIKELIHOODRATIO CHISQ ----------83.15 51.23 59.59 58.73 35.16 27.67 26.80 8.08 PROB. ------0.0000 0.0002 0.0000 0.0000 0.0004 0.0673 0.0082 0.5265 PEARSON CHISQ ------102.00 56.89 60.43 64.78 33.76 26.58 33.18 6.56 PROB. ------0.0000 0.0000 0.0000 0.0000 0.0007 0.0872 0.0009 0.6824 Possible Models: 1. [BH][CH] – B and C independent given H. 2. [BC][BH][CH] – all two factor interaction model n.s. n.s. Model 1: [BH][CH] Log-linear parameters Heart disease -Blood Pressure Interaction uHBi , j Bp Hd Pres Abs <127 -0.256 0.256 127-146 -0.241 0.241 z 147-166 0.066 -0.066 167+ 0.431 -0.431 147-166 0.660 -0.660 167+ 4.461 -4.461 uHBi , j u HB i , j Bp Hd Pres Abs <127 -2.607 2.607 127-146 -2.733 2.733 Multiplicative effect HBi , j exp uHBi , j e uHBi , j Bp Hd Pres Abs <127 0.774 1.291 127-146 0.786 1.272 147-166 1.068 0.936 167+ 1.538 0.65 Log-Linear Model ln ijk u uH i uB j uC k uHBi , j uHCi ,k ijk e e u uH i uB j uC k u HB i , j uHC i ,k e e e e H i B j C k HBi , j HC i ,k uHCi ,k Heart Disease - Cholesterol Interaction Chol Hd Pres Abs <200 -0.233 0.233 200-219 -0.325 0.325 z 220-259 0.063 -0.063 260+ 0.494 -0.494 uHC i ,k u HC i ,k Chol Hd Pres Abs <200 -1.889 1.889 200-219 -2.268 2.268 220-259 0.677 -0.677 260+ 5.558 -5.558 Multiplicative effect HC i ,k exp uHBi ,k e uHBi ,k Chol Hd Pres Abs <200 0.792 1.262 200-219 0.723 1.384 220-259 1.065 0.939 260+ 1.640 0.610 Model 2: [BC][BH][CH] Log-linear parameters Blood pressure-Cholesterol interaction: uBC j ,k Bp Chol <200 200-219 220-259 260+ <200 0.222 0.114 -0.114 -0.221 200-219 -0.019 -0.041 0.154 -0.094 220-259 -0.034 0.013 -0.058 0.079 260+ -0.169 -0.086 0.018 0.237 z uBC j ,k u BC j ,k Bp Chol <200 200-219 220-259 260+ Multiplicative effect <200 2.68 1.27 -1.502 -2.487 BC j ,k 200-219 -0.236 -0.472 2.253 -1.175 220-259 -0.326 0.117 -0.636 0.785 260+ -1.291 -0.626 0.167 2.051 uHB j ,k exp uBC j ,k e Bp Chol <200 200-219 220-259 260+ <200 1.248 1.120 0.892 0.802 200-219 0.981 0.960 1.166 0.910 220-259 0.967 1.013 0.944 1.082 260+ 0.844 0.918 1.018 1.267 Heart disease -Blood Pressure Interaction uHBi , j Bp Hd Pres Abs <127 -0.211 0.211 127-146 -0.232 0.232 z 147-166 0.055 -0.055 167+ 0.389 -0.389 147-166 0.542 -0.542 167+ 3.938 -3.938 uHBi , j u HB i , j Bp Hd Pres Abs <127 -2.125 2.125 127-146 -2.604 2.604 Multiplicative effect HBi , j exp uHBi , j e uHBi , j Bp Hd Pres Abs <127 0.809 1.235 127-146 0.793 1.261 147-166 1.056 0.947 167+ 1.475 0.678 uHCi ,k Heart Disease - Cholesterol Interaction Chol Hd Pres Abs <200 -0.212 0.212 200-219 -0.316 0.316 z 220-259 0.069 -0.069 260+ 0.460 -0.460 uHC i ,k u HC i ,k Chol Hd Pres Abs <200 -1.712 1.712 200-219 -2.199 2.199 220-259 0.732 -0.732 260+ 5.095 -5.095 Multiplicative effect HC i ,k exp uHBi ,k e uHBi ,k Chol Hd Pres Abs <200 0.809 1.237 200-219 0.729 1.372 220-259 1.071 0.933 260+ 1.584 0.631 Next topic: Discrete Multivariate Analysis II