Loglinear Models Stat 557 Heike Hofmann Outline • Model Definition • Interpretation of Parameters • Overview of Models • Collapsibility Condition • Dissimilarity Index Loglinear Models • Instead of relationship between X variables and response Y, loglinear model do not single out response variable • Objective is to model structure between set of categorical variables Definition r Model 2d loglinear Model Assume data is in I by J contingency table of • the cell counts of an I × J contingency table of X X and X and Y ear model assumes that these cell counts come from IJ Let mij be cell count of cell (i,j) - assume • nt Poisson cell distributed variables ij ∼ Poµij . Therefore counts come from IJMindependent µij . variables Mij ~ Poµij with E[Mij] = µij Y XY log mij = λ + λX + λ + λ i j ij near model assumes that these cell counts come from IJ ent Poisson distributed variables Mij ∼ Poµij . Therefore µij . Interpretation of effects Y XY log mij = λ + λX + λ + λ i j ij • For binary variables X and Y using baseline effects, i.e. all first effects are 0 • λ = log m Y=0 Y=1 X=0 m00 m01 X=1 m10 m11 00 λ1X = log m10 - log m00 λ1Y = log m01 - log m00 λ11XY = log (m11m00Loglinear )/(m01Models m10) ( Fall 2008) October Models of 3 Variables Let X,Y, Z be variables with I, J, and K categories: Model systematic structure independence Z Y + λ + λ log mijk = λ + λX j i k joint independence Y Z XY log mijk = λ + λX i + λj + λk + λij conditional independence log m ijk no 3-way interaction 3-way interaction Z XZ YZ Y = λ + λX i + λj + λk + λik + λjk Y Z XZ YZ XY log mijk = λ + λX i + λj + λk + λik + λjk + λij Z XZ YZ XY XY Z Y log mijk = λ + λX i + λj + λk + λik + λjk + λij + λijk log 20 + log 0.5 + log 1 + log 7 = log 70 In the following we are going to discuss hierarchical models only. A model is hierarchical, if for any variable that appears in an interaction effect all lower dimensional (interaction) effects for this variable are included, too. This makes the models easier to compute (which will not affect as much, since we are going to use software to estimate the paramteres anyway), but the interpretation also becomes a lot easier that way. 6.2 Models for three variables Let X, Y and Z be variables with I, J and K categories. model log µijk = hypothesis independence Y Z µ + λX i + λ j + λk graphics X short cut Y H0 : πijk = πi++ π+j+ π++k X, Y, Z Z joint independence Y Z YZ µ + λX i + λj + λk + λjk X Y H0 : πijk = πi++ π+jk X, Y Z Z conditional independence Y Z XZ YZ µ + λX i + λj + λk + λik + λjk X Y H0 : πijk = πi+k π+jk /π++k XZ, Y Z Z no three way interaction YZ Y Z XY XZ µ + λX i + λj + λk + λij + λik + λjk n.a. full XY, XZ, Y Z X Y Y Z XY XZ YZ XY Z µ + λX i + λj + λk + λij + λik + λjk + λijk XY Z Z 67 Association Graphs for Higher Order Models X indicates variable X X indicates association between variables X and Y Y NB: highest interaction is assumed to be present (and all lower order interactions, too) Parameters & Degrees of freedom Higher Order Models Degrees of freedom Model null X X,Y,Z XY , Z XY , YZ XY , YZ , XZ XYZ degrees of freedom IJK − 1 IJK − 1 − (I − 1) = I (JK − 1) IJK − 1 − (I − 1) − (J − 1) − (K − 1) = IJK − I − J − K + 2 IJK − I − J − K + 2 − (I − 1)(J − 1) = = IJK − I − J − K + 2 − IJ + I + J − 1 = = (IJ − 1)(K − 1) (IJ − 1)(K − 1) − (J − 1)(K − 1) = J(I − 1)(K − 1) J(I − 1)(K − 1) − (I − 1)(K − 1) = (I − 1)(J − 1)(K − 1) 0 joint independence µ + λ of + λ homogeneous +λ +λ H :π Model Association conditional independence X i Y j Z k YZ jk Y Z XZ YZ µ + λX + λ + λ + λ + λ i j k ik jk 0 ijk = πi+ H0 : πijk = πi+ model of no three-way •noConsider three way interaction interaction: YZ Y Z XY XZ + λ µ + λX + λ + λ + λ + λ i j ij jk k ik • conditional odds ratios log θij(k)= constant for all k full n.a logµθ+i(j)k = constant for all j X Y Z XY YZ XY Z λi + λj + λk + λij + λXZ + λ + λ ik jk ijk log θ(i)jk= constant for all i Collapsibility Condition • We can collapse over Z, if we do not change the association between variables X and Y (in order to avoid Simpson’s paradox) • Collapsibility condition: Collapsing over Z does not affect the association between X and Y, if X and Z or Y and Z (or both) are conditionally independent. collapsible not Marijuana Study • 2279 high school seniors report on their use of alcohol, cigarettes and marijuana • find an association graph representing the structure in > summary(mj) the data Alcohol Cigarette Marijuana Count No :4 Yes:4 No :4 Yes:4 No :4 Yes:4 Min. : 2.0 1st Qu.: 33.0 Median :161.5 Mean :284.5 3rd Qu.:476.5 Max. :911.0 1400 1200 1200 1500 1000 1000 800 800 count count count 1000 600 400 400 500 0 No Alcohol Yes 600 200 200 0 0 No Cigarette Yes No Marijuana Yes Marijuana Study Alcohol Cigarette Marijuana No Cigarette Marijuana Alcohol No No No Yes Yes Yes Yes No Yes No Cigarette, Alcohol Marijuana No Yes No Yes Yes Example Marijuana glm(formula = Count ~ .^2, family = poisson(link = log), data = mj) Deviance Residuals: 1 2 3 0.02044 -0.02658 -0.09256 4 0.02890 5 -0.33428 6 0.09452 7 0.49134 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 5.63342 0.05970 94.361 < 2e-16 AlcoholYes 0.48772 0.07577 6.437 1.22e-10 Cigarette Yes -1.88667 0.16270 -11.596 < 2e-16 Marijuana Yes -5.30904 0.47520 -11.172 < 2e-16 AlcoholYes:Cigarette Yes 2.05453 0.17406 11.803 < 2e-16 AlcoholYes:Marijuana Yes 2.98601 0.46468 6.426 1.31e-10 Cigarette Yes:Marijuana Yes 2.84789 0.16384 17.382 < 2e-16 --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for poisson family taken to be 1) Null deviance: 2851.46098 Residual deviance: 0.37399 AIC: 63.417 on 7 on 1 degrees of freedom degrees of freedom Number of Fisher Scoring iterations: 4 *** *** *** *** *** *** *** 8 -0.03690 Loglinear Models for Large Data Example: Accident Data, US DOT • 68694 passengers in auto and light truck accidents are recorded with respect to their injury (I), seatbelt (B) use, gender (G) and location (L). belt No :8 Yes:8 location gender Rural:8 Female:8 Urban:8 Male :8 injury No :8 Yes:8 count Min. : 380.0 1st Qu.: 798.8 Median : 2165.0 Mean : 4293.4 3rd Qu.: 6841.5 Max. :11587.0 35000 35000 60000 30000 30000 50000 25000 25000 20000 20000 15000 40000 count count 15000 30000 20000 10000 10000 5000 5000 10000 0 0 0 No belt Yes Female gender Male No injury Yes 40000 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 8.480603 0.008683 976.68 <2e-16 *** beltYes 0.201277 0.007669 26.24 <2e-16 *** locationUrban 0.525589 0.007896 66.57 <2e-16 *** genderMale 0.152155 0.007653 19.88 <2e-16 *** injuryYes -2.297472 0.013244 -173.47 <2e-16 *** --Null deviance: 61709.5 Residual deviance: 2792.8 on 15 on 11 degrees of freedom degrees of freedom 30000 count count Main Effects? 20000 10000 0 Rural location Urban Interactions prodplot(data=dot, count~belt+location, c("vspine","hspine"), subset=.(level==2)) + aes(fill=belt) prodplot(data=dot, count~belt+gender, c("vspine","hspine"), subset=.(level==2)) + aes(fill=belt) belt Rural belt No No Yes Yes Urban Female prodplot(data=dot, count~belt+injury, c("vspine","hspine"), subset=.(level==2)) + aes(fill=belt) Male prodplot(data=dot, count~location+gender, c("vspine","hspine"), subset=.(level==2)) + aes(fill=location) belt No Yes location No Rural Yes Urban Female Male Interactions prodplot(data=dot, count~location+injury, c("vspine","hspine"), subset=.(level==2)) + aes(fill=location) prodplot(data=dot, count~gender+injury, c("vspine","hspine"), subset=.(level==2)) + aes(fill=gender) location No Yes gender Rural Female Urban Male No Yes All 2-way Interactions Call: glm(formula = count ~ .^2, family = poisson(), data = dot) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 8.11786 0.01453 558.535 < 2e-16 *** beltYes 0.57924 0.01623 35.692 < 2e-16 *** locationUrban 0.75930 0.01602 47.399 < 2e-16 *** genderMale 0.58918 0.01620 36.359 < 2e-16 *** injuryYes -1.22138 0.02637 -46.320 < 2e-16 *** beltYes:locationUrban -0.08493 0.01619 -5.244 1.57e-07 *** beltYes:genderMale -0.45992 0.01568 -29.328 < 2e-16 *** beltYes:injuryYes -0.81400 0.02762 -29.473 < 2e-16 *** locationUrban:genderMale -0.20992 0.01612 -13.019 < 2e-16 *** locationUrban:injuryYes -0.75503 0.02695 -28.017 < 2e-16 *** genderMale:injuryYes -0.54053 0.02722 -19.859 < 2e-16 *** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Null deviance: 61709.521 Residual deviance: 23.351 AIC: 198.81 on 15 on 5 degrees of freedom degrees of freedom Number of Fisher Scoring iterations: 3 Cornerstone Models Analysis of Deviance Table Model 1: count ~ 1 Model 2: count ~ belt + location + gender + injury Model 3: count ~ (belt + location + gender + injury)^2 Model 4: count ~ (belt + location + gender + injury)^3 Model 5: count ~ (belt + location + gender + injury)^4 Resid. Df Resid. Dev Df Deviance P(>|Chi|) 1 15 61710 2 11 2793 4 58917 < 2.2e-16 *** 3 5 23 6 2769 < 2.2e-16 *** 4 1 1 4 22 0.0001981 *** 5 0 0 1 1 0.2496401 --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Accident Data • Only very complex models have acceptable fit • Interpretation difficult • Odds ratios are not very different from simpler model GI, LI, SI, GLS Large Data: DOT Study Example: 68694 passengers in auto and light truck accidents are recorded with respect to their injury (I), seatbelt (S) use, gender (G) and location (L). Only very complex model are acceptable, but odds ratios are close to simpler models: odds ratio IL, IS, IG , LGS SLG , SLI , LGI θSL 1.17 1.18 θSG 0.66 0.66 θSI 0.44 0.46 θLG 1.31 1.33 θLI 2.13 2.14 θGI 0.58 0.56 θSLG 0.88 0.88 θSLI 1.00 0.91 θLGI 1.00 1.08 Practical difference? - probably not Model Comparison • no practical difference Stat 557 ( Fall 2008) Loglinear Models October 16, 2008 3/1 Dissimilarity Index Dissimilarity Index milarity Index • Idea: measure percentage of observation measure minimal percentage of observation that have that have to be moved to get perfect fit to be mov t perfect fit Dissimilarity Index: milarity Index � ˆ = ∆ |ni − µi |/2n • i ˆ = 0 indicates a perfect fit - a valu kes values between 0 and 1. ∆ ˆ independent of sample siz or 0.03 is considered a good model. ∆ Dissimilarity Index Delta is • between 0 and 1 • 0 indicates perfect fit • independent of sample size value ≤0.02 or 0.03 is considered a good model Next: Relationship to Logistic Regression