Working with loglinear Models Stat 557 Heike Hofmann Outline • Loglinear Models • Accident Data • Dissimilarity Index Loglinear Models for Large Data Example: Accident Data, US DOT • 68694 passengers in auto and light truck accidents are recorded with respect to their injury (I), seatbelt (B) use, gender (G) and location (L). belt No :8 Yes:8 location gender Rural:8 Female:8 Urban:8 Male :8 injury No :8 Yes:8 count Min. : 380.0 1st Qu.: 798.8 Median : 2165.0 Mean : 4293.4 3rd Qu.: 6841.5 Max. :11587.0 Accident Data • • Download accident.txt from the website & load into R • Fit cornerstone models (Remember .^2 is the code for all two-way interactions) • Find model with acceptable residual deviance. Are the statistics (significances of some terms in particular) surprising? - Find an explanation! • Compare odds ratios to model GI, LI, SI, GLS Explore the data graphically with barcharts and mosaicplots. Keep a loglinear model and its terms in mind when interpreting the graphics. 35000 35000 60000 30000 30000 50000 25000 25000 20000 20000 15000 40000 count count 15000 10000 10000 5000 5000 0 10000 0 No belt Yes 30000 20000 0 Female gender Male No injury Yes 40000 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 8.480603 0.008683 976.68 <2e-16 *** beltYes 0.201277 0.007669 26.24 <2e-16 *** locationUrban 0.525589 0.007896 66.57 <2e-16 *** genderMale 0.152155 0.007653 19.88 <2e-16 *** injuryYes -2.297472 0.013244 -173.47 <2e-16 *** --Null deviance: 61709.5 Residual deviance: 2792.8 on 15 on 11 degrees of freedom degrees of freedom 30000 count count Main Effects? 20000 10000 0 Rural location Urban Interactions prodplot(data=dot, count~belt+location, c("vspine","hspine"), subset=.(level==2)) + aes(fill=belt) prodplot(data=dot, count~belt+gender, c("vspine","hspine"), subset=.(level==2)) + aes(fill=belt) belt Rural belt No No Yes Yes Urban Female prodplot(data=dot, count~belt+injury, c("vspine","hspine"), subset=.(level==2)) + aes(fill=belt) Male prodplot(data=dot, count~location+gender, c("vspine","hspine"), subset=.(level==2)) + aes(fill=location) belt No Yes location No Rural Yes Urban Female Male Interactions prodplot(data=dot, count~location+injury, c("vspine","hspine"), subset=.(level==2)) + aes(fill=location) prodplot(data=dot, count~gender+injury, c("vspine","hspine"), subset=.(level==2)) + aes(fill=gender) location No Yes gender Rural Female Urban Male No Yes All 2-way Interactions Call: glm(formula = count ~ .^2, family = poisson(), data = dot) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 8.11786 0.01453 558.535 < 2e-16 *** beltYes 0.57924 0.01623 35.692 < 2e-16 *** locationUrban 0.75930 0.01602 47.399 < 2e-16 *** genderMale 0.58918 0.01620 36.359 < 2e-16 *** injuryYes -1.22138 0.02637 -46.320 < 2e-16 *** beltYes:locationUrban -0.08493 0.01619 -5.244 1.57e-07 *** beltYes:genderMale -0.45992 0.01568 -29.328 < 2e-16 *** beltYes:injuryYes -0.81400 0.02762 -29.473 < 2e-16 *** locationUrban:genderMale -0.20992 0.01612 -13.019 < 2e-16 *** locationUrban:injuryYes -0.75503 0.02695 -28.017 < 2e-16 *** genderMale:injuryYes -0.54053 0.02722 -19.859 < 2e-16 *** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Null deviance: 61709.521 Residual deviance: 23.351 AIC: 198.81 on 15 on 5 degrees of freedom degrees of freedom Number of Fisher Scoring iterations: 3 Cornerstone Models Analysis of Deviance Table Model 1: count ~ 1 Model 2: count ~ belt + location + gender + injury Model 3: count ~ (belt + location + gender + injury)^2 Model 4: count ~ (belt + location + gender + injury)^3 Model 5: count ~ (belt + location + gender + injury)^4 Resid. Df Resid. Dev Df Deviance P(>|Chi|) 1 15 61710 2 11 2793 4 58917 < 2.2e-16 *** 3 5 23 6 2769 < 2.2e-16 *** 4 1 1 4 22 0.0001981 *** 5 0 0 1 1 0.2496401 --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Accident Data • Only very complex models have acceptable fit • Interpretation difficult • Odds ratios are not very different from simpler model GI, LI, SI, GLS Large Data: DOT Study Example: 68694 passengers in auto and light truck accidents are recorded with respect to their injury (I), seatbelt (S) use, gender (G) and location (L). Only very complex model are acceptable, but odds ratios are close to simpler models: odds ratio IL, IS, IG , LGS SLG , SLI , LGI θSL 1.17 1.18 θSG 0.66 0.66 θSI 0.44 0.46 θLG 1.31 1.33 θLI 2.13 2.14 θGI 0.58 0.56 θSLG 0.88 0.88 θSLI 1.00 0.91 θLGI 1.00 1.08 Practical difference? - probably not Model Comparison • no practical difference Stat 557 ( Fall 2008) Loglinear Models October 16, 2008 3/1 Accident Data • Assume that Injury is response variable, and fit a logistic regression with main effects of gender, location and seatbelt. • Compare effects to the effects of the loglinear GI, LI, SI, GLS • Try to come up with a general explanation. Dissimilarity Index Dissimilarity Index milarity Index • Idea: measure percentage of observation measure minimal percentage of observation that have that have to be moved to get perfect fit to be mov t perfect fit Dissimilarity Index: milarity Index � ˆ = ∆ |ni − µi |/2n • i ˆ = 0 indicates a perfect fit - a valu kes values between 0 and 1. ∆ ˆ independent of sample siz or 0.03 is considered a good model. ∆ Dissimilarity Index Delta is • between 0 and 1 • 0 indicates perfect fit • independent of sample size value ≤0.02 or 0.03 is considered a good model Dissimilarity Index • Write a function Delta that computes the dissimilarity index for a model m • Use Delta to identify a reasonable model for the accidents data.