Working with loglinear Models Stat 557 Heike Hofmann

advertisement
Working with loglinear
Models
Stat 557
Heike Hofmann
Outline
• Loglinear Models
• Accident Data
• Dissimilarity Index
Loglinear Models for
Large Data
Example: Accident Data, US DOT
• 68694 passengers in auto and light truck accidents
are recorded with respect to their injury (I),
seatbelt (B) use, gender (G) and location (L).
belt
No :8
Yes:8
location
gender
Rural:8
Female:8
Urban:8
Male :8
injury
No :8
Yes:8
count
Min.
: 380.0
1st Qu.: 798.8
Median : 2165.0
Mean
: 4293.4
3rd Qu.: 6841.5
Max.
:11587.0
Accident Data
•
•
Download accident.txt from the website & load into R
•
Fit cornerstone models (Remember .^2 is the code
for all two-way interactions)
•
Find model with acceptable residual deviance. Are the
statistics (significances of some terms in particular)
surprising? - Find an explanation!
•
Compare odds ratios to model GI, LI, SI, GLS
Explore the data graphically with barcharts and
mosaicplots. Keep a loglinear model and its terms in
mind when interpreting the graphics.
35000
35000
60000
30000
30000
50000
25000
25000
20000
20000
15000
40000
count
count
15000
10000
10000
5000
5000
0
10000
0
No
belt
Yes
30000
20000
0
Female
gender
Male
No
injury
Yes
40000
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept)
8.480603
0.008683 976.68
<2e-16 ***
beltYes
0.201277
0.007669
26.24
<2e-16 ***
locationUrban 0.525589
0.007896
66.57
<2e-16 ***
genderMale
0.152155
0.007653
19.88
<2e-16 ***
injuryYes
-2.297472
0.013244 -173.47
<2e-16 ***
--Null deviance: 61709.5
Residual deviance: 2792.8
on 15
on 11
degrees of freedom
degrees of freedom
30000
count
count
Main Effects?
20000
10000
0
Rural
location
Urban
Interactions
prodplot(data=dot, count~belt+location,
c("vspine","hspine"), subset=.(level==2))
+ aes(fill=belt)
prodplot(data=dot, count~belt+gender,
c("vspine","hspine"), subset=.(level==2))
+ aes(fill=belt)
belt
Rural
belt
No
No
Yes
Yes
Urban
Female
prodplot(data=dot, count~belt+injury,
c("vspine","hspine"), subset=.(level==2))
+ aes(fill=belt)
Male
prodplot(data=dot, count~location+gender,
c("vspine","hspine"), subset=.(level==2))
+ aes(fill=location)
belt
No
Yes
location
No
Rural
Yes
Urban
Female
Male
Interactions
prodplot(data=dot, count~location+injury,
c("vspine","hspine"), subset=.(level==2))
+ aes(fill=location)
prodplot(data=dot, count~gender+injury,
c("vspine","hspine"), subset=.(level==2))
+ aes(fill=gender)
location
No
Yes
gender
Rural
Female
Urban
Male
No
Yes
All 2-way Interactions
Call:
glm(formula = count ~ .^2, family = poisson(), data = dot)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept)
8.11786
0.01453 558.535 < 2e-16 ***
beltYes
0.57924
0.01623 35.692 < 2e-16 ***
locationUrban
0.75930
0.01602 47.399 < 2e-16 ***
genderMale
0.58918
0.01620 36.359 < 2e-16 ***
injuryYes
-1.22138
0.02637 -46.320 < 2e-16 ***
beltYes:locationUrban
-0.08493
0.01619 -5.244 1.57e-07 ***
beltYes:genderMale
-0.45992
0.01568 -29.328 < 2e-16 ***
beltYes:injuryYes
-0.81400
0.02762 -29.473 < 2e-16 ***
locationUrban:genderMale -0.20992
0.01612 -13.019 < 2e-16 ***
locationUrban:injuryYes -0.75503
0.02695 -28.017 < 2e-16 ***
genderMale:injuryYes
-0.54053
0.02722 -19.859 < 2e-16 ***
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Null deviance: 61709.521
Residual deviance:
23.351
AIC: 198.81
on 15
on 5
degrees of freedom
degrees of freedom
Number of Fisher Scoring iterations: 3
Cornerstone Models
Analysis of Deviance Table
Model 1: count ~ 1
Model 2: count ~ belt + location + gender + injury
Model 3: count ~ (belt + location + gender + injury)^2
Model 4: count ~ (belt + location + gender + injury)^3
Model 5: count ~ (belt + location + gender + injury)^4
Resid. Df Resid. Dev Df Deviance P(>|Chi|)
1
15
61710
2
11
2793 4
58917 < 2.2e-16 ***
3
5
23 6
2769 < 2.2e-16 ***
4
1
1 4
22 0.0001981 ***
5
0
0 1
1 0.2496401
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Accident Data
• Only very complex models have acceptable fit
• Interpretation difficult
• Odds ratios are not very different from simpler
model GI, LI, SI, GLS
Large Data: DOT Study
Example: 68694 passengers in auto and light truck accidents are recorded
with respect to their injury (I), seatbelt (S) use, gender (G) and location
(L).
Only very complex model are acceptable, but odds ratios are close to
simpler models:
odds ratio IL, IS, IG , LGS SLG , SLI , LGI
θSL
1.17
1.18
θSG
0.66
0.66
θSI
0.44
0.46
θLG
1.31
1.33
θLI
2.13
2.14
θGI
0.58
0.56
θSLG
0.88
0.88
θSLI
1.00
0.91
θLGI
1.00
1.08
Practical difference? - probably not
Model Comparison
• no practical difference
Stat 557 ( Fall 2008)
Loglinear Models
October 16, 2008
3/1
Accident Data
• Assume that Injury is response variable, and
fit a logistic regression with main effects of
gender, location and seatbelt.
• Compare effects to the effects of the
loglinear GI, LI, SI, GLS
• Try to come up with a general explanation.
Dissimilarity Index
Dissimilarity Index
milarity Index
• Idea: measure percentage of observation
measure minimal
percentage
of observation
that have
that have
to be moved
to get perfect
fit to be mov
t perfect fit
Dissimilarity Index:
milarity Index
�
ˆ =
∆
|ni − µi |/2n
•
i
ˆ = 0 indicates a perfect fit - a valu
kes values between 0 and 1. ∆
ˆ independent of sample siz
or 0.03 is considered a good model. ∆
Dissimilarity Index
Delta is
• between 0 and 1
• 0 indicates perfect fit
• independent of sample size
value ≤0.02 or 0.03 is considered a good model
Dissimilarity Index
• Write a function Delta that computes the
dissimilarity index for a model m
• Use Delta to identify a reasonable model for
the accidents data.
Download