Loglinear Models Stat 557 Heike Hofmann

advertisement
Loglinear Models
Stat 557
Heike Hofmann
Outline
• Model Definition
• Interpretation of Parameters
• Overview of Models
• Collapsibility Condition
• Dissimilarity Index
Loglinear Models
• Instead of relationship between X variables
and response Y, loglinear model do not
single out response variable
• Objective is to model structure between
set of categorical variables
Definition
r Model
2d loglinear Model
Assume data is in I by J contingency table of
•
the cell counts of an I × J contingency table of X X and
X and Y
ear model assumes
that
these
cell
counts
come
from
IJ
Let mij be cell count of cell (i,j) - assume
•
nt Poisson cell
distributed
variables
ij ∼ Poµij . Therefore
counts come
from IJMindependent
µij .
variables Mij ~ Poµij with E[Mij] = µij
Y
XY
log mij = λ + λX
+
λ
+
λ
i
j
ij
near model assumes that these cell counts come from IJ
ent Poisson distributed variables Mij ∼ Poµij . Therefore
µij .
Interpretation of effects
Y
XY
log mij = λ + λX
+
λ
+
λ
i
j
ij
• For binary variables X and Y
using baseline effects,
i.e. all first effects are 0
• λ = log m
Y=0 Y=1
X=0
m00
m01
X=1
m10
m11
00
λ1X = log m10 - log m00
λ1Y = log m01 - log m00
λ11XY = log (m11m00Loglinear
)/(m01Models
m10)
( Fall 2008)
October
Models of 3 Variables
Let X,Y, Z be variables with I, J, and K categories:
Model
systematic structure
independence
Z
Y
+
λ
+
λ
log mijk = λ + λX
j
i
k
joint independence
Y
Z
XY
log mijk = λ + λX
i + λj + λk + λij
conditional independence log m
ijk
no 3-way interaction
3-way interaction
Z
XZ
YZ
Y
= λ + λX
i + λj + λk + λik + λjk
Y
Z
XZ
YZ
XY
log mijk = λ + λX
i + λj + λk + λik + λjk + λij
Z
XZ
YZ
XY
XY Z
Y
log mijk = λ + λX
i + λj + λk + λik + λjk + λij + λijk
log 20 + log 0.5 + log 1 + log 7 = log 70
In the following we are going to discuss hierarchical models only. A model is hierarchical, if for any variable
that appears in an interaction effect all lower dimensional (interaction) effects for this variable are included,
too. This makes the models easier to compute (which will not affect as much, since we are going to use
software to estimate the paramteres anyway), but the interpretation also becomes a lot easier that way.
6.2
Models for three variables
Let X, Y and Z be variables with I, J and K categories.
model log µijk =
hypothesis
independence
Y
Z
µ + λX
i + λ j + λk
graphics
X
short cut
Y
H0 : πijk = πi++ π+j+ π++k
X, Y, Z
Z
joint independence
Y
Z
YZ
µ + λX
i + λj + λk + λjk
X
Y
H0 : πijk = πi++ π+jk
X, Y Z
Z
conditional independence
Y
Z
XZ
YZ
µ + λX
i + λj + λk + λik + λjk
X
Y
H0 : πijk = πi+k π+jk /π++k
XZ, Y Z
Z
no three way interaction
YZ
Y
Z
XY
XZ
µ + λX
i + λj + λk + λij + λik + λjk
n.a.
full
XY, XZ, Y Z
X
Y
Y
Z
XY
XZ
YZ
XY Z
µ + λX
i + λj + λk + λij + λik + λjk + λijk
XY Z
Z
67
Association Graphs for
Higher Order Models
X
indicates variable X
X
indicates association
between variables X and Y
Y
NB: highest interaction is
assumed to be present (and all
lower order interactions, too)
Parameters & Degrees
of freedom
Higher Order Models
Degrees of freedom
Model
null
X
X,Y,Z
XY , Z
XY , YZ
XY , YZ , XZ
XYZ
degrees of freedom
IJK − 1
IJK − 1 − (I − 1) = I (JK − 1)
IJK − 1 − (I − 1) − (J − 1) − (K − 1) = IJK − I − J − K + 2
IJK − I − J − K + 2 − (I − 1)(J − 1) =
= IJK − I − J − K + 2 − IJ + I + J − 1 =
= (IJ − 1)(K − 1)
(IJ − 1)(K − 1) − (J − 1)(K − 1) = J(I − 1)(K − 1)
J(I − 1)(K − 1) − (I − 1)(K − 1) = (I − 1)(J − 1)(K − 1)
0
joint independence
µ + λ of
+ λ homogeneous
+λ +λ
H :π
Model
Association
conditional independence
X
i
Y
j
Z
k
YZ
jk
Y
Z
XZ
YZ
µ + λX
+
λ
+
λ
+
λ
+
λ
i
j
k
ik
jk
0
ijk
= πi+
H0 : πijk = πi+
model of no three-way
•noConsider
three way interaction
interaction:
YZ
Y
Z
XY
XZ
+
λ
µ + λX
+
λ
+
λ
+
λ
+
λ
i
j
ij
jk
k
ik
• conditional odds ratios
log θij(k)= constant for all k
full
n.a
logµθ+i(j)k
=
constant
for
all
j
X
Y
Z
XY
YZ
XY Z
λi + λj + λk + λij + λXZ
+
λ
+
λ
ik
jk
ijk
log θ(i)jk= constant for all i
Collapsibility Condition
• We can collapse over Z, if we do not change the
association between variables X and Y (in order to
avoid Simpson’s paradox)
• Collapsibility condition:
Collapsing over Z does not affect the association
between X and Y, if X and Z or Y and Z (or both)
are conditionally independent.
collapsible
not
Marijuana Study
•
2279 high school seniors report on their use of alcohol,
cigarettes and marijuana
•
find an association graph representing the structure in
> summary(mj)
the data
Alcohol Cigarette Marijuana
Count
No :4
Yes:4
No :4
Yes:4
No :4
Yes:4
Min.
: 2.0
1st Qu.: 33.0
Median :161.5
Mean
:284.5
3rd Qu.:476.5
Max.
:911.0
1400
1200
1200
1500
1000
1000
800
800
count
count
count
1000
600
400
400
500
0
No
Alcohol
Yes
600
200
200
0
0
No
Cigarette
Yes
No
Marijuana
Yes
Marijuana Study
Alcohol
Cigarette
Marijuana
No
Cigarette
Marijuana
Alcohol
No
No
No
Yes
Yes
Yes
Yes
No
Yes
No
Cigarette, Alcohol
Marijuana
No
Yes
No
Yes
Yes
Example Marijuana
glm(formula = Count ~ .^2, family = poisson(link = log), data = mj)
Deviance Residuals:
1
2
3
0.02044 -0.02658 -0.09256
4
0.02890
5
-0.33428
6
0.09452
7
0.49134
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept)
5.63342
0.05970 94.361 < 2e-16
AlcoholYes
0.48772
0.07577
6.437 1.22e-10
Cigarette Yes
-1.88667
0.16270 -11.596 < 2e-16
Marijuana Yes
-5.30904
0.47520 -11.172 < 2e-16
AlcoholYes:Cigarette Yes
2.05453
0.17406 11.803 < 2e-16
AlcoholYes:Marijuana Yes
2.98601
0.46468
6.426 1.31e-10
Cigarette Yes:Marijuana Yes 2.84789
0.16384 17.382 < 2e-16
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 2851.46098
Residual deviance:
0.37399
AIC: 63.417
on 7
on 1
degrees of freedom
degrees of freedom
Number of Fisher Scoring iterations: 4
***
***
***
***
***
***
***
8
-0.03690
Loglinear Models for
Large Data
Example: Accident Data, US DOT
• 68694 passengers in auto and light truck accidents
are recorded with respect to their injury (I),
seatbelt (B) use, gender (G) and location (L).
belt
No :8
Yes:8
location
gender
Rural:8
Female:8
Urban:8
Male :8
injury
No :8
Yes:8
count
Min.
: 380.0
1st Qu.: 798.8
Median : 2165.0
Mean
: 4293.4
3rd Qu.: 6841.5
Max.
:11587.0
35000
35000
60000
30000
30000
50000
25000
25000
20000
20000
15000
40000
count
count
15000
30000
20000
10000
10000
5000
5000
10000
0
0
0
No
belt
Yes
Female
gender
Male
No
injury
Yes
40000
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept)
8.480603
0.008683 976.68
<2e-16 ***
beltYes
0.201277
0.007669
26.24
<2e-16 ***
locationUrban 0.525589
0.007896
66.57
<2e-16 ***
genderMale
0.152155
0.007653
19.88
<2e-16 ***
injuryYes
-2.297472
0.013244 -173.47
<2e-16 ***
--Null deviance: 61709.5
Residual deviance: 2792.8
on 15
on 11
degrees of freedom
degrees of freedom
30000
count
count
Main Effects?
20000
10000
0
Rural
location
Urban
Interactions
prodplot(data=dot, count~belt+location,
c("vspine","hspine"), subset=.(level==2))
+ aes(fill=belt)
prodplot(data=dot, count~belt+gender,
c("vspine","hspine"), subset=.(level==2))
+ aes(fill=belt)
belt
Rural
belt
No
No
Yes
Yes
Urban
Female
prodplot(data=dot, count~belt+injury,
c("vspine","hspine"), subset=.(level==2))
+ aes(fill=belt)
Male
prodplot(data=dot, count~location+gender,
c("vspine","hspine"), subset=.(level==2))
+ aes(fill=location)
belt
No
Yes
location
No
Rural
Yes
Urban
Female
Male
Interactions
prodplot(data=dot, count~location+injury,
c("vspine","hspine"), subset=.(level==2))
+ aes(fill=location)
prodplot(data=dot, count~gender+injury,
c("vspine","hspine"), subset=.(level==2))
+ aes(fill=gender)
location
No
Yes
gender
Rural
Female
Urban
Male
No
Yes
All 2-way Interactions
Call:
glm(formula = count ~ .^2, family = poisson(), data = dot)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept)
8.11786
0.01453 558.535 < 2e-16 ***
beltYes
0.57924
0.01623 35.692 < 2e-16 ***
locationUrban
0.75930
0.01602 47.399 < 2e-16 ***
genderMale
0.58918
0.01620 36.359 < 2e-16 ***
injuryYes
-1.22138
0.02637 -46.320 < 2e-16 ***
beltYes:locationUrban
-0.08493
0.01619 -5.244 1.57e-07 ***
beltYes:genderMale
-0.45992
0.01568 -29.328 < 2e-16 ***
beltYes:injuryYes
-0.81400
0.02762 -29.473 < 2e-16 ***
locationUrban:genderMale -0.20992
0.01612 -13.019 < 2e-16 ***
locationUrban:injuryYes -0.75503
0.02695 -28.017 < 2e-16 ***
genderMale:injuryYes
-0.54053
0.02722 -19.859 < 2e-16 ***
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Null deviance: 61709.521
Residual deviance:
23.351
AIC: 198.81
on 15
on 5
degrees of freedom
degrees of freedom
Number of Fisher Scoring iterations: 3
Cornerstone Models
Analysis of Deviance Table
Model 1: count ~ 1
Model 2: count ~ belt + location + gender + injury
Model 3: count ~ (belt + location + gender + injury)^2
Model 4: count ~ (belt + location + gender + injury)^3
Model 5: count ~ (belt + location + gender + injury)^4
Resid. Df Resid. Dev Df Deviance P(>|Chi|)
1
15
61710
2
11
2793 4
58917 < 2.2e-16 ***
3
5
23 6
2769 < 2.2e-16 ***
4
1
1 4
22 0.0001981 ***
5
0
0 1
1 0.2496401
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Accident Data
• Only very complex models have acceptable fit
• Interpretation difficult
• Odds ratios are not very different from simpler
model GI, LI, SI, GLS
Large Data: DOT Study
Example: 68694 passengers in auto and light truck accidents are recorded
with respect to their injury (I), seatbelt (S) use, gender (G) and location
(L).
Only very complex model are acceptable, but odds ratios are close to
simpler models:
odds ratio IL, IS, IG , LGS SLG , SLI , LGI
θSL
1.17
1.18
θSG
0.66
0.66
θSI
0.44
0.46
θLG
1.31
1.33
θLI
2.13
2.14
θGI
0.58
0.56
θSLG
0.88
0.88
θSLI
1.00
0.91
θLGI
1.00
1.08
Practical difference? - probably not
Model Comparison
• no practical difference
Stat 557 ( Fall 2008)
Loglinear Models
October 16, 2008
3/1
Dissimilarity Index
Dissimilarity Index
milarity Index
• Idea: measure percentage of observation
measure minimal
percentage
of observation
that have
that have
to be moved
to get perfect
fit to be mov
t perfect fit
Dissimilarity Index:
milarity Index
�
ˆ =
∆
|ni − µi |/2n
•
i
ˆ = 0 indicates a perfect fit - a valu
kes values between 0 and 1. ∆
ˆ independent of sample siz
or 0.03 is considered a good model. ∆
Dissimilarity Index
Delta is
• between 0 and 1
• 0 indicates perfect fit
• independent of sample size
value ≤0.02 or 0.03 is considered a good model
Next: Relationship to
Logistic Regression
Download