Statistics of Contingency Tables stat 557 Heike Hofmann

advertisement
Statistics of
Contingency Tables
stat 557
Heike Hofmann
Outline
• Summary Statistics:
Difference of Proportions, Relative Risk,
Odds, Odds Ratio
• Visualizations: Mosaicplots
• Concordance & Discordance
D
ΠγC =
− ΠC
D
γ=
+ ΠD
ΠC + Π
ΠC
D
πθ00:=π11π00 π11
θ :=
π10 π01
I
J
π
π
�
�2
10
01
��
16
d
c
)∞ =
π
Π
π
−
Π
π
π
:
(1
−
π)
D ij
π : (1ij− π)C ij
n(Π•C Difference
+ ΠD )4 of Proportion
X=0 X=1
i=1 j=1
− π=:
=:ππ21, − π2 ,
j|i=1
πj|i=0 π
−j|i=0
πj|i=1
π1 −
Y=0 π00 π01
ΠC − ΠD
I �
J
�
�
�
γ
=
2
16Π + Π
2
d
c
Y=1 π ij π11
ππj=1|i=0
σ (γ̂)∞•=Relative Risk C 4 D
πrj=1|i=0
ij ΠC πij − ΠD π10
:=
n(ΠC + Π
D
r ):=
π00
π11
i=1
j=1 πj=1|i=1
π
θ :=
j=1|i=1
π10 π01
ΠC − ΠD 2
2
2
2
γ−==
χ=
3.84,
χ=1,0.01
= 6.634897
1,0.05
χ
3.84,
χ
6.634897
π
:
(1
π)
Odds
• 1,0.05
ΠC 1,0.01
+ ΠD
π100−ππ11
π
−
π
=:
π
Odds
Ratio
2,
j|i=0
j|i=1
•
θ :=
π+j
ij =
i+ · π+j
π
πij = ππ10
·π
01
i+
πj=1|i=0
π : (1 − π)
r :=
πj=1|i=1
asymptotics:
Agresti
pp 70-75,
πj|i=0
− πj|i=1
=: π1 77
− π2 ,
H
:
P
(
heart
disease
|
Cholesterol
oP ( heart disease | Cholesterol
Ho :
≤ 220)≤=220) =
Summaries of 2 x 2 Tables
2 x 2 Mosaics
odds ratio:
(1364*344)/(367*126) = 10.15
95% CI (Wald): (8.03, 12.83)
Gender by Survival
No
Yes
Men
Women
Died
1364
126
Survived
367
344
Survival by Gender
Male
Female
Mosaicplots
• John Hartigan (1980s)
• Area plots (i.e. area represents
#combinations)
• Built hierarchically, i.e. order of variables
matters
• based on conditional distributions
Mosaicplots
P(X|Y)*P(Y)
No
Yes
prodplot(tc, Freq~Sex+Survived,
c("vspine", "hspine"),
subset=level==2)
=
=
P(X,Y)
P(Y|X)*P(X)
Male
Female
prodplot(tc, Freq~Survived+Sex,
c("vspine", "hspine"),
subset=level==2)
Visualizing Associations
- 2 x 2 tables
Visualizing
Associations
Fourfold Displays
X: 0
1
25
18
weakest association
0
1
odds ratio: 1.33 (0.590, 3.01)
1
Y: 1
Y: 0
Y
25
0
37
20
0
X\Y1
0
1
weakest association
37
Mosaicplots
20
X
18
X: 1
1
23
29
medium association
0
1
Y2: 1
23
Y2: 0
34
0
34
14
0
X\Y2
0
1
X: 0
Y2
medium association
1
odds ratio: 3.07 (1.337, 7.014)
X
29
14
X: 1
strongest association
strongest association
0
43
14
1
14
29
odds ratio: 6.36 (2.645, 15.306)
X: 1
0
1
0
29
X\Y3
0
1
Y3
14
Y3: 1
14
Y3: 0
43
1
X: 0
X
Visualizing Associations
- 2 x 2 tables
Visualizing
Associations
X\Y1
0
1
Row: A
0
30
20
1
10
40
1.1
1.2
1.1
1.2
1.1
1.2
odds ratio: 6.00 (2.453, 14.678)
2.2
Col: B
20
Col: A
30
Mosaicplots
2.1
Fourfold Displays
10
40
X\Y2
0
1
Row: A
36
1
15
25
odds ratio: 6.00 (2.528, 14.234)
15
2.2
Col: B
Col: A
14
0
36
14
2.1
Row: B
35
30
Col: B
20
Col: A
10
X\Y3
0
1
0
40
10
1
20
30
odds ratio: 6.00 (2.453, 14.678)
Row: B
2.2
Row: A
40
2.1
Row: B
Reading the Odds
a + b = 1 and c + d = 1
that a + b = 1 and c + d = 1
RatioAssume
θ:
then Odds Ratio θ:
ad
1−b
d Taylor
≈
4(d − b)
log θ = log
= log
+adlog
1
−
b
d
bc log θ =blog
−d
=1log
+ log
bc 2
b
1−d
y+
p̃ =
y
+
2
n+4
p̃ =
�
n+4
�
1
p̃ ± zα/2
p̃(1 − p̃)
1
n
p̃ ± zα/2
p̃(1 − p̃)
n
p − po
�
= ±zpα/2
− po
1
= ±zα/2
po (1 − po ) �
probability scale
1.0
ln 1-d
d
c
2
1
a
0.5
ln 1-b
b
+inf
d
0
-1
-2
b
0
-0.85
0.94
-inf
log odds scale
Tayl
≈
Odds ratios in 2 x 2 x K tables
Survival by Gender plots for each Class
1st
1st
67.09
2nd
2nd
44.07
(23.7, 189.9) (21.5, 90.3)
3rd
3rd
Crew
Crew
4.07
23.26
(2.8, 5.9)
(6.8, 79.1)
Odds ratios in 2 x 2 x K tables
X and Y are binary variables, Z is categorical with K categories
Death Penalty in Florida:
X death penalty (yes/no)
Y defendant’s race (black/white)
Z victim’s race (black/white)
Marginal Table of X/Y
Defendant
white
black
yes
no
53
430
483
15
176
191
68
606
674
Marginal odds ratio: 53*176/(430*15) = 1.45
(±0.59)
slight indication in favor of black defendants
?!
Odds ratios in 2 x 2 x K tables
Conditional Tables of X/Y
Z = white victim
yes
no
white 53
black 11
414
37
451
64
Z = black victim
467
48
515
Conditional odds ratios:
0.43
white
black
yes
no
0
4
4
16
139
155
0
very strong indication against black defendants
16
143
159
Florida Data
Marginal Association
yes
black
no
yes
no
white
black
defendant
white
defendant
black
no
Conditional Associations
death
victim
white
yes
Simpson’s paradox
• Simpson’s paradox:
marginal association between X and Y is
opposite to
conditional associations between X and Y
for each level of Z
• due to:
very strong marginal association between
X and Z or Y and Z
Florida Data
Marginal Association
Conditional Associations
yes
black
no
yes
no
white
black
defendant
white
victim
death
Strong Interaction
white
white
black
black
defendant
defendant
black
no
victim
white
yes
Conditional Odds
Ratios
• X,Y are conditionally independent for level k
of Z, if the conditional log odds ratio is 0
• X,Y are conditionally independent given Z, if
all conditional odds ratios are 0.
(Does not imply marginal independence)
• X,Y have homogenous association, if all
conditional odds ratios given Z are
constant.
Testing Independence
• Odds ratio of 1 indicates independence,
confidence interval helps to determine
deviation from independence, but CI is
approximation.
• Alternative solution: table tests
Testing independence
• null hypothesis: π = π · π
me that a •+ bScore
= 1 and
+ d = 1 1900):
Test c(Pearson,
Odds Ratio θ:
ij
i.
.j
∀i, j
� (nij − µˆij )2
X2 =
ad i,j 1 − µ
bˆij
d Taylor
log θ = log
= log
+ log
≈� 4(d − b
�
�1 − d nij
bc
b
πij =
πi. · G
π.j2 =∀i,
2 j log
Likelihood-Ratio
Test:
y+2
µˆij
p̃ =
i,j
at a + b = 1 and c + d = 1
n+4
Ratio θ:
� � (nij − µˆij )2
both X2 and G2 have the
X 2same
=1 limiting
p̃2 −
± bzα/2
p̃(1
−Taylor
p̃)µˆij
ad
1
d
distribution
of
chi
log θ = log
≈ 4(d − b)
= log (I-1)(J-1)+ logn i,j
bc
b
πij =1π−i. d· π.j ∀i, j
•
•
Example: Cholesterol/Heart
Disease
• 1329 patients of same age/sex
Coronary Disease
mg/l
present
absent
Cholesterol ≤
220
y11= 20
y12= 553
> 220
y21= 72
y22= 684
Cholesterol/Heart Disease
• Expected Values under independence
Coronary Disease
mg/l
present
absent
Total
Cholesterol ≤
220
13.66
533.33
573
> 220
52.33
703.67
756
Total
92
1237
1329
Cholesterol/Heart Disease
• loglikelihood ratio test
G2 = 19.8
• Pearson score test
X2 = 18.4
• with df = (2-1)*(2-1) = 1
independence seems to be violated
Extensions to
I x J Contingency Tables
Local Odds Ratios
• Each set of four cells forming a rectangle
yields one odds ratio
• Local Odds Ratio:
Use only neighboring
cells
a
b
c
d
• local odds ratios form a
minimal sufficient set
Example: Marijuana Use
• Study on Marijuana use (based on parental
use)
student
parent
never
occasional
regular
neither
141
54
40
one
68
44
51
both
17
11
19
• evidence of association?
Example: Marijuana Use
• Student by Parent Use
student
prodplot(mj, count~student+parent,
c("vspine","hspine"),
subset=level==2)
•
neither
positive association?
one
parent
both
Summaries of I x J Tables
(ordinal variables)
X=1 X=2
X=i
Y=1 π11 π12
For each pair of subjects count Y=2 π21 π22
#concordant/discordant pairs, ...
where
Y=J πJ1 πJ2
πIi
π2i
...
πJi
• Concordance/Discordance:
• a pair is concordant, if subject 2 is ranked
higher on X, it is also ranked higher on Y
• a pair is discordant, if subject 2 is ranked
higher on X, but ranked lower on Y
...
g (T ) = �
Concordance/Discordance
e
eTk
k
1, 2, ...
1,
2,
...
K
T�
�=1
1
concordance to (i,j):
σ(ν) =
−ν
1 + e <i, <j or >i, >j
�
toM(i,j):
Zi,jm = σ(α0m + αm
X) discordance
m = 1, ...,
Tk = β0k +
βk� Z
fk (X) = gk (T )
Πc = 2
I �
J
�
i=1 j=1
>i,
<j
or
>i,
<j
k = 1, ..., K
k = 1, ..., K
πij ·
��
h>i k>j
πhk =
�
i,j
c
πij
Gamma Statistic
•
I �
J
�
�2
�
let
∏
,
∏
for
C
D be the probabilities
16
d
c
=
π
Π
π
−
Π
π
ij
C ij
D ijresp.
4
concordance
and
discordance,
n(ΠC + ΠD )
i=1 j=1
ΠC − ΠD
γ=
Π C + ΠD
π00 π11
θ :=
approx.
normal
π10
π01 with
I �
J
�
�2
�
16
π : (1 − π)
d
c
σ 2 (γ̂)∞ =
π
Π
π
−
Π
π
ij
C ij
D ij
4
n(ΠC + ΠD )
πj|i=0 − πj|i=1 =: π1 − πi=1
2 , j=1
•
•
πj=1|i=0γ =
ΠC − ΠD
Download