Measures of Asso ciation

advertisement
Measures of Association
References:
Summarize strength of
associations
Goodman, L.A. and Kruskal, W. H. (1979)
Measures of Association for Cross Classications, Springer, New York.
Bishop, Fienberg and Holland (1975).
Quantify relative risk
Types of measures
{ odds ratio
{ correlation
{ Pearson statistic
{ Prediction
{ concordance/discordance
Discrete Multivariate Analysis: Theory
and Practice, MIT Press, Cambridge, MA
(Chapter 11).
Brown, M. B. and Benedetti (1977).
Sampling behavior of tests for correlation
in two-way contingency tables, Journal of
the American Statistical Association, 72,
309{315.
465
Cohen, J. (1960).
A coeÆcient of agreement for nominal
scales, Educational and Psychological Measurement, 20, 37{46.
Mantel, N. and Haenszel, W. (1959).
Statistical aspects of the analysis of data
from retrospective studies of disease, Journal of the National Cancer Institute, 22,
719{748.
Agresti, A. (1984)
Analysis of Ordinal Categorical Data, Wiley, New York, (Chapters 2 & 3).
Agresti, A. (2002)
Categorical Data Analysis, 2nd edition, Wiley, New York, (Chapter 2).
467
466
The odds ratio:
The most frequently used
measure for 2 2 tables.
Observed counts:
Variable 2
j=1 j=2
Variable 1 i = 1 Y11 Y12
i = 2 Y21 Y22
468
Means or "expected" counts:
j=1 j=2
i = 1 m11 m12
i = 2 m21 m22
The odds that a sampled unit is in
category 1 for variable 1 given that it
is in category j for variable 2:
8
<
9
P r: variable 1 is
in category 1
8
< variable 1 is
P r:
in category 2
=
True proportions:
Odds ratio
j=1 j=2
i = 1 11 12
i = 2 21 22
=
variable 2 is =
in category j ;9
variable 2 is =
in category j ;
1j
1j +2j
2j
1j +2j
11
21 12
22
= 21jj
= 1122
2112
= m11m22
m21m12
also called the cross-product ratio.
470
469
Odds of capturing a female:
Example: Chinook Salmon
hook & line:
Early run (1999)
172 :5911
291 =
= 1:45
119
:
4089
291
Hook
& Line
Net
Female
172
165
Male
119
202
291
367
net:
471
165 :4496
367 =
= 0:82
202
:
5504
367
472
Estimated odds ratio:
0
@
1
A
Late run:
odds of capturing a female
with hook & line
1
^ = 0
odds
of
capturing
a
female
@
A
with a net
= 1:45 = 1:77
0:82
Hook
& Line
Female 199
Male
162
361
Estimated odds ratio:
An approximate 95% condence
interval for is
=
(1:30; 2:42)
Conclusion:
The odds that a captured sh is female
are about 30 to 140 percenct greater
with hook & line than with using a net.
473
/*
Net
168
268
436
:5512 :4488 :3853
:6147
= 1:23
0:63
= 1:96
An approximate 95% condence interval for is (1.48, 2.60)
Conclusion:
474
Program to analyze the 1999
Chinook salmon data.
This
program is stored in the file
chinook2.sas
*/
/*
Attach labels to categories
*/
proc format;
data set1;
value run 1 = 'Early'
infile 'chinook.dat';
input (year month day biweek run gear
age sexa length)
(4. 2. 2. 1. 1. 1. 2. $1. 4.);
rage=int(age/10);
2 = 'Late';
value sex 1 = 'Female'
2 = 'Male';
value gear 1 = 'Hook'
2 = 'Net';
oage=age-(10*rage);
run;
if(sexa = 'F') then sex=1;
else sex=2;
run;
475
476
---------------- run=Early ----------------The FREQ Procedure
proc sort data=set1; by run;
run;
/*
Table of gear by sex
Examine partial association between
sex and method of capture within
each run.
*/
proc freq data=set1; by run;
gear
Frequency
Expected
Row Pct Female Male
Total
Hook
172
119
149.04 141.96
59.11 40.89
291
Net
165
202
187.96 179.04
44.96 55.04
367
table gear*sex / chisq Fisher all
nopercent nocol expected;
sex
format sex sex. gear gear. run run.;
run;
Total
337
321
658
478
477
Statistics for Table of gear by sex
Statistic
Statistics for Table of gear by sex
DF
Value
Prob
1
1
1
1
13.0018
13.0545
12.4417
12.9820
0.1406
0.1392
0.1406
0.0003
0.0003
0.0004
0.0003
Value
ASE
Gamma
Kendall's Tau-b
Stuart's Tau-c
0.2778
0.1406
0.1396
0.0733
0.0385
0.0383
Somers' D C|R
Somers' D R|C
0.1415
0.1397
0.0388
0.0383
Fisher's Exact Test
Pearson Correlation
Spearman Correlation
0.1406
0.1406
0.0385
0.0385
Cell (1,1) Frequency (F)
172
Left-sided Pr <= F
0.9999
Right-sided Pr >= F
2.050E-04
Lambda Asymmetric C|R
Lambda Asymmetric R|C
Lambda Symmetric
0.1153
0.0241
0.0719
0.0561
0.0623
0.0513
Table Probability (P)
Two-sided Pr <= P
Uncertainty Coefficient C|R
Uncertainty Coefficient R|C
Uncertainty Coefficient Sym
0.0143
0.0145
0.0144
0.0079
0.0080
0.0079
Chi-Square
Likelihood Ratio Chi-Square
Continuity Adj. Chi-Square
Mantel-Haenszel Chi-Square
Phi Coefficient
Contingency Coefficient
Cramer's V
9.352E-05
4.022E-04
479
Statistic
480
------------------ run=Late ------------------Table of gear by sex
gear
Estimates of the Relative Risk (Row1/Row2)
Type of Study
Value
95% Confidence
Limits
Case-Control (Odds Ratio) 1.7695 1.2961 2.4157
Cohort (Col1 Risk)
1.3147 1.1336 1.5246
Cohort (Col2 Risk)
0.7430 0.6292 0.8773
Sample Size = 658
sex
Frequency
Expected
Row Pct
Female
Male
Total
Hook
199
162
166.23 194.77
55.12 44.88
361
Net
168
268
200.77 235.23
38.53 61.47
436
Total
367
430
797
481
482
Statistics for Table of gear by sex
Statistics for Table of gear by sex
Statistic
Chi-Square
Likelihood Ratio Chi-Square
Continuity Adj. Chi-Square
Mantel-Haenszel Chi-Square
Phi Coefficient
Contingency Coefficient
Cramer's V
DF
Value
1
1
1
1
21.8848
21.9550
21.2221
21.8574
0.1657
0.1635
0.1657
Prob
<.0001
<.0001
<.0001
<.0001
Fisher's Exact Test
Cell (1,1) Frequency (F)
199
Left-sided Pr <= F
1.0000
Right-sided Pr >= F
1.980E-06
Table Probability (P)
Two-sided Pr <= P
9.975E-07
3.361E-06
483
Statistic
Value
ASE
Gamma
Kendall's Tau-b
Stuart's Tau-c
0.3242
0.1657
0.1645
0.0647
0.0350
0.0348
Somers' D C|R
Somers' D R|C
0.1659
0.1655
0.0350
0.0350
Pearson Correlation
Spearman Correlation
0.1657
0.1657
0.0350
0.0350
Lambda Asymmetric C|R
Lambda Asymmetric R|C
Lambda Symmetric
0.1008
0.0859
0.0934
0.0491
0.0507
0.0445
Uncertainty Coefficient C|R 0.0200
Uncertainty Coefficient R|C 0.0200
Uncertainty Coefficient Sym 0.0200
0.0085
0.0085
0.0085
484
Properties
Estimates of the Relative Risk (Row1/Row2)
Type of Study
Value
95% Confidence
Limits
Case-Control (Odds Ratio) 1.9596 1.4763 2.6012
Cohort (Col1 Risk)
1.4306 1.2305 1.6633
Cohort (Col2 Risk)
0.7301 0.6370 0.8367
Sample Size = 797
(i) The odds ratio is not \margin
sensitive".
Choose t1; t2; s1; s2 such that
t1 +t2 = 1 and s1 +s2 = 1. Then
the odds ratio for
s1t111
s2t121
485
s1t212
s2t222
486
(iv) Interchanging the rows of the
table produces 1=.
is
s1t111 s2t222
= 11 22
s1t212 s2t121
1221
= (v) Interchanging the columns of
the table produces 1=.
(vi) Interchanging both the rows
and the columns of the 2 2
table produces .
(ii) 0 1
(iii) = 1 corresponds to independence
487
So, = 4 indicates the
same level of association
as = 14 = :25.
488
Estimation of the large sample
variance:
Estimation:
1
0
1
1
1
2 1
1;^ = ^ @ Y11 + Y12 + Y21 + Y22 A
2
Substitute Yij for mij in :
Y Y
^ = 11 22
Y12Y21
For small samples use
2)
^ has a \large sample" N (; 1
distribution with
0
1
1
1
1
1
C
2 = 2 B@
A
1
+
+
+
m11 m12 m21 m22
(when each mij is large).
^ =
and
(Y11 + :5)(Y22 + :5)
(Y12 + :5)(Y21 + :5)
0
1
1
1;^ = (^)2 @ Y11 + :5 + Y12 + :5
^2
1
+ 1 + 1 A
Y21 + :5 Y22 + :5
489
490
log-odds ratio: log()
Other smooth functions of are
often used as measures of assocation, say f ().
The large sample distribution for
f (^
) is
2 N f (); [f 0()]2 1
;^
%
The asymptotic variance is
obtained from the delta method
491
(i) 1 < log() < 1
(ii) independence , = 1
, log() = 0
(iii) log() is not \margin sensitive"
(iv) log() and log() = log(1=)
imply equal levels of association
(v) log(^) has a more nearly symmetric
distribution than ^ for smaller sample sizes
492
Yule's Q:
log(^) dist
!
N log(); m111 + m112 + m121 + m122
Properties:
as mij ! 1 for all (i; j )
Approximate 95% condence intervals:
For log():
log(^) (1:96)
= [A; B ]
s
1
1
1
1
Y11 + Y12 + Y21 + Y22
Yule (1900)
1
Q=
+1
1. 1 Q 1
2. Q = 0 for the independence
model
3. Q = 1 when either 12 = 0 or
21 = 0
for : [exp(A), exp(B)]
493
4. Q = 1 when either 11 = 0
or 22 = 0
5. Q is \symmetric". When the
columns (or rows) or a 2 2
table are interchanged, then
1
and Q ) Q
)
494
Q is the Goodman-Kruskal
Gamma statistic for 2 2 tables.
Q=
1122
1221
1122 + 1221 1122 + 1221
Estimation:
Q^ =
^ 1 Y11Y22 Y12Y21
=
^ + 1 Y11Y22 + Y12Y21
6. Q is a margin free measure
495
496
What is a large or substantial
association?
What is a large value of ?
Large sample distribution:
As mij ! 1 for all i = 1; 2; j = 1; 2
`n()?
Q?
1 (1 Q2)2 1
Q^ N
;
+ 1 + 1 + 1 )
(
+1
4
m11 m12 m21 m22
%
This is
0
[f ()]22 m111 + m112 + m121 + m122
where f () = +11
1. Use large sample distributions to
construct condence intervals or
test hypotheses
H0 : = 1 versus HA : 6= 1.
) 0 > Z
Reject H0 if Z = `n^(^
=2
`n(^
)
497
1. Suppose you discover that
^ = 2:4
is \signicantly dierent" from
zero. Is an odds ratio of 2.4 large
enough to have practical importance?
There are no absolute guidelines. It
depends on the subject matter or
eld of study.
498
A useful application of measures of association
for two-way tables is to assess dierences in
levels of association across time, or locations.
College Graduate
Yes No
1950 male
^1950 = 10:1
female
Yes No
1960
^1960 = 4:2
Yes No
1970
male
female
^1970 = 2:7
Yes No
1980
499
male
female
male
female
^1980 = 1:8
500
Relative Risk
In that case,
22 = 1 21 =: 1:0
Heart No heart
attack attack
Placebo
Aspirin
11
21
12
22
n1
n2
Relative risk of a heart attack
P r fheart attack jplacebog
=
P r fheart attack jaspiring
11
21
10
1
0
:= @ 11 A @ 22 A
21 12
= when 11 and 21 are small.
and
12 = 1 11 =: 1:0
.
Data:
Heart No heart
attack attack
Placebo 189 10845 n1 = 11034
Aspirin 104 10933 n2 = 11037
501
Estimated relative risk of heart attack
for those taking the placebo versus
those taking the aspirin:
189
11034 = :0171 = 1:82
104
11037 :0094
Odds ratio:
odds of a heart attack
for placebo users
^ =
odds of a heart attack
for aspirin users
= (189)(10933)
(104)(10845)
= 1:83
502
Condence interval: (case-control)
^ = 1:83205
log(^
) = 0:6054377
2
Slog
(^) =
XX
1 = :01509
i j Yij
2
Then log(^) z=2Slog
(^)
)
)
)
)
503
p
0:6054 (1:96) :01509
(0:36467; 0:84621)
(e0:36467; e0:84621)
(1:440; 2:331)
504
Relative risk:
189 RRcol1 = 11034
104 = 1:8178
11037
log(RRcol1) = :5976275
1 ^11 1 ^21
+
n1^11
n2^21
Y
Y
= 12 + 22
n1Y11 n2Y21
= :01472515
2
Slog(
RRcol1) =
s
2
log(RRcol1) (1:96) Slog(
RR)
)
p
:5976275 (1:96) :01472515
) (:359788; :835467)
) An approximate 95%
condence interval
is (e:359788; e:835467)
) (1:433; 2:306)
505
0
B
@
1
C
A
11
21
= log(11) log(21)
log(RR) = log
g(11; 21)
Independent binomial experiments
Y
Y11 Bin(n1; 11)
^11 = 11
n1
Y
^21 = 21
Y21 Bin(n2; 21)
n2
507
506
^11!
V = V
^21
=
2
6
6
6
4
11(1 11
n1
= @@g11
"
21(1 21)
n2
0
Delta method:
V (g(^11; ^21))
"
0
@g
@21
#
#
V
2
6
6
6
4
2
6
6
6
6
4
1
@g
@11
@g
@21
= 111 211 V 111
11 + 1 21
= 1n111
n221
21
3
7
7
7
5
3
7
7
7
7
5
3
7
7
7
5
508
/*
SAS Code
/* This program computes the
PROC FORMAT;
odds ratio for a 2x2 table.
VALUE RFMT 1 = 'Placebo'
It is stored in the file
aspirin.sas
Assign labels to values */
2 = 'Aspirin';
*/
VALUE CFMT 1 = 'Yes'
2 = 'No'; run;
DATA SET1;
INPUT ROW COL COUNT;
/* Analyze the table of counts */
LABEL ROW = 'Treatment'
COL = 'Heart attack';
PROC FREQ DATA=SET1;
CARDS;
TABLES ROW*COL / CHISQ ALL
1 1 189
NOPERCENT NOCOL EXPECTED;
1 2 10845
WEIGHT COUNT;
2 1 104
2 2 10933
FORMAT ROW RFMT. COL CFMT.;
run;
run;
509
510
Statistics for Table of ROW by COL
The FREQ Procedure
Statistic
DF
Value
1
1
1
1
25.0139
25.3720
24.4291
25.0128
0.0337
0.0336
0.0337
Prob
Table of ROW by COL
ROW(Treatment)
Frequency
Expected
Row Pct Yes
Placebo
Aspirin
Total
Chi-Square
Likelihood Ratio Chi-Square
Continuity Adj. Chi-Square
Mantel-Haenszel Chi-Square
Phi Coefficient
Contingency Coefficient
Cramer's V
COL(Heart attack)
No
Total
<.0001
<.0001
<.0001
<.0001
189
146.48
1.71
10845
10888
98.29
11034
104
146.52
0.94
10933
10890
99.06
11037
Cell (1,1) Frequency (F)
189
Left-sided Pr <= F
1.0000
Right-sided Pr >= F
3.253E-07
293
21778
22071
Table Probability (P)
Two-sided Pr <= P
Fisher's Exact Test
511
1.516E-07
5.033E-07
512
S-PLUS Code
# An S-PLUS function to compute
# an odds ratio and construct an
# approximate confidence interval
Estimates of the Relative Risk (Row1/Row2)
Type of Study
# This code is posted int he file
Value
95% Confidence
Limits
Case-Control (Odds Ratio) 1.8321 1.4400 2.3308
Cohort (Col1 Risk)
1.8178 1.4330 2.3059
Cohort (Col2 Risk)
0.9922 0.9892 0.9953
#
oddsratio.ssc
oddsratio <- function(table,conf=.95,
cont=0.0)
{level <- 1-(1-conf)/2
Sample Size = 22071
tablec <- table + cont
alpha <- tablec[1,1]*tablec[2,2]/
(tablec[1,2]*tablec[2,1])
la <- log(alpha)
sla <- sqrt(sum(tablec^(-1)))
513
514
sa <- alpha*sla
# To run this function, first create a
lowera <-
# table of counts
round(exp(la-qnorm(level)*sla),4)
#
uppera <-
#
round(exp(la+qnorm(level)*sla),4)
#
confper <- round(conf*100,1)
# Then source this function into the
sar <- round(sa,4)
# command window
#
cat("\n", "odds-ratio = ", alphar)
#
cat("\n", "std. error = ", sar)
cat("\n",confper,"% confidence interval")
cat("\n", "
10933), 2, 2, byrow=T)
#
alphar <- round(alpha,4)
cat("\n"," lower limit
aspirin <- matrix(c(189, 10845, 104,
upper limit")
", lowera, "
",
uppera)
source("yourdirectory/oddsratio.ssc")
#
# Then execute the function
#
#
oddsratio(aspirin,conf=.95,cont=0.0)
#
}
515
516
The heart attack study is an example of a
propspective study
odds-ratio =
1.8321
std. error =
0.2251
In such studies:
Patients are randomly assigned to
95 % confidence interval
lower limit
1.44
treatment groups.
upper limit
2.3308
The treatments are administered.
The proportion that give a certain
517
placebo:
Y11 = observed proportion
n1
that experience a
heart attack
aspirin:
Y21 = observed proportion
n2
that experience a
heart attack
These are direct estimates of
population proportions needed
to determine relative risk.
519
response is recorded for each treatment group.
518
Retrospective
(case-control)
studies:
Examine what has happened in
the past
Example:
Take a simple random sample of
n1 patient records (cases), e.g.
women who have experienced
a heart attack
520
Classify each women according
to whether or not she ever used
oral contraception.
Take an independent simple random sample of n2 controls, and
classify each woman in the same
way.
oral
contraceptive Heart No heart
use
attack attack
Yes
No
Y11
n1
Y11
Y21
n1
Y12
Y22
n2
estimates
8
<
9
oral
experienced a =
P r : used
contraceptive heart attack ;
Y12
n2
estimates
8
<
9
oral
never had a =
P r : used
contraceptive heart attack ;
522
521
These do not provide a direct
estimate of
9
8
>
>
=
< heart
used
oral
P r >:
attack contraception >;9
8
RR = ><
>
=
P r >: heart do not use oral >;
attack contraception
523
Bayes Rule to the rescue?
8
>
<
>
:
9
>
=
P r heart use >; =
attack o.c.
8
9
8
9
>
>
< use heart >
=
< heart >
=
P r>:
P
r
>
>
;
: attack >
;
o.c. attack
8
9
>
< use >
=
P r>:
o.c. >;
524
Then
(
use
P r o.c.
RR = (
do not
P r use
o.c.
)
(
)
heart .P r use
attack )
o.c.
(
)
.
heart
do not
P
r
attack
use o.c.
Relative risk of heart attack cannot be
estimated without additional information on
n
o
P r use o.c.
the proportion of women in the population who use oral contraceptives.
Approximate the relative risk
with an odds ratio:
8
2
<
6 Pr
6
:
6
6 8
6 <
6
4P r
:
8
<
P r:
8
<
9 3
heart use = 7
attack o.c. ;9 777
no heart use = 775
attack o.c. ;9
3
=2
heart do not = 7
6
6
attack use o.c. ;9 777
6
6
6
=7
7
6
4 P r no heart do not
5
: attack
;
use o.c.
525
526
Which is equal to
Which is equal to
(
)
(
)
(
)
use heart P r heart .P r use
o.c. attack 9 8 attack 9
o.c.
8
(
)
>
>
>
>
no
no
= <
=.
<
use
P r heart
Pr
P r use heart
>
o.c.
; >
: attack >
;
: o.c. attack >
(
) (
). (
)
P r do not heart
P r heart
P r do not
use o.c. attack 9 8 attack 9
use o.c.
8
(
)
>
>
>
>
no
no
.
= <
=
<
do
not
P r heart
P r do not heart
Pr
>
use o.c.
; >
: attack >
;
: use o.c. attack >
Pr
527
8
2
<
6
P r:
6
6
6 8
6
6 <
4P r
:
8
2
<
6
P r:
6
6
6 8
6
6 <
4P r
:
9
3
use heart = 7
o.c. attack ; 9 777
do not heart = 775
use o.c. attack9;
3
use no heart = 7
o.c. attack ;9 777
do not heart = 775
use o.c. attack ;
An estimate is
(Y11=n1)
(Y21=n1) = Y11Y22
(Y12=n2) Y12Y21
(Y22=n2)
528
Download