Contingency tables

advertisement
Contingency tables
Brian Healy, PhD
Types of analysis-independent
samples
Outcome
Explanatory
Analysis
Continuous
Dichotomous
t-test, Wilcoxon
test
Continuous
Categorical
Continuous
Continuous
ANOVA, linear
regression
Correlation, linear
regression
Dichotomous
Dichotomous
Chi-square test,
logistic regression
Dichotomous
Continuous
Logistic regression
Time to event
Dichotomous
Log-rank test
Example
MS is known to have a genetic component
 Several single nucleotide polymorphisms
have been associated with susceptibility to
MS
 Question: Do patients with susceptibility
SNPs experience more sustained
progression than patients without
susceptibility SNPs?

Data
Initially, we will focus on presence vs. absence
of SNPs
 Among our 190 GA treated patients, 74 had the
SNP and 116 did not

– 12 patients with the SNP experienced sustained
progression
pˆ SNP  
12
 0.162
74
– 13 patients without the SNP experienced sustained
progression
pˆ SNP  
13
 0.112
116
Another way to look at the data

Rather than investigating two proportions, we
can look at a 2x2 table of the same data
SNP+
SNP-
Total
Prog
12
13
25
No prog
62
103
165
Total
74
116
190
Question
In our analysis, we assume that the
margins are set
 If there was no relationship between the
two variables, what would we expect the
values in the table be?

Example

As an example, use this table
SNP+
Prog
No prog
Total
50*100/200
=25
50*100/200
=25
50
SNP-
Total
150*100/200=
75
100
150*100/200
=75
100
150
200
Expected table

Expected table for our analysis
SNP+
Prog
No prog
Total
SNP-
Total
25*74/190=
9.73
25*116/190
=15.3
25
165*74/190
=64.3
116*165/
190=100.7
165
74
116
190
How different is our observed data compared to
the expected table?
Does our data show an effect?
To test for an association between the
outcome and the predictor, we would like
to know if our observed table was
different from the expected table
 How could we investigate if our table was
different?

O1  E1   O2  E2   O3  E3   O4  E4 

cells
Oi  Ei 
2
Ei
Chi-square distribution

This statistic follows a chi-square distribution
with 1 degree of freedom

cells

Oi  Ei 2
Ei
Assume x is a normal random variable with
mean=0 and variance=1
– x2 has a chi-square distribution with 1 degree of
freedom
Chi-square distribution
Area=0.05
X2=3.84
Critical information for c2

For 1 degree of freedom, cut-off for
a=0.05 is 3.84
– For normal distribution, this is 1.96
– Note 1.962=3.84

Inherently, two-sided since it is squared
Hypothesis test with c2
1)
2)
3
4)
5)
6)
7)
H0: No association between SNP and
progression
Dichotomous outcome, dichotomous predictor
c2 test
Test statistic: c2=0.99
p-value=0.32
Since the p-value is greater than 0.05, we fail
to reject the null hypothesis
We conclude that there is no significant
association between SNP and progression
p-value
c2 statistic
Hypothesis test comparison
Yesterday, we completed this same test using a
comparison of proportions
 Let’s compare the results

Method
Test statistic
p-value
Test of
proportions
c2 test
z=0.996
p=0.32
c2=0.992
p=0.32
We get the same result!!!
Question: Continuity correction

What is a continuity correction and when
should I use it?
– Continuity correction subtracts ½ from the
numerator of the c2 statistic
rc
| Oi  Ei | 0.52
i 1
Ei
c2  
– Designed to improve performance of normal
approximation
– Use default in STATA (or other stat package),
but know which you are using
– Less important today since exact tests are
easily used
Question: Why 1 degree of
freedom?

We used a c2 distribution with 1 degree of
freedom, but there are 4 numbers. Why?
– For our analysis, we assume that the margins are
fixed.
– If we pick one number in the table, the rest of the
numbers are known
SNP+
Prog
No Prog
Total
SNP-
3
22
71
94
74
116
Total
25
165
190
Question: Normal approximation

We are using a normal approximation, but
yesterday we talked about this being less
than perfect. When can we use this test?
– Rule of thumb: All cells larger than 5
– Large samples

What should I do if I do not have large
samples?
– Fisher’s exact test
Fisher’s exact test
Remember that a p-value is the probability of
the observed value or something more extreme
 Fisher’s exact test looks at a table and
determines how many tables are as extreme or
more extreme than the observed table under the
null hypothesis of no association
 Same concept as exact test from Wilcoxon test
 Easy to compute this in STATA

Hypothesis test with exact test
1)
2)
3)
4)
5)
6)
7)
H0: No association between SNP and
progression
Dichotomous outcome, dichotomous predictor
Exact test
Test statistic: NA
p-value=0.38
Since the p-value is greater than 0.05, we fail
to reject the null hypothesis
We conclude that there is no significant
association between SNP and progression
Two-sided p-value
Results

Our results were very similar to the other
tests in part because we have a large
sample size
– Normal approximation ok

In small samples, larger differences are
possible
Types of studies
In a cohort study, people are enrolled
based on exposure status so we can
somewhat control how many exposed and
unexposed people we have
 In a case-control study, people are
enrolled based on disease status so that
we ensure that we have both diseased
and non-diseased people

Measures of association

Risk difference
RD  P( Disease | Exposure)  P( Disease | Exposure)
– Do these added together equal 1?
– Why?
– Under the null, what is the risk difference?

Relative risk (risk ratio)
P( Disease | Exposure)
RR 
P( Disease | Exposure)
– Under the null, what is the relative risk?
Exposure
Disease
Y
N
Total
Y
a
b
n1
N
c
d
n2
Total
m1
m2
N

P(Disease+|Exposure+)= a/m1=p1
– What is another name for this quantity?
– Prevalence in patients with exposure



P(Disease+|Exposure-)= b/m2=p2
RD=a/m1 – b/m2
Difference between proportions
Confidence interval for RD

Several confidence intervals are available for the
RD
– Asymptotic normal distribution

p (1  p2 ) p1 (1  p1 ) 

pˆ 2  pˆ1 ~ N  p2  p1 , 2

m2
m1


– Confidence interval

 ( pˆ 1  pˆ 2 )  za / 2


pˆ 1 (1  pˆ 1 ) pˆ 2 (1  pˆ 2 )

, ( pˆ 1  pˆ 2 )  za / 2
m1
m2
pˆ 1 (1  pˆ 1 ) pˆ 2 (1  pˆ 2 ) 


m1
m2

Exposure
Disease
Y
N
Total
Y
a
b
n1
N
c
d
n2
Total
m1
m2
N

Estimate of RR:
a m1
RR 
b m2
Confidence interval for RR
To construct a confidence interval we use
a normal approximation
 In addition, the CI is based on a log
transformation of the RR

– log(RR)=ln(RR)
– I will use ln and log to represent the natural
logarithm

Quick math: eln(RR)=RR
ln(RR)

Why do we use the ln(RR)?
– It is generally easier to deal with subtraction rather
than division
– ln(RR)=ln(p1/p2)=ln(p1)-ln(p2)

We can estimate the standard error for the
ln(RR) using the following formula
c
d
   
se ln RR  

am1 bm2
  

Confidence interval

Now that we have an estimate of the variance,
we can create a confidence interval for ln(RR)
using our standard normal approximation

   

c
d
c
d


 ln RR   za 2


, ln RR   za 2

  
am1 bm2  
am1 bm2 


To create a confidence interval for RR, we
transform this confidence interval
 
 ln RR
  za 2
e  


b
d

am1 cm2
,e
  
ln RR   za


2
b
d

am1 cm2




Estimated proportions in two groups
Given the confidence
interval, would you reject
the null hypothesis?
Why?
p-value from chisquare test
Interpretation of RD

The estimated risk difference is 0.05.
– The interpretation of this is that the risk of
progression for patients with the susceptibility allele is
5% higher than for patients without the allele

The 95% confidence interval for the risk
difference is (-0.052,0.152)
– Is there a significant difference between the allele
groups?

What was the confidence interval for the
difference between the proportions that we
investigated two classes ago?
– 95% CI: (-0.052,0.152)
Interpretation of RR

The estimated relative risk is 1.45.
– The interpretation of this is that the risk of
progression for patients with the susceptibility
allele is 1.45 times higher than for patients
without the allele

The 95% confidence interval for the risk
difference is (0.70,3.00)
– Is there a significant difference between the
allele groups?
RD and RR

Now that we know how to estimate these
measures, can we estimate these with any
study design?
– Not directly
– In a cohort study study, the probabilities of
interest, P(Disease|Exposure), are estimated
– In a case-control study, the probabilities
cannot be estimated directly so more
information is required
Bayes theorem-technical

The relationship between the
P(Disease|Exposure) and P(Exposure| Disease)
can be shown using Bayes theorem
P( E  | D ) P( D )
P( D  | E ) 
P ( E  | D  ) P ( D  )  P ( E  | D  ) P ( D )

Therefore, if we knew P(D+), we can estimate
P(D+|E+) from a case control study
– P(D+) is prevalence
– Usually we do not know this so we can’t directly
estimate the relative risk or risk difference
Odds ratio

Odds:

Odds ratio:
OR 
Odds 
OddsExposure 
OddsExposure 
p
1 p
P( Disease | Exposure) 1  P( Disease | Exposure 

P( Disease | Exposure) 1  P( Disease | Exposure
– Under the null, what is the OR?
Exposure
Disease
Y
N
Total
Y
a
b
n1
N
c
d
n2
Total
m1
m2
N
P( D  | E ) 
a
ac
P( D  | E )
a /(a  c) a


1  P( D  | E ) c /(a  c) c
b

This is the estimate of
d
OddsDisease |Exposure   OddsD |E  
OddsDisease |Exposure   OddsD |E 
OR 
OddsD |E 
OddsD |E 
a c ad


b d bc
the odds ratio from a
cohort study
Exposure
Disease
Y
N
Total
Y
a
b
n1
N
c
d
n2
Total
m1
m2
N
P( E  | D ) 
a
ab
a
b
c

d
OddsExposure |Disease   OddsE |D  
OddsExposure |Disease   OddsE |D 
OR 
OddsE |D 
OddsE |D 

a b ad

c d bc
This is the estimate of
the odds ratio from a
case-control study
Amazing!!
Estimated odds ratio from each kind of
study ends up being the same thing!!!
 Therefore, we can complete a case control
study and get an estimate that we really
care about, which is the effect of the
exposure on the disease
 This relationship is why the odds ratio is
so commonly

Confidence interval for OR

In order to calculate a confidence interval
for the OR, we will investigate Woolf’s
approximation
– Other approximations and exact intervals are
available in STATA (Exact is default)

Woolf’s approximation focuses in a log
transformation of the OR like for the RR
– log(OR)=ln(OR)

Quick math: eln(OR)=OR

Woolf’s approximation gives us
1 1 1 1
   
se ln OR  
  
a b c d
  


Using our normal approximation, we can
create a confidence interval for ln(OR)
using
   
1 1 1 1   
1 1 1 1
 ln OR   za 2
   , ln OR   za 2
   
 
a b c d 
a b c d




The confidence interval for OR
 ln OR  za 2
e  


1 1 1 1
  
a b c d
,e
  
ln OR   za


2
1 1 1 1
  
a b c d




Example

In yesterday’s class, we discussed a study
in which we wanted to estimate the effect
of a SNP on disease progression
– What type of study was this?
– Cohort study because we followed people
forward over time

Let’s estimate the odds ratio and
confidence interval for this study
CI for OR
SNP+
SNP-
Total
Prog
12
13
25
No prog
62
103
165
Total
74
116
190
Based on this table, the estimated
OR=(12*103)/(13*62)=1.53
 95% CI: (0.66, 3.57)
 Should we reject the null hypothesis of OR=1?

Interpretation of OR

The estimated odds ratio is 1.53.
– The interpretation of this is that the ODDS of
progression for patients with the susceptibility
allele is 1.53 times higher than the ODDS for
patients without the allele

The 95% confidence interval for the risk
difference is (0.66,3.57)
– Is there a significant association between SNP
and disease?
Estimated OR
Estimated CI (Woolf)
OR vs. RR
Although the odds ratio is interesting, the
relative risk is more intuitive
 If we have a rare disease, which is often the
case for a case-control study,

P( D | E )  P( D | E )  1

Therefore, in these cases, the odds ratio is also
an estimate of the relative risk
P( D  | E ) P( D  | E ) P( D  | E )
OR 

 RR
P( D | E ) P( D | E ) P( D | E )

In other cases, odds ratio provides valid
estimate of relative risk (see other courses)
Hypothesis test with CI
1)
2)
3)
4)
5)
6)
7)
H0: No association between SNP and
progression (RD=0)
Dichotomous outcome, dichotomous predictor
Risk difference 95% confidence interval
Test statistic: Estimated RD=0.50
95% CI: (-0.052, 0.152)
p-value>0.05
Since the p-value is greater than 0.05, we fail
to reject the null hypothesis
We conclude that there is no significant
association between SNP and progression
Hypothesis test with CI
1)
2)
3)
4)
5)
6)
7)
H0: No association between SNP and
progression (RR=1)
Dichotomous outcome, dichotomous predictor
Risk difference 95% confidence interval
Test statistic: Estimated RR=1.45
95% CI: (0.70, 3.00)
p-value>0.05
Since the p-value is greater than 0.05, we fail
to reject the null hypothesis
We conclude that there is no significant
association between SNP and progression
Hypothesis test with CI
1)
2)
3)
4)
5)
6)
7)
H0: No association between SNP and
progression (OR=1)
Dichotomous outcome, dichotomous predictor
Risk difference 95% confidence interval
Test statistic: Estimated OR=1.53
95% CI: (0.66, 3.57)
p-value>0.05
Since the p-value is greater than 0.05, we fail
to reject the null hypothesis
We conclude that there is no significant
association between SNP and progression
Download