Dependent samples/ Survival analysis

advertisement
Dichotomous and
survival outcomes
Brian Healy, PhD
Comments from previous class

Book suggestion/practice problems
– Fundamentals of Biostatistics by Bernard Rosner
– Available in Countway

How do I know when I need help?
– When you think your project is more complicated
than we have discussed
 Correlated observations
 Skewed data
– This class designed to help you do basic analysis, but
also help you communicate with a statistician
Objectives

Dichotomous outcome
– Chi-square test
– Logistic regression

Survival analysis
– Log-rank test
Quick aside

What is the most common proportion to
see in the news?
– Political polling

Most polls look like this:
– 50% of people support Scott Brown
– 45% of people support Martha Coakley
– Margin of error +/- 3%
Margin of error

What does the margin of error tell us?
– The plausible values for the true proportion
accounting for sampling variability (chance)

What does the margin of error not tell us?
– Sample design
 Who was sampled?
 How was sampling done?
– Was there any missing data?
– Were all people treated the same?
Statement regarding accuracy

For confidence interval:
– We are 95% confident that the true parameter value lies within
our confidence bounds

For polling (from
http://www.pollingreport.com/sampling.htm)
– “In theory, with a sample of this size, one can say with 95
percent certainty that the results have a statistical precision of
plus or minus __ percentage points of what they would be if the
entire adult population had been polled with complete accuracy.
Unfortunately, there are several other possible sources of error
in all polls or surveys that are probably more serious than
theoretical calculations of sampling error. They include refusals
to be interviewed (non-response), question wording and
question order, interviewer bias, weighting by demographic
control data, and screening (e.g., for likely voters). It is difficult
or impossible to quantify the errors that may result from these
factors.”
Review

Steps for hypothesis test
– How do we set up a null hypothesis?

Choosing the right test
– Continuous outcome/dichotomous predictor:
Two sample t-test
– Continuous outcome/categorical predictor:
ANOVA
– Continuous outcome/continuous predictor:
Correlation or regression
Types of analysis-independent
samples
Outcome
Explanatory
Analysis
Continuous
Dichotomous
t-test, Wilcoxon
test, linear reg
Continuous
Categorical
Continuous
Continuous
ANOVA, linear
regression
Correlation, linear
regression
Dichotomous
Dichotomous
Chi-square test,
logistic regression
Dichotomous
Continuous
Logistic regression
Time to event
Dichotomous
Log-rank test
Dichotomous outcome
Sustained disease progression in MS is
often defined as a one-unit increase on
EDSS that lasts for at least six months
 This is a common outcome in clinical trials
and observational studies
 Patients are often classified as progressed
or not progressed, which is a dichotomous
outcome

Example
MS is known to have a genetic component
 Several single nucleotide polymorphisms
have been associated with susceptibility to
MS
 Question: Do patients with susceptibility
SNPs experience more sustained
progression than patients without
susceptibility SNPs?

Data
Initially, we will focus on presence vs. absence
of SNPs
 Among our 190 treated patients, 74 had the SNP
and 116 did not

– 12 patients with the SNP experienced sustained
progression
pˆ SNP  
12
 0 . 162
74
– 13 patients without the SNP experienced sustained
progression
pˆ SNP  
13
116
 0 . 112
Contingency table
A common way to look at this data is a 2x2 table
 Does the SNP have an effect on whether or not
patients progress?

SNP+
SNP-
Total
Prog
12
13
25
No prog
62
103
165
Total
74
116
190
Question
In our analysis, we assume that the
margins are set
 Under the null hypothesis of no
relationship between the two variables,
what would we expect the values in the
table be?

Example

As an example, use this table
SNP+
Prog
No prog
Total
50*100/200
=25
50*100/200
=25
50
SNP-
Total
150*100/200=
75
100
150*100/200
=75
100
150
200
Expected table

Expected table for our analysis
SNP+
Prog
No prog
Total
SNP-
Total
25*74/190=
9.73
25*116/190
=15.3
25
165*74/190
=64.3
116*165/
190=100.7
165
74
116
190
How different is our observed data compared to
the expected table?
Does our data show an effect?
To test for an association between the
outcome and the predictor, we would like
to know if our observed table was
different from the expected table under
the null hypothesis
 How could we investigate if our table was
different?
This quantity has a chi-square


cells
O i  E i 
Ei
2
distribution
If it is large, it implies a large
difference from the expected
Critical information for c2

For 1 degree of freedom, cut-off for
a=0.05 is 3.84
– For normal distribution, this is 1.96
– Note 1.962=3.84
Inherently, two-sided since it is squared
 Has problems with small cell counts

– Fix: Fisher’s exact test
Chi-square distribution
Area=0.05
X2=3.84
Hypothesis test with c2
1)
2)
3
4)
5)
6)
7)
H0: No association between SNP and
progression
Dichotomous outcome, dichotomous predictor
c2 test
Summary statistic: c2=0.99
p-value=0.32
Since the p-value is greater than 0.05, we fail
to reject the null hypothesis
We conclude that there is no significant
association between SNP and progression
p-value
c2 statistic
Question: Why 1 degree of
freedom?

We used a c2 distribution with 1 degree of
freedom, but there are 4 numbers. Why?
– For our analysis, we assume that the margins are
fixed.
– If we pick one number in the table, the rest of the
numbers are known
SNP+
Prog
No Prog
Total
SNP-
3
22
71
94
74
116
Total
25
165
190
Estimated effect

When you compare two groups with a
dichotomous outcome, there are three
common ways to show the difference
between the groups
– Risk difference
 Prob of diseaseGroup 1-Prob of diseaseGroup 2
– Relative risk/risk ratio
 Prob of diseaseGroup 1/Prob of diseaseGroup 2
– Odds ratio
Odds ratio


Odds:
Odds 
p
1 p
Odds ratio:
OR 
Odds
Exposure 
Odds
Exposure 

P ( Disease  | Exposure  ) 1  P ( Disease  | Exposure  
P ( Disease  | Exposure  ) 1  P ( Disease  | Exposure  
– Under the null, what is the OR?
Exposure
Disease
Y
N
Total
Y
a
b
n1
N
c
d
n2
Total
m1
m2
N
P(D  | E ) 
Odds
Odds
OR 
a
ac
Disease  | Exposure 
Disease  | Exposure 
Odds
D  |E 
Odds
D |E 

 Odds
 Odds
a c
b d

D  |E 
D  |E 
ad
bc


P(D  | E )
1  P(D  | E )
b
d

a /( a  c )
c /( a  c )

a
c
This is the estimate of
the odds ratio from a
cohort study
Exposure
Disease
Y
N
Total
Y
a
b
n1
N
c
d
n2
Total
m1
m2
N
P(E  | D ) 
Odds
Odds
OR 
a
ab
Exposure  | Disease 
 Odds
Exposure  | Disease 
 Odds
Odds
E  |D 
Odds
E  |D 

a b
c d

E  |D 

E  |D 

ad
bc
a
b
c
d
This is the estimate of
the odds ratio from a
case-control study
Amazing!!
Estimated odds ratio from each kind of
study ends up being the same thing!!!
 Therefore, we can complete a case control
study and get an estimate that we really
care about, which is the effect of the
exposure on the disease
 This relationship is one reason why the
odds ratio is so commonly reported

Logistic regression
Types of analysis-independent
samples
Outcome
Explanatory
Analysis
Continuous
Dichotomous
t-test, Wilcoxon
test, linear reg
Continuous
Categorical
Continuous
Continuous
ANOVA, linear
regression
Correlation, linear
regression
Dichotomous
Dichotomous
Chi-square test,
logistic regression
Dichotomous
Continuous
Logistic regression
Time to event
Dichotomous
Log-rank test
Linear regression

When we fit linear regression, we used
indicator variables to represent
dichotomous predictors
– Ex. Effect of gender
– Gender=0 if Female
– Gender=1 if Male

Y i  b 0  b 1Gender  e i
What is the interpretation of b1?
Outcome

What if the outcome is dichotomous?
– Progression
– Y=0 if no progression
– Y=1 if progression

Can we just use linear regression with 0/1
as the outcome?
.6
.8
1
Can we fit a line
to this data?
0
.2
.4
Is there another
measure?
10
20
30
Age
40
50
Better outcome
Rather than investigating the 0/1 value,
we focus our attention on the probability
of the event
 Therefore, we could use the following
regression equation

pi  b 0  b1 * x

Is there anything wrong with this
function?
Technical aside-Probabilities

Probabilities are required to be between 0
and 1
– Does the present equation impose this
restriction? p i  b 0  b 1 * x
– No
We would like a similar equation, but with
the restriction that 0<=p<=1
 One option
b  b *x

pi 
e
0
1 e
1
b 0  b1 * x
Logistic regression

The previous function is quite complex to
deal with, but we can transform the
equation
e
b 0  b1 * x
p
1 e
b 0  b1* x
e
b 0  b1 * x
p
1 p
 p
ln 
1 p


  ln( Odds )  b 0  b 1 * x

Note that the right side of the equation
looks EXACTLY like our normal regression
Parameter interpretation-review
Let’s think about the following linear
regression model for the effect of age on
BPF
E ( BPF i | age i )  b 0  b 1 * age i
 In linear regression, the meaning of b1 in
this model is that for a one unit increase
in age the mean BPF goes up by b1.
 The meaning of b0 is the mean value of
BPF when age=0

Parameter interpretation

How does this change for our logistic
model?
– Not at all!!!
 pi
ln 
 1  pi

  b 0  b 1 * age i


Logistic model:
 The meaning of b1 in this model is that for
a one unit increase in age, the ln(Odds)
goes up by b1
 The meaning of b0 is the value of ln(Odds)
when age=0

Results

When we fit our data, the parameter
estimates were
 pˆ i
ln 
 1  pˆ i

   4 . 58  0 . 086 * age i


For a one unit increase in age, the
estimated log(Odds) increases by 0.086
 Is this a statistically significant increase?

Hypothesis test
If there was no effect of age on the probability
of progression, what would the value of b1
equal?
 How could we test the hypothesis that there is
no effect?

– H0: b1=0
Need an estimate of the variance of the
estimated b1, but this is provided by STATA
 Assume approximate normality

Hypothesis test
1)
2)
3)
4)
5)
6)
7)
H0: b1=0
Dichotomous outcome with continuous
predictor
Logistic regression
Summary statistic: z=1.99
p-value=0.047
Since the p-value is less than 0.05, we reject
the null hypothesis
We conclude that there is a significant effect of
age at symptom onset on probability of
progression
Estimated coefficient
for age
Estimated intercept
coefficient
p-value for H0: b1=0
p-value for H0: b0=0
1
.8
.6
Progression
.4
.2
0
10
20
30
Age
event2yr
40
Pr(event2yr)
50
Conclusions
Logistic regression allows us to investigate
the relationship between a continuous
predictor and a dichotomous outcome
 Interpretation of coefficients is the same
as linear regression, but on the log(odds)
scale
 We can calculate the predicted probability
just like we could calculate the predicted
mean value

Survival analysis
Types of analysis-independent
samples
Outcome
Explanatory
Analysis
Continuous
Dichotomous
t-test, Wilcoxon
test
Continuous
Categorical
Continuous
Continuous
ANOVA, linear
regression
Correlation, linear
regression
Dichotomous
Dichotomous
Chi-square test,
logistic regression
Dichotomous
Continuous
Logistic regression
Time to event
Dichotomous
Log-rank test
Example

An important marker of disease activity in MS is
the occurrence of a relapse
– This is the presence of new symptoms that lasts for
at least 24 hours

Many clinical trials in MS have demonstrated that
treatments increase the time until the next
relapse
– How does the time to next relapse look in the clinic?

What is the distribution of survival times?
Kaplan-Meier curve
Each drop in
the curve
represents an
event
Survival data

To create this curve, patients placed on
treatment were followed and the time of the first
relapse on treatment was recorded
– Survival time
If everyone had an event, some of the methods
we have already learned could be applied
 Often, not everyone has event

– Loss to follow-up
– End of study
Censoring

The patients who did not have the event
are considered censored
– We know that they survived a specific amount
of time, but do not know the exact time of the
event
– We believe that the event would have
happened if we observed them long enough

These patients provide some information,
but not complete information
Censoring

How could we account for censoring?
– Ignore it and say event occurred at time of censoring
 Incorrect because this is almost certainly not true
– Remove patient from analysis
 Potential bias and loss of power
– Survival analysis

Our objective is to estimate the survival
distribution for patients in the presence of
censoring
Comparison of survival curve
One important aspect of survival analysis
is the comparison of survival curves
 Null hypothesis: survival curve in group 1
is the same as survival curve in group 2
 Method: log-rank test

Example
Untreated
Treated
Patient
Time
Patient
Time
1
3
1
30
2
8+
2
38
3
15
3
52+
4
27+
4
58
5
32
5
66
6
46
6
73+
7
49
7
77
8
51
8
89
9
55+
9
107+
10
70
0.00
0.25
0.50
0.75
1.00
Kaplan-Meier survival estimates
0
20
40
60
analysis time
group = 0
80
group = 1
100
Technical aside-Log-rank test

To compare survival curves, a log-rank
test creates 2x2 tables at each event time
and combines across the tables
– Similar to MH-test
Provides a c2 statistic with 1 degree of
freedom (for a two sample comparison)
and a p-value
 Same procedure for hypothesis testing

Hypothesis test
1)
2)
3)
4)
5)
6)
7)
H0: Survival distribution in group 1 = survival
distribution in group 2
Time to event outcome, dichotomous predictor
Log rank test
Summary statistic: c2=4.4
p-value=0.036
Since the p-value is less than 0.05, we reject
the null hypothesis
We conclude that there is a significant
difference in the survival time in the treated
compared to untreated
p-value
What we learned
Chi-square test
 Logistic regression
 Survival analysis

Download