survival analysis Logistic regression

advertisement
Survival analysis
First example of the day
•
•
•
•
Small cell lungcanser
Meadian survival time: 8-10
months
2-year survival is 10%
New treatment showed
median survival of
13.2months
Progressively censored observations
Current life table
• Completed dataset
Cohort life table
• Analysis “on the fly”
Problem
Do patients survive longer after treatment 1 than after treatment 2?
Possible solutions:
• ANOVA on mean survival time?
• ANOVA on median survival time?
• 100 person years of observation: How long has the average person been in the
study.
• 10 persons being observed for 10 years
• 100 persons being observed for 100 years
Life table analysis
A sub-set of 13 patients undergoing the same treatment
Life table analysis
Time interval chosen to be 3
months
ni number of patients starting a
given period
Life table analysis
di number of terminal events, in
this example;
progression/response
wi number of patients that have
not yet been in the study long
enough to finish this period
Life table analysis
Number exposed to risk:
ni – wi/2
Assuming that patients
withdraw in the middle of
the period on average.
Life table analysis
qi = di/(ni – wi/2)
Proportion of patients
terminating in the period
Life table analysis
pi = 1 - qi
Proportion of patients
surviving
Life table analysis
Si = pi pi-1 ...pi-N
Cumulative proportion of
surviving
Conditional probability
Survival curves
How long will a lung canser patient keep
having canser on this particular
treatment?
Kaplan-Meier
Simple example with only 2
”terminal-events”.
Confidence interval of the Kaplan-Meier method
Fx after 32 months
SE ( Si )  Si
di
 n n  d 
i
i
i
SE ( Si )  0.9
1
 10 10  1  0.0949
Confidence
interval
of
the
Survival plot for all data on treatment 1
AreKaplan-Meier
there differences between the method
treatments?
Comparing
Survival Curves
One
could useTwo
the confidence
intervals…
But what if the confidence intervals
are not overlapping only at some
points?
Logrank-stats
• Hazard ratio
Mantel-Haenszel methods
Comparing
Two Survival Curves
The
logrank statistics
Aka Mantel-logrank statistics
Aka Cox-Mantel-logrank statistics
Comparing
Two
Survival
Curves
Five
steps to the
logrank
statistics
table
1.
Divide the data into intervals (eg. 10 months)
2.
Count the number of patients at risk in the groups and in total
3.
Count the number of terminal events in the groups and in total
4.
Calculate the expected numbers of terminal events
e.g. (31-40) 44 in grp1 and 46 in grp2, 4 terminal events.
expected terminal events 4x(44/90) and 4x(46/90)
5.
Calculate the total
Comparing
Two Survival
Curves
Smells
like Chi-Square
statistics
O  E

2
  
E
all_treatments
2
 23  17.07   12  17.93
2 
2
17.07
df  1
p  0.05
17.93
2
 4.02
Comparing
Hazard
ratio Two Survival Curves
Hazard ratio 
O1 E1 23 17.07

 2.01
O2 E2 12 17.93
Comparing
Two
Mantel
Haenszel
testSurvival Curves
 a  b n

OR 
c  d n
Is the OR significant different from 1?
Look at cell (1,1)
Estimated value, E(ai)
row
total
* column total
Variance,
V(a
i)
grand total
 (a  c)(b  d )(a  b)(c  d ) 
V (ai )  

2
n
n

1
 


Comparing
Two
Mantel
Haenszel
testSurvival Curves
  a  E (a ) 2 
 i  i   1.12
M H 
V (ai )





df = 1; p>0.05
Hazard function
H   log( Si )
H
d
 f  c
d is the number of terminal events
f is the sum of failure times
c is the sum of censured times
Logistic regression
Who survived Titanic?
The sinking of Titanic
Titanic sank April 14th 1912 with 2228 souls 705 survived.
A dataset of 1309 passengers survived.
Who survived?
25
The data
pclass
survived
name
sex
1
1
Allen, Miss. Elisabeth Walton
female
1
1
Allison, Master. Hudson Trevor
male
1
0
Allison, Miss. Helen Loraine
female
1
0
Allison, Mr. Hudson Joshua Creighton
1
0
1
age
sibsp
parch
29
0
0
0.9167
1
2
2
1
2
male
30
1
2
Allison, Mrs. Hudson J C (Bessie Waldo Daniels)
female
25
1
2
1
Anderson, Mr. Harry
male
48
0
0
1
1
Andrews, Miss. Kornelia Theodosia
female
63
1
0
1
0
Andrews, Mr. Thomas Jr
male
39
0
0
1
1
Appleton, Mrs. Edward Dale (Charlotte Lamson)
female
53
2
0
Sibsp is the number of siblings and/or spouses accompanying
Parsc is the number of parents and/or children accompanying
Some values are missing
Can we predict who will survive titanic II?
26
Analyzing the data in a (too) simple manner
•
Associations between factors without considering interactions
27
Analyzing the data in a (too) simple manner
•
Associations between factors without considering interactions
28
Analyzing the data in a (too) simple manner
•
Associations between factors without considering interactions
29
Could we use multiple linear regression to predict survival?
multiple linear regression
Logistic regression
Response variable is defined between –inf and
+inf
Response variable is defined between 0 and 1
Normal distributed
Bernoulli distributed
E ( y)   0  1 x1  ... n xn
30
Logit transformation is modeled linearly
The logistic function
ln
p
  0  1 x1  ... n xn 
1 p
exp   0  1 x1  ... n xn 
1
p

1  exp   0  1 x1  ... n xn  1  exp     0  1 x1  ... n xn  
31
The sigmodal curve
sigmodal curve
1
p
1  e z
z  0  1 x1  ... n xn
1
0.8
p
0.6
0.4
 = 0;  = 1
0
1
0.2
0
-6
-4
-2
0
x
2
4
6
32
The sigmodal curve
sigmodal curve
1
p
1  e z
z  0  1 x1  ... n xn
0.8
The intercept basically just ‘scale’
the input variable
0 = 0; 1 = 1
 = 2;  = 1
0
0.6
1
0 = -2; 1 = 1
p
•
1
0.4
0.2
0
-6
-4
-2
0
x
2
4
6
33
The sigmodal curve
sigmodal curve
1
p
1  e z
z  0  1 x1  ... n xn
•
The intercept basically just ‘scale’
the input variable
Large regression coefficient → risk
factor strongly influences the
probability
0.8
0 = 0; 1 = 1
 = 0;  = 2
0
0.6
1
0 = 0; 1 = 0.5
p
•
1
0.4
0.2
0
-6
-4
-2
0
x
2
4
6
34
The sigmodal curve
sigmodal curve
1
p
1  e z
z  0  1 x1  ... n xn
•
•
The intercept basically just ‘scale’
the input variable
Large regression coefficient → risk
factor strongly influences the
probability
Positive regression coefficient →
risk factor increases the probability
0.8
0 = 0; 1 = 1
0.6
 = 0;  = -1
0
1
p
•
1
0.4
0.2
0
-6
-4
-2
0
x
2
4
6
35
Logistic regression of the Titanic data
36
Logistic regression of the Titanic data – passenger class
1. Summary of data
2. Coding of the dependent variable
3. Coding of the categorical explanatory
variable:
First class: 1
Second class: 2
Third class: reference
37
Logistic regression of the Titanic data – passenger class
•
•
•
A fit of the null-model, basically
just the intercept. Usually not
interesting
The total probability of survival is
500/1309 = 0.382. Cutoff is 0.5
so all are classified as nonsurvivers.
Basically tests if the null-model is
sufficient. It almost certainly is
not.
Shows that survival is related to
pclass (which is not in the nullmodel)
38
Logistic regression of the Titanic data – passenger class
1. Omnibus test: Uses LR to describe if the
adding the pclass variable to the model
makes it better. It did! But better than the
null-model, so no surprise.
2. Model Summary. Other measures of the
goodness of fit.
3. Classification table: By including pclass 67.7
passengers were correctly categorized.
4. Variables in the equation: first line repeats
that pclass has a significant effect on
survival. B is the logistic fittet parameter.
Exp(B) is the odds rations, so the odds of
survival is 4.7 (3.6-6.3) times higher than
passengers on third class (reference class)
39
Logistic regression of the Titanic data – Adding age to the model
Ups… Some data points are missing
And the null model is poorer
40
Logistic regression of the Titanic data – Adding age to the model
•
•
Cox and Senll’s R-square increased
from 0.093 to 0.141, indicating a
better model
By this model we can classify 69.1%
passenger class only classified 67.7%
41
Logistic regression of the Titanic data – Adding age to the model
•
•
•
•
Age has a significant influence on survival.
The odds ratio of age is 0.963
So the odds of a 31 year old is 0.963 times the odds of a 30 year old.
Or the odds for a 30 year old to survive is 1/0.963 = 1.038 times larger than that of a
31 year old
42
Logistic regression of the Titanic data – Age alone
• The model is extremely poor
• Consequently age appear to be insignificant in estimating survival.
43
Logistic regression of the Titanic data – Adding family and sex
•
The model is becoming better
44
Logistic regression of the Titanic data – Using the model as to
predict
•
What is the probability that a 25
year old woman accompanied
only by her husband holding a
second class ticket would
survive Titanic?
z = -2.703
-0.041*25
+2.552
+1.718
+0.925
= 1.4670
1
1
p

1  e  z 1  e 1.4714
 0.8133
45
Using the model to predict survival
•
What is the probability that a 25
year old woman accompanied
only by her husband holding a
second class ticket would
survive Titanic?
z = -3.929
-0.589*(-5)/14.41
+1.718
+2.552
+0.926 = 1.4714
1
1

1  e  z 1  e 1.4714
 0.8133
p
46
Is it realistic that Leonardo survives and the chick dies?
47
Download