[1] 0

advertisement
Survival Analysis (Chapter 27 of Crawley’s “The R Book, 2
nd
ed.”)
Survival analysis is interested in either modeling the time to failure, death or some other event.
It is typically assumed that all subjects will eventually “fail” whereas logistic regression is used
more for data where we want to know the probability of failure at a set time point. Usually
survival analysis will try to evaluate the effects of different “treatments” on the time to failure.
A common complication of survival analysis is censoring; i.e., there is missing information as to
either when the exact time of the event occurred (right censored) or when the “clock” started (left
censored).
Examples:
 Operating times until a type of machine breaks.
 Times until patients die due to a particular cancer.
 Times until children learn to read.
 Time until a chick leaves the nest.
 Times after exposure to the HIV virus until people express AIDS symptoms.
Key concepts:
 Let T be a random variable which represents time to failure (T≥0). Example: A person
dies on their 70th birthday, then T=70.
 Density function of t is f(t) which is the probability of failure at time period t. This can
follow many probability distributions.
 F(t), the cumulative distribution function of T, gives P(T≤t); i.e., the probability that
failure occurs at or before time t.
 Survival function, S(t)=1-F(t)=P(T>t); i.e., the probability that failure occurs after time t.
 Hazard function: Probability of failing during the immediate moment of time after time t
𝑓(𝑡)
given that the failure had not occurred before time t. 𝜆(𝑡)=P( t < T < t+dt | T>t ) =𝑆(𝑡)
where dt is a very small value. For example, suppose the probability of a 70 year old
dying before his 71st birthday is 0.07 and if the precision is recorded by year, then
𝜆(70) = .07. This is sometimes written h(t).
1
Kaplan-Meier Estimator (aka product limit estimator). The Kaplan-Meier estimator is a nonparametric estimator of the survival function. It essentially looks at proportions.
Data: 1, 2, 2+, 3+, 6 where the "+" signs mean that the patient was still alive at the end of his or
her follow-up but was no further information collected; that is, the patient was censored at that
time.
General idea: Calculate a running proportion of the number alive at a given time while adjusting for
censored data. We assume non-informative (random) right censoring.
Interval
(StartEnd)
# At Risk # Censored # At Risk
at Start of During
at End of
Interval
Interval
Interval
# Who
Died at
End of
Interval
Proportion
Surviving
This Interval
Cumulative
Survival at
End of
Interval
(0-1]
5
0
4
1
4/5 = 0.8
0.80
(1-2]
4
1
2
1
3/4 = 0.75
0.80 * 0.75 =
0.60
(2-3]
2
1
1
0
2/2 = 1.00
0.60*1.00=
0.60
(3-6]
1
0
0
1
0/1 = 0
0.6*0 = 0
0.0
0.2
0.4
S(t)
0.6
0.8
1.0
You need to be very careful about whether an interval includes or only goes up to a time t. This may vary
from stat package to stat package. If we had created the table with the intervals: [0-1), [1,2), etc. we
would have gotten a slightly different table.
0
1
2
3
4
5
6
t
2
Data: 1, 2, 2+, 3+, 6
> library( survival )
> # time of death or censoring
> time.end <- c(1,2,2,3,6)
> # 1= death, 0=censored
> dead <- c(1,1,0,0,1)
> # create survival object
> surv.object <- Surv( time.end, dead)
> surv.object
[1] 1 2 2+ 3+ 6
> # kaplan-meier estimates
> kaplan1 <- survfit( surv.object ~ 1 ) # ~1 is because there are no covariates
> names( kaplan1 )
[1] "n"
"time"
"n.risk"
"n.event"
"n.censor" "surv"
[7] "type"
"std.err"
"upper"
"lower"
"conf.type" "conf.int"
[13] "call"
> cbind( kaplan1$time,kaplan1$n.event,kaplan1$surv)
[,1] [,2] [,3]
[1,]
1
1 0.8
[2,]
2
1 0.6
[3,]
3
0 0.6
[4,]
6
1 0.0
> plot(kaplan1, ylab="S(t)",xlab="t")
> # Note: Kaplan-Meier method is also called
> #
the product-limit estimator
> 1*.8
[1] 0.8
> 1*.8*.75
[1] 0.6
> 1*.8*.75*0
[1] 0
A more involved example
> # Get Afifi's text book lung cancer data
> Lung <- read.table("http://users.humboldt.edu/rizzardi/Data.dir/Surv.txt",
+ header=F, na.strings=".")
>
> names(Lung) <+
c("ID","STAGET","Stagen","Hist","Treat","Perfbl","Poinf","Smokfu","Smokbl","Days","Dea
th")
>
> # Death: 0 = Alive(censored), 1=Dead
> # STAGET: Tumor size: 0=small, 1=large
> # Hist: Histology: 1=squamous cells, 2=other types of cells
> # Treat: 0=control (saline), 1=BCG
>
> attach( Lung )
>
3
> # First 8 rows of data
> Lung[1:8,]
ID STAGET Stagen Hist Treat Perfbl Poinf Smokfu Smokbl Days Death
1 1
1
0
1
0
0
0
2
1 2926
0
2 2
1
0
1
1
0
0
1
1 590
1
3 3
0
0
1
0
0
0
1
2 2803
0
4 4
1
0
1
1
0
0
2
1 2762
0
5 5
1
0
1
0
0
0
2
1 2616
0
6 6
0
0
1
1
0
0
NA
1 2716
0
7 7
0
0
1
1
0
0
1
1 1485
1
8 8
1
0
1
0
0
0
NA
1 2456
0
> # First 10 people who died or were censored
> Lung[order(Days),][1:10,]
ID STAGET Stagen Hist Treat Perfbl Poinf Smokfu Smokbl Days Death
281 281
0
0
2
1
0
1
NA
1
4
1
228 228
1
0
1
0
0
1
NA
1
7
1
139 139
0
0
2
0
1
0
NA
1
8
0
120 120
1
0
1
0
1
1
NA
2
29
1
186 186
1
0
2
0
0
0
NA
1
29
1
149 149
0
0
2
1
0
1
NA
1
31
1
344 344
0
0
2
1
1
0
NA
2
40
1
300 300
1
0
2
0
0
0
NA
1
65
1
199 199
1
0
2
1
0
0
NA
2
69
1
393 393
0
0
1
1
1
0
NA
2
78
1
> # Demo of order() function
> order( c(9,7,6,8,8) )
[1] 3 2 4 5 1
>
>
> #see help(Surv)
> #see help(with) used instead of "attach"
>
> Lsurv <- with( Lung, Surv(Days,Death)) # makes survival analysis data
> # A '+' denotes censored
> Lsurv[1:7]
[1] 2926+ 590 2803+ 2762+ 2616+ 2716+ 1485
> # Compare to last values of Lung data frame
> Lung[1:7,]
ID STAGET Stagen Hist Treat Perfbl Poinf Smokfu Smokbl Days Death
1 1
1
0
1
0
0
0
2
1 2926
0
2 2
1
0
1
1
0
0
1
1 590
1
3 3
0
0
1
0
0
0
1
2 2803
0
4 4
1
0
1
1
0
0
2
1 2762
0
5 5
1
0
1
0
0
0
2
1 2616
0
6 6
0
0
1
1
0
0
NA
1 2716
0
7 7
0
0
1
1
0
0
1
1 1485
1
>
> # Kaplan-Meir estimate of Survival function for all data
> kmfit <- survfit( Lsurv ~ 1 )
> kmfit
Call: survfit(formula = Lsurv ~ 1)
records
401
>
n.max n.start
401
401
events
204
median 0.95LCL 0.95UCL
2300
2017
3116
4
> Lung[order(Days),][1:7,]
ID STAGET Stagen Hist Treat Perfbl Poinf Smokfu Smokbl Days Death
281 281
0
0
2
1
0
1
NA
1
4
1
228 228
1
0
1
0
0
1
NA
1
7
1
139 139
0
0
2
0
1
0
NA
1
8
0
120 120
1
0
1
0
1
1
NA
2
29
1
186 186
1
0
2
0
0
0
NA
1
29
1
149 149
0
0
2
1
0
1
NA
1
31
1
344 344
0
0
2
1
1
0
NA
2
40
1
> # summary( ) produces values of survival function
> # Only part shown.
> summary( kmfit )
Call: survfit(formula = Lsurv ~ 1)
time n.risk n.event survival
4
401
1
0.998
7
400
1
0.995
29
398
2
0.990
31
396
1
0.988
40
395
1
0.985
65
394
1
0.983
::
:::
::
:::::
std.err lower 95% CI upper 95% CI
0.00249
0.993
1.000
0.00352
0.988
1.000
0.00497
0.980
1.000
0.00555
0.977
0.998
0.00607
0.973
0.997
0.00655
0.970
0.995
:::::
:::
::::
> plot( kmfit ) # show survival curve
> title( "All data together")
0.0
0.2
0.4
0.6
0.8
1.0
All data together
0
500
1000
1500
2000
2500
3000
3500
>
> kmfit.bystaget <- survfit( Lsurv ~ STAGET )
> kmfit.bystaget
Call: survfit(formula = Lsurv ~ STAGET)
STAGET=0
STAGET=1
records n.max n.start events median 0.95LCL 0.95UCL
213
213
213
91
NA
2516
NA
188
188
188
113
1689
1294
2236
5
>
> plot( kmfit.bystaget, lty=c(1,2))
> legend( 500, .2, legend=c("Small","Large"), lty=c(1,2) )
> title("Grouped by stage of tumor")
0.2
0.4
0.6
0.8
1.0
Grouped by stage of tumor
0.0
Small
Large
0
500
1000
1500
2000
2500
3000
3500
>
>
> kmfit.byTreat <- survfit( Lsurv ~ Treat )
> kmfit.byTreat
Call: survfit(formula = Lsurv ~ Treat)
records n.max n.start events median 0.95LCL 0.95UCL
Treat=0
195
195
195
96
2347
1915
NA
Treat=1
206
206
206
108
2253
1695
NA
> plot( kmfit.byTreat, lty=c(1,2))
> legend( 500, .2, legend=c("Placebo","Drug"), lty=c(1,2) )
> title("Grouped by Treatment")
0.2
0.4
0.6
0.8
1.0
Grouped by Treatment
0.0
Placebo
Drug
0
500
1000
1500
2000
2500
3000
3500
6
>
> kmfit.byHist <- survfit( Lsurv ~ Hist )
> kmfit.byHist
Call: survfit(formula = Lsurv ~ Hist)
records n.max n.start events median 0.95LCL 0.95UCL
Hist=1
196
196
196
93
2580
2084
NA
Hist=2
205
205
205
111
2089
1548
3116
> plot( kmfit.byHist, lty=c(1,2) )
> legend( 500, .2, legend=c("Squamous","Other"), lty=c(1,2) )
> title("Grouped by histology")
0.2
0.4
0.6
0.8
1.0
Grouped by histology
0.0
Squamous
Other
0
500
1000
1500
2000
2500
3000
3500
0.2
0.4
0.6
0.8
1.0
> plot( kmfit.byTreat, conf.int=T, col=c("black","grey"),lty=1:2)
> legend( 500, .2, legend=c("Placebo","Drug"), lty=c(1,2), col=c("black","grey") )
0.0
Placebo
Drug
0
500
1000
1500
2000
2500
3000
3500
7
> survdiff( Lsurv ~ Treat ) # log-rank test for difference between groups
Call:
survdiff(formula = Lsurv ~ Treat)
N Observed Expected (O-E)^2/E (O-E)^2/V
Treat=0 195
96
99.8
0.148
0.289
Treat=1 206
108
104.2
0.142
0.289
Chisq= 0.3 on 1 degrees of freedom, p= 0.591
> survdiff( Lsurv ~ STAGET )
Call:
survdiff(formula = Lsurv ~ STAGET)
N Observed Expected (O-E)^2/E (O-E)^2/V
STAGET=0 213
91
116.1
5.43
12.6
STAGET=1 188
113
87.9
7.18
12.6
Chisq= 12.6 on 1 degrees of freedom, p= 0.000379
> survdiff( Lsurv ~ Hist )
Call:
survdiff(formula = Lsurv ~ Hist)
N Observed Expected (O-E)^2/E (O-E)^2/V
Hist=1 196
93
104
1.07
2.16
Hist=2 205
111
100
1.10
2.16
Chisq= 2.2
> ?survdiff
on 1 degrees of freedom, p= 0.141
8
library( survival ) # Library of survival analysis functions
# time of death or censoring
time.end <- c(1,2,2,3,6)
# 1= death, 0=censored
dead <- c(1,1,0,0,1)
# create survival object
surv.object <- Surv( time.end, dead)
surv.object
# kaplan-meier estimates
kaplan1 <- survfit( surv.object ~ 1 ) # ~1 is because there are no covariates
names( kaplan1 )
cbind( kaplan1$time,kaplan1$n.event,kaplan1$surv)
plot(kaplan1, ylab="S(t)",xlab="t")
# Note: Kaplan-Meier method is also called
#
the product-limit estimator
1*.8
1*.8*.75
1*.8*.75*0
# Get Afifi's text book lung cancer data
Lung <- read.table("http://users.humboldt.edu/rizzardi/Data.dir/Surv.txt",
header=F, na.strings=".")
names(Lung) <c("ID","STAGET","Stagen","Hist","Treat","Perfbl","Poinf","Smokfu","Smokbl","Days","Dea
th")
#
#
#
#
Death: 0 = Alive(censored), 1=Dead
STAGET: Tumor size: 0=small, 1=large
Hist: Histology: 1=squamous cells, 2=other types of cells
Treat: 0=control (saline), 1=BCG
attach( Lung )
# First 8 rows of data
Lung[1:8,]
# First 10 people who died or were censored
Lung[order(Days),][1:10,]
# Demo of order() function
order( c(9,7,6,8,8) )
#see help(Surv)
#see help(with) used instead of "attach"
Lsurv <- with( Lung, Surv(Days,Death))
# A '+' denotes censored
Lsurv[1:7]
# makes survival analysis data
9
# Compare to last values of Lung data frame
Lung[1:7,]
# Kaplan-Meir estimate of Survival function for all data
kmfit <- survfit( Lsurv ~ 1 )
kmfit
Lung[order(Days),][1:7,]
# summary( ) produces values of survival function
# Only part shown.
summary( kmfit )
windows()
plot( kmfit ) # show survival curve
title( "All data together")
kmfit.bystaget <- survfit( Lsurv ~ STAGET )
kmfit.bystaget
windows()
plot( kmfit.bystaget, lty=c(1,2))
legend( 500, .2, legend=c("Small","Large"), lty=c(1,2) )
title("Grouped by stage of tumor")
kmfit.byTreat <- survfit( Lsurv ~ Treat )
kmfit.byTreat
windows()
plot( kmfit.byTreat, lty=c(1,2))
legend( 500, .2, legend=c("Placebo","Drug"), lty=c(1,2) )
title("Grouped by Treatment")
kmfit.byHist <- survfit( Lsurv ~ Hist )
kmfit.byHist
windows()
plot( kmfit.byHist, lty=c(1,2) )
legend( 500, .2, legend=c("Squamous","Other"), lty=c(1,2) )
title("Grouped by histology")
windows()
plot( kmfit.byTreat, conf.int=T, col=c("black","grey"),lty=1:2)
legend( 500, .2, legend=c("Placebo","Drug"), lty=c(1,2), col=c("black","grey") )
survdiff( Lsurv ~ Treat ) # log-rank test for difference between groups
survdiff( Lsurv ~ STAGET )
survdiff( Lsurv ~ Hist )
?survdiff
10
Download