Survival Analysis (Chapter 27 of Crawley’s “The R Book, 2 nd ed.”) Survival analysis is interested in either modeling the time to failure, death or some other event. It is typically assumed that all subjects will eventually “fail” whereas logistic regression is used more for data where we want to know the probability of failure at a set time point. Usually survival analysis will try to evaluate the effects of different “treatments” on the time to failure. A common complication of survival analysis is censoring; i.e., there is missing information as to either when the exact time of the event occurred (right censored) or when the “clock” started (left censored). Examples: Operating times until a type of machine breaks. Times until patients die due to a particular cancer. Times until children learn to read. Time until a chick leaves the nest. Times after exposure to the HIV virus until people express AIDS symptoms. Key concepts: Let T be a random variable which represents time to failure (T≥0). Example: A person dies on their 70th birthday, then T=70. Density function of t is f(t) which is the probability of failure at time period t. This can follow many probability distributions. F(t), the cumulative distribution function of T, gives P(T≤t); i.e., the probability that failure occurs at or before time t. Survival function, S(t)=1-F(t)=P(T>t); i.e., the probability that failure occurs after time t. Hazard function: Probability of failing during the immediate moment of time after time t 𝑓(𝑡) given that the failure had not occurred before time t. 𝜆(𝑡)=P( t < T < t+dt | T>t ) =𝑆(𝑡) where dt is a very small value. For example, suppose the probability of a 70 year old dying before his 71st birthday is 0.07 and if the precision is recorded by year, then 𝜆(70) = .07. This is sometimes written h(t). 1 Kaplan-Meier Estimator (aka product limit estimator). The Kaplan-Meier estimator is a nonparametric estimator of the survival function. It essentially looks at proportions. Data: 1, 2, 2+, 3+, 6 where the "+" signs mean that the patient was still alive at the end of his or her follow-up but was no further information collected; that is, the patient was censored at that time. General idea: Calculate a running proportion of the number alive at a given time while adjusting for censored data. We assume non-informative (random) right censoring. Interval (StartEnd) # At Risk # Censored # At Risk at Start of During at End of Interval Interval Interval # Who Died at End of Interval Proportion Surviving This Interval Cumulative Survival at End of Interval (0-1] 5 0 4 1 4/5 = 0.8 0.80 (1-2] 4 1 2 1 3/4 = 0.75 0.80 * 0.75 = 0.60 (2-3] 2 1 1 0 2/2 = 1.00 0.60*1.00= 0.60 (3-6] 1 0 0 1 0/1 = 0 0.6*0 = 0 0.0 0.2 0.4 S(t) 0.6 0.8 1.0 You need to be very careful about whether an interval includes or only goes up to a time t. This may vary from stat package to stat package. If we had created the table with the intervals: [0-1), [1,2), etc. we would have gotten a slightly different table. 0 1 2 3 4 5 6 t 2 Data: 1, 2, 2+, 3+, 6 > library( survival ) > # time of death or censoring > time.end <- c(1,2,2,3,6) > # 1= death, 0=censored > dead <- c(1,1,0,0,1) > # create survival object > surv.object <- Surv( time.end, dead) > surv.object [1] 1 2 2+ 3+ 6 > # kaplan-meier estimates > kaplan1 <- survfit( surv.object ~ 1 ) # ~1 is because there are no covariates > names( kaplan1 ) [1] "n" "time" "n.risk" "n.event" "n.censor" "surv" [7] "type" "std.err" "upper" "lower" "conf.type" "conf.int" [13] "call" > cbind( kaplan1$time,kaplan1$n.event,kaplan1$surv) [,1] [,2] [,3] [1,] 1 1 0.8 [2,] 2 1 0.6 [3,] 3 0 0.6 [4,] 6 1 0.0 > plot(kaplan1, ylab="S(t)",xlab="t") > # Note: Kaplan-Meier method is also called > # the product-limit estimator > 1*.8 [1] 0.8 > 1*.8*.75 [1] 0.6 > 1*.8*.75*0 [1] 0 A more involved example > # Get Afifi's text book lung cancer data > Lung <- read.table("http://users.humboldt.edu/rizzardi/Data.dir/Surv.txt", + header=F, na.strings=".") > > names(Lung) <+ c("ID","STAGET","Stagen","Hist","Treat","Perfbl","Poinf","Smokfu","Smokbl","Days","Dea th") > > # Death: 0 = Alive(censored), 1=Dead > # STAGET: Tumor size: 0=small, 1=large > # Hist: Histology: 1=squamous cells, 2=other types of cells > # Treat: 0=control (saline), 1=BCG > > attach( Lung ) > 3 > # First 8 rows of data > Lung[1:8,] ID STAGET Stagen Hist Treat Perfbl Poinf Smokfu Smokbl Days Death 1 1 1 0 1 0 0 0 2 1 2926 0 2 2 1 0 1 1 0 0 1 1 590 1 3 3 0 0 1 0 0 0 1 2 2803 0 4 4 1 0 1 1 0 0 2 1 2762 0 5 5 1 0 1 0 0 0 2 1 2616 0 6 6 0 0 1 1 0 0 NA 1 2716 0 7 7 0 0 1 1 0 0 1 1 1485 1 8 8 1 0 1 0 0 0 NA 1 2456 0 > # First 10 people who died or were censored > Lung[order(Days),][1:10,] ID STAGET Stagen Hist Treat Perfbl Poinf Smokfu Smokbl Days Death 281 281 0 0 2 1 0 1 NA 1 4 1 228 228 1 0 1 0 0 1 NA 1 7 1 139 139 0 0 2 0 1 0 NA 1 8 0 120 120 1 0 1 0 1 1 NA 2 29 1 186 186 1 0 2 0 0 0 NA 1 29 1 149 149 0 0 2 1 0 1 NA 1 31 1 344 344 0 0 2 1 1 0 NA 2 40 1 300 300 1 0 2 0 0 0 NA 1 65 1 199 199 1 0 2 1 0 0 NA 2 69 1 393 393 0 0 1 1 1 0 NA 2 78 1 > # Demo of order() function > order( c(9,7,6,8,8) ) [1] 3 2 4 5 1 > > > #see help(Surv) > #see help(with) used instead of "attach" > > Lsurv <- with( Lung, Surv(Days,Death)) # makes survival analysis data > # A '+' denotes censored > Lsurv[1:7] [1] 2926+ 590 2803+ 2762+ 2616+ 2716+ 1485 > # Compare to last values of Lung data frame > Lung[1:7,] ID STAGET Stagen Hist Treat Perfbl Poinf Smokfu Smokbl Days Death 1 1 1 0 1 0 0 0 2 1 2926 0 2 2 1 0 1 1 0 0 1 1 590 1 3 3 0 0 1 0 0 0 1 2 2803 0 4 4 1 0 1 1 0 0 2 1 2762 0 5 5 1 0 1 0 0 0 2 1 2616 0 6 6 0 0 1 1 0 0 NA 1 2716 0 7 7 0 0 1 1 0 0 1 1 1485 1 > > # Kaplan-Meir estimate of Survival function for all data > kmfit <- survfit( Lsurv ~ 1 ) > kmfit Call: survfit(formula = Lsurv ~ 1) records 401 > n.max n.start 401 401 events 204 median 0.95LCL 0.95UCL 2300 2017 3116 4 > Lung[order(Days),][1:7,] ID STAGET Stagen Hist Treat Perfbl Poinf Smokfu Smokbl Days Death 281 281 0 0 2 1 0 1 NA 1 4 1 228 228 1 0 1 0 0 1 NA 1 7 1 139 139 0 0 2 0 1 0 NA 1 8 0 120 120 1 0 1 0 1 1 NA 2 29 1 186 186 1 0 2 0 0 0 NA 1 29 1 149 149 0 0 2 1 0 1 NA 1 31 1 344 344 0 0 2 1 1 0 NA 2 40 1 > # summary( ) produces values of survival function > # Only part shown. > summary( kmfit ) Call: survfit(formula = Lsurv ~ 1) time n.risk n.event survival 4 401 1 0.998 7 400 1 0.995 29 398 2 0.990 31 396 1 0.988 40 395 1 0.985 65 394 1 0.983 :: ::: :: ::::: std.err lower 95% CI upper 95% CI 0.00249 0.993 1.000 0.00352 0.988 1.000 0.00497 0.980 1.000 0.00555 0.977 0.998 0.00607 0.973 0.997 0.00655 0.970 0.995 ::::: ::: :::: > plot( kmfit ) # show survival curve > title( "All data together") 0.0 0.2 0.4 0.6 0.8 1.0 All data together 0 500 1000 1500 2000 2500 3000 3500 > > kmfit.bystaget <- survfit( Lsurv ~ STAGET ) > kmfit.bystaget Call: survfit(formula = Lsurv ~ STAGET) STAGET=0 STAGET=1 records n.max n.start events median 0.95LCL 0.95UCL 213 213 213 91 NA 2516 NA 188 188 188 113 1689 1294 2236 5 > > plot( kmfit.bystaget, lty=c(1,2)) > legend( 500, .2, legend=c("Small","Large"), lty=c(1,2) ) > title("Grouped by stage of tumor") 0.2 0.4 0.6 0.8 1.0 Grouped by stage of tumor 0.0 Small Large 0 500 1000 1500 2000 2500 3000 3500 > > > kmfit.byTreat <- survfit( Lsurv ~ Treat ) > kmfit.byTreat Call: survfit(formula = Lsurv ~ Treat) records n.max n.start events median 0.95LCL 0.95UCL Treat=0 195 195 195 96 2347 1915 NA Treat=1 206 206 206 108 2253 1695 NA > plot( kmfit.byTreat, lty=c(1,2)) > legend( 500, .2, legend=c("Placebo","Drug"), lty=c(1,2) ) > title("Grouped by Treatment") 0.2 0.4 0.6 0.8 1.0 Grouped by Treatment 0.0 Placebo Drug 0 500 1000 1500 2000 2500 3000 3500 6 > > kmfit.byHist <- survfit( Lsurv ~ Hist ) > kmfit.byHist Call: survfit(formula = Lsurv ~ Hist) records n.max n.start events median 0.95LCL 0.95UCL Hist=1 196 196 196 93 2580 2084 NA Hist=2 205 205 205 111 2089 1548 3116 > plot( kmfit.byHist, lty=c(1,2) ) > legend( 500, .2, legend=c("Squamous","Other"), lty=c(1,2) ) > title("Grouped by histology") 0.2 0.4 0.6 0.8 1.0 Grouped by histology 0.0 Squamous Other 0 500 1000 1500 2000 2500 3000 3500 0.2 0.4 0.6 0.8 1.0 > plot( kmfit.byTreat, conf.int=T, col=c("black","grey"),lty=1:2) > legend( 500, .2, legend=c("Placebo","Drug"), lty=c(1,2), col=c("black","grey") ) 0.0 Placebo Drug 0 500 1000 1500 2000 2500 3000 3500 7 > survdiff( Lsurv ~ Treat ) # log-rank test for difference between groups Call: survdiff(formula = Lsurv ~ Treat) N Observed Expected (O-E)^2/E (O-E)^2/V Treat=0 195 96 99.8 0.148 0.289 Treat=1 206 108 104.2 0.142 0.289 Chisq= 0.3 on 1 degrees of freedom, p= 0.591 > survdiff( Lsurv ~ STAGET ) Call: survdiff(formula = Lsurv ~ STAGET) N Observed Expected (O-E)^2/E (O-E)^2/V STAGET=0 213 91 116.1 5.43 12.6 STAGET=1 188 113 87.9 7.18 12.6 Chisq= 12.6 on 1 degrees of freedom, p= 0.000379 > survdiff( Lsurv ~ Hist ) Call: survdiff(formula = Lsurv ~ Hist) N Observed Expected (O-E)^2/E (O-E)^2/V Hist=1 196 93 104 1.07 2.16 Hist=2 205 111 100 1.10 2.16 Chisq= 2.2 > ?survdiff on 1 degrees of freedom, p= 0.141 8 library( survival ) # Library of survival analysis functions # time of death or censoring time.end <- c(1,2,2,3,6) # 1= death, 0=censored dead <- c(1,1,0,0,1) # create survival object surv.object <- Surv( time.end, dead) surv.object # kaplan-meier estimates kaplan1 <- survfit( surv.object ~ 1 ) # ~1 is because there are no covariates names( kaplan1 ) cbind( kaplan1$time,kaplan1$n.event,kaplan1$surv) plot(kaplan1, ylab="S(t)",xlab="t") # Note: Kaplan-Meier method is also called # the product-limit estimator 1*.8 1*.8*.75 1*.8*.75*0 # Get Afifi's text book lung cancer data Lung <- read.table("http://users.humboldt.edu/rizzardi/Data.dir/Surv.txt", header=F, na.strings=".") names(Lung) <c("ID","STAGET","Stagen","Hist","Treat","Perfbl","Poinf","Smokfu","Smokbl","Days","Dea th") # # # # Death: 0 = Alive(censored), 1=Dead STAGET: Tumor size: 0=small, 1=large Hist: Histology: 1=squamous cells, 2=other types of cells Treat: 0=control (saline), 1=BCG attach( Lung ) # First 8 rows of data Lung[1:8,] # First 10 people who died or were censored Lung[order(Days),][1:10,] # Demo of order() function order( c(9,7,6,8,8) ) #see help(Surv) #see help(with) used instead of "attach" Lsurv <- with( Lung, Surv(Days,Death)) # A '+' denotes censored Lsurv[1:7] # makes survival analysis data 9 # Compare to last values of Lung data frame Lung[1:7,] # Kaplan-Meir estimate of Survival function for all data kmfit <- survfit( Lsurv ~ 1 ) kmfit Lung[order(Days),][1:7,] # summary( ) produces values of survival function # Only part shown. summary( kmfit ) windows() plot( kmfit ) # show survival curve title( "All data together") kmfit.bystaget <- survfit( Lsurv ~ STAGET ) kmfit.bystaget windows() plot( kmfit.bystaget, lty=c(1,2)) legend( 500, .2, legend=c("Small","Large"), lty=c(1,2) ) title("Grouped by stage of tumor") kmfit.byTreat <- survfit( Lsurv ~ Treat ) kmfit.byTreat windows() plot( kmfit.byTreat, lty=c(1,2)) legend( 500, .2, legend=c("Placebo","Drug"), lty=c(1,2) ) title("Grouped by Treatment") kmfit.byHist <- survfit( Lsurv ~ Hist ) kmfit.byHist windows() plot( kmfit.byHist, lty=c(1,2) ) legend( 500, .2, legend=c("Squamous","Other"), lty=c(1,2) ) title("Grouped by histology") windows() plot( kmfit.byTreat, conf.int=T, col=c("black","grey"),lty=1:2) legend( 500, .2, legend=c("Placebo","Drug"), lty=c(1,2), col=c("black","grey") ) survdiff( Lsurv ~ Treat ) # log-rank test for difference between groups survdiff( Lsurv ~ STAGET ) survdiff( Lsurv ~ Hist ) ?survdiff 10