Survival analysis in R For an introduction, see the file “Survival analysis overview”. Survival analysis example from Peter Dalgaard, Introductory Statistics with R, Chapter 12. See Glantz, Primer of Biostatistics, Chapter 11, Survival analysis The material in the file “Survival analysis logrank details.doc” is supplementary. install.packages("ISwR") #### load the libraries library(survival) library(ISwR) #### look at the melanoma data data(melanom) help(melanom) Description: The ‘melanom’ data frame has 205 rows and 7 columns. It contains data relating to the survival of patients after an operation for malignant melanoma, collected at Odense University Hospital by K.T. Drzewiecki. This data frame contains the following columns: ‘no’ a numeric vector, patient code. ‘status’ a numeric vector code, survival status; 1: dead from melanoma, 2: alive, 3: dead from other cause. ‘days’ a numeric vector, observation time. ‘ulc’ a numeric vector code, ulceration; 1: present, 2: absent. ‘thick’ a numeric vector, tumor thickness (1/100 mm). ‘sex’ a numeric vector code; 1: female, 2: male. > melanom[1:20,] no status days ulc thick sex 1 789 3 10 1 676 2 2 13 3 30 2 65 2 3 97 2 35 2 134 2 4 16 3 99 2 290 1 5 21 1 185 1 1208 2 6 469 1 204 1 484 2 7 685 1 210 1 516 2 8 7 1 232 1 1288 2 9 932 3 232 1 322 1 10 944 1 279 1 741 1 11 558 1 295 1 419 1 12 612 3 355 1 16 1 13 2 1 386 1 387 1 14 233 1 426 1 484 2 15 418 1 469 1 242 1 16 765 3 493 1 1256 2 17 777 18 61 19 67 20 819 1 1 1 1 529 621 629 659 1 1 1 1 580 706 548 773 2 2 2 2 # Fit a Kaplan-Meier and plot it. fit = survfit(Surv(days, status == 1) ~ 1, data=melanom) Notice that the formula for survfit must have “~ 1” to plot a single curve. The example in the text book without the “~1” is the old style, which is no longer supported. > summary(fit, times = seq(0, 4000, 500)) Call: survfit(formula = Surv(days, status == 1) ~ 1, data = melanom) time n.risk n.event survival std.err lower 95% CI upper 95% CI 0 205 0 1.000 0.0000 1.000 1.000 500 189 9 0.955 0.0147 0.927 0.984 1000 171 17 0.869 0.0240 0.823 0.917 1500 159 10 0.818 0.0274 0.766 0.874 2000 103 10 0.762 0.0308 0.704 0.825 2500 66 7 0.700 0.0362 0.633 0.775 3000 54 2 0.677 0.0385 0.605 0.757 3500 24 2 0.645 0.0431 0.566 0.735 4000 13 0 0.645 0.0431 0.566 0.735) plot(fit, xlab = "Days", ylab="Survival") # Create KM estimates broken out by sex surv.bysex = survfit(Surv(days, status == 1) ~ sex, data = melanom) 1: female, 2: male > summary(surv.bysex, times = seq(0, 4000, 500)) sex=1 time n.risk n.event survival std.err lower 95% CI upper 95% CI 0 126 0 1.000 0.0000 1.000 1.000 500 119 4 0.968 0.0159 0.937 0.999 1000 111 7 0.910 0.0258 0.861 0.962 1500 106 5 0.869 0.0305 0.812 0.931 2000 68 6 0.813 0.0362 0.745 0.887 2500 43 4 0.755 0.0439 0.674 0.846 3000 36 0 0.755 0.0439 0.674 0.846 3500 17 2 0.704 0.0542 0.605 0.818 4000 9 0 0.704 0.0542 0.605 0.818 sex=2 time n.risk n.event survival std.err lower 95% CI upper 95% CI 0 79 0 1.000 0.0000 1.000 1.000 500 70 5 0.934 0.0284 0.880 0.992 1000 60 10 0.801 0.0461 0.715 0.896 1500 53 5 0.734 0.0510 0.640 0.841 2000 35 4 0.677 0.0544 0.578 0.792 2500 23 3 0.608 0.0619 0.498 0.742 3000 18 2 0.553 0.0675 0.435 0.702 3500 7 0 0.553 0.0675 0.435 0.702 4000 4 0 0.553 0.0675 0.435 0.702 plot(surv.bysex, xlab = "Days", ylab="Survival", conf.int=T, col=c("black", "red"), lty = 1:2) legend(100, .2, c("Female", "Male"), lty = 1:2, col=c("black", "red")) #### Logrank test for difference in survival by sex surv.diff.sex = survdiff(Surv(days, status == 1) ~ sex, data = melanom) surv.diff.sex > surv.diff.sex Call: survdiff(formula = Surv(days, status == 1) ~ sex, data = melanom) N Observed Expected (O-E)^2/E (O-E)^2/V sex=1 126 28 37.1 2.25 6.47 sex=2 79 Chisq= 6.5 29 19.9 4.21 6.47 on 1 degrees of freedom, p= 0.011 #### Logrank test for difference in survival by sex, stratified by ulceration surv.diff.sex.ulc = survdiff(Surv(days, status == 1) ~ sex + strata(ulc), data = melanom) surv.diff.sex.ulc > surv.diff.sex.ulc Call: survdiff(formula = Surv(days, status == 1) ~ sex + strata(ulc), data = melanom) N Observed Expected (O-E)^2/E (O-E)^2/V sex=1 126 28 34.7 1.28 3.31 sex=2 79 29 22.3 1.99 3.31 Chisq= 3.3 on 1 degrees of freedom, p= 0.0687 # The p-value is less significant after controlling for ulceration, suggesting that sex differences are reduced after controlling for ulceration. Cox proportional hazards regression Cox proportional hazards regression models have some of the characteristics and advantages of other regression models. In particular, Cox models are useful when the predictor variable is continuous. # First, let’s look at the cox model of survival in the melanom data set where the predictor variable is sex (male/female). coxph.sex = coxph(Surv(days, status == 1) ~ sex, data = melanom) summary(coxph.sex) Call: coxph(formula = Surv(days, status == 1) ~ sex, data = melanom) n= 205 coef exp(coef) se(coef) z p sex 0.662 1.94 0.265 2.5 0.013 exp(coef) exp(-coef) lower .95 upper .95 sex 1.94 0.516 1.15 3.26 Rsquare= 0.03 (max possible= 0.937 ) Likelihood ratio test= 6.15 on 1 df, Wald test = 6.24 on 1 df, Score (logrank) test = 6.47 on 1 df, p=0.0131 p=0.0125 p=0.0110 Here’s what we see in the summary output: “coef” is the estimated logarithm of the hazard ratio for males versus females (coef = 0.662). To make this easier to interpret, we take convert the log of the hazard ratio to the actual estimated hazard ratio using exp(coef) = exp(0.662) = 1.94. In this data set, sex is encoded as a numeric vector. 1: female, 2: male. The R summary for the cox model gives the hazard ratio for the second group relative to the first group, that is, male versus female. The estimated hazard ratio = 1.94 indicates that males have higher risk of death (lower survival rates) than females, in these data. se(coef) = 0.265 is standard error of the log HR. Dividing the coef = 0.662 by its standard error se(coef = 0.265) gives the z score: z = coef/se(coef) = 0.662/0.265 = 2.5. The corresponding p-value for sex is p=0.013, indicating that there is a significant difference in survival as a function of sex. The summary output also gives upper and lower 95% confidence intervals for the hazard ratio. Finally, the summary output gives p-values for three alternative tests for overall significance of the model: Likelihood ratio test= 6.15 Wald test = 6.24 Score (logrank) test = 6.47 on 1 df, on 1 df, on 1 df, p=0.0131 p=0.0125 p=0.0110 These three methods are asymptotically equivalent. For large enough N, they will give similar results. For small N, they may differ somewhat. Cox model using covariates in the melanom data # Now, let’s look at the cox model of survival in the melanom data set where the predictors variables include a continuous covariate, the log of the thickness of the tumor (variable name = “thick”) # Plot the thickness values in the data set. hist(melanom$thick) # The values don’t look at all normally distributed. Regression models, including the Cox model, usually work better with normally-distributed variables, so we’ll try a log transform. hist(log(melanom$thick)) # The log of the thickness of the tumor looks to be more normally-distributed, so we’ll use that. coxph.sex.thick = coxph(Surv(days, status == 1) ~ sex + log(thick), data = melanom) summary(coxph.sex.thick) coxph(formula = Surv(days, status == 1) ~ sex + log(thick), data = melanom) n= 205 coef exp(coef) se(coef) z p sex 0.458 1.58 0.269 1.70 8.8e-02 log(thick) 0.781 2.18 0.157 4.96 6.9e-07 sex log(thick) exp(coef) exp(-coef) lower .95 upper .95 1.58 0.633 0.934 2.68 2.18 0.458 1.604 2.97 Rsquare= 0.151 (max possible= Likelihood ratio test= 33.5 on Wald test = 31 on 2 Score (logrank) test = 32.5 on 0.937 ) 2 df, p=5.45e-08 df, p=1.85e-07 2 df, p=8.68e-08 The p-value for all three overall tests (likelihood, Wald, and score) are all significant, indicating that the model is significant. Which variables in the model appear to be related to survival? coef exp(coef) se(coef) z p sex 0.458 1.58 0.269 1.70 8.8e-02 log(thick) 0.781 2.18 0.157 4.96 6.9e-07 sex log(thick) exp(coef) exp(-coef) lower .95 upper .95 1.58 0.633 0.934 2.68 2.18 0.458 1.604 2.97 The p-value for log(thick) is 6.9e-07, with a hazard ratio HR = exp(coef) = 2.18, indicating a strong relationship between the thickness of the tumor and increased risk of death. By contrast, the p-value for sex is now p=0.088. The hazard ratio HR = exp(coef) = 1.58, with a 95% confidence interval of 0.934 to 2.68. Because the confidence interval for HR includes 1, these results suggest there may be little difference in the HR attributable to sex after controlling for the thickness of the tumor. Why would the Cox model using sex as the only variable indicate that sex is a significant predictor (p- value for sex is p=0.013), while the Cox model including both sex and tumor thickness indicates that sex may not be significant (p-value for sex is p=0.088)? # Let’s look at differences in tumor thickness in males and females. 1: female, 2: male. boxplot(melanom$thick ~ melanom$sex) boxplot(log(melanom$thick) ~ melanom$sex) t.test(log(melanom$thick) ~ melanom$sex, var.equal = TRUE) > t.test(log(melanom$thick) ~ melanom$sex, var.equal = TRUE) Two Sample t-test data: log(melanom$thick) by melanom$sex t = -2.5405, df = 203, p-value = 0.01182 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -0.64672904 -0.08152128 sample estimates: mean in group 1 mean in group 2 5.083019 5.447145 We can speculate that women tend to go to their doctor earlier (when the tumor is less thick), while men are likely to go to their physician later (when the tumor has already grown thicker). If men and women really differ in their death rates due to melanoma, we might start a research program to understand the molecular basis of those differences, such as differences in tumor growth as a function of the levels of estrogen, testosterone, and other hormones. If, instead, men and women do not really differ in their death rates, but instead differ in how quickly they will visit their doctor, then we should consider a publichealth education effort to get men to visit their physicians as soon as they see an abnormality.