Survival analysis in R

advertisement
Survival analysis in R
For an introduction, see the file “Survival analysis
overview”.
Survival analysis example from Peter Dalgaard,
Introductory Statistics with R, Chapter 12.
See Glantz, Primer of Biostatistics, Chapter 11,
Survival analysis
The material in the file “Survival analysis logrank
details.doc” is supplementary.
install.packages("ISwR")
#### load the libraries
library(survival)
library(ISwR)
#### look at the melanoma data
data(melanom)
help(melanom)
Description:
The ‘melanom’ data frame has 205 rows and 7
columns. It contains data relating to the survival
of patients after an operation for malignant
melanoma, collected at Odense University Hospital
by K.T. Drzewiecki.
This data frame contains the following
columns:
‘no’ a numeric vector, patient code.
‘status’ a numeric vector code, survival status; 1:
dead from melanoma, 2: alive, 3: dead from other
cause.
‘days’ a numeric vector, observation time.
‘ulc’ a numeric vector code, ulceration; 1:
present, 2: absent.
‘thick’ a numeric vector, tumor thickness (1/100
mm).
‘sex’ a numeric vector code; 1: female, 2: male.
> melanom[1:20,]
no status days ulc thick sex
1 789
3
10
1
676
2
2
13
3
30
2
65
2
3
97
2
35
2
134
2
4
16
3
99
2
290
1
5
21
1 185
1 1208
2
6 469
1 204
1
484
2
7 685
1 210
1
516
2
8
7
1 232
1 1288
2
9 932
3 232
1
322
1
10 944
1 279
1
741
1
11 558
1 295
1
419
1
12 612
3 355
1
16
1
13
2
1 386
1
387
1
14 233
1 426
1
484
2
15 418
1 469
1
242
1
16 765
3 493
1 1256
2
17 777
18 61
19 67
20 819
1
1
1
1
529
621
629
659
1
1
1
1
580
706
548
773
2
2
2
2
# Fit a Kaplan-Meier and plot it.
fit = survfit(Surv(days, status == 1) ~ 1,
data=melanom)
Notice that the formula for survfit must have “~ 1” to
plot a single curve. The example in the text book
without the “~1” is the old style, which is no longer
supported.
> summary(fit, times = seq(0, 4000, 500))
Call: survfit(formula = Surv(days, status == 1) ~ 1, data = melanom)
time n.risk n.event survival std.err lower 95% CI upper 95% CI
0
205
0
1.000 0.0000
1.000
1.000
500
189
9
0.955 0.0147
0.927
0.984
1000
171
17
0.869 0.0240
0.823
0.917
1500
159
10
0.818 0.0274
0.766
0.874
2000
103
10
0.762 0.0308
0.704
0.825
2500
66
7
0.700 0.0362
0.633
0.775
3000
54
2
0.677 0.0385
0.605
0.757
3500
24
2
0.645 0.0431
0.566
0.735
4000
13
0
0.645 0.0431
0.566
0.735)
plot(fit, xlab = "Days", ylab="Survival")
# Create KM estimates broken out by sex
surv.bysex = survfit(Surv(days, status == 1) ~ sex,
data = melanom)
1: female, 2: male
> summary(surv.bysex, times = seq(0, 4000, 500))
sex=1
time n.risk n.event survival std.err lower 95% CI upper 95% CI
0
126
0
1.000 0.0000
1.000
1.000
500
119
4
0.968 0.0159
0.937
0.999
1000
111
7
0.910 0.0258
0.861
0.962
1500
106
5
0.869 0.0305
0.812
0.931
2000
68
6
0.813 0.0362
0.745
0.887
2500
43
4
0.755 0.0439
0.674
0.846
3000
36
0
0.755 0.0439
0.674
0.846
3500
17
2
0.704 0.0542
0.605
0.818
4000
9
0
0.704
0.0542
0.605
0.818
sex=2
time n.risk n.event survival std.err lower 95% CI upper 95% CI
0
79
0
1.000 0.0000
1.000
1.000
500
70
5
0.934 0.0284
0.880
0.992
1000
60
10
0.801 0.0461
0.715
0.896
1500
53
5
0.734 0.0510
0.640
0.841
2000
35
4
0.677 0.0544
0.578
0.792
2500
23
3
0.608 0.0619
0.498
0.742
3000
18
2
0.553 0.0675
0.435
0.702
3500
7
0
0.553 0.0675
0.435
0.702
4000
4
0
0.553 0.0675
0.435
0.702
plot(surv.bysex, xlab = "Days", ylab="Survival",
conf.int=T, col=c("black", "red"), lty = 1:2)
legend(100, .2, c("Female", "Male"), lty = 1:2,
col=c("black", "red"))
#### Logrank test for difference in survival by sex
surv.diff.sex = survdiff(Surv(days, status == 1) ~ sex,
data = melanom)
surv.diff.sex
> surv.diff.sex
Call:
survdiff(formula = Surv(days, status == 1) ~ sex,
data = melanom)
N Observed Expected (O-E)^2/E (O-E)^2/V
sex=1 126
28
37.1
2.25
6.47
sex=2
79
Chisq= 6.5
29
19.9
4.21
6.47
on 1 degrees of freedom, p= 0.011
#### Logrank test for difference in survival by sex,
stratified by ulceration
surv.diff.sex.ulc = survdiff(Surv(days, status == 1) ~
sex + strata(ulc), data = melanom)
surv.diff.sex.ulc
> surv.diff.sex.ulc
Call:
survdiff(formula = Surv(days, status == 1) ~ sex +
strata(ulc),
data = melanom)
N Observed Expected (O-E)^2/E (O-E)^2/V
sex=1 126
28
34.7
1.28
3.31
sex=2 79
29
22.3
1.99
3.31
Chisq= 3.3
on 1 degrees of freedom, p= 0.0687
# The p-value is less significant after controlling for
ulceration, suggesting that sex differences are
reduced after controlling for ulceration.
Cox proportional hazards regression
Cox proportional hazards regression models have
some of the characteristics and advantages of other
regression models. In particular, Cox models are
useful when the predictor variable is continuous.
# First, let’s look at the cox model of survival in the
melanom data set where the predictor variable is sex
(male/female).
coxph.sex = coxph(Surv(days, status == 1) ~ sex,
data = melanom)
summary(coxph.sex)
Call:
coxph(formula = Surv(days, status == 1) ~ sex, data
= melanom)
n= 205
coef exp(coef) se(coef)
z
p
sex 0.662
1.94
0.265 2.5 0.013
exp(coef) exp(-coef) lower .95 upper .95
sex
1.94
0.516
1.15
3.26
Rsquare= 0.03
(max possible= 0.937 )
Likelihood ratio test= 6.15 on 1 df,
Wald test
= 6.24 on 1 df,
Score (logrank) test = 6.47 on 1 df,
p=0.0131
p=0.0125
p=0.0110
Here’s what we see in the summary output:
“coef” is the estimated logarithm of the hazard ratio
for males versus females (coef = 0.662). To make this
easier to interpret, we take convert the log of the
hazard ratio to the actual estimated hazard ratio using
exp(coef) = exp(0.662) = 1.94.
In this data set, sex is encoded as a numeric vector.
1: female, 2: male.
The R summary for the cox model gives the hazard
ratio for the second group relative to the first group,
that is, male versus female.
The estimated hazard ratio = 1.94 indicates that
males have higher risk of death (lower survival rates)
than females, in these data.
se(coef) = 0.265 is standard error of the log HR.
Dividing the coef = 0.662 by its standard error se(coef
= 0.265) gives the z score:
z = coef/se(coef) = 0.662/0.265 = 2.5.
The corresponding p-value for sex is p=0.013,
indicating that there is a significant difference in
survival as a function of sex.
The summary output also gives upper and lower 95%
confidence intervals for the hazard ratio.
Finally, the summary output gives p-values for three
alternative tests for overall significance of the model:
Likelihood ratio test= 6.15
Wald test
= 6.24
Score (logrank) test = 6.47
on 1 df,
on 1 df,
on 1 df,
p=0.0131
p=0.0125
p=0.0110
These three methods are asymptotically equivalent.
For large enough N, they will give similar results. For
small N, they may differ somewhat.
Cox model using covariates in the melanom data
# Now, let’s look at the cox model of survival in the
melanom data set where the predictors variables
include a continuous covariate, the log of the
thickness of the tumor (variable name = “thick”)
# Plot the thickness values in the data set.
hist(melanom$thick)
# The values don’t look at all normally distributed.
Regression models, including the Cox model, usually
work better with normally-distributed variables, so
we’ll try a log transform.
hist(log(melanom$thick))
# The log of the thickness of the tumor looks to be
more normally-distributed, so we’ll use that.
coxph.sex.thick = coxph(Surv(days, status == 1) ~
sex + log(thick), data = melanom)
summary(coxph.sex.thick)
coxph(formula = Surv(days, status == 1) ~ sex + log(thick),
data = melanom)
n= 205
coef exp(coef) se(coef)
z
p
sex
0.458
1.58
0.269 1.70 8.8e-02
log(thick) 0.781
2.18
0.157 4.96 6.9e-07
sex
log(thick)
exp(coef) exp(-coef) lower .95 upper .95
1.58
0.633
0.934
2.68
2.18
0.458
1.604
2.97
Rsquare= 0.151
(max possible=
Likelihood ratio test= 33.5 on
Wald test
= 31 on 2
Score (logrank) test = 32.5 on
0.937 )
2 df,
p=5.45e-08
df,
p=1.85e-07
2 df,
p=8.68e-08
The p-value for all three overall tests (likelihood,
Wald, and score) are all significant, indicating that the
model is significant. Which variables in the model
appear to be related to survival?
coef exp(coef) se(coef)
z
p
sex
0.458
1.58
0.269 1.70 8.8e-02
log(thick) 0.781
2.18
0.157 4.96 6.9e-07
sex
log(thick)
exp(coef) exp(-coef) lower .95 upper .95
1.58
0.633
0.934
2.68
2.18
0.458
1.604
2.97
The p-value for log(thick) is 6.9e-07, with a hazard
ratio HR = exp(coef) = 2.18, indicating a strong
relationship between the thickness of the tumor and
increased risk of death.
By contrast, the p-value for sex is now p=0.088. The
hazard ratio HR = exp(coef) = 1.58, with a 95%
confidence interval of 0.934 to 2.68. Because the
confidence interval for HR includes 1, these results
suggest there may be little difference in the HR
attributable to sex after controlling for the thickness of
the tumor.
Why would the Cox model using sex as the only
variable indicate that sex is a significant predictor (p-
value for sex is p=0.013), while the Cox model
including both sex and tumor thickness indicates that
sex may not be significant (p-value for sex is
p=0.088)?
# Let’s look at differences in tumor thickness in males
and females. 1: female, 2: male.
boxplot(melanom$thick ~ melanom$sex)
boxplot(log(melanom$thick) ~ melanom$sex)
t.test(log(melanom$thick) ~ melanom$sex, var.equal
= TRUE)
> t.test(log(melanom$thick) ~ melanom$sex,
var.equal = TRUE)
Two Sample t-test
data: log(melanom$thick) by melanom$sex
t = -2.5405, df = 203, p-value = 0.01182
alternative hypothesis: true difference in means is
not equal to 0
95 percent confidence interval:
-0.64672904 -0.08152128
sample estimates:
mean in group 1 mean in group 2
5.083019
5.447145
We can speculate that women tend to go to their
doctor earlier (when the tumor is less thick), while
men are likely to go to their physician later (when the
tumor has already grown thicker).
If men and women really differ in their death rates due
to melanoma, we might start a research program to
understand the molecular basis of those differences,
such as differences in tumor growth as a function of
the levels of estrogen, testosterone, and other
hormones.
If, instead, men and women do not really differ in their
death rates, but instead differ in how quickly they will
visit their doctor, then we should consider a publichealth education effort to get men to visit their
physicians as soon as they see an abnormality.
Download