Survival Analysis

advertisement
Medical Statistics & Epidemiology (153133)
Survival Analysis (week 02)
Hazard/failure rates, Cox regression
October 2008 (K.P.)
CONTENTS
1
2
3
4
5
6
7
Survival function and hazard function
Distributions for failure rate
Kaplan Meier estimator
Regression models
Proportional hazards model
Modeling and testing in the proportional hazards
model
Assignment
1
Section 1: Survival function and hazard function
This part of the course Medical Statistics and Epidemiology deals with the modeling and
analysis of data that have as a principal end point the time until an event occurs. Such
events are generically referred to as failures though the event may, for instance, be the
performance of a certain task in a learning experiment in psychology or a change of
residence in a demographic study as well. Major areas of application, however, are
medical studies on chronic diseases and industrial life testing.
We assume that observations are available on the independent failure time of n
individuals. Let T be the nonnegative random variable representing the failure time of an
arbitrary individual. We assume that the probability distribution of T is described by a
density function f (t ) . We shall introduce the Survival function S (t ) and the hazard
function  (t ) which characterize the distribution of T as well. The survival function
S (t ) is defined by
(1)
S (t )  P(T  t )
and is equal to 1  F (t ) , where F (t ) is the cumulative distribution function of T . (Note
P( X  t )  0 for each number t in case of a density function.) Since the cumulative
distribution function F (t ) specifies the distribution of T , the distribution of T is
specified as well by the survival function S (t )  1  F (t ) .
The hazard function  (t ) specifies the instantaneous rate of failure at T  t conditional
upon survival to time t and is defined by the limit for   0 of the following ratio:
(2)
P(t  T  t   | T  t )


P(t  T  t   ) S (t )  S (t   )
1


P(T  t )  

S (t )
Taking this limit we get
(3)
 (t ) 
f (t )
.
S (t )
Note that the derivative of the survival function S (t ) is equal to  f (t ) . The distribution
of T is specified by its hazard function as well because the survivor function is
determined by the hazard function:
(4)
d
f (t )
ln S (t )   
  (t )
dt
S (t )
2
t
(5)
ln S (t )      (u )du
(Note: S (0)  1 )
0
( 6)
 t

S (t )  exp     (u )du 
 0

Section 2: Distributions for failure rate
In this section we present a number of models for the distribution of T .
The one parameter exponential distribution is obtained for T by taking the hazard
function to be a constant:  (t )   (with   0) , hence
 t

S (t )  exp    du   exp( t )
 0

F (t )  1  exp( t ) and f (t )   exp( t ) follow rather easily. So for the exponential
distribution the instantaneous failure rate is independent of t so that the conditional
chance of failure does not depend on how long the individual has been on trial. This is
referred to as the memoryless property of the exponential distribution. An empirical
check of the exponential distribution for a set of survival data is provided by plotting the
log of the survival function estimate versus t . Such a plot should approximate a straight
line through the origin as can be concluded from (7).
(7 )
An important generalization of the exponential distribution allows for a power
dependence of the hazard function on time. This yields the two parameter Weibull
distribution with hazard function
(8)
 (t )  p(t ) p 1  pp  t p 1 .
This hazard function is monotone decreasing for p  1 , is monotone increasing for p  1
and reduces to a constant if p  1 . For the Weibull distribution we get
(9)
 t

S (t )  exp    p pu p 1du   exp  (t ) p
 0

(10)
ln  ln( S (t )  pln( t )  ln(  )


An empirical check for the Weibull distribution is provided by a plot of the estimate of
ln  ln( S (t ) versus ln( t ) . The plot should give approximately a straight line.
In general the distribution of a failure or survival time is skew. Skew distributions may be
modeled by means of a lognormal distribution or a gamma distribution as well. If T has a
lognormal distribution then this means that Y  ln( T ) has a normal distribution, described
3
by expectation  and a variance  2 . The gamma distribution may be regarded as
another generalization of the exponential distribution, its density function is
(11)
f (t ) 
 (t ) 1 et
,
( )
where  (k ) is the well-known gamma function:

(12)
( )   x 1 exp(  x)dx
( 0     ).
0
For   1 the density (11) reduces to the density of the exponential distribution, note
(1)  1 .
Section 3: Kaplan Meier Estimator
Survival analysis is concerned with studying the time between entry to a study and a
subsequent event. Originally the analysis was concerned with time until death, hence the
name, but survival analysis is applicable to many areas as well as mortality. A common
feature of survival data is censoring, it means that the exact failure times of a number of
subjects are not known.
There are several reasons for censoring, to mention a few:
Some patients might have left the study early, they are lost to follow up.
Examples: emigration, fatal accidents in traffic (competing risk)
Study ends when a fixed time is reached (right censoring of type I)
Study ens when a fixed number of failures occur (right censoring of type II)
In these examples there is right censoring, which means some failure times are not
known. For these unknown failure times one only knows that the failure time exceeds
some known value, the so-called censoring time. In this text we assume that the process
of censoring is independent of the process we want to study. Furthermore we only
consider right censoring. We study survival of 49 patients with Dukes’C colorectal
cancer. The survival times (months) of two treatment groups are as follows.
3+
6
6
6
6
8
8
12
12
12+
15+
16+
Control ( n  24 )
18+
18+
20
22+
24
28+
28+
28+
30
30+
33+
42
Treatment (  linoleic acid , n  25)
1+
13+
5+
15+
6
16+
6
20+
9+
24
10
24+
10
27+
10+
32
12
34+
12
36+
12
36+
12
44+
12+
4
Here + means censoring. The first entry (3+) of the control group means that the patient
left the study after 3 months survival. So the corresponding survival time is known to be
more than 3 months, 3 (months) is the censoring time of the patient. For the treatment
group we shall estimate the survivor function S (t )  P(T  t ) .
For estimation of S (t ) we use the Kaplan Meier estimator, also called the product limit
estimator. Suppose that the survival times, including censored observations, of a
homogeneous group of n patients are represented by t1, t2 ,..., tn . We assume that the
survival times (patients) are already ordered such that t1  t2  ...  tn . For a given value t
find the largest value ti such that ti  t , the probability S (t ) is then estimated by
(13)
r  d r  d2
r  di
Sˆ (t )  1 1  2
 ...  i
r1
r2
ri
where rk is the number of subjects alive just before time t k (the kth ordered survival
time) and d k denotes the number who died at time t k . Let us determine the estimates
Sˆ (t ) for the treatment group.
The estimate of formule (13) is the product of factors (rk  d k ) / rk . For censored
observations 1+ and 5+ these factors equal 1. We get Sˆ (t )  1 for t  6 . Following the
receipt of the Kaplan Meier estimator we obtain
(14)
23  2
Sˆ (t ) 
 0.9130
23
for 6  t  10
(15)
23  2 20  2 21 18
Sˆ (t ) 



 0.8217
23
20
23 20
for 10  t  12
(16)
21 18 17  4
Sˆ (t ) 
 
 0.6284
23 20
17
for 12  t  24
(17)
21 18 13 8  1
Sˆ (t ) 
  
 0.5498
23 20 17
8
for 24  t  32
(18)
21 18 13 7 5  1
Sˆ (t ) 
   
 0.4398
23 20 17 8
5
for t  32
The Kaplan Meier estimate Sˆ (t ) is just a step function, this step function only changes
for survival times t k with a positive d k , for the treatment group these times are 6, 10, 12,
24 and 32. Each factor in the Kaplan Meier estimator represents 1 minus an estimated
hazard rate. For survival time t k we consider the number of people still alive, sometimes
5
called the number at risk, this is the number rk . Then the condional probability of
surviving is estimated in a straightforward way by (rk  d k ) / rk . Next plot is a plot of the
estimate of the survival function of the treatment group.
Survival Function
Survival
Function
Censored
1.0
Cum Survival
0.8
0.6
0.4
0.2
0.0
0
10
20
30
40
50
survival
The corresponding plot of the control group is as follows.
Survival Function
Survival
Function
Censored
1.0
Cum Survival
0.8
0.6
0.4
0.2
0.0
0
10
20
30
40
survco
6
Section 4: Regression models
In section 2 several survival distributions were introduced for modeling the survival
experience of a homogeneous population. Usually, however, there are explanatory
variables upon which failure time may depend. It therefore becomes of interest to
consider generalizations of these models to take account of information of explanatory
variables.
Consider failure times T1 , T2 ,...,Tn of n individuals. For each individual i we have
values xi1 , xi 2 ,..., xik of k explanatory variables. Note that the explanatory may include
both quantitative variables and qualitative variables such as treatment group, the latter
can be incorporated through the use of indicator variables. The principal problem dealt
with in this section is that of modeling the relationship between the failure time T and
explanatory variables.
The exponential distribution can be generalized to obtain a regression model by allowing
the failure rate of Ti to be function of xi1 , xi 2 ,..., xik . In regression models it is common
practice that the dependent variable depends on the explanatory variables only through a
linear function
(19)
0  1xi1   2 xi 2  ...   k xik ,
where 0 , 1 ,  2 ,...,  k are unknown parameters. For the exponential distribution we have
a constant hazard function  . In a regression model for survival analysis one can try to
model the dependence on the explanatory by taking the (new) hazard rate to be
( 20)
  0  c(0  1xi1   2 xi 2  ...   k xik ) ,
The (new) hazard rate is taken to be a constant ( 0 ) times some function c of the linear
function 0  1 xi1   2 xi 2  ...   k xik . Hazard rates being positive it is natural to choose
the function c such that c(0  1xi1   2 xi 2  ...   k xik ) is positive irrespective the
values of xi1 , xi 2 ,..., xik . For this reason one often takes c(.)  exp(.) , the hazard rate in a
regression model is then modeled by
( 21)
  0  exp( 0  1xi1   2 xi 2  ...   k xik ) .
In survival analysis accelerated failure time models are obtained by modeling the log
failure time Y  ln( T ) instead of the failure time itself. Let us explain what the
assumption (21) about the hazard function means if we study the log failure time Y . We
shall use the following fact of probability theory: if T has the exponential distribution
with parameter  then we can write T  U /  where U is a random variable having an
exponential distribution with parameter   1 . Note that the survival function of U /  is
equal to
7

P(U /   t )  P(U  t )   exp( u )du  exp( t ) ,
( 22)
t
which is equal to (7), the survival function of an exponential distribution with parameter
 , hence indeed U /  and T are identical with respect to their distribution.
Consequences of (21) are:
U
0  exp(  0  1 xi1   2 xi 2  ...   k xik )
( 23)
T
( 24)
~
Y   ln( 0 )  0  1xi1  2 xi 2  ...  k xik  U
~
with U  ln( U ) . For log failure time Y we almost get a traditional regression model.
Note that the term  ln( 0 )  0 is the intercept (constant) of the regression model, this
term can be estimated whereas both  ln( 0 ) and   0 cannot be estimated. The
~
disturbance U  ln( U ) does not have a normal distribution, instead one can say that
~
exp(U ) has the exponential distribution with parameter   1 . From (24) one can
conclude that the effects of covariates (explanatory variables) act additively on Y .
Remind we started with (21): the covariates act multiplicatively on the hazard rate.
Let us now consider the Weibull distribution, hence a hazard function given by (8). For
the Weibull distribution the analogue of (21) is:
  p0p  t p 1  exp(  0  1 xi1   2 xi 2  ...   k xik )
( 25)
 p0  exp(( 0  1xi1  2 xi 2  ...  k xik ) / p)  t p 1 ,
p
as a matter of fact the baseline hazard rate 0 is replaced by
0  exp(( 0  1xi1   2 xi 2  ...   k xik ) / p) .
( 26)
We again study the log failure time Y  ln( T ) . Using probability theory we can state the
following: if T has a Weibull distribution with parameters  and p then we can write
/  where U has the exponential distribution with parameter   1 . For proving
1
this we show that the survival function of U p /  equals (9), the survival function of the
Weibull distribution:
T U
1
p
8
1
(27)

 
U p

1
P
 t   P U p  t  P U  (t ) p  ...  exp  (t ) p ,
 



which equals expression (9). From (26) and T  U

1
p


/  one can obtain:
1
(28)
U p
T
0  exp((  0  1 xi1   2 xi 2  ...   k xik ) / p)
( 29)
Y   ln( 0 ) 
0
p

1
p
xi1 
2
p
xi 2  ... 
k
p
~
xik  U
 
1
~
with U  ln U p  ln( U ) / p . This regression equation is a generalization of (25), as
expected. Again the effects of the covariates act additively on the log failure time.
Section 5: The Proportional Hazards Model
The most famous proportional hazard model is the proportional hazards model of Cox. In
the proportional hazards model of Cox independent failure times T1 , T2 ,...,Tn are studied,
which distribution is described by a hazard function  (t ) given by
(30)
 (t )  0 (t )  exp( 0  1xi1  2 xi 2  ...  k xik )
where 0 (t ) is an arbitrary unspecified base-line hazard function which specifies a
continuous distribution for a failure rate. Special cases are 0 (t )   (exponential
distribution) and 0 (t )  p0pt p 1 (Weibull distribution) but one of the most important
features of the model of Cox is that no parametric model is made for the base-line hazard
function 0 (t ) .
An empirical check for the proportional hazards model is provided by plots of the
estimate of ln  ln( S (t ) versus ln( t ) . In case of the Weibull distribution
( 0 (t )  p0pt p 1 ) the graphs of the estimates ln  ln( S (t ) versus ln( t ) should show
parallel straight lines for separate subgroups. In general these graphs should show the
same curvature for different subgroups (fixed vertical distances).
Section 6: Modeling and testing in the Proportional Hazards Model
SPSS (like other programs) provides estimates and standard errors for e.g. the
parameters  j in the proportional hazards model of Cox. For application of ‘Cox
regression’ one has to choose in SPSS first Analyse, then Survival and finally Cox
Regression.
9
For demonstrating testing theory we apply the proportional hazards model to the data of
section 3. In SPSS we have to fill the data matrix in the following way. One column
contains the 49 survival times. A second column indicates whether the survival times are
censored ( 0 ) or not (1) , hence this column contains only numbers 0 and 1. A third
column is indicating to which group each survival time belongs, we used the number 0
for control and the number 1 for the treatment group, so this column contains only
numbers 0 and 1 as well. We gave the columns (variables) names: survival, censor and
treatment. For producing the relevant output one has to choose survival for time, censor
for status (define event by single value 1) and treatment as covariate. Use furthermore
‘method: enter’.
We now shall investigate whether the groups really differ with respect to survival time.
We apply a proportional hazards model with covariate (explanatory variable) treatment.
Doing so, we assume that the hazard function for an individual is given by
 (t )  0 (t )  exp( 0  1xi1 )  0 (t ) exp( 0 )  exp( 1xi1 )
(31)
with the xi1 being the values of the variable treatment, x1  treatment . For investigating
whether there is really a difference between the two groups or whether there is really a
treatment effect, we test the null hypothesis H 0 : 1  0 against the alternative hypothesis
H :   0 . One has to take T  ˆ / S ( ˆ ) as testing statistic with ˆ being the estimate
1
1
1
1
1
of 1 and S ( ˆ1 ) being the corresponding standard error. The distribution of the testing
statistic is approximated by the standard normal distribution under the null hypothesis.
The null hypothesis is rejected if T  c or T  c . Using SPSS a part of the output is the
following:
treatment
B
 0.253
SE
0.430
Wald
0.345
df
1
Sig.
0.557
Exp(B)
0.777
Because the factor exp(  0 ) can be absorbed by the baseline hazard function 0 (t ) , no
estimate for  is given. From the output we see ˆ  0.253 and S ( ˆ )  0.430 , hence
1
0
1
the outcome of T  ˆ1 / S ( ˆ1 ) is  0.588 . Taking significance level 5% we reject
H 0 : 1  0 if T  1.96 or T  1.96 , equivalently if | T | 1.96 . In stead of T SPSS
presents a Wald Statistic being equal to T 2 . This Wald statistic has a chi-square
distribution with one degrees of freedom under the null hypothesis: significance level 5%
means that the null hypothesis is rejected if the Wald Statistic T 2  3.84 , this renders a
equivalent test. For the data of section 3 we don’t have to reject the null hypothesis, no
treatment effect can be proved (at significance level 5%).
For the assignment of this part of the course Medical Statistics and Epidemiology a data
set has to be studied, we call this data set the breastfeeding data. The data is contained by
10
the SPSS system file breastfeeding.sav . The breastfeeding data concerns data on 925
first-born children whose mothers chose for breast feeding. The following variables are
recorded:
Duration
Censoring
race
Poverty
Smoking
Alcohol
Age
Birth
School
Prenatal
duration of breast feeding (weeks)
1 for completed breast feeding, 0 for censored (still breast feeding)
race of mother ( 1  white, 2  black, 3  other)
mother in poverty ( 1  yes, 0  no)
mother smoked at birth of child ( 1  yes, 0  no)
mother used alcohol at birth of child ( 1  yes, 0  no)
age of mother at birth of child
year of birth of child
years of school (education level)
Prenatal care after 3rd month ( 1  yes, 0  no)
For a first preparation for the assignment we study how Duration depends on the
covariate Birth. Inspection of the data set reveals that the outcome of the covariate Birth
ranges from 78 (representing 1978) to 86. For this type of covariate it is not useful to
assume a hazard rate of the form (31) with the xi1 being now the values of the covariate
Birth. The covariate Birth is rather a categorical variable. Perhaps the duration of
breastfeeding is different for babies from different years of birth. Differences from year
to year may be modeled by assigning effects to the levels (outcomes) of the covariate
Birth. This can be done by introducing indicators variables. In case of the breastfeeding
data we introduce indicator variables x1 , x2 ,..., x8 for the covariate Birth defined as
follows:
x1  1 if the outcome of birth is 78 and x1  0 elsewhere, x2  1 if the outcome of birth is
79 and x2  0 elsewhere,…, x8  1 if the outcome of birth is 85 and x8  0 elsewhere.
Using these indicator variables the hazard rate, and thus our model, becomes
(32)
 (t )  0 (t )  exp( 0  1xi1  2 xi 2  ...  8 xi8 )
where xi1 , xi 2 ,..., xi8 are the values of the indicator variables of individual i . According to
formula (32) the ratio of the hazards rates of the years of birth 78 and 86 is equal to
exp( 1 ) . In the same way the ratio of the hazards rates of the years of birth 79 and 86 is
equal to exp(  2 ) . So for the respective outcomes of the covariate we now have
(possibly) different (multiplicative) effects and here the year of birth 86 serves as
reference value. Instead of the last value the first value may be chosen as well as
reference value (category) in SPSS. Using SPSS the following output is obtained.
11
birth
birth(1)
birth(2)
birth(3)
birth(4)
birth(5)
birth(6)
birth(7)
birth(8)
B
SE
-0.799
-0.722
-0.947
-0.707
-0.703
-0.666
-0.715
-0.402
0.433
0.424
0.423
0.419
0.420
0.419
0.421
0.421
Wald
18.411
3.406
2.901
5.019
2.853
2.800
2.529
2.890
0.911
df
8
1
1
1
1
1
1
1
1
Sig.
0.018
0.065
0.089
0.025
0.091
0.094
0.112
0.089
0.340
Exp(B)
0.450
0.486
0.388
0.493
0.495
0.514
0.489
0.669
In the first column of the table the parameters  j are indicated by respectively birth(1),
birth(2),…,birth(8). Testing the null hypothesis H 0 :  j  0 against alternative hypothesis
H1 :  j  0 for each indicator variable x j no null hypothesis needs to be rejected at
significance level 5% except one (verify this). It is however not useful to test H 0 :  j  0
for each indicator variable of the covariate Birth.
Instead one can test whether all effects 1 ,  2 ,..., 8 of the covariate Birth are zero. Let us
test the null hypothesis H 0 : ( 1 ,  2 ,..., 8 )  0 against the alternative hypothesis
H1 : (1 ,  2 ,..., 8 )  0 . SPSS presents (the outcome of) a Wald testing statistic, under the
null hypothesis its distribution is the chi square distribution with 8 degrees of freedom
(df) and one has to reject the null hypothesis for large values of the testing statistics. The
degrees of freedom depends on the number of indicator variables, may be different for
different data sets. In our testing problem we have to reject H 0 if Wald  15.5 (chi
square with df  8 , level of significance 5%). Since the outcome of Wald is 18.411 we
here reject the null hypothesis. We conclude that the (distribution of) Duration depends
on the covariate Birth.
We don’t give a explicit formula for the Wald statistic with (here) 8 degrees of freedom.
We just indicate how to work with it.
Section 7: Assignment
Before you start: consult the text of the file About SPSS.doc. Use SPSS when you do
parts A en B of this assignment.
You don’t have to write a report for this assignment. Just take your computer output and
your own notes with you for an (oral) discussion with the teacher about the answers of
parts A and B. For making an appointment for the assignment: send an e-mail
(k.poortema@ewi.utwente.nl) or ring (053)4893379 .
12
Part A
Select a number of subgroups and study plots of the Kaplan Meier estimate of the
Survival function in order to answer the following two questions. Instead of plots of the
Survival function you may use plots of the hazard function or a function of the Survival
function.
(1) Is the distribution of the duration of breastfeeding modeled well by means of a
Weibull distribution or an exponential distribution for the subgroups chosen?
(2) Does a proportional hazards model fit the data?
Part B
Now assume that the proportional hazards model of Cox holds for the breastfeeding data.
Investigate within this model on which covariates the (dependent) variable Duration
really depends. For the covariates Race, Age and School (and Birth) decide whether you
take the covariate as categorical covariate (if so: go to Categorical … , etc. within the
Cox Regression menu). In case of categorical covariates you don’t need to worry about
the indicator variables described in section 6: SPSS introduces these indicator variables
automatically.
Important aspects:
Try to explain the dependent variable Duration as good as possible but refrain from
including covariates (predictor variables) that seem to be superfluous.
Follow a clear strategy in order to select the covariates. Use statistical tests.
13
Download