Medical Statistics & Epidemiology (153133) Survival Analysis (week 02) Hazard/failure rates, Cox regression October 2008 (K.P.) CONTENTS 1 2 3 4 5 6 7 Survival function and hazard function Distributions for failure rate Kaplan Meier estimator Regression models Proportional hazards model Modeling and testing in the proportional hazards model Assignment 1 Section 1: Survival function and hazard function This part of the course Medical Statistics and Epidemiology deals with the modeling and analysis of data that have as a principal end point the time until an event occurs. Such events are generically referred to as failures though the event may, for instance, be the performance of a certain task in a learning experiment in psychology or a change of residence in a demographic study as well. Major areas of application, however, are medical studies on chronic diseases and industrial life testing. We assume that observations are available on the independent failure time of n individuals. Let T be the nonnegative random variable representing the failure time of an arbitrary individual. We assume that the probability distribution of T is described by a density function f (t ) . We shall introduce the Survival function S (t ) and the hazard function (t ) which characterize the distribution of T as well. The survival function S (t ) is defined by (1) S (t ) P(T t ) and is equal to 1 F (t ) , where F (t ) is the cumulative distribution function of T . (Note P( X t ) 0 for each number t in case of a density function.) Since the cumulative distribution function F (t ) specifies the distribution of T , the distribution of T is specified as well by the survival function S (t ) 1 F (t ) . The hazard function (t ) specifies the instantaneous rate of failure at T t conditional upon survival to time t and is defined by the limit for 0 of the following ratio: (2) P(t T t | T t ) P(t T t ) S (t ) S (t ) 1 P(T t ) S (t ) Taking this limit we get (3) (t ) f (t ) . S (t ) Note that the derivative of the survival function S (t ) is equal to f (t ) . The distribution of T is specified by its hazard function as well because the survivor function is determined by the hazard function: (4) d f (t ) ln S (t ) (t ) dt S (t ) 2 t (5) ln S (t ) (u )du (Note: S (0) 1 ) 0 ( 6) t S (t ) exp (u )du 0 Section 2: Distributions for failure rate In this section we present a number of models for the distribution of T . The one parameter exponential distribution is obtained for T by taking the hazard function to be a constant: (t ) (with 0) , hence t S (t ) exp du exp( t ) 0 F (t ) 1 exp( t ) and f (t ) exp( t ) follow rather easily. So for the exponential distribution the instantaneous failure rate is independent of t so that the conditional chance of failure does not depend on how long the individual has been on trial. This is referred to as the memoryless property of the exponential distribution. An empirical check of the exponential distribution for a set of survival data is provided by plotting the log of the survival function estimate versus t . Such a plot should approximate a straight line through the origin as can be concluded from (7). (7 ) An important generalization of the exponential distribution allows for a power dependence of the hazard function on time. This yields the two parameter Weibull distribution with hazard function (8) (t ) p(t ) p 1 pp t p 1 . This hazard function is monotone decreasing for p 1 , is monotone increasing for p 1 and reduces to a constant if p 1 . For the Weibull distribution we get (9) t S (t ) exp p pu p 1du exp (t ) p 0 (10) ln ln( S (t ) pln( t ) ln( ) An empirical check for the Weibull distribution is provided by a plot of the estimate of ln ln( S (t ) versus ln( t ) . The plot should give approximately a straight line. In general the distribution of a failure or survival time is skew. Skew distributions may be modeled by means of a lognormal distribution or a gamma distribution as well. If T has a lognormal distribution then this means that Y ln( T ) has a normal distribution, described 3 by expectation and a variance 2 . The gamma distribution may be regarded as another generalization of the exponential distribution, its density function is (11) f (t ) (t ) 1 et , ( ) where (k ) is the well-known gamma function: (12) ( ) x 1 exp( x)dx ( 0 ). 0 For 1 the density (11) reduces to the density of the exponential distribution, note (1) 1 . Section 3: Kaplan Meier Estimator Survival analysis is concerned with studying the time between entry to a study and a subsequent event. Originally the analysis was concerned with time until death, hence the name, but survival analysis is applicable to many areas as well as mortality. A common feature of survival data is censoring, it means that the exact failure times of a number of subjects are not known. There are several reasons for censoring, to mention a few: Some patients might have left the study early, they are lost to follow up. Examples: emigration, fatal accidents in traffic (competing risk) Study ends when a fixed time is reached (right censoring of type I) Study ens when a fixed number of failures occur (right censoring of type II) In these examples there is right censoring, which means some failure times are not known. For these unknown failure times one only knows that the failure time exceeds some known value, the so-called censoring time. In this text we assume that the process of censoring is independent of the process we want to study. Furthermore we only consider right censoring. We study survival of 49 patients with Dukes’C colorectal cancer. The survival times (months) of two treatment groups are as follows. 3+ 6 6 6 6 8 8 12 12 12+ 15+ 16+ Control ( n 24 ) 18+ 18+ 20 22+ 24 28+ 28+ 28+ 30 30+ 33+ 42 Treatment ( linoleic acid , n 25) 1+ 13+ 5+ 15+ 6 16+ 6 20+ 9+ 24 10 24+ 10 27+ 10+ 32 12 34+ 12 36+ 12 36+ 12 44+ 12+ 4 Here + means censoring. The first entry (3+) of the control group means that the patient left the study after 3 months survival. So the corresponding survival time is known to be more than 3 months, 3 (months) is the censoring time of the patient. For the treatment group we shall estimate the survivor function S (t ) P(T t ) . For estimation of S (t ) we use the Kaplan Meier estimator, also called the product limit estimator. Suppose that the survival times, including censored observations, of a homogeneous group of n patients are represented by t1, t2 ,..., tn . We assume that the survival times (patients) are already ordered such that t1 t2 ... tn . For a given value t find the largest value ti such that ti t , the probability S (t ) is then estimated by (13) r d r d2 r di Sˆ (t ) 1 1 2 ... i r1 r2 ri where rk is the number of subjects alive just before time t k (the kth ordered survival time) and d k denotes the number who died at time t k . Let us determine the estimates Sˆ (t ) for the treatment group. The estimate of formule (13) is the product of factors (rk d k ) / rk . For censored observations 1+ and 5+ these factors equal 1. We get Sˆ (t ) 1 for t 6 . Following the receipt of the Kaplan Meier estimator we obtain (14) 23 2 Sˆ (t ) 0.9130 23 for 6 t 10 (15) 23 2 20 2 21 18 Sˆ (t ) 0.8217 23 20 23 20 for 10 t 12 (16) 21 18 17 4 Sˆ (t ) 0.6284 23 20 17 for 12 t 24 (17) 21 18 13 8 1 Sˆ (t ) 0.5498 23 20 17 8 for 24 t 32 (18) 21 18 13 7 5 1 Sˆ (t ) 0.4398 23 20 17 8 5 for t 32 The Kaplan Meier estimate Sˆ (t ) is just a step function, this step function only changes for survival times t k with a positive d k , for the treatment group these times are 6, 10, 12, 24 and 32. Each factor in the Kaplan Meier estimator represents 1 minus an estimated hazard rate. For survival time t k we consider the number of people still alive, sometimes 5 called the number at risk, this is the number rk . Then the condional probability of surviving is estimated in a straightforward way by (rk d k ) / rk . Next plot is a plot of the estimate of the survival function of the treatment group. Survival Function Survival Function Censored 1.0 Cum Survival 0.8 0.6 0.4 0.2 0.0 0 10 20 30 40 50 survival The corresponding plot of the control group is as follows. Survival Function Survival Function Censored 1.0 Cum Survival 0.8 0.6 0.4 0.2 0.0 0 10 20 30 40 survco 6 Section 4: Regression models In section 2 several survival distributions were introduced for modeling the survival experience of a homogeneous population. Usually, however, there are explanatory variables upon which failure time may depend. It therefore becomes of interest to consider generalizations of these models to take account of information of explanatory variables. Consider failure times T1 , T2 ,...,Tn of n individuals. For each individual i we have values xi1 , xi 2 ,..., xik of k explanatory variables. Note that the explanatory may include both quantitative variables and qualitative variables such as treatment group, the latter can be incorporated through the use of indicator variables. The principal problem dealt with in this section is that of modeling the relationship between the failure time T and explanatory variables. The exponential distribution can be generalized to obtain a regression model by allowing the failure rate of Ti to be function of xi1 , xi 2 ,..., xik . In regression models it is common practice that the dependent variable depends on the explanatory variables only through a linear function (19) 0 1xi1 2 xi 2 ... k xik , where 0 , 1 , 2 ,..., k are unknown parameters. For the exponential distribution we have a constant hazard function . In a regression model for survival analysis one can try to model the dependence on the explanatory by taking the (new) hazard rate to be ( 20) 0 c(0 1xi1 2 xi 2 ... k xik ) , The (new) hazard rate is taken to be a constant ( 0 ) times some function c of the linear function 0 1 xi1 2 xi 2 ... k xik . Hazard rates being positive it is natural to choose the function c such that c(0 1xi1 2 xi 2 ... k xik ) is positive irrespective the values of xi1 , xi 2 ,..., xik . For this reason one often takes c(.) exp(.) , the hazard rate in a regression model is then modeled by ( 21) 0 exp( 0 1xi1 2 xi 2 ... k xik ) . In survival analysis accelerated failure time models are obtained by modeling the log failure time Y ln( T ) instead of the failure time itself. Let us explain what the assumption (21) about the hazard function means if we study the log failure time Y . We shall use the following fact of probability theory: if T has the exponential distribution with parameter then we can write T U / where U is a random variable having an exponential distribution with parameter 1 . Note that the survival function of U / is equal to 7 P(U / t ) P(U t ) exp( u )du exp( t ) , ( 22) t which is equal to (7), the survival function of an exponential distribution with parameter , hence indeed U / and T are identical with respect to their distribution. Consequences of (21) are: U 0 exp( 0 1 xi1 2 xi 2 ... k xik ) ( 23) T ( 24) ~ Y ln( 0 ) 0 1xi1 2 xi 2 ... k xik U ~ with U ln( U ) . For log failure time Y we almost get a traditional regression model. Note that the term ln( 0 ) 0 is the intercept (constant) of the regression model, this term can be estimated whereas both ln( 0 ) and 0 cannot be estimated. The ~ disturbance U ln( U ) does not have a normal distribution, instead one can say that ~ exp(U ) has the exponential distribution with parameter 1 . From (24) one can conclude that the effects of covariates (explanatory variables) act additively on Y . Remind we started with (21): the covariates act multiplicatively on the hazard rate. Let us now consider the Weibull distribution, hence a hazard function given by (8). For the Weibull distribution the analogue of (21) is: p0p t p 1 exp( 0 1 xi1 2 xi 2 ... k xik ) ( 25) p0 exp(( 0 1xi1 2 xi 2 ... k xik ) / p) t p 1 , p as a matter of fact the baseline hazard rate 0 is replaced by 0 exp(( 0 1xi1 2 xi 2 ... k xik ) / p) . ( 26) We again study the log failure time Y ln( T ) . Using probability theory we can state the following: if T has a Weibull distribution with parameters and p then we can write / where U has the exponential distribution with parameter 1 . For proving 1 this we show that the survival function of U p / equals (9), the survival function of the Weibull distribution: T U 1 p 8 1 (27) U p 1 P t P U p t P U (t ) p ... exp (t ) p , which equals expression (9). From (26) and T U 1 p / one can obtain: 1 (28) U p T 0 exp(( 0 1 xi1 2 xi 2 ... k xik ) / p) ( 29) Y ln( 0 ) 0 p 1 p xi1 2 p xi 2 ... k p ~ xik U 1 ~ with U ln U p ln( U ) / p . This regression equation is a generalization of (25), as expected. Again the effects of the covariates act additively on the log failure time. Section 5: The Proportional Hazards Model The most famous proportional hazard model is the proportional hazards model of Cox. In the proportional hazards model of Cox independent failure times T1 , T2 ,...,Tn are studied, which distribution is described by a hazard function (t ) given by (30) (t ) 0 (t ) exp( 0 1xi1 2 xi 2 ... k xik ) where 0 (t ) is an arbitrary unspecified base-line hazard function which specifies a continuous distribution for a failure rate. Special cases are 0 (t ) (exponential distribution) and 0 (t ) p0pt p 1 (Weibull distribution) but one of the most important features of the model of Cox is that no parametric model is made for the base-line hazard function 0 (t ) . An empirical check for the proportional hazards model is provided by plots of the estimate of ln ln( S (t ) versus ln( t ) . In case of the Weibull distribution ( 0 (t ) p0pt p 1 ) the graphs of the estimates ln ln( S (t ) versus ln( t ) should show parallel straight lines for separate subgroups. In general these graphs should show the same curvature for different subgroups (fixed vertical distances). Section 6: Modeling and testing in the Proportional Hazards Model SPSS (like other programs) provides estimates and standard errors for e.g. the parameters j in the proportional hazards model of Cox. For application of ‘Cox regression’ one has to choose in SPSS first Analyse, then Survival and finally Cox Regression. 9 For demonstrating testing theory we apply the proportional hazards model to the data of section 3. In SPSS we have to fill the data matrix in the following way. One column contains the 49 survival times. A second column indicates whether the survival times are censored ( 0 ) or not (1) , hence this column contains only numbers 0 and 1. A third column is indicating to which group each survival time belongs, we used the number 0 for control and the number 1 for the treatment group, so this column contains only numbers 0 and 1 as well. We gave the columns (variables) names: survival, censor and treatment. For producing the relevant output one has to choose survival for time, censor for status (define event by single value 1) and treatment as covariate. Use furthermore ‘method: enter’. We now shall investigate whether the groups really differ with respect to survival time. We apply a proportional hazards model with covariate (explanatory variable) treatment. Doing so, we assume that the hazard function for an individual is given by (t ) 0 (t ) exp( 0 1xi1 ) 0 (t ) exp( 0 ) exp( 1xi1 ) (31) with the xi1 being the values of the variable treatment, x1 treatment . For investigating whether there is really a difference between the two groups or whether there is really a treatment effect, we test the null hypothesis H 0 : 1 0 against the alternative hypothesis H : 0 . One has to take T ˆ / S ( ˆ ) as testing statistic with ˆ being the estimate 1 1 1 1 1 of 1 and S ( ˆ1 ) being the corresponding standard error. The distribution of the testing statistic is approximated by the standard normal distribution under the null hypothesis. The null hypothesis is rejected if T c or T c . Using SPSS a part of the output is the following: treatment B 0.253 SE 0.430 Wald 0.345 df 1 Sig. 0.557 Exp(B) 0.777 Because the factor exp( 0 ) can be absorbed by the baseline hazard function 0 (t ) , no estimate for is given. From the output we see ˆ 0.253 and S ( ˆ ) 0.430 , hence 1 0 1 the outcome of T ˆ1 / S ( ˆ1 ) is 0.588 . Taking significance level 5% we reject H 0 : 1 0 if T 1.96 or T 1.96 , equivalently if | T | 1.96 . In stead of T SPSS presents a Wald Statistic being equal to T 2 . This Wald statistic has a chi-square distribution with one degrees of freedom under the null hypothesis: significance level 5% means that the null hypothesis is rejected if the Wald Statistic T 2 3.84 , this renders a equivalent test. For the data of section 3 we don’t have to reject the null hypothesis, no treatment effect can be proved (at significance level 5%). For the assignment of this part of the course Medical Statistics and Epidemiology a data set has to be studied, we call this data set the breastfeeding data. The data is contained by 10 the SPSS system file breastfeeding.sav . The breastfeeding data concerns data on 925 first-born children whose mothers chose for breast feeding. The following variables are recorded: Duration Censoring race Poverty Smoking Alcohol Age Birth School Prenatal duration of breast feeding (weeks) 1 for completed breast feeding, 0 for censored (still breast feeding) race of mother ( 1 white, 2 black, 3 other) mother in poverty ( 1 yes, 0 no) mother smoked at birth of child ( 1 yes, 0 no) mother used alcohol at birth of child ( 1 yes, 0 no) age of mother at birth of child year of birth of child years of school (education level) Prenatal care after 3rd month ( 1 yes, 0 no) For a first preparation for the assignment we study how Duration depends on the covariate Birth. Inspection of the data set reveals that the outcome of the covariate Birth ranges from 78 (representing 1978) to 86. For this type of covariate it is not useful to assume a hazard rate of the form (31) with the xi1 being now the values of the covariate Birth. The covariate Birth is rather a categorical variable. Perhaps the duration of breastfeeding is different for babies from different years of birth. Differences from year to year may be modeled by assigning effects to the levels (outcomes) of the covariate Birth. This can be done by introducing indicators variables. In case of the breastfeeding data we introduce indicator variables x1 , x2 ,..., x8 for the covariate Birth defined as follows: x1 1 if the outcome of birth is 78 and x1 0 elsewhere, x2 1 if the outcome of birth is 79 and x2 0 elsewhere,…, x8 1 if the outcome of birth is 85 and x8 0 elsewhere. Using these indicator variables the hazard rate, and thus our model, becomes (32) (t ) 0 (t ) exp( 0 1xi1 2 xi 2 ... 8 xi8 ) where xi1 , xi 2 ,..., xi8 are the values of the indicator variables of individual i . According to formula (32) the ratio of the hazards rates of the years of birth 78 and 86 is equal to exp( 1 ) . In the same way the ratio of the hazards rates of the years of birth 79 and 86 is equal to exp( 2 ) . So for the respective outcomes of the covariate we now have (possibly) different (multiplicative) effects and here the year of birth 86 serves as reference value. Instead of the last value the first value may be chosen as well as reference value (category) in SPSS. Using SPSS the following output is obtained. 11 birth birth(1) birth(2) birth(3) birth(4) birth(5) birth(6) birth(7) birth(8) B SE -0.799 -0.722 -0.947 -0.707 -0.703 -0.666 -0.715 -0.402 0.433 0.424 0.423 0.419 0.420 0.419 0.421 0.421 Wald 18.411 3.406 2.901 5.019 2.853 2.800 2.529 2.890 0.911 df 8 1 1 1 1 1 1 1 1 Sig. 0.018 0.065 0.089 0.025 0.091 0.094 0.112 0.089 0.340 Exp(B) 0.450 0.486 0.388 0.493 0.495 0.514 0.489 0.669 In the first column of the table the parameters j are indicated by respectively birth(1), birth(2),…,birth(8). Testing the null hypothesis H 0 : j 0 against alternative hypothesis H1 : j 0 for each indicator variable x j no null hypothesis needs to be rejected at significance level 5% except one (verify this). It is however not useful to test H 0 : j 0 for each indicator variable of the covariate Birth. Instead one can test whether all effects 1 , 2 ,..., 8 of the covariate Birth are zero. Let us test the null hypothesis H 0 : ( 1 , 2 ,..., 8 ) 0 against the alternative hypothesis H1 : (1 , 2 ,..., 8 ) 0 . SPSS presents (the outcome of) a Wald testing statistic, under the null hypothesis its distribution is the chi square distribution with 8 degrees of freedom (df) and one has to reject the null hypothesis for large values of the testing statistics. The degrees of freedom depends on the number of indicator variables, may be different for different data sets. In our testing problem we have to reject H 0 if Wald 15.5 (chi square with df 8 , level of significance 5%). Since the outcome of Wald is 18.411 we here reject the null hypothesis. We conclude that the (distribution of) Duration depends on the covariate Birth. We don’t give a explicit formula for the Wald statistic with (here) 8 degrees of freedom. We just indicate how to work with it. Section 7: Assignment Before you start: consult the text of the file About SPSS.doc. Use SPSS when you do parts A en B of this assignment. You don’t have to write a report for this assignment. Just take your computer output and your own notes with you for an (oral) discussion with the teacher about the answers of parts A and B. For making an appointment for the assignment: send an e-mail (k.poortema@ewi.utwente.nl) or ring (053)4893379 . 12 Part A Select a number of subgroups and study plots of the Kaplan Meier estimate of the Survival function in order to answer the following two questions. Instead of plots of the Survival function you may use plots of the hazard function or a function of the Survival function. (1) Is the distribution of the duration of breastfeeding modeled well by means of a Weibull distribution or an exponential distribution for the subgroups chosen? (2) Does a proportional hazards model fit the data? Part B Now assume that the proportional hazards model of Cox holds for the breastfeeding data. Investigate within this model on which covariates the (dependent) variable Duration really depends. For the covariates Race, Age and School (and Birth) decide whether you take the covariate as categorical covariate (if so: go to Categorical … , etc. within the Cox Regression menu). In case of categorical covariates you don’t need to worry about the indicator variables described in section 6: SPSS introduces these indicator variables automatically. Important aspects: Try to explain the dependent variable Duration as good as possible but refrain from including covariates (predictor variables) that seem to be superfluous. Follow a clear strategy in order to select the covariates. Use statistical tests. 13