DURATION (SURVIVAL) MODELS FOR TIME TO EVENT DATA INTRODUCTION A relatively new area in econometrics is the analysis of duration data (also called time to event data). The econometrics literature on the analysis of duration data draws heavily from statistical methods that have been developed by industrial engineers and biomedical researchers, who use these methods to analyze such phenomena as the useful lives of machines and survival times of patients after a particular type of operation. Dependent Variable In duration analysis, the dependent variable being studied is a duration. Duration is defined as: i) ii) The amount of time that elapses until some event occurs, or The amount of time that elapses until measurement is taken before the event actually occurs. Duration is often called time to event (e.g., time to death, time to machine failure, time to employment). If an observed duration corresponds to i, it is said to be uncensored. If the observed duration corresponds to ii, it is said to be censored. The following points should be noted about a duration variable: 1. A duration variable is always measured in units of time (e.g. minutes, days, weeks, months). 2. A duration variable must be non-negative (you can’t have a negative time to event). Censoring It is usually the case that some of the observations on a duration variable are censored. An observation is said to be censored when it is measured from the beginning of the period of interest until some point before the event takes place. For example, suppose that the duration variable is time to death after a heart transplant. Suppose that this variable is measured for a sample of 30 persons. Suppose that when the measurement is taken 20 of these individuals have died, but 10 are still alive. The 10 observations for the individuals still alive are censored observations. Duration Data No Censoring Let the variable duration be denoted by T. Duration is a random variable that measures time to event. Because T is a random variable, its behavior can be described by a probability distribution (T). Let t1, t2, …tn be a random sample of n-observations on the random variable T. The sample will usually consist of a cross-section of n times to event (durations) on individuals, firms, machines, etc. Censoring Let T* be the value of duration in the absence of censoring. Let T be the observed value of duration. Let c be the value of duration when it is censored at time c. The observed value of duration is given by T = T* if T* < c T=c if T* c The censoring time, c, can either be a known constant or a random variable. If c is a random variable, then it must be independent of T*. To indicate whether an observation is censored, a censor status variable is usually created. This variable is an indicator variable that takes a value of 1 if the observation is not censored and 0 if the observation is censored. Approaches to Analyzing Duration Data Three alternative approaches can be used to analyze duration data. These are: 1. Parametric approach 2. Semiparametric approach 3. Nonparametric approach The parametric approach makes assumptions about the probability distribution of T. This allows you to analyze duration data using regression models or regression-like models. The semiparametric approach makes only minimal assumptions about the probability distribution of T. The nonparametric approach makes no assumptions about the probability distribution of T. PARAMETRIC APPROACH There are two major types of parametric models of duration. These are: 1. Regression models 2. Regression-like models Regression Models A regression model is the appropriate model to use when your objective is to better understand how a set of variables, X1, X2, …, Xn influence the expected (average) time to event, E(T). Example Let T be the amount of time an individual is unemployed measured in weeks. Thus, the event of interest is finding a job. Let X1 be the level of unemployment benefits in hundreds of dollars per month, X2 be years of work experience, and X3 be marital status; X3 =1 if single, X3 = 0 if married. You have a sample of 200 individuals. Some of these observations are censored at T = 80 weeks. Suppose your objective is to better understand how the level of unemployment benefits, work experience, and marital status influence the average amount of time an individual is unemployed. One way to proceed is to estimate the following classical linear regression model, T = 0 + 1X1 + 2X2 + 3X3 + The coefficient 1 measures the effect of a one unit change ($100) in unemployment benefits on the average amount of time an individual is unemployed. The coefficients 2 and 3 have similar interpretations. However, there are 3 potential problems with this model. 1. The observations on T are censored. As a result, the OLS estimator will yield estimates of the coefficients that are biased and inconsistent. Thus, the appropriate model would be the censored regression model, which accounts for censored observations in the estimation procedure. 2. The classical linear regression model assumes that T has a normal distribution. There are a number of reasons to believe that duration (such as length of unemployment) does not have a normal distribution (the most obvious reason is that T is positive by construction). Jeffrey Wooldridge suggests that one way to deal with this problem is to use the logarithm of duration as the dependent variable; that is ln(T). This is because ln(T) usually has a distribution that is closer to a normal distribution than T itself. In this case, the slope coefficients (multiplied by 100) measures the approximate percentage change in T for a one unit change in X. 3. The dependent variable, T, measures a process that takes place over the length of time (0, t). Regression analysis assumes that the value of X does not change during the period that T is being observed. For example, suppose that an individual is unemployed for 12 months. Regression analysis assumes the level of unemployment benefits he received (X1) and his marital status (X3) did not change during this period of time. If either of these variables does change during the time an individual is unemployed, this greatly complicates the analysis. Regression-Like Models A regression-like model is the appropriate model to use when analyzing duration data if your objective is any of the following. 1. 2. 3. 4. The probability that an event will occur before time t. The probability that an event will occur after time t. The probability that an event will occur between time t and time t+1. The probability that an event will occur between time t and time t+1, given that it has not occurred up to time t. Notice that we are not interested in expected duration, rather we are interested in the probability of duration. However, a regression-like model can also be used to analyze average duration or median duration. Example Let T be the amount of time an individual is unemployed. We might be interested in the following questions. 1. 2. 3. 4. What is the probability that an individual will be unemployed for 6 months or less? What is the probability that an individual will be unemployed for more than 6 months? What is the probability that an individual will be unemployed between 6 and 7 months? Given that an individual has been unemployed for 6 months, what is the probability that he will find a job within the next month? 5. Will the probability that an individual finds a job increase or decrease the longer he is unemployed? 6. What is the average or median amount of time an individual is unemployed? Probability Distributions for a Duration Variable A continuous duration random variable, T, can be described by four alternative probability distributions. These are: 1. Probability density function 2. Cumulative distribution function 3. Survival function 4. Hazard function Once you choose a particular type of probability density function (e.g., normal, exponential, Weibull, etc.) you can derive the other three functions. Thus, all four functions have the same parameters and are simply different ways of describing the same system of probabilities. Probability Density Function Let T be a continuous duration random variable. Let t be a specific value of the random variable T. Let T have a probability density function given by (t), where t is a specific value of T. The probability density function (t) allows you to calculate the probability that T will fall in the interval between t1 and t2; that is, t2 Pr(t1 T t2) = (t)dt t1 Thus, the probability that T will fall in the interval between t 1 and t2 is equal to the area under the curve of (t) between the values t1 and t2. For example, if T is length of unemployment in weeks and t1=40 and t2=42, then you can find the probability that an individual will be unemployed between 40 and 42 weeks. Cumulative Distribution Function Given the probability density function (t), the cumulative distribution function F(t) can be derived as follows, t F(t) = Pr(T t) = (t)dt 0 Thus, the probability that T will take a value that is less than or equal to t is equal to the area under the curve of (t) between 0 and t. For example, if T is the length of unemployment in weeks and t=52, then you can find the probability that an individual will be unemployed for 52 weeks or less. Survival Function Given the cumulative distribution function F(t), the survival function S(t) can be derived as follows, S(t) = Pr(T t) = 1 – F(t) Thus, the probability that T will take a value that is greater than or equal to t is equal to one minus the area of the curve of (t) between 0 and t. This is equal to the area under the curve of (t) between t and the maximum value of t. For example, if T is the length of unemployment in weeks and t=52, then you can find the probability that an individual will be unemployed for 52 weeks or more; that is, you can find the probability that an individual will be unemployed for at least 52 weeks. Hazard Function Given the probability density function (t) and the survival function S(t), the hazard function h(t) can be derived as follows, h(t) = (t) / S(t) The hazard function is a particular type of conditional probability function. It tells you the probability that an event will occur in the next short interval of time, given that it has not occurred up to time t. Roughly speaking, it tells you the rate at which the event will occur at time t. For example, if T is length of unemployment in weeks and t = 52, then you can find the probability that individual will find employment during the next week, given that he has been unemployed for 52 weeks. That is, the hazard function tells you the rate at which individuals who have been unemployed for 52 weeks are finding jobs. For example, a hazard rate of 0.05 at t = 52 implies that 5 of 100 individuals who are unemployed for 52 weeks are expected to find a job shortly after that time. The Hazard Function and Duration Dependence Often times we are interested in questions like the following. 1. Is it more likely, less likely, or equally likely that an individual will find a job the longer he is unemployed? 2. Is it more likely, less likely, or equally likely that an a strike will end the longer it lasts? 3. Is it more likely, less likely, or equally likely that a patient will die the longer he has survived after open heart surgery? The answer to these questions depends upon the slope of the hazard function. We have the following definitions. 1. If the hazard function has a positive slope, then the distribution of the duration variable has positive duration dependence. In this case, the longer the duration (e.g., unemployment) the greater the probability the event will occur in the next short period (e.g., the greater the probability the individual will find employment). 2. If the hazard function has a negative slope, then the distribution of the duration variable has negative duration dependence. In this case, the longer the duration the smaller the probability the event will occur in the next short period. 3. If the hazard function has a constant slope, then the probability that an event will occur in the next short period does not depend upon duration. In this case, the event is said to have no memory. Duration Dependence and Functional Form If economic theory unequivocally indicates that a duration variable has, for example, negative duration dependence, then you should choose a functional form for the probability distribution that imposes this structure on the data. However, if economic theory allows a duration variable to have, for example, positive or negative duration dependence, then you should not choose a functional form for the probability distribution that imposes either positive or negative duration dependence on the data. If you do you create a fete accompli. In this case, you must choose a functional form for the probability distribution that is flexible enough to allow for both positive and negative duration dependence, and allow the data to determine the outcome. Choosing a Functional Form for the Probability Distribution of a Duration Variable To specify a parametric duration model, and estimate the parameters of this model, you must choose a particular functional form for the distribution of the duration variable. The 2 distributions chosen most often are the following: 1. Exponential distribution 2. Weibull distribution Exponential Distribution The exponential distribution is used frequently to specify a parametric duration model. This is because it is easy to work with, easy to interpret, and often times can be justified as a reasonable approximation of the data generation process. The probability density function, cumulative distribution function, survival function, and hazard function for the exponential distribution are as follows. (t) = exp(-t) F(t) = 1 – exp(-t) S(t) = exp(-t) h(t) = where the parameter > 0. The probability density function, cumulative distribution function, survival function and hazard function for the parameter value =1 are illustrated in Figure 3. Note the following. 1. The exponential distribution has only one parameter, . 2. Both the mean and variance of the distribution are given by 1/; that is E(t) = 1/ and Var(t) = 1/. The median of the distribution is given by (0.69314718)(1/). 3. The hazard function is constant and equal to . Thus, this distribution imposes the restriction of no duration dependence. Because of this characteristic, the exponential distribution is sometimes call memoryless. 4. The major shortcoming of the exponential distribution is that it depends on only one parameter: . The family of distributions obtained by varying the value of is not very flexible. Because both the mean and variance are given by 1/, they cannot be adjusted separately. Thus, the exponential distribution will not be a good approximation of the data generation process if the sample contains both very long and very short durations. Weibull Distribution The distribution that is probably used most often to specify a parametric duration model is the Weibull distribution. The Weibull distribution is a generalization of the exponential distribution and has the latter as a special case. The probability density function, cumulative distribution function, survival function, and hazard function for the Weibull distribution are as follows (t) = (t)-1exp[-(t)] F(t) = 1 – exp[-(t)] S(t) = exp[-(t)] h(t) = (t)-1 = t-1 where > 0 and > 0. The probability density function, cumulative distribution function, survival function and hazard function for the parameter values =1 and = 0.5 are illustrated in Figure 1 and for parameter values =1 and = 3 in Figure 2. Note the following. 1. The Weibull distribution has two parameters, and . 2. The Weibull distribution collapses to the exponential distribution when =1. Thus, the exponential distribution is a special case of the more general Weibull distribution. 3. The hazard function can be either monotonically increasing, monotonically decreasing, or constant, depending on whether the parameter is greater than one, less than one, or equal to one. Thus, the Weibull distribution has positive duration dependence if >1, negative duration dependence if 4. <1, and no duration dependence if =1. 5. The parameter represents the shape of the distribution and the parameter represents the location of the distribution. 6. The median of the distribution is given by: Median = (0.69314718)1/(1 / ). Because the Weibull and exponential distributions are skewed to the right, the median may be a better measure of central tendency than the mean. Adding Explanatory Variables to a Parametric Duration Model The duration model developed above is a univariate duration model; it includes one variable: the dependent variable. However, one or more explanatory variables can also be included in the duration model. It is possible to allow changes in the explanatory variables to influence the probability distribution of the dependent variable in various ways. Many parametric duration models allow changes in the explanatory variables to change the probability density, cumulative distribution, survival, and hazard functions by rescaling the horizontal axis. This is called an accelerated failure time model. The coefficients of the explanatory variables are relatively easy to interpret for most distributions and sometimes have a regression-like interpretation. Adding Explanatory Variables to the Weibull Duration Model The Weibull duration model is an accelerated failure time model. To include explanatory variables in the Weibull model, which has the exponential model as a special case, proceed as follows. Let T be a duration random variable that has a Weibull distribution. This distribution is described by two parameters: and . The parameter represents the shape of the distribution and the parameter represents the location of the distribution. If increases (decreases), the distribution shifts to the left (right). Assume is a constant. Let be a function of the explanatory variables. For the unemployment example, = g(X1, X2, X3) where the X’s are the explanatory variables. Because the parameter is a function of the explanatory variables, whenever an explanatory variable changes this will rescale the T-axis, thereby changing the location of the distribution. To simplify the estimation procedure, let g be an exponential function, = exp[-(0 + 1X1 + 2X2 + 3X3)] Estimation The Weibull model can be specified in survival function form as follows, S(t) = exp[-(t)] where = exp[-(0 + 1X1 + 2X2 + 3X3)] To obtain estimates of the parameters of the Weibull model, 0, 1, 2, 3 and , the maximum likelihood estimation procedure is used. The estimates of 0, 1, 2, …n and are the values that maximize the likelihood function for the sample of observations. The likelihood function accounts for both uncensored and censored observations. Interpretation of Parameter Estimates The Weibull model has two interpretations. 1) The effects of the explanatory variables on median duration. 2) The effects of the explanatory variables on the hazard rate. Median Duration Suppose that we want to better understand how a set of explanatory variables influences the center of the distribution of a duration variable. For a skewed distribution, such as the Weibull, the most appropriate measure would be the median rather than the mean. For example, suppose you want to better understand how unemployment benefits (X1), years of work experience (X2) and marital status (X3) influence the amount of time that a typical worker is unemployed. To make this interpretation, you can derive the median duration function for the Weibull distribution. The median of the Weibull distribution is given by M = (0.69314718)1/ (1/) Substituting = exp[-(0 + 1X1 + 2X2 + 3X3)] and taking the logarithm of both sides yields, lnM = + 1X1 + 2X2 + 3X3 where lnM is the natural logarithm of median length of time unemployed, and the constant is a function of the parameters 0 and , and the number 0.69314718. Note that the median duration function is a log-linear function. The coefficients are interpreted as follows. 1 is the approximate proportional change (1*100 is the approximate percentage change) in median length of unemployment that results from a one $100 increase in monthly unemployment benefits. The exact proportional change is given by exp(1) – 1. 2 is the approximate proportional change (2*100 is the approximate percentage change) in median length of unemployment that results from one additional year of work experience. The exact proportional change is given by exp(2) – 1. 3 is the approximate proportional difference (2100 is the approximate percentage difference) in median length of unemployment between a married worker and a single worker. The exact proportional difference is given by exp(2) – 1. Some statistical programs, such as Stata, calculate and report the measures exp(1), exp(2), and exp(3). These are called the time ratios. Hazard Rate Suppose that we want to better understand how a set of explanatory variables influences the hazard rate. For example, suppose you want to better understand how unemployment benefits (X1), years of work experience (X2) and marital status (X3) influence the rate at which unemployed workers find jobs, given that they have been unemployed for some given period of time. To make this interpretation, you can derive the hazard function for the Weibull distribution. The hazard function for the Weibull distribution is given by h(t) = t-1 Substituting = exp[-(0 + 1x1 + 2x2 + 3x3)] and taking the logarithm of both sides yields, lnh = lnh0(t) – 1X1 – 2X2 – 3X3 where lnh is the natural logarithm of the hazard rate, and lnh0(t) is the natural logarithm of the baseline hazard function, which is a function of duration, t. The baseline hazard function is given by h0 = exp(-0)t-1. Note that for the Weibull model, the hazard rate is a log-linear function of the explanatory variables, X1, X2, and X3. This means that changes in the explanatory variables have a proportional effect on the hazard rate. Because of this, the Weibull duration model is both an accelerated failure time model and a proportional hazards model. In a proportional hazards model, changes in the explanatory variables do not change the shape of the hazard function; rather they shift the hazard function in a proportional manner. The shape of the hazard function is given by the baseline hazard function and depends upon the shape parameter . Note that the effect of an explanatory variable on the log hazard rate is given by the negative of the product of two parameters: i. The interpretation is as follows. -1 is the approximate proportional change (-1*100 is the approximate percentage change) in the hazard rate that results from a $100 increase in monthly unemployment benefits. The exact proportional change is given by exp(-1) – 1. -2 is the approximate proportional change (-2*100 is the approximate percentage change) in the hazard rate that results from a one year increase in work experience. The exact proportional change is given by exp(-2) – 1. -3 is the approximate proportional difference (3100 is the approximate percentage difference) in the hazard rate between a married worker and a single worker. The exact proportional difference is given by exp(-3) – 1. Some statistical programs, such as Stata, calculate and report the measures exp(-1), exp(-2), exp(-3). These are called hazard ratios. The parameter has the following interpretation: 1. If > 1, then the hazard function has a positive slope. The longer a worker is unemployed, the greater the probability he will find a job in the next short period. 2. If < 1, then the hazard function has a negative slope. The longer a worker is unemployed, the smaller the probability that he will find a job next short period. 3. If = 0, then the hazard function has a constant slope. The probability that a worker will find a job in the next short period of time is independent of the amount of time he is unemployed. A Quick Interpretation of the Coefficients of the Explanatory Varaibles The median duration function and hazard function for the Weibull Model imply the following: 1. If > 0, then an increase in x will shift the hazard function downward and increase the median duration. A decrease in x will result in the opposite. 2. If < 0, then an increase in x will shift the hazard function upward and decrease the median duration. A decrease in x will result in the opposite. An Empirical Example We want to better understand the length of time that an individual is unemployed. The dependent variable is length of unemployment measured in weeks (UNEMP). The explanatory variables are the level of unemployment benefits measured in hundreds of dollars per month (BENEFIT), work experience measured in years (EXPER), and marital status (MARITAL) a dummy variable for which marital =1 if single, marital = 0 if married. We have a sample of 200 individuals. Some of these observations are censored at UNEMP = 80 weeks. The results are as follows. Variable Coefficient Standard Errort-statistic Constant BENEFIT EXPER MARITAL 4.80 0.085 -0.052 0.100 1.20 0.025 0.030 0.005 4.00 3.40 1.73 2.00 1.20 0.05 Algebraic Signs and Statistical Significance of Estimates of Coefficients of Explanatory Variables and Since > 1, the hazard function is upward sloping. Thus, the longer an individual in unemployed, the greater the probability that he will find a job in the next short period of time (e.g., the next week). Since the 95% confidence interval for is (1.10 , 1.30), the estimate of is significantly different from 1 at the 5% level of significance. Thus, the Weibull distribution is more appropriate than the exponential distribution. The coefficient of BENEFIT is positive and significant at the 5% level. There is strong evidence that an increase in monthly unemployment benefits will increase the median number of weeks that a typical individual is unemployed and decrease the probability that an typical individual will find a job during the next week, for any given number of weeks unemployed. The coefficient of EXPER is negative and significant at the 10% level. There is some evidence that an individual with more years of work experience will have a smaller median number of weeks unemployed and have a higher probability of finding a job during the next week, for any given number of weeks unemployed. The coefficient of MARITAL is positive and significant at the 5% level. There is strong evidence that a typical single individual is unemployed longer and has a smaller probability of finding a job during the next week, for any given number of weeks unemployed, than a typical married individual. Median Duration of Unemployment The estimated median duration function is used to obtain estimates of the magnitude of the effects of the explanatory variables on median length of unemployment. It is given by lnM = + 0.085*BENEFIT –0.052*EXPER + 0.010*MARITAL A $100 increase in monthly unemployment benefits results in an approximate 8.5% increase in the median number of weeks unemployed. The exact increase is: exp(0.085) – 1 = 1.089 – 1 = 0.89 = 8.9%. One year of additional work experience results in an approximate 5.2% decrease in the median number of weeks unemployed. The exact decrease is: exp(-0.052) – 1 = 0.95 – 1 = 0.05 = 5.0%. The median number of weeks unemployed for a single individual is approximately 10% greater than for a married individual. The exact percentage difference is: exp(0.10) – 1 = 1.101 – 1 = 0.101 = 10.1%. Some statistical programs report the time ratios rather than the coefficients. For this example, the time ratio results are as follows. Variable Time Ratio Standard Errort-statistic BENEFIT EXPER MARITAL 1.089 0.95 1.101 1.20 0.025 0.030 0.005 0.05 3.40 1.73 2.00 The standard errors and t-statistics for the time ratios are identical to those for the coefficients. However, the t-statistic for time ratio is not the ratio of the time-ratio estimate to the estimated standard error. If you subtract one from the time ratio estimate this gives you an estimate of the exact proportional change in median number of weeks unemployed that results from a change in the explanatory variable. Hazard Rate The estimated hazard function is used to obtain estimates of the magnitude of the effects of the explanatory variables on the hazard rate. It is given by lnh = lnh0(t) – (1.20)(0.085)*BENEFIT – (1.20)(-0.052)*EXPER – (1.20)(0.100)*MARITAL = lnh0(t) – 0.102*BENEFIT + 0.0624*EXPER – 0.12*MARITAL A $100 increase in monthly unemployment benefits results in an approximate 10.2% decrease in the hazard rate (the rate at which unemployed workers find jobs). The exact decrease is: exp(-0.102) – 1 = 0.903 –1 = - 0.097 = - 9.7%. One year of additional work experience results in an approximate 6.24% increase in the hazard rate. The exact increase is: exp(0.0624) – 1 = 1.064 – 1 = 0.064 = 6.4%. The hazard rate for a single individual is approximately 12% lower than for a married individual. The exact percentage difference is: exp(-0.12) – 1 = 1.127 – 1 = 0.127 = 12.7%. Some statistical programs report estimates of the hazard ratios. For this example, the hazard ratio results are as follows. Variable Time Ratio Standard Errort-statistic BENEFIT EXPER MARITAL 0.903 1.064 1.127 1.20 0.0023 0.0041 0.0004 0.05 3.93 1.68 2.03 The standard errors and t-statistics for the hazard ratios can differ from those for the coefficients. Also, the t-statistic for hazard ratio is not the ratio of the hazard-ratio estimate to the estimated standard error. If you subtract one from the hazard ratio estimate this gives you an estimate of the exact proportional change in the hazard rate that results from a change in the explanatory variable. Survival Function The estimated (uncensored) survival function is used to obtain estimates of the probability that a typical individual will be unemployed for more than a specific number of weeks. It is given by S(t) = exp[-(t)1.20] where = exp[- (4.80 + 0.085*BENEFIT – 0.052*EXPER + 0.100*MARITAL)] The explanatory variables BENEFIT, EXPER, and MARITAL are set equal to their sample mean values. The estimated (uncensored) survival function can be used to obtain an estimate of the probability that a typical unemployed worker will be unemployed for more than t weeks, for any given value of t you choose. Hypothesis Tests To test restrictions on the parameters of a parametric duration model, any of the following large sample tests can be used: asymptotic t-test, Likelihood ratio test, Wald test, Lagrange multiplier test. Time-Varying Explanatory Variables We have assumed that the independent variables are constant from the beginning of the time period until the event occurs, or until the measurement is taken (in the case of censored observations). For example, we assume that the level of monthly unemployment benefits did not change during this 52 week period of time. However, it is possible that the level of monthly unemployment benefits changed one or more times during this 52 week period. In this case, X is a time-varying explanatory variable; that is, its value changes during spells of unemployment. We can write this as X(t). This type of time-varying explanatory variable can be included into the accelerated failure time model. Heterogeneity Heterogeneity exists in a population when different individuals have different distributions of the dependent variable. For example, it is possible that married individuals have a different distribution than single individuals. It is possible that the distribution of length of unemployment differs for individuals who have different amounts of work experience. To control for heterogeneity, we include explanatory variables in econometric models. Heterogeneity can occur when unobserved factors have an important effect on the duration variable. For instance, an individual’s innate ability may be an important factor influencing length of unemployment; however, innate ability cannot be measured, and therefore is not included as an explanatory variable in the duration model. If heterogeneity still exists across individuals after we include our explanatory variables in the model, this can make it difficult to interpret the data, maximum likelihood parameter estimates will be inconsistent, and estimated standard errors will be incorrect. To deal with this problem, you can incorporate the heterogeneity into the survival distribution. SEMIPARAMETRIC APPROACH Introduction The semiparametric approach makes minimal assumptions about the probability distribution of the duration variable. This approach is based on direct estimation of the hazard function. It only requires an assumption about the general form of the hazard function. The general form that is usually adopted is the proportional hazard specification. Cox Proportional Hazard Model The most often used semiparametric model is the Cox proportional hazard model. This model adopts a general specification for the hazard function that allows changes in explanatory variables to multiply the hazard function by a scale factor. Specification of the Cox Proportional Hazard Model It is assumed that the general form of the hazard function is given by, h(t) = 0(t) (X1, X2, X3) The hazard rate is a function of duration, t, and a set of explanatory variables, X 1, X2, X3. It is assumed that the hazard rate is the product of two functions: 0(t) and (X1, X2, X3). The function 0(t) depends only on the value of duration, while the function (X1, X2, X3) depends only on the values of the explanatory variables. When (X1, X2, X3) = 1, then the hazard function is given by h(t) = 0(t) Thus, the function 0(t) is called the baseline hazard function. Note that changes in the values of the explanatory variables will result in changes in the value of the function (X1, X2, X3). In turn this will shift the hazard function up or down, such that for any given value of t the hazard rate will increase or decrease. The Cox model assumes that (X1, X2, X3) is an exponential function, (X1, X2, X3) = exp(α1X1 + α2X2 + α3X3) The exponential functional form simplifies estimation of the parameters and has a straightforward interpretation. The hazard function with exponential functional form is h(t,x;,0) = 0(t) exp(α1X1 + α2X2 + α3X3) Taking the natural log of both sides yields the Cox proportional hazard model, lnh(t) = ln0(t) + α1X1 + α2X2 + α3X3 The Cox proportional hazard model makes no assumption about the baseline hazard function 0(t). It assumes that when the values of the explanatory variables equal zero, the baseline hazard function has some unspecified shape given by 0(t). As a result, the coefficients of the explanatory variables (α’s) can be estimated without making an assumption a priori about the shape of the hazard function. As a result, this model imposes less structure on the data than a parametric duration model. It is because the baseline hazard function is unspecified that the Cox proportional hazards model is a semiparametric model. The hazard function is allowed to have any shape (upward sloping, downward sloping, constant, hill shaped, U-shaped, etc.). Interpretation of the Coefficients The interpretation of the coefficients of the hazard function is as follows. For the unemployment example, α1 is the approximate proportional change (α1 *100 is the approximate percentage change) in the hazard rate that results from a $100 increase in monthly unemployment benefits. The exact proportional change is given by exp(α 1) – 1. α2 is the approximate proportional change (α2*100 is the approximate percentage change) in the hazard rate that results from a one year increase in work experience. The exact proportional change is given by exp(α2) – 1. α3 is the approximate proportional difference (α3*100 is the approximate percentage difference) in the hazard rate between a married worker and a single worker. The exact proportional difference is given by exp(α3) – 1. Some statistical programs, such as Stata, calculate and report the measures exp(α1), exp(α2), exp(α3). These are called hazard ratios. Estimation To obtain estimates of the parameters of the Cox model, α1, α2, α3, the maximum likelihood estimation procedure is used. We choose as estimates of α1, α2, α3, the values that maximize the partial likelihood function for the sample of observations. A partial likelihood function, rather than a full likelihood function, must be maximized because the baseline hazard function, ln0(t), is unspecified. The partial likelihood function accounts for both uncensored and censored observations. Stratification For the Cox proportional hazard model, the sample can be stratified into different groups of observations. This allows different groups of observations to have different baseline hazard functions, even though these baseline hazard functions are not specified. Differences in baseline hazard functions across groups are captured by a coefficient which is a fixed parameter. Time-Varying Explanatory Variables Time-varying explanatory variables can be included in the Cox proportional hazard model. Weibull Parametric Duration Model as a Special Case of the Cox Proportional Hazards Model The parametric Weibull duration model can be viewed as a special case of the semiparametric Cox proportional hazards model. For the case of three explanatory variables, the Cox model is given by lnh(t) = ln0(t) + α1X1 + α2X2 + α3X3 where the baseline hazard function, 0(t), is unspecified. The Cox model reduces to the Weibull model when the baseline hazard function is specified as 0(t) = exp(-β0)t-1. The relationship between the parameters of the Cox model and the Weibull model is α1 = – β1, α2 = – β2, and α3 = – β3, where β0, β1, β2, and β3 denote the coefficients of the explanatory variables for the Weibull model. NONPARAMETRIC APPROACH Introduction The nonparametric approach makes no assumptions about the probability distribution of the duration variable, T. Thus, it is a strictly empirical approach to the estimation of survival and hazard functions. That is, it allows the sample data to determine the shape of the survival and/or hazard curves. The nonparametric approach is most often used to describe the behavior of a duration variable and/or analyze experimental data. It is usually not used to analyze observational data. Two Nonparametric Estimators The two most often used nonparametric estimators are: 1. Life-table estimator 2. Kaplan-Meier estimator Life Table Estimator The first step in a duration analysis should be to describe the duration variable, T, for the sample. This is tantamount to reporting descriptive statistics for a duration variable. The best way to describe a duration variable is to estimate the univariate survival function and hazard function without making any assumptions about how the duration variable is distributed. To do this, you can use the life-table estimator. To use the life-table estimator, you group the duration variable into about 8 to 20 different time intervals. You then use this grouped data to estimate the survival and hazard functions. To use the life-table estimator, you must have enough observations so that duration times can be meaningfully grouped into intervals. The life-table estimator can be used with censored data and makes a correction for censoring. Kaplan-Meier Estimator Like the life-table estimator, the Kaplan-Meier estimator can be used to estimate a univariate survival function without making any assumptions about how the duration variable is distributed. Also, like the life-table estimator it can be used with censored data and makes a correction for censoring. However, unlike the life-table estimator it cannot be used to directly estimate a univariate hazard function. To estimate the univariate survival function, the KaplanMeier estimator uses ordered observations rather than grouped data. Therefore, the estimated survival function does not depend on the size the time intervals that are chosen, and can be used to construct survival functions for small samples. The Kaplan-Meier Estimator has two potential uses. 1. Describe the behavior of a duration variable. 2. Analyze duration data that comes from a controlled experiment. Descriptive Study of Data Like the life-table estimator, the Kaplan Meier estimator can be used to describe the behavior of a censored or uncensored duration variable by constructing a univariate survival function for the sample. Analysis of Experimental Data In a randomized experiment, subjects are randomly assigned to a control group and one or more treatment groups. Random assignment is used to control for extraneous factors. The Kaplan-Meier estimator can be used to estimate a different survival function for each group, and then conduct nonparametric statistical tests, such as the generalized Wilcoxon test or logrank test, to test whether there is a significant difference in the survival functions among two or more groups. ADVANTAGES AND DISADVANTAGES OF THE DIFFERENT DURATION MODELS Parametric Weibull Duration Model Advantages 1. 2. 3. 4. You can determine the general shape of the hazard function (monotonic upward sloping, downward sloping, constant), and therefore draw conclusions about duration dependence. It is a flexible functional form with the exponential duration model as a special case. You can obtain estimates of the magnitude of the effects of the explanatory variables on the hazard rate. You can obtain estimates of the magnitude of the effects of the explanatory variables on median duration. 5. 6. You can obtain direct estimates of the parameters of the uncensored survival function. This facilitates the estimation of survival probabilities and making predictions. You obtain parameter estimates by maximizing the full maximum likelihood function. This should result in more precise estimates. Disadvantages 1. 2. You must make a specific assumption about the probability distribution of the duration variable. Specifically, you assume that the duration variable has a Weibull distribution. By making a specific assumption about the probability distribution of the duration variable you risk imposing an inappropriate structure on the data a priori. You require the explanatory variables to have a proportional effect on the hazard rate and median duration. If this assumption isn’t valid then you are imposing an incorrect structure on the data Semiparametric Cox Proportional Hazards Model Advantages 1. You can make minimal assumptions about the probability distribution of the duration variable and therefore impose minimal structure on the data a priori. 2. You must only make an assumption about the general form of the hazard function (e.g., explanatory variables to multiply the hazard function by a scale factor). 3. By ignoring the shape of the hazard function, you can focus on how explanatory variables proportionately increase or decrease the hazard function. 4. You can obtain estimates of the magnitude of the effects of the explanatory variables on the hazard rate. Disadvantages 1. 2. 3. 4. You cannot determine the shape of the hazard function. Therefore, you can’t obtain information on whether the hazard function is upward sloping, downward sloping, or constant. Because of this, you can’t make any conclusions about duration dependence. You cannot obtain estimates of the magnitude of the effects of the explanatory variables on median duration. You do not obtain direct estimates of the parameters of the uncensored survival function. This makes it difficult to obtain estimates of survival probabilities. To obtain estimates of survival probabilities, you must use a two-step approach. In step #1, you estimate the parameters of the hazard function. In step #2, you obtain a separate estimate of the baseline survival function. You then combine these to get estimates of survival probabilities. To obtain parameter estimates you maximize a partial likelihood function rather than a full likelihood function. This may result in less precise estimates. 5. You must assume that the explanatory variables have a proportional effect on the hazard rate. If this assumption isn’t valid then you are imposing an incorrect structure on the data. Nonparametric Life-Table Estimator Advantages 1. 2. 3. 4. 5. It does not require you to make an assumption about the probability distribution of the duration variable. You can estimate both the univariate survival and hazard functions for a duration variable. You can determine the shape of the descriptive hazard function. You can impose minimal structure on the data, a priori, and therefore allow the data to determine the shape of the survival and hazard functions. This may yield better estimates of survival probabilities and hazard rates. It is a useful tool for preliminary descriptive analysis of duration data. Disadvantages 1. 2. 3. 4. 5. You cannot estimate a covariate adjusted survival and hazard functions that control for explanatory variables. You cannot analyze the effect of explanatory variables on the duration variable. You cannot conduct statistical tests to determine if different groups of subjects have different survival and hazard functions. It is largely a descriptive tool. It cannot be used with small samples. Nonparametric Kaplan-Meier Estimator Advantages 1. It does not require you to make an assumption about the probability distribution of the duration variable, T. 2. You can estimate the univariate survival function for a duration variable. 3. You can impose minimal structure on the data, a priori. 4. Parametric models, by imposing more structure on the data a priori, may yield distorted estimates of survival probabilities. It is possible that the Kaplan-Meier estimator will yield a better representation of the survival function and better estimates of survival probabilities. 5. Heterogeneity of observations can often times obtained by placing observations into different groups (stratification) and constructing different survival distributions for the different groups. 6. You can conduct nonparametric statistical tests to determine if different groups of subjects have different survival functions and hence survival probabilities. 7. It can be used with small samples. Disadvantages 1. You cannot directly estimate the hazard function and hazard rates. 2. You cannot determine the shape of the hazard function, and therefore you cannot draw conclusions about duration dependence. 3. It is much more difficult to analyze the effects of explanatory variables using the nonparametric Kaplan-Meier approach than for the parametric or semiparametric approach. The only way to analyze the effect of an explanatory variable is to create a categorical variable with n-categories, and then estimate separate survival functions for the n-groups. You cannot analyze the effects of continuous explanatory variables on the duration variable. 4. You cannot obtain estimates of the magnitudes of explanatory variables on the duration variable. 5. It is less informative than the parametric Weibull model and semiparametric Cox proportional hazard model.