Modeling Event History via Duration Data I. II. III. Describing Event History (or Duration) Data Two Ways to Structure Event History Data A Survey of Spell-Based Models I. Describing Event History (or Duration) Data A. Thinking and Talking About Spells. Suppose that instead of having a dependent variable that counts the number of times that similar events occurred over an interval, your variable measures the length of time for which similar events occurred. The canonical example here comes from the biostatistics literature and looks at the duration of lifetimes after a treatment or placebo is administered. The units here are people, the explanatory variables are their characteristics and the medical treatments that they receive, and the dependent variable measures how long they survive after the start of the study. This variable can be called a “spell,” and it is simply the number of days, months, or years for which a subject survives. In order to get your drug approved by the FDA, you want to show that administering it increases the duration of people’s lives. Alternatively, you could say that it reduces their hazard of dying at any particular time “t.” Like morticians, biostatisticians have a nicer euphemism for dying: when it spell ends, that observation “exits” the dataset. So you want to show that members of the treatment group tend to exit the dataset later than the members of the control group, and that their “hazard rate” is lower at any point in time t. B. Describing Duration Data (The Dumb Way). There are a number of different basic graphs that you can use to describe duration data, and you can use them to conduct a bit of bivariate analysis by looking at distributions for cases that take on different values of your explanatory variable (such as the treatment group and the control group). i. The Duration Distribution, or f(t). This is the pdf of the variable, which plots on the y-axis the fraction of spells that end at any specific time t, with t on the x-axis. It gives you a rough guess of when the hazards are greatest. ii. The Cumulative Duration Distribution, or F(t). This is the CDF of the variable, which plots on the y-axis the proportion of spells that have ended by any specific time t. It tells you about how many people are still living after t years, in raw terms. C. Describing Duration Data (The Smart Way). One common issue with event history datasets is that researchers cannot wait around long enough to observe the end of every observation. We need to get our drug to market, so we want to finish our study when some subjects are still alive, while some spells have not ended. Since we do not observe the end of the spell, our data for that observation is “right censored.” (Left censoring comes when we do not know when our spell began, and we’ll talk about that a bit later). A simple duration distribution has to pretend that all subjects still alive at the end of our study died in its final year, which clearly biases our estimated lifespans downward. This becomes even more of a problem when some subjects began the study in 1980, some in 1990, etc. Biostatisticians Kaplan and Meier devised a way to graph duration data that does not suffer from this bias. i. A Kaplan-Meier Survival Function, S(t). This plots on the y-axis the fraction of subjects that are still alive at any time t, (# still alive/total # at start of study), but it removes censored observations from the denominator just at the time that they are censored. So now we have # still alive at time t/# still at risk at time t. Another way to write this is (nj - dj )/ nj, where nj gives the number of observations at risk at the beginning of time j and dj gives the number or deaths at time j. We can look at the survival plots for the treatment and control groups, and conduct a log rank test to see if they are significantly different. Here’s how to a basic K-M survival plot in Stata. stset duration, failure(rightcen) failure event: obs. time interval: exit on or before: rightcen != 0 & rightcen < . (0, duration] failure -----------------------------------------------------------------------------588 total obs. 0 exclusions -----------------------------------------------------------------------------588 obs. remaining, representing 103 failures in single record/single failure data 2194 total analysis time at risk, at risk from t = 0 earliest observed entry t = 0 last observed exit t = 20 . sts graph failure _d: analysis time _t: rightcen duration 0.00 0.25 0.50 0.75 1.00 Kaplan-Meier survival estimate 0 5 10 analysis time 15 20 ii. Hazard Rate, h(t). This plots on the y-axis the probability that a spell ends between period t and period t+Δt, given that it survived until t. It’s the smarter version of the duration distribution. II. Two Ways to Structure Event History Data When you collect or organize event history data, you have two choices of how to organize your dataset: by spells or in a time series, cross sectional format. The fundamental difference is that the second choice allows for “time varying covariates,” explanatory variables that can take on different values over the duration of an event. You can convert your dataset from one form to another, but getting increased causal leverage out of the time series, cross sectional format requires you to add data by providing more precisely measured, time varying explanatory variables. This discussion, which draws its methods for using the time series, cross section format from Beck, Katz, and Tucker (1998), will use the example of examining the duration of Congressional careers from 1980 to 2000. I will discuss the strengths and weaknesses of each approach, describe how to implement the basic Beck, Katz, and Tucker model, and discuss a simple extension to it. A. Spell or duration format. This is the traditional format, and how you need to set up your dataset in order to use the exponential and Weibull models that we will discuss below. Each of your cases – the rows in your dataset – represents one spell, such as one member of Congress’ career. The columns in your dataset will be attributes of the spell. Most importantly, you will create a duration variable denoting the length of the spell (in sessions, in years, in months, or whatever precision you find available and informative). You will also need to create a dichotomous variable reporting whether the spell’s endpoint was censored or not (remember that this is called right censoring, and will occur when the member is still in Congress at the time your study ends). Suppose you have done this, naming your duration variable “CareerLength” and your censoring variable “RightCensored.” You will need to use the stset command to tell Stata that you will be working with duration data and that these are the key variables by entering “stset CareerLength, failure(RightCensored=0).” The other variables in your dataset will contain values of explanatory variables for the spell. Because each case takes up only one row, you can enter only one value of an independent variable for each case. For a variable like “district competitiveness” or “ideology,” you could enter a Representative’s average across her career or the values at the midpoint of her career, but you will need to find some way to summarize explanatory variables that vary over time. Your N will equal the number of Congressional members serving since 1980. Strengths: This is the appropriate data format for the traditional event history models such as the exponential and the Weibull. You can use Stata’s streg command to estimate of variety of such models. Another popular alternative that uses this data format is the Cox proportional hazard model, discussed in the lab handout. The Cox model is often referred to as “nonparametric,” and note that what this is referring to is a lack of parametric assumptions about the shape of the hazard rate. The exponential model assumes a flat hazard rate, the Weibull allows a flat, smoothly decreasing, or smoothly increasing hazard rate, and the Cox model just lets the data tell you the shape of the hazard. But it does make parametric assumptions about the systematic component of the model linking explanatory variables to the hazard. Use the “stcox” command to estimate these models in Stata. Another strength of this format is that you can create many useful descriptive graphs using this data format, such as the survival plots that we will look at in the lab on Friday. Even if you plan to later use the time series, cross section approach, it may be worth your will to put your data into this format in order to look at survival functions that separate cases based on categorical variables such as gender or party that do not usually vary over a Representative’s career. Weakness: As discussed above, this basic format does not allow explanatory variables to vary over time. There are more complex ways to use Stata to set up “multiple record” duration dataset with some variables that vary over time (see the help files on stset and stvary), but the most convenient way to estimate the effects of time varying covariates is with a time-series, cross-section approach (the subject of the next class). IV. A Survey of Spell-Based Models A. Modeling Hazard Rates. Even though our dependent variables in these models are really spell lengths (yi), we are going to model the hazard rate h(t), finding out which aspects of the social system make an exit more or less likely in a given interval. Remember that the hazard rate is the probability that a spell ends between t and t+Δt, given that it survived until t. We can see how the hazard rate (sometimes also called λ(t)) relates to the pdf evaluated at t, the cdf, and the Kaplan-Meier survival function. Since knowing one of these functions can get us to the others, we can just pick one of them to model, and we will pick the most intuitively meaningful one, the hazard rate. h(t ) (t ) lim t 0 Pr(t y t t | y t ) F (t t ) F (t ) f (t ) lim t 0 t tS (t ) S (t ) t S (t ) 1 F (t ) exp h( y )dy 0 B. The Exponential: A Model of Constant Hazard Rates. In this model, the hazard rate is constant across time. Do not confuse that with being constant across observations, because observations in the treatment group (we hope) will have a lower hazard rate than observations in the control group. The assumption that underlies the exponential model is that, for each of these groups, the hazard rate stays the same across time and can be captured by flat lines in hazard plots. In each time interval, you lose λ cases, and this implies that if there are no covariates, the Survival function will be exponential. We can see whether or not this is the case by looking at our Kaplan-Meier plot and seeing if the decline in it is roughly exponential (again, important covariates can disrupt this). t S (t ) exp h( y )dy 0 S (t ) exp t S (t ) e t What is the substantive story hidden in this assumption of a constant hazard rate? It implies that our process has no memory, that our units don’t age in any meaningful way. It is only our covariates that determine how long an event lasts, plus some random events or unmeasured characteristics uncorrelated with time that cause failures/exits. This is analogous to the Poisson event count model. How do we incorporate covariates into this model? If we identify a set of factors x that influence the hazard rate, and measure those for observation i, then we can say that the λi is a function of xiβ. But because we can never have a negative hazard (this would be a Lazarus function), we need to use another somewhat ad hoc linkage function to make sure that we never predict that observations will come back to life. Also, we are going to model hazard as a function of -xiβ, because the hazard is the inverse of duration, and covariates that are positively related to duration should be negatively related to the hazard. So now we can write out a systematic component that will predict different hazards fro cases with different xis, but the same hazard for cases with the same vector xi. Note that the hazard is constant across time, and that the xi also cannot change over time. i e x i What is the substantive assumption of using an exponential linkage function? Just as in the exponential-Poisson model, this says that xi has a smaller effect at lower expected durations. It is hard to get a case to last two years rather than one year, but it is easy to shift it from 34 to 35 years. That’s probably a pretty good assumption. How does this model deal with censoring? Much like we did with ordered probit and with tobit, we are going to construct a likelihood function that is the product of the pdf of uncensored observations and the probability that any observation is censored. For some right censoring point ci, the likelihood function for any event history model will look like: L f ( yi ) Pr( yi ci ) f ( yi ) S ( yi ) yi ci yi ci yi ci yi ci And for the exponential model, you can write this out using a dichotomous variable di that takes on a value of 1 if the observation is right-censored and a value of 0 if it is not (you just have to construct one of these variables in Stata), and then reparameterize: L i e i yi e i yi yi ci yi ci L i i e i yi d i 1 L i i e i yi d then substitute i e xi i 1 C. A Survey of Other Models. You can use Stata’s “streg” command to call a number of different event history models. i. The Cox model does not make any assumptions about the shape of the hazard curve. ii. The log-linear model has a hazard rate that is accelerated or decelerated by the covariates. iii. The Weibull is a family of models in which the exponential is nested. It drops the assumption of a constant hazard rate and lets λ vary according to a shape parameter ρ (rho). The hazard is given by the formula λρ(λt)ρ-1. a. If ρ<1, the hazard decreases over time, and we have negative duration dependence. b. If ρ=1, the Weibull reduces to the exponential because the hazard is constant. c. If ρ>1, the hazard increases over time, and we have positive duration dependence.