I. Describing Event History (or Duration) Data
A. Thinking and Talking About Spells. Suppose that instead of having a
dependent variable that counts the number of times that similar events occurred over an
interval, your variable measures the length of time for which similar events occurred. The
canonical example here comes from the biostatistics literature and looks at the duration of
lifetimes after a treatment or placebo is administered. The units here are people, the
explanatory variables are their characteristics and the medical treatments that they receive,
and the dependent variable measures how long they survive after the start of the study. This
variable can be called a “spell,” and it is simply the number of days, months, or years for
which a subject survives. In order to get your drug approved by the FDA, you want to show
that administering it increases the duration of people’s lives. Alternatively, you could say
that it reduces their hazard of dying at any particular time “t.” Like morticians,
biostatisticians have a nicer euphemism for dying: when it spell ends, that observation
“exits” the dataset. So you want to show that members of the treatment group tend to exit
the dataset later than the members of the control group, and that their “hazard rate” is lower
at any point in time t.
B. Describing Duration Data (The Dumb Way). There are a number of
different basic graphs that you can use to describe duration data, and you can use them to
conduct a bit of bivariate analysis by looking at distributions for cases that take on different
values of your explanatory variable (such as the treatment group and the control group).
The Duration Distribution, or f(t). This is the pdf of the variable, which
plots on the y-axis the fraction of spells that end at any specific time t, with
t on the x-axis. It gives you a rough guess of when the hazards are greatest.
The Cumulative Duration Distribution, or F(t). This is the CDF of the
variable, which plots on the y-axis the proportion of spells that have ended
by any specific time t. It tells you about how many people are still living
after t years, in raw terms.
C. Describing Duration Data (The Smart Way). One common issue with event
history datasets is that researchers cannot wait around long enough to observe the end of
every observation. We need to get our drug to market, so we want to finish our study when
some subjects are still alive, while some spells have not ended. Since we do not observe the
end of the spell, our data for that observation is “right censored.” (Left censoring comes
when we do not know when our spell began, and we’ll talk about that a bit later). A simple
duration distribution has to pretend that all subjects still alive at the end of our study died in
its final year, which clearly biases our estimated lifespans downward. This becomes even
more of a problem when some subjects began the study in 1980, some in 1990, etc.
Biostatisticians Kaplan and Meier devised a way to graph duration data that does not suffer
from this bias.
i. A Kaplan-Meier Survival Function, S(t). This plots on the y-axis the fraction of
subjects that are still alive at any time t, (# still alive/total # at start of study), but it removes
censored observations from the denominator just at the time that they are censored. So now
we have # still alive at time t/# still at risk at time t. Another way to write this is (nj - dj )/
nj, where nj gives the number of observations at risk at the beginning of time j and dj gives
the number or deaths at time j. We can look at the survival plots for the treatment and
control groups, and conduct a log rank test to see if they are significantly different. Here’s
how to a basic K-M survival plot in Stata.
duration, failure(rightcen)
failure event:
obs. time interval:
exit on or before:
rightcen != 0 & rightcen < .
(0, duration]
-----------------------------------------------------------------------------588 total obs.
0 exclusions
-----------------------------------------------------------------------------588 obs. remaining, representing
103 failures in single record/single failure data
2194 total analysis time at risk, at risk from t =
earliest observed entry t =
last observed exit t =
. sts graph
failure _d:
analysis time _t:
Kaplan-Meier survival estimate
analysis time
ii. Hazard Rate, h(t). This plots on the y-axis the probability that a spell ends
between period t and period t+Δt, given that it survived until t. It’s the smarter version of
the duration distribution.
When you collect or organize event history data, you have two choices of how to
organize your dataset: by spells or in a time series, cross sectional format. The fundamental
difference is that the second choice allows for “time varying covariates,” explanatory
variables that can take on different values over the duration of an event. You can convert
your dataset from one form to another, but getting increased causal leverage out of the time
series, cross sectional format requires you to add data by providing more precisely measured,
time varying explanatory variables.
This discussion, which draws its methods for using the time series, cross section
format from Beck, Katz, and Tucker (1998), will use the example of examining the duration
of Congressional careers from 1980 to 2000. I will discuss the strengths and weaknesses of
each approach, describe how to implement the basic Beck, Katz, and Tucker model, and
discuss a simple extension to it.
A. Spell or duration format. This is the traditional format, and how you need to
set up your dataset in order to use the exponential and Weibull models that we will discuss
below. Each of your cases – the rows in your dataset – represents one spell, such as one
member of Congress’ career. The columns in your dataset will be attributes of the spell.
Most importantly, you will create a duration variable denoting the length of the spell (in
sessions, in years, in months, or whatever precision you find available and informative). You
will also need to create a dichotomous variable reporting whether the spell’s endpoint was
censored or not (remember that this is called right censoring, and will occur when the
member is still in Congress at the time your study ends). Suppose you have done this,
naming your duration variable “CareerLength” and your censoring variable
“RightCensored.” You will need to use the stset command to tell Stata that you will be
working with duration data and that these are the key variables by entering “stset
CareerLength, failure(RightCensored=0).”
The other variables in your dataset will contain values of explanatory variables for
the spell. Because each case takes up only one row, you can enter only one value of an
independent variable for each case. For a variable like “district competitiveness” or
“ideology,” you could enter a Representative’s average across her career or the values at the
midpoint of her career, but you will need to find some way to summarize explanatory
variables that vary over time. Your N will equal the number of Congressional members
serving since 1980.
 Strengths: This is the appropriate data format for the traditional event history
models such as the exponential and the Weibull. You can use Stata’s streg command to
estimate of variety of such models. Another popular alternative that uses this data format is
the Cox proportional hazard model, discussed in the lab handout. The Cox model is
often referred to as “nonparametric,” and note that what this is referring to is a lack of
parametric assumptions about the shape of the hazard rate. The exponential model assumes
a flat hazard rate, the Weibull allows a flat, smoothly decreasing, or smoothly increasing
hazard rate, and the Cox model just lets the data tell you the shape of the hazard. But it does
make parametric assumptions about the systematic component of the model linking
explanatory variables to the hazard. Use the “stcox” command to estimate these models in
Another strength of this format is that you can create many useful descriptive graphs
using this data format, such as the survival plots that we will look at in the lab on Friday.
Even if you plan to later use the time series, cross section approach, it may be worth your
will to put your data into this format in order to look at survival functions that separate cases
based on categorical variables such as gender or party that do not usually vary over a
Representative’s career.
 Weakness: As discussed above, this basic format does not allow explanatory
variables to vary over time. There are more complex ways to use Stata to set up “multiple
record” duration dataset with some variables that vary over time (see the help files on stset
and stvary), but the most convenient way to estimate the effects of time varying covariates is
with a time-series, cross-section approach (the subject of the next class).
A Survey of Spell-Based Models
A. Modeling Hazard Rates. Even though our dependent variables in these models
are really spell lengths (yi), we are going to model the hazard rate h(t), finding out which
aspects of the social system make an exit more or less likely in a given interval. Remember
that the hazard rate is the probability that a spell ends between t and t+Δt, given that it
survived until t. We can see how the hazard rate (sometimes also called λ(t)) relates to the
pdf evaluated at t, the cdf, and the Kaplan-Meier survival function. Since knowing one of
these functions can get us to the others, we can just pick one of them to model, and we will
pick the most intuitively meaningful one, the hazard rate.
h(t )   (t )  lim t 0
Pr(t  y  t  t | y  t )
F (t  t )  F (t ) f (t )
 lim t 0
tS (t )
S (t )
 t
S (t )  1  F (t )  exp   h( y )dy 
 0
B. The Exponential: A Model of Constant Hazard Rates. In this model, the
hazard rate is constant across time. Do not confuse that with being constant across
observations, because observations in the treatment group (we hope) will have a lower
hazard rate than observations in the control group. The assumption that underlies the
exponential model is that, for each of these groups, the hazard rate stays the same across
time and can be captured by flat lines in hazard plots. In each time interval, you lose λ cases,
and this implies that if there are no covariates, the Survival function will be exponential. We
can see whether or not this is the case by looking at our Kaplan-Meier plot and seeing if the
decline in it is roughly exponential (again, important covariates can disrupt this).
 t
S (t )  exp   h( y )dy 
 0
S (t )  exp  t 
S (t )  e t
What is the substantive story hidden in this assumption of a constant hazard rate? It
implies that our process has no memory, that our units don’t age in any meaningful way. It
is only our covariates that determine how long an event lasts, plus some random events or
unmeasured characteristics uncorrelated with time that cause failures/exits. This is
analogous to the Poisson event count model.
How do we incorporate covariates into this model? If we identify a set of factors x
that influence the hazard rate, and measure those for observation i, then we can say that the
λi is a function of xiβ. But because we can never have a negative hazard (this would be a
Lazarus function), we need to use another somewhat ad hoc linkage function to make sure
that we never predict that observations will come back to life. Also, we are going to model
hazard as a function of -xiβ, because the hazard is the inverse of duration, and covariates that
are positively related to duration should be negatively related to the hazard. So now we can
write out a systematic component that will predict different hazards fro cases with different
xis, but the same hazard for cases with the same vector xi. Note that the hazard is constant
across time, and that the xi also cannot change over time.
i  e  x 
What is the substantive assumption of using an exponential linkage function? Just as
in the exponential-Poisson model, this says that xi has a smaller effect at lower expected
durations. It is hard to get a case to last two years rather than one year, but it is easy to shift
it from 34 to 35 years. That’s probably a pretty good assumption.
How does this model deal with censoring? Much like we did with ordered probit
and with tobit, we are going to construct a likelihood function that is the product of the pdf
of uncensored observations and the probability that any observation is censored. For some
right censoring point ci, the likelihood function for any event history model will look like:
 
L    f ( yi )   Pr( yi  ci )    f ( yi )   S ( yi )
 yi ci
  yi ci
  yi ci
  yi ci
And for the exponential model, you can write this out using a dichotomous variable di that
takes on a value of 1 if the observation is right-censored and a value of 0 if it is not (you just
have to construct one of these variables in Stata), and then reparameterize:
L    i e i yi    e i yi 
 yi ci
  yi ci
L   i i e i yi
i 1
L   i i e i yi
then substitute i  e  xi 
i 1
C. A Survey of Other Models. You can use Stata’s “streg” command to call a
number of different event history models.
The Cox model does not make any assumptions about the shape of the
hazard curve.
The log-linear model has a hazard rate that is accelerated or decelerated by
the covariates.
The Weibull is a family of models in which the exponential is nested. It
drops the assumption of a constant hazard rate and lets λ vary according to
a shape parameter ρ (rho). The hazard is given by the formula λρ(λt)ρ-1.
a. If ρ<1, the hazard decreases over time, and we have negative duration
b. If ρ=1, the Weibull reduces to the exponential because the hazard is
c. If ρ>1, the hazard increases over time, and we have positive duration