A Note on Estimating Maximum Likelihood of Models 1

advertisement
A Note on Estimating Maximum Likelihood of
Mixed Proportional Hazard Competing Risk
Models
Simen Gaure
The Frisch Centre for Economic Research
Scientific Computing Group, USIT, University of Oslo
3/1-2012
1
Introduction
This note is about estimating finite mixture competing risk models as used in [1].
We describe the ideas and functionality of an estimation program in production
use at the Frisch Centre. The estimation program has evolved since we did the
original estimates, this note is updated to reflect the changes.
Say we have N individuals who may be in S different states. From each state
they may make one of T different transits and possibly end up in a different
state. One or more states are “final”, i.e. the individual leaves the data set
upon such a transition.
A typical example is from labour market modelling, where unemployed individuals may transit to either a labour market programme or to job. An indivdual on a labour market programme (with exogeneously determined length)
may transit only to a job. In this example the states are “unemployed” and
“on programme” whereas the transits are “to job” and “to programme”. For
an individual in state “on programme”, the only legal transit is “to job”. I.e.
there is a single risk for these.
We will use this example now and then, though of course the estimation program works equally well if e.g. the individuals are patients and the transitions
are “medical treatment A”, “cured” and “dead”.
We may have any number of different transitions, but for real datasets we
have not tried more than 7.
Associated with a transition t is an individual hazard rate θ which depends
on some covariates Xi (with i = 1..N ). We model this hazard as
θi = eXi ·β
(where β is a parameter vector and · is the inner product).
1
We may use a different set of covariates for different transitions. We want
to estimate the vector β with the maximum likelihood estimator, but there are
some further complications.
To compute the likelihood Li (Xi , β) of a particular individual transit we
must integrate the survival rate over the period in the current state and multiply
with the hazard rate at the end of the period. The form of the integrated survival
depends on how we have measured the duration. In particular the durations
may be exact, interval censored, or discrete.
Some of the covariates Xi may depend on the duration that has been spent
in the state, this means that to avoid bias in β we must also model unobserved
heterogeneity. Following [5], and [3], we use a discrete heterogeneity distribution
and we allow for transition specific
heterogeneity. I.e. we have probabilities
P
(masspoints) p = (pi )K
with
p
=
1 and vectors v = (vt )Tt=1 of length K for
i
i=1
each transition t. I.e. so that vtj is the heterogeneous intercept for masspoint j
for transition t. Let also v·j denote the other projection, i.e. the location vector
of length T : v·j = (v1j , v2j , . . . , vT j ).
These things enter the hazard as
θitj = eXi ·β+vtj
and we have to integrate both the survival rate over the period, and the heterogeneity over the heterogeneity distribution, so that our individual likelihood
becomes
K
X
L(Xi , β, K, v, p) =
pj Θ(Xi , β, v·j ),
(1)
j=1
where Θ is the likelihood conditional on a particular masspoint.
The vtj enters as a random intercept, so we have no intercept in β.
Note that there are no other functional forms in our likelihood, in particular
even with continuous duration time, the hazard rate is considered to be piecewise
constant, though with completely flexible interval lengths.
2
Overall strategy
Our individual likelihoods from the previous section must be multiplied over
the individuals to get a function of the parameters β and our heterogeneity
distribution. We take the logarithm to get the following log-likelihood
L(β, K, v, p) =
N
X
log L(Xi , β, K, v, p).
i=1
Our task is to maximize this function with respect to all its arguments
subject to the constraint K ≤ N (this is from [5]). Fortunately, [3] has taken
care of some of it: We may handle K in the following way.
1. Start with K = 1, maximize L(β, K, v, p) w.r.t. (β, v, p).
2
2. Increase K by 1, keep β fixed and search for a better likelihood L(β, K, v, p)
by varying (v, p).
3. Do a maximization of L(β, K, v, p) w.r.t (β, v, p), using (β, v, p) from the
previous step as an initial value. If the new likelihood is an improvement
go to step 2, otherwise terminate.
Some words are emphasized above, in step 2, we need to specify what we
mean by better. In step 3 we need to specify what we mean by improvement.
We also need to decide how to search in step 2, and how to maximize in step 3.
When we have finished, we need to decide which model (i.e. which K) to use.
Note that the algorithm given here is slightly different from Heckman and
Singer’s. In step 2, we could terminate the algorithm if we couldn’t improve
the likelihood (in the sense that a certain directional derivative can’t be made
positive), but this is conditional on the fixed β. There is a chance that step
3 may actually find a better likelihood (by changing β slightly), even if step 2
can’t do it.
2.1
How we search
In step 2 we need to search for a new masspoint. That is, we should find a
new location vector and a probability which increases the likelihood. Following
Heckman and Singer we do this as follows:
Let p = (p1 , . . . , pK ) and v = (vtk ) be the old probabilities and location
points. Let w be a T -dimensional vector. Let
M(w, ρ) = L(β, K + 1, [vw], ((1 − ρ)p, ρ))
where [vw] is the matrix v extended (by concatenation) with the vector w. I.e.
M(w, ρ) is the likelihood with K + 1 points, the new location vector is w and
the new probability is ρ. Form the derivative in ρ = 0:
G(w) = lim
ρ→0
M(w, ρ) − M(w, 0)
.
ρ
We must search for a T -dimensional vector w which makes G(w) positive.
There are several ways to do this, e.g. local maximization, grid search,
adaptive grid search, etc. We have picked simulated annealing as implemented
by [2]. For our purposes we have usually limited each coordinate v to the interval
[−5, 1]. As we get more points in our heterogeneity distribution, we replace the
fixed search inverval by [E(v) − 4σ(v), E(v) + 4σ(v)].
Note that we do not maximize G(w), we stop as soon as G(w) becomes
positive. Once we have found such a w we use it as the new location point
v·K+1 , we set ρ = 10−4 and use ((1 − ρ)p, ρ) as the new probabilities.
If we do not find a w with G(w) > 0, we use the best one, i.e. we have done
a global maximization of G(w) and use the w which maximizes G(w).
3
We have found that maximizing G(w), as some authors have suggested,
sometimes performs worse, and typically does not perform better, than stopping when it becomes positive. The reason it performs worse seems to be that
maximization often forces w into a corner of its domain, a less favourable position for further maximization.
2.2
How we maximize
The maximization in step 1 and 3, being likelihood maximization, should really
be a global maximization. With N ≈ 106 , T = 7, K ≈ 50 and β ∈ R3000 this
is beyond our current computational capabilities. We have resorted to local
maximization, more specifically a combination of BFGS and Fisher scoring (i.e.
Newton’s method with the Hessian replaced by the Fisher matrix).
Moreover, we have divided the maximization into two parts. We have seen
that often the maximization leads to quite small changes in β, so we first maximize only with respect to (v, p), then w.r.t. (β, v, p).
We typically combine BFGS and Fisher scoring as follows:
a. Set b to 50. Set n to 30.
b. Do up to b iterations of BFGS.
c. Do up to n Newton iterations. If the resulting gradient is smaller than
10−9 , terminate the algorithm with success. If n ≥ 100, then terminate
with failure, otherwise increase b by 100, n by 10 and go to step b.
It turns out that both BFGS and Newton may run into problems, but the
above combination often gets around problematic regions. The number of iterations and the convergence criterion may be adjusted.
2.3
When to stop, what to do
If the log-likelihood does not improve by at least 0.01 we do not consider it an
improvement.
When our algorithm has terminated there are several model candidates. We
may pick the one with the highest likelihood, we may pick the one with the
highest AIC. In [1], AIC is recommended. If using AIC as a criterion it would
be tempting to terminate the algorithm when the AIC does not improve, but
we don’t recommend that. The reason is that the AIC may improve in later
steps.
2.4
Further complications
Sometimes, during maximization, the heterogeneity distribution ends up as degenerate. I.e. one or more probabilities are estimated as zero, or some location
vectors are equal (or close enough to be considered equal). In this case, we
remove the superfluous point(s) from the distribution and re-estimate.
4
To this end, a probability of less than 10−6 is considered to be zero. Two
location vectors v·i and v·j are considered to be equal if kv·i − v·j k∞ < 0.1.
There are more complications. In our example, some people never enter a
programme, their only transition is to “employed”. This means that technically,
the risk that they enter a programme is zero, i.e. a defective risk. The maximization algorithm may sometimes estimate a very large negative value in this
case. E.g. v26 is very negative. Then ev26 is numerically equal to zero and we
are not able to compute meaningful gradients, nor refine v26 . The Fisher matrix
becomes degenerate, and we can’t invert it to obtain standard errors for any of
the parameters. In this case we mark v26 as −∞ and leave it as a constant in
the estimation.
Our complications do far from end here. In particular for interval censored
durations we have occasional numerical problems. I.e. we have expressions like
x
1 − e−e and their derivatives. For certain arguments we then use a Taylor approximation to alleviate numerical difficulties. For an example of this, consider
formula (3): For hazards close to 0 it’s better to write it as
X
Pt = f (`
θu )`θt
u
where
(1 − e−x )
x
and use the Taylor-expansion of f . Similar considerations must be done for
some derivatives.
Also, for some expressions it’s more accurate to work directly with their
logarithms.
There’s also a trick which has proved to be very useful. During estimation,
we sometimes encounter values of parameters which makes the likelihood of some
individuals very close to zero, so that their logarithm becomes very negative.
It will typically be hard to compute the gradient with any reasonable accuracy
for these individuals, resulting in a seriously confused maximization routine.
We have solved this problem by temporarily setting the log-likelihood for these
individuals to a large negative number, whereas their gradient is set to zero. This
introduces a discontinuity in the gradient, but it turns out that the maximization
algorithm most often moves away from such regions, relying on the gradient
contribution from the “sane” individuals.
f (x) =
3
The integrated hazard
As mentioned above, we can either use continuous time, in which the duration
until transit is exactly recorded, or interval censored durations, where we merely
record in which interval the transit occurs. The likelihoods we compute below
are the individual likelihoods for a single period, conditioned on a mass-point.
5
3.1
The hazard
The hazard for transition t is given by
θt (β) = exp(Xt · β)
Where we for simplicity drop most indices and assume that our heterogeneous
intercept is incorporated in the vector β (with a corresponding Xi = 1. Thus
the derivatives become
∂θt
= (Xt )i θt (β)
∂βi
For the competing risk case, we also have a special extension which has
been used to model “motivation”. We use the hazard for one transition as an
explanatory variable in another. In this case, one of the elements in Xt is θs (β).
Or, we may write it explicitly
θt (β) = exp(Xt · β + hs θs (β))
The derivative becomes
∂θt
∂θs
= ((Xt )i + hs
)θt
∂βi
∂βi
Note that we have assumed throughout that the β’s are different for different
transitions (although restrictions may be put on equality of some β’s, but this
may be done on a much higher level, aggregated over all observations). That is,
we will typically only have one nonzero term in the sum above.
3.2
Interval censored data
We organize our data so that all covariates are constant within each time-period.
The periods may have arbitrary and different lengths. With time-varying covariates we therefore may have several “observations” with no transition, before
the observation with the transition.
With interval censored data we need to compute the probability that the
observed outcome occured within the observed interval. The probability Ps of
survival through a period is the product integral([4]) of the hazard:
X
Ps = exp(−`
θt )
(2)
t=1..T
where θt is the hazard of doing transition t and ` is the length of the period.
In case the individual did not make a transition within a period, Ps will be the
likelihood for this period.
In the case that a t-transition was made, the likelihood will be
θt
Pt = (1 − Ps ) P
u θu
.
(3)
I.e. the probability of non-survival times the probability of t-transition conditional that a transition was made.
6
3.2.1
Derivatives
In order to compute the gradient, and the Fisher matrix, we need some derivatives. We have
X ∂Ps ∂θu
∂Ps
=
∂βi
∂θu ∂βi
u
so that, assuming βi explains transition v
X ∂θu
∂Ps
= −`
Ps
= −`Xi Ps θv
∂βi
∂βi
u
(4)
For the case that a t-transition was made, and βi explains transition v we
get
X ∂Pt ∂θu
∂Pt
∂Pt
=
= Xi
θv
∂βi
∂θ
∂β
∂θv
u
i
u
(5)
We have, provided v 6= t:
∂Pt
∂(1 − Ps ) θt
θt
P
=
− (1 − Ps ) P
2
∂θv
∂θv
θ
(
u u
u θu )
which becomes
θt
`Ps P
u θu
so that
Pt
`Ps θt − Pt
−P
= P
,
θ
u
u
u θu
`Ps θt − Pt
∂Pt
= Xi θv P
∂βi
u θu
If v = t we get
P
∂Pt
∂(1 − Ps ) θt
u θu − θt
P
=
+ (1 − Ps ) P
2
∂θt
∂θt
θ
(
u u
u θu )
which becomes
(1 − Ps )
Pt
`Ps θt − Pt
Pt
+ P
−P
= P
+ ,
θt
u θu
u θu
u θu
u θu
θt
`Ps P
so that
∂Pt
`Ps θt − Pt
= Xi θt P
+ Xi Pt
∂βi
u θu
Note that rewriting of these expressions may yield numerically more favourable
expressions. Note also that the i in βi and Xi does not refer to the individual
(as we only look at a single individual here), but merely explanatory variable
number i.
In case βi explains more than one transitions, i.e. if we use the “motivation”
approach of letting the hazard of one transition explain the hazard of another,
the simplifications of the sum in equations (4) and (5) are not valid.
7
3.3
Exact duration
Exact duration means that our data contains the exact time of transition, rather
than an interval. Here we also organize our data into periods with constant
covariates. The periods may be of arbitrary and different lengths. The likelihood
for a period of length ` is the same as the interval-censored likelihood in (2):
X
Ps = exp(−`
θu ).
u
For a t-transition it gets simpler:
Pt = Ps θ t
This is a density, rather than a probability.
3.3.1
Derivatives
For the case that there is no transition, the derivative is the same as above. For
transition to t, and βi explains transition t, we have
∂Ps
∂θt
∂Pt
=
θ t + Ps
= −`Ps θt2 Xi + Ps θt Xi = Xi Pt (1 − `θt ).
∂βi
∂βi
∂βi
If βi explains a transition different from t, the last term vanishes, so we get
∂Pt
= −`θt Pt Xi
∂βi
3.4
Discrete data
Discrete data means that the “transition” is defined directly in terms of timeperiods, e.g. like “main-activity in a week”, i.e. a transition has not occured at
a particular point in time. In this case we model the likelihood as a “logit”:
1
Ps = P
u θu
and for a t-transition:
Pt = Ps θ t .
3.4.1
Derivatives
In case of no transition, we get
∂Ps
= Ps (1 − Ps )
∂βi
and with transition to t we get, when βi explains a transition s 6= t:
∂Pt
= −Pt Ps θs Xi .
∂βi
When βi explains t:
∂Pt
= Pt (1 − Pt )Xi .
∂βi
8
3.4.2
Uncertain transition
In some cases one doesn’t know exactly which transition has been
Ptaken, i.e.
one only knows it’s in some set U . In this
case
the
hazard
is
θ
=
U
t∈U θt . Ps
P
will be the same as before, but PU = t∈U Pt . This functionality has not yet
been implemented.
3.5
The individual likelihood
If for each period i and masspoint j we compute a likelihood as above:
(
Ps if no transition
Lij =
Pt if t-transition
then the product
Θ=
Y
Lij
i
is the individual likelihood as used in (1)
4
Duration dependence and implicit dummies
We have briefly mentioned duration dependence. I.e. the hazard rate may
depend on how long the individual has been in the current state. In the literature
one sees various functional forms of duration dependence. One popular form has
been the Weibull distribution, and there are other forms, decreasing, increasing
and with various regular humps and bumps.
We have used no such functional form. We depend on estimating duration
dependence with dummy variables. E.g. for a dataset with monthly data with
durations up to 5 years, we use 60 dummies, one for each month, to estimate
duration dependence. In some applications we get collections of over 100 such
dummies. Therefore we have optimized particularly for long strings of dummies.
The user may e.g. specify an ordinal variable “dur” as implicit dummy from 1
to 60 with reference 13. “dur” should then have integer values between 1 and 60
and the estimation program will create 59 dummies (“dur1” – “dur60”, except
the reference “dur13”) corresponding to each value of “dur”. This is not merely
a convenience for the user, the fact that no more than 1 of the 59 dummies
is non-zero at any time, is actively used to speed up the execution, effectively
treating the string of dummies as one variable (which it really is), both with
respect to speed and space.
Since we have this flexible (though piecewise constant) duration dependence,
it’s possible to track duration dependence which e.g. stems from pure policy
decisions, e.g. we may see bumps prior to (partial) benefit exhaustion at 18 and
36 months. Implicit dummies may of course also be used to track calendar time
dependence, e.g. the business cycle.
Note that the duration dummies will enter the likelihood just like any other
covariate, this means that we use a model which is known as “proportional
9
hazard”, or, taking into account the heterogeneity distribution, “mixed proportional hazard”.
Note also that with this many parameters we require a lot of data to get
sensible standard errors. Since we typically use these methods for analyzing
register data this is not a big problem for us.
5
Random Coefficients
Just like we modeled the heterogeneity with a discrete distribution, we may
estimate random coefficients. From above, our individual hazard
θitj = eXi ·β+vtj
is replaced by
θitj = eXi ·β+αj ri +vtj
where ri is some covariate and αj is a random coefficient. Note that the unobserved heterogeneity is merely a special case of a random coefficient corresponding to a covariate which is constantly equal to 1.
6
Linear Predictors
In addition to computing likelihoods for transitions, we may, for certain observations, augment the likelihood with a linear predictor.
Continuing our example, when unemployed people get a job, it might be
interesting to analyze how good a job it is, e.g. by looking at the wage. We
update our likelihood to include a linear predictor for the (log)wage e.g. like
Yi = Xi · γ + v + N (0, σ)
where Yi is a covariate (the (log)wage), Xi are covariates (possibly different
from the ones occuring in the hazard), γ is a set of parameters to estimate, v
is heterogeneity and N (0, σ) is an error term normalized to have mean 0. The
individual likelihood is
φij = Nσ (Yi − (Xi · γ + vj ))
where
Nσ (x) =
x2
1
√ e− 2σ2
σ 2π
is the normal density with mean 0. This likelihood φij is multiplied with the
hazard likelihood Θ in (1). We get parameters (γ, σ, vj ) to maximize in addition
to the ones from the hazard.
Interestingely, we have seen that with a linear predictor we benefit from
changing our maximization algorithm slightly. We still do the same search for a
new point, but we do not do a separate maximization over the distribution (as
10
mentioned in 2.2), instead we jump right to a maximization of all the parameters.
This has, in our datasets, prevented the algorithm from repeatedly converging
into a degenerate pit. We have no explanation for this, and it may very well be
data-dependent. This is an area for further investigation.
7
Implementation
The estimation as described above is quite time-consuming. Although it is possible to write down the likelihood and feed it to some general purpose statistical
package, this has not proven useful for the datasets we have. We have resorted
to write the procedure in the de facto standard language for high performance
computations, FORTRAN 90. We have used MPI to do the computations in
parallel (over the dataset).
Since we want to have an easy way to specify a particular problem, we
have written a preprocessor in perl, which takes a human-writeable specification
and transforms it into problem-specific fortran code which is included in the
estimation program. This has the additional benefit that many loop-limits in
the program are compile-time constants, this aids optimization.
Since Fortran hardly can be said to be a rational language for book-keeping,
we have wrapped the fortran stuff in an R-package. This takes care of everything except actually computing the likelihood-function and its derivatives.
This package is very far from being production quality.
In addition we have written a web-frontend in PHP to submit problems to
the Titan cluster at the University of Oslo. (A cluster which the Frisch Centre
has a share in).
7.1
Speed
As mentioned above, the estimation procedure is time-consuming, in particular
with large datasets and large sets of parameters. In addition to coarse grain
parallellism and implicit dummies, we have done quite detailed loop-level optimization. E.g. we try to ensure that all time-consuming loops are executed
in the right order (forwards in memory) and that cache-lines are fully utilized
before wasted. But, the single most important thing is to use a good blas and
lapack and a good compiler. This also holds for the R post-processing routine.
As an example of this, our Newton step uses the Fisher matrix, which can
be computed from the outer product of the individual gradients ∇Li as
X
(∇Li )(∇Li )0 .
i
Doing this without further thought will be next to disastrous when the vector
length becomes long (i.e. so long that the Fisher matrix no longer fits in the
cache). To speed up this critical process we collect ∇Li for 32–128 individuals at
a time (depending on the machine architecture), and compute their contribution
to the Fisher matrix with the blas level 3 routine “dsyrk”. With a vector length
11
of 3000, the speed difference between a good and a poor blas may be an order
of magnitude or more.
7.1.1
Differentiation of products
In computing the gradient we need to compute the derivative of a product over
a spell: I.e. we have a product
pn =
n
Y
fi .
i=1
By taking the logarithm we may find a closed form formula for dn = (pn )0 .
However, it might be faster to use a recurrence relation:
dn = (pn )0 = (fn pn−1 )0 = fn0 pn−1 + fn dn−1
where pn = fn pn−1 , f0 = p0 = 1, and d0 = 0.
However, it’s hard to do implicit dummies in this way.
References
[1] S. Gaure, K. Røed, and T. Zhang, Time and Causality: A Monte-Carlo
Assessment of the timing-of-events approach, J. Econometrics (in print)
(2007).
[2] W. L. Goffe, G. D. Ferrier, and J. Rogers, Global Optimization of Statistical
Functions with Simulated Annealing, J. Econometrics 60 (1994), 65–99.
[3] J. Heckman and B. Singer, A Method for Minimizing the Impact of Distributional Assumptions in Econometric Models for Duration Data, Econometrica
52 (1984), 271–320.
[4] T. Lancaster, The Econometric Analysis of Transition Data, Cambridge
University Press, 1992.
[5] B. G. Lindsay, The Geometry of Mixture Likelihoods: A General Theory,
Ann. Stat. 11 (1983), 86–94.
12
Download