A Note on Estimating Maximum Likelihood of Mixed Proportional Hazard Competing Risk Models Simen Gaure The Frisch Centre for Economic Research Scientific Computing Group, USIT, University of Oslo 3/1-2012 1 Introduction This note is about estimating finite mixture competing risk models as used in [1]. We describe the ideas and functionality of an estimation program in production use at the Frisch Centre. The estimation program has evolved since we did the original estimates, this note is updated to reflect the changes. Say we have N individuals who may be in S different states. From each state they may make one of T different transits and possibly end up in a different state. One or more states are “final”, i.e. the individual leaves the data set upon such a transition. A typical example is from labour market modelling, where unemployed individuals may transit to either a labour market programme or to job. An indivdual on a labour market programme (with exogeneously determined length) may transit only to a job. In this example the states are “unemployed” and “on programme” whereas the transits are “to job” and “to programme”. For an individual in state “on programme”, the only legal transit is “to job”. I.e. there is a single risk for these. We will use this example now and then, though of course the estimation program works equally well if e.g. the individuals are patients and the transitions are “medical treatment A”, “cured” and “dead”. We may have any number of different transitions, but for real datasets we have not tried more than 7. Associated with a transition t is an individual hazard rate θ which depends on some covariates Xi (with i = 1..N ). We model this hazard as θi = eXi ·β (where β is a parameter vector and · is the inner product). 1 We may use a different set of covariates for different transitions. We want to estimate the vector β with the maximum likelihood estimator, but there are some further complications. To compute the likelihood Li (Xi , β) of a particular individual transit we must integrate the survival rate over the period in the current state and multiply with the hazard rate at the end of the period. The form of the integrated survival depends on how we have measured the duration. In particular the durations may be exact, interval censored, or discrete. Some of the covariates Xi may depend on the duration that has been spent in the state, this means that to avoid bias in β we must also model unobserved heterogeneity. Following [5], and [3], we use a discrete heterogeneity distribution and we allow for transition specific heterogeneity. I.e. we have probabilities P (masspoints) p = (pi )K with p = 1 and vectors v = (vt )Tt=1 of length K for i i=1 each transition t. I.e. so that vtj is the heterogeneous intercept for masspoint j for transition t. Let also v·j denote the other projection, i.e. the location vector of length T : v·j = (v1j , v2j , . . . , vT j ). These things enter the hazard as θitj = eXi ·β+vtj and we have to integrate both the survival rate over the period, and the heterogeneity over the heterogeneity distribution, so that our individual likelihood becomes K X L(Xi , β, K, v, p) = pj Θ(Xi , β, v·j ), (1) j=1 where Θ is the likelihood conditional on a particular masspoint. The vtj enters as a random intercept, so we have no intercept in β. Note that there are no other functional forms in our likelihood, in particular even with continuous duration time, the hazard rate is considered to be piecewise constant, though with completely flexible interval lengths. 2 Overall strategy Our individual likelihoods from the previous section must be multiplied over the individuals to get a function of the parameters β and our heterogeneity distribution. We take the logarithm to get the following log-likelihood L(β, K, v, p) = N X log L(Xi , β, K, v, p). i=1 Our task is to maximize this function with respect to all its arguments subject to the constraint K ≤ N (this is from [5]). Fortunately, [3] has taken care of some of it: We may handle K in the following way. 1. Start with K = 1, maximize L(β, K, v, p) w.r.t. (β, v, p). 2 2. Increase K by 1, keep β fixed and search for a better likelihood L(β, K, v, p) by varying (v, p). 3. Do a maximization of L(β, K, v, p) w.r.t (β, v, p), using (β, v, p) from the previous step as an initial value. If the new likelihood is an improvement go to step 2, otherwise terminate. Some words are emphasized above, in step 2, we need to specify what we mean by better. In step 3 we need to specify what we mean by improvement. We also need to decide how to search in step 2, and how to maximize in step 3. When we have finished, we need to decide which model (i.e. which K) to use. Note that the algorithm given here is slightly different from Heckman and Singer’s. In step 2, we could terminate the algorithm if we couldn’t improve the likelihood (in the sense that a certain directional derivative can’t be made positive), but this is conditional on the fixed β. There is a chance that step 3 may actually find a better likelihood (by changing β slightly), even if step 2 can’t do it. 2.1 How we search In step 2 we need to search for a new masspoint. That is, we should find a new location vector and a probability which increases the likelihood. Following Heckman and Singer we do this as follows: Let p = (p1 , . . . , pK ) and v = (vtk ) be the old probabilities and location points. Let w be a T -dimensional vector. Let M(w, ρ) = L(β, K + 1, [vw], ((1 − ρ)p, ρ)) where [vw] is the matrix v extended (by concatenation) with the vector w. I.e. M(w, ρ) is the likelihood with K + 1 points, the new location vector is w and the new probability is ρ. Form the derivative in ρ = 0: G(w) = lim ρ→0 M(w, ρ) − M(w, 0) . ρ We must search for a T -dimensional vector w which makes G(w) positive. There are several ways to do this, e.g. local maximization, grid search, adaptive grid search, etc. We have picked simulated annealing as implemented by [2]. For our purposes we have usually limited each coordinate v to the interval [−5, 1]. As we get more points in our heterogeneity distribution, we replace the fixed search inverval by [E(v) − 4σ(v), E(v) + 4σ(v)]. Note that we do not maximize G(w), we stop as soon as G(w) becomes positive. Once we have found such a w we use it as the new location point v·K+1 , we set ρ = 10−4 and use ((1 − ρ)p, ρ) as the new probabilities. If we do not find a w with G(w) > 0, we use the best one, i.e. we have done a global maximization of G(w) and use the w which maximizes G(w). 3 We have found that maximizing G(w), as some authors have suggested, sometimes performs worse, and typically does not perform better, than stopping when it becomes positive. The reason it performs worse seems to be that maximization often forces w into a corner of its domain, a less favourable position for further maximization. 2.2 How we maximize The maximization in step 1 and 3, being likelihood maximization, should really be a global maximization. With N ≈ 106 , T = 7, K ≈ 50 and β ∈ R3000 this is beyond our current computational capabilities. We have resorted to local maximization, more specifically a combination of BFGS and Fisher scoring (i.e. Newton’s method with the Hessian replaced by the Fisher matrix). Moreover, we have divided the maximization into two parts. We have seen that often the maximization leads to quite small changes in β, so we first maximize only with respect to (v, p), then w.r.t. (β, v, p). We typically combine BFGS and Fisher scoring as follows: a. Set b to 50. Set n to 30. b. Do up to b iterations of BFGS. c. Do up to n Newton iterations. If the resulting gradient is smaller than 10−9 , terminate the algorithm with success. If n ≥ 100, then terminate with failure, otherwise increase b by 100, n by 10 and go to step b. It turns out that both BFGS and Newton may run into problems, but the above combination often gets around problematic regions. The number of iterations and the convergence criterion may be adjusted. 2.3 When to stop, what to do If the log-likelihood does not improve by at least 0.01 we do not consider it an improvement. When our algorithm has terminated there are several model candidates. We may pick the one with the highest likelihood, we may pick the one with the highest AIC. In [1], AIC is recommended. If using AIC as a criterion it would be tempting to terminate the algorithm when the AIC does not improve, but we don’t recommend that. The reason is that the AIC may improve in later steps. 2.4 Further complications Sometimes, during maximization, the heterogeneity distribution ends up as degenerate. I.e. one or more probabilities are estimated as zero, or some location vectors are equal (or close enough to be considered equal). In this case, we remove the superfluous point(s) from the distribution and re-estimate. 4 To this end, a probability of less than 10−6 is considered to be zero. Two location vectors v·i and v·j are considered to be equal if kv·i − v·j k∞ < 0.1. There are more complications. In our example, some people never enter a programme, their only transition is to “employed”. This means that technically, the risk that they enter a programme is zero, i.e. a defective risk. The maximization algorithm may sometimes estimate a very large negative value in this case. E.g. v26 is very negative. Then ev26 is numerically equal to zero and we are not able to compute meaningful gradients, nor refine v26 . The Fisher matrix becomes degenerate, and we can’t invert it to obtain standard errors for any of the parameters. In this case we mark v26 as −∞ and leave it as a constant in the estimation. Our complications do far from end here. In particular for interval censored durations we have occasional numerical problems. I.e. we have expressions like x 1 − e−e and their derivatives. For certain arguments we then use a Taylor approximation to alleviate numerical difficulties. For an example of this, consider formula (3): For hazards close to 0 it’s better to write it as X Pt = f (` θu )`θt u where (1 − e−x ) x and use the Taylor-expansion of f . Similar considerations must be done for some derivatives. Also, for some expressions it’s more accurate to work directly with their logarithms. There’s also a trick which has proved to be very useful. During estimation, we sometimes encounter values of parameters which makes the likelihood of some individuals very close to zero, so that their logarithm becomes very negative. It will typically be hard to compute the gradient with any reasonable accuracy for these individuals, resulting in a seriously confused maximization routine. We have solved this problem by temporarily setting the log-likelihood for these individuals to a large negative number, whereas their gradient is set to zero. This introduces a discontinuity in the gradient, but it turns out that the maximization algorithm most often moves away from such regions, relying on the gradient contribution from the “sane” individuals. f (x) = 3 The integrated hazard As mentioned above, we can either use continuous time, in which the duration until transit is exactly recorded, or interval censored durations, where we merely record in which interval the transit occurs. The likelihoods we compute below are the individual likelihoods for a single period, conditioned on a mass-point. 5 3.1 The hazard The hazard for transition t is given by θt (β) = exp(Xt · β) Where we for simplicity drop most indices and assume that our heterogeneous intercept is incorporated in the vector β (with a corresponding Xi = 1. Thus the derivatives become ∂θt = (Xt )i θt (β) ∂βi For the competing risk case, we also have a special extension which has been used to model “motivation”. We use the hazard for one transition as an explanatory variable in another. In this case, one of the elements in Xt is θs (β). Or, we may write it explicitly θt (β) = exp(Xt · β + hs θs (β)) The derivative becomes ∂θt ∂θs = ((Xt )i + hs )θt ∂βi ∂βi Note that we have assumed throughout that the β’s are different for different transitions (although restrictions may be put on equality of some β’s, but this may be done on a much higher level, aggregated over all observations). That is, we will typically only have one nonzero term in the sum above. 3.2 Interval censored data We organize our data so that all covariates are constant within each time-period. The periods may have arbitrary and different lengths. With time-varying covariates we therefore may have several “observations” with no transition, before the observation with the transition. With interval censored data we need to compute the probability that the observed outcome occured within the observed interval. The probability Ps of survival through a period is the product integral([4]) of the hazard: X Ps = exp(−` θt ) (2) t=1..T where θt is the hazard of doing transition t and ` is the length of the period. In case the individual did not make a transition within a period, Ps will be the likelihood for this period. In the case that a t-transition was made, the likelihood will be θt Pt = (1 − Ps ) P u θu . (3) I.e. the probability of non-survival times the probability of t-transition conditional that a transition was made. 6 3.2.1 Derivatives In order to compute the gradient, and the Fisher matrix, we need some derivatives. We have X ∂Ps ∂θu ∂Ps = ∂βi ∂θu ∂βi u so that, assuming βi explains transition v X ∂θu ∂Ps = −` Ps = −`Xi Ps θv ∂βi ∂βi u (4) For the case that a t-transition was made, and βi explains transition v we get X ∂Pt ∂θu ∂Pt ∂Pt = = Xi θv ∂βi ∂θ ∂β ∂θv u i u (5) We have, provided v 6= t: ∂Pt ∂(1 − Ps ) θt θt P = − (1 − Ps ) P 2 ∂θv ∂θv θ ( u u u θu ) which becomes θt `Ps P u θu so that Pt `Ps θt − Pt −P = P , θ u u u θu `Ps θt − Pt ∂Pt = Xi θv P ∂βi u θu If v = t we get P ∂Pt ∂(1 − Ps ) θt u θu − θt P = + (1 − Ps ) P 2 ∂θt ∂θt θ ( u u u θu ) which becomes (1 − Ps ) Pt `Ps θt − Pt Pt + P −P = P + , θt u θu u θu u θu u θu θt `Ps P so that ∂Pt `Ps θt − Pt = Xi θt P + Xi Pt ∂βi u θu Note that rewriting of these expressions may yield numerically more favourable expressions. Note also that the i in βi and Xi does not refer to the individual (as we only look at a single individual here), but merely explanatory variable number i. In case βi explains more than one transitions, i.e. if we use the “motivation” approach of letting the hazard of one transition explain the hazard of another, the simplifications of the sum in equations (4) and (5) are not valid. 7 3.3 Exact duration Exact duration means that our data contains the exact time of transition, rather than an interval. Here we also organize our data into periods with constant covariates. The periods may be of arbitrary and different lengths. The likelihood for a period of length ` is the same as the interval-censored likelihood in (2): X Ps = exp(−` θu ). u For a t-transition it gets simpler: Pt = Ps θ t This is a density, rather than a probability. 3.3.1 Derivatives For the case that there is no transition, the derivative is the same as above. For transition to t, and βi explains transition t, we have ∂Ps ∂θt ∂Pt = θ t + Ps = −`Ps θt2 Xi + Ps θt Xi = Xi Pt (1 − `θt ). ∂βi ∂βi ∂βi If βi explains a transition different from t, the last term vanishes, so we get ∂Pt = −`θt Pt Xi ∂βi 3.4 Discrete data Discrete data means that the “transition” is defined directly in terms of timeperiods, e.g. like “main-activity in a week”, i.e. a transition has not occured at a particular point in time. In this case we model the likelihood as a “logit”: 1 Ps = P u θu and for a t-transition: Pt = Ps θ t . 3.4.1 Derivatives In case of no transition, we get ∂Ps = Ps (1 − Ps ) ∂βi and with transition to t we get, when βi explains a transition s 6= t: ∂Pt = −Pt Ps θs Xi . ∂βi When βi explains t: ∂Pt = Pt (1 − Pt )Xi . ∂βi 8 3.4.2 Uncertain transition In some cases one doesn’t know exactly which transition has been Ptaken, i.e. one only knows it’s in some set U . In this case the hazard is θ = U t∈U θt . Ps P will be the same as before, but PU = t∈U Pt . This functionality has not yet been implemented. 3.5 The individual likelihood If for each period i and masspoint j we compute a likelihood as above: ( Ps if no transition Lij = Pt if t-transition then the product Θ= Y Lij i is the individual likelihood as used in (1) 4 Duration dependence and implicit dummies We have briefly mentioned duration dependence. I.e. the hazard rate may depend on how long the individual has been in the current state. In the literature one sees various functional forms of duration dependence. One popular form has been the Weibull distribution, and there are other forms, decreasing, increasing and with various regular humps and bumps. We have used no such functional form. We depend on estimating duration dependence with dummy variables. E.g. for a dataset with monthly data with durations up to 5 years, we use 60 dummies, one for each month, to estimate duration dependence. In some applications we get collections of over 100 such dummies. Therefore we have optimized particularly for long strings of dummies. The user may e.g. specify an ordinal variable “dur” as implicit dummy from 1 to 60 with reference 13. “dur” should then have integer values between 1 and 60 and the estimation program will create 59 dummies (“dur1” – “dur60”, except the reference “dur13”) corresponding to each value of “dur”. This is not merely a convenience for the user, the fact that no more than 1 of the 59 dummies is non-zero at any time, is actively used to speed up the execution, effectively treating the string of dummies as one variable (which it really is), both with respect to speed and space. Since we have this flexible (though piecewise constant) duration dependence, it’s possible to track duration dependence which e.g. stems from pure policy decisions, e.g. we may see bumps prior to (partial) benefit exhaustion at 18 and 36 months. Implicit dummies may of course also be used to track calendar time dependence, e.g. the business cycle. Note that the duration dummies will enter the likelihood just like any other covariate, this means that we use a model which is known as “proportional 9 hazard”, or, taking into account the heterogeneity distribution, “mixed proportional hazard”. Note also that with this many parameters we require a lot of data to get sensible standard errors. Since we typically use these methods for analyzing register data this is not a big problem for us. 5 Random Coefficients Just like we modeled the heterogeneity with a discrete distribution, we may estimate random coefficients. From above, our individual hazard θitj = eXi ·β+vtj is replaced by θitj = eXi ·β+αj ri +vtj where ri is some covariate and αj is a random coefficient. Note that the unobserved heterogeneity is merely a special case of a random coefficient corresponding to a covariate which is constantly equal to 1. 6 Linear Predictors In addition to computing likelihoods for transitions, we may, for certain observations, augment the likelihood with a linear predictor. Continuing our example, when unemployed people get a job, it might be interesting to analyze how good a job it is, e.g. by looking at the wage. We update our likelihood to include a linear predictor for the (log)wage e.g. like Yi = Xi · γ + v + N (0, σ) where Yi is a covariate (the (log)wage), Xi are covariates (possibly different from the ones occuring in the hazard), γ is a set of parameters to estimate, v is heterogeneity and N (0, σ) is an error term normalized to have mean 0. The individual likelihood is φij = Nσ (Yi − (Xi · γ + vj )) where Nσ (x) = x2 1 √ e− 2σ2 σ 2π is the normal density with mean 0. This likelihood φij is multiplied with the hazard likelihood Θ in (1). We get parameters (γ, σ, vj ) to maximize in addition to the ones from the hazard. Interestingely, we have seen that with a linear predictor we benefit from changing our maximization algorithm slightly. We still do the same search for a new point, but we do not do a separate maximization over the distribution (as 10 mentioned in 2.2), instead we jump right to a maximization of all the parameters. This has, in our datasets, prevented the algorithm from repeatedly converging into a degenerate pit. We have no explanation for this, and it may very well be data-dependent. This is an area for further investigation. 7 Implementation The estimation as described above is quite time-consuming. Although it is possible to write down the likelihood and feed it to some general purpose statistical package, this has not proven useful for the datasets we have. We have resorted to write the procedure in the de facto standard language for high performance computations, FORTRAN 90. We have used MPI to do the computations in parallel (over the dataset). Since we want to have an easy way to specify a particular problem, we have written a preprocessor in perl, which takes a human-writeable specification and transforms it into problem-specific fortran code which is included in the estimation program. This has the additional benefit that many loop-limits in the program are compile-time constants, this aids optimization. Since Fortran hardly can be said to be a rational language for book-keeping, we have wrapped the fortran stuff in an R-package. This takes care of everything except actually computing the likelihood-function and its derivatives. This package is very far from being production quality. In addition we have written a web-frontend in PHP to submit problems to the Titan cluster at the University of Oslo. (A cluster which the Frisch Centre has a share in). 7.1 Speed As mentioned above, the estimation procedure is time-consuming, in particular with large datasets and large sets of parameters. In addition to coarse grain parallellism and implicit dummies, we have done quite detailed loop-level optimization. E.g. we try to ensure that all time-consuming loops are executed in the right order (forwards in memory) and that cache-lines are fully utilized before wasted. But, the single most important thing is to use a good blas and lapack and a good compiler. This also holds for the R post-processing routine. As an example of this, our Newton step uses the Fisher matrix, which can be computed from the outer product of the individual gradients ∇Li as X (∇Li )(∇Li )0 . i Doing this without further thought will be next to disastrous when the vector length becomes long (i.e. so long that the Fisher matrix no longer fits in the cache). To speed up this critical process we collect ∇Li for 32–128 individuals at a time (depending on the machine architecture), and compute their contribution to the Fisher matrix with the blas level 3 routine “dsyrk”. With a vector length 11 of 3000, the speed difference between a good and a poor blas may be an order of magnitude or more. 7.1.1 Differentiation of products In computing the gradient we need to compute the derivative of a product over a spell: I.e. we have a product pn = n Y fi . i=1 By taking the logarithm we may find a closed form formula for dn = (pn )0 . However, it might be faster to use a recurrence relation: dn = (pn )0 = (fn pn−1 )0 = fn0 pn−1 + fn dn−1 where pn = fn pn−1 , f0 = p0 = 1, and d0 = 0. However, it’s hard to do implicit dummies in this way. References [1] S. Gaure, K. Røed, and T. Zhang, Time and Causality: A Monte-Carlo Assessment of the timing-of-events approach, J. Econometrics (in print) (2007). [2] W. L. Goffe, G. D. Ferrier, and J. Rogers, Global Optimization of Statistical Functions with Simulated Annealing, J. Econometrics 60 (1994), 65–99. [3] J. Heckman and B. Singer, A Method for Minimizing the Impact of Distributional Assumptions in Econometric Models for Duration Data, Econometrica 52 (1984), 271–320. [4] T. Lancaster, The Econometric Analysis of Transition Data, Cambridge University Press, 1992. [5] B. G. Lindsay, The Geometry of Mixture Likelihoods: A General Theory, Ann. Stat. 11 (1983), 86–94. 12