A Quasi-Newton Algorithm for Efficient Computation of Gehan Estimates by Matthias Conrad1 and Brent A. Johnson2 Technical Report 10-02 February 25, 2010 Department of Biostatistics and Bioinformatics Rollins School of Public Health 1518 Clifton Road, N.E. Emory University Atlanta, Georgia 30322 1 Emory University Department of Mathematics and Computer Science and Computational Life Sciences Initiatve Atlanta, GA 30322 2 Emory University Department of Biostatistics and Bioinformatics 1518 Clifton Rd., N.E., 3rd Floor Rollins School of Public Health Atlanta, GA 30322 Correspondence Author: Dr. Matthias Conrad Telephone: (404) 727-7591 FAX: (404) 727-5611 e-mail: conrad@mathcs.emory.edu A quasi-Newton algorithm for efficient computation of Gehan estimates Short title: Efficient computation of Gehan estimates Address for correspondence: Brent A. Johnson Department of Biostatistics Rollins School of Public Health Emory University 1518 Clifton Rd., NE Atlanta, GA 30322 U. S. A. Email: bajohn3@emory.edu 1 A quasi-Newton algorithm for efficient computation of Gehan estimates Matthias Conrad1 and Brent A. Johnson2 Current Version: February 25, 2010 Abstract The analysis of lifetime data is an important research area in statistics, particularly among econometricians and biostatisticians. The two most popular semi-parametric models are the proportional hazards model and the accelerated failure time (AFT) model. The proportional hazards model is computationally advantageous over virtually any other competing semi-parametric model because the ubiquitous maximum partial likelihood estimator is computed using ordinary Newton methods. Rank-based estimation in the semi-parametric AFT model is computationally more challenging, for example, because the Hessian matrix is not directly estimable without nonparametric smoothing. Recently, authors showed that the rank-based estimators may be written as the solution to a linear programming problem. Unfortunately, the size of the linear programming problem is O(n2 ) subject to n2 linear constraints, where n denotes sample size. Thus, the linear programming solution to rank-based estimation is restricted by well-known computational limitations of linear programming and impractical for many applications. In this paper, we describe quasi-Newton methods for rank-based estimation in the semi-parametric AFT model. The algorithm converges super-linearly and is computationally efficient. Thus, the computational cost of our algorithm remains low for even large data sets. We illustrate the algorithm through speed trials and three real data sets. 1 Introduction Survival analysis is a ubiquitous concept in statistics and semi-parametric models and estimators are well-known. Cox’s (1972) proportional hazards model, for example, has been studied extensively for nearly four decades now and the paper is among the most widely cited statistics papers in the scientific literature. This paper proposes new computational strategies and algorithms for rank1 Department of Mathematics and Computer Science and Computational and Life Sciences Strategic Initiative, Emory University, Atlanta, GA 30322, U. S. A. (Email: conrad@mathcs.emory.edu) 2 Department of Biostatistics, Rollins School of Public Health, Emory University, Atlanta, GA 30322, U. S. A. (Email: bajohn3@emory.edu) 2 based estimation in the semi-parametric accelerated failure time model (Cox and Oakes, 1984; Kalbfleisch and Prentice, 2002), a popular alternative to the proportional hazards model. The accelerated failure time (AFT) model asserts that the endpoint Ti is linearly related to explanatory variables after logarithmic transformation, i.e. log Ti = x> i β + ei , (i = 1, . . . , n), (1) where xi is a p-vector of fixed predictors for the ith subject, β is a p-vector of regression coefficients, and (e1 , . . . , en ) are independent and identically distributed errors with unspecified distribution function F . If Ci is a stochastic, subject-specific censoring variable, then the observed data are {(Zi , δi , xi )}ni=1 , where Zi = min(Ti , Ci ), δi = I(Ti ≤ Ci ) and I(·) is the indicator function. The goal is to estimate the regression coefficients β using the observed data. Prentice (1978) proposed linear rank tests for the slope parameter in (1) and these tests formed the basis for subsequent coefficient estimators. Rank-based estimation of the regression coefficients β in (1) was first considered by Louis (1981) and then Wei and Gail (1983). Tsiatis (1990) proposed a system of estimating equations by inverting the linear rank tests and further developments generally follow this framework (see also, Lai and Ying, 1991; Ying, 1993). From its inception, coefficient estimation without nonparametric smoothing has been difficult. Until recently, rank-based estimation algorithms for censored data were limited to grid searches. Most notably, Lin and Geyer (1992) explained how simulated annealing performed a stochastic grid search of the parameter space and could be used to solve a general system of estimating equations. Naturally, the computational cost of grid searches precluded the use of rank-based estimators outside of academic settings. Moreover, simulated annealing is not guaranteed to produce a true minimum. Two decades after Prentice (1978) proposed the rank-based estimator, Lin et al. (1998) provided another estimation algorithm for modest sample sizes by showing that the Gehan estimate may be solved through linear programming. However, the linear programming problem has n2 +p unknown parameters subject to n2 linear constraints and the size of optimization problem defeats many “outof-the-box” linear solvers familiar to statisticians (e.g., linprog in Matlab) for sample sizes as low as n = 40 and p = 2. Later, Jin et al. (2003) provided an effective numerical strategy by rewriting the linear programming problem as an equivalent median regression problem. The significance of the latter development was that algorithms for sparse quantile regression (Koenker and Bassett, 1978; Koenker and D’Orey, 1987; Koenker and Ng, 2005) could be used and those algorithms were 3 already available in standard statistical software. This latter algorithm is the current state-of-theart algorithm for rank-based estimation in the semiparametric accelerated failure time model. Survival analysis is a core component of (bio)statistics and many clinical trials rely on these methods to draw statistical inference on the effect of new therapies for disease recurrence, incident disease, or mortality. Average clinical trials range in size between n = 1, 000 to n = 5, 000 subjects while large clinical trials may enroll several million. Unfortunately, the computation and storage of the rank-based estimator grows like the square of the analogous median regression estimator. So, a rank-based estimate for n = 1, 000 has computational complexity of a median regression estimate for n = 106 , a rank-based estimate for n = 105 has computational complexity of a median regression estimate for n = 1010 , and so on. For practical sample sizes seen in clinical trials, calculating rank-based regression estimates via linear programming for samples of this size is slow, even for modern desk-top computers. On the other hand, Cox’s (1972) maximum partial likelihood estimate can often be calculated in a matter of seconds, even for very large sample sizes. Hence, the computational cost of the rank-based estimator is a huge deterrent and practitioners often use the Cox model in clinical trials even amidst evidence that suggests the model assumptions are not wellsupported by the data. A better algorithm for the rank-based estimator would allow investigators and users to choose an estimator based on the science rather than merely on the convenience and cost of the computational algorithm. The premise of this paper is that the Gehan loss function is itself convex and minimizing it directly will be more efficient than rewriting the optimization with constraints. We perform unconstrained optimization on a modified version of the Gehan loss function. Then, the original Gehan estimate is defined as the limit of a sequence of modified estimates. Our algorithm is built on the Broyden-Fletcher-Goldfarb-Shanno (BFGS) method which provides a solid foundation for convex optimization. Quasi-Newton methods, such as the BFGS, provide curvature information on the search direction at each iteration without calculating the Hessian directly. Our results will show that this optimization technique works very efficiently for large sample sizes n where the current existing method is impractical or fails completely. Because the weighted logrank estimators are fundamental concepts in survival analysis, semi-parametric models and theory, we feel the methods are sufficiently important to a rather general audience. 4 2 Background b is defined as a zero-crossing of the estimating function The weighted logrank estimator β Ψ(β) = n−1 n X δi φ{ei (β), β}[xi − x̄{ei (β), β}], (2) i=1 Pn where ei (β) = log Zi −x> i β, φ is a data-dependent weight function, and x̄(t, β) = j=1 xj I{ej (β) ≥ P b which satisfies, Ψj (β−)Ψ b b t}/ nj=1 I{ej (β) ≥ t}. We define a zero-crossing as an estimator β j (β+) ≤ 0 for all j = 1, . . . , p, where, Ψj (β−)Ψj (β+) = lim Ψj (β − γuj )Ψj (β + γuj ), γ→0 and uj is the j-th canonical unit vector and Ψ(β) = (Ψ1 , . . . , Ψp )> . Two weight functions of P significant interest are φ(t, β) = 1 and φ(t, β) = n−1 nj=1 I{ej (β) ≥ t}, that correspond to the log-rank (Mantel, 1966) and Gehan (1965) weights, respectively. Under the latter weight function, then the Gehan estimating function is simply Ψ(β) = n n 1 XX δi (xi − xj )I{ei (β) ≤ ej (β)}. n2 (3) i=1 j=1 The Gehan estimating function in (3) is the p-dimensional gradient of the following convex loss function, fG (β) = n n 1 XX δi {ei (β) − ej (β)}− , n2 (4) i=1 j=1 b = arg min fG (β). Recently, Jin where c− = max(−c, 0). Then, the Gehan estimator is defined β G et al. (2003) argued that one may estimate the sampling variability of the Gehan estimate through a novel resampling scheme whereby one perturbs the loss function fG with a vector of independent and identically distributed random variables which have unit mean and variance. It is common to use 1000 resampled estimates for standard error estimation and, in Section 4, we use this number to exemplify computational cost of statistical inference for Gehan estimates. Thus, efficient numerical optimization of fG is the lifeline of rank-based estimation and inference in AFT model. Lin et al. (1998) showed that minimizing fG (β) is equivalent to the linear programming (LP) problem: min u,β n X n X δi uij , (5) i=1 j=1 subject to uij = −{ei (β) − ej (β)} and uij ≤ 0 5 ∀i, j. Jin et al. (2003) show how algorithms developed for quantile regression could be used to solve the optimization problem in (5). Applying barrier methods for interior point programming, Jin et al. (2003) suggest one minimize the loss function, n X n X i=1 j=1 n X n X δi |ei (β) − ej (β)| + M − β > δk (xl − xk ) , (6) k=1 l=1 for a large number M . The significance of the latter expression (6) is that one may use standard algorithms for median regression (Koenker and D’Orey, 1987; Koenker and Ng, 2005) to calculate the Gehan estimate. At the time of writing this paper, minimizing the expression in (6) via the quantreg package in R is the current state-of-the-art algorithm for the Gehan estimator. It is well-known that, in terms of computational complexity and cost, linear programming is no panacea (cf. Dennis and Schnabel, 1983; Nocedal and Wright, 2006, and references therein), particularly for moderately-sized problems. Moreover, while the interior point methods of Koenker and colleagues (Koenker and Bassett, 1978; Koenker and D’Orey, 1987; Koenker and Ng, 2005) work well for quantile regression, the computational complexity of the rank-based optimization problem via (6) overwhelms many standard desk-top computers. Naturally, outside of academic settings (i.e. where both n and p are small), the computational cost of (6) for practical problems (e.g. clinical trials) deters many consumers of rank-based estimators in the AFT model. 3 3.1 Algorithm The approximation For our algorithmic implementation, we start with the convex loss function fG (β) and consider a simple smooth approximation to it. Define the approximating loss function, fε (β) = n n 1 XX δi cε {ei (β) − ej (β)}, n2 i=1 j=1 where cε is a sufficiently smooth real-valued function. In this 0, cε (x) = − 1 3 (x + ε)4 + 12 (x + ε)3 16ε 4ε x, 6 paper, we define if x < −ε, if x ∈ [−ε, ε], if x > ε, (7) with sufficiently small ε (e.g. ε = 10−6 ). We note that other definitions of cε are possible. Now, define the minimizer of the approximate loss function in (7), i.e. b = arg min fε (β). β ε (8) β Given our definition of cε , it is evident that limε→0 fε (β) = f0 (β) = fG (β). Furthermore, by b is the global minimizer of a twice-differentiable convex loss function for every ε > 0. definition, β ε b is a set of local minimizers and not necessarily a unique solution; this is similar to However, β ε b (see Jin et al., 2003). At the same time, both β b and β b converge to unique solutions as the β G ε G sample size n converges to infinity. 3.2 Quasi-Newton methods The computational advantage of our approximation lies in the smoothness and convexity of fε in (8). Because fε possesses these properties, we may use gradient-based optimization algorithms for which the optimization theory provides numerous such iterative methods. The foremost gradient-based optimization algorithm is Newton’s method which converges locally at a quadratic rate and uses the gradient ∇fε and Hessian ∇2 fε to form the search direction and step length at each iteration. The numerical calculations of the Hessian of the smoothened Gehan estimator are at unreasonable costs. The fundamental concept behind quasi-Newton methods is to provide curvature information about the loss function fε in order to calculate an efficient search direction at each iteration without calculating the Hessian matrix explicitly. Below, we delineate the Broyden-Fletcher-GoldfarbShanno (BFGS) method (Nocedal and Wright, 2006) to provide an efficient and fast algorithm for calculating the rank-based Gehan estimate. Then, we describe the Limited-BFGS (L-BFGS) method to solve problems for large numbers of predictors p. Let β `+1 be the (` + 1)-th iterate of the regression coefficients in (1). Define the objective function and its gradient at step ` + 1 as f`+1 = fε+1 (β `+1 ), g`+1 = ∇f`+1 , respectively. Then, the BFGS search direction s`+1 is given by s`+1 = −H`+1 g`+1 , with H`+1 given by H`+1 = arg min kH − H` kF H subject to H = H > and Hy` = s` , 7 y` = g`+1 −g` , k · kF denotes a weighted Frobenius matrix norm, and an initial user chosen symmetric positive definite matrix H0 (e.g. H0 = I). Its unique solution is given by the update formula (Broyden, 1970; Fletcher, 1970; Goldfarb, 1970; Shanno, 1970), H`+1 = V`> H` V` + ρ` v` v`> , (9) where V` = (I − ρ` y` v`> ), ρ` = 1 y`> v` , v` = β `+1 − β ` . In ordinary Newton methods, both the gradient and the Hessian are calculated and a linear equation are solved at every iteration. Quasi-Newton methods are similar in that the gradient g` and a matrix H` are calculated at each iteration `. However, as shown in equation (9), quasi-Newton methods differ from ordinary Newton methods in that the former methods avoid solving a linear system explicitly, as such systems may be large or possibly ill-conditioned. Thus, H` contains information of the inverse of the Hessian of the loss function in a local neighborhood of β ` . The L-BFGS method is especially useful for solving large optimization problems (large p) where the storage or computation of the matrix H` is at unreasonable costs. Instead of using the full matrix H` of size p × p the L-BFGS algorithms stores m vectors of size p with m p. These vectors contain the curvature information of the m most recent iterations and the update formula for the m vectors of the L-BFGS method can be derived from the BFGS formula by rewriting (9) as H` = m X ρ`−m−1+j j=0 m−j Y ! > V`−k > v`−m−1+j v`−m−1+j k=1 m−j Y ! V`−k , k=1 > = H0 (cf. Dennis and Schnabel, 1983). An efficient implewith ρ`−m−1 = 1 and v`−m−1 v`−m−1 mentation of the L-BFGS updating strategy is presented in the pseudo-code in Algorithm 1 in line 5 and 10. Note, if m = p the L-BFGS is identical with the BFGS method. The strong Wolfe line search method (cf. Nocedal and Wright, 2006, pp. 31) provides an efficient step length α` which determine the new iteration β `+1 = β ` + α` s` . We use standard stopping criteria for smooth unconstrained optimization f`+1 − f` < τ (1 + f`+1 ), √ kg` k < 3 τ (1 + |f`+1 |), √ β `+1 − β ` < τ (1 + β `+1 ) (10) kg` k < , with the Euclidian norm k · k and the machine accuracy (cf. Gill, 1981). By default we choose τ = 10−6 . The principal advantage of applying the L-BFGS algorithm to the approximate Gehan 8 loss function fε is local super-linear convergence (Nocedal and Wright, 2006) to a local minimizer which results in fast and efficient computation for even large data sets. We assess the speed of our algorithm through a simple simulation study. We simulated p independent, standard normal predictors x = (x(1) , . . . , x(p) )> , independently simulated the coefficient vector β from a standard normal distribution, and then simulated true response log T according to the linear model, log T = β1 x(1) + · · · + βp x(p) + W, where W is a normal random variable with mean zero and standard deviation σ = 1.5. A censoring random variable was simulated according to a unif(0, κ) distribution, where κ was chosen to yield 30% censored observations. The observed random pair (Z, δ) are defined accordingly. The data were identically distributed for n subjects, where n ranges from 100 to 5000. Using the convergence criteria described above, we display in Figure 1 the clock time required to compute the Gehan coefficient estimates. These timings where taken on an Apple MacBook Pro (Model MB134LL/A) with 2.4GHz Intel Core 2 Duo and 4GB 667 MHz DDR2 RAM (OS X Version 10.5.7). Our code is written in the interpreted Matlab language (version 7.8.0.347, R2009a) and CPU times may be even faster if our code were rewritten in a compiled language like Fortran. Our method is restricted only by the storage size of the design matrix X ∈ Rn×p . Compared to the search space of n2 + p of the linear programming problem (6) the convex optimization problem (8) is of size p. On the other hand, our simulation in Figure 1 with n = 5000 reflects a linear program of size ∼25, 000, 000. In case p is large in comparison to n, the linear system (1) might be underdetermined and problem (8) will therefore be ill-posed. Here, regularization methods might be useful to yield reasonable results and is open for further investigations. 3.3 The error in the approximation The approach we propose is based on a modified Gehan loss function where we approximate a non-differentiable function with a smoothed, differentiable one. It is natural to wonder how poorly our modified loss function fε approximates the true Gehan loss function fG . We consider the error in the approximation below. In order to apply gradient-based methods, we consider a modified Gehan loss function fε defined 9 140 120 CPU timing [ s ] 100 80 60 40 20 0 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 n Figure 1: CPU times for computation of Gehan estimates. For each sample size n, we simulate 10 data sets where gener σ = 1.5 and p = 2 (dotted curve), p = 10 (dashed curve), and p = 50 (solid curve). through cε . Then, for every β ∈ Rp , we get the following worst case error of the loss function, n n X X 1 |fε (β) − f0 (β)| = 2 δi [cε {ei (β) − ej (β)} − c0 {ei (β) − ej (β)}] n i=1 j=1 n n 1 XX |[cε {ei (β) − ej (β)} − c0 {ei (β) − ej (β)}]| ≤ 2 n i=1 j=1 n n 1 XX ≤ 2 max |[cε {ei (β) − ej (β)} − c0 {ei (β) − ej (β)}]| n i=1 j=1 = 3 ε, 16 where the last line follows from max |cε {ei (β) − ej (β)} − c0 {ei (β) − ej (β)}| = |cε (0) − c0 (0)| = 3 ε. 16 Hence, the mean error of the approximate loss function for any value in the parameter space is bounded above by 3ε/16. 10 Algorithm 1 L-BFGS Require: f ∈ C 2 : Rp → R objective function β 0 ∈ Rp initial value m ∈ N, H0 ∈ Rp×p symmetric positive definite pick parameters 1: ` ← 0 set parameter 2: calculate f0 ← f (β 0 ), g0 ← g(β 0 ) function evaluations 3: repeat 4: check stop criteria 5: q` ← g` e.g. see equation (10) calculate limited memory quasi Newton direction for j = m, m − 1, . . . , 1 do αj ← ρj vj> g` and q` ← q` − αj yj end for if ` = 0 do r` ← H0 q` else r` ← > vm ym Iq` >y ym m end if for j = 1, . . . , m dor` ← r` + vj (αj − ρj yj> r` ) end for 6: s` ← −r` 7: calculate step size α` via strong Wolfe line search 8: β `+1 ← β ` + α` s` 9: calculate f`+1 ← f (β `+1 ), g`+1 ← g(β `+1 ) 10: update steps function evaluations vj ← vj+1 , yj ← yj+1 , and ρj ← ρj+1 for j = 1, . . . , m − 1 vm ← β ` − β `−1 ym ← g` − g`−1 , ρm ← 11: choose search direction update memory 1 >v ym m `←`+1 12: end repeat b←β 13: β ` Ensure: b = arg min f (β) is local minimizer and f (β) b local minimum β 4 Examples In this section, we present three worked examples to illustrate the methods. Explanatory variables have been standardized to have mean zero. The following three examples use data sets familiar to statisticians and are freely available online. 4.1 Multiple Myeloma We first report our results from a study of multiple myeloma (Krall et al., 1975). The study included a total of n = 65 patients who were treated with alkylating agents, 48 of whom died and 11 0.5 0.3 0.48 0.2 0.46 0.1 0.44 0 0.42 β2 0.4 !0.1 0.4 !0.6 !0.5 !0.4 !0.3 !0.2 !0.1 0 0.1 β1 Figure 2: Search path of the L-BFGS algorithm applied to the multiple myeloma data set with b = (−0.53160, 0.29222)> with fε (β) b = 0.39764 and initial β 0 = (0, 0)> . The local minimum is at β ε = 0.01. the remaining 17 survived. As in Jin et al. (2003), we include two covariates in our analyses: hemoglobin (HGB) and logarithm of blood urea (BUN). Using our method with ε = 0.01 and standardized data, we found the coefficient estimates were −0.5318 and 0.2923 for BUN and HGB, respectively. Using the method of Jin et al. (2003), the parameter estimates are −0.5316 and 0.2922. Due the small sample size and number of predictors, the multiple myeloma data set provides an excellent opportunity to visualize the L-BFGS algorithm in practice. In Figure 2, we display the progress of the algorithm starting with the initial value β 0 = (0, 0). The L-BFGS iterates rapidly steps downhill, converging to the local minimizer in 7 iterations. The last three iterations are difficult to identify in Figure 2 due to the fact they are so close to one another. As a complement to the graphical display of the search algorithm in Figure 2, the sequential parameter estimates are 12 Table 1: Search path of the L-BFGS algorithm applied to the multiple myeloma data set with initial value β = (0, 0)> and ε = 0.01. step β1 β2 fε (β) 0 0.00000 0.00000 0.46846 1 -0.20132 0.12241 0.42335 2 -0.55482 0.28094 0.39791 3 -0.52136 0.30626 0.39775 4 -0.53458 0.29458 0.39764 5 -0.53259 0.29147 0.39764 6 -0.53088 0.29251 0.39764 7 -0.53160 0.29222 0.39764 also given in Table 1. Finally, as a footnote on the error of the approximate loss function fε compared to the original Gehan loss function fG , we illustrate the differences in the multiple myeloma example in Figure 3. Each black line in Figure 3 represents the lines of non-differentiality in the Gehan loss fG which correspond to the equality constraints of the linear optimization problem (5) and the maximal error (≈ 3.0 × 10−5 ) between the modified and original Gehan loss function. An important message is that lines appear nearly everywhere with the exception of small white pockets where the error is the smallest. It is clear that the modified loss function is a very close approximation to the original loss function, which confirms our analytical calculations in Section 3. We find the error in the loss b ) − fG (β b ) = 2.8586 × 10−5 . function at the local minimizer is, fε (β ε ε 4.2 Mayo PBC The Mayo primary biliary cirrhosis (PBC) data (Fleming and Harrington, 1991, Appendix D.1) contains information about the survival time and prognostic variables for 418 patients who were eligible to participate in a randomized study of the drug D-penicillamin. Of 418 patients who met standard eligibility criteria, a total of 312 patients participated in the randomized portion of the study. Using the smaller randomized cohort, the study investigators used stepwise deletion to build 13 !5 x 10 0.4 2.9 0.35 2.8 2.7 0.3 β2 2.6 0.25 2.5 0.2 2.4 2.3 0.15 2.2 0.1 !0.6 !0.5 !0.4 !0.3 !0.2 !0.1 β1 Figure 3: Error map of fε − fG applied to the Myel data set with ε = 0.01. a Cox proportional hazards (PH) model for the natural history of PBC (Dickson et al., 1989). Of the original ten predictors, stepwise deletion selected five significant variables: age, albumin, bilirubin, edema, and prothrombin time (protime). We take the natural logarithmic transformation of albumin, bilirubin, and prothrombin time to conform to the analysis in Fleming and Harrington (1991). These five variables constitute the natural history model for PBC (Dickson et al., 1989). We present the coefficient estimates using quasi-Newton methods and linear programming in Table 2. We note that the two different algorithms lead to solutions which are identical to four decimal places. Although our quasi-Newton algorithm runs in less than one-half of one second, the linear programming method of Jin et al. (2003) is still reasonable at 5.3 seconds. 14 Table 2: Coefficients estimates for Mayo PBC data Parameter New Jin et al. -0.2706201 -0.2706168 albumin 0.2042720 0.2042710 bilirubin -0.5941747 -0.5941735 edema -0.2236625 -0.2236607 protime -0.2372911 -0.2372882 age Table 3: Coefficient estimates for nursing home data Parameter 4.3 New Jin et al. trt 0.14407872 0.14407174 age 0.09617243 0.09616416 sex -0.62928306 -0.62928863 mar.stat. -0.25233286 -0.25234432 h1 -0.09115733 -0.09115354 h2 -0.58666473 -0.58664576 h3 -1.07069675 -1.07070227 Nursing Home From 1980-1982, the National Center for Health Services Research conducted a study to determine the effect of financial incentives on variation of patient care in nursing homes. In particular, 18 out of 36 nursing homes from San Diego, California, received higher per diem payments for accepting and admitting Medicaid patients and additional bonuses when the patient’s prognosis improved. The study collected data from an additional 18 control nursing homes where no financial incentives were used. A complete description is given in Morris, Norton, and Zhou (1994). The total sample size from all 36 nursing homes is n = 1601. Our data set consists of seven covariables: treatment (trt), age, sex, marital status, and three health status indicators (h1–h3), ranging from the best health to the worst health. The coefficient estimates are displayed in Table 3. 15 The sample size of the nursing home data is sufficiently large where the algorithm makes a significant impact. Using the algorithm of Jin et al. (2003) along the with Barrodale-Roberts simplex optimization (Koenker and D’Orey, 1987) via quantreg in R, the computation fails. However, the improved Frisch-Newton (Koenker and Ng, 2005) algorithm performs better and finishes in just under two minutes (i.e. 1.75 minutes on our MacBook Pro running R 2.9.1). For the nursing home data set, our quasi-Newton algorithm runs in ten seconds. To accentuate the differences in CPU times, consider computing standard error estimates using the resampling scheme by Jin et al. (2003) with 1000 resamples. Then, the method by Jin et al. (2003) runs in approximately 30 hours compared with our algorithm which takes two minutes and forty-five seconds. 5 Remarks This paper describes a new algorithm for estimating the slope parameter in the semi-parametric accelerated failure time (AFT) model (Prentice, 1978). The current algorithmic approach to this estimation problem is linear programming as described by Jin et al. (2003). However, the computational complexity of linear programming leads to extraordinary CPU times for even modest sample sizes and few predictors. The computing bottleneck has rendered most extensions and applications of rank-based estimators impractical, especially when compared to the computationally-expedient maximum partial likelihood estimator (Cox, 1972). This paper challenges the computational discrepancy between these two semi-parametric estimators by providing a new computational algorithm in the AFT model. Our algorithm is based on gradient-based, quasi-Newton methods for convex loss functions. Specifically, we adopt a Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm which approximates the inverse Hessian matrix to determine search direction rather than calculate the Hessian matrix explicitly. When the number of predictors is sufficiently large, we advocate a Limited-BFGS algorithm which saves significantly on storage space of the approximate Hessian and, therefore, lowers the computational cost of the standard BFGS algorithm. Our theoretical and numerical calculations indicate that our method is as accurate as the linear programming method of Jin et al. (2003) without the computational burden. In particular, our experience suggests the method of Jin et al. (2003) takes exponentially more CPU time than our algorithm as the sample size increases. In the Nursing Home example with n = 1601, for example, we showed that our algorithm 16 reduces rank-based estimation and inference from 30 hours to less than 3 minutes, a significant reduction in CPU cost and time. We hope the current paper will permit and encourage statisticians to consider weighted logrank estimators in the AFT model for censored data regression as viable semi-parametric alternatives to the ubiquitous maximum partial likelihood estimator in a proportional hazards model. Acknowledgements The authors thank Janine Olesch from Lübeck University for the joint work on the algorithmic implementation. This work was supported, in part, by the Computational Life Science Initiative at Emory University (Conrad), NIH grants R03 AI068484 (Johnson) and Emory’s Center for AIDS Research, P30 AI050409 (Johnson). References [1] Broyden, C. G. (1970) The convergence of a class of double-rank minimization algorithms. J. Inst. Math. Its Appl. 6 76–90. [2] Cox, D. R. (1972) Regression models and life-tables (with Discussion), J. Roy. Statist. Soc. Ser. B 34 187-202. [3] Cox, D. R. and Oakes, D. (1984) Analysis of Survival Data. London: Chapman and Hall. [4] Dennis, J. E. and Schnabel, R. B. (1983) Numerical Methods for Unconstrained Optimization and Nonlinear Equations. SIAM, Philadelphia. [5] Dickson, E. R., Grambsch, P. M., Fleming, T. R., Fisher, L. D., and Langworth, A. (1989). Prognosis in primary biliary cirrhosis: model for decision making. Hepatology, 10, 1–7. [6] Fleming, T. A. and Harrington, D. P. (1991). Counting Processes and Survival Analyses. New York: Wiley. [7] Fletcher, R. (1970) A new approach to variable metric algorithms. Computer Journal 13 317–322. 17 [8] Gehan, E. A. (1965) A generalized Wilcoxon test for comparing arbitrarily single-censored samples. Biometrika 52 203–23. [9] Gill, P. E. and Murray, W. and Wright, M. H. (1981) Practical Optimization. Elsevier, Bingley. [10] Goldfarb, D. (1970) A family of variable metric updates derived by variational means. Math. Computation 24 23–27. [11] Jin, Z., Lin, D. Y., Wei, L. J. and Ying, Z. (2003) Rank-based inference for the accelerated failure time model. Biometrika 90 341–353. [12] Kalbfleisch, J. D. and Prentice, R. L. (2002) The Statistical Analysis of Failure Time Data. John Wiley: New York. [13] Koenker, R. and Bassett, G. S. (1978) Regression quantiles. Econometrica 46 33–50. [14] Koenker, R. and D’Orey, V. (1987) Computing regression quantiles. Appl. Statist. 36 383–393. [15] Koenker, R. and Ng, P. (2005) A Frisch-Newton Algorithm for Sparse Quantile Regression. Acta Mathematicae Applicatae Sinica (English Series) 21(2) 225–236. [16] Krall, J. M., Uthoff, V. A. and Harley, J. B. (1975) A step-up procedure for selecting variables associated with survival. Biometrics 31 49–57. [17] Lai, T. L. and Ying, Z. (1991) Rank regression methods for left truncated and right censored data. Ann. Statist. 19 531–556. [18] Lin, D. Y., Wei, L. J., and Ying, Z. (1998) Accelerated failure time models for counting processes. Biometrika 85 605–618. [19] Louis, T. A. (1981) Nonparametric analysis of an accelerated failure time model. Biometrika 68 381–390. [20] Mantel, N. (1966) Evaluation of survival data and two new rank order statistics arising in its considerations. Cancer Chemo. Rep. 50 163–170. 18 [21] Morris, C. N., Norton, E. C. and Zhou, X. H. (1994) Parametric duration analysis of nursing home usage. In Case Studies in Biometry (N. Lange, L. Ryan, L. Billard, D. Brillinger, L. Conquest, and J. Greenhouse, eds.) 231–248. Wiley, New York. [22] Nocedal, J. and Wright, S. J. (2006) Numerical Optimization. Springer, Berlin. [23] Prentice, R. L. (1978) Linear rank tests with right-censored data. Biometrika 65 167–179. [24] Shanno, D. F. (1970) Conditioning of quasi-Newton methods for function minimization. Math. Computation 24 647–656. [25] Tsiatis, A. A. (1990) Estimating regression parameters using linear rank tests for censored data. Ann. Statist. 18 354–372. [26] Wei, L. J. and Gail, M. H. (1983) Nonparametric estimation for a scale-change with censored observations. J. Amer. Statist. Assoc. 78 382–388. [27] Wei, L. J., Ying, Z. and Lin, D. Y. (1990) Regression analysis of censored survival data based on rank tests. Biometrika 77 845–851. 19