A Quasi-Newton Algorithm for Efficient Computation of Gehan

advertisement
A Quasi-Newton Algorithm for Efficient
Computation of Gehan Estimates
by
Matthias Conrad1 and Brent A. Johnson2
Technical Report 10-02
February 25, 2010
Department of Biostatistics and Bioinformatics
Rollins School of Public Health
1518 Clifton Road, N.E.
Emory University
Atlanta, Georgia 30322
1
Emory University
Department of Mathematics and Computer Science and Computational Life Sciences Initiatve
Atlanta, GA 30322
2
Emory University
Department of Biostatistics and Bioinformatics
1518 Clifton Rd., N.E., 3rd Floor
Rollins School of Public Health
Atlanta, GA 30322
Correspondence Author: Dr. Matthias Conrad
Telephone: (404) 727-7591
FAX: (404) 727-5611
e-mail: conrad@mathcs.emory.edu
A quasi-Newton algorithm for efficient computation of Gehan
estimates
Short title:
Efficient computation of Gehan estimates
Address for correspondence:
Brent A. Johnson
Department of Biostatistics
Rollins School of Public Health
Emory University
1518 Clifton Rd., NE
Atlanta, GA 30322
U. S. A.
Email: bajohn3@emory.edu
1
A quasi-Newton algorithm for efficient computation of Gehan
estimates
Matthias Conrad1 and Brent A. Johnson2
Current Version: February 25, 2010
Abstract
The analysis of lifetime data is an important research area in statistics, particularly among
econometricians and biostatisticians. The two most popular semi-parametric models are the proportional hazards model and the accelerated failure time (AFT) model. The proportional hazards model is computationally advantageous over virtually any other competing semi-parametric
model because the ubiquitous maximum partial likelihood estimator is computed using ordinary
Newton methods. Rank-based estimation in the semi-parametric AFT model is computationally
more challenging, for example, because the Hessian matrix is not directly estimable without
nonparametric smoothing. Recently, authors showed that the rank-based estimators may be
written as the solution to a linear programming problem. Unfortunately, the size of the linear
programming problem is O(n2 ) subject to n2 linear constraints, where n denotes sample size.
Thus, the linear programming solution to rank-based estimation is restricted by well-known
computational limitations of linear programming and impractical for many applications. In this
paper, we describe quasi-Newton methods for rank-based estimation in the semi-parametric
AFT model. The algorithm converges super-linearly and is computationally efficient. Thus,
the computational cost of our algorithm remains low for even large data sets. We illustrate the
algorithm through speed trials and three real data sets.
1
Introduction
Survival analysis is a ubiquitous concept in statistics and semi-parametric models and estimators
are well-known. Cox’s (1972) proportional hazards model, for example, has been studied extensively
for nearly four decades now and the paper is among the most widely cited statistics papers in the
scientific literature. This paper proposes new computational strategies and algorithms for rank1
Department of Mathematics and Computer Science and Computational and Life Sciences Strategic Initiative,
Emory University, Atlanta, GA 30322, U. S. A. (Email: conrad@mathcs.emory.edu)
2
Department of Biostatistics, Rollins School of Public Health, Emory University, Atlanta, GA 30322, U. S. A.
(Email: bajohn3@emory.edu)
2
based estimation in the semi-parametric accelerated failure time model (Cox and Oakes, 1984;
Kalbfleisch and Prentice, 2002), a popular alternative to the proportional hazards model.
The accelerated failure time (AFT) model asserts that the endpoint Ti is linearly related to
explanatory variables after logarithmic transformation, i.e.
log Ti = x>
i β + ei , (i = 1, . . . , n),
(1)
where xi is a p-vector of fixed predictors for the ith subject, β is a p-vector of regression coefficients,
and (e1 , . . . , en ) are independent and identically distributed errors with unspecified distribution
function F . If Ci is a stochastic, subject-specific censoring variable, then the observed data are
{(Zi , δi , xi )}ni=1 , where Zi = min(Ti , Ci ), δi = I(Ti ≤ Ci ) and I(·) is the indicator function. The
goal is to estimate the regression coefficients β using the observed data.
Prentice (1978) proposed linear rank tests for the slope parameter in (1) and these tests formed
the basis for subsequent coefficient estimators. Rank-based estimation of the regression coefficients
β in (1) was first considered by Louis (1981) and then Wei and Gail (1983). Tsiatis (1990) proposed
a system of estimating equations by inverting the linear rank tests and further developments generally follow this framework (see also, Lai and Ying, 1991; Ying, 1993). From its inception, coefficient
estimation without nonparametric smoothing has been difficult. Until recently, rank-based estimation algorithms for censored data were limited to grid searches. Most notably, Lin and Geyer (1992)
explained how simulated annealing performed a stochastic grid search of the parameter space and
could be used to solve a general system of estimating equations. Naturally, the computational cost
of grid searches precluded the use of rank-based estimators outside of academic settings. Moreover,
simulated annealing is not guaranteed to produce a true minimum.
Two decades after Prentice (1978) proposed the rank-based estimator, Lin et al. (1998) provided
another estimation algorithm for modest sample sizes by showing that the Gehan estimate may be
solved through linear programming. However, the linear programming problem has n2 +p unknown
parameters subject to n2 linear constraints and the size of optimization problem defeats many “outof-the-box” linear solvers familiar to statisticians (e.g., linprog in Matlab) for sample sizes as low
as n = 40 and p = 2. Later, Jin et al. (2003) provided an effective numerical strategy by rewriting
the linear programming problem as an equivalent median regression problem. The significance of
the latter development was that algorithms for sparse quantile regression (Koenker and Bassett,
1978; Koenker and D’Orey, 1987; Koenker and Ng, 2005) could be used and those algorithms were
3
already available in standard statistical software. This latter algorithm is the current state-of-theart algorithm for rank-based estimation in the semiparametric accelerated failure time model.
Survival analysis is a core component of (bio)statistics and many clinical trials rely on these
methods to draw statistical inference on the effect of new therapies for disease recurrence, incident
disease, or mortality. Average clinical trials range in size between n = 1, 000 to n = 5, 000 subjects
while large clinical trials may enroll several million. Unfortunately, the computation and storage
of the rank-based estimator grows like the square of the analogous median regression estimator.
So, a rank-based estimate for n = 1, 000 has computational complexity of a median regression
estimate for n = 106 , a rank-based estimate for n = 105 has computational complexity of a median
regression estimate for n = 1010 , and so on. For practical sample sizes seen in clinical trials,
calculating rank-based regression estimates via linear programming for samples of this size is slow,
even for modern desk-top computers. On the other hand, Cox’s (1972) maximum partial likelihood
estimate can often be calculated in a matter of seconds, even for very large sample sizes. Hence, the
computational cost of the rank-based estimator is a huge deterrent and practitioners often use the
Cox model in clinical trials even amidst evidence that suggests the model assumptions are not wellsupported by the data. A better algorithm for the rank-based estimator would allow investigators
and users to choose an estimator based on the science rather than merely on the convenience and
cost of the computational algorithm.
The premise of this paper is that the Gehan loss function is itself convex and minimizing
it directly will be more efficient than rewriting the optimization with constraints. We perform
unconstrained optimization on a modified version of the Gehan loss function. Then, the original
Gehan estimate is defined as the limit of a sequence of modified estimates. Our algorithm is built
on the Broyden-Fletcher-Goldfarb-Shanno (BFGS) method which provides a solid foundation for
convex optimization. Quasi-Newton methods, such as the BFGS, provide curvature information on
the search direction at each iteration without calculating the Hessian directly. Our results will show
that this optimization technique works very efficiently for large sample sizes n where the current
existing method is impractical or fails completely. Because the weighted logrank estimators are
fundamental concepts in survival analysis, semi-parametric models and theory, we feel the methods
are sufficiently important to a rather general audience.
4
2
Background
b is defined as a zero-crossing of the estimating function
The weighted logrank estimator β
Ψ(β) = n−1
n
X
δi φ{ei (β), β}[xi − x̄{ei (β), β}],
(2)
i=1
Pn
where ei (β) = log Zi −x>
i β, φ is a data-dependent weight function, and x̄(t, β) =
j=1 xj I{ej (β) ≥
P
b which satisfies, Ψj (β−)Ψ
b
b
t}/ nj=1 I{ej (β) ≥ t}. We define a zero-crossing as an estimator β
j (β+) ≤
0 for all j = 1, . . . , p, where,
Ψj (β−)Ψj (β+) = lim Ψj (β − γuj )Ψj (β + γuj ),
γ→0
and uj is the j-th canonical unit vector and Ψ(β) = (Ψ1 , . . . , Ψp )> . Two weight functions of
P
significant interest are φ(t, β) = 1 and φ(t, β) = n−1 nj=1 I{ej (β) ≥ t}, that correspond to the
log-rank (Mantel, 1966) and Gehan (1965) weights, respectively. Under the latter weight function,
then the Gehan estimating function is simply
Ψ(β) =
n
n
1 XX
δi (xi − xj )I{ei (β) ≤ ej (β)}.
n2
(3)
i=1 j=1
The Gehan estimating function in (3) is the p-dimensional gradient of the following convex loss
function,
fG (β) =
n
n
1 XX
δi {ei (β) − ej (β)}− ,
n2
(4)
i=1 j=1
b = arg min fG (β). Recently, Jin
where c− = max(−c, 0). Then, the Gehan estimator is defined β
G
et al. (2003) argued that one may estimate the sampling variability of the Gehan estimate through
a novel resampling scheme whereby one perturbs the loss function fG with a vector of independent
and identically distributed random variables which have unit mean and variance. It is common to
use 1000 resampled estimates for standard error estimation and, in Section 4, we use this number to
exemplify computational cost of statistical inference for Gehan estimates. Thus, efficient numerical
optimization of fG is the lifeline of rank-based estimation and inference in AFT model.
Lin et al. (1998) showed that minimizing fG (β) is equivalent to the linear programming (LP)
problem:
min
u,β
n X
n
X
δi uij ,
(5)
i=1 j=1
subject to uij = −{ei (β) − ej (β)} and uij ≤ 0
5
∀i, j.
Jin et al. (2003) show how algorithms developed for quantile regression could be used to solve the
optimization problem in (5). Applying barrier methods for interior point programming, Jin et al.
(2003) suggest one minimize the loss function,
n X
n
X
i=1 j=1
n X
n
X
δi |ei (β) − ej (β)| + M − β >
δk (xl − xk ) ,
(6)
k=1 l=1
for a large number M . The significance of the latter expression (6) is that one may use standard
algorithms for median regression (Koenker and D’Orey, 1987; Koenker and Ng, 2005) to calculate
the Gehan estimate. At the time of writing this paper, minimizing the expression in (6) via the
quantreg package in R is the current state-of-the-art algorithm for the Gehan estimator.
It is well-known that, in terms of computational complexity and cost, linear programming is
no panacea (cf. Dennis and Schnabel, 1983; Nocedal and Wright, 2006, and references therein),
particularly for moderately-sized problems. Moreover, while the interior point methods of Koenker
and colleagues (Koenker and Bassett, 1978; Koenker and D’Orey, 1987; Koenker and Ng, 2005)
work well for quantile regression, the computational complexity of the rank-based optimization
problem via (6) overwhelms many standard desk-top computers. Naturally, outside of academic
settings (i.e. where both n and p are small), the computational cost of (6) for practical problems
(e.g. clinical trials) deters many consumers of rank-based estimators in the AFT model.
3
3.1
Algorithm
The approximation
For our algorithmic implementation, we start with the convex loss function fG (β) and consider a
simple smooth approximation to it. Define the approximating loss function,
fε (β) =
n
n
1 XX
δi cε {ei (β) − ej (β)},
n2
i=1 j=1
where cε is a sufficiently smooth real-valued function. In this




0,



cε (x) = − 1 3 (x + ε)4 + 12 (x + ε)3
16ε
4ε





x,
6
paper, we define
if x < −ε,
if x ∈ [−ε, ε],
if x > ε,
(7)
with sufficiently small ε (e.g. ε = 10−6 ). We note that other definitions of cε are possible. Now,
define the minimizer of the approximate loss function in (7), i.e.
b = arg min fε (β).
β
ε
(8)
β
Given our definition of cε , it is evident that limε→0 fε (β) = f0 (β) = fG (β). Furthermore, by
b is the global minimizer of a twice-differentiable convex loss function for every ε > 0.
definition, β
ε
b is a set of local minimizers and not necessarily a unique solution; this is similar to
However, β
ε
b (see Jin et al., 2003). At the same time, both β
b and β
b converge to unique solutions as the
β
G
ε
G
sample size n converges to infinity.
3.2
Quasi-Newton methods
The computational advantage of our approximation lies in the smoothness and convexity of fε in (8).
Because fε possesses these properties, we may use gradient-based optimization algorithms for which
the optimization theory provides numerous such iterative methods. The foremost gradient-based
optimization algorithm is Newton’s method which converges locally at a quadratic rate and uses the
gradient ∇fε and Hessian ∇2 fε to form the search direction and step length at each iteration. The
numerical calculations of the Hessian of the smoothened Gehan estimator are at unreasonable costs.
The fundamental concept behind quasi-Newton methods is to provide curvature information about
the loss function fε in order to calculate an efficient search direction at each iteration without
calculating the Hessian matrix explicitly. Below, we delineate the Broyden-Fletcher-GoldfarbShanno (BFGS) method (Nocedal and Wright, 2006) to provide an efficient and fast algorithm
for calculating the rank-based Gehan estimate. Then, we describe the Limited-BFGS (L-BFGS)
method to solve problems for large numbers of predictors p.
Let β `+1 be the (` + 1)-th iterate of the regression coefficients in (1). Define the objective
function and its gradient at step ` + 1 as f`+1 = fε+1 (β `+1 ), g`+1 = ∇f`+1 , respectively. Then, the
BFGS search direction s`+1 is given by
s`+1 = −H`+1 g`+1 ,
with H`+1 given by
H`+1 = arg min kH − H` kF
H
subject to H = H > and Hy` = s` ,
7
y` = g`+1 −g` , k · kF denotes a weighted Frobenius matrix norm, and an initial user chosen symmetric
positive definite matrix H0 (e.g. H0 = I). Its unique solution is given by the update formula
(Broyden, 1970; Fletcher, 1970; Goldfarb, 1970; Shanno, 1970),
H`+1 = V`> H` V` + ρ` v` v`> ,
(9)
where
V` = (I − ρ` y` v`> ),
ρ` =
1
y`> v`
,
v` = β `+1 − β ` .
In ordinary Newton methods, both the gradient and the Hessian are calculated and a linear equation
are solved at every iteration. Quasi-Newton methods are similar in that the gradient g` and a matrix
H` are calculated at each iteration `. However, as shown in equation (9), quasi-Newton methods
differ from ordinary Newton methods in that the former methods avoid solving a linear system
explicitly, as such systems may be large or possibly ill-conditioned. Thus, H` contains information
of the inverse of the Hessian of the loss function in a local neighborhood of β ` .
The L-BFGS method is especially useful for solving large optimization problems (large p) where
the storage or computation of the matrix H` is at unreasonable costs. Instead of using the full
matrix H` of size p × p the L-BFGS algorithms stores m vectors of size p with m p. These
vectors contain the curvature information of the m most recent iterations and the update formula
for the m vectors of the L-BFGS method can be derived from the BFGS formula by rewriting (9)
as
H` =
m
X
ρ`−m−1+j
j=0
m−j
Y
!
>
V`−k
>
v`−m−1+j v`−m−1+j
k=1
m−j
Y
!
V`−k
,
k=1
>
= H0 (cf. Dennis and Schnabel, 1983). An efficient implewith ρ`−m−1 = 1 and v`−m−1 v`−m−1
mentation of the L-BFGS updating strategy is presented in the pseudo-code in Algorithm 1 in line
5 and 10. Note, if m = p the L-BFGS is identical with the BFGS method. The strong Wolfe
line search method (cf. Nocedal and Wright, 2006, pp. 31) provides an efficient step length α`
which determine the new iteration β `+1 = β ` + α` s` . We use standard stopping criteria for smooth
unconstrained optimization
f`+1 − f` < τ (1 + f`+1 ),
√
kg` k < 3 τ (1 + |f`+1 |),
√
β `+1 − β ` < τ (1 + β `+1 )
(10)
kg` k < ,
with the Euclidian norm k · k and the machine accuracy (cf. Gill, 1981). By default we choose
τ = 10−6 . The principal advantage of applying the L-BFGS algorithm to the approximate Gehan
8
loss function fε is local super-linear convergence (Nocedal and Wright, 2006) to a local minimizer
which results in fast and efficient computation for even large data sets.
We assess the speed of our algorithm through a simple simulation study. We simulated p independent, standard normal predictors x = (x(1) , . . . , x(p) )> , independently simulated the coefficient
vector β from a standard normal distribution, and then simulated true response log T according to
the linear model,
log T = β1 x(1) + · · · + βp x(p) + W,
where W is a normal random variable with mean zero and standard deviation σ = 1.5. A censoring
random variable was simulated according to a unif(0, κ) distribution, where κ was chosen to yield
30% censored observations. The observed random pair (Z, δ) are defined accordingly. The data
were identically distributed for n subjects, where n ranges from 100 to 5000. Using the convergence
criteria described above, we display in Figure 1 the clock time required to compute the Gehan
coefficient estimates. These timings where taken on an Apple MacBook Pro (Model MB134LL/A)
with 2.4GHz Intel Core 2 Duo and 4GB 667 MHz DDR2 RAM (OS X Version 10.5.7). Our code is
written in the interpreted Matlab language (version 7.8.0.347, R2009a) and CPU times may be
even faster if our code were rewritten in a compiled language like Fortran. Our method is restricted
only by the storage size of the design matrix X ∈ Rn×p . Compared to the search space of n2 + p of
the linear programming problem (6) the convex optimization problem (8) is of size p. On the other
hand, our simulation in Figure 1 with n = 5000 reflects a linear program of size ∼25, 000, 000. In
case p is large in comparison to n, the linear system (1) might be underdetermined and problem
(8) will therefore be ill-posed. Here, regularization methods might be useful to yield reasonable
results and is open for further investigations.
3.3
The error in the approximation
The approach we propose is based on a modified Gehan loss function where we approximate a
non-differentiable function with a smoothed, differentiable one. It is natural to wonder how poorly
our modified loss function fε approximates the true Gehan loss function fG . We consider the error
in the approximation below.
In order to apply gradient-based methods, we consider a modified Gehan loss function fε defined
9
140
120
CPU timing [ s ]
100
80
60
40
20
0
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
n
Figure 1: CPU times for computation of Gehan estimates. For each sample size n, we simulate 10
data sets where gener σ = 1.5 and p = 2 (dotted curve), p = 10 (dashed curve), and p = 50 (solid
curve).
through cε . Then, for every β ∈ Rp , we get the following worst case error of the loss function,
n n
X
X
1 |fε (β) − f0 (β)| = 2 δi [cε {ei (β) − ej (β)} − c0 {ei (β) − ej (β)}]
n i=1 j=1
n
n
1 XX
|[cε {ei (β) − ej (β)} − c0 {ei (β) − ej (β)}]|
≤ 2
n
i=1 j=1
n
n
1 XX
≤ 2
max |[cε {ei (β) − ej (β)} − c0 {ei (β) − ej (β)}]|
n
i=1 j=1
=
3
ε,
16
where the last line follows from
max |cε {ei (β) − ej (β)} − c0 {ei (β) − ej (β)}| = |cε (0) − c0 (0)| =
3
ε.
16
Hence, the mean error of the approximate loss function for any value in the parameter space is
bounded above by 3ε/16.
10
Algorithm 1 L-BFGS
Require:
f ∈ C 2 : Rp → R
objective function
β 0 ∈ Rp
initial value
m ∈ N, H0 ∈ Rp×p symmetric positive definite
pick parameters
1: ` ← 0
set parameter
2: calculate f0 ← f (β 0 ), g0 ← g(β 0 )
function evaluations
3: repeat
4:
check stop criteria
5:
q` ← g`
e.g. see equation (10)
calculate limited memory quasi Newton direction
for j = m, m − 1, . . . , 1 do αj ← ρj vj> g` and q` ← q` − αj yj end for
if ` = 0 do r` ← H0 q` else r` ←
>
vm
ym
Iq`
>y
ym
m
end if
for j = 1, . . . , m dor` ← r` + vj (αj − ρj yj> r` ) end for
6:
s` ← −r`
7:
calculate step size α` via strong Wolfe line search
8:
β `+1 ← β ` + α` s`
9:
calculate f`+1 ← f (β `+1 ), g`+1 ← g(β `+1 )
10:
update steps
function evaluations
vj ← vj+1 , yj ← yj+1 , and ρj ← ρj+1 for j = 1, . . . , m − 1
vm ← β ` − β `−1 ym ← g` − g`−1 , ρm ←
11:
choose search direction
update memory
1
>v
ym
m
`←`+1
12: end repeat
b←β
13: β
`
Ensure:
b = arg min f (β) is local minimizer and f (β)
b local minimum
β
4
Examples
In this section, we present three worked examples to illustrate the methods. Explanatory variables
have been standardized to have mean zero. The following three examples use data sets familiar to
statisticians and are freely available online.
4.1
Multiple Myeloma
We first report our results from a study of multiple myeloma (Krall et al., 1975). The study
included a total of n = 65 patients who were treated with alkylating agents, 48 of whom died and
11
0.5
0.3
0.48
0.2
0.46
0.1
0.44
0
0.42
β2
0.4
!0.1
0.4
!0.6
!0.5
!0.4
!0.3
!0.2
!0.1
0
0.1
β1
Figure 2: Search path of the L-BFGS algorithm applied to the multiple myeloma data set with
b = (−0.53160, 0.29222)> with fε (β)
b = 0.39764 and
initial β 0 = (0, 0)> . The local minimum is at β
ε = 0.01.
the remaining 17 survived. As in Jin et al. (2003), we include two covariates in our analyses:
hemoglobin (HGB) and logarithm of blood urea (BUN). Using our method with ε = 0.01 and
standardized data, we found the coefficient estimates were −0.5318 and 0.2923 for BUN and HGB,
respectively. Using the method of Jin et al. (2003), the parameter estimates are −0.5316 and
0.2922.
Due the small sample size and number of predictors, the multiple myeloma data set provides an
excellent opportunity to visualize the L-BFGS algorithm in practice. In Figure 2, we display the
progress of the algorithm starting with the initial value β 0 = (0, 0). The L-BFGS iterates rapidly
steps downhill, converging to the local minimizer in 7 iterations. The last three iterations are
difficult to identify in Figure 2 due to the fact they are so close to one another. As a complement
to the graphical display of the search algorithm in Figure 2, the sequential parameter estimates are
12
Table 1: Search path of the L-BFGS algorithm applied to the multiple myeloma data set with
initial value β = (0, 0)> and ε = 0.01.
step
β1
β2
fε (β)
0
0.00000
0.00000
0.46846
1
-0.20132
0.12241
0.42335
2
-0.55482
0.28094
0.39791
3
-0.52136
0.30626
0.39775
4
-0.53458
0.29458
0.39764
5
-0.53259
0.29147
0.39764
6
-0.53088
0.29251
0.39764
7
-0.53160
0.29222
0.39764
also given in Table 1.
Finally, as a footnote on the error of the approximate loss function fε compared to the original
Gehan loss function fG , we illustrate the differences in the multiple myeloma example in Figure 3.
Each black line in Figure 3 represents the lines of non-differentiality in the Gehan loss fG which
correspond to the equality constraints of the linear optimization problem (5) and the maximal error
(≈ 3.0 × 10−5 ) between the modified and original Gehan loss function. An important message is
that lines appear nearly everywhere with the exception of small white pockets where the error is
the smallest. It is clear that the modified loss function is a very close approximation to the original
loss function, which confirms our analytical calculations in Section 3. We find the error in the loss
b ) − fG (β
b ) = 2.8586 × 10−5 .
function at the local minimizer is, fε (β
ε
ε
4.2
Mayo PBC
The Mayo primary biliary cirrhosis (PBC) data (Fleming and Harrington, 1991, Appendix D.1)
contains information about the survival time and prognostic variables for 418 patients who were
eligible to participate in a randomized study of the drug D-penicillamin. Of 418 patients who met
standard eligibility criteria, a total of 312 patients participated in the randomized portion of the
study. Using the smaller randomized cohort, the study investigators used stepwise deletion to build
13
!5
x 10
0.4
2.9
0.35
2.8
2.7
0.3
β2
2.6
0.25
2.5
0.2
2.4
2.3
0.15
2.2
0.1
!0.6
!0.5
!0.4
!0.3
!0.2
!0.1
β1
Figure 3: Error map of fε − fG applied to the Myel data set with ε = 0.01.
a Cox proportional hazards (PH) model for the natural history of PBC (Dickson et al., 1989).
Of the original ten predictors, stepwise deletion selected five significant variables: age, albumin,
bilirubin, edema, and prothrombin time (protime). We take the natural logarithmic transformation
of albumin, bilirubin, and prothrombin time to conform to the analysis in Fleming and Harrington
(1991). These five variables constitute the natural history model for PBC (Dickson et al., 1989). We
present the coefficient estimates using quasi-Newton methods and linear programming in Table 2.
We note that the two different algorithms lead to solutions which are identical to four decimal
places. Although our quasi-Newton algorithm runs in less than one-half of one second, the linear
programming method of Jin et al. (2003) is still reasonable at 5.3 seconds.
14
Table 2: Coefficients estimates for Mayo PBC data
Parameter
New
Jin et al.
-0.2706201
-0.2706168
albumin
0.2042720
0.2042710
bilirubin
-0.5941747
-0.5941735
edema
-0.2236625
-0.2236607
protime
-0.2372911
-0.2372882
age
Table 3: Coefficient estimates for nursing home data
Parameter
4.3
New
Jin et al.
trt
0.14407872
0.14407174
age
0.09617243
0.09616416
sex
-0.62928306
-0.62928863
mar.stat.
-0.25233286
-0.25234432
h1
-0.09115733
-0.09115354
h2
-0.58666473
-0.58664576
h3
-1.07069675
-1.07070227
Nursing Home
From 1980-1982, the National Center for Health Services Research conducted a study to determine
the effect of financial incentives on variation of patient care in nursing homes. In particular, 18 out
of 36 nursing homes from San Diego, California, received higher per diem payments for accepting
and admitting Medicaid patients and additional bonuses when the patient’s prognosis improved.
The study collected data from an additional 18 control nursing homes where no financial incentives
were used. A complete description is given in Morris, Norton, and Zhou (1994). The total sample
size from all 36 nursing homes is n = 1601. Our data set consists of seven covariables: treatment
(trt), age, sex, marital status, and three health status indicators (h1–h3), ranging from the best
health to the worst health. The coefficient estimates are displayed in Table 3.
15
The sample size of the nursing home data is sufficiently large where the algorithm makes a significant impact. Using the algorithm of Jin et al. (2003) along the with Barrodale-Roberts simplex
optimization (Koenker and D’Orey, 1987) via quantreg in R, the computation fails. However, the
improved Frisch-Newton (Koenker and Ng, 2005) algorithm performs better and finishes in just
under two minutes (i.e. 1.75 minutes on our MacBook Pro running R 2.9.1). For the nursing
home data set, our quasi-Newton algorithm runs in ten seconds. To accentuate the differences in
CPU times, consider computing standard error estimates using the resampling scheme by Jin et
al. (2003) with 1000 resamples. Then, the method by Jin et al. (2003) runs in approximately 30
hours compared with our algorithm which takes two minutes and forty-five seconds.
5
Remarks
This paper describes a new algorithm for estimating the slope parameter in the semi-parametric
accelerated failure time (AFT) model (Prentice, 1978). The current algorithmic approach to this
estimation problem is linear programming as described by Jin et al. (2003). However, the computational complexity of linear programming leads to extraordinary CPU times for even modest sample
sizes and few predictors. The computing bottleneck has rendered most extensions and applications
of rank-based estimators impractical, especially when compared to the computationally-expedient
maximum partial likelihood estimator (Cox, 1972). This paper challenges the computational discrepancy between these two semi-parametric estimators by providing a new computational algorithm in the AFT model.
Our algorithm is based on gradient-based, quasi-Newton methods for convex loss functions.
Specifically, we adopt a Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm which approximates
the inverse Hessian matrix to determine search direction rather than calculate the Hessian matrix explicitly. When the number of predictors is sufficiently large, we advocate a Limited-BFGS
algorithm which saves significantly on storage space of the approximate Hessian and, therefore,
lowers the computational cost of the standard BFGS algorithm. Our theoretical and numerical
calculations indicate that our method is as accurate as the linear programming method of Jin et
al. (2003) without the computational burden. In particular, our experience suggests the method
of Jin et al. (2003) takes exponentially more CPU time than our algorithm as the sample size increases. In the Nursing Home example with n = 1601, for example, we showed that our algorithm
16
reduces rank-based estimation and inference from 30 hours to less than 3 minutes, a significant
reduction in CPU cost and time. We hope the current paper will permit and encourage statisticians to consider weighted logrank estimators in the AFT model for censored data regression as
viable semi-parametric alternatives to the ubiquitous maximum partial likelihood estimator in a
proportional hazards model.
Acknowledgements
The authors thank Janine Olesch from Lübeck University for the joint work on the algorithmic
implementation. This work was supported, in part, by the Computational Life Science Initiative
at Emory University (Conrad), NIH grants R03 AI068484 (Johnson) and Emory’s Center for AIDS
Research, P30 AI050409 (Johnson).
References
[1] Broyden, C. G. (1970) The convergence of a class of double-rank minimization algorithms.
J. Inst. Math. Its Appl. 6 76–90.
[2] Cox, D. R. (1972) Regression models and life-tables (with Discussion), J. Roy. Statist. Soc.
Ser. B 34 187-202.
[3] Cox, D. R. and Oakes, D. (1984) Analysis of Survival Data. London: Chapman and Hall.
[4] Dennis, J. E. and Schnabel, R. B. (1983) Numerical Methods for Unconstrained Optimization and Nonlinear Equations. SIAM, Philadelphia.
[5] Dickson, E. R., Grambsch, P. M., Fleming, T. R., Fisher, L. D., and Langworth, A.
(1989). Prognosis in primary biliary cirrhosis: model for decision making. Hepatology, 10, 1–7.
[6] Fleming, T. A. and Harrington, D. P. (1991). Counting Processes and Survival Analyses.
New York: Wiley.
[7] Fletcher, R. (1970) A new approach to variable metric algorithms. Computer Journal 13
317–322.
17
[8] Gehan, E. A. (1965) A generalized Wilcoxon test for comparing arbitrarily single-censored
samples. Biometrika 52 203–23.
[9] Gill, P. E. and Murray, W. and Wright, M. H. (1981) Practical Optimization. Elsevier,
Bingley.
[10] Goldfarb, D. (1970) A family of variable metric updates derived by variational means. Math.
Computation 24 23–27.
[11] Jin, Z., Lin, D. Y., Wei, L. J. and Ying, Z. (2003) Rank-based inference for the accelerated
failure time model. Biometrika 90 341–353.
[12] Kalbfleisch, J. D. and Prentice, R. L. (2002) The Statistical Analysis of Failure Time
Data. John Wiley: New York.
[13] Koenker, R. and Bassett, G. S. (1978) Regression quantiles. Econometrica 46 33–50.
[14] Koenker, R. and D’Orey, V. (1987) Computing regression quantiles. Appl. Statist. 36
383–393.
[15] Koenker, R. and Ng, P. (2005) A Frisch-Newton Algorithm for Sparse Quantile Regression.
Acta Mathematicae Applicatae Sinica (English Series) 21(2) 225–236.
[16] Krall, J. M., Uthoff, V. A. and Harley, J. B. (1975) A step-up procedure for selecting
variables associated with survival. Biometrics 31 49–57.
[17] Lai, T. L. and Ying, Z. (1991) Rank regression methods for left truncated and right censored
data. Ann. Statist. 19 531–556.
[18] Lin, D. Y., Wei, L. J., and Ying, Z. (1998) Accelerated failure time models for counting
processes. Biometrika 85 605–618.
[19] Louis, T. A. (1981) Nonparametric analysis of an accelerated failure time model. Biometrika
68 381–390.
[20] Mantel, N. (1966) Evaluation of survival data and two new rank order statistics arising in
its considerations. Cancer Chemo. Rep. 50 163–170.
18
[21] Morris, C. N., Norton, E. C. and Zhou, X. H. (1994) Parametric duration analysis of
nursing home usage. In Case Studies in Biometry (N. Lange, L. Ryan, L. Billard, D. Brillinger,
L. Conquest, and J. Greenhouse, eds.) 231–248. Wiley, New York.
[22] Nocedal, J. and Wright, S. J. (2006) Numerical Optimization. Springer, Berlin.
[23] Prentice, R. L. (1978) Linear rank tests with right-censored data. Biometrika 65 167–179.
[24] Shanno, D. F. (1970) Conditioning of quasi-Newton methods for function minimization.
Math. Computation 24 647–656.
[25] Tsiatis, A. A. (1990) Estimating regression parameters using linear rank tests for censored
data. Ann. Statist. 18 354–372.
[26] Wei, L. J. and Gail, M. H. (1983) Nonparametric estimation for a scale-change with censored
observations. J. Amer. Statist. Assoc. 78 382–388.
[27] Wei, L. J., Ying, Z. and Lin, D. Y. (1990) Regression analysis of censored survival data
based on rank tests. Biometrika 77 845–851.
19
Download