TACKLING BOX-CONSTRAINED OPTIMIZATION VIA A

advertisement
TACKLING BOX-CONSTRAINED OPTIMIZATION VIA
A NEW PROJECTED QUASI-NEWTON APPROACH
DONGMIN KIM∗ , SUVRIT SRA∗† , AND INDERJIT S. DHILLON∗
Abstract. Numerous scientific applications across a variety of fields depend on box-constrained
convex optimization. Box-constrained problems therefore continue to attract research interest. We
address box-constrained (strictly convex) problems by deriving two new quasi-Newton algorithms.
Our algorithms are positioned between the projected-gradient [J. B. Rosen, J. SIAM, 8(1), 1960, pp.
181–217], and projected-Newton [D. P. Bertsekas, SIAM J. Cont. & Opt., 20(2), 1982, pp. 221–
246] methods. We also prove their convergence under a simple Armijo step-size rule. We provide
experimental results for two particular box-constrained problems: nonnegative least squares (NNLS),
and nonnegative Kullback-Leibler (NNKL) minimization. For both NNLS and NNKL our algorithms
perform competitively as compared to well-established methods on medium-sized problems; for larger
problems our approach frequently outperforms the competition.
Key words. Non-negative least squares, KL-divergence minimization, projected Newton methods, quasi-Newton, box-constrained convex optimization,
1. Introduction. The central object of study in this paper is the box-constrained
optimization problem
min
x∈Rn
f (x),
s.t.
l ≤ x ≤ u,
(1.1)
where l and u are fixed vectors and inequalities are taken componentwise; the function
f is assumed to be twice continuously differentiable and strictly convex.
Problem (1.1) is a simple constrained optimization problem that can be solved by
numerous methods. But general purpose methods might be overkill; indeed, for simple problems they can be computationally expensive or even needlessly complicated.
These concerns motivate us to develop algorithms specifically designed for (1.1). Our
algorithms are founded upon the unconstrained BFGS and L-BFGS procedures, both
of which we adapt to our constrained setting. This adaptation is the crux of our work;
but before we describe it, we would like to position it vis-á-vis other methods.
A basic, and perhaps obvious, approach to solve (1.1) is the well-known projectedgradient method [23]. This method, however, frequently suffers from slow convergence,
making it unappealing when medium to high accuracy solutions are desired. In such
cases higher-order methods might be more preferable, e.g., LBFGS-B [5]1 , TRON [12],
or projected-Newton [1] (which could be viewed as a non-trivial second-order extension to projected-gradient). LBFGS-B uses a limited amount of additional memory to
approximate the second-order information of the objective function, while TRON uses
trust-regions [15], and is usually very efficient for medium scale problems.
LBFGS-B and TRON tackle constraints by explicitly distinguishing between steps
that identify active variables and steps that solve the resulting unconstrained subproblems. In comparison, projected-Newton makes the variable-identification step
implicit by dynamically finding active candidates at every iteration. Bertsekas noted
that such an implicit identification of active variables could lead to unstable active
sets across iterations, which in turn could adversely affect convergence [1]. Thus, to
∗ Department of Computer Sciences, University of Texas, Austin, TX 78712-1188, USA
(dmkim@cs.utexas.edu, inderjit@cs.utexas.edu).
† Max-Planck
Institute for Biological Cybernetics, 72076 Tübingen, Germany (suvrit.sra@tuebingen.mpg.de).
1 An extension of L-BFGS [18] that allows box-constraints.
1
stabilize the projected-Newton method he developed a refined procedure that guarantees finite identification of the active set. Although we build on projected-Newton, we
avoid its refinements and make important modifications: we use simpler line-search
and a different choice of working sets. These changes sacrifice the finite identification property, but at the gain of implementation ease. Moreover, by supporting a
limited-memory version, we allow our method to scale well to large problems.
The rest of this paper is organized as follows. In §2.1 we overview our approach;
in §2.2 major implementation issues are discussed; and in §3 we prove convergence.
In §4 we compare our method to several other major approaches and show numerical
results across several synthetic as well as real-life data. Finally, §5 concludes the paper
with an outlook to future work.
2. Algorithms and Theory. Recall that the central problem of interest is
min
x∈Rn
f (x),
subject to
l ≤ x ≤ u,
(2.1)
where l and u are fixed vectors and inequalities are taken componentwise; f is a twice
continuously differentiable, strictly convex function.
2.1. Overview of the method. Our algorithm for solving (2.1) is inspired
by projected-gradient [23] and projected-Newton [1]. Viewed in the general light of
gradient methods [2], our method distinguishes itself in two main ways: (i) its choice
of the variables optimized at each iteration; and (ii) the gradient-scaling it uses to
compute the descent direction. Further details of the resulting iterative method follow.
At each iteration we partition the variables into two groups: free and fixed. Then,
we optimize over only the free variables while holding the fixed ones unchanged. The
fixed variables are defined as a particular subset of the active variables2 , and this
subset is identified using both the gradient and second-order information.
Formally, first at iteration k, we identify the binding set
I1k = i xki = li ∧ ∂i f (xk ) > 0, or xki = ui ∧ ∂i f (xk ) < 0 ,
(2.2)
where ∂i f (xk ) denotes the i-th component of the gradient ∇f (xk ). The set I1k collects
variables that satisfy the KKT conditions (1 ≤ i ≤ n):
∇f (x) − λ + µ = 0,
λi (xi − li ) = 0,
µi (xi − ui ) = 0,
and
λ, µ ≥ 0,
where λ and µ are Lagrange multipliers corresponding to the ℓ ≤ x and x ≤ u
constraints, respectively. To see why the gradient information makes a difference
in (2.2), consider an active variable xki = li whose gradient component ∂i f (xk ) ≤ 0.
For such a variable it may be possible to decrease the objective further, if at the next
> li . Thus, it is better to leave xki free by not including it
iteration we allow xk+1
i
in (2.2) (a similar observation holds for xki = ui ).
Next, we refine the active variables using gradient-scaling information. To that
end, let S k be some non-diagonal positive-definite matrix. Further let S̄ k be the
matrix induced by S k and the variables not in the binding set, i.e.,
k
Sij , if i, j ∈
/ I1k ,
k
S̄ij
=
0,
otherwise.
2 Components
of x that satisfy the constraints with equality.
2
Then, we define the second index-set
I2k = i xki = li ∧ S̄ k ∇f (xk ) i > 0,
or
xki = ui ∧ S̄ k ∇f (xk ) i < 0 .
(2.3)
The set I2k captures variables left free by I1k , that should actually be fixed—e.g., if
xki is active but not in the binding set (i.e., xki = li , but xki ∈
/ I1k ), then as argued
before, the objective function can be potentially decreased if we let xk+1
> li . In
i
the presence of gradient-scaling, however, this decrease
might
not
happen
because
even though xki ∈
/ I1k , the scaled-gradient S̄ k ∇f (xk ) i is positive. Thus, at the next
iteration too we should fix xk+1
= li . The actual fixed-set is as defined below.
i
Definition 2.1 (Fixed set). Given sets I1k and I2k identified by (2.2) and (2.3),
respectively, the fixed set I k at iteration k is defined as the union I k = I1k ∪ I2k .
2.1.1. Projected quasi-Newton Step. Having defined the fixed set we now
present the main projected quasi-Newton step. The projection part is easy and requires computing PΩ (x), the orthogonal projection of x onto the set Ω = [l, u]. The
quasi-Newton part is slightly more involved as the variables not in the fixed-set I k
influence the gradient-scaling matrix, which is given by
k
/ Ik,
Sij , if i, j ∈
k
Ŝij
=
0,
otherwise.
Finally, our projected quasi-Newton step is given by
xk+1 = PΩ xk − αk Ŝ k ∇f (xk ) ,
(2.4)
where αk > 0 is a step-size.
Pseudo-code encapsulating the details described above is presented in Algorithm 1,
which we call projected quasi-Newton (PQN). Observe that in PQN an arbitrary
positive-definite matrix may be used to initialize S 0 , and any feasible point to initialize x0 . Further notice that we have not yet specified the details of computing the
step-size αk , and of updating the matrix S k . We now turn to these two important
steps (and some other implementation issues) in §2.2 below.
2.2. Implementation details. In this section we describe a line-search routine
for computing αk , and two methods for updating the gradient scaling matrix S k .
2.2.1. Line Search. We choose the Armijo step-size rule [2, 6] for the linesearch; this rule is simple and usually works well both for unconstrained and constrained problems. The important difference in our implementation of Armijo is that
we restrict the computation to be over the free variables only.
More specifically, to compute the step-size, first choose an initial step γ > 0 and
some scalars τ , σ ∈ (0, 1). Then, determine the smallest non-negative integer t, for
which the sufficient-descent condition
f (xk ) − f PΩ xk − γσ t Ŝ k ∇f (xk )
≥ τ γσ t ∇f (xk )T Ŝ k ∇f (xk ),
(2.5)
is satisfied. The step-size is then set to αk = γσ t . In §3 we prove that with this
Armijo-based step-size our algorithm converges to the optimum.
2.2.2. Gradient Scaling. Now we turn to gradient scaling. To circumvent
some of the problems associated with using the full (inverse) Hessian, it is usual
to approximate it. Popular choices involve iteratively approximating the Hessian
3
Algorithm 1: Projected Quasi-Newton Framework
Input: Function: f (x); vectors l, u ∈ Rn
Output: x∗ = argminl≤x≤u f (x)
Initialize k ← 0, S k ← I, and l ≤ xk ≤ u;
repeat
Compute
the first index set
I1k = ixki = li ∧ ∂i f (xk ) > 0, or xki = ui ∧ ∂i f (xk ) < 0 ;
Compute
the second
index set
I2k = ixki = li ∧ S̄ k ∇f (xk ) i > 0, or xki = ui ∧ S̄ k ∇f (xk ) i < 0 ;
Compute the fixed set I k = I1k ∪ I2k ;
if ∀i, i ∈ I k or ∂i f (xk ) = 0 for all i ∈
/ I k then
k
S̄ = I;
Re-compute I k ;
begin
Find appropriate value for αk using line-search;
xk+1 ← PΩ xk − αk Ŝ k ∇f (xk ) ;
Update gradient scaling matrix S k to obtain S k+1 , if necessary;
k ← k + 1;
until Stopping criteria are met;
using the Powell-Symmetric-Broyden (PSB), Davidson-Fletcher-Powell (DFP), or the
Broyden-Fletcher-Goldfarb-Shanno (BFGS) updates; the latter are believed to be the
most effective in general [2, 8], so we adapt them for our algorithm.
Suppose H k is the current approximation to the Hessian, and that vectors g and
s are given by the differences
g = ∇f (xk+1 ) − ∇f (xk ),
and
s = xk+1 − xk .
Then, the BFGS update makes a rank-two correction to H k , and is given by
H k+1 = H k −
gg T
H k ssT H k
+ T .
T
k
s H s
s g
(2.6)
Let S k be the inverse of H k . Applying the Sherman-Morrison-Woodbury formula
to (2.6) one obtains the update
S
k+1
gT S k g
=S + 1+
sT g
k
ssT
(S k gsT + sg T S k )
−
.
sT g
sT g
(2.7)
The version of Alg. 1 that uses (2.7) for the gradient-scaling will be called PQN-BFGS.
Although the update of gradient scaling matrix via BFGS updates is efficient,
it can require up to O(n2 ) memory, which restricts its applicability to large-scale
problems. But we can also implement a limited memory BFGS (L-BFGS) [18] version,
which we call PQN-LBFGS. Here, instead of storing an actual gradient scaling matrix,
a small number of additional vectors (say M ) are used to approximate the (inverse)
4
Hessian. The standard L-BFGS approach is implemented using the following formula
Sk =
sTk−1 gk−1 T
T
T
V̄k−M V̄k−M + ρk−M V̄k−M
+1 sk−M sk−M V̄k−M +1
T g
gk−1
k−1
T
T
+ ρk−M +1 V̄k−M
+2 sk−M +1 sk−M +1 V̄k−M +2
+ ···
(2.8)
+ ρk−1 sk−1 sTk−1 ,
where S 0 = I, k ≥ 1, the scalars ρk , and matrices V̄k−M are defined by
ρk = 1/(sTk gk ),
V̄k−M = [Vk−M · · · Vk−1 ] ,
and Vk = I − ρk sk gkT .
At this point the reader might wonder how our method differs from the original
BFGS and L-BFGS methods. There are two basic differences: First, we are performing
constrained optimization. Second, we actually perform the updates (2.7) and (2.8)
with respect to the first index set (2.2)—that is, both updates (2.7) and (2.8) result
in the matrix S̄ k , which is then used in Algorithm 1.
3. Convergence. Let {xk } be a sequence of iterates produced by either the
PQN-BFGS or the PQN-LBFGS versions of Algorithm 1. We prove below under mild
assumptions that each accumulation point of the sequence {xk } is a stationary point
of (2.1). We begin by showing that the sequence {f (xk )} is monotonically decreasing.
Lemma 3.1 (Descent). If xk is not a stationary point of Problem (2.1), then
with xk+1 given by (2.4), there exists some constant α̂ > 0 such that
f (xk+1 ) < f (xk ),
for all αk ∈ (0, α̂].
Proof. By construction of the fixed set I k , the free variables in xk must satisfy:
either
li < xki < ui ,
or
xi = li ∧ ∂i f (xk ) ≤ 0 ∧ [S̄ k ∇f (xk )]i ≤ 0,
or
xi = ui ∧ ∂i f (xk ) ≥ 0 ∧ [S̄ k ∇f (xk )]i ≥ 0.
Note that if I1k ∪ I2k captures all the variables in x, Algorithm 1 reduces to I k = I1k .
If there are still no free variables, it means that all KKT conditions are fulfilled; but
this implies that xk is stationary, a contradiction. Thus, for a non-stationary point
xk , there exists at least one index i in xk for which ∂i f (xk ) 6= 0. This xk yields a
descent direction. To see how, let dk be the direction defined by
−[Ŝ k ∇f (xk )]i , if i ∈
/ Ik,
dki =
0,
otherwise.
Now partition (reordering variables if necessary) all objects with respect to I k so that
k
k
k y
T
0
−T k y k
p
, ∇f (xk ) = k , and dk =
Ŝ k =
.
(3.1)
=
0 0
0
0
z
In (3.1) the components that constitute z k belong to I k , while those that constitute
y k , pk and T k do not. Given this partitioning we see that
∇f (xk ), dk = y k , pk = − y k , T k y k < 0,
5
where the final inequality follows because T k is positive-definite (as it is a principal
submatrix of the positive definite matrix S k ). Thus, dk is a descent direction.
Now we only need to verify that descent along dk upholds feasibility. It is easy
to check that there exists an ᾱ > 0 such that for i ∈
/ Ik,
li ≤ PΩ xki + αk dki = xki + αk dki ≤ ui , for all αk ≤ ᾱ.
Noting that dk is zero for fixed variables, we may equivalently write
l ≤ PΩ xk + αk dk = xk + αk dk ≤ u, for all αk ≤ ᾱ.
(3.2)
Thus, we can finally conclude that there exists some α̂ ≤ ᾱ such that
f (xk+1 ) < f (xk ),
for all αk ∈ (0, α̂].
To prove the convergence of an iterative algorithm one generally needs to show
that the sequence of iterates {xk } has an accumulation point. Since the existence of
limit points inevitably depends on the given problem instance, it is frequently assumed
as a given. For our problem, accumulation points certainly exist as we are dealing
with a continuous function over a compact domain.
To prove the main theorem, we consider the set Ω = [l, u] as the domain of
Problem 2.1 and assume the following condition:
Assumption 1. The eigenvalues of the gradient scaling matrix T k (for all k) lie
in the interval [m, M ] where 0 < m ≤ M < ∞.
This assumption is frequently found in convergence proofs of iterative methods,
especially when proving the so-called gradient related condition, which, in turn helps
prove convergence for several iterative descent methods (as it ensures that the search
direction remains well-defined throughout the iterations).
Lemma 3.2 (Gradient Related Condition). Let {xk } be a sequence generated
by (2.4). Then, for any subsequence {xk }k∈K that converges to a non-stationary
point, the corresponding subsequence {dk }k∈K is bounded and satisfies
lim sup kxk+1 − xk k < ∞
(3.3)
k→∞,k∈K
lim sup ∇f (xk )T dk < 0.
(3.4)
k→∞,k∈K
Proof. Since {xk }k∈K is a converging sequence over a compact domain, xk+1 − xk
is also bounded for all k, therefore (3.3) holds. Since dk = [−T k y k ; 0] and T k is
positive-definite, under Assumption 1, the gradient-related condition (3.4) follows
immediately [2, Chapter 1].
Now we are ready to prove the main convergence theorem. Although using
Lemma 3.2, it is possible to invoke a general convergence proof for the feasible direction method (e.g. [2]), for completeness we outline a short proof.
Theorem 3.3 (Convergence). Let {xk } be a sequence of points generated by (2.4).
Then every limit point of {xk } is a stationary point of Problem (2.1).
Proof. The proof is by contradiction. Since {f (xk )} is monotonically decreasing
(Lemma 3.1), the domain Ω = [l, u] is compact, and f is continuous, we have
lim f (xk ) − f (xk+1 ) = 0.
k→∞
6
Further, the sufficient-descent condition (2.5) implies that
f (xk ) − f (xk+1 ) ≥ −τ αk ∇f (xk )T dk ,
τ ∈ (0, 1),
(3.5)
whereby
{αk ∇f (xk )T dk } → 0.
(3.6)
Assume that a subsequence {xk }k∈K converges to a non-stationary point x̃. From (3.4)
and (3.6), we obtain {αk }k∈K → 0. Since dk is gradient related, this implies that we
have some index k1 ≥ 0 that satisfies (3.2) such that
PΩ xk + αk dk = xk + αk dk , for all k ∈ K, k ≥ k1 .
Further, the Armijo rule (2.5) implies that there exists an index k2 satisfying
αk k αk k αk
f (xk ) − f PΩ xk +
d
= f (xk ) − f xk +
d < −τ ∇f (xk )T dk ,
γ
γ
γ
or equivalently,
f (xk ) − f xk + α̌k dk
< −τ ∇f (xk )T dk ,
α̌k
(3.7)
for all k ∈ K, k ≥ k2 where α̌k = αk /γ.
Now let k̃ = max{k1 , k2 }. Since {dk }k∈K is bounded it can further be shown that
there exists a subsequence {dk }k∈K̃,K̃⊂K of {dk }k∈K such that {dk }k∈K̃ → d˜ where
d˜ is some finite vector. On the other hand, by applying the mean value theorem
to (3.7), it can be shown that there exists some {α̃k }k∈K̃,K̃⊂K such that
−∇f (xk + α̃k dk )T dk < −τ ∇f (xk )T dk ,
where α̃k ∈ [0, ᾱk ], ∀k ∈ K̃, k ≥ k̃. By taking limits on both sides we obtain
˜
−∇f (x̃)T d˜ ≤ −τ ∇f (x̃)T d,
or
˜
0 ≤ (1 − τ )∇f (x̃)T d,
τ ∈ (0, 1),
from which it follows that ∇f (x̃)T d˜ ≥ 0, contradicting that {dk } satisfies (3.4).
4. Applications & Numerical Results. In this section we report numerical
results on two fundamental and important examples of (1.1). Performance results are
shown for different software fed with both synthetic and real-world data. The experiments reveal that for several datasets, our methods are competitive to well-established
algorithms; in fact, as problem size increases our algorithms often outperform the
competition significantly.
4.1. Non-negative Least Squares. The first specific problem to which we
specialize our method is nonnegative least-squares (NNLS):
min
x∈Rn
1
2 kAx
− bk22 ,
s.t.
x ≥ 0,
(4.1)
where A ∈ Rm×n , and b ∈ Rm . To fit (4.1) in our framework, we further assume
that AT A is positive-definite whereby the objective function is strictly convex. We
remark that (L-)BFGS ensures the positive definiteness of the gradient scaling matrix
S k hence Assumption 1 is always satisfied. Note that even though (4.1) is written
7
with only lower bounds, we can easily guarantee that the sequence of iterates {xk }
generated by our algorithm is bounded. To see how, we first show that the initial
level set L0 = {x | kAx − bk2 ≤ kAx0 − bk2 } is bounded; then since each iteration
enforces descent, the sequence {xk } must be bounded as it also belongs to this level
set.
To show that L0 is bounded, consider for any x ∈ L0 , the inequalities
kAxk2 − kbk2
≤
kAx − bk2
≤
kAx0 − bk2 .
These inequalities yield a bound on kxk2 (which implies L0 is bounded), because
σmin (A) · kxk2
kxk2
≤
≤
kAxk2
≤
kAx0 − bk2 + kbk2 ,
−1
σmin
(A) · (kAx0 − bk2 + kbk2 )
=
U < ∞,
(4.2)
where σmin (A) > 0 is the smallest singular value of A.
NNLS is a classical problem in scientific computing [11]; applications include
image restoration [17], vehicle dynamics control [25], and data fitting [14]. Given
its wide applicability many specialized algorithms have been developed for NNLS,
e.g., methods based on active sets such as the classic Lawson-Hanson method [11], or
its efficient modification Fast-NNLS (FNNLS) [4]. Not surprisingly, some constrained
optimization methods have also been applied to solve NNLS [3, 13]. It is interesting
to note that for large scale problems these specialized algorithms are outperformed
by modern methods such as TRON, LBFGS-B, or the methods of this paper.
There exist several different algorithms for solving the NNLS problem. Of these,
some are designed expressly for NNLS, while others are for general differentiable
convex minimization. Naturally, owing to the large variety of optimization methods
and software available, we cannot hope to be exhaustive in our comparisons. We solved
the NNLS problem using several different software, and then picked the ones that
performed the best. Specifically, we show results on the following implementations:
• TRON – a trust region Newton-type method for bound constrained optimization [12]. The underlying implementation was in FORTRAN. For convenience
we used the MTRON [19] Matlab interface.
• LBFGS-B – a limited memory BFGS algorithm extended to handle bound constraints [5]. Again, the underlying implementation was in FORTRAN, and
we invoked it via a Matlab interface.
• PQN-BFGS – our projected quasi-Newton implementation with the full BFGS
update; written entirely in Matlab (i.e., no MEX or C-level interface is
used),
• PQN-LBFGS – our projected L-BFGS implementation; written entirely in Matlab.
We also compare our method with FNNLS, which is Bro and Jong’s improved
implementation of the Lawson-Hanson NNLS procedure [4]. FNNLS was obtained as
a Matlab implementation. We remark that the classic NNLS algorithm of Lawson
and Hanson [11], which is still available as the function lsqnonneg in Matlab, was
far too slow for all our experiments (taking several hours to make progress when other
methods were taking seconds), and hence we have not reported any results on it. For
all the subsequent experiments, whenever possible, we use kg(x)k∞ to decide when
to terminate; here gi is defined as
∂i f (xk ), if i ∈
/ Ik,
k
gi (x ) =
0,
otherwise.
8
We also note that each method also includes a different set of stopping criteria; we
leave the other criteria at their default setting.
We first show experiments on synthetic data. These experiments compare in a
controlled setting our method against others. These experiments obviously do not reflect the complete situation, because the behavior of the different algorithms can vary
substantially on real-world data. Nevertheless, results on synthetic data do provide
a basic performance comparison that helps place our method relative to competing
approaches. We use the following three data sets:
• P1: dense random (uniform) nonnegative matrices with varying sizes,
• P2: medium sized sparse random matrices, and
• P3: large sparse random matrices with varying sparsity.
Matrix
Size
κ(A)
P1-1
2800 × 2000
494
P1-2
P1-3
P1-4
P1-5
3600 × 2400
4400 × 2800
5200 × 3200
6000 × 3600
459
450
452
460
Table 4.1
The size and condition number of the test matrices A in dense data set P1.
We generated the test matrices A and vectors b corresponding to these datasets
using Matlab’s rand and sprand functions. Table 4.1 shows the number of rows and
columns in A for dataset P1, along with the associated condition numbers κ(A). The
matrices for problem set P2 were of size 12000 × 6400, and for P3 they were of size
65536 × 50000. We varied the sparsity in both P2 and P3. Table 4.2 lists the matrices
in these datasets. Note that by sparsity we mean the ratio of zero entries to the total
number of entries in the given matrix.
Matrix
Sparsity
P2-1
0.996
Matrix
Sparsity
P3-1
0.998
P2-2
0.994
P2-3
0.992
P2-4
0.99
P2-5
0.988
P2-6
0.900
P3-2
P3-3
P3-4
P3-5
P3-6
0.997
0.996
0.995
0.994
0.990
Table 4.2
The sparsity of the test matrices A in problem sets P2 and P3.
Table 4.3 shows the running times of TRON, FNNLS, PQN-BFGS, LBFGS-B, and PQNfor the matrices in problem set P1. We remark that FNNLS requires the inputs
in the form of AT A and AT b, hence we included the time for these one-time computations while reporting results for FNNLS. On some mid-sized matrices, e.g., 2800 × 2000
(P1-1) and 3600 × 2400 (P1-2), TRON and FNNLS remain relatively competitive to
others. However, with increasing matrix sizes, the running time of these methods
increases drastically while that of LBFGS-B and PQN-LBFGS changes only linearly. Consequently, for the largest matrix in the problem set (P1-5), PQN-LBFGS outperforms
others significantly, as revealed by Table 4.3.
In Table 4.4 we show running time comparisons analogous to those in Table 4.3,
except that the matrices used were sparse. As shown in the table, LBFGS-B and PQNLBFGS generally outperform others for sparse matrices, and the difference becomes
starker with decreasing sparsity. However, we also observe that sparsity benefits the
TRON software, as it is claimed to take advantage of sparse problems [12].
LBFGS
1 A suffix (F) denotes that the underlying implementation is in FORTRAN, while (M) denotes
Matlab. All reported results are averages over 10 runs to decrease variability.
9
Method1
TRON(F)
f (x)
kg(x)k∞
FNNLS(M)
f (x)
kg(x)k∞
PQN-BFGS(M)
f (x)
kg(x)k∞
LBFGS-B(F)
f (x)
kg(x)k∞
PQN-LBFGS(M)
f (x)
kg(x)k∞
P1-2
P1-3
P1-4
P1-5
8.32s
142.93s
204.09s
290.77s
1.67e+04
5.56e+04
7.64e+04
1.02e+05
2.27e-04
1.69e-04
2.80e-05
4.85e-05
16.93s
26.89s
38.94s
60.22s
1.67e+04
5.46e+04
7.64e+04
1.02e+05
5.63e-15 2.64e-14 1.50e-14
2.01e-14
18.31s
158.56s
120.17s
112.12s
1.67e+04
5.56e+04
7.64e+04
1.02e+05
8.49e-04
0.0022
0.0055
0.0011
3.71s
33.91s
26.41s
27.39s
1.67e+04
5.56e+04
7.64e+04
1.02e+05
9.95e-04
9.28e-04
9.23e-04
9.91e-04
5.06s
9.24s
18.05s
17.75s
1.67e+04
5.46e+04
7.64e+04
1.02e+05
7.28e-04
8.38e-04
3.72e-04
5.98e-04
Table 4.3
NNLS experiments on dataset P1. For all algorithms (except FNNLS), we used kg(x)k∞ ≤ 10−3
as the main stopping criterion. In addition, we stop each method after 10,000 iterations or 20,000
seconds. Each method retains its own default stopping criteria, e.g., PQN-BFGS, LBFGS-B, and
PQN-LBFGS terminate iterations when their internal line-search fails. Finally we ran FNNLS with
no custom parameters as it does not support the stopping criterion kg(x)k∞ . The best values are
shown in bold, and the italicized entries represent abnormal terminations.
Method1
TRON(F)
f (x)
kg(x)k∞
PQN-BFGS(M)
f (x)
kg(x)k∞
LBFGS-B(F)
f (x)
kg(x)k∞
PQN-LBFGS(M)
f (x)
kg(x)k∞
P1-1
55.60s
4.45e+04
1.17e-05
9.56s
4.45e+04
2.64e-14
71.92s
4.45e+04
0.0018
17.78s
4.45e+04
9.76e-04
8.77s
4.45e+04
7.40e-04
P2-2
P2-3
P2-4
P2-5
P2-6
2.06e+03s
2.68e+03s
3.58e+03s
4.66e+03s
1.68e+04
1.45e+09
1.51e+08
2.06e+09
1.30e+10
1.56e+12
0.050
0.174
2.75e-04
0.0053
0.371
105.61s
144.27s
186.95s
202.21s
601.78s
1.45e+09
1.51e+08
2.06e+09
1.30e+10
1.56e+12
0.0079
0.0092
0.0088
0.0084
0.0098
0.85s
1.35s
1.89s
2.53s
43.50s
1.45e+09
1.51e+08
2.06e+09
1.30e+10
1.56e+12
0.0088
0.0097
0.0088
0.0091
1.15
0.37s
0.49s
0.75s
0.84s
18.34s
1.45e+09
1.51e+08
2.06e+09
1.30e+10
1.56e+12
0.0099
0.0083
0.0085
0.0046
0.88
Table 4.4
NNLS experiments on dataset P2. For all algorithms, we used kg(x)k∞ ≤ 10−2 as the main
stopping criterion, and as for P1, other method-specific stopping criteria are set to default. FNNLS
does not scale on this dataset.
Method1
LBFGS-B(F)
f (x)
kg(x)k∞
PQN-LBFGS(M)
f (x)
kg(x)k∞
P2-1
153.31s
1.11e+09
0.077
48.93s
1.11e+09
0.0092
0.55s
1.11e+09
0.0086
0.22s
1.11e+09
0.0087
P3-1
0.27s
5.24e+07
0.010
0.077s
5.24e+07
0.008
P3-2
P3-3
P3-4
P3-5
P3-6
0.47s
0.48s
0.74s
0.94s
2.01s
3.46e+07
1.82e+09
1.18e+09
3.60e+09
1.13e+09
0.014
0.0057
0.012
0.0099
0.0095
0.13s
0.18s
0.26s
0.38s
0.68s
3.46e+07
1.82e+09
1.18e+09
3.60e+09
1.13e+09
0.0032
0.0051
0.0073
0.0054
0.008
Table 4.5
NNLS experiments on dataset P3. The main stopping criterion is kg(x)k∞ ≤ 10−2 for both
algorithms, other settings are the same as the previous experiments. FNNLS,TRON, and PQN-BFGS
do not scale on this dataset.
Our next set of experiments is on dataset P3, and the result is shown in Table 4.5.
Note that only LBFGS-B and PQN-LBFGS scale to this problem set. Although both
methods perform well across all the problems, PQN-LBFGS is seen to have an edge
10
Matrix
Size
Sparsity
ash958
958 × 292
0.9932
bcspwr10
e40r5000
conf6.0-8000
5300 × 5300
17281 × 17281
49152 × 49152
0.9992
0.9981
0.9992
Table 4.6
The size and sparsity of some real-world data sets from the MatrixMarket.
Method1
TRON(F)
f (x)
kg(x)k∞
FNNLS(M)
f (x)
kg(x)k∞
PQN-BFGS(M)
f (x)
kg(x)k∞
LBFGS-B(F)
f (x)
kg(x)k∞
PQN-LBFGS(M)
f (x)
kg(x)k∞
well1850
1850 × 712
0.9934
ash958
0.0065s
6.01e+03
9.77e-15
0.013s
6.01e+03
7.99e-15
0.007s
6.01e+03
5.51e-05
0.0044s
6.01e+03
8.91e-05
0.0097s
6.01e+03
3.37e-05
well1850
bcspwr10
e40r5000
conf6.0-8000
0.011s
0.32s
1.71e+03s
21.70s
4.15e-07
3.66e-07
1.26e-06
3.22e+04
3.33e-10
5.10e-04
5.86e-04
1.08e-06
0.088s
7.03s
2.50e-30 6.99e-29
5.62e-16
3.89e-15
0.71s
240.84s
1.67e+04s
1.27e-07
2.23e-07
0.0066
6.80e-05
8.92e-05
0.044
0.064s
1.79s
17.36s
3.82s
7.60e-07
5.72e-07
0.020
3.22e+04
9.16e-05
9.99e-05
0.027
7.96e-05
0.029s
0.26s
15.40s
1.90s
6.85e-07
1.48e-07
0.0091
3.22e+04
7.31e-05
9.61e-05
0.009
2.68e-05
Table 4.7
NNLS experiments on real-world datasets. For all algorithms, we used kg(x)k∞ ≤ 10−4 as the
main stopping criterion with other method-specific default stopping criteria. FNNLS does not scale
on e40r5000 and conf6.0-8000, PQN-BFGS also fails to scale on conf6.0-8000.
on LBFGS-B as it runs almost twice as fast while attaining similar objective function
values, and better satisfaction of convergence tolerance.
Finally, we experiment with five datasets drawn from real-world applications (see
Table 4.6). The datasets were obtained from the MatrixMarket2 , and they arise
from problems solving least squares (ash958, well1850) and linear systems (bcspwr10,
conf6.0-8000). We impose non-negativity on the solutions to obtain NNLS problems
with these matrices.
Table 4.7 shows results on these real-world datasets; we observe somewhat different behaviors of the various methods. For example, TRON is competitive overall,
except on conf6.0-8000, where it is overwhelmed by LBFGS-B and PQN-LBFGS. PQNLBFGS resembles (as expected) characteristics of LBFGS-B, while consistently delivering
competitive results across all ranges.
4.2. KL-Divergence Minimization. Our next example of (1.1) is nonnegative
Kullback-Leibler divergence minimization (NNKL):
minn
x∈R
f (x) =
X
bi log
i
bi
− bi + [Ax]i ,
[Ax]i
s.t.
x ≥ 0,
(4.3)
m×n
where A ∈ R+
(m ≤ n) is a full-rank matrix and b ∈ Rm
+ . We note here a minor
technical point: to guarantee that the Hessian has bounded eigenvalues, we actually
need to ensure that at iteration k of the algorithm, [Axk ]i ≥ δ > 0. This bound can
be achieved for example by modifying the constraint to be x ≥ ǫ > 0. However, in
practice one can usually run the algorithm with the formulation (4.3) itself. Since
2 http://math.nist.gov/MatrixMarket
11
A is of full-rank, simple calculations show that limkxk→∞ f (x)/kxk > 0. Therefore,
f (x) has bounded sublevel sets, and (4.3) has a solution [22, Theorem 1.9]. Thus, in
particular, if we select x0 = 1, the all-ones vector, we know that the initial sublevel
set L0 = {x : f (x) ≤ f (x0 )} is bounded, ensuring that the sequence of iterates {xk }
generated by our algorithm is bounded as it performs descent on each iteration.
NNKL arises as a core problem in several applications, e.g., positron emission
tomography [7], astronomical image restoration [26], and signal processing [24]. In
contrast to (4.1), the most popular algorithms for (4.3) have been based on the Expectation Maximization (EM) framework [26]. Two of the most popular EM algorithms
for NNKL have been maximum-likelihood EM and ordered-subsets EM [9], where
the latter and its variants are pervasive in medical imaging. Several other related
algorithms for NNKL are summarized in [7, 21]. It turns out that for NNKL too,
optimization methods like TRON, LBFGS-B, or our methods frequently (though not
always) perform better than the essentially gradient based EM methods.
We report below results for the NNKL problem (4.3). Our first set of results
are again with synthetic data. For these experiments we simply reuse the random
matrices described in Section 4.1. We do not report the results of TRON as it takes
an inordinately long time in comparison to the other methods.
The empirical results shown in Tables 4.8, 4.9 and 4.10 are roughly in agreement
with the behavior observed for NNLS, but with a few notable exceptions. For example,
PQN-BFGS exhibits its competence against other methods for the small problem P1-1.
This competence can be accounted for by observing the Hessian of the KL-divergence
function. Unlike NNLS problems where the Hessian is a constant, a method like TRON
is constantly required to compute a varying Hessian across iterations, whereas PQNBFGS updates its Hessian approximation for both NNLS and NNKL at virtually the
same cost. This efficiency combined with a more accurate Hessian results in a greater
convergence speed of PQN-BFGS as compared to the other methods. However, the size
of the given problem still plays a critical role and an increased scale even makes some
of the methods infeasible to test.
Method1
PQN-BFGS(M)
f (x)
kg(x)k∞
LBFGS-B(F)
f (x)
kg(x)k∞
PQN-LBFGS(M)
f (x)
kg(x)k∞
P1-1
29.2s
5.47e-08
7.83e-06
30.63s
1.50e-09
6.14e-06
39.90s
2.73e-08
7.79e-06
P1-2
P1-3
P1-4
P1-5
51.9s
67.4s
77.3s
84.4s
5.87e-08
1.75e-08
2.87e-08
1.59e-08
6.82e-06
9.76e-06
7.69e-06
8.85e-06
15.57s
62.67s
72.61s
78.54s
6.48e-11 6.95e-11 2.06e-08
7.62e-12
3.12e-07
2.95e-07
1.17e-05
1.58e-07
35.46s
42.54s
62.58s
72.36s
3.97e-08
4.07e-08
2.17e-08
1.85e-08
8.59e-06
9.85e-06
6.36e-06
7.03e-06
Table 4.8
NNKL experiments on dataset P1. For all algorithms, we used kg(x)k∞ ≤ 10−5 as the main
stopping criterion.
4.3. Image Deblurring examples. In Figure 4.1 we show examples of image
deblurring, where the images shown were blurred using various blur kernels, and
deblurred using LBFGS-B and PQN-LBFGS. The latter usually outperforms LBFGS-B, and
we show some examples where the performance difference is significant. Objective
function values are also shown to permit a quantitative comparison.
We compare the progress of the algorithms with respect to time by running both
of them for the same amount of time. We illustrate the comparison by plotting the
12
Method1
PQN-BFGS(M)
f (x)
kg(x)k∞
LBFGS-B(F)
f (x)
kg(x)k∞
PQN-LBFGS(M)
f (x)
kg(x)k∞
P2-1
500.36s
4.11e-11
2.93e-07
2.23s
6.44e-13
1.91e-07
10.40s
1.13e-12
5.87e-08
Method1
LBFGS-B(F)
f (x)
kg(x)k∞
PQN-LBFGS(M)
f (x)
kg(x)k∞
P3-1
148.15s
1.19e-11
7.86e-08
128.54s
3.21e-11
9.32e-08
P2-2
P2-3
P2-4
P2-5
P2-6
500.53s
62.53s
69.78s
71.85s
501.24s
1.48e-10
4.87e-12
3.95e-12
3.03e-12
1.04e-11
4.36e-07
9.42e-08
8.94e-08
8.85e-08
3.26e-07
4.01s
3.41s
7.09s
5.55s
61.86s
1.16e-12
1.17e-12
6.26e-13
2.24e-12
6.54e-12
1.08e-07
2.24e-07
4.00e-08
7.37e-08
1.90e-07
3.62s
5.86s
6.54s
4.25s
135.9s
8.10e-13 6.18e-13
9.55e-13
3.35e-12
6.14e-13
8.26e-08
4.91e-08
1.06e-07
9.49e-08
5.42e-08
Table 4.9
NNKL experiments on dataset P2. For all algorithms, we used kg(x)k∞ ≤ 10−7 or a maximum
running time of 500 seconds as the stopping criterion.
P3-2
P3-3
P3-4
P3-5
P3-6
466.92s
389.76s
490.99s
837.23s
2354.23s
3.69e-12 4.60e-11 3.58e-12
3.64e-11
5.43e-11
1.22e-07
1.82e-07
1.20e-07
2.82e-07
1.70e-07
297.40s
242.67s
364.82s
491.50s
1169.36s
2.78e-11
2.64e-11
3.39e-11
4.51e-11
6.85e-11
9.50e-08
9.99e-08
9.83e-08
7.60e-08
2.40e-07
Table 4.10
NNKL experiments on dataset P3. For both algorithms, we used kg(x)k∞ ≤ 10−7 as the main
stopping criterion.
objective function value against the number of iterations. Naturally, the number
of iterations differs for both algorithms; but we adjust this number so that both
PQN-LBFGS and LBFGS-B run for approximately the same time. For Figures 4.2(a) and
4.2(b), an iteration of PQN-LBFGS took roughly the same time as an iteration of LBFGSB, while for Figures 4.2(c) and 4.2(d), LBFGS-B could run fewer iterations for the same
time (or equivalently, needed more time to run the same number of iterations). Note
that we did not run both algorithms to completion, as that usually led to overfitting
of noise and visually less appealing results.
The moon image was obtained from [10]; the cell image is taken from the software
of [16]; the brain image is from the PET Sorteo database [20], and the image of
Haumea is a rendition obtained from Wikipedia.
5. Discussion and Future work. In this paper we have proposed an algorithmic framework for solving box-constrained convex optimization problems using
quasi-Newton methods. We presented two implementations, namely PQN-BFGS and
PQN-LBFGS, based on the BFGS and limited-memory BFGS updates, respectively. Our
methods are simple like the popular projected gradient method, and they overcome
deficiencies such as slow convergence. We reported empirical results of our method
applied to the NNLS and the NNKL problems, showing that our Matlab software
delivers uncompromising performance both in terms of running time and accuracy,
especially for larger scale problems.
We emphasize that the simplicity and ease of implementation of our methods are
two strong points that favor their wide applicability. It remains, however, a subject
of future research to make further algorithmic improvements.
Acknowledgments. We acknowledge support of NSF grants CCF-0431257 and
CCF-0728879. SS also acknowledges support of a Max-Planck research grant. We
thank James Nagy for providing us with his Matlab code, from which we extracted
13
Original image
Blurred image
PQN-LBFGS
LBFGS-B
Fig. 4.1. Deblurring example using PQN-LBFGS and LBFGS-B, from top to bottom, a blurred
image of the moon, a cell, a brain, and Haumea (a dwarf planet).
some useful subroutines.
References.
[1] Dimitri P. Bertsekas, Projected Newton Methods for Optimization Problems
with Simple Constraints, SIAM Jounal on Control and Optimization, 20 (1982),
pp. 221–246.
[2] Dimitri. P. Bertsekas, Nonlinear Programming, Athena Scientific, second ed.,
1999.
[3] M. Bierlaire, Ph. L. Toint, and D. Tuyttens, On Iterative Algorithms for
Linear Least Squares Problems with Bound Constraints, Linear Algebra and its
Applications, 143 (1991), pp. 111–143.
[4] Rasmus Bro and Sijmen De Jong, A Fast Non-negativity-constrained Least
Squares Algorithm, Journal of Chemometrics, 11 (1997), pp. 393–401.
[5] R. Byrd, P. Lu, J. Nocedal, and C. Zhu, A Limited Memory Algorithm
14
5
4
10
10
LBFGS−B
PQN−LBFGS
LBFGS−B
PQN−LBFGS
4
3
10
Objective function value
Objective function value
10
3
10
2
1
10
10
1
10
2
10
0
0
5
10
15
Iteration number
20
10
25
0
5
(a)
10
15
Number of iterations
20
25
(b)
2
1.9
10
10
LBFGS−B
PQN−LBFGS
LBFGS−B
PQN−LBFGS
1.8
Objective function value
Objective function value
10
1
10
1.7
10
1.6
10
1.5
10
0
10
0
5
10
Number of iterations
15
0
20
(c)
5
10
15
Number of iterations
20
25
(d)
Fig. 4.2. Comparison of PQN-LBFGS against LBFGS-B on the examples: (a) the moon, (b) a
cell, (c) a brain and (d) Haumea. The objective function values (y-axes) are plotted with respect to
the number of iterations (x-axes).
[6]
[7]
[8]
[9]
[10]
[11]
[12]
for Bound Constrained Optimization, SIAM Journal on Scientific Computing, 16
(1995), pp. 1190–1208.
J.C. Dunn, Global and Asymptotic Convergence Rate Estimates for a Class of
Projected Gradient Processes, SIAM Jounal on Control and Optimization, 19
(1981), pp. 368–400.
J. Fessler, Image reconstruction: Algorithms and Analysis, 2008. Book
preprint.
Philip E. Gill, Walter Murray, and Margaret H. Wright, Practical
Optimization, Academic Press, 1981.
H. M. Hudson and R. S. Larkin, Accelerated image reconstruction using
ordered subsets of projection data, IEEE Tran. Med. Imag., 13 (1994), pp. 601–
609.
B. Katzung. http://www.astronomy-images.com, 2003.
C. L. Lawson and R. J. Hanson, Solving Least Squares Problems, Prentice–
Hall, 1974.
Chih-Jen Lin and Jorge J. Moré, Newton’s Method for Large BoundConstrained Optimization Problems, SIAM Journal on Optimization, 9 (1999),
15
pp. 1100–1127.
[13] M. Merritt and Y. Zhang, Interior-point gradient method for large-scale
totally nonnegative least squares problems, Journal of Optimization Theory and
Applications, 126 (2005), pp. 191–202.
[14] Nicolas Molinari, Jean-Franqois Durand, and Robert Sabatier,
Bounded Optimal Knots for Regression Splines, Computational Statistics and
Data Analysis, 45 (2004), pp. 159–178.
[15] Jorge J. Moré and D. C. Sorensen, Computing a Trust Region Step, SIAM
Journal on Scientific and Statistical Computing, 3 (1983), pp. 553–572.
[16] J. Nagy, 2008. Personal Communication.
[17] J. Nagy and Z. Strakos, Enforcing nonnegativity in image reconstruction
algorithms, Mathematical Modeling, Estimation, and Imaging, 4121 (2000),
pp. 182–190.
[18] J. Nocedal, Updating Quasi-Newton Matrices with Limited Storage, Mathematics of Computation, 95 (1980), pp. 339–353.
[19] C. Ortner, MTRON: Matlab wrapper for TRON. Mathworks file exchange,
2007.
[20] A. Reilhac, PET-SORTEO. http://sorteo.cermep.fr/home.php, 2008.
[21] R. M. Rewitt and S. Matej, Overview of methods for image reconstruction from projections in emission computed tomography, Proc. IEEE, 91 (2003),
pp. 1588–1611.
[22] R. T. Rockafellar and R. J.-B. Wets, Variational Analysis, Springer, 1997.
[23] J.B. Rosen, The Gradient Projection Method for Nonlinear Programming. Part
I. Linear Constraints, Journal of the Society for Industrial and Applied Mathematics, 8 (1960), pp. 181–217.
[24] Lawrence K. Saul, Fei Sha, and Daniel D. Lee, Statistical signal processing with nonnegativity constraints, in Proceedings of EuroSpeech 2003, vol. 2,
2003, pp. 1001–1004.
[25] Brad Schofield, Vehicle Dynamics Control for Rollover Prevention, PhD thesis, Lund University, 2006.
[26] Y. Vardi and D. Lee, From Image Deblurring to Optimal Investment Portfolios: Maximum Likelihood Solutions for Positive Linear Problems, Journal of the
Royal Statistical Society:Series B, 55 (1993), pp. 569–612.
16
Download