TACKLING BOX-CONSTRAINED OPTIMIZATION VIA A NEW PROJECTED QUASI-NEWTON APPROACH DONGMIN KIM∗ , SUVRIT SRA∗† , AND INDERJIT S. DHILLON∗ Abstract. Numerous scientific applications across a variety of fields depend on box-constrained convex optimization. Box-constrained problems therefore continue to attract research interest. We address box-constrained (strictly convex) problems by deriving two new quasi-Newton algorithms. Our algorithms are positioned between the projected-gradient [J. B. Rosen, J. SIAM, 8(1), 1960, pp. 181–217], and projected-Newton [D. P. Bertsekas, SIAM J. Cont. & Opt., 20(2), 1982, pp. 221– 246] methods. We also prove their convergence under a simple Armijo step-size rule. We provide experimental results for two particular box-constrained problems: nonnegative least squares (NNLS), and nonnegative Kullback-Leibler (NNKL) minimization. For both NNLS and NNKL our algorithms perform competitively as compared to well-established methods on medium-sized problems; for larger problems our approach frequently outperforms the competition. Key words. Non-negative least squares, KL-divergence minimization, projected Newton methods, quasi-Newton, box-constrained convex optimization, 1. Introduction. The central object of study in this paper is the box-constrained optimization problem min x∈Rn f (x), s.t. l ≤ x ≤ u, (1.1) where l and u are fixed vectors and inequalities are taken componentwise; the function f is assumed to be twice continuously differentiable and strictly convex. Problem (1.1) is a simple constrained optimization problem that can be solved by numerous methods. But general purpose methods might be overkill; indeed, for simple problems they can be computationally expensive or even needlessly complicated. These concerns motivate us to develop algorithms specifically designed for (1.1). Our algorithms are founded upon the unconstrained BFGS and L-BFGS procedures, both of which we adapt to our constrained setting. This adaptation is the crux of our work; but before we describe it, we would like to position it vis-á-vis other methods. A basic, and perhaps obvious, approach to solve (1.1) is the well-known projectedgradient method [23]. This method, however, frequently suffers from slow convergence, making it unappealing when medium to high accuracy solutions are desired. In such cases higher-order methods might be more preferable, e.g., LBFGS-B [5]1 , TRON [12], or projected-Newton [1] (which could be viewed as a non-trivial second-order extension to projected-gradient). LBFGS-B uses a limited amount of additional memory to approximate the second-order information of the objective function, while TRON uses trust-regions [15], and is usually very efficient for medium scale problems. LBFGS-B and TRON tackle constraints by explicitly distinguishing between steps that identify active variables and steps that solve the resulting unconstrained subproblems. In comparison, projected-Newton makes the variable-identification step implicit by dynamically finding active candidates at every iteration. Bertsekas noted that such an implicit identification of active variables could lead to unstable active sets across iterations, which in turn could adversely affect convergence [1]. Thus, to ∗ Department of Computer Sciences, University of Texas, Austin, TX 78712-1188, USA (dmkim@cs.utexas.edu, inderjit@cs.utexas.edu). † Max-Planck Institute for Biological Cybernetics, 72076 Tübingen, Germany (suvrit.sra@tuebingen.mpg.de). 1 An extension of L-BFGS [18] that allows box-constraints. 1 stabilize the projected-Newton method he developed a refined procedure that guarantees finite identification of the active set. Although we build on projected-Newton, we avoid its refinements and make important modifications: we use simpler line-search and a different choice of working sets. These changes sacrifice the finite identification property, but at the gain of implementation ease. Moreover, by supporting a limited-memory version, we allow our method to scale well to large problems. The rest of this paper is organized as follows. In §2.1 we overview our approach; in §2.2 major implementation issues are discussed; and in §3 we prove convergence. In §4 we compare our method to several other major approaches and show numerical results across several synthetic as well as real-life data. Finally, §5 concludes the paper with an outlook to future work. 2. Algorithms and Theory. Recall that the central problem of interest is min x∈Rn f (x), subject to l ≤ x ≤ u, (2.1) where l and u are fixed vectors and inequalities are taken componentwise; f is a twice continuously differentiable, strictly convex function. 2.1. Overview of the method. Our algorithm for solving (2.1) is inspired by projected-gradient [23] and projected-Newton [1]. Viewed in the general light of gradient methods [2], our method distinguishes itself in two main ways: (i) its choice of the variables optimized at each iteration; and (ii) the gradient-scaling it uses to compute the descent direction. Further details of the resulting iterative method follow. At each iteration we partition the variables into two groups: free and fixed. Then, we optimize over only the free variables while holding the fixed ones unchanged. The fixed variables are defined as a particular subset of the active variables2 , and this subset is identified using both the gradient and second-order information. Formally, first at iteration k, we identify the binding set I1k = i xki = li ∧ ∂i f (xk ) > 0, or xki = ui ∧ ∂i f (xk ) < 0 , (2.2) where ∂i f (xk ) denotes the i-th component of the gradient ∇f (xk ). The set I1k collects variables that satisfy the KKT conditions (1 ≤ i ≤ n): ∇f (x) − λ + µ = 0, λi (xi − li ) = 0, µi (xi − ui ) = 0, and λ, µ ≥ 0, where λ and µ are Lagrange multipliers corresponding to the ℓ ≤ x and x ≤ u constraints, respectively. To see why the gradient information makes a difference in (2.2), consider an active variable xki = li whose gradient component ∂i f (xk ) ≤ 0. For such a variable it may be possible to decrease the objective further, if at the next > li . Thus, it is better to leave xki free by not including it iteration we allow xk+1 i in (2.2) (a similar observation holds for xki = ui ). Next, we refine the active variables using gradient-scaling information. To that end, let S k be some non-diagonal positive-definite matrix. Further let S̄ k be the matrix induced by S k and the variables not in the binding set, i.e., k Sij , if i, j ∈ / I1k , k S̄ij = 0, otherwise. 2 Components of x that satisfy the constraints with equality. 2 Then, we define the second index-set I2k = i xki = li ∧ S̄ k ∇f (xk ) i > 0, or xki = ui ∧ S̄ k ∇f (xk ) i < 0 . (2.3) The set I2k captures variables left free by I1k , that should actually be fixed—e.g., if xki is active but not in the binding set (i.e., xki = li , but xki ∈ / I1k ), then as argued before, the objective function can be potentially decreased if we let xk+1 > li . In i the presence of gradient-scaling, however, this decrease might not happen because even though xki ∈ / I1k , the scaled-gradient S̄ k ∇f (xk ) i is positive. Thus, at the next iteration too we should fix xk+1 = li . The actual fixed-set is as defined below. i Definition 2.1 (Fixed set). Given sets I1k and I2k identified by (2.2) and (2.3), respectively, the fixed set I k at iteration k is defined as the union I k = I1k ∪ I2k . 2.1.1. Projected quasi-Newton Step. Having defined the fixed set we now present the main projected quasi-Newton step. The projection part is easy and requires computing PΩ (x), the orthogonal projection of x onto the set Ω = [l, u]. The quasi-Newton part is slightly more involved as the variables not in the fixed-set I k influence the gradient-scaling matrix, which is given by k / Ik, Sij , if i, j ∈ k Ŝij = 0, otherwise. Finally, our projected quasi-Newton step is given by xk+1 = PΩ xk − αk Ŝ k ∇f (xk ) , (2.4) where αk > 0 is a step-size. Pseudo-code encapsulating the details described above is presented in Algorithm 1, which we call projected quasi-Newton (PQN). Observe that in PQN an arbitrary positive-definite matrix may be used to initialize S 0 , and any feasible point to initialize x0 . Further notice that we have not yet specified the details of computing the step-size αk , and of updating the matrix S k . We now turn to these two important steps (and some other implementation issues) in §2.2 below. 2.2. Implementation details. In this section we describe a line-search routine for computing αk , and two methods for updating the gradient scaling matrix S k . 2.2.1. Line Search. We choose the Armijo step-size rule [2, 6] for the linesearch; this rule is simple and usually works well both for unconstrained and constrained problems. The important difference in our implementation of Armijo is that we restrict the computation to be over the free variables only. More specifically, to compute the step-size, first choose an initial step γ > 0 and some scalars τ , σ ∈ (0, 1). Then, determine the smallest non-negative integer t, for which the sufficient-descent condition f (xk ) − f PΩ xk − γσ t Ŝ k ∇f (xk ) ≥ τ γσ t ∇f (xk )T Ŝ k ∇f (xk ), (2.5) is satisfied. The step-size is then set to αk = γσ t . In §3 we prove that with this Armijo-based step-size our algorithm converges to the optimum. 2.2.2. Gradient Scaling. Now we turn to gradient scaling. To circumvent some of the problems associated with using the full (inverse) Hessian, it is usual to approximate it. Popular choices involve iteratively approximating the Hessian 3 Algorithm 1: Projected Quasi-Newton Framework Input: Function: f (x); vectors l, u ∈ Rn Output: x∗ = argminl≤x≤u f (x) Initialize k ← 0, S k ← I, and l ≤ xk ≤ u; repeat Compute the first index set I1k = ixki = li ∧ ∂i f (xk ) > 0, or xki = ui ∧ ∂i f (xk ) < 0 ; Compute the second index set I2k = ixki = li ∧ S̄ k ∇f (xk ) i > 0, or xki = ui ∧ S̄ k ∇f (xk ) i < 0 ; Compute the fixed set I k = I1k ∪ I2k ; if ∀i, i ∈ I k or ∂i f (xk ) = 0 for all i ∈ / I k then k S̄ = I; Re-compute I k ; begin Find appropriate value for αk using line-search; xk+1 ← PΩ xk − αk Ŝ k ∇f (xk ) ; Update gradient scaling matrix S k to obtain S k+1 , if necessary; k ← k + 1; until Stopping criteria are met; using the Powell-Symmetric-Broyden (PSB), Davidson-Fletcher-Powell (DFP), or the Broyden-Fletcher-Goldfarb-Shanno (BFGS) updates; the latter are believed to be the most effective in general [2, 8], so we adapt them for our algorithm. Suppose H k is the current approximation to the Hessian, and that vectors g and s are given by the differences g = ∇f (xk+1 ) − ∇f (xk ), and s = xk+1 − xk . Then, the BFGS update makes a rank-two correction to H k , and is given by H k+1 = H k − gg T H k ssT H k + T . T k s H s s g (2.6) Let S k be the inverse of H k . Applying the Sherman-Morrison-Woodbury formula to (2.6) one obtains the update S k+1 gT S k g =S + 1+ sT g k ssT (S k gsT + sg T S k ) − . sT g sT g (2.7) The version of Alg. 1 that uses (2.7) for the gradient-scaling will be called PQN-BFGS. Although the update of gradient scaling matrix via BFGS updates is efficient, it can require up to O(n2 ) memory, which restricts its applicability to large-scale problems. But we can also implement a limited memory BFGS (L-BFGS) [18] version, which we call PQN-LBFGS. Here, instead of storing an actual gradient scaling matrix, a small number of additional vectors (say M ) are used to approximate the (inverse) 4 Hessian. The standard L-BFGS approach is implemented using the following formula Sk = sTk−1 gk−1 T T T V̄k−M V̄k−M + ρk−M V̄k−M +1 sk−M sk−M V̄k−M +1 T g gk−1 k−1 T T + ρk−M +1 V̄k−M +2 sk−M +1 sk−M +1 V̄k−M +2 + ··· (2.8) + ρk−1 sk−1 sTk−1 , where S 0 = I, k ≥ 1, the scalars ρk , and matrices V̄k−M are defined by ρk = 1/(sTk gk ), V̄k−M = [Vk−M · · · Vk−1 ] , and Vk = I − ρk sk gkT . At this point the reader might wonder how our method differs from the original BFGS and L-BFGS methods. There are two basic differences: First, we are performing constrained optimization. Second, we actually perform the updates (2.7) and (2.8) with respect to the first index set (2.2)—that is, both updates (2.7) and (2.8) result in the matrix S̄ k , which is then used in Algorithm 1. 3. Convergence. Let {xk } be a sequence of iterates produced by either the PQN-BFGS or the PQN-LBFGS versions of Algorithm 1. We prove below under mild assumptions that each accumulation point of the sequence {xk } is a stationary point of (2.1). We begin by showing that the sequence {f (xk )} is monotonically decreasing. Lemma 3.1 (Descent). If xk is not a stationary point of Problem (2.1), then with xk+1 given by (2.4), there exists some constant α̂ > 0 such that f (xk+1 ) < f (xk ), for all αk ∈ (0, α̂]. Proof. By construction of the fixed set I k , the free variables in xk must satisfy: either li < xki < ui , or xi = li ∧ ∂i f (xk ) ≤ 0 ∧ [S̄ k ∇f (xk )]i ≤ 0, or xi = ui ∧ ∂i f (xk ) ≥ 0 ∧ [S̄ k ∇f (xk )]i ≥ 0. Note that if I1k ∪ I2k captures all the variables in x, Algorithm 1 reduces to I k = I1k . If there are still no free variables, it means that all KKT conditions are fulfilled; but this implies that xk is stationary, a contradiction. Thus, for a non-stationary point xk , there exists at least one index i in xk for which ∂i f (xk ) 6= 0. This xk yields a descent direction. To see how, let dk be the direction defined by −[Ŝ k ∇f (xk )]i , if i ∈ / Ik, dki = 0, otherwise. Now partition (reordering variables if necessary) all objects with respect to I k so that k k k y T 0 −T k y k p , ∇f (xk ) = k , and dk = Ŝ k = . (3.1) = 0 0 0 0 z In (3.1) the components that constitute z k belong to I k , while those that constitute y k , pk and T k do not. Given this partitioning we see that ∇f (xk ), dk = y k , pk = − y k , T k y k < 0, 5 where the final inequality follows because T k is positive-definite (as it is a principal submatrix of the positive definite matrix S k ). Thus, dk is a descent direction. Now we only need to verify that descent along dk upholds feasibility. It is easy to check that there exists an ᾱ > 0 such that for i ∈ / Ik, li ≤ PΩ xki + αk dki = xki + αk dki ≤ ui , for all αk ≤ ᾱ. Noting that dk is zero for fixed variables, we may equivalently write l ≤ PΩ xk + αk dk = xk + αk dk ≤ u, for all αk ≤ ᾱ. (3.2) Thus, we can finally conclude that there exists some α̂ ≤ ᾱ such that f (xk+1 ) < f (xk ), for all αk ∈ (0, α̂]. To prove the convergence of an iterative algorithm one generally needs to show that the sequence of iterates {xk } has an accumulation point. Since the existence of limit points inevitably depends on the given problem instance, it is frequently assumed as a given. For our problem, accumulation points certainly exist as we are dealing with a continuous function over a compact domain. To prove the main theorem, we consider the set Ω = [l, u] as the domain of Problem 2.1 and assume the following condition: Assumption 1. The eigenvalues of the gradient scaling matrix T k (for all k) lie in the interval [m, M ] where 0 < m ≤ M < ∞. This assumption is frequently found in convergence proofs of iterative methods, especially when proving the so-called gradient related condition, which, in turn helps prove convergence for several iterative descent methods (as it ensures that the search direction remains well-defined throughout the iterations). Lemma 3.2 (Gradient Related Condition). Let {xk } be a sequence generated by (2.4). Then, for any subsequence {xk }k∈K that converges to a non-stationary point, the corresponding subsequence {dk }k∈K is bounded and satisfies lim sup kxk+1 − xk k < ∞ (3.3) k→∞,k∈K lim sup ∇f (xk )T dk < 0. (3.4) k→∞,k∈K Proof. Since {xk }k∈K is a converging sequence over a compact domain, xk+1 − xk is also bounded for all k, therefore (3.3) holds. Since dk = [−T k y k ; 0] and T k is positive-definite, under Assumption 1, the gradient-related condition (3.4) follows immediately [2, Chapter 1]. Now we are ready to prove the main convergence theorem. Although using Lemma 3.2, it is possible to invoke a general convergence proof for the feasible direction method (e.g. [2]), for completeness we outline a short proof. Theorem 3.3 (Convergence). Let {xk } be a sequence of points generated by (2.4). Then every limit point of {xk } is a stationary point of Problem (2.1). Proof. The proof is by contradiction. Since {f (xk )} is monotonically decreasing (Lemma 3.1), the domain Ω = [l, u] is compact, and f is continuous, we have lim f (xk ) − f (xk+1 ) = 0. k→∞ 6 Further, the sufficient-descent condition (2.5) implies that f (xk ) − f (xk+1 ) ≥ −τ αk ∇f (xk )T dk , τ ∈ (0, 1), (3.5) whereby {αk ∇f (xk )T dk } → 0. (3.6) Assume that a subsequence {xk }k∈K converges to a non-stationary point x̃. From (3.4) and (3.6), we obtain {αk }k∈K → 0. Since dk is gradient related, this implies that we have some index k1 ≥ 0 that satisfies (3.2) such that PΩ xk + αk dk = xk + αk dk , for all k ∈ K, k ≥ k1 . Further, the Armijo rule (2.5) implies that there exists an index k2 satisfying αk k αk k αk f (xk ) − f PΩ xk + d = f (xk ) − f xk + d < −τ ∇f (xk )T dk , γ γ γ or equivalently, f (xk ) − f xk + α̌k dk < −τ ∇f (xk )T dk , α̌k (3.7) for all k ∈ K, k ≥ k2 where α̌k = αk /γ. Now let k̃ = max{k1 , k2 }. Since {dk }k∈K is bounded it can further be shown that there exists a subsequence {dk }k∈K̃,K̃⊂K of {dk }k∈K such that {dk }k∈K̃ → d˜ where d˜ is some finite vector. On the other hand, by applying the mean value theorem to (3.7), it can be shown that there exists some {α̃k }k∈K̃,K̃⊂K such that −∇f (xk + α̃k dk )T dk < −τ ∇f (xk )T dk , where α̃k ∈ [0, ᾱk ], ∀k ∈ K̃, k ≥ k̃. By taking limits on both sides we obtain ˜ −∇f (x̃)T d˜ ≤ −τ ∇f (x̃)T d, or ˜ 0 ≤ (1 − τ )∇f (x̃)T d, τ ∈ (0, 1), from which it follows that ∇f (x̃)T d˜ ≥ 0, contradicting that {dk } satisfies (3.4). 4. Applications & Numerical Results. In this section we report numerical results on two fundamental and important examples of (1.1). Performance results are shown for different software fed with both synthetic and real-world data. The experiments reveal that for several datasets, our methods are competitive to well-established algorithms; in fact, as problem size increases our algorithms often outperform the competition significantly. 4.1. Non-negative Least Squares. The first specific problem to which we specialize our method is nonnegative least-squares (NNLS): min x∈Rn 1 2 kAx − bk22 , s.t. x ≥ 0, (4.1) where A ∈ Rm×n , and b ∈ Rm . To fit (4.1) in our framework, we further assume that AT A is positive-definite whereby the objective function is strictly convex. We remark that (L-)BFGS ensures the positive definiteness of the gradient scaling matrix S k hence Assumption 1 is always satisfied. Note that even though (4.1) is written 7 with only lower bounds, we can easily guarantee that the sequence of iterates {xk } generated by our algorithm is bounded. To see how, we first show that the initial level set L0 = {x | kAx − bk2 ≤ kAx0 − bk2 } is bounded; then since each iteration enforces descent, the sequence {xk } must be bounded as it also belongs to this level set. To show that L0 is bounded, consider for any x ∈ L0 , the inequalities kAxk2 − kbk2 ≤ kAx − bk2 ≤ kAx0 − bk2 . These inequalities yield a bound on kxk2 (which implies L0 is bounded), because σmin (A) · kxk2 kxk2 ≤ ≤ kAxk2 ≤ kAx0 − bk2 + kbk2 , −1 σmin (A) · (kAx0 − bk2 + kbk2 ) = U < ∞, (4.2) where σmin (A) > 0 is the smallest singular value of A. NNLS is a classical problem in scientific computing [11]; applications include image restoration [17], vehicle dynamics control [25], and data fitting [14]. Given its wide applicability many specialized algorithms have been developed for NNLS, e.g., methods based on active sets such as the classic Lawson-Hanson method [11], or its efficient modification Fast-NNLS (FNNLS) [4]. Not surprisingly, some constrained optimization methods have also been applied to solve NNLS [3, 13]. It is interesting to note that for large scale problems these specialized algorithms are outperformed by modern methods such as TRON, LBFGS-B, or the methods of this paper. There exist several different algorithms for solving the NNLS problem. Of these, some are designed expressly for NNLS, while others are for general differentiable convex minimization. Naturally, owing to the large variety of optimization methods and software available, we cannot hope to be exhaustive in our comparisons. We solved the NNLS problem using several different software, and then picked the ones that performed the best. Specifically, we show results on the following implementations: • TRON – a trust region Newton-type method for bound constrained optimization [12]. The underlying implementation was in FORTRAN. For convenience we used the MTRON [19] Matlab interface. • LBFGS-B – a limited memory BFGS algorithm extended to handle bound constraints [5]. Again, the underlying implementation was in FORTRAN, and we invoked it via a Matlab interface. • PQN-BFGS – our projected quasi-Newton implementation with the full BFGS update; written entirely in Matlab (i.e., no MEX or C-level interface is used), • PQN-LBFGS – our projected L-BFGS implementation; written entirely in Matlab. We also compare our method with FNNLS, which is Bro and Jong’s improved implementation of the Lawson-Hanson NNLS procedure [4]. FNNLS was obtained as a Matlab implementation. We remark that the classic NNLS algorithm of Lawson and Hanson [11], which is still available as the function lsqnonneg in Matlab, was far too slow for all our experiments (taking several hours to make progress when other methods were taking seconds), and hence we have not reported any results on it. For all the subsequent experiments, whenever possible, we use kg(x)k∞ to decide when to terminate; here gi is defined as ∂i f (xk ), if i ∈ / Ik, k gi (x ) = 0, otherwise. 8 We also note that each method also includes a different set of stopping criteria; we leave the other criteria at their default setting. We first show experiments on synthetic data. These experiments compare in a controlled setting our method against others. These experiments obviously do not reflect the complete situation, because the behavior of the different algorithms can vary substantially on real-world data. Nevertheless, results on synthetic data do provide a basic performance comparison that helps place our method relative to competing approaches. We use the following three data sets: • P1: dense random (uniform) nonnegative matrices with varying sizes, • P2: medium sized sparse random matrices, and • P3: large sparse random matrices with varying sparsity. Matrix Size κ(A) P1-1 2800 × 2000 494 P1-2 P1-3 P1-4 P1-5 3600 × 2400 4400 × 2800 5200 × 3200 6000 × 3600 459 450 452 460 Table 4.1 The size and condition number of the test matrices A in dense data set P1. We generated the test matrices A and vectors b corresponding to these datasets using Matlab’s rand and sprand functions. Table 4.1 shows the number of rows and columns in A for dataset P1, along with the associated condition numbers κ(A). The matrices for problem set P2 were of size 12000 × 6400, and for P3 they were of size 65536 × 50000. We varied the sparsity in both P2 and P3. Table 4.2 lists the matrices in these datasets. Note that by sparsity we mean the ratio of zero entries to the total number of entries in the given matrix. Matrix Sparsity P2-1 0.996 Matrix Sparsity P3-1 0.998 P2-2 0.994 P2-3 0.992 P2-4 0.99 P2-5 0.988 P2-6 0.900 P3-2 P3-3 P3-4 P3-5 P3-6 0.997 0.996 0.995 0.994 0.990 Table 4.2 The sparsity of the test matrices A in problem sets P2 and P3. Table 4.3 shows the running times of TRON, FNNLS, PQN-BFGS, LBFGS-B, and PQNfor the matrices in problem set P1. We remark that FNNLS requires the inputs in the form of AT A and AT b, hence we included the time for these one-time computations while reporting results for FNNLS. On some mid-sized matrices, e.g., 2800 × 2000 (P1-1) and 3600 × 2400 (P1-2), TRON and FNNLS remain relatively competitive to others. However, with increasing matrix sizes, the running time of these methods increases drastically while that of LBFGS-B and PQN-LBFGS changes only linearly. Consequently, for the largest matrix in the problem set (P1-5), PQN-LBFGS outperforms others significantly, as revealed by Table 4.3. In Table 4.4 we show running time comparisons analogous to those in Table 4.3, except that the matrices used were sparse. As shown in the table, LBFGS-B and PQNLBFGS generally outperform others for sparse matrices, and the difference becomes starker with decreasing sparsity. However, we also observe that sparsity benefits the TRON software, as it is claimed to take advantage of sparse problems [12]. LBFGS 1 A suffix (F) denotes that the underlying implementation is in FORTRAN, while (M) denotes Matlab. All reported results are averages over 10 runs to decrease variability. 9 Method1 TRON(F) f (x) kg(x)k∞ FNNLS(M) f (x) kg(x)k∞ PQN-BFGS(M) f (x) kg(x)k∞ LBFGS-B(F) f (x) kg(x)k∞ PQN-LBFGS(M) f (x) kg(x)k∞ P1-2 P1-3 P1-4 P1-5 8.32s 142.93s 204.09s 290.77s 1.67e+04 5.56e+04 7.64e+04 1.02e+05 2.27e-04 1.69e-04 2.80e-05 4.85e-05 16.93s 26.89s 38.94s 60.22s 1.67e+04 5.46e+04 7.64e+04 1.02e+05 5.63e-15 2.64e-14 1.50e-14 2.01e-14 18.31s 158.56s 120.17s 112.12s 1.67e+04 5.56e+04 7.64e+04 1.02e+05 8.49e-04 0.0022 0.0055 0.0011 3.71s 33.91s 26.41s 27.39s 1.67e+04 5.56e+04 7.64e+04 1.02e+05 9.95e-04 9.28e-04 9.23e-04 9.91e-04 5.06s 9.24s 18.05s 17.75s 1.67e+04 5.46e+04 7.64e+04 1.02e+05 7.28e-04 8.38e-04 3.72e-04 5.98e-04 Table 4.3 NNLS experiments on dataset P1. For all algorithms (except FNNLS), we used kg(x)k∞ ≤ 10−3 as the main stopping criterion. In addition, we stop each method after 10,000 iterations or 20,000 seconds. Each method retains its own default stopping criteria, e.g., PQN-BFGS, LBFGS-B, and PQN-LBFGS terminate iterations when their internal line-search fails. Finally we ran FNNLS with no custom parameters as it does not support the stopping criterion kg(x)k∞ . The best values are shown in bold, and the italicized entries represent abnormal terminations. Method1 TRON(F) f (x) kg(x)k∞ PQN-BFGS(M) f (x) kg(x)k∞ LBFGS-B(F) f (x) kg(x)k∞ PQN-LBFGS(M) f (x) kg(x)k∞ P1-1 55.60s 4.45e+04 1.17e-05 9.56s 4.45e+04 2.64e-14 71.92s 4.45e+04 0.0018 17.78s 4.45e+04 9.76e-04 8.77s 4.45e+04 7.40e-04 P2-2 P2-3 P2-4 P2-5 P2-6 2.06e+03s 2.68e+03s 3.58e+03s 4.66e+03s 1.68e+04 1.45e+09 1.51e+08 2.06e+09 1.30e+10 1.56e+12 0.050 0.174 2.75e-04 0.0053 0.371 105.61s 144.27s 186.95s 202.21s 601.78s 1.45e+09 1.51e+08 2.06e+09 1.30e+10 1.56e+12 0.0079 0.0092 0.0088 0.0084 0.0098 0.85s 1.35s 1.89s 2.53s 43.50s 1.45e+09 1.51e+08 2.06e+09 1.30e+10 1.56e+12 0.0088 0.0097 0.0088 0.0091 1.15 0.37s 0.49s 0.75s 0.84s 18.34s 1.45e+09 1.51e+08 2.06e+09 1.30e+10 1.56e+12 0.0099 0.0083 0.0085 0.0046 0.88 Table 4.4 NNLS experiments on dataset P2. For all algorithms, we used kg(x)k∞ ≤ 10−2 as the main stopping criterion, and as for P1, other method-specific stopping criteria are set to default. FNNLS does not scale on this dataset. Method1 LBFGS-B(F) f (x) kg(x)k∞ PQN-LBFGS(M) f (x) kg(x)k∞ P2-1 153.31s 1.11e+09 0.077 48.93s 1.11e+09 0.0092 0.55s 1.11e+09 0.0086 0.22s 1.11e+09 0.0087 P3-1 0.27s 5.24e+07 0.010 0.077s 5.24e+07 0.008 P3-2 P3-3 P3-4 P3-5 P3-6 0.47s 0.48s 0.74s 0.94s 2.01s 3.46e+07 1.82e+09 1.18e+09 3.60e+09 1.13e+09 0.014 0.0057 0.012 0.0099 0.0095 0.13s 0.18s 0.26s 0.38s 0.68s 3.46e+07 1.82e+09 1.18e+09 3.60e+09 1.13e+09 0.0032 0.0051 0.0073 0.0054 0.008 Table 4.5 NNLS experiments on dataset P3. The main stopping criterion is kg(x)k∞ ≤ 10−2 for both algorithms, other settings are the same as the previous experiments. FNNLS,TRON, and PQN-BFGS do not scale on this dataset. Our next set of experiments is on dataset P3, and the result is shown in Table 4.5. Note that only LBFGS-B and PQN-LBFGS scale to this problem set. Although both methods perform well across all the problems, PQN-LBFGS is seen to have an edge 10 Matrix Size Sparsity ash958 958 × 292 0.9932 bcspwr10 e40r5000 conf6.0-8000 5300 × 5300 17281 × 17281 49152 × 49152 0.9992 0.9981 0.9992 Table 4.6 The size and sparsity of some real-world data sets from the MatrixMarket. Method1 TRON(F) f (x) kg(x)k∞ FNNLS(M) f (x) kg(x)k∞ PQN-BFGS(M) f (x) kg(x)k∞ LBFGS-B(F) f (x) kg(x)k∞ PQN-LBFGS(M) f (x) kg(x)k∞ well1850 1850 × 712 0.9934 ash958 0.0065s 6.01e+03 9.77e-15 0.013s 6.01e+03 7.99e-15 0.007s 6.01e+03 5.51e-05 0.0044s 6.01e+03 8.91e-05 0.0097s 6.01e+03 3.37e-05 well1850 bcspwr10 e40r5000 conf6.0-8000 0.011s 0.32s 1.71e+03s 21.70s 4.15e-07 3.66e-07 1.26e-06 3.22e+04 3.33e-10 5.10e-04 5.86e-04 1.08e-06 0.088s 7.03s 2.50e-30 6.99e-29 5.62e-16 3.89e-15 0.71s 240.84s 1.67e+04s 1.27e-07 2.23e-07 0.0066 6.80e-05 8.92e-05 0.044 0.064s 1.79s 17.36s 3.82s 7.60e-07 5.72e-07 0.020 3.22e+04 9.16e-05 9.99e-05 0.027 7.96e-05 0.029s 0.26s 15.40s 1.90s 6.85e-07 1.48e-07 0.0091 3.22e+04 7.31e-05 9.61e-05 0.009 2.68e-05 Table 4.7 NNLS experiments on real-world datasets. For all algorithms, we used kg(x)k∞ ≤ 10−4 as the main stopping criterion with other method-specific default stopping criteria. FNNLS does not scale on e40r5000 and conf6.0-8000, PQN-BFGS also fails to scale on conf6.0-8000. on LBFGS-B as it runs almost twice as fast while attaining similar objective function values, and better satisfaction of convergence tolerance. Finally, we experiment with five datasets drawn from real-world applications (see Table 4.6). The datasets were obtained from the MatrixMarket2 , and they arise from problems solving least squares (ash958, well1850) and linear systems (bcspwr10, conf6.0-8000). We impose non-negativity on the solutions to obtain NNLS problems with these matrices. Table 4.7 shows results on these real-world datasets; we observe somewhat different behaviors of the various methods. For example, TRON is competitive overall, except on conf6.0-8000, where it is overwhelmed by LBFGS-B and PQN-LBFGS. PQNLBFGS resembles (as expected) characteristics of LBFGS-B, while consistently delivering competitive results across all ranges. 4.2. KL-Divergence Minimization. Our next example of (1.1) is nonnegative Kullback-Leibler divergence minimization (NNKL): minn x∈R f (x) = X bi log i bi − bi + [Ax]i , [Ax]i s.t. x ≥ 0, (4.3) m×n where A ∈ R+ (m ≤ n) is a full-rank matrix and b ∈ Rm + . We note here a minor technical point: to guarantee that the Hessian has bounded eigenvalues, we actually need to ensure that at iteration k of the algorithm, [Axk ]i ≥ δ > 0. This bound can be achieved for example by modifying the constraint to be x ≥ ǫ > 0. However, in practice one can usually run the algorithm with the formulation (4.3) itself. Since 2 http://math.nist.gov/MatrixMarket 11 A is of full-rank, simple calculations show that limkxk→∞ f (x)/kxk > 0. Therefore, f (x) has bounded sublevel sets, and (4.3) has a solution [22, Theorem 1.9]. Thus, in particular, if we select x0 = 1, the all-ones vector, we know that the initial sublevel set L0 = {x : f (x) ≤ f (x0 )} is bounded, ensuring that the sequence of iterates {xk } generated by our algorithm is bounded as it performs descent on each iteration. NNKL arises as a core problem in several applications, e.g., positron emission tomography [7], astronomical image restoration [26], and signal processing [24]. In contrast to (4.1), the most popular algorithms for (4.3) have been based on the Expectation Maximization (EM) framework [26]. Two of the most popular EM algorithms for NNKL have been maximum-likelihood EM and ordered-subsets EM [9], where the latter and its variants are pervasive in medical imaging. Several other related algorithms for NNKL are summarized in [7, 21]. It turns out that for NNKL too, optimization methods like TRON, LBFGS-B, or our methods frequently (though not always) perform better than the essentially gradient based EM methods. We report below results for the NNKL problem (4.3). Our first set of results are again with synthetic data. For these experiments we simply reuse the random matrices described in Section 4.1. We do not report the results of TRON as it takes an inordinately long time in comparison to the other methods. The empirical results shown in Tables 4.8, 4.9 and 4.10 are roughly in agreement with the behavior observed for NNLS, but with a few notable exceptions. For example, PQN-BFGS exhibits its competence against other methods for the small problem P1-1. This competence can be accounted for by observing the Hessian of the KL-divergence function. Unlike NNLS problems where the Hessian is a constant, a method like TRON is constantly required to compute a varying Hessian across iterations, whereas PQNBFGS updates its Hessian approximation for both NNLS and NNKL at virtually the same cost. This efficiency combined with a more accurate Hessian results in a greater convergence speed of PQN-BFGS as compared to the other methods. However, the size of the given problem still plays a critical role and an increased scale even makes some of the methods infeasible to test. Method1 PQN-BFGS(M) f (x) kg(x)k∞ LBFGS-B(F) f (x) kg(x)k∞ PQN-LBFGS(M) f (x) kg(x)k∞ P1-1 29.2s 5.47e-08 7.83e-06 30.63s 1.50e-09 6.14e-06 39.90s 2.73e-08 7.79e-06 P1-2 P1-3 P1-4 P1-5 51.9s 67.4s 77.3s 84.4s 5.87e-08 1.75e-08 2.87e-08 1.59e-08 6.82e-06 9.76e-06 7.69e-06 8.85e-06 15.57s 62.67s 72.61s 78.54s 6.48e-11 6.95e-11 2.06e-08 7.62e-12 3.12e-07 2.95e-07 1.17e-05 1.58e-07 35.46s 42.54s 62.58s 72.36s 3.97e-08 4.07e-08 2.17e-08 1.85e-08 8.59e-06 9.85e-06 6.36e-06 7.03e-06 Table 4.8 NNKL experiments on dataset P1. For all algorithms, we used kg(x)k∞ ≤ 10−5 as the main stopping criterion. 4.3. Image Deblurring examples. In Figure 4.1 we show examples of image deblurring, where the images shown were blurred using various blur kernels, and deblurred using LBFGS-B and PQN-LBFGS. The latter usually outperforms LBFGS-B, and we show some examples where the performance difference is significant. Objective function values are also shown to permit a quantitative comparison. We compare the progress of the algorithms with respect to time by running both of them for the same amount of time. We illustrate the comparison by plotting the 12 Method1 PQN-BFGS(M) f (x) kg(x)k∞ LBFGS-B(F) f (x) kg(x)k∞ PQN-LBFGS(M) f (x) kg(x)k∞ P2-1 500.36s 4.11e-11 2.93e-07 2.23s 6.44e-13 1.91e-07 10.40s 1.13e-12 5.87e-08 Method1 LBFGS-B(F) f (x) kg(x)k∞ PQN-LBFGS(M) f (x) kg(x)k∞ P3-1 148.15s 1.19e-11 7.86e-08 128.54s 3.21e-11 9.32e-08 P2-2 P2-3 P2-4 P2-5 P2-6 500.53s 62.53s 69.78s 71.85s 501.24s 1.48e-10 4.87e-12 3.95e-12 3.03e-12 1.04e-11 4.36e-07 9.42e-08 8.94e-08 8.85e-08 3.26e-07 4.01s 3.41s 7.09s 5.55s 61.86s 1.16e-12 1.17e-12 6.26e-13 2.24e-12 6.54e-12 1.08e-07 2.24e-07 4.00e-08 7.37e-08 1.90e-07 3.62s 5.86s 6.54s 4.25s 135.9s 8.10e-13 6.18e-13 9.55e-13 3.35e-12 6.14e-13 8.26e-08 4.91e-08 1.06e-07 9.49e-08 5.42e-08 Table 4.9 NNKL experiments on dataset P2. For all algorithms, we used kg(x)k∞ ≤ 10−7 or a maximum running time of 500 seconds as the stopping criterion. P3-2 P3-3 P3-4 P3-5 P3-6 466.92s 389.76s 490.99s 837.23s 2354.23s 3.69e-12 4.60e-11 3.58e-12 3.64e-11 5.43e-11 1.22e-07 1.82e-07 1.20e-07 2.82e-07 1.70e-07 297.40s 242.67s 364.82s 491.50s 1169.36s 2.78e-11 2.64e-11 3.39e-11 4.51e-11 6.85e-11 9.50e-08 9.99e-08 9.83e-08 7.60e-08 2.40e-07 Table 4.10 NNKL experiments on dataset P3. For both algorithms, we used kg(x)k∞ ≤ 10−7 as the main stopping criterion. objective function value against the number of iterations. Naturally, the number of iterations differs for both algorithms; but we adjust this number so that both PQN-LBFGS and LBFGS-B run for approximately the same time. For Figures 4.2(a) and 4.2(b), an iteration of PQN-LBFGS took roughly the same time as an iteration of LBFGSB, while for Figures 4.2(c) and 4.2(d), LBFGS-B could run fewer iterations for the same time (or equivalently, needed more time to run the same number of iterations). Note that we did not run both algorithms to completion, as that usually led to overfitting of noise and visually less appealing results. The moon image was obtained from [10]; the cell image is taken from the software of [16]; the brain image is from the PET Sorteo database [20], and the image of Haumea is a rendition obtained from Wikipedia. 5. Discussion and Future work. In this paper we have proposed an algorithmic framework for solving box-constrained convex optimization problems using quasi-Newton methods. We presented two implementations, namely PQN-BFGS and PQN-LBFGS, based on the BFGS and limited-memory BFGS updates, respectively. Our methods are simple like the popular projected gradient method, and they overcome deficiencies such as slow convergence. We reported empirical results of our method applied to the NNLS and the NNKL problems, showing that our Matlab software delivers uncompromising performance both in terms of running time and accuracy, especially for larger scale problems. We emphasize that the simplicity and ease of implementation of our methods are two strong points that favor their wide applicability. It remains, however, a subject of future research to make further algorithmic improvements. Acknowledgments. We acknowledge support of NSF grants CCF-0431257 and CCF-0728879. SS also acknowledges support of a Max-Planck research grant. We thank James Nagy for providing us with his Matlab code, from which we extracted 13 Original image Blurred image PQN-LBFGS LBFGS-B Fig. 4.1. Deblurring example using PQN-LBFGS and LBFGS-B, from top to bottom, a blurred image of the moon, a cell, a brain, and Haumea (a dwarf planet). some useful subroutines. References. [1] Dimitri P. Bertsekas, Projected Newton Methods for Optimization Problems with Simple Constraints, SIAM Jounal on Control and Optimization, 20 (1982), pp. 221–246. [2] Dimitri. P. Bertsekas, Nonlinear Programming, Athena Scientific, second ed., 1999. [3] M. Bierlaire, Ph. L. Toint, and D. Tuyttens, On Iterative Algorithms for Linear Least Squares Problems with Bound Constraints, Linear Algebra and its Applications, 143 (1991), pp. 111–143. [4] Rasmus Bro and Sijmen De Jong, A Fast Non-negativity-constrained Least Squares Algorithm, Journal of Chemometrics, 11 (1997), pp. 393–401. [5] R. Byrd, P. Lu, J. Nocedal, and C. Zhu, A Limited Memory Algorithm 14 5 4 10 10 LBFGS−B PQN−LBFGS LBFGS−B PQN−LBFGS 4 3 10 Objective function value Objective function value 10 3 10 2 1 10 10 1 10 2 10 0 0 5 10 15 Iteration number 20 10 25 0 5 (a) 10 15 Number of iterations 20 25 (b) 2 1.9 10 10 LBFGS−B PQN−LBFGS LBFGS−B PQN−LBFGS 1.8 Objective function value Objective function value 10 1 10 1.7 10 1.6 10 1.5 10 0 10 0 5 10 Number of iterations 15 0 20 (c) 5 10 15 Number of iterations 20 25 (d) Fig. 4.2. Comparison of PQN-LBFGS against LBFGS-B on the examples: (a) the moon, (b) a cell, (c) a brain and (d) Haumea. The objective function values (y-axes) are plotted with respect to the number of iterations (x-axes). [6] [7] [8] [9] [10] [11] [12] for Bound Constrained Optimization, SIAM Journal on Scientific Computing, 16 (1995), pp. 1190–1208. J.C. Dunn, Global and Asymptotic Convergence Rate Estimates for a Class of Projected Gradient Processes, SIAM Jounal on Control and Optimization, 19 (1981), pp. 368–400. J. Fessler, Image reconstruction: Algorithms and Analysis, 2008. Book preprint. Philip E. Gill, Walter Murray, and Margaret H. Wright, Practical Optimization, Academic Press, 1981. H. M. Hudson and R. S. Larkin, Accelerated image reconstruction using ordered subsets of projection data, IEEE Tran. Med. Imag., 13 (1994), pp. 601– 609. B. Katzung. http://www.astronomy-images.com, 2003. C. L. Lawson and R. J. Hanson, Solving Least Squares Problems, Prentice– Hall, 1974. Chih-Jen Lin and Jorge J. Moré, Newton’s Method for Large BoundConstrained Optimization Problems, SIAM Journal on Optimization, 9 (1999), 15 pp. 1100–1127. [13] M. Merritt and Y. Zhang, Interior-point gradient method for large-scale totally nonnegative least squares problems, Journal of Optimization Theory and Applications, 126 (2005), pp. 191–202. [14] Nicolas Molinari, Jean-Franqois Durand, and Robert Sabatier, Bounded Optimal Knots for Regression Splines, Computational Statistics and Data Analysis, 45 (2004), pp. 159–178. [15] Jorge J. Moré and D. C. Sorensen, Computing a Trust Region Step, SIAM Journal on Scientific and Statistical Computing, 3 (1983), pp. 553–572. [16] J. Nagy, 2008. Personal Communication. [17] J. Nagy and Z. Strakos, Enforcing nonnegativity in image reconstruction algorithms, Mathematical Modeling, Estimation, and Imaging, 4121 (2000), pp. 182–190. [18] J. Nocedal, Updating Quasi-Newton Matrices with Limited Storage, Mathematics of Computation, 95 (1980), pp. 339–353. [19] C. Ortner, MTRON: Matlab wrapper for TRON. Mathworks file exchange, 2007. [20] A. Reilhac, PET-SORTEO. http://sorteo.cermep.fr/home.php, 2008. [21] R. M. Rewitt and S. Matej, Overview of methods for image reconstruction from projections in emission computed tomography, Proc. IEEE, 91 (2003), pp. 1588–1611. [22] R. T. Rockafellar and R. J.-B. Wets, Variational Analysis, Springer, 1997. [23] J.B. Rosen, The Gradient Projection Method for Nonlinear Programming. Part I. Linear Constraints, Journal of the Society for Industrial and Applied Mathematics, 8 (1960), pp. 181–217. [24] Lawrence K. Saul, Fei Sha, and Daniel D. Lee, Statistical signal processing with nonnegativity constraints, in Proceedings of EuroSpeech 2003, vol. 2, 2003, pp. 1001–1004. [25] Brad Schofield, Vehicle Dynamics Control for Rollover Prevention, PhD thesis, Lund University, 2006. [26] Y. Vardi and D. Lee, From Image Deblurring to Optimal Investment Portfolios: Maximum Likelihood Solutions for Positive Linear Problems, Journal of the Royal Statistical Society:Series B, 55 (1993), pp. 569–612. 16