Maria Cameron 1. Line Search Methods The line search methods proceed as follows. Each iteration computes a search direction pk and then decides how far to move along this direction. The iteration is given by xk+1 = xk + αk pk , where the scalar αk is called the step length. Most line search algorithms require pk to be a descent direction, i.e., pTk ∇f (xk ) < 0. Example The steepest descent direction pk = −∇fk is a descent direction: pTk ∇f (xk ) = −k∇fk k2 < 0. If the Hessian is symmetric positive definite then the Newton direction pk = −Hk−1 ∇fk is a descent direction. If the approximate Hessian Bk is symmetric positive definite then the quasi-Newton search direction pk = −Bk−1 ∇fk is a descent direction: pTk ∇f (xk ) = −∇fkT Hk−1 < 0. Ideally, one would like to find a global minimizer αk of F (α) := f (xk + αpk ). However, even finding a local minimizer might take too many iterations and result in a slow routine. Therefore, the goal is just to make a step along the direction pk that results in a reasonable reduction of f so that the overall algorithm converges to a local minimum of f . 1.1. Wolfe’s conditions. A popular inexact line search condition stipulates that αk should first of all give sufficient decrease in the objective function f , as measured by the following inequality (1) f (xk + αpk ) ≤ f (xk ) + c1 α∇fkT pk , c1 ∈ (0, 1) is some fixed constant. Eq. (1) requires that for the picked value of α the graph of F (α) := f (xk + αpk ) lies below the line f (xk ) + c1 α∇fkT pk . By Taylor’s theorem f (xk + αpk ) = f (xk ) + α∇fkT pk + O(α2 ). Since pk is a descent direction, i.e. ∇fkT pk < 0, such α exists. The sufficient decrease condition (1) is not enough to ensure convergence since as we have just seen, this condition is satisfied for all small enough α. To rule out unacceptably small steps the second requirement called a curvature condition is introduced (2) ∇f (xk + αpk )T pk ≥ c2 ∇fkT pk , c2 ∈ (c1 , 1) is a fixed constant. The curvature condition enforces to choose α large enough so that the slope of F (α) is larger than c2 times the slope of F (0). 1 2 Conditions (1) and (2) are the Wolfe conditions. Sometimes the curvature condition can be amplified to out rule α’s for which F increases faster than c2 |∇fkT pk . The resulting conditions are called strong Wolfe’s conditions. (3) f (xk + αpk ) ≤ f (xk ) + c1 α∇fkT pk , (4) |∇f (xk + αpk )T pk | ≤ c2 |∇fkT pk |, 0 < c1 < c2 < 1. Lemma 1. Suppose that f : Rn → R is continuously differentiable. Let pk be a descent direction at xk and assume that f is bounded from below along the ray {xk + αpk | α > 0}. then id 0 < c1 < c2 < 1, there exists interval of step lengths α satisfying the Wolfe conditions and the strong Wolfe conditions. Proof. Since F (α) = f (xk + αpk ) is bounded from below for all α > 0, the line l(α) = f (xk ) + αc1 ∇fkT pk must intersect the graph of φ at least once. Let α0 > 0 be the smallest intersecting value of α, i.e. (5) f (xk + α0 pk ) = f (xk ) + α0 c1 ∇fkT pk < f (xk ) + c1 ∇fkT pk . Hence the sufficient decrease holds for all 0 < α < α0 . By mean value theorem, there exists α00 ∈ (0, α0 ) such that (6) f (xk + α0 pk ) − f (xk ) = α0 ∇f (xk + α00 pk )T pk . Combining Eqs. (5) and (7) we obtain (7) ∇f (xk + α00 pk )T pk = c1 ∇fkT pk > c2 ∇fkT pk , since c1 < c2 and ∇fkT pk < 0. Therefore, α00 satisfies the Wolfe conditions (1) and (2) and the inequalities are strict. By smoothness assumption on f there is an interval around α00 for which the Wolfe conditions hold. Since ∇f (xk + α00 pk )T pk < 0, the strong Wolfe conditions (3) and (4) hold in the same interval. 1.2. Backtracking. The sufficient decrease condition alone is not enough to guarantee that the algorithm makes a reasonable progress along the given search direction. However, if the line search algorithm chooses the step length by backtracking, the curvature condition can be dispensed. This means that we first try a large step and the gradually decrease the step length until the sufficient decrease condition is satisfied. 1.3. Convergence of line search methods. To obtain global convergence, we must not only have well chosen step lengths but also well-chosen directions pk . The angle θk between the search direction pk and the steepest descent direction −∇fk is defined by ∇fkT pk . k∇fk kkpk k The following theorem due to Zoutendijk, has far-reaching consequences. It shows that the steepest descent method is globally convergent. For other algorithms it describes how far pk can deviate from the steepest descent direction and still give rise to a globally convergent iteration. (8) cos θk = − 3 Theorem 1. Consider any iteration of the form xk+1 = xk + αk pk , where pk is a descent direction and αk satisfies the Wolfe conditions (1), (2). Suppose f is bounded from below in Rn and f is continuously differentiable in an open set D containing the sublevel set L := {x ∈ Rn | f (x) < f (x0 )}, where x0 is the starting point of the iteration. Assume also that ∇f is Lipschitz-continuous in D, i.e., k∇f (x) − ∇f (y)k ≤ Lkx − yk, x, y ∈ D. Then (9) X cos2 θk k∇f k2 < ∞. k≥0 Proof. Subtracting ∇fkT pk from Eq. (2) and taking into account that xk+1 = x + k + αk pk we get (∇fk+1 − ∇fk )T pk ≥ (c2 − 1)∇fkT pk , while the Lipschitz continuity implies that (∇fk+1 − ∇fk )T pk ≤ k∇fk+1 − ∇fk kkpk k ≤ αk Lkpk k2 . Combining these two relations we obtain αk ≥ c2 − 1 ∇fkT pk . L kpk k2 By substituting this inequality in Eq. (1) we get fk+1 ≤fk + αk c1 ∇fkT pk ≤fk − c1 1 − c2 (∇fkT pk )2 L kpk k2 ≤fk − c cos2 θk k∇fk k2 , where c := c1 (1 − c2 )/L. By summing this expression over all indices we outain fk+1 ≤ f0 − c k X cos2 θj k∇fj k2 . j=0 Since f is bounded from below, we have that f0 − fk+1 is less than some positive constant. Hence ∞ X cos2 θj k∇fj k2 < ∞. j=0 4 We call the inequality (9) the Zoutendijk condition. It implies that cos2 θk k∇fk k → 0 as k → ∞. If an algorithm chooses directions so that cos θk is bounded away from 0 then lim k∇fk k = 0. k→∞ For example, for the steepest descent method cos θk = 1 hence k∇fk k → 0. Therefore, without any additional requirements on f we only can guarantee convergence to a stationary point rather than to a minimizer. For Newton-like methods the search direction is of the form pk = −Bk−1 ∇fk . If we assume that all matrices Bk are symmetric positive definite with uniformly bounded condition number, i.e., kBk kkBk−1 k ≤ M for all k, then ∇fkT pk cos θk = − k∇fk kkpk k = ≥ ∇fkT Bk−1 ∇fk k∇fk kkBk−1 ∇fk k 1 k∇fk k2 1 −1 k∇fk k kBk k kBk kk∇fk k 1 1 = −1 ≥ M . kBk kkBk k Therefore, cos θk is bounded away from 0. Hence k∇fk k → 0. Example In the BFGS method (Broyden, Fletcher, Goldfarb, and Shanno) (10) Bk+1 = Bk − Bk sk sTk Bk yk ykT + sTk Bk sk ykT sk where sk = xk+1 − xk , yk − ∇fk+1 − ∇yk . Note that Bk+1 − Bk is a symmetric rank 2 matrix. One can show that the BFGS update generates positive definite matrices whenever B0 is positive definite and sTk yk > 0. References [1] J. Nocedal, S. Wright, Numerical Optimization, Springer, 1999