Line search methods, Wolfe`s conditions

advertisement
Maria Cameron
1. Line Search Methods
The line search methods proceed as follows. Each iteration computes a search direction
pk and then decides how far to move along this direction. The iteration is given by
xk+1 = xk + αk pk ,
where the scalar αk is called the step length.
Most line search algorithms require pk to be a descent direction, i.e.,
pTk ∇f (xk ) < 0.
Example The steepest descent direction pk = −∇fk is a descent direction:
pTk ∇f (xk ) = −k∇fk k2 < 0.
If the Hessian is symmetric positive definite then the Newton direction pk = −Hk−1 ∇fk is
a descent direction. If the approximate Hessian Bk is symmetric positive definite then the
quasi-Newton search direction pk = −Bk−1 ∇fk is a descent direction:
pTk ∇f (xk ) = −∇fkT Hk−1 < 0.
Ideally, one would like to find a global minimizer αk of
F (α) := f (xk + αpk ).
However, even finding a local minimizer might take too many iterations and result in a
slow routine. Therefore, the goal is just to make a step along the direction pk that results
in a reasonable reduction of f so that the overall algorithm converges to a local minimum
of f .
1.1. Wolfe’s conditions. A popular inexact line search condition stipulates that αk
should first of all give sufficient decrease in the objective function f , as measured by the
following inequality
(1)
f (xk + αpk ) ≤ f (xk ) + c1 α∇fkT pk ,
c1 ∈ (0, 1)
is some fixed constant. Eq. (1) requires that for the picked value of α the graph of
F (α) := f (xk + αpk ) lies below the line f (xk ) + c1 α∇fkT pk . By Taylor’s theorem
f (xk + αpk ) = f (xk ) + α∇fkT pk + O(α2 ).
Since pk is a descent direction, i.e. ∇fkT pk < 0, such α exists.
The sufficient decrease condition (1) is not enough to ensure convergence since as we
have just seen, this condition is satisfied for all small enough α. To rule out unacceptably
small steps the second requirement called a curvature condition is introduced
(2)
∇f (xk + αpk )T pk ≥ c2 ∇fkT pk ,
c2 ∈ (c1 , 1)
is a fixed constant. The curvature condition enforces to choose α large enough so that the
slope of F (α) is larger than c2 times the slope of F (0).
1
2
Conditions (1) and (2) are the Wolfe conditions. Sometimes the curvature condition can
be amplified to out rule α’s for which F increases faster than c2 |∇fkT pk . The resulting
conditions are called strong Wolfe’s conditions.
(3)
f (xk + αpk ) ≤ f (xk ) + c1 α∇fkT pk ,
(4)
|∇f (xk + αpk )T pk | ≤ c2 |∇fkT pk |,
0 < c1 < c2 < 1.
Lemma 1. Suppose that f : Rn → R is continuously differentiable. Let pk be a descent
direction at xk and assume that f is bounded from below along the ray {xk + αpk | α > 0}.
then id 0 < c1 < c2 < 1, there exists interval of step lengths α satisfying the Wolfe
conditions and the strong Wolfe conditions.
Proof. Since F (α) = f (xk + αpk ) is bounded from below for all α > 0, the line l(α) =
f (xk ) + αc1 ∇fkT pk must intersect the graph of φ at least once. Let α0 > 0 be the smallest
intersecting value of α, i.e.
(5)
f (xk + α0 pk ) = f (xk ) + α0 c1 ∇fkT pk < f (xk ) + c1 ∇fkT pk .
Hence the sufficient decrease holds for all 0 < α < α0 .
By mean value theorem, there exists α00 ∈ (0, α0 ) such that
(6)
f (xk + α0 pk ) − f (xk ) = α0 ∇f (xk + α00 pk )T pk .
Combining Eqs. (5) and (7) we obtain
(7)
∇f (xk + α00 pk )T pk = c1 ∇fkT pk > c2 ∇fkT pk ,
since c1 < c2 and ∇fkT pk < 0. Therefore, α00 satisfies the Wolfe conditions (1) and (2)
and the inequalities are strict. By smoothness assumption on f there is an interval around
α00 for which the Wolfe conditions hold. Since ∇f (xk + α00 pk )T pk < 0, the strong Wolfe
conditions (3) and (4) hold in the same interval.
1.2. Backtracking. The sufficient decrease condition alone is not enough to guarantee
that the algorithm makes a reasonable progress along the given search direction. However,
if the line search algorithm chooses the step length by backtracking, the curvature condition
can be dispensed. This means that we first try a large step and the gradually decrease the
step length until the sufficient decrease condition is satisfied.
1.3. Convergence of line search methods. To obtain global convergence, we must not
only have well chosen step lengths but also well-chosen directions pk . The angle θk between
the search direction pk and the steepest descent direction −∇fk is defined by
∇fkT pk
.
k∇fk kkpk k
The following theorem due to Zoutendijk, has far-reaching consequences. It shows that
the steepest descent method is globally convergent. For other algorithms it describes how
far pk can deviate from the steepest descent direction and still give rise to a globally
convergent iteration.
(8)
cos θk = −
3
Theorem 1. Consider any iteration of the form
xk+1 = xk + αk pk ,
where pk is a descent direction and αk satisfies the Wolfe conditions (1), (2). Suppose f is
bounded from below in Rn and f is continuously differentiable in an open set D containing
the sublevel set
L := {x ∈ Rn | f (x) < f (x0 )},
where x0 is the starting point of the iteration. Assume also that ∇f is Lipschitz-continuous
in D, i.e.,
k∇f (x) − ∇f (y)k ≤ Lkx − yk, x, y ∈ D.
Then
(9)
X
cos2 θk k∇f k2 < ∞.
k≥0
Proof. Subtracting ∇fkT pk from Eq. (2) and taking into account that xk+1 = x + k + αk pk
we get
(∇fk+1 − ∇fk )T pk ≥ (c2 − 1)∇fkT pk ,
while the Lipschitz continuity implies that
(∇fk+1 − ∇fk )T pk ≤ k∇fk+1 − ∇fk kkpk k ≤ αk Lkpk k2 .
Combining these two relations we obtain
αk ≥
c2 − 1 ∇fkT pk
.
L kpk k2
By substituting this inequality in Eq. (1) we get
fk+1 ≤fk + αk c1 ∇fkT pk
≤fk − c1
1 − c2 (∇fkT pk )2
L
kpk k2
≤fk − c cos2 θk k∇fk k2 ,
where c := c1 (1 − c2 )/L. By summing this expression over all indices we outain
fk+1 ≤ f0 − c
k
X
cos2 θj k∇fj k2 .
j=0
Since f is bounded from below, we have that f0 − fk+1 is less than some positive constant.
Hence
∞
X
cos2 θj k∇fj k2 < ∞.
j=0
4
We call the inequality (9) the Zoutendijk condition. It implies that
cos2 θk k∇fk k → 0
as
k → ∞.
If an algorithm chooses directions so that cos θk is bounded away from 0 then
lim k∇fk k = 0.
k→∞
For example, for the steepest descent method cos θk = 1 hence k∇fk k → 0. Therefore,
without any additional requirements on f we only can guarantee convergence to a stationary
point rather than to a minimizer.
For Newton-like methods the search direction is of the form
pk = −Bk−1 ∇fk .
If we assume that all matrices Bk are symmetric positive definite with uniformly bounded
condition number, i.e.,
kBk kkBk−1 k ≤ M for all k,
then
∇fkT pk
cos θk = −
k∇fk kkpk k
=
≥
∇fkT Bk−1 ∇fk
k∇fk kkBk−1 ∇fk k
1 k∇fk k2
1
−1
k∇fk k kBk k kBk kk∇fk k
1
1
=
−1 ≥ M .
kBk kkBk k
Therefore, cos θk is bounded away from 0. Hence k∇fk k → 0.
Example In the BFGS method (Broyden, Fletcher, Goldfarb, and Shanno)
(10)
Bk+1 = Bk −
Bk sk sTk Bk
yk ykT
+
sTk Bk sk
ykT sk
where
sk = xk+1 − xk , yk − ∇fk+1 − ∇yk .
Note that Bk+1 − Bk is a symmetric rank 2 matrix. One can show that the BFGS update
generates positive definite matrices whenever B0 is positive definite and sTk yk > 0.
References
[1] J. Nocedal, S. Wright, Numerical Optimization, Springer, 1999
Download