Chapter 3 Unconstrained Optimization

advertisement
Chapter 3 Unconstrained Optimization

INTRODUCTION
In this lecture note we shall discuss numerical methods for the solution of the optimization
problem. For a real function of several real variables we want to find an argument vector which
corresponds to a minimal function value. In some cases we want a maximizer of a function. This is
easily determined if we find a minimizer of the function with opposite sign. Optimization plays an
important role in many branches of science and applications: economics, operations research, network
analysis, optimal design of mechanical or electrical systems, to mention but a few.
The ideal situation for optimization computations is that the objective function has a unique
minimizer. We call this the global minimizer. In some cases the objective function has several (or even
infinitely many) minimizers. In such problems it may be sufficient to find one of these minimizers. In
many objective functions from applications we have a global minimizer and several local minimizers.
It is very difficult to develop methods which can find the global minimizer with certainty in this
situation. Methods for global optimization are outside the scope of this lecture note. The methods
described here can find a local minimizer for the objective function. When a local minimizer has been
discovered, we do not know whether it is a global minimizer or one of the local minimizers. We cannot
even be sure that our optimization method will find the local minimizer closest to the starting point. In
order to explore several local minimizers we can try several runs with different starting points, or better
still examine intermediate results produced by a global minimizer.
Conditions for a Local Minimizer
A local minimizer for f is an argument vector giving the smallest function value inside a certain
region, defined by  :
x is a local minimizer for f : if f ( x )  f ( x) , x  x   (   0 )
Most objective functions, especially those with several local minimizers, contain local maximizers and
other points which satisfy a necessary condition for a local minimizer. The following theorems help us
find such points and distinguish the local minimizers from the irrelevant points.
Necessary condition for a local minimum.
We assume that f has continuous partial derivatives of second order. f ( x ) is the gradient of f .

If x is a local minimizer for f :
then
f ( x ) 0
f

 0
x1

f

 0
x2

... 


Hessian f ( x) of the function f is a matrix containing the second partial derivatives of f :
f ( x) 
2 f
xi y j
Note that this is a symmetric matrix.
Second order necessary condition.

If x is a local minimizer, then
f ( x ) is positive semidefinite.
(判定一个矩阵半正定
1、对于半正定矩阵来说,相应的条件应改为所有的主子式非负。顺序主
子式非负并不能推出矩阵是半正定的。比如以下例子:
2、半正定矩阵
定义:设 A 是实对称矩阵。如果对任意的实非零列矩阵 X 有 XTAX≥0,
就称 A 为半正定矩阵。
3、A∈Mn(K)是半正定矩阵的充要条件是:A的所有主子式大于或等于
零。 )

DESCENT METHODS (STEEPEST DESCENT METHOD)
The search direction S
k
must be a descent direction. Then we are able to gain a smaller value of
f ( x) by choosing an appropriate walking distance, and thus we can satisfy the descending
condition
f ( xk 1 )  f ( xk ) . As stopping criterion we would like to use the ideal criterion would be
that the current value of f ( x ) is close enough to the minimal value, i.e.
f ( xk 1 )  f ( x )  

They cannot be used in practice, however, because x and
use approximations to these conditions:
f ( x ) are not known. Instead we have to
xk 1  xk  1 or
f ( xk 1 )  f ( xk )   2
The other type of convergence mentioned is
f ( xk )  0 . This can be reflected in the stopping criterion:
f ( xk )   3 .which is included in many implementations of descent methods.
There is a good way of using the property of converging function values. The Taylor expansion of
f at x is
f ( xk )
T
T
1
f ( x )  ( xk  x ) f ( x )  ( xk  x ) f ( x )( xk  x )
2

Now, if x is a local minimizer, then

This gives us f ( xk )  f ( x )
f ( x )  0 and H   f ( x ) is positive semidefinite,
T
1
( xk  x ) H  ( xk  x )
2
so the stopping criterion could be
T
1
( xk 1  xk ) H k ( xk 1  xk )   4 with xk 1
2

Here xk  x is approximated by xk 1  xk and H

x
is approximated by H k
 f ( xk ) .
This search direction, the negative gradient direction, is called the direction of steepest descent.
S k   f ( x)
It gives us a useful gain in function value if the step is so short that the 3rd term in the Taylor expansion
k
is insignificant. Thus we have to stop well before we reach the minimizer along the direction S . At
the minimizer the higher order terms are large enough to have changed the slope from its negative
starting value to zero.
According to a descent method based on steepest descent is convergent. If we make a method using S
and a version of line search that ensures sufficiently short steps, then the global convergence will
manifest itself as a very robust global performance. The disadvantage is that the method will have
linear final convergence and this will often be exceedingly slow. If we use exact line search together
with steepest descent, we invite trouble.
Steepest descent algorithm
Step 1. Estimate a starting design X 0 and set the iteration counter k  0 . Select a
convergence parameter   0 ;
Step 2. Calculate the gradient of f ( x) at the point X k as ck  f ( X k )
k
Calculate c  cT c If c   then stop the iteration process as X   X k is a
minimum point. Otherwise, go to Step 3;
Step 3. Let the search direction at the current point X k as Sk  f ( X k ) ;
Step 4. Calculate a step size  k to minimize f ( X k   k Sk ) . A one-dimensional search
is used to determine  k ;
Step 5. Update the design as X k 1  X k   k Sk . Set k  k 1 and go to Step 2.
Example
We test a steepest descent method with line search on the function
min f ( X )  4 x12  x22
starting point is taken as X0  (1,1) ,the errors
T
  0.1
The gradient is
f ( X )  (8 x1 , 2 x2 )T
then the first search direction is
S0  f ( X 0 )  (8, 2)T ,
S0  6 4  4  2 1 7
mi n f ( X0 + S0 ) =mi n(  )
A one-dimensional search
and
 (  ) =0
0  0.130769
Update the point as
T
X 1 (11
,)
 0.13.769(8, 2)T  (0.046152, 0.738462)T
Thus the next search direction will be
S1  f ( X 1 )  (0.369216, 1.476924)T ,
S1  2 . 1 8 3 05
1. 52
2 3 7 5
Update the point as X 2  (0.101537, 0.147682)T
Thus the 3rd search direction will be
S2  f ( X 2 )  (0.369216, 1.476924)T ,
Update the point as
S2  0 . 7 4 7 0 5 6
X 3  ( 0 . 0 0 9 7 4 7 , 0 . 1T 0 7 2 1 7 )
Thus the 4th search direction will be
0 . 8 64 3 2 9
S3  f ( X 3 )  (0.077976, 0.214434)T
Update the point as
S3  0 . 0 5 2 0 6 2
0 . 2 28 1 7 1
X 4  ( 0 . 0 1 9 1 2 6 , 0 . 0T 2 7 8 1 6 )
Thus the 5th search direction will be
S4  f ( X 4 )  (0.153008, 0.055632)T
Update the point as
S4  0 . 0 2 6 5 0 6
0 . 1 62 8 0 7
X 5  ( 0 . 0 0 1 8 3 5 , 0 . 0T 2 0 1 9 5 )
f ( X 5 )  0.001847  
X  ( 0 , 0T ) minimum point
minimum value of objective function we can find at this iteration
This example shows how the final linear convergence of the steepest descent method can become
so slow that it makes the method completely useless when we are near the solution. We say that the
iteration is caught in Stiefel’s cage.
The method is useful, however, when we are far from the solution. It performs a little better if we
ensure that the steps taken are small enough. In such a version it is included in several modern hybrid
methods, where there is a switch between two methods, one with robust global performance and one
with superlinear (or even quadratic) final convergence. Under these circumstances the method of
steepest descent does a very good job as the “global part” of the hybrid.
 Newton’s Method
The basic idea of Newton’s method for unconstrained optimization is to iteratively use the
k
quadratic approximation  ( X ) to the objective function f ( x ) at the current iterate X and to
minimize the approximation  ( X ) :
1
f ( X )   ( X )  f ( X k )  f ( X k )T ( X  X k )  ( X  X k )T  2 f ( X k )T ( X  X k )
2
Where H ( X )   f ( X ) Hessian matrix.
k
2
k
Minimizing  ( X ) yields, and  ( X )  0
f ( X k )  H ( X k(
) X  Xk) 0
if the Hessian is positive definite
X   X k  H 1 ( X k )f ( X k )
k
The point X is near minimizer, then
1
X k 1  X k  H 1 ( X k )f ( X k )
Search direction S   H ( X )f ( X ) step size is 1.
k
k
k
For the positive definite quadratic function, Newton’s method can reach the minimizer with one
iteration. However, for a general non-quadratic function, it is not sure that Newton’s method can
reach the minimizer with finite iterations. Fortunately, since the objective function is approximate
to a quadratic function near the minimizer, then if the starting point is close to the minimizer the
Newton’s method will converge rapidly. The following theorem shows the local convergence and
the quadratic convergence rate of Newton’s method.
Newton’s method is a local method. When the starting point is far away from the solution, it is not
k
k
sure that H ( X ) is positive definite and Newton’s direction S is a descent direction. Hence
the convergence is not guaranteed. Since, as we know, the line search is a global strategy, we can
employ Newton’s method with line search to guarantee the global convergence. However it should
be noted that only when the step size sequence 
k
converges to 1, Newton’s method is
convergent with the quadratic rate. Newton’s iteration with line search is as follows:
S k   H 1 ( X k )f ( X k )
X k 1  X k   k S k
Advantages and disadvantages of Newton’s method for unconstrained optimization problems
Advantages
1. Quadratically convergent from a good starting point if
H ( X k )  2 f ( X k ) is positive definite.
2. Simple and easy to implement.
Disadvantages
1. Not globally convergent for many problems.
2. May converge towards a maximum or saddle point of f ( x ) ;
3. The system of linear equations to be solved in each iteration may be ill-conditioned or singular;
4. Requires analytic second order derivatives of f ( x ) .
Example
We shall use Newton’s method with line search to find the minimizer of the following
min f ( X )  4 x12  x22
starting point is taken as X0  (1,1) ,the errors
T
  0.1
We need the derivatives of first and second order for this function
8
f ( X 0 )  (8, 2)T ,  2 f ( X 0 )  
0
0
2
0 
1
1
[  2 f ( X 0 )]1   8
,S0  [ 2 f ( X 0 )]1 f ( X 0 )    

1 
0
1
2
X 1  X 0  S0  (1,1)T  (1,1)T  (0, 0)T

For f ( X1 )  0  0.1 , then X  X 1
Modified Newton’s Method (Damped Newton Methods)
The more efficient modified Newton methods are constructed as either explicit or implicit
hybrids between the original Newton method and the method of steepest descent. The idea is that the
algorithm in some way should take advantage of the safe, global convergence properties of the steepest
descent method whenever Newton’s method gets into trouble. On the other hand the quadratic
convergence of Newton’s method should be obtained when the iterates get close enough to
provided that the Hessian is positive definite.
X ,
Modified Newton’s Method algorithm
Step 1. Estimate a starting point X 0 and set the iteration counter k  0 . Select a convergence
parameter
 0;
Step 2. Calculate the gradient of f ( x ) at the point X k as ck  f ( X k )
Calculate c 
cT c If c   then stop the iteration process as X   X k is a minimum point.
Otherwise, go to Step 3;
1
Step 3. Let the search direction at the current point X k as S  [ f ( X )] f ( X ) ;
k
2
k
k
Step 4. Calculate a step size  k to minimize f ( X k   k Sk ) . A one-dimensional search is used to
determine  k , f ( X   S )  min f ( X   S )
k
k k
k
k
 0
Step 5. Update the design as X k 1  X k   k Sk

. Set k  k  1 and go to Step 2.
Conjugate Gradient Method
We introduce the conjugate gradient method which is one between the steepest descent
method and the Newton’s method. The conjugate gradient method deflects the direction of the
steepest descent method by adding to it a positive multiple of the direction used in the last step.
This method only requires the first-order derivatives but overcomes the steepest descent method’s
shortcoming of slow convergence. At the same time, the method need not save and compute the
second-order derivatives which are needed by Newton method. In particular, since it does not
require the Hessian matrix or its approximation, it is widely used to solve large scale optimization
problems. As a beginning, we first introduce the concept of conjugate directions and the conjugate
direction method.
Definition
Let G be an n×n symmetric and positive definite matrix, d1, d2, · · · , dm be
non-zero vectors, m ≤ n. If
diT Gd j  0 i  j
the vectors d1, d2, · · · , dm are called G-conjugate or simply conjugate.
Obviously, if vectors d1,… , dm are G-conjugate, then they are linearly independent. If G = I,
the conjugacy is equivalent to the usual orthogonal.
A general conjugate direction method has the following steps:
Step 1. Given an initial point X ,   0 , k = 0. Compute S ;
0
0
Step 2. A one-dimensional search is used to determine X
k 1
 X k   k S k ,Compute  k such
that f ( X   S )  min f ( X   S )
k
k k
k
k
 0
Step 3. Calculate the gradient at the point X
k
as ck  f ( X ) Calculate c 
k
cT c If
c   ,then stop the iteration process. Otherwise, go to Step 4;
Step 4. Compute S
k 1
by some conjugate direction method, such that
T
 S j  HS k 1  0 j  0,1, 2,...k
Step 5. Set k = k + 1, go to Step 2.
Conjugate Gradient Method
In the conjugate direction method, there is not an explicit procedure for generating a conjugate
system of vectors d1, d2,… In this section we describe a method for generating mutually conjugate
direction vectors, which is theoretically appealing and computationally effective. This method is called
the conjugate gradient method. In conjugate direction methods, the conjugate gradient method is of
particular importance. Now it is widely used to solve large scale optimization problems. The conjugate
gradient method was originally proposed by Hestenes and Stiefel in the 1950s to solve linear systems.
Since solving a linear system is equivalent to minimizing a positive definite quadratic function,
Fletcher and Reeves in the 1960s modified it and developed a conjugate gradient method for
unconstrained minimization. By means of conjugacy, the conjugate gradient method makes the steepest
descent direction have conjugacy, and thus increases the efficiency and reliability of the algorithm.
The function f ( X ) is approximated locally with a quadratic function form:
f X  
1 T
X HX  BT X  C
2
where H is a symmetric matrix which is usually required to be positive definite.
It is not difficult to see that its gradient at
X k is given by g k  HX k  B ,
X k 1 is g k 1  HX k 1  B
and the gradient at
and for all X the Hessian is
H  2 f ( X )
If H is positive definite, then f ( X ) has a unique minimizer
X    H 1B . If n=2, then the

contours of f ( X ) are ellipses centered at X . The shape and orientation of the ellipses are
determined by the eigenvalues and eigenvectors of H . For n=3 this generalizes to ellipsoids, and in
higher dimensions we get (n¡1)-dimensional hyper-ellipsoids. It is of course possible to define
quadratic functions with a non-positive definite Hessian, but then there is no longer a unique
minimizer.
Finally, a useful fact is derived in a simple way, Multiplication by H maps differences in
X -values to differences in the corresponding gradients:
g k 1  g k  H ( X k 1  X k )   k HS k
k 1
 X k  kSk )
( one-dimensional search is used to determine X
j
k
If the search direction S , S are H - conjugate ,then
T
 S j  HS k  0
S k 1   g k 1 
g k 1
gk
 [ S j ]T g( k 1  g k )
0
Sk
 COORDINATE SEARCH METHOD
Many practical applications require the optimization of functions whose derivatives are not
available. Problems of this kind can be solved, in principle, by approximating the gradient (and
possibly the Hessian) using finite differences, and using these approximate gradients within the
algorithms described in earlier parts. Even though this finite-difference approach is effective in some
applications, it cannot be regarded a general-purpose technique for derivative-free optimization
because the number of function evaluations required can be excessive and the approach can be
unreliable in the presence of noise. Because of these shortcomings, various algorithms have been
developed that do not attempt to approximate the gradient. Rather, they use the function values at a set
of sample points to determine a new iterate by some other means.
Derivative-free optimization (DFO) algorithms differ in the way they use the sampled function
values to determine the new iterate. Derivative-free optimization methods are not as well developed as
gradient-based methods; current algorithms are effective only for small problems. Although most DFO
methods have been adapted to handle simple types of constraints, such as bounds, the efficient
treatment of general constraints is still the subject of investigation. Consequently, we limit our
discussion to the unconstrained optimization problem
min f ( X )
Problems in which derivatives are not available arise often in practice. The evaluation of
f ( X ) can, for example, be the result of an experimental measurement or a stochastic simulation,
with the underlying analytic form of min f ( X ) unknown. Even if the objective
function min f ( X ) is known in analytic form, coding its derivatives may be time consuming or
impractical.
The coordinate search method (also known as the coordinate descent method or the
alternating variables method) cycles through the n coordinate directions e1, e2, . . . , en, obtaining
new iterates by performing a line search along each direction
in turn. Specifically, at the first iteration, we fix all
components of x except the first one x1 and find a new value
of this component that minimizes (or at least reduces) the
objective function. On the next iteration, we repeat the
process with the second component x2, and so on. After n
iterations, we return to the first variable and repeat the cycle.
Though simple and somewhat intuitive, this method can be
quite inefficient in practice, as we illustrate in Figure for a
quadratic function in two variables. Note that after a few
iterations, neither the vertical (x2) nor the horizontal (x1)
move makes much progress toward the solution at each
iteration.
In general, the coordinate search method can iterate
infinitely without ever approaching a point where the gradient
of the objective function vanishes, even when exact line searches are used. In fact, a cyclic search
along any set of linearly independent directions does not guarantee global convergence.
Technically speaking, this difficulty arises because the steepest descent search direction f ( X )
may become more and more perpendicular to the coordinate search direction.
When the coordinate search method does converge to a solution, it often converges much
more slowly than the steepest descent method, and the difference between the two approaches
tends to increase with the number of variables. However, coordinate search may still be useful
because it does not require calculation of the gradient f ( X ) , and the speed of convergence can
be quite acceptable if the variables are loosely coupled in the objective function f ( X ) .
Many variants of the coordinate search method have been proposed, some of which allow a
global convergence property to be proved. One simple variant is a “back-and-forth” approach in
which we search along the sequence of directions
e1, e2, . . . en−1, en, en−1, . . . , e2, e1, e2, . . . (repeats).
Another approach, suggested by Figure, is first to perform a sequence of coordinate descent steps
and then search along the line joining the first and last points in the cycle.
 Powell algorithm
Powell algorithm follows the idea of subsequent directional minimizations in order to find
0
optimization problem solution. Once the starting point X is chosen, there is always the dilemma
how to generate directions for the line search subroutine. Iterating though a set of versors is
mathematically correct as they span the
optimization domain but can turn out to be
strikingly ineffective if the objective
function forms narrow curving valleys. The
improvement would be to choose the next
direction so that while optimizing along it
the gradient stays perpendicular to the
current direction. Such a pair of directions is
called conjugate.
Directional optimization along versors
vs. conjugate vectors is shown on figure.
The first two steps (black arrows) are done in the direction of versors. The third and fourth can be
done also along versors (black arrows) but it is much better to perform it using a conjugate vector
(arrows nearby) that is created utilizing experience from the last two steps. The steps taken along
conjugate vectors lead immediately to the function minimum.
Usually to make a new conjugate direction it is needed to have Hessian matrix (or its
approximation) of the function being optimized. Powell suggested a routine in which to make next
direction one utilizes solely the data form the last N line searches. The routine preserves
algorithm’s quadratic convergence rate but its drawback is that directions tend to be linearly
dependent. This can be omitted in several ways: one of them is to give up the direction set
periodically and to start over with a set of versors. Therefore, the algorithm shape is like below:

Step 1.
Let X 0 , where X 0 is the starting point chosen arbitrarily.
Step 2. Create the direction set R  S 1 , S 2 ,..., S n  Initialize its components to
search space versors.
Step 3. Perform n line searches: for a single line search start the optimization
from the point X k 1 along S k and call the result X k .
Step 4. Remove S 1 and shift the remaining directions. Complete the set
with S n  X n  X 0 .
Step 5. Perform additional minimization along S n and call the result X 0 . Stop if
the stop criterion is satisfied; else return to step 1.
Download