Chapter 3 Unconstrained Optimization INTRODUCTION In this lecture note we shall discuss numerical methods for the solution of the optimization problem. For a real function of several real variables we want to find an argument vector which corresponds to a minimal function value. In some cases we want a maximizer of a function. This is easily determined if we find a minimizer of the function with opposite sign. Optimization plays an important role in many branches of science and applications: economics, operations research, network analysis, optimal design of mechanical or electrical systems, to mention but a few. The ideal situation for optimization computations is that the objective function has a unique minimizer. We call this the global minimizer. In some cases the objective function has several (or even infinitely many) minimizers. In such problems it may be sufficient to find one of these minimizers. In many objective functions from applications we have a global minimizer and several local minimizers. It is very difficult to develop methods which can find the global minimizer with certainty in this situation. Methods for global optimization are outside the scope of this lecture note. The methods described here can find a local minimizer for the objective function. When a local minimizer has been discovered, we do not know whether it is a global minimizer or one of the local minimizers. We cannot even be sure that our optimization method will find the local minimizer closest to the starting point. In order to explore several local minimizers we can try several runs with different starting points, or better still examine intermediate results produced by a global minimizer. Conditions for a Local Minimizer A local minimizer for f is an argument vector giving the smallest function value inside a certain region, defined by : x is a local minimizer for f : if f ( x ) f ( x) , x x ( 0 ) Most objective functions, especially those with several local minimizers, contain local maximizers and other points which satisfy a necessary condition for a local minimizer. The following theorems help us find such points and distinguish the local minimizers from the irrelevant points. Necessary condition for a local minimum. We assume that f has continuous partial derivatives of second order. f ( x ) is the gradient of f . If x is a local minimizer for f : then f ( x ) 0 f 0 x1 f 0 x2 ... Hessian f ( x) of the function f is a matrix containing the second partial derivatives of f : f ( x) 2 f xi y j Note that this is a symmetric matrix. Second order necessary condition. If x is a local minimizer, then f ( x ) is positive semidefinite. (判定一个矩阵半正定 1、对于半正定矩阵来说,相应的条件应改为所有的主子式非负。顺序主 子式非负并不能推出矩阵是半正定的。比如以下例子: 2、半正定矩阵 定义:设 A 是实对称矩阵。如果对任意的实非零列矩阵 X 有 XTAX≥0, 就称 A 为半正定矩阵。 3、A∈Mn(K)是半正定矩阵的充要条件是:A的所有主子式大于或等于 零。 ) DESCENT METHODS (STEEPEST DESCENT METHOD) The search direction S k must be a descent direction. Then we are able to gain a smaller value of f ( x) by choosing an appropriate walking distance, and thus we can satisfy the descending condition f ( xk 1 ) f ( xk ) . As stopping criterion we would like to use the ideal criterion would be that the current value of f ( x ) is close enough to the minimal value, i.e. f ( xk 1 ) f ( x ) They cannot be used in practice, however, because x and use approximations to these conditions: f ( x ) are not known. Instead we have to xk 1 xk 1 or f ( xk 1 ) f ( xk ) 2 The other type of convergence mentioned is f ( xk ) 0 . This can be reflected in the stopping criterion: f ( xk ) 3 .which is included in many implementations of descent methods. There is a good way of using the property of converging function values. The Taylor expansion of f at x is f ( xk ) T T 1 f ( x ) ( xk x ) f ( x ) ( xk x ) f ( x )( xk x ) 2 Now, if x is a local minimizer, then This gives us f ( xk ) f ( x ) f ( x ) 0 and H f ( x ) is positive semidefinite, T 1 ( xk x ) H ( xk x ) 2 so the stopping criterion could be T 1 ( xk 1 xk ) H k ( xk 1 xk ) 4 with xk 1 2 Here xk x is approximated by xk 1 xk and H x is approximated by H k f ( xk ) . This search direction, the negative gradient direction, is called the direction of steepest descent. S k f ( x) It gives us a useful gain in function value if the step is so short that the 3rd term in the Taylor expansion k is insignificant. Thus we have to stop well before we reach the minimizer along the direction S . At the minimizer the higher order terms are large enough to have changed the slope from its negative starting value to zero. According to a descent method based on steepest descent is convergent. If we make a method using S and a version of line search that ensures sufficiently short steps, then the global convergence will manifest itself as a very robust global performance. The disadvantage is that the method will have linear final convergence and this will often be exceedingly slow. If we use exact line search together with steepest descent, we invite trouble. Steepest descent algorithm Step 1. Estimate a starting design X 0 and set the iteration counter k 0 . Select a convergence parameter 0 ; Step 2. Calculate the gradient of f ( x) at the point X k as ck f ( X k ) k Calculate c cT c If c then stop the iteration process as X X k is a minimum point. Otherwise, go to Step 3; Step 3. Let the search direction at the current point X k as Sk f ( X k ) ; Step 4. Calculate a step size k to minimize f ( X k k Sk ) . A one-dimensional search is used to determine k ; Step 5. Update the design as X k 1 X k k Sk . Set k k 1 and go to Step 2. Example We test a steepest descent method with line search on the function min f ( X ) 4 x12 x22 starting point is taken as X0 (1,1) ,the errors T 0.1 The gradient is f ( X ) (8 x1 , 2 x2 )T then the first search direction is S0 f ( X 0 ) (8, 2)T , S0 6 4 4 2 1 7 mi n f ( X0 + S0 ) =mi n( ) A one-dimensional search and ( ) =0 0 0.130769 Update the point as T X 1 (11 ,) 0.13.769(8, 2)T (0.046152, 0.738462)T Thus the next search direction will be S1 f ( X 1 ) (0.369216, 1.476924)T , S1 2 . 1 8 3 05 1. 52 2 3 7 5 Update the point as X 2 (0.101537, 0.147682)T Thus the 3rd search direction will be S2 f ( X 2 ) (0.369216, 1.476924)T , Update the point as S2 0 . 7 4 7 0 5 6 X 3 ( 0 . 0 0 9 7 4 7 , 0 . 1T 0 7 2 1 7 ) Thus the 4th search direction will be 0 . 8 64 3 2 9 S3 f ( X 3 ) (0.077976, 0.214434)T Update the point as S3 0 . 0 5 2 0 6 2 0 . 2 28 1 7 1 X 4 ( 0 . 0 1 9 1 2 6 , 0 . 0T 2 7 8 1 6 ) Thus the 5th search direction will be S4 f ( X 4 ) (0.153008, 0.055632)T Update the point as S4 0 . 0 2 6 5 0 6 0 . 1 62 8 0 7 X 5 ( 0 . 0 0 1 8 3 5 , 0 . 0T 2 0 1 9 5 ) f ( X 5 ) 0.001847 X ( 0 , 0T ) minimum point minimum value of objective function we can find at this iteration This example shows how the final linear convergence of the steepest descent method can become so slow that it makes the method completely useless when we are near the solution. We say that the iteration is caught in Stiefel’s cage. The method is useful, however, when we are far from the solution. It performs a little better if we ensure that the steps taken are small enough. In such a version it is included in several modern hybrid methods, where there is a switch between two methods, one with robust global performance and one with superlinear (or even quadratic) final convergence. Under these circumstances the method of steepest descent does a very good job as the “global part” of the hybrid. Newton’s Method The basic idea of Newton’s method for unconstrained optimization is to iteratively use the k quadratic approximation ( X ) to the objective function f ( x ) at the current iterate X and to minimize the approximation ( X ) : 1 f ( X ) ( X ) f ( X k ) f ( X k )T ( X X k ) ( X X k )T 2 f ( X k )T ( X X k ) 2 Where H ( X ) f ( X ) Hessian matrix. k 2 k Minimizing ( X ) yields, and ( X ) 0 f ( X k ) H ( X k( ) X Xk) 0 if the Hessian is positive definite X X k H 1 ( X k )f ( X k ) k The point X is near minimizer, then 1 X k 1 X k H 1 ( X k )f ( X k ) Search direction S H ( X )f ( X ) step size is 1. k k k For the positive definite quadratic function, Newton’s method can reach the minimizer with one iteration. However, for a general non-quadratic function, it is not sure that Newton’s method can reach the minimizer with finite iterations. Fortunately, since the objective function is approximate to a quadratic function near the minimizer, then if the starting point is close to the minimizer the Newton’s method will converge rapidly. The following theorem shows the local convergence and the quadratic convergence rate of Newton’s method. Newton’s method is a local method. When the starting point is far away from the solution, it is not k k sure that H ( X ) is positive definite and Newton’s direction S is a descent direction. Hence the convergence is not guaranteed. Since, as we know, the line search is a global strategy, we can employ Newton’s method with line search to guarantee the global convergence. However it should be noted that only when the step size sequence k converges to 1, Newton’s method is convergent with the quadratic rate. Newton’s iteration with line search is as follows: S k H 1 ( X k )f ( X k ) X k 1 X k k S k Advantages and disadvantages of Newton’s method for unconstrained optimization problems Advantages 1. Quadratically convergent from a good starting point if H ( X k ) 2 f ( X k ) is positive definite. 2. Simple and easy to implement. Disadvantages 1. Not globally convergent for many problems. 2. May converge towards a maximum or saddle point of f ( x ) ; 3. The system of linear equations to be solved in each iteration may be ill-conditioned or singular; 4. Requires analytic second order derivatives of f ( x ) . Example We shall use Newton’s method with line search to find the minimizer of the following min f ( X ) 4 x12 x22 starting point is taken as X0 (1,1) ,the errors T 0.1 We need the derivatives of first and second order for this function 8 f ( X 0 ) (8, 2)T , 2 f ( X 0 ) 0 0 2 0 1 1 [ 2 f ( X 0 )]1 8 ,S0 [ 2 f ( X 0 )]1 f ( X 0 ) 1 0 1 2 X 1 X 0 S0 (1,1)T (1,1)T (0, 0)T For f ( X1 ) 0 0.1 , then X X 1 Modified Newton’s Method (Damped Newton Methods) The more efficient modified Newton methods are constructed as either explicit or implicit hybrids between the original Newton method and the method of steepest descent. The idea is that the algorithm in some way should take advantage of the safe, global convergence properties of the steepest descent method whenever Newton’s method gets into trouble. On the other hand the quadratic convergence of Newton’s method should be obtained when the iterates get close enough to provided that the Hessian is positive definite. X , Modified Newton’s Method algorithm Step 1. Estimate a starting point X 0 and set the iteration counter k 0 . Select a convergence parameter 0; Step 2. Calculate the gradient of f ( x ) at the point X k as ck f ( X k ) Calculate c cT c If c then stop the iteration process as X X k is a minimum point. Otherwise, go to Step 3; 1 Step 3. Let the search direction at the current point X k as S [ f ( X )] f ( X ) ; k 2 k k Step 4. Calculate a step size k to minimize f ( X k k Sk ) . A one-dimensional search is used to determine k , f ( X S ) min f ( X S ) k k k k k 0 Step 5. Update the design as X k 1 X k k Sk . Set k k 1 and go to Step 2. Conjugate Gradient Method We introduce the conjugate gradient method which is one between the steepest descent method and the Newton’s method. The conjugate gradient method deflects the direction of the steepest descent method by adding to it a positive multiple of the direction used in the last step. This method only requires the first-order derivatives but overcomes the steepest descent method’s shortcoming of slow convergence. At the same time, the method need not save and compute the second-order derivatives which are needed by Newton method. In particular, since it does not require the Hessian matrix or its approximation, it is widely used to solve large scale optimization problems. As a beginning, we first introduce the concept of conjugate directions and the conjugate direction method. Definition Let G be an n×n symmetric and positive definite matrix, d1, d2, · · · , dm be non-zero vectors, m ≤ n. If diT Gd j 0 i j the vectors d1, d2, · · · , dm are called G-conjugate or simply conjugate. Obviously, if vectors d1,… , dm are G-conjugate, then they are linearly independent. If G = I, the conjugacy is equivalent to the usual orthogonal. A general conjugate direction method has the following steps: Step 1. Given an initial point X , 0 , k = 0. Compute S ; 0 0 Step 2. A one-dimensional search is used to determine X k 1 X k k S k ,Compute k such that f ( X S ) min f ( X S ) k k k k k 0 Step 3. Calculate the gradient at the point X k as ck f ( X ) Calculate c k cT c If c ,then stop the iteration process. Otherwise, go to Step 4; Step 4. Compute S k 1 by some conjugate direction method, such that T S j HS k 1 0 j 0,1, 2,...k Step 5. Set k = k + 1, go to Step 2. Conjugate Gradient Method In the conjugate direction method, there is not an explicit procedure for generating a conjugate system of vectors d1, d2,… In this section we describe a method for generating mutually conjugate direction vectors, which is theoretically appealing and computationally effective. This method is called the conjugate gradient method. In conjugate direction methods, the conjugate gradient method is of particular importance. Now it is widely used to solve large scale optimization problems. The conjugate gradient method was originally proposed by Hestenes and Stiefel in the 1950s to solve linear systems. Since solving a linear system is equivalent to minimizing a positive definite quadratic function, Fletcher and Reeves in the 1960s modified it and developed a conjugate gradient method for unconstrained minimization. By means of conjugacy, the conjugate gradient method makes the steepest descent direction have conjugacy, and thus increases the efficiency and reliability of the algorithm. The function f ( X ) is approximated locally with a quadratic function form: f X 1 T X HX BT X C 2 where H is a symmetric matrix which is usually required to be positive definite. It is not difficult to see that its gradient at X k is given by g k HX k B , X k 1 is g k 1 HX k 1 B and the gradient at and for all X the Hessian is H 2 f ( X ) If H is positive definite, then f ( X ) has a unique minimizer X H 1B . If n=2, then the contours of f ( X ) are ellipses centered at X . The shape and orientation of the ellipses are determined by the eigenvalues and eigenvectors of H . For n=3 this generalizes to ellipsoids, and in higher dimensions we get (n¡1)-dimensional hyper-ellipsoids. It is of course possible to define quadratic functions with a non-positive definite Hessian, but then there is no longer a unique minimizer. Finally, a useful fact is derived in a simple way, Multiplication by H maps differences in X -values to differences in the corresponding gradients: g k 1 g k H ( X k 1 X k ) k HS k k 1 X k kSk ) ( one-dimensional search is used to determine X j k If the search direction S , S are H - conjugate ,then T S j HS k 0 S k 1 g k 1 g k 1 gk [ S j ]T g( k 1 g k ) 0 Sk COORDINATE SEARCH METHOD Many practical applications require the optimization of functions whose derivatives are not available. Problems of this kind can be solved, in principle, by approximating the gradient (and possibly the Hessian) using finite differences, and using these approximate gradients within the algorithms described in earlier parts. Even though this finite-difference approach is effective in some applications, it cannot be regarded a general-purpose technique for derivative-free optimization because the number of function evaluations required can be excessive and the approach can be unreliable in the presence of noise. Because of these shortcomings, various algorithms have been developed that do not attempt to approximate the gradient. Rather, they use the function values at a set of sample points to determine a new iterate by some other means. Derivative-free optimization (DFO) algorithms differ in the way they use the sampled function values to determine the new iterate. Derivative-free optimization methods are not as well developed as gradient-based methods; current algorithms are effective only for small problems. Although most DFO methods have been adapted to handle simple types of constraints, such as bounds, the efficient treatment of general constraints is still the subject of investigation. Consequently, we limit our discussion to the unconstrained optimization problem min f ( X ) Problems in which derivatives are not available arise often in practice. The evaluation of f ( X ) can, for example, be the result of an experimental measurement or a stochastic simulation, with the underlying analytic form of min f ( X ) unknown. Even if the objective function min f ( X ) is known in analytic form, coding its derivatives may be time consuming or impractical. The coordinate search method (also known as the coordinate descent method or the alternating variables method) cycles through the n coordinate directions e1, e2, . . . , en, obtaining new iterates by performing a line search along each direction in turn. Specifically, at the first iteration, we fix all components of x except the first one x1 and find a new value of this component that minimizes (or at least reduces) the objective function. On the next iteration, we repeat the process with the second component x2, and so on. After n iterations, we return to the first variable and repeat the cycle. Though simple and somewhat intuitive, this method can be quite inefficient in practice, as we illustrate in Figure for a quadratic function in two variables. Note that after a few iterations, neither the vertical (x2) nor the horizontal (x1) move makes much progress toward the solution at each iteration. In general, the coordinate search method can iterate infinitely without ever approaching a point where the gradient of the objective function vanishes, even when exact line searches are used. In fact, a cyclic search along any set of linearly independent directions does not guarantee global convergence. Technically speaking, this difficulty arises because the steepest descent search direction f ( X ) may become more and more perpendicular to the coordinate search direction. When the coordinate search method does converge to a solution, it often converges much more slowly than the steepest descent method, and the difference between the two approaches tends to increase with the number of variables. However, coordinate search may still be useful because it does not require calculation of the gradient f ( X ) , and the speed of convergence can be quite acceptable if the variables are loosely coupled in the objective function f ( X ) . Many variants of the coordinate search method have been proposed, some of which allow a global convergence property to be proved. One simple variant is a “back-and-forth” approach in which we search along the sequence of directions e1, e2, . . . en−1, en, en−1, . . . , e2, e1, e2, . . . (repeats). Another approach, suggested by Figure, is first to perform a sequence of coordinate descent steps and then search along the line joining the first and last points in the cycle. Powell algorithm Powell algorithm follows the idea of subsequent directional minimizations in order to find 0 optimization problem solution. Once the starting point X is chosen, there is always the dilemma how to generate directions for the line search subroutine. Iterating though a set of versors is mathematically correct as they span the optimization domain but can turn out to be strikingly ineffective if the objective function forms narrow curving valleys. The improvement would be to choose the next direction so that while optimizing along it the gradient stays perpendicular to the current direction. Such a pair of directions is called conjugate. Directional optimization along versors vs. conjugate vectors is shown on figure. The first two steps (black arrows) are done in the direction of versors. The third and fourth can be done also along versors (black arrows) but it is much better to perform it using a conjugate vector (arrows nearby) that is created utilizing experience from the last two steps. The steps taken along conjugate vectors lead immediately to the function minimum. Usually to make a new conjugate direction it is needed to have Hessian matrix (or its approximation) of the function being optimized. Powell suggested a routine in which to make next direction one utilizes solely the data form the last N line searches. The routine preserves algorithm’s quadratic convergence rate but its drawback is that directions tend to be linearly dependent. This can be omitted in several ways: one of them is to give up the direction set periodically and to start over with a set of versors. Therefore, the algorithm shape is like below: Step 1. Let X 0 , where X 0 is the starting point chosen arbitrarily. Step 2. Create the direction set R S 1 , S 2 ,..., S n Initialize its components to search space versors. Step 3. Perform n line searches: for a single line search start the optimization from the point X k 1 along S k and call the result X k . Step 4. Remove S 1 and shift the remaining directions. Complete the set with S n X n X 0 . Step 5. Perform additional minimization along S n and call the result X 0 . Stop if the stop criterion is satisfied; else return to step 1.