Optimisation Methods Minimization or Maximization of Functions (Readings – 10.0 – 10.7 of NRC) 1 Introduction You are given a single function f that depends on one or more independent variables. You want to find the value of those variables where f takes on a maximum or a minimum value. An extremum (maximum or minimum point) can be either global (truly the highest or lowest function value) or local (the highest or lowest in a finite neighborhood and not on the boundary of that neighborhood). The unconstrained multi-variable problem is written as min f(x) x RN where x is a vector of the decision variables. 2 Introduction Extrema of a function in an interval. Points A, C, and E are local, but not global maxima. Points B and F are local, but not global minima. The global maximum occurs at G, which is on the boundary of the interval so that the derivative of the function need not vanish there. The global minimum is at D. At point E, derivatives higher than the first vanish, a situation which can cause difficulty for some algorithms. The points X, Y , and Z are said to “bracket” the minimum F , since Y is less than both X and Z. 3 Contour Plots A contour plot consists of contour lines where each contour line indicates a specific value of the function f(x1,x2). 4 Solution Methods The solution methods are classified into 3 broad categories: 1. Direct (zero order) search methods: a. Bisection Search b. Golden Section Search c. Parabolic Interpolation and Brent’s Method d. Simplex Method e. Powell’s Method 2. Gradient based (first order) methods: a. Steepest descent b. Conjugate gradient 3. Second order methods: a. Newton b. Modified Newton c. Quasi-Newton 5 Direct (zero order) search methods They require only function values. The are computationally uncomplicated. They are slow. 6 How Small is Tolerably Small (1-ε)b < b < (1 + ε)b where ε is computers precision (3 x 10-8 for single and 10-15 for double precision) But: f(x) near b is (Taylor’s Theorem) is The second term is negligible compared to the first when Which is 3 x 10-4 for single and 10-8 for double precision 7 Bisection Method for Finding Roots of a Function Bisection method : finds roots of functions in one dimension. The root is supposed to have been bracketed in an interval (a,b). Evaluate the function at an intermediate point x and obtain a new, smaller bracketing interval, either (a,x) or (x,b). The process continues until the bracketing interval is acceptably small. It is optimal to choose x to be the midpoint of (a,b) so that the decrease in the interval length is maximized when the function is as uncooperative as it can be, i.e., when the luck of the draw forces you to take the bigger bisected segment. 8 Golden Section Search – 1D Successive bracketing of a minimum. The minimum is originally bracketed by points 1,3,2. The function is evaluated at 4, which replaces 2; then at 5, which replaces 1; then at 6, which replaces 4. The rule at each stage is to keep a center point that is lower than the two outside points. After the steps shown, the minimum is bracketed by points 5,3,6. 9 Golden Section Search – Discussion 1 New search interval will be either between x1 and x4 with a length of a+c , or between x2 and x3 with a length of b To ensure that b = a+c, the algorithm should choose x4 = x1 − x2 + x3. Question of where x2 should be placed in relation to x1 and x3. The golden section search chooses the spacing between these points in such a way that these points have the same proportion of spacing as the subsequent triple x1,x2,x4 or x2,x4,x3. By maintaining the same proportion of spacing throughout the algorithm, we avoid a situation in which x2 is very close to x1 or x3, and guarantee that the interval width shrinks by the same constant proportion in each step. 10 Golden Section Search – Discussion 1 Mathematically, to ensure that the spacing after evaluating f(x4) is proportional to the spacing prior to that evaluation, if f(x4) is f4a and our new triplet of points is x1, x2, and x4 then we want c/a = a/b. However, if f(x4) is f4b and our new triplet of points is x2, x4, and x3 then we want c/(b-c) = a/b Eliminating c from these two simultaneous equations yields (b/a)2=(b/a)+1 and solving gives b/a = φ, the golden ratio, where: 11 Golden Section Search – Discussion 2 Given (a,b,c), suppose b is a fraction w between a and c and the next trial point x is an additional fraction z between a and c. The next bracketing segment will either be of length w + z or of length 1 – w. To minimise the worst case possibility these should be equal giving Scale similarity implies that x should be the same fraction in b to c as b was in a to c giving: Solving these gives w = 0.38197, the golden mean / section 12 Golden Section Search – Discussion 3 13 Golden Section Search – Discussion 3 14 Golden Section Search – Discussion 3 15 Golden Section Search – Discussion 3 16 Golden Section Search – Discussion 3 17 Golden Section Search – Discussion 3 18 Golden Section Search – Discussion 3 19 Golden Section Search – Discussion 3 20 Golden Section Search – Discussion 3 21 Parabolic Interpolation The Golden Section Search is designed to handle the worse possible case of function minimisation where the function has erratic behaviour However most functions, if they are sufficiently smooth, are nearly parabolic near a minima. Given three points near a minima, successively fitting a parabola to these three points should help to get a point closer to the minimum. 22 Parabolic Interpolation and Brent’s Method The formula for x at the minimum of a parabola through three points f(a), f(b) and f(c) is: 23 Parabolic Interpolation and Brent’s Method The exacting task is to invent a scheme that relies on a sure-but-slow technique, like golden section search, when the function is not cooperative, but that switches over to parabolic interpolation when the function allows. The task is nontrivial for several reasons, including these: The housekeeping needed to avoid unnecessary function evaluations in switching between the two methods can be complicated. Careful attention must be given to the “endgame,” where the function is being evaluated very near to the round-off limit. The scheme for detecting a cooperative versus noncooperative function must be very robust. 24 Brent’s Method Keeps track of 6 function points: a and b bracket the minimum Least function value found is at x Second least function value at w v is the previous value of w u is the point at which function most recently evaluated Parabolic interpolation is attempted fitting through x, v and w. To be acceptable, the parabolic step must be between a and b, and imply a movement from x that is less than half the movement of the step before. Where this is not working Brent’s Method alternates between parabolic steps and golden sections. 25 Brent’s Method with First Derivatives First derivatives can be used within Brent’s Method as follows: The sign of the derivative at the central point of the bracketing triplet (a,b,c) indicates uniquely whether the next test point should be taken in the interval (a,b) or in the interval (b,c). The value of this derivative and of the derivative at the second-best-so-far point are extrapolated to zero by the secant method (inverse linear interpolation). We impose the same sort of restrictions on this new trial point as in Brent’s method. If the trial point must be rejected, we bisect the interval under scrutiny. 26 Downhill Simplex Method in Multi-Dimensions Bisection Methods only work in one dimension, The downhill simplex method handles multidimensional problems and is due to Nelder and Mead. The method requires only function evaluations, not derivatives. A simplex is the geometrical figure consisting, in N dimensions, of N +1 points (or vertices) and all their interconnecting line segments, polygonal faces, etc. In two dimensions, a simplex is a triangle. In three dimensions it is a tetrahedron, not necessarily the regular tetrahedron. 27 Downhill Simplex Method in Multi-Dimensions After initialisation, the downhill simplex method takes a series of steps, most steps just moving the point of the simplex where the function is largest through the opposite face of the simplex to a lower point. These steps are called reflections, and they are constructed to conserve the volume of the simplex (hence maintain its non-degeneracy). When it can do so, the method expands the simplex in one or another direction to take larger steps. When it reaches a “valley floor,” the method contracts itself in the transverse direction and tries to ooze down the valley. If the simplex is trying to “pass through the eye of a needle,” it contracts itself in all directions, pulling itself in around its lowest (best) point. 28 Downhill Simplex Method in Multi-Dimensions Let xi be the location of the ith vertex, ordered f(x1)>f(x2)…>f(xD+1). Center of face of the simplex defined by all vertices other than the one we are trying to improve, x 1 D1 x mean D i 2 i Since all of the others have a better function value, they give a good direction to move in; reflection x1 x1new x mean (x mean x1 ) 2x mean x1 29 Downhill Simplex Method in Multi-Dimensions If a new position is better, it is worth checking to see if it’s even better to double the size of the step; expansion x1 x1new x mean 2(x mean x1 ) 3x mean 2x1 If a new position is worse, it means we overshot. Then, reflect and shrink x1 x1new x mean (1 / 2)(x mean x1 ) (3 / 2)x mean (1 / 2)x1 30 Downhill Simplex Method in Multi-Dimensions If after reflecting and shrinking a new position is still worse, we can try just shrinking; x1 x1new x mean (1 / 2)(x mean x1 ) (1 / 2)x mean (1 / 2)x1 If after shrinking a new position is still worse, give up and shrink all of the vertices towards the best one xi xinew xi (1 / 2)(xi x D 1 ) (1 / 2)(xi x D 1 ) When it reaches a minimum it will give up and shrink down around it, triggering a stopping decision when the values are no longer improving. 31 Downhill Simplex Method in Multi-Dimensions Solve min f ( x) 2x13 4x1 x22 10x1 x2 x22 by applying 5 iterations of the simplex method, starting with x0 = [5, 2]T. 5 4 3 2 (5, 2) f = 234 1 0 -1 -2 -3 0 1 2 3 4 5 6 7 32 Downhill Simplex Method in Multi-Dimensions Iteration 1 5 4 (5.51, 4.63) f = 576.31 3 2 (6.8, 4.12) f = 851.91 1 (5, 2) f = 234 0 -1 -2 -3 0 1 2 3 4 5 6 7 33 Downhill Simplex Method in Multi-Dimensions Iteration 2 5 4 (5.51, 4.63) f = 576.31 3 2 1 (3.63, 2.51) f = 102.88 0 (5, 2) f = 234 -1 -2 -3 0 1 2 3 4 5 6 7 34 Downhill Simplex Method in Multi-Dimensions Iteration 3 5 4 3 2 (3.63, 2.517) f = 102.88 1 (5, 2) f = 234 0 -1 (3.12, -0.1204) f = 64.71 -2 -3 0 1 2 3 4 5 6 7 35 Downhill Simplex Method in Multi-Dimensions Iteration 4 5 4 3 2 (3.638, 2.517) f = 102.88 1 0 -1 (1.75, 0.397) f = 5.15 (3.12, -0.1204) f = 64.71 -2 -3 0 1 2 3 4 5 6 7 36 Downhill Simplex Method in Multi-Dimensions Iteration 5 5 4 3 2 1 The solution 0 (1.758, 0.3972) f = 5.51 -1 (3.12, -0.12) f = 64.71 -2 -3 0 1 2 3 4 5 6 7 (1.24, -2.24) f = 61.877 37 Downhill Simplex Method in Multi-Dimensions Rosenbrock’s “banana” function F=100(x2-x12)2+(x1-1)2 38 Downhill Simplex Method in Multi-Dimensions 39 Direction Set Methods General Scheme Initial Step set k = 0 supply an initial guess, xk, within any specified constraints Iterative Step calculate a search direction pk determine an appropriate step length lk set xk+1 to xk+ lk pk Stopping Criteria if convergence criteria reached optimum vector is xk+1 stop else set k = k + 1 repeat Iterative Step 40 Direction Set (Powell’s) Method Sometimes it is not possible to estimate the derivative ∂f to obtain the direction in a steepest descent method First guess, minimize along one coordinate axis, then along other and so on. Repeat Can be very slow to converge Conjugate directions: Directions which are independent of each other so that minimizing along each one does not move away from the minimum in the other directions. Powell introduced a method to obtain conjugate directions without computing the derivative. 41 Direction Set (Powell’s) Method If f is minimised along u, then f must be perpendicular to u at the minimum. The function may be expanded using the Taylor series around the origin p as: f 1 2 f 1 f (x) f (p) xi xixj ... c b x x H x 2 i , j xixj 2 i xi By taking the gradient of the Taylor expansion f b H x The change in gradient when moving in one direction is: (f ) H (x) After f is minimised along u, the algorithm proposes a new direction v so that minimisation along v does not spoil the minimum along u. For this to be true, the function gradient must stay perpendicular to u u (f ) 0 u H v When this is true, u and v are said to be conjugate and we get quadratic convergence to the minimum 42 Direction Set (Powell’s) Method 1. Initialise the set of directions ui to the basis vectors 2. Repeat until function stops decreasing: 1. Save starting position as P0 2. For i = 0..N-1, move Pi to the minimum along direction ui and call this point Pi+1 3. For i = 0..N-2, set ui = ui+1 4. Set uN-1 = PN-P0 5. Move PN to the minimum along direction uN-1 and call this point P0 Powell showed that, for a quadratic form, k iterations of the above procedure produce a set of directions ui whose last k members are mutually conjugate. Therefore, N iterations involving N(N+1) line minimisations will exactly minimise a quadratic form. 43 Direction Set (Powell’s) Method 44 Gradient Based Methods They employ the gradient information. They are iterative methods and employ the iteration procedure x( k 1) x( k ) α( k ) s( x( k ) ) where (k) : step size s(x(k)): direction. The methods differ in how s(x(k)) is computed. 45 Steepest Descent Method Let x(k) be the current point. The Taylor expansion of the objective function about x(k): f ( x ( k ) α ( k ) s ( k ) ) f ( x ( k ) ) f ( x ( k ) )T (α ( k ) s ( k ) ) We need the next point to have a lower objective function value than the current point: f ( x( k ) α( k ) s ( k ) ) f ( x( k ) ) f ( x( k ) )T (α( k ) s ( k ) ) 0 That is equivalent to f ( x ( k ) )T s ( k ) 0 The smallest value of this product is when s ( k ) f ( x ( k ) ) 46 Steepest Descent Method We call this direction the steepest descent direction. Another proof of the steepest descent direction is to recognize that the gradient always points towards increasing value of the objective function. Taking the negative of the gradient, then, leads to the decreasing value of the objective function. Now the direction is determined, a single variable search is needed to determine the value of the step size. In every iteration of the method, the direction and step size are computed. 47 Steepest Descent Method 48 Steepest Descent Method Notes The good thing about the steepest descent method is that it always converges. What’s bad about it is that it converges slower as the minimum is approached. 49 Steepest Descent Method The gradient represents the perpendicular line to the tangent of the contour line of the function at a particular point. f ( x (k ) ) 50 Steepest Descent Method The steepest descent method zigzags its way towards the optimum point. This is because each direction is orthogonal to the previous direction. x* x ( 3) x (1) x ( 2) 51 Conjugate Gradient Method Review Two vectors u and v are said to be conjugate with respect to matrix C if uT C v = 0. 1 1 / 2 8 4 For example,let u , v and C 0 1 4 6 T hen,u T Cv 0. The two vectors are C-conjugate. A set of conjugate vectors is called a conjugate set. The eigenvectors of the matrix are conjugate with respect to it. 52 Conjugate Gradient Method If u and v are conjugate and v and w are conjugate, then u and w are conjugate. Conjugate directions, which are vectors, are used to find the minimum of a function. The minimum of a quadratic function of N variables can be found after exactly N searches along conjugate directions. 53 Conjugate Gradient Method The question now is, how can we conveniently generate conjugate directions? For a quadratic function f(x), the gradient is given by f ( x) Cx b g ( x) Take two points x(0) and x(1), the change in the gradient is given by g ( x) g ( x(1) ) g ( x(0) ) C( x(1) x(0) ) Cx The iteration procedure we will apply is x(k+1) = x(k) + (k) s(x(k)) The search directions are calculated as s ( k ) g ( k ) γ( k 1) s ( k 1) , for k 1,2,...,N 1 with s(0)= - g(0), 54 Conjugate Gradient Method If the steepest descent direction is used, we know that ( k 1)T ( k ) g g 0 We want to choose (k-1) such that s(k) is C-conjugate to s(k-1). Take the first direction: s(1) = - g(1) + (0)s(0) = - g(1) - (0) g(0) We require s(0) and s(1) to be C-conjugate: s(1)T C s(0) = 0 [g(1) + (0) g(0)]T C s(0) = 0 We know that s ( 0 ) x α (0) 55 Conjugate Gradient Method Therefore, x [ g (1) γ ( 0) g ( 0) ]T C ( 0) 0 α From the quadratic property, [g(1) + (0) g(0)]T g = 0 After expansion, 0 g(1)T g(1) + (0) g(0)T g(1) – g(1)T g(0) – (0) g(0)T g(0) = From this, γ (0) g (1) 2 g ( 0) 2 56 Conjugate Gradient Method Therefore, the general iteration is given by s(k ) f ( k ) 2 s ( k 1) f ( k ) f ( k 1) 2 for k = 1, …,N-1. If the function is not quadratic, more iterations may be required. 57 Conjugate Gradient Method The steepest descent direction is deflected so the minimum is reached directly. x ( 2) f ( x(1) ) s (1) x (1) s (0) x (0) 58 Conjugate Gradient Method 59 Newton’s Method It is a second order method. Let x(k) be the current point. The Taylor expansion of the objective function about x(k): f ( x) f ( x ( k ) ) f ( x ( k ) )T x 12 xT 2 f ( x ( k ) )x O(x 3 ) The quadratic approximation of f(x) is ~ f ( x) f ( x ( k ) ) f ( x ( k ) )T x 12 xT 2 f ( x ( k ) )x We need to find the critical point of the approximation: f ( x ( k ) ) 2 f ( x ( k ) )x 0 x 2 f ( x ( k ) ) 1 f ( x ( k ) ) 60 Newton’s Method The Newton optimization method is x( k 1) x( k ) 2 f ( x( k ) )1 f ( x( k ) ) If the function f(x) is quadratic, the solution can be found in exactly one step. 61 Newton’s Method 62 Modified Newton’s Method Newton method can be unreliable for non-quadratic functions. The Newton step will often be large when x(0) is far from x*. To solve this problem, we add a step length: x(k 1) x(k ) α(k )2 f ( x(k ) )1f ( x(k ) ) 63 Quasi-Newton Method Quasi-Newton methods use a Hessian-like matrix but without calculating second-order derivatives. Sometimes, these methods are referred to as the variable metric methods because A changes at each iteration. ( k 1) x( k ) A( k )f ( x( k ) ) Take the general formula: x When A(k) = I (identity matrix), the formula becomes the formula of the steepest descent method. When A(k) = 2f(x(k))-1, the formula becomes the formula of the Newton method. Quasi-Newton methods are based primarily upon properties of quadratic functions and they are designed to mimic Newton method using only firstorder information. 64 Quasi-Newton Method Starting from a positive definite matrix, the quasiNewton methods gradually build up an approximate Hessian matrix by using gradient information from the previous iterations. The matrix A is kept positive definite; hence the direction s(k) = - A(k)f(x(k)) remains a descent direction. There are several ways to update the matrix A, one of which is T A ( k 1) A ( k ) A(k ) γ(k ) γ(k ) A(k ) γ (k )T A γ (k ) (k ) where A(0) = I, (k) = x(k+1) – x(k) (k) = f(x(k+1)) – f(x(k)). δ(k )δ(k ) T T δ(k ) γ (k ) and 65 Quasi-Newton Methods (DFP) The DFP formula used to update the matrix A is A (k ) A ( k 1) x x ( k 1) x ( k 1) T ( k 1) T g ( k 1) A ( k 1) g g ( k 1) ( k 1) T g ( k 1) T A ( k 1) A ( k 1) g ( k 1) where A(0) = I, x(k) = x(k+1) – x(k) and g(k) = g(x(k+1)) - g(x(k)) =f(x(k+1)) – f(x(k)). The DFP formula preserves symmetry and positive definiteness so that the sequence A(1), A(2), … will also be symmetric and positive definite. 66 Quasi-Newton Methods (DFP) 67 Lagrange Multipliers - Introduction 68 Lagrange Multipliers - Introduction 69 Lagrange Multipliers - Introduction 70 Lagrange Multipliers - Introduction Checking for Maximality : Closed Interval Method 71 Lagrange Multipliers - Introduction Checking for Maximality : First Derivative Test 72 Lagrange Multipliers - Introduction Checking for Maximality : Second Derivative Test 73 Lagrange Multipliers - Introduction 74 Lagrange Multipliers - Method 75 Lagrange Multipliers - Method 76 Lagrange Multipliers - Method 77 Lagrange Multipliers - Examples 78 Lagrange Multipliers - Examples 79 Lagrange Multipliers - Examples 80 Lagrange Multipliers - Examples 81 Lagrange Multipliers - Examples 82 Lagrange Multipliers - Examples 83 Lagrange Multipliers - Examples 84 Lagrange Multipliers - Examples 85 Lagrange Multipliers - Examples 86 Lagrange Multipliers - Examples 87 Lagrange Multipliers - Examples 88 Lagrange Multipliers - Examples 89 Lagrange Multipliers - Examples 90 Lagrange Multipliers - Examples 91 The Kuhn-Tucker Conditions The Kuhn-Tucker conditions are used to solve NLPs of the following type: max (or min) f ( x1 , x2 ,..., xn ) s.t . g1 ( x1 , x2 ,..., xn ) b1 g 2 ( x1 , x2 ,..., xn ) b2 g m ( x1 , x2 ,..., xn ) bm The Kuhn-Tucker conditions are necessary for a point x ( x1, x2 ,...,xn ) to solve the NLP. 92 The Kuhn-Tucker Conditions Suppose the NLP is a maximization problem. If x ( x1 , x2 ,...,xn ) is an optimal solution to NLP, then x ( x1, x2 ,...,xn ) must satisfy the m constraints in the NLP, and there must exist multipliers 1, 2, …, m satisfying f ( x ) i m g i ( x ) i 0 x j x j i 1 ( j 1, 2 , ..., n) i [bi g i ( x )] 0 (i 1, 2 , ..., m) i 0 (i 1, 2 , ..., m) 93 The Kuhn-Tucker Conditions Suppose the NLP is a minimization problem. If x ( x1 , x2 ,...,xn ) is an optimal solution to NLP, then x ( x1, x2 ,...,xn ) must satisfy the m constraints in the NLP, and there must exist multipliers 1, 2, …, m satisfying f ( x ) i m g i ( x ) i 0 x j x j i 1 ( j 1, 2 , ..., n) i [bi g i ( x )] 0 (i 1, 2 , ..., m) i 0 (i 1, 2 , ..., m) 94 The Kuhn-Tucker Conditions Unless a constraint qualification or regularity condition is satisfied at an optimal point x , the Kuhn-Tucker conditions may fail to hold at x . LINGO can be used to solve NLPs with inequality (and possibly equality) constraints. If LINGO displays the message DUAL CONDITIONS:SATISFIED then you know it has found the point satisfying the Kuhn-Tucker conditions. 95