Engg System Design Optimization Gradient Based M S Prasad , AISST: Amity Univ This lecture note is based on Textbooks and open literature . This LN is suitable for Grad /post Grad students of Aerospace & Avionics . to be read in conjunction with Class room discussions Gradient Based Optimization : LN – 4 Most local optimization algorithms are gradient-based. As indicated by the name, gradientbased optimization techniques make use of gradient information to find the optimum solution . Sometimes also referred as Numerical techniques . Gradient-based algorithms are widely used for solving a variety of optimization problems in engineering. These techniques are popular because they are efficient (in terms of the number of function evaluations required to find the optimum), they can solve problems with large numbers of design variables, and they typically require little problem-specific parameter tuning. Gradient-based algorithms typically make use of a two-step process to reach the optimum. The first step is to use gradient information for finding a desirable search direction S in which to move. The second step is to move in this direction until no more progress can be made. The second step is known as the one-dimensional or line search, and provides the optimum step size, α a positive scalar. There are also gradient-based algorithms that do not rely on a onedimensional search. For most optimization problems, the gradient information is not readily available and is obtained using finite difference gradient calculations. Finite difference gradients provide a flexible means of estimating the gradient information. The different gradient-based algorithms that exist, differ mostly in the logic used to determine the search direction. For the one-dimensional search, there are many algorithms that will find the best step size, and generally any of these techniques can be combined with a particular gradient-based algorithm to perform the required one-dimensional search. Some of the popular one-dimensional search algorithms include the Golden Section search, the Fibonacci search, and many variations of polynomial approximations. ( Refer Line search LN – 2 ). The iteration of optimum value follows the algorithm as below :π πΎ+1 = π πΎ + βπ πΎ βπ πΎ = πΌπ π π Where SK is the desirable search direction in design space and α k is the step size in that direction. Due to this reason the gradient based methods are known as “Search Techniques ‘ Or Direct Techniques . Summary : The basic idea of numerical methods for nonlinear optimization problems is to start with a reasonable estimate for the optimum design. Cost and constraint functions and their derivatives are evaluated at that point. Based on them, the design is moved to a new point. The process is continued until either optimality conditions or some other stopping criteria are met. ESDO LN -4 # M S Prasad Page 1 This iterative process represents an organized search through the design space for points that represent local minima. Thus, the procedures are often called the search techniques or direct methods of optimization. Algorithm Step 1 : Estimate the starting value of Variable X0 and set iteration Counter K = 0. Step 2 . Compute a search direction Sk in design space . Step 3 : Calculate αk in direction Sk Step 4 : calculate Xk+1 = Xk + αk Sk Step 5 : Check f ( X ) K+1 < f ( Xk ) 0r f( Xk + αk Sk ) < f ( Xk ) to find the next iteration Step 6 continue till we converge. How to find the direction ? Let us expand the function by Taylor series f( Xk + αk Sk ) 1 f( Xk + αk Sk ) = f( XK+1 ) = π(π π ) + ∇π π(π πΎ )(βπ)π + 2 (βπ π )π π» ( π π )(βπ)π where βπ = αk Sk we can also take only one term of Taylor series This function needs to be minimum for finding an optimum solution i.e. its derivative should be set to zero. f( Xk + αk Sk ) /dα = ∇π π(π πΎ )(π)π + πΌ(π π )π π» ( π π )(π)π = 0 πΌ=− ∇π π(π πΎ )(π)π π (π π ) π» ( π π )(π)π πππ πππ‘πππ’π π£πππ’π π€π π βππ’ππ βππ£π πΆ πΎ . π πΎ < 0. Where CK = ∇π π(π πΎ ). This is also known as gradient of cost function. The direction defined by this procedure is known as direction of descent . Steepest Descent Search Method We know that gradient points towards the direction of maximum rate of increase for function f(X) at design point X* . Hence in steepest descent algorithm we select the direction negative of gradient i.e opposite to it. Algorithm ESDO LN -4 # M S Prasad Page 2 Step 1 : select starting point X0 Step 2 : Calculate Gradient i.e C0 Step 3 : check convergence i.e || CK|| < e stop since x * = Xk is the minimum point . Step 4 . set S 0 = - C0 Step 5 calculate α to minimize the function f ( X0 + α S0 ) # Step 6 calculate X1 iterate till convergence criteria or minimum value is reached . # This is a linear equation in α and minima can also be found out by any Line search algorithm. Problems with Steepest Descent Method 1. A large number of iterations may be required for the minimization of even positive definite quadratic forms, i.e., the method can be quite slow to converge to the minimum point. 2. Information calculated at the previous iterations is not used. Each iteration is started independent of others, which is inefficient. 3. Only first-order information about the function is used at each iteration to determine the search direction. This is one reason that convergence of the method is slow. It can further deteriorate if an inaccurate line search is used. Note : we had a function f (Xk+1 ) . calculate df( Xk+1 ) / dα = df( xk+1)/dα . d Xk+1/dα = Ck . Sk = 0 also Ck+1. Ck = 0 that it shows that the successive Steepest Descent Directions are Normal to one another . . Conjugate Gradient Method The normal steepest descent algorithm sometimes takes number of iterations to converge in case of quadratic functions . A modification of this was suggested by Fletcher & Reeves so that it can converge faster. The concept is to find a conjugate gradient at each point and then find the step size .The direction updates are modified as π π+1 = − ∇π(π π ) + π½π π π−1 The βk is Conjugate direction and is given as ESDO LN -4 # M S Prasad Page 3 π½π = { ||πΆ π || ||πΆ π−1 || 2 } Rest of the steps are same as steepest descent algorithm . The convergence of this method is faster. In this algorithm the current steepest descent direction is modified by adding a scaled direction used in the previous iteration. The scale factor is determined using lengths of the gradient vector at the two iterations as shown in above equation of βk. Thus, the conjugate direction is nothing but a deflected steepest descent direction. This is an extremely simple modification that requires little additional calculation. The conjugate gradient algorithm finds the minimum in n iterations for positive definite quadratic functions having n design variables. Newton Method of optimization The basic idea of the Newton’s method is to use a second-order Taylor’s expansion of the function about the current design point. This gives a quadratic expression for the change in design βπ . The necessary condition for minimization of this function then gives an explicit calculation for design change. Taylor expansion of our function f(X) with small change ΔX is : π(π + βπ) = π(π) + πΆ π βπ + π2 π(π) 1 βπ π π» βπ 2 Here C is ∇π(π) πππ π» = ππ₯π ππ₯π Hessian of function f ( x) Differentiating the above with respect to ΔX and equating to zero for minimization , we have C+ H ΔX = 0 i.e ΔX = - H-1 C Now we can update X1 = X0 + ΔX. Note : we know that if H is positive semi definite matrix it will ensure Global minima. Also any quadratic function needs to be convex and positive semi definite to have a minima . The above Newton’s method does not have a step size associated with the calculation of design change ΔX ); i.e., step size is taken as one (step of length one is called an ideal step size or Newton’s step). Therefore, we cannot guarantee that the cost function will reduce at each iteration; f (x(k+1)) < f (x(k))). Thus, it appears this algorithm may not converge easily . ESDO LN -4 # M S Prasad Page 4 Modified Newton’s Method In modified Newton’s method we incorporate the step size α computation resulting into better convergence rate . Algorithms ( Modified Newton ) Step 1 : start with X0 , set k = 0 ; select convergence parameter € ( a small number ). Step 2 : calculate CK if ||Ck || < € stop else continue Step 3 : calculate Hk at XK : calculate H0 Step 4 : calculate SK = - [ Hk ] -1 CK : - [ H0 ] -1 C0 k k K { Generally it is better to solve linear equation H S = - C instead of calculating inverse ) Step 5 : update function X k+1 = Xk + αk Sk calculate αk by minimizing the function f( Xk + αk Sk ) Step 6 : set K = K+1 go to step 2. It is important to note here that unless H is positive definite, the direction S K determined may not be that of descent for the cost function. If H is negative definite or negative semidefinite, the condition is always violated. With H as indefinite or positive semidefinite, the condition may or may not be satisfied, so we must check for it . c(k)TH-1c(k) < 0 This condition will always satisfied if H is positive semi definite. Disadvantages of Newton’s Method 1. It requires calculations of second-order derivatives at each iteration, which is usually quite time consuming. In some applications it may not even be possible to calculate such derivatives. Also, a linear system of equations Hk Sk = - CK needs to be solved , needing more computations in each step. 2. The method is not convergent unless the Hessian remains positive definite and a step size is calculated along the search direction to update design. However, the method has a quadratic rate of convergence when it converges. For a strictly convex quadratic function, the method converges in just one iteration from any starting ESDO LN -4 # M S Prasad Page 5 Quasi Newton Method Sometimes it may be difficult to compute Hessian Matrix due to complex equation . Is it possible to approximate the second derivative and proceed ?. In such cases we can try to approximate the second derivative , using two pieces of information: change in design and the gradient vectors between two successive iterations. While updating, the properties of symmetry and positive definiteness needs to be preserved always. The derivation of the updating procedures is based on the so-called quasi-Newton condition. This condition is derived by requiring the curvature of the cost function in the search direction d(k) to be the same at two consecutive points x(k) and x(k+1). The enforcement of this condition gives the updating formulas for the Hessian of the cost Function or its inverse. For a strictly convex quadratic function, the updating procedure converges to the exact Hessian in n iterations. David – Fletcher – Powell ( Inverse Hessian Approximation ) : DFP algorithm Algorithm Step 1 : Choose initial value of X0 . Select a symmetric positive definite matrix ( nxn) A0 as estimate of H . { selection of Identity matrix is possible}, Choose € and set K =0( iteration counter ) Step 2: calculate || Ck || , if || Ck || < € stop . else Step 3 : Sk = - Ak Ck Step 4 : calculate αk by minimizing the function f(Xk + αk Sk ) Step 5 : Update Xk+1 = Xk + αk Sk Step 6 : Update Ak as below Ak+1 = Ak + Bk + Ck π΅π = ππΎ ( ππ )π ⁄(ππ . ππ ) ; πΆ π = − ππ (ππ )π ⁄(ππ ππΎ ) ππ = πΆ π+1 − πΆ π ; ππ = π΄π ππ ; ππ = πΌπ π π Step 7 : K = K+1 Go to step 2 . The matrix A(k) is positive definite for all k. This implies that the method will always converge to a local minimum point 2. When this method is applied to a positive definite quadratic form, A(k) converges to the inverse of the Hessian of the quadratic form. ESDO LN -4 # M S Prasad Page 6 Direct Hessian Updating: BFGS Method Instead of updating inverse of H we can also update H as suggested by Broyden-FletcherGoldfarb-Shanno (BFGS) algorithm Step 1. Estimate an initial design X0. Choose a symmetric positive definite H (nxn ) matrix H(0) as an estimate for the Hessian of the cost function. In the absence of more information, H(0) = I can be also chosen . Choose a convergence parameter €. Set k = 0, Calculate gradient C0 = ∇π(π)0 . Step 2. calculate || Ck || , if || Ck || < € stop . Step 3 : Sk = - Hk Ck Step 4 : calculate αk by minimizing the function f(Xk + αk Sk ) Step 5 : Update Xk+1 = Xk + αk Sk Step 6 : Update Hk as below Hk+1 = Hk + Dk + Ek π» π = π¦πΎ ( π¦π )π ⁄(ππ . ππ ) ; ππ = πΆ π+1 − πΆ π ; πΈ π = − πΆπ (πΆπ )π ⁄(πΆπ ππΎ ) ππ = πΌπ π π Step 7. Set k = k + 1 and go to Step 2. Note again that the first iteration of the method is the same as that for the steepest descent method when H(0) = I. It can be shown that the BFGS update formula keeps the Hessian approximation positive definite if an accurate line search is used. Gradient Projection Algorithm Gradient projection method is based on the concept of projecting the search direction into subspace tangent to active constraint. Problem definition minimize f(x) subject to constraint ππ (π₯) = ππ»π π₯ − ππ ≥ 0 ππ ∑ππ πππ π₯π − ππ ≥ 0 Solution Assume there are p number of active constraints and ππ vector of active constraint and N is gradient of these active constraint ( column matrix) then we have ESDO LN -4 # M S Prasad Page 7 ππ = π΅π» πΏ − π = 0 ---------------------- (A) The basic assumption we make in this algorithm is X lies in the subspace tangent to active constraint that is Both Xi and Xi+1 satisfy the constraint as defined by equation A.This amount to If Xi+1 = Xi + α S then we have NT S = 0 Hence this can be defines as steepest gradient algorithm as under : Minimize ST πf subject to NTS = 0 and STS =1. To solve this let us define a Lagrange function as under : πΏ( π , π, µ ) = π π ∇π − ππ π π − π(π π π − 1) ππΏ ππ = ∇π − ππ − 2ππ = 0 ------------1 π π ∇π − π π ππ = 0 π = ( π π π)−1 π π ∇π Substituting this in eqn 1 above. π= 1 1 [πΌ − ( π π π)−1 π π ]∇π = π ∇π 2π 2π P is known as Projection matrix . 1/ 2µ is insignificant being a scalar value. Thats how we calculate new search direction S = - π ∇π . After a search direction has been determined, we have to determine the value of α . Unlike the unconstrained case, there is an upper limit on α set by the inactive constraints since if α increases, some of them may become active and then violated. Since ππ (π₯) = ππ»π π₯ − ππ ππ ππ (π₯) = ππ»π (π₯π + πΌπ ) − ππ ≥ 0 πΌ ≤ −ππ (π₯)/ ππ»π π The main difficulty caused by the nonlinearity of the constraints is that the one-dimensional search typically moves away from the constraint boundary. This is because we move in the tangent subspace which no longer follows exactly the constraint boundary. For this we can use the approximation of g to restore it . ESDO LN -4 # M S Prasad Page 8 ππ ≈ ππ + ∇πππ (π₯Μ π − π₯π ) constraint boundaries. After the one-dimensional search is over, we require a restoration move to bring x back to the constraint boundaries using linear approximation. ππ ≈ ππ + ∇πππ (π₯Μ π − π₯π ) We want to find a correction ¯xi − xi in the tangent subspace (i.e. P(¯xi − xi) = 0) that would reduce gj to zero. (π₯Μ π − π₯π ) = −π(π π π)−1 ππ is the desired correction, where ga is the vector of active constraints. In addition to this we carry out improvement by specifying a parameter β for reduction in cost function We specify π(π₯π ) − π(π₯π+1 ) ≅ π½π(π₯π ) And πΌ ∗ = − π½ π(π₯π )/π π ∇π We update π₯π+1 = π₯π + πΌ ∗ π − π(π π π)−1 ππ This method is very suitable for non linear optimization. --------------------------------------------------------------------------------------------------------------------- ESDO LN -4 # M S Prasad Page 9 SECTION II Constrained Steepest descent Optimization Concept of Descent function In unconstrained optimization methods we used the cost function as the descent function to monitor progress of algorithms toward the optimum point. For constrained problems, the descent function is usually constructed by adding a penalty for constraint violations to the current value of the cost function. Generally Pshenichny’s descent function (also called the exact penalty function) is commonly used to its simplicity. Pshenichny’s descent function F at any point x is defined as Φ(X) = f(x) + R V(x) where R > 0 is a positive number called the penalty parameter (initially specified by the user), V(x) ≥ 0 is either the maximum constraint violation among all the constraints or zero, and f (x) is the cost function value at x. The descent function at the point x(k) is Φk = f k + R V k Φk = Φ(Xk) and Vk = V (XK ) and R is the most current value of the penalty parameter. It must be ensured that R is greater than or equal to the sum of all the Lagrange multipliers of the Quadratic sub problems at the point x(k). π π ππ = ∑ |π£ππ | + ∑ |π’ππ | πππ π ≥ ππ π=1 π=1 |π£ππ | ππ π‘βπ πππ’ππππ‘π¦ ππππππ‘πππ ππ’ππ‘ππππππ. πππ |π’ππ |ππ πππππ’ππππ‘π¦ ππππ π‘πππππ‘ ππ’ππ‘ππππππ Vk ≥ 0 is the maximum constraint violation at K th iteration. i.e Max { 0; |h1| , |h2|…; g1 , g2….} R is the most current value of the penalty parameter.. Actually, it must be ensured that it is greater than or equal to the sum of all the Lagrange multipliers of the QP sub problem at the point x(k). ESDO LN -4 # M S Prasad Page 10 Quadratic sub Problems Whenever we are dealing with constraint optimization problems we seek to linearize the problem and constraints to get a finite solution or convergence . Most of the times in general we have Minimize π ( π πΎ + β π πΎ ) = π(π πΎ ) + ∇π π (π πΎ )β π π π€ππ‘β ππππππ πΈππ’ππππ‘π¦ ππππ π‘πππππ‘ βπ ( π πΎ + β π πΎ ) ≅ βπ(π πΎ ) + ∇βπ π (π πΎ )β π π = 0 and ππ ( π πΎ + β π πΎ ) ≅ ππ (π πΎ ) + ∇ππ π (π πΎ )β π π ≤ 0 Such minimization problem in Vector form are denoted as Minimize Linear equality Constraints Linear Inequality Constraints 1 1 πΜ = πΆ π π + 2 π π π = ∑ππ=1 πΆπ ππ + 2 ∑ππ=1 ππ ππ --1 π π π = π ππ ∑ππ=1 πππ ππ = ππ ---2 π΄π π ≤ π ππ ∑ππ=1 πππ ππ ≤ ππ where -----3 πΆπ = ππ(π π )⁄ πππ ; ej = - hj ( Xk ) ; bj = -gj (Xk) ; di = ΔXik πππ = πβπ(π π )⁄ πππ πππ = πππ(π π )⁄ πππ and matrix A is formed by the components of aij These set of equations are known as Quadratic sub problems . The parameter Vk ≥ 0 related to the maximum constraint violation at the kth iteration is determined using the calculated values of the constraint functions at the design point x(k) as below Vk = max { 0; |h1 | , | h2 | ….|hp | ; g1 ..g2 ………gn } --- 4 Since the equality constraint is violated if it is different from zero, the absolute value is used with each hi i. Note that Vk is always nonnegative, i.e., Vk ≥ 0. If all constraints are satisfied at x(k), then Vk = 0 Constraint Descent Algorithm The stopping criterion for the algorithm is that ||d|| ≤ € for a feasible point. Here € is a small positive number and d is the search direction that is obtained as a solution of the QP sub problem. Step 1. Set k = 0. Select initial values for design variables as X0. Select an appropriate initial value for the penalty parameter R0, and two small numbers €1 and €2 defining the permissible constraint violation and convergence parameter values, respectively. R0 = 1 can be a starting selection. ESDO LN -4 # M S Prasad Page 11 Step 2. Compute at Xk the cost , constraint and their gradients. Calculate the maximum constraint violation Vk as defined in equation 4 above . . Step 3. Using the cost and constraint function values and their gradients, define the QP sub problem given by Eqs. 1 to 3. Solve the QP sub problem to obtain the search direction d k and Lagrange multipliers vectors Vk and Uk. Step 4. Check ||d(k)|| < €2 and the maximum constraint violation Vk ≤ €1. If these criteria are satisfied then stop. else continue. Step 5. To check the necessary condition of R i.e. R ≥ r k for the penalty parameter R, calculate the sum rk of the Lagrange multipliers defined as below ππ = ∑ π π=1 |π£ππ | + ∑ π π’ππ π=1 . Set R = max {Rk, rk}. Step 6. Update X k+1 = X k + αk d(k), Like the unconstrained problems, the step size is calculated by minimizing the descent function) along the search direction d(k). Step 7. Save the current penalty parameter as Rk+1 = R. Update the iteration counter as k = k + 1, and go to Step 2. The CSD algorithm is a first-order method that can treat equality and inequality constraints. The algorithm converges to a local minimum point starting from an arbitrary point. . The rate of convergence of the CSD algorithm can be improved by including second-order information in the QP sub problem. References Standard Text books : Engineering system optimization : J S Arora . Kalyanmoy Deb and Rao. ESDO LN -4 # M S Prasad Page 12