AN INTRODUCTION TO OPTIMIZATION WILEY SERIES IN DISCRETE MATHEMATICS AND OPTIMIZATION A complete list of titles in this series appears at the end of this volume. AN INTRODUCTION TO OPTIMIZATION Fourth Edition Edwin K. P. Chong Colorado State University Stanislaw H. 2ak Purdue University ®WILEY A JOHN WILEY & SONS, INC., PUBLICATION Copyright © 2013 by John Wiley & Sons, Inc. All rights reserved Published by John Wiley & Sons, Inc., Hoboken, New Jersey Published simultaneously in Canada No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 750-4470, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permission. Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages. For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002. Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic formats. For more information about Wiley products, visit our web site at www.wiley.com. Library of Congress Cataloging-in-Publication Data Chong, Edwin Kah Pin. An introduction to optimization / Edwin K. P. Chong, Colorado State University, Stanislaw H. Zak, Purdue University. — Fourth edition, pages cm Summary: "The purpose of the book is to give the reader a working knowledge of optimization theory and methods" — Provided by publisher. Includes bibliographical references and index. ISBN 978-1-118-27901-4 (hardback) 1. Mathematical optimization. I. Zak, Stanislaw H. II. Title. QA402.5.C476 2012 519.6—dc23 2012031772 Printed in the United States of America. 10 9 8 7 6 5 4 3 2 1 To my wife, Yat-Yee, and to my parents, Paul and Julienne Chong. Edwin K. P. Chong To JMJ; my wife, Mary Ann; and my parents, Janina and Konstanty Zak. Stanislaw H. Zak CONTENTS Preface xiii PART 1 2 M A T H E M A T I C A L REVIEW Methods of Proof and Some Notation 3 1.1 1.2 3 5 6 Methods of Proof Notation Exercises Vector Spaces and Matrices 2.1 2.2 2.3 2.4 3 I Vector and Matrix Rank of a Matrix Linear Equations Inner Products and Norms Exercises 7 7 13 17 19 22 Transformations 25 3.1 25 Linear Transformations viii CONTENTS 3.2 3.3 3.4 3.5 Eigenvalues and Eigenvectors Orthogonal Projections Quadratic Forms Matrix Norms Exercises 26 29 31 35 40 Concepts from Geometry 45 4.1 4.2 4.3 4.4 4.5 45 46 48 50 52 53 Line Segments Hyperplanes and Linear Varieties Convex Sets Neighborhoods Poly topes and Polyhedra Exercises Elements of Calculus 55 5.1 5.2 5.3 5.4 5.5 5.6 55 62 63 67 68 72 77 Sequences and Limits Differentiability The Derivative Matrix Differentiation Rules Level Sets and Gradients Taylor Series Exercises PART II UNCONSTRAINED O P T I M I Z A T I O N Basics of Set-Constrained and Unconstrained Optimization 81 6.1 6.2 81 83 93 Introduction Conditions for Local Minimizers Exercises One-Dimensional Search Methods 103 Introduction Golden Section Search Fibonacci Method Bisection Method Newton's Method Secant Method Bracketing 103 104 108 116 116 120 123 7.1 7.2 7.3 7.4 7.5 7.6 7.7 CONTENTS 7.8 8 9 10 11 12 Line Search in Multidimensional Optimization Exercises IX 124 126 Gradient Methods 131 8.1 8.2 8.3 131 133 141 153 Introduction The Method of Steepest Descent Analysis of Gradient Methods Exercises Newton's Method 161 9.1 9.2 9.3 9.4 161 164 168 168 171 Introduction Analysis of Newton's Method Levenberg-Marquardt Modification Newton's Method for Nonlinear Least Squares Exercises Conjugate Direction Methods 175 10.1 10.2 10.3 10.4 175 177 182 Introduction The Conjugate Direction Algorithm The Conjugate Gradient Algorithm The Conjugate Gradient Algorithm for Nonquadratic Problems Exercises 186 189 Quasi-Newton Methods 193 11.1 11.2 11.3 11.4 11.5 193 194 197 202 207 211 Introduction Approximating the Inverse Hessian The Rank One Correction Formula The DFP Algorithm The BFGS Algorithm Exercises Solving Linear Equations 217 12.1 12.2 12.3 12.4 12.5 217 227 231 232 236 Least-Squares Analysis The Recursive Least-Squares Algorithm Solution to a Linear Equation with Minimum Norm Kaczmarz's Algorithm Solving Linear Equations in General X CONTENTS Exercises 13 14 Unconstrained Optimization and Neural Networks 253 13.1 13.2 13.3 253 256 258 270 Introduction Single-Neuron Training The Backpropagation Algorithm Exercises Global Search Algorithms 273 14.1 14.2 14.3 14.4 14.5 273 274 278 282 285 298 Introduction The Nelder-Mead Simplex Algorithm Simulated Annealing Particle Swarm Optimization Genetic Algorithms Exercises PART 15 16 244 III LINEAR P R O G R A M M I N G Introduction to Linear Programming 305 15.1 15.2 15.3 15.4 15.5 15.6 15.7 15.8 305 307 314 316 318 324 327 330 335 Brief History of Linear Programming Simple Examples of Linear Programs Two-Dimensional Linear Programs Convex Polyhedra and Linear Programming Standard Form Linear Programs Basic Solutions Properties of Basic Solutions Geometric View of Linear Programs Exercises Simplex Method 339 16.1 16.2 16.3 16.4 16.5 16.6 16.7 339 346 349 350 357 361 364 369 Solving Linear Equations Using Row Operations The Canonical Augmented Matrix Updating the Augmented Matrix The Simplex Algorithm Matrix Form of the Simplex Method Two-Phase Simplex Method Revised Simplex Method Exercises CONTENTS 17 18 19 Duality 379 17.1 17.2 379 387 394 21 22 Dual Linear Programs Properties of Dual Problems Exercises Nonsimplex Methods 403 18.1 18.2 18.3 18.4 Introduction Khachiyan's Method Affine Scaling Method Karmarkar's Method Exercises 403 405 408 413 426 Integer Linear Programming 429 19.1 19.2 19.3 429 430 437 447 PART 20 XI Introduction Unimodular Matrices The Gomory Cutting-Plane Method Exercises IV NONLINEAR CONSTRAINED O P T I M I Z A T I O N Problems with Equality Constraints 453 20.1 20.2 20.3 20.4 20.5 20.6 453 455 456 463 472 476 481 Introduction Problem Formulation Tangent and Normal Spaces Lagrange Condition Second-Order Conditions Minimizing Quadratics Subject to Linear Constraints Exercises Problems with Inequality Constraints 487 21.1 21.2 487 496 501 Karush-Kuhn-Tucker Condition Second-Order Conditions Exercises Convex Optimization Problems 509 22.1 22.2 22.3 509 512 521 Introduction Convex Functions Convex Optimization Problems xii CONTENTS 22.4 23 24 Semidefinite Programming Exercises 527 540 Algorithms for Constrained Optimization 549 23.1 23.2 23.3 23.4 23.5 549 549 553 557 564 571 Introduction Projections Projected Gradient Methods with Linear Constraints Lagrangian Algorithms Penalty Methods Exercises Multiobjective Optimization 577 24.1 24.2 24.3 24.4 24.5 577 578 581 585 588 596 Introduction Pareto Solutions Computing the Pareto Front From Multiobjective to Single-Objective Optimization Uncertain Linear Programming Problems Exercises References 599 Index 609 PREFACE Optimization is central to any problem involving decision making, whether in engineering or in economics. The task of decision making entails choosing among various alternatives. This choice is governed by our desire to make the "best" decision. The measure of goodness of the alternatives is described by an objective function or performance index. Optimization theory and methods deal with selecting the best alternative in the sense of the given objective function. The area of optimization has received enormous attention in recent years, primarily because of the rapid progress in computer technology, including the development and availability of user-friendly software, high-speed and parallel processors, and artificial neural networks. A clear example of this phenomenon is the wide accessibility of optimization software tools such as the Optimization Toolbox of MATLAB 1 and the many other commercial software packages. There are currently several excellent graduate textbooks on optimization theory and methods (e.g., [3], [39], [43], [51], [87], [88], [104], [129]), as well as undergraduate textbooks on the subject with an emphasis on engineering design (e.g., [1] and [109]). However, there is a need for an introductory 1 MATLAB is a registered trademark of The MathWorks, Inc. xiii XIV PREFACE textbook on optimization theory and methods at a senior undergraduate or beginning graduate level. The present text was written with this goal in mind. The material is an outgrowth of our lecture notes for a one-semester course in optimization methods for seniors and beginning graduate students at Purdue University, West Lafayette, Indiana. In our presentation, we assume a working knowledge of basic linear algebra and multivariable calculus. For the reader's convenience, a part of this book (Part I) is devoted to a review of the required mathematical background material. Numerous figures throughout the text complement the written presentation of the material. We also include a variety of exercises at the end of each chapter. A solutions manual with complete solutions to the exercises is available from the publisher to instructors who adopt this text. Some of the exercises require using MATLAB. The student edition of MATLAB is sufficient for all of the MATLAB exercises included in the text. The MATLAB source listings for the MATLAB exercises are also included in the solutions manual. The purpose of the book is to give the reader a working knowledge of optimization theory and methods. To accomplish this goal, we include many examples that illustrate the theory and algorithms discussed in the text. However, it is not our intention to provide a cookbook of the most recent numerical techniques for optimization; rather, our goal is to equip the reader with sufficient background for further study of advanced topics in optimization. The field of optimization is still a very active research area. In recent years, various new approaches to optimization have been proposed. In this text, we have tried to reflect at least some of the flavor of recent activity in the area. For example, we include a discussion of randomized search methods—these include particle swarm optimization and genetic algorithms, topics of increasing importance in the study of complex adaptive systems. There has also been a recent surge of applications of optimization methods to a variety of new problems. An example of this is the use of descent algorithms for the training of feedforward neural networks. An entire chapter in the book is devoted to this topic. The area of neural networks is an active area of ongoing research, and many books have been devoted to this subject. The topic of neural network training fits perfectly into the framework of unconstrained optimization methods. Therefore, the chapter on feedforward neural networks not only provides an example of application of unconstrained optimization methods but also gives the reader an accessible introduction to what is currently a topic of wide interest. The material in this book is organized into four parts. Part I contains a review of some basic definitions, notations, and relations from linear algebra, geometry, and calculus that we use frequently throughout the book. In Part II we consider unconstrained optimization problems. We first discuss some theoretical foundations of set-constrained and unconstrained optimization, including necessary and sufficient conditions for minimizers and maximizers. This is followed by a treatment of various iterative optimization algorithms, including line search methods, together with their properties. A discussion of global PREFACE XV search algorithms is included in this part. We also analyze the least-squares optimization problem and the associated recursive least-squares algorithm. Parts III and IV are devoted to constrained optimization. Part III deals with linear programming problems, which form an important class of constrained optimization problems. We give examples and analyze properties of linear programs, and then discuss the simplex method for solving linear programs. We also provide a brief treatment of dual linear programming problems. We then describe some nonsimplex algorithms for solving linear programs: Khachiyan's method, the affine scaling method, and Karmarkar's method. We wrap up Part III by discussing integer linear programming problems. In Part IV we treat nonlinear constrained optimization. Here, as in Part II, we first present some theoretical foundations of nonlinear constrained optimization problems, including convex optimization problems. We then discuss different algorithms for solving constrained optimization problems. We also treat multiobjective optimization. Although we have made every effort to ensure an error-free text, we suspect that some errors remain undetected. For this purpose, we provide online updated errata that can be found at the Web site for the book, accessible via http://www.wiley.com/mathematics We are grateful to several people for their help during the course of writing this book. In particular, we thank Dennis Goodman of Lawrence Livermore Laboratories for his comments on early versions of Part II and for making available to us his lecture notes on nonlinear optimization. We thank Moshe Kam of Drexel University for pointing out some useful references on nonsimplex methods. We are grateful to Ed Silverman and Russell Quong for their valuable remarks on Part I of the first edition. We also thank the students of ECE 580 at Purdue University and ECE 520 and MATH 520 at Colorado State University for their many helpful comments and suggestions. In particular, we are grateful to Christopher Taylor for his diligent proofreading of early manuscripts of this book. This fourth edition incorporates many valuable suggestions of users of the first, second, and third editions, to whom we are grateful. E. K. P. CHONG AND S. H. ZAK Fort Collins, Colorado, and West Lafayette, Indiana PART I MATHEMATICAL REVIEW CHAPTER 1 METHODS OF PROOF AND SOME NOTATION 1.1 Methods of Proof Consider two statements, "A" and "B," which could be either true or false. For example, let "A" be the statement "John is an engineering student," and let "B" be the statement "John is taking a course on optimization." We can combine these statements to form other statements, such as "A and B" or "A or B." In our example, "A and B" means "John is an engineering student, and he is taking a course on optimization." We can also form statements such as "not A," "not B," "not (A and B)," and so on. For example, "not A" means "John is not an engineering student." The truth or falsity of the combined statements depend on the truth or falsity of the original statements, "A" and "B." This relationship is expressed by means of truth tables; see Tables 1.1 and 1.2. From the tables, it is easy to see that the statement "not (A and B)" is equivalent to "(not A) or (not B)" (see Exercise 1.3). This is called DeMorgan's law. In proving statements, it is convenient to express a combined statement by a conditional, such as "A implies B," which we denote "A=>B." The conditional An Introduction to Optimization, Fourth Edition. By E. K. P. Chong and S. H. Zak. Copyright © 2013 John Wiley & Sons, Inc. 3 4 METHODS OF PROOF AND SOME NOTATION Table 1.1 Truth Table for "A and B" and "A or B" A F F T T B F T F T Table 1.2 A and B F F F T Truth Table for "not A" A F T Table 1.3 A or B F T T T not A Y~ F Truth Table for Conditionals and Biconditionals A F F T T B F T F T A^B T T F T A <=B T F T T A<^B T F F T "A=>B" is simply the combined statement "(not A) or B" and is often also read "A only if B," or "if A then B," or "A is sufficient for B," or "B is necessary for A." We can combine two conditional statements to form a biconditional statement of the form "A<i=>B," which simply means "(A=*-B) and (B=>A)." The statement " Α Φ ^ Β " reads "A if and only if B," or "A is equivalent to B," or "A is necessary and sufficient for B." Truth tables for conditional and biconditional statements are given in Table 1.3. It is easy to verify, using the truth table, that the statement "A=>B" is equivalent to the statement "(not B)=>(not A)." The latter is called the contrapositive of the former. If we take the contrapositive to DeMorgan's law, we obtain the assertion that "not (A or B)" is equivalent to "(not A) and (not B)." Most statements we deal with have the form "A=>B." To prove such a statement, we may use one of the following three different techniques: 1. The direct method NOTATION 5 2. Proof by contraposition 3. Proof by contradiction or reductio ad absurdum In the case of the direct method, we start with "A," then deduce a chain of various consequences to end with "B." A useful method for proving statements is proof by contraposition, based on the equivalence of the statements "A=>B" and "(not B)=>(not A)." We start with "not B," then deduce various consequences to end with "not A" as a conclusion. Another method of proof that we use is proof by contradiction, based on the equivalence of the statements "A=>B" and "not (A and (not B))." Here we begin with "A and (not B)" and derive a contradiction. Occasionally, we use the principle of induction to prove statements. This principle may be stated as follows. Assume that a given property of positive integers satisfies the following conditions: ■ The number 1 possesses this property. ■ If the number n possesses this property, then the number n + 1 possesses it too. The principle of induction states that under these assumptions any positive integer possesses the property. The principle of induction is easily understood using the following intuitive argument. If the number 1 possesses the given property, then the second condition implies that the number 2 possesses the property. But, then again, the second condition implies that the number 3 possesses this property, and so on. The principle of induction is a formal statement of this intuitive reasoning. For a detailed treatment of different methods of proof, see [130]. 1.2 Notation Throughout, we use the following notation. If X is a set, then we write x € X to mean that x is an element of X. When an object x is not an element of a set X, we write x $. X. We also use the "curly bracket notation" for sets, writing down the first few elements of a set followed by three dots. For example, {xi,X2,^3,. · ·} is the set containing the elements χ\,Χ2,χζ, and so on. Alternatively, we can explicitly display the law of formation. For example, {x : x £ R, x > 5} reads "the set of all x such that x is real and x is greater than 5." The colon following x reads "such that." An alternative, notation for the same set is {x £ M : x > 5}. If X and Y are sets, then we write X C Y to mean that every element of X is also an element of Y. In this case, we say that X is a subset of Y. If X and Y are sets, then we denote by X \ Y ("X minus Y") the set of all points in X that are not in Y. Note that X \ Y is a subset of X. The 6 METHODS OF PROOF AND SOME NOTATION notation / : X —■> Y means " / is a function from the set X into the set V." The symbol := denotes arithmetic assignment. Thus, a statement of the form x := y means "x becomes y." The symbol = means "equals by definition." Throughout the text, we mark the end of theorems, lemmas, propositions, and corollaries using the symbol □. We mark the end of proofs, definitions, and examples by | . We use the IEEE style when citing reference items. For example, [77] represents reference number 77 in the list of references at the end of the book. EXERCISES 1.1 Construct the truth table for the statement "(not B)=>(not A)," and use it to show that this statement is equivalent to the statement "A=^B." 1.2 Construct the truth table for the statement "not (A and (not B))," and use it to show that this statement is equivalent to the statement "A=>B." 1.3 Prove DeMorgan's law by constructing the appropriate truth tables. 1.4 Prove that for any statements A and B, we have "A <^ (A and B) or (A and (not B))." This is useful because it allows us to prove a statement A by proving the two separate cases "(A and B)" and "(A and (not B))." For example, to prove that \x\ > x for any x G M, we separately prove the cases "|x| > x and x > 0" and "|x| > x and x < 0." Proving the two cases turns out to be easier than proving the statement \x\ > x directly (see Section 2.4 and Exercise 2.7). 1.5 (This exercise is adopted from [22, pp. 80-81]) Suppose that you are shown four cards, laid out in a row. Each card has a letter on one side and a number on the other. On the visible side of the cards are printed the symbols S 8 3 A Determine which cards you should turn over to decide if the following rule is true or false: "If there is a vowel on one side of the card, then there is an even number on the other side." CHAPTER 2 VECTOR SPACES AND MATRICES 2.1 Vector and Matrix We define a column n-vector to be an array of n numbers, denoted ai a— . a<2 The number α^ is called the zth component of the vector a. Denote by R the set of real numbers and by R n the set of column n-vectors with real components. We call R n an n-dimensional real vector space. We commonly denote elements of R n by lowercase bold letters (e.g., x). The components of x £ R n are denoted # i , . . . , xn. We define a row n-vector as [αι,α 2 ,.. ·,α η ]· An Introduction to Optimization, Fourth Edition. By E. K. P. Chong and S. H. Zak. Copyright © 2013 John Wiley & Sons, Inc. 8 VECTOR SPACES AND MATRICES The transpose of a given column vector a is a row vector with corresponding elements, denoted aT. For example, if 02 a then aT = [αι,α 2ii, - · · i ^n\ Equivalently, we may write a = [αχ, α 2 , . . . , α η ] τ . Throughout the text we adopt the convention that the term vector (without the qualifier row or column) refers to a column vector. Two vectors a = [ai, a 2 , . . . , an]T and b = [b\, 62, · · · ? M T a r e e Q u a l if ai — bi, i = 1,2,... ,n. The sum of the vectors a and 6, denoted a + 6, is the vector a + b= [ai H-6i,a 2 + 6 2 , . . . , a n + 6 n ] T . The operation of addition of vectors has the following properties: 1. The operation is commutative: a + b = b + a. 2. The operation is associative: (a + b)-\-c = a + (b + c). 3. There is a zero vector such that 0 = [0,0,...,0]T a + Q = 0-\-a = a. The vector [Oi - &ι,α2 - &2?· . · , Α η - &n] is called the difference between a and b and is denoted a — b. The vector 0 - b is denoted —6. Note that b + (a — b) — a, — (a — b) = b — a. VECTOR AND MATRIX 9 The vector b — a is the unique solution of the vector equation a + x = b. xn]T is a solution to a + x = b. Then, Indeed, suppose that x = [xi, x2,..., a\+x\ =h, a2 + X2 = fo, an -\- xn — o n , and thus x = b — a. We define an operation of multiplication of a vector a G Mn by a real scalar aGRas αα = [ααι, αα2,. · ·, α;αη] . This operation has the following properties: 1. The operation is distributive: for any real scalars a and /?, a(a + 6) = aa + α&, (a + β)α — aa + /3a. 2. The operation is associative: α(βα) = (α/3)α. 3. The scalar 1 satisfies 4. Any scalar a satisfies l a = a. a 0 = 0. 5. The scalar 0 satisfies 0a = 0. 6. The scalar —1 satisfies (—l)a = —a. Note that aa = 0 if and only if a = 0 or a = 0. To see this, observe that aa = 0 is equivalent to ααι = aa2 = · · · = αα η = 0. If a = 0 or a = 0, then aa = 0. If a ^ 0, then at least one of its components α^ φ 0. For this component, αα^ = 0, and hence we must have a = 0. Similar arguments can be applied to the case when a / 0 . 10 VECTOR SPACES AND MATRICES A set of vectors {αχ,... ,ak} is said to be linearly independent if the equality a\a\ + a2a2 + l· akak = 0 implies that all coefficients a*, i = 1 , . . . , fc, are equal to zero. A set of the vectors {αχ,..., ak} is linearly dependent if it is not linearly independent. Note that the set composed of the single vector 0 is linearly dependent, for if a φ 0, then aO — 0. In fact, any set of vectors containing the vector 0 is linearly dependent. A set composed of a single nonzero vector a φ 0 is linearly independent since aa = 0 implies that a = 0. A vector a is said to be a linear combination of vectors αχ, a 2 , . . . , ak if there are scalars α χ , . . . , α^ such that a = OL\a\ + α 2 α 2 Η + QfcOfc. Proposition 2.1 A set of vectors { α ι , α 2 , . . . ,ak} is linearly dependent if and only if one of the vectors from the set is a linear combination of the remaining vectors. □ Proof. =>: If {αι, a 2 , . . . , a^} is linearly dependent, then OLICLI + a2a2 H l· α^α^ = 0, where at least one of the scalars α; Φ 0, whence di = OL\ αι OL2 a2 — · · · OLk α^. <=: Suppose that αχ = α 2 α 2 + α 3 α 3 Η then ( - l ) a i + a2a2 H l· h α^α^, akak = 0. Because the first scalar is nonzero, the set of vectors { α ι , α 2 , . . . ,α/c} is linearly dependent. The same argument holds if α^, i = 2,...,/c, is a linear combination of the remaining vectors. I A subset V of Rn is called a subspace of R n if V is closed under the operations of vector addition and scalar multiplication. That is, if a and b are vectors in V, then the vectors a + b and aa are also in V for every scalar a. Every subspace contains the zero vector 0, for if a is an element of the subspace, so is (—I)a = — a. Hence, a — a — 0 also belongs to the subspace. Let α ι , α 2 , . . . ,α^ be arbitrary vectors in W1. The set of all their linear combinations is called the span of αχ, α 2 , . . . , ak and is denoted span[ai,a 2 ,...,a/e] = < ^ α ^ : a x , . . . ,ak £ R > . VECTOR AND MATRIX 11 Given a vector a, the subspace span [a] is composed of the vectors a o , where a is an arbitrary real number (a G R). Also observe that if a is a linear combination of αι, α 2 , . . . , α/~, then s p a n [ a i , a 2 , . . . , α^,α] — s p a n [ a i , a 2 , . . . , α&]. The span of any set of vectors is a subspace. Given a subspace V, any set of linearly independent vectors {oi, C&2,. · ·, a>k} C V such that V = span[ai, a 2 , . . . , a/-] is referred to as a basis of the subspace V. All bases of a subspace V contain the same number of vectors. This number is called the dimension of V, denoted dim V. Proposition 2.2 If {ai, a 2 , . . . , a / J zs a fraszs 0/ V, t/ien an?/ vector can be represented uniquely as a = OL\CL\ + a2a2 H aofV h α^α^, where a^ G R, z = 1, 2 , . . . , k. □ Proof To prove the uniqueness of the representation of a in terms of the basis vectors, assume that a = OL\a\ + α 2 α 2 + · · · + ο^α/c and α = βια,ι + ß 2 a 2 H h Αα^. We now show that ai — βι, i = 1 , . . . , k. We have αιαι + a2a2 H or h α^α^ = / ^ « l H- β 2 α 2 Η h /3fcafc (ai - )3i)ai + (a 2 - Α ) α 2 + · · · + (afc ~ ßk)ak = 0. Because the set {a* : z = 1,2,..., A:} is linearly independent, OL\ — β\ = a2 — /?2 = · · · = a/e — ßk — 0, which implies that a* = /?*, 2 = 1 , . . . , fc. I Suppose that we are given a basis {αχ, α 2 , . . . , α^} of V and a vector a G V such that α = αχθι + a 2 a 2 H h α^α^. The coefficients a*, i = 1 , . . . , /c, are called the coordinates of a with respect to the basis {ai, a 2 , . . . , α^}. The natural basis for R n is the set of vectors "1" 0 0 0 _0. Ό" , e2 = 1 0 0 .0. 12 VECTOR SPACES AND MATRICES The reason for calling these vectors the natural basis is that Xl X2 x X\e\ + £2^2 + · · · + Xn^n We can similarly define complex vector spaces. For this, let C denote the set of complex numbers and C n the set of column n-vectors with complex components. As the reader can easily verify, the set C n has properties similar to those of R n , where scalars can take complex values. A matrix is a rectangular array of numbers, commonly denoted by uppercase bold letters (e.g., A). A matrix with m rows and n columns is called an m x n matrix, and we write an «12 a\n Ö21 Ö22 d2n Gml Om2 The real number α^ located in the ith row and jth column is called the (i, j ) t h entry. We can think of A in terms of its n columns, each of which is a column vector in R m . Alternatively, we can think of A in terms of its m rows, each of which is a row n-vector. Consider the ra x n matrix A above. The transpose of matrix A, denoted A , is the n x m matrix an A1 = a2i 0>ml «7712 Ö12 «22 Gin 0,2η that is, the columns of A are the rows of A , and vice versa. Let the symbol M m X n denote the set oimxn matrices whose entries are real numbers. We treat column vectors in IRn as elements of R n x l . Similarly, we treat row n-vectors as elements of R l x n . Accordingly, vector transposition is simply a special case of matrix transposition, and we will no longer distinguish between the two. Note that there is a slight inconsistency in the notation of row vectors when identified as 1 x n matrices: We separate the components of the row vector with commas, whereas in matrix notation we do not generally use commas. However, the use of commas in separating elements in a row helps to clarify their separation. We use use such commas even in separating matrices arranged in a horizontal row. RANK OF A MATRIX 2.2 13 Rank of a Matrix Consider the m x n matrix an au «In &2η A = Let us denote the kth column of A by α^: Q>ik Q>k Q>mk The maximal number of linearly independent columns of A is called the rank of the matrix A, denoted rank A. Note that rank A is the dimension of span[ai,... , a n ] . Proposition 2.3 The rank of a matrix A is invariant under the following operations: 1. Multiplication of the columns of A by nonzero scalars. 2. Interchange of the columns. 3. Addition to a given column a linear combination of other columns. D Proof. 1. Let bk = a^a/e, where ctk φ 0, k = 1 , . . . , n, and let B = [61? f>2) · · ·, bn]. Obviously, s p a n [ a i , a 2 , . . . , o n ] = span[öi,6 2 ,... , 6 n ] , and thus rank A = rankf?. 2. The number of linearly independent vectors does not depend on their order. 3. Let b\ = a\ + C2a2 + b2 = a 2 , Vn — Ö-n· l· cnan, 14 VECTOR SPACES AND MATRICES So, for any OJI, . . . , a n , αι6ι + 0:262 H l· o; n 6 n = a\a\ + (c*2 + ^1^2)^2 H and hence l· (a n + aicn)an, span[6i,6 2 ,··· ,&n] C s p a n [ a i , a 2 , . . . , a n ] . On the other hand, αι = 61 - c2b2 cn6n, «2 = 6 2 , ttn — O n . Hence, s p a n [ a i , a 2 , . . . , a n ] C span[6i, 6 2 ,. · · ,&n]· Therefore, rank A = ranki?. | A matrix A is said to be square if the number of its rows is equal to the number of its columns (i.e., it is n x n). Associated with each square matrix A is a scalar called the determinant of the matrix A, denoted d e t A or \A\. The determinant of a square matrix is a function of its columns and has the following properties: 1. The determinant of the matrix A = [αι, a2,..., of each column; that is, det[ai,...,Ofc_i,aa f c + ßak an] is a linear function ,afc+i,... ,an] = a d e t [ a i , . ..,a f c _i,a f c , a f c + i , . . . , a n ] + /?det[ai,... ,ak-\,ak ,α^+ι,.. .,an] for each a, /? G R, a^1}, a^2) € R n . 2. If for some k we have α^ = α^+χ, then d e t A = d e t [ a i , . . . , α^,α^+ι,... ,α η ] = d e t [ a i , . . . ,a,k,ak, 3. Let In = [ e i , e 2 , . . . , e n ] 1 0 0 ... 1 ... 0 0 0 0··· 1 , On] = 0. RANK OF A MATRIX 15 where {βχ,..., en} is the natural basis for R n . Then d e t / n = 1. Note that if a — β = 0 in property 1, then d e t [ a i , . . . ,afc_i,0,afc+i,.. . , o n ] = 0. Thus, if one of the columns is 0, then the determinant is equal to zero. The determinant does not change its value if we add to a column another column multiplied by a scalar. This follows from properties 1 and 2 as shown below: d e t [ a i , . . . ,afc_i,a/fc + actj, Ofc+i,... , α ? , . . . , a n ] = d e t [ a i , . . . ,afc_i,afc,afc+i, · · · , a?·,... , a n ] + a d e t [ o i , . . . , a fc _i, α^,α^+ι,.. . , α ? , . . . , a n ] = det[ai,... ,an]. However, the determinant changes its sign if we interchange columns. To show this property, note that d e t [ o i , . . . ,afc_i,afc,afc+i,... , a n ] = = = = = = det[ai,.. . ,a fc + afc+i,afc+i,... ,o n ] det[ai,. ..,ak + ak+i,ak+i - (ak + a f c + i ) , . . . , a n ] det[ai,. . .,a f c + α&+ι, - a * , . . . , a n ] - d e t [ a i , . . ,a fc + a/ c + i,a f c ,... , a n ] - ( d e t [ a i , . . . ,<ifc,afc,... ,o n ] + d e t [ a i , . . . , a f c + i , a f c , . . . ,a n ]) - d e t [ a i , . . . , a f c + i, ak,..., a„]. A pth-order minor of an m x n matrix A, with p < min{m, n}, is the determinant of a p x p matrix obtained from A by deleting m — p rows and n — p columns. (The notation min{m, n) represents the smaller of m and n.) We can use minors to investigate the rank of a matrix. In particular, we have the following proposition. P r o p o s i t i o n 2.4 If an m x n (m > n) matrix A has a nonzero nth-order minor, then the columns of A are linearly independent; that is, rank A = n. D Proof. Suppose that A has a nonzero nth-order minor. Without loss of generality, we assume that the nth-order minor corresponding to the first n rows of A is nonzero. Let #;, i = 1 , . . . , n, be scalars such that χιαι + x2a2 H l· xno>n = 0. 16 VECTOR SPACES AND MATRICES The vector equality above is equivalent to the following set of m equations: CLiiXi + a\2X2 + + «2n^n 021^1 + «22^2 + an\X\ + an2X2 H l· ümlXl + Am2^2 + annxn h on = 0 0. For i = 1 , . . . , n, let a>u di Then, χχαι + · · · -l· x n ön = 0. The nth-order minor is det[di, ά 2 , . . . , α η ], assumed to be nonzero. From the properties of determinants it follows that the columns άι, α<ι,..., an are linearly independent. Therefore, all X{ = 0, i = 1 , . . . ,n. Hence, the columns Oi, a 2 , . . . , an are linearly independent. I From the above it follows that if there is a nonzero minor, then the columns associated with this nonzero minor are linearly independent. If a matrix A has an rth-order minor | M | with the properties (i) \M\ φ 0 and (ii) any minor of A that is formed by adding a row and a column of A to M is zero, then rank A = r. Thus, the rank of a matrix is equal to the highest order of its nonzero minor(s). A nonsingular (or invertible) matrix is a square matrix whose determinant is nonzero. Suppose that A is an n x n square matrix. Then, A is nonsingular if and only if there is another n x n matrix B such that AB = BA = J n , where In denotes the n x n identity matrix: 1 0 0 1 [0 0··· 1J We call the matrix B above the inverse matrix of A, and write B — A~ . LINEAR EQUATIONS 2.3 17 Linear Equations Suppose that we are given m equations in n unknowns of the form CLnXi + ai2#2 H a 2 lXl + a22#2 H h CLln^n l· = &1, «2n^n = h . We can represent the set of equations above as a vector equation X1O1 + X2«2 + · ' * + X n « n = 6, where h aij a2j a j , = 6= b2 J>m &mj Associated with this system of equations is the matrix A = [αι,θ2,.. · , α η ] , and an augmented matrix [A, 6] = [ α ι , α 2 , . . . , α η , & ] . We can also represent the system of equations above as Ax — 6, where r xi x— Theorem 2.1 The system of equations Ax — b has a solution if and only if rank A = rank [A, b]. D Proof =>: Suppose that the system Ax = b has a solution. Therefore, b is a linear combination of the columns of A; that is, there exist # 1 , . . . , xn such 18 VECTOR SPACES AND MATRICES that xia,i+X2a>2-\ and hence VXnO'n — b. It follows that b belongs to s p a n [ a i , . . . , an] rank A — dim spanfai,..., an] = dim s p a n [ a i , . . . , an, b] = rank [A, 6]. <=: Suppose that rank A = rank [A, 6] = r. Thus, we have r linearly independent columns of A. Without loss of generality, let o i , a<i,..., ar be these columns. Therefore, a\, a 2 , . . . , ar are also linearly independent columns of the matrix [A, b]. Because rank[A, b] = r, the remaining columns of [A, 6] can be expressed as linear combinations of o i , ei2,..., ar. In particular, b can be expressed as a linear combination of these columns. Hence, there exist # 1 , . . . ,xn such that x\a\ 4- #2^2 + l· xn^n — b. I Theorem 2.2 Consider the equation Ax — b, where A G ]R m x n and rank A = m. A solution to Ax — b can be obtained by assigning arbitrary values for n — m variables and solving for the remaining ones. Ώ Proof. We have rank A = m, and therefore we can find m linearly independent columns of A. Without loss of generality, let αι, a 2 , . . . , a m be such columns. Rewrite the equation Ax = b as X\d\ + #2^2 H h XmO>m = b — X m + i a m + i — · %nQ"n· Assign to x m + i , £m+2> · · ·, #n arbitrary values, say %m+l ~ Üm+li and let %m+2 == ^ m + 2 i · · · 5 *^n ~ Urn B = [oi,a2,...,am]GRmxm. Note that det B φ 0. We can represent the system of equations above as Xl X2 — [b — ( i m + i a m + i — · · · — dnan]. The matrix B is invertible, and therefore we can solve for [x\,X2, · · · ,^m] T · Specifically, X2 B [b — dm_|_iam-j-i — · · · — d n a n ] . INNER PRODUCTS AND NORMS 2.4 19 Inner Products and Norms The absolute value of a real number a, denoted |a|, is defined as | α | ία ~[-α if α > 0 ifa<0. The following formulas hold: 1. |α| = | - α | . 2. -\a\ <a< \a\. 3. \a + b\ < |a| + |6|. 4. | | a | - | 6 | | < | o - 6 | < | a | + |6|. 5. \ab\ = \a\\b\. 6. \a\ < c and \b\ < d imply that \a + b\ <c + d. 7. The inequality \a\ < b is equivalent to — b < a < b (i.e., a < b and —a < b). The same holds if we replace every occurrence of "<" by "<." 8. The inequality \a\ > b is equivalent to a > b or —a > b. The same holds if we replace every occurrence of ">" by ">." For x, y G R n , we define the Euclidean inner product by n (x,y) = ^XiVi = xTy. 2=1 The inner product is a real-valued function ( · , · ) : M.n x R n —> R having the following properties: 1. Positivity: (a?, x) > 0, (x, x) = 0 if and only if x = 0. 2. Symmetry: {x,y) = (y,x). 3. Additivity: (x + y , z ) = (x, z) + (y, z). 4. Homogeneity: (rx,y) = r(x,y) for every r G R. The properties of additivity and homogeneity in the second vector also hold; that is, (x,y + z) = (x,y) + (x,z), (x, ry) = r{x,y) for every r G R. 20 VECTOR SPACES AND MATRICES The above can be shown using properties 2 to 4. Indeed, (sc,l/ + z) = (y + z,x) = (y,x) + (z,x) = (x,y) + (x,z) and (x, ry) = (ri/, a?) = r(y, a?) = r(x, y). It is possible to define other real-valued functions on E n x R n that satisfy properties 1 to 4 above (see Exercise 2.8). Many results involving the Euclidean inner product also hold for these other forms of inner products. The vectors x and y are said to be orthogonal if (as, y) = 0. The Euclidean norm of a vector x is defined as ||x|| = y/{x,x) = VxTx. Theorem 2.3 Cauchy-Schwarz Inequality. in M71, the Cauchy-Schwarz inequality For any two vectors x and y \(x,y)\<\\x\\\\y\\ holds. Furthermore, equality holds if and only if x = ay for some a G i □ Proof First assume that x and y are unit vectors; that is, ||x|| = \\y\\ = 1. Then, 0 < ll^-t/H2 = (x-y,x-y) 2 = \\x\\ -2(x,y) = or + \\yf 2-2(x,y) fay) < i, with equality holding if and only if x = y. Next, assuming that neither x nor y is zero (for the inequality obviously holds if one of them is zero), we replace x and y by the unit vectors #/||χ|| and 2//||ΐ/||. Then, apply property 4 to get <*,»>< IMIIIi/ll. Now replace x by — x and again apply property 4 to get -(x,y) < \\x\\\\y\\· The last two inequalities imply the absolute value inequality. Equality holds if and only if a;/||x|| = ±2//||y||; that is, x = ay for some a G R. I INNER PRODUCTS AND NORMS 21 The Euclidean norm of a vector ||x|| has the following properties: 1. Positivity: ||x|| > 0, ||&|| = 0 if and only if x = 0. 2. Homogeneity: ||ra?|| = |r|||ic||, r G R. 3. Triangle inequality: ||ic + 2/|| < ||x|| -f \\y\\. The triangle inequality can be proved using the Cauchy-Schwarz inequality, as follows. We have \\x + y\\2 = \\x\\2 + 2{x,y) + \\y\\2. By the Cauchy-Schwarz inequality, \\x + y\\2<\\x\\2 + 2\\x\\\\y\\ + \\y\\2 = (\\x\\ + and therefore \\y\\)2, H* + 2/11 < 11*11+ llvl|. Note that if x and y are orthogonal: (x,y) = 0, then ||* + i/||2 = IMI2 + ||y||2, which is the Pythagorean theorem for R n . The Euclidean norm is an example of a general vector norm, which is any function satisfying the three properties of positivity, homogeneity, and triangle inequality. Other examples of vector norms on R n include the 1-norm, defined by ||cc||i = \x\\ + · · · + |x n |, and the oo-norm, defined by || (where the notation max* represents the largest over all the possible index values of i). The Euclidean norm is often referred to as the 2-norm, and denoted ||aj||2· The norms above are special cases of the p-norm, given by 11*11 = ί(Ι^ιΙ ρ + ··· + Ι^Ι ρ ) 1 / ρ p 1 m a x { | # i | , . . . , \xn\} ifi<p<oo iip = oc. We can use norms to define the notion of a continuous function, as follows. A function / : R n —» R m is continuous at x if for all ε > 0, there exists δ > 0 such that \\y - x\\ < δ ^ \\f(y) ~ / ( * ) | | < ε. If the function / is continuous at every point in R n , we say that it is continuous on R n . Note that / = [/i,..., fm]T is continuous if and only if each component /^, i = 1 , . . . , m, is continuous. For the complex vector space C n , we define an inner product (x,y) to be ΣΓ=ι xiVii where the bar denotes complex conjugation. The inner product on C n is a complex-valued function having the following properties: 1. (cc, a;) > 0, (sc, x) = 0 if and only if x = 0. 22 VECTOR SPACES AND MATRICES 2. (x,y) = (y,x). 3. (x + y,z) = (x,z) + (y,z). 4. (rx,y) = r(x,y), where r G C. Prom properties 1 to 4, we can deduce other properties, such as (x, ny + r2z) = ή {x, y) + f 2 (x, *), where τ*ι,Γ2 G C. For C n , the vector norm can similarly be defined by ||x|| 2 = (x,x). For more information, consult Gel'fand [47]. EXERCISES 2.1 Let A G R m x n and rank A = m. Show that m < n. 2.2 Prove that the system Ax — 6, A G M m x n , has a unique solution if and only if rank A = rank [A, b] = n. 2.3 (Adapted from [38].) We know that if k > n + 1, then the vectors αι,α2,...,α& G R n are linearly dependent; that is, there exist scalars a i , . . . ,c*fc such that at least one α^ φ 0 and Σ ί = ι α * α * = ^- Show that if k > n + 2, then there exist scalars α ι , . . . , α& such that at least one α* ^ 0, Σ * = 1 α»ο» = 0, and Σί=ι ai = °· #ζη£: Introduce the vectors ä; — [1, α ^ ] Τ G M n + 1 , z = 1 , . . . ,fc,and use the fact that any n + 2 vectors in IR n+1 are linearly dependent. 2.4 Consider a n m x m matrix Ai" that has block form T» /r -M- m—k,k I Mk,k -Lm—k Ok,m-kj where Mk,k is /c x /c, Mm-k,k is (m — k) x k, Im-k is the (m — k) x (m — k) identity matrix, and Ok,m-k is the k x (m — k) zero matrix. a. Show that | d e t M | - |detM f c j f c |. This result is relevant to the proof of Proposition 19.1. b . Under certain assumptions, the following stronger result holds: d e t M = det(-M fc ,fc) Identify cases where this is true, and show that it is false in general. EXERCISES 23 2.5 It is well known that for any a, 6, c, d £ C, a det 6 ad — be. Suppose now that A, B, C, and D are real or complex square matrices of the same size. Give a sufficient condition under which det A C B D AD - BC. An interesting discussion on determinants of block matrices is provided in [121]. 2.6 Consider the following system of linear equations: Xl + X2 + 2^3 + X4 = 1 x\ — 2x2 — XA — —2. Use Theorem 2.1 to check if the system has a solution. Then, use the method of Theorem 2.2 to find a general solution to the system. 2.7 Prove the seven properties of the absolute value of a real number. 2.8 Consider the function (·, -) 2 : M2 x R 2 -> R, defined by (x, y)2 = 2x\yi + 3^22/1 + 3#i2/2 + 5x22/2, where x = [xi,X2] T and y = [yi,y2]T> Show that (·, ·)2 satisfies conditions 1 to 4 for inner products. Note: This is a special case of Exercise 3.21. 2.9 Show that for any two vectors x,y G R n , |||x|| — ||y||| < \\x — y\\. Hint: Write x = (x — y) + y, and use the triangle inequality. Do the same for y. 2.10 Use Exercise 2.9 to show that the norm || · || is a uniformly continuous function] that is, for all ε > 0, there exists δ > 0 such that if \\x — y\\ < (5, then |||*||-Ill/Ill < ε · CHAPTER 3 TRANSFORMATIONS 3.1 Linear Transformations A function C : R n —» R m is called a linear transformation if: 1. C(ax) = aC(x) for every x G R n and a G R. 2. £ ( x + y) = C(x) + £(y) for every x, y G R n . If we fix the bases for R n and R m , then the linear transformation C can be represented by a matrix. Specifically, there exists A G R m x n such that the following representation holds. Suppose that x G R n is a given vector, and x' is the representation of x with respect to the given basis for R n . If y = £ ( x ) , and y' is the representation of y with respect to the given basis for R m , then y' = Ax'. We call A the matrix representation of C with respect to the given bases for R n and R m . In the special case where we assume the natural bases for R n and R m , the matrix representation A satisfies C(x) = Ax. An Introduction to Optimization, Fourth Edition. By E. K. P. Chong and S. H. Zak. Copyright © 2013 John Wiley & Sons, Inc. 25 26 TRANSFORMATIONS Let {ei, e 2 , . . . , en} and {e^, e 2 , . . . , e^} be two bases for R n . Define the matrix T = [ei, e 2 , . . . , e ^ ] _ 1 [ei, e 2 , . . ·, e n ]. We call T the transformation matrix from {ei, β 2 , . . . , e n } to {e^, e 2 , . . . , e'n}. It is clear that [ e i , e 2 , . . . , e n ] = [e'^e^,... ,e^]T; that is, the ith column of T is the vector of coordinates of e^ with respect to the basis { e ' ^ e ^ , . . . ,e'n}. Fix a vector in IRn. Let x be the column of the coordinates of the vector with respect to { e i , e 2 , . . . , e n } and x' the coordinates of the same vector with respect to {e^, e 2 , . . . , e'n}. Then, we can show that x' = Tx (see Exercise 3.1). Consider a linear transformation C : R n -+ Mn, and let A be its representation with respect to { e i , e 2 , . . . , e n } and B its representation with respect to {e^, e 2 , . . . , e'n}. Let y = Ax and y' = Bx'. Therefore, y' = Ty = TAx - Bx' = BTx, and hence TA = BT, or A = TlBT. Two n x n matrices A and B are similar if there exists a nonsingular matrix T such that A = T~lBT. In conclusion, similar matrices correspond to the same linear transformation with respect to different bases. 3.2 Eigenvalues and Eigenvectors Let A be an n x n real square matrix. A scalar λ (possibly complex) and a nonzero vector v satisfying the equation Av = Xv are said to be, respectively, an eigenvalue and an eigenvector of A. For λ to be an eigenvalue it is necessary and sufficient for the matrix XI — A to be singular; that is, det[AJ — A] = 0 , where I is the n x n identity matrix. This leads to an nth-order polynomial equation det[AJ - A] = Xn + α η _ ι λ η _ 1 + · · · + axX + a 0 = 0. We call the polynomial det[A7 — A] the characteristic polynomial of the matrix A, and the equation above the characteristic equation. According to the fundamental theorem of algebra, the characteristic equation must have n (possibly nondistinct) roots that are the eigenvalues of A. The following theorem states that if A has n distinct eigenvalues, then it also has n linearly independent eigenvectors. Theorem 3.1 Suppose that the characteristic equation det[AJ — A] = 0 has n distinct roots λχ, λ 2 ,..., λ η . Then, there exist n linearly independent vectors V\, V2,..., vn such that Ανι = XiVi, i = 1,2, . . . , n . D EIGENVALUES AND EIGENVECTORS 27 Proof. Because det[\il — A] = 0, i = l , . . . , n , there exist nonzero Vi, i = 1 , . . . , n, such that Avi = A ^ , i — 1 , . . . , n. We now prove the linear independence of {vi, v 2 , . . . , vn}. To do this, let c i , . . . , cn be scalars such that Σ™=ι CiVi — 0. We show that cz- = 0, i = 1 , . . . , n. Consider the matrix Z = (X2I - A)(X3I -A)··· (XnI - A). We first show that c\ = 0. Note that Zvn = ( λ 2 Ι - A)(X3I - A) · · · (A n _il - Λ ) ( λ η Ι - A ) u n = ( λ 2 Ι - Α)(λ 3 7 - A) · · · ( λ η - ΐ / - Α)(ληΤ7η - At>„) = 0 since Xnvn — A v n = 0. Repeating the argument above, we get Zvk = 0, fc = 2 , 3 , . . . , n . But Z v i = ( λ 2 Ι - A)(X3I - A) · · · (A n _il - Α ) ( λ η / - Α)νχ = ( λ 2 / - Α)(λ 3 7 - A) · · · (Ä n _!t;i - Α ν ι ) ( λ η - λχ) = ( λ 2 / - Α)(Χ3Ι - A)v! · · · ( λ η - ι - λι)(λ η - λι) = (λ 2 - λι)(λ 3 - λι) · · · (λ η _ι - λι)(λ η - λι)ι>ι. Using the equation above, we see that ( n \ i=l / n i=l = C\ZV\ = Ci(A 2 - λι)(λ3 - λ ι ) · · · ( λ η " λ ι ) « ι = 0. Because the λζ are distinct, it must follow that C\ = 0. Using similar arguments, we can show that all c2- must vanish, and therefore the set of eigenvectors {vi, i> 2 ,..., vn} is linearly independent. I Consider a basis formed by a linearly independent set of eigenvectors {vi, v 2 , . . . , vn}. With respect to this basis, the matrix A is diagonal [i.e., if dij is the (i, j ) t h element of A, then α^ = 0 for all i φ j]. Indeed, let T= [vi,v2,...,vn] 28 TRANSFORMATIONS Then, TAT-1=TA[vuv2,...,vn] = T[Av1,Av2,...,Avn] = Τ[λινι,λ2ν2,·..,ληυη] λι = T T θ" λ2 - i 0 Ι λη λι λ2 L° because TT1 Xn = I. A matrix A is symmetric if A = A T . Theorem 3.2 J4ZZ eigenvalues of a real symmetric matrix are real. Q Proof Let where x ^ 0. Taking the inner product of Ax with a? yields (Αχ,χ) = (Χχ,χ) = X(x,x). On the other hand, (Aa?,ir} = (a;, A a;) = (χ,Αχ) = (χ,Χχ) — X{x,x). The above follows from the definition of the inner product on C n . We note that (a?, x) is real and (x,x) > 0. Hence, X(x,x) = X(x,x) and Because (x, x) > 0, Thus, λ is real. (λ-λ)(ίυ,χ) =0. λ = λ. I Theorem 3.3 Any real symmetric n x n matrix has a set of n eigenvectors that are mutually orthogonal. Q ORTHOGONAL PROJECTIONS 29 Proof. We prove the result for the case when the n eigenvalues are distinct. For a general proof, see [62, p. 104]. Suppose that Av\ = Ai^i, Av2 = X2v2, where λι φ \2. Then, (Avi,v2) = (λινι,ν2) = λι(νι,ν2). Because A = A , (Av1,v2) = (vuATv2) = (vuAv2) = \2{vi,v2). Therefore, Al(Vl,V2> = A 2 ( V 1 , V 2 > . Because Ai φ \2, it follows that (vuv2) = 0. If A is symmetric, then a set of its eigenvectors forms an orthogonal basis for R n . If the basis {v\, v2,..., vn} is normalized so that each element has norm of unity, then defining the matrix T = [Vi,U2,...,«n], we have and hence TTT /TIT = I rrt—1 A matrix whose transpose is its inverse is said to be an orthogonal matrix. 3.3 Orthogonal Projections Recall that a subspace V of R n is a subset that is closed under the operations of vector addition and scalar multiplication. In other words, V is a subspace of R n if xi,x2 e V => α χ ι + βχ2 e V for all α,/3 G R. Furthermore, the dimension of a subspace V is equal to the maximum number of linearly independent vectors in V. If V is a subspace of R n , then the orthogonal complement of V, denoted V-1", consists of all vectors that are orthogonal to every vector in V. Thus, V± = {x: vTx = 0 for all v G V}. The orthogonal complement of V is also a subspace (see Exercise 3.7). Together, V and V1- span R n in the sense that every vector x G R n can be represented uniquely as X = X\ + X 2 , 30 TRANSFORMATIONS where X\ G V and x<i G V-1. We call the representation above the orthogonal decomposition of x (with respect to V). We say that X\ and x<i are orthogonal projections of x onto the subspaces V and V1-, respectively. We write R n = V Θ V1- and say that R n is a direct sum of V and V-1. We say that a linear transformation P is an orthogonal projector onto V if for all x G R n , we have P x G V and a; - Px G V x . In the subsequent discussion we use the following notation. Let A G R m x n . Let the range, or image, of A be denoted 11(A) = {Ax:xe Rn}, and the nullspace, or kernel, of A be denoted ΛΓ(Α) = {x G R n : A x = 0}. Note that 11(A) and N(A) are subspaces (see Exercise 3.9). T h e o r e m 3.4 Let A be a given matrix. λί(Α)±=1Ι(ΑΤ). Then, IZ(A)1- = λί(Ατ) and D Proof. Suppose that x G 11(A)1-. Then, yT(ATx) = (Ay)Tx = 0 for all y, T Τ so that A x = 0. Hence, x G Λ/*(Α ). This implies that 11(A)1- c Λ/*(ΑΤ). If now x G Λ/*(Α ), then (Ay)Tx = yT(ATx) = 0 for all y, so that τ x G π ( Α ) \ and consequently, λί(Α ) C ^ ( A ) - 1 . Thus, H(A)^ = λί(Ατ). 1 The equation Λί(Α) - = 1Z(A ) follows from what we have proved above and the fact that for any subspace V, we have (V-1)1- = V (see Exercise 3.11). ■ Theorem 3.4 allows us to establish the following necessary and sufficient condition for orthogonal projectors. For this, note that if P is an orthogonal projector onto V, then Px = x for all x G V, and Ti(P) = V (see Exercise 3.14). T h e o r e m 3.5 A matrix P is an orthogonal projector [onto the subspace V = n{P)] if and only if P2 = P = PT. D Proof. =>: Suppose that P is an orthogonal projector onto V = Έ,(Ρ). Then, 11(1 -P)C H(P)-1. But, H(P)1- = λί(Ρτ) by Theorem 3.4. Therefore, 11(1 - P) C Af(P'). Hence, PT(I - P)y = 0 for all y, which implies that P (I — P) = O, where O is the matrix with all entries equal to zero; i.e., the zero matrix. Therefore, PT = PTP, and thus P = PT = P2. <=: Suppose that P2 = P = PT. For any x, we have (Py)T(I - P)x = T T y P (I - P)x - yTP(I - P)x = 0 for all y. Thus, (I - P)x G ft(P)\ which means that P is an orthogonal projector. I QUADRATIC FORMS 3.4 31 Q u a d r a t i c Forms A quadratic form f : R n is a function f(x) = xTQx, where Q is an n x n real matrix. There is no loss of generality in assuming Q to be symmetric: Q = QT. For if the matrix Q is not symmetric, we can always replace it with the symmetric matrix QO = QO=\(Q QT)· + Note that xTQx = xTQ0x = xT ( -Q + -QT J x. A quadratic form xTQx, Q = QT, is said to be positive definite ifxTQx > 0 for all nonzero vectors x. It is positive semidefinite if xTQx > 0 for all x. Similarly, we define the quadratic form to be negative definite, or negative semidefinite, if xTQx < 0 for all nonzero vectors x, or xTQx < 0 for all x, respectively. Recall that the minors of a matrix Q are the determinants of the matrices obtained by successively removing rows and columns from Q. The principal minors are det Q itself and the determinants of matrices obtained by successively removing an ith row and an ith column. That is, the principal minors are det Qiiii Qiit2 Qiiip Qt2ii Qt2l2 Ql2ip Qipii Qipl2 ■ , 1 < i\ < - - - < ip < n, p = 1,2,..., n. *Λ. The leading principal minors are det Q and the minors obtained by successively removing the last row and the last column. That is, the leading principal minors are Δι = g n , det Δ 2 = det Qll <7l2 Q21 Q22 Q23 913 Q31 Q32 911 912 921 922 An = detQ. 933 We now prove Sylvester's criterion, which allows us to determine if a quadratic form xTQx is positive definite using only the leading principal minors of Q. 32 TRANSFORMATIONS Theorem 3.6 Sylvester's Criterion. A quadratic form xTQx, Q = QJ, is positive definite if and only if the leading principal minors of Q are positive. D Proof. The key to the proof of Sylvester's criterion is the fact that a quadratic form whose leading principal minors are nonzero can be expressed in some basis as a sum of squares where x<i are the coordinates of the vector x in the new basis, Δο = 1, and Δ ι , . . . , Δ η are the leading principal minors of Q. To this end, consider a quadratic form f(x) = xTQx, where Q = QT. Let {ei, β2,.. ·, e n } be the natural basis for R n , and let X = X\e\ + #2^2 + · * ' + Χη&η be a given vector in R n . Let {vi, t>2> · · · ? vn} be another basis for R n . Then, the vector x is represented in the new basis as i , where X — [vi,V2,...,Vn]i = Vx. Accordingly, the quadratic form can be written as xTQx = xTVTQVx = xTQx, where Qn Q = Qln VTQV Qnl Note that q^ = (vi,Qvj). Our goal is to determine conditions on the new basis {vi, I>2J . . . , vn} such that q^ = 0 for i φ j . We seek the new basis in the form vi = a n e i , v2 = α 2 ιβι +α 2 2β2, vn = θίηχβι + α η 2 β 2 Η Observe that for j = 1 , . . . , i — 1, if (vuQej) =0, (vuQvj) =0. then l· anne„. QUADRATIC FORMS 33 Our goal t h e n is to determine the coefficients an, α ^ , . . . , ^ , i = 1 , . . . , n, such t h a t t h e vector Vi = ane\ + 0^2 H l· α ^ satisfies t h e i relations {vuQej) = 0, {euQvi) = 1. j = l , . . . , z - 1, In this case, we get 0 an 0 ttn For each i = l , . . . , n , the i relations above determine the coefficients &ii,...,Ciii in a unique way. Indeed, upon substituting the expression for Vi into t h e equations above, we obtain t h e set of equations otnqii + ai2qi2 H α*ιφ-ι 1 + ^29i-i 2 H h 0:2291* = 0, h Oiuqi-i» = 0, Oil 9a + <*<2φ2 H h a ^ i = 1. T h e set of equations above can be expressed in matrix form as 9ll 912 * * * Qli OL%\ 921 922 · · · qi% OL%2 qn q%i —- &ii q%\ ~°10! = 1 If t h e leading principal minors of t h e m a t r i x Q do not vanish, t h e n t h e coefficients aij can be obtained using Cramer's rule. In particular, 9n an = -— det 92-11 qn Hence, ' 1 Δι Q 9i 2-1 0 : 0 92-ii-i 92 2-1 0 1 Δι Δ2 Δη-l Δη J 34 TRANSFORMATIONS In the new basis, the quadratic form can be expressed as a sum of squares xTQx = xTQx Δι -2 , Δι___ Λ Χ$ ζ + · + —.—χί. We now show that a necessary and sufficient condition for the quadratic form to be positive definite is Δ* > 0, i = 1 , . . . , n. Sufficiency is clear, for if Δ^ > 0, i = 1 , . . . , n, then by the previous argument there is a basis such that xTQx = xTQx > 0 for any x φ 0 (or, equivalently, any x φ 0). To prove necessity, we first show that for i = 1 , . . . , n, we have Δ* φ 0. To see this, suppose that Δ& = 0 for some k. Note that Δ& = det Qk, Qn Qik Qki Qkk Qk = fc Then, there exists a vector v e M , v φ 0, such that vTQk x e W1 be given by x = [υτ, 0 T ] T . Then, xTQx = vTQkv = 0. Now let = 0. But x φ 0, which contradicts the fact that the quadratic form / is positive definite. Therefore, if xTQx > 0, then Ai φ 0, i — 1 , . . . , n. Then, using our previous argument, we may write xTQx = x Qx = ^~xl~2 + —^Ί + ——xi Δη where x = [υχ,.. .,νη]χ. Hence, if the quadratic form is positive definite, then all leading principal minors must be positive. I Note that if Q is not symmetric, Sylvester's criterion cannot be used to check positive definiteness of the quadratic form x1Qx. To see this, consider an example where Γl Q °1 -4 1 The leading principal minors of Q are Δι = 1 > 0 and Δ2 = det Q = 1 > 0. However, if x = [1,1] T , then xTQx = — 2 < 0, and hence the associated quadratic form is not positive definite. Note that x1 Qx = x1 1 -4 0 1 1 -4 = x 1 -2 0 1 -2 1 + X — X x L^QX. MATRIX NORMS 35 The leading principal minors of Q0 are Δι = 1 > 0 and Δ 2 = det Q0 = — 3 < 0, as expected. A necessary condition for a real quadratic form to be positive semidefinite is that the leading principal minors be nonnegative. However, this is not a sufficient condition (see Exercise 3.16). In fact, a real quadratic form is positive semidefinite if and only if all principal minors are nonnegative (for a proof of this fact, see [44, p. 307]). A symmetric matrix Q is said to be positive definite if the quadratic form xTQx is positive definite. If Q is positive definite, we write Q > 0. Similarly, we define a symmetric matrix Q to be positive semidefinite (Q > 0), negative definite (Q < 0), and negative semidefinite (Q < 0) if the corresponding quadratic forms have the respective properties. The symmetric matrix Q is indefinite if it is neither positive semidefinite nor negative semidefinite. Note that the matrix Q is positive definite (semidefinite) if and only if the matrix —Q is negative definite (semidefinite). Sylvester's criterion provides a way of checking the definiteness of a quadratic form, or equivalently, a symmetric matrix. An alternative method involves checking the eigenvalues of Q, as stated below. Theorem 3.7 A symmetric matrix Q is positive definite (or positive semidefinite) if and only if all eigenvalues of Q are positive (or nonnegative). □ Proof For any x, let y = T~lx = TTx, where T is an orthogonal matrix whose columns are eigenvectors of Q. Then, xTQx = yTT QTy = ΣΓ=ι ^iVi- I^ 0111 this, the result follows. I Through diagonalization, we can show that a symmetric positive semidefinite matrix Q has a positive semidefinite (symmetric) square root Q1'2 satisfying Q 1 / 2 Q 1 / / 2 = Q. For this, we use T as above and define 2 A.' o l o \T \ Q1/2 - T which is easily verified to have the desired properties. Note that the quadratic form xTQx can be expressed as ||Q1/^2a?||2. In summary, we have presented two tests for definiteness of quadratic forms and symmetric matrices. We point out again that nonnegativity of leading principal minors is a necessary but not a sufficient condition for positive semidefinit eness. 3.5 Matrix Norms The norm of a matrix may be chosen in a variety of ways. Because the set of matrices R m X n can be viewed as the real vector space M m n , matrix norms 36 TRANSFORMATIONS should be no different from regular vector norms. Therefore, we define the norm of a matrix A, denoted ||A||, to be any function || · || that satisfies the following conditions: 1. ||A|| > 0 if A φ O, and | | 0 | | = 0, where O is a matrix with all entries equal to zero. 2. \\cA\\ = |c|||A||, for any c G R. 3. ||Λ + Β | | < | | Α | | + ||Β||. An example of a matrix norm is the Frobenius norm, defined as Ki=l 3=1 where A e R m x n . Note that the Frobenius norm is equivalent to the Euclidean norm on R m n . For our purposes, we consider only matrix norms that satisfy the following additional condition: 4. ||AB|| < ||Α||||Β||. It turns out that the Frobenius norm satisfies condition 4 as well. In many problems, both matrices and vectors appear simultaneously. Therefore, it is convenient to construct the norm of a matrix in such a way that it will be related to vector norms. To this end we consider a special class of matrix norms, called induced norms. Let || · ||(n) and || · ||(m) be vector norms on R n and R m , respectively. We say that the matrix norm is induced by, or is compatible with, the given vector norms if for any matrix A e R m x n and any vector x G l n , the following inequality is satisfied: ||Ax|| ( m ) < ||A||||x|| ( n ) . We can define an induced matrix norm as max IK») = 1 II A x I(m)5 that is, || A|| is the maximum of the norms of the vectors Ax where the vector x runs over the set of all vectors with unit norm. When there is no ambiguity, we omit the subscripts (m) and (n) from || · ||(m) and || · ||( n ). Because of the continuity of a vector norm (see Exercise 2.10), for each matrix A the maximum max \\Ax\\ is attainable; that is, a vector x0 exists such that ||xo|| = 1 and || Acc0|| = ||-A||. This fact follows from the theorem of Weierstrass (see Theorem 4.2). MATRIX NORMS 37 The induced norm satisfies conditions 1 to 4 and the compatibility condition, as we prove below. Proof of Condition 1. Let Αφ O. Then, a vector x, ||x|| = 1, can be found such that Ax φ 0, and thus ||Ax|| Φ 0. Hence, ||A|| = max^i^x ||Aaj|| φ 0. If, on the other hand, A — O, then ||A|| = max|| x || = i ||Οχ|| = 0 . I Proof of Condition 2. By definition, ||cA|| = max||a.||=i ||cAx||. Obviously, ||cAaj|| = |c|||Acc||, and therefore \\cA\\ = m a x ^ i ^ i |c|||Ax|| = |c|max|| x || = i \\Ax\\ = |c|||A||. ■ Proof of Compatibility Condition. Let y φ 0 be any vector. Then, x = y/\\y\\ satisfies the condition ||x|| = 1. Consequently, \\Ay\\ = ||Α(||?/||χ)|| = ||y||||Aa;|| < ||y||||A||. ■ Proof of Condition 3. For the matrix A + B, we can find a vector XQ such that \\A + B | | = ||(A + B)a5 0 || and ||z 0 || = 1. Then, we have \\A + B\\ = \\(A + B)x0\\ = \\Ax0 + Bx0\\ <\\Ax0\\ + \\Bxo\\ <||A||||* 0 || + ||B||||*o|| = ||A|| + ||B||, which shows that condition 3 holds. I Proof of Condition 4- For the matrix AB, we can find a vector xo such that Hzoll = 1 and ||ΑΒχ 0 || = \\AB\\. Then, we have ||AB|| = \\ABx0\\ = \\A(Bx0)\\ <\\A\\\\Bx0\\ <||Α||||Β||||*ο|| = ||A||||B||, which shows that condition 4 holds. I Theorem 3.8 Let ΙΝΙ=ίΣΐ^Π = ν^>· The matrix norm induced by this vector norm is \\A\\ = >/ÄT, 38 TRANSFORMATIONS where λι is the largest eigenvalue of the matrix A A. Proof. We have ||Ax|| 2 = (Ax, Ax) = (x, AT G Ax). The matrix A A is symmetric and positive semidefinite. Let λι > λ2 > • · · > λ η > 0 be its eigenvalues and Xi, X2, ·. ·, xn the orthonormal set of the eigenvectors corresponding to these eigenvalues. Now, we take an arbitrary vector x with ||x|| = 1 and represent it as a linear combination of Xi, i = l,...,n: X = C\X\ + C2X2 + Note that l· CnXn. + cl + --- + c2n = 1. (x,x)=c\ Furthermore, \\Ax\\2 = (x,ATAx) — (c\X\ H = \ic\ H l· cnxn, ciAiXi H h cnAnxn) hAn4 <Ai(c? + -.. + 4 ) For the eigenvector aJi of A T A corresponding to the eigenvalue λι, we have ||Axi|| 2 = (®i, ΑΎ Αχλ) = (χι,λχΧι) and hence = λι, max IIAxII — ν λ ι · ll*ll=i This completes the proof. I Using arguments similar to the above, we can deduce the following important inequalities. Rayleigh's Inequalities. If an n x n matrix P is real symmetric positive definite, then Xmin(P)\\x\\2 < XTPX < Amax(P)||x||2, where A m i n (P) denotes the smallest eigenvalue of P , and A m a x (P) denotes the largest eigenvalue of P . Example 3.1 Consider the matrix A = 2 1 1 2 MATRIX NORMS 39 and let the norm in R2 be given by 2 11*11 = V*? + A- Then, ATA 5 4 4 5 and det[A/ 2 - AT A] = λ 2 - 10A + 9 = (λ - 1)(λ - 9). Thus, ||A|| = \/9 = 3. The eigenvector of A A corresponding to λι = 9 is xi V2 Note that \\Axi\\ = \\A\\. Indeed, ||AiCi|| = 2 I" *1 [ill 72 l l 2j 1i lil i II 3 il 7i L3j = 3. Because A = AT in this example, we also have ||A|| = maxi<i< n |λ$(Α)|, where λ ι ( Α ) , . . . , λ η (Α) are the eigenvalues of A (possibly repeated). I Warning: In general, maxi<2<n |^i(-^-)| Φ 11-^11 · Instead, we have ||A|| > maxi<i< n |λί(Α)|, as illustrated in the following example (see also Exercise 5.2). Example 3.2 Let 0 1 A = 0 0 then ATA = and d e t [ A I 2 - A T A ] = det 0 0 λ 0 Note that 0 is the only eigenvalue of A. \Xi(A)\=0. 0 1 0 = λ(λ-1). λ-1 Thus, for i = 1,2, ||A|| = 1 > I 40 TRANSFORMATIONS For a more complete but still basic treatment of topics in linear algebra as discussed in this and the preceding chapter, see [47], [66], [95], [126]. For a treatment of matrices, we refer the reader to [44], [62]. Numerical aspects of matrix computations are discussed in [41], [53]. EXERCISES 3.1 Fix a vector in R n . Let x be the column of the coordinates of the vector with respect to the basis {ei, β 2 , . . . , e n } and x' the coordinates of the same vector with respect to the basis {e^, e 2 , . . . , e'n}. Show that x' = Tx, where T is the transformation matrix from {ei, β 2 , . . . , e n } to {e[,e2l..., e'n}. 3.2 For each of the following cases, find the transformation matrix T from { e i , e 2 , e 3 } to {ei,e 7 2 ,e£}: a. e[ = ei + 3e 2 - 4e 3 , e2 = 2ei - e 2 + 5e 3 , e 3 = 4ei + 5e 2 + 3e 3 . b . ei = e[ + e'2 + 3e'3, e 2 = 2e[ - e2 + 4e 3 , e 3 = Se^ + 5e 3 . 3.3 Consider two bases of R 3 , {ei,e2,e 3 } and {e[,e2,e3}, where e\ = 2e[ + e2 — e 3 , e 2 = 2ei — e 2 H- 2β3, and e 3 = 3ei + e'3. Suppose that a linear transformation has a matrix representation in {ei, e2, e 3 } of the form 2 - 1 0 0 1 -1 0 0 1 Find the matrix representation of this linear transformation in the basis \ e l > e 2> e 3 J · 3.4 Consider two bases of R 4 , {ei,e2,e 3 ,e4} and {e[,e2,e,3,e,4}, where e[ = ei, e2 = e\+e2, e'3 = e i + e 2 + e 3 , and e'4 = e\ -he 2 + e 3 + e 4 . Suppose that a linear transformation has a matrix representation in {ei, e2, e 3 , e±) of the form 2 0 1 0 -3 2 0 1 0 1 -1 2 1 0 0 3 Find the matrix representation of this linear transformation in the basis l e l ? e 2> e 3> e 4 J · EXERCISES 41 3.5 Consider a linear transformation given by the matrix - 1 0 0 0 1 1 0 0 2 5 2 1 -1 1 0 3 Find a basis for R 4 with respect to which the matrix representation for the linear transformation above is diagonal. 3.6 Let λ ι , . . . , λ η be the eigenvalues of the matrix A G R n x n . Show that the eigenvalues of the matrix In — A are 1 — λ χ , . . . , 1 — λ η . 3.7 Let V be a subspace. Show that V1- is also a subspace. 3.8 Find the nullspace of ^4-2 2 1 2 -3 0 " -1 1 3.9 Let A G R m x n be a matrix. Show that 11(A) is a subspace of R m and λί(Α) is a subspace of R n . 3.10 Prove that if A and B are two matrices with m rows, and λί(Ατ) C λί(Βτ), then 11(B) c 11(A). Hint: Use the fact that for any matrix M with m rows, we have dim7£(M") + dim Af(MT) = m [this is one of the fundamental theorems of linear algebra (see [126, p. 75])]. 3.11 Let V be a subspace. Show that (V^1Hint: Use Exercise 3.10. = V. 3.12 Let V and W be subspaces. Show that if V C W, then W± CV±. 3.13 Let V be a subspace of R n . Show that there exist matrices V and U such that V = 1Z(V) = Af(U). 3.14 Let P be an orthogonal projector onto a subspace V. Show that a. Px = x for all x G V. b . 1Z(P) = V. 42 TRANSFORMATIONS 3.15 Is the quadratic form x 1 1 -fi x 1 positive definite, positive semidefinite, negative definite, negative semidefinite, or indefinite? 3.16 Let A = 2 2 2 2 2 2 2 2 0 Show that although all leading principal minors of A are nonnegative, A is not positive semidefinite. 3.17 Consider the matrix "o Q 1 f 1 0 1 1 1 0 a. Is this matrix positive definite, negative definite, or indefinite? b . Is this matrix positive definite, negative definite, or indefinite on the subspace M = {x : x\ + X2 + X3 = 0} ? 3.18 For each of the following quadratic forms, determine if it is positive definite, negative definite, positive semidefinite, negative semidefinite, or indefinite. a. f(xi,X2,X3) b. =%2 f(xi,X2,X3)=Xi'\'2xl-XiXs C. f(xi,X2,X3) = X1+X3 + ΊΧ\Χ2 + 2χιΧ3 + 2X2^3 3.19 Find a transformation that brings the following quadratic form into the diagonal form: f{x\,X2,xz) = kx\ +x\ + 9^3 - 4xi#2 - 6x 2 #3 + 12xix 3 . Hint: Read the proof of Theorem 3.6. EXERCISES 43 3.20 Consider the quadratic form f(xi,X2,%3) = x\ + x\ + $xl + 2ξχιχ 2 - 2χιχ 3 + 4χ 2 ^3· Find the values of the parameter £ for which this quadratic form is positive definite. 3.21 Consider the function (-,-) Q : R n x R n -► R, defined by (x,y)Q = &TQy, where x,y G R n and Q G R n x n is a symmetric positive definite matrix. Show that (·, -)Q satisfies conditions 1 to 4 for inner products (see Section 2.4). 3.22 Consider the vector norm || · ||οο on R n given by |j21?|]oo — max^ \xi\, where x — [a?i,..., xn]T. Define the norm || - ||oo o n ^ m similarly. Show that the matrix norm induced by these vector norms is given by n \\A\loo =maxY]|aifc|, k=l where a^· is the (i,j)th element of A G R m x n . 3.23 Consider the vector norm || · ||i on R n given by ||ic||i = ΣΓ=ι \χί\·> where x = [ x i , . . . , x n ] T . Define the norm || · ||i on R m similarly. Show that the matrix norm induced by these vector norms is given by 771 ||A||i = max^2\aik\, i=l where a^· is the (i, j)th element of A G W1 CHAPTER 4 CONCEPTS FROM GEOMETRY 4.1 Line Segments In the following analysis we concern ourselves only with R n . The elements of this space are the n-component vectors x = [xi, X2,..., £ η ] Τ · The line segment between two points x and y in R n is the set of points on the straight line joining points x and y (see Figure 4.1). Note that if z lies on the line segment between x and y, then z-y = a(x-y), where a is a real number from the interval [0,1]. The equation above can be rewritten as z = ax + (1 — a)y. Hence, the line segment between x and y can be represented as {ax + (l-a)y\ae [0,1]}. An Introduction to Optimization, Fourth Edition. By E. K. P. Chong and S. H. Zak. Copyright © 2013 John Wiley & Sons, Inc. 45 46 CONCEPTS FROM GEOMETRY Figure 4.1 4.2 Line segment. Hyperplanes and Linear Varieties Let ui,U2,...,un,v G R, where at least one of the ui is nonzero. The set of all points x = [x\, X2, · . . , xn}T that satisfy the linear equation UiXi + U2X2 H h ^n^n = V is called a hyperplane of the space R n . We may describe the hyperplane by { x e R n : uTx = v}, where U= [wi,U2,...,Wn] T . A hyperplane is not necessarily a subspace of R n since, in general, it does not contain the origin. For n = 2, the equation of the hyperplane has the form u\Xi + U2X2 = v, which is the equation of a straight line. Thus, straight lines are hyperplanes in R 2 . In R 3 (three-dimensional space), hyperplanes are ordinary planes. By translating a hyperplane so that it contains the origin of R n , it becomes a subspace of R n (see Figure 4.2). Because the dimension of this subspace is n — 1, we say that the hyperplane has dimension n — 1. The hyperplane H = {x : U\X\ H h unxn = v} divides R n into two halfspaces. One of these half-spaces consists of the points satisfying the inequality U\Xi + U2X2 H + unxn > v, denoted H+ = {x e R n : uTx > v}, where, as before, U= [ui,U2,...,Un]T. The other half-space consists of the points satisfying the inequality UiXi + U2X2 H h unxn < v, denoted H- = {x e R n : uTx < v}. The half-space H+ is called the positive half-space, and the half-space if_ is called the negative half-space. HYPERPLANES AND LINEAR VARIETIES Figure 4.2 47 Translation of a hyperplane. Let a = [ai, a2,..., an]T be an arbitrary point of the hyperplane H. Thus, u a — v = 0. We can write T uTx — v = uTx — v — (uTa — v) = uT(x — a) = wi(»i - «l) + u2(x2 - a 2 ) H l· un(xn - an) — 0. The numbers (xi — α^), i = 1 , . . . , n, are the components of the vector x — a. Therefore, the hyperplane H consists of the points x for which (u, x — a) = 0. In other words, the hyperplane H consists of the points x for which the vectors u and x — a are orthogonal (see Figure 4.3). We call the vector u the normal to the hyperplane H. The set H+ consists of those points x for which (u, x — a) > 0, and H- consists of those points x for which (u,x — a) < 0. A linear variety is a set of the form {x e R n : Ax = 6} for some matrix A G R m x n and vector b G M m . If dimAf(A) = r, we say that the linear variety has dimension r. A linear variety is a subspace if and only if b = 0. If A = O, the linear variety is R n . If the dimension of the linear variety is less than n, then it is the intersection of a finite number of hyperplanes. 48 CONCEPTS FROM GEOMETRY Figure 4.3 4.3 The hyperplane H = {x G Rn : uT(x - a) = 0}. Convex Sets Recall that the line segment between two points u,v G R n is the set {w G R n : w = au + (1 — a)v,a G [0,1]}. A point w = au + (1 — a)v (where a G [0,1]) is called a convex combination of the points n and v. A set Θ C W1 is convex if for all iz, t; G Θ, the line segment between u and v is in Θ. Figure 4.4 gives examples of convex sets, whereas Figure 4.5 gives examples of sets that are not convex. Note that Θ is convex if and only if au + (1 — a)v G Θ for all u, v G Θ and a G (0,1). Examples of convex sets include the following: ■ The empty set ■ A set consisting of a single point ■ A line or a line segment Θ Figure 4.4 Convex sets. CONVEX SETS Figure 4.5 49 Sets that are not convex. A subspace A hyperplane A linear variety A half-space Theorem 4.1 Convex subsets ofW1 have the following properties: a. If Θ is a convex set and ß is a real number, then the set βθ = {χ:χ = βν,ν£θ} is also convex. b. If ©i and 02 are convex sets, then the set θ ι + Θ2 = {x : x = Vi +1>2, Vi e θ ι , v2 G 62} is also convex. c. The intersection of any collection of convex sets is convex (see Figure 4.6 for an illustration of this result for two sets). □ Proof. a. Let βν\,βν<ι G /?©, where Vi,V2 G Θ. Because Θ is convex, we have OLV\ + (1 — a)v2 G Θ for any a G (0,1). Hence, αβυι + (1 - a)ßv2 = ß{av1 + (1 - α)ν 2 ) G /?θ, and thus βθ is convex. 50 CONCEPTS FROM GEOMETRY Figure 4.6 Intersection of two convex sets. b . Let V\,v2 G ©i + 02- Then, v± = v[ + v'{, and v2 = v2 -f v2, where vi,V2 € θ ι , and v",v2 G Θ2. Because θ ι and ©2 are convex, for all a €(0,1), 051 = av[ and 4- (1 — OL)V2 G θ ι x2 = av" + (1 - α)«2 £ θ 2 . By definition of ©i + ©2, X\ + ^2 G ©i + ©2- Now, a v i + (1 - a;)v2 = «(vi + v'/) + (1 - OL)(V'2 -f «2) = 051 +CC2 G ©1 + ©2- Hence, ©i + ©2 is convex. c. Let C be a collection of convex sets. Let #1,052 G f l e e c ® (where n © e c ® represents the intersection of all elements in C). Then, 05i, 052 G © for each Θ e C. Because each © G C is convex, α χ ι + (1 — α)θ52 G © for all a G (0,1) and each © G C. Thus, αχχ + (1 — OJ)O52 G f l e e c ®* I A point 05 in a convex set © is said to be an extreme point of © if there are no two distinct points u and v in © such that 05 = au + (1 — a)v for some a G (0,1). For example, in Figure 4.4, any point on the boundary of the disk is an extreme point, the vertex (corner) of the set on the right is an extreme point, and the endpoint of the half-line is also an extreme point. 4.4 Neighborhoods A neighborhood of a point 05 G R n is the set {yeRn:||y-*||<e}, where ε is some positive number. The neighborhood is also called a ball with radius ε and center 05. In the plane R 2 , a neighborhood of x = [#i, x2]T consists of all the points inside a disk centered at 05. In R 3 , a neighborhood of 05 = [x\,X2, #3] T consists of all the points inside a sphere centered at 05 (see Figure 4.7). NEIGHBORHOODS disk Figure 4.7 51 sphere Examples of neighborhoods of a point in R2 and R3. A point x G S is said to be an interior point of the set S if the set S contains some neighborhood of x; that is, if all points within some neighborhood of x are also in S (see Figure 4.8). The set of all the interior points of S is called the interior of S. A point x is said to be a boundary point of the set S if every neighborhood of x contains a point in S and a point not in S (see Figure 4.8). Note that a boundary point of S may or may not be an element of S. The set of all boundary points of S is called the boundary of S. A set S is said to be open if it contains a neighborhood of each of its points; that is, if each of its points is an interior point, or equivalently, if S contains no boundary points. A set S is said to be closed if it contains its boundary (see Figure 4.9). We can show that a set is closed if and only if its complement is open. A set that is contained in a ball of finite radius is said to be bounded. A set is compact if it is both closed and bounded. Compact sets are important in optimization problems for the following reason. Theorem 4.2 Theorem of Weierstrass. Let f : Ω —► R be a continuous function, where Ω C R n is a compact set Then, there exists a point XQ G Ω Figure 4.8 x is an interior point; y is a boundary point. 52 CONCEPTS FROM GEOMETRY Si = {[x 1 ,X 2 ] T :1<Xi<2,1<X2<2} S-j is open A*2 3+ 2 I 1+ 0 S2 = {[X1,X 2 ] T -3^X1^4,1<X 2 ^2} TsV! H 2 S 2 is closed 1 3 1 1 4 Figure 4.9 5 ^ *1 Open and closed sets. such that f(xo) < f(x) for allx G Ω. In other words, f achieves its minimum on Ω. ü Proof. See [112, p. 89] or [2, p. 154]. 4.5 I Polytopes and Polyhedra Let Θ be a convex set, and suppose that y is a boundary point of Θ. A hyperplane passing through y is called a hyperplane of support (or supporting hyperplane) of the set Θ if the entire set Θ lies completely in one of the two half-spaces into which this hyperplane divides the space Rn. Recall that by Theorem 4.1, the intersection of any number of convex sets is convex. In what follows we are concerned with the intersection of a finite number of half-spaces. Because every half-space H+ or H- is convex in R n , the intersection of any number of half-spaces is a convex set. A set that can be expressed as the intersection of a finite number of halfspaces is called a convex polytope (see Figure 4.10). A nonempty bounded polytope is called a polyhedron (see Figure 4.11). For every convex polyhedron Θ C Mn, there exists a nonnegative integer k < n such that Θ is contained in a linear variety of dimension k, but is not Figure 4.10 Polytopes. EXERCISES Figure 4.11 53 One-dimensional polyhedron. entirely contained in any (k — 1)-dimensional linear variety of R n . Furthermore, there exists only one fc-dimensional linear variety containing Θ, called the carrier of the polyhedron Θ, and k is called the dimension of Θ. For example, a zero-dimensional polyhedron is a point of R n , and its carrier is itself. A one-dimensional polyhedron is a segment, and its carrier is the straight line on which it lies. The boundary of any fc-dimensional polyhedron, k > 0, consists of a finite number of (A: — 1)-dimensional polyhedra. For example, the boundary of a one-dimensional polyhedron consists of two points that are the endpoints of the segment. The (k—l)-dimensional polyhedra forming the boundary of a fc-dimensional polyhedron are called the faces of the polyhedron. Each of these faces has, in turn, (k — 2)-dimensional faces. We also consider each of these (k — 2)dimensional faces to be faces of the original fc-dimensional polyhedron. Thus, every fc-dimensional polyhedron has faces of dimensions k — l,fc — 2 , . . . , 1 , 0 . A zero-dimensional face of a polyhedron is called a vertex, and a one-dimensional face is called an edge. EXERCISES 4.1 Show that a set S C R n is a linear variety if and only if for all x,y G S and a G R, we have ax + (1 — a)y G S. 4.2 Show that the set {x G R n : ||cc|| < r} is convex, where r > 0 is a given real number and ||x|| = VxTx is the Euclidean norm of x G R n . 4.3 Show that for any matrix A G MrnXn and vector b G M m , the set (linear variety) {x G R n : Ax = b} is convex. 4.4 Show that the set {x G R n : x > 0} is convex (where x > 0 means that every component of x is nonnegative). CHAPTER 5 ELEMENTS OF CALCULUS 5.1 Sequences and Limits A sequence of real numbers is a function whose domain is the set of natural numbers 1,2,..., fc,... and whose range is contained in R. Thus, a sequence of real numbers can be viewed as a set of numbers {x\, #2, . . . , £ & , . . . } , which is often also denoted as {xk} (or sometimes as {xk}kLi, to indicate explicitly the range of values that k can take). A sequence {xk} is increasing if x\ < X2 < · · · < Xk · · *; that is, Xk < Xk+i for all k. If Xk < ^fc+i, then we say that the sequence is nondecreasing. Similarly, we can define decreasing and nonincreasing sequences. Nonincreasing or nondecreasing sequences are called monotone sequences. A number x* G R is called the limit of the sequence {xk} if for any positive ε there is a number K (which may depend on e) such that for all k > K, \xk — x* | < ε; that is, Xk lies between x* — ε and x* + ε for all k > K. In this case we write x* = lim Xk or fc—»oo Xk —► x * . j4n Introduction to Optimization, Fourth Edition. By E. K. P. Chong and S. H. Zak. Copyright © 2013 John Wiley & Sons, Inc. 55 56 ELEMENTS OF CALCULUS A sequence that has a limit is called a convergent sequence. The notion of a sequence can be extended to sequences with elements in R n . Specifically, a sequence in R n is a function whose domain is the set of natural numbers 1,2,...,A:,... and whose range is contained in R n . We use the notation {χ(χ\χ(2\ ...} or {x^} for sequences in R n . For limits of n sequences in R , we need to replace absolute values with vector norms. In other words, x* is the limit of {x^} if for any positive ε there is a number K (which may depend on e) such that for all k > K, \\x^ —x*\\ < ε. As before, if a sequence {x^} is convergent, we write x* = lim^oo x^ or x^ —> x*. Theorem 5.1 A convergent sequence has only one limit. Q Proof. We prove this result by contradiction. Suppose that a sequence {x^} has two different limits, say X\ and x2. Then, we have ||xi — a^21| > 0. Let e = -||a;i-a:2||. From the definition of a limit, there exist K\ and K2 such that for k > K\ we have \x^ — x\\\ < ε, and for k > K2 we have \\χ^ — x2\\ < ε. Let K = max{K1,K2}. Then, if k > K, we have \\x^ -X\\\ <e and \\x^ -x2\\ < e. Adding \\x^ - xi|| < ε and \\x^ - x2\\ < ε yields \\x^ - Xl\\ + \\x^ - x2\\ <2ε. Applying the triangle inequality gives || — asi 4- »21| = \\x{k) ~ xi ~ x(k) + 0521| = < \\χ^ Therefore, \\(xW-Xl)-(xW-X2)\\ - Xl\\ + \\x<<k) - x2\\. || -£Ci 4-X2II = \\xi -«2II < 2 ε . However, this contradicts the assumption that ||xi — »2II — 2ε, which completes the proof. I A sequence {x^} in R n is bounded if there exists a number B > 0 such that ||a;(fc) || < B for all k = 1,2,.... Theorem 5.2 Every convergent sequence is bounded. G Proof. Let {x^} be a convergent sequence with limit x*. Choose ε = 1. Then, by definition of the limit, there exists a natural number K such that for all k> K, \\x^k) - a:*!! < 1 . SEQUENCES AND LIMITS 57 By the result of Exercise 2.9, we get \\x{k)\\ - ||**H < \\x{k) - s * | | < 1 for all k > K. Therefore, \\x{k)\\ < ||a5*|| + l for all jfe > K. Letting B = max {llas^H, ||*< 2 >||,..., ||*<*>||, | | * · | | + l } , we have B> \\x{k)\\ which means that the sequence {x^} for all/c, is bounded. I For a sequence {xk} in R, a number B is called an upper bound if Xk < B for all k = 1,2, In this case, we say that {xk} is bounded above. Similarly, B is called a lower bound if x^ > B for all fc = 1,2, In this case, we say that {α:^} is bounded below. Clearly, a sequence is bounded if it is both bounded above and bounded below. Any sequence {xk} in R that has an upper bound has a least upper bound (also called the supremum), which is the smallest number B that is an upper bound of {xk}- Similarly, any sequence {xk} in R that has a lower bound has a greatest lower bound (also called the infimum). If B is the least upper bound of the sequence {x^}, then Xk < B for all fc, and for any ε > 0, there exists a number K such that XK > B — ε. An analogous statement applies to the greatest lower bound: If B is the greatest lower bound of {x^}? then Xk > B for all fc, and for any ε > 0, there exists a number K such that XK < B + ε. Theorem 5.3 Every monotone bounded sequence in R is convergent □ Proof. We prove the theorem for nondecreasing sequences. The proof for nonincreasing sequences is analogous. Let {xk} be a bounded nondecreasing sequence in R and x* the least upper bound. Fix a number ε > 0. Then, there exists a number K such that XK > x* — ε. Because {xk} is nondecreasing, for any k > K, Xk > XK > x* — ε. Also, because x* is an upper bound of {xk}, we have Xk < X* < X* + £· Therefore, for any k > K, \xk - x*\ < e, which means that #& —* x*. I 58 ELEMENTS OF CALCULUS Suppose that we are given a sequence {x^} of natural numbers {ra^}. The sequence and an increasing sequence {a5<mfc>} = { * < m i \ x < m a \ . . . } is called a subsequence of the sequence {x^}. A subsequence of a given sequence can thus be obtained by neglecting some elements of the given sequence. Theorem 5.4 Consider a convergent sequence {x^} any subsequence of {x^} also converges to x*. with limit x*. Then, □ Proof. Let {x^™^} be a subsequence of {x^}, where {rrik} is an increasing sequence of natural numbers. Observe that rrik > k for all k = 1,2, To show this, first note that m\>\ because πΐ\ is a natural number. Next, we proceed by induction by assuming that mk > k. Then, we have rafc+i > rrik > fc, which implies that rrik+i > k + 1. Therefore, we have shown that rrik > k for all k = 1,2,.... Let ε > 0 be given. Then, by definition of the limit, there exists K such that ||x(fc)— x*|| < ε for any k > K. Because rrik > A:, we also have ||x^mfc^— x*|| < ε for any k > K. This means that lim x (mfc) =x*. k-+oc It turns out that any bounded sequence contains a convergent subsequence. This result is called the Bolzano-Weierstrass theorem (see [2, p. 70]). Consider a function / : W1 —> Rm and a point Xo € l n . Suppose that there exists / * such that for any convergent sequence {x^} with limit Xo, we have lim /(«<*>) = / * . k—+oo Then, we use the notation lim X—>Xo f(x) to represent the limit / * . It turns out that / is continuous at Xo if and only if for any convergent sequence {x^} with limit x0> we have lim /(*<*>) = / ( lim x&A = fc—>oo yfc—>oo J f(x0) (see [2, p. 137]). Therefore, using the notation introduced above, the function / is continuous at XQ if and only if lim f(x) = X—*XQ f(x0). SEQUENCES AND LIMITS 59 We end this section with some results involving sequences and limits of matrices. These results are useful in the analysis of algorithms (e.g., Newton's algorithm in Chapter 9). We say that a sequence {A^} o f m x n matrices converges to the m x n matrix A if lim | | A - Ak\\ = 0 . /c—»oo L e m m a 5.1 Let A e values of A satisfy \\i(A)\ l . Then, Hindoo A < 1, i — 1 , . . . , n. = O if and only if the eigenD Proof. To prove this theorem, we use the Jordan form (see, e.g., [47]). Specifically, it is well known that any square matrix is similar to the Jordan form: There exists a nonsingular T such that TAT1 = diag [ J m i ( λ ι ) , . . . , J m s (λι), J n i (λ 2 ), ...,Jtu (λ ς )] = J , where J r (A) is the r x r matrix: "λ 1 0 λ Jr(A) 0 The λ ι , . . . , Xq above are distinct eigenvalues of A, the multiplicity of λι is mi H + m s , and so on. We may rewrite the above as A = T~XJT. To complete the proof, observe that xk Jfe-lJ λ*-ι \k (jr(\))k Xk 0 where k\ i\(k-i)\ Furthermore, Ak = T~lJkT. Hence, lim Ak = T l /c—»oo ( lim Jk ) T = O k—+oo 60 ELEMENTS OF CALCULUS if and only if |λ;| < 1, i — 1 , . . . , n. I Lemma 5.2 The series of n x n matrices In + A + A 2 + · · · + Ak + · · · converges if and only if Hm^oo A equals (In — A ) - 1 . = O. In this case the sum of the series D Proof. The necessity of the condition is obvious. To prove the sufficiency, suppose that lim^oo A = O. By Lemma 5.1 we deduce that |λ;(Α)| < 1, i = 1 , . . . , n. This implies that det(i" n — A) φ 0, and hence ( J n — A ) - 1 exists. Consider now the following relation: (In + A + A 2 + · · · + Ak)(In - A) = In - A* + 1 . Postmultiplying the equation above by ( J n — A ) - 1 yields In + A + A 2 + . . . + Ak = ( I n - A ) " 1 - A f c + 1 ( / n - A ) " 1 . Hence, k lim Γ Α ' ^ / . - Α ) - 1 , J=0 because lim^oo A +1 = O. Thus, oo ΣΑΐ = (Ιη-Α)-\ 3=0 which completes the proof. if A matrix-valued function A : R r —► R n x n is continuous at a point £ 0 G lim ||Α(£)-Α(ξο)||=0. Lemma 5.3 Let A : R r —> R n X n be an n x n matrix-valued function that is continuous at ξ0. J / A ( £ 0 ) _ 1 exists, then A ( £ ) _ 1 exists for ξ sufficiently close to £ 0 and A ( · ) - 1 is continuous at £ 0 . □ Proof. We follow [114]. We first prove the existence of A ( £ ) _ 1 for all ξ sufficiently close to £o· We have •A(€) = Mio) - Mio) + Mi) = ^(€o)(Jn - *(€)), where Κ{ζ) = Α{ζ0)-ΗΑ{ζο)-Α{ζ)). SEQUENCES AND LIMITS Thus, ΙΙ^(€)ΙΙ<Ι|Α(€ο)-ΊΐΙΙ^«ο)-Α(€)|| and lim Because A is continuous at £0> f° r \\Κ(ξ)\\ = 0. au £ close enough to £ 0 , we have where Θ € (0,1). Then, ||*(€)||<0<ι and (In-K«-))-1 exists. But then Α ( ξ ) - 1 = (Α(ξ0)(Ιη - A · « ) ) ) " 1 = (/„ Κ(ζ))-ιΑ{ξο)-\ -1 which means that A ( £ ) exists for £ sufficiently close to £ 0 . To prove the continuity of A ( · ) - 1 note that ΙΙΑ(ξο)-1 - Α(ξ)-ι\\ = WAV-)"1 - Aß,)"1» = ||((7η-ΑΓ(ί))-1-Ιη)Α(€0)-1||. However, since ||ϋΓ(ξ)|| < 1, it follows from Lemma 5.2 that ( i n - i f ( ξ ) ) - 1 - / „ = κ(ξ) + κ2(ζ) + · · · = «■({)(/„ + Ä - ( 0 + · · · ) · Hence, | | ( I n - A - « ) ) " 1 - 1 „ | | < ||Jr(i)||(l + ||Ä-(€)|| + IIÄC(Oll2 + · · ·) l|Ä-(i)ll I-||Ä-(OH' when \\Κ(ζ)\\ < 1. Therefore, HA«)-1 - Αίίο)-1!! < rqj^Hiill^o)-1!!· Because lim ΙΙ*-*οΙΗο we obtain lim l l s - «oil - > Ό which completes the proof. ||ÄT(€)||=0, JA^-A^^O, 61 62 ELEMENTS OF CALCULUS 5.2 Differentiability Differential calculus is based on the idea of approximating an arbitrary function by an affine function. A function A : R n —► R m is affine if there exists a linear function £ : R n —► R m and a vector y G R m such that A{x) = C(x) + y for every x G R n . Consider a function / : R n -► R m and a point x0 G R n . We wish to find an affine function A that approximates / near the point XoFirst, it is natural to impose the condition A{x0) = /(a? 0 ). Because *A(sc) = £(x) + y, we obtain 1/ = f(xo) — £(#0)· By the linearity of £, £(sc) + 1 / = £(x) - £(aj 0 ) H- f(xo) = £ ( « - «0) + /(«o)· Hence, we may write A(x) = C(x - x0) + /(»o)· Next, we require that A(x) approaches f(x) that is, iim x—».χο,ίΕ^Ω faster than x approaches Xo; ll/(»)-^)ll=0, \\x ~ Xo\\ The conditions above on A ensure that A approximates / near Xo in the sense that the error in the approximation at a given point is "small" compared with the distance of the point from x$. In summary, a function / : Ω —► R m , Ω C R n , is said to be differentiable at Xo £ Ω if there is an affine function that approximates / near XQ; that is, there exists a linear function £ : R n —> R m such that lim x—>χο,χ£Ω \\f(x)-(C(x-x0) + f(x0))\\ = 0 \\X — Xo\\ The linear function £ above is determined uniquely by / and XQ and is called the derivative of / at x$. The function / is said to be differentiable on Ω if / is differentiable at every point of its domain Ω. In R, an affine function has the form ax + 6, with a, b G R. Hence, a realvalued function f{x) of a real variable x that is differentiable at Xo can be approximated near x0 by a function A(x) = ax -\-b. Because f(xo) — A{x§) = axo + 6, we obtain A{x) = ax-\-b = a(x - x0) + /(ffo)· THE DERIVATIVE MATRIX 0 ' 63 x0 Figure 5.1 Illustration of the notion of the derivative. The linear part of A(x), denoted earlier by C(x), is in this case just ax. The norm of a real number is its absolute value, so by the definition of differentiability we have \f(x)-(a(x-xo) lim + f(xo))\ = 0 \x -xo\ X—>XQ which is equivalent to IimMlM =e . x—>χο x — Xo The number a is commonly denoted f'(xo) and is called the derivative of / at xo· The affine function A is therefore given by A(x) = f(x0) + f'(xo)(x ~ xo)- This affine function is tangent to / at XQ (see Figure 5.1). 5.3 The Derivative Matrix Any linear transformation from R n to R m , and in particular the derivative C of / : R n —► R m , can be represented by an m x n matrix. To find the matrix representation L of the derivative £ of a differentiate function / : R n —► R m , we use the natural basis { e i , . . . , en} for R n . Consider the vectors j = 1, . . . , Π. Xj — XQ + t€j, By the definition of the derivative, we have Hm t^o fjxA-jtLej + fjxo)) t = 0 64 forj ELEMENTS OF CALCULUS 1 , . . . , n. This means that lim f{Xj) t f(x0) Le^ for j = 1 , . . . , n. But Lej is the jth. column of the matrix L. On the other hand, the vector Xj differs from XQ only in the jth coordinate, and in that coordinate the difference is just the number t. Therefore, the left side of the preceding equation is the partial derivative 3/ dxj (®o). Because vector limits are computed by taking the limit of each coordinate function, it follows that if fi(x) /(*) 5 fm(x)_ then df_ dxj [Mj(^o)' \ _ (®o) ;— 5 ίΐ^(*ο)_ and the matrix L has the form af_ *o) dxi [fe(«o) ··· f£(*o) dx„ (*o) m^ Ό) ··· !fc(*o) The matrix L is called the Jacobian matrix, or derivative matrix, of / at Xo> and is denoted Df(xo). For convenience, we often refer to Df(x0) simply as the derivative of / at XQ. We summarize the foregoing discussion in the following theorem. Theorem 5.5 If a function f : R n —► R m is differentiable at Xo, then the derivative of f at XQ is determined uniquely and is represented by the m x n derivative matrix Df(xo). The best affine approximation to f near XQ is then given by A{x) = f(xo) + Df(x0)(x - x 0 ), in the sense that f(x) = A(x) + r(x) THE DERIVATIVE MATRIX 65 and Ηηΐφ-^ο ||r(x)||/||x — Xo|| — 0· The columns of the derivative matrix Df(xo) are vector partial derivatives. The vector is a tangent vector at XQ to the curve f obtained by varying only the jth coordinate ofx. D If / : R n is differentiate, then the function V / defined by £(«)' V/(x) Df(x)T &(*>. is called the gradient of / . The gradient is a function from R n to R n , and can be pictured as a vector field, by drawing the arrow representing V/(cc) so that its tail starts at x. Given / : R n —> R, if V / is differentiate, we say that / is twice differentiate, and we write the derivative of V / as dx\dx2 d2f dx2dx\ d2f dxj d2f _ dx\dxn d2f dx2dxn Γ D2f = ^ d2 l 2 a / d2f dxndxi d2f dxndx2 . ■ ^ ~ 1 r e n r p s e n t s t a kinff t h e n a r t i a l d e r i v a t dxidxj to Xj first, then with respect to X{.) The matrix D2f(x) is called the Hessian matrix of / at x, and is often also denoted F(x). A function / : Ω —* R m , Ω C R n , is said to be continuously differentiate on Ω if it is differentiate (on Ω), and Df : Ω —> R m x n is continuous; that is, the components of / have continuous partial derivatives. In this case, we write / G C1. If the components of / have continuous partial derivatives of order p, then we write f eCp. Note that the Hessian matrix of a function / : R n —► R at x is symmetric if / is twice continuously differentiate at x. This is a well-known result from calculus called ClairauVs theorem or Schwarz's theorem. However, if the second partial derivatives of / are not continuous, then there is no guarantee that the Hessian is symmetric, as shown in the following well-known example. E x a m p l e 5.1 Consider the function f{x) = {xlx2{x2l-xl)/{x\JtX2) \0 iftf^O nx = 0. 66 ELEMENTS OF CALCULUS Let us compute its Hessian at the point 0 = [0,0] T . We have F = d2f dx\ d2f dx\X2 d2f dx2X\ a2/ dx\ J We now proceed with computing the components of the Hessian and evaluating them at the point [0,0] T one by one. We start with &£ = _d_ (df\ dx\ dx\ \dx\J ' where df_(x) dx\ + ±χ\χ22)/(χ\ \χ2{χ\-χ\ 10 + χΙ)2 ήχφθ if x = 0. Note that Hence, Also, ^([0)χ2Γ) = - χ , Hence, the mixed partial is a2/ We next compute &l dx\ = (0) = - 1 . _d_(df_ dx2 \dx2J ' where df , v dx-2{Xl'X2) 0 Note that if x = 0. g W ) = o. DIFFERENTIATION RULES 67 Hence, §<·>- Also, |£([*i,0] T )=*i. Hence, the mixed partial is d2f (0) = 1. Therefore, the Hessian evaluated at the point 0 is F(0) = 0 1 -1 0 which is not symmetric. 5.4 Differentiation Rules We now introduce the chain rule for differentiating the composition of a function / : R -► R n and a function g : R n -* R. g(f(t)), Theorem 5.6 Lei # : V —► R &e differentiable on an open set D c R n , and Ze£ / : (a, 6) —> P 6e differentiable on (a, 6). Tften, £/ie composite function h : (a, 6) —> R given ft?/ ft(t) = g(f(t)) is differentiable on (a, 6), and Λ'(ί) = Dg(f(t))Df(t) = V5(/(i))T Proof. By definition, v y s—ί S—t s-+t S - t if the limit exists. By Theorem 5.5 we write 9(f(s)) - g(f(t)) = Dg(f(t))(f(s) - /(*)) + r(s), where lims_>t r(s)/(s — t) = 0. Therefore, s-t s-t s-t 68 ELEMENTS OF CALCULUS Letting s —»t yields h\t) = YimDg{f{t))f{s)-f{t) s—>t +^ - S —t S — t = Dg(f(t))Df(t). I Next, we present the product rule. Let / : R n -► R m and g : R n -> R m be two differentiable functions. Define the function h : R n —► R by h(x) = f(x)Tg(x). Then, /i is also differentiable and f(x)TDg(x)+g(x)TDf(x). Dh(x) = We end this section with a list of some useful formulas from multivariable calculus. In each case, we compute the derivative with respect to x. Let A G R m x n be a given matrix and y G R m a given vector. Then, D(yT Ax) = yT A D(xTAx) = xT(A + AT) if m = n. It follows from the first formula above that if y G R n , then D(yTx)=yT. It follows from the second formula above that if Q is a symmetric matrix, then D(xTQx) = 2xTQ. In particular, 5.5 D(xTx) = 2xT. Level Sets and Gradients The level set of a function / : R n —> R at level c is the set of points S = {x: f(x) = c}. For / : R 2 —» R, we are usually interested in 5 when it is a curve. For / : R 3 —> R, the sets S most often considered are surfaces. Example 5.2 Consider the following real-valued function on R 2 : f{x) = 100(x2 - x\f + (1 - * i ) 2 , x = [xuX2]T· The function above is called Rosenbrock's function. A plot of the function / is shown in Figure 5.2. The level sets of / at levels 0.7, 7, 70, 200, and 700 are depicted in Figure 5.3. These level sets have a particular shape resembling LEVEL SETS AND GRADIENTS -1 Figure 5.2 -2 -1.5 Figure 5.3 -1 -2 Graph of Rosenbrock's function. -0.5 Level sets of Rosenbrock's (banana) function. 69 70 ELEMENTS OF CALCULUS f(x1}x2)=c Figure 5.4 Orthogonality of the gradient to the level set. bananas. For this reason, Rosenbrock's function is also called the banana function. I To say that a point XQ is on the level set S at level c means that f(xo) = c. Now suppose that there is a curve 7 lying in S and parameterized by a continuously differentiable function g : R —► R n . Suppose also that g(to) = XQ and Dg(to) = υ φ 0, so that v is a tangent vector to 7 at Xo (see Figure 5.4). Applying the chain rule to the function h(t) = f(g(i)) at to gives ti(t0) = Df{g{t0))Dg{t0) = Df(x0)v. But since 7 lies on 5, we have h(t) = f(g(t)) = c; that is, h is constant. Thus, h'(to) — 0 and Df(x0)v = Wf(x0)Tv = 0. Hence, we have proved, assuming / continuously differentiable, the following theorem (see Figure 5.4). Theorem 5.7 The vector Vf(xo) is orthogonal to the tangent vector to an arbitrary smooth curve passing through XQ on the level set determined by f(x) = f(x0). □ LEVEL SETS AND GRADIENTS 71 Af(Xl.X2) Figure 5.5 Illustration of a path of steepest ascent. It is natural to say that V/(iCo) is orthogonal or normal to the level set S corresponding to XQ, and it is also natural to take as the tangent plane (or line) to S at xo the set of all points x satisfying Vf(x0)T(x - xo) = 0 if V / ( x 0 ) Φ 0. As we shall see later, Vf(xo) is the direction of maximum rate of increase of / at Xo. Because Vf(xo) is orthogonal to the level set through XQ determined by f(x) = f(xo), we deduce the following fact: The direction of maximum rate of increase of a real-valued differentiable function at a point is orthogonal to the level set of the function through that point. Figure 5.5 illustrates the discussion above for the case / : R 2 —> R. The curve on the shaded surface in Figure 5.5 running from bottom to top has the property that its projection onto the (x 1,^2)-plane is always orthogonal to the level curves and is called a path of steepest ascent because it always heads in the direction of maximum rate of increase for / . The graph of / : R n -+ R is the set {[xT,f(x)]T : x e R n } c R n + 1 . The notion of the gradient of a function has an alternative useful interpretation 72 ELEMENTS OF CALCULUS in terms of the tangent hyperplane to its graph. To proceed, let XQ G R n and ZQ = f(xo)- The point [a?o~,zo]T € R n + 1 is a point on the graph of / . If / is differentiable at £, then the graph admits a nonvertical tangent hyperplane at ξ = [χζ, ZQ) T . The hyperplane through ξ is the set of all points [ x i , . . . , xn, z]T G R n + 1 satisfying the equation ui(xi - xoi) H h un(xn - x0n) + v(z - ZQ) = 0, where the vector [u\,... ,un,v]T G R n + 1 is normal to the hyperplane. Assuming that this hyperplane is nonvertical (that is, v ^ O ) , let Thus, we can rewrite the hyperplane equation above as z = di(xi - x0i) + . . . + dn(xn - xon) + ZQ. We can think of the right side of the above equation as a function z : R n —> R. Observe that for the hyperplane to be tangent to the graph of / , the functions / and z must have the same partial derivatives at the point XQ. Hence, if / is differentiable at a?o, its tangent hyperplane can be written in terms of its gradient, as given by the equation z - z0 = Df(x0)(x 5.6 x0)TVf(x0). - x0) = (x- Taylor Series The basis for many numerical methods and models for optimization is Taylor's formula, which is given by Taylor's theorem. Theorem 5.8 Taylor's Theorem. Assume that a function f : R —► R is m times continuously differentiable (i.e., f G Cm) on an interval [a, 6]. Denote h = b — a. Then, /(*>) = /(«) + £/ ( 1 ) («) + | / ( 2 ) ( a ) + ■ · ■ + J^TJy^^(a) + Rm, (called Taylor's formula) where / W is the ith derivative of f, and umfi *" = withe,e'e (0,1). n\m—l (m-l)! /(TO)(Q + h) = ° urn ^! / ( m ) ( a + Θ'Η)> □ Proof. We have i C = f(b) - f(a) - £/<*>(„) - */<»>(<,) ^L/(m-i)(„). TAYLOR SERIES 73 Denote by g<m(x) an auxiliary function obtained from i? m by replacing a by x. Hence, 9m(x) b = f(b) - /(*) - -^f^(x) (m-1)! ; - ^^/(2>(Z) [ h Differentiating gm(x) yields g£Hx) = -f{1)(x) + / ( 1 ) ( x ) - ^1!/ ( 2 ) ( x ) / ( 3 ) (x) + ■ (6-x)m-2 > ( m _ i)i J m L + H __{b-xT^ ~ (m-1)! (m) 7 l j W (m _ 1} , / W ' Observe that g m (6) = 0 and (?m(a) = i? m . Applying the mean-value theorem yields flro(ft) - f l m ( o ) _ ( l ) = 9£>(α + ΘΙι), b— a where Θ £ (0,1). The equation above is equivalent to Rm h (b-a- ΘΚ)m—l (m-1)! f{m\a + eh) = - hm-\l-9) m—l ( m ) - / ( a + 0ft). (m - 1)! Hence, ro(1 flm = / t(m " ? r 1 / ( m ) ( o + gft). - 1)! To derive the formula i?m see, e.g., [81] or [83]. = iL/W(a + ^ ) ! I An important property of Taylor's theorem arises from the form of the remainder i2 m . To discuss this property further, we introduce the order symbols, O and o. Let g be a real-valued function defined in some neighborhood of 0 G Mn, with g(x) φ 0 if x φ 0. Let / : Ω -+ R m be defined in a domain Ω C W1 that includes 0. Then, we write 74 ELEMENTS OF CALCULUS 1. f(x) — 0(g(x)) to mean that the quotient ||/(a:)||/|^(ic)| is bounded near 0; that is, there exist numbers K > 0 and δ > 0 such that if \\x\\ < J, xen,then\\f(x)\\/\g(x)\<K. 2. f(x) = o(g(x)) to mean that lim χ-+0,χ€Ω MM \g(&)\ 0. The symbol 0(g(x)) [read "big-oh of g{x)"] is used to represent a function that is bounded by a scaled version of g in a neighborhood of 0. Examples of such a function are: x = O(x). x3 = 2x + 3a:4 2 0(x2). ■ cosx = O(l). ■ sin a; = 0(x). On the other hand, o(g(x)) [read "little-oh of g(x)"] represents a function that goes to zero "faster" than g(x) in the sense that 1ίπιχ_>ο ll0(fi,(2C))ll/l^(iC)l — 0· Examples of such functions are: ■ x2 = o(x). x3 2x2 + 3x 4 ■ X3 — m o(x). o(x2). x = o(l). Note that if f(x) = o(g(x)), then f(x) = 0(g(x)) (but the converse is not necessarily true). Also, if f(x) = 0(||cc|| p ), then f(x) — ο(\\χ\\ρ~ε) for any ε>0. Suppose that / G C m . Recall that the remainder term in Taylor's theorem has the form urn Rm = —rf(m)(a + eh), ml where Θ £ (0,1). Substituting this into Taylor's formula, we get f(b) = f(a)+±fW(a)+^fW(a)+. · . + ^^/<">-ΐ)(«)+^/(«0( α +βΑ). TAYLOR SERIES 75 By the continuity of /< m ), we have / ( m ) ( a + 0ft) -> / ( m ) ( a ) as ft ^ 0; that is, /(™)(a + 6>ft) = /< m )(a) + o(l). Therefore, um um - r / ( m ) ( a + 0ft) - ^ r / ( m ) ( a ) + o(ft m ), m! m! since ftmo(l) = o(ft m ). We may then write Taylor's formula as /(&) - /(β) + £ / ( 1 ) ( α ) + | - / ( 2 ) ( α ) + · · · + ^ , / ( m ) ( a ) + o(hm)If, in addition, we assume that / G C m + 1 , we may replace the term o(ftm) above by 0 ( f t m + 1 ) . To see this, we first write Taylor's formula with i? m +i: f(b) = /(a) + £/(1)(") + |-/ (2) (a) + · · · + ^/ ( m ) (a) + V i > where Um+l with 6»' G (0,1). Because /("»+1) is bounded on [a, 6] (by Theorem 4.2), i?m+1=0(fcm+1). Therefore, if / G C m + 1 , we may write Taylor's formula as h2 h f(b) = f(a) + £ / ( 1 ) ( a ) + ^/2\a) hm + ■■■ + ^f(m)(a) + 0(/*™ +1 ). We now turn to the Taylor series expansion of a real-valued function / : R n —> R about the point xo G R n . Suppose that / G C2. Let x and a?0 be points in R n , and let z(a) = sc0 + <*(# ~~ x o ) / | | # — #o||· Define φ : R —► R by 0(α) = / ( ζ ( α ) ) = f(x0 + a(aj - xo)/\\x ~ »oil). Using the chain rule, we obtain <A'(a) = g ( a ) Df(z(a))^~Xo) \\χ-χο\\ T T (x-x0) Df(z(a)) /\\x-x0\\ = Z?/(z(a))ZM«) = = 76 ELEMENTS OF CALCULUS and d da v£) {a) ;|x-»ol ;(x-x0)TD2f(z(a))T(x-x0) i|a5-aJo|| (x - x0)' D2f(z(a))(x - x0), where we recall that ~dx\ d2f dx\dx2 d2f dx2dx\ d2f ~dx\ d2f _dx\dx n d2f dx2dxn Γ D2f = d2 f a2/ dxndxi d2f dxndx2 d2f dxt Observe that /{χ)=φ(\\χ-χο\\) - Λ(η\ ι I'35 " x*W*'(cΛ 1! 2 I I * -■*oll <kf,((\\ _L 0"(O) + O (||x-xo|| 2 )· 2! Hence, f(x) = f(x0) + γ|£>/(χο)(® - xo) + - ( » - x0)TD2f(x0)(x - x0) + o(||a? - x 0 || 2 ). If we assume that / G C3, we may use the formula for the remainder term A3 to conclude that f(x) = f(x0) + —Df(x0)(x - xo) 1 + 2[(* ~ *o) D2f(x0)(x - x0) + 0 ( | | * - *ο|Γ). We end with a statement of the mean value theorem, which is closely related to Taylor's theorem. EXERCISES 77 Theorem 5.9 If a function f : R n —► R m is differentiable on an open set Ω C R n , then for any pair of points x,y G Ω, there exists a matrix M such that f(x)-f(y) = M(x-y). D The mean value theorem follows from Taylor's theorem (for the case where m = 1) applied to each component of / . It is easy to see that M is a matrix whose rows are the rows of Df evaluated at points that lie on the line segment joining x and y (these points may differ from row to row). For further reading in calculus, consult [13], [81], [83], [115], [120], [134]. A basic treatment of real analysis can be found in [2], [112], whereas a more advanced treatment is provided in [89], [111]. For stimulating reading on the "big-oh" notation, see [77, pp. 104-108]. EXERCISES 5.1 Show that a sufficient condition for limfc_>oo Ak = O is \\A\\ < 1. 5.2 Show that for any matrix A G R n X n , ||A|| > max |λ,(Α)|. 1<2<η Hint: Use Exercise 5.1. 5.3 Consider the function fix) = (aTx)(bTx), where a, 6, and x are n-dimensional vectors. a. Find Vf(x). b . Find the Hessian F(x). 5.4 Define the functions / : R 2 -> R and g : R -> R 2 by f(x) = xf/6 + x | / 4 , g(t) = [3t + 5,2i - 6] T . Let F : R -► R be given by F(t) = f(g(t)). Evaluate ^■(t) using the chain rule. 5.5 Consider f{x) = XiX 2 /2, g(s,t) = [4s+3t,2s+t]T. and §if(g(s,t)) using the chain rule. Evaluate 5.6 Let x(t) = [el + t3,t2,t + 1] T , t G R, and f(x) = x\x2xl x = [^i,x 2 ,^3] T G R 3 . Find -^f(x(t)) in terms of t. -j^f(g(s,t)) + x\X2 + #3, 78 ELEMENTS OF CALCULUS 5.7 Suppose that f(x) = o(g(x)). Show that for any given ε > 0, there exists δ > 0 such that if \\x\\ < 5, then ||/(#)|| < ε|^(χ)|. 5.8 Use Exercise 5.7 to show that if functions / : R n -> R and g : R n -► R satisfy f(x) = —g(x) + o(g(x)) and g{x) > 0 for all x φ 0, then for all £C ^ 0 sufficiently small, we have f(x) < 0. 5.9 Let fl(xi,X2) =Xi -x\, / 2 ( x i , x 2 ) = 2xix 2 Sketch the level sets associated with f\{x\,X2) = 12 and / 2 ( # i , x 2 ) = 16 on the same diagram. Indicate on the diagram the values of x = [xi,# 2 ] T for which f(x) = [ / i ( x i , x 2 ) , / 2 ( x i , x 2 ) ] T = [12,16] T . 5.10 Write down the Taylor series expansion of the following functions about the given points Xo. Neglect terms of order three or higher. a. f(x) = Xle-X* +x2 + l,x0 = [1,0] T . b . f(x) =xj + 2x\x\ + x\, x0 = [1,1] T . c. f(x) = e*1"*2 + e Xl+:E2 +x1+x2 + l,x0 = [1,0] T . PART II UNCONSTRAINED OPTIMIZATION CHAPTER 6 BASICS OF SET-CONSTRAINED AND UNCONSTRAINED OPTIMIZATION 6.1 Introduction In this chapter we consider the optimization problem minimize subject to f(x) x G Ω. The function / : R n —► R that we wish to minimize is a real-valued function called the objective function or cost function. The vector x is an n-vector of independent variables: x = [xi, #2, · · ·, #n] T £ R n · The variables X i , . . . , xn are often referred to as decision variables. The set Ω is a subset of R n called the constraint set or feasible set. The optimization problem above can be viewed as a decision problem that involves finding the "best" vector x of the decision variables over all possible vectors in Ω. By the "best" vector we mean the one that results in the-smallest value of the objective function. This vector is called the minimizer of / over Ω. It is possible that there may be many minimizers. In this case, finding any of the minimizers will suffice. An Introduction to Optimization, Fourth Edition. By E. K. P. Chong and S. H. Zak. Copyright © 2013 John Wiley & Sons, Inc. 81 82 BASICS OF SET-CONSTRAINED AND UNCONSTRAINED OPTIMIZATION There are also optimization problems that require maximization of the objective function, in which case we seek maximizers. Minimizers and maximizers are also called extremizers. Maximization problems, however, can be represented equivalently in the minimization form above because maximizing / is equivalent to minimizing —/. Therefore, we can confine our attention to minimization problems without loss of generality. The problem above is a general form of a constrained optimization problem, because the decision variables are constrained to be in the constraint set Ω. If Ω = R n , then we refer to the problem as an unconstrained optimization problem. In this chapter we discuss basic properties of the general optimization problem above, which includes the unconstrained case. In the remaining chapters of this part, we deal with iterative algorithms for solving unconstrained optimization problems. The constraint "x G Ω" is called a set constraint Often, the constraint set Ω takes the form Ω = {x : h(x) = 0, g(x) < 0}, where h and g are given functions. We refer to such constraints as functional constraints. The remainder of this chapter deals with general set constraints, including the special case where Ω = R n . The case where Ω = R n is called the unconstrained case. In Parts III and IV we consider constrained optimization problems with functional constraints. In considering the general optimization problem above, we distinguish between two kinds of minimizers, as specified by the following definitions. Definition 6.1 Suppose that / : R n —► R is a real-valued function defined on some set Ω C R n . A point x* G Ω is a local minimizer of / over Ω if there exists ε > 0 such that f(x) > f(x*) for all x G Ω \ {x*} and \\x — x*\\ < ε. A point sc* G Ω is a global minimizer of / over Ω if f(x) > f(x*) for all icefi\{ai*}. ■ If in the definitions above we replace ">" with ">," then we have a strict local minimizer and a strict global minimizer, respectively. In Figure 6.1, we illustrate the definitions for n = 1. If x* is a global minimizer of / over Ω, we write f(x*) = π ύ η ^ Ω / ( # ) and x* = argmin xG Q f(x). If the minimization is unconstrained, we simply write x* = argminjp f(x) or x* = arg min/(cc). In other words, given a real-valued function / , the notation arg min f(x) denotes the argument that minimizes the function / (a point in the domain of / ) , assuming that such a point is unique (if there is more than one such point, we pick one arbitrarily). For example, if / : R —> R is given by f(x) = (x + l ) 2 + 3, then argmin/(x) = —1. If we write a r g m i n ^ ^ , then we treat ux G Ω" to be a constraint for the minimization. For example, for the function / above, argmin a . >0 f(x) = 0. Strictly speaking, an optimization problem is solved only when a global minimizer is found. However, global minimizers are, in general, difficult to find. Therefore, in practice, we often have to be satisfied with finding local minimizers. CONDITIONS FOR LOCAL MINIMIZERS 83 Figure 6.1 Examples of minimizers: X\: strict global minimizer; X2'. strict local minimizer; X3: local (not strict) minimizer. 6.2 Conditions for Local Minimizers In this section we derive conditions for a point x* to be a local minimizer. We use derivatives of a function / : R n —► R. Recall that the first-order derivative of / , denoted Df, is Df dj_ df_ df_ dxi' dx2' ' dxn Note that the gradient V / is just the transpose of £>/; that is, V / = (Df)T. The second derivative of / : R n —► R (also called the Hessian of / ) is r d2f dx„dx\ £f(*) F{x) = £>'/(*) = Example 6.1 Let f(xi,x2) Df(x) = (Vf(x))T Sw a 2 / (x) dx\dx7 L (x) = 5#i + 8x2 + ^1^2 — x\ — 2^2· Then, df , ^ ( X ) Λ df . ■ '^ ( X ) [5 + X2 — 2xi, 8 + x\ - 4x2] and F(x) = D2f(x) = «2 1 dX!dx2(X' ax2öxi(x) Έχ\(Χ> -2 1 1 -4 Given an optimization problem with constraint set Ω, a minimizer may lie either in the interior or on the boundary of Ω. To study the case where it lies on the boundary, we need the notion of feasible directions. 84 BASICS OF SET-CONSTRAINED AND UNCONSTRAINED OPTIMIZATION ocdi Figure 6.2 Two-dimensional illustration of feasible directions; d\ is a feasible direction, d2 is not a feasible direction. Definition 6.2 A vector d G R n , d ^ 0, is a feasible direction at x G Ω if there exists ctQ > 0 such that x + ad G Ω for all a G [0, ao]. I Figure 6.2 illustrates the notion of feasible directions. Let / : R n —► R be a real-valued function and let d be a feasible direction at x G Ω. The directional derivative of f in the direction d, denoted df/dd, is the real-valued function defined by Άχ) od = a->o lim / ( * + ad) a -/(a;). If ||d|| = 1, then df/dd is the rate of increase of / at x in the direction d. To compute the directional derivative above, suppose that x and d are given. Then, f(x + ad) is a function of a, and a=0 Applying the chain rule yields g(.) _!-/(, +a* a=0 Vf{xYd = <V/(s),<i) = rfTV/(x). In summary, if d is a unit vector (||d|| = 1), then ( V / ( x ) , d) is the rate of increase of / at the point x in the direction d. Example 6.2 Define / : by f(x) = #i#2#3> and let T d = L2'2'72j The directional derivative of / in the direction d is — (x) = V/(a?) T d = [x2x3,xiX3,XiX2] 1/2 1/2 1/V2 X2^3 + Ζι:τ3 + \/2a;iX2 CONDITIONS FOR LOCAL MINIMIZERS 85 Note that because ||d|| = 1, the above is also the rate of increase of / at x in the direction d. I We are now ready to state and prove the following theorem. Theorem 6.1 First-Order Necessary Condition (FONC). Let Ω be a subset ofW1 and f G C1 a real-valued function on Ω. Ifx* is a local minimizer of f over Ω, then for any feasible direction d at x*, we have d T V / ( x * ) > 0. D Proof. Define x(a) = x* + ad G Ω. Note that a?(0) = x*. Define the composite function φ(α) = f(x(a)). Then, by Taylor's theorem, f(x* + ad) - f(x*) = φ{α) - 0(0) = φ'{0)α + o(a) = adTVf(x(0)) + o(a), where a > 0 [recall the definition of o(a) ("little-oh of a") in Part I]. Thus, if φ(α) > 0(0), that is, f(x* + ad) > f(x*) for sufficiently small values of a > 0 (a?* is a local minimizer), then we have to have d Vf(x*) > 0 (see Exercise 5.8). I Theorem 6.1 is illustrated in Figure 6.3. An alternative way to express the FONC is for all feasible directions d. In other words, if x* is a local minimizer, then the rate of increase of / at x* in any feasible direction d in Ω is nonnegative. Using directional derivatives, an alternative proof of Theorem 6.1 is as follows. Suppose that x* is a local minimizer. Then, for any feasible direction d, there exists ä > 0 such that for all a G (0, ä), /(«*)</(«*+ad)· Hence, for all a G (0, ä), we have /(**+ad)-/(**) a Taking the limit as a —> 0, we conclude that g(x-)>o. >0. 86 BASICS OF SET-CONSTRAINED AND UNCONSTRAINED OPTIMIZATION Figure 6.3 Illustration of the FONC for a constrained case; X\ does not satisfy the FONC, whereas x2 satisfies the FONC. A special case of interest is when x* is an interior point of Ω (see Section 4.4). In this case, any direction is feasible, and we have the following result. Corollary 6.1 Interior Case. Let Ω be a subset o / R n and f G C1 a realvalued function on Ω. If x* is a local minimizer of f over Ω and if x* is an interior point of Ω, then V/(**) = 0. D Proof. Suppose that / has a local minimizer as* that is an interior point of Ω. Because x* is an interior point of Ω, the set of feasible directions at x* is the whole of Rn. Thus, for any d G R n , d T V/(cc*) > 0 and - d T V / ( x * ) > 0. Hence, d T V/(a;*) - 0 for all d G R n , which implies that V/(«*) = 0. I Example 6.3 Consider the problem minimize subject to x\ + 0.5x2 + 3#2 + 4.5 £i,#2 > 0. a. Is the first-order necessary condition (FONC) for a local minimizer satisfied at x = [1,3] T ? b . Is the FONC for a local minimizer satisfied at x = [0,3] T ? c. Is the FONC for a local minimizer satisfied at x = [1,0] T ? CONDITIONS FOR LOCAL MINIMIZERS 87 4 3 X CM O ^ 1 0 0 1 2 3 X Figure 6.4 4 1 Level sets of the function in Example 6.3. d. Is the FONC for a local minimizer satisfied at x = [0,0] T ? Solution: First, let / : R 2 -► R be defined by f(x) = x\ + 0.5x§ + 3x 2 + 4.5, where x — \x\, x2]T. A plot of the level sets of / is shown in Figure 6.4. a. At x = [1,3] T , we have Vf(x) = [2xux2 + 3] T = [2,6] T . The point x = [1,3] T is an interior point of Ω = {x : x\ > 0,x2 > 0}. Hence, the FONC requires that Vf(x) = 0. The point x = [1,3] T does not satisfy the FONC for a local minimizer. b . At x = [0,3] T , we have V/(a?) = [0,6] T , and hence dTVf(x) = 6d 2 , where d = [di,d2] T . For d to be feasible at as, we need di > 0, and d2 can take an arbitrary value in R. The point x = [0,3] T does not satisfy the FONC for a minimizer because d2 is allowed to be less than zero. For example, d = [1, — 1] T is a feasible direction, but d T V / ( x ) = — 6 < 0. c. At x = [1,0] T , we have V / ( x ) = [2,3] T , and hence dTVf(x) = 2d1+3d2. For d to be feasible, we need d2 > 0, and d\ can take an arbitrary value in R. For example, d = [—5,1]T is a feasible direction. But dTVf(x) = -7 < 0. Thus, x = [1,0] T does not satisfy the FONC for a local minimizer. d. At x = [0,0] T , we have V / ( x ) = [0,3] T , and hence dTVf{x) = 3d2. For d to be feasible, we need d2 > 0 and d\ > 0. Hence, x — [0,0] T satisfies the FONC for a local minimizer. | Example 6.4 Figure 6.5 shows a simplified model of a cellular wireless system (the distances shown have been scaled down to make the calculations 88 BASICS OF SET-CONSTRAINED AND UNCONSTRAINED OPTIMIZATION Primary Base Station 2 H H Neighboring Base Station Interference | *-| Mobile x Simplified cellular wireless system in Example 6.4. Figure 6.5 simpler). A mobile user (also called a mobile) is located at position x (see Figure 6.5). There are two base station antennas, one for the primary base station and another for the neighboring base station. Both antennas are transmitting signals to the mobile user, at equal power. However, the power of the received signal as measured by the mobile is the reciprocal of the squared distance from the associated antenna (primary or neighboring base station). We are interested in finding the position of the mobile that maximizes the signal-tointerference ratio, which is the ratio of the signal power received from the primary base station to the signal power received from the neighboring base station. We use the FONC to solve this problem. The squared distance from the mobile to the primary antenna is 1 + x2, while the squared distance from the mobile to the neighboring antenna is 1 + (2 — x)2. Therefore, the signal-tointerference ratio is fix) I[X) 1+ ( 2 -*> 2 1+x2 ' We have _ -2(2-x)(l J[ + x2)-2x(l + (2-x)2) ~ (1 + * 2 ) 2 _ 4(x2 - 2x - 1) (1 + x 2 ) 2 ' ] By the FONC, at the optimal position x* we have /'(#*) = 0. Hence, either x* — 1 — y/2 or x* = 1 + y/2. Evaluating the objective function at these two candidate points, it easy to see that x* = 1 — y/2 is the optimal position. I The next example illustrates that in some problems the FONC is not helpful for eliminating candidate local minimizers. However, in such cases, there may be a recasting of the problem into an equivalent form that makes the FONC useful. CONDITIONS FOR LOCAL MINIMIZERS 89 Example 6.5 Consider the set-constrained problem minimize subject to f(x) x G Ω, where Ω = {[xi,#2] T · x\ + %\ = 1}· a. Consider a point x* G Ω. Specify all feasible directions at x*. b . Which points in Ω satisfy the FONC for this set-constrained problem? c. Based on part b, is the FONC for this set-constrained problem useful for eliminating local-minimizer candidates? d. Suppose that we use polar coordinates to parameterize points x G Ω in terms of a single parameter Θ: Xi=cos0 #2 = sin0. Now use the FONC for unconstrained problems (with respect to Θ) to derive a necessary condition of this sort: If x* G Ω is a local minimizer, then d T V / ( x * ) = 0 for all d satisfying a "certain condition." Specify what this certain condition is. Solution: a. There are no feasible directions at any x*. b . Because of part a, all points in Ω satisfy the FONC for this setconstrained problem. c. No, the FONC for this set-constrained problem is not useful for eliminating local-minimizer candidates. d. Write h{ß) = /(#(#)), where g : R —► R 2 is given by the equations relating Θ to x = [χι,Χ2] Τ · Note that Dg{9) = [— sin0,cos0] T . Hence, by the chain rule, h\ff) = Df{g{e))Dg{9) = Dg(e)TVf(g(e)). Notice that Dg{6) is tangent to Ω at x = g(0). Alternatively, we could say that Dg(9) is orthogonal to x = g(0). Suppose that x* G Ω is a local minimizer. Write x* = g{0*). Then Θ* is an unconstrained minimizer of h. By the FONC for unconstrained problems, h'(6*) = 0, which implies that d T V / ( x * ) = 0 for all d tangent to Ω at x* (or, alternatively, for all d orthogonal to x*). | We now derive a second-order necessary condition that is satisfied by a local minimizer. 90 BASICS OF SET-CONSTRAINED AND UNCONSTRAINED OPTIMIZATION Theorem 6.2 Second-Order Necessary Condition (SONC). Let Ω c R n , f G C2 a function on Ω, x* a local minimizer of f over Ω, and d a feasible direction at x*. If dTWf(x*) = 0, then dTF(x*)d > 0, Q where F is the Hessian of f. Proof We prove the result by contradiction. Suppose that there is a feasible direction d at x* such that dTVf(x*) = 0 and dTF(x*)d < 0. Let x{a) = x* + ad and define the composite function φ(α) = f(x* + ad) = f(x(a)). Then, by Taylor's theorem, φ(α) = 0(0) + ^ " ( 0 ) ^ + ο ( α 2 ) , where by assumption, <//(0) = d T V / ( x * ) = 0 and φ"{ϋ) = dTF(x*)d For sufficiently small a, φ(α)-φ(0) that is, = < 0. φ"(0)^+ο(α2)<0, /(x* + a d ) < / ( x * ) , which contradicts the assumption that x* is a local minimizer. Thus, φ"(0) = dTF(x*)d > 0. ■ Corollary 6.2 Interior Case. Let x* be an interior point o / ! l c l " . x* is a local minimizer of f : Ω —> ]R, / G C 2 , i/ien // V/(**) = 0, and F(x*) is positive semidefinite (F{x*) > 0); that is, for all d G W1, dTF(x*)d > 0. G Proof If x* is an interior point, then all directions are feasible. The result then follows from Corollary 6.1 and Theorem 6.2. I In the examples below, we show that the necessary conditions are not sufficient. E x a m p l e 6.6 Consider a function of one variable f(x) = x 3 , / : R —► R. Because /'(0) = 0, and /"(0) = 0, the point x = 0 satisfies both the FONC and SONC. However, x = 0 is not a minimizer (see Figure 6.6). I CONDITIONS FOR LOCAL MINIMIZERS AW Figure 6.6 91 ' f(x)=x3 The point 0 satisfies the FONC and SONC but is not a minimizer. Example 6.7 Consider a function / : R 2 —► R, where f(x) = x\ - x\. The FONC requires that Vf{x) = [2x1,-2x2]T = 0. Thus, x = [0,0] T satisfies the FONC. The Hessian matrix of / is F(x) 2 0 0 -2 The Hessian matrix is indefinite; that is, for some d\ G R 2 we have dx Fd\ > 0 (e.g., di = [1,0] T ) and for some d 2 we have d j F d 2 < 0 (e.g., d 2 = [0,1] T ). Thus, x = [0,0] T does not satisfy the SONC, and hence it is not a minimizer. The graph of f(x) — x\ x"o is shown in Figure 6.7. I Figure 6.7 Graph of f(x) xl SONC; this point is not a minimizer. xl The point 0 satisfies the FONC but not 92 BASICS OF SET-CONSTRAINED AND UNCONSTRAINED OPTIMIZATION We now derive sufficient conditions that imply that x* is a local minimizer. Theorem 6.3 Second-Order Sufficient Condition (SOSC), Interior Case. Let f E C2 be defined on a region in which x* is an interior point. Suppose that 1. V/(x*) = 0. 2. F(x*) > 0. G Then, x* is a strict local minimizer of f. Proof. Because / G C 2 , we have F(x*) = FT(as*). Using assumption 2 and Rayleigh's inequality it follows that if d φ 0, then 0 < A m i n (F(ic*))||d|| 2 < d F(x*)d. By Taylor's theorem and assumption 1, / ( * · + d) - /(**) = \dTF(x*)d + o(\\df) > Λ " " ° ^ ( 8 * ) ) μ | | 2 + 0 (!|rf|| 2 ). Hence, for all d such that ||d|| is sufficiently small, f{x* + d)> f(x*), which completes the proof. I Example 6.8 Let f{x) = x\ + x\. We have Vf(x) only if x = [0,0] T . For all x G R 2 , we have F(x) = 2 0 0 2 = [2xl,2x2)T = 0 if and >0. The point x = [0,0] T satisfies the FONC, SONC, and SOSC. It is a strict local minimizer. Actually, x = [0,0] T is a strict global minimizer. Figure 6.8 shows the graph of f(x) = x\ + x\. I In this chapter we presented a theoretical basis for the solution of nonlinear unconstrained problems. In the following chapters we are concerned with iterative methods of solving such problems. Such methods are of great importance in practice. Indeed, suppose that one is confronted with a highly nonlinear function of 20 variables. Then, the FONC requires the solution of 20 nonlinear simultaneous equations for 20 variables. These equations, being nonlinear, will normally have multiple solutions. In addition, we would have to compute 210 second derivatives (provided that / G C2) to use the SONC or SOSC. We begin our discussion of iterative methods in the next chapter with search methods for functions of one variable. EXERCISES Figure 6.8 93 Graph of f(x) = x\ + x\. EXERCISES 6.1 Consider the problem minimize subject to /(x) x G Ω, where / G C2. For each of the following specifications for Ω, x*, and / , determine if the given point x* is: (i) definitely a local minimizer; (ii) definitely not a local minimizer; or (iii) possibly a local minimizer. a. / : R 2 -» R, Ω = {x = [xi,x 2 ] T : x\ > 1}, x* = [1,2] T , and gradient V/(x*) = [ l , l ] T . b . / : R 2 -> R, Ω = {x = [a?i,x 2 ] T : x\ > 1,^2 > 2}, x* = [1,2] T , and gradient V/(x*) = [l,0] T . c. / : R 2 -+ R, Ω = {x = [xi,x 2 ] T : »l > 0,x 2 > 0}, x* = [1,2] T , gradient V/(x*) = [0,0] T , and Hessian F(x*) = I (identity matrix). d. / : R 2 -► R, Ω = {x = [xi,x 2 ] T : X\ > l,x2 > 2}, x* = [1,2] T , gradient V/(x*) = [1,0] T , and Hessian F(x* 1 0 0 -1 6.2 Find minimizers and maximizers of the function / ( x i , x 2 ) = -x\ - 4 x i + -x\ - 16x2. 94 BASICS OF SET-CONSTRAINED AND UNCONSTRAINED OPTIMIZATION 6.3 Show that if x* is a global minimizer of / over Ω, and #* G Ω' C Ω, then x* is a global minimizer of / over Ω'. 6.4 Suppose that x* is a local minimizer of / over Ω, and i l c f f . Show that if x* is an interior point of Ω, then x* is a local minimizer of / over Ω'. Show that the same conclusion cannot be made if a?* is not an interior point of Ω. 6.5 Consider the problem of minimizing / : R —> R, / G C3, over the constraint set Ω. Suppose that 0 is an interior point of Ω. a. Suppose that 0 is a local minimizer. By the FONC we know that /'(0) = 0 (where / ' is the first derivative of / ) . By the SONC we know that /"(0) > 0 (where / " is the second derivative of / ) . State and prove a third-order necessary condition (TONC) involving the third derivative at o, r(o). b . Give an example of / such that the FONC, SONC, and TONC (in part a) hold at the interior point 0, but 0 is not a local minimizer of / over Ω. (Show that your example is correct.) c. Suppose that / is a third-order polynomial. If 0 satisfies the FONC, SONC, and TONC (in part a), then is this sufficient for 0 to be a local minimizer? 6.6 Consider the problem of minimizing / : R —> R, / G C3, over the constraint set Ω = [0,1]. Suppose that x* — 0 is a local minimizer. a. By the FONC we know that /'(O) > 0 (where / ' is the first derivative of / ) . By the SONC we know that if /'(0) = 0, then /"(0) > 0 (where / " is the second derivative of / ) . State and prove a third-order necessary condition involving the third derivative at 0, /'"(O). b . Give an example of / such that the FONC, SONC, and TONC (in part a) hold at the point 0, but 0 is not a local minimizer of / over Ω = [0,1]. 6.7 Let / : R n -> R, x0 G R n , and Ω c R n . Show that x0 + arg m i n / ( x ) = arg min / ( y ) , χβΩ yeQ' where Ω' = {y : y — XQ G Ω}. 6.8 Consider the following function / : R 2 —> R: "l 2~ "3" x + xT 4 7 5 EXERCISES 95 a. Find the gradient and Hessian of / at the point [1,1] T . b . Find the directional derivative of / a t [1,1] T with respect to a unit vector in the direction of maximal rate of increase. c. Find a point that satisfies the FONC (interior case) for / . Does this point satisfy the SONC (for a minimizer)? 6.9 Consider the following function: f(x\,X2) = x\x2 + # 2 χ 1 · a. In what direction does the function / decrease most rapidly at the point χ(°) = [2,1] τ ? b . What is the rate of increase of / at the point x^ maximum decrease of / ? c. Find the rate of increase of / at the point x^ in the direction of in the direction d — [3,4] T . 6.10 Consider the following function / : R 2 -+ R: "2 -1 5 x + xT 1 3 4 a. Find the directional derivative of / at [0,1] T in the direction [1,0] T . b . Find all points that satisfy the first-order necessary condition for / . Does / have a minimizer? If it does, then find all minimizer(s); otherwise, explain why it does not. 6.11 Consider the problem minimize subject to — x\ |#21 < x\ x\ > 0 , where £i,#2 £ ^ · a. Does the point [#i,£2] T = 0 satisfy the first-order necessary condition for a minimizer? That is, if / is the objective function, is it true that d T V / ( 0 ) > 0 for all feasible directions d at 0? b . Is the point [#i,£2] T = 0 a local minimizer, a strict local minimizer, a local maximizer, a strict local maximizer, or none of the above? 96 BASICS OF SET-CONSTRAINED AND UNCONSTRAINED OPTIMIZATION 6.12 Consider the problem minimize subject to f(x) x G Ω, where / : R 2 —> R is given by f(x) = 5^2 with x = [xi,x 2 ] T ? and Ω = {x = [xi,x 2 ] T : x\ + X2 > 1}. a. Does the point x* = [0,1] T satisfy the first-order necessary condition? b . Does the point x* = [0,1] T satisfy the second-order necessary condition? c. Is the point x* = [0,1] T a local minimizer? 6.13 Consider the problem minimize subject to f(x) xGi], where / : R 2 —> R is given by f(x) = —3x\ with x = [xi, X2]T? a n d Ω = {x = [xi,X2] T · x\ + x\ < 2}. Answer each of the following questions, showing complete justification. a. Does the point x* = [2,0] T satisfy the first-order necessary condition? b . Does the point x* = [2,0] T satisfy the second-order necessary condition? c. Is the point x* = [2,0] T a local minimizer? 6.14 Consider the problem minimize subject to f(x) x G Ω, where Ω = {x G R 2 : x\ + x\ > 1} and f(x) = x2. a. Find all point (s) satisfying the FONC. b . Which of the point(s) in part a satisfy the SONC? c. Which of the point(s) in part a are local minimizers? 6.15 Consider the problem minimize subject to f(x) x GΩ EXERCISES 97 where / : R 2 —> R is given by f(x) = 3x\ with x = [xi,X2]T, and Ω = {x = [xi,X2] T · X\ + x\ > 2}. Answer each of the following questions, showing complete justification. a. Does the point x* = [2,0] T satisfy the first-order necessary condition? b . Does the point x* = [2,0] T satisfy the second-order necessary condition? c. Is the point x* = [2,0] T a local minimizer? Hint: Draw a picture with the constraint set and level sets of / . 6.16 Consider the problem minimize subject to f(x) x G Ω, where x = [£ι,£2] Τ , / : R 2 —> R is given by f(x) = 4x 2 — x\, and Ω = {x : x\ + 2#i - x2 > 0, x\ > 0, x2 > 0}. a. Does the point x* = 0 = [0,0] T satisfy the first-order necessary condition? b . Does the point x* = 0 satisfy the second-order necessary condition? c. Is the point x * = 0 a local minimizer of the given problem? 6.17 Consider the problem maximize subject to f(x) x G Ω, where Ω c {x G R 2 : x\ > 0,^2 > 0} and / : Ω —► R is given by f(x) = log(xi) + log(#2) with x = [xi,x 2 ] T , where "log" represents natural logarithm. Suppose that x* is an optimal solution. Answer each of the following questions, showing complete justification. a. Is it possible that x* is an interior point of Ω? b . At what point(s) (if any) is the second-order necessary condition satisfied? 6.18 Suppose that we are given n real numbers, # i , . . . , xn. Find the number x G R such that the sum of the squared difference between x and the numbers above is minimized (assuming that the solution x exists). 6.19 An art collector stands at a distance of x feet from the wall, where a piece of art (picture) of height a feet is hung, b feet above his eyes, as shown in 98 BASICS OF SET-CONSTRAINED AND UNCONSTRAINED OPTIMIZATION Picture a Eye Λ: Figure 6.9 Art collector's eye position in Exercise 6.19. ϋϋϋΒέφ tiil^iiiiii!;!;!! : · : · : · :;:;:;2!:;i;i;:;i!:ii;iii;!;i!i;Mi^fffitfSi; ::::::::: : : : :::::::::::::::::?? [:: t ji^iäU':£:: : · : · * ■ : · : ■ : · : ■ H Sensor Figure 6.10 Simplified fetal heart monitoring system for Exercise 6.20. Figure 6.9. Find the distance from the wall for which the angle 0 subtended by the eye to the picture is maximized. Hint: (1) Maximizing 0 is equivalent to maximizing tan(0). (2) If 0 = 02 - 0i, then tan(0) = (tan(0 2 ) - tan(0i))/(l + tan(0 2 ) tan(0i)). 6.20 Figure 6.10 shows a simplified model of a fetal heart monitoring system (the distances shown have been scaled down to make the calculations simpler). A heartbeat sensor is located at position x (see Figure 6.10). The energy of the heartbeat signal measured by the sensor is the reciprocal of the squared distance from the source (baby's heart or mother's heart). Find the position of the sensor that maximizes the signal-to-interference ratio, which is the ratio of the signal energy from the baby's heart to the signal energy from the mother's heart. 6.21 An amphibian vehicle needs to travel from point A (on land) to point B (in water), as illustrated in Figure 6.11. The speeds at which the vehicle travels on land and water are v\ and t>2, respectively. EXERCISES Figure 6.11 99 Path of amphibian vehicle in Exercise 6.21. a. Suppose that the vehicle traverses a path that minimizes the total time taken to travel from A to B. Use the first-order necessary condition to show that for the optimal path above, the angles θ\ and θ2 in Figure 6.11 satisfy Snell's law: sin θι vi sin 02 v2' b . Does the minimizer for the problem in part a satisfy the second-order sufficient condition? 6.22 Suppose that you have a piece of land to sell and you have two buyers. If the first buyer receives a fraction x\ of the piece of land, the buyer will pay you Ό\(χ\) dollars. Similarly, the second buyer will pay you U2{x2) dollars for a fraction of x2 of the land. Your goal is to sell parts of your land to the two buyers so that you maximize the total dollars you receive. (Other than the constraint that you can only sell whatever land you own, there are no restrictions on how much land you can sell to each buyer.) a. Formulate the problem as an optimization problem of the kind maximize subject to f(x) x £Ω by specifying / and Ω. Draw a picture of the constraint set. 100 BASICS OF SET-CONSTRAINED AND UNCONSTRAINED OPTIMIZATION b . Suppose that Ui(xi) = a ^ , i = 1,2, where a\ and a 2 are given positive constants such that a\ > a 2 . Find all feasible points that satisfy the first-order necessary condition, giving full justification. c. Among those points in the answer of part b, find all that also satisfy the second-order necessary condition. 6.23 Let / : R 2 -► R be defined by f(x) = {xi - x2)4 + x\ - x\ - 2xi + 2x2 + 1, where x = [xi,X2]T. Suppose that we wish to minimize / over R 2 . Find all points satisfying the FONC. Do these points satisfy the SONC? 6.24 Show that if d is a feasible direction at a point x G Ω, then for all ß > 0, the vector ßd is also a feasible direction at x. 6.25 Let Ω = {x G R n : Ax = b}. Show that d G R n is a feasible direction at x G Ω if and only if Ad = 0. 6.26 Let / : R 2 -> R. Consider the problem minimize subject to f(x) x\,X2 > 0, where x = [χι,α^]"1". Suppose that V / ( 0 ) Φ 0, and £<o)so, -g(o)<o. Show that 0 cannot be a minimizer for this problem. 6.27 Let c G R n , c φ 0, and consider the problem of minimizing the function f(x) = cTx over a constraint set Ω C R n . Show that we cannot have a solution lying in the interior of Ω. 6.28 Consider the problem maximize C\X\ + C2X2 subject to x\ + X2 < 1 x i , x 2 > 0, where c\ and c2 are constants such that c\ > c2 > 0. This is a linear programming problem (see Part III). Assuming that the problem has an optimal feasible solution, use the first-order necessary condition to show that the unique optimal feasible solution x* is [1,0] T . EXERCISES 101 Hint: First show that x* cannot lie in the interior of the constraint set. Then, show that x* cannot lie on the line segments L\ = {x : x\ = 0,0 < x2 < 1}, L2 = {x : 0 < x\ < 1, x 2 = 0}, L 3 = {x : 0 < X\ < 1, x2 = 1 - xi}. 6.29 Line Fitting. Let [#i,2/i] T ,..., [x n? 2/n] T 5 n > 2, be points on the R 2 plane (each Xi,yi G R). We wish to find the straight line of "best fit" through these points ("best" in the sense that the average squared error is minimized); that is, we wish to find a, b G R to minimize 1 n / ( a , b) = - ^2 (axi + b - yi)2 . 2=1 a. Let — 1 n X = - Vxi, n f-f 2=1 1 n 2=1 1 n 2=1 1 n 2=1 I n χ XY = n ~Σ *-^ Μ' 2=1 Show that f(a,b) can be written in the form zTQz — 2 c T z + d, where z = [a, 6] T , Q = Q T G R 2 ^ x 2 L cGR 2 and d G R, and find expressions for Q, c, and d in terms of X, Ϋ, X 2 , Y 2 , and 1 7 . b . Assume that the x z , z = 1 , . . . , n, are not all equal. Find the parameters a* and b* for the line of best fit in terms of X, Y, X 2 , Y 2 , and XY. Show that the point [α*, δ*] τ is the only local minimizer of / . Hint:JÖ-{Xf = ^Yri^i-X?· c. Show that if a* and 6* are the parameters of the line of best fit, then Y = a*X + b* (and hence once we have computed a*, we can compute 6* using the formula b* = Y — a*X). 6.30 Suppose that we are given a set of vectors {x^\ . . . , x ^ } , a:W G R n , 2 = 1 , . . . ,p. Find the vector x G R n such that the average squared distance (norm) between x and x^\ . . . , χ(ρ\ 1 P PUi ωιι 2 102 BASICS OF SET-CONSTRAINED AND UNCONSTRAINED OPTIMIZATION is minimized. Use the SOSC to prove that the vector x found above is a strict local minimizer. How is x related to the centroid (or center of gravity) of the given set of points { x ^ \ . . . , x^}? 6.31 Consider a function / : Ω —► R, where Ω C R n is a convex set and / eC1. Given x* G Ω, suppose that there exists c > 0 such that d T V / ( x * ) > c||d|| for all feasible directions d at x*. Show that x* is a strict local minimizer of / over Ω. 6.32 Prove the following generalization of the second-order sufficient condition: Theorem: Let Ω be a convex subset of R n , / G C2 a real-valued function on Ω, and x* a point in Ω. Suppose that there exists c G R, c > 0, such that for all feasible directions d at x* (d φ 0), the following hold: 1. d T V / ( x * ) > 0. 2. d T F(x*)d > c||d|| 2 . Then, x* is a strict local minimizer of / . 6.33 Consider the quadratic function / : R n —> R given by - xT6, / ( x ) = -xTQx where Q = QT > 0. Show that x* minimizes / if and only if x* satisfies the FONC. 6.34 Consider the linear system Xk+i = Q<Xk + biik+i, k > 0, where X{ G R, ui G R, and the initial condition is xo = 0. Find the values of the control inputs u\,..., un to minimize n -qxn + r^uh 2=1 where </, r > 0 are given constants. This can be interpreted as desiring to make xn as large as possible but at the same time desiring to make the total input energy Σ™=1 u2 as small as possible. The constants q and r reflect the relative weights of these two objectives. CHAPTER 7 ONE-DIMENSIONAL SEARCH METHODS 7.1 Introduction In this chapter, we are interested in the problem of minimizing an objective function / : K —» R (i.e., a one-dimensional problem). The approach is to use an iterative search algorithm, also called a line-search method. Onedimensional search methods are of interest for the following reasons. First, they are special cases of search methods used in multivariable problems. Second, they are used as part of general multivariable algorithms (as described later in Section 7.8). In an iterative algorithm, we start with an initial candidate solution x^ and generate a sequence of iterates x^l\x^2\ For each iteration k = 0 , 1 , 2 , . . . , the next point χ^+^ depends on x^ and the objective function / . The algorithm may use only the value of / at specific points, or perhaps its first derivative / ' , or even its second derivative / " . In this chapter, we study several algorithms: ■ Golden section method (uses only / ) ■ Fibonacci method (uses only / ) An Introduction to Optimization, Fourth Edition. By E. K. P. Chong and S. H. Zak. Copyright © 2013 John Wiley & Sons, Inc. 103 104 ONE-DIMENSIONAL SEARCH METHODS AfM H ' a0 ' b0 Figure 7.1 ^v x Unimodal function. ■ Bisection method (uses only / ' ) ■ Secant method (uses only / ' ) ■ Newton's method (uses f and / " ) The exposition here is based on [27]. 7.2 Golden Section Search The search methods we discuss in this and the next two sections allow us to determine the minimizer of an objective function / : R —► R over a closed interval, say [αο,&ο]· The only property that we assume of the objective function / is that it is unimodal, which means that / has only one local minimizer. An example of such a function is depicted in Figure 7.1. The methods we discuss are based on evaluating the objective function at different points in the interval [αο,&ο]· We choose these points in such a way that an approximation to the minimizer of / may be achieved in as few evaluations as possible. Our goal is to narrow the range progressively until the minimizer is "boxed in" with sufficient accuracy. Consider a unimodal function / of one variable and the interval [αο,&ο]· If we evaluate / at only one intermediate point of the interval, we cannot narrow the range within which we know the minimizer is located. We have to evaluate / at two intermediate points, as illustrated in Figure 7.2. We choose the intermediate points in such a way that the reduction in the range is symmetric, in the sense that ai - a0 = b0 -bi where = p(b0 - a 0 ), 1 2We then evaluate / at the intermediate points. If f(a\) < /(&i), then the minimizer must lie in the range [αο,&ι]. If, on the other hand, f(a{) > /(£>i), then the minimizer is located in the range [01,60] (see Figure 7.3). P< GOLDEN SECTION SEARCH ara0 + a0 Figure 7.2 a-, 0" b 1 + b-| + b0 Evaluating the objective function at two intermediate points. a0 Figure 7.3 b 105 x* a^ bA b0 The case where /(αι) < /(6i); the minimizer x* G [ao,&i]. Starting with the reduced range of uncertainty, we can repeat the process and similarly find two new points, say Ü2 and 62, using the same value of p < \ as before. However, we would like to minimize the number of objective function evaluations while reducing the width of the uncertainty interval. Suppose, for example, that f{a\) < /(6i), as in Figure 7.3. Then, we know that x* G [αο,&ι]. Because a\ is already in the uncertainty interval and f(a\) is already known, we can make a\ coincide with 62· Thus, only one new evaluation of / at 02 would be necessary. To find the value of p that results in only one new evaluation of / , see Figure 7.4. Without loss of generality, imagine that the original range [ao, bo] is of unit length. Then, to have only one new evaluation of / it is enough to choose p so that p(fei - a 0 ) = 6 1 - 6 2 . Because 61 — ao = 1 — p and 61 — 62 = 1 — 2p, we have p(l-p) = l - 2p. We write the quadratic equation above as p2 - 3p + 1 = 0. The solutions are Pi = 3 + ^5 92 3-\/5 106 ONE-DIMENSIONAL SEARCH METHODS 1-p ! a0 - >-. P 1-2p i a2 a1=b2 b1 b0 < > ► b0-a0=1 Figure 7.4 Finding value of p resulting in only one new evaluation of / . Because we require that p < ^, we take p Observe that and that is, =^ Λ „ 0.382. >/5-l ! - P = —ö— >/5 x/5-1 \/5-l 2 P 1-p 1-p 1 1 Thus, dividing a range in the ratio of p to 1 — p has the effect that the ratio of the shorter segment to the longer equals the ratio of the longer to the sum of the two. This rule was referred to by ancient Greek geometers as the golden section. Using the golden section rule means that at every stage of the uncertainty range reduction (except the first), the objective function / need only be evaluated at one new point. The uncertainty range is reduced by the ratio 1 — p « 0.61803 at every stage. Hence, N steps of reduction using the golden section method reduces the range by the factor (1 - p)N « (0.61803) N Example 7.1 Suppose that we wish to use the golden section search method to find the value of x that minimizes f{x) =xA- 14x3 + 60z 2 - 70x in the interval [0,2] (this function comes from an example in [21]). We wish to locate this value of x to within a range of 0.3. GOLDEN SECTION SEARCH 107 After N stages the range [0,2] is reduced by (0.61803)^. So, we choose N so that (0.61803)^ < 0.3/2. Four stages of reduction will do; that is, N = 4. Iteration 1. We evaluate / at two intermediate points a\ and b\. We have di = ao H- p(b0 — a 0 ) = 0.7639, 6i = a 0 + (1 - p){bo - a0) = 1.236, where p = (3 — Λ/5)/2. We compute /(αχ) = -24.36, f{h) = -18.96. Thus, / ( a i ) < /(&i), so the uncertainty interval is reduced to [oo,6i] = [0,1.236]. Iteration 2. We choose 62 to coincide with ai, and so / need only be evaluated at one new point, a2 = a0 + p(bx - a0) = 0.4721. We have f(a2) = -21.10, f(b2) = / ( a i ) = -24.36. Now, /(i>2) < /(02), so the uncertainty interval is reduced to [α2,6ι] = [0.4721,1.236]. Iteration 3. We set a3 = b2 and compute 63: 63 = a2 + (1 - p)(6i - a 2 ) = 0.9443. We have / ( a 3 ) = /(&2) = "24.36, /(6 3 ) = -23.59. So f(bs) > f(as). Hence, the uncertainty interval is further reduced to [a2M = [0.4721,0.9443]. Iteration 4- We set 64 = as and a4 = a2 + p(bs — a2) = 0.6525. 108 ONE-DIMENSIONAL SEARCH METHODS We have / ( a 4 ) = -23.84, / ( M = / ( a 3 ) = -24.36. Hence, f(a±) > /(fo*). Thus, the value of x that minimizes / is located in the interval [04,63] = [0.6525,0.9443]. Note that b3 - a4 = 0.292 < 0.3. 7.3 I Fibonacci Method Recall that the golden section method uses the same value of p throughout. Suppose now that we are allowed to vary the value p from stage to stage, so that at the fcth stage in the reduction process we use a value ρ&, at the next stage we use a value pfc+i, and so on. As in the golden section search, our goal is to select successive values of Pfc> 0 < pk < 1/2, such that only one new function evaluation is required at each stage. To derive the strategy for selecting evaluation points, consider Figure 7.5. From this figure we see that it is sufficient to choose the pk such that p fc +i(l - pk) = l-2pk. After some manipulations, we obtain p fc+ i = 1 - —. 1 - Pk There are many sequences pi, p2,... that satisfy the law of formation above and the condition that 0 < pk < 1/2. For example, the sequence pi = p2 = ps = · · · = (3 — Λ / 5 ) / 2 satisfies the conditions above and gives rise to the golden section method. Suppose that we are given a sequence p i , p 2 , . . . satisfying the conditions above and we use this sequence in our search algorithm. Then, after N iterations of the algorithm, the uncertainty range is reduced by a factor of (l-ft)(l-/*)···(!-PAT). Depending on the sequence p i , p 2 , . . . , we get a different reduction factor. The natural question is as follows: What sequence p i , p 2 , . . . minimizes the reduction factor above? This problem is a constrained optimization problem that can be stated formally as minimize subject to (1 - pi)(l - p 2 ) · · · (1 - PN) pfc+i = 1 0<pk<\, 1 ~ Pk fc , k = 1,..., N — 1 = l,...,W. FIBONACCI METHOD i Pk 1-2p w ■ 109 Pk Iteration k * i Iteration k+1 J *k+1 1 Pk+i(1-Pk) -I k+1 -►I "Pk Selecting evaluation points. Figure 7.5 Before we give the solution to the optimization problem above, we need to introduce the Fibonacci sequence Fi, F2, F3, This sequence is defined as follows. First, let F_i = 0 and Fo = 1 by convention. Then, for k > 0, -Ffc+i = Fk + Fk-i. Some values of elements in the Fibonacci sequence are: Fi 1 F2 2 F3 3 F4 5 F5 8 F6 13 F7 21 F8 34 It turns out that the solution to the optimization problem above is FN FN+I 92 = 1 - Pk PN FN-I ?N FjV-fc+1 F\ F2' where the F^ are the elements of the Fibonacci sequence. The resulting algorithm is called the Fibonacci search method. We present a proof for the optimality of the Fibonacci search method later in this section. In the Fibonacci search method, the uncertainty range is reduced by the factor {1-nXl-&)...(!-pN) FN+I FN F\ F2 Fi 1 FN+I FN+I 110 ONE-DIMENSIONAL SEARCH METHODS Because the Fibonacci method uses the optimal values of pi, P 2 , . . . , the reduction factor above is less than that of the golden section method. In other words, the Fibonacci method is better than the golden section method in that it gives a smaller final uncertainty range. We point out that there is an anomaly in the final iteration of the Fibonacci search method, because Recall that we need two intermediate points at each stage, one that comes from a previous iteration and another that is a new evaluation point. However, with PN = 1/2, the two intermediate points coincide in the middle of the uncertainty interval, and therefore we cannot further reduce the uncertainty range. To get around this problem, we perform the new evaluation for the last iteration using ρχ = 1/2 — ε, where ε is a small number. In other words, the new evaluation point is just to the left or right of the midpoint of the uncertainty interval. This modification to the Fibonacci method is, of course, of no significant practical consequence. As a result of the modification above, the reduction in the uncertainty range at the last iteration may be either or 1 - (pN - ε) = - + ε = —^—, depending on which of the two points has the smaller objective function value. Therefore, in the worst case, the reduction factor in the uncertainty range for the Fibonacci method is l + 2g FN+I Example 7.2 Consider the function f(x) = x4- Ux3 + 60x2 - 70x. Suppose that we wish to use the Fibonacci search method to find the value of x that minimizes / over the range [0,2], and locate this value of x to within the range 0.3. After N steps the range is reduced by (1 + 2g)/F/v+1 in the worst case. We need to choose N such that Thus, we need 1 + 2ε final range 0.3 — < . . . , = —z- = 0.15. .Fjv+i initial range 2 FN+1 - ΈΪ5"· If we choose ε < 0.1, then N = 4 will do. Iteration 1. We start with F4 5 We then compute αι= a0+ pi(b0 - a0) = -, 5 h = a0 + (1 - pi)(6o - «o) = τ , / ( d ) = -24.34, /(fc) = -18.65, / ( G l ) < /(6i). The range is reduced to [a0M 0, Iteration 2. We have 3 F3 ß2 = <k> + P2(h - a 0 ) = - , , 3 t>2 = O l = T, / ( a 2 ) = -21.69, /(&2) = / ( a i ) = -24.34, / ( a 2 ) > /(fc), so the range is reduced to [ö2,6l] = 1 5 2'4 Iteration 3. We compute Pi = 2 F3 " 3 ' 3 «3 = b2 4' &3 = «2 + (1 - P3)(&1 - «2) = 1, f(as) = f{b2) = -24.34, 1-P3 f(h) = -23, / ( a 3 ) < /(fts). 112 ONE-DIMENSIONAL SEARCH METHODS The range is reduced to [02,63] = 2'1 Iteration 4- We choose ε = 0.05. We have F i 1 l 04 = a 2 + {pi - s)(b3 - a 2 ) = 0.725, , 3 04 = a 3 = -, /(04) = -24.27, /(M = f(a3) = -24.34, /(a 4 ) > f(b4). The range is reduced to [a 4 ,6 3 ] = [0.725,1]. Note that b3 - a4 = 0.275 < 0.3. I We now turn to a proof of the optimality of the Fibonacci search method. Skipping the rest of this section does not affect the continuity of the presentation. To begin, recall that we wish to prove that the values of pi,P2, · · · ,PN used in the Fibonacci method, where pk = 1 — F/v-fc+i/F/v-fc+2, solve the optimization problem minimize subject to (1 - pi)(l - p 2 ) * * · (1 - PN) pk+i = 1 — 0<pk<\, fc Pk 1 - Pk , k — 1 , . . . , TV — 1 = l,...,JV. It is easy to check that the values of p\, p2,... above for the Fibonacci search method satisfy the feasibility conditions in the optimization problem above (see Exercise 7.4). Recall that the Fibonacci method has an overall reduction factor of (1 — pi) · ■ · (1 — PN) = I/.FW+1. To prove that the Fibonacci search method is optimal, we show that for any feasible values of p i , . . . , p^? we have (1-ΡΙ)···(1-ΡΛΓ)>1/^+Ι. It is more convenient to work with r^ = 1 — pk rather than p&. The optimization problem stated in terms of r^ is minimize subject to 7*1 · · · r/v 7*fc+i = Tk 1, fc = 1 , . . . , iV — 1 \ < r f c < l , fc = l , . . . , W . FIBONACCI METHOD 113 Note that if Τ Ί , Γ 2 , . . . satisfy r^+i — ^— 1, then rk > 1/2 if and only if rk+i < 1· Also, rk > 1/2 if and only if Tk-i < 2/3 < 1. Therefore, in the constraints above, we may remove the constraint r^ < 1, because it is implied implicitly by rk > 1/2 and the other constraints. Therefore, the constraints above reduce to nfe+i = l, fe = i , . . . , J V - i , rk rk > ^ k = l,...,N. To proceed, we need the following technical lemmas. In the statements of the lemmas, we assume that 7*1, Γ2,... is a sequence that satisfies rk+i = L e m m a 7.1 For rk 1, rk > 2> fc = l,2,. k>2, rk = ~ Fk-2 - Fk-in Fk-3 - Fk_2ri' D Proof. We proceed by induction. For A: = 2 we have 1 - n _ ri r\ F0- Fin F_i - F0ri and hence the lemma holds for k = 2. Suppose now that the lemma holds for k > 2. We show that it also holds for k + 1. We have rk+i = = 1 rk -F f c -3 + Fk-2ri _ Ffc-2 - F fc _iri Fk-2 - Fk-in Fk-2 ~ Fk-in Ffc-2 + Fk-s ~ (Fk-i + F fc _ 2 )ri Fk-2 - Fk-in i V x - Fkn Fk-2 - Fk-iri where we used the formation law for the Fibonacci sequence. Lemma 7.2 For k>2, (-l)fc(Ffc_2-Ffc_1r1)>0. D 114 ONE-DIMENSIONAL SEARCH METHODS Proof. We proceed by induction. For k = 2, we have (-l)2(F0-F1ri) = l - n . But ri = 1/(1 + r2) < 2/3, and hence 1 — r\ > 0. Therefore, the result holds for k = 2. Suppose now that the lemma holds for k > 2. We show that it also holds for k -f 1. We have ( - l ) * + 1 ( ^ - i - Fkn) = (-l)*+V f c + i — ( F f c - i - F f c n). By Lemma 7.1, rk+i ■■ Fk-i - Fkn Ffc_2 - Ffc-in Substituting for l/r^+i, we obtain (-l) f c + 1 (F f c _! - F f c n) = r f c + 1 (-l) f c (F f c _ 2 - F ^ n ) > 0, which completes the proof. Lemma 7.3 For k>2, i\fc+i Fk ( - l ) \&+ι«. * + 1 r i ^> /(-1)* Fk+i D Proof. Because rk+\ = ^r— 1 and /> > | , we have r^+i < 1. Substituting for 7>+i from Lemma 7.1, we get Ffc-i-Ffcn Ffc_2 - Ffc-in <]L Multiplying the numerator and denominator by (—l)k yields (-l)k+1(Fk-i-Fkn) (-l)k(Fk.2-Fk.iri) < 1. By Lemma 7.2, (—l)k(Fk-2 — ^fc-i^i) > 0, and therefore we can multiply both sides of the inequality above by (—l)k(Fk-2 — -Ffc-i^i) to obtain ( - ΐ ) * + 1 ( Ρ * _ ! - Fkn) < (-i)*(F f c _ 2 - Ffc-xn). Rearranging yields ( - 1 ) * + 1 ( ^ _ χ + Fk)n > (-l) fe+1 (F fc _2 + F fc _i). Using the law of formation of the Fibonacci sequence, we get ( - l ) f e + 1 F f c + 1 n > (-l) f e + 1 F f c , FIBONACCI METHOD which upon dividing by Ffc+i on both sides gives the desired result. 115 I We are now ready to prove the optimality of the Fibonacci search method and the uniqueness of this optimal solution. Theorem 7.1 Let Γχ,... ,ΓΝ, N > 2, satisfy the constraints r-fc+i = 1, k = 1 , . . . , 7 V - 1 , g, k = rk> l,...,N. Then, ri--rN > 1 FN+I Furthermore, 1 ri--rN z/ and only ifrk — Fjsf-k+i/FN-k+2, k = 1 , . . . , N. In other words, the values of r i , . . . , ΓΑΓ ^sed m £Ae Fibonacci search method form a unique solution to the optimization problem. D Proof. By substituting expressions for η , . . . , r # from Lemma 7.1 and performing the appropriate cancellations, we obtain ri · · · rN = (-l)N(FN-2 = (-l)NFN-2 - FN-iri) + FN^-I^+W Using Lemma 7.3 yields ri · · · rN > (-l)NFN-2 — (_1) FN^(-1)N^^- + (^V-2^V+i - FN-iFN)— . By Exercise 7.5, it is readily checked that the following identity holds: {-1)N(FN.2FN+1 - FN^FN) = 1. Hence, T\ --rN From the above we see that ri--rN if and only if . > — 1 FN+I FN FN+I This is simply the value of r\ for the Fibonacci search method. Note that fixing ri determines r2,..., r^ uniquely. I For further discussion on the Fibonacci search method and its variants, see [133]. 116 7.4 ONE-DIMENSIONAL SEARCH METHODS Bisection Method Again we consider finding the minimizer of an objective function / : R —> R over an interval [αο>&ο]· As before, we assume that the objective function / is unimodal. Further, suppose that / is continuously differentiate and that we can use values of the derivative / ' as a basis for reducing the uncertainty interval. The bisection method is a simple algorithm for successively reducing the uncertainty interval based on evaluations of the derivative. To begin, let χ(°) = (α0 + 6o)/2 be the midpoint of the initial uncertainty interval. Next, evaluate f'(x^). If f'(x^) > 0, then we deduce that the minimizer lies to the left of χ(°\ In other words, we reduce the uncertainty interval to [ao, x^]. On the other hand, if f'(x^) < 0, then we deduce that the minimizer lies to the right of χ(°\ In this case, we reduce the uncertainty interval to [x^°\6o]· Finally, if f'(x^) = 0, then we declare x^ to be the minimizer and terminate our search. With the new uncertainty interval computed, we repeat the process iteratively. At each iteration k, we compute the midpoint of the uncertainty interval. Call this point x^k\ Depending on the sign of f'{x^) (assuming that it is nonzero), we reduce the uncertainty interval to the left or right of x^k\ If at any iteration k we find that f'{x^) = 0, then we declare x^ to be the minimizer and terminate our search. Two salient features distinguish the bisection method from the golden section and Fibonacci methods. First, instead of using values of / , the bisection methods uses values of / ' . Second, at each iteration, the length of the uncertainty interval is reduced by a factor of 1/2. Hence, after N steps, the range is reduced by a factor of (1/2) N . This factor is smaller than in the golden section and Fibonacci methods. Example 7.3 Recall Example 7.1 where we wish to find the minimizer of f(x) = x4- Ux3 + 60x2 - 70x in the interval [0,2] to within a range of 0.3. The golden section method requires at least four stages of reduction. If, instead, we use the bisection method, we would choose N so that (0.5)" < 0.3/2. In this case, only three stages of reduction are needed. 7.5 I Newton's Method Suppose again that we are confronted with the problem of minimizing a function / of a single real variable x. We assume now that at each measurement NEWTON'S METHOD 117 point x^ we can determine / ( x ^ ) , / ' ( x ^ ) , and f"(x^k>)). We can fit a quadratic function through x^ that matches its first and second derivatives with that of the function / . This quadratic has the form q{x) = /(x ( / c ) ) + f'{x{k))(x - x{k)) + y,f(x{k))(x - x (fc) ) 2 . Note that q(xW) = /(x<fc>), q'(x^) = /'(x ( f c ) ), and q"{x^) = /"(χ(*>). Then, instead of minimizing / , we minimize its approximation q. The firstorder necessary condition for a minimizer of q yields 0 = q\x) = /'(*<*>) + f"(xW)(x Setting x = x^k+1\ - *<*>). we obtain f"{xwy Example 7.4 Using Newton's method, we will find the minimizer of f(x) = -x2 -sinx. Suppose that the initial value is x^ = 0 . 5 , and that the required accuracy is e = 10~ 5 , in the sense that we stop when |x(fc+1) — x^\ < e. We compute f'(x) — x — cosx, f"{x) — 1 + sinx. Hence, (Λ\ ~ ~ 0.5 — cos0.5 xW = 0.5 —— 1 + s i n 0.5 = ° · 5 - -0.3775 1.479 = 0.7552. Proceeding in a similar manner, we obtain *<">= *<« - f"(xW) & Ά =X"-°-^ 1.685 =0.7391, Ί-5 118 ONE-DIMENSIONAL SEARCH METHODS x(k) X(k+1) Newton's algorithm with f"{x) > 0. Figure 7.6 Note that \x^ - x^\ < e = Η Γ 5 . Furthermore, / ' ( x ( 4 ) ) = - 8 . 6 x 10" 6 « 0. Observe that f"(x^) = 1.673 > 0, so we can assume that x* « x^ is a strict minimizer. I Newton's method works well if f"(x) > 0 everywhere (see Figure 7.6). However, if f"(x) < 0 for some x, Newton's method may fail to converge to the minimizer (see Figure 7.7). Newton's method can also be viewed as a way to drive the first derivative of / to zero. Indeed, if we set g(x) = / ; ( x ) , then we obtain a formula for iterative solution of the equation g(x) = 0: g(x{k)) x(fc+1) = x (*0 _ g'{x(k))' In other words, we can use Newton's method for zero finding. X(k+1) Figure 7.7 x(k) x* Newton's algorithm with f"(x) < 0. NEWTON'S METHOD Figure 7.8 119 Newton's method of tangents. Example 7.5 We apply Newton's method to improve a first approximation, χ(°) = 12, to the root of the equation g(x) = x3 - 12.2x2 + lAbx + 42 = 0. We have g'{x) = 3x2 - 24Ax + 7.45. Performing two iterations yields 102.6 11.33, 146.65 14.73 11.21. = 11.33 116.11 cW = 12 ,(2) Newton's method for solving equations of the form g(x) = 0 is also referred to as Newton's method of tangents. This name is easily justified if we look at a geometric interpretation of the method when applied to the solution of the equation g(x) = 0 (see Figure 7.8). If we draw a tangent to g(x) at the given point x^k\ then the tangent line intersects the x-axis at the point x^k^l\ which we expect to be closer to the root x* of g(x) = 0. Note that the slope of g(x) at x^ is 9 <(x<'>)= X*"'» ~(fc+i) X (k) _ Hence, g(x{k)) g'(x(k))' Newton's method of tangents may fail if the first approximation to the root is such that the ratio g(x^)/g'(x^) is not small enough (see Figure 7.9). Thus, an initial approximation to the root is very important. ,(*+!) r(*0 120 ONE-DIMENSIONAL SEARCH METHODS Figure 7.9 Example where Newton's method of tangents fails to converge to the root x* of g(x) = 0. 7.6 Secant Method Newton's method for minimizing / uses second derivatives of / : x{k+i) = x(k) /"(#))' If the second derivative is not available, we may attempt to approximate it using first derivative information. In particular, we may approximate fff(x^) above with /'(χ(*))-/'(χ(*-ΐ)) x(k) _ x(k-i) Using the foregoing approximation of the second derivative, we obtain the algorithm x (fe+i) = XW (k) x~(k) _ „(k-l) x(k-l) //(α:(*))_//(χ(*-ΐ))· called the secant method. Note that the algorithm requires two initial points to start it, which we denote x^~^ and x^°\ The secant algorithm can be represented in the following equivalent form: (fc+1) X _ f( X (fc)) X (fc-l)_^( x (fc-l)) x (fc) ~ /,(x(fc))-/,(x(fe-1)) Observe that, like Newton's method, the secant method does not directly involve values of f(x^). Instead, it tries to drive the derivative / ' to zero. In fact, as we did for Newton's method, we can interpret the secant method as an algorithm for solving equations of the form g(x) = 0. Specifically, the SECANT METHOD x (k+2) Figure 7.10 x (k+1) x (k) 121 x (k-1) Secant method for root finding. secant algorithm for finding a root of the equation g(x) = 0 takes the form x(k+i) _ (fc) _ x„(fc) x~(fc-i) 9{X g(xW) - g(xV°-») g(x(k)) h or, equivalently, X (fc+i) = 9(χΜ)χ«-ν - g(x(k-V)xW g(XW) - gixV*-») The secant method for root finding is illustrated in Figure 7.10 (compare this with Figure 7.8). Unlike Newton's method, which uses the slope of g to determine the next point, the secant method uses the "secant" between the (k — l)th and kth points to determine the (k + l)th point. Example 7.6 We apply the secant method to find the root of the equation g(x) = x3 - 12.2x2 + 7.45x + 42 = 0. We perform two iterations, with starting points χ(~^ = 13 and x^ We obtain = 12. χΜ = 11.40, x& = 11.25. Example 7.7 Suppose that the voltage across a resistor in a circuit decays according to the model V(i) = e~Rt, where V(i) is the voltage at time t and R is the resistance value. 122 ONE-DIMENSIONAL SEARCH METHODS Given measurements V i , . . . , Vn of the voltage at times t i , . . . , tn> respectively, we wish to find the best estimate of R. By the best estimate we mean the value of R that minimizes the total squared error between the measured voltages and the voltages predicted by the model. We derive an algorithm to find the best estimate of R using the secant method. The objective function is /(Ä) = f>-e-«*) a . Hence, we have /,(Ä)=2^(V--e-Ät*)e-Ät*i<. 2=1 The secant algorithm for the problem is Rk — Rk-i Σ Γ = ι ( ^ - e-Ä**<)e~Äfcti*i - (Vi ~ -Rfc+l = Rk n x e-Kx-^e-Kx-iHi Y^(Vi-e-RkU)e-Rktiti. i=l For further reading on the secant method, see [32]. Newton's method and the secant method are instances of quadratic fit methods. In Newton's method, x(fc+1) is the stationary point of a quadratic function that fits / ' and / " at x^k\ In the secant method, x(fc+1) is the stationary point of a quadratic function that fits / ' at x^ and x^k_1\ The secant method uses only / ' (and not / " ) but needs values from two previous points. We leave it to the reader to verify that if we set χ^+^ to be the stationary point of a quadratic function that fits / at x^k\ χ^~λ\ and x^k~2\ we obtain a quadratic fit method that uses only values of / : (fc+i) Xv = W ( * ( f c ) ) + a 2 0 /(x( f c - 1 )) + σ 01 /(χ(*- 2 >) 2(ίΐ2/(*<*>) + <W(z ( f c - 1 } ) + W ( z ( f c - 2 ) ) ) where σ^ = (x ( / c - i ) ) 2 - (x (fc ~ j) ) 2 and <% = x^k~^ - x^k~^ (see Exercise 7.9) This method does not use / ' or / " , but needs values of / from three previous points. Three points are needed to initialize the iterations. The method is also sometimes called inverse parabolic interpolation. An approach similar to fitting (or interpolation) based on higher-order polynomials is possible. For example, we could set x^k+1^ to be a stationary point of a cubic function that fits / ' at x^k\ x^k~x\ and x^k~2\ It is often practically advantageous to combine multiple methods, to overcome the limitations in any one method. For example, the golden section method is more robust but slower than inverse parabolic interpolation. Brent's method combines the two [17], resulting in a method that is faster than the golden section method but still retains its robustness properties. BRACKETING Xo Figure 7.11 7.7 Xi X2 X3 123 ► X An illustration of the process of bracketing a minimizer. Bracketing Many of the methods we have described rely on an initial interval in which the minimizer is known to lie. This interval is also called a bracket, and procedures for finding such a bracket are called bracketing methods. To find a bracket [a, b] containing the minimizer, assuming unimodality, it suffices to find three points a < c < b such that /(c) < / ( a ) and /(c) < f(b). A simple bracketing procedure is as follows. First, we pick three arbitrary points xo < xi < #2- If / ( # i ) < /(#o) and f(x\) < / ( ^ ) , then we are done—the desired bracket is [#0^2]· If not, say f(xo) > f(xi) > ffa), then we pick a point xs > X2 and check if /(#2) < /(#3)· If it holds, then again we are done— the desired bracket is [χι,Χβ]. Otherwise, we continue with this process until the function increases. Typically, each new point chosen involves an expansion in distance between successive test points. For example, we could double the distance between successive points, as illustrated in Figure 7.11. An analogous process applies if the initial three points are such that f(xo) < / ( # i ) < /(#2)· In the procedure described above, when the bracketing process terminates, we have three points Xfc-2, #fc-i, and Xk such that f(xk-i) < f{xk-2) and f(xk-i) < f(xk)· The desired bracket is then [xfc_2,Xfc], which we can then use to initialize any of a number of search methods, including the golden section, Fibonacci, and bisection methods. Note that at this point, we have already evaluated /(χ^_2), f(xk-i), and f(xk)· If function evaluations are expensive to obtain, it would help if the point Xk-i inside the bracket also 124 ONE-DIMENSIONAL SEARCH METHODS coincides with one of the points used in the search method. For example, if we intend to use the golden section method, then it would help if Xk-ι ~ Xk-2 — p{%k — Xk-2), where p = (3 — \/5)/2. In this case, Xk-i would be one of the two points within the initial interval used in the golden section method. This is achieved if each successive point Xk is chosen such that Xk = Xk-i + (2 — p)(xk-i — Xk-2)- In this case, the expansion in the distance between successive points is a factor 2 — p « 1.618, which is less than double. 7.8 Line Search in Multidimensional Optimization One-dimensional search methods play an important role in multidimensional optimization problems. In particular, iterative algorithms for solving such optimization problems (to be discussed in the following chapters) typically involve a line search at every iteration. To be specific, let / : W1 —► R be a function that we wish to minimize. Iterative algorithms for finding a minimizer of / are of the form k xV<+»=xW+akdfi where x^ \ is a given initial point and a^ > 0 is chosen to minimize 0 fc (a) = /(*<*>+ad ( f c ) ). The vector er ' is called the search direction and α& is called the step size. Figure 7.12 illustrates a line search within a multidimensional setting. Note that choice of ctk involves a one-dimensional minimization. This choice ensures that under appropriate conditions, /(*( f c + i )) < /(»<*>). Any of the one-dimensional methods discussed in this chapter (including bracketing) can be used to minimize </>&. We may, for example, use the secant method to find α&. In this case we need the derivative of (j)k, which is 0'fc(a) = d<*>T V/(a><fc) + ad ( f c ) ). This is obtained using the chain rule. Therefore, applying the secant method for the line search requires the gradient V / , the initial line-search point x^k\ and the search direction d> ' (see Exercise 7.11). Of course, other onedimensional search methods may be used for line search (see, e.g., [43] and [88]). Line-search algorithms used in practice involve considerations that we have not yet discussed thus far. First, determining the value of α& that exactly minimizes 4>k may be computationally demanding; even worse, the minimizer of φκ may not even exist. Second, practical experience suggests that it is better to allocate more computational time on iterating the multidimensional LINE SEARCH IN MULTIDIMENSIONAL OPTIMIZATION Figure 7.12 125 Line search in multidimensional optimization. optimization algorithm rather than performing exact line searches. These considerations led to the development of conditions for terminating line-search algorithms that would result in low-accuracy line searches while still securing a sufficient decrease in the value of the / from one iteration to the next. The basic idea is that we have to ensure that the step size ctk is not too small or too large. Some commonly used termination conditions are as follows. First, let ε G (0,1), 7 > 1, and η G (ε, 1) be given constants. The Armijo condition ensures that Qfc is not too large by requiring that 0fc(a f c )<0fc(O)+ea f c ^(O). Further, it ensures that a& is not too small by requiring that 0*(7<*fc) > 0*(O) + e7a*0!b(O). The Goldstein condition differs from Armijo in the second inequality: <£*(<**) > ^ ( Ο ) + η α * 0 * ( Ο ) . The first Armijo inequality together with the Goldstein condition are often jointly called the Armijo-Goldstein condition. The Wolfe condition differs from Goldstein in that it involves only (fr'k: 4>'k(ak) > ηφ'^0). 126 ONE-DIMENSIONAL SEARCH METHODS A stronger variation of this is the strong Wolfe condition: Wk{ak)\<nWkmA simple practical (inexact) line-search method is the Armijo backtracking algorithm, described as follows. We start with some candidate value for the step size α&. If this candidate value satisfies a prespecified termination condition (usually the first Armijo inequality), then we stop and use it as the step size. Otherwise, we iteratively decrease the value by multiplying it by some constant factor r G (0,1) (typically r = 0.5) and re-check the termination condition. If a^0^ is the initial candidate value, then after m iterations the value obtained is α& = τ^α^. The algorithm backtracks from the initial value until the termination condition holds. In other words, the algorithm produces a value for the step size of the form α^ = rma^ with m being the smallest value in {0,1,2,...} for which α^ satisfies the termination condition. For more information on practical line-search methods, we refer the reader to [43, pp. 26-40], [96, Sec. 10.5], [11, App. C], [49], and [50].x EXERCISES 7.1 Suppose that we have a unimodal function over the interval [5,8]. Give an example of a desired final uncertainty range where the golden section method requires at least four iterations, whereas the Fibonacci method requires only three. You may choose an arbitrarily small value of ε for the Fibonacci method. 7.2 Let f(x) = x2 + 4cosx, x G i We wish to find the minimizer x* of / over the interval [1,2]. (Calculator users: Note that in cosx, the argument x is in radians.) a. Plot f(x) versus x over the interval [1,2]. b . Use the golden section method to locate x* to within an uncertainty of 0.2. Display all intermediate steps using a table: Iteration k 1 2 CLk bk ? ? ? ? /K) /(**) ? ? ? ? New uncertainty interval [?,?] [?,?] c. Repeat part b using the Fibonacci method, with ε = 0.05. Display all intermediate steps using a table: 1 We thank Dennis M. Goodman for furnishing us with references [49] and [50]. EXERCISES Iteration k 1 2 Pk ? Q>k &fc ? Hak) ? ? ? ? ? ? /(M ? ? 127 New uncertainty interval [?,?] [?,?] d. Apply Newton's method, using the same number of iterations as in part b, with χ(°) = 1. 7.3 Let / ( # ) = 8e 1 - a : + 71og(x), where "log" represents the natural logarithm function. a. Use MATLAB to plot f(x) versus x over the interval [1,2], and verify that / is unimodal over [1,2]. b . Write a simple MATLAB program to implement the golden section method that locates the minimizer of / over [1,2] to within an uncertainty of 0.23. Display all intermediate steps using a table as in Exercise 7.2. c. Repeat part b using the Fibonacci method, with ε = 0.05. Display all intermediate steps using a table as in Exercise 7.2. 7.4 Suppose that p i , . . . , p ; v are the values used in the Fibonacci search method. Show that for each k = 1 , . . . , AT, 0 < pk < 1/2, and for each fc = l , . . . , 7 V - l , 7.5 Show that if F 0 , F i , . . . is the Fibonacci sequence, then for each k = 2,3,..., ^fc-2^fc+i - Fk-\Fk = (-1) . 7.6 Show that the Fibonacci sequence can be calculated using the formula 7.7 Suppose that we have an efficient way of calculating exponentials. Based on this, use Newton's method to devise a method to approximate log(2) [where "log" is the natural logarithm function]. Use an initial point of x^ = 1, and perform two iterations. 7.8 Consider the problem of finding the zero of g(x) = (ex — l)/{ex + 1), x G R, where ex is the exponential of x. (Note that 0 is the unique zero of g.) 128 ONE-DIMENSIONAL SEARCH METHODS a. Write down the algorithm for Newton's method of tangents applied to this problem. Simplify using the identity sinha: = (ex — e~x)/2. b . Find an initial condition x^ such that the algorithm cycles [i.e., x^ = x{2) _ χ (4) _ . . . j Y O U n e e ( j n o ^ explicitly calculate the initial condition; it suffices to provide an equation that the initial condition must satisfy. Hint: Draw a graph of g. c. For what values of the initial condition does the algorithm converge? 7.9 Derive a one-dimensional search (minimization) algorithm based on quadratic fit with only objective function values. Specifically, derive an algorithm that computes x^fc+1) based on x^k\ χ^~λ\ x^k~2\ f(x^), /(x^-1^), k and f{x( -V). Hint: To simplify, use the notation σ^ = (x^k~^)2 — (x^k~^)2 and Sij = Y O U might a l S o find it useful to experiment with your algox{k-%) _ x(k-j)^ rithm by writing a MATLAB program. Note that three points are needed to initialize the algorithm. 7.10 The objective of this exercise is to implement the secant method using MATLAB. a. Write a simple MATLAB program to implement the secant method to locate the root of the equation g(x) = 0. For the stopping criterion, use the condition |x^ + 1 ^ — x^\ < \χ(^\ε, where ε > 0 is a given constant. b . Let g(x) = (2x - l ) 2 + 4(4 - 1024x) 4 . Find the root of g(x) = 0 using the secant method with χ(~^ = 0, χ^ = 1, and ε = 10~ 5 . Also determine the value of g at the solution obtained. 7.11 Write a MATLAB function that implements a line search algorithm using the secant method. The arguments to this function are the name of the M-file for the gradient, the current point, and the search direction. For example, the function may be called linesearch_secant and be used by the function call alpha=linesearch_secant( , grad , , x , d ) , where grad.m is the M-file containing the gradient, x is the starting line search point, d is the search direction, and alpha is the value returned by the function [which we use in the following chapters as the step size for iterative algorithms (see, e.g., Exercises 8.25 and 10.11)]. Note: In the solutions manual, we used the stopping criterion \d Vf(x + ad)| < ε\ά V / ( x ) | , where ε > 0 is a prespecified number, V / is the gradient, x is the starting line search point, and d is the search direction. The rationale for the stopping criterion above is that we want to reduce the directional derivative of / in the direction d by the specified fraction ε. We used a value of ε = 1 0 - 4 and initial conditions of 0 and 0.001. EXERCISES 129 7.12 Consider using a gradient algorithm to minimize the function /(«)=i* r with the initial guess x^ 2 ll = [0.8, -0.25] T . a. To initialize the line search, apply the bracketing procedure in Figure 7.11 along the line starting at x^ in the direction of the negative gradient. Use ε = 0.075. b . Apply the golden section method to reduce the width of the uncertainty region to 0.01. Organize the results of your computation in a table format similar to that of Exercise 7.2. c. Repeat the above using the Fibonacci method. CHAPTER 8 GRADIENT METHODS 8.1 Introduction In this chapter we consider a class of search methods for real-valued functions on R n . These methods use the gradient of the given function. In our discussion we use such terms as level sets, normal vectors, and tangent vectors. These notions were discussed in some detail in Part I. Recall that a level set of a function / : R n —> R is the set of points x satisfying f(x) = c for some constant c. Thus, a point XQ G R n is on the level set corresponding to level c if f(xo) = c. In the case of functions of two real variables, / : R 2 —> R, the notion of the level set is illustrated in Figure 8.1. The gradient of / at x$, denoted Vf(xo), if it is not a zero vector, is orthogonal to the tangent vector to an arbitrary smooth curve passing through Xo on the level set f(x) = c. Thus, the direction of maximum rate of increase of a real-valued differentiable function at a point is orthogonal to the level set of the function through that point. In other words, the gradient acts in such a direction that for a given small displacement, the function / increases more in the direction of the gradient than in any other direction. To prove this statement, recall that (V/(sc),d), ||d|| = 1, is the rate of increase of / in An Introduction to Optimization, Fourth Edition. By E. K. P. Chong and S. H. Zak. Copyright © 2013 John Wiley & Sons, Inc. 131 132 GRADIENT METHODS Z=f(Xi,X 2 ) Figure 8.1 Constructing a level set corresponding to level c for / . the direction d at the point x. By the Cauchy-Schwarz inequality, (V/(x),d)<||V/(x)|| because ||d|| = 1. But if d = V/(aj)/||V/(x)||, then (v/w-Ä>-|V/(·»· Thus, the direction in which Vf(x) points is the direction of maximum rate of increase of / at x. The direction in which — V/(a?) points is the direction of maximum rate of decrease of / at x. Hence, the direction of negative gradient is a good direction to search if we want to find a function minimizer. We proceed as follows. Let x^ be a starting point, and consider the point χ(°) — a V / ( a ; ^ ) . Then, by Taylor's theorem, we obtain /(x<°> - aV/(*<°>)) - / ( x ( 0 ) ) - α||ν/(χ(°))|| 2 + o{a). Thus, if V/(aj(°)) φ 0, then for sufficiently small a > 0, we have /(χ(°)-αν/(^))</(χ(0)). This means that the point x^ ~ Q V / ( ^ 0 ' ) is an improvement over the point χ(°) if W e are searching for a minimizer. To formulate an algorithm that implements this idea, suppose that we are given a point x^k\ To find the next point x^k+l\ we start at x^ and move by an amount —afcV/(x^fc^), where α^ is a positive scalar called the step size. This procedure leads to the following iterative algorithm: x( f c + 1 )= a j ( f c )-a f c V/(x( f c )). THE METHOD OF STEEPEST DESCENT 133 We refer to this as a gradient descent algorithm (or simply a gradient algorithm). The gradient varies as the search proceeds, tending to zero as we approach the minimizer. We have the option of either taking very small steps and reevaluating the gradient at every step, or we can take large steps each time. The first approach results in a laborious method of reaching the minimizer, whereas the second approach may result in a more zigzag path to the minimizer. The advantage of the second approach is possibly fewer gradient evaluations. Among many different methods that use this philosophy the most popular is the method of steepest descent, which we discuss next. Gradient methods are simple to implement and often perform well. For this reason, they are used widely in practical applications. For a discussion of applications of the steepest descent method to the computation of optimal controllers, we recommend [85, pp. 481-515]. In Chapter 13 we apply a gradient method to the training of a class of neural networks. 8.2 The Method of Steepest Descent The method of steepest descent is a gradient algorithm where the step size a*: is chosen to achieve the maximum amount of decrease of the objective function at each individual step. Specifically, α^ is chosen to minimize 0fc(a) = f{x^ — aVf(x^)). In other words, ak = argmin/(x ( f c ) - aV/(aj (fe) )). a>0 To summarize, the steepest descent algorithm proceeds as follows: At each step, starting from the point x^k\ we conduct a line search in the direction —Vf(x^) until a minimizer, χ^+1\ is found. A typical sequence resulting from the method of steepest descent is depicted in Figure 8.2. Observe that the method of steepest descent moves in orthogonal steps, as stated in the following proposition. Proposition 8.1 If {x^}kLo is a steepest descent sequence for a given function f : Rn —► R, then for each k the vector x( fc+1 ) — x^ is orthogonal to the vector x^^ - x^k+l\ D Proof. From the iterative formula of the method of steepest descent it follows that (x(fe+1> - a.(fe),x(fc+2) - a!<fc+1>) = afcafc+i<V/(aj<*>), V/(*( fc+1 >)>. To complete the proof it is enough to show that 134 GRADIENT METHODS x(°)^ Figure 8.2 CQ>CI>C2>C3 Typical sequence resulting from the method of steepest descent. To this end, observe that α^ is a nonnegative scalar that minimizes φΐζ{θί) = f(x^ — a\7f(x^)). Hence, using the FONC and the chain rule gives us = V/(*<*> - akVf(x^))T(-Vf(x^)) = -(V/(^t+1)),V/(x»)}, which completes the proof. I The proposition above implies that Vf(x^) is parallel to the tangent plane to the level set {f(x) = /(a^ fc+1 ^)} at χ^+1\ Note that as each new point is generated by the steepest descent algorithm, the corresponding value of the function / decreases in value, as stated below. Proposition 8.2 If{x^}^=0 is the steepest descent sequence for f : R n —> R 1 and i / V / ( » W ) φ 0, then /{x^ ^ < f{x{k)). □ Proof. First recall that x(fc+i) = a .(fc)_ a f c V /(x( f c )), where a^ > 0 is the minimizer of 0 fc (a) = /(*<*>-aV/(*<*>)) over all a: > 0. Thus, for a > 0, we have THE METHOD OF STEEPEST DESCENT 135 By the chain rule, 44(0) = ^ ( 0 ) = - ( V / ( * W - 0V/(*( f e )))) T V/(*W) = - | | V / ( x W ) | | 2 < 0 because V / ( a j ^ ) φ 0 by assumption. Thus, </4(0) < 0 and this implies that there is an ä > 0 such that (j>k(0) > 0&(α) for all a G (0, ä]. Hence, / ( α ^ + 1 > ) = 0fc(afc) < <t>k{a) < <j>k(0) = /(x<*>), which completes the proof. I In Proposition 8.2, we proved that the algorithm possesses the descent property: /(x( f c + 1 )) < f(x^) if V / ( x ^ ) Φ 0. If for some fc, we have V/(x ( f c ) ) = 0, then the point x^ satisfies the FONC. In this case, a?(fc+1> = χ(*0. We can use the above as the basis for a stopping (termination) criterion for the algorithm. The condition V/(ic^ + 1 ^) = 0, however, is not directly suitable as a practical stopping criterion, because the numerical computation of the gradient will rarely be identically equal to zero. A practical stopping criterion is to check if the norm | | V / ( x ^ ) | | of the gradient is less than a prespecified threshold, in which case we stop. Alternatively, we may compute the absolute difference |/(;r( fc+1 )) — f(x^)\ between objective function values for every two successive iterations, and if the difference is less than some prespecified threshold, then we stop; that is, we stop when |/( X (*+D) - /(χ(*))| < ε, where ε > 0 is a prespecified threshold. Yet another alternative is to compute the norm ||x^fc+1^ — x^\\ of the difference between two successive iterates, and we stop if the norm is less than a prespecified threshold: \\χν°+ν-χΜ\\<ε. Alternatively, we may check "relative" values of the quantities above; for example, |/(«(fc+1))-/(g(fc))| ^ \f(xW)\ or || x (fc+i)_a.(*0|| P^l <£ The two (relative) stopping criteria above are preferable to the previous (absolute) criteria because the relative criteria are "scale-independent." For example, scaling the objective function does not change the satisfaction of the criterion |/(a?( fc+1 ))-/(;c( fc ))|/|/(x( fc ))| < e. Similarly, scaling the decision variable does not change the satisfaction of the criterion ||£c^+1^ — aj(fe)||/||ic(fc))|| < e. 136 GRADIENT METHODS To avoid dividing by very small numbers, we can modify these stopping criteria as follows: |/(g<fc+1>)-/(*<*>)! max{l, |/(*<*>)|} or ||aj(*+i)_a.(fc)|| fc < ε. max{l, ||x( )||} Note that the stopping criteria above are relevant to all the iterative algorithms we discuss in this part. Example 8.1 We use the method of steepest descent to find the minimizer of f(xi,X2,xs) = (xi ~ 4) 4 + (x2 - 3) 2 + 4(x 3 + 5) 4 . The initial point is x^ = [4,2, — 1] T . We perform three iterations. We find that V / ( * ) = [4(xi - 4) 3 ,2(x 2 - 3), 16(x3 + 5) 3 ] T . Hence, V / ( x ( 0 ) ) = [0,-2,1024] T . To compute x^\ we need a 0 = argmin/(x ( °) - a V / ( x ( 0 ) ) ) = argmin(0 + (2 + 2a - 3) 2 + 4 ( - l - 1024a + 5) 4 ) a>0 = argmin</>o(a). a>0 Using the secant method from Section 7.6, we obtain a 0 = 3.967 x 1 0 - 3 . For illustrative purpose, we show a plot of φο(α) versus a in Figure 8.3, obtained using MATLAB. Thus, x{l) = x{0) - a 0 V / ( x ( 0 ) ) = [4.000,2.008, -5.062] T . To find x^2\ we first determine V / ( « ( 1 ) ) = [0.000,-1.984,-0.003875] T . Next, we find a i , where a i = argmin(0 + (2.008 + 1.984a - 3) 2 + 4(-5.062 + 0.003875a + 5) 4 ) a>0 = arg min 0i (a). a>0 THE METHOD OF STEEPEST DESCENT Φο(α) 7000 0.002 0.004 a 0.006 0.008 Figure 8.3 Plot of φο(α) versus a. Figure 8.4 Plot of φι(α) versus a. Φΐ(α) 0.01 137 138 GRADIENT METHODS φ2(α) 1.6 1.2 0.8 0.4 "10 12 Figure 8.5 14 α 16 18 20 Plot of φι (a) versus a. Using the secant method again, we obtain OL\ = 0.5000. Figure 8.4 depicts a plot of φι(α) versus a. Thus, XW =a.(i) _ α ι ν / ( χ ( 1 ) ) = [4.000,3.000,-5.060] T . To find χ(3\ we first determine V/(;z ( 2 ) ) = [0.000,0.000,-0.003525] T and a2 = argmin(0.000 + 0.000 + 4(-5.060 + 0.003525a + 5) 4 ) a>0 = arg min 02(a). α>0 We proceed as in the previous iterations to obtain a2 = 16.29. A plot of φ2(θί) versus a is shown in Figure 8.5. The value of x^ is x ( 3 ) = [4.000,3.000,-5.002] T . Note that the minimizer of / is [ 4 , 3 , - 5 ] T , and hence it appears that we have arrived at the minimizer in only three iterations. The reader should be cautioned not to draw any conclusions from this example about the number of iterations required to arrive at a solution in general. It goes without saying that numerical computations, such as those in this example, are performed in practice using a computer (rather than by hand). THE METHOD OF STEEPEST DESCENT 139 The calculations above were written out explicitly, step by step, for the purpose of illustrating the operations involved in the steepest descent algorithm. The computations themselves were, in fact, carried out using a MATLAB program (see Exercise 8.25). I Let us now see what the method of steepest descent does with a quadratic function of the form f(x) = -xTQx - bTx, where Q G R n X n is a symmetric positive definite matrix, 6 G R n , and x G M.n. The unique minimizer of / can be found by setting the gradient of / to zero, where V / ( x ) = Qx - 6, because D (xTQx) = xT(Q + QT) = 2xTQ, and D(bTx) = bT. There is no loss of generality in assuming Q to be a symmetric matrix. For if we are given a quadratic form xT Ax and Αφ A , then because the transposition of a scalar equals itself, we obtain (xT Ax)T = xT ATx — xT Ax. Hence, χγΑχ=1-χγΑχ+1-χτΑΎχ = ^xT(A = Note that + AT)x 2χΤ®χ· (A + A T ) T = Q T = A + AT = Q. The Hessian of / is F(x) = Q = Q > 0. To simplify the notation we write gW = Vf(x^). Then, the steepest descent algorithm for the quadratic function can be represented as x<*+i> =a .(*)-a fcfl < fc >, where ak = argmin/(ic (A:) = arg min (\{x{k) ag^) - ag^)TQ(x^ - ag™) - (x{k) - ag^)Tb) . In the quadratic case, we can find an explicit formula for ak- We proceed as follows. Assume that g^ φ 0, for if gW = 0, then x^ = x* and the 140 GRADIENT METHODS g (or Figure 8.6 Steepest descent method applied to /(xi, X2) = x\ + %\> algorithm stops. Because α^ > 0 is a minimizer of 4>k(oi) = f(x^ we apply the FONC to 4>k(&) to obtain — ag^), φ',(α) = ( * « - a f l W ) T Q ( - f l W ) - b T (-g ( f e ) )· Therefore, </>'fc(a) = 0 if agWTQgW = (x^TQ - bT)g(-k\ But x^TQ-bT=g^T. Hence, «fc = gWTQg(k)' In summary, the method of steepest descent for the quadratic takes the form _ g{k)Tg{k) β (*+ΐ) - xw (k) where gW = Vf(x{k)) = Qx{k) -b. Example 8.2 Let f(xi,X2) =Xi+xl- Then, starting from an arbitrary initial point x^ G M2, we arrive at the solution x* = 0 G M2 in only one step. See Figure 8.6. However, if x2 f{xux2) = y + x\, then the method of steepest descent shuffles ineffectively back and forth when searching for the minimizer in a narrow valley (see Figure 8.7). This example illustrates a major drawback in the steepest descent method. More ANALYSIS OF GRADIENT METHODS Figure 8.7 141 Steepest descent method in search for minimizer in a narrow valley. sophisticated methods that alleviate this problem are discussed in subsequent chapters. I To understand better the method of steepest descent, we examine its convergence properties in the next section. 8.3 Analysis of Gradient Methods Convergence The method of steepest descent is an example of an iterative algorithm. This means that the algorithm generates a sequence of points, each calculated on the basis of the points preceding it. The method is a descent method because as each new point is generated by the algorithm, the corresponding value of the objective function decreases in value (i.e., the algorithm possesses the descent property). We say that an iterative algorithm is globally convergent if for any arbitrary starting point the algorithm is guaranteed to generate a sequence of points converging to a point that satisfies the FONC for a minimizer. When the algorithm is not globally convergent, it may still generate a sequence that converges to a point satisfying the FONC, provided that the initial point is sufficiently close to the point. In this case we say that the algorithm is locally convergent How close to a solution point we need to start for the algorithm to converge depends on the local convergence properties of the algorithm. A related issue of interest pertaining to a given locally or globally convergent algorithm is the rate of convergence; that is, how fast the algorithm converges to a solution point. In this section we analyze the convergence properties of descent gradient methods, including the method of steepest descent and gradient methods with fixed step size. We can investigate important convergence characteristics of a gradient method by applying the method to quadratic problems. The convergence analysis is more convenient if instead of working with / we deal 142 GRADIENT METHODS with V(x) = f{x) + \x*TQx* = \{χ - x*)TQ(x - x*), where Q = QT > 0. The solution point x* is obtained by solving Qx = 6; that is, x* = Q~lb. The function V differs from / only by a constant ^x*TQx*. We begin our analysis with the following useful lemma that applies to a general gradient algorithm. Lemma 8.1 The iterative algorithm X(k+l) with g^ = Qx^ X(k) = _ ^gik) — b satisfies y(*(fc+1)) = (i-7*m* ( f c ) ), where if gW = 0, then η^ = 1, and if g^ gWTQgW 7fc _ φ 0, then N ^gWTg(k) ak Z gWTQ-lg{k) <*") ' \ g(k)TQg(k) D Proof. The proof is by direct computation. Note that if g^> = 0, then the desired result holds trivially. In the remainder of the proof, assume that g(k) φ 0. We first evaluate the expression K(gW)-V(g<*+1)) V(xW) To facilitate computations, let yW \y(k)rQy{k). Hence, = x^ - x*. Then, Vfo^) ν(χ(*+!)) = i(x( fc+1 > - a;')TQ(ar<fc+1> - x*) = \{x^ akgW)TQ(xW -x*- = \y{k)TQy^ - akg^QyW - x* - afcfl<*>) + \α\9^Qg^. Therefore, V(xW) - V(x(k+V) V{Xik)) Because gW = QXW - _ 2ak9WrQyW - a2kgWTQgW ~ y(k)TQy(k) b = Qx{k) _ QX* = Qy(k)^ = ANALYSIS OF GRADIENT METHODS 143 we have y{k)TQy^=9^TQ-l9{k\ g(k)TQyW=gWTgW. Therefore, substituting the above, we get y(gW)-y(a(fc+i)) _ V(x(k)) ~ gWTQgW g(k)TQ-^g(k) ak ( g^Tg^ {^gWTQgik) = Tfc- ■ Note that j k < 1 for all fc, because j k = 1 - V[x^k+^)/V(x^) and V is a nonnegative function. If 7/- = 1 for some fc, then V(ic(fe+1)) = 0, which is equivalent to x^k+1>} = x*. In this case we also have that for alH > k + 1, χ{ι) _ x* a n ( j ^ — 1. It turns out that 7^ = 1 if and only if either gW = 0 or gW is an eigenvector of Q (see Lemma 8.3). We are now ready to state and prove our key convergence theorem for gradient methods. The theorem gives a necessary and sufficient condition for the sequence {x^} generated by a gradient method to converge to x*] that is, x^ —> x* or limfc^oo x^ = x*. Theorem 8.1 Let {x^} be the sequence resulting from a gradient algorithm xik+i) — x(k) — afcgf(/c). Let 7fc be as defined in Lemma 8.1, and suppose that 7fc > 0 for all k. Then, {x^} converges to x* for any initial condition x^ if and only if 00 J^7fc = 00. k=0 D Proof From Lemma 8.1 we have V(x(k+^) we obtain = (1 — 7^) V(x^), from which ν(χ^)=(γ[(1-ΊΛν(χ^). Assume that 7^ < 1 for all fc, for otherwise the result holds trivially. Note that x^ —► x* if and only if V(x^) —► 0. By the equation above we see that this occurs if and only if Πί1ο(1 ~~ Ίί) — 0> which, in turn, holds if and only if Y^LQ — log(l — 7i) = 00 (we get this simply by taking logs). Note that by Lemma 8.1, 1 — 7$ > 0 and log(l — 7*) is well-defined [log(0) is taken to be —00]. Therefore, it remains to show that Σ ° ^ 0 ~~ l°s(l"" Ί%) = °° ^ a n d οη^Υ if 00 We first show that J^SoTi = °° ^ m P ^ e s that Σ ^ ο — l°s(l ~~ 7») — °°. For this, first observe that for any x G R, x > 0, we have log(x) < x — 1 144 GRADIENT METHODS [this is easy to see simply by plotting log(x) and x — 1 versus x]. Therefore, log(l - 7i) < 1 - 7i - 1 = -7», and hence - l o g ( l - 7<) > 7». Thus, if Σ £ 0 7* = oo> t h e n clearly £ ~ 0 - log(l - 7*) = 00. Finally, we show that X ^ 0 — log(l — 7») = 00 implies that X]°^0 7* = °°We proceed by contraposition. Suppose that Σ ^ 0 ^ < °°- Then, it must be that 7^ —► 0. Now observe that for x G M, x < 1 and x sufficiently close to 1, we have log(x) > 2(x — 1) [as before, this is easy to see simply by plotting log(x) and 2(x — 1) versus #]. Therefore, for sufficiently large i, log(l - 7<) > 2(1 - 7i - 1) = -27», which implies that - l o g ( l - 7 ^ < 27». Hence, Σ ϊ ο 7 χ < oo implies that Σιίο ~ ^ β ί 1 ~ 7t) < °°This completes the proof. I The assumption in Theorem 8.1 that 7^ > 0 for all k is significant in that it corresponds to the algorithm having the descent property (see Exercise 8.23). Furthermore, the result of the theorem does not hold in general if we do not assume that 7^ > 0 for all A;, as shown in the following example. Example 8.3 We show, using a counterexample, that the assumption that 7fc > 0 in Theorem 8.1 is necessary for the result of the theorem to hold. Indeed, for each k = 0 , 1 , 2 , . . . , choose otk in such a way that 72/e = —1/2 and 72&+1 = 1/2 (we can always do this if, for example, Q = In). From Lemma 8.1 we have y(a.(2(*+D)) = (1 _ i/2)(l + 1/2)V(*<2*>) = (3/4)V(a5<2*>). Therefore, V(x^2k>>) -» 0. Because V{x^2k+l">) = (3/2)V{xW), we also have that y(x( 2 f c + 1 )) -» 0. Hence, V(x^) -> 0, which implies that ajW -> 0 (for all a;(0)). On the other hand, it is clear that A 1 for all k. Hence, the result of the theorem does not hold if 7^ < 0 for some A;. ■ Using the general theorem above, we can now establish the convergence of specific cases of the gradient algorithm, including the steepest descent algorithm and algorithms with fixed step size. In the analysis to follow, we use Rayleigh's inequality, which states that for any Q = QT > 0, we have (Q)H«H2 < xTQx < Amax(Q)||x||2, where A m i n (Q) denotes the minimal eigenvalue of Q and A max (Q) denotes the maximal eigenvalue of Q. For Q = Q > 0, we also have Amin(Q ) — T ^maxlW ) = T 77v\"' 77y\' ANALYSIS OF GRADIENT METHODS 145 and Lemma 8.2 Let Q = QT > 0 be an n x n real symmetric positive definite matrix. Then, for any x G R n , we have (xTx)2 (Q) T T 1 Amax(Q) " {x Qx)(x Q- x) ^ A max (Q) ~ A m i n (Q)' D Proof. Applying Rayleigh's inequality and using the properties of symmetric positive definite matrices listed previously, we get (xTx)2 T T ||x|| 4 < 1 (X QX)(X Q~ X) _Amax(Q) 2 ~ XminiQ^xW^miniQ-^Wxll A m i n (Q) and (xTx)2 (x Qx)(xTQ-1x) \\x\\4 _ Aml„(Q) ~ A max (Q)|| iC ||2A max (Q- 1 )||a ; ||2 A m a x (Q)· > T We are now ready to establish the convergence of the steepest descent method. Theorem 8.2 In the steepest descent algorithm, we have x^ —> x* for any Proof IfflfW = Ofor some A;, then x^ = x* and the result holds. So assume that gW ^ 0 for all k. Recall that for the steepest descent algorithm, g(k)Tg(k) Oik = g(k)TQg(k)' Substituting this expression for a^ in the formula for 7^ yields (g(fc>y*>)2 lk (gWTQgW)(gWTQ~1gW)' Note that in this case 7/- > 0 for all k. Furthermore, by Lemma 8.2, we have 7fc > (A m in(Q)/A max (Q)) > 0. Therefore, we have ^fcLo^fe = °°> a n d hence by Theorem 8.1 we conclude that x^ —> x*. I Consider now a gradient method with fixed step size; that is, ctk = et G R for all k. The resulting algorithm is of the form *(*+!)= a ; « _ a f l <*). 146 GRADIENT METHODS We refer to the algorithm above as a fixed-step-size gradient algorithm. The algorithm is of practical interest because of its simplicity. In particular, the algorithm does not require a line search at each step to determine ctk, because the same step size a is used at each step. Clearly, the convergence of the algorithm depends on the choice of a, and we would not expect the algorithm to work for arbitrary a. The following theorem gives a necessary and sufficient condition on a for convergence of the algorithm. Theorem 8.3 For the fixed-step-size gradient algorithm, x^ χ(°ϊ if and only if —» x* for any ^max(Q) D Proof <=: By Rayleigh's inequality we have Xmin(Q)9WT9W <gMTQgW \m^Q)g{k)19(fe) < and Therefore, substituting the above into the formula for 7^, we get Therefore, 7/- > 0 for all k, and 5^fcL07fc = °°· Hence, by Theorem 8.1 we conclude that x^ —» x*. =>: We use contraposition. Suppose that either a < 0 or a > 2/A m a x (Q). Let χ(°ϊ be chosen such that x^ — x* is an eigenvector of Q corresponding to the eigenvalue A m a x (Q). Because x(fc+l) = x{k) _ a(Qx{k) _ty= x(k) _ a(Qx{k) _ QX*^ we obtain x{k+l) _ χ* = x(k) _ χ* _ a(Qx{k) _ Qx*j = (I n -aQ)(*< f c >-**) = (In-aQ)k+1(xW-x*) = (l-aAmax(Q))fc+1(^°)-x*), where in the last line we used the property that x^ Q. Taking norms on both sides, we get — x* is an eigenvector of || χ (*+ι) - x*|| = |i _ aA m a x (Q)| f c + 1 \\x™ _ χ · | | . ANALYSIS OF GRADIENT METHODS 147 Because a < 0 or a > 2/A m a x (Q), |l-c*Amax(Q)|>l. Hence, ||x( fc+1 ) — aj*|| cannot converge to 0, and thus the sequence {x^} not converge to x*. does I Example 8.4 Let the function / be given by f{X)=xAi 2fL+xTM+24. We wish to find the minimizer of / using a fixed-step-size gradient algorithm x(k+i) = x ( fc ) _ aV/(x W)i where a G R is a fixed step size. To apply Theorem 8.3, we first symmetrize the matrix in the quadratic term of / to get /(*) = \xT 8 2\/2 2v^ 10 x + x + 24. The eigenvalues of the matrix in the quadratic term are 6 and 12. Hence, using Theorem 8.3, the algorithm converges to the minimizer for all x^ if and only if a lies in the range 0 < a < 2/12. I Convergence Rate We now turn our attention to the issue of convergence rates of gradient algorithms. In particular, we focus on the steepest descent algorithm. We first present the following theorem. Theorem 8.4 In the method of steepest descent applied to the quadratic function, at every step k we have (Q) v '' D Proof In the proof of Theorem 8.2, we showed that 7^ > A m i n (Q)/A m a x (Q). Therefore, V(xW) - F(*( fc+1 >) ^ A min (Q) — = 7fe > Amax(Q) V(x(k)) 148 GRADIENT METHODS and the result follows. I Theorem 8.4 is relevant to our consideration of the convergence rate of the steepest descent algorithm as follows. Let r=^ ^ = IIQIIIIQ- 1 ll, called the condition number of Q. Then, it follows from Theorem 8.4 that n*(fc+1)) < (i - 1 ) v(x^). The term (1 — 1/r) plays an important role in the convergence of {V(x^)} to 0 (and hence of {x^} to x*). We refer to (1 — 1/r) as the convergence ratio. Specifically, we see that the smaller the value of (1 — 1/r), the smaller y(#(*+!)) will be relative to V(x^), and hence the "faster" V(x^) converges to 0, as indicated by the inequality above. The convergence ratio (1 — 1/r) decreases as r decreases. If r = 1, then A max (Q) = A m i n (Q), corresponding to circular contours of / (see Figure 8.6). In this case the algorithm converges in a single step to the minimizer. As r increases, the speed of convergence of {V(x^)} (and hence of {x^}) decreases. The increase in r reflects that fact that the contours of / are more eccentric (see, e.g., Figure 8.7). We refer the reader to [88, pp. 238, 239] for an alternative approach to the analysis above. To investigate the convergence properties of {x^} further, we need the following definition. Definition 8.1 Given a sequence {x^} that converges to x*, that is, limfc^oo ||x(fc) — a5*|| = 0, we say that the order of convergence is p, where p e R, if 0 < hm V T M If for all p > 0, k^oo ,. hm \\xW ^ < °°- -x*\\P ||χ(* +1 > - x* fc-^oo \\x(k) -X*\\P then we say that the order of convergence is oo. I Note that in the definition above, 0/0 should be understood to be 0. The order of convergence of a sequence is a measure of its rate of convergence; the higher the order, the faster the rate of convergence. The order of convergence is sometimes also called the rate of convergence (see, e.g., [96]). If p = 1 (first-order convergence) and limfc-+oc ||χ^ +1 ^ — x*||/||a5^ —x*\\ = 1, we say that the convergence is sublinear. If p = 1 and lim^oo ||a;(fc+1) — a;*||/||aj(fc) — x*|| < 1, we say that the convergence is linear. If p > 1, we say that the convergence is superlinear. If p = 2 (second-order convergence), we say that the convergence is quadratic. ANALYSIS OF GRADIENT METHODS Example 8.5 1. Suppose that x^ = l/k and thus x^ 149 —> 0. Then, |g(fc+i)| _ l/(fc + l) _ k? \x(V\p ~ 1/kP ~ k + 1' If p < 1, the sequence above converges to 0, whereas if p > 1, it grows to oo. If p = 1, the sequence converges to 1. Hence, the order of convergence is 1 (i.e., we have linear convergence). 2. Suppose that x^ = 7fc, where 0 < 7 < 1, and thus x^ |„.(fc+l)| -.fc+1 |x(fc)|p (7fc)P J —> 0. Then, J If p < 1, the sequence above converges to 0, whereas if p > 1, it grows to 00. If p = 1, the sequence converges to 7 (in fact, remains constant at 7). Hence, the order of convergence is 1. 3. Suppose that x^ Then, — 7 ^ \ where q > 1 and 0 < 7 < 1, and thus x^ u(fc+i)i ~(<7fc+1) \XW\P (<y(Qk))P , fc+1 , fcx 1 . —► 0. k 1 If p < q, the sequence above converges to 0, whereas if p > q, it grows to 00. If p = g, the sequence converges to 1 (in fact, remains constant at 1). Hence, the order of convergence is q. 4. Suppose that x^ = 1 for all &, and thus x^ —► 1 trivially. Then, |s(fc+i)_l| _ 0 _ Q \x(k) _ I|P " OP ~ for all p. Hence, the order of convergence is 00. The order of convergence can be interpreted using the notion of the order symbol O, as follows. Recall that a = O(h) ("big-oh of ft") if there exists a constant c such that \a\ < c\h\ for sufficiently small h. Then, the order of convergence is at least p if \\x^k+V -x*\\=0(\\x{k) -χ*ψ) (see Theorem 8.5 below). For example, the order of convergence is at least 2 if \\xtk+l)-x*\\=0(\\x{k) -x*\\2) (this fact is used in the analysis of Newton's algorithm in Chapter 9). 150 GRADIENT METHODS Theorem 8.5 Let {x^} be a sequence that converges to x*. / / \\x^1)^X*\\=0(\\xik)-X*\\p), then the order of convergence (if it exists) is at least p. Proof Let s be the order of convergence of {x^}. □ Suppose that \\χ&+ν-χ*\\=0(\\χΜ-χ*\\ρ). Then, there exists c such that for sufficiently large k, ||<Β<*+Ι)_Χ*" x (fc) _ x*\\p <C. Hence, ll*«^)-**!! e l l ^ ^ - ^ lp l ^ w . ^ p - a ||x(fc) - a j * | | \\x^ -x*\\ -x*\\p~s. <c\\x^ Taking limits yields ||aj( fc + 1 ) - x*\\ lim ^ V 1 < c lim ||*(*> - * Τ " * . fc-^oo ||a;(*0 -x*\\s ~ fc-oo" " Because by definition s is the order of convergence, ||a;(*+i)_ fc^oo k \\X( ) χ*|| -X*\\s Combining the two inequalities above, we get c lim \\x^ k—>oo -x*\\p-s>0. Therefore, because lim^oo \\x^ — x*\\ — 0, we conclude that s > p; that is, the order of convergence is at least p. I By an argument similar to the above, we can show that if ||a;(fc+1)-»*||=o(||«(fc)-a5*||p), then the order of convergence (if it exists) strictly exceeds p. Example 8.6 Suppose that we are given a scalar sequence {x^} verges with order of convergence p and satisfies hm |x(*+i)_2| Λ , ,Μ——-z- = 0. that con- ANALYSIS OF GRADIENT METHODS 151 The limit of {x^} must be 2, because it is clear from the equation that |x(fc+i) — 2| —► 0. Also, we see that | χ ^ + 1 ) - 2| = ο(|ζ<*> - 2| 3 ). Hence, we conclude that p > 3. I It turns out that the order of convergence of any convergent sequence cannot be less than 1 (see Exercise 8.3). In the following, we provide an example where the order of convergence of a fixed-step-size gradient algorithm exceeds 1. Example 8.7 Consider the problem of finding a minimizer of the function / : R -> R given by Suppose that we use the algorithm x^k+l^ — x^ — aff(x^) with step size a = 1/2 and initial condition x^ = 1. (The notation / ' represents the derivative of /.) We first show that the algorithm converges to a local minimizer of / . Indeed, we have f'{x) = 2x — x2. The fixed-step-size gradient algorithm with step size a = 1/2 is therefore given by χ(*+υ=χ(*)_α/'(χ(*)) = Ι( χ (*))2. With χ(°ϊ = 1, we can derive the expression x^ — (1/2) 2 algorithm converges to 0, a strict local minimizer of / . Next, we find the order of convergence. Note that -1 . Hence, the |a?(fc+1>| _ 1 |χ(*0|2 ~ 2' Therefore, the order of convergence is 2. I Finally, we show that the steepest descent algorithm has an order of convergence of 1 in the worst case; that is, there are cases for which the order of convergence of the steepest descent algorithm is equal to 1. To proceed, we will need the following simple lemma. Lemma 8.3 In the steepest descent algorithm, if g^ 7fc = 1 if and only if g^ is an eigenvector of Q. Proof. Suppose that g^ algorithm, φ 0 for all k, then □ φ 0 for all k. Recall that for the steepest descent (gwyfc>)2 (g( fc ) T Qg( fe ))(g( fe ) T Q- 1 g( fc ))' Sufficiency is easy to show by verification. To show necessity, suppose that 7fc = 1. Then, V{x^k+1^) = 0, which implies that x^k+i^> = x*. Therefore, lk x* = x{k) - akg{k). 152 GRADIENT METHODS Premultiplying by Q and subtracting b from both sides yields 0= gW-akQgW, which can be rewritten as QgW = - l f l « Oik Hence, gW is an eigenvector of Q. I By the lemma, if g^ is not an eigenvector of Q, then 7& < 1 (recall that 7fc cannot exceed 1). We use this fact in the proof of the following result on the worst-case order of convergence of the steepest descent algorithm. Theorem 8.6 Let {x^} be a convergent sequence of iterates of the steepest descent algorithm applied to a function f. Then, the order of convergence of {x^} is 1 in the worst case; that is, there exist a function f and an initial condition x^ such that the order of convergence of {x^} is equal to 1. D Proof Let / : R n —» R be a quadratic function with Hessian Q. Assume that the maximum and minimum eigenvalues of Q satisfy A max (Q) > Xmin(Q). To show that the order of convergence of {x^} is 1, it suffices to show that there exists χ(°) such that ||a5<fc+1>-x*H > c | | x ( f c ) - x * | | for some c > 0 (see Exercise 8.2). Indeed, by Rayleigh's inequality, y ( > + D ) = i ( a ^ + 1 ) - * * ) T Q ( a ^ + 1 ) - x*) <Xm^Q)\\X(k+V-x*\\2. Similarly, V ^ k ) ) > ^ ^ l l x ( k ) _ x ^ 2 m 2 Combining the inequalities above with Lemma 8.1, we obtain Therefore, it suffices to choose x^ such that 7^ < d for some d < 1. Recall that for the steepest descent algorithm, assuming that g^ φ 0 for all fc, 7fc depends on g^ according to (<7 ( / c ) V f c ) ) 2 Ik = (g(VTQg(k))(g(k)TQ-lg(k)} EXERCISES 153 First consider the case where n = 2. Suppose that χ(°> Φ x* is chosen such that χ(°ϊ — x* is not an eigenvector of Q. Then, gW = Q(x^ — x*) φ 0 is also not an eigenvector of Q. By Proposition 8.1, gW = (x( fc+1 ) —x^)/ak is not an eigenvector of Q for any k [because any two eigenvectors corresponding to A max (Q) and A m i n (Q) are mutually orthogonal]. Also, g^ lies in one of two mutually orthogonal directions. Therefore, by Lemma 8.3, for each k, the value of 7^ is one of two numbers, both of which are strictly less than 1. This proves the n = 2 case. For the general n case, let V\ and V2 be mutually orthogonal eigenvectors corresponding to A max (Q) and A m i n (Q). Choose x^ such that x^ —x*^0 lies in the span of v\ and v% but is not equal to either. Note that g^ = Q(x^ — x*) also lies in the span of V\ and V2, but is not equal to either. By manipulating cc(fc+1) = x^ — akg^ as before, we can write </(fc+1) = (I — OLkQ)g^'. Any eigenvector of Q is also an eigenvector of I — a^Q. Therefore, gW lies in the span of v\ and v2 for all k; that is, the sequence {g^} is confined within the two-dimensional subspace spanned by v\ and V2. We can now proceed as in the n = 2 case. I In the next chapter we discuss Newton's method, which has order of convergence at least 2 if the initial guess is near the solution. EXERCISES 8.1 Perform two iterations leading to the minimization of / O i , x 2 ) = xi + 2 X 2 + 9 X i + x 2 + 3 using the steepest descent method with the starting point x^ determine an optimal solution analytically. = 0. Also 8.2 Let {x^} be a sequence that converges to cc*. Show that if there exists c > 0 such that ||^+1)_χ*|| >c\\xW-X*\\* for sufficiently large /c, then the order of convergence (if it exists) is at most p. 8.3 Let {x^} be a sequence that converges to x*. Show that there does not exist p < 1 such that b m ' - ^ " - - ; ' > t t fc-+oo 8.4 Consider the sequence {x^} \\x(k) -X*\\P given by x^ =2 2 . 154 GRADIENT METHODS a. Write down the value of the limit of b . Find the order of convergence of {x^}. {x^}. 8.5 Consider the two sequences {x^} lows: and {y^} defined iteratively as fol- x(k+i)=ax(k)^ where a G R, b G R, 0 < a < 1, b > 1, χ^ a. Derive a formula for x^ x<*> -» 0. φ 0, y^ φ 0, and \y^>\ < 1. in terms of x^ and a. Use this to deduce that b . Derive a formula for yW in terms of y^ yW _, 0. and b. Use this to deduce that c. Find the order of convergence of {x^} and the order of convergence of {y{k)}. d. Calculate the smallest number of iterations k such that \x^\ < c | x ^ | , where 0 < c < 1. Hint: The answer is in terms of a and c. You may use the notation \z\ to represent the smallest integer not smaller than z. e. Calculate the smallest number of iterations k such that \y^\ where 0 < c < 1. < c|?/°)|, f. Compare the answer of part e with that of part d, focusing on the case where c is very small. 8.6 Suppose that we use the golden section algorithm to find the minimizer of a function. Let Uk be the uncertainty range at the kth. iteration. Find the order of convergence of {uk}. 8.7 Suppose that we wish to minimize a function / : R —> R that has a derivative / ' . A simple line search method, called derivative descent search (DDS), is described as follows: given that we are at a point x^h\ we move in the direction of the negative derivative with step size a; that is, x^k+1^ = k where a > 0 is a constant. x(k) _ af'(x( )^ In the following parts, assume that / is quadratic: f(x) = \ax2 — bx + c (where a, 6, and c are constants, and a > 0). a. Write down the value of x* (in terms of a, 6, and c) that minimizes / . EXERCISES 155 b . Write down the recursive equation for the DDS algorithm explicitly for this quadratic / . c. Assuming that the DDS algorithm converges, show that it converges to the optimal value x* (found in part a). d. Find the order of convergence of the algorithm, assuming that it does converge. e. Find the range of values of a for which the algorithm converges (for this particular / ) for all starting points x^. 8.8 Consider the function f(x) = 3{χ2λ + x\) + 4xix 2 + 5xi + 6x 2 + 7, where x = [xi,x 2 ] T £ I^2· Suppose that we use a fixed-step-size gradient algorithm to find the minimizer of / : x(W)=x(*)-QV/(x(fc)). Find the largest range of values of a for which the algorithm is globally convergent. 8.9 This exercise explores a zero-finding algorithm. Suppose that we wish to solve the equation h(x) = 0, where . , , |~4 + 3χι + 2χ 2 1 h(x) v J = \ [l + 2xi+3x2J Consider using an algorithm of the form cc^ +1 ^ — x^ is scalar constant that does not depend on k. — ah(x^), where a a. Find the solution of h(x) = 0. b . Find the largest range of values of a such that the algorithm is globally convergent to the solution of h(x) = 0. c. Assuming that a is outside the range of values in part b, give an example of an initial condition x^ of the form [xi, 0] T such that the algorithm is guaranteed not to satisfy the descent property. 8.10 Consider the function / : R 2 —► R given by 3 f(x) = -{x\ + x\) + (1 + a)xix2 - (xi + x 2 ) + b, 156 GRADIENT METHODS where a and b are some unknown real-valued parameters. a. Write the function / in the usual multivariable quadratic form. b . Find the largest set of values of a and b such that the unique global minimizer of / exists, and write down the minimizer (in terms of the parameters a and b). c. Consider the following algorithm: 5 Find the largest set of values of a and b for which this algorithm converges to the global minimizer of / for any initial point χ(°\ 8.11 Consider the function / : R -+ R given by f(x) = \{x - c) 2 , c G R. We are interested in computing the minimizer of / using the iterative algorithm x(^)=x(V-akf(xW), where / ' is the derivative of / and ak is a step size satisfying 0 < o^ < 1. a. Derive a formula relating /(x^ fc+1 ^) with f(x^), involving α^. b . Show that the algorithm is globally convergent if and only if oo Σα*= °°· k=0 Hint: Use part a and the fact that for any sequence {a^} C (0,1), we have oo oo k=0 fc=0 J | ( l - a f c ) = 0<^>^a f c = oo. 8.12 Consider the function / : R —► R given by f(x) = x3 — x. Suppose that we use a fixed-step-size algorithm x(fc+1) = x^ — aff(x^) to find a local minimizer of / . Find the largest range of values of a such that the algorithm is locally convergent (i.e., for all XQ sufficiently close to a local minimizer #*, we have x^ —► x*). 8.13 Consider the function / given by f(x) = (x — l ) 2 , x G R. We are interested in computing the minimizer of / using the iterative algorithm /c / x(k+i) — x(k) _ a 2 ~ / ( x ^ ^ ) , where / ' is the derivative of / and 0 < a < 1. Does the algorithm have the descent property? Is the algorithm globally convergent? EXERCISES 157 8.14 Let / : R —> R, / G C3, with first derivative / ' , second derivative / " , and unique minimizer x*. Consider a fixed-step-size gradient algorithm X(*+D = XW _ a/'(a.(*)). Suppose that f"(x*) φ 0 and a = l/f"(x*). Assuming that the algorithm converges to #*, show that the order of convergence is at least 2. 8.15 Consider the problem of minimizing f(x) = \\ax — 6|| 2 , where a and b are vectors in R n , and a φ 0. a. Derive an expression (in terms of a and b) for the solution to this problem. b . To solve the problem, suppose that we use an iterative algorithm of the form a-ik+D = x(k) _ a / ' ^ W ) , where f is the derivative of / . Find the largest range of values of a (in terms of a and 6) for which the algorithm converges to the solution for all starting points x^°\ 8.16 Consider the optimization problem minimize \\Ax — 6|| 2 , where A e R m X n , m > n, and b G R m . a. Show that the objective function for this problem is a quadratic function, and write down the gradient and Hessian of this quadratic. b . Write down the fixed-step-size gradient algorithm for solving this optimization problem. c. Suppose that Find the largest range of values for a such that the algorithm in part b converges to the solution of the problem. 8.17 Consider a function / : R n -» R n given by f(x) = Ax + i>, where A G R n x n and b G R n . Suppose that A is invertible and x* is the zero of / [i.e., f(x*) — 0]. We wish to compute x* using the iterative algorithm where a S 1 , a > 0. We say that the algorithm is globally monotone if for any χ(°>, ||x(fc+!) - x*|| < ||xW - x*|| for all Jfe. 158 GRADIENT METHODS a. Assume that all the eigenvalues of A are real. Show that a necessary condition for the algorithm above to be globally monotone is that all the eigenvalues of A are nonnegative. Hint: Use contraposition. b . Suppose that A = 3 2 2 3 b = 3 -1 Find the largest range of values of a for which the algorithm is globally convergent (i.e., x^ —> x* for all x^). 8.18 Let / : R n -► R be given by f(x) = \χΎQx - xTb, where b <E R n and Q is a real symmetric positive definite n x n matrix. Suppose that we apply the steepest descent method to this function, with χ^ φ Q~xb. Show that the method converges in one step, that is, x^ — Q~1b, if and only if x^ is chosen such that g^ = Qx^ — b is an eigenvector of Q. 8.19 Suppose that we apply the steepest descent algorithm cc^+i) _ x(k) _ &k9^ to a quadratic function / with Hessian Q > 0. Let A max and Amin be the largest and smallest eigenvalue of Q, respectively. Which of the following two inequalities are possibly true? (When we say here that an inequality is "possibly" true, we mean that there exists a choice of / and x^ such that the inequality holds.) a. a0 > 2/A max . b . a0 > 1/Amin. 8.20 Suppose that we apply a fixed-step-size gradient algorithm to minimize f{x) = xT "3/2 2 ' 0 3/2 x + xT ' 3" -1 -22. a. Find the range of values of the step size for which the algorithm converges to the minimizer. b . Suppose that we use a step size of 1000 (which is too large). Find an initial condition that will cause the algorithm to diverge (not converge). 8.21 Consider a fixed-step-size gradient algorithm applied to each of the functions / : R 2 —> R in parts a and b below. In each case, find the largest range of values of the step size a for which the algorithm is globally convergent. a. f(x) = 1 + 2xi + Z(x\ + xl) + 4xix 2 . EXERCISES b. f(x) -=O x. T 3 1 159 3 χ + [16,23]χ + π 2 3 8.22 Let / : R n -+ R be given by f{x) = \χΎQx - xTb, where b G R n and Q is a real symmetric positive definite n x n matrix. Consider the algorithm where £<*> = Qx^ - b, ak = 9{k)T g(k) / g{k)T Qg{k\ and β € R is a given constant. (Note that the above reduces to the steepest descent algorithm if β = 1.) Show that { a ^ ) } converges to x* = Q~lb for any initial condition χ<°) if and only if 0 < β < 2. 8.23 Let / : R n -> R be given by / ( « ) = \xTQx - xTb, where 6 G R n and Q is a real symmetric positive definite nxn matrix. Consider a gradient algorithm x(k+i) = x(k) _ akg(k)^ where g^ = Qx^ — b is the gradient of / at x^ and ak is some step size. Show that the algorithm has the descent property [i.e., f(x^k+1^) < f(x^k>}) whenever g^ φ 0] if and only if 7/~ > 0 for all k. 8.24 Given / : R n —> R, consider the general iterative algorithm x(k+i) = xw + afcd(fc)5 where (Γλ\(Γ2\... are given vectors in R n and α& is chosen to minimize (fc) /(xW+ad );thatis, ak = argmin/(ic ( / c ) + Show that for each &, the vector x( fc+1 ) — x^ (assuming that the gradient exists). ad{k)). is orthogonal to V/(x^ fc+1 ^) 8.25 Write a simple MATLAB program for implementing the steepest descent algorithm using the secant method for the line search (e.g., the MATLAB function of Exercise 7.11). For the stopping criterion, use the condition HiJ^II < ε, where ε — 1 0 - 6 . Test your program by comparing the output with the numbers in Example 8.1. Also test your program using an initial condition of [—4,5,1]T, and determine the number of iterations required to satisfy the stopping criterion. Evaluate the objective function at the final point to see how close it is to 0. 8.26 Apply the MATLAB program from Exercise 8.25 to Rosenbrock's function: /(x) = 100(x2-z?)2 + ( l - z i ) 2 . Use an initial condition of x^ = [—2, 2] T . Terminate the algorithm when the norm of the gradient of / is less than 1 0 - 4 . CHAPTER 9 NEWTON'S METHOD 9.1 Introduction Recall that the method of steepest descent uses only first derivatives (gradients) in selecting a suitable search direction. This strategy is not always the most effective. If higher derivatives are used, the resulting iterative algorithm may perform better than the steepest descent method. Newton's method (sometimes called the Newton-Raphson method) uses first and second derivatives and indeed does perform better than the steepest descent method if the initial point is close to the minimizer. The idea behind this method is as follows. Given a starting point, we construct a quadratic approximation to the objective function that matches the first and second derivative values at that point. We then minimize the approximate (quadratic) function instead of the original objective function. We use the minimizer of the approximate function as the starting point in the next step and repeat the procedure iteratively. If the objective function is quadratic, then the approximation is exact, and the method yields the true minimizer in one step. If, on the other hand, the objective function is not quadratic, then the approximation will provide An Introduction to Optimization, Fourth Edition. By E. K. P. Chong and S. H. Zak. Copyright © 2013 John Wiley & Sons, Inc. 161 162 NEWTON'S METHOD f.q Current Point Predicted Minimizer* x(k + i) · . Figure 9.1 Quadratic approximation to the objective function using first and second derivatives. only an estimate of the position of the true minimizer. Figure 9.1 illustrates this idea. We can obtain a quadratic approximation to the twice continuously differentiable objection function / : R n —► R using the Taylor series expansion of / about the current point x^k\ neglecting terms of order three and higher. We obtain f(x) « /(x<*>) + (a: - *(*>) V f c ) + \(x ~ x(k))TF{x^)(x where, for simplicity, we use the notation g^ FONC to q yields 0 = Wq{x) = g{k) + F(x^)(x If F(x^) = Vf(x^). - *<*>) = q(x), Applying the - *<*>). > 0, then q achieves a minimum at a .(*+i) =a .(fc)_jF( a .(fc))-i^(fc). This recursive formula represents Newton's method. Example 9.1 Use Newton's method to minimize the Powell function: / ( x i , x 2 , ^ 3 , ^ 4 ) = {x\ + IOX2)2 + 5(x 3 - X4)2 + (x2 - 2χβ) 4 + 10(xi Use as the starting point x^ = [3, —1,0,1] T . Perform three iterations. Note that / ( x ( 0 ) ) = 215. We have V/(a) 2(xi + 10x2) + 40(xi - x4)3 ' 20(xi + 10x2) + 4(x2 - 2x3)3 10(a?3 - X4) ~ 8(^2 - 2 x 3 ) 3 —10(0:3 - X4) - 40(xi - X4) 3 -Xi)4· INTRODUCTION 163 and F(x) is given by 2 + 120(xi - x 4 ) 2 20 0 -120(xi-z4)2 2 2 200 + 12(x2 - 2x 3 ) -24(x 2 - 2x 3 ) 0 20 -24(x 2 - 2x 3 ) 2 10 + 48(x2 - 2x3)2 -10 0 0 -10 10 + 120(ari - x 4 ) 2 . -120(xi - x4)2 Iteration 1 ^(°) = [ 3 0 6 , - 1 4 4 , - 2 , - 3 1 0 ] T , ' 482 20 F(*(°>) = 0 -480 20 212 -24 0 " 0.1126 -0.0089 0.0154 _ 0.1106 (0)N-1 *V U >) 0 -24 58 -10 -0.0089 0.0057 0.0008 -0.0087 -480" 0 -10 490 0.0154 0.0008 0.0203 0.0155 0.1106 -0.0087 0.0155 0.1107 F ( a j ( 0 ) ) - V 0 ) = [1.4127,-0.8413,-0.2540,0.7460] T . Hence, XW f(x™) =x(o) _ F ( a : ( 0 ) ) - V 0 ) = [1-5873,-0.1587,0.2540,0.2540] T , = 31.8. Iteration 2 0 ( 1 ) = [94.81,-1.179,2.371,-94.81] T , F(x™) = 215.3 20 0 -213.3 20 205.3 -10.67 0 0 -10.67 31.34 -10 -213.3 0 -10 223.3 F i a j ' ^ J - V 1 ' = [0.5291,-0.0529,0.0846,0.0846] T . Hence, x (2) = x ( f(x&) 1 ) _^(χ(ΐ))-ι^(ΐ) = 6.28. = [1.0582,-0.1058,0.1694,0.1694] T , 164 NEWTON'S METHOD Iteration 3 gW = [28.09, -0.3475,0.7031, -28.08] T , Γ 96.80 f(x&) 0 202 4 -4.744 0 -4.744 19.49 [-94.80 0 -10 r,/ (2U F(xK { })J = \ x (3) 20 20 · -94.80] 0 -10 104.80j = [0.7037,-0.0704,0.1121,0.1111] T , = 1.24. ■ Observe that the kth iteration of Newton's method can be written in two steps as 1. Solve F(xW)d{k) = -gW for d{k\ 2. S e t x ^ + 1 ) = ^ ) + d ( / c ) . Step 1 requires the solution of an n x n system of linear equations. Thus, an efficient method for solving systems of linear equations is essential when using Newton's method. As in the one-variable case, Newton's method can also be viewed as a technique for iteratively solving the equation g(x) = 0, where a ; G l n and g : R n —► R n . In this case F(x) is the Jacobian matrix of g at x\ that is, F(x) is the n x n matrix whose (i,j) entry is (dgi/dxj)(x), ij = 1 , 2 , . . . , n . 9.2 Analysis of Newton's Method As in the one-variable case there is no guarantee that Newton's algorithm heads in the direction of decreasing values of the objective function if F(x^) is not positive definite (recall Figure 7.7 illustrating Newton's method for functions of one variable when / " < 0). Moreover, even if F(x^) > 0, Newton's method may not be a descent method; that is, it is possible that /(x( f c + 1 )) > f(x^). For example, this may occur if our starting point x^ is far away from the solution. See the end of this section for a possible remedy to this problem. Despite these drawbacks, Newton's method has superior convergence properties when the starting point is near the solution, as we shall see in the remainder of this section. The convergence analysis of Newton's method when / is a quadratic function is straightforward. In fact, Newton's method reaches the point x* such ANALYSIS OF NEWTON'S METHOD that V/(aj*) = 0 in just one step starting from any initial point χ(°\ this, suppose that Q — QT is invertible and f(x) = -xTQx Then, To see — xTb. g(x) = Vf(x) = and 165 Qx-b F(x) = Q. Hence, given any initial point χ(°\ by Newton's algorithm x(i)=a.(o)_jF(a.(o))-i^(o) 1 = = XW-Q- [QXW-b] 1 Qb = x*. Therefore, for the quadratic case the order of convergence of Newton's algorithm is oo for any initial point x^ (compare this with Exercise 8.18, which deals with the steepest descent algorithm). To analyze the convergence of Newton's method in the general case, we use results from Section 5.1. Let {x^} be the Newton's method sequence for minimizing a function / : R n —► R. We show that {x^} converges to the minimizer x* with order of convergence at least 2. T h e o r e m 9.1 Suppose that f G C3 and x* G W1 is a point such that V/(a?*) = 0 and F(x*) is invertible. Then, for all x^ sufficiently close to x*, Newton's method is well-defined for all k and converges to x* with an order of convergence at least 2. D Proof. The Taylor series expansion of V / about x^ V / ( « ) - V / ( x ( 0 ) ) - F(x^)(x - x^) yields = 0(\\x - x ( 0 ) || 2 ). Because by assumption f £ C3 and F(x*) is invertible, there exist constants ε > 0, C\ > 0, and c2 > 0 such that if χ(°\χ G {x : \\x — x*|| < ε}, we have ||V/(a?) - V / ( x ( 0 ) ) - F(x^)(x and by Lemma 5.3, F(x)~1 - x^)\\ < cx\\x - a^f exists and satisfies \\F(x)-l\\<c2. The first inequality above holds because the remainder term in the Taylor series expansion contains third derivatives of / that are continuous and hence bounded on {x : \\x — x*\\ < ε}. 166 NEWTON'S METHOD Suppose that x^ e {x : \\x — x*\\ < ε}. Then, substituting x = x* in the inequality above and using the assumption that V/(x*) = 0, we get \\F(xW)(x<® - x*) - V/(ir ( 0 ) )|| < ci||x ( 0 ) - x*|| 2 . Now, subtracting x* from both sides of Newton's algorithm and taking norms yields \\χ™ - x*\\ = ||x (0) - x* 0 Fix^^Vfix^W 1 = | | F ( x ( ) ) - ( F ( x ^ ) ( x ^ - x*) - V / ( x ( 0 ) ) ) | | - x*) - V / ( x ( 0 ) ) | | . < WFix^y^lWFix^ix^ Applying the inequalities above involving the constants c\ and c2 gives \\χΜ-χ*\\ Suppose that x^ <cic2||x(0)-ir*||2. is such that ||*(0)-*!<—, ClC2 where a € (0,1). Then, ||a; ( 1 ) -a;*|| < a||x ( 0 ) - x*\\. By induction, we obtain ||a.(*+i) - os*|| < ciC2||a:(ls) - aj*|| 2 , Ua-ifc+i) — as*|| < a\\xw Hence, - x*\\. lim ||x<*> - sc*|| = 0, and therefore the sequence {x^} converges to x*. The order of convergence is at least 2 because \\x<<k+l>>-x*|| < cic 2 ||x ( / c ) -ir*|| 2 ; that is, | | x ^ + 1 ) - x * | | = 0(||xW-x*|| 2 ). ■ Warning: In the Theorem 9.1, we did not state that x* is a local minimizer. For example, if x* is a local maximizes then provided that / G C3 and F(x*) is invertible, Newton's method would converge to x* if we start close enough to it. As stated in Theorem 9.1, Newton's method has superior convergence properties if the starting point is near the solution. However, the method is not guaranteed to converge to the solution if we start far away from it (in fact, it may not even be well-defined because the Hessian may be singular). In particular, the method may not be a descent method; that is, it is possible that ANALYSIS OF NEWTON'S METHOD 167 /(ic( fc+1 )) > f(x^). Fortunately, it is possible to modify the algorithm such that the descent property holds. To see this, we need the following result. Theorem 9.2 Let {x^} be the sequence generated by Newton's method for minimizing a given objective function f(x). If the Hessian F(x^) > 0 and g{k) _ v / ( a ; ^ ) ) φ 0, then the search direction <*(*> = - F ( * < * > ) - V f c ) = *(/c+1) - *{k) from x^ to χ^+^ is a descent direction for f in the sense that there exists an ä > 0 such that for all a G (0, ä ) , f(x(k)+ad{k))<f(x^). D Proof Let (j){a) = f{x{k) +adw). Then, using the chain rule, we obtain φ'(α) = Vf{x {k) + ad{k))Jd{k\ Hence, <//(()) = Vf(x{k))Td{k) = -0( fc > T F(a5< fc >)-V fc) < 0, 1 because F(x^)~ > 0 and g^ φ 0. Thus, there exists an ä > 0 so that for all a G (0, α), φ(α) < φ(0). This implies that for all a G (0, ä), f(xW+adW)<f(x^), which completes the proof. I Theorem 9.2 motivates the following modification of Newton's method: x(k+D=x(k)_akF{x(k)rig(k)y where ak = axgmin/(x<fc> - aF(a: ( f c ) )-^ ( f c ) ); a>0 that is, at each iteration, we perform a line search in the direction -F(x<*>)-V f c ) · By Theorem 9.2 we conclude that the modified Newton's method has the descent property; that is, /(*( fc+i >) < /(*<*>) whenever gW φ 0. A drawback of Newton's method is that evaluation of F(x^) for large n can be computationally expensive. Furthermore, we have to solve the set of n linear equations F(x^)d^ ' = —g^k\ In Chapters 10 and 11 we discuss methods that alleviate this difficulty. Another source of potential problems in Newton's method arises from the Hessian matrix not being positive definite. In the next section we describe a simple modification of Newton's method to overcome this problem. 168 9.3 NEWTON'S METHOD Levenberg-Marquardt Modification If the Hessian matrix F(x^) is not positive definite, then the search direction (Γ ' = — F(x^)~1g^ may not point in a descent direction. A simple technique to ensure that the search direction is a descent direction is to introduce the Levenberg-Marquardt modification of Newton's algorithm: x<*+D = XW - (F(xW) + /ifc/rV0, where μ^ > 0. The idea underlying the Levenberg-Marquardt modification is as follows. Consider a symmetric matrix F, which may not be positive definite. Let λ ι , . . . , λ η be the eigenvalues of F with corresponding eigenvectors v i , . . . , vn. The eigenvalues λ ι , . . . , λ η are real, but may not all be positive. Next, consider the matrix G — F + μΐ, where μ > 0. Note that the eigenvalues of G are λι + μ , . . . , λ η + μ. Indeed, Gvi = {F + μΙ)υ{ = Fvi + μΐνι = XiVi + μνί which shows that for all i = 1 , . . . , n, Vi is also an eigenvector of G with eigenvalue λζ + μ. Therefore, if μ is sufficiently large, then all the eigenvalues of G are positive and G is positive definite. Accordingly, if the parameter μ^ in the Levenberg-Marquardt modification of Newton's algorithm is sufficiently large, then the search direction d^ = —(F(x^) + μ/ 0 Ι)~ 1 ^^^ always points in a descent direction (in the sense of Theorem 9.2). In this case if we further introduce a step size α^ as described in Section 9.2, β (*+υ = XW - ak(F(xW) + M*J)-V f c ) , then we are guaranteed that the descent property holds. The Levenberg-Marquardt modification of Newton's algorithm can be made to approach the behavior of the pure Newton's method by letting μ^ —> 0. On the other hand, by letting μ^ —► oo, the algorithm approaches a pure gradient method with small step size. In practice, we may start with a small value of μ& and increase it slowly until we find that the iteration is descent: / ( t f ^ 1 ) ) < /(*(*)). 9.4 Newton's Method for Nonlinear Least Squares We now examine a particular class of optimization problems and the use of Newton's method for solving them. Consider the following problem: m minimize yj(rf(#))2, 2=1 NEWTON'S METHOD FOR NONLINEAR LEAST SQUARES 169 where r^ : R n —» R, i = 1 , . . . , ra, are given functions. This particular problem is called a nonlinear least-squares problem. The special case where the rt- are linear is discussed in Section 12.1. E x a m p l e 9.2 Suppose that we are given m measurements of a process at m points in time, as depicted in Figure 9.2 (here, m = 21). Let t i , . . . , t m denote the measurement times and y i , . . . , y m the measurement values. Note that t\ = 0 while £21 = 10- We wish to fit a sinusoid to the measurement data. The equation of the sinusoid is y = Asm(ujt + φ) with appropriate choices of the parameters A, ω, and φ. To formulate the data-fitting problem, we construct the objective function ^ ( ^ - Α β ί η Μ ί + ^)) 2 , 2=1 representing the sum of the squared errors between the measurement values and the function values at the corresponding points in time. Let x — [A, ω, φ]τ represent the vector of decision variables. We therefore obtain a nonlinear least-squares problem with ri(x) = 2/i - ^4sin(u;£i + φ). Defining r = [ n , . . . , r m ] T , we write the objective function as f(x) = r(x)Tr(x). To apply Newton's method, we need to compute the gradient and the Hessian of / . The j t h component of Vf(x) is (V/M), £(.) _*£>,(.)£(.). dx 2=1 Denote the Jacobian matrix of r by J(x) fe(«) few &r(«) few Then, the gradient of / can be represented as V/(«) = 2J(x)Tr(x). 170 NEWTON'S METHOD 2.5 2 1.5 c 1 c 0.5 E (O i-0.5 -1 -1.5 -2 -2.5 4 Figure 9.2 Time Measurement data for Example 9.2. Next, we compute the Hessian matrix of / . The (k,j)th Hessian is given by d2 f dxkdxj ( g ) = 10 6 d component of the (df(x)) dxk \dxj ) -έ(φ.<.>&(->) Letting S(x) be the matrix whose (A:,j)th component is we write the Hessian matrix as F(x) = 2(J(x)TJ(x) + S(x)). Therefore, Newton's method applied to the nonlinear least-squares problem is given by x(k+i) = x(k) _ (j(x)Tj(x) + 3(Χ))-^(Χ)ΤΓ{Χ). EXERCISES 171 In some applications, the matrix S(x) involving the second derivatives of the function r can be ignored because its components are negligibly small. In this case Newton's algorithm reduces to what is commonly called the GaussNewton method: x (fc+i) = x(k) _ (j(x)Tj(x))-ij(x)Tr(x). Note that the Gauss-Newton method does not require calculation of the second derivatives of r. E x a m p l e 9.3 Recall the data-fitting problem in Example 9.2, with ri(x) = 2/i - Asm(u;ti + φ), 2 = 1 , . . . ,21. The Jacobian matrix J(x) in this problem is a 21 x 3 matrix with elements given by (J(ic)) ( M ) = -sin(u;ti + 0), (J(x))(i,2) = -UAcos(uU + 0), (J(»))(i,3) = -Acos(uti + φ), i = 1 , . . . ,21. Using the expressions above, we apply the Gauss-Newton algorithm tofindthe sinusoid of bestfit,given the data pairs (ti, y i ) , . . . , (tm, y m ). Figure 9.3 shows a plot of the sinusoid of bestfitobtained from the Gauss-Newton algorithm. The parameters of this sinusoid are: A = 2.01, ω = 0.992, and φ = 0.541. I A potential problem with the Gauss-Newton method is that the matrix J(x)TJ(x) may not be positive definite. As described before, this problem can be overcome using a Levenberg-Marquardt modification: x{k+i) = x(k) _ (j^Yj(x} + ßkIyij(xYr(xy This is referred to in the literature as the Levenberg-Marquardt algorithm, because the original Levenberg-Marquardt modification was developed specifically for the nonlinear least-squares problem. An alternative interpretation of the Levenberg-Marquardt algorithm is to view the term μ^I as an approximation to S(x) in Newton's algorithm. EXERCISES 9.1 Let / : R —> R be given by f(x) = (x — xo)4, where XQ G R is a constant. Suppose that we apply Newton's method to the problem of minimizing / . a. Write down the update equation for Newton's method applied to the problem. 172 NEWTON'S METHOD 2.5 y^T^\r 2 c 1.5 J 1 c E 3 0.5 0 (0 co CD , °/ \° 2-0.5h \ -1 " 1,5 T I ) -2[ ^~^c -2.5 4 Figure 9.3 Time 6 10 Sinusoid of best fit in Example 9.3. b . Let y^ = \x^ — XQ\, where x^ is the fcth iterate in Newton's method. Show that the sequence {y^} satisfies 2/(fc+1) = |?/ f c ). c. Show that x^ —► XQ for any initial guess χ(°\ d. Show that the order of convergence of the sequence {x^} in part b is 1. e. Theorem 9.1 states that under certain conditions, the order of convergence of Newton's method is at least 2. Why does that theorem not hold in this particular problem? 9.2 This question relates to the order of convergence of the secant method, using an argument similar to that of the proof of Theorem 9.1. a. Consider a function / : R —> R, / G C 2 , such that x* is a local minimizer and f"(x*) φ 0. Suppose that we apply the algorithm x^k+l^ = x^ — &kf'(x^) such that {a*;} is a positive step-size sequence that converges to l/f"(x*). Show that if x^ —► x*, then the order of convergence of the algorithm is superlinear (i.e., strictly greater than 1). b . Given part a, what can you say about the order of convergence of the secant algorithm? EXERCISES 173 9.3 Consider the problem of minimizing f(x) = xs = (^/χ) 4 , x G l . Note that 0 is the global minimizer of / . a. Write down the algorithm for Newton's method applied to this problem. b . Show that as long as the starting point is not 0, the algorithm in part a does not converge to 0 (no matter how close to 0 we start). 9.4 Consider Rosenbrock's Function: f(x) = 100(^2— #i) 2 + (l — #i) 2 > where x = [xi,X2] T (known to be a "nasty" function—often used as a benchmark for testing algorithms). This function is also known as the banana function because of the shape of its level sets. a. Prove that [1,1] T is the unique global minimizer of / over R 2 . b . With a starting point of [0,0l T , apply two iterations of Newton's method. Hint: a b c d i - 1 d 1 ad — be a c. Repeat part b using a gradient algorithm with a fixed step size of α^ 0.05 at each iteration. 9.5 Consider the modified Newton's algorithm x(*+1)=aj(fe)-afcF(x(fc))-V*), where α& = a r g m i n a > 0 f(x^ — aF(x^)~1g^). Suppose that we apply the algorithm to a quadratic function f{x) = ^xTQx — x T 6 , where Q = Q > 0. Recall that the standard Newton's method reaches point x* such that V/(cc*) = 0 in just one step starting from any initial point χ(°\ Does the modified Newton's algorithm above possess the same property? CHAPTER 10 CONJUGATE DIRECTION METHODS 10.1 Introduction The class of conjugate direction methods can be viewed as being intermediate between the method of steepest descent and Newton's method. The conjugate direction methods have the following properties: 1. Solve quadratics of n variables in n steps. 2. The usual implementation, the conjugate gradient algorithm, requires no Hessian matrix evaluations. 3. No matrix inversion and no storage of an n x n matrix are required. The conjugate direction methods typically perform better than the method of steepest descent, but not as well as Newton's method. As we saw from the method of steepest descent and Newton's method, the crucial factor in the efficiency of an iterative search method is the direction of search at each iteration. For a quadratic function of n variables f(x) = ^xTQx — x T 6 , x G R n , Q = Q > 0, the best direction of search, as we shall see, is in the Q-conjugate direction. Basically, two directions S1' and cl·2' in R n are An Introduction to Optimization, Fourth Edition. By E. K. P. Chong and S. H. Zak. Copyright © 2013 John Wiley & Sons, Inc. 175 176 CONJUGATE DIRECTION METHODS said to be Q-conjugate if cr 1 ' QdS1' = 0. In general, we have the following definition. Definition 10.1 Let Q be a real symmetric n x n matrix. The directions d (0) , d (1) , d ( 2 ) , . . . , d ( m ) are Q-conjugate if for all i φ j , we have d{i)TQd{j) = 0. ■ Lemma 10.1 Let Q be a symmetnc positive definite n x n matrix. If the directions d (0) , d ( 1 ) , . . . , d{k) e Rn, k < n — 1, are nonzero and Q-conjugate, then they are linearly independent. □ Proof. Let cto,..., ak be scalars such that a0d{0) + a i d ( 1 ) + · · · + akSk) = 0. Premultiplying this equality by d>^ Q, 0 < j < A:, yields OLjd{j)TQd{j) = 0, because all other terms d^TQd^ = 0, i φ j , by Q-conjugacy. But Q = QT > 0 and d{j) φ 0; hence αά = 0, j = 0 , 1 , . . . , k . Therefore, k < n — 1, are linearly independent. I Example 10.1 Let \3 0 Γ Q = 0 4 2 [l 2 3 Note that Q = Q > 0. The matrix Q is positive definite because all its leading principal minors are positive: Δι = 3 > 0, Δ 2 = det 3 0 = 12 > 0, 0 4 Δ 3 = det Q = 20 > 0. Our goal is to construct a set of Q-conjugate vectors c r ° \ d^\ T 2) Let d<°> = [1,0,0] , d™ = [d?\d£\dP] , require that d (0)T T Qd d^ Qd^ (1) = 0. We have "3 0 1" = [1,0,0] 0 4 2 1 2 3 Let 4 ! ) = 1, 4 X ) = 0> 4 X ) = - 3 d^TQd^ = 0. Then d^2\ d^ = [ 4 , 4 , 4 2 ) ] τ . We T m 4 υ 1} 4J > rf(1) 2) M^+d^. = [ 1 , 0 , - 3 ] T , and thus THE CONJUGATE DIRECTION ALGORITHM 177 To find the third vector cl·2', which would be Q-conjugate with cl·0' and d , we require that d (0)T Qd{2) = 0 and d(1)TQd(2) = 0. We have (1) d^Qd™ =3di 2 ) +4 2 ) =0, d^TQd^ -6d{2)-8d{2)=0. = If we take <r2' = [1,4,—3] T , then the resulting set of vectors is mutually conjugate. I This method of finding Q-conjugate vectors is inefficient. A systematic procedure for finding Q-conjugate vectors can be devised using the idea underlying the Gram-Schmidt process of transforming a given basis of R n into an orthonormal basis of Mn (see Exercise 10.1). 10.2 The Conjugate Direction Algorithm We now present the conjugate direction algorithm for minimizing the quadratic function of n variables f(x) = -xTQx - xTb, where Q = Q T > 0 , x G R n . Note that because Q > 0, the function / has a global minimizer that can be found by solving Qx = b. Basic Conjugate Direction Algorithm. Given a starting point and Q-conjugate directions d^0\d^\..., <r n _ 1 '; for k > 0, g(k) =Vf(x(k)) = QxW x^ -b, fe T (fc) g( ) d ak T ~~d^ QdW Theorem 10.1 For any starting point χ(°\ the basic conjugate direction algorithm converges to the unique x* (that solves Qx = b) in n steps; that is, χ(ηϊ=χ*. D Proof. Consider x* — χ(°) e R n . Because the a^ are linearly independent, there exist constants ft, i = 0 , . . . , n — 1, such that tf*-z(°>=/3od(0) + --- + /?n-id ( n - 1 ) . Now premultiply both sides of this equation by d^k' Q, 0 < k < n, to obtain d^TQ(x* - *«») = ßkSk)TQd<k\ 178 CONJUGATE DIRECTION METHODS where the terms Φ ' Q<r%' = 0, k ψ i, by the Q-conjugate property. Hence, d(fc)TQ(x*-x<0)) ßk dWTQd(k) ■ Now, we can write x<*> = x<°> + a0d<°> + · · · + a ^ - i d ^ - 1 ) . Therefore, x<fc> - *<°> = a0d<0> + · · · + α*-!^*" 1 *. So writing x* - x<0> = (x* - x « ) + (x<fe> - x< 0 ') and premultiplying the above by S ' Q, we obtain d ( * ) T Q(x* - xW) = d<*> T Q(x· - x<*>) = - d < f c > y * \ because gW — Qx^ — b and Qx* = 6. Thus, ^fc~ and x* = χ(η\ d( f c > T Qd( f e > _ a f e which completes the proof. Example 10.2 Find the minimizer of f(xi,x2) 2χΤ = ~4 2 2 2 X -1 -xT 1 ,X € using the conjugate direction method with the initial point χ(°) = [0,0] T , and Q-conjugate directions d (0) = [1,0] T and d (1) = [ - § , | ] T . We have l(°) -& = [ i , - i ] T , and hence <* 0 = g (°) 0 T T d(0) 'd< > Qd<0> [1,-1] [1,0] 4 2 2 2 Thus, x( 1 ) = x ( ° ) + a 0 d ( 0 ) 1 1 0 = 0 " 4 0 1 4 0 THE CONJUGATE DIRECTION ALGORITHM 179 To find x^2\ we compute gM = Qx^ -b = 1 4 "4 2" 2 2 -1 1 - 0 and [0,- g(i>Td(P oil 'd^TQd^ 0 = 3 2 3" 8 3 4 §] 2" Γ_3 3] "4 L 8' 4J 2 2. 3] 38 2 4 J Therefore, 1" 4 x W = ^ + a i d ^( i ) =_ +2 0 3 4 3 8 = Because / is a quadratic function in two variables, x^ For a quadratic function reaches the solution after n possesses a certain desirable suppose that we start at x^ - ' - -1 3 2 _ = x*. I of n variables, the conjugate direction method steps. As we shall see below, the method also property in the intermediate steps. To see this, and search in the direction er 0 ' to obtain - We claim that - ( # g(DT d (0) & = ) - ■ α To see this, g^Td^ = (QxW - &)Trf(°> = x( )TQd(0) ° = - { $ $ * ) d(°)TQd(0) - &Td(0) g(o)Td(o)_g(o)Td(o)=a The equation g^Td^ ' = 0 implies that £*o has the property that ao = arg min φο{α), where φ0(α) = /(χ(°) + a d ( 0 ) ) . To see this, apply the chain rule to get ^da( a ) = V/(*(°> + ad(°>)T<i(0)· Evaluating the above at a = ao, we get d ^(a )=gWdW=0. da o 180 CONJUGATE DIRECTION METHODS Because φο is a quadratic function of a, and the coefficient of the a2 term in 0o is d^TQd^ > 0, the above implies that a0 = argmin a G R 0o(a). Using a similar argument, we can show that for all A:, 0(fc+i)Td(*) and hence = 0 ak = argmin/(ic ( f c ) + aSk)). In fact, an even stronger condition holds, as given by the following lemma. L e m m a 10.2 In the conjugate direction algorithm, 0(*+i)T d «) o = for all k, 0 < k <n — 1, and 0 < i < k. □ Proof. Note that Q ( a ^ + 1 ) - a<*>) = Qx(k+V - b - (QxW - b) = 0<fc+1> - g™, because g^ = Qx^ — 6. Thus, fc fl(*+i)=fl<O+afcQd< >. We prove the lemma by induction. The result is true for k = 0 because g^Td^ = 0, as shown before. We now show that if the result is true for k - 1 (i.e., gWTd(i) = 0, i < k - 1), then it is true for k (i.e., g(k+VTd{i) = 0, i <k). Fix k > 0 and 0 < i < k. By the induction hypothesis, g^Td^ = 0. Because 0< fc+1) =ff< fc) +a fc Qd<*>, and Sk'TQdl·1' = 0 by Q-conjugacy, we have g(k+i)Td(i) = g{k)Td(i) + akdWrQd(i) = Q It remains to be shown that g(k+DTd(k) = Q Indeed, g(k+l)Td(k) = (Qx(k+1) _ 6) T d (fc) _ ( (k) _ 9{k)Td{k) = 0, (k)\ (k) _ T (k) THE CONJUGATE DIRECTION ALGORITHM Figure 10.1 181 Illustration of Lemma 10.2. because Qx^ —b — g(k\ Therefore, by induction, for all 0 < k < n — 1 and 0 < i < &, g{k + l)Td{i) = Q By Lemma 10.2 we see that ^( fc+1 ) is orthogonal to any vector from the subspace spanned by S°\ dSl\ . . . , er \ Figure 10.1 illustrates this statement. The lemma can be used to show an interesting optimal property of the conjugate direction algorithm. Specifically, we now show that not only does /(x ( f c + 1 ) ) satisfy /(x ( f c + 1 ) ) = min a /(«<*) + a d ( f c ) ) , as indicated before, but also /(x (fe+1) ) min / a0,...,ak \ x (0) + yZ a ^ (0 i=0 In other words, if we write Vfc-^+span[d^,d^,...,d^], then we can express /(α?^ +1 ^) = mina.Gyfc f(x). As k increases, the subspace span[d ( 0 ) ,d ( 1 ) ,... ,d (fc) ] "expands," and will eventually fill the whole of R n (provided that the vectors d^'^cl·1',..., are linearly independent). Therefore, for some sufficiently large k, x* will lie in Vk- For this reason, the above result is sometimes called the expanding subspace theorem (see, e.g., [88, p. 266]). To prove the expanding subspace theorem, define the matrix D^ by 182 CONJUGATE DIRECTION METHODS that is, d (i) is the ith column of D{k). + lZ(D{k)) Note that x^ = Vk. Also, *(*+!) = x ( 0 ) + £ a . d ( 0 = x (o) + £ ) (fc) a ) where a = [<*o,..., a/c]T· Hence, x(fc+1)6a.(0)+^(D(fc))=Vfc. Now, consider any vector x G V&. There exists a vector a such that x = x^ + D^a. Let 0&(α) = f(x^ + D^a). Note that </>fc is a quadratic function and has a unique minimizer that satisfies the FONC (see Exercises 6.33 and 10.7). By the chain rule, D<i>k(a) = V / ( * ( 0 ) + Df f c O) T D ( f c ) . Therefore, D0 f c (a) - V / ( * ( 0 ) + D{k)a)TD{k) = V/(a^+1))TJD^) = 0(fe+1)T£>(fc). By Lemma 10.2, g(k^)TD(k) = 0 T . Therefore, a satisfies the FONC for the quadratic function <^, and hence a is the minimizer of </>&; that is, /(x^1)) = min/(x(°) + Dwa) a = min / ( x ) , xEVk which completes the proof of our result. The conjugate direction algorithm is very effective. However, to use the algorithm, we need to specify the Q-conjugate directions. Fortunately, there is a way to generate Q-conjugate directions as we perform iterations. In the next section we discuss an algorithm that incorporates the generation of Q-conjugate directions. 10.3 The Conjugate Gradient Algorithm The conjugate gradient algorithm does not use prespecified conjugate directions, but instead computes the directions as the algorithm progresses. At each stage of the algorithm, the direction is calculated as a linear combination of the previous direction and the current gradient, in such a way that all the directions are mutually Q-conjugate—hence the name conjugate gradient algorithm. This calculation exploits the fact that for a quadratic function of THE CONJUGATE GRADIENT ALGORITHM 183 n variables, we can locate the function minimizer by performing n searches along mutually conjugate directions. As before, we consider the quadratic function f(x) = ];XTQx x e Rn, - xTb, where Q = Q > 0. Our first search direction from an initial point x^ the direction of steepest descent; that is, is in d<°> = - f l <°>. Thus, where O (0)TJ(0) In the next stage, we search in a direction d^1' that is Q-conjugate to cr 0 '. We choose er 1 ' as a linear combination of g^ and <r°\ In general, at the (fc + l)th step, we choose cr + 1 ' to be a linear combination of g(k+1) and d> \ Specifically, we choose d (*+D = _ S (*+D + ßkSk), k = 0,1,2,.... The coefficients /?&, k = 1,2,..., are chosen in such a way that d^fc+1^ is Q-conjugate to c r ° \ d^\ . . . , d^k\ This is accomplished by choosing ßk to be The conjugate gradient algorithm is summarized below. 1. Set k := 0; select the initial point χ(°). 2. g(°) = V/(x<°>). Ifff<°>= 0, stop; else, set d (0) = o. ak - d(k)TQd(k) ■ 4. x(fe+1> = XW + akd{k). 5. ö ( fc+1 ) = V/(x< fe+1 )). If g( fc+1 ) = 0, stop. 7. d<*+i> = _ ff (*+i) + / ? f c d< fc) . 8. Set k :— /c + 1; go to step 3. -gW. 184 CONJUGATE DIRECTION METHODS Proposition 10.1 In the conjugate gradient d (0) , d ( 1 ) , . . . , d ( n _ 1 ) are Q-conjugate. algorithm, the directions D Proof. We use induction. We first show that (T°> Qd>1' = 0. To this end we write Substituting for d^TQd^ P ° in the equation above, we see that d ( 0 ) T Q d ( 1 ) = 0. We now assume that S°\d^\... ,<r fc \ fc < n — 1, are Q-conjugate directions. From Lemma 10.2 we have gfr+^dW = 0, j = 0 , 1 , . . . , fc. Thus, g(k+i) i s or thogonal to each of the directions d^, d>1',..., d^k'. We now show that g{k+1)Tg{j) = 0, j = o,i,...,k. Fix j e { 0 , . . . , fc}. We have dÜ) _ffÜ)+/j._ldÜ-i). = Substituting this equation into the previous one yields fl(*+i)Tdü) = o = _ fl (*+i)T fl Ü) + ^ . . ^ + υ τ ^ - ΐ ) . Because g(.k+1')Td(j~1) = 0, it follows that gC'+^gU) = o. We are now ready to show that d^k+1'TQd^' = 0, j = 0 , . . . , fc. We have d(k+DTQdU) {_g(k+i)+ßkd(k))TQd(j)_ = If j < A:, then cr ^ Qd'·" = 0, by virtue of the induction hypothesis. Hence, we have _g(k+i)TQdU)_ d (fc+i)T Qd G·) = But g^' +1 ) =flfW)+ ajQd(j). Because d (* + DT Q d o·) = Thus, d{k+1)TQd{j) fc 1 T flf( + ) gW = 0, i = 0 , . . . , fc, _g(fc+i)T(g(j+1)-g(j)) = 0 = 0 , j = 0 , . . . , fc - 1. It remains to be shown that d ( f c + 1 ) T Qd ( f e ) = 0. We have d (*+i)T Qd (fc) = (_fl(fc+D Using the expression for ßk, we get <r proof. +ßkdW)TQdW. ' Qd> ' = 0, which completes the I THE CONJUGATE GRADIENT ALGORITHM 185 E x a m p l e 10.3 Consider the quadratic function 3 3 /(xi,x 2 ,X3) = -zx\ +2x2 + -x\ + £i#3 + 2x2X3 - 3xx - x 3 . We find the minimizer using the conjugate gradient algorithm, using the starting point χ(°) = [0,0,0] T . We can represent / as -xTQx-xTb, f{x) = where Q = "3 0 il 0 4 2 1 2 3J b = We have g(x) = V / ( x ) = Qx-b= [3xi + x 3 - 3,4x 2 + 2x 3 , xi + 2x 2 + 3x 3 - 1] T . Hence, 9<°> = [-3,0,-l]T, d (0) = -ff(0), 5 (°) T (0 d > 10 ao = - d (o)T Q d (o) = _3 6 = 0.2778 and xd) = x (°) +a0d(0) = [0.8333,0,0.2778] T . The next stage yields ö ( 1 ) = V / ( x ( 1 ) ) = [-0.2222,0.5556,0.6667] T , We can now compute Sl) = -g{1) + ß0d{0) = [0.4630, -0.5556, -0.5864] T . Hence, «l = — T ^T F d^ QdW and ,(2) — = 0.2187 = x ( 1 ) + a i d ( 1 ) = [0.9346, -0.1215,0.1495] T . 186 CONJUGATE DIRECTION METHODS To perform the third iteration, we compute t,(2) = V / ( x ( 2 ) ) = [-0.04673, -0.1869,0.1402] T , g( 2 ) T Qd ( 1 ) d{2) = -gW +ß1d{1) Hence, a and Note that x (3) = [0.07948,0.1476,-0.1817] T . o( 2 ) T d ( 2 ) 2 = —75rf or; = ° · 8 2 3 1 d(2)TQd(2) = x (2) + a 2 d ( 2 ) = [1.000,0.000,0.000] T . g(3> = V/( e < 3 >) = 0, as expected, because / is a quadratic function of three variables. Hence, I x*=xW. 10.4 The Conjugate Gradient Algorithm for Nonquadratic Problems In Section 10.3, we showed that the conjugate gradient algorithm is a conjugate direction method, and therefore minimizes a positive definite quadratic function of n variables in n steps. The algorithm can be extended to general nonlinear functions by interpreting f(x) = ^xTQx — xTb as a second-order Taylor series approximation of the objective function. Near the solution such functions behave approximately as quadratics, as suggested by the Taylor series expansion. For a quadratic, the matrix Q, the Hessian of the quadratic, is constant. However, for a general nonlinear function the Hessian is a matrix that has to be reevaluated at each iteration of the algorithm. This can be computationally very expensive. Thus, an efficient implementation of the conjugate gradient algorithm that eliminates the Hessian evaluation at each step is desirable. Observe that Q appears only in the computation of the scalars a& and ßk · Because ak = argmin/(ic ( f c ) + a d ( ; c ) ) , a>0 the closed-form formula for α& in the algorithm can be replaced by a numerical line search procedure. Therefore, we need only concern ourselves with the formula for ßk. Fortunately, elimination of Q from the formula is possible and results in algorithms that depend only on the function and gradient values at THE CONJUGATE GRADIENT ALGORITHM FOR NONQUADRATIC PROBLEMS 187 each iteration. We now discuss modifications of the conjugate gradient algorithm for a quadratic function for the case in which the Hessian is unknown but in which objective function values and gradients are available. The modifications are all based on algebraically manipulating the formula ßk in such a way that Q is eliminated. We discuss three well-known modifications. Hestenes-Stiefel Formula. Recall that g(k^)TQd{k) _ ßk d^TQd^ ~ ' The Hestenes-Stiefel formula is based on replacing the term Q<r ^ by the term (g^k+1^ — g^)/otkThe two terms are equal in the quadratic case, as we now show. Now, cc(fc+1) = x^ -ho^cr^. Premultiplying both sides by Q, subtracting b from both sides, and recognizing that g^ = Qx^ — 6, we get {k) = (g(k+V -g^k))/ak. fl(fc+i) = g(k) +akQd(V^ w hich we can rewrite as Qd Substituting this into the original equation for ßk gives the Hestenes-Stiefel formula fl(*+i)Tto(fc+i)_g(fc)] = k)T dfi \g(k+1)-gW] Polak-Ribiere Formula. Starting from the Hestenes-Stiefel formula, we multiply out the denominator to get g(k+l)T[g{k+l)_g{k)] By Lemma 10.2, d ( f c ) V f c + 1 ) = 0. Also, since d{k) = -gW +/? fc _id (fc ~ 1) , and premultiplying this by g^T, we get g(k)Td(k) = -g(k)Tg(k)+ßk_ig(k)Td(k-l) _ ff (*)T fl (fc) ) = where once again we used Lemma 10.2. Hence, we get the Polak-Ribiere formula gfr+l)T\g(k+l) -gW] = fa g(k)Tg(k) ' Fletcher-Reeves Formula. Starting with the Polak-Ribiere formula, we multiply out the numerator to get g(k+l)Tg(k+l) fa = * g(k)Tg(k) We now use the fact that g(k+1)TgW g(k+i)Td(k) _g(k+l)Tg(k) = = 0, which we get by using the equation _g(k+i)Tgw +/3 jb _ 1 ^(*+DT d (fc-i) and applying Lemma 10.2. This leads to the Fletcher-Reeves formula g(k+l)T fa = g(k+l) g(k)Tg(k) ' 188 CONJUGATE DIRECTION METHODS The formulas above give us conjugate gradient algorithms that do not require explicit knowledge of the Hessian matrix Q. All we need are the objective function and gradient values at each iteration. For the quadratic case the three expressions for ßk are exactly equal. However, this is not the case for a general nonlinear objective function. We need a few more slight modifications to apply the algorithm to general nonlinear functions in practice. First, as mentioned in our discussion of the steepest descent algorithm (Section 8.2), the stopping criterion V/(cc( fc+1 )) = 0 is not practical. A suitable practical stopping criterion, such as those discussed in Section 8.2, needs to be used. For nonquadratic problems, the algorithm will not usually converge in n steps, and as the algorithm progresses, the "Q-conjugacy" of the direction vectors will tend to deteriorate. Thus, a common practice is to reinitialize the direction vector to the negative gradient after every few iterations (e.g., n or n + 1) and continue until the algorithm satisfies the stopping criterion. A very important issue in minimization problems of nonquadratic functions is the line search. The purpose of the line search is to minimize 0&(α) = f(x^ + aS )) with respect to a > 0. A typical approach is to bracket or box in the minimizer and then estimate it. The accuracy of the line search is a critical factor in the performance of the conjugate gradient algorithm. If the line search is known to be inaccurate, the Hestenes-Stiefel formula for ßk is recommended [69]. In general, the choice of which formula for ßk to use depends on the objective function. For example, the Polak-Ribiere formula is known to perform far better than the Fletcher-Reeves formula in some cases but not in others. In fact, there are cases in which the g^h\ k = 1,2,..., are bounded away from zero when the Polak-Ribiere formula is used (see [107]). In the study by Powell in [107], a global convergence analysis suggests that the Fletcher-Reeves formula for ßk is superior. Powell further suggests another formula for β^: f ^ = Π 1 3 Χ 0 \ ' g (fc+i)T[g(fc+i)_g(fc)n ^WTgW /· For general results on the convergence of conjugate gradient methods, we refer the reader to [135]. For an application of conjugate gradient algorithms to Wiener filtering, see [116], [117], and [118]. Conjugate gradient algorithms are related to Krylov subspace methods (see Exercise 10.6). Krylov-subspace-iteration methods, initiated by Magnus Hestenes, Eduard Stiefel, and Cornelius Lanczos, have been declared one of the 10 algorithms with the greatest influence on the development and practice of science and engineering in the twentieth century [40]. For control perspective on the conjugate gradient algorithm, derived from a proportional-plus-derivative (PD) controller architecture, see [4]. In addition, these authors offer a control perspective on Krylov-subspace-iteration methods as discrete feedback control systems. EXERCISES 189 EXERCISES 10.1 (Adopted from [88, Exercise 9.8(1)]) Let Q be a real symmetric positive definite n x n matrix. Given an arbitrary set of linearly independent vectors {p^°\ . . . ,p^ n _ 1 ^} in R n , the Gram-Schmidt procedure generates a set of vectors { d ( 0 ) , . . . , d ( n - 1 ) } as follows: d< 0 >=p< 0 \ έί d(t)TQd(t) Show that the vectors dr°\ . . . , (Γη~ι* are Q-conjugate. 10.2 Let / : R n -+ R be the quadratic function /(a;) = -xTQx - xTb, where Q = Q T > 0. Given a set of directions {d (0) , d ( 1 ) , . . . } C R n , consider the algorithm fc a j (*+l)=x(*) + e f c d( ), where α& is the step size. Suppose that g( f c + 1 ) T d ( i ) = 0 for all fc = 0 , . . . , n — 1 and t = 0 , . . . , fe, where g^+V = V / ( a ^ + 1 ) ) . Show that if gWTdw φ 0 for all k = 0 , . . . , n — 1, then dS°\ . . . , d^n~^ are Q-conjugate. 10.3 Let / : R n -► R be given by f(x) = ±x T Qa; - x T 6 , where 6 G R n and Q is a real symmetric positive definite n x n matrix. Show that in the conjugate gradient method for this / , d{k)TQd{k) = -d ( f c ) T 'Qg { k ) . 10.4 Let Q be a real nx n symmetric matrix. a. Show that there exists a Q-conjugate set {d^\ . . . , d ^ } such that each dy* (i = 1 , . . . , n) is an eigenvector of Q. Hint: Use the fact that for any real symmetric nxn matrix, there exists a set { u i , . . . , v n } of its eigenvectors such that vjvj = 0 for all i,j = Ι , . , . , π , %φ j . b . Suppose that Q is positive definite. Show that if {d^\.. .,d^n'} is a Q-conjugate set that is also orthogonal (i.e., cr*' d^' = 0 for all i,j = l , . . . , n , i ^ j), and d' 2 ' φ 0, i = 1 , . . . , n, then each d ^ , i = 1 , . . . , n, is an eigenvector of Q. 10.5 Consider the following algorithm for minimizing a function / : x<*+i> = a .(*) + a f c d<*>, 190 CONJUGATE DIRECTION METHODS where α^ = argmin a f(x^ + adP^). Let g^ = Vf(x^) (as usual). Suppose that / is quadratic with Hessian Q. We choose Sk+1^ = lk9^k+l>} + d^k\ and we wish the directions and (Γ + 1 ' to be Q-conjugate. Find a formula for 7^ in terms of <r \ g^k+l\ and Q. 10.6 Consider the algorithm with ah G R scalar and x^ given by = 0, applied to the quadratic function / : R n —► R f(x) = -xTQx - bTx, where Q > 0. As usual, write gW = V / ( x ^ ) . Suppose that the search directions are generated according to d(*+i)=a f c 9 ( f c + 1 )+& f e d( f e \ where a& and bk are real constants, and by convention we take c r 1 ' = 0. a. Define the subspace Vk = span[6, Qb,..., Q - 1 6] (called the Krylov subspace of order fc). Show that dP^ G Vk+i and x^ G Vfc. Hint: Use induction. Note that Vo = {0} and Vi = span[6]. b . In light of part a, what can you say about the "optimality" of the conjugate gradient algorithm with respect to the Krylov subspace? 10.7 Consider the quadratic function / : R n —► R given by f(x) = -xTQx - xTb, where Q = QT > 0. Let D G R n x r be of rank r and x0 G R n . Define the function φ : W -► R by φ(α) = /(χ0 + Όα). Show that φ is a quadratic function with a positive definite quadratic term. 10.8 Consider a conjugate gradient algorithm applied to a quadratic function. a. Show that the gradients associated with the algorithm are mutually orthogonal. Specifically, show that g^+^gW = 0 for all 0 < k < n - 1 and 0 < i < k. Hint: Write g& in terms of d (i) and d{i~l\ b . Show that the gradients associated with the algorithm are Q-conjugate if separated by at least two iterations. Specifically, show that = 0 for all 0 < fc < n - 1 and 0 < i < fc - 1. g{k+i)TQg{i) EXERCISES 191 10.9 Represent the function /(a?i,x 2 ) = 2 X i +x2 - 3 x i x 2 -X2-7 in the form f(x) = ^xTQx — xTb + c. Then use the conjugate gradient algorithm to construct a vector d' 1 ' that is Q-conjugate with d^0' = V/(x^ 0 ^), where x^ = 0 . 10.10 Let / ( x ) , x = [xi,x 2 ] T G M2, be given by 5 1 f(x) = -x\ + -x\ + 2xix 2 - 3xi - x2. a. Express f(x) in the form of f(x) = \xTQx — xTb. b . Find the minimizer of / using the conjugate gradient algorithm. Use a starting point of x^ = [0,0] T . c. Calculate the minimizer of / analytically from Q and 6, and check it with your answer in part b. 10.11 Write a MATLAB program to implement the conjugate gradient algorithm for general functions. Use the secant method for the line search (e.g., the MATLAB function of Exercise 7.11). Test the different formulas for ßk on Rosenbrock's function (see Exercise 9.4) with an initial condition χ(°ϊ = [—2,2]T. For this exercise, reinitialize the update direction to the negative gradient every six iterations. CHAPTER 11 QUASI-NEWTON METHODS 11.1 Introduction Newton's method is one of the more successful algorithms for optimization. If it converges, it has a quadratic order of convergence. However, as pointed out before, for a general nonlinear objective function, convergence to a solution cannot be guaranteed from an arbitrary initial point χ(°\ In general, if the initial point is not sufficiently close to the solution, then the algorithm may not possess the descent property [i.e., f(x^k+1^) jt f(x^) for some k]. Recall that the idea behind Newton's method is to locally approximate the function / being minimized, at every iteration, by a quadratic function. The minimizer for the quadratic approximation is used as the starting point for the next iteration. This leads to Newton's recursive algorithm a.(*+ 1 )=a.W_jF(a.W)-iÄ|(fc). We may try to guarantee that the algorithm has the descent property by modifying the original algorithm as follows: 1 x(*+i)=xW-akF(xW)- gVe\ An Introduction to Optimization, Fourth Edition. By E. K. P. Chong and S. H. Zak. Copyright © 2013 John Wiley & Sons, Inc. 193 194 QUASI-NEWTON METHODS where ak is chosen to ensure that /(*( fc+i >) < /(*< fe )). For example, we may choose ak = a r g m i n a > 0 f(x^ — aF(x^)~1g^) (see Theorem 9.2). We can then determine an appropriate value of ak by performing a line search in the direction —F(x^)~1g^k\ Note that although the line search is simply the minimization of the real variable function φΐζ{θί) = f{x^ — aF(x^)~1g^), it is not a trivial problem to solve. A computational drawback of Newton's method is the need to evaluate F(x^) and solve the equation F(x^)d{k) = -g^k) [i.e., compute d{k) = 1 —F(x^)~ g^]. To avoid the computation of F(x^)~l, the quasi-Newton methods use an approximation to F(x^)~1 in place of the true inverse. This approximation is updated at every stage so that it exhibits at least some properties of F(x^)~1. To get some idea about the properties that an approximation to F(x^)~1 should satisfy, consider the formula x(k+i)=x(k)-aHkgW, where Hk is an n x n real matrix and a is a positive search parameter. Expanding / about x^ yields / ( a ^ + D ) = /(*(*)) + f f W T (x( f e +D - XW) + ο(||χ(*+1> - a;<fc>||) = /(*<*>) - agWTHkgW + o(\\Hkg^\\a). As a tends to zero, the second term on the right-hand side of this equation dominates the third. Thus, to guarantee a decrease in / for small a, we have to have )T fc) / W >o. A simple way to ensure this is to require that Hk be positive definite. We have proved the following result. Proposition 11.1 Let f G C 1 , x{k) G R n , g{k) = V / ( « W ) φ 0, and Hk an n x n real symmetric positive definite matrix. If we set cc(fc+1) = x (fc) - akHkg^k\ where ak = argmin a > 0 /(x ( f c ) - aHkg^), then ak > 0 (fc+1) {k) and /(x ) <f(x ). ü In constructing an approximation to the inverse of the Hessian matrix, we should use only the objective function and gradient values. Thus, if we can find a suitable method of choosing Hk, the iteration may be carried out without any evaluation of the Hessian and without the solution of any set of linear equations. 11.2 Approximating the Inverse Hessian Let HQ,HI,H2, .. · be successive approximations of the inverse F(x^)~x of the Hessian. We now derive a condition that the approximations should APPROXIMATING THE INVERSE HESSIAN 195 satisfy, which forms the starting point for our subsequent discussion of quasiNewton algorithms. To begin, suppose first that the Hessian matrix F(x) of the objective function / is constant and independent of x. In other words, the objective function is quadratic, with Hessian F(x) = Q for all x, where Q = QT. Then, Ä(fc+i)_yW=Q(a.(fc+i)_xW). Let and Then, we may write Ag(k) = QAx(k). We start with a real symmetric positive definite matrix HQ. Note that given fc, the matrix Q - 1 satisfies Q~lAg{i) = Ax{i\ 0<i<k. Therefore, we also impose the requirement that the approximation fffc+i of the Hessian satisfy = Δα ( < ) , H^AgW 0<i<k. If n steps are involved, then moving in n directions Δ α τ ° \ Δατ 1 ', . . . , Αχ^η~^ yields ΗηΔβ<°>=Δ*<°>, HnAgW HnAg{n-x) = Axw, = Ax^-V. This set of equations can be represented as Hn[AgV\Ag^,...,Ag^\ = [Αχ^,Αχ^,... ,Ax^~% Note that Q satisfies Q[Ax<°\AxW,..., Δ ί ' " - 1 » ] = [AgM, Ag^,..., Ag^} and Q-1[Ag^\Ag^\...1Ag^-^] = [Ax^\Ax^\...,Ax^-1\ Therefore, if [Ag^°\ Ag^\ . . . , Δ # ( η _ 1 ) ] is nonsingular, then Q _ 1 is determined uniquely after n steps, via Q-l=Hn = [Ax(0\Ax^,...,Ax^-%Ag^\Ag^,...,A9(n-l)rl. 196 QUASI-NEWTON METHODS As a consequence, we conclude that if Hn satisfies the equations HnAg^ = Ax{i\ 0 < i < n - 1, then the algorithm χ<*+1) = χΜ akHkg(k\ ak = a r g m i n a > 0 / ( £ C ^ — aHkg^), is guaranteed to solve problems with quadratic objective functions in n + 1 steps, because the update χ( η + 1 ) = n is equivalent to Newton's algorithm. In fact, as we shall see x(n) _ anHng( ) below (Theorem 11.1), such algorithms solve quadratic problems of n variables in at most n steps. The considerations above illustrate the basic idea behind the quasi-Newton methods. Specifically, quasi-Newton algorithms have the form d« = -HkgW, ak = argmin/(a; ( f e ) + χ ad(k)), a>0 (*+ΐ)=χ(*)+α^*>, where the matrices Ho, Hi,... are symmetric. In the quadratic case these matrices are required to satisfy Hk+1Ag(i) 0<z<fc, = Ax^\ where Δ χ ( ΐ ) = χ^+^ - a?« = a<d(<) and Ag{i) = g^+V - g& = QAx^. It turns out that quasi-Newton methods are also conjugate direction methods, as stated in the following. Theorem 11.1 Consider a quasi-Newton algorithm applied to a quadratic function with Hessian Q = Q such that for 0 < k < n — 1, Hk+1Ag{i) where Hk+1 conjugate. = Hj+1. =Ax{i\ 0<i<k, If a{ φ 0, 0 < i < k, then rf(0),... ,d ( f c + 1 ) are QD Proof. We proceed by induction. We begin with the k = 0 case: that d (0) and d (1) are Q-conjugate. Because OLQ φ 0, we can write d^ = Αχ^/αο· Hence, d^Qd^ = -gWHlQd(0) ΟΔχ(0) mx _ fl (DT = -9(1)T a0 THE RANK ONE CORRECTION FORMULA 197 But gW^dr·0' = 0 as a consequence of ao > 0 being the minimizer of φ(α) = /(aj<°> + ad{0)) (see Exercise 11.1). Hence, d ( 1 ) T Q d ( 0 ) = 0. Assume that the result is true for k — 1 (where k < n — 1). We now prove the result for fc, that is, that d^°\... ,*r fc+1 ) are Q-conjugate. It suffices to show that d ( f c + 1 ) T Qd ( i ) = 0, 0 < i < k. Given i, 0 < i < fc, using the same algebraic steps as in the k = 0 case, and using the assumption that cti φ 0, we obtain d(fe+i)TQd(i) = _fl(*+i)THfc+lQd«) = _s(*+i)Td(Oi Because are Q-conjugate by assumption, we conclude from Lemma 10.2 that ^( f c + 1 ) T d ( i ) = 0. Hence, d ( f c + 1 ) T Qd ( i ) = 0, which completes the proof. I By Theorem 11.1 we conclude that a quasi-Newton algorithm solves a quadratic of n variables in at most n steps. Note that the equations that the matrices Hk are required to satisfy do not determine those matrices uniquely. Thus, we have some freedom in the way we compute the Hk- In the methods we describe, we compute Hk+i by adding a correction to Hk- In the following sections we consider three specific updating formulas. 11.3 The Rank One Correction Formula In the rank one correction formula, the correction term is symmetric and has the form akz^z^T, where ak e R and z^ G Rn. Therefore, the update equation is Hk^=Hk + akz^z^T. Note that W rankz (fc) (fc)T z = rank M [«*(*:) ,(*) =1 / and hence the name rank one correction [it is also called the single-rank symmetric (SRS) algorithm]. The product z^z^T is sometimes referred to as the dyadic product or outer product Observe that if Hk is symmetric, then so is iffc+i· Our goal now is to determine ak and z^k\ given Hk, Δ Α < * \ Δβ<*\ so that the required relationship discussed in Section 11.2 is satisfied; namely, 198 QUASI-NEWTON METHODS Hk+iAg^1' = Δχ(*\ i = 1 , . . . , k. To begin, let us first consider the condition JTfc+iA<7^ = Ax^k\ In other words, given Hk, Ag (fc) , and Ax (fe) , we wish to find ak and z^ to ensure that = (Hk + akzWzWT)AgW Hk+1AgW First note that z^TAg^ = Ax^. is a scalar. Thus, - H fc A fl ( fc ) = Ax^ and hence Ak) = ( a ^ ' V ' ) ^ , AgW-HfcAgW afe(2WTA9«) We can now determine (fc) (AxW - H fc Ag( fc ))(AxW - H fc A g <*>) T (fc)T = at(2«TASW)2 * Hence, Hk+\ = Hk + (Ax<fc) - HkAg(k)){Axik) a fe (zWTA fl W)2 HkAg(k))T The next step is to express the denominator of the second term on the righthand side of the equation above as a function of the given quantities Hk, ,(*) Ag (fe) , and Aa?(fe). To accomplish this, premultiply Aa;(fe) - HkAg^ (akzWJ'Agw)zW by Ag(k)r to obtain Ag^TAx^ = Aff<fe>Tα**<*>z<*>TAg^. - AgWTHkAgW Observe that ak is a scalar and so is Ag (fe)T z (fe) = z ( f e ) T Ag ( f e ) . Thus, Afl<fc>TA*<*> - AgWTHkAgW = ak(z^TAg^)2. Taking this relation into account yields „ „ fe+1 , (Ax{k) - HkAgW)(AxW HkAg^)T fe+ AgWT(AxM-HkAgM) We summarize the above development in the following algorithm. Rank One Algorithm 1. Set k := 0; select x^ and a real symmetric positive definite ϋ"ο· 2. Ifflf(fc)= 0, stop; else, d{k) = -Hkg^k\ THE RANK ONE CORRECTION FORMULA 199 3. Compute ak = argmin/(x ( f c ) + a>0 ad{k)), k x(K+V=x(K)+akS \ 4. Compute Ax^ =akdik\ AgW=g^-gW, „ _ „ , (Aa;<fc> - HkAg^)(Ax^ HkAg^)T k+1 k+ ~ Ag^T(Ax^-HkAg^) 5. Set k := k + 1; go to step 2. The rank one algorithm is based on satisfying the equation Hk+1AgW = Δ*<*>. However, what we want is Hk+1AgW=AxM, i = 0,l,...,fc. It turns out that the above is, in fact, true automatically, as stated in the following theorem. Theorem 11.2 For the rank one algorithm applied to the quadratic with Hessian Q = QT, we have Hk+iAg^ = Ax^\ 0 < i < k. D Proof. We prove the result by induction. Prom the discussion before the theorem, it is clear that the claim is true for k — 0. Suppose now that the theorem is true for k — 1 > 0; that is, HkAg^ — Ax^\ i < k. We now show that the theorem is true for k. Our construction of the correction term ensures that Hk+1AgW = Ax^. So we only have to show that Η*+ιΔ0(<)=Δ*(ί), i<k. To this end,fixi < k. We have M A«W H Λ « « + (Ag (fc) - HfcAg<fc>)(As<*> - HkAg^y Hk+1Ag( = HkAg( + A , « T ( A S B < * > _ Η *Δ,,<*>) ^ (i) ' By the induction hypothesis, HkAg^ = Ax^. To complete the proof, it is enough to show that the second term on the right-hand side of the equation above is equal to zero. For this to be true it is enough that (A* (fc) - HkAg{k))TAg^ = Ax{k)T Ag^ - Ag^k)THkAg^ = 0. 200 QUASI-NEWTON METHODS Indeed, since Ag{k)THkAg(i) = Ag{k)T (HkAg{i)) Ag{k)TAx{i) = by the induction hypothesis, and because Ag^k' = QAx^k\ Ag^THkAg^ = Ag™T Ax® = Ax^TQAx^ we have = Ax^T Ag^ Hence, (Ax{k) - HkAg{k))TAg{i) = Ax{k)TAg{i) - Ax^k)TAg{i) = 0, which completes the proof. Example 11.1 Let = x\ + -x\ + 3. f(xi,x2) Apply the rank one correction algorithm to minimize / . Use x^ and Ho = I2 ( 2 x 2 identity matrix). We can represent / as \xT /(*) = 2 0 x + 3. 0 1 Thus, ,(*) Because HQ = I2, 2 0 (fc) x 0 1 d<0> = - f l ( 0 ) = [ - 2 , - 2 ] T . The objective function is quadratic, and hence n(0)Tj(0) a0 = a r g m i n / ( ^ ) + a d ( ° ) ) = - ^ a>0 [2,2] 2] 2I 2 2 0] I"2 ~ ~ 3 ' [2,2] 0 lj [2 and thus x^=x^+a0d^ 1 2 3'3 T A _ = THE RANK ONE CORRECTION FORMULA 201 We then compute Δχ<°> = a 0 d ( 0 ) = Γ 4 _4l L _ 3' ~ 3 j "_2 2 ] T ~ 3 ' 3J V 0) W =9 (0) ~9 -|T = Because Ag^T(Ax^-H0Ag^) 32 9' = we obtain H1 = H0 + (Ag(°> - H0Ag^)(Ax^ Α9^τ(ΑχΜ-Η0Α9Μ) Therefore, d'1* -Hl9W = - H0Ag^)T \ 0 0 1 1 2 3'~3 and Oil We now compute ~d^TQd^ = 1. » ( 2 ) = x ( 1 ) + a i d ^ = [0,0] T . Note that g^ = 0, and therefore x^ = x*. As expected, the algorithm solves the problem in two steps. Note that the directions Φ0' and d^1' are Q-conjugate, in accordance with Theorem 11.1. I Unfortunately, the rank one correction algorithm is not very satisfactory, for several reasons. First, the matrix Hk+i that the rank one algorithm generates may not be positive definite (see Example 11.2 below) and thus may not be a descent direction. This happens even in the quadratic case (see Example 11.10). Furthermore, if Ag^(Ax^ -HkAgW) is close to zero, then there may be numerical problems in evaluating JTfc+i. Example 11.2 Assume that Hk > 0. It turns out that if Ag^k)T (Ax^k) HkAg(h)) > 0, then B V n > 0 (see Exercise 11.7). However, if 202 QUASI-NEWTON METHODS Ag^k)T(Ax^k) - HkAg(k)) < 0, then Hk+i may not be positive definite. As an example of what might happen if Ag^T(Ax^k) - HkAg^) < 0, consider applying the rank one algorithm to the function /(*) = T + y with an initial point XlX2 + Xi - X2 x ( 0 ) = [0.59607,0.59607]T and initial matrix Ho = 0.94913 0.14318 0.14318 0.59702 Note that H0 > 0. We have Δ 0 ( ο ) τ ( Δ χ ( ο ) - H0Ag(Q)) and 0.94481 0.23324 = -0.03276 0.23324 -1.2788 It is easy to check that H i is not positive definite (it is indefinite, with eigenvalues 0.96901 and -1.3030). I Fortunately, alternative algorithms have been developed for updating Hk. In particular, if we use a "rank two" update, then Hk is guaranteed to be positive definite for all &, provided that the line search is exact. We discuss this in the next section. 11.4 The D F P Algorithm The rank two update was originally developed by Davidon in 1959 and was subsequently modified by Fletcher and Powell in 1963: hence the name DFP algorithm. The DFP algorithm is also known as the variable metric algorithm. We summarize the algorithm below. D F P Algorithm 1. Set k := 0; select x^ and a real symmetric positive definite HQ. 2. Ifflf(fc)= 0, stop; else, d{k) = -HkgW. 3. Compute ak = argmin/(x ( f c ) + ad ( f c ) ), a>0 k x(^)=xW+akS \ THE DFP ALGORITHM 203 4. Compute Ag<fc>As<fc>T _ k+1 k + ~ [HkAg^][HkAg^}T Δ*<*>τΔβ<*> AgWrHkAgW ' 5. Set A: := A: + 1; go to step 2. We now show that the DFP algorithm is a quasi-Newton method, in the sense that when applied to quadratic problems, we have Hk+iAg^1' = Ax^\ 0<i<k. Theorem 11.3 In the DFP algorithm applied to the quadratic with Hessian Q = QTy We have Hk+1Ag{i) = Ax{i), 0<i<k. D Proof. We use induction. For k = 0, we have IW» AgM - H ° V 0 ) V 0 ) T g Q + ΑΧ(0)ΑΧ(0)Τ = iW°) Ag (0 = Δχ(°>. Assume that the result is true for k — 1; that is, HkAg^ = Δχ^, 0 < i < fc - 1. We now show that Hk+iAg^ = Ax^\ 0 < i < k. First, consider i = k. We have = Δ*<*>. It remains to consider the case i < k. To this end, + y y Δχ(ί:)τΔ0(ί:) HkAgW AgWTHkAgW Ag{k)THkAg{k) (Ag^Ax^). Now, Δχ ( Α : ) Τ Δ9 ( ί ) = Δχ ( Α : ) τ ζ?Δχ ( ί ) = = 0, akaid^TQd^ 204 QUASI-NEWTON METHODS by the induction hypothesis and Theorem 11.1. The same arguments yield Ag(k)TAx{i) = 0. Hence, Hk+1Ag^ = Ax^, which completes the proof. I By Theorems 11.1 and 11.3 we conclude that the DFP algorithm is a conjugate direction algorithm. E x a m p l e 11.3 Locate the minimizer of 4 2 fix) = 2 χ Τ 2 2 -1 -xT X xe 1 Use the initial point x^ = [0,0] T and H0 = 1*2,. Note that in this case -1 4 2 (*) _ 2 2 x ,(*) Hence, 9(0) = [ l , - l f , d<°> - -H0gW ~-l~ "l θ" " 1 " 1 0 1 -1 = = Because / is a quadratic function, a0 = argmin/(x(0)+ad(°)) = - J 5 y T a>0 [i,-i] -l" 1 4 2~ [-1,1] 2 2 Therefore, x -ll 1. 1J -r(O) + a d (0) = [-l,l] T . = x(°> 0 (1) _ We then compute Δχ(°'=ί1!(ΐ)-χ(ο) ,(i) 4 2 2 2 = -1 1 [-ΐ)ΐ]ΐ - -1 1 = -1 -1 A_ THE DFP ALGORITHM and 205 = {-2,0]r. AgW=gW-gW Observe that "-l" 1 [-i,i] = Δ*<°>τΔβ<°> = -1,1] /W°> = -2] 0 1 = 2, Γ-2 1 o] Γ-2 0 lj Thus, 1 -1" -1 1 L° L° ' -i [-2,0] = (H0Ag^)(H0Ag^)T 0 and "1 0" )] 0 1 Ag^TH0Ag^^[-2,0] "-2I 0J [4 0 [0 0 = 4. Using the above, we now compute H\\ H1=H0 + Ax(°W°>T Ax^Ag^ "1 0" 1 1 0 1 4- 2- - 1 1 2 1 2 (H0AgM)(H0AgM)T Ag^HoAg (0) -1 1 1 4 0" ~ 4 0 0 1" 2 3 2 We now compute d (1) = -Ηλρ^ = [0,1] T and αι = ar in/( (1) + ad(1)) = f- " Hence, x (2) ~^^ = * = x (!) + a i d ( 1 ) = [-1,3/2] T = x*, because / is a quadratic function of two variables. Note that we have d ( 0 ) T Qd ( 1 ) = d ( 1 ) T Q d ( 0 ) = 0; that is, d (0) and d (1) are Q-conjugate directions. I We now show that in the DFP algorithm, Hk+i inherits positive definiteness from Hk- 206 QUASI-NEWTON METHODS Theorem 11.4 Suppose that gW φ 0. positive definite, then so is Hk+i. In the DFP algorithm, if Hk is □ Proof. We first write the following quadratic form: τ„ _ x nk+lx-x -X τ„ tikx τ , xTAx^Ax^rx + Ax{k)TAg(k) xr(HkAg^)(HkAg^)Tx AgWrHkAgW (χτΔχ^)2 MkX+ (xTHfcAgW)2 Ax{k)TAg{k) Ag(k)THkAg{k) ■ Define Δ „1/2 a = Hk' x, b±Hl/2AgW, where Note that because Hk > 0, its square root is well-defined; see Section 3.4 for more information on this property of positive definite matrices. Using the definitions of a and 6, we obtain xTHkX xrHkAgW and AgWTHkAgW = xJHjJ Hj/ x = aTa, T = xTHl/2Hl/2AgW = a b, = Ag^1ΉψΉψ:Ag^ = bTb. Hence, x Hk+1x a = a +AxWTAg{k)--bTb~ _ ||α||2||6||2-((α,6))2 II&II (χτΑχ^)2 2 Ax^TAg^' We also have Δ*« τ Δ0<*> = Δ*<*>τ(0<*+1> -gM) since Ax{k)Tg(k+V = akSk)Tg(k+V cise 11.1). Because Ax^=akSk) we have Ax^TAg^ = -Ax^Tg^k\ = 0 by Lemma 10.2 (see also Exer= -akHkgik\ = - Δ χ ^ V fc) = W ° T W f c ) - THE BFGS ALGORITHM 207 This yields τΜ „ \\a\\2\\b\\2-((a,b))2 (xTAx^ akgWTHkgW Both terms on the right-hand side of the above equation are nonnegative—the first term is nonnegative because of the Cauchy-Schwarz inequality, and the second term is nonnegative because Hk > 0 and ak > 0 (by Proposition 11.1). Therefore, to show that xTHk+iX > 0 for x φ 0, we only need to demonstrate that these terms do not both vanish simultaneously. The first term vanishes only if a and b are proportional, that is, if a = ßb for some scalar ß. Thus, to complete the proof it is enough to show that if a = ßb, then (xT Ax{k))2/(akg^T Hkg^) > 0. Indeed, first observe that Η\/2χ = a = ßb = βΗΐ'2Α9Μ Hence, = H^ißAgW). x = ßAg{k\ Using the expression for x above and the expression akg^THkg^k\ we obtain (a?TAa?(fc))2 akg(k)THkg(k) _ ß2(AgWTAxW)2 _ k T ~ akg( ) HkgW ~ = ß2akgWTHk9W>0. Thus, for all x φ 0, x1Hk+ix which completes the proof. ß2{akg{k)THkg^)2 akg(k)THkg(V > 0, I The DFP algorithm is superior to the rank one algorithm in that it preserves the positive definiteness of Hk. However, it turns out that in the case of larger nonquadratic problems the algorithm has the tendency of sometimes getting "stuck." This phenomenon is attributed to Hk becoming nearly singular [19]. In the next section we discuss an algorithm that alleviates this problem. 11.5 The BFGS Algorithm In 1970, an alternative update formula was suggested independently by Broyden, Fletcher, Goldfarb, and Shanno. The method, now called the BFGS algorithm, is discussed in this section. To derive the BFGS update, we use the concept of duality, or complementarity, as presented in [43] and [88]. To discuss this concept, recall that the 208 QUASI-NEWTON METHODS updating formulas for the approximation of the inverse of the Hessian matrix were based on satisfying the equations Hk+iAg{i) = Ax{i\ 0<i<k, which were derived from Ag^ — QAx^l\ 0 < i < k. We then formulated update formulas for the approximations to the inverse of the Hessian matrix Q~l. An alternative to approximating Q~l is to approximate Q itself. To do this let Bk be our estimate of Q at the /cth step. We require Bk+ι to satisfy Ag{i) =BMAx{i\ 0<i<k. Notice that this set of equations is similar to the previous set of equations for iffc+i? the only difference being that the roles of Δατ^ and Ag^ are interchanged. Thus, given any update formula for Hk, a corresponding update formula for Bk can be found by interchanging the roles of Bk and Hk and of Ag^ and Ax^k\ In particular, the BFGS update for Bk corresponds to the DFP update for Hk- Formulas related in this way are said to be dual or complementary [43]. Recall that the DFP update for the approximation Hk of the inverse Hessian is , Δα^>Δ*<*>τ HkAg^Ag^Hk HDFP = rf k+1 k + τ T Δχ<*> Δ0<*> Ag^ HkAg^ ' Using the complementarity concept, we can easily obtain an update equation for the approximation Bk of the Hessian: Ag^Ag^T k+ Ag^TAx^ fc+1 BkAx^Ax^TBk Ax^TBkAx^ ' This is the BFGS update of Bk. Now, to obtain the BFGS update for the approximation of the inverse Hessian, we take the inverse of Bk+\ to obtain TTBFGS \-l (JD / + Ag^Ag^T V " Δ^τΔχ^ BkAx^Ax^Bk Ax^TBkAx^ \"' ) ' To compute H^+^s by inverting the right-hand side of this equation, we apply the following formula for a matrix inverse, known as the ShermanMorrison formula (see [63, p. 123] or [53, p. 50]). Lemma 11.1 Let A be a nonsingular matrix. Let u and v be column vectors such that 1 + vTA~lu φ 0. Then, A + uvT is nonsingular, and its inverse can be written in terms of A~ using the following formula: THE BFGS ALGORITHM (A + uvT)~1 = A' (A^u^A'1) l-h^A"1 u 209 D Proof. We can prove the result easily by verification. Prom Lemma 11.1 it follows that if A - 1 is known, then the inverse of the matrix A augmented by a rank one matrix can be obtained by a modification of the matrix A - 1 . Applying Lemma 11.1 twice to Bk+i (see Exercise 11.12) yields IT TTBFGS , Δ*<*>Δ*<*>Ί ) Ax™TAgW AgWTHkAgW\ 1+ Ag^TAx^ v HkAg^Ax^T (HkAg^Ax{k)T)T + T AgW AxW which represents the BFGS formula for updating Hk. Recall that for the quadratic case the DFP algorithm satisfies H^Ag{i) = Ax{i\ 0<i<k. Therefore, the BFGS update for Bk satisfies Bk+iAx^ = Ag^\ 0 < i < k. By construction of the BFGS formula for H we k+iS, conclude that H^sAg{i) = Ax(i\ 0 < i < k. Hence, the BFGS algorithm enjoys all the properties of quasi-Newton methods, including the conjugate directions property. Moreover, the BFGS algorithm also inherits the positive denniteness property of the DFP algorithm; that is, if gW φ 0 and Hk > 0, then H%£fs > 0. The BFGS update is reasonably robust when the line searches are sloppy (see [19]). This property allows us to save time in the line search part of the algorithm. The BFGS formula is often far more efficient than the DFP formula (see [107] for further discussion). We conclude our discussion of the BFGS algorithm with the following numerical example. Example 11.4 Use the BFGS method to minimize f(x) = -xTQx - xTb + log(7r), where Q 5 -3 -3 2 Take H0 = J 2 and x<0' = [0,0] T . Verify that H2 = Q~l We have d (0) = -gW = _ ( Q x ( 0 ) _ ft) = b = I° 210 QUASI-NEWTON METHODS The objective function is a quadratic, and hence we can use the following formula to compute a0: OLO = - 9 (o)Td(o) ! d(0)TQd(0) 2' Therefore, aj(i)=x(0)+a0d To compute Ηχ = HfFGS, 0 1/2 (0) we need the following quantities: 0 V2 g™ = Qx™ - b -- -3/2 0 -3/2 1 AgW=gU-gM: Therefore, H i = Ho + χ 1+ Δ0(°)τΔχ(°) Ax^Ag^TH0 1 3/2 j Δχ<°>τΔβ<°> + H0Afl<0W0>T (0) Ag^TAx 3/2 11/4 Hence, we have 3/2 9/4 ö(DTd(i) «1 <fi»TQ<fi» 2. Therefore, 2 1 1 cc( ) = a 5 ( ) + a i d ( > = Because our objective function is a quadratic on R 2 , x^ Notice that the gradient at x^ is 0; that is, g^ = 0. is the minimizer. EXERCISES To verify that H2 = Q 211 λ Ax^ , we compute = XW - XW = AgV=gW-gU 3 9/2 3/2] 0 Hence, H2 = Hi+[l + Δχ( 1 >Δχ( 1 ) τ Ag^H^A Ag^TAx^ ) Ax(1)TAgW Αχ^Α9^ΎΗ1+Η1Α9^Αχ^τ Ag^TAx^ 2 3 3 5 Note that indeed H2Q = QH2 — I2, and hence H2 = Q 1 . ■ For nonquadratic problems, quasi-Newton algorithms will not usually converge in n steps. As in the case of the conjugate gradient methods, here, too, some modifications may be necessary to deal with nonquadratic problems. For example, we may reinitialize the direction vector to the negative gradient after every few iterations (e.g., n or n + 1), and continue until the algorithm satisfies the stopping criterion. EXERCISES 11.1 Given / : Rn -* K, / € C1, consider the algorithm where c r ^ c r 2 ' , . . . are vectors in R n , and α^ > 0 is chosen to minimize / ( x ( f c ) + a d ( f c ) ) ; that is, ak = argmin/(x ( ; c ) + a>0 aS-k)). Note that the general algorithm above encompasses almost all algorithms that we discussed in this part, including the steepest descent, Newton, conjugate gradient, and quasi-Newton algorithms. Let gW = V/(«<fc>), and assume that d(fc) V * ° < 0. 212 QUASI-NEWTON METHODS a. Show that d^k' is a descent direction for / in the sense that there exists ä > 0 such that for all a G (0, ä], /(x^+ad(fc))</(x(fc)). b . Show that ak > 0. c. Show that d{k)1'g(k+V = 0. d. Show that the following algorithms all satisfy the condition d^k' g^ if0<*>^O: < 0, 1. Steepest descent algorithm. 2. Newton's method, assuming that the Hessian is positive definite. 3. Conjugate gradient algorithm. 4. Quasi-Newton algorithm, assuming that Hk > 0. e. For the case where f(x) = \χτQx — xTb, with Q = Q expression for ak in terms of Q, <r \ and g(k\ > 0, derive an 11.2 Consider Newton's algorithm applied to a function f e C2: x(k+i) = x(k) _ α^(χ(*0)-ΐν/(χ<*>), where ak is chosen according to a line search. Is this algorithm a member of the quasi-Newton family? 11.3 In some optimization methods, when minimizing a given function / ( # ) , we select an initial guess x^ and a real symmetric positive definite matrix Ho- Then we iteratively compute Hk, cr ' = —Hkg^ (where gW = Vf(x^)), and χ^+^ = » « + akd{k\ where ak = arg m i n / (χ^ α>0 \ + ad ( f c ) ) . / Suppose that the function we wish to minimize is a standard quadratic of the form f(x) = -xTQx - xTb + c, Q = QT > 0. a. Find an expression for ak in terms of Q, Hk, g^k\ and d> '; b . Give a sufficient condition on Hk for ak to be positive. 11.4 Consider the algorithm a.(fc+i) = x<*) _ Ηί,<*\ EXERCISES where, as usual, g^ 213 — V / ( x ^ ) and H is a fixed symmetric matrix. a. Suppose that / G C3 and there is a point x* such that V/(x*) = 0 and F ( x * ) _ 1 exists. Find H such that if x^ is sufficiently close to cc*, then x^ converges to a?* with order of convergence of at least 2. b . With the setting of H in part a, is the given algorithm a quasi-Newton method? 11.5 Minimize the function o" x ~1 0 2 --xT 1" -1 +7 using the rank one correction method with the starting point x^ = 0. 11.6 Consider the algorithm x(k+i) = x(k) _ {k) akMkWf(x ), where / : R 2 -► R, / G C1, M f c G R 2 x 2 is given by Mfc = 1 0 0 a with a G R, and ak = arg min / ( « « - aM f e V/(x ( f c ) )). α>0 Suppose that at some iteration k we have V / ( a ; ^ ) = [1,1] T . Find the largest range of values of a that guarantees that ak > 0 for any / . 11.7 Consider the rank one algorithm. Assume that Hk > 0. Show that if Ag(k)T(Ax{k) - HkAg{k)) > 0, then Hk+1 > 0. 11.8 Based on the rank one update equation, derive an update formula using complementarity and the matrix inverse formula. 11.9 Let / = -xTQx = 2X — xTb + c 1 0 o" x 2 --xT "1 " -1 +7 214 QUASI-NEWTON METHODS and χ(°^ = 0. Use the rank one correction method to generate two Qconjugate directions. 11.10 Apply the rank one algorithm to the problem in Example 11.3. 11.11 Consider the DFP algorithm applied to the quadratic function f(x) = -xTQx - xTb, where Q = QT > 0. a. Write down a formula for ak in terms of Q, g^k\ and <rk\ b . Show that if gW φ 0, then ak > 0. 11.12 Use Lemma 11.1 to derive the BFGS update formula based on the DFP formula, using complementarity. Hint: Define A 0 = Bk, u0 = Ui = vj = A1 Ag^TAx^' BkAx{k) Ax^TBkAx^' Ax^TBk, Ag^Ag^T D = Bk+ A Ag^Ax^=A0 + U0V °' Using the notation above, represent J3fc+i as Bk+1 = A0 + = Ai U0VQ + uivj -\-uxvJ. Apply Lemma 11.1 to the above. 11.13 Assuming exact line search, show that if HQ = In (n x n identity matrix), then the first two steps of the BFGS algorithm yield the same points x^ and x^ as conjugate gradient algorithms with the Hestenes-Stiefel, the Polak-Ribiere, and the Fletcher-Reeves formulas. 11.14 Let / : M71 —> R be such that f e C1. Consider an optimization algorithm applied to this / , of the usual form x( fc+1 ) = χ^ + ak<r \ where EXERCISES Oik > 0 is chosen according to line search. Suppose that S ' = where g^ = Vf(x^) and Hk is symmetric. 215 —Hkg^k\ a. Show that if Hk satisfies the following conditions whenever the algorithm is applied to a quadratic, then the algorithm is quasi-Newton: 1· -fffc+i = Hk + Uk2. UkAg(k) = Ax{k) - HkAg{k\ 3. Uk = α<*>Δχ ( * )τ + b{k)Ag{k)THk, where <»(*> and b(k) are in R n . b . Which (if any) among the rank-one, DFP, and BFGS algorithms satisfy the three conditions in part a (whenever the algorithm is applied to a quadratic)? For those that do, specify the vectors a^ and b^ \ 11.15 Given a function / : R n —► R, consider an algorithm aj(fe+1) = x^ — OLkHkg^ for finding the minimizer of / , where gW = V / ( x ^ ) and Hk G R n x n is symmetric. Suppose that Hk = <t>H%FP + (1 - φ)Η%FGS, where φ e R, and H%FP and H%FGS are matrices generated by the DFP and BFGS algorithms, respectively. a. Show that the algorithm above is a quasi-Newton algorithm. Is the above algorithm a conjugate direction algorithm? b . Suppose that 0 < φ < 1. Show that if H°FP > 0 and H%FGS > 0, then Hk > 0 for all k. What can you conclude from this about whether or not the algorithm has the descent property? 11.16 Consider the following simple modification of the quasi-Newton family of algorithms. In the quadratic case, instead of the usual quasiNewton condition Hk+iAg^> = Ax^\ 0 < i < k, suppose that we have Hk-^iAg^ = ριΑχ^\ 0 < i < fe, where pi > 0. We refer to the set of algorithms that satisfy the condition above as the symmetric Huang family. Show that the symmetric Huang family algorithms are conjugate direction algorithms. 11.17 Write a MATLAB program to implement the quasi-Newton algorithm for general functions. Use the secant method for the line search (e.g., the MATLAB function of Exercise 7.11). Test the various update formulas for Hk on Rosenbrock's function (see Exercise 9.4), with an initial condition a?(°) = [—2,2]T. For this exercise, reinitialize the update direction to the negative gradient every six iterations. 11.18 Consider the function f(x) = -± + y - ΧλΧ2 +Xi- X2. 216 QUASI-NEWTON METHODS a. Use MATLAB to plot the level sets of / at levels -0.72, -0.6, - 0 . 2 , 0.5, 2. Locate the minimizers of / from the plots of the level sets. b . Apply the DFP algorithm to minimize the function above with the following starting initial conditions: (i) [0,0] T ; (ii) [1.5,1] T . Use Ho = I2. Does the algorithm converge to the same point for the two initial conditions? If not, explain. CHAPTER 12 SOLVING LINEAR EQUATIONS 12.1 Least-Squares Analysis Consider a system of linear equations Ax = 6, where A e R m X n , b G M m , m > n, and rank A = n. Note that the number of unknowns, n, is no larger than the number of equations, m. If b does not belong to the range of A, that is, if b 0 1Z(A), then this system of equations is said to be inconsistent or overdetermined. In this case there is no solution to the above set of equations. Our goal then is to find the vector (or vectors) x minimizing \\Ax — b\\2. This problem is a special case of the nonlinear least-squares problem discussed in Section 9.4. Let x* be a vector that minimizes \\Ax — 6|| 2 ; that is, for all x G l n , \\Ax-b\\2> ||Ax*-6||2. We refer to the vector x* as a least-squares solution to Ax = b. In the case where Ax = b has a solution, then the solution is a least-squares solution. An Introduction to Optimization, Fourth Edition. By E. K. P. Chong and S. H. Zak. Copyright © 2013 John Wiley & Sons, Inc. 217 218 SOLVING LINEAR EQUATIONS Otherwise, a least-squares solution minimizes the norm of the difference between the left- and right-hand sides of the equation Ax = b. To characterize least-squares solutions, we need the following lemma. Lemma 12.1 Let A G M m x n , m > n. Then, rank A = n if and only if rank A A = n (i.e., the square matrix A A is nonsingular). □ Proof. =>: Suppose that rank A = n. To show rank A A = n, it is equivalent to show λί(ΑτA) = {0}. To proceed, let x G λί(ΑτA); that is, ATAx = 0. Therefore, \\Axf = xTATAx = 0, which implies that Ax = 0. Because rank A = n, we have x = 0. <=: Suppose that rank A A = n; that is, Λ/*(Α A) = {0}. To show rank A = n, it is equivalent to show that λί(Α) = {0}. To proceed, let x G Λ/*(Α); that is, Ax = 0. Then, AT Ax = 0, and hence x = 0. I Recall that we assume throughout that rank A = n. By Lemma 12.1 we conclude that (AT A ) - 1 exists. The following theorem characterizes the leastsquares solution. Theorem 12.1 The unique vector x* that minimizes \\Ax — b\\2 is given by the solution to the equation A T A x = A b; that is, x* = (A A)~lA b. D Proof. Let x* = (ATA)-1ATb. First observe that \\Ax - b\\2 = \\A(x - x*) + (Ax* - &)||2 = (A(x - x*) + (Ax* - b))T(A{x 2 - x*) + (Ax* - b)) 2 = ||A(x - x*)|| + ||Ax* - 6|| + 2[A(x - x*)] T (Ax* - b). We now show that the last term in this equation is zero. Indeed, substituting the expression above for x*, [A(x - x*)] T (Ax* - b) - (x - x*)TAT[A(ATA)-1AT = (x- T T T 1 T x*) [(A A){A A)- A - In]b - AT}b = (x-x*)T(AT-AT)6 = 0. Hence, ||Ax - 6|| 2 = ||A(x - x*)|| 2 + ||Ax* - 6|| 2 . If x φ x*, then ||A(x — x*)|| 2 > 0, because rank A = n. Thus, if x φ χ*, we have | | A x - 6 | | 2 > ||Ax*-6||2. Thus, x* = (ΑτΑ)~λA b is the unique minimizer of ||Ax — 6|| 2 . I LEAST-SQUARES ANALYSIS Figure 12.1 219 Orthogonal projection of b on the subspace ΊΖ(Α). We now give a geometric interpretation of the Theorem 12.1. First note that the columns of A span the range 1Z(A) of A, which is an n-dimensional subspace of R m . The equation Ax = b has a solution if and only if b lies in this n-dimensional subspace 1Z(A). If m = n, then b G 1Z(A) always, and the solution is x* = A~lb. Suppose now that m > n. Intuitively, we would expect the "likelihood" of b G 1Z( A) to be small, because the subspace spanned by the columns of A is very "thin." Therefore, let us suppose that b does not belong to 1Z(A). We wish to find a point h G 11(A) that is "closest" to b. Geometrically, the point h should be such that the vector e = h — b is orthogonal to the subspace 11(A) (see Figure 12.1). Recall that a vector e G Mm is said to be orthogonal to the subspace 11(A) if it is orthogonal to every vector in this subspace. We call h the orthogonal projection of b onto the subspace 11(A). It turns out that h = Ax* = A(ATA)~1ATb. Hence, the vector h G H(A) minimizing \\b — h\\ is exactly the orthogonal projection of b onto 1Z(A). In other words, the vector x* minimizing \\Ax — b\\ is exactly the vector that makes Ax — b orthogonal to 1Z(A). To proceed further, we write A = [αχ,..., α η ], where α ι , . . . , an are the columns of A. The vector e is orthogonal to 1Z( A) if and only if it is orthogonal to each of the columns α ι , . . . , an of A, To see this, note that (e,o») = 0, % = Ι,.,.,η if and only if for any set of scalars {xi, X2? ...,^η}) we also have (β,ΧιΟι Η YXn^n) = Any vector in 11(A) has the form Χχα,ι + 0. h £ηαη· Proposition 12.1 Let h G H(A) be such that h — b is orthogonal to 1Z(A). Then, h = Ax* = A(ATA)~1Aib. D Proof. Because h G 1Z(A) = span[ai,... , a n ] , it has the form h = X\d\ + • · · + Xn^ni where # i , . . . , xn G R. To find # i , . . . , x n , we use the assumption 220 SOLVING LINEAR EQUATIONS that e = h — b is orthogonal to s p a n [ a i , . . . , a n ] ; that is, for alH = 1 , . . . , n, we have (h-b,a,i) = 0, or, equivalently, (h,a,i) = {b,a,i). Substituting h into the equations above, we obtain a set of n linear equations of the form (ai,ai)xi H h (an,a,i)xn = (6,a»), i = 1 , . . . ,n. In matrix notation this system of n equations can be represented as (αι,αι) ■·· <οι,α η > (οη,αι) hzi (*>,αι) An? Q"n) %n \^? ^ π / Note that we can write (ΟΙ,ΟΙ) ··· (αη,θι) ATA = (oi,on) ··· (an,an) We also note that αϊ ai · · · anJ. ^ (6,ai) = AT6 = (b, on) av a' Because rank A = n, A A is nonsingular, and thus we conclude that Xl X = Λ\-1 L l AT = {A\Tl A)~ Ab = x*. Notice that the matrix (θι,αι) ··· (οη,αι) (αι,αη) ··· (an,an) T A A- plays an important role in the least-squares solution. This matrix is often called the Gram matrix (or Grammian). LEAST-SQUARES ANALYSIS 221 An alternative method of arriving at the least-squares solution is to proceed as follows. First, we write f(x) = = \\Ax-b\\2 (Ax-b)T(Ax-b) = ]-χΎ{2ΑΎΑ)χ - xT(2ATb) + bTb. Therefore, / is a quadratic function. The quadratic term is positive definite because rank A = n. Thus, the unique minimizer of / is obtained by solving the FONC (see Exercise 6.33); that is, V/(x) = 2ATAx - 2ATb = 0. The only solution to the equation V/(a?) = 0 is x* = (ATA)~1ATb. Example 12.1 Suppose that you are given two different types of concrete. The first type contains 30% cement, 40% gravel, and 30% sand (all percentages of weight). The second type contains 10% cement, 20% gravel, and 70% sand. How many pounds of each type of concrete should you mix together so that you get a concrete mixture that has as close as possible to a total of 5 pounds of cement, 3 pounds of gravel, and 4 pounds of sand? The problem can be formulated as a least-squares problem with A = Γθ.3 0.4 [0.3 O.l" 0.2 0.7 where the decision variable is x = [xi,^2] T and x\ and X2 are the amounts of concrete of the first and second types, respectively. After some algebra, we obtain the solution: x* = (ATA)-1ATb 1 Γ 0.54 ~ (0.34)(0.54) - (0.32)2 | -0.32 -0.32] Γ3.9 0.34 | | 3.9 10.6 0.961 (For a variation of this problem solved using a different method, see Example 15.7.) ■ We now give an example in which least-squares analysis is used to fit measurements by a straight line. 222 SOLVING LINEAR EQUATIONS Experimental Data for Example 12.2. Table 12.1 0 2 3 i U Vi 1 3 4 2 4 15 Example 12.2 Line Fitting. Suppose that a process has a single input i E l and a single output y e R. Suppose that we perform an experiment on the process, resulting in a number of measurements, as displayed in Table 12.1. The ith. measurement results in the input labeled U and the output labeled t/i. We would like to find a straight line given by y = mt + c that fits the experimental data. In other words, we wish to find two numbers, m and c, such that yi = mti + c, i = 0,1,2. However, it is apparent that there is no choice of m and c that results in the requirement above; that is, there is no straight line that passes through all three points simultaneously. Therefore, we would like to find the values of m and c that best fit the data. A graphical illustration of our problem is shown in Figure 12.2. 16 14 12 10 8 6 4 2 °0 Figure 12.2 1 2 3 4 5 ^t Fitting a straight line to experimental data. LEAST-SQUARES ANALYSIS 223 We can represent our problem as a system of three linear equations of the form 2m + c = 3 3m + c = 4 4m + c = 15. We can write this system of equations as Ax = 6, where A = 2 3 4 1 1 1 that since 6 = 3 4 15 ® = m c rank A < rank [A, 6], the vector b does not belong to the range of A. Thus, as we have noted before, the system of equations above is inconsistent. The straight line of best fit is the one that minimizes \Ax - 6||2 = ^2 (mti + c "~ Vi)2 2=0 Therefore, our problem lies in the class of least-squares problems. Note that the foregoing function of m and c is simply the total squared vertical distance (squared error) between the straight line defined by m and c and the experimental points. The solution to our least-squares problem is x 77Γ C* (ATA)-1ATb = 6 -32/3 Note that the error vector e = Ax* — b is orthogonal to each column of A. I Next, we give an example of the use of least-squares in wireless communications. Example 12.3 Attenuation Estimation. A wireless transmitter sends a discrete-time signal {so, si, S2} (of duration 3) to a receiver, as shown in Figure 12.3. The real number s2 is the value of the signal at time i. The transmitted signal takes two paths to the receiver: a direct path, with delay 10 and attenuation factor ai, and an indirect (reflected) path, with delay 12 and attenuation factor α2· The received signal is the sum of the signals from these two paths, with their respective delays and attenuation factors. 224 SOLVING LINEAR EQUATIONS Si 4 Transmitter Receiver Wireless transmission in Example 12.3. Figure 12.3 Suppose that the received signal is measured from times 10 through 14 as ?"io> r n , . . . , ri4, as shown in the figure. We wish to compute the least-squares estimates of a\ and a 2 , based on the following values: s2 n o 1 4 sp si 1 2 ri2 8 rn 7 n3 6 ri4 3 ' The problem can be posed as a least-squares problem with Si 0" 0 S2 so so 0 0 L Sl 5 αλ x = a?, J "nol ni b = n2 ri3\ _r14_ 2_ The least-squares estimate is given by (ATA)-1ATb = \\s\\ S0S2 6 1 1 6 sos2 11*112 -1 1_ 35 - 1 -1 50^10 + s i r n + s2r12 «0^*12 + SiTis + 5 2 ri4 4+14 + 8 8 + 12 + 3 -1 I 26 23 1_ 133] 35 112 ■ LEAST-SQUARES ANALYSIS 225 We now give a simple example where the least-squares method is used in digital signal processing. Example 12.4 Discrete Fourier Series. Suppose that we are given a discretetime signal, represented by the vector b= [6l,&2,...,&m] T · We wish to approximate this signal by a sum of sinusoids. Specifically, we approximate b by the vector + Zk*{k)) yocM + J2 (Vk^ fc=l where yo, y i , . . . , yn, zi, · · · , zn G l and the vectors c^ .(0) Λ*) 1 1 1 and s^ are given by |T 72'75''"'7i .2/CTT\ cos ( 1 f 2kn\ ] , cos I 2 I , . . . , cos I m I , AC — 1 , . . . , 7 1 , m J \ m I \ m J 2kn\ 2kn 2kn\ M , k = 1 , . . . , n. sin 1 , sin 2 , . . . , sin m m m ) \ m J We call the sum of sinusoids above a discrete Fourier series (although, strictly speaking, it is not a series but a finite sum). We wish to find 2/o, 2/i, · · ·, yn, zu · · ·, Zn such that \ ^ + J2ykc^ + zksA-b\ k=l is minimized. To proceed, we define c (0) L· X= , c(l) L· , . . . , c(n) L· , ö (1) (n)' ,...,ö [2/0,2/ΐ,.··,2/η,^1,..·,^η] · Our problem can be reformulated as minimizing ||A*-&||2. We assume that m > 2n+1. To find the solution, we first compute A A. We make use of the following trigonometric identities: For any nonzero integer k that is not an integral multiple of ra, we have 771 • 2_.cos I i 2=1 ^ 2kn\ m J Λ = 0' 226 SOLVING LINEAR EQUATIONS With the aid of these identities, we can verify that l· )Tcü) - 0 otherwise )m/2 s(k)T SU) = 'Uk Ί0 _ Q c(k)Ts(j) = 3 otherwise fQr a n y ^ j Hence, = A^A ? I- .2 n + l > 2 which is clearly nonsingular, with inverse (ATA)-1 = -J2n+i. m Therefore, the solution to our problem is X * r * * * * * i I =[2/0»ί/ΐ>···)2/η.2;1.···.2„] = (ATA)-1AT6 m We represent the solution as ΙΌ = — 2 > > 2/fc = — > ^ c o s U , m ^—' \ m 1 2=1 m ^^ X \ k l,...,n, ' m J We call these discrete Fourier coefficients. I Finally, we show how least-squares analysis can be used to derive formulas for orthogonal projectors. Example 12.5 Orthogonal Projectors. Let V C R n be a subspace. Given a vector x G R n , we write the orthogonal decomposition of x as X = X\; + Xy-L, where X\? G V is the orthogonal projection of x onto V and xv± G V1- is the orthogonal projection of x onto V^. (See Section 3.3; also recall that V1" is THE RECURSIVE LEAST-SQUARES ALGORITHM 227 the orthogonal complement of V.) We can write xy — Ρχ f° r some matrix P called the orthogonal projector. In the following, we derive expressions for P for the case where V = ΊΖ(Α) and the case where V = N(A). Consider a matrix A G R m x n , m > n, and rank A = n. Let V = H(A) be the range of A (note that any subspace can be written as the range of some matrix). In this case we can write an expression for P in terms of A, as follows. By Proposition 12.1 we have Xy = A(A A)~lA x, whence P = A(A A)~lAT. Note that by Proposition 12.1, we may also write x\> — arg min ||?/ — x\\. yev Next, consider a matrix A G R m X n , m < n, and rank A = m. Let V = λί(Α) be the nullspace of A (note that any subspace can be written as the nullspace of some matrix). To derive an expression for the orthogonal projector P in terms of A for this case, we use the formula derived above and the identity λί(Α)1- = ΊΖ(ΑΤ) (see Theorem 3.4). Indeed, if U = ΊΙ(ΑΎ), then the orthogonal decomposition with respect to U is x = Xu + #t/-L, where xu = AT(AAT)~1Ax (using the formula derived above). Because λί(Α)1- = 1Z(AT), we deduce that xv± = xu = AT(AAT)~lAx. Hence, xv = x-xv±=x- AT(AAT)-1Ax = (I - AT {AAT)~lA)x. Thus, the orthogonal projector in this case is P = I — A (AA 12.2 ) _ 1 A. I The Recursive Least-Squares Algorithm Consider again the example in Section 12.1 We are given experimental points (£o>2/o)> (£i? 2/1)5 and (£2,2/2), and we find the parameters m* and c* of the straight line that best fits these data in the least-squares sense. Suppose that we are now given an extra measurement point (£3,2/3), so that we now have a set of four experimental data points: (to, 2/0)5 (ti,Vi), (h, 2/2)5 and (£3,2/3). We can similarly go through the procedure for finding the parameters of the line of best fit for this set of four points. However, as we shall see, there is a more efficient way: We can use previous calculations of m* and c* for the three data points to calculate the parameters for the four data points. In effect, we simply "update" our values of m* and c* to accommodate the new data point. This procedure, called the recursive least-squares (RLS) algorithm, is the topic of this section. To derive the RLS algorithm, first consider the problem of minimizing \\A0x — b^||2. We know that the solution to this is given by x^ = GQXAQ b^°\ where Go — ^ o ^ o - Suppose now that we are given more data, in the form of a matrix A\ and a vector fr1'. Consider now the problem of minimizing 6(0) \\A° x — ' 6(1) 11 J liA 228 SOLVING LINEAR EQUATIONS The solution is given by x( 1 )=Gr -i T r Ao Αχ 1 6(o)· 6«1) where Ao Αχ G1 Ao Αχ Our goal is to write a;'1) as a function of χ(°), Go, and the new data Αχ and &(J). To this end, we first write G\ as χ Ao Αχ AjAo + AjAx Go + AjAx. Gx Λ Λ ο Next, we write ~AÖ τ -b(oY ■6(0)- = Αχ_ A 0 l &*1) A0rbW+Ajb^ = To proceed further, we write A^b^ A as A 0 T 6( 0 )=GoGö 1 A 0 T 6(°) = G 0 x(°) = (Gx - AjAx)x^ = Gias<°> - AjAx x (0) Combining these formulas, we see that we can write χ(*) as Ao Αχ x M=G? 6(0)· = G r 1 (GXX^ - AjAxx^ = xM + Ajb^) G1-1Aj(bW-AxxM), where Gx can be calculated using Gx=G0 + + AjAx. THE RECURSIVE LEAST-SQUARES ALGORITHM 229 We note that with this formula, x^ can be computed using only χ(°\ Αχ, b^\ and Go- Hence, we have a way of using our previous efforts in calculating χ(°ϊ to compute x^1), without having to compute x^ directly from scratch. The solution x^ is obtained from x^ by a simple update equation that adds to aj(°) a "correction term" G^Aj (b{1) - Αχχ&Α. Observe that if the new data are consistent with the old data, that is, A\x^ — b^\ then the correction term is 0 and the updated solution x^ is equal to the previous solution χ(°\ We can generalize the argument above to write a recursive algorithm for updating the least-squares solution as new data arrive. At the (k + l)th iteration, we have Gk+i — Gk + Ak+1Ak+i *<*+*> = *W + G^ATk+1 ( b ^ - Ak+lX^) . The vector fr +1^ — Ak+ix^ is often called the innovation. As before, observe that if the innovation is zero, then the updated solution x^ + 1 ^ is equal to the previous solution x^k\ We can see from the above that to compute x ^ + 1 ^ from x^ we need G j ^ , rather than Gfc+i. It turns out that we can derive an update formula for Gk+X itself. To do so, we need the following technical lemma, which is a generalization of the Sherman-Morrison formula (Lemma 11.1), due to Woodbury, and hence also called the Sherman-Morrison- Woodbury formula (see [63, p. 124] or [53, p. 50]). L e m m a 12.2 Let A be a nonsingular matrix. Let U and V be matrices such that I + VA~lU is nonsingular. Then, A + UV is nonsingular, and (A + UV)'1 = A'1 - {Α-λυ)(Ι + VA^Uy^VA-1). D Proof. We can prove the result easily by verification. I Using Lemma 12.2 we get G kll = [Gk + A ~k+lAk+l) = G~kl - G~klATk+l{I + A^G^Al^-'A^G,1. For simplicity of notation, we rewrite G^1 as PkWe summarize by writing the RLS algorithm using Pk: ■Pfc+i — Pk - .PfcAfc+1(J + Afc+iPfc^-fc+i)- Afc+iPfc, *(*+!> = *(*) + Pk+1ATk+1 (&<fc+1> - Afc+1*<*>) . 230 SOLVING LINEAR EQUATIONS In the special case where the new data at each step are such that Α^+ι is a matrix consisting of a single row, Α^+ι = α^ +1 , and b^ +1^ is a scalar, & ( f c + 1 ) =6 f c + i, we get P f c + 1 = Pk - x(k+i) = x(k) + ( 6fc+1 _ al+,χΜ) pMak+1 . E x a m p l e 12.6 Let A0 = 1 0 1 0 1 1 = [2 1], Al=aJ 1 1 1 6<°> = 5 5 &(!) = 6i = [3] A2 = a.2 = [3 ]I], = [4] b^=b2- First compute the vector x^ minimizing \\AQX — b^\\2. algorithm to find x^ minimizing \\Ao~ \\Al x— \[A2 r& (o) i 1 ft(1) &(2) J We have P 0 = (A^Ao)-1 Then, use the RLS = *<°> = Ρ ο Λ Τ & ( 0 ) = 2/3 -1/3 -1/3 2/3 2/3 2/3 Applying the RLS algorithm twice, we get P0aiaJP0 1 + aJP0ai *W = *<°> + Ριαι P 2 = Pi x^=x^ - + 1/3 -1/3 (b! - aj"aj<°>) = Pia2aJPi 1 + aJPia2 1/6 -1/4 P2a2{b2-aJx^) -1/3 2/3 1 2/3 -1/4 5/8 13/12 5/8 SOLUTION TO A LINEAR EQUATION WITH MINIMUM NORM We can easily check our solution by computing x^ x& = {ATA)~1ATb, where 231 directly using the formula w Άο b = 6« Αι A2 6 (2) ■ 12.3 Solution t o a Linear Equation with Minimum Norm Consider now a system of linear equations Ax — 6, where A G R m x n , b G R m , m < n, and rank A = ra. Note that the number of equations is no larger than the number of unknowns. There may exist an infinite number of solutions to this system of equations. However, as we shall see, there is only one solution that is closest to the origin: the solution to Ax = b whose norm ||a?|| is minimal. Let x* be this solution; that is, Ax* = b and ||x*|| < ||x|| for any x such that Ax = b. In other words, x* is the solution to the problem minimize subject to ||x|| Ax = b. In Part IV, we study problems of this type in more detail. Theorem 12.2 The unique solution x* to Ax = b that minimizes the norm \\x\\ is given by x* = AT(AAT)-1b. D Proof. Let x* - AT(AAT)-1b. Note that 2 ll*H H|(*-**) + **||2 - ((* - x*) + x*)T((x - x*) + »*) - ||x - x*||2 + ||**||2 + 2x*T(x - x*). We now show that χ*Ύ{χ-χ*)=0 . x*T(x - x*) = [AT(AAT)-1b}T[x - Indeed, = bT(AATy1[Ax = - bT(AAT)-1[b-b]=0. AT(AAT)-1b] {ΑΑΤ){ΑΑΤ)-^} 232 SOLVING LINEAR EQUATIONS Therefore, ii \\x\\ n2 = ii \\x * n2 i ii + X ~ X * n2 . Because \\x — cc*||2 > 0 for all x φ x*, it follows that for all x Φ #*, ll*ll2 > ΙΙ*ΊΙ2, which implies that x > x* Example 12.7 Find the point closest to the origin of R 3 on the line of intersection of the two planes defined by the following two equations: Xl + 2x2 - X3 = 1, 4#i + x2 + 3x 3 = 0. Note that this problem is equivalent to the problem minimize subject to where 1 4 2 1 ||x|| Ax = 6, -1 3 Thus, the solution to the problem is 0.0952 x* = AT(AATy1b = 0.3333 -0.2381 In the next section we discuss an iterative algorithm for solving Ax = b. 12.4 Kaczmarz's Algorithm As in Section 12.3, let A e R m x n , b G M m , m < n, and rank A = m. We now discuss an iterative algorithm for solving Ax = 6, originally analyzed by Kaczmarz in 1937 [70]. The algorithm converges to the vector x* = AT(AAT)~1b without explicitly having to invert the matrix AA . This is important from a practical point of view, especially when A has many rows. Let aj denote the jth row of A, and bj the j t h component of 6, and μ a positive scalar, 0 < μ < 2. With this notation, Kaczmarz's algorithm is: KACZMARZ'S ALGORITHM 233 1. Set i := 0, initial condition χ(°\ 2. For j = 1 , . . . ,ra, set x (im+j) = x{im+j-\) + μ ^ _ aJx(im+j-l)j _^_ 3. Set i := i + 1; go to step 2. In words, Kaczmarz's algorithm works as follows. For the first ra iterations (fc = 0 , . . . , ra — 1), we have *(*+*> = * « + μ (bk+1 - aJ+lXW) V - ^ - , / afc+1afc+i where, in each iteration, we use rows of A and corresponding components of b successively. For the (ra + l)th iteration, we revert back to the first row of A and the first component of 6; that is, x{rn We continue with the (ra+2)th iteration using the second row of A and second component of 6, and so on, repeating the cycle every ra iteration. We can view the scalar μ as the step size of the algorithm. The reason for requiring that μ be between 0 and 2 will become apparent from the convergence analysis. We now prove the convergence of Kaczmarz's algorithm, using ideas from Kaczmarz's original paper [70] and subsequent exposition by Parks [102]. Theorem 12.3 In Kaczmarz's algorithm, if x^ AT(AAT)-1b ask-^oo. = 0, then x^ —► x* = D Proof. We may assume without loss of generality that ||ai|| = l,z = l , . . . , r a . For if not, we simply replace each a* by α^/||α|| and each bi by 6i/||ai||. We first introduce the following notation. For each j = 0 , 1 , 2 , . . . , let R(j) denote the unique integer in { 0 , . . . , ra — 1} satisfying j = /ra + R(j) for some integer Z; that is, R(j) is the remainder that results if we divide j by ra. Using the notation above, we can write Kaczmarz's algorithm as * ( f c + 1 ) = x{k) + μ(&Λ(*)+ι - «i(fc) + i^ (fc) )oß(fc)+i· Using the identity \\x + y\\2 = \\x\\2 + \\y\\2 + 2(x,y), ||x<*+D - χψ = ||χ(*) - x* + /i(6 Ä(fc)+1 - we obtain aTRik)+1xW)aR{k)+1f = \\x{k) ~ * Ί Ι 2 + M 2 (^(fe) + i - «i( f c ) + i* ( f c ) ) 2 + MbR(k)+i ~ al(k)+ix(k))al{k)+i(v{h) Substituting aft(fc)+1a:* = || β (*+ΐ) _ x*f = £>Λ(/;)+Ι ~ **)· into this equation, we get ||χ(«0 _ x *||2 _ μ{2 _ M )(& R(fc)+1 - aj (fc)+1 x< fc >) 2 = \\xW ~ «ΊΙ 2 - M(2 - M)(aS (fc)+1 (x (fc) - x*)) 2 · 234 SOLVING LINEAR EQUATIONS Because 0 < μ < 2, the second term on the right-hand side is nonnegative, and hence Wxfr+V-x'fKWxW-x'f. fc Therefore, {||x^ ^ — x*|| 2 } is a nonincreasing sequence that is bounded below, because \\x^ — x*\\2 > 0 for all k. Hence, {||a;^ — #*|| 2 } converges (see Theorem 5.3). Furthermore, we may write fc-l 11*« - * i 2 = |*<°> - *ΊΙ 2 - M(2 - μ) Σ(°Α(ο+ι( χ(<) - **))2· 2=0 fc 2 Because {||a;( ) — x*|| } converges, we conclude that oo $>i(i)+1(tf«-tf*))2<oo, z=0 which implies that 4(fe)+i(^(fe)-^)-o. Observe that || χ (* + ΐ) _ χ (*)||2 = ß2{bR[k)+1 _ α ϊ ( Λ ) + 1 χ ( « ) 2 = μ 2 (α£ (Λ)+1 (*<*> " ^*)) 2 and therefore ||a?(fe+1) - χΜ\\2 -+ 0. Note also that because {||»(fc) - x*|| 2 } converges, {x^} is a bounded sequence (see Theorem 5.2). Following Kaczmarz [70], we introduce the notation x^r^ = x(rrn+s\ r = 0 , 1 , 2 , . . . , s — 0 , . . . , m — 1. With this notation, we have, for each s = 0 , . . . ,ra — 1, a]+1(x^ - x*) ^ 0 as r —> oo. Consider now the sequence {χ(Γ'°) : r > 0}. Because this sequence is bounded, we conclude that it has a convergent subsequence—this follows from the Bolzano-Weierstr ass theorem (see [2, p. 70]; see also Section 5.1 for a discussion of sequences and subsequences). Denote this convergent subsequence by {a^ r ' 0 ) : r G £ } , where E is a subset of { 0 , 1 , . . . } . Let z* be the limit of {aj(r'°) : r G £ } . Hence, aj(z* -x*) = 0. Next, note that because ||x( fc+1 ) — x^^|| 2 —► 0 as k —► oo, we also have ΙΙχί7*»1) — χ( Γ '°)|| 2 —► 0 as r —► oo. Therefore, the subsequence {x^r^ : r G £} also converges t o z * . Hence, a j ( z * - x * ) = 0. Repeating the argument, we conclude that for each i — 1 , . . . , m, aj(z*-x') = 0. KACZMARZ'S ALGORITHM 235 In matrix notation, the equations above take the form A(z* -x*)=0. Now, χΜ G ΊΙ{ΑΎ) for all k because χ(°) = 0 (see Exercise 12.25). Therefore, z* G 1Z(AT), because 1Z(AT) is closed. Hence, there exists y* such that z* = ATy*. Thus, A(z* - x*) = A(ATy* AT(AAT)-1b) - = (AA T )y* - b = 0. Because rank A = m, y* = (AAT)~1b and hence z* = x*. Therefore, the subsequence {||x r ' 0 -cc*|| 2 : r G £} converges to 0. Because {||£cr'° —cc*||2 : r G £} is a subsequence of the convergent sequence {||χ^ — x*|| 2 }, we conclude that the sequence {||x^fc^ - x*\\2} converges to 0; that is, x^ —► x*. I For the case where χ^ φ 0, Kaczmarz's algorithm converges to the unique point on {x : Ax = b} minimizing the distance \\x—x^\\ (see Exercise 12.26). If we set μ = 1, Kaczmarz's algorithm has the property that at each iteration A:, the "error" £>#(&)+1 — ^ ( M + I 3 ^ " 1 " 1 ^ satisfies &Ä(fc)+i ~ « Ä ( f c ) + i * ( f c + 1 ) = 0 (see Exercise 12.28). Substituting 6ß(fc)+i = &]i(k)+ix*i we may write aTR{k)+1(x{k+1)-x*) = 0. Hence, the difference between a?(fe+1) and the solution x* is orthogonal to a Ä(fc)+i· This property is illustrated in the following example. Example 12.8 Let 1 0 -1 1 b= i 3 In this case, x* = [5,3] T . Figure 12.4 shows a few iterations of Kaczmarz's algorithm with μ = 1 and x^ = 0. We have aj = [1,-1], a j — [0,1], bi = 2, 62 = 3. In Figure 12.4, the diagonal line passing through the point [2,0] T corresponds to the set {x : ajx = 61}, and the horizontal line passing through the point [0,3] T corresponds to the set {x : ajx = 62}. To illustrate 236 SOLVING LINEAR EQUATIONS x2t x <2) χ(4) Iterations of Kaczmarz's algorithm in Example 12.8. Figure 12.4 the algorithm, we perform three iterations: x X (1) 0 + (2-0); 0 (2) 1 -1 x^ = + (3-(-l)) + (2-(-2))- 0 1 1 -1 As Figure 12.4 illustrates, the property 4 ( f e ) + i(* ( f c + 1 ) -**) = o holds at every iteration. Note the convergence of the iterations of the algorithm to the solution. I 12.5 Solving Linear Equations in General Consider the general problem of solving a system of linear equations Ax = 6, where A G R m x n , and rank A = r. Note that we always have r < min{m, n}. In the case where A G Mnxn and rank A = n, the unique solution to the equation above has the form x* = A~lb. Thus, to solve the problem in this case it is enough to know the inverse A~l. In this section we analyze a general approach to solving Ax = b. The approach involves defining a pseudoinverse or generalized inverse of a given matrix A G R m x n , which plays the role of SOLVING LINEAR EQUATIONS IN GENERAL 237 A - 1 when A does not have an inverse (e.g., when A is not a square matrix). In particular, we discuss the Moore-Penrose inverse of a given matrix A, denoted A*. In our discussion of the Moore-Penrose inverse we use the fact that a nonzero matrix of rank r can be expressed as the product of a matrix of full column rank r and a matrix of full row rank r. Such a factorization is referred to as a full-rank factorization, a term which in this context was proposed by Gantmacher [45] and Ben-Israel and Greville [6]. We state and prove the above result in the following lemma. Lemma 12.3 Full-Rank Factorization. Let A G M m x n , rank A = r < min{ra,n}. Then, there exist matrices B G M m X r and C G M r x n such that A = BC, where rank A = rank B = rank C = r. D Proof. Because rank A = r, we can find r linearly independent columns of A. Without loss of generality, let α χ , α 2 , . . . , a r be such columns, where a; is the zth column of A. The remaining columns of A can be expressed as linear combinations of αι, 0 2 , . . . , o r . Thus, a possible choice for the full-rank matrices B and C are αι,.. . , a r ] G M m X r , [~i · * · [o · 0 c1?r+1 1 Cr^r-\-i · ·• Ci,n Cr,n where the entries Q j are such that for each j = r + 1 , . . . , n, we have aj = cijOi H + crjar. Thus, A = BC. I Note that if m < n and rank A = m, then we can take B = JTm, C = A, where J m is the m x m identity matrix. If, on the other hand, m > n and rank A = n, then we can take B = A, 0 = Ιη. Example 12.9 Let A be given by 2 1 3 1 0 -1 -2 5 -3 2 -13 5 238 SOLVING LINEAR EQUATIONS Note that rank A = 2. We can write a full-rank factorization of A based on the proof of Lemma 12.3: ~2 1 " A = 1 0 3 -1 1 0 0 1 = BC. We now introduce the Moore-Penrose inverse and discuss its existence and uniqueness. For this, we first consider the matrix equation AX A = A, where A G R m x n is a given matrix and X G M n x m is a matrix we wish to determine. Observe that if A is a nonsingular square matrix, then the equation above has the unique solution X = A - 1 . We now define the MoorePenrose inverse, also called the pseudoinverse or generalized inverse. Definition 12.1 Given A G R m x n , a matrix A1" G R n x m is called a pseudoinverse of the matrix A if AA^A = A, and there exist matrices U eRnxn, A1" = UAT V eRrnXrn and such that A + = A T V. The requirement A* = UAT = ATV can be interpreted as follows. Each row of the pseudoinverse matrix A' of A is a linear combination of the rows of A T , and each column of A^ is a linear combination of the columns of A . For the case in which a matrix A G R m X n with m > n and rank A = n, we can easily check that the following is a pseudoinverse of A: Af = (ATA)-1AT. Indeed, A(ATA)-1ATA = A, and if we define U = (AJ A ) " 1 and V = T 1 T 1 T A(A A)- (A A)- A , then A f = UAT = A T V . Note that, in fact, we have A*A = In. For this reason, (ATA)-1AT is often called the left pseudoinverse of A. This formula also appears in least-squares analysis (see Section 12.1). For the case in which a matrix A G R m x n with m < n and rank A = m, we can easily check, as we did in the previous case, that the following is a pseudoinverse of A: Af =AT(AAT)-1. SOLVING LINEAR EQUATIONS IN GENERAL 239 Note that in this case we have A A* = J m . For this reason, AT (AAT)~X is often called the right pseudoinverse of A. This formula also appears in the problem of minimizing ||x|| subject to Ax = b (see Section 12.3). Theorem 12.4 Let A G R m x n . / / a pseudoinverse A* of A exists, then it is unique. □ Proof. Let A[ and A\ be pseudoinverses of A. We show that A[ = A\. By definition, € R n X n , VUV2 and there are matrices UuU2 A{ = U1AT e M m x m such that =ATV1, A\ = U2AT = A\,U -UUV ATV2. Let D = A\- = U2 = V2- VL Then, we have O = ADA, D = UAT = ATV. Therefore, using the two equations above, we have (DA)TDA = ATDTDA = ATVTADA = O, which implies that DA On the other hand, because DA DDT = o. = o, we have = DAUT = o, which implies that D =and hence -A =-O Ά 2 -A- 4+ - Ά2 - Prom Theorem 12.4, we know that if a pseudoinverse matrix exists, then it is unique. Our goal now is to show that the pseudoinverse always exists. In fact, we show that the pseudoinverse of any given matrix A is given by the formula A+ = C]B\ where B^ and C* are the pseudoinverses of the matrices B and C that form a full-rank factorization of A; that is, A = BC, where B and C are of full 240 SOLVING LINEAR EQUATIONS rank (see Lemma 12.3). Note that we already know how to compute B^ and C+ = B^ = (BTB)1BT, CT{CCT)~1. Theorem 12.5 Let a matrix A G R m x n have a full-rank factorization A — BC, with rank A = r a n k ß = r a n k C = r, B G R m x r , C G R r x n . Then, A f = C*B*. D Proof. We show that A1" = C f B f = CT(CCT)-\BTB)-1BT satisfies the conditions of Definition 12.1 for a pseudoinverse. Indeed, first observe that AC^B^A = BCCT {CCT)l{BT Next, define and B)lBT BC = BC = A. U = CT(CCT)-1(BTB)-1{CCT)-1C V = B(BTB)-1(CCT)-1{BT'B)'1BT. It is easy to verify that the matrices U and V above satisfy A f - C f B f = UAT Hence, ATV. = A+ = C f B f is the pseudoinverse of A. Example 12.10 Continued from Example 12.9. Recall that 2 1 - 2 5 A = 1 0 - 3 2 3 -1 -13 2 1 3 5 1 0 -1 1 0 - 3 2 0 1 4 1 We compute B^ = (BTB)I 1DBT and f C = T TT\ - l C (CC ) 5 27 16 "9 1 5 76 - 7 23 2 1 5 -11 5] 7 13 17_ BC. SOLVING LINEAR EQUATIONS IN GENERAL Thus, Af = C ^ 2052 125 137 173 387 23 17 -1 63 241 -10 -52 -178 -72 We emphasize that the formula A' = C^B^ does not necessarily hold if A = BC is not a full-rank factorization. The following example, which is taken from [45], illustrates this point. Example 12.11 Let Λ-[,]. Obviously, A' = A - 1 = A = [1]. Observe that A can be represented as [0 1] BC. The above is not a full-rank factorization of A. Let us now compute B^ and C]. We have Bt=BT(BBT)-1 = & = (C^C) -1 «?" 1 " = [ l / 2 I/2] . (Note that the formulas for B' and C' here are different from those in Example 12.10 because of the dimensions of B and C in this example.) Thus, Ctflt 1/2 which is not equal to A^. I We can simplify the expression Af - C f B f to A* = C^CC^-^B"^)-^7" CT(BTACT)-1BT. The expression above is easily verified simply by substituting A = BC. This explicit formula for A' is attributed to C. C. MacDuffee by Ben-Israel and Greville [6]. Ben-Israel and Greville report that around 1959, MacDuffee was the first to point out that a full-rank factorization of A leads to the above 242 SOLVING LINEAR EQUATIONS explicit formula. However, they mention that MacDufFee did it in a private communication, so there is no published work by MacDufFee that contains the result. We now prove two important properties of A' in the context of solving a system of linear equations Ax = b. Theorem 12.6 Consider a system of linear equations Ax = b, A G IR mXn , rank A = r. The vector x* — A^b minimizes \\Ax — b\\2 onW1. Furthermore, among all vectors in W1 that minimize \\Ax — b\\2, the vector x* = A^b is the unique vector with minimal norm. □ Proof. We first show that x* = A^b minimizes \\Ax — b\\2 over R n . To this end, observe that for any x G Mn, \\Ax - b\\2 = \\A(x - x*) + Ax" - b\\2 = \\A{x - x*)|| 2 + ||Ac* - 6|| 2 + 2[A(x - x*)] T (Az* - 6). We now show that [A(x-x*)]T(Ax* -b)=0. Indeed, [A(x - x*)]T(Ax* -b) = (x- x*)T(ATAx* = (x- x*)T(ATAAf6 - - ATb) ATb). However, ΑΎΑΑ] = CT ΒΎ BCCT {CCT)l{BT Β)-λΒΎ = AT. Hence, [A{x - x*)]T(Ax* -b) x*)T{ATb = {x- - ATb) = 0. Thus, we have ||A* - 6||2 = \\A{x - x*)\\2 + ||Ax* - 6|| 2 . Because we obtain ||Α(χ-χ*)||2>0, \\Ax-b\\2> \\Ax*-b\\2 and thus x* minimizes \\Ax — b\\2. We now show that among all x that minimize || Ax — 6|| 2 , the vector x* — A^b is the unique vector with minimum norm. So let x be a vector minimizing \\Ax — b\\2. We have PHI2 = I K * - * * ) + * 1 2 = \\x - x*\\2 + ||x*|| 2 + 2x*T(x - x*). SOLVING LINEAR EQUATIONS IN GENERAL Observe that 243 x * T ( i - x * ) = 0. To see this, note that x*T(x-x*) = (A^b)T(x-A^b) = bTB{BTB)-T(CCT)-TC(x - = bTB{BTB)~T(CCT)-T[Cx - CT(CCT)-1(BTB)-1BTb) {BTB)-1BTb}, where the superscript — T denotes the transpose of the inverse. Now, \\Ax — b\\2 = \\B{Cx) — b\\2. Because x minimizes \\Ax — b\\2 and C is of full rank, then y* = Cx minimizes \\By — b\\2 over W (see Exercise 12.29). Because B is of full rank, by Theorem 12.1, we have Cx = y* = (BTB)1BTb. T Substituting this into the equation above, we get x* (x — x*) = 0. Therefore, we have ||χ|| 2 = For all x φ £c*, we have and hence | | * 1 2 + ||χ-ζ*|| 2 . P-**ll 2 >o, P||2>||**||2 or, equivalently, ||χ||>||χΊ|. Hence, among all vectors minimizing \Ax — 6|| 2 , the vector x* = Ä*b is the unique vector with minimum norm. I The generalized inverse has the following useful properties (see Exercise 12.30): a. (A T )t = ( A f ) T . b . (A f )t = A. These two properties are similar to those that are satisfied by the usual matrix inverse. However, we point out that the property (Α1Α2Ϋ = A^A\ does not hold in general (see Exercise 12.32). Finally, it is important to note that we can define the generalized inverse in an alternative way, following the definition of Penrose. Specifically, the Penrose definition of the generalized inverse of a matrix A G R m x n is the unique matrix A* G R n x m satisfying the following properties: 1. A A + A = A. 2. A f A A f = A f . 244 SOLVING LINEAR EQUATIONS 3. ( A A f ) T = AAl 4. ( A f A ) T - A f A. The Penrose definition above is equivalent to Definition 12.1 (see Exercise 12.31). For more information on generalized inverses and their applications, we refer the reader to the books by Ben-Israel and Greville [6] and Campbell and Meyer [23]. EXERCISES 12.1 A rock is accelerated to 3, 5, and 6 m/s 2 by applying forces of 1, 2, and 3 N, respectively. Assuming that Newton's law F = ma holds, where F is the force and a is the acceleration, estimate the mass m of the rock using the least-squares method. 12.2 A spring is stretched to lengths L = 3, 4, and 5 cm under applied forces F = 1, 2, and 4 N, respectively. Assuming that Hooke's law L = a + bF holds, estimate the normal length a and spring constant b using the leastsquares approach. 12.3 Suppose that we perform an experiment to calculate the gravitational constant g as follows. We drop a ball from a certain height and measure its distance from the original point at certain time instants. The results of the experiment are shown in the following table. Time (seconds) Distance (meters) 1.00 5.00 2.00 19.5 3.00 44.0 The equation relating the distance s and the time t at which s is measured is given by 1 2* 2 a. Find a least-squares estimate of g using the experimental results from the table above. b . Suppose that we take an additional measurement at time 4.00 and obtain a distance of 78.5. Use the recursive least-squares algorithm to calculate an updated least-squares estimate of g. 12.4 Suppose that we have a speech signal, represented as a finite sequence of real numbers xi, x2,..., xn- Suppose that we record this signal onto magnetic tape. The recorded speech signal is represented by another sequence of real numbers yi,y2, · . . ,yn> EXERCISES 245 Suppose that we model the recording process as a simple scaling of the original signal (i.e., we believe that a good model of the relationship between the recorded signal and the original signal is yi = axi for some constant a that does not depend on i). Suppose that we know exactly the original signal #1,0:2,... , # n as well as the recorded signal 3/1,2/2» · · · >2/n· Use the least-squares method to find a formula for estimating the scale factor a given this data. (You may assume that at least one X{ is nonzero.) 12.5 Suppose that we wish to estimate the value of the resistance R of a resistor. Ohm's law states that if V is the voltage across the resistor and / is the current through the resistor, then V = IR. To estimate R, we apply a 1-ampere current through the resistor and measure the voltage across it. We perform the experiment on n voltage-measuring devices and obtain measurements of V\,..., Vn. Show that the least-squares estimate of R is simply the average of V i , . . . , Vn. 12.6 The table below shows the stock prices for three companies, X, Y, and Z, tabulated over three days: X Y z 1 6 1 2 Day 2 4 1 1 3 5 3 2 Suppose that an investment analyst proposes a model for the predicting the stock price of X based on those of Y and Z: Px =apY + bpz, where ρ χ , py, and pz are the stock prices of X, Y, and Z, respectively, and a and b are real-valued parameters. Calculate the least-squares estimate of parameters a and b based on the data in the table above. 12.7 We are given two mixtures, A and B. Mixture A contains 30% gold, 40% silver, and 30% platinum, whereas mixture B contains 10% gold, 20% silver, and 70% platinum (all percentages of weight). We wish to determine the ratio of the weight of mixture A to the weight of mixture B such that we have as close as possible to a total of 5 ounces of gold, 3 ounces of silver, and 4 ounces of platinum. Formulate and solve the problem using the linear least-squares method. 12.8 Background: If Ax + w = b, where w is a, "white noise" vector, then define the least-squares estimate of x given b to be the solution to the problem minimize \\Ax — b\ 246 SOLVING LINEAR EQUATIONS This problem is related to Wiener filtering. Application: Suppose that a given speech signal {uk : k = 1 , . . . , n} (with Uk G R) is transmitted over a telephone cable with input-output behavior given by yk = ayk-i + buk + Vk, where, at each time &, yk G R is the output, Uk G R is the input (speech signal value), and Vk represents white noise. The parameters a and b are fixed known constants, and the initial condition is 2/0 = 0. We can measure the signal {yk} at the output of the telephone cable, but we cannot directly measure the desired signal {uk} or the noise signal {vk}- Derive a formula for the linear least-squares estimate of the signal {uk : k — 1 , . . . , n} given the signal {yk : k = 1 , . . . , n } . Note: Even though the vector v = [vi,..., vn]T is a white noise vector, the vector Dv (where D is a matrix) is, in general, not. 12.9 Line Fitting, Let [#i,?/i] T ,..., [xp,yp]T, p > 2, be points in R 2 . We wish to find the straight line of best fit through these points ("best" in the sense that the total squared error is minimized); that is, we wish to find a*, b* G R to minimize p /(a,6) = ^2(aXi-\-b-yi)2 . i=l Assume that the Xi, i — 1 , . . . ,p, are not all equal. Show that there exist unique parameters a* and b* for the line of best fit, and find the parameters in terms of the following quantities: X Ϋ 1 P 1 P 1 P 2=1 ΎΫ 1 V 1 P 2=1 12.10 Suppose that we take measurements of a sinusoidal signal y(t) — sin(u;£ + Θ) at times t i , . . . , t p , and obtain values yi,..., yp, where — π/2 < üjti + θ < π/2, i = 1 , . . . ,p, and the U are not all equal. We wish to determine the values of the frequency ω and phase Θ. EXERCISES 247 a. Express the problem as a system of linear equations. b . Find the least-squares estimate of ω and Θ based on part a. Use the following notation: F 1 2=1 P TY = - V^ U arcsinyi, F i=l 1 V P arcsin yi. P?=i 12.11 We are given a point [xo, yo]T G R 2 . Consider the straight line on the R 2 plane given by the equation y = mx. Using a least-squares formulation, find the point on the straight line that is closest to the given point [ffo,2/o]j where the measure of closeness is in terms of the Euclidean norm on R 2 . Hint: The given line can be expressed as the range of the matrix A = [1, m ] T . 12.12 Consider the affine function / : R n —> R of the form f(x) = aTx + c, where a G R n and c G R. a. We are given a set of p pairs (»1,2/1),.. ·, (xp, yp), where X{, e R n , yi G R, i — 1 , . . . ,p. We wish to find the affine function of best fit to these points, where "best" is in the sense of minimizing the total square error Yi(f(xi)-Vi)2 i=l Formulate the above as an optimization problem of the form: minimize \\Az — b\\2 with respect to z. Specify the dimensions of A, z, and b. b . Suppose that the points satisfy Xi H h xp = 0 and 2/iXi 4- h ypxp = 0. Find the affine function of best fit in this case, assuming that it exists and is unique. 248 SOLVING LINEAR EQUATIONS Uk Figure 12.5 ► Vk A Input-output system in Exercise 12.13. 12.13 For the system shown in Figure 12.5, consider a set of input-output pairs (wi,2/i),..., {un,yn), where uk G R, yk G R, k = 1 , . . . ,n. a. Suppose that we wish to find the best linear estimate of the system based on the input-output data above. In other words, we wish to find a θη G R to fit the model yk = OnUk, k = l , . . . , n . Using the least-squares approach, derive a formula for θη based onui,...,un and y i , . . . , yn. b . Suppose that the data in part a are generated by yk = 9uk + efc, where Θ G R and Uk = 1 for all k. Show that the parameter θη in part a converges to Θ as n —» oo if and only if 1 n lim - y ^ ek = 0. n—»oo 77, ^—-^ fc=l 12.14 Consider a discrete-time linear system χ^+ι = αχ^ + buk, where Uk is the input at time k, Xk is the output at time k, and a, 6 G R are system parameters. Suppose that we apply a constant input Uk = 1 for all k > 0 and measure the first four values of the output to be XQ = 0, x\ — 1, x<i = 2, x3 = 8. Find the least-squares estimate of a and b based on the data above. 12.15 Consider a discrete-time linear system Xk+i — axk + buk, where Uk is the input at time fc, Xk is the output at time /c, and a, b G R are system parameters. Given the first n + 1 values of the impulse response /io, · · ·, hn, find the least-squares estimate of a and 6. You may assume that at least one hk is nonzero. Note: The impulse response is the output sequence resulting from an input of UQ = 1, Uk = 0 for k φ 0 and zero initial condition XQ = 0. 12.16 Consider a discrete-time linear system x/c+i = axk + fax^, where life is the input at time /c, x^ is the output at time /c, and a, 6 G R are system parameters. Given the first n +1 values of the step response s o , . . . , sn, where n > 1, find the least-squares estimate of a and b. You may assume that at least one Sk is nonzero. Note: The step response is the output sequence resulting from an input of Uk = 1 for k > 0, and zero initial condition XQ = 0 (i.e., so = XQ = 0). EXERCISES 249 12.17 Consider a known discrete-time signal on the time interval { 1 , . . . , n}, represented by the vector x G R n (xi is the value of the signal at time Ϊ). We transmit the signal ax over a communication channel, where o G R represents the "amplitude" of the transmission, a quantity unknown to the receiver. The receiver receives a signal y G R n , which is a distorted version of the transmitted signal (so that y may not be equal to ax for any a). Formulate the problem of estimating the quantity a according to a least-squares criterion, and solve it (stating whatever appropriate assumptions are necessary, if any). 12.18 Let A G R m x n , b G R m , m > n, and rank A = n. Consider the constrained optimization problem minimize -x x — x b 2 x G 11(A), subject to where ΊΖ(Α) denotes the range of A. Derive an expression for the global minimizer of this problem in terms of A and b. 12.19 Solve the problem minimize subject to \\x -x0 [l 11 1 1 11~U = 1, where x0 = [ 0 , - 3 , 0 ] T . 12.20 Let A G R m x n , b G R m , m < n, rank A = m, and x0 G R n . Consider the problem minimize subject to \\x — Xo\\ Ax = b. Show that this problem has a unique solution given by x* - AT(AAT)~lb 12.21 Given A G R m x n , m>n, the problem minimize + (In - AT(AAT)-1A)x0. rank A = n, and & i , . . . , bp G R m , consider \\Ax - 6i|| 2 + \\Ax - b2\\2 H h \\Ax - bp\\2. Suppose that x* is a solution to the problem minimize ||Ax — ö^||2, where i = 1 , . . . ,p. Write the solution to the problem in terms of x j , . . . , x* 250 SOLVING LINEAR EQUATIONS 12.22 Given A G R m x n , ra > n, rank A = n, bu...,bp G R m , and a i , . . . , ap G R, consider the problem minimize α ι | | Α χ — 611|2 -f 0:2!!Ax — 6 2 || 2 H H a p | | A x — bp\\2. Suppose that x* is the solution to the problem minimize 11 Ax — bi \ | 2 , where i = 1 , . . . ,p. Assuming that a\ H ha p > 0, derive a simple expression for the solution to this problem in terms of x j , . . . , x* and c*i,..., ap. 12.23 Let A G R m x n , 6 G R m , m < n, and rank A = m. Show that x* = A T ( A A ) _ 1 6 is the only vector in 1Z(A ) satisfying Ax* = 6. 12.24 The purpose of this question is to derive a recursive least-squares algorithm where we remove (instead of add) a data point. To formulate the algorithm, suppose that we are given matrices AQ and A\ such that Ao Ax aj where a\ G R n . Similarly, suppose that vectors 6 ^ and b^ satisfy 6<°> = where b\ G R. Let χ(°) be the least-squares solution associated with (Ao, o^') and x^1) the least-squares solution associated with (Αι,&^ 1 '). Our goal is to write x^1) in terms of χ(°) and the data point "removed," (αχ,&ι). As usual, let Go and G\ be the Grammians associated with χ(°) and x^l\ respectively. a. Write down expressions for the least-squares solutions x^0) and x^1) in terms of A 0 , 6 ( 0 ) , Au and b ( 1 ) . b . Derive a formula for G\ in terms of Go and a,\. c. Let Po = GQ1 and P i — G f λ . Derive a formula for P\ in terms of Po and d\. (The formula must not contain any matrix inversions.) d. Derive a formula for Ä^b^' in terms of G i , χ(°\ and a\. e. Finally, derive a formula for x^1) in terms of χ(°\ P i , αχ, and b\. Use this and part c to write a recursive algorithm associated with successive removals of rows EXERCISES 12.25 Show that in Kaczmarz's algorithm, if χ<°) = 0, then x^ for all k. 12.26 Consider Kaczmarz's algorithm with x^ 251 G ΊΖ(ΑΤ) ^ 0. a. Show that there exists a unique point minimizing \\x — 2c(°)|| subject to {x : Ax = b}. b . Show that Kaczmarz's algorithm converges to the point in part a. 12.27 Consider Kaczmarz's algorithm with x^ = 0, where m = 1; that is, A = [aT] G R l x n and a φ 0, and 0 < μ < 2. Show that there exists 0 < 7 < 1 such that ||x(fc+1) - x*\\ < 7 ||χ(*) - x*|| for all jfc > 0. 12.28 a Show that in Kaczmarz's algorithm, if μ = 1, then &#(&)+1 — ic +1 5(fc)+i ^ ^ = ^ *°Γ e a c n ^· 12.29 Consider the problem of minimizing \\Ax — b\\2 over R n , where A G R m x n , b G R m . Let x* be a solution. Suppose that A = BC is a full-rank factorization of A; that is, rank A = ranki? = r a n k C = r, and i? G R m x r , C G R r X n . Show that the minimizer of \\By - b\\ over R r is Cx*. 12.30 Prove the following properties of generalized inverses: a. (A T )t = ( A 1 ) 7 . b . (A f )t = A. 12.31 Show that the Penrose definition of the generalized inverse is equivalent to Definition 12.1. 12.32 Construct matrices Αχ and A<i such that {ΑχΑ^ φ AL2ALX. CHAPTER 13 UNCONSTRAINED OPTIMIZATION AND NEURAL NETWORKS 13.1 Introduction In this chapter we apply the techniques of previous chapters to the training of feedforward neural networks. Neural networks have found numerous practical applications, ranging from telephone echo cancellation to aiding in the interpretation of EEG data (see, e.g., [108] and [72]). The essence of neural networks lies in the connection weights between neurons. The selection of these weights is referred to as training or learning. For this reason, we often refer to the weights as the learning parameters. A popular method for training a neural network is the backpropagation algorithm, based on an unconstrained optimization problem and an associated gradient algorithm applied to the problem. This chapter is devoted to a description of neural networks and the use of techniques developed in preceding chapters for the training of neural networks. An artificial neural network is a circuit composed of interconnected simple circuit elements called neurons. Each neuron represents a map, typically with multiple inputs and a single output. Specifically, the output of the neuron is a function of the sum of the inputs, as illustrated in Figure 13.1. The An Introduction to Optimization, Fourth Edition. By E. K. P. Chong and S. H. Zak. Copyright © 2013 John Wiley & Sons, Inc. 253 254 UNCONSTRAINED OPTIMIZATION AND NEURAL NETWORKS Figure 13.1 Single neuron. function at the output of the neuron is called the activation function. We use the symbol shown in Figure 13.2 to represent a single neuron. Note that the single output of the neuron may be used as an input to several other neurons, and therefore the symbol for a single neuron has multiple arrows emanating from it. A neural network may be implemented using an analog circuit. In this case inputs and outputs may be represented by currents and voltages. A neural network consists of interconnected neurons, with the inputs to each neuron consisting of weighted outputs of other neurons. The interconnections allow exchange of data or information between neurons. In a feedforward neural network, the neurons are interconnected in layers, so that the data flow in only one direction. Thus, each neuron receives information only from neurons in the preceding layer: The inputs to each neuron are weighted outputs of neurons in the preceding layer. Figure 13.3 illustrates the structure of feedforward neural networks. The first layer in the network is called the input layer, and the last layer is called the output layer. The layers in between the input and output layers are called hidden layers. We can view a neural network as simply a particular implementation of a map from W1 to R m , where n is the number of inputs x\,... ,xn and m is the number of outputs y\,..., ?/m. The map that is implemented by a neural network depends on the weights of the interconnections in the network. Therefore, we can change the mapping that is implemented by the network by adjusting the values of the weights in the network. The information about the mapping is "stored" in the weights over all the neurons, and thus the neural network is a distributed representation of the mapping. Moreover, for any given input, computation of the corresponding output is achieved through the collective effect of individual input-output characteristics of each neuron; therefore, the neural network can be considered as a parallel computation device. We point out that the ability to implement or approximate a map encompasses many important practical applications. For example, pattern Figure 13.2 Symbol for a single neuron. INTRODUCTION Input Layer Figure 13.3 Hidden Layers Output Layer Structure of a feedforward neural network. recognition and classification problems can be viewed as function implementation or approximation problems. Suppose that we are given a map F : W1 —► R m that we wish to implement using a given neural network. Our task boils down to selecting the interconnection weights in the network appropriately. As mentioned earlier, we refer to this task as training of the neural network or learning by the neural network. We use input-output examples of the given map to train the neural network. Specifically, let (iCd,i>2/d,i)j · · ·>(xd,p,Vd,p) € ^ n x ^ m > where each ydi is the output of the map F corresponding to the input Xd,i\ that is, ydti = F(xdyi). We refer to the set {(a5d,i,!/d,i), · · · > (xd,P,Vd,p)} as the training set We train the neural network by adjusting the weights such that the map that is implemented by the network is close to the desired map F. For this reason, we can think of neural networks as function approximators. The form of learning described above can be thought of as learning with a teacher. The teacher supplies questions to the network in the form of x d,\,..., Xd,P and tells the network the correct answers 2/d,i» · · ·»Vd,p· Training of the network then comprises applying a training algorithm that adjusts weights based on the error between the computed and desired outputs; that is, the difference between yd i = F(xd,i) and the output of the neural network corresponding to xd^ Having trained the neural network, our hope is that the network correctly generalizes the examples used in the training set. By this we mean that the network should correctly implement the mapping F and produce the correct output corresponding to any input, including those that were not a part of the training set. As we shall see in the remainder of this chapter, the training problem can be formulated as an optimization problem. We can then use optimization techniques and search methods (e.g., steepest descent, conjugate gradients [69], and quasi-Newton) for selection of the weights. The training algorithms are based on such optimization algorithms. 256 UNCONSTRAINED OPTIMIZATION AND NEURAL NETWORKS In the literature, for obvious reasons, the form of learning described above is referred to as supervised learning, a term which suggests that there is also a form of learning called unsupervised learning. Indeed, this is the case. However, unsupervised learning does not fit into the framework described above. Therefore, we do not discuss the idea of unsupervised learning any further. We refer the interested reader to [60]. 13.2 Single-Neuron Training Consider a single neuron, as shown in Figure 13.4. For this particular neuron, the activation function is simply the identity (linear function with unit slope). The neuron implements the following (linear) map from R n to R: = 2_\wixi — #TW5 where x = [ x i , . . . , # n ] T £ ^ n is the vector of inputs, y G R is the output, and w = [wi,... ,wn]T G R n is the vector of weights. Suppose that we are given a map F : R n —► R. We wish to find the value of the weights w\,..., wn such that the neuron approximates the map F as closely as possible. To do this, we use a training set consisting of p pairs {(asd,i,yd,i)> · · · > (xd,p,Vd,p)}i where x^% G R n and yd,i G R , i = l , . . . , p . For each i, yd^ = F(xdJi) is the "desired" output corresponding to the given input Xd,i. The training problem can then be formulated as the following optimization problem: minimize 1 p - ^ (yd4 - xjtiw)' where the minimization is taken over all w — [wi,..., wn]T G R n . Note that the objective function represents the sum of the squared errors between the desired outputs y^i and the corresponding outputs of the neuron xJiW. The factor of 1/2 is added for notational convenience and does not change the minimizer. The objective function above can be written in matrix form as follows. First define the matrix Xd G R n X p and vector yd G W by Xd = [xd,i ' ^d.i 2/d,i Vd = Vd,P Then, the optimization problem becomes minimize hyd-XTdw\\2. SINGLE-NEURON TRAINING 257 ^ y Figure 13.4 Single linear neuron. There are two cases to consider in this optimization problem: p < n and p > n. We first consider the case where p < n, that is, where we have at most as many training pairs as the number of weights. For convenience, we assume that rank Xd = p. In this case there are an infinitely many points satisfying yd = Xd w. Hence, there are infinitely many solutions to the optimization problem above, with the optimal objective function value of 0. Therefore, we have a choice of which optimal solution to select. A possible criterion for this selection is that of minimizing the solution norm. This is exactly the problem considered in Section 12.3. Recall that the minimum-norm solution is w* = Xd(Xd Xd)-1 Vd- An efficient iterative algorithm for finding this solution is Kaczmarz's algorithm (discussed in Section 12.4). Kaczmarz's algorithm in this setting takes the form \\Xd,R(h)+l\\ where w^ = 0 and ek = 2/d,Ä(fc)+i - T (k) ttd,Ä(fc)+i^ · Recall that R(k) is the unique integer in { 0 , . . . ,p— 1} satisfying k = lp+R(k) for some integer /; that is, R(k) is the remainder that results if we divide k by p (see Section 12.4 for further details on the algorithm). The algorithm above was applied to the training of linear neurons by Widrow and Hoff (see [132] for some historical remarks). The single neuron together with the training algorithm above is illustrated in Figure 13.5 and is often called Adaline, an acronym for adaptive linear element. We now consider the case where p > n. Here, we have more training points than the number of weights. We assume that r a n k X j = n. In this case the objective function \\[yd — Xdw\\2 *s simply a strictly convex quadratic function of w, because the matrix XdXd is a positive definite matrix. To solve this optimization problem, we have at our disposal the whole slew of unconstrained optimization algorithms considered in earlier chapters. For example, we can use a gradient algorithm, which in this case takes the form w(^)=wW+akXde^\ 258 UNCONSTRAINED OPTIMIZATION AND NEURAL NETWORKS Figure 13.5 Adaline. where e^ =ydXjw^k\ The discussion above assumed that the activation function for the neuron is the identity map. The derivation and analysis of the algorithms can be extended to the case of a general differentiable activation function fa. Specifically, the output of the neuron in this case is given by y= α $ (Σ WiXi = a xTw ) f ( ) · The algorithm for the case of a single training pair (x<i, yd) has the form where the error is given by ek = yd~ fa (*Jw ( f c ) ) · For a convergence analysis of the algorithm above, see [64]. 13.3 The Backpropagation Algorithm In Section 13.2 we considered the problem of training a single neuron. In this section we consider a neural network consisting of many layers. For simplicity of presentation, we restrict our attention to networks with three layers, as depicted in Figure 13.6. The three layers are referred to as the input, hidden, and output layers. There are n inputs x^ where i = 1 , . . . ,n. We have m outputs ys, s = 1 , . . . , m. There are / neurons in the hidden layer. The outputs THE BACKPROPAGATION ALGORITHM Input Layer Hidden Layer Output Layer i j k Figure 13.6 259 Three-layered neural network. of the neurons in the hidden layer are Zj, where j = 1 , . . . ,Z. The inputs distributed to the neurons in the hidden layer. We may think of the neurons in the input layer as single-input-single-output linear elements, with each activation function being the identity map. In Figure 13.6 we do not explicitly depict the neurons in the input layer; instead, we illustrate the neurons as signal spliters. We denote the activation functions of the neurons in the hidden layer by / j 1 , where j — 1 , . . . , /, and the activation functions of the neurons in the output layer by f°, where s = 1 , . . . ,ra. Note that each activation function is a function from R to M. We denote the weights for inputs into the hidden layer by w^, i = 1 , . . . , n, j = 1 , . . . , /. We denote the weights for inputs from the hidden layer into the output layer by w°j, j = 1 , . . . , / , s = l , . . . , r a . Given the weights w^ and Wgj, the neural network implements a map from M.n to W71. To find an explicit formula for this map, let us denote the input to the jth neuron in the hidden layer by Vj and the output of the j t h neuron in the hidden layer by Zj. Then, we have The output from the sth neuron of the output layer is ys = f°s Σ < ^j . 260 UNCONSTRAINED OPTIMIZATION AND NEURAL NETWORKS Therefore, the relationship between the inputs #;, i = 1 , . . . ,n, and the 5th output ys is given by \J=1 \t=l = -Γ β ^Χΐ , . . . , Χγι)- The overall mapping that the neural network implements is therefore given by Fi(xi,...,xn) V\ Vn ■Tm\%l ? · · · ) ^ n j We now consider the problem of training the neural network. As for the single neuron considered in Section 13.2, we analyze the case where the training set consists of a single pair (xd,yd), where xd G Mn and yd G R m . In practice, the training set consists of many such pairs, and training is typically performed with each pair at a time (see, e.g., [65] or [113]). Our analysis is therefore also relevant to the general training problem with multiple training pairs. The training of the neural network involves adjusting the weights of the network such that the output generated by the network for the given input Xd = [xdi, · · · ,Xdn]T is as close to yd as possible. Formally, this can be formulated as the following optimization problem: minimize ^ m Z 8=1 ■Va) where ys, s = 1 , . . . , ra, are the actual outputs of the neural network in response to the inputs xd\, · . . , Xdn, as given by \i=l This minimization is taken over all w^, w°j, i = Ι , . , . , η , j = 1 , . . . , / , s = 1 , . . . , m. For simplicity of notation, we use the symbol w for the vector w = {w^w°sj :i = l , . . . , n , j = 1 , . . . , / , s = l , . . . , r a } THE BACKPROPAGATION ALGORITHM 261 and the symbol E for the objective function to be minimized; that is, m 1 ( ) = ö2 , Σ (y<js ~ y^2 E w =1 To solve the optimization problem above, we use a gradient algorithm with fixed step size. To formulate the algorithm, we need to compute the partial derivatives of E with respect to each component of w. For this, let us first fix the indices i, j , and s. We first compute the partial derivative of E with respect to w°y For this, we write where for each q = 1 , . . . , /, z * = f£ [Σ™*χ*] · Using the chain rule, we obtain BE i-(w) = - (yds - ys) f°' sj sj ^2w°sqzq \q=\iq= Zj, ) where / ° : R —► R is the derivative of / ° . For simplicity of notation, we write as = (yds ~ y8) fs ί Σ w wz* ) We can think of each Ss as a scaled output error, because it is the difference between the actual output ys of the neural network and the desired output yds, scaled by / ° (Σσ=ι w°qzq)· dE dwSJ Using the Ss notation, we have ( \ x We next compute the partial derivative of E with respect to w^. We start with the equation E(*) = \ Σ (väp ~ f°P (Σ «&/* (it < * * ) ) ) · 262 UNCONSTRAINED OPTIMIZATION AND NEURAL NETWORKS Using the chain rule once again, we get / m dE w h ( ) = ~ Σ ( » * ~ Vr) fp ßw UW ji p=l ' Σ \g=l w mzi / » y f Σ \r=l w Jrxdr / Xdi, where /'3j 1 : E —> E is the derivative of f!f. Simplifying the above yields dE 9w ?i •7* («o = -v Σ*Ρ<)//(^*. i / \p=l / We are now ready to formulate the gradient algorithm for updating the weights of the neural network. We write the update equations for the two sets of weights w°j and w^ separately. We have \P=1 where η is the (fixed) step size and (fc) v^ i=l z{k) 3 M*0 3* =/;W fc) )· öik) = (yds-y^f?(jtw?qk)4k))· The update equation for the weights w°j of the output layer neurons is illustrated in Figure 13.7, whereas the update equation for the weights w^ of the hidden layer neurons is illustrated in Figure 13.8. The update equations above are referred to in the literature as the backpropagation algorithm. The reason for the name backpropagation is that the output errors δ[ ',... ,6m are propagated back from the output layer to the hidden layer and are used in the update equation for the hidden layer weights, as illustrated in Figure 13.8. In the discussion above we assumed only a single hidden layer. In general, we may have multiple hidden layers—in this case the update equations for the weights will resemble the equations derived above. In the general case the output errors are propagated backward from layer to layer and are used to update the weights at each layer. THE BACKPROPAGATION ALGORITHM 263 s-th output neuron From Hidden Layer To Hidden Layer o(k+1) w,sj Figure 13.7 Illustration of the update equation for the output layer weights. We summarize the backpropagation algorithm qualitatively as follows. Using the inputs Xdi and the current set of weights, we first compute the quantities Vj , Zj , yi \ and öi , in turn. This is called the forward pass of the algorithm, because it involves propagating the input forward from the input layer to the output layer. Next, we compute the updated weights using the quantities computed in the forward pass. This is called the reverse pass of the algorithm, because it involves propagating the computed output errors 6s backward through the network. We illustrate the backpropagation procedure numerically in the following example. Example 13.1 Consider the simple feedforward neural network shown in Figure 13.9. The activation functions for all the neurons are given by f(v) = 1/(1 + e~v). This particular activation function has the convenient property 264 UNCONSTRAINED OPTIMIZATION AND NEURAL NETWORKS j-th hidden neuron (k) i To Output Layer M M From Output Layer h(k+l) W; Figure 13.8 Illustration of the update equation for the hidden layer weights. THE BACKPROPAGATION ALGORITHM 265 *-yi Figure 13.9 that f'{v) = f(v){l — f(v))<*i = (yd-yi)f Neural network for Example 13.1. Therefore, using this property, we can write \^2w<iqzq w = (yd-yi)flJ2w°qz<*) = (2/d-2/i)2/i(l-2/i). (IH) Suppose that the initial weights are wx[ = 0 . 1 , w^ = 0.3, w2[ = 0.3, = 0.4, wir = 0.4, and w$} = 0.6. Let xd = [0.2,0.6] T and yd = 0.7. Perform one iteration of the backpropagation algorithm to update the weights of the network. Use a step size of 77 = 10. To proceed, we first compute HO) w.22 v[0) = wi[0)xdl „(o) + wQ0)xd2 = 0.2, r ,M°)Xd\ + ^ 2h(0) ^21 2 Xd2 = 0 . 3 . Next, we compute 40) = M0)) = Ί + e-°·2 *Γ = / Κ ' ) = Ϊ 1 + e -0.3 0.5498, = 0.5744. We then compute „(°> = / (wTz[0) + *>T40)) = /(0.5646) = 0.6375, 266 UNCONSTRAINED OPTIMIZATION AND NEURAL NETWORKS which gives an output error of ö[V 0.01444. = (yd-yW)y[°\l-yW)= This completes the forward pass. To update the weights, we use w°i1] =w°{20) +ηδ[0)ζί0) and, using the fact that f'{vf]) WW = „MO «,#> = „#°> = f(vf])(l =0.6830, - f(vf)) = zf\l - zf\ + vsWwo(o)zW{1 _ z(o))xdi = 0 1029) r,sWwof)zW{1 _ z(o))xd2 = + we get a3086) «&(1) = «# 0) + ^ 0) <4 0) 4 0) (1 - 4 0) )^i = 0.3042, # = «&0) + ^ 0 ) ^ 0 ) 4 0 ) ( 1 - 4 ° W = 0.4127. Thus, we have completed one iteration of the backpropagation algorithm. We can easily check that y[ ' = 0.6588, and hence \y<i — y[ I < \yd — y[ |; that is, the actual output of the neural network has become closer to the desired output as a result of updating the weights. After 15 iterations of the backpropagation algorithm, we get w^ 15 ) = 0.6365, 15) = 0.8474, w^ (15) ^ = 0.1105, 15 w ^ > = 0.3315, (15) w,2! = 0.3146, ^ ™22 = 0.4439. The resulting value of the output corresponding to the input Xd = [0.2,0.6] T is ^ 1 5 ) = 0.6997. ■ In the example above, we considered an activation function of the form /(") 1 1 + e- This function is called a sigmoid and is a popular activation function used in practice. The sigmoid function is illustrated in Figure 13.10. It is possible to use a more general version of the sigmoid function, of the form l \ β THE BACKPROPAGATION ALGORITHM Figure 13.10 267 Sigmoid function. The parameters ß and Θ represent scale and shift (or location) parameters respectively. The parameter Θ is often interpreted as a threshold. If such an activation function is used in a neural network, we would also want to adjust the values of the parameters ß and 0, which also affect the value of the objective function to be minimized. However, it turns out that these parameters can be incorporated into the backpropagation algorithm simply by treating them as additional weights in the network. Specifically, we can represent a neuron with activation function g as one with activation function / with the addition of two extra weights, as shown in Figure 13.11. 1+e-(vi"e) ν2=ν1-θ Figure 13.11 Two configurations that are equivalent. 268 UNCONSTRAINED OPTIMIZATION AND NEURAL NETWORKS Example 13.2 Consider the same neural network as in Example 13.1. We introduce shift parameters 0i, 02, and 03 to the activation functions in the neurons. Using the configuration illustrated in Figure 13.11, we can incorporate the shift parameters into the backpropagation algorithm. We have vi = w^Xdi + Wi2Xd2 - 0i, V2 = W21Xdl + W%2xd2 ~ #2, Zl = / ( V l ) , Z2 = f(v2), 2/1 = / (w^Zi + w{2Z2 - 0 3 ) , i i = (2/rf — 2/i)2/i(l - 2 / 1 ) , where / is the sigmoid function: The components of the gradient of the objective function E with respect to the shift parameters are 8E_ (w) 901 dE (w) Ö02 as (w) Ö03 5^1^1(1 -Zi), öiw°2z2{l - z2), δι. ■ In the next example, we apply the network discussed in Example 13.2 to solve the celebrated exclusive OR (XOR) problem (see [113]). Example 13.3 Consider the neural network of Example 13.2. We wish to train the neural network to approximate the exclusive OR (XOR) function, defined in Table 13.1. Note that the XOR function has two inputs and one output. To train the neural network, we use the following training pairs: *d,i = [0,0] T , T yd,i=0, Xd,2 = [0,1] , Xd,3 = [1,0] T , 2/d,2 = 1, yd,3 = 1, xdA = [1,1] T , VdA = 0 · We now apply the backpropagation algorithm to train the network using the training pairs above. To do this, we apply one pair per iteration in a cyclic THE BACKPROPAGATION ALGORITHM Truth Table for XOR Function Table 13.1 X\ F(xi,x2) 0 1 1 0 X2 0 0 1 1 Table 13.2 269 0 1 0 1 Response of the Trained Network of Example 13.3 x\ 0 0 1 1 X2 2/i 0 1 0 1 0.007 0.99 0.99 0.009 fashion. In other words, in the /cth iteration of the algorithm, we apply the pair (xd,R(k)+i,yd,R(k)+i)i where, as in Kaczmarz's algorithm, R(k) is the unique integer in {0,1,2,3} satisfying k = 4/ + R(k) for some integer /; that is, R(k) is the remainder that results if we divide A: by 4 (see Section 12.4). The experiment yields the following weights (see Exercise 13.5): w°n = - 1 1 . 0 1 , <2 — 10.92, «tfl = -7.777, «>12 = -8.403, ™21 = - 5 . 5 9 3 , « & = -5.638, 0i = - 3 . 2 7 7 , 02 = - 8 . 3 5 7 , 03 = 5.261. Table 13.2 shows the output of the neural network with the weights above corresponding to the training input data. Figure 13.12 shows a plot of the function that is implemented by this neural network. I For a more comprehensive treatment of neural networks, see [58], [59], or [137]. For applications of neural networks to optimization, signal processing, and control problems, see [28] and [67]. 270 UNCONSTRAINED OPTIMIZATION AND NEURAL NETWORKS Figure 13.12 Example 13.3. Plot of the function implemented by the trained network of EXERCISES 13.1 Consider a single linear neuron, with n inputs (see Figure 13.4). Suppose that we are given Xd G R n x p and yd G MP representing p training pairs, where p > n. The objective function to be minimized in the training of the neuron is f(w) = \\\yd-XTdwf. a. Find the gradient of the objective function. b . Write the conjugate gradient algorithm for training the neuron. c. Suppose that we wish to approximate the function F : M2 —> R given by F(x) = (sin #i) (cos #2)· Use the conjugate gradient algorithm from part b to train the linear neuron, using the following training points: {x : £ι,#2 = —0.5,0,0.5}. It may helpful to use the MATLAB program from Exercise 10.11. d. Plot the level sets of the objective function for the problem in part c, at levels 0.01, 0.1, 0.2, and 0.4. Check if the solution in part c agrees with the level sets. EXERCISES 271 e. Plot the error function e(x) = F(x) — w*Tx versus X\ and #2? where w* is the optimal weight vector obtained in part c. 13.2 Consider the Adaline, depicted in Figure 13.5. Assume that we have a single training pair (xd, yd), where Xd φ 0. Suppose that we use the WidrowHoff algorithm to adjust the weights: X d where e^—yd — Xd xjw^k\ a. Write an expression for e^+i as a function of e& and μ. b . Find the largest range of values for μ for which e^ —► 0 (for any initial condition it/ 0 )). 13.3 As in Exercise 13.2, consider the Adaline. Consider the case in which there are multiple pairs in the training set {(xd,i, 2/d,i)> · · · j (xd,p, 2/d,p)}> where p <n and rankX^ = p (the matrix Xd has x^% as its zth column). Suppose that we use the following training algorithm: „,(*+!> = „,(*) + where e^ = yd — Xjw^ Xd{XJxd)-^e^\ and μ is a given constant p x p matrix. a. Find an expression for e^ + 1 ^ as a function of e^ and μ. b . Find a necessary and sufficient condition on μ for which e^ any initial condition w^). —► 0 (for 13.4 Consider the three-layered neural network described in Example 13.1 (see Figure 13.9). Implement the backpropagation algorithm for this network in MATLAB. Test the algorithm for the training pair Xd = [0,1] T and yd = 1. Use a step size of η = 50 and initial weights as in the Example 13.1. 13.5 Consider the neural network of Example 13.3, with training pairs for the XOR problem. Use MATLAB to implement the training algorithm described in Example 13.3, with a step size of η = 10.0. Tabulate the outputs of the trained network corresponding to the training input data. CHAPTER 14 GLOBAL SEARCH ALGORITHMS 14.1 Introduction The iterative algorithms in previous chapters, in particular gradient methods, Newton's method, conjugate gradient methods, and quasi-Newton methods, start with an initial point and then generate a sequence of iterates. Typically, the best we can hope for is that the sequence converges to a local minimizer. For this reason, it is often desirable for the initial point to be close to a global minimizer. Moreover, these methods require first derivatives (and also second derivatives in the case of Newton's method). In this chapter we discuss various search methods that are global in nature in the sense that they attempt to search throughout the entire feasible set. These methods use only objective function values and do not require derivatives. Consequently, they are applicable to a much wider class of optimization problems. In some cases, they can also be used to generate "good" initial (starting) points for the iterative methods discussed in earlier chapters. Some of the methods we discuss in this chapter (specifically, the randomized search methods) are also used in combinatorial optimization, where the feasible set is finite (discrete), but typically large. An Introduction to Optimization, Fourth Edition. By E. K. P. Chong and S. H. Zak. Copyright © 2013 John Wiley & Sons, Inc. 273 274 GLOBAL SEARCH ALGORITHMS 14.2 The Nelder-Mead Simplex Algorithm The method originally proposed by Spendley, Hext, and Himsworth [122] in 1962 was improved by Neider and Mead [97] in 1965 and it is now commonly referred to as the Nelder-Mead simplex algorithm. A contemporary view of the algorithm is provided in the well-written paper by Lagarias et al. [82]. In our exposition, we use the notation of this paper. The Nelder-Mead algorithm is a derivative-free method. The method uses the concept of a simplex. A simplex is a geometric object determined by an assembly of n + 1 points, p0>Pi> · · · iPm m the ^-dimensional space such that det Ρθ 1 Pi 1 · · Pn 1 ^0. This condition ensures that two points in R do not coincide, three points in R 2 are not colinear, four points in R 3 are not coplanar, and so on. Thus, simplex in R is a line segment, in R 2 it is a triangle, while a simplex in R 3 is a tetrahedron; in each case it encloses a finite n-dimensional volume. Suppose that we wish to minimize / ( # ) , x G R n . To start the algorithm, we initialize a simplex of n + 1 points. A possible way to set up a simplex, as suggested by Jang, Sun, and Mizutani [67], is to start with an initial point x(o) _ p o a n c j g e n e r a t e the remaining points of the initial simplex as follows: Pi = Po + ^ e n i = 1,2,..., n, where the e^ are unit vectors constituting the natural basis of R n as described in Section 2.1. The positive constant coefficients λ^ are selected in such a way that their magnitudes reflect the length scale of the optimization problem. Our objective is to modify the initial simplex stage by stage so that the resulting simplices converge toward the minimizer. At each iteration we evaluate the function / at each point of the simplex. In the function minimization process, the point with the largest function value is replaced with another point. The process for modifying the simplex continues until it converges toward the function minimizer. We now present the rules for modifying the simplex stage by stage. To aid in our presentation, we use a two-dimensional example to illustrate the rules. We begin by selecting the initial set of n + 1 points that are to form the initial simplex. We next evaluate / at each point and order the n + 1 vertices to satisfy / ( Ρ θ ) < / ( Ρ ΐ ) < •••</(Pn)· For the simplex wish to and pnl two-dimensional case we let ph pnl, and ps denote the points of the for which / is largest, next largest, and smallest; that is, because we minimize / , the vertex ps is the best vertex, pl is the worst vertex, is the next-worst vertex. We next compute pg, the centroid (center THE NELDER-MEAD SIMPLEX ALGORITHM Pr=Pg + Figure 14.1 275 p(Pg-Pl) Reflecting pt in p with a reflection coefficient p. of gravity) of the best n points: n-l Σ Ρΐ In our two-dimensional case, n = 2, we would have 1 / Pg = 2 VPnl+Ps)' We then reflect the worst vertex, pl, in pg using a reflection coefficient p > 0 to obtain the reflection point Pr=Pg+p(Pg-Pl)' A typical value is p = 1. The operation above is illustrated in Figure 14.1. We proceed to evaluate / at pr to obtain fr = f (p r )· If /o < fr < fn-i [i-e., if fr lies between fs = f {ps) a n d fni = f {ρηι)]·>tnen t n e P o m t Pr replaces pl to from a new simplex, and we terminate the iteration. We proceed to repeat the process. Thus, we compute the centroid of the best n vertices of the new simplex and again reflect the point with the largest function / value in the centroid obtained for the best n points of the new simplex. If, however, fr < fs = fo, so that the point pr yields the smallest function value among the points of the simplex, we argue that this direction is a good one. In this case we increase the distance traveled using an expansion coefficient χ > 1 (e.g., χ = 2) to obtain Pe=Pg+X (Pr ~ Pg) · The operation above yields a new point on the line P\Pgpr extended beyond pr. We illustrate this operation in Figure 14.2. If fe < fr now, the expansion is declared a success and pe replaces pt in the next simplex. If, on the other hand, fe > / r , the expansion is a failure and pr replaces pt. 276 GLOBAL SEARCH ALGORITHMS Figure 14.2 Figure 14.3 Expansion operation with the expansion coefficient χ. Outside contraction operation for the case when fr € [fnu fi)- Finally, if fr > fni, the reflected point pr would constitute the point with the largest function value in the new simplex. Then in the next step it would be reflected in pg, probably an unfruitful operation. Instead, this case is dealt with by a contraction operation in one of two ways. First, if fr > fni and fr < fi, then we contract (pr — pg) with a contraction coefficient 0 < 7 < 1 (e.g., 7 = 1/2) to obtain Pc = Pg + 7 {Pr ~ Pg) · We refer to this operation as the outside contraction. See Figure 14.3 for an illustration of this operation. If, on the other hand, fr > fni and fr > //, then Pi replaces pr in the contraction operation and we get Pc = Pg + 7 (Pi - Pg) · This operation, referred to as the inside contraction, is illustrated in Figure 14.4. If, in either case, fc < //, the contraction is considered a success, and we replace pl with pc in the new simplex. If, however, fc > fi, the contraction THE NELDER-MEAD SIMPLEX ALGORITHM 277 Pnl p9 Ps Figure 14.4 Inside contraction operation for the case when fr > fi. Pnl Pnl+Ps 2 P. Figure 14.5 Shrinkage operation. is a failure, and in this case a new simplex can be formed by retaining ps only and halving the distance from ps to every other point in the simplex. We can refer to this event as a shrinkage operation. The shrinkage operation is illustrated in Figure 14.5. In general, the shrink step produces the n new vertices of the new simplex according to the formula Vi=p8+ σ(ρ{ - p8), i = 1,2,..., n, where σ = 1/2. Hence, the vertices of the new simplex are p s , vi,..., vn. When implementing the simplex algorithm, we need a tie-breaking rule to order points in the case of equal function values. Lagarias et al. [82] propose tie-breaking rules that assign to the new vertex the highest possible index consistent with the relation /(Po)</(Pi)< ···</(?„)· In Figure 14.6 we illustrate the simplex search method by showing the first few stages of the search for a minimizer of a function of two variables. Our 278 GLOBAL SEARCH ALGORITHMS Figure 14.6 two variables. The simplex search method applied to minimization of a function of drawing is inspired by a figure in Layton [84, p. 225]. The starting simplex is composed of the vertices A, B, and C. The vertices D and E are obtained by the expansion operation. The vertex F is obtained by the reflection operation. The vertex G is obtained using the outside contraction operation, while the vertex I is obtained employing the inside contraction operation. For the sake of clarity we terminate the process with the simplex composed of the vertices 23, 22", and I. The process may, of course, be continued beyond this simplex. We add that a variant of the simplex method described above is presented in Jang et al. [67], where they use the centroid of the entire simplex rather than the centroid of the best n vertices of the simplex. That is, Jang et al. [67] compute the point pg using the n + 1 vertices of the simplex. In addition, they use only the inside contraction and they do not use the outside contraction operation. 14.3 Simulated Annealing Randomized Search Simulated annealing is an instance of a randomized search method. A randomized search method, also sometimes called a probabilistic search method, is an algorithm that searches the feasible set of an optimization problem by SIMULATED ANNEALING 279 considering randomized samples of candidate points in the set. The simulated annealing algorithm was first suggested for optimization by Kirkpatrick et al. [75] based on techniques of Metropolis et al. [91]. An early application to image processing was described by Geman and Geman [48]. As usual, suppose that we wish to solve an optimization problem of the form minimize subject to f(x) x G Ω. The basic assumption in randomized search is our ability to select a random sample from the feasible set Ω. Typically, we start a randomized search process by selecting a random initial point a;(0) G Ω. Then, we select a random next-candidate point, usually close to χ(°\ More formally, we assume that for any x G Ω, there is a set N(x) C Ω such that we can generate a random sample from this set. Typically, N(x) is a set of points that are "close" to x, and for this reason we usually think of N(x) as a "neighborhood" of x [we use the term neighborhood for N(x) even in the general case where the points in it are arbitrary, not necessarily close to x). When we speak of generating a random point in N(x), we mean that there is a prespecified distribution over N(x), and we sample a point with this distribution. Often, this distribution is chosen to be uniform over N(x); other distributions are also used, including Gaussian and Cauchy. Before discussing the simulated annealing method, we first consider a simple randomized search algorithm, which we will call naive random search. Naive Random Search Algorithm 1. Set k := 0. Select an initial point x ^ G Ω. 2. Pick a candidate point z^ at random from 3. If /(*<*>) < /(x ( f c ) ), then set χ^+^ N(x^). = z « ; else, set χ ^ + 1 ) = »(*>. 4. If stopping criterion satisfied, then stop. 5. Set k := k + 1, go to step 2. Note that the algorithm above has the familiar form x ^ + 1 ^ = x^ + dik\ where Φ ' is randomly generated. By design, the direction <rh' either is 0 or is a descent direction. Typical stopping criteria include reaching a certain number of iterations, or reaching a certain objective function value. Simulated Annealing Algorithm The main problem with the naive random search method is that it may get stuck in a region around a local minimizer. This is easy to imagine; for 280 GLOBAL SEARCH ALGORITHMS example, if x^ is a local minimizer and N(x^) is sufficiently small that all points in it have no smaller objective function value than x^ [i.e., x^ is a global minimizer of / over N(x^)], then clearly the algorithm will be stuck and will never find a point outside of N(x^). To prevent getting stuck in a region around a local minimizer, we need a way to consider points outside this region. One way to achieve this goal is to make sure that at each fc, the neighborhood N(x^) is a very large set. Indeed, if N(x^) is sufficiently large, then we are guaranteed that the algorithm will converge (in some sense) to a global minimizer. An extreme example of this case is where N(x) = Ω for any x G Ω (in this case running k iterations of the naive random search algorithm amounts to finding the best point among k randomly chosen points in Ω). However, having too large a neighborhood in the search algorithm results in a slow search process, because the sampling of candidate points to consider is spread out, making it more unlikely to find a better candidate point. Another way to overcome the problem of getting stuck in a region around a local minimizer is to modify the naive search algorithm so that we can "climb out" of such a region. This means that the algorithm may accept a new point that is worse than the current point. The simulated annealing algorithm incorporates such a mechanism. Simulated Annealing Algorithm 1. Set k := 0; select an initial point x^ 2. Pick a candidate point z^ G Ω. at random from N(x^). 3. Toss a coin with probability of HEAD equal to p(kJ(z^)J(x^)). HEAD, then set aj(fc+1> = *<*>; else, set χ^+^ = x^k\ If 4. If the stopping criterion is satisfied, then stop. 5. Set k := k + 1, go to step 2. In step 3, the use of a "coin toss" is simply descriptive for a randomized decision—we do not mean literally that an actual coin needs to be tossed. As in naive random search, the simulated annealing algorithm above has the familiar form x ^ * 1 ) = x^ + dS \ where <r ' is randomly generated. But in simulated annealing the direction (r * might be an ascent direction. However, as the algorithm progresses, we can keep track of the best-so-far point—this is a point x[,Jß+ which, at each fc, is equal to a x^\ j G { 0 , . . . , fc}, such that f(x^) < f(x™) for alH G { 0 , . . . , k}. The best-so-far point can be updated at each step k as follows: la!bL1) otherwise. SIMULATED ANNEALING 281 By keeping track of the best-so-far point, we can treat the simulated annealing algorithm simply as a search procedure; the best-so-far point is what we eventually use when the algorithm stops. This comment applies not only to simulated annealing, but other search techniques as well (including the randomized algorithms presented in the next two sections). The major difference between simulated annealing and naive random search is that in step 5, there is some probability that we set the next iterate to be equal to the random point selected from the neighborhood, even if that point turns out to be worse than the current iterate. This probability is called the acceptance probability. For the algorithm to work properly, the acceptance probability must be chosen appropriately. A typical choice is p(k, / ( « « ) , /(*<*>)) = min{l, exp(-(/(z< fe >) - /(*< fc >))/T fc )}, where exp is the exponential function and Tk represents a positive sequence called the temperature schedule or cooling schedule. This form of acceptance probability is usually credited to Boltzmann and leads to a simulated annealing algorithm that behaves as a Gibbs sampler (a method of probabilistic sampling based on the Gibbs distribution). Notice that if /(z<*>) < /(x ( / e ) ), then p(fc,/(* ( f c ) ),/(x ( f c ) )) = 1, which means that we set χ^+^ = ζ^ (i.e., we move to the point z^). However, if f(z^) > f(x^), there is still a positive probability of setting χ^+^ = ζ^; this probability is equal to exp /(gW)-/(«W)\ Tk ) - Note that the larger the difference between f(z^) and /(x^), the less likely we are to move to the worse point z^h\ Similarly, the smaller the value of X^, the less likely we are to move to z^k\ It is typical to let the "temperature" Tk be monotonically decreasing to 0 (hence the word cooling). In other words, as the iteration index k increases, the algorithm becomes increasingly reluctant to move to a worse point. The intuitive reason for this behavior is that initially we wish to actively explore the feasible set, but with time we would like to be less active in exploration so that we spend more time in a region around a global minimizer. In other words, the desired behavior is this: Initially, the algorithm jumps around and is more likely to climb out of regions around local minimizers, but with time it settles down and is more likely to spend time around a global minimizer. The term annealing comes from the field of metallurgy, where it refers to a technique for improving the property of metals. The basic procedure is to heat up a piece of metal and then cool it down in a controlled fashion. When the metal is first heated, the atoms in it become unstuck from their initial positions (with some level of internal energy). Then, as cooling takes place, the atoms gradually configure themselves in states of lower internal energy. Provided that the cooling is sufficiently slow, the final internal energy is lower 282 GLOBAL SEARCH ALGORITHMS than the initial internal energy, thereby refining the crystalline structure and reducing defects. In an analogous way, the temperature in simulated annealing must be cooled in a controlled fashion. In particular, the cooling should be sufficiently slow. In a seminal paper, Hajek [56] provides a rigorous analysis of the cooling schedule for convergence of the algorithm to a global minimizer. Specifically, he shows that an appropriate cooling schedule is k log(fc + 2)' where 7 > 0 is a problem-dependent constant (large enough to allow the algorithm to "climb out" of regions around local minimizers that are not global minimizers). See also [57] for an analysis of a generalized version of simulated annealing. Simulated annealing is often also used in combinatorial optimization, where the feasible set is finite (but typically large). An example of such a problem is the celebrated traveling salesperson problem. In the most basic form of this problem, we are given a number of cities and the cost of traveling from any city to any other city. The optimization problem is to find the cheapest round-trip route, starting from a given city, that visits every other city exactly once. For a description of how to apply simulated annealing to the traveling salesperson problem, see [67, p. 183]. 14.4 Particle Swarm Optimization Particle swarm optimization (PSO) is a randomized search technique presented by James Kennedy (a social psychologist) and Russell C. Eberhart (an engineer) in 1995 [73]. This optimization method is inspired by social interaction principles. The PSO algorithm differs from the randomized search methods discussed in Section 14.3 in one key way: Instead of updating a single candidate solution x^ at each iteration, we update a population (set) of candidate solutions, called a swarm. Each candidate solution in the swarm is called a particle. We think of a swarm as an apparently disorganized population of moving individuals that tend to cluster together while each individual seems to be moving in a random direction. (This description was adapted from a presentation by R. C. Eberhart.) The PSO algorithm aims to mimic the social behavior of animals and insects, such as a swarm of bees, a flock of birds, or a herd of wildebeest. Suppose that we wish to minimize an objective function over Rn. In the PSO algorithm, we start with an initial randomly generated population of points in R n . Associated with each point in the population is a velocity vector. We think of each point as the position of a particle, moving with an associated velocity. We then evaluate the objective function at each point in the population. Based on this evaluation, we create a new population of PARTICLE SWARM OPTIMIZATION 283 points together with a new set of velocities. The creation of points in the new population, and their velocities, involve certain operations on points and velocities of the particles in the preceding population, described later. Each particle keeps track of its best-so-far position—this is the best position it has visited so far (with respect to the value of the objective function). We will call this particle-dependent best-so-far position a personal best (pbest). In contrast, the overall best-so-far position (best among all the positions encountered so far by the entire population) is called a global best (gbest). The particles "interact" with each other by updating their velocities according to their individual personal best as well as the global best. In the gbest version of the PSO algorithm, presented below, the velocity of each particle is changed, at each time step, toward a combination of its pbest and the gbest locations. The velocity is weighted by a random term, with separate random numbers being generated for velocities toward pbest and gbest locations. Thus, the particles are drawn both to their own personal best positions as well as to the best position of the entire swarm. As usual, typical stopping criteria of the algorithm consist of reaching a certain number of iterations, or reaching a certain objective function value. Basic PSO Algorithm We now present a simple version of the gbest version of the PSO algorithm, where at each time step the velocity of each particle is changed toward its pbest and the gbest locations. Let / : W1 —> R be the objective function that we wish to minimize. Let d be the population size, and index the particles in the swarm by i = 1 , . . . , d. Denote the position of particle i by Xi G R n and its velocity by vi G W1. Let p{ be the pbest of particle i and g the gbest. It is convenient to introduce the Hadamard product (or Schur product) operator, denoted by o: If A and B are matrices with the same dimension, then AoB is a matrix of the same dimension as A (or B) resulting from entryby-entry multiplication of A and B. This operation is denoted in MATLAB by ".*" (the dot before an operator indicates entry-by-entry operations). Thus, if A and B have the same dimension, then A. *B returns a matrix whose entries are simply the products of the corresponding individual entries of A and B. The PSO gbest algorithm uses three given constant real parameters, ω, c\, and C2, which we discuss after presenting the algorithm. PSO Gbest Algorithm 1. Set k := 0. For i = 1 , . . . ,d, generate initial random positions x\ ' and velocities v) \ and set p) — x) \ Set g^ = arg min f (o> (o), f(x). 284 GLOBAL SEARCH ALGORITHMS 2. For i = 1 , . . . , d, generate random n-vectors r\ nents uniformly in the interval (0,1), and set = ωνΡ + v\W Clrf > o ( p « - x(*>) + } C2sf and s\ ) with compo- > o foW - x « ) , χ<* +1 >=χ<*>+„<* +1 >. 3. For t = 1 d, if /(*i f c + 1 ) ) < f{p\k)), set p ^ 1 ' = „ < * > . then set p<fe+1) = x< fe+1) ; else, 4. If there exists i G { l , . . . , d } such that / ( i c f + 1 ) ) < f(g(k)), 0(*+i) = χ (*+ι). e l s e ? ^ ^(fc+i) = ^(fc). then set 5. If stopping criterion satisfied, then stop. 6. Set k := A: + 1, go to step 2. In the algorithm, the parameter ω is referred to as an inertial constant. Recommended values are slightly less than 1. The parameters c\ and C2 are constants that determine how much the particle is directed toward "good" positions. They represent a "cognitive" and a "social" component, respectively, in that they affect how much the particle's personal best and the global best influence its movement. Recommended values are ci,C2 ~ 2. Variations The PSO techniques have evolved since 1995. For example, recently Clerc [29] proposed a constriction-factor version of the algorithm, where the velocity is updated as „(*+!) = K („<*> + Cirf) o (p(*> - x « ) + <*.<*> o fo« - χ(*>)) , where the constriction coefficient κ is computed as 2 \ΐ-φ- y/φ2 - 4φ\' where φ = c\ -f C2 and φ > 4. For example, for ^ = 4.1, we have ^ = 0.729. The role of the constriction coefficient is to speed up the convergence. When using PSO in practice, one might wish to clamp the velocities to a certain maximum amount, say, vmayi. In other words, we replace each component v of the velocity vector by min {^max, max{-v m a x , v}} . For an up-to-date literature survey and other modifications and heuristics, we recommend the first part of the proceedings of the 8th International Conference on Adaptive and Natural Computing Algorithms, held in April 2007 in GENETIC ALGORITHMS 285 Warsaw, Poland [5]. In these proceedings, one can find a number of papers dealing with applications of PSO to multiobjective optimization problems, versions of PSO for constrained optimization problems, as well as "niching" versions designed to find multiple solutions, that is, applications of PSO to multimodal optimization problems. For a mathematical analysis of the PSO algorithm, see Clerc and Kennedy [30]. 14.5 Genetic Algorithms Basic Description A genetic algorithm is a randomized, population-based search technique that has its roots in the principles of genetics. The beginnings of genetic algorithms is credited to John Holland, who developed the basic ideas in the late 1960s and early 1970s. Since its conception, genetic algorithms have been used widely as a tool in computer programming and artificial intelligence (e.g., [61], [79], and [94]), optimization (e.g., [36], [67], and [127]), neural network training (e.g., [80]), and many other areas. Suppose that we wish to solve an optimization problem of the form maximize subject to f(x) x GΩ (notice that the problem is a maximization, which is more convenient for describing genetic algorithms). The underlying idea of genetic algorithms applied to this problem is as follows. We start with an initial set of points in Ω, denoted P(0), called the initial population. We then evaluate the objective function at points in P(0). Based on this evaluation, we create a new set of points P ( l ) . The creation of P ( l ) involves certain operations on points in P(0), called crossover and mutation, discussed later. We repeat the procedure iteratively, generating populations P(2), P ( 3 ) , . . . , until an appropriate stopping criterion is reached. The purpose of the crossover and mutation operations is to create a new population with an average objective function value that is higher than that of the previous population. To summarize, the genetic algorithm iteratively performs the operations of crossover and mutation on each population to produce a new population until a chosen stopping criterion is met. The terminology used in describing genetic algorithms is adopted from genetics. To proceed with describing the details of the algorithm, we need the additional ideas and terms described below. Chromosomes and Representation Schemes First, we point out that, in fact, genetic algorithms do not work directly with points in the set Ω, but rather, with an encoding of the points in Ω. Specifically, we need first to map Ω onto a set consisting of strings of symbols, all of equal length. These strings are 286 GLOBAL SEARCH ALGORITHMS called chromosomes. Each chromosome consists of elements from a chosen set of symbols, called the alphabet. For example, a common alphabet is the set {0,1}, in which case the chromosomes are simply binary strings. We denote by L the length of chromosomes (i.e., the number of symbols in the strings). To each chromosome there corresponds a value of the objective function, referred to as the fitness of the chromosome. For each chromosome x, we write f(x) for its fitness. Note that, for convenience, we use / to denote both the original objective function and the fitness measure on the set of chromosomes. We assume that / is a nonnegative function. The choice of chromosome length, alphabet, and encoding (i.e., the mapping from Ω onto the set of chromosomes) is called the representation scheme for the problem. Identification of an appropriate representation scheme is the first step in using genetic algorithms to solve a given optimization problem. Once a suitable representation scheme has been chosen, the next phase is to initialize the first population P(0) of chromosomes. This is usually done by a random selection of a set of chromosomes. After we form the initial population of chromosomes, we then apply the operations of crossover and mutation on the population. During each iteration k of the process, we evaluate the fitness f(x^) of each member x^ of the population P(k). After the fitness of the entire population has been evaluated, we form a new population P(k + 1) in two stages. Selection and Evolution In the first stage we apply an operation called selection, where we form a set M(k) with the same number of elements as P(k). This number is called the population size, which we denote by N. The set M(k), called the mating pool, is formed from P(k) using a random procedure as follows: Each point m^ in M(k) is equal to x^ in P(k) with probability /(*<*>) F(k) where ' F(k) = j2Mk)) and the sum is taken over the whole of P(k). In other words, we select chromosomes into the mating pool with probabilities proportional to their fitness. The selection scheme above is also called the roulette-wheel scheme, for the following reason. Imagine a roulette wheel in which each slot is assigned to a chromosome in P(k); some chromosomes may be assigned multiple slots. The number of slots associated with each chromosome is in proportion to its fitness. We then spin the roulette wheel and select [for inclusion in M(k)] the chromosome on whose slot the ball comes to rest. This procedure is repeated TV times, so that the mating pool M(k) contains TV chromosomes. An alternative selection scheme is the tournament scheme, which proceeds as follows. First, we select a pair of chromosomes at random from P(k). We GENETIC ALGORITHMS Parent chromosomes 287 Offspring chromosomes Crossing site Figure 14.7 Illustration of basic crossover operation. then compare the fitness values of these two chromosomes, and place the fitter of the two into M(k). We repeat this operation until the mating pool M(k) contains N chromosomes. The second stage is called evolution: in this stage, we apply the crossover and mutation operations. The crossover operation takes a pair of chromosomes, called the parents, and gives a pair of offspring chromosomes. The operation involves exchanging substrings of the two parent chromosomes, described below. Pairs of parents for crossover are chosen from the mating pool randomly, such that the probability that a chromosome is chosen for crossover is pc. We assume that whether or not a given chromosome is chosen is independent of whether or not any other chromosome is chosen for crossover. We can pick parents for crossover in several ways. For example, we may randomly choose two chromosomes from the mating pool as parents. In this case if TV is the number of chromosomes in the mating pool, then pc = 2/N. Similarly, if we randomly pick 2k chromosomes from the mating pool (where k < ΑΓ/2), forming k pairs of parents, we have pc = 2k/N. In the two examples above, the number of pairs of parents is fixed and the value of pc is dependent on this number. Yet another way of choosing parents is as follows: Given a value of p c , we pick a random number of pairs of parents such that the average number of pairs is pcN/2. Once the parents for crossover have been determined, we apply the crossover operation to the parents. There are many types of possible crossover operations. The simplest crossover operation is the one-point crossover. In this operation, we first choose a number randomly between 1 and L — 1 according to a uniform distribution, where L is the length of chromosomes. We refer to this number as the crossing site. Crossover then involves exchanging substrings of the parents to the left of the crossing site, as illustrated in Figure 14.7 and in the following example. Example 14.1 Suppose that we have chromosomes of length L = 6 over the binary alphabet {0,1}. Consider the pair of parents 000000 and 111111. Suppose that the crossing site is 4. Then, the crossover operation applied to the parent chromosomes yields the two offspring 000011 and 111100. I We can also have crossover operations with multiple crossing sites, as illustrated in Figure 14.8 and in the following example. 288 GLOBAL SEARCH ALGORITHMS Parent chromosomes Offspring chromosomes Crossing sites Figure 14.8 Illustration of two-point crossover operation. Example 14.2 Consider two chromosomes, 000000000 and 111111111, of length L = 9. Suppose that we have two crossing sites, at 3 and 7. Then, the crossover operation applied to the parent chromosomes above yields the two offspring 000111100 and 111000011. I After the crossover operation, we replace the parents in the mating pool by their offspring. The mating pool has therefore been modified but maintains the same number of elements. Next, we apply the mutation operation, which takes each chromosome from the mating pool and randomly changes each symbol of the chromosome with a given probability pm. In the case of the binary alphabet, this change corresponds to complementing the corresponding bits; that is, we replace each bit with probability pm from 0 to 1, or vice versa. If the alphabet contains more than two symbols, then the change involves randomly substituting the symbol with another symbol from the alphabet. Typically, the value of pm is very small (e.g., 0.01), so that only a few chromosomes will undergo a change due to mutation, and of those that are affected, only a few of the symbols are modified. Therefore, the mutation operation plays only a minor role in the genetic algorithm relative to the crossover operation. After applying the crossover and mutation operations to the mating pool M(fc), we obtain the new population P(k + 1). We then repeat the procedure of evaluation, selection, and evolution, iteratively. We summarize the genetic algorithm as follows. Genetic Algorithm 1. Set k := 0. Generate an initial population P(0). 2. Evaluate P(k). 3. If the stopping criterion is satisfied, then stop. 4. Select M(k) from P(fc). 5. Evolve M(k) to form P(k + 1). 6. Set k := k + 1, go to step 2. GENETIC ALGORITHMS k:=0 Form P(0) Evaluate P(k) Yes -C STOP) Fitness Selection M(k) Evolution Crossover Mutation P(k+1) k:=k+1 Figure 14.9 Flowchart for the genetic algorithm. 289 290 GLOBAL SEARCH ALGORITHMS A flowchart illustrating this algorithm is shown in Figure 14.9. During execution of the genetic algorithm, we keep track of the best-sofar chromosome, that is, the chromosome with the highest fitness of all the chromosomes evaluated. After each iteration, the best-so-far chromosome serves as the candidate for the solution to the original problem. Indeed, we may even copy the best-so-far chromosome into each new population, a practice referred to as elitism. The elitist strategy may result in domination of the population by "superchromosomes." However, practical experience suggests that elitism often improves the performance of the algorithm. The stopping criterion can be implemented in a number of ways. For example, a simple stopping criterion is to stop after a prespecified number of iterations. Another possible criterion is to stop when the fitness for the bestso-far chromosome does not change significantly from iteration to iteration. The genetic algorithm differs from the algorithms discussed in previous chapters in several respects. First, it does not use derivatives of the objective function (like the other methods in this chapter). Second, it uses operations that are random within each iteration (like the other randomized search methods). Third, it searches from a set of points rather than a single point at each iteration (like the PSO algorithm). Fourth, it works with an encoding of the feasible set rather with than the set itself. We illustrate an application of the genetic algorithm to an optimization problem in the following example. Example 14.3 Consider the MATLAB "peaks" function / : R 2 -> R given by f{x, y) = 3(1 - xf e-*9-<*+1>a - 10 ( | - x3 - y") e ^ ^ - ^ ^ ~" (see also [67, pp. 178-180] for an example involving the same function). We wish to maximize / over the set Ω = {[x,?/]T £ R 2 : — 3 < x, y < 3}. A plot of the objective function / over the feasible set Ω is shown in Figure 14.10. Using the MATLAB function fminunc (from the Optimization Toolbox), we found the optimal point to be [—0.0093,1.5814]T, with objective function value 8.1062. To apply the genetic algorithm to solve the optimization problem above, we use a simple binary representation scheme with length L = 32, where the first 16 bits of each chromosome encode the x component, whereas the remaining 16 bits encode the y component. Recall that x and y take values in the interval [—3,3]. We first map the interval [—3,3] onto the interval [0, 2 16 — 1], via a simple translation and scaling. The integers in the interval [0,2 16 — 1] are then expressed as binary 16-bit strings. This defines the encoding of each component x and y. The chromosome is obtained by juxtaposing the two 8-bit strings. For example, the point [x,2/]T = [—1,3]T is encoded as (see GENETIC ALGORITHMS Figure 14.10 291 Plot of / for Example 14.3. Exercise 14.4 for a simple algorithm for converting from decimal into binary) v01010101010101011111111111111111. N v ' encoded x = — 1 v ' encoded y = 3 Using a population size of 20, we apply 50 iterations of the genetic algorithm on the problem above. We used parameter values of pc = 0.75 and pm = 0.0075. Figure 14.11 shows plots of the best, average, and worst objective function values in the population for every iteration (generation) of the algorithm. The best-so-far solution obtained at the end of the 50 iterations is [0.0615,1.5827]T, with objective function value 8.1013. Note that this solution and objective function value are very close to those obtained using MATLAB. ■ Analysis of Genetic Algorithms In this section we use heuristic arguments to describe why genetic algorithms work. As pointed out before, the genetic algorithm was motivated by ideas from natural genetics [61]. Specifically, the notion of "survival of the fittest" plays a central role. The mechanisms used in the genetic algorithm mimic this principle. We start with a population of chromosomes, and selectively pick the fittest ones for reproduction. From these selected chromosomes, we form the new generation by combining information encoded in them. In this way, 292 GLOBAL SEARCH ALGORITHMS 10 8 o Best - * - Average -*- Worst 6 <D ,3 <0 > 4 c g t3 § 2 LL <D 1° o -2 -4 "0 10 20 30 Generations 40 50 Figure 14.11 The best, average, and worst objective function values in the population for every iteration (generation) of the genetic algorithm in Example 14.3. the goal is to ensure that the fittest members of the population survive and their information content is preserved and combined to produce even better offspring. To further analyze the genetic algorithm in a more quantitative fashion, we need to define a few terms. For convenience, we only consider chromosomes over the binary alphabet. We introduce the notion of a schema (plural: schemata) as a set of chromosomes with certain common features. Specifically, a schema is a set of chromosomes that contain Is and 0s in particular locations. We represent a schema using a string notation over an extended alphabet {0,1, *}. For example, the notation 1 * 01 represents the schema 1*01 = {1001,1101}, and the notation 0 * 101* represents the schema 0 * 101* = {001010,001011,011010,011011}. In the schema notation, the numbers 0 and 1 denote the fixed binary values in the chromosomes that belong to the schema. The symbol *, meaning "don't care," matches either 0 or 1 at the positions it occupies. Thus, a schema describes a set of chromosomes that have certain specified similarities. A chromosome belongs to a particular schema if for all positions j = 1 , . . . , L the symbol found in the jth position of the chromosome matches the symbol found in the jth position of the schema, with the understanding that any symbol matches *. Note that if a schema has r "don't care" symbols, then it GENETIC ALGORITHMS 293 contains 2 r chromosomes. Moreover, any chromosome of length L belongs to 2 L schemata. Given a schema that represents good solutions to our optimization problem, we would like the number of matching chromosomes in the population P(k) to grow as k increases. This growth is affected by several factors, which we analyze in the following discussion. We assume throughout that we are using the roulette-wheel selection method. The first key idea in explaining why the genetic algorithm works is the observation that if a schema has chromosomes with better-than-average fitness, then the expected (mean) number of chromosomes matching this schema in the mating pool M(k) is larger than the number of chromosomes matching this schema in the population P(k). To quantify this assertion, we need some additional notation. Let H be a given schema, and let e(if, k) be the number of chromosomes in P(k) that match H; that is, e(H, k) is the number of elements in the set P(k) Π H. Let f(H, k) be the average fitness of chromosomes in P(k) that match schema H. This means that if HnP(k) — {x\,..., xe(H,k)}, then x + "' + f(xe(H,k)) f(rr M _ f( i) t[H k) ' ~ eJWJ) * Let N be the number of chromosomes in the population and F(k) be the sum of the fitness values of chromosomes in P(/c), as before. Denote by F(k) the average fitness of chromosomes in the population; that is, Finally, let m(üi, k) be the number of chromosomes in M(k) that match H, in other words, the number of elements in the set M(k) Π H. Lemma 14.1 Let H be a given schema and M(H,k) of m(H,k) given P(k). Then, M{H,k) = be the expected value if^e{H,k). D Proof. Let P(k) Π H = { x i , . . . ,xe(H,k)}- I n the remainder of the proof, the term expected should be taken to mean "expected, given P(fc)." For each element m^ G M(k) and each i = 1 , . . . , e(H, k), the probability that m^ = Xi is given by f(xi)/F(k). Thus, the expected number of chromosomes in M(k) equal to X{ is r/(«<) N F(k) _ / ( « * ) F(k) 294 GLOBAL SEARCH ALGORITHMS Hence, the expected number of chromosomes in P(k) Π H that are selected into M(k) is ^ F(k)~[ > e(H,k) F(k)~ F(k) ( h Because any chromosome in M{k) is also a chromosome in P(fc), the chromosomes in M(k) Π H are simply those in P(k) Π H that are selected into M(k). Hence, M(H,k) = l^-e(H,k). Lemma 14.1 quantifies our assertion that if a schema H has chromosomes with better than average fitness [i.e., f(H,k)/F(k) > 1], then the expected number of chromosomes matching H in the mating pool M(k) is larger than the number of chromosomes matching H in the population P(h). We now analyze the effect of the evolution operations on the chromosomes in the mating pool. For this, we need to introduce two parameters that are useful in the characterization of a schema: order and length. The order o(S) of a schema S is the number of fixed symbols (non* symbols) in its representation (the notation o(S) is standard in the literature on genetic algorithms, and should not be confused with the "little-oh" symbol defined in Section 5.6). If the length of chromosomes in S is L, then o(S) is L minus the number of * symbols in 5. For example, o(l*01) = 4 - 1 - 3 , whereas o(0* 1*01) = 6 - 2 = 4. The length l(S) of a schema S is the distance between the first and last fixed symbols (i.e., the difference between the positions of the rightmost fixed symbol and the leftmost fixed symbol). For example, Z(l*01) 3*101*) /(* * 1*) == 4 -- 1 : = 3, == 5 -- 1 := 4, :- 0 . Note that for a schema S with chromosomes of length L, the order o(S) is a number between 0 and L and the length l(S) is a number between 0 in L — 1. The order of a schema with all * symbols is 0; its length is also 0. The order of a schema containing only a single element (i.e., its representation has no * symbols) is L [e.g., o(1011) = 4 — 0 = 4]. The length of a schema with fixed symbols in its first and last positions is L — 1 [e.g., Z ( 0 * * l ) = 4 — 1 = 3]. GENETIC ALGORITHMS 295 We first consider the effect of the crossover operation on the mating pool. The basic observation in the following lemma is that given a chromosome in M(k)C\H, the probability that it leaves H after crossover is bounded above by a quantity that is proportional to pc and 1(H). Lemma 14.2 Given a chromosome in M(k) Π H, the probability that it is chosen for crossover and neither of its offspring is in H is bounded above by Pi L-l D Proof Consider a given chromosome in M(k) Π H. The probability that it is chosen for crossover is pc. If neither of its offspring is in H, then the crossover point must be between the corresponding first and last fixed symbols of H. The probability of this is 1(H)/(L — 1). Hence, the probability that the given chromosome is chosen for crossover and neither of its offspring is in H is bounded above by Prom Lemma 14.2 we conclude that given a chromosome in M(k) Π H, the probability either that it is not selected for crossover or that at least one of its offspring is in H after the crossover operation, is bounded below by Note that if a chromosome in H is chosen for crossover and the other parent chromosome is also in i i , then both offspring are automatically in H (see Exercise 14.5). Hence, for each chromosome in M(k) Π if, there is a certain probability that it will result in an associated chromosome in H (either itself or one of its offspring) after going through crossover (including selection for crossover) and that probability is bounded below by the foregoing expression. We next consider the effect of the mutation operation on the mating pool M(k). Lemma 14.3 Given a chromosome in M(k) Π H, the probability that it remains in H after the mutation operation is given by D Proof Given a chromosome in M(k) Π Ü", it remains in H after the mutation operation if and only if none of the symbols in this chromosome that correspond to fixed symbols in H are changed by the mutation operation. The probability of this event is (1 - pm)°^H\ I 296 GLOBAL SEARCH ALGORITHMS Note that if p m is small, the expression (1— Pm)°^ equal to l-pmo(ff). above is approximately The following theorem combines the results of the preceding lemmas. Theorem 14.1 Let H be a given schema and S(H, k + 1) be the expected value of e(H, k + 1) given P{k). Then, 8{H, k + 1) > ( l - PcW^ (1 - Pm)°wf-^e(H, k). D Proof. Consider a given chromosome in M(k) Π H. If, after the evolution operations, it has a resulting chromosome that is in if, then that chromosome is in P(k + 1) Π H. By Lemmas 14.2 and 14.3, the probability of this event is bounded below by {l~Pc^){l-PmY^. Therefore, because each chromosome in M(k) Π H results in a chromosome in P(k + 1)Γ\Η with a probability bounded below by the expression above, the expected value of e(H, k + 1) given M(k) is bounded below by (l-PcJ~)(l-Prn)°iH)rn(H,k). Taking the expectation given P(fc), we get S(H,k + l)>(l-Pcj^j) (1 -pm)o{H)M(H,k). Finally, using Lemma 14.1, we arrive at the desired result. I Theorem 14.1 indicates how the number of chromosomes in a given schema changes from one population to the next. Three factors influence this change, reflected by the three terms on the right-hand side of inequality in Theorem 14.1: l-pJ(H)/(L-l), (1-pm)°(H\ and f(H,k)/F(k). Note that the larger the values of these terms, the higher the expected number of matches of the schema H in the next population. The effect of each term is summarized as follows: ■ The term /(if, k)/F(k) reflects the role of average fitness of the given schema H—the higher the average fitness, the higher the expected number of matches in the next population. ■ The term 1 —pcl(H)/(L — 1) reflects the effect of crossover—the smaller the term pcl(H)/(L — 1), the higher the expected number of matches in the next population. GENETIC ALGORITHMS 297 ■ The term ( 1 — p m ) ° ^ reflects the effect of mutation—the larger the term, the higher the expected number of matches in the next population. In summary, we see that a schema that is short, low order, and has aboveaverage fitness will have on average an increasing number of its representatives in the population from iteration to iteration. Observe that the encoding is relevant to the performance of the algorithm. Specifically, a good encoding is one that results in high-fitness schemata having small lengths and orders. Real-Number Genetic Algorithms The genetic algorithms described thus far operate on binary strings, representing elements of the feasible set Ω. (For this reason, genetic algorithms are also suitably applied to combinatorial optimization problems, where Ω is not M.n but some discrete set.) Binary encodings allow us to use the schema theory, described in the preceding section, to analyze genetic algorithms. However, there are some disadvantages to operating on binary strings. To see this, let g : {0,1} L —> Ω represent the binary "decoding" function; that is, if x is a binary chromosome, g(x) G Ω is the point in the feasible set Ω C Mn whose encoding is x. Therefore, the objective function being maximized by the genetic algorithm is not / itself but rather the composition of / and the decoding function g. In other words, the optimization problem being solved by the genetic algorithm is maximize subject to f(g(x)) xe{y e {0,1} L : g{y) G Ω}. This optimization problem may be more complex than the original optimization problem. For example, it may have extra maximizers, making the search for a global maximizer more difficult. The above motivates a consideration of genetic algorithms that operate directly on the original optimization problem. In other words, we wish to implement a genetic algorithm that operates directly on W1. The steps of this algorithm will be the same as before (see Figure 14.9), except that the elements of the population are points in the feasible set Ω rather than binary strings. We will need to define appropriate crossover and mutation operations for this case. For crossover, we have several options. The simplest is to use averaging: For a pair of parents x and y, the offspring is z = (x + y)/2 (this type of crossover operation is used, e.g., in [103]). This offspring can then replace one of the parents. Alternatively, we may produce two offspring as follows: Z\ = (x + y)/2 + w\ and z2 — (x + y)/2 + ttf2, where W\ and w^ are two randomly generated vectors (with zero mean). If either offspring lies outside Ω, we have to bring the offspring back into Ω, using, for example, a projection (see Section 23.2). A third option for crossover is to take random convex combinations of the parents. Specifically, given a pair of parents x and y, 298 GLOBAL SEARCH ALGORITHMS we generate a random number a G (0,1) and then produce two offspring z\ — ax + (1 — a)y and Z2 = (1 — a)x + ay. This method of crossover ensures that the offspring are always in the feasible set, provided that the feasible set is convex. A fourth option is to perturb the two points above by some random amount: Z\ = ax + {\ — a)y + W\ and z<i = (1 — a)x + ay + W2, where W\ and W2 are two randomly generated vectors (with zero mean). In this case we have to check for feasibility of the offspring and use projections if needed. For mutation, a simple implementation is to add a random vector to the chromosome. Specifically, given a chromosome x, we produce its mutation as x' = x + tu, where w is a random vector with zero mean. This mutation operation is also called a real number creep (see, e.g., [103]). As before, we have to ensure that the mutated chromosome is feasible. If not, we may use a projection. An alternative method for mutation is to replace the chosen chromosome with a random convex combination of the chromosome with a random point in the feasible set; that is, we generate a random number a G (0,1) and a random point w G Ω, and set x' = ax + (1 — a)w. Provided that the feasible set is convex, the mutated chromosom will always be feasible. Example 14.4 Consider again the function / : R 2 —► R from Example 14.3. We apply a real-number genetic algorithm to find a maximizer of / using a crossover operation of the fourth type described above and a mutation operation of the second type above. With a population size of 20, we apply 50 iterations of the genetic algorithm. As before, we used parameter values of pc = 0.75 and pm = 0.0075. Figure 14.12 shows plots of the best, average, and worst objective function values in the population for every iteration (generation) of the algorithm. The best-so-far solution obtained at the end of the 50 iterations is [—0.0096,1.5845]T, with objective function value 8.1061, which is close to the result of Example 14.3. I EXERCISES 14.1 Write a MATLAB program to implement the Nelder-Mead algorithm applied to minimizing the function f(xi,X2) = {X2 - ^ l ) 4 + 12X1^2 - Xl + X 2 - 3 on Ω = {x G R 2 : ^1,^2 £ [— 1,1]}- Locate the iteration points on the level sets of / . Connect the successive points with lines to show clearly the progression of the optimization process. Test your program with two starting points: χ(0)= 0·55 0.7 and *«»= h0·9 -0.5 EXERCISES i I ' Γ _ 6 I o Best - * - Average 1 -*- Worst 1 _δΙ ' 0 10 ' > 20 30 Generations ' 40 299 J 1 1 50 Figure 14.12 The best, average, and worst objective function values in the population for every iteration (generation) of the real-number genetic algorithm in Example 14.4. 14.2 Write MATLAB programs to implement naive random search and simulated annealing. Use the neighborhood N(x{k)) = {x : x[k) -a<Xi< xf] + a } , where a > 0 is prespecified, and pick z^ to be uniformly distributed on N(xW). Test both algorithms on maximizing the MATLAB "peaks" function given in Example 14.3. Observe the effect of varying a. 14.3 Write a MATLAB program to implement a particle swarm optimization algorithm. Test your implementation on maximizing the MATLAB "peaks" function given in Example 14.3. 14.4 This problem has four parts and is related to binary encoding for genetic algorithms. a. Let (7)io be the decimal representation for a given integer, and let a m a m _ i · · -ao be its binary representation; that is, each ai is either 0 or 1, and (I)io - a m 2 m + a m _ i 2 m " 1 + · · · + a ^ 1 + a 0 2°. Verify that the following is true: GOio = (((· · · (((a m 2 + a m _!)2 + a m _ 2 )2 + a m _ 3 ) · · · )2 + ai)2 + a 0 ). 300 GLOBAL SEARCH ALGORITHMS b . The second expression in part a suggests a simple algorithm for converting from decimal representation to equivalent binary representation, as follows. Dividing both sides of the expression in part a by 2, the remainder is CLQ. Subsequent divisions by 2 yield the remaining bits α,χ, α<ι,..., am as remainders. Use this algorithm to find the binary representation of the integer (i)io = 1995. c. Let (-F)io be the decimal representation for a given number in [0,1], and let 0.α_ια_2 · · · be its binary representation, that is, (F)io = α _ ι 2 _ 1 + α_ 2 2" 2 + · · · . If this expression is multiplied by 2, the integer part of the product is α_ι. Subsequent multiplications yield the remaining bits a_2, a~3, As in part b, the above gives a simple algorithm for converting from a decimal fraction to its binary representations. Use this algorithm to find the binary representation of (i^io = 0.7265625. Note that we can combine the algorithms from parts b and c to convert an arbitrary positive decimal representation into its equivalent binary representation. Specifically, we apply the algorithms in parts b and c separately to the integer and fraction parts of the given decimal number, respectively. d. The procedure in part c may yield an infinitely long binary representation. If this is the case, then we need to determine the number of bits required to keep at least the same accuracy as the given decimal number. If we have a d-digit decimal fraction, then the number of bits b in the binary representation must satisfy 2~b < 10~ d , which yields b > 3.32d Convert 19.95 to its equivalent binary representation with at least the same degree of accuracy (i.e., to two decimal places). 14.5 Given two chromosomes in a schema H, suppose that we swap some (or all) of the symbols between them at corresponding positions. Show that the resulting two chromosomes are also in H. Prom this fact we conclude that given two chromosomes in H, both offspring after the crossover operation are also in H. In other words, the crossover operation preserves membership in H. 14.6 Consider a two-point crossover scheme (see Example 14.2), described as follows. Given a pair of binary chromosomes of length L, we independently choose two random numbers, uniform over 1,...,L — 1. We call the two numbers C\ and C2, where c\ < C2. If C\ — C2, we do not swap any symbols (i.e., leave the two given parent chromosomes unchanged). If c\ < C2, we interchange the (c\ + l)th through C2th bits in the given parent chromosomes. EXERCISES 301 Prove the analog of Lemma 14.2 for this case, given below. Lemma: Given a chromosome in M(k) Π H, the probability that it is chosen for crossover and neither of its offspring is in H is bounded above by D Hint: Note that the two-point crossover operation is equivalent to a composition of two one-point crossover operations (i.e., doing two one-point crossover operations in succession). 14.7 State and prove the analog of Lemma 14.2 for an n-point crossover operation. Hint: See Exercise 14.6. 14.8 Implement the roulette-wheel selection scheme using MATLAB. Hint: Use the MATLAB functions sum, cumsum, and find. 14.9 Implement the crossover operation (one-point) using the MATLAB, assuming that we are given two binary parent chromosomes. 14.10 Implement the mutation operation using the MATLAB function xor, assuming that the chromosomes in the mating pool are binary vectors. 14.11 Write a MATLAB program to implement a genetic algorithm using binary encoding. Test your implementation on the following functions: a. f(x) = -15sin 2 (2x) - (x - 2) 2 + 160, |x| < 10. b . /( a ? ,y) = 3 ( l - a ; ) 2 e - a 5 a - ^ + 1 ) a - 1 0 ( f - a : 3 - 2 / 5 ) W> M < 3 (considered in Example 14.3). e " * 2 ^ 2 - ^ ^ , 14.12 Write a MATLAB program to implement a real-number genetic algorithm. Test your implementation on the function f(x) = xisin(xi) + X2 sin(5^2) with the constraint set Ω = {x : 0 < x\ < 10,4 < x2 < 6}. PART III LINEAR PROGRAMMING CHAPTER 15 INTRODUCTION TO LINEAR PROGRAMMING 15.1 Brief History of Linear Programming The goal of linear programming is to determine the values of decision variables that maximize or minimize a linear objective function, where the decision variables are subject to linear constraints. A linear programming problem is a special case of a general constrained optimization problem. In the general setting, the goal is to find a point that minimizes the objective function and at the same time satisfies the constraints. We refer to any point that satisfies the constraints as a feasible point In a linear programming problem, the objective function is linear, and the set of feasible points is determined by a set of linear equations and/or inequalities. In this part we study methods for solving linear programming problems. Linear programming methods provide a way of choosing the best feasible point among the many possible feasible points. In general, the number of feasible points is infinitely large. However, as we shall see, the solution to a linear programming problem can be found by searching through a particular finite number of feasible points, known as basic feasible solutions. Therefore, in principle, we can solve a linear programming problem simply by comparing An Introduction to Optimization, Fourth Edition. By E. K. P. Chong and S. H. Zak. Copyright © 2013 John Wiley & Sons, Inc. 305 306 INTRODUCTION TO LINEAR PROGRAMMING the finite number of basic feasible solutions and finding one that minimizes or maximizes the objective function—we refer to this approach as the bruteforce approach. For most practical decision problems, even this finite number of basic feasible solutions is so large that the method of choosing the best solution by comparing them to each other is impractical. To get a feel for the amount of computation needed in a brute-force approach, consider the following example. Suppose that we have a small factory with 20 different machines producing 20 different parts. Assume that any of the machines can produce any part. We also assume that the time for producing each part on each machine is known. The problem then is to assign a part to each machine so that the overall production time is minimized. We see that there are 20! (20 factorial) possible assignments. The brute-force approach to solving this assignment problem would involve writing down all the possible assignments and then choosing the best one by comparing them. Suppose that we have at our disposal a computer that takes 1 μβ ( 1 0 - 6 second) to determine each assignment. Then, to find the best (optimal) assignment this computer would need 77,147 years (working 24 hours a day, 365 days a year) to find the best solution. An alternative approach to solving this problem is to use experienced planners to optimize this assignment problem. Such an approach relies on heuristics. Heuristics come close, but give suboptimal answers. Heuristics that do reasonably well, with an error of, say, 10%, may still not be good enough. For example, in a business that operates on large volumes and a small profit margin, a 10% error could mean the difference between loss and profit. Efficient methods for solving linear programming problems became available in the late 1930s. In 1939, Kantorovich presented a number of solutions to some problems related to production and transportation planning. During World War II, Koopmans contributed significantly to the solution of transportation problems. Kantorovich and Koopmans were awarded a Nobel Prize in Economics in 1975 for their work on the theory of optimal allocation of resources. In 1947, Dantzig developed a new method for solving linear programs, known today as the simplex method (see [34] for Dantzig's own treatment of the algorithm). In the following chapters we discuss the simplex method in detail. The simplex method is efficient and elegant and has been declared one of the 10 algorithms with the greatest influence on the development and practice of science and engineering in the twentieth century [40]. The simplex method has the undesirable property that in the worst case, the number of steps (and hence total time) required to find a solution grows exponentially with the number of variables. Thus, the simplex method is said to have exponential worst-case complexity. This led to an interest in devising algorithms for solving linear programs that have polynomial complexity— algorithms that find a solution in an amount of time that is bounded by a polynomial in the number of variables. Khachiyan, in 1979, was the first to devise such an algorithm. However, his algorithm gained more theoretical than practical interest. Then, in 1984, Karmarkar proposed a new linear program- SIMPLE EXAMPLES OF LINEAR PROGRAMS 307 ming algorithm that has polynomial complexity and appears to solve some complicated real-world problems of scheduling, routing, and planning more efficiently than the simplex method. Karmarkar's work led to the development of many other nonsimplex methods commonly referred to as interior-point methods. This approach is currently still an active research area. For more details on Karmarkar's and related algorithms, see [42], [55], [71], [119], and [124]. Some basic ideas illustrating Khachiyan's and Karmarkar's algorithms are presented in Chapter 18. 15.2 Simple Examples of Linear Programs Formally, a linear program is an optimization problem of the form minimize subject to cTx Ax = b x>0, where c G Mn, b G M m , and A G R m x n . The vector inequality x > 0 means that each component of x is nonnegative. Several variations of this problem are possible; for example, instead of minimizing, we can maximize, or the constraints may be in the form of inequalities, such as Ax > b or Ax < b. We also refer to these variations as linear programs. In fact, as we shall see later, these variations can all be rewritten into the standard form shown above. The purpose of this section is to give some simple examples of linear programming problems illustrating the importance and the various applications of linear programming methods. Example 15.1 This example is adapted from [123]. A manufacturer produces four different products: Χι, Χ2ι Χ3, and X4. There are three inputs to this production process: labor in person-weeks, kilograms of raw material A, and boxes of raw material B. Each product has different input requirements. In determining each week's production schedule, the manufacturer cannot use more than the available amounts of labor and the two raw materials. The relevant information is presented in Table 15.1. Every production decision must satisfy the restrictions on the availability of inputs. These constraints can be written using the data in Table 15.1. In particular, we have xi + 2x 2 + x 3 + 2x 4 < 20 6x1 + 5x 2 + 3x 3 + 2x 4 < 100 3xi + 4x 2 + 9x 3 + 12x4 < 75. Because negative production levels are not meaningful, we must impose the following nonnegativity constraints on the production levels: Xi>0, z = l,2,3,4. 308 INTRODUCTION TO LINEAR PROGRAMMING Table 15.1 Data for Example 15.1 Product Inputs Person-weeks Kilograms of material A Boxes of material B Production levels XA 2 5 4 Xs 1 3 9 X2 X3 X4 Xl X2 1 6 3 xi 2 2 12 Input Availabilities 20 100 75 Now, suppose that one unit of product X\ sells for $6, and X2l X3, and Χ4 sell for $4, $7, and $5, respectively. Then, the total revenue for any production decision (#i,£2>£3>#4) is /(xi,X 2 ,^3,^4) = 6X1 + 4:X2 + 7xs + 5^4. The problem is then to maximize / subject to the given constraints (the three inequalities and four nonnegativity constraints). Using vector notation with X= [X1,X 2 ,^3,^4] T , the problem can be written in the compact form maximize subject to ex Ax < b x>0, where cT = [6,4,7,5], 1 2 1 2" 6 5 3 2 , 3 4 9 12 "20] b = 100 75 J Another example that illustrates linear programming involves determining the most economical diet that satisfies the basic minimum requirements for good health. Example 15.2 Diet Problem. This example is adapted from [88]. Assume that n different food types are available. The jth food sells at a price Cj per unit. In addition, there are m basic nutrients. To achieve a balanced diet, you must receive at least bi units of the ith nutrient per day. Assume that each SIMPLE EXAMPLES OF LINEAR PROGRAMS 309 unit of food j contains α^ units of the ith nutrient. Denote by Xj the number of units of food j in the diet. The objective is to select the Xj to minimize the total cost of the diet: minimize C\X\ + C2X2 + · · · + cnxn subject to the nutritional constraints anxi + (I12X2 H l· a\nxn > 61 «21^1 + «22^2 + · · * + 0,2nXn > &2 dmlXl + «m2^2 H l· a m n X n > &m, and the nonnegativity constraints x\ > 0, #2 > 0, . . . , xn > 0. In the more compact vector notation, this problem becomes minimize cTx subject to Ax > b x>0, where x = [χι,#2, · · · 5 #n] T is an n-dimensional column vector, c T is an ndimensional row vector, A is an m x n matrix, and b is an m-dimensional column vector. We call this problem the diet problem and will return to it in Chapter 17. I In the next example we consider a linear programming problem that arises in manufacturing. Example 15.3 A manufacturer produces two different products, X\ and X2, using three machines: Mi, M2, and M3. Each machine can be used for only a limited amount of time. Production times of each product on each machine are given in Table 15.2. The objective is to maximize the combined time of utilization of all three machines. Every production decision must satisfy the constraints on the available time. These restrictions can be written down using data from Table 15.2. In particular, we have xi + ^2 < 3, x\ + 3^2 < 18, 2xi + Χ2 < 14, where x\ and X2 denote the production levels. The combined production time of all three machines is f(xux2) =4χι +5χ2· 310 INTRODUCTION TO LINEAR PROGRAMMING Table 15.2 Data for Example 15.3 Production time (hours/unit) Machine Mi M2 M3 Total X2 Xl 1 1 2 4 1 3 1 5 Available time (hours) 8 18 14 Thus, writing x = [χι,α^] 1 ", the problem in compact notation has the form maximize subject to cTx Ax < b x > 0, where c T = [4,5], Γι f 1 [2 3 , 1 _ 6 = 8~ 18 14 ■ In the following example we discuss an application of linear programming in transportation. Example 15.4 A manufacturing company has plants in cities A, B, and C. The company produces and distributes its product to dealers in various cities. On a particular day, the company has 30 units of its product in A, 40 in B, and 30 in C. The company plans to ship 20 units to D, 20 to E, 25 to F, and 35 to G, following orders received from dealers. The transportation costs per unit of each product between the cities are given in Table 15.3. In the table, the quantities supplied and demanded appear at the right and along the bottom of the table. The quantities to be transported from the plants to different destinations are represented by the decision variables. SIMPLE EXAMPLES OF LINEAR PROGRAMS Table 15.3 To From A B C Demand Data for Example 15.4 D $7 $7 $5 20 311 $10 $11 $8 20 $14 $12 $15 25 $8 $6 $9 35 Supply 30 40 30 100 This problem can be stated in the form minimize subject to lx\\ + 10a:i2 + 14xi3 + 8x14 + lx2\ + H#22 + 12#23 + 6x24 + 5x 3 i + 8x32 + 15x 33 + 9x 34 x\\ + x\2 + £13 + £14 = 30 #21 + #22 + #23 + #24 #31 + #32 + #33 + #34 #11 +#21 +#31 #12 + #22 + #32 #13 + #23 + #33 #14 + #24 + #34 #11,#12,·· · ,#34 = = = = = = > 40 30 20 20 25 35 0. In this problem one of the constraint equations is redundant because it can be derived from the rest of the constraint equations. The mathematical formulation of the transportation problem is then in a linear programming form with twelve (3x4) decision variables and six (3 + 4 — 1) linearly independent constraint equations. Obviously, we also require nonnegativity of the decision variables, since a negative shipment is impossible and does not have a valid interpretation. I Next, we give an example of a linear programming problem arising in electrical engineering. Example 15.5 This example is adapted from [100]. Figure 15.1 shows an electric circuit that is designed to use a 30-V source to charge 10-V, 6-V, and 20-V batteries connected in parallel. Physical constraints limit the currents i i , I2,13, ^4, and I5 to a maximum of 4 A, 3 A, 3 A, 2 A, and 2 A, respectively. In addition, the batteries must not be discharged; that is, the currents 7χ, Ι2, I3, J4, and 75 must not be negative. We wish to find the values of the currents / 1 , . . . , ^5 such that the total power transferred to the batteries is maximized. The total power transferred to the batteries is the sum of the powers transferred to each battery and is given by IO/2 + 6/4 + 2Ο/5 W. Prom the circuit in 312 INTRODUCTION TO LINEAR PROGRAMMING Volts Figure 15.1 Battery charger circuit for Example 15.5. Figure 15.1, we observe that the currents satisfy the constraints I\ = I2 + ^3 and J3 = I4 -f I5. Therefore, the problem can be posed as the following linear program: maximize subject to IO/2 + 6/4 + 2Ο/5 I\ = I2 + h h =h +h h <4 /2<3 /3<3 h <2 /i,/2,/3,/4,/5>0. Finally, we present an example from wireless communications. Example 15.6 Consider the wireless communication system shown in Figure 15.2. There are n "mobile" users. For each i = 1 , . . . ,n, user i transmits a signal to the base station with power pi and an attenuation factor of hi (i.e., the actual signal power received at the base station from user i is hiPi). When the base station is receiving from user i, the total power received from all other users is considered interference (i.e., the interference for user i is Σΐ^ΐ hjPj)· For the communication with user i to be reliable, the signal-tointerference ratio must exceed a threshold 7^, where the "signal" is the power received from user i. We are interested in minimizing the total power transmitted by all users subject to having reliable communications for all users. We can formulate the SIMPLE EXAMPLES OF LINEAR PROGRAMS 313 Base Station r User 1 User 2 F i g u r e 15.2 User 3 Wireless communication system in Example 15.6. problem as a linear programming problem of the form cTx minimize subject to Ax > b x>0. We proceed as follows. T h e total power transmitted is p i + · ■ signal-to-interference ratio for user i is hiPi Hence, t h e problem can be written as minimize subject to pi + \-pn hiPi >7t, i = l,...,n p i , . . . , p n > 0. We can write the above as the linear programming problem minimize subject to pi + · · · + p n hiPi — 7^ Y ^ hjPj > 0, i = 1 , . . . , n ύφί Ρ ΐ , · . . , Ρ η > 0. + pn. The 314 INTRODUCTION TO LINEAR PROGRAMMING In matrix form, we have c=[l,...,l]T A hi I -72Λ1 -7i^2 h2 -l\hn "72 K -Inhl b = 0. ~7n^2 hn For more examples of linear programming and their applications in a variety of engineering problems, we refer the reader to [1], [34], [35], [46], and [109]. For applications of linear programming to the design of control systems, see [33]. Linear programming also provides the basis for theoretical applications, as, for example, in matrix game theory (discussed in [18]). 15.3 Two-Dimensional Linear Programs Many fundamental concepts of linear programming are easily illustrated in two-dimensional space. Therefore, we consider linear problems in R 2 before discussing general linear programming problems. Consider the following linear program (adapted from [123]): maximize subject to c x Ax < b x > 0, where x = [xi,X2] T and [1,5], 5 3 6 2 First, we note that the set of equations {cTx = Xi +5x2 = / , / G R} specifies a family of straight lines in R 2 . Each member of this family can be obtained by setting / equal to some real number. Thus, for example, x\ + 5x2 = — 5, X\ + 5x2 = 0, and X\ + 5x2 = 3 are three parallel lines belonging to the family. Now, suppose that we try to choose several values for x\ and X2 and observe how large we can make / while still satisfying the constraints on x\ and X2. We first try x\ — 1 and X2 = 3. This point satisfies the constraints. For this point, / = 16. If we now select x\ — 0 and X2 = 5, then / = 25 and this point yields a larger value for / than does x = [1,3] T . There are infinitely many points [xi,X2] T satisfying the constraints. Therefore, we need a better TWO-DIMENSIONAL LINEAR PROGRAMS 8 Figure 15.3 315 Xi Geometric solution of a linear program in '. method than trial and error to solve the problem. In the following sections we develop a systematic approach that simplifies considerably the process of solving linear programming problems. For the example above we can easily solve the problem using geometric arguments. First let us sketch the constraints in R 2 . The region of feasible points (the set of points x satisfying the constraints Ax < 6, x > 0) is depicted by the shaded region in Figure 15.3. Geometrically, maximizing cTx = x\-}-5x2 subject to the constraints can be thought of as finding the straight line f = x\+ 5x2 that intersects the shaded region and has the largest / . The coordinates of the point of intersection will then yield a maximum value of cTx. In our example, the point [0,5] T is the solution (see Figure 15.3). Example 15.7 Suppose that you are given two different types of concrete. The first type contains 30% cement, 40% gravel, and 30% sand (all percentages of weight). The second type contains 10% cement, 20% gravel, and 70% sand. The first type of concrete costs $5 per pound and the second type costs $1 per pound. How many pounds of each type of concrete should you buy and mix together so that your cost is minimized but you get a concrete mixture that has at least a total of 5 pounds of cement, 3 pounds of gravel, and 4 pounds of sand? 316 INTRODUCTION TO LINEAR PROGRAMMING The problem can be represented as minimize subject to cTx Ax > b x>0, where c T = [5,l], Γ0.3 0.1" 0.4 0.2 , [o.3 0.7 "5" b= 3 4 Using the graphical method described above, we get a solution of [0,50] T , which means that we should purchase 50 pounds of the second type of concrete. (For a variation of this problem solved using a different method, see Example 12.1.) I In some cases, when using the graphical method, there may be more than one point of intersection of the optimal straight line f = cTx with the boundary of the feasible region. In this case all of the intersection points will yield the same value for the objective function cTx, and therefore any one of them is a solution. 15.4 Convex Polyhedra and Linear Programming The goal of linear programming is to minimize (or maximize) a linear objective function CTX = C\X\ + C2X2 H h CnXn subject to constraints that are represented by linear equalities and/or inequalities. For the time being, let us consider only constraints of the form Ax < 6, x > 0. In this section we discuss linear programs from a geometric point of view (for a review of geometric concepts used in the section, see Chapter 4). The set of points satisfying these constraints can be represented as the intersection of a finite number of closed half-spaces. Thus, the constraints define a convex poly tope. We assume, for simplicity, that this poly tope is nonempty and bounded. In other words, the equations of constraints define a polyhedron M in R n . Let if be a hyperplane of support of this polyhedron. If the dimension of M is less than n, then the set of all points common to the hyperplane H and the polyhedron M coincides with M. If the dimension of M is equal to n, then the set of all points common to the hyperplane H and the polyhedron M is a face of the polyhedron. If this face is (n — 1)-dimensional, then there exists only one hyperplane of support, namely, the carrier of this face. If the CONVEX POLYHEDRA AND LINEAR PROGRAMMING 317 0-dimensional face 1-dimensional !— face Figure 15.4 M. Hyperplanes of support at different boundary points of the polyhedron dimension of the face is less than n — 1, then there exist an infinite number of hyperplanes of support whose intersection with this polyhedron yields this face (see Figure 15.4). The goal of our linear programming problem is to maximize a linear objective function f(x) = cTx = c\X\ + · · · + cnxn on the convex polyhedron M. Next, let H be the hyperplane defined by the equation cTx = 0. Draw a hyperplane of support H to the polyhedron M, which is parallel to H and positioned such that the vector c points in the direction of the half-space that does not contain M (see Figure 15.5). The equation of the hyperplane H has the form cTx = /?, and for all x E M we have cTx < ß. Denote by M the convex polyhedron that is the intersection of the hyperplane of support H with the polyhedron M. We now show that / is constant on M and that M is the set of all points in M for which / attains its maximum value. To this end, let y and z be two arbitrary points in M. This implies that both y and z belong to H. Hence, f(y) = cTy = ß = cTz = f(z), which means that / is constant on M. Let y be a point of M, and let x be a point of M \ M; that is, x is a point of M that does not belong to M (see Figure 15.5). Then, cTx < ß — cTy, which implies that 318 INTRODUCTION TO LINEAR PROGRAMMING Figure 15.5 Maximization of a linear function on the polyhedron M. Thus, the values of / at the points of M that do not belong to M are smaller than the values at points of M. Hence, / achieves its maximum on M at points in M. It may happen that M contains only a single point, in which case / achieves its maximum at a unique point. This occurs when the the hyperplane of support passes through an extreme point of M (see Figure 15.6). 15.5 Standard Form Linear Programs We refer to a linear program of the form minimize subject to cTx Ax — b x>0 as a linear program in standard form. Here A is an m x n matrix composed of real entries, m < n, rank A = m. Without loss of generality, we assume that Figure 15.6 Unique maximum point of / on the polyhedron M. STANDARD FORM LINEAR PROGRAMS 319 b > 0. If a component of b is negative, say the ith component, we multiply the ith constraint by —1 to obtain a positive right-hand side. Theorems and solution techniques for linear programs are usually stated for problems in standard form. Other forms of linear programs can be converted to the standard form, as we now show. If a linear program is in the form minimize subject to cTx Ax > b x>0, then by introducing surplus variables yi, we can convert the original problem into the standard form minimize subject to c x anXi + ai2X2 H l· ainxn -y{ Xl > 0, X2 > 0, . . . , Xn > 0 = b{, i = 1 , . . . , m 2/1 > 0,2/2 > 0 , . . . , 2/m > 0. In more compact notation, the formulation above can be represented as minimize subject to cTx Ax - Imy = [A, - I m ] x > 0, y > 0, where Im is the m x m identity matrix. If, on the other hand, the constraints have the form Ax <b x > 0, then we introduce slack variables yi to convert the constraints into the form Ax + Imy = [A, I m ] = b x > 0, y > 0, where y is the vector of slack variables. Note that neither surplus nor slack variables contribute to the objective function cTx. At first glance, it may appear that the two problems minimize subject to cTx Ax > b x>0 320 INTRODUCTION TO LINEAR PROGRAMMING and minimize subject to ex Ax — Imy x>0 y>o =b are different, in that the first problem refers to the intersection of half-spaces in the n-dimensional space, whereas the second problem refers to an intersection of half-spaces and hyperplanes in the (n + m)-dimensional space. It turns out that both formulations are algebraically equivalent in the sense that a solution to one of the problems implies a solution to the other. To illustrate this equivalence, we consider the following examples. Example 15.8 Suppose that we are given the inequality constraint x\ < 7. We convert this to an equality constraint by introducing a slack variable x2 > 0 to obtain X\+ X2 = 7 x2 > 0 . Consider the sets C\ = {x\ : X\ < 7} and C2 = {x\ : X\+x2 = 7, x2 > 0}. Are the sets C\ and C2 equal? It is clear that indeed they are; in this example, we give a geometric interpretation for their equality. Consider a third set Cs — {[χι,χ2]Ύ : x\ + x2 = 7,x2 > 0}. Prom Figure 15.7 we can see that the set Cs consists of all points on the line to the left and above the point of intersection of the line with the xi-axis. This set, being a subset of R 2 , is of course not the same set as the set C\ (a subset of R). However, we can project the set C3 onto the xi-axis (see Figure 15.7). We can associate with each point X\ G C\ a point [#ι,0] τ on the orthogonal projection of C3 onto the xi-axis, and vice versa. Note that C2 = {x\ : [xi,x2]T G Cs} = C\. I Example 15.9 Consider the inequality constraints a\X\ + a2x2 < b x\,x2 > 0, where αι, a2, and b are positive numbers. Again, we introduce a slack variable xs > 0 to get a\X\ + a2x2 + xs = b #1,^2, #3 > 0. STANDARD FORM LINEAR PROGRAMS 321 Projection of C 3 onto χ-ι -axis 1 2 3 4 5 6 7\ χ 1 Projection of the set C3 onto the xi-axis. Figure 15.7 Define the sets C\ = {[xi,x2]T C2 = {[#1,22] : αιΧι + a2x2 < b, X\,x2 > 0}, : Q>iXi + a2x2 + %3 = b, # i , # 2 5 # 3 > 0 } , T C 3 = {[xi,X2,x3] : αιχι + a2x2 + %3 = &, ^1,^2,^3 > 0}. We again see that C3 is not the same as C\. However, the orthogonal projection of C3 onto the (xi,x2)-plane allows us to associate the resulting set with the set C\. We associate the points [xi,X2 5 0] T resulting from the orthogonal projection of C3 onto the (xi,x2)-pla,ne with the points in C\ (see Figure 15.8). Note that C2 = {[xllx2]T : [xi,x2,xs]T G C 3 } = C\. ■ Example 15.10 Suppose that we wish to maximize f{xi,x2) = cixi +c2x2 subject to the constraints a\\X\ + a>\2x2 < b\ a2\X\ + a22x2 = b2 Z l , X 2 , > 0, where, for simplicity, we assume that each α^ > 0 and 61, b2 > 0. The set of feasible points is depicted in Figure 15.9. Let C\ C R 2 be the set of points satisfying the constraints. 322 INTRODUCTION TO LINEAR PROGRAMMING Projection of C 3 onto the (x-| ,x2)-plane Figure 15.8 Projection of the set C3 onto the (x 1,2:2)-plane. Set of feasible points a-j-jx-j+a^xa^b-j Figure 15.9 The feasible set for Example 15.10. STANDARD FORM LINEAR PROGRAMS a 323 11x1+a12x2+x3=D1 Projection of C 2 onto (x-,,x2)-plane Figure 15.10 Projection of C2 onto the (xi,X2)-plane. Introducing a slack variable, we convert the constraints into standard form: ß l l ^ l + Ol2#2 + X3 = bi &2\X\ + 022^2 = bi %i > 0 , i = 1,2,3. Let C2 C R 3 be the set of points satisfying the constraints. As illustrated in Figure 15.10, this set is a line segment (in R 3 ). We now project C2 onto the (xi,X2)-plane. The projected set consists of the points [xi,X2 ? 0] T , with [#i,£ 2 >£3] T e C<i for some £3 > 0. In Figure 15.10 this set is marked by a heavy line in the (#1, X2)-plane. We can associate the points on the projection with the corresponding points in the set C\. I In the following example we convert an optimization problem into a standard form linear programming problem. Example 15.11 Consider the following optimization problem maximize subject to X2— x\ 3#i = x<i — 5 1*2 I < 2 xi < 0 . To convert the problem into a standard form linear programming problem, we perform the following steps: 324 INTRODUCTION TO LINEAR PROGRAMMING 1. Change the objective function to: minimizex\ — x2. 2. Substitute x\ = —x[. 3. Write \x2\ < 2 as x2 < 2 and -x2 < 2. 4. Introduce slack variables x$ and £4, and convert the inequalities above to x2 + #3 = 2 and — x2 + £4 = 2. 5. Write £2 = ^ — v, u,v > 0. Hence, we obtain minimize subject to — x[ — u + v 3x[ 4- u — v = 5 tz — i; + £3 = 2 t; — u + £4 = 2 # 1 , ^ , V,£3,#4 > 0. 15.6 Basic Solutions We have seen in Section 15.5 that any linear programming problem involving inequalities can be converted to standard form, that is, a problem involving linear equations with nonnegative variables: minimize subject to cTx Ax = b x > 0, where c G R n , A e R m x n , b e M m , m < n, rank A = m, and 6 > 0. In the following discussion we only consider linear programming problems in standard form. Consider the system of equalities Ax = 6, where rank A — m. In dealing with this system of equations, we frequently need to consider a subset of columns of the matrix A. For convenience, we often reorder the columns of A so that the columns we are interested in appear first. Specifically, let B be a square matrix whose columns are m linearly independent columns of A. If necessary, we reorder the columns of A so that the columns in B appear first: A has the form A = [B,D], where BASIC SOLUTIONS 325 D is an m x (n — m) matrix whose columns are the remaining columns of A. The matrix B is nonsingular, and thus we can solve the equation BxB =b for the m-vector XB- The solution is XB = B~lb. Let x be the n-vector whose first m components are equal to XB and the remaining components are equal to zero; that is, x = [ccj, 0 T ] T . Then, x is a solution to Ax = b. Definition 15.1 We call [ # 5 , 0 T ] T a basic solution to Ax = b with respect to the basis B. We refer to the components of the vector XB as basic variables and the columns of B as basic columns. If some of the basic variables of a basic solution are zero, then the basic solution is said to be a degenerate basic solution. A vector x satisfying Ax = 6, x > 0, is said to be a feasible solution. A feasible solution that is also basic is called a basic feasible solution. If the basic feasible solution is a degenerate basic solution, then it is called a degenerate basic feasible solution. I Note that in any basic feasible solution, XB > 0. Example 15.12 Consider Ihe equation -Ax — b ^ i t h A = [αι,α 2 ,α 3 ,α 4 ] = 1 1 - 1 4 1 - 2 - 1 1 b = where α^ denotes the ith column of the matrix A. Then, x = [6,2,0,0] T is a basic feasible solution with respect to the basis B = [αι,α2], x = [0,0,0,2] T is a degenerate basic feasible solution with respect to the basis B = [as, 0,4] (as well as [αι, 0,4} and [02,04]), x = [3,1,0,1] T is a feasible solution that is not basic, and x = [0,2, — 6,0] T is a basic solution with respect to the basis B = [02,03], but is not feasible. I Example 15.13 As another example, consider the system of linear equations Ax = 6, where 2 3 - 1 - 1 A = 4 1 1 - 2 We now find all solutions of this system. Note that every solution x of Ax = b has the form x = v + h, where v i s a particular solution of Ax = b and h is a solution to Ax — 0. We form the augmented matrix [A, b] of the system: [A,b} = 2 4 3 - 1 - 1 -1 1 1 - 2 9 326 INTRODUCTION TO LINEAR PROGRAMMING Using elementary row operations, we transform this matrix into the form (see Chapter 16) given by 1 0 1 0 1 - 14 5 _11 5 I 2 The corresponding system of equations is given by 2 1 *1 + 5 Χ 3 " 2 * 4 14 = Τ 3 11 Solving for the leading unknowns xi and X2, we obtain 14 2 = 15 " 5X3 11 3 Xl X2 = + ~T 1 2Xi + 5* 3 ' · where £3 and X4 are arbitrary real numbers. If [#ι,£2>#3>#4]lT is a solution, then we have 14 2 1 S + 5 2* 11 3 + 5S' Xl X2 X3 = s, X4 = *, where we have substituted s and t for £3 and £4, respectively, to indicate that they are arbitrary real numbers. Using vector notation, we may write the system of equations above as Γ xi \X2 \xs \X4 ' 14 " 5 11 5 0 0 * 3 5 +5 2" 5 1 0 "1" 2 +f 0 0 1 Note that we have infinitely many solutions, parameterized by s, t G R. For the choice s = t = 0 we obtain a particular solution to Ax = 6, given by 5 _!! v = 5 0 0 PROPERTIES OF BASIC SOLUTIONS 327 Any other solution has the form v + h, where " h 3 5 2" 5 1 _ 0_ "1" 2 0 +t 0 1_ The total number of possible basic solutions is at most 71! m\(n-m)\ 4! 2!(4 - 2)! ■6. To find basic solutions that are feasible, we check each of the basic solutions for feasibility. Our first candidate for a basic feasible solution is obtained by setting x% = X4 = 0, which corresponds to the basis B = [01,02]. Solving BXB = b, we obtain Xß = [14/5,-11/5]' , and hence x = [14/5,-11/5,0,0]' is a basic solution that is not feasible. For our second candidate basic feasible solution, we set #2 = #4 = 0. We have the basis B = [αι,α 3 ]. Solving BXB = b yields XB = [4/3,11/3] . Hence, x = [4/3,0,11/3,0] is a basic feasible solution. A third candidate basic feasible solution is obtained by setting x2 = x$ = 0. However, the matrix ^ r i Γ2 - l l B = [αι,α 4 ] = is singular. Therefore, B cannot be a basis, and we do not have a basic solution corresponding to B = [0,1,0,4], We get our fourth candidate for a basic feasible solution by setting X\ = X4 = 0. We have a basis B = [02,03], resulting i n x = [0,2,7,0] T , which is a basic feasible solution. Our fifth candidate for a basic feasible solution corresponds to setting x\ — xs = 0, with the basis B = [a 2 ,04]. This results in x = [0, —11/5,0, —28/5] , which is a basic solution that is not feasible. Finally, the sixth candidate for a basic feasible solution is obtained by setting xx = x2 = 0. This results in the basis B = [03,04], and x = [0,0,11/3, —8/3] , which is a basic solution but is not feasible. I 15.7 Properties of Basic Solutions In this section we discuss the importance of basic feasible solutions in solving linear programming (LP) problems. We first prove the fundamental theorem of LP, which states that when solving an LP problem, we need only consider 328 INTRODUCTION TO LINEAR PROGRAMMING basic feasible solutions. This is because the optimal value (if it exists) is always achieved at a basic feasible solution. We need the following definitions. Definition 15.2 Any vector x that yields the minimum value of the objective function cTx over the set of vectors satisfying the constraints Ax = b, x > 0, is said to be an optimal feasible solution. An optimal feasible solution that is basic is said to be an optimal basic feasible solution. I Theorem 15.1 Fundamental in standard form. Theorem of LP. Consider a linear program 1. If there exists a feasible solution, then there exists a basic feasible solution. 2. If there exists an optimal feasible solution, then there exists an optimal basic feasible solution. □ Proof. We first prove part 1. Suppose that a feasible solution and it has p positive components. Without loss of generality, we can assume that the first p components are positive, whereas the remaining components are zero. Then, in terms of the columns of A — [αχ,..., α ρ , . . . , a n ] , this solution satisfies x\a\ + x2Q>2 H + XpUp = o. There are now two cases to consider. Case 1: If αι, a 2 , . . . , ap are linearly independent, then p < m. If p = m, then the solution x is basic and the proof is done. If, on the other hand, p < m, then, since rank A = m, we can find m — p columns of A from the remaining n — p columns so that the resulting set of m columns forms a basis. Hence, the solution a? is a (degenerate) basic feasible solution corresponding to the basis above. Case 2: Assume that ai,a2,... ,ap are linearly dependent. Then, there exist numbers yi, i = 1 , . . . ,p, not all zero, such that 2/ifli + 2/2^2 + · · · + ypap = 0. We can assume that there exists at least one yi that is positive, for if all the yi are nonpositive, we can multiply the equation above by —1. Multiply the equation by a scalar ε and subtract the resulting equation from X\a\ +X2CL2 + 1- xpap = b to obtain {xi - eyi)ai + (x2 - ey2)a2 H Let h (xp - eyp)ap = b. V = [2/i,...,2/ P ,0,...,0] T . PROPERTIES OF BASIC SOLUTIONS 329 Then, for any e we can write A[x — ey] = b. Let e = mm{xi/yi : i = 1 , . . . ,p, y% > 0}. Then, the first p components of x — ey are nonnegative, and at least one of these components is zero. We then have a feasible solution with at most p — 1 positive components. We can repeat this process until we get linearly independent columns of A, after which we are back to case 1. Therefore, part 1 is proved. We now prove part 2. Suppose that x = [ # i , . . . , xn]T is an optimal feasible solution and only the first p variables are nonzero. Then, we have two cases to consider. The first case is exactly the same as in part 1. The second case follows the same arguments as in part 1, but in addition we must show that x — ey is optimal for any ε. We do this by showing that cTy = 0. To this end, assume that cTy φ 0. Note that for e of sufficiently small magnitude (|ε| < m i n { | ^ / ^ | : i = 1 , . . . ,p, yi φ 0}), the vector x — ey is feasible. We can choose e such that cTx > cTx — ecTy = cT(x — ey). This contradicts the optimality of x. We can now use the procedure from part 1 to obtain an optimal basic feasible solution from a given optimal feasible solution. I E x a m p l e 15.14 Consider the system of equations given in Example 15.13. Find a nonbasic feasible solution to this system and use the method in the proof of the fundamental theorem of LP to find a basic feasible solution. Recall that solutions for the system given in Example 15.13 have the form Γ x = 14 11 5 0 0 " " 3 +5 2' 5 5 1 0 "11 2 -hi o 0 1 where s, t G R. Note that if s = 4 and t = 0, then XQ — 6' 5 1 5 4 is a nonbasic feasible solution. There are constants y^ i = 1,2,3, such that 2/1 a i +2/202+2/303 = 0. For example, let 2 2/1 = ~ 7 ' 2/2—, 2/3 = 1. 330 INTRODUCTION TO LINEAR PROGRAMMING Note that where A(x0 - ey) = 6, _2" 5 3 5 1 If e = 1/3, then xi = x0 - ey is a basic feasible solution. I Observe that the fundamental theorem of LP reduces the task of solving a linear programming problem to that of searching over a finite number of basic feasible solutions. That is, we need only check basic feasible solutions for optimality. As mentioned before, the total number of basic solutions is at most n! m\(n — m)V Although this number is finite, it may be quite large. For example, if m — 5 and n = 50, then '5θ\ = 2,118,760. This is potentially the number of basic feasible solutions to be checked for optimality. Therefore, a more efficient method of solving linear programs is needed. To this end, in the next section we analyze a geometric interpretation of the fundamental theorem of LP. This leads us to the simplex method for solving linear programs, which we discuss in Chapter 16. 15.8 Geometric View of Linear Programs Recall that a set Θ C W1 is said to be convex if, for every cc, y G Θ and every real number a, 0 < a < 1, the point ax + (1 — a)y £ Θ. In other words, a set is convex if given two points in the set, every point on the line segment joining these two points is also a member of the set. Note that the set of points satisfying the constraints Ax = b, x > 0 GEOMETRIC VIEW OF LINEAR PROGRAMS 331 is convex. To see this, let X\ and x2 satisfy the constraints, that is, Axi = 6, x% > 0, i = 1,2. Then, for all a G (0,1), A(ax1 + (1 - a)x2) = α-Αχι + (1 a)Ax2 = b. Also, for a G (0,1), we have ax\ -f (1 — OL)X2 > 0. Recall that a point cc in a convex set Θ is said to be an extreme point of Θ if there are no two distinct points X\ and x2 in Θ such that x = ax\ + (1 — a)x2 for some a G (0,1). In other words, an extreme point is a point that does not lie strictly within the line segment connecting two other points of the set. Therefore, if x is an extreme point, and x = αχχ + (1 — a)x2 for some Xi,x2 £ Θ and a G (0,1), then x\ = x2. In the following theorem we show that extreme points of the constraint set are equivalent to basic feasible solutions. T h e o r e m 15.2 Let Ω be the convex set consisting of all feasible solutions, that is, all n-vectors x satisfying Ax = 6, x > 0, where A G R m X n ; m < n. Then, x is an extreme point of Ω if and only if x is a basic feasible solution to Ax — b, x > 0. □ Proof. =>: Suppose that x satisfies Ax = 6, x > 0, and has p positive components. As before, without loss of generality, we can assume that the first p components are positive and the remaining components are zero. We have x\a\ + χ2α>2 + · · · + XpQ>p = b. Let ?/i, i = 1 , . . . ,p, be numbers such that 2/i«i + 2/2^2 H h ypap = 0. We show that each yi = 0. To begin, multiply this equation by ε > 0, then add and subtract the result from the equation Χχα,ι + x2a2 + · · · + xpa,p = b to get (xi + eyi)ai + (x2 + 62/2)02 H (xi - £j/i)ai + (x2 - ey2)a2 Λ h (z p + £2/p)op = 6, h (xp - eyp)ap = b. Because each Xi > 0, ε > 0 can be chosen such that each xi + εy^,Xi — εyi>0 (e.g., ε = min{\xi/yi\ : z = 1 , . . . ,p, ^ ^ 0}). For such a choice of ε, the vectors zi = [xi + ε 2 / ι , χ 2 + ε ? / 2 , . · . , ^ ρ + ε?/ρ,0, . . . , 0 ] T , ^2 = [xi -eyi,X2 -εy2,...,Xp-εyp,0,...,0}τ belong to Ω. Observe that x = \z\ + \z2. Because x is an extreme point, Z\ = z2. Hence, each yi = 0, which implies that the ai are linearly independent. 332 INTRODUCTION TO LINEAR PROGRAMMING <=: Let x G Ω be a basic feasible solution. Let y, z G Ω be such that cc = ay + (1 — a ) z for some a G (0,1). We show that y = z and conclude that x is an extreme point. Because t/, z > 0, and the last n — m components of x are zero, the last n — m components of y and z are zero as well. Furthermore, since Ay = Az = 6, 2/i αι H ht/mflm = 6 and ζιαι H (- zmam = b. Combining these two equations yields (yi - ζι)αι + h (ym - zm)am = 0. Because the columns o i , . . . , a m are linearly independent, we have yi = ^ , z = 1 , . . . , ra. Therefore, y = z, and hence x is an extreme point of Ω. I From Theorem 15.2 it follows that the set of extreme points of the constraint set Ω = {x : Ax = 6, x > 0} is equal to the set of basic feasible solutions to Ax = 6, x > 0. Combining this observation with the fundamental theorem of LP (Theorem 15.1), we can see that in solving linear programming problems we need only examine the extreme points of the constraint set. Example 15.15 Consider the following LP problem: maximize 3x\ + 5^2 subject to xi + 5x2 2xi + x 2 xi + x 2 xi,x2 < < < > 40 20 12 0. We introduce slack variables x 3 , x 4 , x 5 to convert this LP problem into standard form: minimize -3xi— 5x2 subject to xi + 5x2 + X3 2xi + X2 + #4 =40 =20 #1 + #2 + X5 = 12 xi,...,x5 > 0. In the remainder of the example we consider only the problem in standard form. We can represent the constraints above as xi 1 1 5 0 0 2 + X2 1 + X3 0 + X4 1 + £5 0 = 1 1 1 0 0 40 20 12 x i , . . . ,x 5 > 0, GEOMETRIC VIEW OF LINEAR PROGRAMS that is, x\a\ + X2CL2 + #303 + X±O>A 333 + #505 = 6, x > 0. Note that x = [0,0,40,20,12] T is a feasible solution. But for this x, the value of the objective function is zero. We already know that the minimum of the objective function (if it exists) is achieved at an extreme point of the constraint set Ω defined by the constraints. The point [0,0,40,20,12] T is an extreme point of the set of feasible solutions, but it turns out that it does not minimize the objective function. Therefore, we need to seek the solution among the other extreme points. To do this we move from one extreme point to an adjacent extreme point such that the value of the objective function decreases. Here, we define two extreme points to be adjacent if the corresponding basic columns differ by only one vector. We begin with x = [0,0,40,20,12] T . We have Οαι + 0a 2 + 40a 3 4- 20a 4 + 12a 5 = b. To select an adjacent extreme point, let us choose to include a\ as a basic column in the new basis. We need to remove either 03, CI4, or α§ from the old basis. We proceed as follows. We first express a\ as a linear combination of the old basic columns: CL\ = las + 2a4 + la^. Multiplying both sides of this equation by E\ > 0, we get e\a\ = Eias + 2ειθ4 + εια§. We now add this equation to the equation Οαι -\-Oa2 +40a3 + 20a4 + 12as = b. Collecting terms yields ελαλ + 0a 2 + (40 - ελ)α3 + (20 - 2ει)α 4 + (12 - εχ)α 5 = b. We want to choose S\ in such a way that each of the coefficients above is nonnegative and at the same time, one of the coefficients 03, 04, or a§ becomes zero. Clearly, e\ = 10 does the job. The result is ΙΟαι 4- 30a 3 + 2a 5 = 6. The corresponding basic feasible solution (extreme point) is [10,0,30,0,2] T . For this solution, the objective function value is —30, which is an improvement relative to the objective function value at the old extreme point. We now apply the same procedure as above to move to another adjacent extreme point, which hopefully further decreases the value of the objective function. This time, we choose a 2 to enter the new basis. We have 1 9 1 334 INTRODUCTION TO LINEAR PROGRAMMING X1 / Γ101\ y. ^ X \*1 Figure 15.11 Graphical solution to the LP problem in Example 15.15. and 10--ε2]αι+ε2α2+ η θ - - ε 2 ) α3 + ( 2 ε2 «5 Substituting ε 2 = 4, we obtain 8αι + 4α 2 + 12α 3 = 6. The solution is [8,4,12,0,0] T and the corresponding value of the objective function is —44, which is smaller than the value at the previous extreme point. To complete the example we repeat the procedure once more. This time, we select a± and express it as a combination of the vectors in the previous basis, Oi, a 2 , and a 3 : Ö4 = α>ι — a>2 + 4a 3 , and hence (8 - ε 3 ) αι + (4 + ε 3 ) α 2 + (12 - 4ε 3 ) α 3 + ε 3 α 4 = b. The largest permissible value for ε 3 is 3. The corresponding basic feasible solution is [5,7,0,3,0] T , with an objective function value of —50. The solution [5,7,0,3,0] T turns out to be an optimal solution to our problem in standard form. Hence, the solution to the original problem is [5,7] T , which we can easily obtain graphically (see Figure 15.11). I EXERCISES 335 The technique used in this example for moving from one extreme point to an adjacent extreme point is also used in the simplex method for solving LP problems. The simplex method is essentially a refined method of performing these manipulations. EXERCISES 15.1 Convert the following linear programming problem to standard form: maximize subject to 2^i -f x2 0 < x\ < 2 x\ + %2 < 3 x\ + 2x2 < 5 x2 > 0 . 15.2 Consider a discrete-time linear system Xk+i = axk+buk, where Uk is the input at time k, Xk is the output at time &, and a, b G R are system parameters. Given an initial condition XQ — 1, consider the problem of minimizing the output X2 at time 2 subject to the constraint that \ui\ < 1, i = 0,1. Formulate the problem as a linear programming problem, and convert it into standard form. 15.3 Consider the optimization problem minimize subject to Ci|xi| +c 2 |#2| H Ax = 6, Hc n |z n | where c* Φ 0, i = 1 , . . . , n. Convert this problem into an equivalent standard form linear programming problem. Hint: Given any x G M, we can find unique numbers x + , x~ G R, x + , x~ > 0, such that \x\ = x+ + x~ and x = x+ — x~. 15.4 Does every linear programming problem in standard form have a nonempty feasible set? If "yes," provide a proof. If "no," give a specific example. Does every linear programming problem in standard form (assuming a nonempty feasible set) have an optimal solution? If "yes," provide a proof. If "no," give a specific example. 15.5 Suppose that a computer supplier has two warehouses, one located in city A and another in city B. The supplier receives orders from two customers, one in city C and another in city D. The customer in city C orders 50 units, 336 INTRODUCTION TO LINEAR PROGRAMMING and the customer in city D orders 60 units. The number of units at the warehouse in city A is 70, and the number of units at the warehouse in city B is 80. The cost of shipping each unit from A to C is 1, from A to D is 2, from B to C is 3, and from B to D is 4. Formulate the problem of deciding how many units from each warehouse should be shipped to each customer to minimize the total shipping cost (assuming that the values of units to be shipped are real numbers). Express the problem as an equivalent standard form linear programming problem. 15.6 Consider a computer network consisting of six computers, A through F. The computers are connected according to the following links, with maximum data rates (in Mbps) shown: AC (10), BC (7), BF (3), CD (8), DE (12), DF (4). For example, "AC (10)" means that computers A and C are connected with a link that supports data rates up to 10 Mbps. Suppose that A and B need to send data to E and F , respectively (no other communication is taking place in the network). Any path through the given links above may be used as long as the path has no loop. Also, multiple paths (say from A to E) can be used simultaneously. Link bandwidth can be shared as long as the total data rate through the link does not exceed its maximum (the total data rate through a link is the sum of the data rates of communication in both directions). For every Mbps of data rate the network can support for transmission from A to E, we receive 2 dollars. For every Mbps of data rate the network can support for transmission from B to F, we receive 3 dollars. Formulate a linear programming problem to represent the goal of maximizing the total revenue. Then, convert this problem into standard form. Hint: Draw a picture of the network, then label each link with the maximum data rate and the paths that share that link. 15.7 A cereal manufacturer wishes to produce 1000 pounds of a cereal that contains exactly 10% fiber, 2% fat, and 5% sugar (by weight). The cereal is to be produced by combining four items of raw food material in appropriate proportions. These four items have certain combinations of fiber, fat, and sugar content, and are available at various prices per pound: Item % fiber %fat % sugar Price/lb 1 3 6 20 2 2 8 46 5 4 3 16 9 4 1 4 4 9 0 2 The manufacturer wishes to find the amounts of each item to be used to produce the cereal in the least expensive way. Formulate the problem as a linear programming problem. What can you say about the existence of a solution to this problem? EXERCISES 337 15.8 Suppose that a wireless broadcast system has n transmitters. Transmitter j broadcasts at a power of pj > 0. There are m locations where the broadcast is to be received. The path gain from transmitter j to location i is gij; that is, the power of the signal transmitted from transmitter j received at location i is gijPj. The total power received at location i is the sum of the powers received from all the transmitters. Formulate the problem of finding the minimum sum of the powers transmitted subject to the requirement that the power received at each location is at least P. 15.9 Consider the system of equations [2 - 1 2 -1 3" 1 2 3 1 0 1 0 - 2 0 - 5 X2 X3 = X4 " 14" 5 -10 L X 5. Check if the system has basic solutions. If yes, find all basic solutions. 15.10 Solve the following linear program graphically: maximize 2#i + 5x2 subject to 0 < x\ < 4 0 < x2 < 6 15.11 The optimization toolbox in MATLAB provides a function, linprog, for solving linear programming problems. Use the function l i n p r o g to solve the problem in Example 15.5. Use the initial condition 0. CHAPTER 16 SIMPLEX METHOD 16.1 Solving Linear Equations Using Row Operations The examples in previous chapters illustrate that solving linear programs involves the solution of systems of linear simultaneous algebraic equations. In this section we describe a method for solving a system of n linear equations in n unknowns that we use in subsequent sections. The method uses elementary row operations and corresponding elementary matrices. For a discussion of numerical issues involved in solving a system of simultaneous linear algebraic equations, we refer the reader to [41] and [53]. An elementary row operation on a given matrix is an algebraic manipulation of the matrix that corresponds to one of the following: 1. Interchanging any two rows of the matrix 2. Multiplying one of its rows by a real nonzero number 3. Adding a scalar multiple of one row to another row An elementary row operation on a matrix is equivalent to premultiplying the matrix by a corresponding elementary matrix, which we define next. An Introduction to Optimization, Fourth Edition. By E. K. P. Chong and S. H. Zak. Copyright © 2013 John Wiley & Sons, Inc. 339 340 SIMPLEX METHOD Definition 16.1 We call E an elementary matrix of the first kind if E is obtained from the identity matrix I by interchanging any two of its rows. I An elementary matrix of the first kind formed from I by interchanging the ith and the j t h rows has the form 0 1 ··· 1 ith row E jth row Note that E is invertible and E = E - 1 Definition 16.2 We call E an elementary matrix of the second kind if E is obtained from the identity matrix I by multiplying one of its rows by a real number a φ 0. I The elementary matrix of the second kind formed from I by multiplying the zth row by a φ 0 has the form "l E. a ith row SOLVING LINEAR EQUATIONS USING ROW OPERATIONS 341 Note that E is invertible and "l E~l = 1/a zth row Definition 16.3 We call E an elementary matrix of the third kind if E is obtained from the identity matrix I by adding ß times one row to another row of I. I An elementary matrix of the third kind obtained from I by adding ß times the jth row to the ith. row has the form ß ith row E = j t h row Observe that E is the identity matrix with an extra ß in the (i, j)th location. Note that E is invertible and 1 -ß ith row 1 E- j t h row Definition 16.4 An elementary row operation (of first, second, or third kind) on a given matrix is a premultiplication of the given matrix by a corresponding elementary matrix of the respective kind. I 342 SIMPLEX METHOD Because elementary matrices are invertible, we can define the corresponding inverse elementary row operations. Consider a system of n linear equations in n unknowns xi,X2,..-,xn with right-hand sides bi, &2> · · · > &n- I n matrix form this system may be written as Ax = b where \xi , x b = ~h] Ae bn 1 %n If A is invertible, then A~lb. x = Thus, the problem of solving the system of equations Ax = 6, with A G R n x n invertible, is related to the problem of computing A~l. We now show that A~l can be computed effectively using elementary row operations. In particular, we prove the following theorem. Theorem 16.1 Let A e RnXn be a given matrix. Then, A is nonsingular (invertible) if and only if there exist elementary matrices Ei, i = l , . . . , t , such that Ef''' E2E1A = I. D Proof =>: If A is nonsingular, then its first column must have at least one nonzero element, say a^\ φ 0. Premultiplying A by an elementary matrix of the first kind of the form 0 1 ΕΛ = j t h row 1 brings the nonzero element aj\ to the location (1,1). Hence, in the matrix Ei A, the element an φ 0. Note that since Ei is nonsingular, EiA is also nonsingular. SOLVING LINEAR EQUATIONS USING ROW OPERATIONS 343 Next, we premultiply E\ A by an elementary matrix of the second kind of the form 1/an 1 The result of this operation is the matrix E2E1A with unity in the location (1,1). We next apply a sequence of elementary row operations of the third kind on the matrix E2E1A. Specifically, we premultiply Ε^ΕχΑ by n — 1 elementary matrices of the form 1 1 -G21 E — E3 = 1 1 -0>nl where r = 2-f-n—l = n + l. The result of these operations is the nonsingular matrix ErET—\ ·'' E^E\A 1 äi2 0 a22 [0 an2 — ·· · a\n &2r\ '·· dnn I Because the matrix Er · · · Εχ A is nonsingular, its submatrix Ö22 *· · ä>2n 0"η2 ' ' ' &η must also be nonsingular. This implies that there is a nonzero element äj2, where 2 < j < n. Using an elementary operation of the first kind, we bring this element to the location (2,2). Thus, in the matrix Er+\Er ''' E\A the (2,2)th element is nonzero. Premultiplying the matrix by an elementary matrix of the second kind yields the matrix r+2-^r+l * - £ i A , in which the element in the location (2,2) is unity. As before, we premultiply this matrix by n — 1 elementary row operations of the third kind, to get a 344 SIMPLEX METHOD matrix of the form Es ''' Er -' · E\A 1 0 <2i3 0 1 «23 0 0 α33 0 0 ' än3 ·· ^ΐη ·' &2n ·· «3η Q"nn where s = r + 2 + n— 1 = 2(n + l). This matrix is nonsingular. Hence, there is a nonzero element äj3, 3 < j < n. Proceeding in a similar fashion as before, we obtain Et · - - Es · - · Er · · - E\ A = / , where t = n(n + 1). <=: If there exist elementary matrices Ei,...,Et Et'"E1A such that = IJ then clearly A is invertible, with A-1=Et-'E1. Theorem 16.1 suggests the following procedure for finding A \ if it exists. We first form the matrix [A, I}. We then apply elementary row operations to [A, I] so that A is transformed into i"; that is, we obtain ^ . . . ^ [ Α , Ι ] = [/,£?]. It then follows that B = Et Ελ =Α~\ Example 16.1 Let 2 1 -2 -1 Find A " 1 . 5 10 1 1 0 -10 -30 -2 -3 0 1 0 SOLVING LINEAR EQUATIONS USING ROW OPERATIONS 345 We form the matrix [A,I] = 2 5 1 1 - 2 -10 - 1 - 2 10 0 1 1 0 0 -30 1 0 - 3 0 0 0 0 1 0 0 1 0 0 0 0 0 1 and perform row operations on this matrix. Applying row operations of the first and third kinds yields 1 1 1 0 0 1 0 0 0 3 8 0 1 - 2 0 0 0 - 8 -28 1 0 2 1 0 0 - 1 - 2 0 0 1 0 1 We then interchange the second and fourth rows and apply elementary row operations of the second and third kinds to get 1 0 - 1 0 1 2 0 0 2 0 0 -12 0 0 2 0 1 0 0 - 1 0 - 1 0 1 1 0 3 1 0 -6 1 -8 Now multiply the third row by 1/2 and then perform a sequence of elementary operations of the third kind to obtain "l 0 0 1 0 0 0 0 0 0U 0 0 1 0 U 0 1 I 5 2 -1 λ 2 6 Hence, 2 A~l = 2 -1 -2 λ λ 2 0 6 5 " 2 0 -2 0 λ 0 U 2 0 1 2 2 u -4 2 2 10_ 5 " 2 0 -4 3 0 2 u 1 10 We now return to the general problem of solving the system of equations Ax = 6, A e R n x n . If A~l exists, then the solution is x — A~lb. However, we do not need an explicit expression for A - 1 to find the solution. Indeed, let A~ be expressed as a product of elementary matrices A"1 = EtEt. • E-1 · 346 SIMPLEX METHOD Thus, Ef-EiAx and hence x= = Ef>Eib Ef-E1b. The discussion above leads to the following procedure for solving the system Ax = b. Form an augmented matrix [A,b]Then, perform a sequence of row elementary operations on this augmented matrix until we obtain Prom the above we have that if a? is a solution to Ax = 6, then it is also a solution to EAx = Eb, where E = Et · · · Ei represents a sequence of elementary row operations. Because EA = J, and Eb = 6, it follows that x = b is the solution to Ax = 6, A G R n x n invertible. Suppose now that A G R m x n where m < n, and rank A = m. Then, A is not a square matrix. Clearly, in this case the system of equations Ax = b has infinitely many solutions. Without loss of generality, we can assume that the first m columns of A are linearly independent. Then, if we perform a sequence of elementary row operations on the augmented matrix [A, 6] as before, we obtain [I,D,b], where D is an m x (n — m) matrix. Let x G R n be a solution to Ax = b and write x = [xj^x])}1", where xB G R m , xD G R^" 771 ). Then, [I,D]x = 6, which we can rewrite as XB + DXD = b, or XB = b — DXQ. Note that for an arbitrary XD G R ^ n _ m \ if XB = b — DXD, then the resulting vector x = [ x j , xJ)]T is a solution to Ax = b. In particular, [b , 0 T ] T is a solution to Ax = b. We often refer to the basic solution [b , 0 T ] T as a particular solution to Ax = b. Note that [— (DXD)T,χ~ρ]τ is a solution to Ax = 0. Any solution to Ax = b has the form x = + -DxD XD for some xD G R ( n _ m ) . 16.2 The Canonical Augmented Matrix Consider the system of simultaneous linear equations Ax = 6, rank A = m. Using a sequence of elementary row operations and reordering the variables THE CANONICAL AUGMENTED MATRIX 347 if necessary, we transform the system Ax = b into the following canonical form: Xl #2 + V\ m + l ^ m + l H V VlnXn + 2/2m+l#m+l H h 2/2n#n = 2/20 *^m i l/mm-j-l%m-\-l = 2/10 ι * * * ~r 1/τηη%η = 2/m0· This can be represented in matrix notation as [ J m , Y m , n - m ] * = y0· Formally, we define the canonical form as follows. Definition 16.5 A system Ax = b is said to be in canonical form if among the n variables there are m variables with the property that each appears in only one equation, and its coefficient in that equation is unity. I A system is in canonical form if by some reordering of the equations and the variables it takes the form [J m ,"K m?n _ m ]ic = y0. If a system of equations Ax = b is not in canonical form, we can transform the system into canonical form by a sequence of elementary row operations. The system in canonical form has the same solution as the original system Ax = b and is called the canonical representation of the system with respect to the basis O i , . . . , am. There are, in general, many canonical representations of a given system, depending on which columns of A we transform into the columns of Im (i.e., basic columns). We call the augmented matrix [ J m , l r m > n _ m , y 0 ] of the canonical representation of a given system the canonical augmented matrix of the system with respect to the basis α ι , . . . , a m . Of course, there may be many canonical augmented matrices of a given system, depending on which columns of A are chosen as basic columns. The variables corresponding to basic columns in a canonical representation of a given system are the basic variables, whereas the other variables are the nonbasic variables. In particular, in the canonical representation ]x = yQ of a given system, the variables X\,..., Xm are the basic variables and the other variables are the nonbasic variables. Note that in general the basic variables need not be the first m variables. However, in the following discussion we assume, for convenience and without loss of generality, that the basic variables are indeed the first m variables in the system. 348 SIMPLEX METHOD Having done so, the corresponding basic solution is xi = 2/10, 2/m05 *£m+l 0, Xn — U', that is, x = I/o 0 Given a system of equations Ax = 6, consider the associated canonical augmented matrix 1 0 0 1 ·· • .· • 0 0 2/1 771+1 • 2/ln 2/10 2/2 m + l * 2/2n 2/20 [-*77l5 * 771,71 — 7715 2/oJ — 0 0 1 2/:771771+1 2/m0 Prom the arguments above we conclude that b = 2/io«i + 2/20^2 H l· 2/moam. In other words, the entries in the last column of the canonical augmented matrix are the coordinates of the vector 6 with respect to the basis {a\,..., a m } . The entries of all the other columns of the canonical augmented matrix have a similar interpretation. Specifically, the entries of the j t h column of the canonical augmented matrix, j = 1 , . . . ,n, are the coordinates of a,j with respect to the basis { o i , . . . , a m } · To see this, note that the first m columns of the augmented matrix form a basis (the standard basis). Every other vector in the augmented matrix can be expressed as a linear combination of these basis vectors by reading the coefficients down the corresponding column. Specifically, let a£, i = 1 , . . . ,n + 1, be the ith column in the augmented matrix above. Clearly, since a[,..., afm form the standard basis, then for m < j < n, a 'j = 2 / l j ° l + 2/2j«2 H 1" 2 / m j a m · Let a*, i = 1 , . . . ,n, be the zth column of A, and α η + χ = 6. Now, a\ = Ea,i, z = l , . . . , n + l , where E is a nonsingular matrix that represents the elementary row operations needed to transform [A, 6] into [ Z m , y m > n _ m , j / 0 ] . Therefore, for m < j < n, we also have dj = 2/ijOi + 2/2j«2 H h 2/mjOm. UPDATING THE AUGMENTED MATRIX 16.3 349 Updating the Augmented Matrix To summarize Section 16.2, the canonical augmented matrix of a given system Ax = b specifies the representations of the columns a,j, m < j < n, in terms of the basic columns α ι , . . . , a m . Thus, the elements of the jth column of the canonical augmented matrix are the coordinates of the vector aj with respect to the basis c t i , . . . , o m . The coordinates of b are given in the last column. Suppose that we are given the canonical representation of a system Ax = b. We now consider the following question: If we replace a basic variable by a nonbasic variable, what is the new canonical representation corresponding to the new set of basic variables? Specifically, suppose that we wish to replace the basis vector a p , 1 < p < m, by the vector aq, m < q < n. Provided that the first m vectors with ap replaced by aq are linearly independent, these vectors constitute a basis and every vector can be expressed as a linear combination of the new basic columns. Let us now find the coordinates of the vectors a\,..., an with respect to the new basis. These coordinates form the entries of the canonical augmented matrix of the system with respect to the new basis. In terms of the old basis, we can express aq as 771 771 ΐφρ Note that the set of vectors { α ι , . . . , a p _ i , a g , a p + i , . . . , a m } is linearly independent if and only if ypq φ 0. Solving the equation above for a p , we get 771 Λ ap = — a q — 2^ — a i Vpq 2=1 Vpq ϊφρ Recall that in terms of the old augmented matrix, any vector α^, m < j < n, can be expressed as aj = yijdi + y2jCL2 H l· ymjOm· Combining the last two equations yields 3 = Σ (y* ~ ? ^ ) α ί + !r"a«· a ΐφρ Denoting the entries of the new augmented matrix by y[j, we obtain / Vij =yiJ tfpj Vpq Vpj -—Viq* Vpq · l j ^P> 350 SIMPLEX METHOD Therefore, the entries of the new canonical augmented matrix can be obtained from the entries of the old canonical augmented matrix via the formulas above. These equations are often called the pivot equations, and ypq, the pivot element We refer to the operation on a given matrix by the formulas above as pivoting about the (p,q)th element Note that pivoting about the (p,q)ih element results in a matrix whose qih column has all zero entries, except the (p, <7)th entry, which is unity. The pivoting operation can be accomplished via a sequence of elementary row operations, as was done in the proof of Theorem 16.1. 16.4 The Simplex Algorithm The essence of the simplex algorithm is to move from one basic feasible solution to another until an optimal basic feasible solution is found. The canonical augmented matrix discussed in Section 16.3 plays a central role in the simplex algorithm. Suppose that we are given the basic feasible solution x = [xi,...,xm,0, ...,0]T, or equivalently x\CL\ H Xi > 0, i = l , . . . , r a h x m o m — b. In Section 16.3 we saw how to update the canonical augmented matrix if we wish to replace a basic column by a nonbasic column, that is, if we wish to change from one basis to another by replacing a single basic column. The values of the basic variables in a basic solution corresponding to a given basis are given in the last column of the canonical augmented matrix with respect to that basis; that is, Xi = yio, i = l , . . . , r a . Basic solutions are not necessarily feasible—the values of the basic variables may be negative. In the simplex method we want to move from one basic feasible solution to another. This means that we want to change basic columns in such a way that the last column of the canonical augmented matrix remains nonnegative. In this section we discuss a systematic method for doing this. In the remainder of this chapter we assume that every basic feasible solution of Ax = 6, x >0 is a nondegenerate basic feasible solution. We make this assumption primarily for convenience—all arguments can be extended to include degeneracy. Let us start with the basic columns α ι , . . . , a m , and assume that the corresponding basic solution x = [?/io,..., ymo> 0 , . . . , 0] T is feasible; that is, the THE SIMPLEX ALGORITHM 351 entries y^, i = 1 , . . . , ra, in the last column of the canonical augmented matrix are positive. Suppose that we now decide to make the vector aq, q > ra, a basic column. We first represent aq in terms of the current basis as Q>q = VlqCil + y2q<*>2 H V Vrnq^m- Multiplying the above by ε > 0 yields eaq = eyiqdi + ey2qa2 H l· eymqam. We combine this equation with 2/10Ö1 H H UmO^m = & to get (2/10 - £2/ΐς)αι + (?/2o ~ £y2q)a2 H Note that the vector h (ymo ~ e y m g ) a m -l· ε ο ς = 6. Γ Vio ~ zy\q 1 2/77iO £ymq 0 ε [ 0 J where ε appears in the ^th position, is a solution to Ax = b. If ε = 0, then we obtain the old basic feasible solution. As ε is increased from zero, the qth component of the vector above increases. All other entries of this vector will increase or decrease linearly as ε is increased, depending on whether the corresponding yiq is negative or positive. For small enough ε, we have a feasible but nonbasic solution. If any of the components decreases as ε increases, we choose ε to be the smallest value where one (or more) of the components vanishes. That is, ε = mm{yio/yiq : yiq > 0}. With this choice of ε we have a new basic feasible solution, with the vector aq replacing a p , where p corresponds to the minimizing index p = sigmin^yio/yiq : yiq > 0}. So, we now have a new basis α ι , . . . , a p _ i , a p + i , . . . , a m , aq. As we can see, ap was replaced by aq in the new basis. We say that aq enters the basis and ap leaves the basis. If the minimum in mmi{yio/yiq : yiq > 0} is achieved by more than a single index, then the new solution is degenerate and any of the zero components can 352 SIMPLEX METHOD be regarded as the component corresponding to the basic column that leaves the basis. If none of the yiq are positive, then all components in the vector [?/io — eyiq,..., y-mo — eymq, 0 , . . . , ε , . . . , 0] T increase (or remain constant) as ε is increased, and no new basic feasible solution is obtained, no matter how large we make e. In this case there are feasible solutions having arbitrarily large components, which means that the set Ω of feasible solutions is unbounded. So far, we have discussed how to change from one basis to another, while preserving feasibility of the corresponding basic solution, assuming that we have already chosen a nonbasic column to enter the basis. To complete our development of the simplex method, we need to consider two more issues. The first issue concerns the choice of which nonbasic column should enter the basis. The second issue is to find a stopping criterion, that is, a way to determine if a basic feasible solution is optimal or is not. To this end, suppose that we have found a basic feasible solution. The main idea of the simplex method is to move from one basic feasible solution (extreme point of the set Ω) to another basic feasible solution at which the value of the objective function is smaller. Because there is only a finite number of extreme points of the feasible set, the optimal point will be reached after a finite number of steps. We already know how to move from one extreme point of the set Ω to a neighboring one by updating the canonical augmented matrix. To see which neighboring solution we should move to and when to stop moving, consider the following basic feasible solution: [ x 5 , 0 T ] T = [2/ 10 ,...,2/mO,0,...,0] T together with the corresponding canonical augmented matrix, having an identity matrix appearing in the first m columns. The value of the objective function for any solution x is z = C\X\ + C2X2 H h cnxn. For our basic solution, the value of the objective function is z = z0 = cBxB where = C12/10 H h c m 2/ m0 , c j = [ci,c2,...,cm]. To see how the value of the objective function changes when we move from one basic feasible solution to another, suppose that we choose the qth column, m < q < n, to enter the basis. To update the canonical augmented matrix, let p = diTgrnm^yio/yiq : yiq > 0} and e = ypo/ypq. The new basic feasible THE SIMPLEX ALGORITHM solution is 353 Γ 2/10 ~ eyiq 1 UrnO ^Vmq 0 ε ° J I Note that the single ε appears in the qth component, whereas the pth component is zero. Observe that we could have arrived at the basic feasible solution above simply by updating the canonical augmented matrix using the pivot equations from the previous Section 16.3: y%j - —yiq, ίφν, Upq y'pj = Vpq where the qth column enters the basis and the pth column leaves [i.e., we pivot about the (p, q)th element]. The values of the basic variables are entries in the last column of the updated canonical augmented matrix. The cost for this new basic feasible solution is z = ci(2/io - yiqs) H h c m (2/ m0 - ymqe) + cqe = ZQ + [cq - {ciyiq H where z0 = ayio H h cmym0. h βτηι/της)]ε, Let Zq — Ciyiq ~T ' ' ' ~T~ C-mymq' Then, Z = Z0 -h (Cq - Zq)e. Thus, if Z - Z0 = (Cq ~ Zq)e < 0, then the objective function value at the new basic feasible solution above is smaller than the objective function value at the original solution (i.e., z < ZQ). Therefore, if cq — zq < 0, then the new basic feasible solution with aq entering the basis has a lower objective function value. On the other hand, if the given basic feasible solution is such that for all q = ra + 1,.. . , n , Cq ~ Zq > 0, 354 SIMPLEX METHOD then we can show that this solution is in fact an optimal solution. To show this, recall from Section 16.1 that any solution to Ax — b can be represented as * m,n—m*^D \ Vo X = 0 + XD for some xp = [ x m + i , . . . , x n ] T € ^ the above, we obtain m n ^· Using manipulations similar to n cTx = z0 + ] P {ci-Zi)xi, where Zi = c\yu + · · · + c m y m i, i = m + l , . . . , n . For a feasible solution we have Xi > 0, i = 1 , . . . , n. Therefore, if c* — z» > 0 for alH = ra + 1 , . . . , n, then any feasible solution x will have objective function value cTx no smaller than zo· Let ri = 0 for z = 1 , . . . , m and r^ = c» — z» for i = m + 1 , . . . , n. We call ri the zth reduced cost coefficient or relative cost coefficient. Note that the reduced cost coefficients corresponding to basic variables are zero. We summarize the discussion above with the following result. Theorem 16.2 A basic feasible solution is optimal if and only if the corresponding reduced cost coefficients are all nonnegative. ü At this point we have all the necessary steps for the simplex algorithm. Simplex Algorithm 1. Form a canonical augmented matrix corresponding to an initial basic feasible solution. 2. Calculate the reduced cost coefficients corresponding to the nonbasic variables. 3. If rj > 0 for all j , stop—the current basic feasible solution is optimal. 4. Select a q such that rq < 0. 5. If no yiq > 0, stop—the problem is unbounded; else, calculate p = aigmin^yio/yiq : yiq > 0}. (If more than one index i minimizes yio/yiq, we let p be the smallest such index.) 6. Update the canonical augmented matrix by pivoting about the (p, q)ih element. 7. Go to step 2. We state the following result for the simplex algorithm, which we have already proved in the foregoing discussion. THE SIMPLEX ALGORITHM 355 Theorem 16.3 Suppose that we have an LP problem in standard form that has an optimal feasible solution. If the simplex method applied to this problem terminates and the reduced cost coefficients in the last step are all nonnegative, then the resulting basic feasible solution is optimal G Example 16.2 Consider the following linear program (see also Exercise 15.10): maximize subject to 2x\ + 5^2 x\ < 4 £2 < 6 xi + #2 < 8 Xl,X2 > 0. We solve this problem using the simplex method. Introducing slack variables, we transform the problem into standard form: minimize subject to — 2#i x\ —5#2 — 0^3 +£3 +#4 X2 X\ £l, —OX4 —OX5 +X2 #2, #3? #4j = 4 = 6 +X*> = 8 #5 > 0. The starting canonical augmented matrix for this problem is <2i CL2 a 3 a 4 °5 & 1 0 1 0 0 4 0 1 0 1 0 6 1 1 0 0 1 8 Observe that the columns forming the identity matrix in the canonical augmented matrix above do not appear at the beginning. We could rearrange the augmented matrix so that the identity matrix would appear first. However, this is not essential from the computational point of view. The starting basic feasible solution to the problem in standard form is x = [0,0,4,6,8] T . The columns 03, 04, and a§ corresponding to £3, £4, and £5 are basic, and they form the identity matrix. The basis matrix is B = [03,04,0,5] = 13. The value of the objective function corresponding to this basic feasible solution is z = 0. We next compute the reduced cost coefficients corresponding to the nonbasic variables x\ and £2· They are n=ci- z1=ci- r2 = C2-z2 (c3j/ii + c4i/2i + C52/31) = - 2 , = C2- (c3yi2 + Q2/22 + c5y32) = - 5 . 356 SIMPLEX METHOD We would like now to move to an adjacent basic feasible solution for which the objective function value is lower. Naturally, if there is more than one such solution, it is desirable to move to the adjacent basic feasible solution with the lowest objective value. A common practice is to select the most negative value of rj and then to bring the corresponding column into the basis (see Exercise 16.18 for an alternative rule for choosing the column to bring into the basis). In our example, we bring a<i into the basis; that is, we choose a<i as the new basic column. We then compute p = argminji/io/to · V%2 > 0} = 2. We now update the canonical augmented matrix by pivoting about the (2,2)th entry using the pivot equations: V'ij = Vij ~ — 2/<2, i φ 2, J 2/22 y'2j = 3 &. 2/22 The resulting updated canonical augmented matrix is CL\ a,<i α>3 α4 α^ b 1 0 1 0 0 4 0 1 0 1 0 6 1 0 0 - 1 1 2 Note that 02 entered the basis and a± left the basis. The corresponding basic feasible solution is x = [0,6,4,0,2] T . We now compute the reduced cost coefficients for the nonbasic columns: n=ci-zi = -2, Γ4 = C4 — Z4 = 5. Because r\ = — 2 < 0, the current solution is not optimal, and a lower objective function value can be obtained by bringing a\ into the basis. Proceeding to update the canonical augmented matrix by pivoting about the (3, l)th element, we obtain CL\ Q.2 Ö3 Ö4 Ö5 b 0 0 1 1 - 1 2 0 1 0 1 0 6 1 0 0 - 1 1 2 The corresponding basic feasible solution is x = [2,6,2,0,0] T . The reduced cost coefficients are Γ4 = C4 — Z4 = 3, r*5 = c 5 - 25 = 2. Because no reduced cost coefficient is negative, the current basic feasible solution x = [2,6,2,0,0] T is optimal. The solution to the original problem is therefore X\ = 2, x<i = 6, and the objective function value is 34. | MATRIX FORM OF THE SIMPLEX METHOD 357 We can see from Example 16.2 that we can solve a linear programming problem of any size using the simplex algorithm. To make the calculations in the algorithm more efficient, we discuss the matrix form of the simplex method in the next section. 16.5 Matrix Form of the Simplex Method Consider a linear programming problem in standard form: minimize subject to cTx Ax = b x>0. Let the first m columns of A be the basic columns. The columns form a square mxm nonsingular matrix B. The nonbasic columns of A form a n m x ( n - m ) matrix D. We partition the cost vector correspondingly as c T = [Cß,cJ)]. Then, the original linear program can be represented as follows: minimize subject to CßXB + CpXp [£?, D] > 0, XB XB = XD XD + BXB DXD = b > 0. If XD = 0, then the solution x = [Xß, x~b]T = [##> 0 T ] T is the basic feasible solution corresponding to the basis B. It is clear that for this to be a solution, we need XB = ß _ 1 b ; that is, the basic feasible solution is x = B-'b 0 The corresponding objective function value is z0 = cTBB~lb. If, on the other hand, xr> φ 0, then the solution x = In this case XB is given by XB = [X~B,XJ)]T B~1b-B"1DxD, and the corresponding objective function value is z = CBXB + = cJ(B _1 CDXD 6 - ΒλΌχΌ) = cTBB~lb + ( c j - + cTDxD cTBB-lD)xD. is not basic. 358 SIMPLEX METHOD Defining we obtain z = ZQ + rJ)XDThe elements of the vector TD are the reduced cost coefficients corresponding to the nonbasic variables. If TD > 0, then the basic feasible solution corresponding to the basis B is optimal. If, on the other hand, a component of rp is negative, then the value of the objective function can be reduced by increasing a corresponding component of #£>, that is, by changing the basis. We now use the foregoing observations to develop a matrix form of the simplex method. To this end we first add the cost coefficient vector c T to the bottom of the augmented matrix [A, b] as follows: A cT b 0 B D -D b 0 We refer to this matrix as the tableau of the given LP problem. The tableau contains all relevant information about the linear program. Suppose that we now apply elementary row operations to the tableau such that the top part of the tableau corresponding to the augmented matrix [A, b] is transformed into canonical form. This corresponds to premultiplying the tableau by the matrix Γ Β _ 1 O| o1 1 The result of this operation is B 1 0T 0 1 B D cl cl BXD b 0 -D B~lb 0 We now apply elementary row operations to the tableau above so that the entries of the last row corresponding to the basic columns become zero. Specifically, this corresponds to premultiplication of the tableau by the matrix The result is BlD -c -D B~lb 0 T o Β~λΌ -clB^D B-'b -clB^b We refer to the resulting tableau as the canonical tableau corresponding to the basis B. Note that the first m entries of the last column of the canonical MATRIX FORM OF THE SIMPLEX METHOD 359 tableau, B~ b, are the values of the basic variables corresponding to the basis B. The entries of c J — c^B~lD in the last row are the reduced cost coefficients. The last element in the last row of the tableau, —c]3B~1b, is the negative of the value of the objective function corresponding to the basic feasible solution. Given an LP problem, we can in general construct many different canonical tableaus, depending on which columns are basic. Suppose that we have a canonical tableau corresponding to a particular basis. Consider the task of computing the tableau corresponding to another basis that differs from the previous basis by a single vector. This can be accomplished by applying elementary row operations to the tableau in a similar fashion as discussed above. We refer to this operation as updating the canonical tableau. Note that updating of the tableau involves using exactly the same update equations as we used before in updating the canonical augmented matrix, namely, for i = l , . . . , r a + 1, Vpq itpj Vpq 5 where yij and ^ are the (i,j)th entries of the original and updated canonical tableaus, respectively. Working with the tableau is a convenient way of implementing the simplex algorithm, since updating the tableau immediately gives us the values of both the basic variables and the reduced cost coefficients. In addition, the (negative of the) value of the objective function can be found in the lower right-hand corner of the tableau. We illustrate the use of the tableau in the following example. Example 16.3 Consider the following linear programming problem: maximize subject to 7x\ -f 6x2 2^i + X2 < 3 xi + 4x2 < 4 X\,X2 > 0. We first transform the problem into standard form so that the simplex method can be applied. To do this we change the maximization to minimization by multiplying the objective function by —1. We then introduce two nonnegative slack variables, x3 and x 4 , and construct the tableau for the problem: a\ cT a,2 a>3 0,4 b 2 1 1 0 3 1 4 0 1 4 -7 -6 0 0 0 360 SIMPLEX METHOD Notice that this tableau is already in canonical form with respect to the basis [03,04]. Hence, the last row contains the reduced cost coefficients, and the rightmost column contains the values of the basic variables. Because r\ = — 7 is the most negative reduced cost coefficient, we bring a\ into the basis. We then compute the ratios yio/yii = 3/2 and 2/20/2/21 = 4. Because 2/10/2/11 < 2/20/2/21, we get p = argmini{2/io/2/ii : Vn > 0} = 1. We pivot about the (1, l)th element of the tableau to obtain 11 I I 2 0 1 w 0 _i 2 0 -I 2 w 2 1 2 2 ^ 5 2 1 0 f In the second tableau above, only r2 is negative. Therefore, q = 2 (i.e., we bring a<i into the basis). Because V™. = 3 2/12 2/20 _ 5 2/22 ' 7 we have p = 2. We thus pivot about the (2,2)th element of the second tableau to obtain the third tableau: 1 0 1 u u n u 1 n υ 0 * _I 7 1 - 1 2 7 22 ? 8 7 7 5 7 5 7 86 77 Because the last row of the third tableau above has no negative elements, we conclude that the basic feasible solution corresponding to the third tableau is optimal. Thus, x\ = 8/7, #2 = 5/7, £3 = 0, x± = 0 is the solution to our LP in standard form, and the corresponding objective value is —86/7. The solution to the original problem is simply xi = 8/7, #2 = 5/7, and the corresponding objective value is 86/7. I Degenerate basic feasible solutions may arise in the course of applying the simplex algorithm. In such a situation, the minimum ratio yio/yiq is 0. Therefore, even though the basis changes after we pivot about the (p, q)th element, the basic feasible solution does not (and remains degenerate). It is possible that if we start with a basis corresponding to a degenerate solution, several iterations of the simplex algorithm will involve the same degenerate solution, and eventually the original basis will occur. The entire process will then repeat indefinitely, leading to what is called cycling. Such a scenario, although rare in practice, is clearly undesirable. Fortunately, there is a simple rule for choosing q and p, due to Bland, that eliminates the cycling problem (see Exercise 16.18): q = min{z : r* < 0}, p = min{j : yj0/yjq = min{yi0/yiq : yiq > 0}}. TWO-PHASE SIMPLEX METHOD 16.6 361 Two-Phase Simplex Method The simplex method requires starting with a tableau for the problem in canonical form; that is, we need an initial basic feasible solution. A brute-force approach to finding a starting basic feasible solution is to choose m basic columns arbitrarily and transform the tableau for the problem into canonical form. If the rightmost column is positive, then we have a legitimate (initial) basic feasible solution. Otherwise, we would have to pick another candidate basis. Potentially, this brute-force procedure requires ( m ) tries, and is therefore not practical. Certain LP problems have obvious initial basic feasible solutions. For example, if we have constraints of the form Ax < b and we add m slack variables 2i > · · · j 2m j then the constraints in standard form become [AJn where z = [21,..., zm]T. = b, >0, The obvious initial basic feasible solution is and the basic variables are the slack variables. This was the case in the example in Section 16.5. Suppose that we are given a linear program in standard form: minimize subject to cTx Ax = b x>0. In general, an initial basic feasible solution is not always apparent. We therefore need a systematic method for finding an initial basic feasible solution for general LP problems so that the simplex method can be initialized. For this purpose, suppose that we are given an LP problem in standard form. Consider the following associated artificial problem: minimize subject to yi + y2 H [A, I m ] >0, l· ym 362 SIMPLEX METHOD where y = [2/1,... ,2/m]T· We call y the vector of artificial variables. Note that the artificial problem has an obvious initial basic feasible solution: Γθΐ [6J ' We can therefore solve this problem by the simplex method. Proposition 16.1 The original LP problem has a basic feasible solution if and only if the associated artificial problem has an optimal feasible solution with objective function value zero. O Proof =>►: If the original problem has a basic feasible solution x, then the vector [cc T ,0 T ] T is a basic feasible solution to the artificial problem. Clearly, this solution has an objective function value of zero. This solution is therefore optimal for the artificial problem, since there can be no feasible solution with negative objective function value. <=: Suppose that the artificial problem has an optimal feasible solution with objective function value zero. Then, this solution must have the form [ # τ , 0 T ] T , where x > 0. Hence, we have Ax = 6, and x is a feasible solution to the original problem. By the fundamental theorem of LP, there also exists a basic feasible solution. I Assume that the original LP problem has a basic feasible solution. Suppose that the simplex method applied to the associated artificial problem has terminated with an objective function value of zero. Then, as indicated in the proof above, the solution to the artificial problem will have all yi — 0, i = 1 , . . . ,ra. Hence, assuming nondegeneracy, the basic variables are in the first n components; that is, none of the artificial variables are basic. Therefore, the first n components form a basic feasible solution to the original problem. We can then use this basic feasible solution (resulting from the artificial problem) as the initial basic feasible solution for the original LP problem (after deleting the components corresponding to artificial variables). Thus, using artificial variables, we can attack a general linear programming problem by applying the two-phase simplex method. In phase I we introduce artificial variables and the artificial objective function and find a basic feasible solution. In phase II we use the basic feasible solution resulting from phase I to initialize the simplex algorithm to solve the original LP problem. The two-phase simplex method is illustrated in Figure 16.1. Example 16.4 Consider the following linear programming problem: minimize subject to 2x\ + 3#2 4 χ ι + 2^2 > 12 x\ + 4x 2 > 6 X\,X2 > 0. TWO-PHASE SIMPLEX METHOD 363 I PHASE I I (Finding an optimal solution) Figure 16.1 ! PHASE I W (Finding a basic , feasible solution) Illustration of the two-phase simplex method. First, we express the problem in standard form by introducing surplus variables: minimize subject to 2x\ + 3x2 4#i + 2x2 — X3 — 12 x\ + 4^2 — #4 = 6 xi,... ,£4 > 0. For the LP problem above there is no obvious basic feasible solution that we can use to initialize the simplex method. Therefore, we use the two-phase method. Phase I. We introduce artificial variables χ^,χβ > 0, and an artificial objective function x 5 + x6. We form the corresponding tableau for the problem: CL\ cT a<i 03 2 -1 4 1 4 0 0 d\ 0,5 ae b 0 1 0 12 0 - 1 0 1 6 0 0 1 1 0 To initiate the simplex procedure, we must update the last row of this tableau to transform it into canonical form. We obtain a\ a,2 CI3 d\ 05 ae 4 2 -1 0 1 0 1 4 0 - 1 0 1 - 5 - 6 1 1 0 0 b 12 6 -18 364 SIMPLEX METHOD The basic feasible solution corresponding to this tableau is not optimal. Therefore, we proceed with the simplex method to obtain the next tableau: 7 2 1 4 7 2 -1 0 1 0 1 0 1 0 0 1 2 1 4 1 2 1 4 3 2 1 2 9 3 2 -9 We still have not yet reached an optimal basic feasible solution. Performing another iteration, we get 2 7 1 0 0 1 0 0 1 7 1 14 0 2 7 2 7 1 14 0 2 7 1 1 7 18 7 6 7 1 0 Both of the artificial variables have been driven out of the basis, and the current basic feasible solution is optimal. We now proceed to phase II. Phase II. We start by deleting the columns corresponding to the artificial variables in the last tableau in phase I and revert back to the original objective function. We obtain «I Ö2 0 1 3 «3 2 7 1 14 0 a± 1 7 2 7 0 b 18 7 6 7 0 We transform the last row so that the zeros appear in the basis columns; that is, we transform the tableau above into canonical form: 1 0 0 1 0 0 2 7 1 14 5 14 1 7 4 7 2 7 18 7 6 7 54 7 All the reduced cost coefficients are nonnegative. Hence, the optimal solution is '18 6 X = 7'7<°'° and the optimal cost is 54/7. 16.7 Revised Simplex Method Consider an LP problem in standard form with a matrix A of size m x n. Suppose that we use the simplex method to solve the problem. Experience suggests that if m is much smaller than n, then, in most instances, pivots will REVISED SIMPLEX METHOD 365 occur in only a small fraction of the columns of the matrix A. The operation of pivoting involves updating all the columns of the tableau. However, if a particular column of A never enters any basis during the entire simplex procedure, then computations performed on this column are never used. Therefore, if m is much smaller than n, the effort expended on performing operations on many of the columns of A may be wasted. The revised simplex method reduces the amount of computation leading to an optimal solution by eliminating operations on columns of A that do not enter the bases. To be specific, suppose that we are at a particular iteration in the simplex algorithm. Let B be the matrix composed of the columns of A forming the current basis, and let D be the matrix composed of the remaining columns of A. The sequence of elementary row operations on the tableau leading to this iteration (represented by matrices Ei,..., Ek) corresponds to premultiplying JB, D, and 6 by B~l = Ek · · · E\. In particular, the vector of current values of the basic variables is B~xb. Observe that computation of the current basic feasible solution does not require computation of B~1D; all we need is the matrix B~l. In the revised simplex method we do not compute B~ D. Instead, we only keep track of the basic variables and the revised tableau, which is the tableau [JB -1 , B~lb}. Note that this tableau is only of size m x (ra + 1 ) [compared to the tableau in the original simplex method, which is m x (n+1)]. To see how to update the revised tableau, suppose that we choose the column aq to enter the basis. Let y = B aq, y0 = [yoi , 2/0mJ = B 6, and p = argmiUilyio/yiq : yiq > 0} (as in the original simplex method). Then, to update the revised tableau, we form the augmented revised tableau [B~1,yQ,yq] and pivot about the pth element of the last column. We claim that the first ra + 1 columns of the resulting matrix comprise the updated revised tableau (i.e., we simply remove the last column of the updated augmented revised tableau to obtain the updated revised tableau). To see this, write B~l as B~l = Ek · · ■ -Ei, and let the matrix Ek+ι represent the pivoting operation above (i.e., Ek+iyq = e p , the pth column of the mxm identity matrix). The matrix Ek+ι is given by 1 E fc+l -Vlq/Vl pq 0 l/2/j pq 0 VmqlV\ pq 1 Then, the updated augmented tableau resulting from the pivoting operation above is [Ek+\B ,#H-i2/o> e pl· L e t Bv be the new basis. Then, we have 1 Bn^w = Ek+i · · · Ελ. But notice that Bnew = Ek+iB~1, and the values of the basic variables corresponding to Bnew are given by y0new = 22/c+i2/o· Hence, the updated tableau is indeed [i?new>2/onew] = [^fc+i-B_1,^fc+i2/o]· We summarize the foregoing discussion in the following algorithm. 366 SIMPLEX METHOD Revised Simplex Method 1. Form a revised tableau corresponding to an initial basic feasible solution [B-\y0]. 2. Calculate the current reduced cost coefficients vector via rl = cTDwhere \ T = XTD, cTBB-x. 3. If Tj > 0 for all j , stop—the current basic feasible solution is optimal. 4. Select a q such that rq < 0 (e.g., the q corresponding to the most negative rq), and compute yq = B 1 ^ . 5. If no yiq > 0, stop—the problem is unbounded; else, compute p = a r g m i n ^ o / ^ : yiq > 0}. 6. Form the augmented revised tableau [B~1,y0,yq], and pivot about the pth element of the last column. Form the updated revised tableau by taking the first ra + 1 columns of the resulting augmented revised tableau (i.e., remove the last column). 7. Go to step 2. The reason for computing rp in two steps as indicated in step 2 is as follows. We first note that rp = c j — c^B~lD. To compute CgB~lD, we 1 can do the multiplication in the order either (c]3B~ )D or c]3(B~1D). The former involves two vector-matrix multiplications, whereas the latter involves a matrix-matrix multiplication followed by a vector-matrix multiplication. Clearly, the former is more efficient. As in the original simplex method, we can use the two-phase method to solve a given LP problem using the revised simplex method. In particular, we use the revised tableau from the final step of phase I as the initial revised tableau in phase II. We illustrate the method in the following example. Example 16.5 Consider solving the following LP problem using the revised simplex method: maximize subject to 3x\ + 5x2 x\ + X2 < 4 5xi + 3x 2 > 8 Xl,X2 > 0. REVISED SIMPLEX METHOD 367 First, we express the problem in standard form by introducing one slack and one surplus variable, to obtain minimize subject to — 3x\ — 6x2 x\ + X2 + X3 = 4 5#i + 3^2 — X4 = 8 # i , . . . ,£4 > 0. There is no obvious basic feasible solution to this LP problem. Therefore, we use the two-phase method. Phase I. We introduce one artificial variable x$ and an artificial objective function £5. The tableau for the artificial problem is CL\ CL2 1 5 0 cT Q>3 05 Q>A b 1 1 0 0 4 3 0 - 1 1 8 0 0 0 1 0 We start with an initial basic feasible solution and corresponding 2? _ 1 , as shown in the following revised tableau: Β~λ Variable y0 T~Ö I x~3 0 x5 1 8 We compute A T = C 5 B " 1 = [0,1], rl = cl-\TD = [0,0,0] - [5,3,-1] = [-5,-3,1] = [rur2,r4]. Because r\ is the most negative reduced cost coefficient, we bring a\ into the basis. To do this, we first compute yx = B~la\. In this case, yx = a\. We get the augmented revised tableau: B~l y0 y1 xs T~Ö 4 Γ £5 0 Variable 1 8 5 We then compute p = argmin^^o/?/^ : Viq > 0} = 2 and pivot about the second element of the last column to get the updated revised tableau: B~l Variable X 3 x *i 0 y0 I 5 5 § 368 SIMPLEX METHOD We next compute clB1 [0,0], rl = c l - X D = [0,0,1] = [r2,r4,r5] > 0 T . T The reduced cost coefficients are all nonnegative. Hence, the solution to the artificial problem is [8/5,0,12/5,0,0] . The initial basic feasible solution for phase II is therefore [8/5,0,12/5,0] T . Phase II. The tableau for the original problem (in standard form) is a\ 1 5 -3 cT α<ι α^ &\ b 1 1 0 4 3 0 - 1 8 -5 0 0 0 As the initial revised tableau for phase II, we take the final revised tableau from phase I. We then compute λ τ = cTBB~l = [0, - 3 ] 1 -Γ _o l 3" 5 " °'-5. 5 3" 1 0 5 3 -1 rl = cTD-XTD = [-5,0] ,o]- 0,— ' z= ~~5' We bring a2 into the basis, and compute y2 = B~la2 B~l Variable 1 5 Xl 5 8 y0 y2 5 5 3 5 5 16 3" 5 [^2,r4]. to get In this case we get p — 2. We update this tableau by pivoting about the second element of the last column to get Variable >-i 1 - i3 L X2 U 3 I/o * 3 3 We compute λ τ = clß-1 = [0, -5] rl = d - XTD = [-3,0] _I 1 3 3 o-i 1 5 0 -1 16 _ 5 = 3 ' 3 [η,η]. EXERCISES 369 We now bring 0,4 into the basis: Β~λ Variable 1 0 x3 x2 y0 y4 _I 3 4 3 1 3 3 3 I We update the tableau to obtain Variable x4 x2 8 _I 3 B~l y0 3 - 1 4 1 0 4 We compute TD-1 c^B rTD=cTD-\TD [0,-5] 3 1 -1 0 -5,0], 1 1 = [2,5] = [n,r 3 ]. 5 0 = [-3,0] - [ - 5 , 0 ] The reduced cost coefficients are all positive. Hence, [0,4,0,4] T is optimal. The optimal solution to the original problem is [0,4] T . I EXERCISES 16.1 This question is concerned with elementary row operations and rank. a. For the matrix 1 2 2 -1 3 1 1 2 -1 3 2 3 3 0 3 1 2 1 3 1 find its rank by first transforming the matrix using elementary row operations into an upper triangular form. b . Find the rank of the following matrix for different values of the parameter 7 by first transforming the matrix using elementary row operations into an upper triangular form: A = 1 2 1 7 -1 10 -1 7 -6 2 5 1 370 SIMPLEX METHOD 16.2 Consider the following standard form LP problem: minimize subject to 2#i — X2 — £3 3xi + #2 + #4 = 4 6x1 + 2#2 + £3 + £4 = 5 #1,#2,£3J#4 > 0. a. Write down the A, 6, and c matrices/vectors for the problem. b . Consider the basis consisting of the third and fourth columns of A, ordered according to [04,03]. Compute the canonical tableau corresponding to this basis. c. Write down the basic feasible solution corresponding to the basis above, and its objective function value. d. Write down the values of the reduced cost coefficients (for all the variables) corresponding to the basis. e. Is the basic feasible solution in part c an optimal feasible solution? If yes, explain why. If not, determine which element of the canonical tableau to pivot about so that the new basic feasible solution will have a lower objective function value. f. Suppose that we apply the two-phase method to the problem, and at the end of phase I, the tableau for the artificial problem is 0 1 0 0 - 1 1 3 0 3 0 2 υ 0 3 1 - 1 3 3 3 1 0 Does the original problem have a basic feasible solution? Explain. g. Prom the final tableau for phase I in part f, find the initial canonical tableau for phase II. 16.3 Use the simplex method to solve the following linear program: maximize X\ + χ<χ + 3a?3 subject to xi + x3 = 1 X2 + £3 = ^1,^2,^3 > 2 0. EXERCISES 371 16.4 Consider the linear program maximize subject to 2xi + x2 0 < xx < 5 0 < x2 < 7 x\ + x2 < 9. Convert the problem to standard form and solve it using the simplex method. 16.5 Consider a standard form linear programming problem with ? ? A— ? ? 0 1 1 0 τ 8 7 ? = b = 6 ' ? where the "?" symbols signify unknowns to be determined. Suppose that the canonical tableau corresponding to some basis is [~0 1 1 0 1 3 2 ?1 4 ? . [θ 0 - 1 1 ?J a. Find all entries of A. b . Find all entries of c. c. Find the basic feasible solution corresponding to the canonical tableau above. d. Find all entries in the rightmost column of the tableau. 16.6 Consider the optimization problem minimize subject to ci|xi| + c2\x2\ H Ax = b. l· cn\xn\ We can convert this problem into an equivalent standard form linear programming problem by introducing the new variables Xi = x\ — x~l where xf > 0, x~ > 0, and \xi\ = xf +x~, i = 1,2,..., n i = 1,2, . . . , n (See also Exercise 15.3.) Then we can apply the simplex method to solve the equivalent problem. Explain, in two or three sentences, why we will always 372 SIMPLEX METHOD have that only either xf or x{ can be positive but never both xf and xi can be positive. In other words, we will always have, x~\x~ = 0. 16.7 Suppose that we are given a linear programming problem in standard form (written in the usual notation) and are told that the vector x = [1,0,2,3,0] T is a basic feasible solution with corresponding relative cost coefficient vector r — [0,1,0,0, — 1] T and objective function value 6. We are also told that the vector [—2,0,0,0,4] T lies in the nullspace of A. a. Write down the canonical tableau corresponding to the given basic feasible solution above, filling in as many values of entries as possible (use the symbol * for entries that cannot be determined from the information given). Clearly indicate the dimensions of the tableau. b . Find a feasible solution with an objective function value that is strictly less than 6. 16.8 Consider a standard form linear programming problem (with the usual A, 6, and c). Suppose that it has the following canonical tableau: 0 1 0 1 1 2 0 0 0 3 1 0 0 4 0 0 - 1 2 3 4 5 6 7 8 a. Find the basic feasible solution corresponding to this canonical tableau and the corresponding value of the objective function. b . Find all the reduced cost coefficient values associated with the tableau. c. Does the given linear programming problem have feasible solutions with arbitrarily negative objective function values? d. Suppose that column a^ enters the basis. Find the canonical tableau for the new basis. e. Find a feasible solution with objective function value equal to —100. f. Find a basis for the nullspace of A. 16.9 Consider the problem maximize subject to — x\ — 2^2 x\ > 0 X2 > I- EXERCISES 373 a. Convert the problem into a standard form linear programming problem. b . Use the two-phase simplex method to compute the solution to this problem and the value of the objective function at the optimal solution of the problem. 16.10 Consider the linear programming problem minimize subject to — x\ x\ — X2 = 1 ^1,^2 > 0. a. Write down the basic feasible solution for x\ as a basic variable. b . Compute the canonical augmented matrix corresponding to the basis in part a. c. If we apply the simplex algorithm to this problem, under what circumstance does it terminate? (In other words, which stopping criterion in the simplex algorithm is satisfied?) d. Show that in this problem, the objective function can take arbitrarily negative values over the constraint set. 16.11 Find the solution and the value of the optimal cost for the following problem using the revised simplex method: minimize subject to x\ + x2 x\ + 2^2 > 3 2xi + x2 > 3 X\,X2 > 0. Hint: Start with x\ and x2 as basic variables. 16.12 Solve the following linear programs using the revised simplex method: a. Maximize — 4#i — Zx2 subject to 5^i + X2 > 11 - 2 x i - x2 < - 8 xi + 2x 2 > 7 χ\,χ2 > o. 374 SIMPLEX METHOD b. Maximize 6x1 + 4x2 + 7x3 -f 5x4 subject to xi + 2x2 + x3 + 2x4 < 20 6xi H- 5x2 + 3x3 + 2x4 < 100 3xi + 4x2 + 9x3 + 12x4 < 75 Xl,X 2 ,X3,#4 > 0. 16.13 Consider a standard form linear programming problem with ~7l 0 2 0 1 1 1 0 0 0 3 1 0 Suppose that we are told that the reduced cost coefficient vector corresponding to some basis is r T = [0,1,0,0]. a. Find an optimal feasible solution to the problem. b . Find c 2 . 16.14 Consider the linear programming problem minimize subject to c\X\ -f c 2 x 2 2xi 4- x 2 = 2 #1,#2 :> 0, where c\, c2 G R. Suppose that the problem has an optimal feasible solution that is not basic. a. Find all basic feasible solutions. b . Find all possible values of c\ and c 2 . c. At each basic feasible solution, compute the reduced cost coefficients for all nonbasic variables. 16.15 Suppose that we apply the simplex method to a given linear programming problem and obtain the following canonical tableau: 0 / 3 0 1 4 1 7 0 0 5 0 -3 1 0 6 0 2-a 0 0 δ EXERCISES 375 For each of the following conditions, find the set of all parameter values a, /?, 7, δ that satisfy the condition. a. The problem has no solution because the objective function values are unbounded. b . The current basic feasible solution is optimal, and the corresponding objective function value is 7. c. The current basic feasible solution is not optimal, and the objective function value strictly decreases if we remove the first column of A from the basis. 16.16 You are given a linear programming problem in standard form. Suppose that you use the two-phase simplex method and arrive at the following canonical tableau in phase I: "? 0 1 1 ? ? 0 6" ? 0 0 ? ? ? 1 a ? 1 0 ? ? ? 0 5 7 0 0 δ ? ? β 0 The variables a, /?, 7, and δ are unknowns to be determined. Those entries marked with "?" are unspecified. The only thing you are told is that the value of 7 is either 2 or —1. a. Determine the values of a, /?, 7, and δ. b . Does the given linear programming problem have a feasible solution? If yes, find it. If not, explain why. 16.17 Suppose we are given a matrix A e Wrnxn and a vector b e R m such that b > 0. We are interested in an algorithm that, given this A and 6, is guaranteed to produce one of following two outputs: (1) If there exists x such that Ax > 6, then the algorithm produces one such x. (2) If no such x exists, then the algorithm produces an output to declare so. Describe in detail how to design this algorithm based on the simplex method. 376 SIMPLEX METHOD 16.18 Consider the following linear programming problem (attributed to Beale—see [42, p. 43]): minimize subject to 3 1 — -X4 + 2OX5 — -XQ + 6x7 4 2 x\ + -£4 — 8x5 — #6 + 9x7 = 0 %2 + ~#4 - 12^5 - ~#6 + 3X7 = 0 X3 + X6 = 1 x i , . . . ,£7 > 0. a. Apply the simplex algorithm to the problem using the rule that q is the index corresponding to the most negative rq. (As usual, if more than one index i minimizes yio/yiq, let p be the smallest such index.) Start with Xi, X2, and X3 as initial basic variables. Notice that cycling occurs. b . Repeat part a using Bland's rule for choosing q and p: q — min{i : Vi < 0}, p = min{j : yj0/yjq = min{yi0/yiq : yiq > 0}}. Note that Bland's rule for choosing p corresponds to our usual rule that if more than one index i minimizes yio/yiq, we let p be the smallest such index. 16.19 Consider a standard form LP problem. Suppose that we start with an initial basic feasible solution x^ and apply one iteration of the simplex algorithm to obtain x^\ We can express x^ in terms of x^ as where ao minimizes φ(α) = f(x(°)+aS0') is feasible. a. Show that d (0) over all a > 0 such that x^+ad^ βλί(Α). b . As usual, assume that the initial basis is the first m columns of A, and the first iteration involves inserting aq into the basis, where q > m. Let the qth column of the canonical augmented matrix be yq = [yiq,..., y m g ] T . Express £Τ0^ in terms of yq. 16.20 Write a simple MATLAB function that implements the simplex algorithm. The inputs are c, A, 6, and v, where v is the vector of indices of basic EXERCISES 377 columns. Assume that the augmented matrix [A, b] is already in canonical form; that is, the i^th column of A is [ 0 , . . . , 1 , . . . , 0] T , where 1 occurs in the ith position. The function should output the final solution and the vector of indices of basic columns. Test the MATLAB function on the problem in Example 16.2. 16.21 Write a MATLAB routine that implements the two-phase simplex method. It may be useful to use the MATLAB function of Exercise 16.20. Test the routine on the problem in Example 16.5. 16.22 Write a simple MATLAB function that implements the revised simplex algorithm. The inputs are c, A, 6, v, and B " 1 , where v is the vector of indices of basic columns; that is, the zth column of B is the t^th column of A. The function should output the final solution, the vector of indices of basic columns, and the final Β~λ. Test the MATLAB function on the problem in Example 16.2. 16.23 Write a MATLAB routine that implements the two-phase revised simplex method. It may be useful to use the MATLAB function of Exercise 16.22. Test the routine on the problem in Example 16.5. CHAPTER 17 DUALITY 17.1 Dual Linear Programs Associated with every linear programming problem is a corresponding dual linear programming problem. The dual problem is constructed from the cost and constraints of the original, or primal, problem. Being an LP problem, the dual can be solved using the simplex method. However, as we shall see, the solution to the dual can also be obtained from the solution of the primal problem, and vice versa. Solving an LP problem via its dual may be simpler in certain cases, and also often provides further insight into the nature of the problem. In this chapter we study basic properties of duality and provide an interpretive example of duality. Duality can be used to improve the performance of the simplex algorithm (leading to the primal-dual algorithm), as well as to develop nonsimplex algorithms for solving LP problems (such as Khachiyan's algorithm and Karmarkar's algorithm). We do not discuss this aspect of duality further in this chapter. For an in-depth discussion of the primal-dual method, as well as other aspects of duality, see, for example, [88]. For a description of Khachiyan's algorithm and Karmarkar's algorithm, see Chapter 18. An Introduction to Optimization, Fourth Edition. By E. K. P. Chong and S. H. Zak. Copyright © 2013 John Wiley & Sons, Inc. 379 380 DUALITY Suppose that we are given a linear programming problem of the form minimize subject to cTx Ax > b x>0. We refer to the above as the primal problem. We define the corresponding dual problem as maximize λ b subject to λτ A < cT \>0. We refer to the variable λ G Mm as the dual vector. Note that the cost vector c in the primal has moved to the constraints in the dual. The vector b on the right-hand side of Ax > b becomes part of the cost in the dual. Thus, the roles of b and c are reversed. The form of duality defined above is called the symmetric form of duality. To define the dual of an arbitrary linear programming problem, we use the following procedure. First, we convert the given linear programming problem into an equivalent problem of the primal form shown above. Then, using the symmetric form of duality, we construct the dual to the equivalent problem. We call the resulting problem the dual of the original problem. Note that based on the definition of duality above, the dual of the dual problem is the primal problem. To see this, we first represent the dual problem in the form minimize subject to λ (—b) λ τ (—A) > —cT λ>0. Therefore, by the symmetric form of duality, the dual to the above is maximize subject to (—cT)x (—A)x < —b x>0. Upon rewriting, we get the original primal problem. Now consider an LP problem in standard form. This form has equality constraints Ax = b. To formulate the corresponding dual problem, we first convert the equality constraints into equivalent inequality constraints. Specifically, observe that Ax = b is equivalent to Ax >b -Ax > -b. DUAL LINEAR PROGRAMS 381 Thus, the original problem with the equality constraints can be written in the form minimize c x subject to A -A x > x>0. The LP problem above is in the form of the primal problem in the symmetric form of duality. The corresponding dual is therefore maximize \uT vT] 6 -b subject to [uT vT] A -A <c' tx, v > 0 After simple manipulation the dual above can be represented as maximize (u — v)Tb subject to (u — v)TA < cT u,v > 0. Let λ = u — v. Then, the dual problem becomes maximize λ b subject to XT A < c T . Note that since X = u — v and n, v > 0, the dual vector λ is not restricted to be nonnegative. We have now derived the dual for a primal in standard form. The form of duality above is referred to as the asymmetric form of duality. We summarize the forms of duality in Tables 17.1 and 17.2. Note that in the asymmetric form of duality, the dual of the dual is also the primal. We can show this by reversing the arguments we used to arrive at the asymmetric form of duality and using the symmetric form of duality. Recall that at the beginning of this chapter we defined the dual of an arbitrary linear programming problem by first transforming the problem into an equivalent problem of the form of the primal in the symmetric form of duality. We then derived the asymmetric form of duality based on the symmetric form. In both forms of duality the dual of the dual is the primal. Therefore, we now have four forms of primal-dual linear programming pairs: Each of the four linear programming problems in Tables 17.1 and 17.2 is a primal in these four pairs. So, given an arbitrary linear programming problem, we can obtain its 382 DUALITY dual by converting the problem into any of the four problems in Tables 17.1 and 17.2. Example 17.1 Suppose that we are given the given linear programming problem minimize subject to ex Ax < b. This problem is already close to the form of the dual in Table 17.2. In particular, let us rewrite the above as maximize x T (—c) subject to xT A <b. Its associated dual is then given by the primal in Table 17.2, which has the form minimize subject to b X Α τ λ = —c λ>0, which can be written in the equivalent form maximize —λ b subject to λ τ Α = —cT λ>0. If we change the sign of the dual variable, we can rewrite the above in a more "natural" form: maximize λ b subject to λ τ Λ = cT λ<0. ■ Example 17.2 This example is adapted from [88]. Recall the diet problem (see Example 15.2). We have n different types of food. Our goal is to create the most economical diet and at the same time meet or exceed nutritional requirements. Specifically, let α^ be the amount of the ith nutrient per unit of the jth food, bi the amount of the ith nutrient required, 1 < i < m, Cj the DUAL LINEAR PROGRAMS Table 17.1 Symmetric Form of Duality Primal minimize cTx subject to Ax>b x>0 Table 17.2 383 Dual maximize λ τ 6 subject to λ τ Α < c T λ>0 Asymmetric Form of Duality Primal minimize cTx subject to Ax = b x>0 Dual maximize λ τ & subject to λ τ Λ < c T cost per unit of the j t h food, and Xi the number of units of food i in the diet. Then, the diet problem can be stated as follows: minimize C\X\ + c^xi + · · · + cnxn subject to anXi + «12^2 + l· ^21^1 + 022^2 H a mlxl + am2x2 Oj\nxn > b\ 1" 02n^n > &2 + * * ' + Q"mn%n ^ ^m X\, ..., XJI ^_ U. Now, consider a health food store that sells nutrient pills (all m types of nutrients are available). Let λ^ be the price of a unit of the zth nutrient in the form of nutrient pills. Suppose that we purchase nutrient pills from the health food store at this price such that we exactly meet our nutritional requirements. Then, XTb is the total revenue to the store. Note that since prices are nonnegative, we have λ > 0. Consider now the task of substituting nutrient pills for natural food. The cost of buying pills to create the nutritional equivalent of the ith food synthetically is simply XIÜU H h A m a m i. Because C{ is the cost per unit of the ith food, if ^lali "l· ' ' * H" Xmami ^ Q? then the cost of the unit of the zth food made synthetically from nutrient pills is less than or equal to the market price of a unit of the real food. Therefore, 384 DUALITY for the health food store to be competitive, the following must hold: λιαιι 4- h A m a m i < c\ ^ι^ΐη + · · · + A m a m n < cn. The problem facing the health food store is to choose the prices λ ι , . . . , Xm such that its revenue is maximized. This problem can be stated as maximize λ b subject to λ τ Α < cT λ>0. Note that this is simply the dual of the diet problem. I E x a m p l e 17.3 Consider the following linear programming problem: maximize 2xi + 5^2 + X3 subject to 2a: 1 — X2 + 7x3 < 6 x\ + 3x2 + 4^3 < 9 3xi + 6x2 + %3 < 3 ^1,^2,^3 > 0. Find the corresponding dual problem and solve it. We first write the primal problem in standard form by introducing slack variables X4,X5,X6- This primal problem in standard form is [c T , 0 T ] x [A, I]x = b x>0, minimize subject to where x — [ x i , . . . , x 6 ] T and "2 - 1 1 3 L3 6 7" 4 , 1 "-2I "β" 6 = 9 , 3 c = The corresponding dual problem (asymmetric form) is maximize λ b subject to λ τ [ Λ , I] < [c T , 0 T ] . -5 -1 DUAL LINEAR PROGRAMS 385 Note that the constraints in the dual can be written as λ τ Α < cT λ<0. To solve the dual problem above, we use the simplex method. For this, we need to express the problem in standard form. We substitute λ by — λ and introduce surplus variables to get minimize subject to 6λι + 9λ 2 + 3λ 3 2λι + λ 2 4- 3λ 3 - λ 4 - λ ι + 3λ 2 + 6Α3 - λ5 7λι + 4λ 2 + λ 3 =2 =5 - λ6 = 1 λι,...,λ6>0. There is no obvious basic feasible solution. Thus, we use the two-phase simplex method to solve the problem. Phase I. We introduce artificial variables \γ, Ag, A9 and the artificial objective function A 7 + As + X9. The tableau for the artificial problem is Ai Cost 2 -1 7 0 λ2 1 3 4 0 λ3 3 6 1 0 λ4 -1 0 0 0 λ5 0 -1 0 0 λ6 0 0 -1 0 λ7 1 0 0 1 λ8 0 1 0 1 λ9 0 0 1 1 C 2 5 1 0 We start with an initial feasible solution and corresponding B~ Variable λ7 λ8 λ9 B~l 1 0 0 0 1 0 0 0 1 y0 2 5 1 compute rl = [0,0,0,0,0,0]- [8,8, 10, - 1 . ,-ι 5 1] = [ - 8 , - 8 , - 1 0 , 1 , 1 , 1 ] = [ri,r2,r3,r4,r5 ,re}. Because r 3 is the most negative reduced cost coefficient, we bring the third column into the basis. In this case, y3 = [3,6,1] T . We have Variable A7 A8 A9 B λ y0 y3 1 0 0 2 3 0 1 0 5 6 0 0 1 1 1 386 DUALITY By inspection, p = 1, so we pivot about the first element of the last column. The updated tableau is Variable λ3 λ8 λ9 I/o 1 3 2 3 0 -2 _ι 3 0 1 i I We compute D _ _7 10 14 3 ' 3' ' ' 3 4 3' [n,r2,r4,r5,r6,r7], We bring the second column into the basis to get Variable 1 B 1 3 0 0 -2 1 0 1 3 0 1 λ3 λ8 λ9 I/o 2 3 2/2 1 3 1 3 11 3 1 1 We update the tableau to get Variable 4 11 21 11 1 11 λ3 λ8 λ2 1 11 3 11 3 11 0 1 0 2/o 7 11 10 11 1 11 We compute D 74 _ 2 1 ll'"ll' ' __3_ 32 14 = 11' 11' 11 [ri,U,r5,r6,r7,r9]. We bring the fourth column into the basis: Variable λ3 λ8 λ2 4 11 21 11 1 11 0 1 0 1 11 3 11 3 11 2/o 7 11 10 11 1 11 2/4 4 11 21 11 1 11 The updated tableau becomes Variable B λ y0 PROPERTIES OF DUAL PROBLEMS We compute 387 r j = [0,0,0,1,1,1] = [ r i , r 5 , r 6 , r 7 , r 8 , r 9 ] . Because all the reduced cost coefficients are nonnegative, we terminate phase I. Phase II. We use the last tableau in phase I (where none of the artificial variables are basic) as the initial tableau in phase II. Note that we now revert back to the original cost of the dual problem in standard form. We compute 62 1 15 [ri,rb,r6] "T'7'T r D We bring the first column into the basis to obtain the augmented revised tableau Variable B~l y0 yl λ3 4 oi 21 11 21 1 1_ 21 oi U λ4 λ2 We update the tableau to get p| ^ Variable λ3 λ4 λι We compute D 3 oi 21 3 21 6 _2_ 21 oi B 0 w 1 25 oi 21 74 21 43 45 21 oi Vo 43 19 43 ^ 17 oi 21 10 21 1 J^ 21 oi 43 15 43 6_ J43 ΛΟ ΛΟ Art - J43/IQ 36 AQ 43 24 43 J_ -L 43 ΛΟ 186 15 39 = fc2,r5,r6]. 43 ' 4 3 ' 4 3 Because all the reduced cost coefficients are nonnegative, the current basic feasible solution is optimal for the dual in standard form. Thus, an optimal solution to the original dual problem is 1 0, 43 17.2 36 Ί τ 43 Properties of Dual Problems In this section we present some basic results on dual linear programs, We begin with the weak duality lemma. Lemma 17.1 Weak Duality Lemma. Suppose that x and X are feasible solutions to primal and dual LP problems, respectively (either in the symmetric or asymmetric form). Then, cTx > X b. □ 388 DUALITY Proof. We prove this lemma only for the asymmetric form of duality. The proof for the symmetric form involves only a slight modification (see Exercise 17.1). Because x and λ are feasible, we have Ax = 6, x > 0, and λ A < c T . Postmultiplying both sides of the inequality λ A < c T by x > 0 yields λ τ A x < cTx. But Ax = 6, hence λ τ 6 < cTx. I The weak duality lemma states that a feasible solution to either problem yields a bound on the optimal cost of the other problem. The cost in the dual is never above the cost in the primal. In particular, the optimal cost of the dual is less than or equal to the optimal cost of the primal, that is, "maximum < minimum." Hence, if the cost of one of the problems is unbounded, then the other problem has no feasible solution. In other words, if "minimum = - c o " or "maximum = +oo," then the feasible set in the other problem must be empty. Example 17.4 Consider the problem minimize x subject to x < 1, which is clearly unbounded. By Example 17.1, the dual is maximize λ subject to λ = 1 λ<0, which is clearly infeasible. I It follows from the weak duality lemma that if we are given feasible primal and dual solutions with equal cost, then these solutions must be optimal in their respective problems. Theorem 17.1 Suppose that XQ and XQ are feasible solutions to the primal and dual, respectively (either in symmetric or asymmetric form). If cTXo = X0 b, then XQ and XQ are optimal solutions to their respective problems. □ Proof. Let x be an arbitrary feasible solution to the primal problem. Because λο is a feasible solution to the dual, by the weak duality lemma, cTx > X0 b. So, if cTXo = λ^&, then cTXo = X0b < cTx. Hence, Xo is optimal for the primal. On the other hand, let λ be an arbitrary feasible solution to the dual problem. Because Xo is a feasible solution to the primal, by the weak duality lemma, CTXQ > XTb. Therefore, if cTxo = λ^&, then λ τ 6 < cTxo = λ ^ 6 . Hence, λο is optimal for the dual. I We can interpret Theorem 17.1 as follows. The primal seeks to minimize its cost, and the dual seeks to maximize its cost. Because the weak duality PROPERTIES OF DUAL PROBLEMS 389 lemma states that "maximum < minimum," each problem "seeks to reach the other." When their costs are equal for a pair of feasible solutions, both solutions are optimal, and we have "maximum = minimum." It turns out that the converse of Theorem 17.1 is also true, that is, "maximum = minimum" always holds. In fact, we can prove an even stronger result, known as the duality theorem. Theorem 17.2 Duality Theorem. If the primal problem (either in symmetric or asymmetric form) has an optimal solution, then so does the dual, and the optimal values of their respective objective functions are equal. □ Proof We first prove the result for the asymmetric form of duality. Assume that the primal has an optimal solution. Then, by the fundamental theorem of LP, there exists an optimal basic feasible solution. As is our usual notation, let B be the matrix of the corresponding m basic columns, D the matrix of the n — m nonbasic columns, cB the vector of elements of c corresponding to basic variables, CD the vector of elements of c corresponding to nonbasic variables, and VD the vector of reduced cost coefficients. Then, by Theorem 16.2, =cJ)-clB~1D>0T. rl Hence, Define Then, cTBBlD < cj. \T = clB-\ cTBB~lD = \TD < cj. We claim that λ is a feasible solution to the dual. To see this, assume for convenience (and without loss of generality) that the basic columns are the first m columns of A. Then, XTA = XT[B,D] = [cl,XTD] < [cTB,cl\ = c T . Hence, XT A < cT and thus λ τ = CßB'1 is feasible. We claim that λ is also an optimal feasible solution to the dual. To see this, note that XTb = clB~1b = clxB. Thus, by Theorem 17.1, λ is optimal. We now prove the symmetric case. First, we convert the primal problem for the symmetric form into the equivalent standard form by adding surplus 390 DUALITY variables: minimize [c T ,0 T ] subject to [A, — I] = b >0. Note that x is optimal for the original primal problem if and only if [x T , (Ax— 6 ) T ] T is optimal for the primal in standard form. The dual to the primal in standard form is equivalent to the dual to the original primal in symmetric form. Therefore, the result above for the asymmetric case applies also to the symmetric case. This completes the proof. I Example 17.5 Recall Example 17.2, where we formulated the dual of the diet problem. Prom the duality theorem, the maximum revenue for the health food store is the same as the minimum cost of a diet that satisfies all of the nutritional requirements; that is, cTx = XTb. I Consider a primal-dual pair in asymmetric form. Suppose that we solve the primal problem using the simplex method. The proof of the duality theorem suggests a way of obtaining an optimal solution to the dual by using the last row of the final simplex tableau for the primal. First, we write the tableau for the primal problem: A b cT 0 B D -D Suppose that the matrix B is the basis for an optimal basic feasible solution. Then, the final simplex tableau is I 0' BlD D B-'b -cTBB-lb l where r ^D — c = b ~ c 3B-1 D. In the proof of the duality theorem we have shown that λ = CQB is an optimal solution to the dual. The vector λ can be obtained from the final tableau above. Specifically, if rankZ) = m, then we can solve for λ using the vector r ^ , via the equation XTD = cl-rl. PROPERTIES OF DUAL PROBLEMS 391 Of course, it may turn out that rank D < m. In this case as we now show, we have additional linear equations that allow us to solve for λ. To this end, recall that λ B = c j . Therefore, if we define rT = [0 T , r J ] , then combining the equations XT D = cj rj) and \T B yields The vector λ may be easy to obtain from the equation λ D = c]D — r]D if D takes certain special forms. In particular, this is the case if D has an mxm identity matrix embedded in it; that is, by rearranging the positions of the columns of D, if necessary, D has the form D = [J m , G], where G is an rax (n — 2ra) matrix. In this case we can write the equation λ D = Cp—rJ) as [XT,XTG] = {cJ,cl}-[rJ,rl]. Hence, λ is given by In other words, the solution to the dual is obtained by subtracting the reduced costs coefficients corresponding to the identity matrix in D from the corresponding elements in the vector c (i.e., c/). For example, if we have a problem where we introduced slack variables, and the basic variables for the optimal basic feasible solution do not include any of the slack variables, then the matrix D has an identity matrix embedded in it. In addition, in this case we have c/ = 0. Therefore, λ = — τι is an optimal solution to the dual. E x a m p l e 17.6 In Example 17.3, the tableau for the primal in standard form IS Oi ei2 15 43 74 43 19 43 24 7? 43 0 0 1 0 03 05 04 Ct6 b 2 - 1 7 1 0 0 6 1 1 3 4 0 0 9 0 3 6 1 0 1 3 0 cT - 2 - 5 - 1 0 0 0 If we now solve the problem using the simplex method, we get the following final simplex tableau: r 1 0 0 0 6 43 21 43 1 43 1 43 0 1 0 0 1 43 25 43 7 43 36 43 39 43 186 43 15 43 114 43 We can now find the solution of the dual from the above simplex tableau using the equation XT D = c j — rT . [λι,λ2,λ3] "2 i οΊ 24 1 0 0 = [-2,0,0]43' 43'43 3 0 1 392 DUALITY Solving the above, we get 43 36 43 ,0, which agrees with our solution in Example 17.3. I We now summarize our discussion relating the solutions of the primal and dual problems. If one has unbounded objective function values, then the other has no feasible solution. If one has an optimal feasible solution, then so does the other (and their objective function values are equal). One final case remains: What can we say if one (the primal, say) has no feasible solution? In this case clearly the other (the dual, say) cannot have an optimal solution. However, is it necessarily the case that the dual is unbounded? The answer is no: If one of the problems has no feasible solution, then the other may or may not have a feasible solution. The following example shows that there exists a primal-dual pair of problems for which both have no feasible solution. Example 17.7 Consider the primal problem minimize [1, — 2]x subject to 1 -1 -1 1 x> ' 2" -1 x > 0. The problem has no feasible solution, because the constraints require that #i — #2 > 2 and X\ — x 2 < 1. Based on symmetric duality, the dual is maximize λ "2 " -1 subject to λ 1 -1 -l" 1 <[l,-2] λ>0. The dual also has no feasible solution, because the constraints require that λι - λ 2 < 1 and λχ - λ 2 > 2. I We end this chapter by presenting the following theorem, which describes an alternative form of the relationship between the optimal solutions to the primal and dual problems. Theorem 17.3 Complementary Slackness Condition. The feasible solutions x and X to a dual pair of problems (either in symmetric or asymmetric form) are optimal if and only if: PROPERTIES OF DUAL PROBLEMS 1. (c T - XTA)x 393 = 0. T 2. \ {Ax-b)=0. D Proof. We first prove the result for the asymmetric case. Note that condition 2 holds trivially for this case. Therefore, we consider only condition 1. =>: If the two solutions are optimal, then by Theorem 17.2, cTx = X b. Because Ax = 6, we also have (c T — λ Α)χ = 0. <=: If (c T - XTA)x = 0, then cTx = XTAx = XTb. Therefore, by Theorem 17.1, x and λ are optimal. We now prove the result for the symmetric case. =>: We first show condition 1. If the two solutions are optimal, then by Theorem 17.2, cTx = X b. Because Ax > b and λ > 0, we have (c T - XTA)x = cTx - XTAx = XTb - XT Ax = XT (b - Ax) < 0. On the other hand, since X A < cT and x > 0, we have (c T — λ Α)χ > 0. Hence, (c T — λ A)x — 0. To show condition 2, note that since Ax > b and λ > 0, we have λ (Ax — b) > 0. On the other hand, since XTA < cT and x > 0, we have XT{Ax - b) = ( λ τ A - cT)x < 0. <=: Combining conditions 1 and 2, we get cTx = X Ax = X b. Hence, by Theorem 17.1, x and λ are optimal. I Note that if x and λ are feasible solutions for the dual pair of problems, we can write condition 1, that is, (c T — λ A)x = 0, as UX{ > 0 implies that λ di = a, i = 1 , . . . ,n," that is, for any component of x that is positive, the corresponding constraint for the dual must be an equality at λ. Also, observe that the statement uXi > 0 implies that λ τ α * = Q " is equivalent to "λ a,i < Ci implies that Xi = 0." A similar representation can be written for condition 2. Consider the asymmetric form of duality. Recall that for the case of an optimal basic feasible solution x, r T = c T — λ τ A is the vector of reduced cost coefficients. Therefore, in this case, the complementary slackness condition can be written as rTx = 0. Example 17.8 Suppose that you have 26 dollars and you wish to purchase some gold. You have a choice of four vendors, with prices (in dollars per ounce) of 1/2, 1, 1/7, and 1/4, respectively. You wish to spend your entire 26 dollars by purchasing gold from these four vendors, where xi is the dollars you spend on vendor i, i = 1,2,3,4. a. Formulate the linear programming problem (in standard form) that reflects your desire to obtain the maximum weight in gold. b . Write down the dual of the linear programming problem in part a, and find the solution to the dual. 394 DUALITY c. Use the complementary slackness condition together with part b to find the optimal values of xi,..., #4. Solution: a. The corresponding linear programming problem is minimize subject to — (2#i + #2 + 7#3 + ^4) x\ + X2 + £3 -f £4 = 26 > X\,X2,XZ,X4 0. b . The dual problem is maximize subject to 26λ λ < —2 λ<-1 λ<-7 λ<-4. The solution is clearly λ = — 7. (Note: It is equally valid to have a dual problem with variable λ' = — λ.) c. By the complementary slackness condition, we know that if we can find a vector x that is feasible in the primal and satisfies (—[2,1,7,4] — (—7)[l,l,l,l])x = 0, then this x is optimal in the primal (original) problem. We can rewrite the conditions above as [1,1,1,1]« = 26, x>0, [5,6,0,3]® = 0. By x > 0 and [5,6,0,3]a? = 0, we conclude that x\ = X2 = XA = 0, and by [1,1,1, l]x = 26 we then conclude that x = [0,0,26,0] T . | EXERCISES 17.1 Prove the weak duality lemma for the symmetric form of duality. 17.2 Find the dual of the optimization problem in Exercise 15.8. 17.3 Consider the following linear program: maximize subject to 2#i + 3α?2 x\+ 2x2 < 4 2a: 1 + X2 < 5 X\,X2 > 0. EXERCISES 395 a. Use the simplex method to solve the problem. b . Write down the dual of the linear program and solve the dual. 17.4 Consider the linear program minimize 4χχ + 3x2 subject to 5xi -\- X2 > H 2xi + x2 > 8 xi + 2x 2 > 7 X\,X2 > 0. Write down the corresponding dual problem and find the solution to the dual. (Compare this problem with the one in Exercise 16.12, part a.) 17.5 Consider the following primal problem: maximize subject to x\ + 2^2 — 2xi + x2 + X3 — X\ + 2x 2 =2 4- X4 =7 + X5 = 3 x\ Xi > 0 , i = 1,2,3,4,5. a. Construct the dual problem corresponding to the primal problem above. b . It is known that the solution to the primal above is x* = [3,5,3,0,0] T . Find the solution to the dual. 17.6 Consider the linear programming problem minimize subject to cTx Ax < b. a. Find the dual to this problem. b . Suppose that b = 0 and there exists a vector y > 0 such that yT A+cT = 0 . Does this problem have an optimal feasible solution? If yes, find it. If no, explain why not. Give complete explanations. 17.7 Convert the following optimization problem into a linear programming problem and solve it: maximize " Fil - \χ2\ subject to 1 0 1 -1 \χ3\ X\ X2 _X3_ — 2 1 L J 396 DUALITY Then construct its dual program and solve it. Hint: Introduce two sets of nonnegative variables: x^ > 0, x~ > 0. Then represent the optimization problem using the variables above. Note that only one xf and x^ can be nonzero at a time. If Xi > 0 then xf = Xi and x~ = 0. On the other hand, if Xi < 0 then xf = 0 and Xi = —x~. See Exercise 16.6. 17.8 Consider the linear program minimize subject to x\ Λ h xn, # i , . . . , xn G R a\X\ H h anxn = 1 a ? i , . . . , xn t> U, where 0 < a\ < a,2 < · · · < an. a. Write down the dual to the problem and find a solution to the dual in terms of α ι , . . . , a n . b . State the duality theorem and use it to find a solution to the primal problem above. c. Suppose that we apply the simplex algorithm to the primal problem. Show that if we start at a nonoptimal initial basic feasible solution, the algorithm terminates in one step if and only if we use the rule where the next nonbasic column to enter the basis is the one with the most negative reduced cost coefficient. 17.9 You are given the following linear programming problem: maximize subject to c\X\ Λ X\ Λ h cnxn V xn = 1 X\, . . . , Xn ^_ U, where c i , . . . , cn G R are constants. a. Write down the dual linear program for the primal problem. b . Suppose you know that C4 > Q for all i φ 4. Use this information to solve the dual. c. Use part b to solve the linear programming problem. 17.10 Consider the linear programming problem maximize subject to cTx Ax < 0 x > 0, EXERCISES 397 where c = [ 1 , 1 , . . . , 1] T . Assume that the problem has a solution. a. Write down the dual of this problem. b . Find the solution to the problem. c. What can you say about the constraint set for the problem? 17.11 Consider a given linear programming problem in standard form (written in the usual notation). a. Write down the associated artificial problem for the problem (used in the two-phase method). b . Write down the dual to the artificial problem from part a. c. Prove that if the original linear programming problem has a feasible solution, then the dual problem in part b has an optimal feasible solution. 17.12 Consider a pair of primal and dual linear programming problems (either in symmetric or asymmetric form). Identify which of the following situations are possible (depending on the particular primal-dual pair) and which are impossible (regardless of the primal-dual pair). In each case, justify your answer (citing results such as the weak duality lemma and the duality theorem whenever needed). a. The primal has a feasible solution, and the dual has no feasible solution. b. The primal has an optimal feasible solution, and the dual has no optimal feasible solution. c. The primal has a feasible solution but no optimal feasible solution, and the dual has an optimal feasible solution. 17.13 Consider an LP problem in standard form. Suppose that x is a feasible solution to the problem. Show that if there exist λ and μ such that Ατ\ +μ =ο μτχ =0 μ>0, then x is an optimal feasible solution to the problem and λ is an optimal feasible solution to the dual. The conditions above, called the Karush-KuhnTucker optimality conditions for LP, are discussed in detail in Chapters 21 and 22. 398 DUALITY 17.14 Consider the linear program cTx, Ax < 6, maximize subject to where c G Mn, b G R m , and A e Rmxn. Use the symmetric form of duality to derive the dual of this linear program and show that the constraint in the dual involving A can be written as an equality constraint. Hint: Write x — u — v^ with u, v > 0. 17.15 Consider the linear program minimize subject to x\ + x2 x\ + 2x2 > 3 2#i + X2 > 3 £ l , £ 2 > 0. The solution to the problem is [1,1] T (see Exercise 16.11). Write down the dual to the problem, solve the dual, and verify that the duality theorem holds. 17.16 Consider the problem minimize subject to cTx, x > 0. xGRn For this problem we have the following theorem. Theorem: A solution to the foregoing problem exists if and only if c > 0. Moreover, if a solution exists, 0 is a solution. Use the duality theorem to prove this theorem (see also Exercise 22.15). 17.17 Let A be a given matrix and b a given vector. Show that there exists a vector x such that Ax > b and x > 0 if and only if for any vector y satisfying ATy < 0 and y > 0, we have bTy < 0. 17.18 Let A be a given matrix and b a given vector. We wish to prove the following result: There exists a vector x such that Ax = b and x > 0 if and only if for any given vector y satisfying ATy < 0, we have b y < 0. This result is known as Farkas's transposition theorem. Our argument is based on duality theory, consisting of the following parts. a. Consider the primal linear program minimize subject to 0Tx Ax = b x > 0, EXERCISES 399 Write down the dual of this problem using the notation y for the dual variable. b . Show that the feasible set of the dual problem is guaranteed to be nonempty. Hint: Think about an obvious feasible point. c. Suppose that for any y satisfying ATy < 0, we have b y < 0. In this case what can you say about whether or not the dual has an optimal feasible solution? Hint: Think about the obvious feasible point in part b. d. Suppose that for any y satisfying ATy < 0, we have b y < 0. Use parts b and c to show that there exists x such that Ax = b and x > 0. (This proves one direction of Farkas's transposition theorem.) e. Suppose that x satisfies Ax = b and x > 0. Let y be an arbitrary vector satisfying A y < 0. Show that b y < 0. (This proves the other direction of Farkas's transposition theorem.) 17.19 Let A be a given matrix and b a given vector. Show that there exists a vector x such that Ax < b if and only if for any given vector y satisfying ATy = 0 and y > 0, we have b y > 0. This result is known as Gale's transposition theorem. 17.20 Let A be a given matrix. Show that there exists a vector x such that Ax < 0 if and only if for any given vector y satisfying ATy = 0 and y > 0, we have y = 0 (i.e., y — 0 is the only vector satisfying ATy = 0 and y > 0). This result is known as Gordan's transposition theorem. 17.21 Let P G R n x n be a matrix with the property that each element is in the real interval [0,1], and the sum of the elements of each row is equal to 1; call such a matrix a stochastic matrix. Now consider a vector x > 0 such that xTe = 1, where e— [ 1 , . . . , 1] T ; call such a vector x a probability vector. We wish to prove the following result: For any stochastic matrix P , there exists a probability vector x such that xTP = xT. Although this is a key result in probability theory (under the topic of Markov chains), our argument is based on duality theory (for linear programming), consisting of the following parts. a. Consider the primal linear program: maximize subject to xTe xTP x>0. = xT 400 DUALITY Write down the dual of this problem. b . Show that the dual is not feasible (i.e., there does not exist a feasible solution to the dual). Hint: Derive a contradiction based on Py > y; think about the largest element of y (call it yi). c. Is the primal feasible? What can you deduce about whether or not the primal is unbounded? d. Use part c to deduce the desired result: that there exists a vector x > 0 such that xT P = xT and xTe = 1. 17.22 Suppose that you are presented with a "black box" that implements a function φ defined as follows: Given positive integers m and n, a matrix A G R m x n , and a vector b G R m , the value of </>(m, n, A, b) is a vector x = </>(ra, n, A, b) that satisfies Ax > 6, if such a vector exists. In other words, the black box solves a linear feasibility problem. Now, given A G R m x n , b G R m , and c G R n , consider the linear programming problem minimize subject to cTx Ax > b x>0. Express a solution to this problem in terms of the function φ given above. In other words, show how we can use the black box to solve this linear programming problem. Hint: Find the appropriate inputs to the black box such that the output immediately gives a solution to the linear programming problem. You should use the black box only once. 17.23 This exercise illustrates the use of duality to compute the sensitivity of the optimal objective function value with respect to perturbations in the constraint. Consider a primal linear programming problem and its dual (in either symmetric or asymmetric form). Let us view the b vector in the primal as a parameter that we can vary, and that we wish to calculate the change in the optimal objective function value if we perturb 6 by a small perturbation Δ6 (i.e., replace b by b + Δ6). a. To make the problem precise, let z(b) be the optimal value of the primal objective function. Let λ denote the corresponding optimal dual vector. Calculate the gradient of z at b: Vz(b). Write the answer in terms of λ. You may assume that the optimal dual vector remains fixed in a neighborhood of 6; but if you do, you must explain why this assumption EXERCISES 401 is reasonable. Hint: Use the duality theorem to see how z(b) depends on b. b . Suppose that the first component of the optimal dual vector is λι = 3. Now suppose that we increase b\ by a very small amount Ab\. Determine the amount by which the optimal objective function value will change. 17.24 Consider the quadratic programming problem minimize subject to -x x 2 Ax < 6, where A G M m x n and b G M m . Call this problem the primal problem. Consider the associated dual quadratic programming problem maximize —-yT(AAT)y subject to y > 0. —b y Let / i and fa be the objective functions of the primal and dual, respectively. a. State and prove a weak duality lemma in this setting. b . Show that if XQ and y0 are feasible points in the primal and dual, and /i(sco) — Λίί/ο)? t n e n x0 and y0 are optimal solutions to the primal and dual, respectively. Hint: The techniques used in the linear programming duality results are applicable in this exercise. CHAPTER 18 NONSIMPLEX METHODS 18.1 Introduction In previous chapters we studied the simplex method and its variant, the revised simplex method, for solving linear programming problems. The method remains widely used in practice for solving LP problems. However, the amount of time required to compute a solution using the simplex method grows rapidly as the number of components n of the variable x £ R n increases. Specifically, it turns out that the relationship between the required amount of time for the algorithm to find a solution and the size n of x is exponential in the worst case. An example of an LP problem for which this relationship is evident was devised by Klee and Minty in 1972 [76]. Below, we give a version of the An Introduction to Optimization, Fourth Edition. By E. K. P. Chong and S. H. Zak. Copyright © 2013 John Wiley & Sons, Inc. 403 404 NONSIMPLEX METHODS Klee-Minty example, taken from [9]. Let n be given. Let c=[10n_1,10"-2,...,101,l]T, & = [i,io2,io4,...,io2(n-1)]T, 1 2 x 101 2 x 102 0 1 2 x 101 0 0 1 2 x lO71"1 2 x 10 n " 2 ··· 0 0 0 2 x 101 1 Consider the following LP problem: maximize subject to cTx Ax < b x>0. The simplex algorithm applied to the LP problem above requires 2 n — 1 steps to find the solution. Clearly, in this example the relationship between the required amount of time for the simplex algorithm to find a solution and the size n of the variable x is exponential. This relationship is also called the complexity of the algorithm. The simplex algorithm is therefore said to have exponential complexity. The complexity of the simplex algorithm is also often written as 0 ( 2 n - 1). Naturally, we would expect that any algorithm that solves LP problems would have the property that the time required to arrive at a solution increases with the size n of the variable x. However, the issue at hand is the rate at which this increase occurs. As we have seen above, the simplex algorithm has the property that this rate of increase is exponential. For a number of years, computer scientists have distinguished between exponential complexity and polynomial complexity. If an algorithm for solving LP problems has polynomial complexity, then the time required to obtain the solution is bounded by a polynomial in n. Obviously, polynomial complexity is more desirable than exponential complexity. Therefore, the existence of an algorithm for solving LP problems with polynomial complexity is an important issue. This issue was partially resolved in 1979 by Khachiyan (also transliterated as Hacijan) [74], who proposed an algorithm that has a complexity 0(n4L), where, roughly speaking, L represents the number of bits used in the computations. The reason that we consider Khachiyan's algorithm (also called the ellipsoid algorithm) as only a partial resolution of this issue is that the complexity depends on L, which implies that the time required to solve a given LP problem increases with the required accuracy of the computations. The existence of a method for solving LP problems with a polynomial complexity bound based only on the size of the variable n (and possibly the number of constraints) remains a difficult open problem [55]. In any case, computational KHACHIYAN'S METHOD 405 experience with Khachiyan's algorithm has shown that it is not a practical alternative to the simplex method [14]. The theoretical complexity advantage of Khachiyan's method relative to the simplex method remains to be demonstrated in practice. Another nonsimplex algorithm for solving LP problems was proposed in 1984 by Karmarkar [71]. Karmarkar's algorithm has a complexity of 0(n 3 , 5 L), which is lower than that of Khachiyan's algorithm. The algorithm is superior to the simplex algorithm from a complexity viewpoint, but has its drawbacks. Improved methods along similar lines, called interior-point methods, have received considerable interest since Karmarkar's original paper. Wellimplemented versions of these methods are very efficient, especially when the problem involves a large number of variables [55]. This chapter is devoted to a discussion of nonsimplex methods for solving LP problems. In the next section we discuss some ideas underlying Khachiyan's algorithm. We then present Karmarkar's algorithm in the section to follow. 18.2 Khachiyan's Method Our description of the Khachiyan's algorithm is based on [8] and [9]. The method relies on the concept of duality (see Chapter 17). Our exposition of Khachiyan's algorithm is geared toward a basic understanding of the method. For a detailed rigorous treatment of the method, we refer the reader to [101]. Consider the (primal) linear programming problem minimize subject to cTx Ax > b x>0. We write the corresponding dual problem, maximize λ b subject to λ τ Α < cT λ>0. Recall that the two LP problems above constitute the symmetric form of duality. From Theorem 17.1, if x and λ are feasible solutions to the primal and dual problems, respectively, and cTx = λ 6, then x and λ are optimal solutions to their respective problems. Using this result, we see that to solve the primal problem it is enough to find a vector [ χ τ , λ ] τ that satisfies the 406 NONSIMPLEX METHODS following set of relations: crx = 6 τ λ , Ax > b, ATX < c, x>0, λ>0. Note that the equality cTx = b X is equivalent to the two inequalities cTx - bTX < 0, -cTx + bTX < 0. ' °1 -6'1 Taking this into account, we can represent the previous set of relations as T & 0 0 X -b < 0 X 0 -A AT c -In _ 0_ ""·* m. 0 0 Therefore, we have reduced the problem of finding an optimal solution to the primal-dual pair into one of finding a vector [ccT, λ ] τ that satisfies the system of inequalities above. In other words, if we can find a vector that satisfies the system of inequalities, then this vector gives an optimal solution to the primal-dual pair. On the other hand, if there does not exist a vector satisfying the system of inequalities, then the primal-dual pair has no optimal feasible solution. In the subsequent discussion, we simply represent the system of inequalities as Pz<<7, where P = cT -cT -A -In 0 0 -b1 bT 0 0 AT -In Z = X X Q = ' 0" 0 -b 0 c . 0. In our discussion of Khachiyan's algorithm, we will not be using these forms of P , ςτ, and z specifically; we simply treat Pz < q as a generic matrix inequality, with P , g, and z as generic entities. Let r and s be the sizes of q and z, respectively; that is, P G R r x s , z G R s , and q e Mr. KHACHIYAN'S METHOD 407 Khachiyan's method solves the LP problem by first determining if there exists a vector z that satisfies the inequality Pz < q; that is, the algorithm decides if the system of linear inequalities above is consistent If the system is consistent, then the algorithm finds a vector z satisfying the system. In the following we refer to any vector satisfying the system as a solution to the system. We assume that the entries in P and q are all rational numbers. This is not a restriction in practice, since any representation of our LP problem on a digital computer will involve only rational numbers. In fact, we assume further that the entries in P and q are all integers. We can do this without loss of generality since we can always multiply both sides of the inequality Pz < q by a sufficiently large number to get only integer entries on both sides. Before discussing Khachiyan's algorithm, we introduce the idea of an ellipsoid. To this end, let z £ Rs be a given vector and let Q be an s x s nonsingular matrix. Then, the ellipsoid associated with Q centered at z is defined as the set EQ(z) = {z + Qy : y e Rs,\\y\\ < l } . The main idea underlying Khachiyan's algorithm is as follows. Khachiyan's algorithm is an iterative procedure, where at each iteration we update a vector z^ and a matrix Qk. Associated with z^ and Qk is an ellipsoid EQk{z^). At each step of the algorithm, the associated ellipsoid contains a solution to the given system of linear inequalities. The algorithm updates z^ and Qk in such a way that the ellipsoid at the next step is "smaller" than that of the current step, but at the same time is guaranteed to contain a solution to the given system of inequalities, if one exists. If we find that the current point z^ satisfies Ρζ^ < ςτ, then we terminate the algorithm and conclude that z^ is a solution. Otherwise, we continue to iterate. The algorithm has a fixed prespecified maximum number of iterations N to be performed, where N is a number that depends on L and s. Note that we are not free to choose N—it is computed using a formula that uses the values of L and s. The constant L is itself a quantity that we have to compute beforehand using a formula that involves P and q. When we have completed N iterations without finding a solution in an earlier step, we terminate the algorithm. The associated ellipsoid will then have shrunk to the extent that it is smaller than the precision of computation. At this stage, we will either discover a solution inside the ellipsoid, if indeed a solution exists, or we will find no solution inside the ellipsoid, in which case we conclude that no solution exists. As we can see from the description above, Khachiyan's approach is a radical departure from the classical simplex method for solving LP problems. The method has attracted a lot of attention, and many studies have been devoted to it. However, as we pointed out earlier, the algorithm is of little practical value for solving real-world LP problems. Therefore, we do not delve any further into the details of Khachiyan's algorithm. We refer the interested reader to [101]. 408 NONSIMPLEX METHODS Despite its practical drawbacks, Khachiyan's method has inspired other researchers to pursue the development of computationally efficient algorithms for solving LP problems with polynomial complexity. One such algorithm is attributed to Karmarkar, which we discuss in Section 18.4. 18.3 Affine Scaling Method Basic Algorithm In this section we describe a simple algorithm, called the affine scaling method, for solving linear programming problems. This description is to prepare the reader for our discussion of Karmarkar's method in the next section. The affine scaling method is a an interior-point method. Such methods differ fundamentally from the classical simplex method in one main respect: An interior-point method starts inside the feasible set and moves within it toward an optimal vertex. In contrast, the simplex method jumps from vertex to vertex of the feasible set seeking an optimal vertex. To begin our description of the affine scaling method, consider the LP problem minimize subject to c x Ax = b x>0. Note that the feasibility constraints have two parts: Ax = b and x > 0. Suppose that we have a feasible point x^ that is strictly interior (by strictly interior we mean that all of the components of x^ are strictly positive). We wish to find a new point x^ by searching in a direction d^0' that decreases the objective function. In other words, we set where ao is a step size. In the gradient method (Chapter 8) we used the negative gradient of the objective function for the search direction. For the LP problem, the negative gradient of the objective function is —c. However, if we set S ' = — c, the point x^ may not lie inside the feasible set. For a^1) to lie inside the feasible set, it is necessary that S0' be a vector in the nullspace of A. Indeed, because x^ is feasible, we have Ax^ = b. We also require that Ax^ = b. Combining these two equations yields A (χ™ - *(°>) = a0Ad<°> = 0. To choose a direction d^ that lies in the nullspace of A but is still "close" to —c, we orthogonally project — c onto the nullspace of A and take the resulting AFFINE SCALING METHOD Figure 18.1 409 Results of projected gradient step from center and noncenter points. projection as S°\ The orthogonal projection of any vector onto the nullspace of A involves multiplication by the following matrix P , called the orthogonal projector (see Section 3.3 and Example 12.5): P = In-AT(AAT)-1A. We set S0' to be in the direction of the orthogonal projection of —c onto the nullspace of A: d<°> = -Pc. It is easy to check that APc = 0 and hence Ax^ = b. In summary, given a feasible point x^°\ we find a new feasible point x^ using x™ = *«>) _ aopc, where the choice of the step size ao is discussed later in the section. The choice of x^ above can be viewed as one iteration of a projected gradient algorithm, discussed in Section 23.3. We now make the observation that the point x^ should be chosen close to the center of the feasible set. Figure 18.1 illustrates this observation. Comparing the center and noncenter starting points in the figure, we can see that if we start at the center of the feasible set, we can take a larger step in the search direction. This larger step from the center point should yield a lowercost value for the new point compared with the step originating from the noncenter point. Suppose that we are given a point x^ that is feasible but is not a center point. We can transform the point to the center by applying what is called an affine scaling. For simplicity, suppose that A = [ 1 , 1 , . . . , l]/n and b = [1]. It is easy to see that the center of this feasible set is the point e = [ 1 , . . . , 1] T . To transform x^ to e, we use the affme-scaling transformation e = £>ö1x(°)) 410 NONSIMPLEX METHODS where DQ is a diagonal matrix whose diagonal entries are the components of the vector x^: ,(o) D0=dmg[x?\...,x^} = ,(o) Note that DQ is invertible because we assumed that x^ is strictly interior. For general A and b we will still use the same amne-scaling transformation as above. In general, we may not be at precisely the center of the feasible set, but we hope that the transformed point will be "close" to the center. At least the point e is equidistant from the boundaries of the positive orthant {x : x > 0}. Once the starting point is at (or close to) the center of the feasible set after performing the amne-scaling transformation, we can proceed as described before. Because we have transformed the original vector x^ via premultiplication by DQ1, effectively changing the coordinate system, we also need to represent the original LP problem in the new coordinates. Specifically, the LP problem in the transformed coordinates takes the form minimize CQX subject to ÄQX — b x > 0, where c 0 = D0c, AQ = AD0. In the new (x) coordinate system we construct the orthogonal projector P0 = i n - Ä0 (Ä 0 Ä 0 ) _ 1 Ä 0 and set d to be in the direction of the orthogonal projection of — Co onto the nullspace of Äo: Then, compute x^ a (0) = -Poco. using x^ = »<°> - a0P0co, where ä^0) = DQ1X^°\ TO obtain a point in the original coordinates, we perform the transformation *W = D0x{1). AFFINE SCALING METHOD The procedure above takes a point x^ This procedure can be represented as and generates a new point 411 x^\ x( 1 )=x(°)+a 0 d (0) , where rf(°) = -DQPDQC. We repeat the procedure iteratively to generate a sequence of points where {x^}, with Pk = Ind<*> = Ak (ÄkÄk )_1Äfc, -DkPkDkc. At each stage of the algorithm, we have to ensure that the point x^ is strictly interior. Note that the condition Ax^ = b is satisfied automatically at each stage because of the way we select However, we also need to guarantee that x\* > 0 for i = 1 , . . . ,n. This can be done through appropriate choice of the step size ak, discussed next. The main criterion for choosing ak is to make it as large as possible, but not so large that some components of x( fc+1 ) become nonpositive. That is, we select ak so that x\ — x\' + akd\ ' > 0 for i = 1 , . . . , n. To proceed, first define x{k) rk = min γγ^. { ^><ο } df] The number rk represents the largest value of the step size ak such that all the components of x( fc+1 ) are nonnegative. To ensure that χ^+^ is strictly interior, we use a step size of the form ak = ark, where a G (0,1). Typical values of a for this method are a = 0.9 or 0.99 (see [96, p. 572]). Unlike the simplex method, the affine scaling method will not reach the optimal solution in a finite number of steps. Therefore, we need a stopping criterion. For this, we can use any of the stopping criteria discussed in Section 8.2. For example, we can stop if |cx(/c+l)_ca.(/c)| max{l, |ccc(fc)|} where ε > 0 is a prespecified threshold (see also [96, p. 572] for a similar stopping criterion, as well as an alternative criterion involving duality). 412 NONSIMPLEX METHODS Two-Phase Method To implement the affine scaling method described above, we need an initial feasible starting point that is strictly interior. We now describe a method to find such a starting point. After the starting point is found, we can then proceed to search for an optimal solution to the problem. This approach involves two phases: In phase I we find an initial strictly interior feasible point, and in phase II we use the result of phase I to initialize the affine scaling algorithm to find an optimal solution. This procedure is analogous to the two-phase simplex algorithm described in Section 16.6. We now describe phase I of the two-phase affine scaling method. Let u be an arbitrary vector with positive components, and let v = b — Au. If v = 0, then u is a strictly interior feasible point. We can then set x^ = u and proceed to phase II, where we apply the affine scaling method as described before. On the other hand, if v φ 0, we construct the following associated artificial problem: minimize subject to y [A, v] = b >0. The artificial problem above has an obvious strictly interior feasible point: Using this point as the initial point, we can apply the affine scaling algorithm to the artificial problem. Because the objective function in the artificial problem is bounded below by 0, the affine scaling method will terminate with some optimal solution. Proposition 18.1 The original LP problem has a feasible solution if and only if the associated artificial problem has an optimal feasible solution with objective function value zero. □ Proof =>: If the original problem has a feasible solution x, then the vector [ x T , 0 ] T is a feasible solution to the artificial problem. Clearly, this solution has an objective function value of zero. This solution is therefore optimal for the artificial problem, since there can be no feasible solution with negative objective function value. KARMARKAR'S METHOD 413 <ί=: Suppose that the artificial problem has an optimal feasible solution with objective function value zero. Then, this solution must have the form [sc T ,0] T , where x > 0. Hence, we have Ax = 6, and x is a feasible solution to the original problem. I Suppose that the original LP problem has a feasible solution. By Proposition 18.1, if we apply the affine scaling method to the artificial problem (with initial point [tt T ,l] T ), the algorithm will terminate with objective function value zero. The optimal solution will be of the form [ccT, 0] T . We argue that x will in general be a strictly interior feasible point. It is easy to see that x > 0. To convince ourselves that each component of x will be positive in general, note that the subset of optimal feasible solutions of the artificial problem in which one or more among the first n components are zero is a very small or thin subset of the set of all optimal feasible solutions. By small or thin we mean in the sense that a two-dimensional plane in R 3 is small or thin. In particular, the volume of the two-dimensional plane in M3 is zero. Thus, it is very unlikely that the affine scaling algorithm will terminate with an optimal feasible solution in which one or more among the first n components are zero. Having completed phase I as described above, we then use the first n components of the terminal optimal feasible solution for the artificial problem as our initial point for the affine scaling method applied to the original LP problem. This second application of the affine scaling algorithm constitutes phase II. In theory, phase I generates a feasible point to initiate phase II. However, because of the finite precision of typical computer implementations, the solution obtained from phase I may not, in fact, be feasible. Moreover, even if the initial point in phase II is feasible, in practice the iterates may lose feasibility, owing to finite precision computations. Special procedures for dealing with such problems are available. For a discussion of numerical implementation of affine scaling algorithms, see [42, Section 7.1.2]. 18.4 Karmarkar's Method Basic Ideas Like the affine scaling method, Karmarkar's method for solving LP problems differs fundamentally from the classical simplex method in various respects. First, Karmarkar's method is an interior-point method. Another difference between Karmarkar's method and the simplex method is that the latter stops when it finds an optimal solution. On the other hand, Karmarkar's method stops when it finds a solution that has an objective function value that is less than or equal to a prespecified fraction of the original guess. A third difference between the two methods is that the simplex method starts with LP problems in standard form, whereas Karmarkar's method starts with LP problems in a special canonical form, which we call Karmarkar's canonical 414 NONSIMPLEX METHODS form. We discuss this canonical form in the next subsection. While more recent interior-point methods are recognized to be superior to Karmarkar's original algorithm in efficiency and robustness, a study of Karmarkar's method provides an informative introduction to the study of more advanced interiorpoint methods. Karmarkar's Canonical Form To apply Karmarkar's algorithm to a given LP problem, we must first transform the given problem into a particular form, which we refer to as Karmarkar's canonical form. Karmarkar's canonical form is written as cTx Ax = 0 minimize subject to n 2= 1 x >0, where x = [ # i , . . . , x n ] T . As in our discussion of Khachiyan's method, we assume without loss of generality that the entries of A and c are integers. We now introduce some notation that allows convenient manipulation of the canonical form. First, let e = [ 1 , . . . , 1] T be the vector in Rn with each component equal to 1. Let Ω denote the nullspace of A, that is, the subspace Ω = {x e Rn : Ax = 0}. Define the simplex in R n by Δ = { Ι Ε Γ : eTx = 1, x > 0}. We denote the center of the simplex Δ by a0 e n Clearly, αο Ε Δ. With the notation above, Karmarkar's canonical form can be rewritten as cTx x £ Ω Π Δ. minimize subject to Note that the constraint set (or feasible set) Ω Π Δ can be represented as Ω Π Δ = {x e Rn : Ax = 0, eTx = 1, x > 0} = <x e A X fol x >0 KARMARKAR'S METHOD 415 Example 18.1 Consider the following LP problem, taken from [125]: minimize 5xi + 4x2 + 8x3 subject to x\ + #2 -I- #3 = 1 #1,^2, #3 > 0. Clearly, this problem is already in Karmarkar's canonical form, with c T = [5,4,8], and A = O. The feasible set for this example is illustrated in Figure 18.2. ■ 1 Figure 18.2 x 2 Feasible set for Example 18.1. Example 18.2 Consider the following LP problem, taken from [110]: minimize subject to 3xi + 3x2 _ #3 2xi — 3x2 + X3 = 0 %1 + %2 + #3 = 1 Z l , £ 2 , # 3 > 0. This problem is in Karmarkar's canonical form, with c T = [3,3,-1] and A = [2, —3,1]. The feasible set for this example is illustrated in Figure 18.3 (adapted from [110]). I We show later that any LP problem can be converted into an equivalent problem in Karmarkar's canonical form. Karmarkar's Restricted Problem Karmarkar's algorithm solves LP problems in Karmarkar's canonical form, with the following assumptions: A. The center ao of the simplex Δ is a feasible point: ao G Ω. 416 NONSIMPLEX METHODS -x Figure 18.3 2 The feasible set for Example 18.2. B. The minimum value of the objective function over the feasible set is zero. C. The (m + 1) x n matrix \A\ [eTJ has rank m + 1. D. We are given a termination parameter q > 0, such that if we obtain a feasible point x satisfying ^ < 2-, c1 a0 then we consider the problem solved. Any LP problem that is in Karmarkar's canonical form and that also satisfies the four assumptions above is called a Karmarkar's restricted problem. In the following we discuss the assumptions and their interpretations. We begin by looking at assumption A. We point out that this assumption is not restrictive, since any LP problem that has an optimal feasible solution can be converted into a problem in Karmarkar's canonical form that satisfies assumption A. We discuss this in the next subsection. We next turn our attention to assumption B. Any LP problem in Karmarkar's canonical form can be converted into one that satisfies assumption B, provided that we know beforehand the minimum value of its objective function over the feasible set. Specifically, suppose that we are given an LP problem where the minimum value of the objective function is M. As in [110], consider the function f(x) = cTx — M. Then, using the property that e T ai = 1 on the feasible set, we have that for any feasible x, f(x) = cTx - M = cTx - MeTx = (c T - MeT)x = cTx, KARMARKAR'S METHOD 417 where cT = cT — M e T . Notice that the objective function above has a minimum value of zero and is a linear function of x. We can replace the original objective function with the new objective function above, without altering the solution. Example 18.3 Recall the LP problem in Example 18.1: minimize subject to 5a; i + 4^2 + 8x3 x\ + X2 + X3 = 1 £ l , Z 2 , # 3 > 0. The problem satisfies assumption A (and assumption C) but not assumption B, since the minimum value of the objective function over the feasible set is 4. To convert the above into a problem that satisfies assumption B, we replace c T = [ 5 , 4 , 8 ] b y c T = [l,0,4]. I Example 18.4 The reader can easily verify that the LP problem in Example 18.2 satisfies assumptions A, B, and C. I Assumption C is a technical assumption that is required in the implementation of the algorithm. Its significance will be clear when we discuss the update equation in Karmarkar's algorithm. Assumption D is the basis for the stopping criterion of Karmarkar's algorithm. In particular, we stop when we have found a feasible point satisfying cTx/cTao < 2~q. Such a stopping criterion is inherent in any algorithm that uses finite-precision arithmetic. Observe that the stopping criterion above depends on the value of c T ao- It will turn out that Karmarkar's algorithm uses ao as the starting point. Therefore, we can see that the accuracy of the final solution in the algorithm is influenced by the starting point. From General Form to Karmarkar's Canonical Form We now show how any LP problem can be coverted into an equivalent problem in Karmarkar's canonical form. By equivalent we mean that the solution to one can be used to determine the solution to the other, and vice versa. To this end, recall that any LP problem can be transformed into an equivalent problem in standard form. Therefore, it suffices to show that any LP problem in standard form can be transformed into an equivalent problem in Karmarkar's canonical form. In fact, the transformation given below (taken from [71]) will also guarantee that assumption A of the preceding subsection is satisfied. To proceed, consider a given LP problem in standard form: minimize subject to cTx, Ax = b x>0. x G M71 418 NONSIMPLEX METHODS We first present a simple way to convert this problem into Karmarkar's canonical form, ignoring the requirement to satisfy assumption A. For this, define a new variable z G JRn+1 by x\ 111 T T Also define c' = [c ,0] and A' = [A, —b]. Using this notation, we can now rewrite the LP problem above as minimize subject to cf z, A'z = 0 z >0. z G Rn+1 We need one more step to transform the problem into one that includes the constraint that the decision variables sum to 1. For this, let y = [i/i, · · · ,2/n,2M+i]T G M n + 1 , where ; rf h xn + 1 1 " χι + · · · + χ η + 1' Vi = 2/71+1 xi-\ ; 2 = l,...,n This transformation from x to y is called a projective transformation. It can be shown that (see later) cTx = 0 o Ax = b o x >0 c' y = 0, Äy = 0, Φ> y > 0. Therefore, we have transformed the given LP problem in standard form into the following problem, which is in Karmarkar's canonical form: minimize subject to c' y, y G Rn+1 A'y = 0 eTy = l 2/>0. The transformation technique above can be modified slightly to ensure that assumption A holds. We follow the treatment of [71]. We first assume that we are given a point a = [αχ,..., an] that is a strictly interior feasible point; that is, Aa = b and a > 0. We show later how this assumption can be enforced. Let P+ denote the positive orthant of Mn, given by P+ = {x G Mn : x > 0}. Let Δ = {x G M n + 1 : e T x = 1, x > 0} be the simplex in R n + 1 . Define the map Γ : P+ -► Δ by T^^IT^X),...,^!^)]1 KARMARKAR'S METHOD 419 with T i{x) = — / , xi/ai H Tn+i(x) = — El/ai H X% a% , \-xn/an - —r, +1 —. i = l,...,n hXri/on + 1 We call the map T a projective transformation of the positive orthant P+ into the simplex Δ (for an introduction to projective transformations, see [68]). The transformation T has several interesting properties (see Exercises 18.4, 18.5, and 18.6). In particular, we can find a vector d G M n + 1 and a matrix A1 G R ™ * ^ 1 ) such that for each x G M n , cTx = 0 & c,TT(x) = 0 Ax = b & ÄT{x) and =0 (see Exercises 18.5 and 18.6 for the forms of A' and d). Note that for each x G Mn, we have eTT(x) = 1, which means that T(x) G Δ. Furthermore, note that for each x G R n , x >0 Φ> Τ(χ) > 0. Taking this into account, consider the following LP problem (where y is the decision variable): . . . ,τ minimize c y subject to A'y = 0 e T t/ = l 2/>0. Note that this LP problem is in Karmarkar's canonical form. Furthermore, in light of the definitions of d and A!\ the above LP problem is equivalent to the original LP problem in standard form. Hence, we have converted the LP problem in standard form into an equivalent problem in Karmarkar's canonical form. In addition, because a is a strictly interior feasible point, and a 0 = T(a) is the center of the simplex Δ (see Exercise 18.4), the point ao is a feasible point of the transformed problem. Hence, assumption A of the preceding subsection is satisfied for the problem above. We started this discussion with the assumption that we are given a' point a that is a strictly interior feasible point of the original LP problem in standard form. To see how this assumption can be made to hold, we now show that we can transform any given LP problem into an equivalent problem in standard form where such a point a is explicitly given. To this end, consider a given 420 NONSIMPLEX METHODS LP problem of the form minimize subject to cTx Ax>b x>0. Note that any LP problem can be converted into an equivalent problem of the above form. To see this, recall that any LP problem can be transformed into an equivalent problem in standard form. But any problem in standard form can be represented as above, since the constraint Ax = b can be written as Ax > 6, —Ax > —b. We next write the dual to the problem above: maximize λ b subject to λ τ Λ < cT λ>0. As we did in our discussion of Khachiyan's algorithm, we now combine the primal and dual problems to get cTx - bTX = 0, Ax > 6, T AX < c, x > 0, λ>0. As we pointed out in the earlier section on Khachiyan's algorithm, the original LP problem is solved if and only if we can find a pair (cc, λ) that satisfies the set of relations above. This follows from Theorem 17.1. We now introduce slack and surplus variables u and v to get the following equivalent set of relations: cTx - bT\ = 0, Ax — v = 6, Α τ λ + u = c, cc,A, u, v > 0. Let x0 e R n , λ 0 G R m , u0 £ Mn, and v0 G R m be points that satisfy x0 > 0, λο > 0, UQ > 0, and Vo > 0. For example, we could choose XQ = [ 1 , . . . , 1] T , and likewise with λο, u$, and VQ. Consider the LP problem minimize subject to z cTx — bTX + (—c T x 0 + bT\0)z = 0 Ax — v + (6 — AXQ + VQ)Z = b AT\ + u + (c- Α τ λ 0 )2: = c χ , λ , u, υ,ζ > 0. KARMARKAR'S METHOD 421 We refer to the above as the Karmarkar's artificial problem, which can be represented in matrix notation as cTx minimize Ax = b subject to x > 0, where T X = [x C = ,XT,uT,vT,z}T, 1®2ηι+2η> ^ ' T -6T •^· Umxm AT ' c A= (1 oTm 0J Umxn T (-cTx0 + bT\0) (6 - Ax0 + v0) (c-ATA0) -*m ^nxm (the subscripts above refer to the dimensions/sizes of the corresponding matrices/vectors). Observe that the following point is a strictly interior feasible point for the problem above: \x λ u x0 λο = U0 V vo [_z_ _ 1_ Furthermore, the minimum value of the objective function for Karmarkar's artificial problem is zero if and only if the previous set of relations has a solution, that is, there exists cc, λ, u, and v satisfying cTx - bT\ 0, Ax — v = 6, Α τ λ + u = c, χ , λ , η , υ > 0. Therefore, Karmarkar's artificial LP problem is equivalent to the original LP problem: minimize subject to c x Ax > b x > 0. Note that the main difference between the original LP problem and Karmarkar's artificial problem is that we have an explicit strictly interior feasible point for Karmarkar's artificial problem, and hence we have satisfied the assumption that we imposed at the beginning of this subsection. 422 NONSIMPLEX METHODS The Algorithm We are now ready to describe Karmarkar's algorithm. Keep in mind that the LP problem we are solving is a Karmarkar's restricted problem, that is, a problem in Karmarkar's canonical form and satisfies assumptions A, B, C, and D. For convenience, we restate the problem: minimize subject to cTx, x G M.n x G Ω Π Δ, where Ω = {x G W1 : Ax = 0} and Δ = {x G Rn : eTx = l,x > 0}. Karmarkar's algorithm is an iterative algorithm that, given an initial point χ(°ϊ and parameter q, generates a sequence x^l\x^2\ . . . , x ^ N \ Karmarkar's algorithm is described by the following steps: 1. Initialize: Set k :— 0; x^ — do — e/n. 2. Update: Set χ^+^ below. = Φ(α5^), where Φ is an update map described 3. Check the stopping criterion: If the condition cTx^/cTx^ is satisfied, then stop. < 2~q 4. Iterate: Set k := k + 1; go to step 2. We describe the update map Φ as follows. First, consider the first step in the algorithm: χ^ = αο· To compute x^\ we use the familiar update equation where a is a step size and S0' is an update direction. The step size a is chosen to be a value in (0,1). Karmarkar recommends a value of 1/4 in his original paper [71]. The update direction er0) is chosen as follows. First, note that the gradient of the objective function is c. Therefore, the direction of maximum rate of decrease of the objective function is —c. However, in general, we cannot simply update along this direction, since x^ is required to lie in the constraint set Ω Π Δ = {x G R n : Ax = 0, eTx = 1, x > 0} xe n . A X ° ,*>ol n = { x e R" : BB00xx == \ 0 , x > 0 1 KARMARKAR'S METHOD 423 where B0 € R ^ * 1 ) * " is given by B0 = A Note that since x^ e Ω Π Δ, then for x^ = a;(0) + ad ( 0 ) also to lie in Ω Π Δ, the vector d^ must be an element of the nullspace of Bo. Hence, we choose d^ to be in the direction of the orthogonal projection of — c onto the nullspace of Bo- This projection is accomplished by the matrix Po given by Po = In — Bo C^o-Bo )~ -BoNote that BoB0 to be the vector is nonsingular by assumption C. Specifically, we choose cl· 0 ' -rc^0', where PQC e(°> = \\Po4 and r = 1/\/n(n — 1). The scalar r is incorporated into the update vector cr 0 ' for the following reason. First, observe that r is the radius of the largest sphere inscribed in the simplex Δ (see Exercise 18.7). Therefore, the vector ^(0) _ rg(o) p 0 m t s m the direction of the projection cr 0 ' of c onto the nullspace of Bo and χ^ = χ^ + a<r 0 ' is guaranteed to lie in the constraint set Ω Π Δ. In fact, x^ lies in the set Ω Π Δ Π {x : \\x — ao\\ < r}. Finally, we note that a^1) is a strictly interior point of Δ. The general update step x^^ = Φ(α5^) is performed as follows. We first give a brief description of the basic idea, which is similar to the update from χ(°ϊ to a^1) described above. However, note that x^ is, in general, not at the center of the simplex. Therefore, let us first transform this point to the center. To do this, let Dk be a diagonal matrix whose diagonal entries are the components of the vector x^: ,(*) 0 0 ,(*) Dk It turns out that because x^ is a strictly interior point of Δ, χ^ is a strictly interior point of Δ for all k (see Exercise 18.10). Therefore, Dk is nonsingular and l/x\ (*) 0 Dkl = 1/x· (k) λ Consider the mapping Uk : Δ —► Δ given by Uk{x) = Dk 1x/eTDk χ. Note that Uk(x^) = e/n = a 0 . We use Uk to change the variable from 424 NONSIMPLEX METHODS x to x = Uk(x). We do this so that x^ is mapped into the center of the simplex, as indicated above. Note that Uk is an invertible mapping, with x = U^ix) = Dkx/eT Dkx. Letting x^ = Uk(x^) = ao, we can now apply the procedure that we described before for getting a^1) from χ(°) = ao. Specifically, we update x^ to obtain äj(fc+1) using the update formula χ^+^ = χ^ + aS-k\ To compute dP*\ we need to state the original LP problem in the new variable x: minimize subject to cTDkx ADkx = 0 x e Δ. The reader can easily verify that the LP problem above in the new variable x is equivalent to the original LP problem in the sense that x* is an optimal solution to the original problem if and only if Uk(x*) is an optimal solution to the transformed problem. To see this, simply note that x = Uk(x) = D^x/e1 D^lx, and rewrite the objective function and constraints accordingly (see Exercise 18.8). As before, let ADk We choose (Γ ' — —r<r \ where er ^ is the normalized projection of —(cT Dk)T = —Dkc onto the nullspace of Bk, and r = l/y/n(n — 1) as before. To determine er \ we define the projector matrix Pk by Pk = In~ Note that BkBk fore given by Bk (BkBk )~ Bk. is nonsingular (see Exercise 18.9). The vector c^k' is there{k) = PkDkc \\PkDkc\\' The direction vector (r ' is then \\PkDkc\\ The updated vector x( fc+1 ) = x^ + aS ' is guaranteed to lie in the transformed feasible set {x : ADkx = 0}(ΊΔ. The final step is to apply the inverse transformation U^1 to obtain cc^fc+1^: Note that χ^+^ lies in the set Ω Π Δ. Indeed, we have already seen that Uk and U^1 map Δ into Δ. To see that Α χ ^ + 1 ) = 0, we simply premultiply the foregoing expression by A and use the fact that ADkx^k+l>) = 0. KARMARKAR'S METHOD 425 We now summarize the update x<fc+1) = *(«<*>): 1. Compute the matrices: „(*) 0 Dk = „(*0 ADk j 2. Compute the orthogonal projector onto the nullspace of Bk: Pk = In- Bk (BkBk )" Bk. 3. Compute the normalized orthogonal projection of c onto the nullspace of Bk: P D C g(*) = * * \\PkDkc\\' 4. Compute the direction vector: where r = 1/\/n{n — 1). 5. Compute x( fc+1 ) using x(fc+1)=o0+ad(fc)> where a is the prespecified step size, a G (0,1). 6. Compute χ^+^ by applying the inverse transformation U"^1: *(*+!) = t / - i ( ä ( * + D ) Dkx(k+V e T D f c ä( f c + 1 )' The matrix Pk in step 2 is needed solely for computing PkDkc in step 3. In fact, the two steps can be combined in an efficient way without having to compute Pk explicitly, as follows. We first solve a set of linear equations BkBk y = BkDkc (for the variable y), and then compute PkDkc using the expression PkDkc = Dkc — B^y. For more details on Karmarkar's algorithm, see [42], [55], [71], and [124]. For an informal introduction to the algorithm, see [110]. For further reading on other nonsimplex methods in linear programming, see [42], [55], [96], and [119]. A continuous gradient system for solving linear programming problems is discussed in [26]. An interesting three-article series on developments of 426 NONSIMPLEX METHODS the linear programming area before and after 1984 appeared in SI AM News, Vol. 22, No. 2, March 1989. The first article in this journal issue contains an account by Wright on recent progress and a history of linear programming from the early 1800s. The second article, by Anstreicher, focuses on interiorpoint algorithms developed since 1984. Finally in the third article in the series, Monma surveys computational implementations of interior-point methods. EXERCISES 18.1 Write a simple MATLAB function to implement the affine scaling algorithm. The inputs are c, A, 6, and χ(°\ where x^ is a strictly feasible initial point. Test the function on the problem in Example 16.2; use *<°> = [2,3,2,3,3] τ . 18.2 Write a MATLAB routine that implements the two-phase affine scaling method. It may be useful to use the MATLAB function of Exercise 18.1. Test the routine on the problem in Example 16.5. 18.3 For a given linear programming problem of the form minimize subject to cTx Ax > b x > 0, the associated Karmarkar's artificial problem can be solved directly using the affine scaling method. Write a simple MATLAB program to solve problems of the form above by using the affine scaling algorithm applied to the associated Karmarkar's artificial problem. It may be useful to use the MATLAB function of Exercise 18.1. Test your program on the problem in Example 15.15. 18.4 Let a G Mn, a > 0. Let T = [ T i , . . . , T n +i] be the projective transformation of the positive orthant P+ of R n into the simplex Δ in M n + 1 , given by f *<£* , _L1 ifl<z<n [m/αι+...+χη/αη+ι if i = n H-1. , Prove the following properties of T (see [71]): 1. T is a one-to-one mapping; that is, T(x) = T(y) implies that x = y. 2. T maps P+ onto Δ \ {x : xn+i = 0} = {x G Δ : x n +i > 0}; that is, for each y G {x G Δ : £ n +i > 0}, there exists x G P+ such that y = T(x). 3. The inverse transformation of T exists on {x G Δ : xn+\ > 0} and is given by T " 1 = [T~\ . . . , Γ " 1 ] Τ , with T ^ ) = aiyi/yn+1. EXERCISES 427 4. T maps a to the center of the simplex Δ, that is, T{a) = e/(n + 1) = [l/(n + l ) , . . . , l / ( n + l ) ] e M n + 1 . 5. Suppose that x satisfies Ax = 6, and y [yiai,.. .,ynan]T· Then, Ax' = byn+i. = T(x). Let x' = 18.5 Let T be the projective transformation in Exercise 18.4 and A G R m x n be a given matrix. Prove that there exists a matrix Af G M m x ( n + 1 ) such that Ax = b if and only if A'T(x) = 0. Hint: Let the zth column of Al be given by a* times the zth column of A, i = 1 , . . . , n, and the (n + l)th column of A' be given by —6. 18.6 Let T be the projective transformation in Exercise 18.4 and c G W1 be a given vector. Prove that there exists a vector d G E n + 1 such that cTx = 0 if and only if c'TT(x) = 0. Hint: Use property 3 in Exercise 18.4, with the d = [c' l 5 ..., c ^ + 1 ] T given by c't = diCi, i = 1 , . . . , n, and c^ + 1 = 0. 18.7 Let Δ = {x G R n : eTx = l , x > 0} be the simplex in E n , n > 1, and let do = e/n be its center. A sphere of radius r centered at ao is the set {x G R n : ||x — «oll ^ r}- The sphere is said to be inscribed in A if {x G Mn : \\x — ao|| = r, eTx — 1} C Δ. Show that the largest such sphere has radius r = 1/\/n(n — 1). 18.8 Consider the following Karmarkar's restricted problem: minimize subject to c a? Ax = 0 Let aio G Δ be a strictly interior point of Δ, and D be a diagonal matrix whose diagonal entries are the components of x$. Define the map U : Δ —► Δ by U(x) = D~1x/eJD~lx. Let x = U(x) represent a change of variable. Show that the following transformed LP problem in the variable äj, minimize subject to cT Dx ADx = 0 X G Δ, is equivalent to the original LP problem above in the sense that x* is an optimal solution to the original problem if and only if x* = U(x*) is an optimal solution to the transformed problem. 428 NONSIMPLEX METHODS 18.9 Let A e satisfies Wnx n , ra < n, and Ω = {x : Ax = 0}. Suppose that A rank A = ra + l. Let XQ G Δ η Ω be a strictly interior point of Δ C W1 and Z> be a diagonal matrix whose diagonal entries are the components of XQ. Consider the matrix B defined by Γ B = Show that mnkB ΑΌ] = ra + 1, and hence BB is nonsingular. 18.10 Show that in Karmarkar's algorithm, x^ οίΔ. is a strictly interior point CHAPTER 19 INTEGER LINEAR PROGRAMMING 19.1 Introduction This chapter is devoted to linear programs with the additional constraint that the solution components be integers. Such problems are called integer linear programming (ILP) (or simply integer programming) problems, and arise naturally in many practical situations. For example, in Example 15.1, the decision variables represent production levels, which we allowed to take real values. If production levels correspond to actual numbers of products, then it is natural to impose the constraint that they be integer valued. If we expect solutions that are very large in magnitude, then ignoring the integer constraint might have little practical consequence. However, in cases where the solution is a relatively small integer (on the order of 10, say), then ignoring the integer constraint could lead to dramatically erroneous solutions. Throughout this section, we use the notation Z for the set of integers, Z n the set of vectors with n integer components, and Zmxn the set of m x n matrices with integer entries. Using this notation, we can express an ILP An Introduction to Optimization, Fourth Edition. By E. K. P. Chong and S. H. Zak. Copyright © 2013 John Wiley & Sons, Inc. 429 430 INTEGER LINEAR PROGRAMMING problem in following form: minimize subject to c x Ax = b x>0 xezn. 19.2 Unimodular Matrices There is a class of ILP problems that can be solved using standard linear programming methods. To proceed, we need some definitions and background results. The reader should recall the definition of a minor from Section 2.2. Definition 19.1 An rax n integer matrix A G Z m x n , ra < n, is unimodular if all its nonzero rath-order minors are ± 1 (i.e., either 1 or —1). I Unimodular matrices play a special role in the context of linear equations and integer basic solutions. Consider the linear equation Ax = b with A G Z m X n , ra < n. Let ß b e a corresponding basis matrix (an ra x ra matrix consisting of ra linearly independent columns of A). Then, the unimodularity of A is equivalent to | d e t B | = 1 for any such B. The following lemma connects unimodularity with integer basic solutions. Lemma 19.1 Consider the linear equation Ax = b where A G Z m x n , ra < n, is unimodular and b G Z m . Then, all basic solutions have integer components. □ Proof. As usual, suppose that the first ra columns of A constitute a basis, and that B is the invertible ra x ra matrix composed of these columns. Then the corresponding basic solution is x — B~lb 0 Because all the elements of A are integers, B is an integer matrix. Moreover, because A is unimodular, | d e t B | = 1. This implies that the inverse Β~λ is also an integer matrix (see [62, p. 21]). Therefore, x* is an integer vector. I Corollary 19.1 Consider the LP constraint Ax — b x > 0, mxr where A is unimodular, A G Z \ m < n, and b G Z m . Then, all basic feasible solutions have integer components. U UNIMODULAR MATRICES 431 Unimodularity allows us to solve ILP problems using the simplex method. Specifically, consider the ILP problem minimize subject to cTx Ax = b x>0 xezn where A G Z m X n , m < n, is unimodular and b G Z m . Then, the corollary above tells us that if we consider the associated LP problem minimize subject to ex Ax = b x>0, the optimal basic feasible solution is an integer vector. This means that we can apply the simplex method to the LP problem above to obtain a solution to the original ILP problem. Example 19.1 Consider the following ILP problem: maximize subject to 2^i -+- 5x2 x\ + xs = 4 #2 + %A = 6 xi + #2 + #5 = 8 ^1,^2,^3,^4,^5 > 0 ^1,^2,^3,^4,^5 Ξ Z We can write this problem in matrix form with 1 0 1 0 0 0 1 0 1 0 , 1 1 0 0 1 6 = 4 6 8 Notice that 6 G Z 3 . Moreover, it is easy to check that A is unimodular. Hence, the ILP problem above can be solved by solving the LP problem maximize 2xi + hx2 subject to xi + #3 = 4 X2 + X4 = 6 X\ + %2 + #5 = 8 ^1,^2,^3,^4,^5 > 0. 432 INTEGER LINEAR PROGRAMMING This was done in Example 16.2 using the simplex method, yielding optimal solution [2,6,2,0,0] T , which is an integer vector. I In general, when the matrix A is not unimodular, the simplex method applied to the associated LP problem yields a noninteger optimal solution. However, in some cases, even if A is not unimodular, the simplex method still produces an integer optimal basic feasible solution. To see this, suppose that we are given A G Z m x n , m < n, and b G Z m . Note that as long as each mxm basis matrix B consisting of columns of A corresponding to a basic feasible solution has the property that | d e t i ? | = 1, we can use the argument in the proof of Lemma 19.1 to conclude that the basic feasible solution is an integer vector. Equivalently, we can draw this conclusion if each basis submatrix B of A such that |det JB| Φ 1 corresponds to a nonfeasible basic solution. We illustrate this in the following example. Example 19.2 Consider the ILP problem minimize subject to — x\ — 2x2 — 2x\ + X2 + #3 = 2 — X\ + X2 + X4 = 3 xi + #5 = 3 Xi > 0, i = 1 , . . . ,5 Xi G Z, i — 1 , . . . ,5. Can this ILP problem be solved using the simplex method? We can easily verify that the matrix " - 2 1 1 0 θ" - 1 1 0 1 0 1 0 0 0 1 is not unimodular. Indeed, it has one (and only one) basis submatrix with determinant other than ± 1 , consisting of the first, fourth, and fifth columns of A. Indeed, if we write B — [0,1,0,4,0,5], then detB = —2. However, a closer examination of this matrix and the vector b = [2,3,3] T reveals that the corresponding basic solution is not feasible: B~lb = [—1,2,4]T (which, coincidentally, happens to be an integer vector). Therefore, for this problem, applying the simplex method to the associated LP problem will produce an integer optimal basic feasible solution, which also solves the original ILP problem. UNIMODULAR MATRICES 433 We begin by forming the first tableau, cT αι -2 -1 1 -1 a2 1 1 0 -2 «3 ei4 1 0 0 0 0 1 0 0 a5 0 0 1 0 b 2 3 3 0 We have r2 = —2. Therefore, we introduce a2 into the new basis. We calculate the ratios 2/io/2/i2, y%2 > 0, to determine the pivot element: — = 2/12 and 1 V2 3 2/22 1' ° = We will use 2/12 as the pivot. Performing elementary row operations, we obtain the second tableau, a,\ rT 03 a2 04 -2 1 1 1 0 - 1 1 0 0 -5 0 2 05 b 0 0 2 1 0 1 0 1 3 0 0 4 We now have n = — 5 < 0. Therefore, we introduce αι into the new basis. We next calculate the ratios 2/20/2/22, y%2 > 0, to determine the pivot element: M= I 2/21 1 an d *» = *. 2/31 1 We will use 2/21 as the pivot. Performing row elementary operations, we obtain the third tableau, a2 <*3 0 4 a 5 b Ol 1 -1 2 0 4 0 1 1 0 1 0 -1 1 1 1 2 0 0 0 9 0 0 -3 5 We have r% = — 3 < 0. Therefore, we introduce 03 into the new basis. We next calculate the ratios 2/^0/2/22, 2/ΐ2 > 0, to determine the pivot element, 2/30 2/33 = 2 1' 434 INTEGER LINEAR PROGRAMMING We will use 2/33 as the pivot. Performing row elementary operations, we obtain the fourth tableau, CL\ rT Q>2 O3 CI4 CI5 0 1 0 1 1 1 0 0 0 1 0 0 1 - 1 1 0 0 0 2 3 b 6 3 2 15 All reduced cost coefficients are now positive, which means that the current solution is optimal. This solution is [3,6,2,0,0] T . I Next, we consider ILP problems of the form minimize subject to cTx Ax < b x>0 xezn We have seen in Section 15.5 that we can transform the inequality constraint Ax < b into standard form by introducing slack variables. Doing so would lead to a new problem in standard form for which the constraint has the form [A, I]y = b (where the vector y contains x and the slack variables). To deal with matrices of the form [A, J], we need another definition. Definition 19.2 An m x n integer matrix A G Z m X n is totally unimodular if all its nonzero minors are ± 1 . I By minors here we mean pth-order minors for p < min(m, n). Equivalently, a matrix A G Z m x n is totally unimodular if and only if all its square invertible submatrices have determinant ± 1 . By a submatrix of A we mean a matrix obtained by removing some columns and rows of A. It is easy to see from this definition that if an integer matrix is totally unimodular, then each entry is 0, 1, or —1. The next proposition relates the total unimodularity of A with the unimodularity of [A, I] (see also Exercise 19.3). Proposition 19.1 If an m x n integer matrix A G Z m x n is totally unimodular, then the matrix [A, I] is unimodular. □ Proof. Let A satisfy the assumptions of the proposition. We will show that any m x m invertible submatrix of [A,/] has determinant ± 1 . We first note that any m x m invertible submatrix of [A, I] that consists only of columns of A has determinant ± 1 because A is totally unimodular. Moreover, the m x m submatrix / satisfies det 1=1. Consider now an m x m invertible submatrix of [A, I] composed of k columns of A and m — k columns of i\ Without loss of generality, suppose that this submatrix is composed of the last k columns of A and the first UNIMODULAR MATRICES 435 m — k columns of J; that is, the m x m invert ible submatrix is B = I an_fc+i ··· an e\ ··· em_J = ■Hm—k,k -*m—k Bk,k O where e* is the ith column of the identity matrix. This choice of columns is without loss of generality because we can exchange rows and columns to arrive at this form, and each exchange only changes the sign of the determinant. Moreover, note that detB — ± d e t Bk,k (see also Exercises 19.4 and 2.4). Thus, Bk,k is invert ible because B is invert ible. Moreover, because Bk,k is a submatrix of A and A is totally unimodular, det Bk,k = ± 1 . Hence, det B = ± 1 also. Thus any mxm invert ible submatrix of [A, I] has determinant ± 1 , which implies that [A, I] is unimodular. I Combining the result above with Lemma 19.1, we obtain the following corollary. Corollary 19.2 Consider the LP constraint [A,I]x = b x>0, where A G Z m X n is totally unimodular and b G Z m . Then, all basic feasible solutions have integer components. D Total total unimodularity of A allows us to solve ILP problems of the following form using the simplex method: minimize subject to cTx Ax < b x>0 xezn where b G Z m . Specifically, we first consider the associated LP problem minimize subject to cTx Ax < b x>0. If A is totally unimodular, then the corollary above tells us that once we convert this problem into standard form by introducing a slack-variable vector minimize subject to c x [A, I] x,z > 0, 436 INTEGER LINEAR PROGRAMMING the optimal basic feasible solution is an integer vector. This means that we can apply the simplex method to the LP problem above to obtain a solution to the original ILP problem. Note that although we only needed the x part of the solution to be integer, the slack-variable vector z is automatically integer for any integer x, because both A and b only contain integers (see also Exercise 19.5). Example 19.3 Consider the following ILP problem: maximize subject to 2x\ + 5x2 x\ < 4 %2 < 6 X\ + %2 < 8 Xi,^2 > 0 £ i , £ 2 € Z. This problem can be written in the matrix form above with "l 0 1 o" 4 1 1 6 It is easy to check that A is totally unimodular. Hence, the ILP problem above can be solved by solving the LP problem maximize 2xi + 5x 2 subject to xi + #3 = 4 #2 + #4 = 6 Xl + ^2 + #5 = 8 ^ 1 , ^ 2 , ^ 3 , ^ 4 , ^ 5 > 0, as was done in Example 16.2. As discussed before, even if [A, I] is not unimodular, the simplex algorithm might still yield a solution to the original ILP. In particular, even if A is not totally unimodular, the method above might still work, as illustrated in the following example. Example 19.4 Consider the following ILP problem: maximize subject to X\ + 2^2 — 2x\ + X2 < 2 x\ — %2 > —3 x\ < 3 X\ > 0, #2 > 0, #1,£2 € Ζ· THE GOMORY CUTTING-PLANE METHOD 437 We first express the given problem in this equivalent form: minimize subject to — X\ — 2x2 — 2x\ + x2 <2 - X\ + X2 < 3 x\ < 3 X\ > 0, X2 > 0, Xi,^2 We next represent the problem above in standard form by introducing slack variables £3, #4, and x§ to obtain minimize subject to — X\ — 2x2 — 2x\ + #2 + #3 - x\ + x2 +XA x\ Xi > 0, =2 =3 + £5 = 3 z = 1 , . . . ,5. This problem is now of the form in Example 19.2, where the simplex method was used. Recall that the solution is [3,6,2,0,0] T . Thus, the solution to the original problem is x* = [3,6] T . Note that the matrix " - 2 1| -1 1 1 0 is not totally unimodular, because it has an entry (—2) not equal to 0, 1, or —1. Indeed, the matrix [A,/] is not unimodular. However, in this case, the simplex method still produced an optimal solution to the ILP, as explained in Example 19.2. I 19.3 The Gomory Cutting-Plane Method In 1958, Ralph E. Gomory [54] proposed a method where noninteger optimal solutions obtained using the simplex method are successively removed from the feasible set by adding constraints that exclude these noninteger solutions from the feasible set. The additional constraints, referred to as Gomory cuts, do not eliminate integer feasible solutions from the feasible set. The process is repeated until the optimal solution is an integer vector. To describe Gomory cuts, we use the floor operator, defined next. Definition 19.3 The floor of a real number, denoted [x\, is the integer obtained by rounding x toward —00. I For example, [3.4J = 3 and [—3.4J = - 4 . 438 INTEGER LINEAR PROGRAMMING Consider the ILP problem minimize subject to ex Ax = b x>0 xezn. We begin by applying the simplex method to obtain an optimal basic feasible solution to the LP problem minimize subject to cTx Ax = b x>0. As usual, suppose that the first m columns form the basis for the optimal basic feasible solution. The corresponding canonical augmented matrix is αι 1 0 a2 0 1 ·· . ·· . 0 0 0 0 · · Obi ttm+1 Q"m o o ·· • .· • 0 0 2/l,m+l 2/2,m+l ··· 1 ··· 0 2/i,m+l ··· 0 ··· 1 J/m,m+l Q>n 2/0 '* 2/l,n 2/10 ' 2/2,n 2/20 ·'· Vi,n ViO ''* Vrn,n VmO Consider the ith component of the optimal basic feasible solution, yio. Suppose that yio is not an integer. Note that any feasible vector x satisfies the equality constraint (taken from the ith row) n X i+ Σ j=m+l yi X 33 = ^°* We use this equation to derive an additional constraint that would eliminate the current optimal noninteger solution from the feasible set without eliminating any integer feasible solution. To see how, consider the inequality constraint x *+ n Σ j=m+l L2/iiJ^ ^ 2/it). Because [yij\ < yij, any x > 0 that satisfies the first equality constraint above also satisfies this inequality constraint. Thus, any feasible x satisfies this inequality constraint. Moreover, for any integer feasible vector cc, the THE GOMORY CUTTING-PLANE METHOD 439 left-hand side of the inequality constraint is an integer. Therefore, any integer feasible vector x also satisfies n j=m+l Subtracting this inequality from the equation above, we deduce that any integer feasible vector satisfies n Σ j=m+l (yii ~ L^J)^· > 2/<o - Li/ioJ. Next, notice that the optimal basic feasible solution above does not satisfy this inequality, because the left-hand side for the optimal basic feasible solution is 0, but the right-hand side is a positive number. Therefore, if we impose the additional inequality constraint above to the original LP problem, the new constraint set would be such that the current optimal basic feasible solution is no longer feasible, but yet every integer feasible vector remains feasible. This new constraint is called a Gomory cut To transform the new LP problem into standard form, we introduce the surplus variable xn+i to obtain the equality constraint n j=m+l For convenience, we will also call this equality constraint a Gomory cut. By augmenting this equation into A and 6, or canonical versions of them (e.g., in the form of a simplex tableau), we obtain a new LP problem in standard form. We can then solve the new problem using the simplex method and examine the resulting optimal basic feasible solution. If the solution satisfies the integer constraints, then we are done—this vector gives an optimal solution to the original ILP problem by extracting the appropriate components. If the solution does not satisfy the integer constraints, we introduce another Gomory cut and repeat the process. We call this procedure the Gomory cutting-plane method. Note that in applying the Gomory cutting-plane method, we only need to introduce enough cuts to satisfy the integer constraints for the original ILP problem. The additional variables introduced by slack variables or by the Gomory cuts are not constrained to be integers. In the following two examples, we illustrate how the Gomory cutting-plane method can be implemented by incorporating Gomory cuts directly into the simplex tableau. 440 INTEGER LINEAR PROGRAMMING Example 19.5 Consider the following ILP problem 1 : maximize 3#i + 4^2 2 -x\ + X2 < 3 5 2 2 subject to -Xl - -x2 < 1 > 0 X\,X2 Xl,X 2 £ ^ · We first solve the problem graphically. The constraint set Ω for the associated LP problem (without integer constraints) can be found by calculating the extreme points: x& = o ο ] Τ , χ{2) = [ΐ ο ] Τ , *<3> = [ο 3 ] T , x ( 4 ) = [ff ψ]Τ In Figure 19.1, we show the feasible set Ω. In Figure 19.2, we show the feasible set for the ILP problem, which allows us to solve the problem graphically. The solution is obtained by finding the straight line f = 3xi + 4x2 with largest / that passes through a feasible point with integer components. This can be accomplished by first drawing the line / = 3#i + 4x 2 for / = 0 and then gradually increasing the values of / , which corresponds to sliding across the feasible region until the straight line passes through the "last" integer feasible point yielding the largest value of the objective function. From Figure 19.2, we can see that the optimal solution to the ILP problem is [2,2] T . We now solve the problem using the Gomory cutting-plane method. First we represent the associated LP problem in standard form: maximize subject to 3x\ + 4x 2 2 -xi + X2 + X3 — 3 5 2 2 -Xl - -X2 +X4 — 1 5 5 ^ 1 , ^ 2 , ^ 3 , ^ 4 > 0. Note that we only need the first two components of the solution to be integers. We can start the simplex method because we have an obvious basic feasible solution. The first tableau is CLi 2 5 2 5 CT -3 a2 1 2 5 -4 «3 a4 1 0 0 0 1 0 b 3 1 0 Thanks to David Schvartzman Cohenca for his solution to this problem. THE GOMORY CUTTING-PLANE METHOD 0 0.5 Figure 19.1 0 F i g u r e 19.2 1 1.5 2 2.5 3 3.5 4 Feasible set Ω for LP problem in Example 19.5. 0.5 1 1.5 2 2.5 3 3.5 4 Graphical solution for ILP problem in Example 19.5. 441 442 INTEGER LINEAR PROGRAMMING We bring ci2 into the basis and pivot about the element (1,2) to obtain 2 5 14 25 7 —-k5 r Q>2 «3 1 0 0 1 b 3 a± 0 1 0 2 5 4 11 5 12 Next, we pivot about the element (2,1) to obtain a>2 o 3 ai CI4 _io 14 25 14 5 2 10 14 10 14 0 1 5 0 20 14 55 14 35 2 The corresponding optimal basic feasible solution is 55 14 10 7 Q U Q1 U which does not satisfy the integer constraints. We start by introducing the Gomory cut corresponding to the first row of the tableau. We obtain 10 4 3 + Ϊ 4 * 4 "ϊΤ #5 = 6 14' our tableau: d\ 0 1 0 0 a<i 1 0 0 0 a3 10 14 10 14 10 14 5 GL4 10 14 25 14 4 14 5 2 a5 0 0 -1 0 b 20 14 55 14 6 14 35 2 Pivoting about the element (3,3) gives a\ 0 1 0 0 Q,2 as «4 05 1 0 0 0 0 0 1 0 -1 1 1 3 2 2 5 1 2 _7 5 7 ^y The corresponding optimal basic feasible solution is [7/2,1,3/5,0,0] T , which still does not satisfy the integer constraint. Next, we construct the Gomory cut for the second row of the tableau: -X4 — XQ THE GOMORY CUTTING-PLANE METHOD 443 We add this constraint to our tableau to obtain rT a\ 0 1 0 0 0 a,2 as a\ a^ 1 0 - 1 1 3 1 0 0 2 2 7 1 0 5 5 1 0 0 0 2 1 0 0 7 2 ae 0 0 0 -1 0 7 2 3 5 1 2 29 2 a>2 03 Pivoting about (4,4), we get d\ T 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0,4 05 ag b 0 0 0 1 0 1 1 -2 2 3 2 I 15 0I 7 5 4 -2 1 5 1 1 14 In this optimal basic feasible solution, the first two components are integers. Thus, we conclude that the solution to our ILP is [2,2] T , which agrees with the graphical solution in Figure 19.2. I In Example 19.5, the final solution to the LP problem after applying the Gomory cutting-plane method is not an integer vector. Only the first two components are integers, as these are the only two components in the original ILP problem. As pointed out earlier, the slack variables and variables introduced by the Gomory cuts are not constrained to be integers. However, if we are given an ILP problem with inequality constraints as in Example 19.5 but with only integer values in constraint data, then the slack variables and those introduced by the Gomory cuts are automatically integer valued (see also Exercise 19.9). We illustrate this in the following example. E x a m p l e 19.6 Consider the following ILP problem: maximize subject to 3xi 4- 4x2 3xi — X2 < 12 3xi + l l x 2 < 66 Xl,X2 > 0 X i , x 2 G Z. A graphical solution to this ILP problem is shown in Figure 19.3. As in Example 19.5, the solution is obtained by finding the straight line / = 3xi +4x2 with largest / that passes through a feasible point with integer components. This point is [5,4] T . 444 INTEGER LINEAR PROGRAMMING 7 6 5 4 x" 3 2 1 0 - 1 0 1 2 Ί 3 4 5 6 Figure 19.3 Graphical solution of the ILP problem in Example 19.6, where integer feasible solutions are marked with heavy dots. We now solve the ILP problem above using the simplex method with Gomory cuts. We first represent the associated LP problem in standard form by introducing slack variables x$ and #4. The initial tableau has the form a\ cT CL2 o>3 3 - 1 1 3 11 0 -3 -4 0 04 b 0 1 0 12 66 0 In this case there is an obvious initial basic feasible solution available, which allows us to initialize the simplex method to solve the problem. After two iterations of the simplex algorithm, the final tableau is 1 0 0 rT 0 1 0 b tt4 »3 11 36 1 12 7 12 1 36 1 12 5 12 11 2 9 2 69 2 with optimal solution *· = [¥ 1 ° °]'· Both basic components are noninteger. Let us construct a Gomory cut for the first basic component x\ = 11/2. Prom the first row of the tableau, the associated constraint equation is Xl 11 + 36* 3+ 1 4 11 36* = T · THE GOMORY CUTTING-PLANE METHOD 445 7r 4 4 Liiiy • 2h l[ 4 · · · 2 3 · · Figure 19.4 Graphical solution of the ILP in Example 19.6 after adding the constraint x\ < 5 to the original constraints. If we apply the floor operator to this equation as explained before, we get an inequality constraint x\ < 5. A graphical solution of the above problem after adding this inequality constraint to the original LP problem is shown in Figure 19.4. We can see that in this new problem, the first component of the optimal solution is an integer, but not the second. This means that a single Gomory cut will not suffice. To continue with the Gomory procedure for the problem using the simplex method, we first write down the Gomory cut 11 1 1 3 6 * 3 + 3 g * 4 - * 5 = 2. We now obtain a new tableau by augmenting the previous tableau with the above constraint: αι a2 a3 a,4 a5 b 11 1 0 11 0 1 36 36 2 1 1 0 92 1 0 12 12 11 1 -1 21 0 0 36 36 7 5 0 69 0 0 12 12 2 At this point, there is no obvious basic feasible solution. However, we can easily use the two-phase method. This yields ai tt2 1 0 03 04 0 0 a$ b 1 5 446 INTEGER LINEAR PROGRAMMING which has all nonnegative reduced cost coefficients. Hence, we obtain the optimal basic feasible solution 51 x 18 11 0u 11 0 u1 T As expected, the second component does not satisfy the integer constraint. Next, we write down the Gomory cut for the basic component x\ = 51/11 using the numbers in the second row of the tableau: 1 8 7 Χ4+ Χ5 Χ6= η η ~ η' Updating the tableau gives ai G&2 as 1 0 0 0 0 T 0 1 0 0 0 0 0 1 0 0 a± 0 a$ 1 a§ 0 1 11 1 11 1 11 4 11 3 11 36 11 8 11 21 11 0 0 -1 0 51 11 18 11 7 11 369 11 Again, there is no obvious basic feasible solution. Applying the two-phase method gives a\ a<i as a± a§ a6 b 11 33 0 1 0 0 8 8 3 39 0 1 0 0 8 8 9 9 0 0 1 0 2 2 11 7 0 0 0 1 T 1 n 21 255 r 0 0 0 0 8 The corresponding optimal basic feasible solution still does not satisfy the integer constraints; neither the first nor the second components are integer. Next, we introduce the Gomory cut using the numbers in the second row of the previous tableau to obtain T a\ ci2 03 CI4 a§ a§ 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 -i 0 ±± 1 8 1 2 1 8 1 8 1 8 0 0 1 0 0 3 8 9 2 11 8 5 8 21 8 a? b O f 0 0 0 -1 0 39 8 9 2 7 8 7 8 255 8 EXERCISES 447 Applying the two-phase method again gives a\ τ 1 0 0 0 0 0 α<ι α^ 0 1 0 0 0 0 α± α§ ÖLQ CLQ 0 0 0 1 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 _i 2 _7 2 5 2 _1 2 1 α7 0 1 2 1 2 11 2 1 2 2 b 5 4 1 7 0 31 (Note that this basic feasible solution is degenerate—the corresponding basis is not unique.) The corresponding optimal basic feasible solution is 5 4 1 7 0 -iT 0 0 , which satisfies the integer constraints. From this, we see that the integer optimal solution to the original ILP problem is [5,4] T , which agrees with our graphical solution in Figure 19.3. In this example, we note that the final solution to LP problem after introducing slack variables and using the Gomory cutting-plane method is an integer vector. The reason for this, in contrast with Example 19.5, is that the original ILP inequality constraint data has only integers. I A linear programming problem in which not all of the components are required to be integers is called a mixed integer linear programming (MILP) problem. Gomory cuts are also relevant to solving MILP problems. In fact, Example 19.5 is an instance of an MILP problem, because the slack variables in the standard form of the problem are not constrained to be integers. Moreover, the cutting-plane idea also has been applied to nonsimplex methods and nonlinear programming algorithms. For other methods for solving ILPs, see [119]. EXERCISES 19.1 Show that if A is totally unimodular, then so is any submatrix of it. 19.2 Show that if A is totally unimodular, then so is A . 19.3 Show that A is totally unimodular if and only [A, I] is totally unimodular. This result is stronger than Proposition 19.1. 19.4 Consider the matrix B in the proof of Proposition 19.1: r> I Hm—k,k Bk,k J-m—k O 448 INTEGER LINEAR PROGRAMMING Show that det B = ± d e t Bk,k19.5 Consider the constraint Ax < 6, xezn where A and b contain only integers. Suppose that we introduce the slackvariable vector z to obtain the equivalent constraint [A,T\ xez71 z>0. Show that if z satisfies this constraint (with some x), then z is an integer vector. 19.6 Write a MATLAB program to generate Figures 19.1 and 19.2. 19.7 Consider the constraint in standard form Ax = b. Suppose that we augment this with a Gomory cut to obtain x Xn+l = b. Let x n +i satisfy this constraint with an integer vector x. Show that xn+i is an integer. 19.8 Consider the ILP problem in standard form minimize subject to cTx Ax = b x>0 xezn. Show that if we use the Gomory cutting-plane method with the simplex algorithm, then the final optimal basic feasible solution, including the variables introduced by the Gomory method, is an integer vector. (Use Exercise 19.7.) 19.9 Consider the ILP problem minimize subject to ~τ■ c x Ax < b x>0 xezn. EXERCISES 449 Suppose that we introduce slack variables to convert the problem into standard form, and we use the Gomory cutting-plane method with the simplex algorithm to solve the resulting problem. Show that the final optimal basic feasible solution, including the slack variables and the variables introduced by the Gomory method, is an integer vector. (Use Exercises 19.5 and 19.8.) 19.10 Use a graphical method to find an integer solution to the dual of the ILP problem in Example 19.5. PART IV NONLINEAR CONSTRAINED OPTIMIZATION CHAPTER 20 PROBLEMS WITH EQUALITY CONSTRAINTS 20.1 Introduction In this part we discuss methods for solving a class of nonlinear constrained optimization problems that can be formulated as minimize f(x) subject to n hi(x) = 0, i = 1,..., m 9j(x) < 0 , j = l,...,p, n n where x e R , / : R -> R, ft» : R -> R, ga : R n -> R, and m < n. In vector notation, the problem above can be represented in the following standard form: minimize subject to f(x) h(x) = 0 9(x) < 0, where h : R terminology. n -> R m and g : R n -> W. As usual, we adopt the following An Introduction to Optimization, Fourth Edition. By E. K. P. Chong and S. H. Zak. Copyright © 2013 John Wiley & Sons, Inc. 453 454 PROBLEMS WITH EQUALITY CONSTRAINTS Definition 20.1 Any point satisfying the constraints is called a feasible point The set of all feasible points, {x e Rn : h(x) = 0, g(x) < 0}, is called a feasible set I Optimization problems of the above form are not new to us. Indeed, linear programming problems of the form minimize subject to cTx Ax — b x > 0, which we studied in Part III, are of this type. As we remarked in Part II, there is no loss of generality by considering only minimization problems. For if we are confronted with a maximization problem, it can easily be transformed into the minimization problem by observing that maximize/(x) = minimize—/(x). We illustrate the problems we study in this part by considering the following simple numerical example. Example 20.1 Consider the following optimization problem: minimize subject to {x\ — l ) 2 + #2 — 2 X2 — x\ = 1, Xi + %2 < 2. This problem is already in the standard form given earlier, with f(xi,X2) = (xi - l ) 2 + X2 - 2, h(xi,x2) = #2 — #i - 1, and g(xi,X2) = #i + #2 2. This problem turns out to be simple enough to be solved graphically (see Figure 20.1). In the figure the set of points that satisfy the constraints (the feasible set) is marked by the heavy solid line. The inverted parabolas represent level sets of the objective function /—the lower the level set, the smaller the objective function value. Therefore, the solution can be obtained by finding the lowest-level set that intersects the feasible set. In this case, the minimizer lies on the level set with / = —1/4. The minimizer of the objective function is x* = [1/2,3/2] T . I In the remainder of this chapter we discuss constrained optimization problems with only equality constraints. The general constrained optimization problem is discussed in the chapters to follow. PROBLEM FORMULATION Figure 20.1 20.2 455 Graphical solution to the problem in Example 20.1. Problem Formulation The class of optimization problems we analyze in this chapter is minimize subject to f(x) h(x) = 0, where x G R n , / : R n -► R, h : R n -» R m , h = [hu · ·, ^m] T , and m < n. We assume that the function h is continuously differentiable, that is, h G C 1 . We introduce the following definition. Definition 20.2 A point x* satisfying the constraints fti(x*) = 0 , . . . , hm(x*) = 0 is said to be a regular point of the constraints if the gradient vectors Vfti(a5*),..., Vftm(a;*) are linearly independent. I Let Dh(x*) be the Jacobian matrix of h = [/ii,..., hm]T at £C*, given by "v/ii(aj*) T 1 |"Dfti(x·)" \Dhm(x*)m yhm{x*)T\ Then, x* is regular if and only if rank Dh(x*) = m (i.e., the Jacobian matrix is of full rank). The set of equality constraints h\(x) = 0 , . . . ,hm(x) = 0, hi : R n —► R, describes a surface 5 = {x G R n : fti(x) = 0 , . . . , hm(x) - 0}. 456 PROBLEMS WITH EQUALITY CONSTRAINTS x 1 ^x3 X2 S=K[x1lx2fX3]T:x2-^=0} Figure 20.2 Two-dimensional surface in R3. Assuming that the points in S are regular, the dimension of the surface S is n — m. Example 20.2 Let n = 3 and m = 1 (i.e., we are operating in R 3 ). Assuming that all points in S are regular, the set 5 is a two-dimensional surface. For example, let hi(x) = X2 — x\ = 0. Note that Vhi(x) this case, = [0,1, - 2 x 3 ] T , and hence for any XGR3, Vfei(x) φ 0. In dim S = dim{x : hi (x) — 0} = n — m = 2. See Figure 20.2 for a graphical illustration. I Example 20.3 Let n = 3 and m — 2. Assuming regularity, the feasible set 5 is a one-dimensional object (i.e., a curve in R 3 ). For example, let hi(x) = xi, h2(x) — X2- #!· In this case, Vhi(a?) = [1,0,0] T and Vh2(x) = [0, l , - 2 x 3 ] T . Hence, the vectors Vfti(x) and V/i2(ic) are linearly independent in R 3 . Thus, d i m 5 = dim{ic : h\(x) = 0, h2{x) = 0} = n — m = 1. See Figure 20.3 for a graphical illustration. 20.3 I Tangent a n d N o r m a l Spaces In this section we discuss the notion of a tangent space and normal space at a point on a surface. We begin by defining a curve on a surface S. TANGENT AND NORMAL SPACES Figure 20.3 457 One-dimensional surface in R3. Definition 20.3 A curve C on a surface 5 is a set of points {x(t) G S : t G (a, 6)}, continuously parameterized by t G (a, 6); that is, x : (a, b) —» S is a continuous function. | A graphical illustration of the definition of a curve is given in Figure 20.4. The definition of a curve implies that all the points on the curve satisfy the equation describing the surface. The curve C passes through a point cc* if there exists t* G (a, b) such that x(t*) = x*. Intuitively, we can think of a curve C = {x(t) : t £ (a, b)} as the path traversed by a point x traveling on the surface S. The position of the point at time t is given by x(t). Definition 20.4 The curve C = {x(t) : t G (a, b)} is differentiable if . , x dx , x »(*) = -*(«> = ii(f)l Xn(t)\ exists for all t G (a, b). Figure 20.4 Curve on a surface. 458 PROBLEMS WITH EQUALITY CONSTRAINTS x(b) (t) "OurveC x(a) Figure 20.5 Geometric interpretation of the differentiability of a curve. The curve C = {x{t) : t G (a, b)} is twice differentiable if *»=$<«= Xn(t)\ exists for all t G (a, b). I Note that both x(t) and x(t) are n-dimensional vectors. We can think of x(t) and x(t) as the velocity and acceleration, respectively, of a point traversing the curve C with position x(t) at time t. The vector x{t) points in the direction of the instantaneous motion of x(t). Therefore, the vector x(t*) is tangent to the curve C at x* (see Figure 20.5). We are now ready to introduce the notions of a tangent space. For this recall the set S = {x G R n : h(x) = 0}, where h € C1. We think of 5 as a surface in W1. Definition 20.5 The tangent space at a point x* on the surface S = {x G R n : h(x) = 0} is the set T(x*) = {?/ : Dh(x*)y = 0}. I Note that the tangent space T(x*) is the nullspace of the matrix Dh(x*): T(x*)=Af(Dh(x*)). The tangent space is therefore a subspace of R n . Assuming that x* is regular, the dimension of the tangent space is n — m, where m is the number of equality constraints hi(x*) = 0. Note that the tangent space passes through the origin. However, it is often convenient to picture the tangent space as a plane that passes through the point x*. For this, we define the tangent plane at x* to be the set TP(x*) = T(x*) + x* = {x + x* : x G T(x*)}. TANGENT AND NORMAL SPACES 459 Tangent Plane Figure 20.6 Tangent plane to the surface S at the point x*. Figure 20.6 illustrates the notion of a tangent plane, and Figure 20.7, the relationship between the tangent plane and the tangent space. Example 20.4 Let S = { x G R 3 : fti(sc) = xi = 0, h2(x) = xi - x2 = 0}. Then, S is the a^-axis in R 3 (see Figure 20.8). We have Dh(x) = Vfti(») T " Wh2(x)T 1 1 0 -1 0 0 Because Vfti and Vh2 are linearly independent when evaluated at any x G 5, all the points of S are regular. The tangent space at an arbitrary point of S is T(x) = {y : V h i ( x ) T y - 0, Vh2(x)Ty = 0} v 2/· 1 o 1 -1 = {[0,0,α] τ o" 0 2/i 2/2 _2/3_ = 0> ) :aeR} — the Xß-axis in R 3 . In this example, the tangent space T(x) dimensional subspace of R 3 . at any point x G S is a oneI Intuitively, we would expect the definition of the tangent space at a point on a surface to be the collection of all "tangent vectors" to the surface at that 460 PROBLEMS WITH EQUALITY CONSTRAINTS Figure 20.7 Tangent spaces and planes in R2 and M3. point. We have seen that the derivative of a curve on a surface at a point is a tangent vector to the curve, and hence to the surface. The intuition above agrees with our definition whenever x* is regular, as stated in the theorem below. Theorem 20.1 Suppose that x* e S is a regular point and T(x*) is the tangent space at x*. Then, y G T(x*) if and only if there exists a differentiable curve in S passing through x* with derivative y at x*. D Proof <=: Suppose that there exists a curve {x(t) : t G (a, b)} in S such that x(t*) = x* and x(t*) = y for some t* G (a, b). Then, h(x(t)) = 0 TANGENT AND NORMAL SPACES 461 AX3 1 \ 1 1 1 1 ^ — Vh 2 (x)' Vh^x) Figure 20.8 \ \ s N / \ \ *T(x) h 1= 0 ^ ~7o\ X h2=o\ The surface S = {x G R3 : xi = 0, xi - x2 = 0}. for all t G (a, 6). If we differentiate the function h(x(t)) using the chain rule, we obtain d at h(x(t)) = Dh(x(t))x(t) with respect to £ = 0 for all t G (a, b). Therefore, at t* we get Dh{x*)y = 0, and hence y G T(x*). =>: To prove this, we need to use the implicit function theorem. We refer the reader to [88, p. 325]. I We now introduce the notion of a normal space. Definition 20.6 The normal space N(x*) at a point x* on the surface S = {xeRn : h(x) = 0} is the set N(x*) = {x G R n : x = Dh(x*)T z, z G R m } . We can express the normal space N(x*) as N(x*)=U(Dh{x*)T), that is, the range of the matrix Dh(x*)T. Note that the normal space N(x*) is the subspace of R n spanned by the vectors V/u(cc*),..., V/i m (x*); that is, N(x*) = span[Vfti(x*),..., Vftm(a;*)] = { x G R n : x = ^iVhi(a;*) + --- + ^ m V/i m (a;*), zu...,zm G R}. 462 PROBLEMS WITH EQUALITY CONSTRAINTS Normal space in R3. Figure 20.9 Note that the normal space contains the zero vector. Assuming that x* is regular, the dimension of the normal space N(x*) is ra. As in the case of the tangent space, it is often convenient to picture the normal space N(x*) as passing through the point x* (rather than through the origin of Rn). For this, we define the normal plane at x* as the set NP(x*) = N(x*) + x* = {x + x* G R n : x G AT(x*)}. Figure 20.9 illustrates the normal space and plane in R 3 (i.e., n — 3 and m = l). We now show that the tangent space and normal space are orthogonal complements of each other (see Section 3.3). Lemma 20.1 We have T(x*) = N(x*)± and T(x*) ± = N(x*). D Proof. By definition of T(x*), we may write T(x*) = {yeRn: Hence, by definition of N(x*), we also have T{x*)^ = N(x*). xTy = 0 for all x G N(x*)}. we have T{x*) = N(x"f)±. By Exercise 3.11 I By Lemma 20.1, we can write Rn as the direct sum decomposition (see Section 3.3): Μη = ΛΓ(χ*)ΘΤ(ίΕ*); that is, given any vector v G Mn, there are unique vectors w G N(x*) and y G T(x*) such that v = w + y. LAGRANGE CONDITION 20.4 463 Lagrange Condition In this section we present a first-order necessary condition for extremum problems with constraints. The result is the well-known Lagrange's theorem. To better understand the idea underlying this theorem, we first consider functions of two variables and only one equality constraint. Let h : R 2 —► R be the constraint function. Recall that at each point x of the domain, the gradient vector Wh(x) is orthogonal to the level set that passes through that point. Indeed, let us choose a point ] ' such that h(x*) = 0, and assume that Vft(x*) Φ 0. The level set through the point x* is the set {x : h{x) = 0}. We then parameterize this level set in a neighborhood of x* by a curve {#(£)}, that is, a continuously differentiate vector function x : R —» R 2 such that x(t) xi(t) X2{t) t G (a, 6), x* = x(**), x(t*) φ 0, t* G (a, 6). We can now show that Vft(a?*) is orthogonal to x(t*). Indeed, because h is constant on the curve {x(i) : t G (a, 6)}, we have that for all t G (a, 6), h(x(t)) = 0. Hence, for all t G (a, 6), fth(x(t))=0. Applying the chain rule, we get 4-h(x{t)) at = Vh{x(t))Tx{t) = 0. Therefore, V/i(a?*) is orthogonal to x(t*). Now suppose that x* is a minimizer of / : R 2 —> R on the set {x : ft(x) = 0}. We claim that V/(x*) is orthogonal to ώ(ί*). To see this, it is enough to observe that the composite function of t given by 4>{t) = f(x(t)) achieves a minimum at t*. Consequently, the first-order necessary condition for the unconstrained extremum problem implies that Applying the chain rule yields 0 = ±φ{ΐ) = ν/(χ(ί')) τ χ(ί·) = Vf(x*)Tx(t*). Thus, V/(x*) is orthogonal to x(t*). The fact that x(t*) is tangent to the curve {x(t)} at x* means that V/(x*) is orthogonal to the curve at x* (see Figure 20.10). 464 PROBLEMS WITH EQUALITY CONSTRAINTS Figure 20.10 The gradient V/(x*) is orthogonal to the curve {x(t)} at the point x* that is a minimizer of / on the curve. Recall that Vh(x*) is also orthogonal to x(t*). Therefore, the vectors Vh(x*) and V/(x*) are parallel; that is, Vf(x*) is a scalar multiple of Vh(x*). The observations above allow us now to formulate Lagrange's theorem for functions of two variables with one constraint. Theorem 20.2 Lagrange's Theorem for n = 2, m = 1. Let the point x* be a minimizer of f : R 2 —> R subject to the constraint h(x) = 0, h : R 2 —► R. TAen, V/(x*) and Vft(x*) are parallel. That is, ifVh(x*) φ 0, ίΛβη ί/iere exists a scalar λ* 5?/c/i that V/(x*) + A*V/i(x*) = 0. D In Theorem 20.2, we refer to λ* as the Lagrange multiplier. Note that the theorem also holds for maximizers. Figure 20.11 gives an illustration of Lagrange's theorem for the case where x* is a maximizer of / over the set {x : h(x) = 0}. Lagrange's theorem provides a first-order necessary condition for a point to be a local minimizer. This condition, which we call the Lagrange condition, consists of two equations: ν / ( χ * ) + λ*ν/ι(**) = 0 h(x*) = 0. Note that the Lagrange condition is necessary but not sufficient. In Figure 20.12 we illustrate a variety of points where the Lagrange condition is LAGRANGE CONDITION 465 z=f(x1,x2) yh(x*) Figure 20.11 Lagrange's theorem for n — 2, m = 1. satisfied, including a case where the point is not an extremizer (neither a maximizer nor a minimizer). We now generalize Lagrange's theorem for the case when / : R n —> R and h : R n -► R m , m < n. Theorem 20.3 Lagrange's Theorem. Let x* be a local minimizer (or maximizer) of f : R n -► R, subject to h(x) = 0, h : R n -► R m , m < n. Assume that x* is a regular point. Then, there exists λ* G R m 5^c/i that D/(x*) + A*TD/i(ai*) = 0 T . Proof. We need to prove that V/(x*) = -Dh(x*)T\* for some λ* G R m ; that is, V/(a?*) G ft(£>ft(x*)T) = AT(x*). But by Lemma 20.1, N(x*) — T(x*)±. Therefore, it remains to show that V/(**) G T ( * * ) x . We proceed as follows. Suppose that y e TV). Then, by Theorem 20.1, there exists a differentiable curve {x(t) : t G (a, b)} such that for all t G (a, 6), Λ(®(ί)) = 0, 466 PROBLEMS WITH EQUALITY CONSTRAINTS y f(x*> h=0 h=0 (a) (b) AVh(x*) h=0 (c) (d) Figure 20.12 Four examples where the Lagrange condition is satisfied: (a) maximizer, (b) minimizer, (c) minimizer, (d) not an extremizer. (Adapted from [120].) and there exists t* G (a, b) satisfying x(t*) = x*, x(t*) = y. Now consider the composite function φ(ί) = f(x(t)). Note that t* is a local minimizer of this function. By the first-order necessary condition for unconstrained local minimizers (see Theorem 6.1), Applying the chain rule yields ^ ( f ) = Df{x')x(f) So all y 6 T(x*) satisfy = Df(x*)y V / ( z * ) T y = 0; = Vf{x*)Ty - 0. LAGRANGE CONDITION 467 h=0 Vh(x*) Figure 20.13 Example where the Lagrange condition does not hold. that is, v/(a:*) G T(x*y This completes the proof. Lagrange's theorem states that if x* is an extremizer, then the gradient of the objective function / can be expressed as a linear combination of the gradients of the constraints. We refer to the vector λ* in Theorem 20.3 as the Lagrange multiplier vector, and its components as Lagrange multipliers. From the proof of Lagrange's theorem, we see that a compact way to write the necessary condition is V/(x*) G N(x*). If this condition fails, then x* cannot be an extremizer. This situation is illustrated in Figure 20.13. Notice that regularity is stated as an assumption in Lagrange's theorem. This assumption plays an essential role, as illustrated in the following example. Example 20.5 Consider the following problem: minimize subject to f(x) h(x) = 0, where f(x) = x and f 2 x h(x) = < 0 (x-l)2 ifx<0 if 0 < x < 1 if x > l . The feasible set is evidently [0,1]. Clearly, x* = 0 is a local minimizer. However, f'(x*) = 1 and h'(x*) = 0. Therefore, x* does not satisfy the 468 PROBLEMS WITH EQUALITY CONSTRAINTS necessary condition in Lagrange's theorem. Note, however, that x* is not a regular point, which is why Lagrange's theorem does not apply here. I It is convenient to introduce the Lagrangian function I : Mn x R m —> R, given by l{x,X) = f(x) + XTh(x). The Lagrange condition for a local minimizer x* can be represented using the Lagrangian function as Dl(x*,\*) = 0T for some λ*, where the derivative operation D is with respect to the entire argument [ χ τ , λ ] τ . In other words, the necessary condition in Lagrange's theorem is equivalent to the first-order necessary condition for unconstrained optimization applied to the Lagrangian function. To see the above, denote the derivative of I with respect to x as Dxl and the derivative of / with respect to λ as D\l. Then, Dl{x, X) = [DJ(x, \),Dxl(x, X)]. Note that Dxl(x, X) = Df(x)+XTDh(x) and D\l(x, X) = h(x)T. Lagrange's theorem for a local minimizer x* can be stated as DJ{x*,X*) = 0T, Dxl(x*,X*) = 0T Therefore, for some λ*, which is equivalent to Dl{x*,X*) = 0T. In other words, the Lagrange condition can be expressed as Dl(x*,\*) = 0 T . The Lagrange condition is used to find possible extremizers. This entails solving the equations DJ(x,X) = 0T, Dxl(x,X) = 0T. The above represents n + m equations i n n + m unknowns. Keep in mind that the Lagrange condition is necessary but not sufficient; that is, a point x* satisfying the equations above need not be an extremizer. Example 20.6 Given a fixed area of cardboard, we wish to construct a closed cardboard box with maximum volume. We can formulate and solve this problem using the Lagrange condition. Denote the dimensions of the box with maximum volume by #ι, Χ2, and £3, and let the given fixed area of cardboard be A. The problem can then be formulated as maximize χχχ^Χζ Subject tO X\X2 + #2^3 + #3#1 = 7Γ- A LAGRANGE CONDITION 469 We denote f(x) = —x\x2x^ and h(x) = xiX2-\-x2Xs+XsXi —A/2. We have V/(a?) = -[χ2Χ3,χιΧζ,χιχ2]τ and V/i(x) = [x2 +x$,x\ + £3,£i + X2V - Note that all feasible points are regular in this case. By the Lagrange condition, the dimensions of the box with maximum volume satisfies #2^3 - Κχ2 + X3) - 0 £1^3 - λ(Χι + £ 3 ) = 0 £ i £ 2 - A(a?i + £ 2 ) = 0 A £ l £ 2 + #2^3 + #3^1 ~ 2~' where λ e R. We now solve these equations. First, we show that that #1, #2, #3, a n d A are all nonzero. Suppose that Xi = 0. By the constraints, we have #2^3 = A/2. However, the second and third equations in the Lagrange condition yield Xx2 = Xx3 = 0, which together with the first equation implies that £2 £3 = 0. This contradicts the constraints. A similar argument applies to £2 and £3. Next, suppose that λ = 0. Then, the sum of the three Lagrange equations gives x2xz + x\x$ + x\X2 = 0, which contradicts the constraints. We now solve for £1, x2, and £3 in the Lagrange equations. First, multiply the first equation by x\ and the second by £2, and subtract one from the other. We arrive at χ^Χ(χι — x2) = 0. Because neither £3 nor λ can be zero (by part b), we conclude that X\ — x2. We similarly deduce that x2 = £3. From the constraint equation, we obtain x\ — x2 — £3 = y/A/Q. Notice that we have ignored the constraints that £1, x2, and £3 are positive so that we can solve the problem using Lagrange's theorem. However, there is only one solution to the Lagrange equations, and the solution is positive. Therefore, if a solution exists for the problem with positivity constraints on the variables £1, x2, and £3, then this solution must necessarily be equal to the solution above obtained by ignoring the positivity constraints. I Next we provide an example with a quadratic objective function and a quadratic constraint. Example 20.7 Consider the problem of extremizing the objective function f(x) =x\ + x\ on the ellipse {[£i,£ 2 ] T : h(x) = x\ + 2x\-l We have ν/(χ)-[2£!,2£2]Τ, ν / ι ( χ ) = [2£!,4£ 2 ] τ . = 0}. 470 PROBLEMS WITH EQUALITY CONSTRAINTS Thus, Dxl(x, X) = Dx[f(x) + Xh(x)] = [2xi + 2λχι, 2x2 + 4λχ2] and D\l(x, X) = h(x) = x\ + 2x\ - 1. Setting Dxl(x, X) = 0 unknowns T and D\l(x, X) = 0, we obtain three equations in three 2xi + 2λχι = 0, 2x 2 + 4λχ 2 = 0, X\ ~~\~ AXo ^ A· All feasible points in this problem are regular. Prom the first of the equations above, we get either x\ = 0 or λ = — 1. For the case where X\ = 0, the second and third equations imply that λ = —1/2 and x 2 = ± l / \ / 2 . For the case where λ = — 1, the second and third equations imply that x\ = ± 1 and x2 = 0. Thus, the points that satisfy the Lagrange condition for extrema are 0 , _l/>/2 x& = 0 -1/^2 , x& = 1 , 0 xW = -1 0 Because and f(xV) = f(xW) = \ f(x(V) = f(xW) = l we conclude that if there are minimizers, then they are located at x^ and x^2\ and if there are maximizers, then they are located at x^ and x^. It turns out that, indeed, x^ and x^ are minimizers and x^ and x^ are maximizers. This problem can be solved graphically, as illustrated in Figure 20.14. I In the example above, both the objective function / and the constraint function h are quadratic functions. In the next example we take a closer look at a class of problems where both the objective function / and the constraint h are quadratic functions of n variables. E x a m p l e 20.8 Consider the following problem: . . maximize xTQx , τ η x' Px where Q = QT > 0 and P = PT > 0. Note that if a point x = [ # i , . . . , xn]T is a solution to the problem, then so is any nonzero scalar multiple of it, tx = [txi,... )txn]T, t φ 0. LAGRANGE CONDITION Figure 20.14 Indeed, 471 Graphical solution of the problem in Example 20.7. (tx)TQ(tx) (tx)TP(tx) t2xTQx t2xTPx xTQx xTPx Therefore, to avoid the multiplicity of solutions, we further impose the constraint xTPx = 1. The optimization problem becomes maximize x Qx T subject to = 1. x Px Let us write xTQx, f(x) = h(x) = 1- xTPx. Any feasible point for this problem is regular (see Exercise 20.13). We now apply Lagrange's method. We first form the Lagrangian function Z(x, λ) = xTQx + λ(1 - xTPx). Applying the Lagrange condition yields DJ{x, X) = 2xTQ - 2XxTP Dxl(x,X) = l-xTPx = 0. = 0T, 472 PROBLEMS WITH EQUALITY CONSTRAINTS The first of the equations above can be represented as Qx - XPx = 0 or (XP - Q)x = 0. This representation is possible because P = P and Q = Q . B y assumption P > 0, hence P _ 1 exists. Premultiplying (XP — Q)x — 0 by P - 1 , we obtain (λ/ η - Prefix or, equivalently, P~lQx =0 = Xx. Therefore, the solution, if it exists, is an eigenvector of P _ 1 Q , and the Lagrange multiplier is the corresponding eigenvalue. As usual, let x* and λ* be the optimal solution. Because x*TPx* = 1 and P _ 1 Q x * = A*cc*, we have X* =x*TQx*. Hence, λ* is the maximum of the objective function, and therefore is, in fact, the maximal eigenvalue of P~lQ. It is also called the maximal generalized eigenvalue. I In the problems above, we are able to find points that are candidates for extremizers of the given objective function subject to equality constraints. These critical points are the only candidates because they are the only points that satisfy the Lagrange condition. To classify such critical points as minimizers, maximizers, or neither, we need a stronger condition—possibly a necessary and sufficient condition. In the next section we discuss a second-order necessary condition and a second-order sufficient condition for minimizers. 20.5 Second-Order Conditions We assume that / : W1 —► M. and h : R n —> R m are twice continuously differentiable: f,heC2. Let l(x, λ) = f(x) + XTh(x) = f(x) + λιΛι(χ) + · · · + Xmhm(x) be the Lagrangian function. Let L(x, X) be the Hessian matrix of Z(x, λ) with respect to x: L(x, X) = F(x) + λιJii(a;) + · · · + XmHmix), where F(x) is the Hessian matrix of / at x and Hk(x) of hk at x, k = 1 , . . . , m, given by Hk(x) is the Hessian matrix "äif( X ) "' dx^dxx ( X ) 3 ^fc (γ) ... d hk (γ) = I SECOND-ORDER CONDITIONS 473 We introduce the notation [AJH"(SC)]: [XH(X)} = Aiffifc) + · · ■ + \mHm(x). Using the notation above, we can write L(x,\) = F(x) + [\H(x)}. Theorem 20.4 Second-Order Necessary Conditions. Let x* be a local minimizer of f : W1 —► R subject to h(x) = 0, h : Rn —► M m , m < n, and f,h EC2. Suppose that x* is regular. Then, there exists λ* G Rm such that: 1. Df(x*)+\*TDh{x*)=0T. 2. For all y G T{x*), we have yTL(x*,\*)y > 0. D Proof. The existence of λ* G R m such that Df(x*) + \*TDh(x*) = 0T follows from Lagrange's theorem. It remains to prove the second part of the result. Suppose that y G T(x*); that is, y belongs to the tangent space to S = {x G W1 : h(x) = 0} at x*. Because h G C2, following the argument of Theorem 20.1, there exists a twice-differentiable curve {x(t) : t G (a, b)} on 5 such that a;(t*)=a;*, i(t*) = y for some i* G (a, 6). Observe that by assumption, t* is a local minimizer of the function <j>(t) — f(x(t)). Prom the second-order necessary condition for unconstrained minimization (see Theorem 6.2), we obtain Using the formula ±{y{t)Tz(t))=z(t)^(t) + y(t)T^(t) and applying the chain rule yields = x{t*)T F(x*)x(t*) + T = y F(x*)y Df{x*)x(t*) + Df(x*)x(t*) Because h(x(t)) = 0 for all t € (a, b), we have ^\*Th(x(t)) = 0. > 0. 474 PROBLEMS WITH EQUALITY CONSTRAINTS Thus, for all t G (a, 6), έλ·τ"Μ'»=! ~ dt \*T-h(x(t)) J2\*k-hk(x(t)) Lfc=i £ dt J2\*kDhk(x(t))x(t) .k=l tlX^(Dhk(x(t))x(t)) = fe=l J^K [x(t)THk(x(t))x(t) + Dhk(x(t))x(t)] fe=l = xT(t)\X*H(x(t))}x(t) = 0. \*TDh{x(t))x{t) + In particular, the above is true for t = t*; that is, yT[X*H(x*)]y + X*T Dh(x*)x(t*) = 0. Adding this equation to the inequality yTF(x*)y + Df(x*)x(t*)>0 yields yT (F(x*) + [λ*Η(χ·)]) y + (Df(x*) + \*TDh(x*))x(t*) But, by Lagrange's theorem, Df(x*) + X*TDh(x*) y T (F(x*) + [\*H(x*)})y which proves the result. > 0. — 0 T . Therefore, = yTL{x\\*)y > 0, I Observe that L(x, X) plays a similar role as the Hessian matrix F(x) of the objective function / did in the unconstrained minimization case. However, we now require that L{x*, λ*) > 0 only on T(x*) rather than on Rn. The conditions above are necessary, but not sufficient, for a point to be a local minimizer. We now present, without a proof, sufficient conditions for a point to be a strict local minimizer. Theorem 20.5 Second-Order Sufficient Conditions. Suppose that f,h eC2 and there exists a point x* G W1 and λ* G R m such that: 1. D/(x*) + A * T D / i ( : z * ) = 0 T . SECOND-ORDER CONDITIONS we have yTL(x*,X*)y 2. For all y e T(x*), y^O, 475 > 0. Then, x* is a strict local minimizer of f subject to h(x) = 0. D Proof. The interested reader can consult [88, p. 334] for a proof of this result. Theorem 20.5 states that if an cc* satisfies the Lagrange condition, and Ζ/(χ*,λ*) is positive definite on T(x*), then x* is a strict local minimizer. A similar result to Theorem 20.5 holds for a strict local maximizer, the only difference being that L(x*,\*) be negative definite on T(x*). We illustrate this condition in the following example. Example 20.9 Consider the following problem: ιιια,Λ.11U 1 where Q xTQx xTPx' ~ 4 θ" 0 1 P = 2 0 0 1 As pointed out earlier, we can represent this problem in the equivalent form maximize x Qx T subject to x Px = 1. The Lagrangian function for the transformed problem is given by l(x, X) = xTQx + λ(1 - xTPx). The Lagrange condition yields ( A J - P _ 1 Q ) x = 0, where PlQ = 2 0 0 1 There are only two values of λ that satisfy (XI — P~lQ)x = 0, namely, the eigenvalues of P~XQ: X\ — 2, X2 — 1. We recall from our previous discussion of this problem that the Lagrange multiplier corresponding to the solution is the maximum eigenvalue of P~1Q^ namely, λ* = λι = 2. The corresponding eigenvector is the maximizer—the solution to the problem. The eigenvector corresponding to the eigenvalue λ* = 2 satisfying the constraint xTPx = 1 is ±x*, where 476 PROBLEMS WITH EQUALITY CONSTRAINTS At this point, all we have established is that the pairs (±χ*,λ*) satisfy the Lagrange condition. We now show that the points ±x* are, in fact, strict local maximizers. We do this for the point x*. A similar procedure applies to —x*. We first compute the Hessian matrix of the Lagrangian function. We have L(x*,\*) v y p = 2Q-2XP= The tangent space T(x*) to {x : 1 — xTPx = {y:y Note that for each y e T(x*), . = 0} is T(x*) = {y e R 2 : x*JPy = {y:[V2,0]y ° 10 —2J = 0} = 0} = [0,a]T, a G R}. 2/^0, yT£(a*,A*)y=[0,a] 0 0 0 -2 -2a 2 < 0. Hence, Χ(χ*,λ*) < 0 on T(x*), and thus x* = [l/>/2,0] T is a strict local maximizer. The same is true for the point —x*. Note that x*TQx* x*TPx* = 2 which, as expected, is the value of the maximal eigenvalue of P~1Q. Finally, we point out that any scalar multiple tx* of as*, t φ 0, is a solution to the original problem of maximizing xTQx/xTPx. I 20.6 Minimizing Quadratics Subject to Linear Constraints Consider the problem . . . minimize subject to 1 -x τ^Qx Ax = 6, where Q > 0, A £ R m X n , m < n, rank A = m. This problem is a special case of what is called a quadratic programming problem (the general form of a quadratic programming problem includes the constraint x > 0). Note that the constraint set contains an infinite number of points (see Section 2.3). We now show, using Lagrange's theorem, that there is a unique solution to the optimization problem above. Following that, we provide an example illustrating the application of this solution to an optimal control problem. MINIMIZING QUADRATICS SUBJECT TO LINEAR CONSTRAINTS 477 To solve the problem, we first form the Lagrangian function + λτ(6 - /(», λ) = -xTQx Ax). The Lagrange condition yields DJ{x*, X*) = x*TQ - λ * τ A = 0 T . Rewriting, we get x* =Q~1AT\*. Premultiplying both sides of the above by A gives =AQ1AT\*. Ax* Using the fact that Ax* = 6, and noting that AQ~1AT Q > 0 and rank A = m, we can solve for λ* to obtain is invertible because (AQ'1AT)'1b. λ* = Therefore, we obtain x* = Q-1AT(AQ-1AT)-1b. The point x* is the only candidate for a minimizer. To establish that x* is indeed a minimizer, we verify that x* satisfies the second-order sufficient conditions. For this, we first find the Hessian matrix of the Lagrangian function at (a:*,A*). We have L{x*,\*) = Q, which is positive definite. Thus, the point x* is a strict local minimizer. We will see in Chapter 22 that x* is, in fact, a global minimizer. The special case where Q = J n , the n x n identity matrix, reduces to the problem considered in Section 12.3. Specifically, the problem in Section 12.3 is to minimize the norm ||x|| subject to Ax = b. The objective function here is f(x) = \\x\\, which is not differentiable at x = 0. This precludes the use of Lagrange's theorem because the theorem requires differentiability of the objective function. We can overcome this difficulty by considering an equivalent optimization problem: minimize subject to -||#||2 Ax = b. The objective function ||a?||2/2 has the same minimizer as the previous objective function ||x||. Indeed, if x* is such that for all x G R n satisfying Ax = 6, ||^*|| < II&H, then ||ic*||2/2 < ||x|| 2 /2. The same is true for the converse. Because the problem of minimizing ||x|| 2 /2 subject to Ax = b is simply the 478 PROBLEMS WITH EQUALITY CONSTRAINTS problem considered above with Q — I n , we easily deduce the solution to be x* = A (AA ) _ 1 6 , which agrees with the solution in Section 12.3. Example 20.10 Consider the discrete-time linear system model + buk, xk = axk-i k > 1, with initial condition XQ given. We can think of {xk} as a discrete-time signal that is controlled by an external input signal {uk}. In the control literature, Xk is called the state at time k. For a given XQ, our goal is to choose the control signal {uk} so that the state remains "small" over a time interval [l,iV], but at the same time the control signal is "not too large." To express the desire to keep the state {xk} small, we choose the control sequence to minimize 1 N On the other hand, maintaining a control signal that is not too large, we minimize The two objectives above are conflicting in the sense that they cannot, in general, be achieved simultaneously—minimizing the first may result in a large control effort, while minimizing the second may result in large states. This is clearly a problem that requires compromise. One way to approach the problem is to minimize a weighted sum of the two functions above. Specifically, we can formulate the problem as minimize subject to 1 N - 2_\ {QX1 + ru Tj Xk = axk-i + buk, k = 1 , . . . , TV, x 0 given, where the parameters q and r reflect the relative importance of keeping the state small versus keeping the control effort not too large. This problem is an instance of the linear quadratic regulator (LQR) problem (see, e.g., [15], [20], [85], [86], or [99]). Combining the two conflicting objectives of keeping the state small while keeping the control effort small is an instance of the weighted sum approach (see Section 24.4). MINIMIZING QUADRATICS SUBJECT TO LINEAR CONSTRAINTS 479 To solve the problem above, we can rewrite it as a quadratic programming problem. Define Q = QIN O O rIN ' 1 -a -a axo 0 o -6 : I 0 b = ... 1 -b 0 z = [XI,...,XN,UI,...,UN]T . 0 With these definitions, the problem reduces to the previously considered quadratic programming problem, . . . minimize 1 τ Qz ^ -z subject to Az — 6, where Q is 2N x 27V, A is N x 27V, and b e RN. The solution is z* 1-1 ΛΤ L [ - 1 l L =QA (AQ-LAΛΤ\-Ι, )- b. The first N components of z* represent the optimal state signal in the interval [1, N], whereas the second iV components represent the optimal control signal. In practice, computation of the matrix inverses in the formula for z* above may be too costly. There are other ways to tackle the problem by exploiting its special structure. This is the study of optimal control (see, e.g., [15], [20], [85], [86], or [99]). I The following example illustrates an application of the above discussion. Example 20.11 Credit-Card Holder Dilemma. Suppose that we currently have a credit-card debt of $10,000. Credit-card debts are subject to a monthly interest rate of 2%, and the account balance is increased by the interest amount every month. Each month we have the option of reducing the account balance by contributing a payment to the account. Over the next 10 months, we plan to contribute a payment every month in such a way as to minimize the overall debt level while minimizing the hardship of making monthly payments. 480 PROBLEMS WITH EQUALITY CONSTRAINTS g15000 c JS (0 ~10000 3 o * 5000 5 6 Month 10 3000 5 6 Month Figure 20.15 Plots for Example 20.11 with q = 1 and r = 10. We solve our problem using the LQR framework described in Example 20.10. Let the current time be 0, Xk the account balance at the end of month fc, and Uk our payment in month k. We have xk = 1.02zfc_i -uk, k = 1 , . . . , 10; that is, the account balance in a given month is equal to the account balance in the previous month plus the monthly interest on that balance minus our payment that month. Our optimization problem is then minimize subject to 1 10 Ö Σ faXi + i=l rU i) Xk = 1.02xfc_i — Uk, k = 1 , . . . , 10, x0 = 10,000, which is an instance of the LQR problem. The parameters q and r reflect our priority in trading off between debt reduction and hardship in making payments. The more anxious we are to reduce our debt, the larger the value of q relative to r. On the other hand, the more reluctant we are to make payments, the larger the value of r relative to q. The solution to the problem above is given by the formula derived in Example 20.10. In Figure 20.15 we plot the monthly account balances and payments over the next 10 months using q = 1 and r = 10. We can see here that our debt has been reduced to less than $1000 after 10 months, but with a first EXERCISES 2 481 lllllllllll 0 "0 1 2 3 4 5 Month 6 7 8 9 10 3000 Figure 20.16 Plots for Example 20.11 with q = 1 and r = 300. payment close to $3000. If we feel that a payment of $3000 is too high, then we can try to reduce this amount by increasing the value of r relative to q. However, going too far along these lines can lead to trouble. Indeed, if we use q = 1 and r = 300 (see Figure 20.16), although the monthly payments do not exceed $400, the account balance is never reduced by much below $10,000. In this case, the interest on the account balance eats up a significant portion of our monthly payments. In fact, our debt after 10 months will be higher than $10,000. ■ For a treatment of optimization problems with quadratic objective functions, subject to linear or quadratic constraints, arising in communication and signal processing, see [105] and [106]. EXERCISES 20.1 Consider the following constraints on R 2 : h(xux2) = (xi - 2) 2 = 0 and g(xi,x2) = (x2 + I ) 3 < 0. Find the set of feasible points. Are the feasible points regular? Justify your answer. 482 PROBLEMS WITH EQUALITY CONSTRAINTS 20.2 Find local extremizers for the following optimization problems: a. Minimize x\ + 2x\x<i + 3x 2 + 4χχ + 5x2 + 6^3 subject to b. x\+ 2x2 = 3 4#i + 5x3 = 6. Maximize 4xi + x\ subject to x\ + x\ = 9. c. Maximize x\x2 subject to x\ + 4#2 = 1· 20.3 Find minimizers and maximizers of the function f{x) = (aTx)(bTx), x G R3, subject to xi + x2 — 0 X2 + ^3 = 0, where Γο" 1 L° and b = "l] 0 lj 20.4 Consider the problem minimize subject to f(x) h(x) = 0, where / : R 2 -> R, ft : R 2 -> R, and V / ( » ) = [χι,Χι + 4] T . Suppose that x* is an optimal solution and Vft(x*) = [1,4] T . Find Vf(x*). 20.5 Consider the problem minimize subject to where x0 = [1, 11 x — XQ \ | 2 ||x|| 2 = 9, A/3] T · a. Find all points satisfying the Lagrange condition for the problem. EXERCISES 483 b . Using second-order conditions, determine whether or not each of the points in part a is a local minimizer. 20.6 We wish to construct a closed box with minimum surface area that encloses a volume of V cubic feet, where V > 0. a. Let a, 6, and c denote the dimensions of the box with minimum surface area (with volume V). Derive the Lagrange condition that must be satisfied by a, b, and c. b . What does it mean for a point x* to be a regular point in this problem? Is the point x* = [a,fr,c] T a regular point? c. Find a, 6, and c. d. Does the point x* = [a,b,c]T found in part c satisfy the second-order sufficient condition? 20.7 Find local extremizers of a. / ( x i , X2, X3) = x\ + 3 x | + xs subject to x\-\-x\-\-x\ b . f(xi,x2) = 16. = x\ + x\ subject to 3x\ + 4xix 2 + 6x| = 140. 20.8 Consider the problem minimize subject to 2x\ + 3#2 — 4, x\,x2 GR £1X2 = 6. a. Use Lagrange's theorem to find all possible local minimizers and maximizers. b . Use the second-order sufficient conditions to specify which points are strict local minimizers and which are strict local maximizers. c. Are the points in part b global minimizers or maximizers? Explain. 20.9 Find all maximizers of the function /(xi '* x 2) 18x? - 8x1X2 + 12x1 = —2xTT^i—■ 20.10 Find all solutions to the problem 3 4 maximize xT subject to ||#|| 2 = 1. 0 3 x 484 PROBLEMS WITH EQUALITY CONSTRAINTS 20.11 Consider a matrix A with the property that A A has eigenvalues ranging from 1 to 20 (i.e., the smallest eigenvalue is 1 and the largest is 20). Let x be a vector such that ||x|| = 1, and let y = Ax. Use Lagrange multiplier methods to find the range of values that \\y\\ can take. Hint: What is the largest value that \\y\\ can take? What is the smallest value that 11 y \ | can take? 20.12 Consider a matrix A G R m X n . Define the induced 2-norm of A, denoted ||A||2, to be the number ||A|| 2 = max{||Ax|| : x e R n , ||α|| = 1}, where the norm || · || on the right-hand side above is the usual Euclidean norm. Suppose that the eigenvalues of A A are λ ι , . . . , λ η (ordered from largest to smallest). Use Lagrange's theorem to express || A|| 2 in terms of the eigenvalues above (cf. Theorem 3.8). 20.13 Let P = PT be a positive definite matrix. Show that any point x satisfying 1 — xTPx = 0 is a regular point. 20.14 Consider the problem maximize ax\ + bx2, subject to x\ + x\ = 2, X\,%2 £ R where a,b G R. Show that if [1,1] T is a solution to the problem, then a = b. 20.15 Consider the problem minimize X\X2 — 2#i, subject to x\ — x\ = 0. #i,#2 € M a. Apply Lagrange's theorem directly to the problem to show that if a solution exists, it must be either [1,1] T or [—1,1]T. b . Use the second-order necessary conditions to show that [—1,1]T cannot possibly be the solution. c. Use the second-order sufficient conditions to show that [1,1] T is a strict local minimizer. 20.16 Let A G R m x n , m < n, rank A = m, and x0 G Rn. Let x* be the point on the nullspace of A that is closest to XQ (in the sense of Euclidean norm). a. Show that x* is orthogonal to x* — XQ. EXERCISES 485 b . Find a formula for x* in terms of A and XQ. 20.17 Consider the problem minimize \\\Ax-b\\* subject to Cx = d, where A G R m x n , m > n, C G R p x n , p < n, and both A and C are of full rank. We wish to find an expression for the solution (in terms of A, 6, C, and d). a. Apply Lagrange's theorem to solve this problem. b . As an alternative, rewrite the given optimization problem in the form of a quadratic programming problem and apply the formula in Section 20.6 to obtain the solution. 20.18 Consider the problem of minimizing a general quadratic function subject to a linear constraint: minimize subject to -xTQx — cTx + d Ax — 6, where Q = Q > 0, A G R m x n , m < n, rank A = m, and d is a constant. Derive a closed-form solution to the problem. 20.19 Let L be an n x n real symmetric matrix, and let M be a subspace of R n with dimension m < n. Let { & i , . . . , 6 m } C Mn be a basis for M, and let B be the n x m matrix with bi as the ith column. Let LM be the mxm matrix defined by LM = B LB. Show that L is positive semidefinite (definite) on M if and only if LM is positive semidefinite (definite). Note: This result is useful for checking that the Hessian of the Lagrangian function at a point is positive definite on the tangent space at that point. 20.20 Consider the sequence {xk}, %k £ ^ generated by the recursion Xk+i = axk + buki k >0 (a, b G R, a, b φ 0), where ^0,^1,^2»··· is a sequence of "control inputs," and the initial condition #o Φ 0 is given. The recursion above is also called a discrete-time linear system. We wish to find values of control inputs uo and u\ such that #2 = 0, and the average input energy (UQ + uf)/2 is minimized. Denote the optimal inputs by UQ and u\. 486 PROBLEMS WITH EQUALITY CONSTRAINTS a. Find expressions for UQ and u* in terms of a, 6, and XQ. b . Use the second-order sufficient conditions to show that the point tx* = [tio,^*] T in part a is a strict local minimizer. 20.21 Consider the discrete-time linear system Xk = %Xk-i +^fc> k > 1, with #0 = 1· Find the values of the control inputs U\ and U2 to minimize 2 , 1 2 , 1 2 *2 + 2 M 1 + 3^2· 20.22 Consider the discrete-time linear system x^+i = Xk + 2^^, 0 < A: < 2, with #o = 3. Use the Lagrange multiplier approach to calculate the optimal control sequence {uo, ^1,^2} that transfers the initial state XQ to £3 = 9 while minimizing ! 2 fc=0 CHAPTER 21 PROBLEMS WITH INEQUALITY CONSTRAINTS 21.1 Karush-Kuhn-Tucker Condition In Chapter 20 we analyzed constrained optimization problems involving only equality constraints. In this chapter we discuss extremum problems that also involve inequality constraints. The treatment in this chapter parallels that of Chapter 20. In particular, as we shall see, problems with inequality constraints can also be treated using Lagrange multipliers. We consider the following problem: minimize subject to f(x) h(x) = 0, where / : Rn -> R, h : W1 -+ R m , m < n, and g : Rn -> W. For the general problem above, we adopt the following definitions. Definition 21.1 An inequality constraint gj(x) < 0 is said to be active at x* if 9j{x*) = 0. It is inactive at x* if gj(x*) < 0. I An Introduction to Optimization, Fourth Edition. By E. K. P. Chong and S. H. Zak. Copyright © 2013 John Wiley & Sons, Inc. 487 488 PROBLEMS WITH INEQUALITY CONSTRAINTS By convention, we consider an equality constraint hi(x) = 0 to be always active. Definition 21.2 Let x* satisfy h(x*) = 0, g(x*) < 0, and let J(x*) be the index set of active inequality constraints: Then, we say that x* is a regular point if the vectors Vfti(x*), Vgj(x*), l<i<ra, j G J(x*) are linearly independent. I We now prove a first-order necessary condition for a point to be a local minimizer. We call this condition the Karush-Kuhn-Tucker (KKT) condition. In the literature, this condition is sometimes also called the Kuhn-Tucker condition. T h e o r e m 21.1 Karush-Kuhn-Tucker (KKT) Theorem. Let f,h,g G C 1 . Let x* be a regular point and a local minimizer for the problem of minimizing f subject to h(x) = 0, g(x) < 0. Then, there exist λ* G R m and μ* G Rp such that: 1. μ* > 0. 2. Df(x*) + \*TDh(x*) S. ß*Tg{x*) + ß*TDg{x*) = 0T. = 0. D In Theorem 21.1, we refer to λ* as the Lagrange multiplier vector and μ* as the Karush-Kuhn-Tucker (KKT) multiplier vector. We refer to their components as Lagrange multipliers and Karush-Kuhn-Tucker (KKT) multipliers, respectively. Before proving this theorem, let us first discuss its meaning. Observe that μj > 0 (by condition 1) and gj{x*) < 0. Therefore, the condition μ* τ 0(χ*) = ßlgi(x*) + · · ■ + μ^ρί**) = 0 implies that if gj(x*) < 0, then μ^ = 0; that is, for all j 0 J(as*), we have μ^ = 0. In other words, the KKT multipliers μ!· corresponding to inactive constraints are zero. The other KKT multipliers, μ*, i G J(x*), are nonnegative; they may or may not be equal to zero. Example 21.1 A graphical illustration of the KKT theorem is given in Figure 21.1. In this two-dimensional example, we have only inequality constraints KARUSH-KUHN-TUCKER CONDITION Figure 21.1 489 Illustration of the Karush-Kuhn-Tucker (KKT) theorem. 9j(&) < 0, j = 1,2,3. Note that the point x* in the figure is indeed a minimizer. The constraint gz(x) < 0 is inactive: g${x*) < 0; hence μ£ = 0. By the KKT theorem, we have V/Or*) + μΐν9ι(χ*) + μ*2ν92(χ*) = 0, or, equivalently, V/(x*) = -μΐν9ι(χ*) - M5V^(X*), where μ\ > 0 and μ\ > 0. It is easy to interpret the KKT condition graphically for this example. Specifically, we can see from Figure 21.1 that V/(x*) must be a linear combination of the vectors — S7gi(x*) and — V#2(#*) with positive coefficients. This is reflected exactly in the equation above, where the coefficients μ\ and μ\ are the KKT multipliers. I We apply the KKT condition in the same way that we apply any necessary condition. Specifically, we search for points satisfying the KKT condition and treat these points as candidate minimizers. To summarize, the KKT condition consists of five parts (three equations and two inequalities): 1. μ* > 0. 2. Df(x*) + \*TDh{x*) 3. μ*τ9{χ*)=0. 4- ß*TDg(x*) = 0T. 490 PROBLEMS WITH INEQUALITY CONSTRAINTS 4. fe(x*) = 0. 5. g(x*) < 0. We now prove the KKT theorem. Proof of the Karush-Kuhn-Tucker Theorem. Let x* be a regular local minimizer of / on the set {x : h(x) = 0,g(x) < 0}. Then, x* is also a regular local minimizer of / on the set {x : h(x) = 0,gj(x) = 0,j G J(x*)} (see Exercise 21.16). Note that the latter constraint set involves only equality constraints. Therefore, from Lagrange's theorem, it follows that there exist vectors λ* G R m and μ* G W such that Df(x*) + X*TDh(x*) + ß*TDg(x*) = 0T, where for all j 0 J{x*), we have μ!· = 0. To complete the proof it remains to show that for all j G J(x*), we have μ!· > 0 (and hence for all j = 1 , . . . ,p, we have μ!· > 0, i.e., μ* > 0). We use a proof by contradiction. So suppose that there exists j G J(x*) such that μ!· < 0. Let S and T(x*) be the surface and tangent space defined by all other active constraints at x*. Specifically, S = {x : h(x) = 0, gi(x) = 0, i G J(x*),i and f (x*) = {2/ : L > M * > = 0, D9i(x*)y Φ j} = 0, ί G J(x*), ί ^ j}· We claim that by the regularity of x*, there exists y G T(cc*) such that Dft(x*)2/^0. To see this, suppose that for all y G T(x*), V^j(x*) T y = Dgj(x*)y = 0. This implies that V^j(x*) G T^cc*)-1. By Lemma 20.1, this, in turn, implies that Vgj(x*) G span[V/ifc(x*), fc = 1 , . . . , m, V^(x*), z G J(x*), i ^ j ] . But this contradicts the fact that x* is a regular point, which proves our claim. Without loss of generality, we assume that we have y such that Dgj(x*)y < 0. Consider the Lagrange condition, rewritten as Df(x*) + \*TDh(x*) + ß*aD9j{x*) + Σμ*ίΌ9ί{χ*) = 0T. ίφύ If we postmultiply the above by y and use the fact that y G T(x*), we get Df(x*)y Because Dgj(x*)y = -μ)Ό9ί{χ*)ν. < 0 and we have assumed that μ^ < 0, we have Df(x*)y < 0. KARUSH-KUHN-TUCKER CONDITION 20 V ^ = Figure 21.2 491 10 Ω Circuit in Example 21.2. Because y G T(x*}, by Theorem 20.1 we can find a differentiable curve {x(t) : t G (a, b)} on S such that there exists t* G (a, b) with x(i*) = x* and i(t*) = y. Now, p(x(t*)) = Df{x*)y < 0. The above means that there is a δ > 0 such that for all t G (t*, t* + 5], we have /(*(<))</(x(t*)) = /0O· On the other hand, d_ (x(t*)) = Dgj(x*)y<0, gj dt and for some ε > 0 and all ί G [t*, ί*+ε], we have that gj(x(t)) < 0. Therefore, for all t G (ί*,Γ +πύη{<ϊ,ε}], we have that gj(x(t)) < 0 and f(x(t)) < f{x*). Because the points x(t), t G (t*,t* + min{£,ε}], are in 5, they are feasible points with lower objective function values than x*. This contradicts the assumption that x* is a local minimizer, which completes the proof. I E x a m p l e 21.2 Consider the circuit in Figure 21.2. Formulate and solve the KKT condition for the following problems. a. Find the value of the resistor R > 0 such that the power absorbed by this resistor is maximized. b . Find the value of the resistor R > 0 such that the power delivered to the 10-Ω resistor is maximized. Solution: a. The power absorbed by the resistor R is p = i2R, where i = optimization problem can be represented as minimize subject to 400i? (10 + R)2 — R < 0. 10 2 5^. The 492 PROBLEMS WITH INEQUALITY CONSTRAINTS The derivative of the objective function is 400(10 + R)2 - 800^(10 + R) (10 + # ) 4 400(10 - R) " " (10 + R)3 ' Thus, the KKT condition is 400(10 - R) (io + i?) 3 μ "0, μ>0, μR = 0, -R<0. We consider two cases. In the first case, suppose that μ > 0. Then, R = 0. But this contradicts the first condition above. Now suppose that μ = 0. Then, by the first condition, we have R = 10. Therefore, the only solution to the KKT condition is R = 10, μ = 0. b . The power absorbed by the 10-Ω resistor is p = i 2 10, where z = 20/(10 + i2). The optimization problem can be represented as . . . minimize 4000 — -—- —77 (10 + # ) 2 — R < 0. subject to The derivative of the objective function is 8000 (10 + R)3' Thus, the KKT condition is 8000 (10 + Ä) 3 μ " ' μ>0, μ β = 0, -Ä<0. As before, we consider two cases. In the first case, suppose that μ > Then, # = 0, which is feasible. For the second case, suppose that μ = But this contradicts the first condition. Therefore, the only solution the KKT condition is R = 0, μ = 8. 0. 0. to I In the case when the objective function is to be maximized, that is, when the optimization problem has the form maximize subject to f(x) h(x) = 0 9(x) < 0, KARUSH-KUHN-TUCKER CONDITION 493 the KKT condition can be written as 1. μ* > 0. 2. -Df{x*) + \*TDh{x*) 3. μ*τ9{χ*) = 0. + μ* τ £>0(α*) = 0 T . 4. h(x*) = 0. 5. »(«*) < 0. The above is easily derived by converting the maximization problem above into a minimization problem, by multiplying the objective function by —1. It can be further rewritten as 1. μ* < 0. 2. Df(x*) + \*TDh(x*) + μ*ΎΌ9{χ*) = 0T. 3. μ* τ 0(α*) = 0. 4. /ι(χ*) = 0. 5. <?(**) < 0. The form shown above is obtained from the preceding one by changing the signs of μ* and λ* and multiplying condition 2 by —1. We can similarly derive the KKT condition for the case when the inequality constraint is of the form g(x) > 0. Specifically, consider the problem minimize subject to f(x) h(x) = 0 9(x) > 0. We multiply the inequality constraint function by —1 to obtain — g(x) < 0. Thus, the KKT condition for this case is 1. μ* > 0. 2. Df(x*) + \*TDh(x*) - μ* τ £>0(χ*) = 0 T . 3. μ* Τ 0(χ*)=Ο. 4. h{x*) = 0. 5. g(x*) > 0. Changing the sign of μ* as before, we obtain 1. μ* < 0. 494 PROBLEMS WITH INEQUALITY CONSTRAINTS 2. Df(x*) + A* T Dft(x*) + μ* τ £>0(**) = <>T 3. μ* τ ^(χ*) = 0. 4. h(x*) = 0. 5. g(x*) > 0. For the problem maximize subject to f(x) h(x) = 0 9{x) > 0, the KKT condition is exactly the same as in Theorem 21.1, except for the reversal of the inequality constraint. E x a m p l e 21.3 In Figure 21.3, the two points X\ and #2 are feasible points; that is, g(x\) > 0 and #(#2) > 0, and they satisfy the KKT condition. The point X\ is a maximizer. The KKT condition for this point (with KKT multiplier μ\) is 1. μι > 0. 2. V/(a5i)+MiV^(«i) = 0. 3. μι9(χι) = 0. 4. 0(*i) > 0. The point X2 is a minimizer of / . The KKT condition for this point (with KKT multiplier /X2) is 1. μ 2 < 0. 2. V/(a? 2 ) + /x2V£(tt2) = 0. 3. μ29{Χ2) = 0. 4. <?(*2) > 0. ■ Example 21.4 Consider the problem minimize / (x 1, Χ2) subject to £i,#2 > 0, where f(xi, x 2 ) = x\ + x\ + xiX2 — 3#i. KARUSH-KUHN-TUCKER CONDITION 495 WVf( Xl ) Figure 21.3 minimizer). Points satisfying the KKT condition (x± is a maximizer and X2 is a The KKT condition for this problem is 1. μ = [μι,μ2]Τ < 0. 2. £ > / ( * ) + μ τ = 0 T . 3. μτχ = 0. 4. χ > 0. We have Df{x) = [2xi + x2 - 3, zi + 2x2]· This gives 2xi + X2 + Mi = 3, xi + 2x 2 + M2 = 0, μιΧι + μ2^2 = 0. We now have four variables, three equations, and the inequality constraints on each variable. To find a solution (as*,/x*), we first try * Mi which gives = 0, ,* _ 3 1 — 2' x 2 μ>2 ~ = = o, 3 2 496 PROBLEMS WITH INEQUALITY CONSTRAINTS The above satisfies all the KKT and feasibility conditions. In a similar fashion, we can try μ* = 0 , x\ = 0, which gives x*=0, μί=3. This point clearly violates the nonpositivity constraint on μ\. The feasible point above satisfying the KKT condition is only a candidate for a minimizer. However, there is no guarantee that the point is indeed a minimizer, because the KKT condition is, in general, only necessary. A sufficient condition for a point to be a minimizer is given in the next section. ■ Example 21.4 is a special case of a more general problem of the form minimize subject to f(x) x > 0. The KKT condition for this problem has the form μ<0, ν/(*)+μ = 0, μτχ = 0, x >0. Prom the above, we can eliminate μ to obtain V / ( * ) > 0, T x V / 0 r ) = 0, x >0. Some possible points in 1R2 that satisfy these conditions are depicted in Figure 21.4. For further results related to the KKT condition, we refer the reader to [90, Chapter 7]. 21.2 Second-Order Conditions As in the case of extremum problems with equality constraints, we can also give second-order necessary and sufficient conditions for extremum problems involving inequality constraints. For this, we need to define the following matrix: £ ( χ , λ , μ ) - F(x) + [XH(x)} + [μβ(χ)], SECOND-ORDER CONDITIONS 497 Figure 21.4 Some possible points satisfying the KKT condition for problems with positive constraints. (Adapted from [13].) where F(x) is the Hessian matrix of / at cc, and the notation [XH(x)] represents [XH(x)] = λ ι Η ι ( α ) + · · · + A m ff m (x), as before. Similarly, the notation [/xG(x)] represents [μβ(χ)] = μιβι(χ) Η h pbpGp{x), where Gk{x) is the Hessian of g^ at £c, given by r Gk(x) I d29k („\ &M. dxndkXl ^w d 9k (χ) ... \X) d gk In the following theorem, we use T(x*) = {y€Rn: Dh(x*)y = 0, D9j(x')y = 0, j € J(x*)}, 498 PROBLEMS WITH INEQUALITY CONSTRAINTS that is, the tangent space to the surface defined by active constraints. Theorem minimizer m < n, g there exist 21.2 Second-Order Necessary Conditions. Let x* be a local of f : R n -> R subject to h(x) = 0, g(x) < 0, h : R n -» R m , : M.n ^> R p , and f,h,g G C2. Suppose that x* is regular. Then, λ* G R m and μ* G R p such that: 1. μ* > 0, Df{x*) + λ* τ £>/ι(χ*) + μ* τ Dg(x*) = 0 T , μ* τ ^(»*) = 0. £ For a// y G T(sc*) we ftave y1'L(x*, λ*, ß*)y > 0. D Proof. Part 1 is simply a result of the KKT theorem. To prove part 2, we note that because the point as* is a local minimizer over {x : h(x) = 0, 0(215) < 0}, it is also a local minimizer over {x : h(x) = 0, Qj{x) = 0, j e J(cc*)}; that is, the point x* is a local minimizer with active constraints taken as equality constraints (see Exercise 21.16). Hence, the second-order necessary conditions for equality constraints (Theorem 20.4) are applicable here, which completes the proof. I We now state the second-order sufficient conditions for extremum problems involving inequality constraints. In the formulation of the result, we use the following set: Τ(χ*,μ*) = {y : Dh(x*)y = 0, Dgi(x*)y = 0,i G J{x\ μ*)}, where </(χ*,μ*) = {i : g%{x*) = Ο,μ* > 0}. Note that J(x*,ß*) is a subset of J(x*): </(χ*,μ*) C J(x*). This, in turn, implies that T(x*) is a subset of f(as*,M*):r(x*)cf(e»,M*). Theorem 21.3 Second-Order Sufficient Conditions. Suppose that / , ^ , / i e C2 and there exist a feasible point x* G R n and vectors λ* G R m and μ* G R p such that: 1. μ* > 0, £>/(**) + λ* τ Ζ)/ι(χ*) + μ*τΌ9{χ*) 2. For all y G T(cc*, μ*), y ^0, we have yTL(x*, = 0 T , μ* τ ^(χ*) = 0. λ*, μ*)?/ > 0. 77ien, x* zs a sinci /ocaZ minimizer of f subject to h(x) = 0, g(x) < 0. Proof For a proof of this theorem, we refer the reader to [88, p. 345]. □ I A result similar to Theorem 21.3 holds for a strict local maximizer, the only difference being that we need μ* < 0 and that L(x*,A*) be negative definite on Γ(χ*,μ*). Example 21.5 Consider the following problem: minimize subject to x\X2 x\ + X2 > 2 x2 > x\. SECOND-ORDER CONDITIONS 499 a. Write down the KKT condition for this problem. b . Find all points (and KKT multipliers) satisfying the KKT condition. In each case, determine if the point is regular. c. Find all points in part b that also satisfy the SONC. d. Find all points in part c that also satisfy the SOSC. e. Find all points in part c that are local minimizers. Solution: a. Write f(x) = X1X2, gi{x) = 2 — xi—x2, condition is and g2(x) =x\— X2· The KKT X2 - μι + M2 = 0, χι - μι - μ2 = 0, μι(2 - χ ι - χ2) + μ2{χ\ - x2) = 0, μι,μ2 > 0, 2 — χι — χ2 < 0, Χι — %2 < 0. b . It is easy to check that μ\ φ 0 and μ2 ^ 0. This leaves us with only one solution to the KKT condition: x\ = x\ — 1, μ\ = 1, μ2 = 0. For this point, we have Dgi(x*) = [—1,-1] and Dg2(x*) = [1,-1]· Hence, x* is regular. c. Both constraints are active. Hence, because x* is regular, T(x*) = {0}. This implies that the SONC is satisfied. d. Now, Σ(χ\μ*)= I o . Moreover, Τ(χ*,μ*) = {y : [-1,-1]*/ = 0} = {y : yx = -y2}. Pick y = [ 1 , - 1 ] τ β Τ(χ*,μ*). We have yTL(x*^*)y = - 2 < 0, which means that the SOSC fails. e. In fact, the point x* is not a local minimizer. To see this, draw a picture of the constraint set and level sets of the objective function. Moving in the feasible direction [1,1] T , the objective function increases; but moving in the feasible direction [—1,1]T, the objective function decreases. ■ We now solve analytically the problem in Example 20.1 that we solved graphically earlier. 500 PROBLEMS WITH INEQUALITY CONSTRAINTS E x a m p l e 21.6 We wish to minimize f(x) = {x\ — l ) 2 + X2 — 2 subject to h(x) = X2 — x\ — 1 = 0, g{x) = xi + X2 — 2 < 0. For all a: 6 R 2 , we have Dh(x) = [-1,1], Dg(x) = [1,1]. Thus, Vft(aj) and Vg(x) are linearly independent and hence all feasible points are regular. We first write the KKT condition. Because Df(x) = [2#i — 2,1], we have Df(x) + XDh(x) + ßDg(x) = [2xx - 2 - λ + μ,1 + λ + μ ] = 0 τ , μ(χι + X2 - 2) = 0, μ>0, ^2 - Xi - 1 = 0, xi + x 2 - 2 < 0. To find points that satisfy the conditions above, we first try μ > 0, which implies that Χχ + X2 — 2 = 0. Thus, we are faced with a system of four linear equations 2xi - 2 - λ + μ = 0, 1 + λ + μ = 0, x 2 - #1 - 1 = 0, Xl + X2 ~ 2 = 0. Solving the system of equations above, we obtain 1 31 = 2' X2 = 3 2' λ==_1, ^ = 0 · However, the above is not a legitimate solution to the KKT condition, because we obtained μ = 0, which contradicts the assumption that μ > 0. In the second try, we assume that μ = 0. Thus, we have to solve the system of equations 2xi - 2 - A = 0, 1 + A = 0, X2 - xi - 1 = 0, and the solutions must satisfy g(xi, x 2 ) = xi + #2 - 2 < 0. EXERCISES 501 Solving the equations above, we obtain 1 3 ^ ^ Note that cc* = [1/2,3/2] satisfies the constraint g{x*) < 0. The point x* satisfying the KKT necessary condition is therefore the candidate for being a minimizer. We now verify if x* = [1/2,3/2] τ , λ* = —1, μ* = 0, satisfy the secondorder sufficient conditions. For this, we form the matrix L(x\ λ*, μ*) - F(x*) + \*H{x*) + "2 0" 0 0" 0 0 + (-1) 0 ß*G{x*) 0 + (0) 0 0" 0 0 "2 0' 0 0 We then find the subspace f(x*,ß*) = {y:Dh(x*)y = 0}. Note that because μ* = 0, the active constraint g(x*) = 0 does not enter the computation of Γ(χ*,μ*). Note also that in this case, T(x*) = {0}. We have Τ(χ*,μ*) = {y : [-1, l]y = 0} = {[a,a]T : a G R}. We then check for positive definiteness of L(x*, λ*, μ*) on T(x*, μ*). We have y T L(cc*,A*^*)t/ = [a, a] 2 0 0 0 = 2a 2 . Thus, L(iE*,A*,/i*) is positive definite on Γ(χ*,μ*). Observe that L(x*, λ*,μ*) is, in fact, only positive semidefinite on R 2 . By the second-order sufficient conditions, we conclude that x* = [1/2,3/2] is a strict local minimizer. I EXERCISES 21.1 Consider the optimization problem minimize subject to x\ + Ax\ x\ + 2x\ > 4. a. Find all the points that satisfy the KKT conditions. 502 PROBLEMS WITH INEQUALITY CONSTRAINTS b . Apply the SOSC to determine the nature of the critical points from the previous part. 21.2 Find local extremizers for: a. x\ + x\ — 2x\ — 10x2 + 26 subject to jrx2 — x\ < 0, 5xi + \x2 < 5. b . x\ + x\ subject to x\ > 0, X2 > 0, x\ + x^ > 5. c. #i + 6x1X2 — 4xi — 2x2 subject to x\ + 2x 2 < 1, 2xi — 2x2 < 1. 21.3 Find local minimizers for x\ + x\ subject to x\ + 2xiX2 + x | = 1, #i - ^2 < 0. 21.4 Write down the Karush-Kuhn-Tucker condition for the optimization problem in Exercise 15.8. 21.5 Consider the problem minimize subject to X2 — (xi — 2) 3 + 3 x 2 > 1, where x\ and X2 are real variables. Answer each of the following questions, making sure that you give complete reasoning for your answers. a. Write down the KKT condition for the problem, and find all points that satisfy the condition. Check whether or not each point is regular. b . Determine whether or not the point(s) in part a satisfy the second-order necessary condition. c. Determine whether or not the point(s) in part b satisfy the second-order sufficient condition. 21.6 Consider the problem minimize X2 subject to X2 > — (xi — l ) 2 + 3. a. Find all points satisfying the KKT condition for the problem. b . For each point x* in part a, find T(x*), 7V(cc*), and T(x*). c. Find the subset of points from part a that satisfy the second-order necessary condition. EXERCISES 503 21.7 Consider the problem of optimizing (either minimizing or maximizing) (xi - 2) 2 + (x2 - l ) 2 subject to 2 — x\ — X2 > 0 xi > 0 . The point x* = 0 satisfies the KKT conditions. a. Does x* satisfy the FONC for minimization or maximization? What are the KKT multipliers? b . Does x* satisfy the SOSC? Carefully justify your answer. 21.8 Consider the optimization problem minimize subject to where f(x) 0}. f(x) x £ Ω, = x\x\, where x = [xi,x 2 ] T , and Ω = {x G R 2 : x\ = x2, x\ > a. Find all points satisfying the KKT condition. b . Do each of the points found in part a satisfy the second-order necessary condition? c. Do each of the points found in part a satisfy the second-order sufficient condition? 21.9 Consider the problem minimize subject to - | | A x — 6|| 2 x\-\ h xn = 1 X\ , . . . , Χγι ^ U. a. Write down the KKT condition for the problem. b . Define what it means for a feasible point x* to be regular in this particular problem. Are there any feasible points in this problem that are not regular? If yes, find them. If not, explain why not. 504 PROBLEMS WITH INEQUALITY CONSTRAINTS 21.10 Let g : R n -> R and x0 G R n be given, where g(x0) > 0. Consider the problem -\\x — Xo\\2 minimize subject to g(x) < 0. Suppose that #* is a solution to the problem and g G C1. Use the KKT theorem to decide which of the following equations/inequalities hold: i. g(x*) < 0. ii. g(x*) = 0. iii. (x* -x0)TVg(x*) < 0. iv. (x* -x0)TVg(x*) = 0. v. (x* - x 0 ) T V^(ir*) > 0. 21.11 Consider a square room with corners located at [0,0] T , [0,2] T , [2,0] T , and [2,2] T (in R 2 ). We wish to find the point in the room that is closest to the point [3,4] T . a. Guess which point in the room is the closest point in the room to the point [3,4] T . b . Use the second-order sufficient conditions to prove that the point you have guessed is a strict local minimizer. Hint: Minimizing the distance is the same as minimizing the square distance. 21.12 Consider the quadratic programming problem minimize -xTQx subject to Ax < b, where Q = QT > 0, A G R m x n , and b > 0. Find all points satisfying the KKT condition. 21.13 Consider the linear programming problem minimize subject to ax\ + 6x2 cx\+ dx2 = e X\,X2 > 0, where a, b, c, d, e G R are all nonzero constants. Suppose that x* is an optimal basic feasible solution to the problem. EXERCISES 505 a. Write down the Karush-Kuhn-Tucker condition involving x* (specifying clearly the number of Lagrange and KKT multipliers). b . Is x* regular? Explain. c. Find the tangent space T(x*) (defined by the active constraints) for this problem. d. Assume that the relative cost coefficients of all nonbasic variables are strictly positive. Does x* satisfy the second-order sufficient condition? Explain. 21.14 Consider the problem minimize subject to cTx Ax < 0, where A G M m x n , m < n, is of full rank. Use the KKT theorem to show that if there exists a solution, then the optimal objective function value is 0. 21.15 Consider a linear programming problem in standard form (see Chapter 15). a. Write down the Karush-Kuhn-Tucker condition for the problem. b . Use part a to show that if there exists an optimal feasible solution to the linear program, then there exists a feasible solution to the corresponding dual problem that achieves an objective function value that is the same as the optimal value of the primal (compare this with Theorem 17.1). c. Use parts a and b to prove that if x* is an optimal feasible solutions of the primal, then there exists a feasible solution λ* to the dual such that (c T - λ* τ Λ)χ* = 0 (compare this with Theorem 17.3). 21.16 Consider the constraint set S = {x : h(x) = 0,g(x) < 0}. Let x* G S be a regular local minimizer of / over S and J(x*) the index set of active inequality constraints. Show that x* is also a regular local minimizer of / over the set S' = {x : h(x) = 0,gj(x) = 0 , j G J(x*)}. 21.17 Solve the following optimization problem using the second-order sufficient conditions: minimize subject to x\ + x\ x\ — #2 — 4 < 0 x2 - x\ - 2 < 0. 506 PROBLEMS WITH INEQUALITY CONSTRAINTS See Figure 22.1 for a graphical illustration of the problem. 21.18 Solve the following optimization problem using the second-order sufficient conditions: minimize x\ + x\ subject to X\ — x\ — 4 > 0 xx - 10 < 0. See Figure 22.2 for a graphical illustration of the problem. 21.19 Consider the problem minimize x\ + x\ subject to 4 — x\ — x^ < 0 3^2 — #i < 0 — 3^2 — xi < 0. Figure 22.3 gives a graphical illustration of the problem. Deduce from the figure that the problem has two strict local minimizers, and use the secondorder sufficient conditions to verify the graphical solutions. 21.20 Consider the following optimization problem with an inequality constraint: minimize subject to 3#i x\ + x\ > 2. a. Does the point x* = [2,0] T satisfy the KKT (first-order necessary) condition? b . Does the point x* = [2,0] T satisfy the second-order necessary condition (for problems with inequality constraints)? c. Is the point x* = [2,0] T a local minimizer? (See Exercise 6.15 for a similar problem treated using set-constrained methods.) 21.21 Consider the problem • · · minimize 1 2 subject to II 112 -as ii ii T ax =b x > 0, EXERCISES 507 where a G R n , a > 0, and b G R, b > 0. Show that if a solution to the problem exists, then it is unique, and find an expression for it in terms of a and b. 21.22 Consider the problem minimize (x\ — a) 2 + (#2 — &)2> #i, #2 € R 2 subject to re + #2 < 1, where a, 6 G R are given constants satisfying a 2 + 62 > 1. a. Let x* = [#*,#2] T be a solution to the problem. Use the first-order necessary conditions for unconstrained optimization to show that (x\)2 + (x*2)2 = 1. b . Use the KKT theorem to show that the solution x* = [ χ ΐ , ^ ] 1 " is unique and has the form x\ = αα, x\ = ab, where a G R is a positive constant. c. Find an expression for a (from part b) in terms of a and b. 21.23 Consider the problem x\ + (x2 + l ) 2 , minimize subject to #1, £2 £ R #2 > exp(a?i) [exp(x) = ex is the exponential of x\. Let as* = [#ί,#2] Τ be the solution to the problem. a. Write down the KKT condition that must be satisfied by x*. b . Prove that x\ = exp(#i). c. Prove that - 2 < x\ < 0. 21.24 Consider the problem minimize subject to cTx + 8 - | | x | | 2 < 1, where c G R n , c φ 0. Suppose that x* = a e is a solution to the problem, where a G R and e = [ 1 , . . . , 1] T , and the corresponding objective value is 4. a. Show that ||x*|| 2 = 2. b . Find a and c (they may depend on n). 508 PROBLEMS WITH INEQUALITY CONSTRAINTS 21.25 Consider the problem with equality constraint minimize subject to f(x) h(x) = 0. We can convert the above into the equivalent optimization problem minimize subject to f(x) -||Ma;)|| 2 < 0. Write down the KKT condition for the equivalent problem (with inequality constraint) and explain why the KKT theorem cannot be applied in this case. CHAPTER 22 CONVEX OPTIMIZATION PROBLEMS 22.1 Introduction The optimization problems posed at the beginning of this part are, in general, very difficult to solve. The source of these difficulties may be in the objective function or the constraints. Even if the objective function is simple and "wellbehaved," the nature of the constraints may make the problem difficult to solve. We illustrate some of these difficulties in the following examples. Example 22.1 Consider the optimization problem minimize subject to x\ + x\ X2 — X\ — 2 < 0 x\ - x2 - 4 < 0. The problem is depicted in Figure 22.1, where, as we can see, the constrained minimizer is the same as the unconstrained minimizer. At the minimizer, all the constraints are inactive. If we had only known this fact, we could have approached this problem as an unconstrained optimization problem using techniques from Part II. I An Introduction to Optimization, Fourth Edition. By E. K. P. Chong and S. H. Zak. Copyright © 2013 John Wiley & Sons, Inc. 509 510 CONVEX OPTIMIZATION PROBLEMS Figure 22.1 the same. Situation where the constrained and the unconstrained minimizers are Example 22.2 Consider the optimization problem minimize subject to 2 I 2 X\ ~\~ Xo x\ 10 < 0 xi - x\ - 4 > 0. The problem is depicted in Figure 22.2. At the solution, only one constraint is active. If we had only known about this we could have handled this problem as a constrained optimization problem using the Lagrange multiplier method. Example 22.3 Consider the optimization problem minimize 2 . 2 X\ i Xo subject to 4 — x\ —x\ < 0 3x2 — x\ <o - 3 x 2 - xi < 0 . The problem is depicted in Figure 22.3. This example illustrates the situation where the constraints introduce local minimizers, even though the objective function itself has only one unconstrained global minimizer. | Some of the difficulties illustrated in the examples above can be eliminated if we restrict our problems to convex feasible regions. Admittedly, some important real-life problems do not fit into this framework. On the other hand, INTRODUCTION x r 10=0 Figure 22.2 Figure 22.3 Situation where only one constraint is active. Situation where the constraints introduce local minimizers. 511 512 CONVEX OPTIMIZATION PROBLEMS it is possible to give results of a global nature for this class of optimization problems. In the next section, we introduce the notion of a convex function, which plays an important role in our subsequent treatment of such problems. 22.2 Convex Functions We begin with a definition of the graph of a real-valued function. Definition 22.1 The graph of / : Ω -> R, Ω C R n , is the set of points in Ω x R C R n + 1 given by x /(*) :χβΩ We can visualize the graph of / as simply the set of points on a "plot" of f(x) versus x (see Figure 22.4). We next define the epigraph of a real-valued function. Definition 22.2 The epigraph of a function / : Ω —► R, Ω C R n , denoted epi(/), is the set of points in Ω x R given by epi(/) : x e Ω, ß e R, ß > f(x)}. The epigraph epi(/) of a function / is simply the set of points in Ω x R on or above the graph of / (see Figure 22.4). We can also think of epi(/) as a subset of R n + 1 . Recall that a set Ω C R n is convex if for every X\,xi G Ω and a G (0,1), aa?i + (1 — a)#2 G Ω (see Section 4.3). We now introduce the notion of a convex function. Definition 22.3 A function / : Ω -► R, Ω C R n , is convex on Ω if its epigraph is a convex set. I Theorem 22.1 / / a function f : Ω —► R, Ω C R n , is convex on Ω, then Ω is a convex set. D Proof We prove this theorem by contraposition. Suppose that Ω is not a convex set. Then, there exist two points yx and y2 such that for some a G (0,1), z = ay1 + (l- a)y2 £ Ω. Let ßi = f(vi), & = /(y2)· CONVEX FUNCTIONS 513 f(x)A graph of f Figure 22.4 Graph and epigraph of a function / : R Then, the pairs 3/2 Ä belong to the graph of / , and hence also the epigraph of / . Let w = a + (1 - a) 2/2 A We have w = aft + (1 - α)Α But note that w 0 epi(/), because z 0 Ω. Therefore, epi(/) is not convex, and hence / is not a convex function. I The next theorem gives a very useful characterization of convex functions. This characterization is often used as a definition for a convex function. T h e o r e m 22.2 A function f : Ω —> R defined on a convex set Ω C R n is convex if and only if for all x,y G Ω and all a G (0,1), we have f{ax + (1 - a)y) < af{x) + (1 - a)f(y). D Proof <=: Assume that for all cc, y G Ω and a G (0,1), f{ax + (1 - a)y) < af{x) + (1 - a)f(y). Let [as T ,a] T and [y T ,6] T be two points in epi(/), where a,b £ definition of epi(/) it follows that /(*) < a, f(y) < b. Prom the 514 CONVEX OPTIMIZATION PROBLEMS Therefore, using the first inequality above, we have f(ax + (1 — a)y) < eta + (1 — a)b. Because Ω is convex, ax + (1 — a)y G Ω. Hence, ax + (1 — a)y e epi(/), aa + (1 — a)b which implies that epi(/) is a convex set, and hence / is a convex function. =>: Assume that / : Ω —► R is a convex function. Let x, y G Ω and / ( * ) = a, / ( » ) = 6. Thus, G epi(/). Because / is a convex function, its epigraph is a convex subset of R n + 1 . Therefore, for all a G (0,1), we have a + (!-") ax + (1 — α)τ/ G epi(/). αα + (1 — α)6 The above implies that for all a G (0,1), f(ax + (1 - a)y) < a a + (1 - a)6 = a / ( » ) + (1 - a ) / ( y ) . This completes the proof. I A geometric interpretation of Theorem 22.2 is given in Figure 22.5. The theorem states that if / : Ω —► R is a convex function over a convex set Ω, then for all x, y G Ω, the points on the line segment in R n + 1 connecting [x T , f{x)]T and [y T , f(y)]T must lie on or above the graph of / . Using Theorem 22.2, it is straightforward to show that any nonnegative scaling of a convex function is convex, and that the sum of convex functions is convex. Theorem 22.3 Suppose that f, / i , and fa are convex functions. Then, for any a > 0, the function af is convex. Moreover, f\ + /2 is convex. □ Proof. Let x, y G Ω and a G (0,1). Fix a > 0. For convenience, write / = af. We have f(ax + (1 - a)y) = < = = af(ax + (1 - a)y) a (af(x) + (1 — a)f(y)) because / is convex and a > 0 a(a/(a:)) + ( l - a ) ( a / ( y ) ) af(x) + (1 - a ) / ( y ) , CONVEX FUNCTIONS 515 f(y) af(x)+(1-a)f(y) f(x) f(ax+(1-a)y) Figure 22.5 Geometric interpretation of Theorem 22.2. which implies that / is convex. Next, write fs = / i + / 2 . We have fs(ax + (1 - a)y) = fi(ax + (1 - a)y) + / 2 ( a x + (1 - a)y) < (α/χ(χ) + (1 - a ) / i ( y ) ) + (af2(x) + (1 - a)/ 2 (y)) by convexity of / i and / 2 = α ( Λ ( χ ) + / 2 (x)) + (1 - a)(/!(y) + / 2 (y)) = a/3(x) + ( l - a ) / 3 ( y ) , which implies that fs is convex. I Theorem 22.3 implies that for any given collection of convex functions / i , . . . , fa and nonnegative numbers C\,..., Q , the function c i / 2 H \- ctjt is convex. Using a method of proof similar to that used in Theorem 22.3, it is similarly straightforward to show that the function m a x { / i , . . . , fa} is convex (see Exercise 22.6). We now define the notion of strict convexity. Definition 22.4 A function / : Ω —> R on a convex set Ω C R n is stnctly convex if for all x, y G Ω, χ φ y, and a G (0,1), we have f{ax + (1 - a)y) < af(x) + (1 - a)f(y). ■ 516 CONVEX OPTIMIZATION PROBLEMS From this definition, we see that for a strictly convex function, all points on the open line segment connecting the points [ χ τ , / ( χ ) ] τ and [ y T , / ( y ) ] T lie (strictly) above the graph of / . Definition 22.5 A function / : Ω —* R on a convex set Ω C R n is (strictly) concave if — / is (strictly) convex. I Note that the graph of a strictly concave function always lies above the line segment connecting any two points on its graph. To show that a function is not convex, we need only produce a pair of points x, y G Ω and an a G (0,1) such that the inequality in Theorem 22.2 is violated. Example 22.4 Let f(x) = x\x<i. Is / convex over Ω = {x : x\ > 0, #2 > 0}? The answer is no. Take, for example, x = [1,2] T G Ω and y = [2,1] T G Ω. Then, [2 - ol ax + (1 - a)y = \ Hence, 1 + a\ f(ax + (1 - a)y) = (2 - a ) ( l + a) = 2 + a - a2 and af(x) + (1 - a)f(y) = 2. If, for example, a = 1/2 e (0,1), then / Ua;+2i/j = 4>2/(x)+2/(l/)' which shows that / is not convex over Ω. I Example 22.4 is an illustration of the following general result. Proposition 22.1 A quadratic form f : Ω —► R, Ω C R n , given by f(x) = xTQx, Q G R n x n , Q = Q T , is convex on Ω if and only if for all x,y G Ω, (x-y)TQ(x-y)>0. u Proof The result follows from Theorem 22.2. Indeed, the function f{x) = xTQx is convex if and only if for every a G (0,1), and every x, y G R n , we have f(ax + (1 - a)y) < af(x) + (1 - a ) / ( y ) , or, equivalently, af(x) + (1 - a)f(y) - f(ax + (1 - a)y) > 0. CONVEX FUNCTIONS 517 Substituting for / into the left-hand side of this equation yields axTQx + (1 — OL)yTQy — (ctx + (1 - a)y)TQ(ax T = ax Qx T T + y Qy 2 2 T — ay Qy T - (2a - 2a )x Qy = a ( l — a)xTQx —ax - (1 - 2a + — 2a(l — a)xTQy = a ( l - a)(x - y)TQ{x + (1 - a)y) Qx 2 a )yTQy + a ( l — ct)yT Qy - y). Therefore, / is convex if and only if a ( l - a)(a; - y)TQ(x - y) > 0, which proves the result. E x a m p l e 22.5 In Example 22.4, f(x) f(x) = xTQx, where Q = X\X2, which can be written as 1 0 1 1 0 Let Ω = {x : x > 0}, and x = [2,2] T G Ω, y = [1,3] T G Ω. We have y-x = and (y-x)TQ(y-x) = \[-l,l] -ll 0 1 1 0 -1 1 = - 1 < 0. Hence, by Proposition 22.1, / is not convex on Ω. Diiferentiable convex functions can be characterized using the following theorem. Theorem 22.4 Let f : Ω —► M, / G C1, be defined on an open convex set Ω C R n . Then, f is convex on Ω if and only if for all x, y G Ω, f(y)>f(x) + Df(x)(y-x). D Proof =>: Suppose that / : Ω —► R is diiferentiable and convex. Then, by Theorem 22.2, for any y, x G Ω and a G (0,1) we have f(ay + (1 - a)x) < af(y) + (1 - a ) / ( x ) . 518 CONVEX OPTIMIZATION PROBLEMS Rearranging terms yields f(x + a(y - x)) - f(x) < a(f(y) - f(x)). Upon dividing both sides of this inequality by a, we get /(tt + a C y - * ) ) - / ( * ) a If we now take the limit as a —> 0 and apply the definition of the directional derivative of / at x in the direction y — x (see Section 6.2), we get Df(x)(y-x)<f(y)-f(x) or f{v)>f{x) + Df{x){y-x). <=: Assume that Ω is convex, / : Ω —> R is differentiable, and for all X XI G Ω f{y)>f{x) + Df(x)(y-x). Let u, v G Ω and a G (0,1). Because Ω is convex, w = au + (1 — a)v G Ω. We also have f(u)>f(w) + Df{w)(u-w) f(v)>f(w) + Df(w)(v-w). and Multiplying the first of this inequalities by a and the second by (1 — a) and then adding them together yields af(u) + (1 - a)f{v) But Hence, > f(w) + Df(w) (au + (1 - a)v - w). w — au + (1 — a)v. af(u) + (1 - α ) / ( υ ) > f(au + (1 - o » . Hence, by Theorem 22.2, / is a convex function. I In Theorem 22.4, the assumption that Ω be open is not necessary, as long as f eC1 on some open set that contains Ω (e.g., / G C1 on R n ). A geometric interpretation of Theorem 22.4 is given in Figure 22.6. To explain the interpretation, let xo G Ω. The function £(x) = f(x0) + Df(xo)(x — Xo) is the linear approximation to / at ceo- The theorem says that the graph of / always lies above its linear approximation at any point. In other words, CONVEX FUNCTIONS \ 519 epi(f) f(y) f(x)+Df(x)(y-x) : * Figure 22.6 y x y ' Ω Geometric interpretation of Theorem 22.4. the linear approximation to a convex function / at any point of its domain lies below epi(/). This geometric idea leads to a generalization of the gradient to the case where / is not differentiable. Let / : Ω —> R be defined on an open convex set Ω C R n . A vector g G R n is said to be a subgradient of / at a point x G Ω if for all y G Ω, f(y)>f(x)+gT(y-x)· As in the case of the standard gradient, if g is a subgradient, then for a given Xo G Ω, the function £(x) = f(xo) + gT(x — xo) lies below epi(/). For functions that are twice continuously differentiable, the following theorem gives another possible characterization of convexity. Theorem 22.5 Let f : Ω —> R, / G C2, be defined on an open convex set Ω c R n . Then, f is convex on Ω if and only if for each x G Ω, the Hessian F(x) of f at x is a positive semidefinite matrix. □ Proof. <=: Let x,y G Ω. Because / G C2, by Taylor's theorem there exists a G (0,1) such that f(y) = f(x) + Df{x){y - x) + \{y - x)TF{x + a{y - x))(y - x). Because F(x + a(y — x)) is positive semidefinite, (y - x)TF(ay + (1 - a)x)(y -x)>0. 520 CONVEX OPTIMIZATION PROBLEMS Therefore, we have f(v)>f(x) + Df(x)(y-x), which implies that / is convex, by Theorem 22.4. =>: We use contraposition. Assume that there exists x G Ω such that Fix) is not positive semidefinite. Therefore, there exists d G R n such that d F(x)d < 0. By assumption, Ω is open; thus, the point x is an interior point. By the continuity of the Hessian matrix, there exists a nonzero s G R such that x + sd G Ω, and if we write y = x + sd, then for all points z on the line segment joining x and y, we have d F(z)d < 0. By Taylor's theorem there exists a G (0,1) such that f(y) = f(x) + Df(x)(y -x) = f(x) + Df(x)(y + i ( y - x)TF(x - x) + -s2dTF(x + a(y - x))(y - x) + asd)d. Because a G (0,1), the point x + a s d is on the line segment joining x and y, and therefore dTF(x + a s d ) d < 0. Because 5 ^ 0, we have s 2 > 0, and hence f{y)<f{x) + Df{x)(y-x). Therefore, by Theorem 22.4, / is not a convex function. I Theorem 22.5 can be strengthened to include nonopen sets by modifying the condition to be (y — x)TF(x)(y — x) > 0 for all x,y G Ω (and assuming that / G C2 on some open set that contains Ω; for example, / G C2 on R n ). A proof similar to that above can be used in this case. Note that by definition of concavity, a function / : Ω —> R, / G C2, is concave over the convex set Ω C R n if and only if for all x G Ω, the Hessian F(x) of / is negative semidefinite. E x a m p l e 22.6 Determine whether the following functions are convex, concave, or neither: 1. / : R - > R , f{x) = -8x2. 2. / : R 3 -» R, f(x) = 4x1 + 3ar| + bx\ + §χλχ2 + ΧχΧζ - 3xi - 2x2 + 15. 3. / : R 2 -* R, f(x) = 2χλχ2 - x \ - x\. Solution: 1. We use Theorem 22.5. We first compute the Hessian, which in this case is just the second derivative: (d2f/dx2)(x) = —16 < 0 for all x G R. Hence, / is concave over R. CONVEX OPTIMIZATION PROBLEMS 521 2. The Hessian matrix of / is 8 6 6 6 1 0 F(x) 1 0 10 The leading principal minors of F(x) are Δ ι = 8 > 0, det 8 6 6 6 12 > 0 , A 3 = d e t F ( x ) = 114>0. Hence, F(x) is positive definite for all 1 6 R 3 . Therefore, / is a convex function over M.3. 3. The Hessian of / is F(x) -2 2 2 -2 which is negative semidefinite for all x G R 2 . Hence, / is concave on R 2 . 22.3 Convex Optimization Problems In this section we consider optimization problems where the objective function is a convex function and the constraint set is a convex set. We refer to such problems as convex optimization problems or convex programming problems. Optimization problems that can be classified as convex programming problems include linear programs and optimization problems with quadratic objective function and linear constraints. Convex programming problems are interesting for several reasons. Specifically, as we shall see, local minimizers are global for such problems. Furthermore, first-order necessary conditions become sufficient conditions for minimization. Our first theorem below states that in convex programming problems, local minimizers are also global. Theorem 22.6 Let f : Ω —► R be a convex function defined on a Convex set Ω C R n . Then, a point is a global minimizer of f over Ω if and only if it is a local minimizer of f. □ Proof. =>: This is obvious. 522 CONVEX OPTIMIZATION PROBLEMS <=: We prove this by contraposition. Suppose that x* is not a global minimizer of / over Ω. Then, for some y G Ω, we have f(y) < f(x*). By assumption, the function / is convex, and hence for all a G (0,1), / ( a y + (1 - a)x*) < af(y) + (1 - a)f(x*). Because f(y) < /(#*), we have af(y) + (1 - <*)/(*·) = a ( / ( y ) - /(**)) + /(**) < / ( * * ) . Thus, for a l i a G (0,1), / ( a y + (l-a)x*) </(**). Hence, there exist points that are arbitrarily close to x* and have lower objective function value. For example, the sequence {yn} of points given by converges to cc*, and f(yn) which completes the proof. < f(x*)> Hence, x* is not a local minimizer, I We now show that the set of global optimizers is convex. For this, we need the following lemma. Lemma 22.1 Let g : Ω —► R be a convex function defined on a convex set Ω C R n . Then, for each c e R, the set r c = {x e Ω : g(x) < c} □ is a convex set. Proof. Let x,y G Tc. Then, g(x) < c and g(y) < c. Because g is convex, for a l l a G (0,1), g(ax + (1 - a)y) < ag{x) + (1 - a)g(y) < c. Hence, ax + (1 — a)y G T c , which implies that Tc is convex. I Corollary 22.1 Let f : Ω —► R be a convex function defined on a convex set Ω C R n . Then, the set of all global minimizers of f over Ω is a convex set. D Proof. The result follows immediately from Lemma 22.1 by setting c = min/(x). CONVEX OPTIMIZATION PROBLEMS 523 We now show that if the objective function is continuously differentiable and convex, then the first-order necessary condition (see Theorem 6.1) for a point to be a minimizer is also sufficient. We use the following lemma. Lemma 22.2 Let f : Ω —► R be a convex function defined on the convex set Ω C R n and f G Cl on an open convex set containing Ω. Suppose that the point x* G Ω is such that for all x G Ω, χ φ x*, we have >0. Df(x*){x-x*) Then, x* is a global minimizer of f over Ω. D Proof Because the function / is convex, by Theorem 22.4, for all x G Ω, we have f(x) >f(x*) +Df (**)(*-**)· Hence, the condition Df{x*)(x — x*) > 0 implies that f(x) > /(#*). I Observe that for any x G Ω, the vector x — x* can be interpreted as a feasible direction at x* (see Definition 6.2). Using Lemma 22.2, we have the following theorem (cf. Theorem 6.1). Theorem 22.7 Let f : Ω —► R be a convex function defined on the convex set Ω C R n , and f G C1 on an open convex set containing Ω. Suppose that the point x* G Ω is such that for any feasible direction d at x*, we have dTVf(x*) > 0. Then, x* is a global minimizer of f over Ω. ü Proof Let x G Ω, x ^ x*. By convexity of Ω, x* + a(x - x*) = ax + (1 - a)x* G Ω for all a G (0,1). Hence, the vector d = x — x* is a feasible direction at x* (see Definition 6.2). By assumption, Df(x*)(x - x*) = d T V / ( x * ) > 0. Hence, by Lemma 22.2, x* is a global minimizer of / over Ω. I From Theorem 22.7, we easily deduce the following corollary (compare this with Corollary 6.1). Corollary 22.2 Let f : Ω —> R, / G C1, be a convex function defined on the convex set Ω C R n . Suppose that the point x* G Ω is such that V/(**) = 0. 524 CONVEX OPTIMIZATION PROBLEMS D Then, x* is a global minimizer of f over Ω. We now consider the constrained optimization problem minimize subject to f(x) h(x) = 0. We assume that the feasible set is convex. An example where this is the case is when h(x) — Ax — 6. The following theorem states that provided the feasible set is convex, the Lagrange condition is sufficient for a point to be a minimizer. Theorem 22.8 Let f : R n —► R, / G C1, be a convex function on the set of feasible points Ω = {x G R n : h{x) = 0}, where h : R n —> R m , h EC 1 , and Ω is convex. Suppose that there exist x* € Ω and λ* e R m such that \*TDh(x*)=0T. Df(x*) + Then, x* is a global minimizer of f over Ω. D Proof By Theorem 22.4, for all x G Ω, we have f(x)>f(x*) Substituting Df(x*) = —\*TDh(x*) + Df(x*)(x-x*). into the inequality above yields f(x) > /(»*) - \*TDh(x*)(x - x*). Because Ω is convex, (1 — a)x* + ax G Ω for all a G (0,1). Thus, h(x* + a(x - x*)) = h((l - a)x* + ax) = 0 for all a G (0,1). Premultiplying by λ* τ , subtracting \*Th(x*) dividing by a, we get A* T /i(x* + a(x - x*)) a = 0, and \*Th(x*) for all a G (0,1). If we now take the limit as a —► 0 and apply the definition of the directional derivative of λ* h at x* in the direction x — x* (see Section 6.2), we get \*TDh(x*)(x - x*) = 0. Hence, f(x) > /(«·), CONVEX OPTIMIZATION PROBLEMS which implies that x* is a global minimizer of / over Ω. 525 I Consider the general constrained optimization problem minimize subject to f(x) h(x) = 0 g(x) < 0. As before, we assume that the feasible set is convex. This is the case if, for example, the two sets {x : h(x) = 0} and {x : g(x) < 0} are convex, because the feasible set is the intersection of these two sets (see also Theorem 4.1). We have already seen an example where the set {x : h(x) = 0} is convex. On the other hand, an example where the set {x : g(x) < 0} is convex is when the components of g = [#i,... ,gp]T are all convex functions. Indeed, the set {x : g(x) < 0} is the intersection of the sets {x : gi(x) < 0}, i = 1 , . . . ,p. Because each of these sets is convex (by Lemma 22.1), their intersection is also convex. We now prove that the Karush-Kuhn-Tucker (KKT) condition is sufficient for a point to be a minimizer to the problem above. Theorem 22.9 Let f : R n —> R, f G C1, be a convex function on the set of feasible points Ω = {x e Rn : h(x) = 0,g(x) < 0}, where h : Rn -► R m , g : R n -► Rp, h,g e C1, and Ω is convex. Suppose that there exist x* G Ω, λ* G R m , and μ* G Rp, such that 1. μ* > 0. 2. Df(x*) + \*TDh(x*) + μ* τ £>0(α*) = 0 T . 3. μ*τ</(**) = 0. Then, x* is a global minimizer of f over Ω. D Proof. Suppose that x G Ω. By convexity of / and Theorem 22.4, f(x)>f(x*) + Df(x*)(x-x*). Using condition 2, we get f(x) > fix*) - \*TDh(x*)(x - x*) - ß*TDg(x*)(x - x*). As in the proof of Theorem 22.8, we can show that λ* Dh(x*)(x — x*) = 0. We now claim that μ*τDg(x*)(x — x*) < 0. To see this, note that because Ω is convex, (1 — a)x* + ax G Ω for all a G (0,1). Thus, g(x* + a(x - x*)) = g((l - a)x* + ax) < 0 526 CONVEX OPTIMIZATION PROBLEMS for all a G (0,1). Premultiplying by μ*τ > 0 T (by condition 1), subtracting μ*Ύ g(x*) = 0 (by condition 3), and dividing by a, we get μ*τ9(χ* + a{x - a?*)) - μ*Ί~ g{x*) < Q a ~ We now take the limit as a —> 0 to obtain μ*Ύ Dg{x*){x — x*) < 0. From the above, we have f(x) > /(«*) - \*TDh{x*)(x - x*) - μ*τΌ9(χ*)(χ - x*) for all x G Ω, which completes the proof. I E x a m p l e 22.7 A bank account starts out with 0 dollars. At the beginning of each month, we deposit some money into the bank account. Denote by Xk the amount deposited in the fcth month, k = 1,2, Suppose that the monthly interest rate is r > 0 and the interest is paid into the account at the end of each month (and compounded). We wish to maximize the total amount of money accumulated at the end of n months, such that the total money deposited during the n months does not exceed D dollars (where D > 0). To solve this problem we first show that the problem can be posed as a linear program, and is therefore a convex optimization problem. Let y^ be the total amount in the bank at the end of the fcth month. Then, yk = (1 + r)(yk-i + Xfe), fc > 1, with yo = 0. Therefore, we want to maximize yn subject to the constraint that Xk > 0, fc = 1 , . . . , n, and x\-\ \-xn < D. It is easy to deduce that yn = (1 + r)nXl + (1 + r)n~lx2 + · · · + (1 + r)xn. Let c T = [(1 + r ) n , ( l + r ) n " \ . . . , (1 + r)], e T = [1,...,1], and a; = [ # i , . . . , xn]T. Then, we can write the problem as maximize subject to cTx eTx < D x>0, which is a linear program. It is intuitively clear that the optimal strategy is to deposit D dollars in the first month. To show that this strategy is indeed optimal, we use Theorem 22.9. Let x* = [ D , 0 , . . . , 0 ] T G R n . Because the problem is a convex programming problem, it suffices to show that x* satisfies the KKT SEMIDEFINITE PROGRAMMING 527 condition (see Theorem 22.9). The KKT condition for this problem is -οτ + μΜβτ-μ™τ 1 τ μ( \β χ*-Ό) = 0, = 0, μΜτχ* = 0, T e x* - £> < 0, -x* < 0, μ (1) > 0, μ (2) > 0, eTx < D, x > 0, where μ^ G R and μ^ € R n . Let μ& = (1 + r)n and μ<2> = (1 + r ) n e - c. Then, it is clear that the KKT condition is satisfied. Therefore, x* is a global minimizer. I An entire book devoted to the vast topic of convexity and optimization is [7]. For extensions of the theory of convex optimization, we refer the reader to [136, Chapter 10]. The study of convex programming problems also serves as a prerequisite to nondifferentiable optimization (see, e.g., [38]). 22.4 Semidefinite Programming Semidefinite programming is a subfield of convex optimization concerned with minimizing a linear objective function subject to a linear matrix inequality. The linear matrix inequality constraint defines a convex feasible set over which the linear objective function is to be minimized. Semidefinite programming can be viewed as an extension of linear programming, where the componentwise inequalities on vectors are replaced by matrix inequalities (see Exercise 22.20). For further reading on the subject of semidefinite programming, we recommend an excellent survey paper by Vandenberghe and Boyd [128]. Linear Matrix Inequalities and Their Properties Consider n + 1 real symmetric matrices Fi = Fj G M m x m , < = 0 , l , . . . , n and a vector x = [xi,...,x n ] T eR n . 528 CONVEX OPTIMIZATION PROBLEMS Then, F(x) = F0 + x i F i + · · · + xnFn n = F0 + J2XiFi 2=1 is an affine function of x, because F(x) is composed of a linear term Σ7=ι and a constant term Fo. Consider now an inequality constraint of the form F(x) = F0 + xiFx + · · · + xnFn x i^i > 0. The inequality constraint above is to be interpreted as the set of vectors x such that zTF(x)z > 0 for all z G R m ; that is, F(x) is positive semidefinite [or, in the usual notation, F(x) > 0]. Recall that the terms Fi represent constant matrices, x is unknown, and F(x) = F(x)T is an affine function x. The expression F(x) = F0 + X\F\ + • · · + xnFn > 0 is referred to in the literature as a linear matrix inequality (LMI), although the term affine matrix inequality would seem to be more appropriate. It is easy to verify that the set {x : F(x) > 0} is convex (see Exercise 22.20). We can speak similarly of LMIs of the form F(x) > 0, where the requirement is for F(x) to be positive definite (rather than just positive semidefinite). It is again easy to see that the set {x : F{x) > 0} is convex. A system of LMIs F i ( x ) > 0 , F 2 ( x ) > 0 , ..., Fk(x)>0 can be represented as one single LMI: |>i(«) 1 F(x) = > 0. # L Fk(x)\ As an example, a linear inequality involving an m x n real constant matrix A of the form Ax <b can be represented as m LMIs: b{ — ajx > 0, i = 1,2,..., m, SEMIDEFINITE PROGRAMMING 529 where aj is the ith row of the matrix A. We can view each scalar inequality as an LMI. We then represent m LMIs as one LMI: bi — ajx a0 x F(x) = >0. bm - a^x With the foregoing facts as background, we can now give an example of semidefinite programming: minimize subject to ex F(x) > 0. The matrix property that we discuss next is useful when converting certain LMIs or nonlinear matrix inequalities into equivalent LMIs. We start with a simple observation. Let P be a nonsingular n x n matrix and let x = Mz, where M G R n X n such that det M φ 0. Then, we have xTPx that is, Similarly, > 0 if and only if zTMTPMz > 0; P > 0 if and only if MTPM > 0. P > 0 if and only if MTPM > 0. Suppose that we have a square matrix A L B Γ D Then, by the observation above, A BT B\ > 0 if and only if D I O A BT B D 0 1 I O >o, where / is an identity matrix of appropriate dimension. In other words, A BT B\ > 0 if and only if D D B BT A >0. We now introduce the notion of the Schur complement, useful in studying LMIs. Consider a square matrix of the form An Ai2 A2i A22 530 CONVEX OPTIMIZATION PROBLEMS where An and A22 are square submatrices. Suppose that the matrix An is invertible. Then, we have I -Α21ΛΓ11 O / An A21 Let A12 A22 I -ΑϊΪΑ12 O An O I O A2iAn1A12\ A22- A n = A22 - A2i A n x Λ12, which is called the Schur complement of A n . For the case when A12 = A 2 i , we have / -A21AU O I where An A21 An A22 -An1'AAT 21 I I O An O O An A n = A22 - A 2 i A n A21. Hence, "An ^■21 A2i A22 An O > 0 if and only if O >0; An that is, An A^ > 0 if and only if An A22 A2I Given An A21 > 0 and A n > 0. A12 A22 we can similarly define the Schur complement of A22, assuming that A22 is invertible. We have I 0 -A12A2J I An A2\ I An A22 ~A22 A21 O I Δ22 O O A22 where Δ22 = A n — A12A221 A 2 i is the Schur complement of A22· So, for the case where A12 = A 2 i, An A21 Aji > 0 if and only if A22 > 0 and Δ22 > 0. A22 Many problems of optimization, control design, and signal processing can be formulated in terms of LMIs. To determine whether or not there exists a point x such that F(x) > 0 is called a feasibility problem. We say that the LMI is nonfeasible if no such solution exists. SEMIDEFINITE PROGRAMMING 531 Example 22.8 We now present a simple example illustrating the LMI feasibility problem. Let A G R m x m be a given real constant square matrix. Suppose that we wish to determine if A has all its eigenvalues in the open left half-complex plane. It is well known that this condition is true if and only if there exists a real symmetric positive definite matrix P such that ATP + PA<0, or, equivalently, -ATP -PA>0 (also called the Lyapunov inequality; see [16]). Thus, the location of all eigenvalues of A being in the open left half-complex plane is equivalent to feasibility of the following matrix inequality: |> [O O 1 -ATP-PA\ >0; that is, the existence of P = PT > 0 such that ATP + PA < 0. We now show that finding P = PT > 0 such that AT P + PA < 0 is indeed an LMI. For this, let where X\ X2 %2 %m+l yx<m %2m—l #2ra-l ' ' ' m(m + 1) %n J 532 CONVEX OPTIMIZATION PROBLEMS We next define the following matrices: Pi P2 = Pn 1 0 0 0 0 0 0 0 0 ·· • ·· • ·· • 0 0 0 0 0 0 ·· • o 0 1 0 1 0 0 0 0 0 ·· • ·· • · • 0' 0 0 0 0 0 0 0 0··· 0 0 0 0 ··♦ 0 0 0 0 ··· 0 [0 0 0 ··· 1| Note that Pi has only nonzero elements corresponding to Xi in P. Let Fi = -ATPi i = 1,2,..., n. - PiA, We can then write ATP + PA = xx ( A T P i + ΡλΑ} + xn(ATPn = -XlFi + x2 ( A T P 2 + P 2 A ) + ·· · + PnA) - X2F2 XnFn <0. Let Then, if and only if F(x) = x i F i 4- X2F2 + · · · + P = P T >0 and ΑΎΡ + xnFn. ΡΑ<0 F(x) > 0. Note that this LMI involves a strict inequality. Most numerical solvers do not handle strict inequalities. Such solvers simply treat a strict inequality (>) as a non-strict inequality (>). I SEMIDEFINITE PROGRAMMING 533 LMI Solvers The inequality F(x) = Fo + xiFi + · · · + xnFn > 0 is called the canonical representation of an LMI. Numerical LMI solvers do not deal directly with LMIs in canonical form because of various inefficiencies. Instead, LMI solvers use a structured representation of LMIs. We can use MATLAB's LMI toolbox to solve LMIs efficiently. This toolbox has three types of LMI solvers, which we discuss next. Finding a Feasible Solution Under LMI Constraints First, we discuss MATLAB's LMI solver for solving the feasibility problem defined by a given system of LMI constraints. Using this solver, we can solve any system of LMIs of the form NTC(XU . . . , Xk)N < MTK(XU ..., Xk)M, where X\,..., Xk are matrix variables, N is the left outer factor, M is the right outer factor, C{X\,..., Xk) is the left inner factor, and 7 £ ( X i , . . . , Xk) is the right inner factor. The matrices £(·) and 7£(·) are, in general, symmetric block matrices. We note that the term left-hand side refers to what is on the "smaller" side of the inequality 0 < X. Thus in X > 0, the matrix X is still on the right-hand side because it is on the "larger " side of the inequality. We now provide a description of an approach that can be used to solve the given LMI system feasibility problem. To initialize the LMI system description, we type s e t l m i s ( [ ] ) . Then we declare matrix variables using the command lmivar. The command lmiterm allows us to specify LMIs that constitute the LMI system under consideration. Next, we need to obtain an internal representation using the command getlmis. We next compute a feasible solution to the LMI system using the command f easp. After that, we extract matrix variable values with the command dec2mat. In summary, a general structure of a MATLAB program for finding a feasible solution to the set of LMIs could have the form setlmis([]) lmivar lmiterm lmiterm getlmis feasp dec2mat We now analyze these commands in some detail so that the reader can write simple MATLAB programs for solving LMIs after completing this section. 534 CONVEX OPTIMIZATION PROBLEMS First, to create a new matrix-valued variable, say, X, in the given LMI system, we use the command X = lmivar(type,structure) The input type specifies the structure of the variable X. There may be three structures of matrix variables. When type=l, we have a symmetric block diagonal matrix variable. The input type=2 refers to a full rectangular matrix variable. Finally, type=3 refers to other cases. The second input s t r u c t u r e gives additional information on the structure of the matrix variable X. For example, the matrix variable X could have the form |\Di O \ O D2 ··· ··· [O O ··· Ol 0\ Dr\ where each Di is a square symmetric matrix. For the example above we would use type=l. The matrix variable above has r blocks. The input s t r u c t u r e is then an r x 2 matrix whose iih row describes the ith block, where the first component of each row gives the corresponding block size, while the second element of each row specifies the block type. For example, X = l m i v a r ( l , [ 3 1]) specifies a full symmetric 3 x 3 matrix variable. On the other hand, X = l m i v a r ( 2 , [ 2 3]) specifies a rectangular 2 x 3 matrix variable. Finally, a matrix variable S of the form 0 0 0" " «1 0 0 0 Sl 0 _0 0 0 S2 S3 S3 s4 . can be declared as follows: S = l m i v a r ( l , [ 2 0;2 1]) Note above that the second component of the first row of the second input has the value of zero; that is, s t r u c t u r e d , 2 ) =0. This describes a scalar block matrix of the form £>i = β ι / 2 . Note that the second block is a 2 x 2 symmetric full block. SEMIDEFINITE PROGRAMMING 535 We next take a closer look at a command whose purpose is to specify the terms of the LMI system of interest. This command has the form lmiterm(termid,A,B,flag) We briefly describe each of the four inputs of this command. The first input, termid, is a row with four elements that specify the terms of each LMI of the LMI system. We have t e r m i d ( l ) = n to specify the left-hand side of the nth LMI. We use t e r m i d ( l ) = - n to specify the right-hand side of the nth LMI. The middle two elements of the input termid specify the block location. Thus termid(2,3) = [i j ] refers to the term that belongs to the (i,j) block of the LMI specified by the first component. Finally, termid(4) =0 for the constant term, termid (4) =X for the variable term in the form AXB, while termid(4)=-X for the variable term in the form ΑΧΎΒ. The second and third inputs of the command lmiterm give the values of the left and right outer factors; that is, A and B give the values of the constant outer factors in the variable terms AXB and AX B. Finally, the fourth input to lmiterm serves as a compact way to specify the expression AXB (AXB)T. + Thus, f l a g = , s ' can be used to denote a symmetrized expression. We now illustrate the command above on the following LMI: PA + (PA)T < 0. We have one LMI with two terms. We could use the following description of this single LMI: lmiterm ([1 1 1 P],1,A) lmiterm([l 1 1 -P],Α',1) On the other hand, we can describe this LMI compactly using the f l a g as follows: lmiterm([l 1 1 Ρ ϋ , Ι , Α , ' β ' ) Now, to solve the feasibility problem we could have typed [tmin,xfeas] = feas(lmis) In general, for a given LMI feasibility problem of the form find such that x L(x) < R(x), the command f easp solves the auxiliary convex problem minimize subject to t L(x) < R(x) -f 11. 536 CONVEX OPTIMIZATION PROBLEMS The system of LMIs is feasible if the minimal t is negative. We add that the current value of t is displayed by f easp at each iteration. Finally, we convert the output of the LMI solver into matrix variables using the command P = dec2mat(lmis,xfeas,P). Example 22.9 Let Ai = -1 0 0 -1 and A2 = -2 1 0 -1 We use the commands of the LMI Control Toolbox discussed above to write a program that finds P such that P > O.5J2 and A[P + PAX <O, Ä%P + PA2 <0. The program is as follows: A_l = [-1 0;0 - 1 ] ; A_2 = [-2 0;1 - 1 ] ; setlmis( [] ) P = lmivar(l,[2,l]) lmiterm([l 1 1 P] , Α _ 1 ' , 1 , ' s ' ) lmiterm([2 1 1 P ] , A _ 2 ' , 1 , ' s ' ) lmiterm([3 1 1 0 ] , . 5 ) lmiterm([-3 1 1 P ] , l , l ) lmis=getlmis; [tmin,xfeas] = feasp(lmis); P = dec2mat(lmis,xfeas,P) Minimizing a Linear Objective Under LMI Constraints The next solver we discuss solves the convex optimization problem minimize subject to cTx A(x) < B(x). The notation A(x) < B{x) is shorthand notation for a general structured LMI system. This solver is invoked using the function mincx. Thus, to solve a mincx problem, in addition to specifying the LMI constraints as in the f easp problem, we also declare the linear objective function. Then we invoke the function SEMIDEFINITE PROGRAMMING 537 mincx. We illustrate and contrast the f easp and mincx solvers in the following example. Example 22.10 Consider the optimization problem minimize subject to cTx Ax < 6, where 4 5J "l 5 's] l" 1 3 , 2 1 b = 18 U\ We first solve the feasibility problem; that is, we find an x such that Ax < 6, using the f easp solver. After that, we solve the minimization problem above using the mincx solver. A simple MATLAB code accomplishing these tasks is shown below. °/e Enter problem d a t a A = [1 1;1 3 ; 2 1] ; b = [8 18 14] ' ; c = [-4 - 5 ] >; setlmis([]) ; X = lmivar(2,[2 1]); lmiterm([l 1 1 X],A(1,:) l ) ; lmiterm([l 1 1 0 ] , - b ( D ) lmiterm([l 2 2 X],A(2,:) l ) ; lmiterm([l 2 2 0 ] , - b ( 2 ) ) lmiterm([l 3 3 X],A(3,:) 1); lmiterm([l 3 3 0 ] , - b ( 3 ) ) lmis = getlmis; disp(' feasp r e s u l t ') [tmin,xfeas] = f e a s p ( l m i s ) ; x_feasp = dec2mat(lmis,xfeas,X) disp(' mincx r e s u l t ') [objective,x_mincx] = mincx(lmis,c,[0.0001 1000 0 0 1]) The f easp function produces •Efeasp — -64.3996 -25.1712 538 CONVEX OPTIMIZATION PROBLEMS The mincx function produces Xri 3.0000 5.0000 In the next example, we discuss the function def ex, which we can use to construct the vector c used by the LMI solver mincx. Example 22.11 Suppose that we wish to solve the optimization problem minimize subject to trace(P) ATP + PA < 0 P>0 where trace(P) is the sum of the diagonal elements of P . We can use the function mincx to solve this problem. However, to use mincx, we need a vector c such that cTx = trace (P). After specifying the LMIs and obtaining their internal representation using, for example, the command lmisys=getlmis, we can obtain the desired c with the following MATLAB code, q = decnbr(lmisys); c = zeros(q,l); for j = l:q Pj = defcxClmisys,j,P); c(j) = trace(Pj); end Having obtained the vector c, we can use the function mincx to solve the optimization problem. I Minimizing a Generalized Eigenvalue Under LMI Constraints This problem can be stated as minimize subject to λ C(x) < D(x) 0 < B{x) A(x) < XB{x). Here, we need to distinguish between standard LMI constraints of the form C(x) < D(x) and linear-fractional LMIs of the form A(x) < XB(x), which SEMIDEFINITE PROGRAMMING 539 are concerned with the generalized eigenvalue λ. The generalized eigenvalue minimization problem under LMI constraints can be solved using the solver gevp. The basic structure of the gevp solver has the form [ l o p t , x o p t ] = gep{lmisys,nflc} which returns lopt, the global minimum of the generalized eigenvalue, and xopt, the optimal decision vector variable. The argument lmisys is the system of LMIs, C{x) < D(x), C(x) < D(x), and A(x) < XB(x) for λ = 1. As in the previous solvers, the corresponding optimal values of the matrix variables are obtained using dec2mat. The number of linear-fractional constraints is specified with nf l c . There are other inputs to gevp but they are optional. For more information on this type of the LMI solver, we refer the reader to the LMI Lab in MATLAB's Robust Control Toolbox user's guide. E x a m p l e 22.12 Consider the problem of finding the smallest a such that P>0 T AP + PA< -αΡ, where ^-1.1853 0.9134 0.2785 0.9058 -1.3676 0.5469 0.1270 0.0975 -3.0000 This problem is related to finding the decay rate of the stable linear differential equation x = Ax. Finding a that solves the optimization problem above can be accomplished using the following LMIs: A = [-1.1853 0.9134 0.2785 0.9058 -1.3676 0.5469 0.1270 0.0975 -3.0000]; setlmis( [] ) ; P = lmivar(l,[3 1]) lmiterm([-l 1 1 P ] , 1,1) °/0 P lmiterm([l 1 1 0],.01) 7, P >= 0.01*1 lmiterm([2 1 1 Ρΐ,Ι,Α,'ε') °/0 linear fractional constraint—--LHS lmiterm( [-2 1 1 P ] , 1,1) °/0 linear fractional constraint—--RHS lmis = getlmis; [gamma,P_opt] = gevp(lmis,1); P = dec2mat(lmis,P_opt,P) alpha = -gamma The result is a = 0.6561 and P = 0.6996 -0.7466 -0.0296 -0.7466 0.8537 -0.2488 -0.0296 -0.2488 3.2307 540 CONVEX OPTIMIZATION PROBLEMS Notice that we used P > 0.011 in place of P > 0. I More examples of linear matrix inequalities in system and control theory can be found in the book by Boyd et al. [16]. A quick introduction to MATLAB's LMI toolbox is the tutorial that can be accessed with the command Imidem within MATLAB. In addition to the MATLAB's LMI toolbox, there is another toolbox for solving LMIs called LMITOOL, a built-in software package in Scilab toolbox, developed at INRIA in Prance. Scilab offers free software for numerical optimization. There is a version of LMITOOL for MATLAB that can be obtained from the website of the Scilab Consortium. Yet Another LMI Package, YALMIP, for solving LMIs was developed in Switzerland in the Automatic Control Laboratory at ETH. YALMIP is an "intuitive and flexible modelling language for solving optimization problems in MATLAB." LMIs are tools of modern optimization. The following quote on numerical linear algebra from Gill, Murray, and Wright [52, p. 2] applies as well to the contents of this chapter: "At the heart of modern optimization methods are techniques associated with linear algebra. Numerical linear algebra applies not simply in optimization, but in all fields of scientific computation, including approximation, ordinary differential equations, and partial differential equations. The importance of numerical linear algebra to modern scientific computing cannot be overstated. Without fast and reliable linear algebraic building blocks, it is impossible to develop effective optimization methods; without some knowledge of the fundamental issues in linear algebra, it is impossible to understand what happens during the transition from equations in a textbook to actual computation." EXERCISES 22.1 Find the range of values of the parameter a for which the function / ( x i , £ 2 , # 3 ) = 2χχχ3 — x\ — x\ — 5x\ — 2ax\X2 — 4x2X3 is concave. 22.2 Consider the function f(x) = -xTQx-xTb, where Q = QT > 0 and x,b e R n . Define the function φ : R -► R by φ(ά) = f(x + ad), where cc, d € R n are fixed vectors and d ^ O . Show that φ{ά) is a strictly convex quadratic function of a. EXERCISES 541 22.3 Show that f(x) = x\X2 is a convex function on Ω = {[a, ma\T : a G R}, where m is any given nonnegative constant. 22.4 Suppose that the set Ω = {x : ft(a?) = c} is convex, where h : R n —► R and c G R. Show that /i is convex and concave over Ω. 22.5 Find all subgradients of x G R, f(x) = \x\, at x = 0 and at x = 1. 22.6 Let Ω C R n be a convex set, and fa : Ω —> R, z = 1 , . . . ,£ be convex functions. Show that m a x { / i , . . . , fa} is a convex function. Note: The notation m a x { / i , . . . , fa} denotes a function from Ω to R such that for each x G Ω, its value is the largest among the numbers fa(x), i = 1 , . . . , £. 22.7 Let Ω C R n be an open convex set. Show that a symmetric matrix Q G R n is positive semidefinite if and only if for each #, y G Ω, (x — y)TQ(x — y) > 0. Show that a similar result for positive definiteness holds if we replace the ">" by ">" in the inequality above. 22.8 Consider the problem minimize subject to - | | A x — 6|| 2 x\ + h xn — 1 X\ , . . . , Χγι -^ U (see also Exercise 21.9). Is the problem a convex optimization problem? If yes, give a complete proof. If no, explain why not, giving complete explanations. 22.9 Consider the optimization problem minimize subject to f(x) x G Ω, where f(x) = x\x\, where x = [xi,X2] T , and Ω — {x G R 2 : X\ = X2, x\ > 0}. (See also Exercise 21.8.) Show that the problem is a convex optimization problem. 22.10 Consider the convex optimization problem minimize subject to f(x) x G Ω. 542 CONVEX OPTIMIZATION PROBLEMS Suppose that the points y G Ω and z G Ω are local minimizers. Determine the largest set of points G C Ω for which you can be sure that every point in G is a global minimizer. 22.11 Suppose that we have a convex optimization problem on R 3 . a. Consider the following three feasible points: [1,0,0] T , [0,1,0] T , [0,0,1] T . Suppose that all three have objective function value 1. What can you say about the objective function value of the point (1/3) [1,1,1] T ? Explain fully. b. Suppose we know that the three points in part a are global minimizers. What can you say about the point (1/3)[1,1,1] T ? Explain fully. 22.12 Consider the optimization problem . . . minimize 1 τ Qx ^ -x subject to Ax = b, where Q G R n x n , Q = QT > 0, A G R m X n , and rank A = m. a. Find all points satisfying the Lagrange condition for the problem (in terms of Q, A, and b). b . Are the points (or point) global minimizers for this problem? 22.13 Let / : R n —► R, / G C1, be a convex function on the set of feasible points Ω = {x G R n :ajx + bi > 0, i = l , . . . , p } , where o i , . . . , ap G R n , and 6 i , . . . , bp G R. Suppose that there exist x* G 5, and μ* G R p , μ* < 0, such that Df(x*)+ £ μ>Τ=Ο τ , jeJ(x*) where J(x*) = {i : a ^ « * -f 6» = 0}. Show that x* is a global minimizer of / over Ω. 22.14 Consider the problem: minimize \\x\\2 (x G R n ) subject to aTx > &, where a G R n is a nonzero vector and b G R, b > 0. Suppose that x* is a solution to the problem. a. Show that the constraint set is convex. b . Use the KKT theorem to show that α τ χ * = b. EXERCISES 543 c. Show that x* is unique, and find an expression for x* in terms of a and b. 22.15 Consider the problem minimize subject to cTcc, x > 0. a;Gln For this problem we have the following theorem (see also Exercise 17.16). Theorem: A solution to this problem exists if and only if c > 0. Moreover, if a solution exists, 0 is a solution. a. Show that the problem is a convex programming problem. b . Use the first-order necessary condition (for set constraints) to prove the theorem. c. Use the KKT condition to prove the above theorem. 22.16 Consider a linear programming problem in standard form. a. Derive the KKT condition for the problem. b . Explain precisely why the KKT condition is sufficient for optimality in this case. c. Write down the dual to the standard form primal problem (see Chapter 17). d. Suppose that x* and λ* are feasible solutions to the primal and dual, respectively. Use the KKT condition to prove that if the complementary slackness condition (c T — λ* Α)χ* = 0 holds, then x* is an optimal solution to the primal problem. Compare the above with Exercise 21.15. 22.17 Consider two real-valued discrete-time signals, s^ and s^2\ defined over the time interval [l,n]. Let s\ ' and s\ ' be the values at time i of the signals s^ and s^2\ respectively. Assume that the energies of the two signals are 1 [i.e., ( s ^ ) 2 + · · · + (s^)2 = 1 and (s^)2 + · · · + (ώ 2 ) ) 2 = 1]. Let Sa be the set of all signals that are linear combinations of s^ and s^ with the property that for each signal in 5 a , the value of the signal over all time is no smaller than a e R. For each s G 5 a , if s = Xi8^ + # 2 ^ 2 \ we call x\ and #2 the coefficients of s. We wish to find a signal in Sa such that the sum of the squares of its coefficients is minimized. a. Formulate the problem as an optimization problem. 544 CONVEX OPTIMIZATION PROBLEMS b . Derive the Karush-Kuhn-Tucker conditions for the problem. c. Suppose that you have found a point satisfying the Karush-Kuhn-Tucker conditions. Does this point satisfy the second-order sufficient condition? d. Is this problem a convex optimization problem? 22.18 Let a probability vector be any vector p G R n satisfying pi > 0, i = 1 , . . . , n, and p\ Λ V pn = 1. Let p e M n and q G R n be two probability vectors. Define D(p, q) = Pl log ( V ) + · · · + Pn log {^j , where "log" is the natural logarithm function. a. Let Ω be the set of all probability vectors (with fixed n). Show that Ω is convex. b . Show that for each fixed p, the function / defined by f(q) = D(p, q) is convex over Ω. c. Show the following: D{p,q) > 0 for any probability vectors p and q. Moreover, D(p, q) = 0 if and only if p — q. d. Describe an application of the result of part c. 22.19 Let Ω C R n be a nonempty closed convex set and z G R n be a given point such that z 0 Ω. Consider the optimization problem minimize subject to \\x — z\\ x G Ω. Does this problem have an optimal solution? If so, is it unique? Whatever your assertion, prove it. Hint: (i) If X\ and #2 are optimal solutions, what can you say about X3 — (xi + X2V2? (ii) The triangle inequality states that ||x-fi/||<||ic||H-||i/||, with equality holding if and only if x = ay for some a > 0 (or x = 0 or y = 0). 22.20 This exercise is about semidefinite programming. a. Show that if A G R n X n and B G R n X n are symmetric and A > 0, B > 0, then for any a G (0,1), we have a A + (1 — a)B > 0. As usual, the notation "> 0" denotes positive semidefiniteness. EXERCISES 545 Consider the following semidefinite programming problem, that is, an optimization problem with linear objective function and linear matrix inequality constraints: minimize subject to cTx P 0 + Y^ XjFj > 0> where x = [xi,... , x n ] T € M.n is the decision variable, c G Mn, and P 0 , -PI, . . . , Fn G R m x m are symmetric. Show that this problem is a convex optimization problem. c. Consider the linear programming problem minimize subject to c x Ax > 6, where A G R m x n , b G R m , and the inequality Ax > b has the usual elementwise interpretation. Show that this linear programming problem can be converted to the problem in part b. Hint: First consider diagonal Fj. 22.21 Suppose that you have a cake and you need to divide it among n different children. Suppose that the ith child receives a fraction xi of the cake. We will call the vector x = [ # i , . . . , x n ] T an allocation. We require that every child receives at least some share of the cake, and that the entire cake is completely used up in the allocation. We also impose the additional condition that the first child (i = 1) is allocated a share that is at least twice that of any other child. We say that the allocation is feasible if it meets all these requirements. A feasible allocation x is said to be proportionally fair if for any other allocation y, i=i Xl a. Let Ω be the set of all feasible allocations. Show that Ω is convex. b . Show that a feasible allocation is proportionally fair if and only if it solves the following optimization problem: maximize y^log(a^) i=l subject to x G Ω. 546 CONVEX OPTIMIZATION PROBLEMS 22.22 Let Ui : R —> R, Ui G C1, i = 1 , . . . , n, be a set of concave increasing functions. Consider the optimization problem n maximize 2_\ Ui (xi) i=l n subject to 2_] χΐ — C-> i=\ where C > 0 is a given constant. a. Show that the optimization problem above is a convex optimization problem. b . Show that cc* = [#*,... , # * ] T is an optimal solution to the optimization problem if and only if there exists a scalar μ* > 0 such that x* = argmax x (£/*(#) — μ*χ). [The quantity Ui(x) has the interpretation of the "utility" of x, whereas μ* has the interpretation of a "price" per unit of x.] c. Show that £ ? = i x* - C. 22.23 Give an example of a function / : R 2 —► R, a set Ω = {# : g(x) < 0}, and a regular point x* e Ω, such that the following all hold simultaneously: 1. x* satisfies the FONC for set constraint Ω (Theorem 6.1). 2. x* satisfies the KKT condition for inequality constraint g(x) < 0 (Theorem 21.1). 3. x* satisfies the SONC for set constraint Ω (Theorem 6.2). 4. x* does not satisfy the SONC for inequality constraint g{x) < 0 (Theorem 21.2). Be sure to show carefully that your choice of/, Ω = {x : g(x) < 0}, and x* satisfies all the conditions above simultaneously. 22.24 This question is on duality theory for nonlinear programming problems, analogous to the theory for linear programming (Chapter 17). (A version for quadratic programming is considered in Exercise 17.24.) Consider the following optimization problem: minimize subject to f(x) g(x) < 0, where / : R n —> R is convex, each component of g : R n —> R m is convex, and f,9 £ C1. Let us call this problem the primal problem. EXERCISES 547 Define the dual of the problem above as maximize subject to ς(μ) μ > 0, where q is defined by ς(μ) = min / ( χ , μ ) , xeRn Ύ with l(x, μ) = f(x) + μ g(x) the Lagrangian at x, μ. Prove the following results: a. If XQ and μ 0 are feasible points in the primal and dual, respectively, then /(#o) > #(Mo)· This *s the weak duality lemma for nonlinear programming, analogous to Lemma 17.1. b . If Xo and μ 0 are feasible points in the primal and dual, and f(xo) = #(μ 0 ), then XQ and μ 0 are optimal solutions to the primal and dual, respectively. c. If the primal has an optimal (feasible) solution, then so does the dual, and their objective function values are equal. (You may assume regularity.) This is the duality theorem for nonlinear programming, analogous to Theorem 17.2. 22.25 Consider the matrix M = 1 7 7 1 - 1 2 -1 2 5 where 7 is a parameter. a. Find the Schur complement of Af (1,1); b . Find the Schur complement of M ( 2 : 3,2:3) (the bottom-right 2 x 2 submatrix of M , using MATLAB notation). 22.26 Represent the Lyapunov inequality ATP + PA<0, where as a canonical LMI. 0 -1 1 -2 548 CONVEX OPTIMIZATION PROBLEMS 22.27 Let A, B, and R be given matrices such that R = R > 0. Suppose that we wish to find a symmetric positive definite matrix P satisfying the following quadratic inequality: ATP + PA + PBR1BTP < 0. Represent this inequality in the form of LMIs. (This inequality should not be confused with the algebraic Riccati inequality, which has a negative sign in front of the third term.) 22.28 Let A = -0.9501 -0.2311 -0.6068 -0.4860 -0.8913 -0.7621 -0.4565 -0.0185 -0.8214 Write a MATLAB program that finds a matrix P satisfying 0.1 J3 < P < J3 and ATP + PA < 0. CHAPTER 23 ALGORITHMS FOR CONSTRAINED OPTIMIZATION 23.1 Introduction In Part II we discussed algorithms for solving unconstrained optimization problems. In this chapter we present some simple algorithms for solving special constrained optimization problems. The methods here build on those of Part II. We begin our presentation in the next section with a discussion of projected methods, including a treatment of projected gradient methods for problems with linear equality constraints. We then consider Lagrangian methods. Finally, we consider penalty methods. This chapter is intended as an introduction to ideas underlying methods for solving constrained optimization problems. For an in-depth coverage of the subject, we refer the reader to [11]. 23.2 Projections The optimization algorithms considered in Part II have the general form x(k+i) =x(k)+akdW, An Introduction to Optimization, Fourth Edition. By E. K. P. Chong and S. H. Zak. Copyright © 2013 John Wiley & Sons, Inc. 549 550 ALGORITHMS FOR CONSTRAINED OPTIMIZATION where <rk' is typically a function of V / ( x ^ ) . The value of x^ is not constrained to lie inside any particular set. Such an algorithm is not immediately applicable to solving constrained optimization problems in which the decision variable is required to lie within a prespecified constraint set. Consider the optimization problem minimize subject to f(x) x G Ω. If we use the algorithm above to solve this constrained problem, the iterates x^ may not satisfy the constraints. Therefore, we need to modify the algorithms to take into account the presence of the constraints. A simple modification involves the introduction of a projection. The idea is as follows. If x^ + akSk) is in Ω, then we set χ^+^ = χ^ + akSk) as usual. If, on the other hand, x^ + akd> ' is not in Ω, then we "project" it back into Ω before setting x^k+1\ To illustrate the projection method, consider the case where the constraint set Ω C l n is given by Ω = {x : li < Xi < Ui, i = l , . . . , n}. In this case, Ω is a "box" in R n ; for this reason, this form of Ω is called a box constraint. Given a point a ; E E n , define y = H[x] G Mn by ( Ui if Xi > Ui Xi if U <Xi U \ixi < U. < Ui The point Π[χ] is called the projection of x onto Ω. Note that H[x] is actually the "closest" point in Ω to x. Using the projection operator Π, we can modify the previous unconstrained algorithm as follows: x( f c + 1 >=n[x( f c >+a f e d ( f e >]. Note that the iterates x^ now all lie inside Ω. We call the algorithm above a projected algorithm. In the more general case, we can define the projection onto Ω: H[x] = argmin||z — x||. In this case, Π[χ] is again the "closest" point in Ω to x. This projection operator is well-defined only for certain types of constraint sets: for example, closed convex sets (see Exercise 22.19). For some sets Ω, the "arg min" above is not well-defined. If the projection Π is well-defined, we can similarly apply the projected algorithm x(k+V=n[xW+akdW}. PROJECTIONS 551 In some cases, there is a formula for computing H[x], For example, if Ω represents a box constraint as described above, then the formula given previously can be used. Another example is where Ω is a linear variety, which is discussed in the next section. In general, even if the projection Π is well-defined, computation of TL[x] for a given x may not be easy. Often, the projection H[x] may have to be computed numerically. However, the numerical computation of H[x] itself entails solving an optimization algorithm. Indeed, the computation of Tl[x] may be as difficult as the original optimization problem, as is the case in the following example: ||cc||2 x G Ω. minimize subject to Note that the solution to the problem in this case can be written as Π[0]. Therefore, if 0 0 Ω, the computation of a projection is equivalent to solving the given optimization problem. As an example, consider the projection method applied specifically to the gradient algorithm (see Chapter 8). Recall that the vector — V / ( x ) points in the direction of maximum rate of decrease of / at x. This was the basis for gradient methods for unconstrained optimization, which have the form _ x(k) _ a f c V / ( x ^ ) , where α& is the step size. The choice of the step X(M) size ctk depends on the particular gradient algorithm. For example, recall that in the steepest descent algorithm, ot^ = argminQ>Q The projected version of the gradient algorithm has the form x(k+i) = n[x(k) _ ahVf(xw)]. We refer to the above as the projected gradient algorithm. E x a m p l e 23.1 Consider the problem . . . minimize subject to 1 τ Qx ^ -x ||cc||2 = 1, where Q = Q > 0. Suppose that we apply a fixed-step-size projected gradient algorithm to this problem. a. Derive a formula for the update equation for the algorithm (i.e., write down an explicit formula for x^k+1^ as a function of x^k\ Q, and the fixed step size a). You may assume that the argument in the projection operator to obtain x^ is never zero. b . Is it possible for the algorithm not to converge to an optimal solution even if the step size a > 0 is taken to be arbitrarily small? c. Show that for 0 < a < 1/Amax (where A max is the largest eigenvalue of Q), the fixed-step-size projected gradient algorithm (with step size a) 552 ALGORITHMS FOR CONSTRAINED OPTIMIZATION converges to an optimal solution, provided that x^ is not orthogonal to the eigenvectors of Q corresponding to the smallest eigenvalue. (Assume that the eigenvalues are distinct.) Solution: a. The projection operator in this case simply maps any vector to the closest point on the unit circle. Therefore, the projection operator is given by U[x] = x/||sc||, provided that x Φ 0. The update equation is a;<fc+1> - &(*<*> - aQx{k)) = ßk(I - aQ)x^k\ where ßk = l/||(i" — aQ)x^\\ (i.e., it is whatever constant scaling is needed to make x^k+l^ have unit norm). b . If we start with x^ being an eigenvector of Q, then χ^ = χ^ for all k. Therefore, if the corresponding eigenvalue is not the smallest, then clearly the algorithm is stuck at a point that is not optimal. c. We have xik+V = ßk(I - aCfixW = ßk(I-*Q)(v[k)v1 = ßk(y[k)(I + '-. + yWvn) - aQ)v! + . . . + y W ( J - But (I — aQ)vi = (1 — a\i)vi, to Vi. Hence, x(k+1) = ßk(y[k)(l aQ)vn). where λ^ is the eigenvalue corresponding - aX^V! + ■ ■ ■ + y{k){\ - a\n)vn), which means that y\ = ßky\ (1 — α\). In other words, y\ ß(k)yf\\ - a\i)k, where β^ = Tlk~oßk- We rewrite χΜ as = i=l k ( - ■VPU J) n + ~~ Σ Jk) Assuming that y\' φ 0, we obtain yJV_ = ι/<0)(1-αΑ«)* y[k) »ί 0) (1-αλ0* yf_ /1 - «A« y^U-^i = Using the fact that (1 — αλ*)/(1 — αλι) < 1 (because the λ* > λι for i > 1 and a < 1/A max ), we deduce that (k) (k) which implies that x^ υ ' y\ —> V\, as required. I PROJECTED GRADIENT METHODS WITH LINEAR CONSTRAINTS 23.3 553 Projected Gradient Methods with Linear Constraints In this section we consider optimization problems of the form minimize subject to f{x) Ax = fr, where f : Rn -+R, A e Rmxn, m < n, rank A = ra, b G R m . We assume 1 throughout that / G C . In the problem above, the constraint set is Ω = {x : Ax = b}. The specific structure of the constraint set allows us to compute the projection operator Π using the orthogonal projector (see Section 3.3). Specifically, H[x] can be defined using the orthogonal projector matrix P given by P = In-AT(AAT)-1A (see Example 12.5). Two important properties of the orthogonal projector P that we use in this section are (see Theorem 3.5): 1. P = PT. 2. P2 = P. Another property of the orthogonal projector that we need in our discussion is given in the following lemma. Lemma 23.1 Let v G Rn. Then, Pv = 0 if and only if v G ΊΖ(ΑΤ). In other words, λί(Ρ) = 1Z(AT). Moreover, Av = 0 if and only if v G TZ(P); that is, λί(Α) =ΊΖ(Ρ). □ Proof =>: We have Pv - ( I n - AT{AAT)-1A)v = v-AT(AAT)-1Av. v = AT(AAT)~1Av If Pv = 0, then and hence v G 1Z(AT). <=: Suppose that there exists u G R m such that v = A u. Then, Pv = (J n - ATu - AT(AAT)-1A)ATu AT(AAT)-1AATu = 0. Hence, we have proved that ftf(P) = 1Z(AT). Using an argument similar to that above, we can show that N(A) = ΊΖ(Ρ). ■ 554 ALGORITHMS FOR CONSTRAINED OPTIMIZATION Recall that in unconstrained optimization, the first-order necessary condition for a point x* to be a local minimizer is V/(x*) = 0 (see Section 6.2). In optimization problems with equality constraints, the Lagrange condition plays the role of the first-order necessary condition (see Section 20.4). When the constraint set takes the form {x : Ax = 6}, the Lagrange condition can be written as PVf(x*) = 0, as stated in the following proposition. Proposition 23.1 Let x* G Rn be a feasible point and only if x* satisfies the Lagrange condition. Then, PV/(cc*) = 0 t / Q Proof By Lemma 23.1, PVf(x*) = 0 if and only if we have V/(x*) G K(AT). This is equivalent to the condition that there exists λ* G R m such that V/(a5*) + AT\* = 0, which together with the feasibility equation Ax = 6, constitutes the Lagrange condition. I Recall that the projected gradient algorithm has the form k X( +V =U[x™ -akVf{xW)]. For the case where the constraints are linear, it turns out that we can express the projection Π in terms of the matrix P as follows: Π[χ<*> - a f c V/(xW)] = x{k) - a fc PV/(ai' fc )), assuming that x^ G Ω. Although the formula above can be derived algebraically (see Exercise 23.4), it is more insightful to derive the formula using a geometric argument, as follows. In our constrained optimization problem, the vector — Vf(x) is not necessarily a feasible direction. In other words, if x^ is a feasible point and we apply the algorithm as(fc+1) = x^ — afcV/(a:^), then £c(fc+*) need not be feasible. This problem can be overcome by replacing —V/(as(fe)) by a vector that points in a feasible direction. Note that the set of feasible directions is simply the nullspace λί(Α) of the matrix A. Therefore, we should first project the vector — V / ( x ) onto λί(Α). This projection is equivalent to multiplication by the matrix P. In summary, in the projection gradient algorithm, we update x^ according to the equation X(*+V akPVf(x{k)). = XW - The projected gradient algorithm has the following property. Proposition 23.2 In a projected gradient algorithm, if x^ each x^ is feasible; that is, for each k > 0, Ax^ = b. is feasible, then G Proof. We proceed by induction. The result holds for k = 0 by assumption. Suppose now that AxW = b. We now show that Aaj(fc+1^ = b. To show this, first observe that P V / ( x ( f c ) ) € Λί(Α). Therefore, Ax(fc+1> = A{x{k) = Ax^ = b, -akP\7f{x{k))) - akAPVf{x{k)) PROJECTED GRADIENT METHODS WITH LINEAR CONSTRAINTS which completes the proof. 555 I The projected gradient algorithm updates x^ in the direction of - P V / ( a i ^ ) . This vector points in the direction of maximum rate of decrease of / at x^ along the surface defined by Ax = 6, as described in the following argument. Let x be any feasible point and d a feasible direction such that ||d|| = 1. The rate of increase of / at x in the direction d is ( V / ( x ) , d). Next, we note that because d is a feasible direction, it lies in Af(A) and hence by Lemma 23.1, we have d G ΊΖ(Ρ) — ΊΖ(Ρ ). So, there exists v such that d = Pv. Hence, ( V / ( x ) , d ) = (Vf(x),PTv) = (PV/(x),«>. By the Cauchy-Schwarz inequality, (PVf(x),v)<\\PVf(x)\\\\v\\ with equality if and only if the direction of v is parallel with the direction of P V / ( x ) . Therefore, the vector —PVf(x) points in the direction of maximum rate of decrease of / at x among all feasible directions. Following the discussion in Chapter 8 for gradient methods in unconstrained optimization, we suggest the following gradient method for our constrained problem. Suppose that we have a starting point χ(°\ which we assume is feasible; that is, Ax^ = b. Consider the point x = x^ — aPS/f(x^), where a G R. As usual, the scalar a is called the step size. By the discussion above, x is also a feasible point. Using a Taylor series expansion of / about x^ and the fact that P — P2 = PT P, we get /(a:<0> - a P V / ( x ( 0 ) ) ) = / ( x ( 0 ) ) - aVf(x{0))TPVf(x{0)) + o(a) 2 = /(χ(°))-α||Ρν/(χ(°))|| +ο(α). Thus, if P V / ( x ( ° ) ) Φ 0, that is, x^ does not satisfy the Lagrange condition, then we can choose an a sufficiently small such that f(x) < f(x^), which means that x = x^ — aPVf(x^) is an improvement over χ(°\ This is the basis for the projected gradient algorithm a^fc+1) = x^ — ajkPV/(i' f c '), where the initial point x^ satisfies Ax^ — b and α^ is some step size. As for unconstrained gradient methods, the choice of α& determines the behavior of the algorithm. For small step sizes, the algorithm progresses slowly, while large step sizes may result in a zigzagging path. A well-known variant of the projected gradient algorithm is the projected steepest descent algorithm, where QLk is given by ak = argmin/(£c (fc) aPVf(x{k))). The following theorem states that the projected steepest descent algorithm is a descent algorithm, in the sense that at each step the value of the objective function decreases. 556 ALGORITHMS FOR CONSTRAINED OPTIMIZATION Theorem 23.1 If {x^} is the sequence of points generated by the projected steepest descent algorithm and if P V / ( x ^ ) φ 0, then /(x^ fc+1 ^) < f(x^). D Proof First, recall that x (fc+i) = x(k) _ •a PV/(ajW), fc where α^ > 0 is the minimizer of 4>k{a) = /(x (fe)- a P V / ( x ( f e ) ) ) over all a > 0. Thus, for a > 0, we have 0fc(ajfe) < 0fc(a). By the chain rule, ^(0) = ^f(O) = -v/(x ( f e ) -0Pv/(x (fc) )) T PV/(x (fe) ) = -V/(x(fc))TPV/(x(fc)). Using the fact that P = P2 = PT P, we get 44(0) = - V / ( * W ) T P T P V / ( x W ) = -||PV/(x< f c ))|| 2 < 0, because PVf(x^) Φ 0 by assumption. Thus, there exists ö > 0 such that (j>k{0) > <t>k{ot) for all α G (0,a]. Hence, /(x<fc+1>) = 0 f c K ) < 4>k{ä) < φφ) = /(x (fc) ), which completes the proof of the theorem. I In Theorem 23.1 we needed the assumption that P V f ( x ^ ) φ 0 to prove that the algorithm possesses the descent property. If for some k, we have P V / ( a j ^ ) = 0, then by Proposition 23.1 the point x^ satisfies the Lagrange condition. This condition can be used as a stopping criterion for the algorithm. Note that in this case, x^1) = x^k\ For the case where / is a convex function, the condition PVf{x^) = 0 is, in fact, equivalent to x^ being a global minimizer of / over the constraint set {x : Ax = b}. We show this in the following proposition. Proposition 23.3 The point x* G W1 is a global minimizer of a convex function f over {x : Ax = b} if and only if PVf(x*) = 0. □ Proof We first write h{x) — Ax — b. Then, the constraints can be written as h{x) = 0, and the problem is of the form considered in earlier chapters. LAGRANGIAN ALGORITHMS 557 Note that Dh(x) = A. Hence, x* G Mn is a global minimizer of / if and only if the Lagrange condition holds (see Theorem 22.8). By Proposition 23.1, this is true if and only if PVf(x*) = 0, and this completes the proof. I For an application of the projected steepest descent algorithm to minimum fuel and minimum amplitude control problems in linear discrete systems, see [78]. 23.4 Lagrangian Algorithms In this section we consider an optimization method based on the Lagrangian function (see Section 20.4). The basic idea is to use gradient algorithms to update simultaneously the decision variable and Lagrange multiplier vector. We consider first the case with equality constraints, followed by inequality constraints. Lagrangian Algorithm for Equality Constraints Consider the following optimization problem with equality constraints: minimize subject to f{x) h{x) = 0 where h : R n —> M m . Recall that for this problem the Lagrangian function is given by l(x,\) = f(x) + \Th(x). Assume that f,h e C2; as usual, denote the Hessian of the Lagrangian by L{x,X). The Lagrangian algorithm for this problem is given by x{k+i) = x(k) _ a f c ( V /( x ( fc )) + Dh(xW)T\{k)), Notice that the update equation for x^ is a gradient algorithm for minimizing the Lagrangian with respect to its x argument, and the update equation for is a gradient algorithm for maximizing the Lagrangian with respect to its λ argument. Because only the gradient is used, the method is also called the first-order Lagrangian algorithm. The following lemma establishes that if the algorithm converges, the limit must satisfy the Lagrange condition. More specifically, the lemma states that any fixed point of the algorithm must satisfy the Lagrange condition. A fixed point of an update algorithm is simply a point with the property that when updated using the algorithm, the resulting point is equal to the given point. For the case of the Lagrangian algorithm, which updates both x^ and 558 ALGORITHMS FOR CONSTRAINED OPTIMIZATION λ ' ^ vectors, a fixed point is a pair of vectors. If the Lagrangian algorithm converges, the limit must be a fixed point. We omit the proof of the lemma because it follows easily by inspection. Lemma 23.2 For the Lagrangian algorithm for updating x^ and λ^ ', the pair (x*,\*) is a fixed point if and only if it satisfies the Lagrange condition. D Below, we use (a?*, λ*) to denote a pair satisfying the Lagrange condition. Assume that L(x*,A*) > 0. Also assume that x* is a regular point. With these assumptions, we are now ready to state and prove that the algorithm is locally convergent. For simplicity, we will take ctk and ßk to be fixed constants (not depending on &), denoted a and /?, respectively. Theorem 23.2 For the Lagrangian algorithm for updating x^ and λ ' \ provided that a and β are sufficiently small, there is a neighborhood o/(a?*, λ*) such that if the pair (x^°\ λ^ ') is in this neighborhood, then the the algorithm converges to (χ*,λ*) with at least a linear order of convergence. □ Proof. We can rescale x and λ by appropriate constants (so that the assumptions are preserved) and effectively change the relative values of the step sizes for the update equations. Therefore, without loss of generality, we can take β = α. We will set up our proof by introducing some convenient notation. Given a pair ( χ , λ ) , let w = [ χ τ , λ τ ] τ be the (n + m)-vector constructed by concatenating x and λ. Similarly define w^ = [ χ ^ τ , λ ^ ] τ and w* = k*TiT Define the map U : by aT U(w) = x-a(Vf(x) + Dh{x)T\) X + ah{x) Then, the Lagrangian algorithm can be rewritten as w{k+i) = U(w{k)). We now write ||w(fc+1) — w*\\ in terms of \\w^ — tu*||, where || · || denotes the usual Euclidean norm. By Lemma 23.2, w* = [x*T,\*T]T is a fixed point (k+i) u(w^). Therefore, of w = llwifc+i) - ii;*|| = \\U(w^k)) - U(w*)\\. Let DU be the (matrix) derivative of U: DU(w) = I + a -L(x,\) Dh(x) -Dh(x)T O LAGRANGIAN ALGORITHMS 559 By the mean value theorem (see Theorem 5.9), U(wW) - U(w*) = G(w^k))(w^ - w*), where G(w^) is a matrix whose rows are the rows of DU evaluated at points that lie on the line segment joining w^ and w* (these points may differ from row to row). Taking norms of both sides of the equation above, \\U(wW) - U(w*)\\ < ||G(ti;<fe>)||||ti;<fc> - ti7*||. Finally, combining the above, we have l l ^ + i ) _ w*|| < \\G(w^)\\\\w^ - w*\\. We now claim that for sufficiently small a > 0, ||DC/(it;*)|| < 1. Our argument here follows [11, Section 4.4]. Let Γ-£(**, λ*) [ Dh(x*) -£>h(**)Tl J' O so that DU(w*) = I + aM. Hence, to prove the claim, it suffices to show that the eigenvalues of M all lie in the open left-half complex plane. For any complex vector y, let yH represent its complex conjugate transpose (or Hermitian) and $i(y) its real part. Let λ be an eigenvalue of M and w = [ χ τ , λ τ ] τ φ 0 be a corresponding eigenvector. Now, !R (wHMw) = Si(A)||tü||2. However, from the structure of Af, we can readily see that 3? (wHMw) = -$l(xHL(x*,\*)x) = - $l(xHDh(x*)Tλ) + tt(\HDh{x*)x) -M{xHL(x*,\*)x). By the assumption that L(x*,A*) > 0, we know that $l(xHL(x*, X*)x) > 0 if x φ 0. Therefore, comparing the two equations above, we deduce that 9ί(λ) < 0, as required, provided that x is nonzero, as we now demonstrate. Now, suppose that x — 0. Because w is an eigenvector of M", we have Mw = Xw. Extracting the first n components, we have Dh(x*)TX = 0. By the regularity assumption, we deduce that λ = 0. This contradicts the assumption that w φ 0. Hence we conclude that x ^ 0 , which completes the proof of our claim that for sufficiently small a > 0, ||£>J7(tu*)|| < 1. The result of the foregoing claim allows us to pick constants η > 0 and K < 1 such that for all w = [ccT,A ] T satisfying ||tu — w*\\ < 77, we have ||^(κ;)|| < K (this follows from the continuity of DU and norms). To complete the proof, suppose that \\w^ — w*\\ < 77. We will show by induction that for all k > 0, \\w^ —w*\\ <η and ||tt/ fc+1 ) —w*\\ < K\\W^ tu* ||, from which we conclude that w^ converges to w* with at least linear order of convergence. For k = 0, the result follows because \\w^ — w*\\ < η by assumption, and ||™(1) - u > l < ||G(u;<0>)||||fi;<0> -w*\\< K\\W^ - w*\\, 560 ALGORITHMS FOR CONSTRAINED OPTIMIZATION which follows from 11^(0) -w*\\ <η So suppose that the result holds for k. This implies that ||G(ii/ fc ))|| < κ. To show the k -f 1 case, we write ||„,(*+ΐ) _ w*\\ < ||G(«; (fc) )||||ti;W - w*|| < /*||™(/c) - ti7*|| < η. This means that ||G(ti;( fc+1 ))|| < Ä, from which we can write | | ^ + 2 ) _ ^*|| < ||G(u;(fc+1>)||||ii;(fc+1> - n;*|| < /*||κ/* +1 ) - w*||. This completes the proof. I Lagrangian Algorithm for Inequality Constraints Consider the following optimization problem with inequality constraints: minimize subject to f{x) g(x) < 0, where g : R n —► R p . Recall that for this problem the Lagrangian function is given by 1(χ,μ) = /(χ) + μτ9{χ). As before, assume that f,g G C2; as usual, denote the Hessian of the Lagrangian by Σ(χ,μ). The Lagrangian algorithm for this problem is given by x(k+i) = x(k) 1 _ afc(v/(a;(fc)) + Z?0(a:<*>)Vfc)), μ(*+ ) = [μ(*)+Α β (χ(*))] + > where [·] + = max{-,0} (applied componentwise). Notice that, as before, the update equation for x^ is a gradient algorithm for minimizing the Lagrangian with respect to its x argument. The update equation for μ^ is a projected gradient algorithm for maximizing the Lagrangian with respect to its μ argument. The reason for the projection is that the KKT multiplier vector is required to be nonnegative to satisfy the KKT condition. The following lemma establishes that if the algorithm converges, the limit must satisfy the KKT condition. As before, we use the notion of a fixed point to state the result formally. The proof is omitted because the result follows easily by inspection. L e m m a 23.3 For the Lagrangian algorithm for updating x^ and μ^\ the pair (χ*,μ*) is a fixed point if and only if it satisfies the KKT condition. □ As before, we use the notation (as*,/x*) to denote a pair satisfying the KKT condition. Assume that Ζ(χ*,μ*) > 0. Also assume that x* is a regular point. With these assumptions, we are now ready to state and prove that the algorithm is locally convergent. As before, we will take α^ and ßk to LAGRANGIAN ALGORITHMS 561 be fixed constants (not depending on A:), denoted a and /?, respectively. Our analysis examines the behavior of the algorithm in two phases. In the first phase, the "nonactive" multipliers decrease to zero in finite time and remain at zero thereafter. In the second phase, the x^ iterates and the "active" multipliers converge jointly to their respective solutions, with at least a linear order of convergence. Theorem 23.3 For the Lagrangian algorithm for updating x^ and μ^\ provided that a and β are sufficiently small, there is a neighborhood o/(a?*, μ*) such that if the pair (x^°\ μ ^ ) is in this neighborhood, then (1) the nonactive multipliers reduce to zero in finite time and remain at zero thereafter and (2) the algorithm converges to (χ*,μ*) with at least a linear order of convergence. D Proof As in the proof of Theorem 23.2, we can rescale x and μ by appropriate constants (so that the assumptions are preserved) and effectively change the relative values of the step sizes for the update equations. Therefore, without loss of generality, we can take β = a. We set up our proof using the same vector notation as before. Given a pair (χ,μ), let w = [ χ τ , μ τ ] τ be the (n + p)-vector constructed by concatenating x and μ. Similarly define w^ = [ χ ^ τ , μ ^ τ ] τ and w* = [ # * τ , μ * τ ] τ . Define the map U as U{w) x a(Vf{x) + Ό9(χ)τμ) μ + ag(x) Also, define the map Π by n\w] X [μ}+ Then, the update equations can be rewritten as !£,(*+!) =n[l/(ti;( f c ))]. Because Π is a projection onto the convex set {w = [ χ τ , μ τ ] τ : μ > 0}, it is a nonexpansive map (see [12, Proposition 3.2]), which means that ||Π[ν] — Π Η | | < ||u-iu||. We now write ||κ;^ + 1 ) — w*\\ in terms of \\w^ —w*\\, where || · || denotes the usual Euclidean norm. By Lemma 23.3, w* = [χ* τ , μ * τ ] τ is a fixed point of w(k+i) = u(wW). Therefore, | | ^ + D _™*|| = \\U[U(w^)} -U[U(w*)]\\ < ||I/(tü(fc))-E7(tO|| 562 ALGORITHMS FOR CONSTRAINED OPTIMIZATION by the nonexpansiveness of Π. Let DU be the (matrix) derivative of U: DU(w) = I + a -Dg(x)T O -Σ(χ,μ) Dg(x) By the mean value theorem, U(w^) - U{w*) = G(w^)(w{k) ■ t O , where G(w^) is a matrix whose rows are the rows of DU evaluated at points that lie on the line segment joining w^ and w* (these points may differ from row to row). Taking norms of both sides of the equation above yields <\\G{w (*h \\wW-w* \\U(wW)-U(w*)\\ Finally, combining the above, we obtain Utflifc+i) _ w*\\ < \\G{wW)\\\\wW - w*\\. Let gA represent those rows of g corresponding to active constraints (at x*) and gA represent the remaining rows of g. [Recall that by regularity, DgA(x*) has full rank.] Given a vector μ, we divide it into two subvectors μΑ and μΑ, according to active and nonactive components, respectively. (Note that μΑ = 0, the zero vector.) Next, write wA = [χτ ,μΑ]τ and UA(wA) x α(ν/(χ) + ^A + OigA(x) so that DUA(wA) ϋ9Α(χ)τμΑ) =I +a -Ώ(χ,μΑ) DgA(x) -DgA(x)T O Finally, let GA be such that UA (wA ') - UA(w*A) GA(w^)(w^ •wX)(by the mean value theorem as before). We organize the remainder of our proof into four claims. Claim 1: For sufficiently small a > 0, ||DC7A(WA)II < lThe argument here parallels that of the proof of Theorem 23.2. So for the sake of brevity we omit the details. The result of claim 1 allows us to pick constants η > 0, δ > 0, and κΑ < 1 such that for all w = [ χ τ , μ τ ] τ satisfying ||u; — w*\\ < η, \\GA(WA)\\ < K>A, and ΑΑ(2Β) < —ie, where e is the vector with all components equal to 1. The first inequality follows from claim 1 and the continuity of DUA(-) and || · ||. The second follows from the fact that the components of gA{x*) are negative. Let K = max{||G(iu)|| : \\w — w*\\ < η}, which we assume to be at least 1; otherwise, set κ = 1. Now pick ε > 0 such that εκε^αδ^ < η. We can do this because the left side of this inequality goes to 0 as ε —► 0. Assume for LAGRANGIAN ALGORITHMS 563 convenience that ko = ε/(αδ) is an integer; otherwise, replace all instances of ε/(αδ) by the smallest integer that exceeds it (i.e., round it up to the closest integer). For the remainder of this proof, let tt/°) satisfy ||κ/°) — w*\\ < ε. Claim 2: For k = 0 , . . . , fe0, || w (/c) - w* || < 77. To prove the claim, we show by induction that \\w^ — w*\\ < εκ* (which is bounded above by η provided that k < ko). For k = 0, by assumption ||ty(°) — w* || < £ = SKP, as required. For the inductive step, suppose that ||ti,(*)-ti;*|| <eKkiork< k0. Now, using ||™^ + 1 )-it;*|| < ||G(ti;<fc>)||||tf;<fc>ti7*|| and the fact that \\w^ —w*\\ < η, Hu/fc+i) _ w *|| < ||G(ii7<*>)||||ti;(fc) -w*\\< κ(εκ*) = ε«* + 1 , and the result now follows by induction. Claim 3: For k = 0 , . . . , ko, μ^ ^ is monotonically nonincreasing in &, and l~k =® ( w n i c n ls equal to μ^). By claim 2, gA(x^) < ~^e f° r all fc = 0 , . . . , &o. Hence, for k < ko, μ(*+1) = [ μ « + α β ϊ ( χ ( « ) ] < [ μ « -αδβ} + + {k) <u which establishes nonincreasing monotonicity. To show that μ^ = 0 , suppose that for some nonactive component Z, we have μ\ °' > 0. By the monotonicity above, μ\ ' > 0 for k = 0 , . . . , k0. Hence, ^o)z=^o-l)+Qgi{xiko-1)) k=0 But by claim 2, gi(x( ) < -δ for all k = 0 , . . . ,fc0 - 1. Hence, μ ζ (Μ < ε — koaö < 0, which is a contradiction. Finally, we will state and prove claim 4, which completes the proof of the theorem. Claim 4: For k > k0, we have μχ' =0 = μ | , ||it;^ + — ΐϋ£|| < « A I I ^ A _ wX\\, and {{w^ -w*\\ < η. We use induction. For k — ko, we have ||u>(fco) - 117*|| < η by claim 2, μ*>> = 0 by claim 3. Hence, „ Λ , + ΐ ) = U[UA(w{AkQ)) + a % ( ^ ) ) T 4 f c o ) ] = n[UA(w^)]. k) 564 ALGORITHMS FOR CONSTRAINED OPTIMIZATION Because μ^ = 0, it is, similarly, also true that w\ — ~ «£11 = \\Tl[UA(w^)} \\W(A°+1) Π[17Α(^)]· Thus, - U[UA(w*A)}\\ <\\UA(w{Ako))-UA(wX)\\ <\\GA(wAko))\\\\wAko)-wX\\, where \\GA(wAko))\\ < κΑ because \\w(k0) - w*\\ < η. Hence, \\wAk°+1) - wA\\ < KA\\WA °' — wA\\, as required. For the inductive step, suppose that the result holds for k > ko- Now, gA(x^) < —Se and 4 f c + 1 ) = [4 fc> + <*9A(x{k))}+ < [0 - aSe}+ = 0, which implies that μ4 = — 0. It follows that U[UA(wAk+1))}, and now using the same argument as in the case of k — ko above we get rf+2) - «ail < KA\\wAk+1) -wX\\. Finally, ||u,(fc+1) - to*|| = K f c + 1 ) - ιι£|| < KA\\wAk) - w\\\ < η. Because KA < 1, claim 4 implies that w^ a linear order of convergence. converges to κ;*, with at least I An application of Lagrangian algorithms to a problem in decentralized rate control for sensor networks appears in [24], [25], and [93]. The proof above is based on [25]. 23.5 Penalty Methods Consider a general constrained optimization problem minimize f(x) subject to x e Ω We now discuss a method for solving this problem using techniques from unconstrained optimization. Specifically, we approximate the constrained optimization problem above by the unconstrained optimization problem minimize f(x)+/yP(x)) PENALTY METHODS 565 where 7 G R is a positive constant and P : R n —► R is a given function. We then solve the associated unconstrained optimization problem and use the solution as an approximation to the minimizer of the original problem. The constant 7 is called the penalty parameter, and the function P is called the penalty function. Formally, we define a penalty function as follows. Definition 23.1 A function P : R n —> R is called a penalty function for the constrained optimization problem above if it satisfies the following three conditions: 1. P is continuous. 2. P(x) > 0 for all x G R n . 3. P(x) = 0 if and only if x is feasible (i.e., x G Ω). | Clearly, for the unconstrained problem above to be a good approximation to the original problem, the penalty function P must be chosen appropriately. The role of the penalty function is to "penalize" points that are outside the feasible set. To illustrate how we choose penalty functions, consider a constrained optimization problem of the form minimize subject to f{x) gi(x) < 0, i = l,...,p, where / : R n —> R, gi : R n —» R, i = l , . . . , p . Considering only inequality constraints is not restrictive, because an equality constraint of the form h(x) = 0 is equivalent to the inequality constraint ||/i(x)|| 2 < 0 (however, see Exercise 21.25 for a caveat). For the constrained problem above, it is natural that the penalty function be defined in terms of the constraint functions p i , . . . , g p . A possible choice for P is Ρ{χ) = Σ,9?(χ)> where gf{x) = max{0, <;,(*)} = l \j9i\x) * " 11 9%[χ) > 0. We refer to this penalty function as the absolute value penalty function, because it is equal to Σ | # ζ ( # ) | , where the summation is taken over all constraints that are violated at x. We illustrate this penalty function in the following example. 566 ALGORITHMS FOR CONSTRAINED OPTIMIZATION P(x) 0 Figure 23.1 H- 2 g+ for Example 23.2. E x a m p l e 23.2 Let <7i,#2 : R —► R be defined by <7i(x) — a; — 2, <72(#) = — (x + 1) 3 . The feasible set defined by {x G R : 9\{x) < 0,g2(x) < 0} is simply the interval [—1,2]. In this example, we have x<2 x — 2 otherwise, ot fli(ic) = max{0,^2(a?)} = and P(x)=gZ(x)+gZ(x)= 3 -(x + 1) if rc> - 1 otherwise, x- 2 if a: > 2 if - 1 < x < 2 <0 (x + 1) 3 i f x < - l . Figure 23.1 provides a graphical illustration of g+ for this example. I The absolute value penalty function may not be differentiable at points x where gi(x) = 0, as is the case at the point x = 2 in Example 23.2 (notice, though, that in Example 23.2, P is differentiable at x = —1). Therefore, in such cases we cannot use techniques for optimization that involve derivatives. A form of the penalty function that is guaranteed to be differentiable is the Courant-Beltrami penalty function, given by P{x) = J2{gt{x))2. In the following discussion we do not assume any particular form of the penalty function P. We only assume that P satisfies conditions 1 to 3 given in Definition 23.1. The penalty function method for solving constrained optimization problems involves constructing and solving an associated unconstrained optimization PENALTY METHODS 567 problem and using the solution to the unconstrained problem as the solution to the original constrained problem. Of course, the solution to the unconstrained problem (the approximated solution) may not be exactly equal to the solution to the constrained problem (the true solution). Whether or not the solution to the unconstrained problem is a good approximation to the true solution depends on the penalty parameter 7 and the penalty function P. We would expect that the larger the value of the penalty parameter 7, the closer the approximated solution will be to the true solution, because points that violate the constraints are penalized more heavily. Ideally, in the limit as 7 —> 00, the penalty method should yield the true solution to the constrained problem. In the remainder of this section, we analyze this property of the penalty function method. Example 23.3 Consider the problem minimize subject to x Qx ||x|| 2 = 1, where Q = QT > 0. a. Using the penalty function P(x) = (\\x\\2 — l ) 2 and penalty parameter 7, write down an unconstrained optimization problem whose solution χΊ approximates the solution to this problem. b . Show that for any 7, χΊ is an eigenvector of Q. c. Show that ||cc7||2 - 1 — 0(1/7) as 7 —► 00. Solution: a. The unconstrained problem based on the given penalty function is minimize xTQx + 7(||cc||2 — l ) 2 . b . By the FONC, χΊ satisfies 2Qx1 + 47(||χ 7 || 2 - ΐ)χΊ = 0. Rearranging, we obtain where λ 7 is a scalar. Hence, χΊ is an eigenvector of Q. (This agrees with Example 20.8.) c. Now, λ 7 = 27(1 — ||a?7||2) < A max , where A max is the largest eigenvalue of Q. Hence, ||α 7 || 2 - 1 = -A m a x /(27) = 0(1/7) as 7 -> 00. | 568 ALGORITHMS FOR CONSTRAINED OPTIMIZATION We now analyze the penalty method in a more general setting. In our analysis, we adopt the following notation. Denote by x* a solution (global minimizer) to the problem. Let P be a penalty function for the problem. For each k = 1,2,..., let 7^ £ R be a given positive constant. Define an associated function 9(7^, ·) : R n —► R by q(jk,x) = f(x)+7kP(x). For each A;, we can write the following associated unconstrained optimization problem: minimize (7(7*;, x). Denote by x^ a minimizer of #(7&,χ). The following technical lemma describes certain useful relationships between the constrained problem and the associated unconstrained problems. Lemma 23.4 Suppose that {7^} is a nondecreasing sequence; that is, for each k, we have 7^ < 7^+1· Then, for each k we have 1. q(lk+ux^)>q(lk,x^). 2. Ρ(χ(* + 1 >)<Ρ(*<*>). 3. / ( x ( f c + 1 ) ) > / ( x ( / c ) ) . I f(x*)>q(7k,x{k))>f(x{k)). D Proof We first prove part 1. From the definition of q and the fact that {7^} is an increasing sequence, we have 9 ( 7 f c + i,aJ (fc+1) ) - f(x{k+1)) Now, because x^ + jk+1P(x(k+1)) f(x(k+V)+nP(x(k+1)). > is a minimizer of (7(7*;, x), = /(* ( f c ) ) + 7 *P(* ( f c ) ) < f(x{k+1)) q(lk,x(k)) + 7 f c P(x (fc+1) ). Combining the above, we get part 1. We next prove part 2. Because x ^ and x( fc+1 ) minimize (7(7*;,x) and (/(7fc+i,x), respectively, we can write q{lk,x(k)) {k+1) q(lk+ux < /(x ( f e + 1 ) ) + = f(x{k))+jkP(x{k)) {k+1) ) = f(x ) {k+1) +7k+iP(x {k) ) < f(x ) 7fe P(*< + fc+1 >), lk+iP(x{k)). Adding the inequalities above yields 7fc P(x (fe) ) + 7fc+1 P(*( fc+1 >) < (k) lk+1P{x ) + 7*Ρ(α^+1>). Rearranging, we get ( 7fc+1 - 7 f c )P(*( f e + 1 >) < ( 7 Η ΐ - 7 θ Φ ( * ) ) · PENALTY METHODS 569 We know by assumption that 7^+1 > 7^. If 7^+1 > 7^, then we get P(x( fc+1 )) < P(x^). If, on the other hand, 7^+1 = 7^, then clearly k Therefore, in either case, we x(fc+i) _ x(k) a n ( j s o p( x (fc+i)) = p(x( )). arrive at part 2. We now prove part 3. Because x^ is a minimizer of g(7fc>x)> w e obtain e(7fc,x(fc)) = f(x{k)) + ikP(x{k)) f(x{k+1))+ikP(x{k+1)). < Therefore, /(* ( f c + 1 ) ) > f(x{k)) From part 2 we have P(x^) therefore, we get - P(x ( / e + 1 ) )). + 7k(P(x{k)) > 0, and 7^ > 0 by assumption; — Ρ(χ^^) f(x{k+1)) > f(x{k)). Finally, we now prove part 4. Because x^ is a minimizer of ^(7^,05), we get /(**) + 7fcP(**) > <z(7fc, ^ (fc) ) = f(x(k)) + 7kP(x(k)). Because x* is a minimizer for the constrained optimization problem, we have P(x*) = 0. Therefore, /(«*)>/(«(fc))+7ibP(*(fc))· Because P ( a ^ ) > 0 and 7^ > 0, /(**) > q{lk,x{k)) > f(x{k)), which completes the proof. I With the above lemma, we are now ready to prove the following theorem. Theorem 23.4 Suppose that the objective function f is continuous and 7^ —► 00 as k —> 00. Then, the limit of any convergent subsequence of the sequence {x^} is a solution to the constrained optimization problem. □ Proof. Suppose that {x^™^} is a convergent subsequence of the sequence {x^}. (See Section 5.1 for a discussion of sequences and subsequences.) Let x be the limit of {x^™^}. By Lemma 23.4, the sequence {</(7fc, x^)} is nondecreasing and bounded above by f(x*). Therefore, the sequence {#(7*;, x^)} has a limit q* = lim^—oo 2(7*., a ? ^ ) such that q* < f(x*) (see Theorem 5.3). Because the function / is continuous and f(x^rnk^) < f(x*) by Lemma 23.4, we have lim / (V m f c )) = / ( l i m x{mk)) = f(x) < /(**)· Because the sequences {/(a^mfc))} and {g(7m/c,cc(mfc))} both converge, the sequence {7m fc P(a^ m ^)} = {(K7mfc>a^mfc^) - f{x^mk^)} also converges, with \\mlmkP{x^) fc—>·οο = q*-f{x). 570 ALGORITHMS FOR CONSTRAINED OPTIMIZATION By Lemma 23.4, the sequence {P(x^)} is nonincreasing and bounded from below by 0. Therefore, {P(x^)} converges (again see Theorem 5.3), and hence so does {P(x^rnk^)}. Because 7mfc —>· co we conclude that lim P(x ( m f c ) ) = 0. k—>oo By continuity of P , we have 0 = lim P(x ( m f c ) ) = P ( lim x (mfc) ] - P ( £ ) , fc—>·οο \k~*°° ) and hence x is a feasible point. Because /(x*) > f(x) from above, we conclude that x must be a solution to the constrained optimization problem. | If we perform an infinite number of minimization runs, with the penalty parameter 7^ —> oo, then Theorem 23.4 ensures that the limit of any convergent subsequence is a minimizer x* to the original constrained optimization problem. There is clearly a practical limitation in applying this theorem. It is certainly desirable to find a minimizer to the original constrained optimization problem using a single minimization run for the unconstrained problem that approximates the original problem using a penalty function. In other words, we desire an exact solution to the original constrained problem by solving the associated unconstrained problem [minimize f(x) + 7P(a?)] with a finite 7 > 0. It turns out that indeed this can be accomplished, in which case we say that the penalty function is exact However, it is necessary that exact penalty functions be nondifferentiable, as shown in [10], and illustrated in the following example. Example 23.4 Consider the problem minimize subject to f(x) x G [0,1], where f(x) = 5 — 3x. Clearly, the solution is x* = 1. Suppose that we use the penalty method to solve the problem, with a penalty function P that is differentiable at x* = 1. Then, P'(x*) = 0, because P(x) = 0 for all x G [0,1]. Hence, if we let g = f + 7 P , then g'{x*) = f'(x*) + 7P / (x*) Φ 0 for all finite 7 > 0. Hence, x* = 1 does not satisfy the first-order necessary condition to be a local minimizer of g. Thus, P is not an exact penalty function. I Here, we prove a result on the necessity of nondifferentlability of exact penalty functions for a special class of problems. Proposition 23.4 Consider the problem minimize subject to f(x) x G Ω, EXERCISES 571 with Ω C M.n convex. Suppose that the minimizer x* lies on the boundary of Ω and there exists a feasible direction d at x* such that d V/(ic*) > 0. If P is an exact penalty function, then P is not differentiable at x*. □ Proof We use contraposition. Suppose that P is differentiable at x*. Then, dT VP(x*) = 0, because P(x) = 0 for all x e Ω. Hence, if we let g = / + 7 P , then d Vg(x*) > 0 for all finite 7 > 0, which implies that Vg(x*) φ 0. Hence, x* is not a local minimizer of g, and thus P is not an exact penalty function. I Note that the result of Proposition 23.4 does not hold if we remove the assumption that d Vf(x*) > 0. Indeed, consider a convex problem where V/(x*) = 0. Choose P to be differentiable. Clearly, in this case we have Vg(x*) = V/(x*) + 7VP(x*) = 0. The function P is therefore an exact penalty function, although differentiable. For further reading on the subject of optimization of nondifferentiable functions, see, for example, [38]. References [11] and [96] provide further discussions on the penalty method, including nondifferentiable exact penalty functions. These references also discuss exact penalty methods involving differentiable functions; these methods go beyond the elementary type of penalty method introduced in this chapter. EXERCISES 23.1 Consider the constrained optimization problem maximize subject to f(x) ||x|| = 1, where /(2c) = \χΊQx and Q = QT. We wish to apply a fixed-step-size projected gradient algorithm to this problem: where a > 0 and Π is the usual projection operator defined by H[x] = argmin z G i 2 \\z — x\\ and Ω is the constraint set. a. Find a simple formula for Π[χ] in this problem (an explicit expression in terms of x), assuming that x φ 0. b . For the remainder of the question, suppose that Q = 1 0 0 2 572 ALGORITHMS FOR CONSTRAINED OPTIMIZATION Find the solution(s) to this optimization problem. · Derive an expression for y(fc+1) in terms of yW and c. Let yW = x[ /x2 a. d. Assuming that x2 φ 0, use parts b and c to show that for any a > 0, x^ converges to a solution to the optimization problem (i.e., the algorithm works). e. In part d, what if x2 — 0? 23.2 Consider the problem minimize subject to f(x) x G Ω, where f(x) = cTx and c G R n is a given nonzero vector. (Linear programming is a special case of this problem.) We wish to apply a fixed-step-size projected gradient algorithm χ<*+1> = n [ a j ' * ' - V / ( x ( f c ) ) ] , where, as usual, Π is the projection operator onto Ω (assume that for any y, U[y] = argmin a ; € Q \\y - x\\2 is unique). a. Suppose that for some &, x^ is a global minimizer of the problem. Is it necessarily the case that x( fc+1 ) = χ^Ί Explain fully. b . Suppose that for some /c, x^k+1^ = x^k\ Is it necessarily the case that x^ is a local minimizer of the problem? Explain fully. 23.3 Consider the optimization problem minimize subject to f(x) x G Ω, where / : R2 -> R, / G C \ and Ω = [-1,1] 2 = {x : - 1 < Xi < 1, i = 1,2}. Consider the projected steepest descent algorithm applied to this problem: »(fc+1)=n[a?W-afcV/(x(fc))], where Π represents the projection operator with respect to Ω and c*k = argmin a > 0 /(ic( f c ) — aV/(ic' f e ')). Our goal is to prove the following statement: x(fc+i) tion. _ x(k) ·£ a n ( j o n jy ^ x(k) sa tisfies the first-order necessary condi- EXERCISES 573 We will do this in two parts. a. Prove the statement above for the case where x^ Ω. is an interior point of b . Prove the statement for the case where x^ is a boundary point of Ω. Hint: Consider two further subcases: (i) x^ is a corner point, and (ii) x^ is not a corner point. For subcase (i) it suffices to take x^ = [1,1] T . For subcase (ii) it suffices to take x^ e {x : x\ = 1,-1 < X2 < 1}. 23.4 Let A G R m x n , m < n, rank A = m, and b G M m . Define Ω = {x : Ax = b} and let x0 G Ω. Show that for any y G Mn, n[x 0 + y] = χο + Py, where P = I AT(AAT)-1A. Hint: Use Exercise 6.7 and Example 12.5. 23.5 Let / : Rn -> R be given by /(a?) = \xTQx-xTc, where Q = QT > 0. We wish to minimize / over {x : Ax = 6}, where A G M m x n , m < n, and rank A = ra. Show that the projected steepest descent algorithm for this case takes the form x(k+l) where _ „(k) _ n(k)T 9K gW =Vf{x(k)) r>n(k) P9K = Qx{k) p (k) -c, AT(AAT)-1A. and P = In- 23.6 Consider the problem minimize subject to -||x||2 Ax = 6, where A G R m X n , m < n, and rank A — ra. Show that if χ(°) G {# : Ax = 6}, then the projected steepest descent algorithm converges to the solution in one step. 23.7 Show that in the projected steepest descent algorithm, we have that for each k: a. 9 (*+i)Tp fl (fc) = b . The vector χ^+^ o. — χ^ is orthogonal to the vector x(k+2) — a?(fc+1). 574 ALGORITHMS FOR CONSTRAINED OPTIMIZATION 23.8 Consider the optimization problem minimize subject to f(x) x G Ω, where Ω C R n . Suppose that we apply the penalty method to this problem, which involves solving an associated unconstrained optimization problem with penalty function P and penalty parameter 7 > 0. a. Write down the unconstrained problem associated with penalty function P and penalty parameter 7. b . Let x* be a global minimizer of the given constrained problem, and let x1 be a global minimizer of the associated unconstrained optimization problem (in part a) with penalty parameter 7. Show that if χΊ 0 Ω, then 23.9 Use the penalty method to solve the following problem: minimize subject to x\ + 2x\ x\ + X2 = 3. Hint: Use the penalty function P(x) = (χχ + X2 — 3) 2 . The solution you find must be exact, not approximate. 23.10 Consider the simple optimization problem minimize subject to x x > a, where a G R. Suppose that we use the penalty method to solve this problem, with penalty function P(x) = (max{a - x, 0}) 2 (the Courant-Beltrami penalty function). Given a number ε > 0, find the smallest value of the penalty parameter 7 such that the solution obtained using the penalty method is no further than ε from the true solution to the given problem. (Think of e as the desired accuracy.) 23.11 Consider the problem minimize subject to 2I|X|1 Ax = 6, EXERCISES 575 where A G R m x n , b G R m , m < n, and rank A = m. Let x* be the solution. Suppose that we solve the problem using the penalty method, with the penalty function P(x) = \\Ax-bf. Let X* be the solution to the associated unconstrained problem with the penalty parameter 7 > 0; that is, x* is the solution to minimize - | | # | | 2 + 7||Acc — 6|| 2 . Δ a. Suppose that A = 1 l], &=[!]· Verify that x* converges to the solution x* of the original constrained problem as 7 —> 00. b . Prove that #*—►#* as7—>oo holds in general. Hint: Use the following result: There exist orthogonal matrices U G R m x m and VT G Rnxn such that A= U[S,0}VT, where S = diag (J\1(AAT),..., ^\m(AAr)) is a diagonal matrix with diagonal elements that are the square roots of the eigenvalues of AAT. The result above is called the singular value decomposition (see, e.g., [62, p. 411]). CHAPTER 24 MULTIOBJECTIVE OPTIMIZATION 24.1 Introduction When an optimization problem involves only one objective function, it is a single-objective optimization. Most engineering problems require the designer to optimize a number of conflicting objectives. The objectives are in conflict with each other if an improvement in one objective leads to deterioration in another. Multiobjective problems in which there is competition between objectives may have no single, unique optimal solution. Multiobjective optimization problems are also referred to as multicriteria or vector optimization problems. We can formulate a multiobjective optimization problem as follows: Find a decision variable that satisfies the given constraints and optimizes a vector function whose components are objective functions. Formally, the mul- An Introduction to Optimization, Fourth Edition. By E. K. P. Chong and S. H. Zak. Copyright © 2013 John Wiley & Sons, Inc. 577 578 MULTIOBJECTIVE OPTIMIZATION tiobjective optimization problem is stated as follows: /l(xi,£2,..-,Zn) minimize subject to f(x) = /2(Χΐ,£2,.·.,Ζη) x G Ω, where / : R n —► R^ and Ω C R n . For example, the constraint set Ω can have the form Ω = {x : h(x) = 0, g(x) < 0} , where h : R n -> R m , g : R n -> R p , m < n. In general, we may have three different types of multiobjective optimization problems: ■ Minimize all the objective functions. ■ Maximize all the objective functions. ■ Minimize some and maximize others. However, as usual, any of these can be converted into an equivalent minimization problem. 24.2 Pareto Solutions Note that multiobjective function assigns to each decision variable a multiobjective vector function value in the objective function space. A graphical illustration of this statement is illustrated in Figures 24.1 and 24.2. In Figure 24.1 the decision variable is a point x G R 2 while the vector of objective functions is given by / : R 2 —> R 2 . In Figure 24.2 the decision variable is a point x G R 2 while the vector of objective functions is / : R 2 —> R 3 . In single-objective optimization problems our goal is to find a single solution, where we focus mainly on the decision variable space. On the other hand, in multiobjective problems we are usually more interested in the objective space. As pointed out by Miettinen [92, p. 11], multiobjective problems are in a sense ill-defined because there is no natural ordering in the objective space. Miettinen [92] illustrates this statement with the following simple example. One can say that [1,1] T is less than [3,3] T . But how do we compare [1,3] T and [3,1] T ? In general, in multiobjective optimization problems our goal is to find good compromises. Roughly speaking, in a multiobjective optimization problem, a solution is optimal if there exists no other solution, within the feasible set, PARETO SOLUTIONS 579 objective function space Figure 24.1 Two-dimensional illustration of a multiobjective vector function assigning to each decision variable a multiobjective vector function value. that gives improved performance with regard to all the objectives. A formal definition of an optimal point for a multiobjective optimization problem was proposed by Francis Y. Edgeworth in 1881 and generalized by Vilfredo Pareto in 1896. It is customary now to refer to an optimal point of a multiobjective optimization problem as the Pareto minimizer, whose formal definition is given next. Definition 24.1 Let / : problem -> R* and x G Ω be given. For the optimization minimize subject to f{x) x GΩ a point x* G Ω is called a Pareto minimizer if there exists no x G Ω such that for i = 1 , 2 , . . . , ^ fi(x) < fi(xm) and for at least one z, Mx) < fi(x*) In other words, the point x* is a Pareto minimizer, or a nondominated solution, if there exists no other feasible decision variable x that would decrease some objectives without causing simultaneous increase in at least one other variable. The set of Pareto minimizers (optimizers) is called the Pareto front, as illustrated in Figure 24.3. Most multiobjective optimization algorithms use 580 MULTIOBJECTIVE OPTIMIZATION objective function space Figure 24.2 Three-dimensional illustration of a multiobjective vector function assigning to each decision variable a multiobjective vector function value. objective function space Pareto front Figure 24.3 The Pareto front is marked with a heavy line. COMPUTING THE PARETO FRONT 581 min-max min-min /Pareto I front \ \ Pareto front max-max Pareto front l V Pareto ) front V max-min Figure 24.4 Pareto fronts for four possible cases of two-objective optimization. the concept of domination. A solution is said to be nondominated if it is Pareto optimal. In Figure 24.4 we show different combinations of two-objective optimization and the corresponding Pareto fronts. In particular, in the upper left, we show the Pareto front for the case when we are minimizing both components of the objective function vector, which we represent by "min-min." Similarly, "minmax" represents the case when we are minimizing the first objective function and maximizing the second; and so forth. 24.3 Computing the Pareto Front When computing the Paret front, two solutions are compared and the dominated solution is eliminated from the set of candidates of Pareto optimizers. Thus, the Pareto front consists of nondominated solutions. To proceed, we need some notation. Let *r _ *L· r *r — \dj^ *r , Λ/2 J · · · τ J x *ηΤ n J 582 MULTIOBJECTIVE OPTIMIZATION be the rth candidate Pareto optimal solution, r = 1,2,..., Ä, where R is the number of current candidate Pareto solutions. Let /(** Γ ) = [/ι(χ* Γ ),/2(** Γ ),···,Λ(** Γ )] τ be the corresponding value of the objective function vector. For any new solution candidate a^, we evaluate the objective function vector f(x^). We then compare the new solution candidate with the existing Pareto solutions. We need to consider three cases: ■ xj dominates at least one candidate solution. ■ χΐ does not dominate any existing candidate solutions. ■ χί is dominated by a candidate solution. If χί dominates at least one candidate solution, we delete the dominated solutions from the set and add the new solution cc·7 to the set of candidates. In the second case, when the new candidate solution x·7 does not dominate any of the existing candidate Pareto solutions, add this new Pareto solution to the set of candidate Pareto solutions. Finally, in the third case, when the new candidate solution is dominated by at least one of the existing candidate Pareto solutions, we do not change the set of the existing candidate Pareto solutions. Example 24.1 Consider the two-objective minimization problem whose data are as follows: as« T [5,6] [4,5] [3,7] [6,8] [1,4] [6,7] [2,5] [3,6] [2,7] [4,7] /(*W)T [30,45] [22,29] [19,53] [41,75] [13,45] [42,55] [37,46] [28,37] [12,51] [41,67] Suppose that we wish to find nondominated pairs for this problem. Recall that a point x* is a nondominated point if for all i and all cc, fi{x") < fi(x), and at least for one component j of the objective vector, we have /,·(**) < fj(x). COMPUTING THE PARETO FRONT 583 To find the Pareto front, we start with the first pair as a candidate Pareto optimal solution and then compare the other pairs against this first pair, replacing the first pair as necessary. We then continue with the other pairs, building up a set of candidate Pareto solutions and modifying this set when appropriate. The result of the search gives the following Pareto optimal set: X(J)T [4,5] [1,4] [2,7] /(X(0)T [22,29] [13,45] [12,51] ■ We now discuss an algorithm for generating the Pareto front that implements the foregoing ideas. This algorithm is a minor modification of the algorithm of Osyczka [98, pp. 100-101]. We use the following notation. Let J be the number of candidate solutions to be checked for optimality, while R is the number of current candidate Pareto solutions. Recall that £ is the number of objective functions, the dimension of the objective function vector, and n is the dimension of the decision space, that is, the number of components of x. The algorithm consists of eight steps. Algorithm for Generating a Pareto Front 1. Generate an initial solution x1 and evaluate f*1 = /(as 1 ). This first solution generated is taken as a candidate Pareto solution. Set initial indices R := 1 and j := 1. 2. Set j := j -h 1. If j < J, then generate solution χΐ and go to step 3. Otherwise, stop, because all the candidate solutions have already been considered. 3. Set r := 1 and q := 0 (q represents the number of eliminated solutions from the existing set of Pareto solutions). 4. If for alH = 1,2,...,*, /<(**) < fi(x*r), then set q := q -f 1, f*R := /(x· 7 ), remember the solution that should be eliminated, and go to step 6. 5. If for alii = 1,2,...,£, fi(xj) > then go to step 2. 6. Set r := r + 1. If r < R, go to step 4. fi(x*r), 584 MULTIOBJECTIVE OPTIMIZATION Level sets of ^ and f2> and Pareto optimal points Figure 24.5 Pareto optimal points in the decision space along with the level sets of the objective functions /i and ji7. If q φ 0, remove from the candidate Pareto set the solutions that are eliminated in step 4, add solution χΐ as a new candidate Pareto solution, and go to step 2. 8. Set R := R + 1, x*R := xj, f*R := f{xj), and go to step 2. Example 24.2 We apply the algorithm above to generate the Pareto front for the multiobjective optimization problem . . . minimize subject to I -(ΧΙ+Χ2)1 v 9 ' \ 2 < #i < 6 5 < x2 < 9. We performed 100 iterations. At each iteration we randomly generated 50 feasible points. Then we applied the algorithm above to extract from this set of feasible points candidate Pareto optimal solutions. In Figure 24.5 we show Pareto optimal points obtained after 100 iterations of the algorithm. We also show level sets of the objective functions in the (#i,#2)-space. In Figure 24.6 we show the Pareto front in the objective function space after 100 FROM MULTIOBJECTIVE TO SINGLE-OBJECTIVE OPTIMIZATION 585 Pareto optimal front 90 r I 1 1 1 X Γ " 80 l· * 70 1 60 \ 50h 40 r 30 -^5 X * * *x ....*. x I -40 . : ...x. I -35 x-xv\, I -30 *χχ -25 χχ -20 v x „ „ x x x x x x x -15 X x X -10 -5 Figure 24.6 Pareto front for the problem of Example 24.2. Also marked are the objective vector values for the remaining candidate points generated in the last iteration. iterations of the algorithm. The Pareto optimal points are marked with x's. The points marked with -'s are the candidate solutions generated randomly at the beginning of the last iteration of the algorithm. I We have described a simple approach to computing the Pareto front. Alternative methods include those that apply genetic algorithms to solving multiobjective optimization problems, as discussed in Deb [37], Coello Coello et al. [31], and Osyczka [98]. 24.4 From Multiobjective to Single-Objective Optimization In some cases it is possible to deal with a multiobjective optimization problem by converting the problem into a single-objective optimization problem, so that standard optimization methods can be brought to bear. Here, we discuss four techniques to convert a multiobjective problem to a singleobjective problem. We assume throughout that an objective function vector f(x) = [fi(x), · · ·, fe(x)\T is given. The first method is to form a single objective function by taking a linear combination, with positive coefficients, of the components of the objective 586 MULTIOBJECTIVE OPTIMIZATION function vector. Equivalently, we form a convex combination of the components of the objective function vector. In other words, we use /(*) = cTf(x) as the (single) objective function, where c is a vector of positive components. This method is also called the weighted-sum method, where the coefficients of the linear combination (i.e., the components of c) are called weights. These weights reflect the relative importance of the individual components in the objective vector. Of course, it might be difficult to determine suitable weight values. A second method is to form a single objective function by taking the maximum of the components of the objective vector: f(x) = m a x { / i ( x ) , . . . , fe(x)}. In other words, we convert the multiobjective minimization problem into one of minimizing the maximum of the components. For this reason, it is also called the minimax method. Note that this method applies to situations where the components of the objective vector are comparable or compatible, in the sense that they are in the same units (e.g., they are all lengths measured in meters, or masses in kilograms). A limitation of this method is that the resulting single objective function might not be differentiable, thereby precluding the use of optimization methods that rely on differentiability (e.g., gradient algorithms). However, as we show in the following, a minimax problem with linear objective vector components and linear constraints can be reduced to a linear programming problem. E x a m p l e 24.3 Given vectors v i , . . . , vp G R n and scalars u\,..., sider the minimax problem minimize subject to max{i;7# + v>i, · · ·, vjx it p , con- + up} Ax < 6, where A G R m x n and 6 G M m . Call this problem P I . a. Consider the optimization problem minimize subject to y Ax < b y > vJx + Ui, i = l,...,p, where the decision variable is the vector [ χ τ , τ / ] τ . Call this problem P2. Show that x* solves P I if and only if [x* T , y*]T with y* = max{v[a:* + u i , . . . , vjx* + up} solves P2. Hint: y > max{a, 6, c} if and only if y > a, y > 6, and y > c. FROM MULTIOBJECTIVE TO SINGLE-OBJECTIVE OPTIMIZATION 587 b . Use part a to derive a linear programming problem minimize subject to c z Az < b that is equivalent to P I (by "equivalent" we mean that the solution to one gives us the solution to the other). Explain how a solution to the linear programming problem above gives a solution to P I . Solution: a. First suppose that x* is optimal in P I . Let y* = max{i;7#* + ιζι,... ,VpX* + Up}. Then, [x*Tyy*]T is feasible in P2. Let [ x T , y ] T be any feasible point in P2. Then (by the hint) y > maxima? + i&i,..., vjx + up}. Moreover, x is feasible in P I , and hence y > max{i;7ic + w i , . . . ,v^x + up} > max{v[a;* + ui,..., vjx* + up} = y*. Hence, [x* T ,i/*] T is optimal in the linear programming problem. To prove the converse, suppose that x* is not optimal in P I . Then, there is some x' that is feasible in P I such that y' = m a x ^ ^ c c ' + H i , . . . , vjx' + up} < max{v7ic* + u\,..., vjx* -h up} = y · fT T But [x ,y'] is evidently feasible in P2, and has objective function value {y') that is lower than that of [x* T , 2/*]T. Hence, [x* T , 2/*]T is not optimal inP2. b . Define A vj 0 -1 b 1 -ui -uP\ Then the equivalent problem can be written as minimize subject to c z Az < 6. 588 MULTIOBJECTIVE OPTIMIZATION By part a, if we obtain a solution to this linear programming problem, then the first n components form a solution to the original minimax problem. | A third method to convert a multiobjective problem to a single-objective problem, assuming that the components of the objective vector are nonnegative, is to form a single objective function by taking the p-norm of the objective vector: f{x) = \\f(x)\\p. The minimax method can be viewed as a special case of this method, with p = oo. The weighted-sum method with uniform weights can be viewed as this method with p = 1. To make the objective function differentiable in the case where p is finite (so that we can apply gradient methods, for example), we replace it by its pih power: fix) = \\f(xwp = (h(x)y + ■■■ + (Mx)y. A fourth method is to minimize one of the components of the objective vector subject to constraints on the other components. For example, given / , we solve minimize subject to f\ (x) /2(#) < &2, fi(x) < fc, where 6 2 , . . . , be are given constants that reflect satisfactory values for the objectives / 2 , . . . , fe, respectively. Of course, this approach is suitable only in situations where these satisfactory values can be determined. 24.5 Uncertain Linear Programming Problems In this section we show how multiobjective optimization methods can be used to solve linear programming problems with uncertain coefficients, including uncertain constraints and uncertain objective functions. Uncertain Constraints We first consider a generalization of linear programming to problems with uncertain constraints. Our exposition is based on a discussion of fuzzy linear programming by Wang [131, Chapter 30]. We consider the following general UNCERTAIN LINEAR PROGRAMMING PROBLEMS 589 linear programming problem: minimize subject to cTx Ax < b x>0. We can represent the constraints in the form (Ax)i <bi, i = l,2,...,ra. Suppose that the constraints' bounds are uncertain in the sense that they can vary within given tolerance values and can be represented as (Ax)i <bi + 9ti, i = 1,2,..., ra, where Θ G [0,1] and U > 0, i = 1,2,..., ra. We now discuss a method to solve the problem above. First, solve the following two linear programming problems: cTx (Ax)i < &i, x>0 minimize subject to i = 1,2,..., ra and cTx {Ax)i <bi + U, minimize subject to i = 1,2,..., m x>0. Denote the solution to the two programs as x^ and χ(°\ respectively, and let Z\ = cTx^ and ZQ = cTx^°\ Using these definitions, we construct a function that characterizes the "degree of the penalty" associated with the uncertain constraints in the linear programming problem 0 if cTx < ZQ 1 if cTx > z\. μ0(χ) A plot of this function is given in Figure 24.7. Note that when cTx < z 0 , then μο(χ) = 0, which represents minimum degree of penalty. On the other hand, when cTx > zi, then μο(χ) = 1, and we have a maximum degree of penalty. When ZQ < cTx < zi, the degree of penalty varies from 0 to 1. Next, we introduce a function that describes the degree of penalty for violating the iih constraint: 0 if (Ax)i -bi<0 1 if (Ax)i -bi>U. ßi{x) 590 MULTIOBJECTIVE OPTIMIZATION Μχ) Figure 24.7 Plot of the function μο(χ). Mi(x) (Ax),. Figure 24.8 Plot of the function μ%{χ). A plot of this function is shown in Figure 24.8. Using the definitions above we can reformulate the original linear programming problem as a multiobjective optimization problem, with the goal of minimizing the functions that penalize constraint violations: minimize μ0(χ) μι(χ) _μτη(χ) subject to x > 0. We can employ the minimax method to solve the multiobjective optimization problem as a single-objective problem minimize subject to max {μο {x) > Mi (x)? · · · ? Mm (x)} x > 0. UNCERTAIN LINEAR PROGRAMMING PROBLEMS 591 As shown in Example 24.3, the problem above can be stated equivalent ly as minimize Θ subject to μο(χ) < Θ μι{χ) < Θ, i = 1,2,... , ra <9e [0,1], x>o. Using now the definitions of μο and μ^, i = 1 , . . . , ra, we restate the optimization problem above as minimize Θ subject to cTx < ZQ + Q{z\ — ZQ) (Ax)i <bi + 6ti, θβ [0,1], x>0. i = 1,2,..., ra E x a m p l e 24.4 Consider the following linear programming problem: . . . minimize subject to 1 — -X\ — X2 x\ + #2 < 5 X2 < 3 X\ > 0, #2 > 0, where the tolerances are t\ = 2 and £2 = 1· a. Solve the two linear programming problems to obtain x^ the data above. Then find z\ and z$. and x^ using b. Construct the equivalent optimization problem (involving Θ) using the data above. c. Express the optimization problem as a linear programming problem in standard form. Solution: a. We can solve these problems graphically to obtain x ^ = [2,3] T and *<°> = [3,4] T . Hence, ζχ = cTx{1) = -4 and z0 = c T x ( 0 ) = - 5 ^ . 592 MULTIOBJECTIVE OPTIMIZATION b. T h e optimization problem has the form minimize Θ subject to <Θ μο(χ) μι(χ)<θ μ2(χ) < θ θβ [0,1], x > 0 , where (0 ßo(x) = I if — \χ\ if — 5 | < -\χλ-χ2 1 if —\x\ 0 μι(χ) 0 χ < -4 — Χ2 > —4, if χι + #2 — 5 < 0 3?ι+Χ2~5 2 if 0 < ζ ι + χ2 - 5 < 2 1 μ2(χ) — χ2 < — 5 | -^χι-χ2+5| 3/2 if #1 + £ 2 — 5 > 2, if x2 - 3 < 0 2 -3 if 0 < χ2 - 3 < 1 1 if x 2 - 3 > 1. c. We have minimize Θ subject to cTx {Ax)i < zo + B{z\ — ZQ) < h + (1 - 6)U, i = l,2 0 e [0,1], χ > ο . Using our data, we obtain minimize subject to Θ 1 -x\ + x2 >5- χι+χ2<5 χ2<3 θβ 1 3 — -Θ + 2θ +θ [0,1], x>0. UNCERTAIN LINEAR PROGRAMMING PROBLEMS 593 Write xs = Θ. Then, the above problem can be represented as minimize X3 subject to x\ + 2x2 + 3x3 > 11 X\ + X2 — 2X3 < 5 %2 - #3 < 3 *3<1 i = l,2,3. Xi>0, The above linear program expressed in the form of a linear programming problem in standard form is minimize X3 subject to x\ + 2x2 + 3x3 — X4 = 11 X\+ X2 — 2X3 + X5 = 5 #2 — #3 + #6 = 3 X3 + #7 = 1 Xi>0, i = 1,2,... 7. Uncertain Objective Function Coefficients We now consider a linear programming problem with uncertain objective function coefficients. We assume that uncertainties of the objective coefficients are modeled by the following triangular function: / 1 \ μ { χ ; α Μ = 0 (x — a)/(b ~ a) <(ο-χ)/(ο-ο) 0 J if x < a if a < x <b if b<x<c if x > c. A plot of this function for a = 1, b = 2, and c = 6 is shown in Figure 24.9. In other words, the uncertain objective coefficients will be represented by the triangular functions of the form given above. Following Wang [131, p. 386], we use the notation c* = (c~\c®,cl) to denote the uncertain coefficient c* represented by the triangular function μ (χ; c^~, c?, c+). Then the linear programming problem minimize subject to cTx Ax < b x > 0, 594 MULTIOBJECTIVE OPTIMIZATION becomes C X minimize c°x subject to Ax < b x > 0, where k [* This is a multiobjective optimization problem. Wang [131] suggests that instead of minimizing the three values c~x, c°x, and c+x simultaneously, the center, c°£C, be minimized; the left leg, (c° — c~) x, be maximized; and the right leg, ( c + — c°) sc, be minimized. This results in pushing the triangular functions to the left in the minimization process. Thus, the multiobjective optimization problem above can be changed to the following multiobjective optimization problem: minimize ~-(c°-c-)x~ c°x (c+ - c°) x subject to Ax<b X >0. Uncertain Constraint Coefficients We may be faced with solving a linear programming problem with uncertain constraint coefficients. In this case the coefficients of the constraint matrix A would be represented by triangular functions of the form given in the preceding section. That is, the coefficient α^ of the constraint matrix A would be modeled by the function α^ = μ (χ;α^, α^, α+). Then, the linear programming problem with uncertain constraint coefficients would take the form minimize cTx A x subject to A°x A+x ~bi < b b\ x>0, where A~ - [ a r.], A0 = [a&], and A+ = [a+]. UNCERTAIN LINEAR PROGRAMMING PROBLEMS Figure 24.9 595 Plot of the triangular function μ(χ; α, 6, c) for a = 1, b = 2, and c = 6. General Uncertainties Finally, we may be faced with solving an uncertain linear programming problem that is a combination of the basic uncertain linear programming problems discussed above. For example, suppose that we are asked to solve the following quite general uncertain linear programming problem: minimize subject to c x Ax < b x > 0, where the tilde symbols refer to the uncertain data; that is, we have c=(c-,c°,c+), A= ( A ~ , A ° , A + ) , b= ( 6 " , 6 ° , & + ) . We can represent the uncertain linear programming problem above as a multiobjective optimization problem of the form minimize ubject to -(cP — c )x c°x 1 (c+ - c°) x \ A~x A°x A+x x >0. < V] 6° _6+j 596 MULTIOBJECTIVE OPTIMIZATION EXERCISES 24.1 Write a MATLAB program that implements the algorithm for generating a Pareto front, and test it on the problem in Example 24.1. 24.2 Consider the multiobjective problem minimize subject to f(x) x £ Ω, where / : R n -> R€. a. Suppose that we solve the single-objective problem minimize subject to cTf(x) x £ Ω, where c € R n , c > 0 (i.e., we use the weighted-sum approach). Show that if x* is a global minimizer for the single-objective problem above, then x* is a Pareto minimizer for the given multiobjective problem. Then show that it is not necessarily the case that if x* is a Pareto minimizer for the multiobjective problem, then there exists a c > 0 such that x* is a global minimizer for the single-objective (weighted-sum) problem. b . Assuming that for all x £ Ω, f(x) objective problem minimize subject to > 0, suppose that we solve the single- (fi(x))p x £ Ω, + ···+ (Mx))p where p £ R, p > 0 (i.e., we use the minimum-norm approach). Show that if x* is a global minimizer for the single-objective problem above, then x* is a Pareto minimizer for the given multiobjective problem. Then show that it is not necessarily the case that if x* is a Pareto minimizer for the multiobjective problem, then there exists a p > 0 such that x* is a global minimizer for the single-objective (minimum-norm) problem. c. Suppose that we solve the single-objective problem minimize subject to max{/i(x),..., x £Ω fi(x)} (i.e., we use the minimax approach). Show that it is not necessarily the case that if x* is a Pareto minimizer for the given multiobjective problem, then x* is a global minimizer for the single-objective (minimax) problem. Then show that it also is not necessarily the case that if x* is a global EXERCISES 597 minimizer for the single-objective problem, then x* is a Pareto minimizer for the multiobjective problem. 24.3 Let / : R n —► R^ be given. Consider the following multiobjective problem with equality constraints: minimize subject to f(x) x G Ω. Suppose that / G C1, all the components of / are convex, and Ω is convex. Suppose that there exists x* and c* > 0 such that for any feasible direction d at x*, we have c*TDf(x*)d > 0. Show that £C* is a Pareto minimizer. 24.4 Let / : R n -+ R£ and h : R n -► R m be given. Consider the following multiobjective problem with equality constraints: minimize subject to f{x) h(x) = 0. Suppose that / , h G C1, all the components of / are convex, and the constraint set is convex. Show that if there exists #*, c* > 0, and λ* such that c*TDf(x*) + λ* τ £>/ι(χ*) = 0 T h(x*) = 0, then x* is a Pareto minimizer. We can think of the above as a Lagrange condition for the constrained multiobjective function. 24.5 Let / : R n -+ Re and g : R n -> Rp be given. Consider the following multiobjective problem with inequality constraints: minimize subject to f{x) g(x) < 0. Suppose that / , g G C1, all the components of / are convex, and the constraint set is convex. Show that if there exists #*, c* > 0, and μ* such that μ*>0, T T c* D/(x*) + ß* Dg(x*) - 0T, μ* τ 0(**) = 0, g(x*) < o, 598 MULTIOBJECTIVE OPTIMIZATION then x* is a Pareto minimizer. We can think of the above as a KKT condition for the constrained multiobjective function. 24.6 Let / : R n -* R', h : R n -> R m , and g : R n -> W be given. Consider the general constrained multiobjective problem minimize subject to f(x) h(x) = 0 g(x) < 0. Suppose that / , #, h e C1, all the components of / are convex, and the constraint set is convex. Show that if there exists x*, c* > 0, λ*, and μ* such that μ*>0, T c* Df(x*) T τ + A* DÄi(x*) + μ* £>0(χ*) = 0 T , μ*τ<?(χ*) - 0, h(x*) = 0, <7(**) < 0, then x* is a Pareto minimizer. 24.7 Let / i : R n -> R and f2 : R n -> R, / i , / 2 € C1. Consider the minimax problem minimize max{/i(x),/2(x)}. Show that if x* is a local minimizer, then there exist μ{,μ2 £ R such that μϊ,^2 > 0, μ ί ν / ι ( χ * ) + μ^ν/ 2 (χ*) = 0, μί + μ*2 = 1, and μ* = 0 if /<(«*) < max{/i(x*), / 2 ( * * ) } . ffini: Consider the following problem: minimizes subject to 2; > fi(x), 1,2. i = REFERENCES 1. J. S. Arora, Introduction to Optimum Design. New York: McGraw-Hill Book Co., 1989. 2. R. G. Bartle, The Elements of Real Analysis, 2nd ed. New York: Wiley, 1976. 3. M. S. Bazaraa, H. D. Sherali, and C. M. Shetty, Nonlinear Programming: Theory and Algorithms, 2nd ed. New York: Wiley, 1993. 4. A. Bhaya and E. Kaszkurewicz, Control Perspectives on Numerical Algorithms and Matrix Problems. Philadelphia: Society for Industrial and Applied Mathematics, 2006. 5. B. Beliczynski, A. Dzielinski, M. Iwanowski, and B. Ribeiro, Eds., Adaptive and Natural Computing Algorithms, vol. 4431 of Lecture Notes in Computer Science. Berlin: Springer, 2007. 6. A. Ben-Israel and T. N. E. Greville, Generalized Inverses: Theory and Applications. New York: Wiley-Interscience, 1974. 7. L. D. Berkovitz, Convexity and Optimization in E n . Hoboken, NJ: Wiley, 2002. 8. C. C. Berresford, A. M. Rockett, and J. C. Stevenson, "Khachiyan's algorithm, Part 1: A new solution to linear programming problems," Byte, vol. 5, no. 8, pp. 198-208, Aug. 1980. 9. C. C. Berresford, A. M. Rockett, and J. C. Stevenson, "Khachiyan's algorithm, Part 2: Problems with the algorithm," Byte, vol. 5, no. 9, pp. 242-255, Sept. 1980. An Introduction to Optimization, Fourth Edition. By E. K. P. Chong and S. H. Zak. Copyright © 2013 John Wiley & Sons, Inc. 599 600 REFERENCES 10. D. P. Bertsekas, "Necessary and sufficient conditions for a penalty method to be exact," Mathematical Programming, vol. 9, no. 1, pp. 87-99, Aug. 1975. 11. D. P. Bertsekas, Nonlinear Programming: Scientific, 1999. 2nd ed. Belmont, MA: Athena 12. D. P. Bertsekas and J. N. Tsitsiklis, Parallel and Distributed Numerical Methods. Belmont, MA: Athena Scientific, 1997. Computation: 13. K. G. Binmore, Calculus. Cambridge, England: Cambridge University Press, 1986. 14. R. G. Bland, D. Goldfarb, and M. J. Todd, "The ellipsoid method: A survey," Operations Research, vol. 29, pp. 1039-1091, 1981. 15. V. G. Boltyanskii, Mathematical Methods of Optimal Control. New York: Holt, Rinehart and Winston, 1971. 16. S. Boyd, L. El Ghaoui, E. Feron, and V. Balakrishnan, Linear Matrix Inequalities in System and Control Theory. Philadelphia, PA: SIAM, 1994. 17. R. P. Brent, Algorithms for Minimization Cliffs, NJ: Prentice Hall, 1973. without Derivatives. Englewood 18. L. Brickman, Mathematical Introduction to Linear Programming and Game Theory. New York: Springer-Verlag, 1989. 19. C. G. Broyden, "Quasi-Newton methods," in Optimization Methods in Electronics and Communications (K. W. Cattermole and J. J. O'Reilly, Eds.), vol. 1 of Mathematical Topics in Telecommunications. New York: Wiley, 1984, pp. 105-110, 20. A. E. Bryson and Y.-C. Ho, Applied Optimal Control: Optimization, Estimation, and Control, rev. print. Washington, DC: Hemisphere Publishing Corporation, 1975. 21. B. D. Bunday, Basic Optimization Methods. London: Edward Arnold, 1984. 22. J. Campbell, The Improbable Machine. New York: Simon and Schuster, 1989. 23. S. L. Campbell and C. D. Meyer, Jr., Generalized Inverses of Linear Transformations. New York: Dover Publications, 1991. 24. E. K. P. Chong and B. E. Brewington, "Distributed communications resource management for tracking and surveillance networks," in Proceedings of the Conference on Signal and Data Processing of Small Targets 2005 (SPIE Vol. 5913), part of the SPIE Symposium on Optics & Photonics, San Diego, California, July 31-Aug. 4, 2005, pp. 280-291. 25. E. K. P. Chong and B. E. Brewington, "Decentralized rate control for tracking and surveillance networks," Ad Hoc Networks, special issue on Recent Advances in Wireless Sensor Networks, vol. 5, no. 6, pp. 910-928, Aug. 2007. 26. E. K. P. Chong, S. Hui, and S. H. Zak, "An analysis of a class of neural networks for solving linear programming problems," IEEE Transactions on Automatic Control, special section on Neural Networks in Control, Identification, and Decision Making, vol. 44, no. 11, pp. 1995-2006, Nov. 1999. 27. E. K. P. Chong and S. H. Zak, "Single-dimensional search methods," in Wiley Encyclopedia of Operations Research and Management Science, 2011, ISBN: 978-0-470-40063-0. REFERENCES 601 28. A. Cichocki and R. Unbehauen, Neural Networks for Optimization and Signal Processing. Chichester, England: Wiley, 1993. 29. M. Clerc, "The swarm and the queen: Towards a deterministic and adaptive particle swarm optimization," in Proceedings of the Congress of Evolutionary Computation, Washington, DC, July 1999, pp. 1951-1957. 30. M. Clerc and J. Kennedy, "The particle swarm: Explosion, stability and convergence in a multidimensional complex space," IEEE Transactions on Evolutionary Computation, vol. 6, pp. 58-73, Feb. 2002. 31. C. A. Coello Coello, D. A. Van Veldhuizen, and G. B. Lamont, Evolutionary Algorithms for Solving Multi-Objective Problems. New York: Kluwer Academic/Plenum Publishers, 2002. 32. S. D. Conte and C. de Boor, Elementary Numerical Analysis: An Approach, 3rd ed. New York: McGraw-Hill Book Co., 1980. Algorithmic 33. M. A. Dahleh and I. J. Diaz-Bobillo, Control of Uncertain Systems: A Linear Programming Approach. Upper Saddle River, NJ: Prentice Hall, 1995. 34. G. B. Dantzig, Linear Programming and Extensions. Princeton, NJ: Princeton University Press, 1963. 35. G. B. Dantzig and M. N. Thapa, Linear Programming, vol. 1, New York: Springer-Verlag, 1997. Introduction. 36. L. Davis, Ed., Genetic Algorithms and Simulated Annealing, Research Notes in Artificial Intelligence. London: Pitman, 1987. 37. K. Deb, Multi-objective Optimization ester, England: Wiley, 2001. Using Evolutionary Algorithms. 38. V. F. Dem'yanov and L. V. Vasil'ev, Nondifferentiable Optimization. York: Optimization Software, Inc., Publications Division, 1985. ChichNew 39. J. E. Dennis, Jr. and R. B. Schnabel, Numerical Methods for Unconstrained Optimization and Nonlinear Equations. Englewood Cliffs, NJ: Prentice Hall, 1983. 40. J. Dongarra and F. Sullivan, "The top 10 algorithms," Computing in Science and Engineering, pp. 22-23, Jan./Feb. 2000. 41. V. N. Faddeeva, Computational Methods of Linear Algebra. New York: Dover Publications, 1959. 42. S.-C. Fang and S. Puthenpura, Linear Optimization and Extensions: and Algorithms. Englewood Cliffs, NJ: Prentice Hall, 1993. Theory 43. R. Fletcher, Practical Methods of Optimization, 2nd ed. Chichester, England: Wiley, 1987. 44. F. R. Gantmacher, The Theory of Matrices, vol. 1. New York: Chelsea Publishing Co., 1959. 45. F. R. Gantmacher, The Theory of Matrices, 2nd ed. Moscow: Nauka, revised 1966. In Russian. 46. S. I. Gass, An Illustrated Guide to Linear Programming. New York: McGrawHill Book Co., 1970. 602 REFERENCES 47. I. M. Gel'fand, Lectures on Linear Algebra. New York: Interscience Publishers, 1961. 48. S. Geman and D. Geman, "Stochastic relaxation, Gibbs distribution, and the Bayesian restoration of images," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 6, pp. 721-741, 1984. 49. P. E. Gill and W. Murray, "Safeguarded steplength algorithms for optimization using descent methods," Tech. Rep. NPL NAC 37, National Physical Laboratory, Division of Numerical Analysis and Computing, Teddington, England, Aug. 1974. 50. P. E. Gill, W. Murray, M. A. Saunders, and M. H. Wright, "Two step-length algorithms for numerical optimization," Tech. Rep. SOL 79-25, Systems Optimization Laboratory, Department of Operations Research, Stanford University, Stanford, CA, Dec. 1979. 51. P. E. Gill, W. Murray, and M. H. Wright, Practical Optimization. Academic Press, 1981. London: 52. P. E. Gill, W. Murray, and M. H. Wright, Numerical Linear Algebra and Optimization. Redwood City, CA: Addison-Wesley, 1991. 53. G. H. Golub and C. F. Van Loan, Matrix Computations, MD: The Johns Hopkins University Press, 1983. 3rd ed.. Baltimore, 54. R. E. Gomory, "Outline of an algorithm for integer solutions to linear programs," Bulletin of the American Mathematical Society, vol. 64, no. 5, pp. 275278, Sep. 1958. 55. C. C. Gonzaga, "Path-following methods for linear programming," SI AM Review, vol. 34, no. 2, pp. 167-224, June 1992. 56. B. Hajek, "Cooling schedules for optimal annealing," Mathematics of Operations Research, vol. 13, no. 2, pp. 311-329, 1988. 57. J. Hannig, E. K. P. Chong, and S. R. Kulkarni, "Relative frequencies of generalized simulated annealing," Mathematics of Operations Research, vol. 31, no. 1, pp. 199-216, Feb. 2006. 58. R. L. Harvey, Neural Network Principles. Englewood Cliffs, NJ: Prentice Hall, 1994. 59. S. Haykin, Neural Networks: A Comprehensive Foundation, 2nd ed. Upper Saddle River, NJ: Prentice Hall, 1999. 60. J. Hertz, A. Krogh, and R. G. Palmer, Introduction to the Theory of Neural Computation, vol. 1 of Santa Fe Institute Studies in the Sciences of Complexity. Redwood City, CA: Addison-Wesley, 1991. 61. J. H. Holland, Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence. Cambridge, MA: MIT Press, 1992. 62. R. A. Horn and C. R. Johnson, Matrix Analysis. Cambridge, England: Cambridge University Press, 1985. 63. A. S. Householder, The Theory of Matrices in Numerical Analysis. New York: Dover Publications, 1975. REFERENCES 603 64. S. Hui and S. H. Zak, "The Widrow-Hoff algorithm for McCulloch-Pitts type neurons," IEEE Transactions on Neural Networks, vol. 5, no. 6, pp. 924-929, Nov. 1994. 65. D. R. Hush and B. G. Home, "Progress in supervised neural networks: What's new since Lippmann," IEEE Signal Processing Magazine, pp. 8-39, Jan. 1993. 66. S. Isaak and M. N. Manougian, Basic Concepts of Linear Algebra. New York: W. W. Norton & Co., 1976. 67. J.-S. R. Jang, C.-T. Sun, and E. Mizutani, Neuro-Fuzzy and Soft Computing: A Computational Approach to Learning and Machine Intelligence. Upper Saddle River, NJ: Prentice Hall, 1997. 68. W. E. Jenner, Rudiments of Algebraic Geometry. New York: Oxford University Press, 1963. 69. E. M. Johansson, F. U. Dowla, and D. M. Goodman, "Backpropagation learning for multi-layer feed-forward neural networks using the conjugate gradient method," International Journal of Neural Systems, vol. 2, no. 4, pp. 291-301, 1992. 70. S. Kaczmarz, "Approximate solution of systems of linear equations," International Journal of Control, vol. 57, no. 6, pp. 1269-1271, 1993. A reprint of the original paper: S. Kaczmarz, "Angenäherte Auflösung von Systemen linearer Gleichunger," Bulletin International de VAcademie Polonaise des Sciences et des Lettres, Serie A, pp. 355-357, 1937. 71. N. Karmarkar, "A new polynomial-time algorithm for linear programming," Combinatorica, vol. 4, no. 4, pp. 373-395, 1984. 72. M. F. Kelly, P. A. Parker, and R. N. Scott, "The application of neural networks to myoelectric signal analysis: A preliminary study," IEEE Transactions on Biomedical Engineering, vol. 37, no. 3, pp. 221-230, Mar. 1990. 73. J. Kennedy and R. C. Eberhart, with Y. Shi, Swarm Intelligence. San Francisco: Morgan Kaufmann, 2001. 74. L. G. Khachiyan, "A polynomial algorithm in linear programming," Soviet Mathematics Doklady, vol. 20, no. 1, pp. 191-194, 1979. 75. S. Kirkpatrick, C. D. Gelatt, Jr., and M. P. Vecchi, "Optimization by simulated annealing," Science, vol. 220, no. 4598, pp. 671-680, 1983. 76. V. Klee and G. J. Minty, "How good is the simplex algorithm?" in InequalitiesIll (O. Shisha, Ed.), New York: Academic Press, 1972, pp. 159-175. 77. D. E. Knuth, The Art of Computer Programming, vol. 1, Fundamental Algorithms, 2nd ed. Reading, MA: Addison-Wesley, 1973. 78. L. Kolev, "Iterative algorithm for the minimum fuel and minimum amplitude problems for linear discrete systems," International Journal of Control, vol. 21, no. 5, pp. 779-784, 1975. 79. J. R. Koza, Genetic Programming: On the Programming of Computers by Means of Natural Selection. Cambridge, MA: MIT Press, 1992. 80. T. Kozek, T. Roska, and L. O. Chua, "Genetic algorithm for CNN template learning," IEEE Transactions on Circuits and Systems, I: Fundamental Theory and Applications, vol. 40, no. 6, pp. 392-402, June 1993. 604 REFERENCES 81. K. Kuratowski, Introduction to Calculus, 2nd ed., vol. 17 of International Series of Monographs in Pure and Applied Mathematics. Warsaw, Poland: Pergamon Press, 1969. 82. J. C. Lagarias, J. A. Reeds, M. H. Wright, and P. E. Wright, "Convergence properties of the Neider-Mead simplex method in low dimensions," SI AM Journal on Optimization, vol. 9, no. 1, pp. 112-147, 1998. 83. S. Lang, Calculus of Several Variables, 3rd ed. New York: Springer-Verlag, 1987. 84. J. M. Layton, Multivariable Control Theory. Stevenage, England: Peter Peregrinus on behalf of the Institution of Electrical Engineers, 1976. 85. E. B. Lee and L. Markus, Foundations of Optimal Control Theory. Malabar, FL: Robert E. Krieger Publishing Company, 1986. 86. G. Leitmann, The Calculus of Variations and Optimal Control: An Introduction. New York: Plenum Press, 1981. 87. D. G. Luenberger, Optimization by Vector Space Methods. New York: Wiley, 1969. 88. D. G. Luenberger and Y. Ye, Linear and Nonlinear Programming, 3rd ed. New York, NY: Springer Science + Business Media, 2008. 89. I. J. Maddox, Elements of Functional Analysis, 2nd ed. Cambridge, England: Cambridge University Press, 1988. 90. O. L. Mangasarian, Nonlinear Programming. New York: McGraw-Hill Book Co., 1969. 91. N. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth, H. Teller, and E. Teller, "Equation of state calculations by fast computing machines," Journal of Chemical Physics, vol. 21, no. 6, pp. 1087-1092, 1953. 92. K. M. Miettinen, Nonlinear Multiobjective Optimization. Norwell, MA: Kluwer Academic Publishers, 1998. 93. S. A. Miller and E. K. P. Chong, "Flow-rate control for managing communications in tracking and surveillance networks," in Proceedings of the Conference on Signal and Data Processing of Small Targets 2007 (SPIE Vol. 6699), part of the SPIE Symposium on Optics & Photonics, San Diego, California, Aug. 26-30, 2007. 94. M. Mitchell, An Introduction to Genetic Algorithms. Press, 1996. Cambridge, MA: MIT 95. A. Mostowski and M. Stark, Elements of Higher Algebra. Warsaw, Poland: PWN—Polish Scientific Publishers, 1958. 96. S. G. Nash and A. Sofer, Linear and Nonlinear Programming. McGraw-Hill Book Co., 1996. New York: 97. J. A. Neider and R. Mead, "A simplex method for function minimization," Computer Journal, vol. 7, no. 4, pp. 308-313, 1965. 98. A. Osyczka, Evolutionary Algorithms for Single and Multicriteria Design Optimization. Heidelberg, Germany: Physica-Verlag, 2002. REFERENCES 605 99. D. H. Owens, Multivariable and Optimal Systems. London: Academic Press, 1981. 100. T. M. Ozan, Applied Mathematical Programming for Production and Engineering Management. Englewood Cliffs, NJ: Prentice Hall, 1986. 101. C. H. Papadimitriou and K. Steiglitz, Combinatorial Optimization: rithms and Complexity. Englewood Cliffs, NJ: Prentice Hall, 1982. 102. P. C. Parks, "S. Kaczmarz (1895-1939)," International vol. 57, no. 6, pp. 1263-1267, 1993. Algo- Journal of Control, 103. R. J. Patton and G. P. Liu, "Robust control design via eigenstructure assignment, genetic algorithms and gradient-based optimisation," IEE Proceedings on Control Theory and Applications, vol. 141, no. 3, pp. 202-208, May 1994. 104. A. L. Peressini, F. E. Sullivan, and J. J. Uhl, Jr., The Mathematics of Nonlinear Programming. New York: Springer-Verlag, 1988. 105. A. Pezeshki, L. L. Scharf, M. Lundberg, and E. K. P. Chong, "Constrained quadratic minimizations for signal processing and communications," in Proceedings of the Joint 44th IEEE Conference on Decision and Control and European Control Conference (CDC-ECC05), Seville, Spain, Dec. 12-15, 2005, pp. 7949-7953. 106. A. Pezeshki, L. L. Scharf, and E. K. P. Chong, "The geometry of linearly and quadratically constrained optimization problems for signal processing and communications," Journal of the Franklin Institute, special issue on Modelling and Simulation in Advanced Communications, vol. 347, no. 5, pp. 818-835, June 2010. 107. M. J. D. Powell, "Convergence properties of algorithms for nonlinear optimization," SIAM Review, vol. 28, no. 4, pp. 487-500, Dec. 1986. 108. S. S. Rangwala and D. A. Dornfeld, "Learning and optimization of machining operations using computing abilities of neural networks," IEEE Transactions on Systems, Man and Cybernetics, vol. 19, no. 2, pp. 299-314, Mar./Apr. 1989. 109. G. V. Reklaitis, A. Ravindran, and K. M. Ragsdell, Engineering Optimization: Methods and Applications. New York: Wiley-Interscience, 1983. 110. A. M. Rockett and J. C. Stevenson, "Karmarkar's algorithm: A method for solving large linear programming problems," Byte, vol. 12, no. 10, pp. 146-160, Sept. 1987. 111. H. L. Royden, Real Analysis, 3rd ed. New York: Macmillan Company, 1988. 112. W. Rudin, Principles of Mathematical Analysis, 3rd ed. New York: McGrawHill Book Co., 1976. 113. D. E. Rumelhart, J. L. McClelland, and the PDP Research Group, Parallel Distributed Processing: Explorations in the Micro structure of Cognition, vol. 1, Foundations. Cambridge, MA: MIT Press, 1986. 114. D. Russell, Optimization Theory. New York: W. A. Benjamin, 1970. 115. S. L. Salas and E. Hille, Calculus: One and Several Variables, 4th ed. New York: Wiley, 1982. 606 REFERENCES 116. L. L. Scharf, L. T. McWhorter, E. K. P. Chong, J. S. Goldstein, and M. D. Zoltowski, "Algebraic equivalence of conjugate direction and multistage Wiener filters," in Proceedings of the Eleventh Annual Workshop on Adaptive Sensor Array Processing (ASAP), Lexington, Massachusetts, Mar. 11-13, 2003. 117. L. L. Scharf, E. K. P. Chong, and Z. Zhang, "Algebraic equivalence of matrix conjugate direction and matrix multistage filters for estimating random vectors," in Proceedings of the 43rd IEEE Conference on Decision and Control (CDC'04), Atlantis Resort, Paradise Island, Bahamas, Dec. 14-17, 2004, pp. 4175-4179. 118. L. L. Scharf, E. K. P. Chong, M. D. Zoltowski, J. S. Goldstein, and I. S. Reed, "Subspace expansion and the equivalence of conjugate direction and multistage Wiener filters," IEEE Transactions on Signal Processing, vol. 56, no. 10, pp. 5013-5019, Oct. 2008. 119. A. Schrijver, Theory of Linear and Integer Programming. New York: Wiley, 1986. 120. R. T. Seeley, Calculus of Several Variables: An Introduction. Scott, Foresman and Co., 1970. Glenview, IL: 121. J. R. Silvester, "Determinants of block matrices," The Mathematical vol. 48, no. 51, pp. 460-467, Nov. 2000. Gazette, 122. W. Spendley, G. R. Hext, and F. R. Himsworth, "Sequential application of simplex designs in optimization and evolutionary operation," Technometrics, vol. 4, pp. 441-461, 1962. 123. W. A. Spivey, Linear Programming: An Introduction. Company, 1963. New York: Macmillan 124. R. E. Stone and C. A. Tovey, "The simplex and projective scaling algorithms as iteratively reweighted least squares methods," SIAM Review, vol. 33, no. 2, pp. 220-237, June 1991. 125. G. Strang, Introduction to Applied Mathematics. Cambridge Press, 1986. 126. G. Strang, Linear Algebra and Its Applications. 1980. Wellesley, MA: WellesleyNew York: Academic Press, 127. T. W. Then and E. K. P. Chong, "Genetic algorithms in noisy environments," in Proceedings of the 9th IEEE Symposium on Intelligent Control, pp. 225-230, Aug. 1994. 128. L. Vandenberghe and S. Boyd, "Semidefinite programming," SIAM vol. 38, no. 1, pp. 49-95, Mar. 1996. 129. P. P. Varaiya, Notes on Optimization. Co., 1972. Review, New York: Van Nostrand Reinhold 130. D. J. Velleman, How To Prove It: A Structured Approach. Cambridge, England: Cambridge University Press, 1994. 131. L.-X. Wang, A Course in Fuzzy Systems and Control. Upper Saddle River, NJ: Prentice Hall, 1999. REFERENCES 607 132. B. Widrow and M. A. Lehr, "30 years of adaptive neural networks: Perceptron, madaline, and backpropagation," Proceedings of the IEEE, vol. 78, no. 9, pp. 1415-1442, Sept. 1990. 133. D. J. Wilde, Optimum Seeking Methods. Englewood Cliffs, NJ: Prentice Hall, 1964. 134. R. E. Williamson and H. F. Trotter, Multivariable Mathematics, 2nd ed. Englewood Cliffs, NJ: Prentice Hall, 1979. 135. W. I. Zangwill, Nonlinear Programming: Cliffs, NJ: Prentice Hall, 1969. A Unified Approach. Englewood 136. G. Zoutendijk, Mathematical Programming Methods. Amsterdam, The Netherlands: North-Holland, 1976. 137. J. M. Zurada, Introduction to Artificial Neural Systems. St. Paul, MN: West Publishing Co., 1992. Index Absolute value, 19 Absolute value penalty function, 561 Activation function, 254 Active constraint, 483 Adaline, 257, 271 Adaptive linear element, 257 Additivity, 19 Affine function, 62, 63, 247, 524 Affine matrix inequality, 524 Affine scaling, 406, 407 Affine scaling method, 406-411 artificial problem, 410 stopping criterion, 409 strictly interior feasible point, 406 Algebraic Riccati inequality, 544 Algorithm affine scaling, 406-411 backpropagation, 253, 258-269 BFGS, 207-211 Broyden-Fletcher-GoldfarbShanno, see BFGS algorithm complexity of, 306, 402 conjugate gradient, see Conjugate gradient algorithm convergence of, see Convergence Davidon-Fletcher-Powell, see D F P algorithm DFP, 202-207 ellipsoid, see Khachiyan's method exponential complexity, 402 fixed step size, 146, 151, 261 for constrained optimization, 545567 genetic, 285-298 globally monotone, 157 gradient, 131-153 Gram-Schmidt, 177, 189 interior-point, 307, 403, 406, 411, 423 iterative, 124, 159. See also Search methods Kaczmarz's, 232-236, 257 Karmarkar's, see Karmarkar's method Khachiyan's, 306, 402-405, 418 Lagrangian, 553-560 naive random search, 279 Nelder-Mead, 274-278 particle swarm optimization, 282285 polynomial complexity, 402 An Introduction to Optimization, Fourth Edition. By E. K. P. Chong and S. H. Zak. Copyright © 2013 John Wiley & Sons, Inc. 609 610 INDEX probabilistic search, 278 projected, 546, 556 projected gradient, 407, 547, 549553 projected steepest descent, 551 quasi-Newton, see Quasi-Newton methods randomized search, 278 rank one, 197-202 rank two, 202 RLS, 227-232, 250 secant method, 120-122, 172 simplex, see Simplex method simulated annealing, 278-282 single-rank symmetric, 197 SRS, 197 steepest descent, 133-141 symmetric Huang family, 215 variable metric, 202 Widrow-Hoff, 257, 271 zero finding, 118, 155 Allocation, 541 Alphabet in genetic algorithm, 286 Argmin, 82 Armijo backtracking algorithm, 126 Armijo condition, 125 Armijo-Goldstein condition, 125 Artificial neural networks, see Feedforward neural networks Artificial problem in affine scaling method, 410 in Karmarkar's method, 418 in simplex method, 361 Associative, 8, 9 Asymmetric duality, 379 Augmented matrix, 325 Backpropagation algorithm, 253, 258269 as a gradient algorithm, 261 forward pass, 263 reverse pass, 263 Ball, 50 Banana (Rosenbrock's) function, 68, 159, 173, 191, 215 Basic columns, 325 Basic feasible solution, 305, 325, 327, 331, 349 Basic solutions, 324-327 Basic variables, 325, 347 Basis definition of, 11 entering, 351 in linear equations, 325, 347, 348 leaving, 351 natural, 11 orthogonal, 29 Beltrami, 562 Best-so-far, 280, 283, 290 BFGS algorithm, 207-211 Big-oh notation, 74, 149 Bisection method, 116 Bland's rule, 360, 375 Boltzmann, 281 Bolzano-Weierstrass theorem, 58, 234 Boundary, 51 Boundary point, 51 Bounded above, 57 Bounded below, 57 Bounded sequence, 56, 57 Bounded set, 51 Box constraint, 546 Bracketing, 123, 129 Brent's method, 122 Broyden, 207 Broyden-Fletcher-Goldfarb-Shanno algorithm, see BFGS algorithm Canonical augmented matrix, 346-348 Canonical form, 346 Canonical representation, 347 Canonical representation of LMI, 529 Canonical tableau, 358 Carrier of polyhedron, 53 Cauchy-Schwarz inequality, 20, 132, 207, 551 Center of gravity, 102, 275 Centroid, 102, 274 Chain rule, 67 Characteristic equation, 26 Characteristic polynomial, 26 Chromosome in genetic algorithm, 286 Circuit, 121, 253, 254, 311, 487 Citation style, 6 Clairaut's theorem, 65 Closed set, 51 Column vector, 7 Combinatorial optimization, 273, 282, 297 Commutative, 8 Compact set, 51 Compatible matrix noi>m, 36 Complementarity, 207 Complementary slackness, 390, 539 Complex inner product, 21 Complex vector space, 12 Complexity of algorithm, 306, 402 exponential, 402 INDEX polynomial, 402 Component of vector, 7 Composite function, 67, 85, 90, 459 Concave function, 512, see Convex function Condition number, 148 Conjugate direction methods, 175-188 Conjugate gradient algorithm Fletcher-Reeves formula, 187 Hestenes-Stiefel formula, 187 nonquadratic problems, 186-188 Polak-Ribiere formula, 187 Powell formula, 188 quadratic problems, 182-186 stopping criterion, 188 Consistent linear inequalities, 404 Constrained optimization, 82, 305, 449 Constraint active, 483 box, 546 convex, 517 equality, 450, 504 functional, 82 inactive, 483 inequality, 483, 561 set, 82 Constraint set, 81. See also Feasible set Continuity, 21, 60, 453, 565 Continuous function, 21, 60, 453, 565 Continuously different iable function, 65, 451, 513 Contradiction, proof, 5 Contraposition, proof, 4 Contrapositive, 4 Control system, 102, 472, 474, 475, 481, 553 Convergence fixed-step-size gradient algorithm, 146 globally convergent, 141 gradient algorithms, 143 Kaczmarz's algorithm, 233 linear, 148 locally convergent, 141, 554, 556 Newton's method, 165 of sequence of matrices, 59 order of, 148, 149, 152, 153, 157, 165 penalty method, 565 quadratic (second-order), 148 rate of, 141, 148 ratio, 148 steepest descent algorithm, 145 sublinear, 148 611 superlinear, 148 Convergent sequence, 56 Convex combination, 48, 297, 582 Convex constraint, 517 Convex function, 508-517 definition of, 509 different iable, 513 equivalent definition of, 509 minimizers of, 517 optimization of, 517-536 quadratic, 512 strict, 512 twice differentiable, 515 Convex optimization, 517-536 Convex programming, see Convex optimization Convex set, 48-50 definition of, 48, 330, 509 extreme point, 50, 331, 332, 335 in definition of convex function, 509 polyhedron, 52, 317 polytope, 52, 316 properties of, 49 supporting hyperplane, 52, 316 Cooling schedule, 281 Coordinates, 11 Cost function, 81 Courant-Beltrami penalty function, 562, 570 Cramer's rule, 33 Crossing site, 287 Crossover in genetic algorithm, 287 crossing site, 287 multiple-point crossover, 287, 300 one-point crossover, 287 Cubic fit, 122 Curve, 453 Cutting-plane method, 435 Cycling in simplex method, 360, 375 Dantzig, 306 Davidon, 202 Davidon-Fletcher-Powell algorithm, see D F P algorithm Decision variable, 81, 305, 541 Decomposition direct sum, 30, 458 orthogonal, 30, 226 Decreasing sequence, 55 Degenerate basic feasible solution, 325, 328, 360 DeMorgan's law, 3 Derivative, 62, 83 partial, 64 612 INDEX Derivative descent search, 154 Derivative matrix, 64 Descent property, 135, 141, 144, 167, 168, 193, 552 Determinant, 14 D F P algorithm, 202-207 Diagonal matrix, 27, 407, 421, 541, 571 Diet problem, 308, 381, 388 Differentiable curve, 453 Differentiable function, 62, 63, 453 Dimension, 11, 452 Direct sum decomposition, 30, 458 Directional derivative, 84 Discrete Fourier series, 225 Discrete-time linear system, 102, 248, 474, 481 Distributive, 9 Domination, 575 Dual linear program, 378, 403 Dual nonlinear program, 543 Dual quadratic program, 399 Duality asymmetric, 379 dual nonlinear program, 543 dual problem, 378, 403 dual quadratic program, 399 dual vector, 378 duality theorem, 387, 543 in quasi-Newton methods, 207 Karush-Kuhn-Tucker conditions, 395, 539 linear programming, 377, 539 nonlinear programming, 542 primal nonlinear program, 542 primal problem, 378, 403 primal quadratic program, 399 quadratic programming, 399 symmetric, 378, 403 weak duality lemma, 385, 399, 543 Duality theorem, 387, 543 Dyadic product, 197 Eberhart, Russell, 282 Edge of polyhedron, 53 Eigenvalue definition of, 26 maximal, 144 minimal, 144 of symmetric matrix, 28, 35 Eigenvector definition of, 26 of symmetric matrix, 28 orthogonal, 28 relation to Q-conjugacy, 189 Electric circuit, 311, 487 Elementary matrix elementary row operation, 341 first kind, 340 second kind, 340 third kind, 341 Elementary row operation, 341 Elitism in genetic algorithm, 290 Ellipsoid, 405 Ellipsoid algorithm, see Khachiyan's method Encoding in genetic algorithm, 285, 290, 297 Entry of matrix, 12 Epigraph, 508 Equality constraint, 450, 504 Estimation, 121, 223, 244 Euclidean inner product, 19 Euclidean norm, 20 Evolution in genetic algorithm, 287 Exact penalty function, 566 Exclusive OR, see XOR Expanding subspace theorem, 181 Exponential complexity, 402 Extreme point, 50, 331, 332, 334 Extremizer, 82 Face of polyhedron, 53, 316 Farkas's transposition theorem, 396 Feasibility problem, 398, 526, 529 Feasible direction, 84, 519, 566 Feasible point, 305, 450 Feasible set, 81, 450 Feedforward neural networks, 253-269 activation function, 254 Adaline, 257 backpropagation algorithm, 253, 258-269 function approximation, 255 hidden layer, 254 input layer, 254 learning, 253 neuron, 253 output layer, 254 single-neuron training, 256-258 supervised learning, 256 training, 253 training set, 255 unsupervised learning, 256 weights, 253, 254 Fibonacci method, 108-115 Fibonacci sequence, 109 First-order Lagrangian algorithm, 553 First-order necessary condition INDEX equality constraint (Lagrange), 460, 461 in convex optimization, 518 inequality constraint (KKT), 484 interior case, 86 set constraint, 85 Fitness in genetic algorithm, 286 Fitting straight line, 101, 221, 227, 246, 247 Fixed point, 553 Fixed step size, 146, 151, 261 Fletcher, 187, 202, 207 Fletcher-Reeves formula, 187 Floor, 435 FONC, see First-order necessary condition Fourier series, 225 Frobenius norm, 36 Full-rank factorization, 236 Function affine, 62, 63, 247, 524 banana, 68, 159, 173, 191, 215 composite, 67, 85, 90, 459 concave, 512, see Convex function continuous, 21, 60, 453, 565 continuously differentiable, 65, 451, 513 convex, 508-517 cost, 81 derivative matrix of, 64 derivative of, 62, 83 differentiable, 62, 63, 453 directional derivative of, 84 gradient of, 65, 71, 83, 131 graph of, 71, 508 Jacobian matrix of, 64 Lagrangian, 464, 468, 543, 553 linear, see Linear transformation matrix-valued, 60 maximum rate of decrease, 132 maximum rate of increase, 71, 131 notation, 5 objective, 81 partial derivative of, 64 penalty, 560 Powell, 162 Rosenbrock's, 68, 159, 173, 191, 215 sigmoid, 266 twice continuously differentiable, 65, 515 twice differentiable, 65, 454 uniformly continuous, 23 unimodal, 104 613 utility, 542 Function approximation, 255 Functional constraint, 82 Fundamental theorem of algebra, 26 Fundamental theorem of linear algebra, 41 Fundamental theorem of LP, 328 Fuzzy linear programming, 585 Gale's transposition theorem, 397 Gauss-Newton method, 171 Generalized eigenvalue, 468, 534 Generalized inverse, 236, 238 Genetic algorithm, 285-298 alphabet, 286 analysis of, 291-297 best-so-far chromosome, 290 chromosome, 286 crossover, 287 elitism, 290 encoding, 285, 290, 297 evolution, 287 fitness, 286 initial population, 285 length of schema, 294 mating pool, 286 mutation, 288 offspring, 287 order of schema, 294 parents, 287 population size, 286 real-number, 297-298 representation scheme, 286 roulette-wheel scheme, 286 schema, 292 selection, 286 stopping criterion, 290 tournament scheme, 286 Gibbs, 281 Global minimizer, 82, 94, 473, 517-519, 521, 537, 552 Globally convergent, 141 Globally monotone algorithm, 157 Golden section, 106 Golden section search, 104-108 Goldfarb, 207 Goldstein condition, 125 Gomory cut, 435 Gomory cutting-plane method, 435 Gordan's transposition theorem, 397 Gradient, 65, 71, 83, 131 Gradient descent algorithm, see Algorithm, gradient Gradient methods, 131-153 614 INDEX backpropagation algorithm, 253, 258-269 constrained optimization, see Projected gradient method convergence of, 141-147 convergence rate of, 147-153 descent property, 135, 141, 144 equality constraints, see Lagrangian algorithms fixed step size, 145 inequality constraints, see Lagrangian algorithms Lagrangian, 553-560 order of convergence, 152 projected, 547, 549-553 stopping criterion, 135 Gram matrix, 220 Gram-Schmidt, 177, 189 Grammian, 220 Graph, 71, 508 Greatest lower bound, 57 Hacijan, see Khachiyan Hadamard product, 283 Hajek, 282 Half-space, 46, 316 negative, 46 positive, 46 Hessian, 65, 468, 515 Hessian matrix, 83 Hestenes, Magnus, 187, 188 Hestenes-Stiefel formula, 187 Hidden layer in neural network, 254 Hoff, 257 Holland, John, 285 Homogeneity, 19, 21 Huang family, 215 Hyperplane definition of, 46 supporting, 52, 316 tangent to graph, 71 Identity matrix, 16 ILP, see Integer linear programming Image of matrix, see Range of matrix Implicit function theorem, 457 Impulse response, 248 Inactive constraint, 483 Inconsistent system of equations, 217 Increasing sequence, 55 Indefinite matrix, 35 Induced matrix norm, 36, 480 Induction, principle of, 5 Inequality constraint, 483, 561 Infimum, see Greatest lower bound Inner product complex, 21 Euclidean, 19 properties of, 19 Innovation, 229 Input layer in neural network, 254 Integer linear programming, 427-444 Integer programming, see Integer linear programming Interior, 51 Interior point, 51 Interior-point method, 307, 403, 406, 411, 423 Inverse continuity of, 60 matrix, 16 Inverse Hessian, 194 Inverse parabolic interpolation, 122 Invertible matrix, see Nonsingular matrix Iterative algorithm, 124, see Search methods, 159 Jacobian matrix, 64 Jordan form, 59 Kaczmarz's algorithm, 232-236, 257 Kantorovich, 306 Karmarkar, 306, 403 Karmarkar's method, 306, 403, 411-423 artificial problem, 418 complexity, 403 Karmarkar's canonical form, 4 1 1 413, 415 Karmarkar's restricted problem, 414-415 projective transformation, 416, 424 simplex, 412 stopping criterion, 415, 420 strictly interior feasible point, 407, 416 Karush-Kuhn-Tucker condition, see K K T condition Karush-Kuhn-Tucker multiplier, see K K T multiplier Karush-Kuhn-Tucker theorem, 484 Kennedy, James, 282 Kernel of matrix, see Nullspace of matrix Khachiyan, 306, 402 Khachiyan's method, 306, 402-405, 418 K K T condition, 484, 485, 489, 521, 556, 594 KKT multiplier, 484, 490 INDEX K K T theorem, 484 Klee-Minty problem, 401 Koopmans, 306 Krylov subspace, 188 Kuhn-Tucker condition, see KKT condition Lagrange condition, 460, 464, 520, 550, 553, 593 Lagrange multiplier, 460, 463 Lagrange's theorem, 460, 461 Lagrangian algorithms, 553-560 Lagrangian function, 464, 468, 543, 553 Lanczos, Cornelius, 188 Leading principal minor, 31 Learning in neural network, 253 Least squares, 217-227, 238 nonlinear, 169 Least upper bound, 57 Left pseudoinverse, 238 Level set, 68, 131, 134 Levenberg-Marquardt algorithm, 171 Levenberg-Marquardt modification, 168 Limit of sequence, 55 Line fitting, 101, 221, 227, 246, 247 Line search, 103, 124, 133, 167, 186, 188, 194, 209 Line segment, 45, 48 Linear combination, 10 Linear convergence, 148 Linear dynamical system, see Discretetime linear system Linear equations augmented matrix, 325 basic solution, 325 basis, 325, 347, 348 canonical augmented matrix, 347 canonical form, 346 degenerate basic solutions, 325 existence of solution, 17 inconsistent, 217 Kaczmarz's algorithm, 232-236 least-squares solution, 217, 218, 221 minimum-norm solution, 231, 241, 257, 473 overdetermined, 217 particular solution, 346 pivot, 349, 352, 364 solving in general, 217-243 solving using row operations, 339346 Linear function, see Linear transformation Linear inequalities 615 consistent, 404 in linear programming, 305, 307, 316 Linear least squares, 217-227, 238 Linear matrix inequality, 524, 541 Linear programming affine scaling method, 406-411 artificial problem in affine scaling method, 410 artificial problem in Karmarkar's method, 418 artificial problem in simplex method, 361 artificial variables in simplex method, 361 as constrained problem, 450 asymmetric duality, 379 basic columns, 325 basic feasible solution, 305, 325, 327, 331, 349 basic solutions, 324-327 basic variables, 325, 347 Bland's rule, 360, 375 brief history of LP, 305 canonical augmented matrix, 347 canonical tableau, 358 complementary slackness, 390, 539 cycling, 360, 375 degenerate basic feasible solution, 325, 328, 360 dual problem, 378, 403 duality, see Duality duality theorem, 387 examples of, 100, 307-314 extreme point, 331, 332, 334 feasible solution, 325 fundamental theorem of LP, 328 fuzzy, 585 geometric view of, 330 integer linear programming, 427444 interior-point method, 307, 403, 406, 411, 423 Karmarkar's method, see Karmarkar's method Karush-Kuhn-Tucker condition, 395, 501, 539 Khachiyan's method, 306, 403-405, 418 Klee-Minty problem, 401 optimal basic feasible solution, 328 optimal feasible solution, 328 primal problem, 378, 403 616 INDEX reduced cost coefficient, 353, 357, 358, 391 revised simplex method, 364-368 sensitivity, 398 simplex method, 306, 339-368 slack variable, 319 standard form, 318, 324 surplus variable, 319 symmetric duality, 378, 403 tableau, 358 two-dimensional, 314 two-phase affine scaling method, 409 two-phase simplex method, 360364 uncertain, 584-591 weak duality lemma, 385, 399 Linear quadratic regulator, 474 Linear regression, see Line fitting Linear space, see Vector space Linear transformation, 25, 63 Linear variety, 47 Linear-fractional LMIs, 534 Linearly dependent, 10 Linearly independent, 9, 176, 324, 451, 484 Little-oh notation, 74, 85 LMI, see Linear matrix inequality LMI solvers, 529 LMI toolbox for MATLAB, 529, 536 LMITOOL, 536 Local minimizer, 82, 83, 85, 90, 92, 461, 469, 471, 484, 494, 517 Locally convergent, 141, 554, 556 Location parameter, 267 Lower bound, 57 LP, see Linear programming LQR, 474 Lyapunov inequality, 527, 543 MacDuffee, 241 Markov chain, 397 Mating pool in genetic algorithm, 286 MATLAB, xiii, 127, 128, 136, 139, 159, 191, 215, 271, 290, 291, 299, 301, 337, 376, 424, 543 LMI toolbox, 529, 536 Matrix affine matrix inequality, 524 compatible norm, 36 condition number, 148 continuous, 60 convergence of sequence, 59 definition of, 12 derivative, 64 determinant, 14 diagonal, 27, 407, 421, 541, 571 eigenvalue of, see Eigenvalue eigenvector of, see Eigenvector elementary, see Elementary matrix entry of, 12 full-rank factorization, 236 function, matrix-valued, 60 game theory, 314 generalized inverse, 236, 238 Gram, 220 Hadamard product, 283 Hessian, 65, 83, 468, 515 identity, 16 image of, see Range of matrix indefinite, 35 induced norm, 36, 480 inverse, 16 invertible, see Nonsingular matrix Jacobian, 64 Jordan form, 59 kernel of, see Nullspace of matrix leading principal minor of, 31 left pseudoinverse, 238 linear matrix inequality, 524, 541 minor of, 15, 428 Moore-Penrose inverse, 236, 237 negative definite, 35 negative semidefinite, 35 nonsingular, 16, 208, 218, 229, 325, 342 notation, 12 nullspace of, 30, 41, 227, 372, 406, 454 orthogonal, 29, 571 orthogonal projector, 30, 226, 406, 408, 423, 549 Penrose generalized inverse, 243 positive definite, 35 positive semidefinite, 35 principal minor of, 31 pseudoinverse, 236, 238 range of, 30, 41 rank of, 13-16 representation of linear transformation, 25 right pseudoinverse, 238 Schur complement, 526 Schur product, 283 sequence of, 59 series of, 60 similar, 26 square, 14 INDEX stochastic, 397 submatrix of, 432 Sylvester's criterion, 31 symmetric, 28, 35, 139 totally unimodular, 432 trace, 534 transformation, 26 transpose of, 12 unimodular, 428 Matrix norm, 35-39 Matrix-valued function, 60 Max, 21 Maximizer, 82 Mean value theorem, 76, 554, 558 MILP, see Mixed integer linear programming Min, 15, 82 Minimax, 582, 587, 592 Minimizer description of, 81 global, 82, 94, 473, 517-519, 521, 537, 552 local, 82, 83, 85, 90, 461, 469, 484, 494, 517 Pareto, 575 strict global, 82 strict local, 82, 92, 102, 471, 494 Minimum norm, 231, 242, 257, 584, 592 Minor definition of, 15, 428 leading principal, 31 principal, 31 Minty, 401 Mixed integer linear programming, 444 Monotone sequence, 55, 57 Moore-Penrose inverse, 236, 237 Morrison, 208, 229 Multicriteria optimization, 573 Multiobjective optimization, 573, 586 Mutation in genetic algorithm, 288 Naive random search, 279 Natural basis, 11 Negative definite matrix, 35 quadratic form, 31 Negative half-space, 46 Negative semidefinite matrix, 35 quadratic form, 31 Neighborhood, 50 Nelder-Mead algorithm, 274-278 centroid, 274 contraction, 276 617 expansion, 275 Neural networks, see Feedforward neural networks Neuron, 253 Newton's method convergence of, 165 descent direction, 167 descent property, 167 for nonlinear least squares, 168-171 Gauss-Newton method, 171 general, 161-171 Levenberg-Marquardt modification of, 168 modification of, 167 of tangents, 119 one-dimensional, 116-119 order of convergence, 165 Newton-Raphson method, see Newton's method Non-strict inequality, 528 Nondecreasing sequence, 55 Nondifferentiable optimization, 523 Nondifferentiable penalty function, 566 Nonincreasing sequence, 55 Nonlinear least squares, 169 Nonsingular matrix, 16, 208, 218, 229, 325, 342 Norm compatible, 36 Euclidean, 20 Frobenius, 36 general vector norm, 21 induced, 36, 480 matrix, 35-39 p-norm, 21, 584 properties of, 21 Normal, 47, 70 Normal plane, 458 Normal space, 457, 458 Notation, 5 Nullspace of matrix, 30, 41, 227, 372, 406, 454 Objective function, 81 Offspring in genetic algorithm, 287 One-dimensional search methods, 103126 Open set, 51 Optimal basic feasible solution, 328 Optimal control, 472, 475, 481, 482, 553 Optimal feasible solution in LP, 328 Optimization combinatorial, 273, 282, 297 constrained, 82, 305, 449 618 INDEX convex, 517-536 linear, see Linear programming multicriteria, 573 multiobjective, 573, 586 nondifferentiable, 523 semidefinite, 523 unconstrained, see Unconstrained optimization vector, 573 with equality constraints, 449, 553 with inequality constraints, 483, 556 with set constraint, 82, 502 Optimization algorithm, see Search methods Order of convergence, 148, 149, 152, 153, 157, 165 Order symbol, 73, 149 Orthant, 416 Orthogonal, 70 Orthogonal basis, 29 Orthogonal complement, 29, 226, 458 Orthogonal decomposition, 30, 226 Orthogonal matrix, 29, 571 Orthogonal projection, 30, 219, 406, 408, 423 Orthogonal projector, 30, 226, 406, 408, 423, 549 Orthogonal vectors, 20 Outer product, 197 Output layer in neural network, 254 Overdetermined system of equations, 217 Parents in genetic algorithm, 287 Pareto front, 575 Pareto minimizer, 575 Partial derivative, 64 Particle swarm optimization, 282-285 Particular solution, 346 Penalty function, 560 Penalty method, 560-567 absolute value penalty function, 561 convergence, 565 Courant-Beltrami penalty function, 562, 570 exact penalty function, 566 nondifferentiable penalty function, 566 penalty function, 560 penalty parameter, 560 Penalty parameter, 560 Penrose, see Moore-Penrose inverse Penrose generalized inverse, 243 Perp, see Orthogonal complement Pivot, 349, 352, 364 Polak-Ribiere formula, 187 Polyhedron carrier of, 53 definition of, 52 edge of, 53 face of, 53, 316 in linear programming, 316-318 vertex of, 53 Polynomial, characteristic, 26 Polynomial complexity, 402 Polytope definition of, 52 in linear programming, 316 Population in genetic algorithm, 285, 286 Positive definite matrix, 35 quadratic form, 31 relation to eigenvalues, 35 Sylvester's criterion, 31 Positive half-space, 46 Positive orthant, 416 Positive semidefinite matrix, 35 quadratic form, 31 relation to eigenvalues, 35 relation to principal minors, 35 Positivity, 19, 21 Powell, 162, 188, 202 Powell formula, 188 Powell function, 162 Primal linear program, 378, 403 Primal nonlinear program, 542 Primal quadratic program, 399 Primal-dual method, 377 Principal minor, 31 Principle of induction, 5 Probabilistic search, 278 Probability vector, 397, 540 Product dyadic, 197 inner, 19, 21 outer, 197 Product rule, 68 Projected algorithm, 546, 556 Projected gradient method, 407, 547, 549-553 stopping criterion, 552 Projected steepest descent algorithm, 551 Projection, 297, see Orthogonal projection, 546 Projective transformation, 416, 424 INDEX Proof contradiction (reductio ad absurdum), 5 contraposition, 4 direct method, 4 methods of, 3-5 principle of induction, 5 Proportional fairness, 541 Pseudoinverse, 236, 238 Pythagorean theorem, 21 Q-conjugate definition of, 176 linear independence, 176 relation to eigenvectors, 189 relation to orthogonality, 189 Quadratic convergence, 148 Quadratic fit, 122, 128 Quadratic form convex, 512 definition of, 31 maximizing, 467, 471 negative definite, 31 negative semidefinite, 31 positive definite, 31 positive semidefinite, 31, 35 Sylvester's criterion, 31 Quadratic programming, 399, 472, 481, 500 Quasi-Newton methods, 193-211 approximating inverse Hessian, 194 BFGS algorithm, 207-211 complementarity, 207 conjugate direction property, 196 descent property, 193 D F P algorithm, 202-207 duality, 207 rank one formula, 197-202 rank two update, 202 single-rank symmetric, 197 symmetric Huang family, 215 variable metric algorithm, 202 Randomized search, 278 Range of matrix, 30, 41 Rank of matrix, 13-16 Rank one formula, 197-202 Rank two update, 202 Rate of convergence, 141, 148 Ratio of convergence, 148 Rayleigh's inequality, 38, 92, 144, 146, 152 Real vector space, 7 619 Recursive least-squares, see RLS algorithm Reduced cost coefficient, 353, 357, 358, 391 Reductio ad absurdum, 5 Reeves, 187 Regular point, 451, 456, 461, 464, 484, 554, 556 Relative cost coefficient, see Reduced cost coefficient Representation scheme in genetic algorithm, 286 Revised simplex method, 364-368 Revised tableau, 365 Ribiere, 187 Riccati inequality, 544 Right pseudoinverse, 238 RLS algorithm, 227-232, 250 Rosenbrock's function, 68, 159, 173, 191, 215 Roulette-wheel scheme, 286 Row operations, 339-346 Row vector, 7 Scalar, 9 Scale parameter, 267 Schema in genetic algorithm, 292 length of, 294 order of, 294 Schmidt, see Gram-Schmidt Schur complement, 526 Schur product, 283 Schwarz, see Cauchy-Schwarz inequality Schwarz's theorem, 65 Scilab Consortium, 536 Search direction, 124, 128, 167, 168 Search methods bisection method, 116 conjugate direction methods, 175188 conjugate gradient algorithm, 182188 constrained optimization, 545-567 derivative descent search, 154 Fibonacci, 108-115 general algorithm, 211 genetic algorithm, 285-298 Golden section, 104-10§ gradient methods, 131-153 Kaczmarz's algorithm, 232-236, 257 Lagrangian, 553-560 line search, 103, 124, 133, 167, 186, 188, 194, 209 620 INDEX naive random search, 279 Nelder-Mead algorithm, 274-278 neural network training, 255 Newton's method, 116-119, 1 6 1 171 Newton- Raphson method, see Newton's method one-dimensional, 103-126 particle swarm optimization, 282285 penalty method, 560-567 probabilistic, 278 projected, 546, 556 projected gradient methods, 547, 549-553 quasi-Newton methods, 193-211 randomized, 278 secant method, 120-122, 172 simulated annealing algorithm, 278-282 steepest descent method, 133-141 Secant method, 120-122, 172 Second-order necessary condition equality constraints, 469 inequality constraints, 494 interior case, 90 set constraint, 90 Second-order sufficient condition equality constraints, 470 inequality constraints, 494 interior case, 92 set constraint, 102 Selection in genetic algorithm, 286 Semidefinite programming, 523 Sensitivity, 398 Sequence bounded, 56, 57 bounded above, 57 bounded below, 57 convergent, 56 decreasing, 55 Fibonacci, 109 greatest lower bound, 57 increasing, 55 least upper bound, 57 limit, 55 lower bound, 57 monotone, 55, 57 nondecreasing, 55, 564 nonincreasing, 55 of matrices, 59 of real numbers, 55 order of convergence, 148, 149, 152, 153, 157, 165 Set subsequence of, 58 upper bound, 57 boundary of, 51 bounded, 51 closed, 51 compact, 51 constraint, 81, see Feasible set convex, see Convex set feasible, 81, 450 interior of, 51 minus, 5 notation, 5 open, 51 simplex, 274, 412 subset of, 5 Set constraint, 82, 502 Shanno, 207 Sherman-Morrison formula, 208, 229 Sherman-Morrison-Woodbury formula, 229 Shift parameter, 267 Sigmoid, 266 Signal-to-interference ratio, 88, 98 Similar matrices, 26 Simplex, 274, 412 Simplex algorithm, see Simplex method Simplex method, 306, 339-368 algorithm, 349-356 artificial problem, 361 artificial variables, 361 Bland's rule, 360, 375 canonical augmented matrix, 346348 canonical tableau, 358 complexity, 402 cycling, 360, 375 exponential complexity, 402 integer linear programming, 427444 matrix form, 356-360 reduced cost coefficient, 353, 357, 358, 391 revised simplex method, 364-368 revised tableau, 365 row operations, 339-346 stopping criterion, 351, 372 tableau, 358 two-phase, 360-364 updating augmented matrix, 348349 updating canonical tableau, 358 Simulating annealing algorithm, 278-282 INDEX Simultaneous equations, see Linear equations Single-rank symmetric algorithm, 197 Singular value decomposition, 571 Slack variable, 319 SONC, see Second-order necessary condition SOSC, see Second-order sufficient condition Span, 10 Sphere, 425 Square matrix, 14 SRS algorithm, 197 Standard form linear program, 318, 324 Statement biconditional, 4 conditional, 3 Steepest ascent, 71 Steepest ascent method, see Steepest descent method Steepest descent order of convergence, 152 Steepest descent method, 133-141 for constrained optimization, 551 for quadratic, 139 projected, 551 Step response, 248 Step size, 132, 146, 156, 168, 233, 261, 547, 551 Stiefel, Eduard, 187, 188 Stochastic matrix, 397 Stopping criterion affine scaling method, 409 conjugate gradient method, 188 genetic algorithm, 290 gradient method, 135 Karmarkar's method, 415, 420 line search, 128 projected gradient method, 552 simplex method, 351, 372 Strict inequality, 528 Strictly interior feasible point, 406, 407, 416 Strong Wolfe condition, 126 Structured representation of LMI, 529 Subgradient, 515, 537 Sublinear convergence, 148 Submatrix, 432 Subsequence, 58 Subset, 5 Subspace, 10 Superlinear convergence, 148 Supervised learning, 256 Supporting hyperplane, 52, 316 621 Supremum, see Least upper bound Surface, 451, 454 Surplus variable, 319 SVD, see Singular value decomposition Sylvester's criterion, 31 Symmetric duality, 378, 403 Symmetric Huang family, 215 Symmetric matrix, 28, 35, 139 Symmetry, 19 Tableau in linear programming, 358 Tangent line, 70 Tangent plane, 70, 454 Tangent space, 454, 456, 458 Tangent vector, 65, 70, 454, 456 Taylor series, 72-76, 162, 165, 551. See also Taylor's theorem Taylor's formula, 72, 75. See also Taylor's theorem Taylor's theorem, 72, 85, 90, 92, 132, 515, 516 Temperature schedule, 281 Termination criterion, see Stopping criterion Third-order necessary condition, 94 Third-order sufficient condition, 94 Threshold, 267 Totally unimodular, 432 Tournament scheme, 286 Trace, 534 Training of neural network, 253 Training set, 255 Transformation affine scaling, 407 linear, 25, 63 matrix, 26 matrix representation of, 25 projective, 416, 424 Transportation problem, 306, 310 Transpose matrix, 12 vector, 8 Transposition theorems, 396 Traveling salesperson problem, 282 Triangle inequality, 21 Truth table, 3 Tucker, see K K T condition Twice continuously differentiate function, 65, 515 Twice differentiable function, 65, 454 Two-dimensional linear program, 314 Two-phase affine scaling method, 409 Two-phase simplex method, 360-364 622 INDEX Uncertainty range, 106 Unconstrained optimization basics of, 82 conditions for, 83-92 Uniform continuity, 23 Uniformly continuous function, 23 Unimodal, 104 Unimodular, 428 Unimodular, totally 432 Unsupervised learning, 256 Upper bound, 57 Utility function, 542 Variable metric algorithm, 202 Variety, linear, 47 Vector column, 7 complex, 12 component of, 7 convex combination, 48, 297, 582 definition of, 7 difference, 8 field, 65 linear combination, 10 linearly dependent, 10 linearly independent, 9, 176, 324. 451, 484 normal, 47 orthogonal, 20 probability, 397, 540 row, 7 tangent, 65, 70, 454, 456 transpose of, 8 zero vector, 8 Vector field, 65 Vector optimization, 573 Vector space basis for, 11 complex, 12 definition of, 7 dimension of, 11 real, 7 subspace of, 10 Vertex, 53, 274, 406 Weak duality lemma, 385, 399, 543 Weierstrass theorem, 36, 51 Weighted sum, 474, 582, 592 Weights in neural network, 253, 254 Widrow, 257 Widrow-HofT algorithm, 257, 271 Wiener filter, 188, 245 Wolfe condition, 125 Woodbury, 229 XOR, 268, 271 YALMIP, 536 Yet Another LMI Package (YALMIP), 536 Zero finding, 118, 155 Zero matrix, 22, 30, 36 Zero vector, 8 WILEY SERIES IN DISCRETE MATHEMATICS AND OPTIMIZATION AARTS AND KORST · Simulated Annealing and Boltzmann Machines: A Stochastic Approach to Combinatorial Optimization and Neural Computing AARTS AND LENSTRA · Local Search in Combinatorial Optimization ALON AND SPENCER · The Probabilistic Method, Third Edition ANDERSON AND NASH · Linear Programming in Infinite-Dimensional Spaces: Theory and Application ARLINGHAUS, ARLINGHAUS, AND HARARY · Graph Theory and Geography: An Interactive View E-Book AZENCOTT · Simulated Annealing: Parallelization Techniques BARTHiLEMY AND GUENOCHE · Trees and Proximity Representations BAZARRA, JARVIS, AND SHERALI · Linear Programming and Network Flows BRUEN AND FORCINITO · Cryptography, Information Theory, and Error-Correction: A Handbook for the 21st Century CHANDRU AND HOOKER · Optimization Methods for Logical Inference CHONG AND ZAK · An Introduction to Optimization, Fourth Edition COFFMAN AND LUEKER · Probabilistic Analysis of Packing and Partitioning Algorithms COOK, CUNNINGHAM, PULLEYBLANK, AND SCHRIJVER · Combinatorial Optimization DASKIN · Network and Discrete Location: Modes, Algorithms and Applications DINITZ AND STINSON · Contemporary Design Theory: A Collection of Surveys DU AND KO · Theory of Computational Complexity ERICKSON · Introduction to Combinatorics GLOVER, KLINGHAM, AND PHILLIPS · Network Models in Optimization and Their Practical Problems GOLSHTEIN AND TRETYAKOV · Modified Lagrangians and Monotone Maps in Optimization GONDRAN AND MLNOUX · Graphs and Algorithms (Translated by S. Vajdä) GRAHAM, ROTHSCHILD, AND SPENCER · Ramsey Theory, Second Edition GROSS AND TUCKER · Topological Graph Theory HALL · Combinatorial Theory, Second Edition HOOKER · Logic-Based Methods for Optimization: Combining Optimization and Constraint Satisfaction IMRICH AND KLAV2AR · Product Graphs: Structure and Recognition JANSON, LUCZAK, AND RUCINSKI · Random Graphs JENSEN AND TOFT · Graph Coloring Problems KAPLAN · Maxima and Minima with Applications: Practical Optimization and Duality LAWLER, LENSTRA, RINNOOY KAN, AND SHMOYS, Editors · The Traveling Salesman Problem: A Guided Tour of Combinatorial Optimization LAYWINE AND MULLEN · Discrete Mathematics Using Latin Squares LEVITIN · Perturbation Theory in Mathematical Programming Applications MAHMOUD · Evolution of Random Search Trees MAHMOUD · Sorting: A Distribution Theory MARTELLI · Introduction to Discrete Dynamical Systems and Chaos MARTELLO AND TOTH · Knapsack Problems: Algorithms and Computer Implementations McALOON AND TRETKOFF · Optimization and Computational Logic MERRIS · Combinatorics, Second Edition MERRIS · Graph Theory MINC · Nonnegative Matrices MINOUX · Mathematical Programming: Theory and Algorithms (Translated by S. Vajdä) MIRCHANDANI AND FRANCIS, Editors · Discrete Location Theory NEMHAUSER AND WOLSEY · Integer and Combinatorial Optimization NEMIROVSKY AND YUDIN · Problem Complexity and Method Efficiency in Optimization (Translated bv E. R. Dawson) PACH AND AGARWAL · Combinatorial Geometry PLESS · Introduction to the Theory of Error-Correcting Codes, Third Edition ROOS AND VIAL · Ph. Theory and Algorithms for Linear Optimization: An Interior Point Approach SCHEINERMAN AND ULLMAN · Fractional Graph Theory: A Rational Approach to the Theory of Graphs SCHIFF · Cellular Automata: A Discrete View of the World SCHRIJVER · Theory of Linear and Integer Programming SPALL · Introduction to Stochastic Search and Optimization: Estimation, Simulation, and Control STIEBITZ, SCHEIDE, TOFT, AND FAVRHOLDT · Graph Edge Coloring: Vizing's Theorem and Goldberg's Conjecture SZPANKOWSKI · Average Case Analysis of Algorithms on Sequences TOMESCU · Problems in Combinatorics and Graph Theory (Translated by R. A. Melter) TUCKER · Applied Combinatorics, Second Edition WOLSEY · Integer Programming YE · Interior Point Algorithms: Theory and Analysis