Introduction to Numerical Analysis, 3rd Edition

Texts in Applied Mathematics 12 Editors J.E. Marsden L. Sirovich M. Golubitsky S.S. Antman Advisors G.Iooss P. Holmes D. Barkley M. Dellnitz P. Newton Springer Science+Business Media, LLC Texts in Applied Mathematics 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. Sirovich: Introduction to Applied Mathematics. Wiggins: Introduction to Applied Nonlinear Dynamical Systems and Chaos. HalelKorak: Dynamics and Bifurcations. ChorinlMarsden: A Mathematical Introduction to Fluid Mechanics, 3rd ed. HubbardlWest: Differential Equations: A Dynamical Systems Approach: Ordinary Differential Equations. Sontag: Mathematical Control Theory: Deterministic Finite Dimensional Systems, 2nd ed. Perko: Differential Equations and Dynamical Systems, 3rd ed. Seaborn: Hypergeometric Functions and Their Applications. Pipkin: A Course on Integral Equations. HoppensteadtlPeskin: Modeling and Simulation in Medicine and the Life Sciences, 2nd ed. Braun: Differential Equations and Their Applications, 4th ed. StoerlBulirsch: Introduction to Numerical Analysis, 3rd ed. RenardylRogers: An Introduction to Partial Differential Equations. Banks: Growth and Diffusion Phenomena: Mathematical Frameworks and Applications. Brenner/Scott: The Mathematical Theory of Finite Element Methods, 2nd ed. Van de Velde: Concurrent Scientific Computing. MarsdenlRatiu: Introduction to Mechanics and Symmetry, 2nd ed. Hubbard/West: Differential Equations: A Dynamical Systems Approach: HigherDimensional Systems. KaplanlGlass: Understanding Nonlinear Dynamics. Holmes: Introduction to Perturbation Methods. Curtain/Zwart: An Introduction to Infinite-Dimensional Linear Systems Theory. Thomas: Numerical Partial Differential Equations: Finite Difference Methods. Taylor: Partial Differential Equations: Basic Theory. Merkin: Introduction to the Theory of Stability of Motion. Naber: Topology, Geometry, and Gauge Fields: Foundations. Polderman/Willems: Introduction to Mathematical Systems Theory: A Behavioral Approach. Reddy: Introductory Functional Analysis with Applications to Boundary-Value Problems and Finite Elements. Gustafson/Wilcox: Analytical and Computational Methods of Advanced Engineering Mathematics. Tveito/Winther: Introduction to Partial Differential Equations: A Computational Approach. GasquetlWitomski: Fourier Analysis and Applications: Filtering, Numerical Computation, Wavelets. (continued after index) ]. Stoer R. Bulirsch Introduction to Numerical Analysis Third Edition Translated by R. Bartels, W. Gautschi, and C. Witzgall With 39 Illustrations Springer J. Stoer Institut fUr Angewandte Mathematik Universtiit Wurzburg AM Hubland D-97074 Wurzburg Germany R. Bartels Department of Computer Science University of Waterloo Waterloo, Ontario N2L 3Gl Canada R. Bulirsch Institut fUr Mathematik Technische Universitiit 8000 Munchen Germany W. Gautschi Department of Computer Sciences Purdue University West Lafayette, IN 47907 USA Series Editors C. Witzgall Center for Applied Mathematics National Bureau of Standards Washington, DC 20234 USA J.E. Marsden Control and Dynamical Systems, 107-81 California Institute of Technology Pasadena, CA 91125 USA L. Sirovich Division of Applied Mathematics Brown University Providence, RI 02912 USA M. Golubitsky Department of Mathematics University of Houston Houston, TX 77204-3476 USA S.S. Antman Department of Mathematics and Institute for Physical Science and Technology University of Maryland College Park, MD 20742-4015 USA Mathematics Subject Classification (2000): 44-01, 44AIO, 44A15, 65R1O Library of Congress Cataloging.in.Publication Data Stoer, Josef. [EinfOhrung in die numerische Mathematik. English] Introduction to numerica! analysis / J. Stoer, R. Bulirsch. - 3rd ed. p. cm. - (Texts in applied mathematics ; 12) Includes bibliographical references and index. ISBN 978-1-4419-3006-4 ISBN 978-0-387-21738-3 (eBook) DOI 10.1007/978-0-387-21738-3 1. Numerica! analysis. I. Bulirsch, Roland, II. Title. III. Series. QA297 .S8213 2002 519.4-<1c21 2002019729 Printed on acid-free paper. © 2002, 1980, 1993 Springer Science+Business Media New York Originally published by Springer-Verlag New York, Inc. in 2002 Softcover reprint ofthe hardcover 3rd edition 2002 AII rights reserved. This work may nat be translated or copied in whole ar in part without the written pennission of the publisher Springer Science+Business Media, LLC, except for brief excerpts in connection with reviews or scholarly analysis. Vse in connection with any fonn of information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed is forbidden. The use of general descriptive names, tracle names, trademarks, etc., in this publication. even if the former are not especially identified, is not to be taken as a sign that such names, as understood by the Trade Marks and Merchandise Marks Act, may accordingly be used freely by anyone. Production managed by Timothy Taylor; manufactu!:!.ns. supernsed by Jacqui Ashri. Photocomposed copy prepared from the author's U:l~ files using Springer's svsing.sty macro. 9 8 7 6 5 4 3 2 1 ISBN 978-1-4419-3006-4 SPIN 10867658 Series Preface Mathematics is playing an ever more important role in the physical and biological sciences, provoking a blurring of boundaries between scientific disciplines and a resurgence of interest in the modern as well as the classical techniques of applied mathematics. This renewal of interest, both in research and teaching, has led to the establishment of the series Texts in Applied Mathematics (TAM). The development of new courses is a natural consequence of a high level of excitement on the research frontier as newer techniques, such as numerical and symbolic computer systems, dynamical systems, and chaos, mix with and reinforce the traditional methods of applied mathematics. Thus, the purpose of this textbook series is to meet the current and future needs of these advances and to encourage the teaching of new courses. TAM will publish textbooks suitable for use in advanced undergraduate and beginning graduate courses, and will complement the Applied Mathematical Sciences (AMS) series, which will focus on advanced textbooks and research-level monographs. Pasadena, California Providence, Rhode Island Houston, Texas College Park, Maryland J .E. Marsden L. Sirovich M. Golubitsky S.S. Antman Preface to the Third Edition A new edition of a text presents not only an opportunity for corrections and minor changes but also for adding new material. Thus we strived to improve the presentation of Hermite interpolation and B-splines in Chapter 2, and we added a new Section 2.4.6 on multi-resolution methods and B-splines, using, in particular, low order B-splines for purposes of illustration. The intent is to draw attention to the role of B-splines in this area, and to familiarize the reader with, at least, the principles of multi-resolution methods, which are fundamental to modern applications iI:. signal- and image processing. The chapter on differential equations was enlarged, too A new Section 7.2.18 describes solving differential equations in the presence of discontinuities whose locations are not known at the outset. Such discontinuities occur, for instance, in optimal control problems where the character of a differential equation is affected by control function changes in response to switching events. Many applications, such as parameter identification, lead to differential equations which depend on additional parameters. Users then would like to know how sensitively the solution reacts to small changes in these parameters. Techniques for such sensitivity analyses are the subject of the new Section 7.2.19. Multiple shooting methods are among the most powerful for solving boundary value problems for ordinary differential equatiom:. We dedicated, therefore, a new Section 7.3.8 to new advanced techniques in multiple shooting, which especially enhance the efficiency of these methods when applied to solve boundary value problems with discontinuities, which are typical for optimal contol problems. Among the many iterative methods for solving large sparse linear equations, Krylov space methods keep growing in importance. We therefore treated these methods in Section 8.7 more systematically by adding new subsections dealing with the GMRES method (Section 8.7.2), the biorthogonalization method of Lanczos and the (principles of the) QMR method (Section 8.7.3), and the Bi-CG and Bi-CGSTAB algorithmo: (Section 8.7.4). Correspondingly, the final Section 8.10 on the comparison of iterative methods was updated in order to incorporate the findings for all Krylov space methods described before. The authors are greatly indebted to the many who have contributed Vll viii Preface to the Third Edition to the new edition. We thank R. Grigorieff for many critical remarks on earlier editions, M. v. Golitschek for his recommendations concerning Bsplines and their application in multi-resolution methods, and Ch. Pflaum for his comments on the chapter dealing with the iterative solution of linear equations. T. Kronseder and R. Callies helped substantially to establish the new sections 7.2.18, 7.2.19, and 7.3.8. Suggestions by Ch. Witzgall, who had helped translate a previous edition, were highly appreciated and went beyond issues of language. Our co-workers M. Preiss and M. Wenzel helped us read and correct the original german version. In particular, we appreciate the excellent work done by J. Launer and Mrs. W. Wrschka who were in charge of transcribing the full text of the new edition in TEX. Finally we thank the Springer-Verlag for the smooth cooperation and expertise that lead to a quick realization of the new edition. Wiirzburg, Miinchen January 2002 J. Stoer R. Bulirsch Preface to the Second Edition On the occasion of the new edition, the text was enlarged by several new sections. Two sections on B-splines and their computation were added to the chapter on spline functions: due to their special properties, their flexibility, and the availability of well tested programs for their computation, B-splines play an important role in many applications. Also, the authors followed suggestions by many readers to supplement the chapter on elimination methods by a section dealing with the solution of large sparse systems of linear equations. Even though such systems are usually solved by iterative methods, the realm of elimination methods has been widely extended due to powerful techniques for handling sparse matrices. We will explain some of these techniques in connection with the Cholesky algorithm for solving positive definite linear systems. The chapter on eigenvalue problems was enlarged by a section on the Lanczos algorithm; the sections on the LR- and QR algorit.hm were rewritten and now contain also a description of implicit shift teehniques. In order to take aeeount of the progress in the area of ordinary differential equations to some extent, a new section on implieit differential equations and differential-algebraic systems was added, and the seetion on stiff differential equations was updated by describing further methods to solve such equations. Also the last chapter on the iterative solution of linear equations was improved. The modern view of the eonjugate gradient algorithm as an iterative method was stressed by adding an analysis of its convergence rate and a description of some preconditioning techniques. Finally, a new section on multigrid methods was incorporated: It contains a description of their basic ideas in the context of a simple boundary value problem for ordinary differential equations. ix x Preface to the Second Edition Many of the changes were suggested by several colleagues and readers. In particular, we would like to thank R. Seydel, P. Rentrop and A. Neumaier for detailed proposals, and our translators R. Bartels, W. Gautschi and C. Witzgall for their valuable work and critical commentaries. The original German version was handled by F. Jarre, and 1. Brugger was responsible for the expert typing of the many versions of the manuscript. Finally we thank the Springer-Verlag for the encouragement, patience and close cooperation leading to this new edition. Wiirzburg, Miinchen May 1991 J. Stoer R. Bulirsch Contents VII IX Preface to the Third Edition Preface to the Second Edition 1 Error Analysis 1 1.1 1.2 1.3 1.4 1.5 Representation of Numbers 2 Roundoff Errors and Floating-Point Arithmetic 4 Error Propagation 9 Examples 21 Interval Arithmetic; Statistical Roundoff Estimation Exercises for Chapter 1 33 References for Chapter 1 36 2 Interpolation 2.1 2.1.1 2.1.2 2.1.3 2.1.4 2.1.5 2.2 2.2.1 2.2.2 2.2.3 2.2.4 2.3 2.3.1 2.3.2 2.3.3 2.3.4 Interpolation by Polynomials 38 Theoretical Foundation: The Interpolation Formula of Lagrange 38 Neville's Algorithm 40 Newtons Interpolation Formula: Divided Differences 43 The Error in Polynomial Interpolation 48 Hermite Interpolation 51 Interpolation by Rational Functions 59 General Properties of Rational Interpolation 59 Inverse and Reciprocal Differences. Thiele's Continued Fraction 64 Algorithms of the Neville Type 68 Comparing Rational and Polynomial Interpolation 73 Trigonometric Interpolation 74 Basic Facts 74 Fast Fourier Transforms 80 The Algorithms of Goertzel and Reinsch 88 The Calculation of Fourier Coefficients. Attenuation Factors 92 27 37 xi Contents XII 2.4 2.4.1 2.4.2 2.4.3 2.4.4 2.4.5 2.4.6 Interpolation by Spline Functions 97 Theoretical Foundations 97 Determining Interpolating Cubic Spline Functions Convergence Properties of Cubic Spline Functions B-Splines 111 The Computation of B-Splines 117 lVIulti-Resolution Methods and B-Splines 121 Exercises for Chpater 2 134 References for Chapter2 143 3 Topics in Integration 3.1 3.2 3.3 3.4 3.5 3.6 3.7 The Integration Formulas of Newton and Cotes 146 Peano's Error Representation 151 The Euler-Maclaurin Summation Formula 156 Integration by Extrapolation 160 About Extrapolation Methods 165 Gaussian Integration Methods 171 Integrals with Singularities 181 Exercises for Chapter 3 184 References for Chapter;{ 188 4 Systems of Linear Equations 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.8.1 4.8.2 Gaussian Elimination. The Triangular Decomposition of a Matrix 190 The Gauss-Jordan Algorithm 200 The Choleski Decompostion 204 Error Bounds 207 Roundoff-Error Analysis for Gaussian Elimination 215 Roundoff Errors in Solving Triangular Systems 221 Orthogonalization Techniques of Householder and Gram-Schmidt 223 Data Fitting 231 Linear Least Squares. The Normal Equations 232 The Use of Orthogonalization in Solving Linear Least-Squares Problems 235 The Condition of the Linear Least-Squares Problem 236 Nonlinear Least-Squares Problems 241 The Pseudoinverse of a Matrix 243 Modification Techniques for Matrix Decompositions 247 The Simplex Method 256 Phase One of the Simplex Method 268 Appendix: Elimination Methods for Sparse Matrices 272 Exercises for Chapter 4 280 References for Chapter 1 286 4.8.3 4.8.4 4.8.5 4.9 4.10 4.11 4.12 101 107 145 190 Contents 5 Finding Zeros and Minimum Points by Iterative Methods 5.1 5.2 5.3 5.4 5.4.1 5.4.2 The Development of Iterative Methods 290 General Convergence Theorems 293 The Convergence of Newton's Method in Several Variables 298 A Modified Newton Method 302 On the Convergence of Minimization Methods 303 Application of the Convergence Criteria to the Modified Newton Method 308 Suggestions for a Practical Implementation of the Modified Newton Method. A Rank-One Method Due to Broyden 313 Roots of Polynomials. Application of Newton's Method 316 Sturm Sequences and Bisection Methods 328 Bairstow's Method 333 The Sensitivity of Polynomial Roots 335 Interpolation Methods for Determining Roots 338 The 6,2-Method of Aitken 344 Minimization Problems without Constraints 349 Exercises for Chapter 5 358 References for Chapter 5 361 5.4.3 5.5 5.6 5.7 5.8 5.9 5.10 5.11 6 Eigenvalue Problems 6.0 6.1 6.2 6.3 6.4 Introduction 364 Basic Facts on Eigenvalues 366 The Jordan Normal Form of a Matrix 369 The Frobenius Normal Form of a Matrix 375 The Schur Normal Form of a Matrix; Hermitian and Normal Matrices; Singular Values of Matrixes 379 Reduction of Matrices to Simpler Form 386 Reduction of a Hermitian Matrix to Tridiagonal Form: The Method of Householder 388 Reduction of a Hermitian Matrix to Tridiagonal or Diagonal Form: The Methods of Givens and Jacobi 394 Reduction of a Hermitian Matrix to Tridiagonal Form: The Method of Lanczos 398 Reduction to Hessenberg Form 402 Methods for Determining the Eigenvalues and Eigenvectors 405 Computation of the Eigenvalues of a Hermitian Tridiagonal Matrix 405 Computation of the Eigenvalues of a Hessenberg Matrix. The Method of Hyman 407 Simple Vector Iteration and Inverse Iteration of Widandt 408 The LR and QR Methods 415 The Practical Implementation of the QR Method 425 6.5 6.5.1 6.5.2 6.5.3 6.5.4 6.6 6.6.1 6.6.2 6.6.3 6.6.4 6.6.5 xiii 289 364 xiv Contents 6.7 6.8 6.9 Computation of the Singular Values of a Matrix Generalized Eigenvalue Problems 440 Estimation of Eigenvalues 441 Exercises for Chapter 6 455 References for Chapter 6 462 7 Ordinary Differential Equations 436 465 7.U 7.1 Introduction 465 Some Theorems from the Theory of Ordinary Differential Equations 467 7.2 Initial-Value Problems 471 7.2.1 One-Step Methods: Basic Concepts 471 7.2.2 Convergence of One-Step Methods 477 7.2.3 Asymptotic Expansions for the Global Discretization Error of One-Step Methods 480 7.2.4 The Influence of Rounding Errors in One-Step Methods 483 7.2.5 Practical Implementation of One-Step Methods 485 7.2.6 Multistep Methods: Examples 492 7.2.7 General Multistep Methods 495 7.2.8 An Example of Divergence 498 7.2.9 Linear Difference Equations 501 Convergence of Multistep Methods 504 7.2.10 7.2.11 Linear Multistep rvIethods 508 7.2.12 Asymptotic Expansions of the Global Discretization Error for Linear Multistep Methods 513 Practical Implementation of Multistep Methods 517 7.2.13 Extrapolation Methods for the Solution of the Initial-Value 7.2.14 Problem 521 Comparison of Methods for Solving Initial-Value Problems 524 7.2.15 7.2.16 Stiff Differential Equations 525 Implicit Differential Equations. Differential-Algebraic Equations 7.2.17 Handling Discontinuities in Differential Equations 536 7.2.18 Sensitivity Analysis of Initial-Value Problems 538 7.2.19 Boundary-Value Problems 539 7.3 7.3.0 Introduction 539 7.3.1 The Simple Shooting Method 542 7.3.2 The Simple Shooting Method for Linear Boundary-Value Problems 548 An Existence and Uniqueness Theorem for the Solution of 7.3.3 Boundary-Value Problems 550 7.3.4 Difficulties in the Execution of the Simple Shooting Method 552 The Multiple Shooting Method 557 7.3.5 531 Contents 7.3.6 7.3.7 7.3.8 7.3.9 7.4 7.5 7.6 7.7 Hints for the Practical Implementation of the Multiple Shooting Method 561 An Example: Optimal Control Program for a Lifting Reentry Space Vehicle 565 Advanced Techniques in Multiple Shooting 572 The Limiting Case m -+ 00 of the Multiple Shooting Method (General Newton's Method, Quasilinearization) 577 Difference Methods 582 Variational Methods 586 Comparison of the Methods for Solving Boundary-Value Problems for Ordinary Differential Equations 596 Variational Methods for Partial Differential Equations The Finite-Element Method 600 Exercises for Chapter 7 607 References for Chapter 7 613 8 Iterative Methods for the Solution of Large Systems of Linear Equations. Additional Methods 8.0 8.1 8.2 8.3 8.4 8.5 8.6 8,7 8.7.1 8.7.2 8.7.3 Introduction 619 General Procedures for the Construction of Iterative Methods 621 Convergence Theorems 623 Relaxation Methods 629 Applications to Difference Methods-An Example 63~) Block Iterative Methods 645 The ADI-Method of Peaceman and Rachford 647 Krylov Space Methods for Solving Linear Equations 657 The Conjugate-Gradient Method of Hestenes and Stiefel 658 The GMRES Algorithm 667 The Biorthogonalization Method of Lanczos and the QMR algorithm 680 The Bi-CG and BI-CGSTAB Algorithms 686 Buneman's Algorithm and Fourier Methods for Solving the Discretized Poisson Equation 691 Multigrid Methods 702 Comparison of Iterative Methods 712 Exercises for Chapter 8 719 References for Chapter 8 727 8.7.4 8.8 8.9 8.10 xv 619 General Literature on Numerical Methods 730 Index 732 1 Error Analysis Assessing the accuracy of the results of calculations is a paramount goal in numerical analysis. One distinguishes several kinds of errors which may limit this accuracy: (1) errors in the input data, (2) roundoff errors, (3) approximation errors. Input or data errors are beyond the control of the calculation. They may be due, for instance, to the inherent imperfections of physical measurements. Roundoff errors arise if one calculates with numbers whose representation is restricted to a finite number of digits, as is usually the case. As for the third kind of error, many methods will not yield the exact solution of the given problem P, even if the calculations are carried out without rounding, but rather the solution of another simpler problem P which approximates P. For instance, the problem P of summing an infinite series, e.g., 111 e=l+-+-+-+··· I! 2! 3! ' may be replaced by the simpler problem P of summing only up to a finite number of terms of the series. The resulting approximation error is commonly called a truncation error (however, this term is also used for the roundoff related error committed be deleting any last digit of a number representation). Many approximating problems P are obtained by "discretizing" the original problem P: definite integrals are approximated by finite sums, differential quotients by a difference quotients, etc. In such cases, the approximation error is often referred to as discretization error. Some authors extend the term "truncation error" to cover discretization errors. In this chapter, we will examine the general effect of input and roundoff errors on the result of a calculation. Approximation errors will be discussed in later chapters as we deal with individual methods. For a comprehensive treatment of roundoff errors in floating-point computation see Sterbenz (1974). 2 1 Error Analysis 1.1 Representation of Numbers Based on their fundamentally different ways of representing numbers, two categories of computing machinery can be distinguished: (1) analog computers, (2) digital computers. Examples of analog computers are slide rules and mechanical integrators as well as electronic analog computers. When using these devices one replaces numbers by physical quantities, e.g., the length of a bar or the intensity of a voltage, and simulates the mathematical problem by a physical one, which is solved through measurement, yielding a solution for the original mathematical problem as well. The scales of a slide rule, for instance, represent numbers x by line segments of length k In x. Multiplication is simulated by positioning line segments contiguously and measuring the combined length for the result. It is clear that the accuracy of analog devices is directly limited by the physical measurements they employ. Digital computers express the digits of a number representation by a sequence of discrete physical quantities. Typical instances are desk calculators and electronic digital computers. EXAMPLE 123101 Each digit is represented by a specific physical quantity. Since only a small finite number of different digits have to be encoded - in the decimal number system, for instance, there are only 10 digits - the representation of digits in digital computers need not be quite as precise as the representation of numbers in analog computers. Thus one might tolerate voltages between, say, 7.8 and 8.2 when aiming at a representation of the digit 8 by 8 volts. Consequently, the accuracy of digital computers is not directly limited by the precision of physical measurements. For technical reasons, most modern electronic digital computers represent numbers internally in binary rather than decimal form. Here the coefficients or bits O!i of a decomposition by powers of 2 play the role of digits in the representation of a number x: x = ±(O!n2n + O!n_12n-1 + ... + O!o2° + 0!_12- 1 + 0!_22-2 + ... ), O!i = 0 or O!i = 1. 1.1 Representation of Numbers 3 In order not to confuse decimal an binary representations of numbers, we denote the bits of binary number representation by 0 and L, respectively. EXAMPLE. The number x = 18.5 admits the decomposition 18.5 = 1 . 24 + 0 . 2 3 + 0 . 22 + 1 .2 1 + 0 . 2° + 1 . T1 and has therefore the binary representation LOOLO.L. We will use mainly the decimal system, pointing out differences between the two systems whenever it is pertinent to the examination at hand. As the example 3.999 ... = 4 shows, the decimal representation of a number may not be unique. The same holds for binary representations. To exclude such ambiguities, we will always refer to the finite representation unless otherwise stated. In general, digital computers must make do with a fixed finite number of places, the word length, when internally representing a number. This number n is determined by the make of the machine, although some machines have built-in extensions to integer multiples 2n, 3n, ... (double word length, triple word length, ... ) of n to offer greater precision if needed. A word length of n places can be used in several different fashions to represent a number. Fixed-point representation specifies a fixed number nl of places before and a fixed number n2 after the decimal (binary) point, so that n = nl + n2 (usually nl = 0 or nl = n). EXAMPLE. For n = 10, nl = 4, n2 = 6 30.421 ---+ I00301421000 I 0.0437 ---+ I 00001 0437001 In this representation, the position of the decimal (binary) point is fixed. A few simple digital devices, mainly for accounting purposes, are still restricted to fixed-point representation. Much more important, in particular for scientific calculations, are digital computers featuring floating-point representation of numbers. Here the decimal (binary) point is not fixed at the outset; rather its position with respect to the first digit is indicated for each number separately. This is done by specifying a so-called exponent. In other words, each real number can be represented in the form (1.1.1) x = a x lOb (x = a x 2b ) with lal < 1, b integer (say, 30.421 by 0.30421 x 10 2 ), where the exponent b indicates the position of the decimal point with respect to the mantissa a. Rutishauser proposed the following "semilogarithmic" notation,which displays the basis of the 4 1 Error Analysis number system at the subscript level and moves the exponent up to the level of the mantissa: 0.30421102. Analogously, 0.LOOLOL 2 LOL denotes the number 18.5 in the binary system. On any digital computer there are, of course, only fixed finite numbers t and e, n = t + e, of places available for the representation of mantissa and exponent, respectively. EXAMPLE. For t = 4, e = 2 one would have the floating-point representation o 15420110 [Qi] or more concisely 15420 I041 for the number 5420 in the decimal system. The floating-point representation of a number need not be unique. Since 5420 = 0.542104 = 0.0542105, one could also have the floating-point representation o 10542110 ~ or 105421051 instead of the one given in the above example. A floating-point representation is normalized if the first digit (bit) of the mantissa is different from 0 (0). Then lal 2: 1O- 1 (lal 2: 2- 1 ) holds in (1.1.1). The significant digits (bits) of a number are the digits of the mantissa not counting leading zeros. In what follows, we will only consider normalized floating-point representations and the corresponding floating-point arithmetic. The numbers t and e determine - together with the basis B = 10 or B = 2 of the number representation - the set A ~ IR of real numbers which can be represented exactly within a given machine. The elements of A are called machine numbers. While normalized floating-point arithmetic is prevalent on current electronic digital computers, unnormalized arithmetic has been proposed to ensure that only truly significant digits are carried [Ashenhurst and Metropolis, (1959)]. 1.2 Roundoff Errors and Floating-Point Arithmetic The set A of numbers which are representable in a given machine is only finite. The question therefore arises of how to approximate a number x rf. A which is not a machine number by a number g E A which is. This problem is encountered not only when reading data into a computer, but also when representing intermediate results within the computer during the course 1.2 Roundoff Errors and Floating-Point Arithmetic 5 of a calculation. Indeed, straightforward examples show that the results of elementary arithmetic operations x ± y, x x y, x/y need not belong to A, even if both operands x, yEA are machine numbers. It is natural to postulate that the approximation of any number x ~ A by a machine number rd(x) E A should satisfy Ix - rd(x)1 ::; Ix - gl (1.2.1 ) for all g EA. Such a machine-number approximation rd(x) can be obtained in most cases by 'rOunding. EXAMPLE 1 (t = 4). = 0.1429100, = 0.3142101, rd(0.142842102) = 0.1428102. rd(0.14285100) rd(3.14159100) In general, one can proceed as follows in order to find rd( x) for at-digit computer: x ~ A is first represented in normalized form x = a x lOb, so that lal ~ 10- 1 . Suppose the decimal respresentation of lal is given by Then one forms , {0.a1a2'" at a:= 0.a1 a 2'" at + lO-t if if o ::; at+1 :-s: 4, at+! ~ 5, that is, one increases at by 1 if the (t + 1 )st digit at+! ~ .5, and deletes all digits after the tth one. Finally one puts rd(x) := sign(x)· a' x lOb. Since lal ~ 10- 1 , the "relative error" of ~(x) admits the following bound (Scarborough, 1950): I ~(x) - x I x ::; 5 x 1O-(t+1) lal ::; 5 x lO-t. With the abbreviation eps := 5 x lO-t, this can be written as (1.2.2) rd(x) = x(l + E), where lEI::; eps. The quantity eps = 5 x lO-t is called the machine precision. In the binary system, rd(x) is defined analogously: Starting with a decomposition x = a· 2b satisfying 2- 1 ::; lal < 1 and the binary representation of lal, 6 1 Error Analysis one forms , {0.a1 ... Cl:t a:= 0.Cl:1 ... Cl:t + 2- t if if at+1 = 0, = L, Cl:t+1 rd(x) := sign(x) . a' x 2b. Again (l.2.2) holds, provided one defines the machine precision by eps := 2- t . Whenever rd(x) E A is a machine number, then rd has the property (l.2.1) of a correct rounding process, and we may define rd(x) := rd(x) for all x with rd(x) EA. Because only a finite number e of places are available to express the exponent in a floating-point representation, there are unfortunately always numbers x ~ A with rd(x) ~ A. EXAMPLE 2 (t = 4, e = 2). a) r~(0.3179410110) = 0.317910 110 b) rd(0.999971099) = 0.1000lO1O0 c) rd(0.0123451O- 99) = 0.123510-100 d) ;;:1(0.54321 10 - 110) = 0.543210-110 rt A. rt .4. rt A. rt A. In cases (a) and (b) the exponent is too greatly positive to fit the allotted space: These are instances of exponent overflow. Case (b) is particularly pathological: Exponent overflow happens only after rounding. Cases (c) and (d) are instances of exponent underflow, i.e. the exponent of the number represented is too greatly negative. In cases (c) and (d) exponent underflow may be prevented by defining (l.2.3) rd(0.0123451O - 99) := 0.012310 - 99 E A, rd(0.54321 10 - 110) := 0 E A. But then rd does not satisfy (l.2.2), that is, the relative error of rd( x) may exceed eps. Digital computers treat occurrences of exponent overflow and underflow as irregularities of the calculation. In the case of exponent underflow, rd(x) may be formed as indicated in (l.2.3). Exponent overflow may cause a halt in calculations. In the remaining regular cases (but not for all makes of computers), rounding is defined by rd(x) := rd(x). Exponent overflow and underflow can be avoided to some extent by suitable scaling of the input data and by incorporating special checks and rescalings during computations. Since each different numerical method will require its own special protection techniques, and since overflow and underflow do not 1.2 Roundoff Errors and Floating-Point Arithmetic 7 happen very frequently, we will make the ideaEzed assumption that e = 00 in our subsequent discussions, so that rd := rd does indeed provide a rule for rounding which ensures (1.2.4) rd: JR -t A, rd(x) = x(l + c) with Icl ::; eps for all x E: JR. In further examples we will, correspondingly, give the length t of the mantissa only. The reader must bear in mind, however, that subsequent statements regarding roundoff errors max be invalid if overflows or underflows are allowed to happen. We have seen that the results of arithmetic operations x±y, x x y, x/y need not be machine numbers, even if the operands x and yare. Thus one cannot expect to reproduce the arithmetic operations exactly on a digital computer. One will have to be content with substitute operations + *, - *, x *, /*, so called floating-point operations, which approximate arithmetic operations as well as possible [v. Neumann and Goldstein (1947)]. Such operations my be defined, for instance, with the help of the rounding map rd as follows (1.2.5) x+* y:=rd(x+y), x - * y:= rd(x - y), x x* y:= rd(x x y), for x,y E A, x /* y := rd(x/y), so that (1.2.4) implies (1.2.6) x+* Y = (x + y)(l + cd x- * y=(x-y)(1+c2) x x* y = (x x y)(l + C3) x /* y = (x/y)(l + c4) } ICi I ::; eps. On many modern computer installations, the floating-point operations ±*, ... are not defined by (1.2.5), but instead in such a way that (1.2.6) holds with only a somewhat weaker bound, say ICil ::; I~ • eps, k 2:: 1 being a small integer. Since small deviations from (1.2.6) are not significant for our examinations, we will assume for simplicity that the floating-point operations are in fact defined by (1.2.5) and hence satisfy (1.2.6). It should be pointed out that the floating-point operations do not satisfy the well-known laws for arithmetic operations. For instance, x +* y = x, if x,y E A, where B is the basis of the number system. The machine precision eps could indeed be defined as the smallest positive machine number 9 for which 1 +* 9 > 1: 8 1 Error Analysis eps = min {g E A I 1 +* 9 > 1 and 9 > 0 }. Furthermore, floating-point operations need not be associative or distributive. EXAMPLE 3 (t = 8). With a:= b:= 0.2337125810 - 4, 0.33678429102, c := -0.3367781h02, one has a +* (b +* c) = 0.2337125810 - 4 +* 0.6180000010 - 3 = 0.6413712610 - 3, (a +* b) +* c = 0.33678452102 - * 0.3367781h02 = 0.6410000010 - 3. The exact result is a + b + c = 0.64137125810 - 3. When subtracting two numbers x, yEA of the same sign, one has to watch out for cancellation. This occurs if x and y agree in one or more leading digits with respect to the same exponent, e.g., x = 0.315876101, Y = 0.314289101. The subtraction causes the common leading digits to disappear. The exact result x-y is consequently a machine number, so that no new roundoff error arises, x -* y = x - y. In this sense, subtraction in the case of cancellation is a quite harmless operation. We will see in the next section, however, that cancellation is extremely dangerous concerning the propagation of old errors, which stem from the calculation of x and y prior to carrying out the subtraction x - y. For expressing the result of floating-point calculations, a convenient but slightly imprecise notation has been widely accepted, and we will use it frequently ourselves: If it is clear from the context how to evaluate an arithmetic expression E (if need be this can be specified by inserting suitable parentheses), then fl( E) denotes the value of the expression as obtained by floating-point arithmetic. EXAMPLE 4. fl(x x y) := X X * y, fl(a+ (b+c)):= a+* (b+* c), fl«a + b) + c) := (a +* b) +* c. We will also use the notation fl(.JX), fl(cos(x)), etc., whenever the digital computer approximates functions /, cos, etc., by substitutes /*, cos·, etc. Thus fl(.JX) := .JX., and so on. 1.3 Error Propagation 9 The arithmetic operations +, -, x, /, together with those basic functions like ..j, cos, for which floating-point substitutes ..j*, cos*, etc., have been specified, will be called elementary operations .. 1.3 Error Propagation We have seen in the previous section (Example 3) that two different but mathematically equivalent methods (a + b) + c,a + (b + c) for evaluating the same expression a + b + c may lead to different results if floating-point arithmetic is used. For numerical purposes it is therefore important to distinguish between different evaluation schemes even if they are mathematically equivalent. Thus we call a finite sequence of elementary operations (as given for instance by consecutive computer instructions) which prescribes how to calculate the solution of a problem from given input data, an algorithm. We will formalize the notion of an algorithm somewhat. Suppose a problem consists of calculating desired result numbers Yl, ... ,Ym from input numbers Xl, ... ,X n . If we introduce the vectors then solving the above problem means determining the value Y = 'P(x) of a certain multivariate vector function 'P : D ---+ IRm , D;;; IRn where 'P is given by m real functions 'Pi, At each stage of a calculation there is an operand set of numbers, which either are original input numbers Xi or have resulted from previous operations. A single operation calculates a new number from one or more elements of the operand set. The new number is either an intermediate or a final result. In any case, it is adjoined to the operand set, which then is purged of all entries that will not be needed as operands during the remainder of the calculation. The final operand set will consist of the desired results YI, ... , Ym· Therefore, an operation corresponds to a transformation of the operand set. Writing consecutive operand sets as vectors, (i) Xli) = [ X~ 1 (i) X ni 10 1 Error Analysis we can associate with an elementary operation an elementary map cp(i) : Di ---+ lR ni + 1 , so that Di ~ lR ni , cp(i)(x(i») = x(i+ 1 ), where (x(i+1») is a vector of the transformed operand set. The elementary map cp(i) is uniquely defined except for inconsequential permutations of xii) and x(i+I) which stem from the arbitrariness involved in arranging the corresponding operand sets in the form of vectors. Given an algorithm, then its sequence of elementary operations gives rise to a decomposition of cp into a sequence of elementary maps (1.3.1) cp(i) : Di ---+ Di+l, D j ~ lRnj , cp = cp(T) 0 cp(T-I) 0 . . . 0 ({J(O) , Do = D, DT+1 ~ lRnr + 1 = lRm , which characterize the algorithm. EXAMPLE 1. For ip( a, b, c) = a + b + c consider the two algorithms "I := a + b, y := + "I and "I := b + c, y := a + "I. The decompositions (1.3.1) are C and ip (0) ( a] a" b c )._ .- [ b + c E JR2 , ip(l) (u, v) := u + v E JR. EXAMPLE 2. Since a 2 _b 2 = (a+b)(a- b), one has for the calculation of ip(a, b) := a 2 - b2 the two algorithms Algorithm 1: "11:= a x a, "12 := b X b, y := "11 - "12, Algorithm 2: "11:= a + b, "12 := a - b, y := "11 X "12. Corresponding decompositions (1.3.1) are Algorithm 1: cp(O) (a, b) := 2 [ab ] , Algorithm 2: ip(O)(a,b):= [ ~ ] a+b ' ip (1) ._ [ v ] (a, b, v).a_ b ' Note that the decomposition of ip(, b) := a 2 - b2 corresponding to Algorithm 1 above can be telescoped into a simpler decomposition: 1.3 Error Propagation 'P-(0) (a,b) -_ [ab 2 2 ] 11 ' Strictly speaking, however, map cp(O) is not elementary. Moreover the decomposition does not determine the algorithm uniquely, since there is still a choice, however numerically insignificant, of what to compute first, a 2 or b2 . Hoping to find criteria for judging the quality of algorithms, we will now examine the reasons why different algorithms for solving the same problem generally yield different results. Error propagation, for one, plays a decisive role, as the example of the sum y = a + b + c shows (see Example 3 in Section 1.2). Here floating-point arithmetic yields an approximation fj = fl((a + b) + c) to y which, according to (1.2.6), satisfies rl : = fl(a + b) = (a + b)(l + El), fj: = fl(T] + c) = (7] + c)(l + E2) = [(a + b)(l + El) + c](l + E2) = (a+b+c) [1+ a+a;b+c E1(1+E2)+E2]. For the relative error Ey := (fj - y)jy of fj, Ey= a+b b E1(1+E2)+E2 a+ +c or disregarding terms of order higher than 1 in E'S such as E1E2, . Ey = a+b b E1 + 1 . E2· a+ +c The amplification factors (a + b)j(a + b + c) and 1, respectively, measure the effect of the roundoff errors E1, E2 on the error Ey of the result. The factor (a + b)j(a + b + c) is critical: depending on whether la + bl or Ib + cl is the smaller of the two, it is better to proceed via (a + b) + c rather than a + (b + c) for computing a + b + c. In Example 3 of the previous section, a+b a+b+c 0.33 ... 10 2 ~ ~105 0.64 ... 10-3 2 ' b+ c 0.618 ... 10-3 ---= ~O.97, a + b + cO. 64 ... 10-3 which explains the higher accuracy of fl(a + (b + c)). The above method of examining the propagation of particular errors while disregarding higher-order terms can be extended systematically to provide a differential error analysis of an algorithm for computing cp(x) if this function is given by a decomposition (1.3.1): 12 1 Error Analysis To this end we must investigate how the input errors Llx of x as well as the roundoff errors accumulated during the course of the algorithm affect the final result y = <p(x). We start this investigation by considering the input errors Llx alone, and we will apply any insights we gain to the analysis of the propagation of roundoff errors. We suppose that the function is defined on an open subset D of lRn , and that its component functions = 1, ... , n, have continuous first derivatives on D. Let x be an approximate value for x. Then we denote by CPi, i Llx:=.1: - x the absolute error of Xi and X, respectively. The relative error of Xi is defined as the quantity Xi - Xi Xi if Xi =I- O. Replacing the input data x by X leads to the result iJ := cp(x) instead of y = cp(x). Expanding in a Taylor series and disregarding higher-order terms gives /I -) )OCPi(X) LJ.Yi : = Yi - Yi = CPi (X - CPi ( X) =. ~(~ xj - xj ~ (1.3.2) j=1 _ ~ Ocpi(X) Ll . -~ ox.J j=1 Xl' 1 i = 1, .. . ,m, or in matrix notation, (1.3.3) with the Jacobian matrix Dcp(x). The notation "~,, instead of "=", which has been used occasionally before, is meant to indicate that the corresponding equations are only a first order approximation, i.e., they do not take quantities of higher order (in s's or Ll's) into account. The quantity 0cpi(X)/OXj in (1.3.3) represents the sensitivity with which Yi reacts to absolute perturbations Llx j of x j. If Yi =I- 0 for i = 1, ... , m 1.3 Error Propagation 13 and Xj =I 0 for j = 1, ... ,n, then a similar error propagation formula holds for relative errors: (1.3.4) Again the quantity (Xj/<Pi)O<pi/OXj indicates how strong;ly a relative error Xj affects the relative error in Yi. The amplification factors (Xj/<Pi)8<pi/OXj for the relative error have the advantage of not depending on the scales of Yi and Xj. The amplification factors for relative errors are generally called condition numbers. If any condition numbers are present which have large absolute values, then one speaks of an ill-conditioned problem; otherwise, of a well-conditioned problem. For ill-conditioned problems, small relative errors in the input data x can cause large relative errors in the results Y = <p(x). The above concept of condition numbers suffers from the fact that it is meaningful only for nonzero Yi, Xj. Moreover, it is impractical for many purposes, since the condition of <p is described by mn numbers. For these reasons, the conditions of special classes of problems are frequently defined in a more convenient fashion. In linear algebra, for example, it is customary to call numbers c condition numbers if, in conjunction with a suitable norm II· II, ~<p(7:--'x)'---:-----:<p~(x--,-)-".I < cII x - x II -,,-II II <p(x) II - II x II (see Section 4.4). EXAMPLE 3. For y = <pea, b, c) := a + b + c, (1.3.4) gives . Cy = a b c Ca + Cb + Ec. a+b+c a+b+c a+b+c The problem is well conditioned if every summand a, b, c is small compared to a + b+ c. EXAMPLE 4: Let y = <pep, q) := -p + Vp2 + q. Then OO, I p+~1 2yp2 +q ~ 14 1 Error Analysis ip is well conditioned if q > 0, and badly conditioned if q ::::: _p2. For the arithmetic operations (1.3.4) specializes to (x -j. 0, y -j. 0) rp(x,y) := X· y: (1.3.5) rp(x,y) := x/y: rp(x, y) := x ± y : rp(x) := Fx : Exy ~ Ex + Ey, Exjy ~ Ex - Ey, X Y Ex±y=-±Ex ± - ±Ey, ifx±y-j.O, x y x y . I E,;x = "2Ex. It follows that the multiplication, division, and square root are not dangerous: The relative errors of the operands don't propagate strongly into the result. This is also the case for the addition, provided the operands x and y have the same sign. Indeed, the condition numbers x/(x + y), y/(x + y) then lie between and 1, and they add up to 1, whence ° If one operand is small compared to the other, but carries a large relative error, the result x + y will still have a small relative error so long as the other operand has only a small relative error: error damping results. If, however, two operands of different sign are to be added, then at least one of the factors is bigger than 1, and at least one of the relative errors Ex, Ey will be amplified.This amplification is drastic if x::::; -y holds and therefore cancellation occurs. We will now employ the formula (1.3.3) to describe the propagation of roundoff errors for a given algorithm. An algorithm for computing the function rp : D -+ IRi l l , D ~ IRn , rp : D -+ IRID , D ~ IRn , for a given x = (Xl, ... ,Xn)T E D corresponds to a decomposition of the map rp into elementary maps rp(i) [see (1.3.1)]' and leads from x(O) := x via a chain of intermediate results (1.3.6) to the result y. Again we assume that every ip(i) is continuously differentiable on D i . Now let us denote by 'ljJ(i) the "remainder map" 'ljJ(i) = rp(r) 0 rp(r-l) 0 . . . 0 ip(i) : Di -+ IRm , i = 0,1,2, ... ,r. Then 'ljJ(O) == rp. Drp(i) and D'Ij;(i) are the Jacobian matrices of the maps rp(i) and 'ljJ(i). Since Jacobian matrices are multiplicative with respect to function composition, 1.3 Error Propagation 15 D(f 0 g)(x) = Df(g(x)) . Dg(x), we note for further reference that for i = 0, 1, ... , r (1.3.7) D'P(x) = D'P(r) (x(r)) . D'P(r-l) (x(r-l)) ... D'P(O) (x), D'!jJ(i) (x(i)) = D'P(r) (x(r)) . D'P(r-l) (x(r-l)) ... D'P(i) (x(i)). With floating-point arithmetic, input and roundoff errors will perturb the intermediate (exact) results xCi) so that approximate values xci) with x(i+l) = fl('P(i)(x(i))) will be obtained instead. For the absolute errors Llx(i) = xCi) - x(i), By (1.3.3) (disregarding higher-order error terms), (1.3.9) If 'P(i) is an elementary map, or if it involves only independent elementary operations, the floating-point evaluation of 'P(i) will yield the rounding of the exact value: (1.3.10) Note, in this context, that the map 'P(i) : Di ---+ Di+l c::::; IRn i+l is actually a vector of component functions 'PY) : Di ---+ IR, Thus (1.3.10) must be interpreted componentwise: (1.3.11) fl('PY) (u)) = rd('PY) (u)) = (1 + Ej)'P)i) (u). IEjl~eps, j=1,2, ... ,ni+l. Here Ej is the new relative roundoff error generated during the calculation of the jth component of 'P(i) in floating-point arithmetic. Plainly, (1.3.10) can also be written in the form with the identity matrix I and the diagonal error matrix 16 Error Analysis o o This yields the following expression for the first bracket in (1.3.8): , Furthermore EHI . cp(i)(x(i)) ~ Ei+l . cp(i)(x(i)), since the error terms by which cp(i)(x(i)) and cp(i)(x(i)) differ are multiplied by the error terms on the diagonal of Ei+l' giving rise to higher-order error terms. Therefore The quantity ai+l can be interpreted as the absolute roundoff error newly created when cp(i) is evaluated in floating-point arithmetic, and the diagonal elements of Ei+l can be similarly interpreted as the corresponding relative roundoff errors. Thus by (1.3.8), (1.3.9), and (1.3.12), L1x(HI) can be expressed in first-order approximation as follows L1X(HI) ~ ai+l + Dcp(i) (X(i))L1X(i) = EHIX(HI) + Dcp(i) (x(i))L1x(i) , i ~ 0, L1x(O):= L1x. Consequently L1x(1) ~ Dcp(O) L1x + aI, L1x(2) ~ Dcp(l) [Dcp(O) . L1x + all + a2, In view of (1.3.7), we finally arrive at the following formulas which describe the effect of the input errors L1x and the roundoff errors ai on the result Y = x(r+l) = cp(x): (1.3.13) + D'ljP)(x(I))al + ... + D'ljJ(r) (x(r))a r + ar+l = Dcp(x)L1x + D'ljJ(I) (x(1))E1 x(1) + ... + D'ljJ(r) (x(r))Erx(r)+ + Er+lY· L1y ~ Dcp(x)L1x It is therefore the size of the Jacobian matrix D'ljJ(i) of the remainder map 'ljJ(i) which is critical for the effect of the intermediate roundoff errors ai or Ei on the final result. 1.3 Error Propagation 17 EXAMPLE 5. For the two algorithms for computing y = <p(a, b) = a 2 - b2 given in Example 2 we have for Algorithm 1: x = xeD) = [~], X(I) = [ab2 ] , X(2) 1jJ(I)(U,V) = u - v 2 , D<p(X) = (2a, -2b), = [~~], X(3) = Y = a 2 _ b2 , 1jJ(2)(U,V) = u - v, D1jJ(I)(X(I)) = (1, -2b), Moreover El EI = [ 0 0]0 ' D1jJ(2) (x(2)) = (1, -1). lEI I :S eps, because of and likewise IEil :S eps for i = 2,3. From (1.3.13) with .dx = [~~], (1.3.14) Analogously for Algorithm 2: x=x(O)= [~], X(l) = [:~~l x(2)=y=a 2 _b 2 , 1jJ(I)(U,V) = U· v, D<p(x) = (2a, -2b), ill = [:~~~ ~ ~n, il2 D1jJ(!)(x(1)) = (a - = E3(a 2 - b3), El = [~ b,a + b), E~]' IEil:S eps, and therefore (1.3.13) again yields (1.3.15) If one selects a different algorithm for calculating the same result <p( x) (in other words, a different decomposition of <p into elementary maps), then D<p remains unchanged; the Jacobian matrices D1jJ(i), which measure the propagation of roundoff, will be different, however, and so will be the total effect of rounding, (1.3.16) 18 1 Error Analysis An algorithm is called numerically more trustworthy than another algorithm for calculating !p( x) if, for a given set data x, the total effect of rounding, (l.3.16), is less for the first algorithm than for the second one. EXAMPLE 6. The total effect of rounding using Algorithm 1 in Example 2 is, by (1.3.14), (1.3.17) and that of Algorithm 2, by (1.3.15), (1.3.18) Algorithm 2 is numerically more trustworthy than algorithm 1 whenever 1/3 < la/bl 2 < 3; otherwise algorithm 1 is more trustworthy. This follows from the equivalence of the two relations 1/3 <::: la/W <::: 3 and 31a 2 -b 2 1<::: a2 +b 2 +la 2 _b 2 i. For a := 0.3237, b := 0.3134 using four places (t = 4), we obtain the following results: Algorithm 1: Algorithm 2: Exact result: a x* a = 0.1048, b x* b = 0.988210-1, (a x* a) -* (b x* b) = 0.658010-2, a +* b = 0.6371, a -* b = 0.103010-1, (a x* b) x* (a -* b) = 0.656210-2, a 2 - b2 = 0.65621310-2. In the error propagation formula (l.3.13), the last term admits the following bound: I no matter what algorithm had been used for computing y = !p( x). Hence an error iJy of magnitude Iyl eps has to be expected for any algorithm. Note, moreover, when using mantissas of t places, that the rounding of the input data x = (Xl, .. ' ,xn)T will cause an input error ,1(O)x with unless the input data are already machine numbers and therefore representable exactly. Since the latter can not be counted on, any algorithm for computing y = !p(x) will have to be assumed to incur the error D!p(x)· ,1(O)x, so that altogether for every such algorithm an error of magnitude (l.3.19) ,1(O)y := [ID!p(x)1 ·Ixl + Iyl] eps must be expected. We call ,1(O)y the inherent error of y. Since this error will have to be reckoned with in any case, it would be unreasonable to ask that the influence of intermediate roundoff errors on the final result 1 The absolute values of vectors and matrices are to be understood componentwise, e.g., Iyl = (IYll, .. ·IYmI)T. 1.3 Error Propagation 19 be considerable smaller than L1(Oly. We therefore call roundoff errors ai or Ei harmless if their contribution in (1.3.13) towards the total error L1y is of at most the same order of magnitude as the inherent ErrOr L1(Oly from (1.3.19): If all roundoff errors of an algorithm are harmless, then ~he algorithm is said to be well behaved or numerically stable. This particular notion of numerical stability has been promoted by Bauer et a1. (1965); Bauer also uses the term benign (1974). Finding numerically stable algorithms is a primary task of numerical analysis. EXAMPLE 7. Both algorithms of Example 2 are numerically stable. Indeed, the inherent error ,1(O)y is as follows: Comparing this with (1.3.17) and (1.3.18) even shows that the wtal error of each of the two algorithms cannot exceed ,1(O)y. Let us pause to review our usage of terms. Numerical trustworthiness, which we will use as a comparative term, relates to the roundoff errors associated with two or more algorithms for the same problem. Numerical stability, which we will use as an absolute term, relates to the inherent error and the corresponding harmlessness of the roundoff errors associated with a single algorithm. Thus one algorithm may be numerically more trustworthy than another, yet neither may be numerically stable. If both are numerically stable, the numerically more trustworthy algorithm is to he preferred. \Ve attach the qualifier "numerically" because of the widespread use of the term "stable" without that qualifier in other contexts such as the terminology of differential equations, economic models, and linear multistep iterations, where it has different meanings. Further illustrations of the concept which we have introduced above will be found in the next section. A general technique for establishing the numerical stability of an algorithm, the so-called backward analysis, has been introduced by Wilkinson (1960) for the purpose of examining algorithms in linear algebra. He tries to show that the floating-point result fJ = y + L1y of an algorithm for computing y = cp(.1;) may bc written in the form fJ = cp(x + L1J;) , that is, as the result of an exact calculation based on perturbed input data x + L1x. If L1x turns out to have the same order of magnitude as 1L1(Olxl ::: eps lxi, then the algorithm is indeed numerically stable. Bauer (1974) associates graphs with algorithms in order to illuminate their error patterns. For instance, Algorithms 1 and 2 0:[ example 2 give rise to the graphs in Figure 1. The nodes of these graphs correspond to the intermediate results. Node i is linked to node j by a directed arc if the intermediate result corresponding to node i is an operand of the elementary 20 1 Error Analysis operation which produces the result corresponding to node j. At each node there arises a new relative roundoff error, which is written next to its node. Amplification factors for the relative errors are similarly associated with, and written next to, the arcs of the graph. Tracing through the graph of Algorithm 1, for instance, one obtains the following error relations: 0 C3 Algorithmus 2 Fig. 1. Graphs Representing Algorithms and Their Error Propagation To find the factor by which to multiply the roundoff error at node i in order to get its contribution to the error at node j, one multiplies all arc factors for each directed path from i to j and adds these products. The graph of Algorithm 2 thus indicates that the input error Ca contributes ( _a_ . 1 + _a_ . a+b to the error Cy. a-b 1) . Ca 1.4 Examples 21 1.4 Examples EXAMPLE L: This example follows up Example 4 of the previous section: given p> 0, q > O,p» q, determine the root y = -p+ J p +q 2 with smallest absolute value of the quadratic equation y2 + 2py - q = 0. J Input data: p, q. Result: y = cp(p, q) = -p + p 2 + q. The problem was seen to be well conditioned for p > 0, q > 0. It was also shown that the relative input errors E p , Eq make the following contribution to the relative error of the result y = cp(p, q): Since +ql < 1, 1<1 , Ip+2JJp2 p 2+ q p I J p2 + q - the inherent error Ll (0) Y satisfies We will now consider two algorithms for computing y = cp(p, q). s:= p2, Algorithm 1: t:= s + q, u:= Vt, y:= -p+u. Obviously, p » q causes cancellation when y := -p + u is evaluated, and it must therefore be expected that the roundoff error Llu := E· Vt = E· J p 2 + q. generated during the floating-point calculation of the square root fl(Vt) = Vi· (1 + E), lEI::; eps, will be greatly amplified. Indeed, the above error contributes the following term to the error of y : ~Llu = y p+q 1 ~ ~ . E = -(py p2 + q + P + q)E = k· E. 2 _p+ yp2 + q q 22 1 Error Analysis Since p, q > 0, the amplification factor k admits the following lower bound: 2 2 k> L > o. q which is large, since p » q by hypothesis. Therefore, the proposed algorithm is not numerically stable, because the influence of rounding p 2 + q alone exceeds J that of the inherent error E~O) by an order of magnitude. 2 Algorithm 2: S :=p , t:= s + q, '11:= 0, v :=p+ '11, y := q/v. This algorithm does not cause cancellation when calculating v := p + u. The roundoff error £I'll = p + q, which stems from rounding p + q, will be amplified according to the remainder map 1jJ (u): EJ 2 J2 '11 -+ p + '11 -+ - q - =: 1jJ(u). p+u Thus it contributes the following term to the relative error of y: .:':. 8'P L1u = -q . £I'll y8u y(p+u)2 _qJp 2 + q ------..'......--------,c . ( -p + J p2 + q)(p + J p2 + q) 2 ~ E ·E=k·E. p+Jp 2+ q The amplification factor k remains small; indeed, Ikl < 1, and algorithm 2 is therefore numerically stable. The following numerical results illustrate the difference between Algorithms 1 and 2. They were obtained using floating-point arithmetic of 40 binary mantissa places - about 13 decimal places - as will be the case in subsequent numerical examples. p = 1000, q = 0.018000000081 Result y according to Algorithm 1: 0.900030136108 10 -5 Result y according to nach Algorithm 2: 0.899999999998. 10 - 5 Exact value of y: 0.900 000 000 00010 - 5 EXAJ\-IPLE 2. For given fixed x, the value of cos kx may be computed recursively using for m = 1,2, ... , k - 1 the formula cos(m + l)x = 2cosx cosmx - cos(m - l)x. 1.4 Examples 23 In this case, a trigonometric-function evaluation has to be carried out only once, to find c = cos x. Now let Ixl f= 0 be a small number. The calculation of c causes a small roundoff error: C=(l+E)cosx,IEI:S;eps. How does this roundoff error affect the calculation of cos kx ? coskx can be expressed in terms of c: coskx = cos(k arccos c) =: f(c). Since df dc ksinkx sinx the error 10 cos x of c causes, to first approximation, an absolute error .:1 cos kx ~ 10 c~s x k sin kx = 10 . k cot x sin kx (1.4.1) smx in coskx. On the other hand, the inherent error .:1 (0) Ck (1.3.19) of the result Ck := cos kx is .:1 (0) Ck = [klx sin kxl + I cos kxll eps. Comparing this with (1.4.1) shows that .:1 cos kx may be considerably larger than for small Ixl; hence the algorithm is not numerically stable. EXAMPLE 3. For given x and a "large" positive integer k, the numbers cos kx and sin kx are to be computed recursively using .:1(O)Ck cosmx = cos x cos(m - l)x - sinx sin(m - l)x, sin mx = sinx cos(m - l)x + cos x sin(m - l)x, m = 1,2, ... , k. How do small errors Ec cos x, lOs sin x in the calculation of cos x, sin x affect the final results cos kx, sin kx ? Abbreviating Crn := cos mx, Sm := sin mx, c := cos x, s := sinx, and putting U:= we have [cm] = U [c. 8m [~-:l rn - l ] , Srn-l m= 1, ... ,k. Here U is a unitary matrix, which corresponds to a rotation by the angle x. Repeated application of the formula above gives Now au = OC and therefore :c aU [0- 1] [10] 01' as 1 0 =: A ' Uk = kU k - l , ~Uk = AU k - 1 + U AU k - 2 + ... + U k - 1 A as = kAU k - l , 24 1 Error Analysis because A commutes with U. Since U describes a rotation in lR 2 by the angle x, !!..-Uk = k [COS(k - l)x sin(k - l)x ac !!..-Uk = k [- sin(k - l)x cos(k - l)x as - sin(k - l)X] cos(k - l)x ' - cos(k - l)X] - sin(k - l)x . The relative errors Ee. Es of C = cos x, S = sin x effect the following absolute errors of cos kx, sin kx: (1.4.2) = Eck cosx [COS(k - l)X] + csk sin x [- Sin((k - l))X] . cos k - 1 x sin(k - l)x . The inherent errors d (0) Ck and d (0) Sk of Ck = cos kx and Sk = sin kx, respectively, are given by (1.4.3) d(O)Ck d(O)Sk = [klxsinkxl + Icoskxl]eps, = [klxcoskxl + Isinkxl]eps. Comparison of (1.4.2) and (1.4.3) reveals that for big k and Ikxl ;:::; 1 the influence of the roundoff error Cc is considerably bigger than the inherent errors. while the roundoff error Cs is harmless. The algorithm is not numerically stahle, albeit numerically more trustworthy than the algorithm of Example 2 as far as the computation of Ck alone is concerned. EXAMPLE 4. For small lxi, the recursive calculation of ern = cosmx, based on SUt = sinmx, m = 1,2, ... , cos(m + l)x = cosx cos mx - sinxsin mx, sin(m + l)x = sinxcosmx + cosxsinmx, as in Example 3, may he further improved numerically. To this end, we express the differences dS m + 1 and dCm+l of subsequent sine and cosine values as follows: + l)x - cos mx = 2(cos x-I) cos mx - sin x sin mx - cosx cos mx + cosmx = -4 (sin2~) cosmx + [cosmx - cos(m - l)x], dS m + 1 : = sin(m + l)x - sinmx = 2(cos x-I) sin mx + sin x cos mx - cosxsin mx + sin mx = -4 (sin2 ~) sin mx + [sin mx - sin(m - l)x]. dCm+l : = cos(m This leads to a more elaborate recursive algorithm for computing Ck, Sk in the case x > 0: 1.4 Examples 25 . 2 X dCI := - 2 sin 2' J d81 := -dcI (2 + dCI)' 80 := 0, Co:= 1, and for m := 1, 2, ... , k: Cm := Cm-I + dcm , 8m := 8m -1 + d8 m , dCm+I:= t· Cm + dcm , d8 m+I:= t . Sm + ds m . For the error analysis, note that Ck and Sk are functions of S = sin(x/2): Ck = cos(2k arcsin s) =: 'PI(S), Sk = sin(2karcsins) =: 'P2(S). An error Lls = c: s sin(x/2) in the calculation of S therefore causes - to a first-order approximation - the following errors in Ck: . x O'PI -2ksinkx. a;c:s Sill 2 = c: s cos(x/2) Sill x 2 = -2ktan ~ sinkx· C: s , . X O'P2 --;:;--C: s sm - 2 Us X = 2k tan - cos kx . C: s . 2 Comparison with the inherent errors (1.4.3) shows these errors to be harmless for smallixi. The algorithm is then numerically stable, at least as far as the influence of the roundoff error c: s is concerned. Again we illustrate our analytical considerations with some numerical results. Let x = 0.001, k = 1000. Algorithm Result for cos kx Example 2 Example 3 Example 4 Exact value 0.540 302 121 124 -0.3410--6 -0.1710--9 0.540 302 305 776 0.540 302 305 86§. -0.5810 --11 0.540302 305 868 140 ... Relative error EXAMPLE 5. We will derive some results which will be useful for the analysis of algorithms for solving linear equations in Section 4.5. Given the quantities c, ai, ... , an, bl , ... , bn- l with an # 0, we want to find the solution f3n of the linear equation (1.4.4) Floating-point arithmetic yields the approximate solution bn = fl (1.4.5) (C- albl - .~~ - an-Ibn-I) as follows: So := Cj for j:= 1, 2, ... , n - 1 26 1 Error Analysis 8j := fl(8j-1 ~ ajb j ) = (8j-1 ~ a j bj (1 + ILj))(l + aj), (l.4.6) bn := fl(s71-l/a71) = (1 + 8)s71-l/a71' with Ifljl, lajl, 181 ::; eps. If an = 1, as is frequently the case in applications, then 8 = 0, since bn := 8 71 -1. We will now describe two useful estimates for the residual From (1.4.6) follow the equations aj 1 + aj =8j--~~ajbjflj. j=1,2, ... ,n~1, anbn ~ 871-1 = 68 71 -1. Summing these equations yields ~O'Sn-l and thereby the first one of the promised estimates n-1 (1.4.7) 0" .= . {O1 otherwise. if 1. an = The second estimate is cruder than (1.4.7). (1.4.6) gives (1.4.8) which can be solved for c: C= L j=l n-1 j-l n-i (1.4.9) a j b j (1 + fll) II (1 + ak)-I + anb" (1 + 0')-1 k=l II (1 + ak)-I. k=l A simple induction argument over m shows that (1+0') = II(l+O'k)±l, k=1 implies 10'1::; IO'kl::;eps, m· eps 1 ~ m· eps m·eps<l 1.5 Interval Arithmetic; Statistical Roundoff Estimation 27 In view of (1.4.9) this ensures the existence of quantities Cj with n-1 (1.4.10) C = 2~>jbj(1 + j. Cj) + an bn (1 + (n -1 + (j')cn), j=1 ICj 1<- 1 _ eps n . eps' 8' .=. {O1 otherwise. if an = 1, (1.4.11) In particular, (1.4.8) reveals the numerical stability of our algorithm for computing f3n. The roundoff error Q m contributes the amount to the absolute error in f3n. This, however, is at most equal to (lei + 2=:::1 laibil) eps ::; lanl ' which represents no more than the influence of the input errors Cc and Cai of c and ai, i = 1, ... , m, respectively, provided Iccl, Icai I ::; eps. The remaining roundoff errors /J-k and 8 are similarly shown to be harmless. The numerical stability of the above algorithm is often shown by interpreting (1.4.10) in the sense of backward analysis: The computed approximate solution bn is the exact solution of the equation whose coeffizients aj := aj(1 + j. Cj), 1::; j ::; n - 1, + (n - 1 + 8')cn) lij := aj(1 have changed only slightly from their original values aj. This kind of analysis, however, involves the difficulty of having to define how large n can be so that errors of the form nc, 1101 ::; eps can still be considered as being of the same order of magnitude as the machine precision eps. 1.5 Interval Arithmetic; Statistical Roundoff Estimation The effect of a few roundoff errors can be quite readily estimated, to a first-order approximation, by the methods of Section 1.3. For a typical 28 1 Error Analysis numerical method, however, the number of the arithmetic operations, and consequently the number of individual roundoff errors, is very large, and the corresponding algorithm is too complicated to permit the estimation of the total effect of all roundoff errors in this fashion. A technique known as interval arithmeticoffers an approach to determining exact upper bounds for the absolute error of an algorithm, taking into account all roundoff and data errors. Interval arithmetic is based on the realization that the exact values for all real numbers a E IR which either enter an algorithm or are computed as intermediate or final results are usually not known. At best one knows small intervals which contain a. For this reason, the interval-arithmetic approach is to calculate systematically in terms of such intervals - [' a= a ,a"] , bounded by machine numbers a', a" E A, rather than in terms of single real numbers a. Each unknown number a is represented by an interval ii = [a', a"] with a E ii. The arithmetic operations ® E {EB, 8, @, 0} between intervals are defined so as to be compatible with the above interpretation. That is, c := ii ® bis defined as an interval (as small as possible) satisfying c ::) {a * b I a E ii and b E b. and having machine number endpoints. In the case of addition, for instance, this holds if EB is defined as follows: [c', c"] := [a', a"] EB [b', b"] where c' := max{ "i' E A I "i' ::::; a' + b' } c" := min{ "i" E A I "i" ;::: a" + b" }, with A denoting again the set of machine numbers. In the case of multiplication @, assuming, say, a' > 0, b' > 0, [c', e"] := [a', a"] @ [b', b"] can be defined by letting e' := max{ "i' E A I "i' ::::; a' x b'}, elf := min{ "i" E A I "i" ;::: a" x b" }. Replacing, in these and similar fashions, every quantity by an interval and every arithmetic operation by its corresponding interval operation - this is readily implemented on computers - we obtain interval algorithms which produce intervals guaranteed to obtain the desired exact solutions. The data for these interval algorithms will be again intervals, chosen to allow for data errors. 1.5 Interval Arithmetic; Statistical Roundoff Estimation 29 It has been found, however, that an uncritical utilization of interval arithmetic techniques leads to error bounds which, while certainly reliable, are in most cases much too pessimistic. It is not enough to simply substitute interval operations for arithmetic operations without taking into account how the particular roundoff or data enter into the respective results. For example, it happens quite frequently that a certain roundoff error c impairs some intermediate results Ul, ... ,Un of an algorithm considerably, Iaa:i I" c. // 1 for i = 1, ... , n, while the final result y = f(Ul, ... , un) is not strongly affected, I~~I ~ 1, even though it is calculated from the highly inaccurate intermediate values Ul, ... , Un: the algorithm shows error damping. EXAMPLE 1. Evaluate y = ¢(x) = x 3 - 3x 2 + 3x = ((x - 3) x x + 3) x x using Horner's scheme: u:= x - 3, v:= u x x, w:= v +3, y:= w x x. The value x is known to lie in the interval x E x := [0.9,1.1]. Starting with this interval and using straight interval arithmetic, we find u = x 8 [3,3] = [-2.1, -1.9], v = u 0 x = [-2.31, -1.71], 'Ii; = v 8 [3,3] = [0.69, 1.29]' Y = 'Ii; 0 x = [0.621,1.419]. The interval y is much too large compared to the interval {¢(x)lx EX} = [0.999,1.001]' which describes the actual effect of an error in x on ¢(x). EXAMPLE 2. Using just ordinary 2-digit arithmetic gives considerably more accurate results than the interval arithmetic suggests: u v w y x = 0.9 -2.1 -1.9 1.1 0.99 x = 1.1 -1.9 -2.1 0.9 0.99 30 1 Error Analysis For the successful application of interval arithmetic, therefore, it is not sufficient merely to replace the arithmetic operations of commonly used algorithms by interval operations: It is necessary to develop new algorithms producing the same final results but having an improved error-dependence pattern for the intermediate results. EXAMPLE 3. In Example 1 a simple transformation of <p(x) suffices: y = <p(x) = 1 + (x - 1)3. When applied to the corresponding evaluation algorithm and the same starting interval i = [0.9,1.1], interval arithmetic now produces the optimal result: iiI := i e [1, 1] = [-0.1, 0.1], ii2 := iiI ® iiI = [-0.01, 0.01], ii3 := ii2 ® iiI = [-0.001, 0.001], fj := ii3 EEl [1, 1] = [0.999, 1.001]. As far as ordinary arithmetic is concerned, there is not much difference between the two evaluation algorithms of Example 1 and Example 3. Using two digits again, the results are practically identical to those in Example 2: UI U2 U3 Y x = 0.9 -0.1 0.01 -0.001 1.0 x = 1.1 0.1 om 0.001 1.0 For an in-depth treatment of interval arithmetic the reader should consult, for instance, Moore (1966) or Kulisch (1969). In order to obtain statistical roundoff estimates [Rademacher (1948)]' we assume that the relative roundoff error [see (1.2.6)] which is caused by an elementary operation is a random variable with values in the interval [- eps, eps]. Furthermore we assume that the roundoff errors E attributable to different operations are independent random variables. By flE: we denote the expected value and by (7E: the variance of the above round-off distribution. They satisfy the general relationship Assuming a uniform distribution in the interval [- eps, eps], we get (1.5.1) flE: = E(E) = 0, 1 jeps 1 (7; = E(E2) = edt = - eps =: [2. 2eps _ eps 3 2 Closer examinations show the roundoff distribution to be not quit uniform [see Sterbenz (1974)), Exercise 22, p. 122]. It should also be kept in mind that the ideal roundoff pattern is only an approximation to the roundoff 1.5 Interval Arithmetic; Statistical Roundoff Estimation 31 patterns observed in actual computing machinery, so that the quantities /1E and CT~ may have to be determined empirically. The results x of algorithms subjected to random roundoff errors become random variables themselves with expected values /1x and variances CT~ connected again by the basic relation The propagation of previous roundoff effects through elementary operations is described by the following formulas for arbitrary independent random variables x, y and constants a, (3 E R: {/ux±(3y = E(ax ± (3y) = aE(x) ± (3E(y) = a/1x ± (3/1y, (1.5.2) CT~x±(3y = E((ax ± 6y)2) - (E(ax ± (3y)? = a 2E(x - E(x))2 + (32 E(y - E(y))2 = a 2 CT; + (32CT~. The first of the above formulas follows by the linearity of the expected-value operator. It holds for arbitrary random variables x, y. The second formula is based on the relation E(xy) = E(x)E(y), which holds whenever x and y arc independent. Similarly, we obtain for independent J; and y ILxxy = E(x x y) = E(x)E(y) = /1x/1y, (1.5.3) CT;xy = E[:r x y) - E(x)E(y)]2 = /1x 2 /1y2 - /1;/1~ = CTxCTy + /1xCTy + /1yCT x ' 2 2 2 2 2 2 EXAMPLE. For calculating y = a 2 - b2 (see example 2 in Section 1.3) we find, under the assumptions (1.5.1), E(a) = a, CT~ = 0, E(b) = b, CT~ = 0 and using (1.5.2) and (1.5.3), that 771=a 2 (1+cd, T/2 = b2(1 + 102), y = (T/l - T/2)(1 + 103), E(T/l)=a 2, E(T/2) = b2, CT~,=a4i'2, CT;2 = b4 f', E(y) = E(T/1 - T/2)E(1 + 103) = a 2 - b2 , (T/l, T/2, 103 arc assumed to be independent), CT; = CT~, -'72 CTi+C3 + fJ~, -'72 CTi+C3 + f1~ +"3 CT~, -'72 = (CT~, + CT~2)f2 + (a 2 - b2 )2f2 + l(CT~, + CT~2) = (a 4 + b4)~ + [(a 2 _ b2)2 + a 4 + b4]S2. Neglecting g4 compared to S2 yields CT~ ~ ((a 2 _ b2 )2 + a 4 + b4 )S2. For a := 0.3237, b = 0.3134, eps = ,5 x 10- 4 (see example 5 in Section 1.3), we find CT y ~ 0.144s = 0.000 0415, 32 1 Error Analysis which is close in magnitude to the true error L1y = 0.000 01787 for 4-digit arithmetic. Compare this with the error bound 0.000 10478 furnished by (1.3.17). We denote by M(x) the set of all quantities which, directly or indirectly, have entered the calculation of the quantity x. If M (x) n M (y) =f 0 for the algorithm in question, then the random variables x and yare in general dependent. The statistical roundoff error analysis of an algorithm becomes extremely complicated if dependent random variables are present. It becomes quite easy, however, under the following simplifying assumptions: (1.5.4) (a) The operands of each arithmetic operation are independent random variables. (b) In calculating variances all terms of an order higher than the smallest one are neglected. (c) All variances are so small that for elementary operations * in jirstorder approximation, E(x * y) ~ E(x) * E(y) = /Lx * /Ly. If in addition the expected values /Lx are replaced by the estimated values x, and relative variances E; := (y;/ /L; ~ (Y;/x 2are introduced, then from (1.5.2) and (1.5.3) [compare (1.2.6), (1.3.5)]' z=fl(x±y): (1.5.5) z = flex ± y): z=fl(x/y): E2z ~ E2x + E2y + t 2 , E;~E;'+E~+t2. It should be kept in mind, however, that these results are valid only if the hypothesis (1.5.4), in particular (1.5.4a), are met. It is possible to evaluate above formulas in the course of a numerical computation and thereby to obtain an estimate of the error of the final results. As in the case of interval arithmetic, this leads to an arithmetic of paired quantities (x, E;) for which elementary operations are defined with the help of the above or similar formulas. Error bounds for the final results 1" are then obtained from the relative variance assuming that the final error distribution is normal. This assumption is justified inasmuch as the distributions of propagated errors alone tend to become normal if subjected to many elementary operations. At each such operation the nonnormal roundoff error distribution is superimposed on the distribution of previous errors. However, after many operations, the propagated errors are large compared to the newly created roundoff errors, so that the latter do not appreciably affect the normality of the total error distribution. Assuming the final error distribution to be normal, the actual relative error of the final result 1" is bounded with probability 0.9 by 2E r . E;, EXERCISES FOR CHAPTER I 33 EXERCISES FOR CHAPTER 1 1. Show that with floating-point arithmetic of t decimal places rd(a) = _a_ with I+e lei ~ 5 ·lO-t holds in analogy to (1.2.2). [In parallel with (1.2.6), as a consequence, fl(a* b) = (a* b)/(I + e) with lei ~ 5· lO-t for all arithmetic operations * = +, -, x, /.l 2. Let a, b, c be fixed-point numbers with N decimal places after the decimal point, and suppose < a, b, c < 1. Substitute product a * b is defined as follows: Add lO-N /2 to the exact product a· b, and delete the (N + I)-st and subsequent digits. (a) Give a bound for I(a * b) * c - abel. (b) By how many units of the N-th place can (a * b) * c and a * (b * c) differ ? 3. Evaluating L:~~1 aj in floating-point arithmetic may lead to an arbitrarily large relative error. If, however, all summands ai are of the same sign, then this relative error is bounded. Derive a crude bound for this error, disregarding terms of higher order. 4. Show how to evaluate the following expressions in a numerically stable fashion: 1 I-x - - - - - for Ixl« 1, 1 + 2x I +x ° J~ J~ x + I - cos x x 5. - x - for for x» I, x =1= 0, Ixl « 1. Suppose a computer program is available which yields values for arcsin y in floating-point representation with t decimal mantissa places and for Iyl ~ I subject to a relative error e with lei ~ 5 x lO-t. In view of the relation . arctan x = arCSIn x ~, vI + x 2 this program could also be used to evaluate arctan x. Determine for which values x this procedure is numerically stable by estimating the relative error. 6. For given z, the function tan z/2 can be computed according to the formula z tan - = ± 2 (1 - cos Z) 1/2 I + cosz Is this method of evaluation numerically stable for z necessary, give numerically stable alternatives. 7. The function :::0 0, z :::0 7f /2 ? If 34 1 Error Analysis is to be evaluated for 0 ::; 'P ::; 7f /2, 0 < kc ::; l. The method 2 2 k : = 1 - kc , 1 f('P, k c ) : = --;====;0= Jl - k 2 sin 2 'P avoids the calculation of cos'p and is faster. Compare this with the direct evaluation of the original expression for f('P, k c ) with respect to numerical stability. 8. For the linear function f(x) := a + b x, where a # O. b # 0, compute the first derivative Dhf(O) = 1'(0) = b by the formula in binary floating-point arithmetic. Suppose that a and b are binary machine numbers, and h a power of 2. Multiplication by h and division by 2h can be therefore carried out exactly. Give a bound for the relative error of Dhf(O). What is the behavior of this bound as h -+ 0 ? 9. The square root ±(u + iv) of a complex number x +iy with y # 0 may be calculated from the formulas U= 2 ± jx+ Jx +y2 v = ..JL 2 2u Compare the cases x :;;, 0 and x < 0 with respect tom their numerical stability. Modify the formulas if necessary to ensure numerical stability. 10. The variance 52, of a set of observations Xl, . . . ,X" is to determined. Which of formulas 2 _2) 1 5 = n _ 1 (~2 ~ Xi - nx , 5 2 = -1- L" (Xi -x) with n-l 2 x:= ~ t X i i=l i=l is numerically more trustworthy ? 11. The coefficients a" br(r = 0, ... ,n) are, for fixed X, connected recursively: (*) for r=n-l,n-2 .... ,0: b,:=xb,+l+a r . (a) Show that the polynomials A(z) := L arz'. B(z):= L b,Z,-l r=O r'=l EXERCISES FOR CHAPTER 1 satisfy 35 A(z) = (z - x) . B(z) + boo (b) Suppose A( x) = bo is to be calculated by the recursion (*) for fixed x in floating-point arithmetic, the result being b~. Show, using the formulas (com pare Exercise 1) u+v fl(u+v) = - - , lal:::; eps, fl(u· v) = ;~ : ' 171"1:::; eps, l+a the inequality IA(x) - b~1 :::; ~(2eo -lb~I), 1 - eps where eo is defined by the following recursion: 1'=n-1,n-2, ... ,O: for er:=lxlar+l+lb~l. Hint: From l' derive = n - 1, ... ,0, (1' = n - 1, ... ,0); 12. Assuming Earth to be special, two points on its surface can be expressed in Cartesian coordinates Pi = [Xi,YiZi] = [1' cos Qi cos/1i,rsinQi cos/1i,rsin/1i], i = 1,2, where r is the earth radius and Qi, /1i are the longitudes and latitudes of the two points Pi, respectively. If T cos a = PI ;2 = COS( Ql - (2) cos /11 cos /12 + sin /11 sin /12, l' then rO" is the great-circle distance between the two points. (a) Show that using the arccos function to determine 0" from the above expression is not numerically stable. (b) Derive a numerically stable expression for 0". 36 1 Error Analysis References for Chapter 1 Ashenhurst, R. L., Metropolis, N. (1959): Unnormalized floating-point arithmetic. Assoc. Comput. Mach. 6, 415-428. Bauer, F. L. (1974): Computational graphs and rounding error. SIAM J. Numer. Anal. 11, 87-96. - - - , Heinhold, J., Samelson, K., Sauer, R. (1965): Moderne Rechenanlagen. Stuttgart: Teubner. Henrici, P. (1963): Error propagation for difference methods. New York: Wiley. Knuth, D. E. (1969): The Art of Computer Programming. Vol. 2. Seminumerical Algorithms. Reading, Mass.: Addison-Wesley. Kulisch, U. (1969): Grundziige der Intervallrechnung. In: D. Laugwitz: ijberblicke Mathematik 2, 51-98. Mannheim: Bibliographisches Institut. Moore, R. E. (1966): Interval analysis. Englewood Cliffs, N.J.: Prentice-Hall. Neumann, J. von, Goldstein, H. H. (1947): Numerical inverting of matrices. Bull. Amer. Math. Soc. 53, 1021-1099. Rademacher, H. A. (1948): On the accumulation of errors in processes of integration on high-speed calculating machines. Proceedings of a symposium on large-scale digital calculating machinery. Annals Comput. Labor. Harvard Univ. 16, 176-185. Scarborough, J. B. (1950): Numerical Mathematical Analysis, 2nd edition. Baltimore: Johns Hopkins Press. Sterbenz, P. H. (1974): Floating Point Computation. Englewood Cliffs, N.J.: Prentice-Hall. Wilkinson, J.H. (1960): Error analysis of floating-point computation. Numer. Math. 2, 219-340. - - - (1963): Rounding Errors in Algebraic Pmcesses. New York: Wiley. - - - (1965): The Algebraic Eigenvalue Problem. Oxford: Clarendon Press. 2 Interpolation Consider a family of functions of a single variable x, having n + 1 parameters ao, ... , an, whose values characterize the individual functions in this family. The interpolation problem for iP consists of determining these parameters ai so that for n + 1 given real or complex pairs of numbers (Xi, h), i = 0, ... , n, with Xi =J Xk for i =J k i = 0, ... ,n, holds. We will call the pairs (Xi, I;) support points, the locations Xi support abscissas, and the values Ii support ordinates. Occasionally, the values of derivatives of iP are also prescribed. The above is a linear interpolation problem if iP depends linearly on the parameters ai: This class of problems includes the classical one of polynomial interpolation [Section 2.1], as well as trigonometric interpolation [Section 2.3], (i 2 = -1). In the past, polynomial interpolation was frequently used to interpolate function values gathered from tables. The availability of modern computing machinery has almost eliminated the need for extensive table lookups. However, polynomial interpolation is also important as the basis of several types of numerical integration formulas in general use. In Be more modern development, polynomial and rational interpolation (see below) arc employed in the construction of "extrapolation mcthods" for integration, differential equations, and related problems [see for instance Sections 3.3 and 3.4]. 38 2 Interpolation Trigonometric interpolation is used extensively for the numerical Fourier analysis of time series and cyclic phenomena in general. In this context, the so-called "fast Fourier transforms" are particularly important and successful [Section 2.3.2]. The class of linear interpolation problems also contains spline interpolation [Section 2.4]. In the special case of cubic splines, the functions tJ> are assumed to be twice continuously differentiable for x E [xo, xn] and to coincide with some cubic polynomial on every subinterval [Xi,.Ti+l] of a given partition Xo < Xl < ... < X n . Spline interpolation is a fairly new development of growing importance. It provides a valuable tool for representing empirical curves and for approximating complicated mathematical functions. It is increasingly used when dealing with ordinary or partial differential equations. Two nonlinear interpolation schemes are of importance: rational interpolation, and exponential interpolation, rh(X' \ ... ,A\ n ) ==. aoe AO .£ + ale A1X + ... + a,>e '±' , a 0,···, a n,AQ, ,. AnX • Rational interpolation (Section 2.2) plays a role in the process of best approximating a given function by one which is readily evaluated on a digital computer. Exponential interpolation is used, for instance, in the analysis of radioactive decay. Interpolation is a basic tool for the approximation of given functions. For a comprehensive discussion of these and related topics consult Davis (1965). 2.1 Interpolation by Polynomials 2.1.1 Theoretical Foundation: The Interpolation Formula of Lagrange In what follows, we denote by lIn the set of all real or complex polynomials P whose degrees do not exceed n: P(X) = ao + alx + ... + anxn. (2.1.1.1) Theorem For n + 1 arbitrary support points (Xi, fi), i = 0, ... ,n, Xi =I .Tk for i =I k, there exists a unique polynomial P E lIn with 2.1 Interpolation by Polynomials P(Xi) = h PROOF. 39 i = 0, 1, ... ,n. Uniqueness: For any two polynomials Pi, P 2 E nn with i = 0,1, ... , n, the polynomial P := Pi - P 2 E lIn has degree at most n, and it has at least n + 1 different zeros, namely Xi, i = 0, ... ,n. P must therefore vanish identically, and Pi = P 2 . Existence: We will construct the interpolating polynomial P explicitly with the help of polnomials L, E lIn, i = 0, ... , n, for which (2.1.1.2) The following Lagrange polynomials satisfy the above conditions: Li (X) : == (X - xu)··· (x - Xi-i)(X - .Ti+i)··· (x - Xn) (Xi - Xo ) ... (Xi - Xi-i )(Xi - Xi+i ) ... (Xi - Xn ) (2.1.1.3) w(X) with w(x) := (X - Xi)W'(Xi) II(X - x;). i=O Note that our proof so far shows that the Lagrange polynomials are uniquely determined by (2.1.1.2). The solution P of the interpolation problem can now be expressed directly in terms of the polynomials L i , leading to the Lagrange interpolation formula: X - Xk 2: f , II.X·_·Xk· -- n (2.1.1.4) P(X) 2: fiLi(x) i=O n n i=O k::;i:1 k=O o '/, The above interpolation formula shows that the coefficients of P depend linearly on the support ordinates fi. While theoretically important, Lagrange's formula is, in general, not as suitable for actual calculations as some other methods to be described below, particularly for large numbers n of support points. Lagrange's formula may, however, be useful in some situations in which many interpolation problems are to be solved for the same support abscissae Xi, i = 0, ... ,n, but different sets of support ordinates 1;, i = 0, ... , n. EXAMPLE. Given for n = 2: Xi 0 1 3 Ji 1 3 2 Wanted: P(2), where P E I"h, P(Xi) = Ji for i = 0, 1, 2. 40 2 Interpolation Solution: L () OX (x - l)(x - 3) == (0-1)(0-3)' L (x) = (x - O)(x - 3) L (x) = (x - O)(x - 1) 1 (1-0)(1-3),2 - (3-0)(3-1)' P(2) = 1 . Lo (2) + 3 . Ll (2) + 2 . L2 (2) = 1 . -1 + 3 . 1 + 2 . ! = 10 . 3 3 3 2.1.2 Neville's Algorithm Instead of solving the interpolation problem all at once, one might consider solving the problem for smaller sets of support points first and then updating these solutions to obtain the solution to the full interpolation problem. This idea will be explored in the following two sections. For a given set of support points (Xi, fi)' i = 0, 1, ... , 71, we denote by that polynomial in Ih for which j = 0,1, ... , k. These polynomials are linked by the following recursion: (2.1.2.1a) Pi(X) == Ii, (2.1.2.1b) PROOF: (2.1.2.1a) is trivial. To prove (2.1.2.1b), we denote its right-hand side by R( x), and go on to show that R has the characteristic properties of Piai, ... ik' The degree of R is clearly not greater than k. By the definitions of P i, ... ik and Pio .. ik-1' Pio .. ik-1 (Xio) Pi,.ik(Xik) and for j = I, 2, ... , k - 1. Thus R = Pi a i 1 ... ik' in view of the uniqueness of 0 polynomial interpolation [Theorem (2.1.1.1)]. Neville's algorithm aims at determining the value of the interpolating polynomial P for a single value of x. It is less suited for determining the interpolating polynomial itself. Algorithms that are more efficient for the 2.1 Interpolation by Polynomials 41 later task, and also more efficient if values of P are sought for several arguments x simultaneously, will be described in Section 2.1.3. Based on the recursion (2.1.2.1), Neville's algorithm constructs a symmetric tableau of the values of some of the partially interpolating polynomials Piail .. ik for fixed x: k=Q Xo fo = Po (x) Xl h = PI(x) 2 1 3 POI (x) (2.1.2.2) POI2 (X) PI2 (X) . X2 12 = P2(x) x3 1':3 = P3(X) Po 123 (x) P123 (x) P23 (X) The first column of the tableau contains the prescribed support ordinates fi' Subseqnent columns are filled by calculation each entry recursively from its two "neighbors" in the previous column according to (2.1.2.1b). The entry P I23 (X), for instance, is given by EXAMPLE. Determine P012 (2) for the same support points as in section 2.1.1. k=O 2 1 Xo =0 fa = Po(2) = 1 Xl =1 II = P 1 (2) = 3 X2 =3 h = P2 (2) = 2 Pol (2) = 5 H2(2) = % POI2 (2) = Pm (2) = (2 - 0) ·3- (2 - 1) . 1 1_ 0 = 5, P12(2) = (2 - 1) ·2- (2 - 3) . 3 5 3- 1 = 2' P012(2) = (2 - 0) ·5/2 - (2 - 3) ·5 3-0 ¥ 10 = -. 3 We will now discuss slight variants of Neville's algorithm, employing a frequently used abbreviation, (2.1.2.3) 42 2 Interpolation The tableau (2.1.2.2) becomes fo = Too Xo T11 h = TlO T22 (2.1.2.4) T21 12 = T20 13 = 1\0 ~ )' ~ )' ~ )' T33 T32 T31 The arrows indicate how the additional upward diagonal Ti,o, Ti,l, ... , Ti,i can be constructed if one more support point Xi, 1; is added. The recursion (2,1.2.1) may be modified for more efficient evaluation: (2.1.2.5a) Ti,o :=fi (2.1.2.5b) Ti,k: (x - Xi-k)Ti,k-l - (x - xi)Ti-1,k-l Xi - Xi-k = Ti k -1 , Ti,k-l - Ti-1,k-l + --'-'''--''---'----=''''--....::. X - Xi-k ----1 x - Xi 1 :::; k :::; i, i 2': O. The following ALGOL algorithm is based on this modified recursion: for i := 0 step 1 until n do begin t[i] := f[i]; for j := 'i - 1 step -1 until 0 do t[j] := t[j + 1] + (t[j + 1] - t[j]) x (x - x[i])/(x[i]- x[j]) end; After the inner loop has terminated, t[j] = Ti,i-j, 0 :::; j :::; i. The desired value Tnn = POl,.,n(X) of the interpolating polynomial can be found in t[O]. Still another modification of Neville's algorithm serves to improve somewhat the accuracy of the interpolated polynomial value. For i = 0, 1, ... ,n, let the quantities Qik, Dik be defined by QiO := D iO := fi, Qik := Tik - Ti,k-l Dik := Tik - Ti-1,k-l } 1 :::; k :::; i. 2.1 Interpolation by Polynomials 43 The recursion (2.1.2.5) then translates into (2.1.2.6) Xi _ X } Qik := (Di,k-l - Q i - l , k - d - - - Xi-k - Xi Xi-k - X Dik := (Di,k-l - Q i - l , k - l ) - - - Xi-k - Xi 1 ~ k ~ i, l~ = 0,1, .... Starting with QiO .- DiD := fi, one calculates Qik' Dik from the above recursion. Finally n Pnn := fn + LQnk' k=l If the values fo, ... ,fn are close to each other, the quantities Qik will be small compared to fi. This suggests forming the sum of the "corrections" Qnl,"" Qnn first [contrary to (2.1.2.5)] and then adding it to fn, thereby avoiding unnecessary roundoff errors. Note finally that for X = 0 the recursion (2.1.2.5) takes a particularly simple form (2.1.2.7a) TiD := fi (2.1.2.7b) Tik:= T. "k-l + Ti.k-l - Ti- 1 ,k-l Xi-k --1 Xi - as does its analog (2.1.2.6). These forms are encountered when applying extrapolation methods. For historical reasons mainly, we mention Aitken's algorithm . . It is also based on (2.1.2.1), but uses different intermediate polynomials. Its tableau is of the form Xo fo = Po(x) Xl h = Pdx) h = P2(x) h = P3(X) X2 X3 POl (:r) P02 (X) POl 2 (X) P03(X) POI3(X) P0123 (X) The first column again contains the prescribed values fi. Each subsequent entry derives from the previous entry in the same row and the top entry in the previous column according to (2.1.2.1b). 2.1.3 Newton's Interpolation Formula: Divided Differences Neville's algorithm is geared towards determining interpolating values rather than polynomials. If the interpolating polynomial itself is needed, 44 2 Interpolation or if one wants to find interpolating values for several arguments I:,j simultaneously, then Newton's interpolation formula ff is to be preferred. Here we write the interpolating polynomial P E IIn, P(Xi) = J;, i = 0, 1, ... n, in the form (2.l.3.1) P(x) == POl.n(x) = ao + a1(x - xo) + a2(x - xo)(x - xI) + ... + an(x - xo)··· (x - Xn -1)' Note that the evaluation of (2.l.3.1) for x = I:, may be done recursively as indicated by the following expression: This requires fewer operations than evaluating (2.l.3.1) term by term. It corresponds to the so-called Horner scheme for evaluating polynomials which are given in the usual form, i.e. in terms of powers of x. and it shows that the representation (2.l.3.1) is well suited for evaluation. It remains to determine the coefficients ai in (2.l.3.1).ln principle, they can be calculated successively from fo = P(xo) = ao h = P(xd = ao + (L1 (:1;1 - xo) fz = P(X2) = au + a1(x2 - xu) + a2(x2 - XO)(X2 - xd This can be done with n divisions and n( n - 1) multiplications. There is, however, a better way, which requires only n(n + 1)/2 divisions and which produces useful intermediate results. Observe that the two polynomials Pioi, .. ik_, (x) and Piai1 .. ik (x) differ by a polynomial of degree k with k zeros Xi o ' .Ti " " , , Xi k _ 1 , since both polynomials interpolate the corresponding support points. Therefore there exists a unique coefficient (2.l.3.2) such that (2.l.3.3) Piui1.ik (x) = Pioi,.ik-, (x) + h)i 1... ik (x - Xio)(x - Xi,) ... (x - Xi k _ , ) • From this and from the identity Pio == h, it follows immediately that (2.l.3.4) PiOi1 .. ik (x) =fio + fioi1 (x - Xi,J + ... + h", ... ik(X - Xio)(x - Xi,)'" (.r - Xi k_ 1) 2.1 Interpolation by Polynomials 45 is a Newton representation ofthe interpolating polynomial pio, ... ,ik (x). The coefficients (2.1.3.2) are called k th divided differences. The recursion (2.1.2.1) for the partially interpolating polynomials translates into the recursion (2.1.3.5) for the divided differences, since by (2.1.3.3), l i l ... ik and l i o ... ik_1 are the coefficients of the highest terms of the polynomials P il i 2 ... i, and Pioil ... ik-l' respectively. The above recursion starts for k = 0 with the given support ordinates Ii, i = 0, ... , n. It can be used in various ways for calculating divided differences ho' hoil' ... , hoil ... i n , which then characterize the desired interpolating poynomial P = Pioil ... i". Because the polynomial P iail . .ik is uniquely determined by the support points it interpolates [Theorem (2.1.1.1)]' the polynomial is invariant to any permutation of the indices i o , i l , ... , ik, and so is its coefficient li o i l ... i k of xk. Thus: (2.1.3.6). The divided differences A,ii .. ik are invariant to permutations of the indices i o , iI, ... , ik: If is a permutation of the indices i o, iI, ... ,ik, then If we choose to calculate the divided differences in analogy to Neville's method - instead of, say, Aitken's method - then we are led to the following tableau, called the divided-difference scheme: k=O (2.1.3.7) Xo fo Xl II X2 h 1 2 f01 1I2 f012 The entries in the second column are of the form f _ JOI - II - 10 Xl - .TO , those in the third column, f _ JOl2 - 1I2 - 101 , X2 - Xo ... , 46 2 Interpolation Clearly, P(X) == POl .. n(X) == fa + f01(X - xa) + ... + fOln(X - xa)(x - Xl)'" (X - Xn-l) is the desired solution to the interpolation problem at hand. The coefficients of the above expansion are found in the top descending diagonal of the divided-difference scheme (2.1.3.7). EXAMPLE. vVith the numbers of the example in sections 2.1.1 and 2.1.2, we have: Xo = 0 fa = 1 Xl =1 it = 3 X2 = 3 h =2 f01 = 2 it2 = -~ f012 = -~ P012 (X) = 1 + 2· (x - 0) - ~(x - O)(x - 1), P012(2) = (-~. (2 - 1) + 2)(2 - 0) + 1 = ¥1. Instead of building the divided-difference scheme column by column, one might want to start with the upper left corner and add successive ascending diagonal rows. This amounts to adding new support points one at a time after having interpolated the previous ones. In the following ALGOL procedure, the entries in an ascending diagonal of (2.1.3.7) are found, after each increase of i, in the top portion of array t, and the first i coefficients fOl ... i are found in array a. for i := 0 step 1 until n do begin t[i] := f[i]; for j := i - I step -1 until 0 do t[j] := (t[j + 1] - t[j])/(x[i] - .T[j]); ali] := t[O] end; Afterwards, the intcrpolationg polynomial (2.1.3.1) may be evaluated for any desired argument z: p:= a[n]; for i := n - 1 step -1 until 0 do p:= p x (z - xli)) + ali]; Some Newton representations of the same polynomial are numerically more trustworth to evaluate than others. Choosing the permutation so that dampens the error (see Section 1.3) during the Horner evaluation of 2.1 Interpolation by Polynomials 47 (2.1.3.8) P(O == Pio "' in (0 == fio + fioi1 (~-Xio) + ... + fioi1 ... in (~-Xio) ... (~-Xin_1)' All Newton representations of the above kind can be found in the single divided-difference scheme which arises if the support arguments Xi, 'I = 0, ... ,71, are ordered by size: Xi < Xi+l for i = 0, ... ,71- 1. Then the preferred sequence of indices io, iI, ... , ik is such that each index ik is "adjacent" to some previous index. More precisely, either ik = min{ i l I o I < k } - 1 or ik = max{ il lOS: I < k } + 1. Therefore the coefficients of (2.1.3.8) are found along a zigzag path-instead of the upper descending diagonal-of the divided-difference scheme. Starting with fio' the path proceeds to the upper right neighbor if i k i k - l . s: EXAMPLE, In the previous example, a preferred sequence for I; = 2 is io = 1, i l = 2, i2 = O. The corresponding path in the divided difference scheme is indicated below: Xo = 0 fa = 1 Xl = 1 h = 3 X2 = 3 h = 2 fOl = 2 f012 = -~ The desired Newton representation is: P120(X) = 3 - ~(x - 1) - ~(x - 1)(x - 3) H20(2) = (-~(2 - 3) - ~(2 -1) + 3 = ¥. Frequently, the support ordinates Ii are values f(x;l = fi of a given function f(x), which one wants to approximate by interpolation. In this case, the divided differences may be considered as multivariate functions of the support arguments Xi, and are historically written as These functions satisfy (2.1.3.5). For instance, J[xol = f(xo) . ] - f[XIl- J[xol f(XI) - f(xo) f[ XO, Xl Xl - Xo Xl - ·'To J[XI' X2] f[xo, Xl] f[ XO,XI,x2 1 = X2 - Xo f(XO)(Xl - X2) + f(XI)(X2 - Xo) + f(:r:2)(XO - xd (Xl - XO)(X2 - xd(xo - X2) 48 2 Interpolation Also, (2.l.3.6) gives immediately: (2.1.3.9) Theorem. The divided differences J[Xiu,"" Xik] are symmetric functions of their arguments, i. e., they are invariant to permutations of the If the function f (x) is itself a polynomial, then we have the (2.1.3.10) Theorem. If f(x) is a polynomial of degree N, then J[xo, ... ,Xk] = 0 for k > N. Because of the unique solvability of the interpolation problem [Theorem (2.l.1.1)]' PQ1 ... k(X) == f(x) for k :2: N. The coefficient of xk in Po1 ... dx) must therefore vanish for k> N. This coefficient, however, is D given by J[xo, ... , Xk] according to (2.l.3.3). PROOF. EXAMPLE. f(x) = x2 Xi k=O o 0 1 2 1 3 4 1 5 3 4 9 4 1 1 2 3 1 o o o 7 16 If the function f(x) is sufficiently often differentiable, then its divided differences J[xo, ... , Xk] can also be defined if some of the arguments Xi coincide. For instance, if f(x) has a derivative at xo, then it makes sense for certain purposes to define J[xo, xo] := !,(xo). For a corresponding modification of the divided-difference scheme (2.l.3.7) see Section 2.1.5 on Hermite interpolation. 2.1.4 The Error in Polynomial Interpolation Once again wie consider a given function f(x) and certain of its values Ii = f(·1:i), i = 0,1, ... , n, which are to be interpolated. We wish to ask how well the interpolating polynomial P(x) == PO ... n(x) E IIn with 2.1 Interpolation by Polynomials 49 i = 0, 1, ... ,n, reproduces f(x) for arguments different from the support arguments Xi. The error f(x) - P(x), where x #- Xi, i = 0, 1, ... , can clearly become arbitrarily large for suitable functions f unless some restrictions are imposed on f. Under certain conditions, however, it is possible to bound the error. We have, for instance: (2.1.4.1) Theorem. If the function f has an (n + l)st derivative, then for every argument x there exists a numer ~ in the smallest interval I[xo, ... , x n , xl which contains x and all support abscissas Xi, satisfying w(x)f(n+1)(~) _ _ f(x) - POl .. n(x) = (n + I)! ' where w(x) =:0 (x - xo)(x - Xl)'" (X - Xn). Let P(x) :=:0 POl...n(x) be the polynomial which interpolates the function at Xi, i = 0, 1, ... , n, and suppose x #- Xi (for x = Xi there is nothing to show). We can find a constant K such that the function PROOF. F(x) :=:0 f(x) - P(x) - Kw(x) vanishes for X = x: F(x) = O. Consequently, F(x) has at least the n + 2 zeros in the interval I[xo, ... ,xn,xl. By Rolle's theorem, applied repeatedly, F' (x) has at least n + 1 zeros in the above interval, F" (x) at least n zeros, and finally F(n+1) (x) at least one zero ~ E I[xo, ... ,xn,x] . Since p(n+l)(x) =:0 0, F(n+1)(~) = f(n+l)(~) - K(n + I)! = 0 or K f(n+1)(~) (n + I)! . This proves the proposition f(x) - P(x) = Kw(x) w(x) f(n+1)«( (n + I ) ! ) ' o 50 2 Interpolation A different error term can be derived from Newton's interpolation formula (2.l.3.4): P(X) == POln(x) == J[xo] + J[.TO,Xl](X - xo) + ... + J[xo, Xl,···. Xn](X - Xo) ... (x - xn-d. Here J[xo, xl, ... ,Xk] are the divided differences of the given function f. If in addition to the n + 1 support points we introduce an (n + 2)nd support point where x 1= Xi, i = 0, ... , n, then by Newton's formula f(x) = PO .n + 1(x) = PO... n (x) + J[.TO, ... , .T n , x]w(x) , or f(x) - PO .. n(x) = w(x)J[xo, ... , Xn, x] . (2.l.4.2) The difference on the left-hand side appears in Theorem (2.l.4.1), and since w(x) 1= 0, we must have _ f(n+1)(~) J[xo, ... , Xn, x] = (n + I)! for some ~ E I[xo, ... , Xn, x]. This also yields f(n)(~) f[xo, ... ,xn ] = --,- for some ~ E I[xo, ... ,xn ] , n. (2.l.4.3) which relates derivatives and divided differences. EXAMPLE. f(x) = sinx: Xi = ~ . i, i = 0, 1,2,3,4,5, n = 5, -sin( sinx - P(x) = (x - xo)(x - Xl) ... (X - X5 )72i:) , I. SlllX - I I I (= ((X), 1 jw(x)j P(x) ::; 720 (x - xo)(x - xd··· (x - X5) = 72i)' 2.1 Interpolation by Polynomials 51 We end this section with two brief warnings, one against trusting the interpolating polynomial outside of I[xo, ... ,X n ], and one against expecting too much of polynomial interpolation inside I[xo, . .. , x n ]. In the exterior of the interval I[xo, ... , x n ], the value of Iw(x) I in Theorem (2.1.4.1) grows very fast. The use of the interpolation polynomial P for approximating f at some location outsinde the interval I[xo, ... , x n ] called extrapolation - should be avoided if possible. Within I[xo, ... ,x n ] on the other hand, it should not be assumed that finer and finer samplings of the function f will lead to better and better approximations through interpolation. Consider a real function f which is infinitely often differentiable in a given interval [a, h]. To every interval partition L1 = {a = Xo < Xl < . .. < Xn = h} there exists an interpolating polynomial P L1 E lIn with P L1 (;];i) = fi for Xi E L1. A sequence of interval partitions L1m = {a = x6m ) < xim ) < ... < x~:) = b} gives rise to a sequence of interpolating polynomials P L1m' One might expect the polynomials PL1m to converge toward f if the fineness IIL1mll := max Ix~:'i - x~m)1 ° of the partitions tends to as m -+ 00. In general this is not true. For example, it has been shown for the functions 1 l+x [a, b] = [-5,5]' f(x):= - - 2 ' and f(x) := Fx, [a, b] = [0,1]' that the polynomials PL),,,, do not converge pointwise to f for arbitrarily fine uniform partitions L1 m , x~m) = 0.+ i(b - a)/m, i = 0, ... , m. 2.1.5 Hermite-Interpolation Consider the real numbers Xi, fi(k) , i = 0,1, ... , m, k = 0,1, ... , ni - 1, with Xo < Xl < ... < Xm · The Hermite interpolation problem for these data consists of determining a polynomial P whose degree docs not exceed n, P E lIn, where m n+1:= Lni, i=O and which satisfies the following interpolation conditions: 52 2 Interpolation (2.1.5.1) p(k)(X'i) = fi(k) , i = 0, 1, ... , m k = 0, 1, ... , ni - 1. This problem differt:; from the usual interpolation problem for polynomials in that it prescribes at each support abscissa Xi not only the value but also the first ni - 1 derivatives of the det:;ired polynomial. The polynomial interpolation of Section 2.1.1 is the special case ni = 1, i = 0, 1, ... , m. There are exactly L ni = n + 1 conditions (2.1.5.1) for the 11 + 1 coefficients of the interpolating polynomial, leading us to expect that the Hermite interpolation problem can be solved uniquely: (2.1.5.2) Theorem. For arbitrary numbers Xo < Xl < ... < X= and fi(k) , i = 0, 1, ... , m, k = 0, 1, ... , ni -1, there exists precisely one polynomial n + 1 := P E IIn, m 2: 11i, i=O which satisfies (2.1.5.1). PROOF. We first show uniqueness. Consider the difference polynomial Q(x) := Pi (X) - P2(x) of two polynomials Pi, P 2 E IIn for which (2.1.5.1) holds. Since Xi is at least an n;-fold root of Q, so that Q has altogether L ni = n+ 1 roots each counted according to its multiplicity. Thus Q must vanish identically, since its degree is less than n + 1. Existence is a consequence of uniqueness: For (2.1.5.1) is a system of linear equations for n unknown coefficients Cj of P(x) = CO+C1X+' . ·+cnx n . The matrix of this system is not singular, because of the uniqueness of its solutions. Hence the linear system (2.1.5.1) has a unique solution for arbitrary right-hand sides f;(k). D Hermite interpolating polynomials can be given explicitly in a form analogous to the interpolation formula of Lagrange (2.1.1.4). The polynomial P E IIn given by (2.1.5.3) P(x) m n~-l i=O k=O 2: 2: fY) L;k(x). satisfies (2.1.5.1). The polynomials L;k E IIn are generalized Lagrange polynomials.They are defined as follows: Starting with the auxiliary polynomials 0<:::: i <:::: m, 0<:::: k <:::: n;, 2.1 Interpolation by Polynomials 53 [compare (2.1.1.3)]' put Li,n,-I(X) := li,ni-dx), i=O,l, ... ,m . and recursively for k = ni - 2, ni - 3, ... , 0, ni- 1 Lik(X) := lidx) - L l;~)(Xi)Liv(X). v=k+l By induction if i = j and k = a, otherwise. Thus P in (2.1.5.3) is indeed the desired Hermite interpolating polynomial. An alternative way to describe Hermite interpolation is important for Newton- and Neville-type algorithms to construct the Hermite interpolating polynomial, and, in particular, for developing the theory of Bsplines [Section 2.4.4]. The approach is to generalize divided differences f[xo, xl,·· ., Xk] [see Section 2.1.3] so as to accommodate repeated abscissae. To this end, we expand the sequence of abscissae Xo < Xl < ... < Xm occuring in (2.1.5.1) by replacing each Xi by ni copies of itself: Xo = ... = Xo < Xl = ... = Xl < ... < Xm = ... = Xm . '---v-----' no '---v-----' '-v-----" The n + 1 = I:z:,o ni elements in this sequence are then renamed and will be referred to as virtual abscissae. We now wish to reformulate the Hermite interpolation problem (2.1.5.1) in terms of the virtual abscissae, without reference to the numbers Xi and ni, i = 0, ... , m. Clearly, this is possible since the virtual abscissae determine the "true" abscissae Xi and the integers ni, i = 0, 1, ... , m. In order to stress the dependence of the Hermite interpolant P(.) on to, tl, ... , tn we write POl .. n (.) for P(.). Also it will be convenient to write f(r)(xi) instead of fi(r) in (2.1.5.1). The interpolant POL.n is uniquely defined by the n + 1 = I:z:,o ni interpolation conditions (2.1.5.1), which are as many as there are index pairs (i, k) with i = 0, 1, ... , m, k = 0, 1, ... , ni - 1, and are as many as there are virtual abscissae. An essential observation is that the interpolation conditions in (2.1. 5.1) belonging to the following linear ordering of the index pairs (i, k), (0,0), (0,1), ... , (O,no-1), (1,0), ... , (1,nl-1), ... , (m,n m -1), have the form 54 2 Interpolation (2.1.5.4) j = 0, 1. ... , n, if we define Sj, j = 0, 1, ... , n, as the number of times the value of tj occurs in the subsequence to ::; t 1 ::; . . . ::; t j . The equivalence of (2.1.5.1) and (2.1.5.4) follows directly from Xo = to = t1 = ... = tno~l < Xl = tno = ... = t no + n, ~1 < ... , and the definition of the Sj, So = 1, Sl = 2, ... , Sno~l = no; sno = 1, ... , sno+nl~l = n1; .... We now use this new formulation in order to express the existence and uniqueness result of Theorem (2.1.5.2) algebraically. Note that any polynomial pet) E IIn can be written in the form where II (t) is the row vector II(t) := [ 1, t, ... , ~~] . Therefore by Theorem (2.1.5.2) and (2.1.5.4), the following system of linear equations j = 0, 1, ... , n, has a unique solution b. This proves the following corollary, which is equivalent to Theorem (2.1.5.2): (2.1.5.5) Corollary. For" any non decreasing finite sequence of n + 1 real numbers, the (n + 1) x (n + 1) matrix is nonsingular. The matrix V;,(to, t l , .... t n ) is related to the well-known Vandermonde matrix if the numbers tj are distinct (then Sj = 1 for all j): 2.1 Interpolation by Polynomials 55 to EXAMPLE 1. For to = tl < t2, one finds to 1 t~t~ 1. t~ 2 In preparation for a Neville type algorithm for Hermite interpolation, we associate with each segment o :s: i :s: i + k :s: n, of virtual abscissae the solution Pi.Hl.. .. ,Hk E Ih of the partial Hermite interpolation problem belonging to this subsequence, that is the solution of [see (2.1.5.4)] j = i, i + 1, ... ,i + k. Here of course, the integers S j, i :s: .i :s: i + k, are defined with respect to the subsequence, that is 5j is the number of times the value of tj occurs within the sequence ti, ti+l, ... , tj. EXAMPLE 2. Suppose no = 2, nl = 3 and Xo = 0, XI f~O) = -1, f~l) = -2, = 1, f~O) = 0, f~l) = 10, f?) = 40. This leads to virtual abscissae tj, j = 0, 1, ... , 4, with For the subsequence tl S t2 S t3, that is i = 1 and k = 2, one has It is associated with the interpolating polynomial PI23(X) E Ih, satisfying p~~j-I)(tl) = P 123 (O) = f(sl-l)(td = f(O)(O) =-1, pi~~-1)(t2) = P 123 (l) = f(S2-1)(t2) = lO)(l) = 0, pi~~-1)(t3) = P{23(1) = lS3- 1)(t3) = f(I)(l) = 10. This illustrates (2.1.5.4). 56 2 Interpolation Using this notation the following analogs to Neville's formulae (2.1.2.1) hold: If ti = ti+l = ... = tHk = Xl then Pi,i+l, .. ,i+k(X) = lr) L r. (x - xlf, k (2.1.5.6a) _1_, r=O and if ti < tHk (2.1.5.6b) P . . (x) = (x - ti)Pi+I,.,Hk(X) - (x - tHk)Pi,i+l, .. ,i+k-I(X) t.,+l .... ,,+k t Hk - t i . The first of these formulae follows directly from definition (2.1.5.4); the second is proved in the same way as its counterpart (2.1.2.1b): One verifies first that the right hand side of (2.1.5.6b) satisfies the same uniquely solvable (Theorem (2.1.5.2)) interpolation conditions (2.1.5.4) as Pi,HI"i+k(X), The details of the proof are left to the reader. D In analogy to (2.1.3.2), we now define the generalized divided difference J[ti' tHl, ... , ti+kl as the coefficient of xk in the polynomial Pi,i+l, .. ,i+k(X) E 11k . By comparing the coefficients of xk in (2.1.5.6) [cf. (2.1.3.5)] we find, if ti = ... = ti+k = Xl, (2.1.5.7a) .. l_J[tHl, ... ,ti+kl-J[t;,ti+l, ... ,ti+k-ll (2.1.5.7b) f[ ttl... t,+l, ... , t,+k . tHk - t; By means of the generalized divided differences k = 0,1, ... ,n, the solution P(x) = POI ... n(x) E IIn of the Hermite intcrpolation problem (2.1.5.1) can be represented explicitly in its Newton form [ef. (2.1.3.1)] (2.1.5.8) Pm .. n(x) = ao + al(x - to) + a2(x - to)(x - tl) + ... This follows from the observation that the difference polynomial Q(X) := Pol ... n(x) - POl ... (n-l) (x) = J[to, tl, ... , tnlxn + ... is of degree at most n and has, because of (2.1.5.1) and (2.1.5.4), for i :::; m1, the abscissa Xi as zero of multiplicity ni, and Xm as zero of multiplicity (nm - 1). The polynomial of degree n 2.1 Interpolation by Polynomials 57 (x - to)(x - td··· (x - tn-d, has the same zeros with the same multiplicities. Hence Q(x) = f[t o, t1,"" tn](X - to)(X - h)'" (x - tn-d, o which proves (2.1.5.8). EXAMPLE 3. Wc illustrate the calculation of the generalized divided differences with the data of Example 2 (m = 1, no = 2, nl = 3). The following difference scheme results: to = 0 -1* = f[to] -2* = f[to, tl] tl = 0 -1* = f[tl] 3 = f[to, tl, t2] 1 = 1[tl, t2] 6 = f[to, ... , t3] t2 = 1 0* = 1[t2] 9 = 1[tl, t2, t3] 5 = 1[to, ... , t4] 10* = f[t2, t3] 11 = f[tl, ... , t4] t3 = 1 0* = f[t3] 20* = 1[t2, t3, t4] 10* = 1[h t 4 ] t4 = 1 0* = 1[t4] The entries marked * have been calculated using (2.1.5.7a) rather than (2.1.5.7b). The coefficients of the Hermite interpolating polynomial can be found in the upper diagonal of the difference scheme: P(x) = -1 - 2(x - 0) + 3(x - O)(x - 0) + 6(x - O)(x -- O)(x - 1) + 5(x - O)(x - O)(x - l)(x - 1) = -1 - 2x + 3x 2 + 6x 2 (x - 1) + 5x 2 (x - 1)2. The interpolation error incurred by Hermite interpolation can be estimated in the same fashion as for the usual interpolation by polynomials. In particular, the proof of the following theorem is entirely analogous to the proof of Theorem (2.1.4.1): (2.1.5.9) Theorem. Let the real function f be n + 1 times differentiable on the interval [a, b], and consider m + 1 support abscissae Xi E [a, b], Xo < Xl < ... < x m · If the polynomial P(x) is of degree at most n, Lni = n+ 1, i=O and satisfies the interpolation conditions i = 0, 1, ... , m, k = 0, 1, ... , ni - 1, 58 2 Interpolation then for every x E [a, b] exists ~ E I[xo, ... ,X m , x] such that f(x)-P(x) = w(x)f(n+l) (~) (n+l)! where Hermite interpolation is frequently used to approximate a given real function f by a piecewise polynomial function cpo Given a partition Ll : a = Xo < Xl < ... < Xm = b of an interval [a, b], the corresponding Hermite function space Hr;) is defined as consisting of all functions cp : [a, b] -+ IR with the following properties: (2.1.5.10) (a) cp E c v - l [a, b]: The (// - 1 )st derivative of cp exists and is continuous on [a, b]. (b) cplIi E II2v - l : On each subinterval Ii := [Xi, XHl], i = 0, 1, ... , m - 1, cp agrees with a polynomial of degree at most 2// - 1. Thus the function cp consists of polynomial pieces of degree 2// - 1 or less which are // - 1 times differentiable at the "knots" Xi. In order to approximate a given real function f E c v - l [a, b] by a function cp E Hr;), we choose the component polynomials Pi = cplIi of cp so that Pi E II2v - 1 and so that the Hermite interpolation conditions are satisfied. Under the more stringent condition f E C 2u [a, b], Theorem (2.1.5.9) provides a bound to the interpolation error for X E Ii, which arises if the component polynomial Pi replaces f: If(x) - Pi(x)1 :::; I(x - Xi)/X ~ Xi+lW' max If(2V)(~)1 (2.1.5.11) 2// . (El, < IXHI - xil 2V max If(2V)(~)I. - 22v (2//!) (Eli Combining these results for i = 0, 1, ... , m gives for the function cp E Hr;), which was defined earlier, 2.2 Interpolation by Rational Functions where 11.::111 = max 0:::;i:::;rn-1 59 IX i+l - xii is the "fineness" of the partition .::1. The approximation error goes to zero with the 21/ 1;h power of the fineness II.::1 j II if we consider a sequence of partitions .::1} of the interval [a, b] with II.::1 j II --+ O. Contrast this with the case of ordinary polynomial interpolation, where the approximation error does not necessarily go to zero as II.::1 j II --+ 0 [sec Section 2.1.4]. Ciarlet, Schultz, and Varga (1967) were able to show also that the first 1/ derivatives of <p are a good approximation to the corresponding derivatives of I: (2.1.5.13) = II(k)(x) - p?)(x)1 ::: I(x - ~:t~x ~~)~W-k (Xi+l - Xi)'" max II(2v)(01 . 1/. f,EI, for all X E Ii, i = 0, 1, ... , m - 1, k = 0, 1, ... , 1/, and therefore (2.1.5.14) III(2v) I I I(k) - <p(k) 11= <- 22v - 2k11 .::111 k! (21/ - 2k)! 2V - k CX! for k = 0, 1, ... , 1/. 2.2 Interpolation by Rational Functions 2.2.1 General Properties of Rational Interpolation Consider again a given set of support points (Xi, Ii), i = 0, 1, .... \Ve will now examine the use of rational functions pl"'V(x) == PI"·V(x) == ao + a1x + ... + al"xl" QIL,v(X) bo + b1x + ... + bvxlJ for interpolating these support points. Here the integers JJ and 1/ denote the maximum degrees of the polynomials in the numerator and denominator, respectively. We call the pair of integers (JJ, 1/) the degree type of the rational interpolation problem. The rational function pl",V is determined by its JJ + 1/ + 2 coefficients On the other hand, <p1",V determines these coefficients only up to a common factor p oJ O.This suggests that pl",V is fully determined by the JJ + 1/ + 1 interpolation conditions 60 2 Interpolation i = 0,1, ... ,p, + v. (2.2.1.1) We denote by A/l·v the problem of calculating the rational function p/l,v from (2.2.1.1). It is clearly necessary that the coefficients an bs: of p/l,V solve the homogeneous system of linear equations i = 0, 1. ... ,IL + v, (2.2.1.2) or written out in full, We denote this linear system by S/l,v. At first glace. substituting S/l,V for A/l,v does not seem to present a problem. The next example will show, however, that this is not the case, and that rational interpolation is inherently more complicated than polynomial interpolation. EXAMPLE. For support poiIlt~ and fJ, = 1/ = 1: x,: 0 fi: 1 2 2 2 ao - 1 . bo = 0, aO+a1 -2(bo +b1) =0, ao + 2a1 - 2(bo + 2b1) = O. Up to a common nonzero factor, solving the above system 5 1 ,1 yields the coefficients ao = 0, bo = 0, a1 = 2, b) = 1, and therefore the rational expression which for x = 0 leads to the indeterminate expression 0/0. After cancelling the factor x, we arrive at the rational expression Both expressions p1,] and ,p1,1 represent the same rational function, namely the constant function of value 2. This function misses the first support point (xo, fo) = (0,1). Therefore it does not solve A 1.1. Since solving 5 1 ,1 is necessary for any solution of A 1,1, we conclude that no such solution exists. The above example shows that the rational interpolation problem A/l,V need not be solvable. Indeed, if S/l,V has a solution which leads to a rational function that does not solve A/l,V -- as was the case in the example - then 2.2 Interpolation by Rational Functions 61 the rational interpolation problem is not solvable. In order to examine this situation more closely, we have to distinguish between different representations of the same rational function <Jj!-',l/, which arise from each other by canceling or by introducing a common polynomial factor in numerator and denominator. We say that two rational expressions (that is pairs of two polynomials, the numerator and denominator polynomials) are equivalent, and write if This is precisely when the two rational expressions represent the same rational function. A rational expression is called relatively prime if its numerator and denominator are relatively prime, i.e., not both divisible by the same polynomial of positive degree. If a rational expression is not relatively prime, then canceling all common polynomial factors leads to an equivalent rational which is. Finally we say that a rational expression <Jj!-',v is a solution of S!-"v if its coefficients solve S!-,,l/. As noted before, <Jj!-',V solves S!-"v if it solves A!-'·v. Rational interpolation is complicated by the fact that the converse need not hold. (2.2.1.3) Theorem. The homogeneous linear system of equations S!-"v always has nontrivial solutions. For each such solution QIJ.V(X) =j. ° holds, i.e., all nontrivial solutions define rational functions. PROOF. The homogeneous linear system S!-"v has p, + v -+- 1 equations for p, + v + 2 unknowns. As a homogeneous linear system with more unknowns than equations, S!-'·v has nontrivial solutions (aO,a1, ... ,a!-', bo, ... ,bv ) i- (0, ... ,0, 0, ... ,0). For any such solution, Q!-',V(x) =j. 0, since Q!-',V(x) == bo + b1x + ... + bvxv == ° would imply that the polynomial p!-"V(x) == au + al + .. , + a!-,x!-' has the zeros i = 0,1, ... ,p, + v. 62 2 Interpolation It would follow that pll,V (x) == 0, since the polynomial pll,v has at most degree tl, and vanishes at tL + v + 1 ~ tL + 1 different locations, contradicting (ao,al, ... ,a ll , bo,···,b v ) =1= (0, ... ,0). o The following theorem shows that the rational interpolation problem has a unique solution if it has a solution at all. (2.2.1.4) Theorem. If PI and P2 are both (nontrivial) solutions of the homogeneous linear system SIl.V, then they are equivalent (PI ~ P2), that is, they df?termine the same rational function. PROOF. IfbothPI(x) == PI(X)/QI(X) and P2(X) == P 2(X)/Q2(X) solve SIl.V, then the polynomial has tL + v + 1 different zeros P(X;) = P 1(Xi)Q2(Xi) - P 2(Xi)QI(Xi) = fi QI(Xi)Q2(Xi) - fi Q2(Xi)QI(X;) i = 0, ... , tL + v. = 0, Since the degree of polynomial P does not exceed tL + v, it must vanish identically, and it follows that PI (x) ~ P2 (x). 0 Note that the converse of the above theorem does not hold: a rational expression PI may well solve SIl,v whereas some equivalent rational expression P2 does not. The previously considered example furnishes a case in point. Combining Theorems (2.2.1.3) and (2.2.1.4), we find that there exists for each rational interpolation problem AIl.v a unique rational function, which is represented by any rational expression pll.V that solves the corresponding linear system SIl,V. Either this rational function satisfies (2.2.1.1), thereby solving A,t.I/, or AllY is not solvable at all. In the latter case, there must be some support point (Xi, fi) which is "missed" by the rational function. Such a support point is called inaccessible. Thus AI"Y is solvable if there are no inaccessible points. Suppose pll,V(x) == PI",V(X)/QIl,V(x) is a solution to SIl.V. For any i E {O, 1, ... ,tL + v} we distinguish the two cases: 1) 2) QIl,V(Xi) =1= 0, QIl,V(Xi) = 0. In the first case, clearly pll,L'(Xi) = fi. In the second case, however, the support point (Xi, fi) may be inaccessible. Here 2.2 Interpolation by Rational Functions 63 must hold by (2.2.1.2). Therefore, both PfJ.,V and QfJ.,V contain the factor x ~ Xi and are consequently not relatively prime. Thus: (2.2.1.5). If SfJ.,V has a solution pfJ.,v which is relatively prime, then there are no inaccessible points: AfJ.,V is solvable. Given PfJ.. v , let /pfJ.,V be an equivalent rational expression which is relatively prime. We then have the general result: (2.2.1.6) Theorem. Suppose PfJ.,V solves SfJ.· v . Then A,L,u is solvable and PfJ.,V represents the solution -- if and only if /pfJ.,V solves S",I/. PROOF. If /P/I,I/ solves SfJ.,V, then AfJ.,V is solvable by (2.2.1.5). If /pfJ.,V does not solve S,L,V, its corresponding rational function does not solve AfJ.· v . D Even if the linear system SfJ.· v has full rank 11 + V + 1, the rational interpolation problem AfJ.,V may not solvable. However, since the solutions of SfJ.,u are, in this case, uniquely determined up to a common constant factor p =J 0, we have: (2.2.1.7) Corollary to (2.2.1.6). If SfJ.,V has full rank, then AfJ.,V is solvable if and only if the solution PfJ.. V of SfJ.,V is relatively prime. We say that the support points (Xi, fd, i = 0,1, ... , (J' are in special position if they are interpolated by a rational expression of degree type (r;, A) with r; + A < (J'. In other words, the interpolation problem is solvable for a smaller combined degree of numerator and denominator than suggested hy the nnmher of support points. We observe that (2.2.1. 8) Theorem. The accessible support points of a nonsolvable 'interpolation problem AfJ.,v are in special position. PROOF. Let i 1 , ... , ia be the subscripts of the inaccessible points, and let PfJ.,v be a solution of S/l.V. The numerator and the denominator of PfJ.,V were seen above to have the common factors X ~ Xi" ... , X ~ X 1n , whose cancellation leads to an equivalent rational expression pK,A with r; = /1 ~ G, A = v ~ G. pK,A solves the interpolation problem AK,A which just consists of the 11 + V + 1 ~ G accessible points. As r; + A + 1 = I). + V + 1 ~ 2G < M + v + 1 ~ G, the accessible points of AfJ.,V are clearly in special position. D The observation (2.2.1.8) makes it clear that nonsolvability of the rational interpolation problem is a degeneracy phenomenon: solvability can be restored by arbitrarily sIllall perturbations of the support points. In what follows, we will therefore restrict our attention to fully nondegenerate problems, that is, problems for which no subset of the support points is in special position. Not only is AfJ.,v solvable in this case, but so are all problems AK,A of r; + A + 1 of the original support points where r; + A ::; M + v. For further details see Milne (1950) and Maehly and Witzgall (1960). 64 2 Interpolation Most of the following discussion will be of recursive procedures for solving rational interpolation problems AIL,v. With each step of such recursions there will be associated a rational expression J[IL,v, of degree type (/-l, v) with /1 :s; m and v :s; n, and either the numerator or the denominator of ([JIL,v will be increased by l. Because of the availability of this choice, the recursion methods for rational interpolation are more varied than those for polynomial interpolation. It will be helpful to plot the sequence of degree types (/-l, v) which are encountered in a particular recursion as paths in a diagram: /-l=0 1 2 3 1/=0 1 2 3 We will distinguish two kinds of algorithms. The first kind is analogous to Newton's method of interpolation: A tableau of quantities analogous to divided differences is generated from which coefficients are gathered for an interpolating rational expression. The second kind corresponds to the Neville-Aitken approach of generating a tableau of values of intermediate rational functions ([JIL,V. These values rclate to each other directly. 2.2.2 Inverse and Reciprocal Differences. Thiele's Continued Fraction The algorithms to be described in this section calculate rational expressions along the main diagonal of the (/-l, 1/ )-plane: /-l=0 1 2 3 v=O 1 2 3 (2.2.2.1) Starting from the support points (Xi, j,), i = 0, 1, ... , we build the following tableau of inverse differences: 2.2 Interpolation by Rational Functions Xi fi 0 Xo fo 1 2 Xl 3 X2 X3 h h h rp(XO, xd rp(xo, X2) rp(xo, X3) rp(xo, Xl, X2) rp(XO,Xl,X3) 65 rp(xo, :rl, X2, X3) The inverse differences are defined recursively as follows: (2.2.2.2) On occasion, certain inverse differences become 00 because the denominators in (2.2.2.2) vanish. Note that the inverse differences are, in general, not symmetric functions of their arguments. Let PI-', QV be polynomials whose degree is bounded by /1 and /J, respectively. We will now try to use inverse differences in order to find a rational expression pn(X) pn,n(x) = _ _ Qn(x) with pn,n(Xi) = fi for i = 0,1, ... ,2n. We must therefore have pn(x) = fo + pn(x) _ pn(xo) Qn(x) Qn(x) Qn(xo) pn-l(x) (x-xo) = fo + (x - xo) Qn(x) = fo + Qn(x)/pn-l(x)" The rational expression Qn (x) / pn-l (x) satisfies Qn(Xi) pn-l(Xi) for i = 1, 2, ... , 2n. It follows that 66 2 Interpolation and therefore pn-l(Xi) Qn-I(X;) Continuing in this fashion, we arrive at the following expression for pn,n(X): pn(X) f n n() x = Qn (x) = P' x - Xo °+ Qn (x) / pn - X) I ( x - Xo x - xl = fo + - - - - - - - - - - ip(XO, Xl) + pn-I(X)/Qn-I(X) Xo x - Xl = fo + - - - - - - - - - - - - - - - X - ip(XO, Xl) + - - - - - - - - - - - - 1: - X2n-1 ip(XO, ' , . ,X2n) . pn.n(X) is thus srepresented by a continued fraction: (2.2.2.3) pn,T!(X) =fo + X - xo/P(XO,XI) + ~ + ... + X - X2n-I/ip(XO, Xl,"" X2n)' It is readily seen that the partial fractions of this continued fraction are nothing but the rational expressions PJ",,"(x) and p'" +1 ,," (x) , p, = 0, 1, ... , n - 1, which satisfy (2.2.1.1) and which are indicated in the diagram (2.2.2.1) . pO,O(X) fo, p1.0(x) fo + x - xo/p(xo, Xl)' p1.1 (x) fa + X - xo/p(xo, Xl) + X - xI/ip(xo, Xl, X2) , EXAMPLE 2.2 Interpolation by Rational Functions Xi Ji 0 1 0 1 0 -1 2 2 3 3 'P(XQ, x;) 'P(XQ,XI,X,) 2 3 -1 -3 -2 9 I :3 3 67 'P(XQ, Xl, X2, x;) I 2 I 2 Because the inverse differences lack symmetry, the so-called reciprocal differences P(Xi' XH1,"" Xi+k) are often preferred. They are defined by the recursions P(Xi) := fi' (2.2.2.4) P(Xi, Xi+l ) Xi - Xi+l f ' := f ( ) PXi,Xi+I,···,Xi+k := i-HI Xi - Xi+k ( ) ( )+ P Xi,' .. , XHk-1 - P Xi+l,···, Xi+k For a proof that the reciprocal differences are indeed symmetrical, see Milne-Thompson (1951). The reciprocal differences are closely related to the inverse differences. (2.2.2.5) Theorem. For p = 1, 2, ... [letting p(xo, ... , X p -2) = 0 for p = 1], PROOF. The proposition is correct for p = l. Assuming it true for p, we conclude from that By (2.2.2.4), 68 2 Interpolation p(Xp. Xo, ...• xp~d - p(xo, ... ,Xp-l, Xp+l) = p(Xp, XO,"" Xp-l, Xp+l) - p(XO," ., Xp~l)' Since the p( ... ) are symmetric, 'P(Xo, Xl,···, :.c:p+l) = p(Xo,···, xp+d - p(Xo, ... , Xp-l), whence (2.2.2.5) has been established for p + 1. The reciprocal differences can be arranged in the tableau Xo fa Xl h o p(xo, Xl) (2.2.2.6) p(Xo, Xl, :r'2) P(Xl,X2) X2 12 X3 h P(XO,Xl,X2,X3) P(Xl, X2, X3) P(X2,X3) Using (2.2.2.5) to substitute reciprocal differences for inverse differences in (2.2.2.3) yields Thiele's continued fraction: (2.2.2.7) pn.n(x) =fo + X - xo/p(Xo, Xl) + ~p(xo. Xl, X2) - p(Xo) + ... + X - X2n-dp(Xo, ... , X2n) - p(Xo, .... X2n~2)' 2.2.3 Algorithms of the Neville Type We proceed to derive an algorithm for rational interpolation which is analogous to Neville's algorithm for polynomial interpolation. A quick reminder that, after discussing possible degeneracy effects in rational interpolation problems [Section 2.2.1]' we have assumed that such effects are absent in the problems whose solution we are discussing. Indeed, such degeneracies are not likely to occur in numerical problems. We use plL.V(X) = Pf'V(x) s - Q~'V(x) to denote the rational expression with p~,v(x;)=fi fori=s,s+l, ... ,S+M+V. Pf'V, Q~'v being polynomials of degrees not exceeding M and v, respectively. Let p~'v and q~'v be the leading coefficients of these polynomials: 2.2 Interpolation by Rational Functions 69 For brevity we put Qi := x - Xi and T/:'V(x, y) := p/:,V(x) - y. C;)~,V(x), noting that (2.2.3.1) Theorem. Starting with P~,O(x) = fs, the following recursions hold: (a) Transition (M - 1,1/) -+ (M,I/): (b) Transition (M,I/ - 1) -+ (M,I/): PROOF. We show only (a), the proof of (b) being analogous. Suppose the rational expressions q>~-l,v and q>~+;'v meet the interpolation requirements (2.2.3.2) T/:-1,V(Xi, fi) = 0 for i = s, ... , s + M+ 1/ - 1, T:;/,V(Xi,fi) =0 for i=s+l, ... ,S+M+I/. If we define Pt,V(x), Q~'V(x) by (a), then the degree of Pt,V clearly does not exceed M. The polynomial expression for Q~'v contains formally a term with x v+1, whose coefficient, however, vanishes. The polynomial Q~'v is therefore of degree at most 1/. Finally, From this and (2.2.3.2), T/:,V(Xi, fi) = 0 for i = s, ... , s + M+ II. Under the general hypothesis that no combination (M, 1/, s) has inaccessible points, the above result shows that (a) indeed defines the numerator and denominator of q>~,v . 0 70 2 Interpolation Unfortunately, the recursions (2.2.3.1) still contain the coefficients The formulas are therefore not yet suitable for t.he calculation of cP';'''(x) for a prescribed value of .T. However, we can eliminate t.hese coefficients on the basis of the following theorem. p~,v-1, q~-1,v. (2.2.3.3) Theorem. (a) cPfJ.-1,v(x) _ cPfJ.-l,v-1(x) = k (:1: - xs+d'" (x - xS+I"+v-d 8 8+1 1 Q~-1'V(x)Q~~rV-1(x) fJ.- 1,1/-1 qs1"-1,v , WZ'th k 1 -- _ Ps+1 (b) cPl"-l,v(X) _ cPl"-l,v-1(,T) = k (x - X8 H)'" (x - x"+I"+v-1) 8+1 s+1 2 QI"-l,v( )QI"-1,V-1( ) 8+1 x 8+1 x .h k 1"-1,v-1 l"-l,v q8+1' wzt 2 - -P s+1 PROOF. The numerator polynomial of the rational expression "'-1"-1,v( ",-fJ.-l,v-1( x)_ '¥s X) _ '¥s+l Pt'-1'V(x)Q~~:,v-1(X) _ p:;/,v-1(X)Q~-1,v(X) Q~-l,v (x )Q~~:,v-1 (x) is at most of degree 11 - 1 + v and has fl, + v-I different. zeros Xi, i=s+1,s+2, ... ,s+f1+ v - 1 by definition of cP~-l,v and cP~~l,v-l. It must therefore be of the form k"1 ( X - .T 8 +1 ) ... (x - xS+fJ.+v-1 ) 'tl1 k 1 -- _ P fJ.-1,v-1 q"l"-l,v . s+1 WI o This proves (a). (b) is shown analogously. (2.2.3.4) Theorem. For f1 ;;0. 1, v ;;0. 1, cP~~ll,v(x) _ cP~-1,v(X) (a) cP~,v(x) = cP~+:'V(x) + - - - - - - - - - - - - - - - as [ cP~~:,V(x) - cP~-I,v(x) ] 1-1 as+J-l+v cP~~11,v(X) - cP~~11,v-1(X) .... ~ 2.2 Interpolation by Rational Functions 71 PROOF. By Theorem (2.2.3.1), J.L-1,v pJ.L-l,v( ) J.L-1,v p/J.-1,v( ) cpJ.L,V(x) _ a 8q8 8+1 x - a 8+J.L+v q8+1 8 X 8 V( ) J.L-1 vQ,,-l V( ) . a 8q8J.L-1 'vQJ.L-1 8+1' X - a 8+J.L+v q8+1 ' 8' , X We now assume that p~+t,V-1 i- 0 and multiply numerator and denominator of the above fraction by J.L-1,v-1( x - X8+1 )( X - X8+2 ) ... ( x - x S +J.L+v-1 ) -P8+1 )QJ.L-1,v( )QJ.L-1,v-1( ) Q J.L-1,v( s+l x s X s+l X Taking Theorem (2.2.3.3) in account, we arrive at II - as+J.L+vcp~-l,v(x) [ II - as+J.L+v [ b (2.2.3.5) where _ cpJ.L-1,v-1(x) 11 = cpJ.L-1,v(X) s s+l , b = cp~+t,v (x) - cp~+t,V-1 (x) . (a) follows by a straightforward transformation. (b) is derived analogously. o The formulas in Theorem (2.2.3.4) can now be used to calculate the values of rational expressions for prescribed x successivel:y, alternately increasing the degrees of numerators and denominators. This corresponds to a zigzag path in the (IL, v)-diagram: 1 2 3 v=O 1 (2.2.3.6) 2 L Special recursive rules are still needed for initial straight portions-vertically and horizontally-of such paths. As long as v = 0 and only IL is being increased, one has a case of pure polynomial interpolation. One uses Neville's formulas [see (2.1.2.1)l cp~,O(x) := is, IL = 1,2, .... 72 2 Interpolation Actually these are specializations of Theorem (2.2.3.4a) for v = 0, provided the convention q>~+;,-l := 00 is adopted, which causes the quotient marked * in (2.2.3.4) to vanish. If J-l = and only v is being increased, then this case relates to polynomial interpolation with the support points (Xi, 1/ Ii), and one can use the formulas ° (2.2.3.7) q>~'V(x) as - a + := _ _ _ _ _s_v_ __ as a s +v q>~;;-l (x) q>~,v-l (x) v = 1,2, ... , which arise from Theorem (2.2.3.4) if one defines q>-;~i-l(x) := 0. Experience has shown that the (J-l, v)-sequence (0,0) --+ (0,1) --+ (1,1) --+ (1,2) --+ (2,2) --+ ... - indicated by the dotted line in the diagram (2.2.3.6) - holds particular advantages, especially in the important application area of extrapolation methods [Sections 3.4 and 3.5], where interest focuses on the values q>~,,,(x) for x = 0. If we refer to this particular sequence, then it suffices to indicate J-l + v, instead of both J-l and v, and this permits the shorter notation Ti,k := q>~'''(x) with i = S + J-l + v, k = J-l + v. The formulas (2.2.3.4) combine with (2.2.3.7) to yield the algorithm Ti,-l := 0, (2.2.3.8) Ti,k := Ti,k-l Ti,k-l - Ti-1,k-l + --------------X - Xi-k x - Xi [1 _ Ti,k-l - Ti-1,k-l] Ti,k-l - T i - 1,k-2 -1 for 1 ::; k ::; i, i = 0, 1, .... Note that this recursion formula differs from the corresponding polynomial formula (2.1.2.5) only by the expression in brackets [... j, which assumes the value 1 in the polynomial case. If we display the values Tik in the tableau below, letting i count the ascending diagonals and k the columns, then each instance of the recursion formula (2.2.3.8) interrelates the four corners of a rhombus: 2.2 Interpolation by Rational Functions (M, v) (0,0) (0,1) (1,1) (1,2) 73 ... fo = To,o 0= TO,-l (2.2.3.9) TI,1 h = TI,o 0= TI,-l T 2 ,2 \ . T 2,1 12 = T 2 ,0 0= T 2,-1 --+ T3,2 /" T33 T 3,1 h = T3,0 If one is interested in the rational function itself, i.e. its coefficients, then the methods of Section 2.2.2, involving inverse or reciprocal differences, are suitable. However, if one desires the value of the interpolating function for just one single argument, then algorithms of the Neville type based on the formulas of Theorem (2.2.3.4) and (2.2.3.8) are to be preferred. The formula (2.2.3.8) is particularly useful in the context of extrapolation methods [see Sections 3.4, 3.5, 7.2.3, 7.2.14]. 2.2.4 Comparing Rational and Polynomial Interpolation Interpolation, as mentioned before, is frequently used for the purpose of approximating a given function f (x). In many such instances, interpolation by polynomials is entirely satisfactory. The situation is different if the location x for which one desires an approximate value of f(x) lies in the proximitiy of a pole or some other singularity of f(x) - like the value of tan x for x close to 7r /2. In such cases, polynomial interpolation does not give satisfactory results, whereas rational interpolation does, because rational functions themselves may have poles. EXAMPLE [taken from Bulirsch and Rutishauser (1968)]. For the function f(x) = cot x the values cot 1 0, cot 2°, ... have been tabulated. The problem is to determine an approximate value for cot 2° 30'. Polynomial interpolation of order 4, using the formulas (2.1.2.4), yields the tableau 74 2 Interpolation ctg(x;) Xi Ji = 1° 57.28996163 2° 28.63625328 14.30939911 21.47137102 23.85869499 3° 19.08113669 4° 14.30066626 22.36661762 23.26186421 21.47137190 22.63519158 23.08281486 22.18756808 18.60658719 5° 11.43005230 Rational interpolation with (fL, 1/) = (2,2) using the formulas (2.2.3.8) in contrast gives 1° 57.28996163 2° 28.63625328 3° 19.08113669 4° 14.30066626 22.90760673 22.90341624 22.90201805 22.90369573 22.90411487 22.91041916 22.9037655~ 22.90384141 22.90201975 22.94418151 5° 11.43005230 The exact value is cot 2°30' = 22.9037655484 ... ; incorrect digits are underlined. A similar situation is encountered in extrapolation methods [see Sections 3.4, 3.5, 7.2.3, 7.2.14]. Here a function T(h) of the step length h is interpolated at small positive values of h. 2.3 Trigonometric Interpolation 2.3.1 Basic Facts Trigonometric interpolation uses combinations of the trigonometric functions cos hx and sin hx for integer h. We will confine ourselves to linear interpolation, that is, interpolation by one of the trigonometric expressions (2.3.1.1a) 1A + ~)AhCOS hx+Bh sin hx), AI l/i'(x):= h=l 2.3 Trigonometric Interpolation 1P(x):= ~O + (2.3.1.1b) 75 A M-l L (Ah cos hx + Bh sin hx) + : cos Mx h=l of, respectively, N = 2M + 1 or N = 2M support points (Xk' ik), k = 0, 1, ... , N - 1. Interpolation by such expressions is suitable for data which are periodic of known period. Indeed, the expressions 1P(x) in (2.3.1.1) represent periodic functions of x with the period 271".2 Considerable conceptual and algebraic simplifications are achieved by using complex numbers and invoking De Moivre's formula e ikx = cos kx + i sin kx . Here and in what follows, i denotes the imaginary unit. If c = a + i b, a, b real, then c = a - ibis its complex conjugate, a is the real part of c, bits imaginary part, and Icl := (cc)1/2 = (a 2 + b2 )1/2 its absolute value. Particularly important are uniform partitions of the interval [0,271"] k = 0,1, ... ,N - 1, Xk := 271"k/N, to which we now restrict our attention. For such partitions, the trigonometric interpolation problem can be transformed into the problem of finding a phase polynomial of order N (Le. with N coefficients) (2.3.1.2) with complex coefficients (3j such that P(Xk) = ik, Indeed e-hixk k = 0,1, ... ,N - 1. = e-21rihk/N = e2rri(N-h)k/N = e(N-h)ixk, and therefore sin hx k = (2.3.1.3) ehixk _ e(N-h)ixk ---::-2""'"i- - - Making these substitutions in expressions (2.3.1.1) for 1P(x) and then collecting the powers of e iXk produces a phase polynomial p(:r) , (2.3.1.2), with coefficients (3j, j = 0, ... , N - 1, which are related to the coefficients A h , Bh of 1P(s) as follows: 2 If sin u and cos u have to be both evaluated for the same argument u, then it may be advantageous to evaluate t = tan( u/2) and express sin u and cos u in terms of t : . 2t sm u = 1 + t 2 ' 1 - t2 cos U = 1 + t 2 . This procedure is numerically stable for 0 ::; u ::; 7r / 4, and the problem can always be transformed so that the argument falls into that range. 76 2 Interpolation (2.3.1.4) (a) If N is odd, then N = 2M + 1 and Ao (30 = 2' (3j = ~(Aj - iBj ), (3N-j = ~(Aj + iBj ), j = 1, ... , M, Ao = 2(30, Ah = (3h + (3N-h, Bh = i((3h - (3N-h), h = 1, ... , M. (b) If N is even, then N = 2M and (30 = ~o, (3j = ~(Aj - iBj ), (3N-j = ~(Aj + iBj ), j = 1, ... , M - 1, AM (3M = 2 ' Ao = 2(30, Ah = (3h + (3N-h, Bh = i((3h - (3N-h), h = 1, ... , M - 1, AM = 2(3M. The trigononometric expression lP(x) and its corresponding phase polynomial p( x) agree for all support arguments Xk = 27r / N of an equidistant partition of the interval [0,27r]: k = 0, 1, ... ,N - 1. However lP(x) = p(x) need not hold at intermediate points x -I- Xk. The two interpolation problems are equivalent only insofar as a solution to one problem will produce a solution to the other via the coefficient relations (2.3.1.4). The phase polynomials p(x) in (2.3.1.2) are structurally simpler than the trigonometric expressions lP(x) in (2.3.1.1). Upon abbreviating °: :; and since Wj -I- Wk for j -I- k, j, k :::; N -1, it becomes clear that we are faced with just a standard polynomial interpolation problem in disguise: find the (complex) algebraic polynomial P of degree less than N with P(Wk) = ik, k = 0, 1, ... ,N - 1. The uniqueness of polynomial interpolation immediately gives the following (2.3.1.5) Theorem. For any support points (Xk' ik), k = 0, ... , N - 1, with ik complex and Xk = 27rk/N, there exists a unique phase polynomial p(x) = (30 + (31e ix + ... + (3N_le(N-l)ix with for k = 0, 1, ... , N - 1. p(xd = ik 2.3 Trigonometric Interpolation 77 The coefficients (3j of the interpolating phase polynomial can be expressed in closed form. To this end, we note that, for 0 ::; j, h ::; N - 1 (2.3.1.6) More importantly, however, we have for 0 ::; j, h ::; N - 1 = N-1 ~ wj w- h (2.3.1. 7) ~ k=O k {£' or = h , N J for j =1= h. 0 k PROOF. Wj-h is a root of the polynomial wN - 1 = (w - 1)(W N - 1 + w N - 2 + ... + 1), from which either Wj-h = 1, and therefore j = h, or N-1 N-1 ~ j -h ~ wkw k ~ j-h ~ wk = k=O N-1 k=O k 0 = ~ ~ Wj_h = . D k=O Introducing in the N-dimensional complex vector space (CN of all Ntuples (uo, U1, ... ,UN -1) of complex numbers Uk the usual scalar product N-1 L [u, v] := UjVj, j=O then (2.3.1.7) says that the special N-vectors W (h) _ - (1 ,W1"",WN-1, h h) h = 0, ... ,N -1, form an orthogonal basis of (CN, ( .) (2.3.1.8) [w J ,w (h) ]= { N 0 for j = h, for j =1= h. Note that the vectors are of length [w(h), w(h)j1/2 = VN instead of length 1, however. From the orthogonality of the vectors w(h) follows: N-1 .. (2.3.1.9) Theorem. The phase polynomial p(x) = 2: j =o (3je J 'X satisfies k = 0,1, ... , N -1, for fk complex and Xk = 27f/N, if and only if N-1 N-1 k=O k=O ~ f w- j = ~ ~ f e-2rrijk/N , (3J. = ~ N~kk N~k j = 0, ... ,N -1. 78 2 Interpolation PROOF. Because of fk = p(Xk) the N-vector f := (fo, iI,···, fN-d satisfies N-l f = L (3jw(j) j=O so that N-l LfkW-';/= [f,w(j)] = [(3ow(O)+···+(3N_IW(N-l),w(j)] =N(3j. D k=O The mapping F : QjN ---+ QjN, f = (fo, iI,···, fN-l) H (3 := «(30, (31, ... ,(3N-l) =: F(f), defined by (2.3.1.9) is called discrete Fouriertransformation (DFT). Its inverse (3 H f = F- 1((3) corresponds to Fourier synthesis, namely the evaluation ofthe phase polynomial p(x) (2.3.1.2) at the equidistant abscissas Xk = 27rk/N, k = 0, 1, ... , N - 1, N-l N-l fk = L (3je27rijk/N = L (3jwf j=O j=O Because of A = L.f=~1 {JjW;j the mapping F- 1 can be expressed by F, (2.3.1.10) Therefore the algorithms to compute F [e.g. the fast Fourier transform methods of the next section 2.3.2] can also be used for Fourier synthesis. For phase polynomials q(x) of order at most s, s :::::: N - 1 given, it is in general not possible to make all residuals k =O, ... ,N -1, vanish, as they would for the interpolating phase polynomial of order N. In this context, the s-segments Ps(x) = (30 + (31eix + ... + (3se six of the interpolating polynomial p( x) have an interesting best-approximation property: (2.3.1.11) Theorem. The s-segment Ps(x), 0::; s < N, of the interpolating phase polynomial p(x) minimizes the sum N-l S(q) := L Ifk - q(xk)1 2 k=O 2.3 Trigonometric Interpolation 79 [note that S (p) = 0] of the squared absolute values of the residuals over all phase polynomials The phase polynomial Ps (x) is uniquely determined by this minimum property. PROOF. We introduce the N-vectors Then S(q) can be written as the scalar product S(q) = [f-q,f-q]. By Theorem (2.3.l.9), N,6j = [f, w(j)] for j = 0, ... , N -- l. So for j ~ s, s [f - Ps, w(j)] = [f - L f3h W (h) , w(j)] = N f3j - N f3j = 0, h=O and [f - Ps,Ps - q] = L [J - Ps, (f3j - I'j )w(j)] = 0. j=O But then we have S(q) = [f - q, f - q] [(f - Ps) + (Ps - q), (f - Ps) + (Ps - q)] = [f - Ps, f - Ps] + [Ps - q,ps - q] ::::: [f - Ps, f - Ps] = = S(Ps)' Equality holds only if [Ps - q, Ps - q] = 0, i.e., if the vectors Ps and q are equal. Then the phase polynomials Ps (x) and q( x) are identical by the uniqueness theorem (2.3.l.5). D Returning to the original trigonometric expressions (2.3.l.1), we note that Theorems (2.3.l.5) and (2.3.l.9) translate into the following: (2.3.1.12) Theorem. The trigonometric expressions iJr(x) := A -f + L(Ah cos hx + Bh sin hx), iJr(x) := i + L (Ah cos hx + Bh sin hx) + A;! cos Mx, M h=l A M-l h=l 80 2 Interpolation where N = 2M + 1 and N = 2M, respectively, satisfy tJr(Xk) = ik, k = 0, 1, ... N - 1, for Xk = 21Tk/N if and only if the coefficients of tJr(x) are given by 2 N-l Ah = N 2 N-l 21Thk L fk cos hXk = N L ik cos ---y:;- . k=U k=O 2 N-l 2 N-l 21Thk Bh = N ik sin hXk = N ik sin ---y:;-' k=O k=O L PROOF. L Only the expressions for A h . Bh remain to be verified. For by (2.3.1.4) Ah = Ph + PN-h = ~ N-l L fk(e- hiXk + e-(N-h)iXk) , k=O . N-l Bh = i(Ph - PN-h) = ~ L fde- hixk - e-(N-h)ixk) , k=O and the substitutions (2.3.l.3) yield the desired expressions. D Note that if the support ordinates ik are real, then so are the coefficients A h , Bh in (2.3.l.l2). 2.3.2 Fast Fourier Transforms The interpolation of support points (Xk, ik), k = 0,1, ... , N -1, with equidistant Xk = 21Tk/N. by a phase N-l -polynomial p(x) = L:j=o Pje)'X leads to expressions of the form [Theorem (2.3.l.9)] N-l (2.3.2.1) (-J = ~ '" f e-2rrijk/N {-') N L k , k=O j = 0,1, ... ,N - 1. The evaluation of such expressions is of prime importance in Fourier analysis. The expressions occur also as discrete approximations - for N equidistant arguments s - to the Fourier transform H(s) = I: f(t)e-2rrist dt, which pervades many areas of applied mathematics. However, the numerical evaluation of expressions (2.3.2.1) had long appeared to require on the order 2.3 Trigonometric Interpolation 81 of N 2 multiplications, putting it out of reach for even high-speed electronic computers for those large values of N necessary for a sufficiently accurate discrete representation of the above integrals. The discovery [Cooley and Tukey (1965); it is remarkable that similar techniques were already used by Gauss] of a method for rapidly evaluating (on the order of NlogN multiplications) all expressions (2.3.2.1) for large special values of N has therefore opened up vast new areas of applications. This method and its variations are called fast Fourier transforms. For a detailed treatment see Brigham (1974) and Bloomfield (1976). There are two main approaches, the original Cooley- Tukey method and one described by Gentleman and Sande (1966), commonly ealled the SandeTukey method. Both approaches rely on an integer factorization of ]Ii" and decompose the problem accordingly into subproblems of lower degree. These decompositions are then carried out recursively. This works best when N = 2n, n> 0 integer. We restrict our presentation to this most important and most straightforward case, although analogous techniques will clearly work for the more genereal case N = N1N2 ... N n , N m integer. The Cooley-Tukey approach is best understood in terms of the interpolation problem described in the previous section 2.3.1. Suppose N = 2M and consider the two interpolating phase polynomials q(x) and r(x) of order M = NI2 with h = 0, ... , ivl - 1. The phase polynomial q(x) interpolates all support points of even index, whereas the phase polynomial r(x) = r(x - 21fIN) = r(x -1fIM) interpolates all those of odd index. Since keven, k odd, the complete interpolating phase polynomial p(x) is now readily expressed in terms of the two lower-order phase polynomials q(x) and r(x): (2.3.2.2) p(x) = q(x) ( l+eM~) 2 + r(x -1fIM) (l-eM~) --2- . This suggests the following n step recursive scheme. For Tn :S n, let M = 2m - 1 and R = 2n - m . Step Tn then consists of determining R phase polynomials (.I('/TI) + (.I('/TI) eix + ... + (.I(m) e(2M -l)ix Pr(m) = Pr,O fJr,l fJ r ,2AI-1 , r ,= 0, ... , R - 1, 82 2 Interpolation from 2R phase polynomials p~m-l) (x), r = 0, ... , 2R-l, using the recursion (2.3.2.2): 2p~m) (x) = p~m-l) (x)(1 + e Mix ) + pkn;..~l) (x - 7T /M)(1 _ e Mix ). This relation gives rise to the following recursion between the coefficients of the above phase polynomials: (2.3.2.3) 1 j 2(3(m) + (3(mr,] = (3(m-l) r,] R+r,]lcm 2(3(m) r,M +j - (3(m-l) r,j + (3(m-l) cj R+r,j m } r = 0, ... ,R - 1, j = 0, ... ,!vI - 1, where m=O, ... ,n. The recursion is initiated by putting (0) . - f (3k,O'k, k=O, ... ,N -1, . - (3(n) (3 j.O,j' j = 0, ... ,N-1. and terminates with This recursion typifies the Cooley-Tukey method. The Sande-Tukey approach chooses a clever sequence of additions in the sums ike- jixk . Again with M = N/2, we assign to each term ike- jixk an opposite termik+Me-jixk+M. Summing respective opposite terms in (2.3.2.1) produces N sums of M = N /2 terms each. Splitting those N sums into two sets, one for even indices j = 2h and one for odd indices j = 2h + 1, will lead to two problems of evaluating expressions of the form (2.3.2.1), each problem being of reduced degree M = N/2. Using the abbreviation 2:.::-01 again, we can write the expressions (2.3.2.1) in the form N-l ~k N(3j = L.. fkc~ , k=O j = 0,1, ... ,N - 1. Here n is such that N = 2n. Distinguishing between even and odd values of j and combining opposite terms gives M-l N-l N (32h = = k=O =: k=O k=O M-l N-l N(32h+1 = M-l L ikc~hk L (ik + fk+M )C~~l L f£C~~l M-l L ikc~2h+1)k L (Uk - fk+M)C~)c~~l L ffc~~l = k=O =: k=O k=O 2.3 Trigonometric Interpolation 83 for h = 0, ... , .M - 1 and AI := N/2, since 10;' = IOn-I, c~I = -1. Here f~ = ik + fk+M } f~ = (ik - fk+M )c~ k = 0, ... ,M -1. In order to iterate this process for m = n, n - 1, ... , 0, we let AI := 2rn - 1 , R:= 217 - rn and introduce the notation f;.r;:) , ° r = 0, ... ,R - 1, k = 0, ... , 2M - 1, . h JO.k .(n) -- f k, k -J\T .(n-l) an d f(n-l) ' . Wlt , ... ,"' - 1. JO.k l,k represent th e quantltles f£ and f~, respectively, which were introduced above. In g;eneral we have, with 1\;1 = 2rn - l and R = 2n - m , (2.3.2.4) 2M-l N(~ fJjR+r = ~ ~ r = 0,1, ... ,R - 1, .(m) jk Jr,k 10 m , j = 0, 1, ... 2M - 1. k=O with the quantities f;~) satisfying the recursions: (2.3.2.5) lrn-l) -lrn) r,k r,k + lrn) r,k+JH (m-l) _ (m) fr+R.k - (fr.k - } { (m) k fr,k+M )c m m = n, n -- 1, ... , 1. r = 0, 1. .... R - 1. '" k = 0, 1, ... ,1\1 - 1. PROOF. Suppose (2.3.2.4) is correct for some m ::; n, and let M' := M/2 = 2m - 2 , R' := 2R = 2n - m + 1 . For j = 2h and j = 2h 1, respectively, we + find by combining opposite terms M-l 2M'-1 ~ (f;.r;:) + f;,r;:~M)C~r~ = ~ f;.r;:-I)c~_I' N{3hR'+r = N{3JR+r = k=O k=O M-l N{3hR'+r+R = N{3jR+r' = ~ (f;~) - f;~LI)Ct:: k=O M-l = ~ (f(rn) ~ r,k k=O 2M'-1 f(rn) ) k hk r,k+M c m c m - l = ~ ~ .(m-l) hk Jr+R,k c m-l' k=O where r = 0, ... , R - 1, j = 0, ... , 2lvl - 1. o The recursion (2.3.2.5) typifies the Sande-Tukey method. It is initiated by and terminates with .(n) . - f JO.k . - k· k = 0, ... , 1V - 1, 84 2 Interpolation ._ 1 (0) f3r .- N fr,o , r = 0, L ... , N - l. Returning to the Cooley-Tukey method for a more detailed algorithmic formulation, we are faced with the problem of arranging the quantities in a one-dimensional array: fJ?r;) 1'£ = O, ... ,N-l. Among suitahle maps (m,r,j) --+ I'£(m,r,j), the following IS the most straightforward: 1'£ = 2m r + j; m = 0, ... ,11" r = 0, ... , 2n - m - 1, j = 0, ... , 2m - l. It has the advantage, that the final results are automatically in the correct order. However, two arrays {il[ ], {i[ ] are necessary to accommodate the left- and right-hand sides of the recusion (2.3.2.3). We can make do with only one array {i[ ] if we execute the transformations "in place", that is, if we let each pair of quantities ,[3~7), f3~~~l +j occupy the same positions in 8[ ] as the pair of quantities (3~7-1), ,[31:~,~), from which the former are computed. In this case, however, the entries in the array {i[ ] are being permuted, and the maps which assign the positions in {i[ ] as a function of integers m, r, j become more complicated. Let T = T(m,r,j) be a map with the above mentioned replacement properties, namely with (2.3.2.6) T(m, r,j) = T(m - 1, r.j) T(m, r,j + 2m - I ) = T(m -1, r + 2n - m ,j) m=L ... ,n, r = 0, ... , 2 n - m - 1, j = 0, ... ,2 m - 1 - 1, and (2.3.2.7) T(n, 0, j) = j, j = 0, ... , N - l. The last condition means that the final result f3j will be found in position j in the array {i: f3j = {i[j]. The conditions (2.3.2.6) and (2.3.2.7) define the map T recursively. It remains to determine it explicitly. To this end, let t = ao + al ·2+ ... + an-I' 2n - 1 , aj E {O, I}, 2.3 Trigonometric Interpolation be the binary representation of an integer t, °:S: t < 85 2n. Then putting pet) := an-I + a n -2 ·2+ ... + ao ·2 n - l (2.3.2.8) defines a permutation of the integers t = 0, ... , 2n -1 called bit reversal. The bit-reversal permutation is symmetric, i.e. p(p(t)) = t. In terms of the bit-reversal permutation p, we can express T(m, r,j) explicitly: T(m, r,j) = per) + j, (2.3.2.9) for all m = 0, 1, ... , n, r = 0, 1, ... , 2n - m - 1, j = 0, 1, ... , 2m - 1. PROOF. If again t = ao + al ·2+ ... + an-I· 2n - \ aj E {O, 1 }, then by (2.3.2.6) and (2.3.2.7) T(n-1,0,t) ( 0) t = T n, ,t = { T(n _ 1,0, t _ 2n-l) Thus if an-l. = 0, if an-l. = 1. t = T(n, 0, t) = T(n - 1, an-I, aO + ... + a n -2 ·2 n - 2 ), and, more generally, t = T(n, 0, t) = T(m, an-I + ... + am . 2n - m - l , aO + ... + am-I· 2m - I ) for all m = n - 1, n - 2, ... , 0. For r = an-I + ... + am· 2n - m - l , we find per) = am . 2m + ... + an-I· 2n - l and t = per) + j, where j = ao + ... + am-I· 2m - I . By the symmetry of bit reversal, D T(m,p(r),j) = r + j, where r is a multiple of 2m , o :S: j < 2m - I , then °:S: r < 2n , and °:S: < j 2m . Observe that if t = T(m, p(r),j) = T(m -l,p(r),j) = r + j, t = T(m, p(r),j + 2m - I ) = T(m - 1, per) + 2n - m ,j) = r + j + 2m - I mark a pair of positions in ,6[ ] which contain quantities connected by the Cooley-Thkey recursions (2.3.2.3). In the following pseUdO-ALGOL formulation of the classical CooleyThkey method, we assume that the array ,6[ ] is initialized by putting ,6[p(k)] := ik, k = O, ... ,N -1, 86 2 Interpolation where p is the bit-reversal permutation (2.3.2.8). This "scrambling" of the initial values can also be carried out "in place", because the bit-reversal permutation is symmetric and consists, therefore, of a sequence of pairwise interchanges or "transpositions". In addition, we have deleted the factor 2 which is carried along in the formulas (2.3.2.3), so that finally 1 - (Jj := NP[j], j = 0, ... ,N-1. The algorithm then takes the form ° for m := 1 step 1 until n do begin for j := step 1 until 2m - I - 1 do begin e := cjn; for F := step 2m until 2n - 1 do begin u := ~[F + j]; v:= ~[F + j + 2m - I ] X e; B[F + j] := u + v; ~[F + j + 2 m - I ] := u - v end end end: ° If the Sande-Tukey recursions (2.3.2.5) are used, there is again no problem if two arrays of length N are available for new and old values, respectively. However, if the recursions are to be carried out "in place" in a single array j[ ], then we must again map index triples m, T, j into single indices T. This index map has to satisfy the relations T(m - I, T, k) = T(m, T, k), T(m - 1, T + 2n - m , k) = T(m, T, k + 2m - I ) for m = n, n - L ... , L T = 0, 1, ... , 2 n - m - 1, k = 0, 1, ... , 2m - I - 1. If we assume T(n, 0, k) = k for k = 0, ... , N - 1, that is. if we start out with the natural order, then these conditions are precisely the conditions (2.3.2.6) and (2.3.2.7) written in reverse. Thus T = T(m,T, k) is identical to the index map T considered for the CooleyTukey method. In the following pseUdO-ALGOL formulation of the Sande-Tukey method, we assume that the array j[ ] has been initialized directly with the values h: j[k] := h, k = 0, 1, ... ,N - 1. However, the final results have to be "unscrambled" using bit reversal, j = 0, ... ,N - 1: 2.3 Trigonometric Interpolation ° 87 for m:= n step -1 until 1 do begin for k := step 1 until 2m - 1 - 1 do begin e := c~; for r:= step 2m until 2n - 1 do begin u := i[r + k]; v:= i[r + k + 2m - I ]; i[r + k] := u + v; i[r + k + 2m - I ] :== (u - v) X e end end end; ° If all values ik, k = 0, ... , N - 1, are real and N = 2M is even, then the problem of evaluating the expressions (2.3.2.1) can be reduced in size by putting h = 0,1, ... , M - 1, and evaluating the expressions M-l ~ 9 e-2rrijh/M IJ·- M ~ h , '"Y . • - j = 0, 1, ... ,loll - 1. '" h=O The desired values (3j, j = 0, ... , N - 1, can be expressed in terms of the values 'Yj, j = 0, ... , M - 1. Indeed, one has with 'YM := 'Yo (2.3.2.10) - ) + 4i1 ('Yj -'YM-j )e -2rrij/N , (3j = 4:1 ('Yj + 'YM-j J. = °1 '1 , , ... ,d j=1,2, ... ,M-1. (3N-j=i]j, PROOF. It is readily verified that M-l 1 Yj + 1M-j) = N 1 '~ " hhe- 2 rrtJ.. 2h/N , 4:(' h=O 1\[-1 1 ('V 4i ;y diI - j ,j - ) _ - 1 '" f e- 2rrij (2h+l)/N+2rrijjN N ~ 2h+l . D h=O In many cases, particularly if all values fk are real, one is actually interested in the expressions 2 N-l Aj:= N 2 N-l 27rjk 2·k L ikcos~, B := N L fk sin ~ j k=O k=O 88 2 Interpolation for j = 0,1, ... , Ai, which occur, for instance, in Theorem (2.3.1.12). The values A j , B j are connected with the corresponding values for (3j via the relations (2.3.l.4). 2.3.3 The Algorithms of Goertzel and Reinsch The problem of evaluating phase polynomials p(x) from (2.3.1.2) or trigonometric expressions If/(x) from (2.3.1.1) for some aTbitmry argument x = f. is called FouTieT synthesis. For phase polynomials, there are Horner-type evaluation schemes, as there are for expressions (2.3.l.1a) when written in the form If/(x) = L~~-l\I /3 j e jix . The numerical behavior of such evaluation schemes, however, should be examined carefully. For example, Goertzel (1958) proposed an algorithm for a problem closely related to Fourier synthesis, namely, for simultaneously evaluating the two sums N-l L Yk cos kf. , N-l k=O k=1 L Yk sink'; for a given argument f. and given values Yk, k = 0, ... , N -1. This algorithm is not numerically stable unless it is suitably modified. The algoithm is based on the following: (2.3.3.1) Theorem. FaT f. '" nr, T = 0, ±1, ±2, ... , define the quantities 1 N-l Vj := ~ Yk sin(k - j + 1)f., sm" k=j L j = 0, 1, ... , N - 1, V N := V N + 1 := 0. These quantities satisfy the TecuTsions VJ = Yj + 2Vj+1 cosf. - Uj+2, (2.3.3.1a) j = N -l,N - 2, ... ,0. In paTticulaT N-l L Yk sin kf. = VI sin f., (2.3.3.1b) N-l (2.3.3.1c) L Yk cos k'; = Yo + VI cos f. - V 2 . ° k=O PROOF. For S j S N - 1, let By the definition of Vj+l, Vj+2, 2.3 Trigonometric Interpolation L L N-l N-l A = YJ + _.1_ { 2cos~ Yk sin(k - j)~ Yk sin(k - j sm ~ k=j+l k=j+2 89 1)~ } N-l =Yj+_.l_ L yd2cos~sin(k-j)~-sin(k-j-l)~l· sm~ k=j+l Kow 2 cos~ sin(k - j)~ = sin(k - j + 1)~ + sin(k - j - 1)~. Hence A = _._1_ [ Yj sin~ + sm~ L Yk sin(k - j + 1)~ == U;- N-l ] k=j+l This proves (2.3.3.1a). (2.3.3.1b) restates the definition of U 1 . To verify (2.3.3.1c), note that 1 1 N-l. L N-l . U 2 = -.- ~ Yk sm(k - 1)~ = -.Yk sm(k - 1)~, sm~ ~ sm~ k=2 k=l and sin(k - 1)~ = cos~ sink~ - sin~ cosk~. D Goertzel's algorithm applies the recursions (2.3.3.1) directly: UN := [h.J+l := 0; c:= cos~; for j := N - 1, N - 2, ... , 1 : cc:= 2· c: Uj := Yj + cc· Uj+l - Uj +2; 51 := Yo + U 1 . C - U2 ; 52 := U1 . sin~; to find the desired results 51 = L~:-Ol Yk cos k( 82 = L~:ll Yk sin k~. This algorithm is unfortunately not numerically stable for small absolute values of~, I~I «1. Indeed, having calculated c = cosk~, the quantity 81 = L~=~l Yk cos k~ will depend solely on c and the values Yk. We can write 51 = cp(c, Yo, ... , YN-l), where cp(c, Yo, ... ,YN -d = N-l L Yk cos(k arccos c). k=O As in Section 1.2, we denote by eps the machine precision. The roundoff error Llc = ccc, Iccl :::; eps, which occurs during the calculation of c, causes an absolute error Ll c 51 in 51, which in first-order approximation amounts to 90 2 Interpolation N-l cAp E e cos ~ """"" .1 e s1 ~ -;:;-.1c = -.-- L k Yk sin k~ sm~ uC k=O N-l = Ec(COtO L kYksink~. k=O An error .1~ = EE~' lEE I s: eps in ~, on the other hand, causes only the error L fJ ( N - l ) .1.;s1 ~ fJ~ Yk cos k~ . .1~ k=O N-l = -EE~ L kYksink~ k=O in s1. Now cot~ ~ 1/~ for small I~I. The influence of the roundoff error in c is consequently an order of magnitude more serious than that of a corresponding error in ~. In other words, the algorithm is not numerically stable. In order to overcome these numerical difficulties, Reinsch has modified Goertzel's algorithm [see Bulirsch and Stoer (1968)]. He distinguishes the two cases cos ~ > 0 and cos ~ O. Case (a): cos~ > O. The recursion (2.3.3.1a) yields for the difference s: bUJ := Uj - UH1 the relation i5Uj = Uj - UH1 = Yj + (2 cos ~ - 2) UH1 + UH1 - UH2 = Yj + AUH1 + i5UH1 , where A := 2(cos~ - 1) = -4sin2(~/2). This suggests the algorithm A := -4sin2(~/2); UN+l := i5UN := 0; for j := N - 1. N - 2, ... , 0: UH1 := i5UH1 + Uj+2; i5Uj := A' UJ +1 + i5UH1 + Yj; s1 := i5Uo - A . UI/2: s2 := U1 . sin~: This algorithm is well behaved as far as the propagation of the error .1A = E\A, 110\1 eps, in A is concerned. The latter causes only the following error .1\s1 in s1: s: 2.3 Trigonometric Interpolation 91 8s1 8S1/8).. ,1),s1 ="= a:\,1)" = E)')..7i{ 8~ = -E A L N-l sin 2( ~ / 2 ) kYksink~ . sin(U2) . cos(U2) k=O I tan(U2)I is small for small I~I. Besides, I tan(U2)1 < 1 for cos~ > o. Case (b): cos~ :::; O. Here we put and find 5Uj = Uj + Uj+l = Yj + (2 cos ~ + 2)Uj+l - Uj+l - Uj+2 = Yj + )..Uj+l - 5UJ +1 , where now ).. := 2(cos~ + 1) = 4cos 2 (U2). This leads to the following algorithm: ).. := 4COS2(~/2); UN+l := 5UN := 0; for j := N - L N - 2, ... , 0: Uj+l := 5Uj+l - Uj+2; 5Uj := ).. . UJ + 1 - 5Uj + 1 + Yj; s1 := 5Uo - U 1 . )../2; s2 := U 1 • sin~; It is readily confirmed that a roundoff error ,1)" = E),).., IE)'I :::; eps, in ).. causes an error of at most in s1, and I cot(~/2)1 :::; 1 for cos~ :::; o. The algorithm is therefore well behaved as far as the propagation of the error ,1)" is concerned. 92 2 Interpolation 2.3.4 The Calculation of Fourier Coefficients. Attenuation Factors Let K be the set of all absolutely continuous 3 real functions f : IR -+ IR which are periodic with period 27r. It is well known [see for instance Achieser (1956)] that every function f E K can be expanded into a Fourier series oc (2.3.4.1) L cje f(x) = jix j=~oo which converges towards f(x) for every x E IR. The coefficients Cj = Cj(f) of this series are given by Cj -- Cj (1'),.- - 1 (2.3.4.2) 27r 1271" f( x )e ~jixdx, 0 j = 0, ±1, ±2, .... In practice, frequently all one knows of a function f are its values !k := f(Xk) at equidistant arguments Xk := 27rkjN, where N is a given fixed positive integer. The problem then is to find, under these circumstances, reasonable approximate values for the Fourier coefficients Cj (f). We will show how the methods of trigonometric interpolation can be applied to this problem. By Theorem (2.3.1.9), the coefficients (3j of the interpolating phase polynomial with for k = 0, ±1, ±2, ... , are given by N~l jixk 6J = ~ N 'L" f k e, j = 0, 1, .. . ,N - 1. k=O Since fo = fN, the quantities (3j can be thought of as a "trapezoidal sum" [compare (3.1.7)] 3 A real function f : [a, b] --t IR is absolutely continuous on the interval [a, b] if for every EO > 0 there exists 0 < 0 such that Li If (b i ) - f (ai) I < EO for every finite set of intervals [a" bi ] with a::; a1 < b1 < ... < an < bn ::; band Li Ib i -ail < 8. If the function f is differentiable everywhere on the closed interval [a, b] or, more generally, if it satisfies a "Lipschitz condition" If(xi) - f(x2)1 ::; elx1 - x21 on [a, b], then f is absolutely continuous, but not conversely: there are absolutely continuous functions with unbounded derivatives. If the function is absolutely continuous, then it is continuous and its derivative l' exists almost everywhere. 1'"lorcover, f(x) = f(a) + Jax1'(t) dt for x E [a, b]. The absolute continuity of the functions f, 9 also ensures that integration by parts can be carried out safely, f(x )g' (x) dx = f(x )g(x) I~ l' (x )g(x) dx . J: J: 2.3 Trigonometric Interpolation ;3j = ~ [~ + !Ie- jiX1 + ... + hV_le-jixN-l + 93 f; e- jiXN ] approximating the integral (2.3.4.2), so that one might think of using the sums ;3j(J) =;3j := ~ (2.3.4.3) N-l L ike- jixk k=O for all integers j 0, ±1, ±2, ... as approximate values to the desired Fourier coefficients Cj (J). This approach appears attractive, since fast Fourier transforms can be utilized to calculate the quantities ;3j (J) efficiently. However, for large indices j the value ;3j (J) is a very poor approximation to Cj (J). Indeed, ;3j+kN = ;3j holds for all integers k, j, while on the other hand limlJl--+CXJ Cj = O. [This follows immediately from the convergence of the Fourier series (2.3.4.1) for the argument :c = 0.] A closer look also reveals that the asymptotic behavior of the Fourier coefficients c] (J) depends on the degree of differentiability of f: (2.3.4.4) Theorem. If the 27r-periodic function f has an absolutely continuous r th derivative f(r), then PROOF. Successive integration by parts yields C ] .. = -1 1271" f(x)e-PXdx 27r = ~ 0 2 !'(x)e-jiXdx r J 71" 27rJ2 o in view of the periodicity of f. This proves the proposition. 0 To approximate the Fourier coefficients Cj (J) by values which display the right asymptotic behavior, the following approach suggests itself: Determine for given values ik, k = 0, ±1, ±2, ... , as simple a function 9 E 1C as possible which approximates f in some sense (e.g., interpolates f for Xk) and share with f some degree of differentiability. The Fourier coefficients 94 2 Interpolation Cj (g) of 9 are then chosen to approximate the Fourier coefficients Cj (f) of the given function 1. In pursuing this idea, it comes as a pleasant surprise that even for quite general methods of approximating the function 1 by a suitable function g, the Fourier coefficients Cj (g) of 9 can be calculated in a straightforward manner from the coefficients (3j(f) in (2.3.4.3). More precisely, there are so-called attenuation lactorsTj, j integer, which depend only on the choice of the approximation method and not on the particular function values lk, k = 0, ±1, ... , and for which To clarify what we mean by an "approximation method" , we consider besides the set K of all absolutely continuous 21r-periodic functions 1 : JR -+ JR - the set - IF = { (ik)kEZ I lk E JR, lk+N = lk for k E 7l.}, 7l. := {k I k intcger}, of all N-periodic sequences of real numbers 1 = (···,1-1,10, h, .. .). For convenience, we denote by 1 both the function 1 E K and its corresponding sequence (fk)kElL with lk = l(xk). The meaning of 1 will follow from the context. Any method of approximation assigns to each sequence 1 E IF a function 9 = P(f) in K; it can therefore be described by a map P: IF -+ K. K and IF are real vector spaces with the addition of elements and the multiplication by scalars defined in the usual straightforward fashion. It therefore makes sense to distinguish linear approximation methods P. The vector space IF is of finite dimension N, a basis being formed by the sequences (2.3.4.5) elk) = (e(k)) J jElL' k = 0,1, ... ,N - 1, where if k == j mod N, otherwise. In both IF and K we now introduce translation operators E: IF -+ IF and E: K -+ K, respectively, by (Ej)k := lk-l (Eg)(x) := g(x - h) for all k E 7l., if 1 E IF, for all x E JR, if 9 E K, h:= 21r/N = Xl. (For convenience, we use the same symbol for both kinds of translation operators.) We call an approximation method P: IF -+ K translation invariant if 2.3 Trigonometric Interpolation 95 P(E(f)) = E(P(f)) for all f E IF, that is, a "shifted" sequence is approximated by a "shifted" function. P(E(f)) = E(P(f)) yields P(Ek(f)) = Ek(P(f)), where E2 = Eo E, E3 = Eo E 0 E, etc. We can now prove the following theorem by Gautschi and Reinsch [for further details see Gautschi (1972)]: (2.3.4.6) Theorem. For each approximation method P: IF -+ K there exist attenuation factors Tj, jEll, for which (2.3.4.7) Cj (P f) = Tj /3j (f) for all jEll and arbitrary f E IF if and only if the approximation method P is linear and translation invariant. PROOF. Suppose that P is linear and translation invariant. Every f E IF can be expressed in terms of the basis (2.3.4.5): f = N-l N-l k=O k=O L fke(k) = L IkEke(O). Therefore g:= Pf = L N-l fkEkpe(O), k=O by the linearity and the translation invariance of P. Equivalently, N-l g(x) = L IkTJo(x - Xk), k=O where TJo := Pe(O) is the function which approximates the sequence e(O). The periodicity of 9 yields where (2.3.4.8) We have thus found expressions for the attenuation factors Tj which depend only on the approximation method P and the number N of given function 96 2 Interpolation values fk for arguments Xk, 0:::: Xk < 271". This proves the "if" direction of the theorem. Suppose now that (2.3.4.7) holds for arbitrary f ElF. Since all functions in K can be represented by their Fourier series, and in particular P f E K, (2.3.4.7) implies (2.3.4.9) j=-oc j=-oc By the definition (2.3.4.3) of !3j (f), !3j is a linear operator on IF and, in addition, N-l N-l k=O k=O jixk = ~e-jih ' " f e- jixk = e- jih , !3(f) !3(Ef) = ~ . J N 'L" f k-l·eN L k· 'J . Thus (2.3.4.9) yields the linearity and the translation invariance of P: L Tj!3j(f)e j DC (P(E(f)))(x) = ,(x-h) = (Pf)(x - h) = (E(P(f)))(x). D j=-oc As a by-product of the above proof, we obtained an explicit formula (2.3.4.8) for the attenuation factors. An alternative way of determining the attenuation factors Tj for a given approximation method P is to evaluate the formula (2.3.4.10) for a suitable f ElF. EXAMPLE 1. For a given sequence I E IF, let g:= PI be the piecewise linear interpolation of I, that is, 9 is continuous and linear on each subinterval [Xk,Xk+11, and satisfies g(xd = Ik for k = 0, ±1, .... This function 9 = PI is clearly absolutely continuous and has period 271". It is also clear that the approximation method P is linear and translation invariant. Hence Theorem (2.3.4.6) ensures the existence of attcnuation factors. In order to calculate them, we note that for the special sequence I = e(O) of (2.3.4.5) 1 (3j(f) = N' PI(x) = {01 Cj(Pf) = ~ 271" il x - xkNI if Ix - xkNI:::: h, k = 0, ±1, ... , otherwise, PI(x)e-j'Xdx = ~ jh (1- EJ)e-j'Xdx. Jtrr 271" h o -h Utilizing the symmetry properties of the above integrand, we find 2.4 Interpolation by Spline Functions 97 I1h( 1 - -X) . 2(jh) cosjxdx = -.-2 sm --;- . h J 11"h 2 C(Pf) = ] 11" 2 0 With h = 211"/N the formula (2.3.4.10) gives . _ (Sin Z)2 TJ - Z . h Wit 11"j z:= N' j=O,±l, .... EXAMPLE 2. Let 9 := P f be the periodic cubic spline function [see Section 2.4J with g(Xk) = fk, k = 0, ±1, .... Again, P is linear and translation invariant. Using the same technique as in the previous example, we find the following attenuation factors: Tj = SinZ)4 3 11"] ( ~z1 + 2 cos2 z' where z:= N' 2.4 Interpolation by Spline Functions Spline functions yield smooth interpolating curves which are less likely to exhibit the large oscillations characteristic of high-degree polynomials. They are finding applications in graphics and, increasingly, in numerical methods. For instance, spline functions may be used as trial functions in connection with the Rayleigh-Ritz-Galerkin method for solving boundaryvalue problems of ordinary and partial differential equatio[ls. More recently, they are also used in signal processing. Spline functions (splines) are connected with a partition L1: a = Xo < Xl < ... < Xn = b of an interval [a, bj by knots Xi, i = 0, 1, ... , n: They are piecewise polynomial functions 8: [a, bj -t JR, with certain smoothness properties that are composed of polynomials, the restrictions 81Ii of 8 to Ii :=, (Xi-I, Xi), i = 1, 2, ... , n, are polynomials. In Sections 2.4.1 - 2.4.3 we describe only the simple but already representative case of cubic splines functions (splines of degree 3) that are composed of cubic polynomials, 81Ii E II3 . The general case of splines of degree k and of arbitrary piecewise polynomial functions are treated in Sections 2.4.4 and 2.4.5. The final Section 2.4.6 deals with the basic ideas behind the application of splines in signal processing. A detailed description of splines can be found in many monographs, for instance in Greville (1969), Schultz (1973), Bohmer (1974), de Boor (1978), Schumaker (1981), Korneichuk (1984), andNiirnberger (1989). 2.4.1 Theoretical Foundations Let L1 .- {a = Xo < Xl [a, bj. < ... < Xn = b} be a partition of the interval 98 2 Interpolation (2.4.1.1) Definition. A cubic spline (function) S £!, on L1 is a real junction S£!,: [a, b] -+ IR with the properties: (a) S£!, E 2 [a, b]' that is, S£!, is twice continuously differentiable on [a, b]. (b) S £!, coincides on every sl1binterval [Xi, Xi+l], i = 0, 1, ... , n - I, with a polynomial oj degree (at most) three. e Thus a cubic spline consists of cubic polynomials pieced together in such a fashion that their values and those of their first two derivatives coincide at the interior knots .T;. i = 1, ... , n - 1. Consider a finite sequence Y := (Yo, Yl, ... , Yn) of n + 1 real numbers. We denote by S£!,(Y: .) an interpolating spline junction S£!, with S£!,(Y; Xi) = Yi for i = 0, 1, "" n. Such an interpolating cubic spline function S£!, (Y: .) is not uniquely determined by the sequence Y of support ordinates. Roughly speaking, there are still two degrees of freedom left, calling for suitable additional requirements. The following three additional requirements are most commonly considered: (2.4.1.2) S//(Y'a) = S//(Y'b) = 0, £!" £!" a) b) S~) (Y; a) = S~) (Y; b) for k = 0, 1, 2: S£!, (Y; .) is periodic, c) S~(Y; a) = vb, S~(Y; b) = y~ for given numbers vb, y~. We will confirm that each of these three sets of conditions by itself ensures uniqueness of the interpolating spline function S£!, (Y: .). A prerequisite of the condition (2.4.1.2b) is, of course, that Yn = yo. For this purpose, and to establish a characteristic minimuIIl property of spline functions, we consider the sets (2.4.1.3) m > 0 integer, of real functions j: [a, b] -+ IR for which j(m-l) is absolutely continuous 4 on [a, b] and j(m) E L2 [a, b].5 By K;'[a,b] we denote the set of all functions in Km[a,b] with j(k)(a) = j(k)(b) for k = 0, 1, ... , m - 1. We call such functions periodic, because they arise as restrictions to [a, b] of functions which are periodic with period b - a. 4 See footnote 2 in Section 2.3.4. The set L2 [a, b] denotes the set of all real functions whose squares are integrable on the interval [a, b], i.e., J: If(t)12 dt exists and is finite. 2.4 Interpolation by Spline Functions 99 Note that SiJ. E K3[a, b]' and that SiJ.(Y;.) E K~[a, b] if (2.4.1.2b) holds. If f E K2[a, b], then we can define IIfl12 := lb 1f"(xW dx. Note that Ilfll :::: O. However, 11.11 is not a norm [see (4.4.1)] but only a semi norm , because Ilfll = 0 may hold for functions which are not identically zero, for instance, for all linear functions f(x) == cx + d. We proceed to show a fundamental identity due to Holladay [see for instance Ahlberg, Nilson, and Walsh (1967)]. (2.4.1.4) Theorem. If f E K2(a,b),.d = {a = Xo < Xl < ... < Xn = b} is a partition of the interval [a, bj, and if S iJ. is a spline junction with knots Xi E .d, then Ilf - SiJ.112 = IIfl12 -II SiJ.112- 2 [(fl (x) - S~ (x) )S~ (x) I~ - t(f(X) - SiJ.(;r) )S~ I:~_,] . Here g(x)l~ stands for g(u) - g(v), as it is commonly understood in the calculus of integrals. It should be realized, however, that S~ (x) is piecewise constant with possible discontinuities at the knots Xl, ... , Xn-I. Hence we have to use the left and right limits of S~(x) at the locations Xi and Xi-I, respectively, in the above formula. This is indicated by the notation xi, X{_I' PROOF. By the definition of I ,11, Ilf - SiJ.112 = lb If"(x) - S~(xW dx = IIfl12 - 21b f"(x)S~(x) dx + IISiJ.112 = IIfl12 - 21\{"(x) - S~(x))S~(x) d:r -II SiJ.112. Integration by parts gives for i = L 2, ... , n lXi (f"(X) - S~(x))S~(x) dx = (f'(X) - S~(x))S~(x) XL-l _lXi (f'(X) - S~(x))8~(x) dx XL-l = (f'(X) - S~(x))S~(xW' - (f(x) - SiJ.(x))S~(x) X1-l 100 2 Interpolation + 1~~1 (f(x) - SLl(X))S~)(X) dx. Now S(4)(x) == 0 on the subintervals (Xi-I,Xi), and 1', S~, S':J, are continuous on [a, bJ. Adding these formulas for i = 1, 2, ... , n yields the propositon of the theorem, since n ~)1'(x) - S~(x))S':J,(X)I::_l = (f'(x) - S~(x))S':J,(x)I~· 0 i=l With the help of this theorem we will prove the important minimum- norm property of spline functions. (2.4.1.5) Theorem. Given a partition .:1 := {a = Xo < Xl < ... < Xn = b} of the interval [a, b], values Y := (Yo,···, Yn) and a function f E K,2(a, b) with f(xi) = Yi for i = 0, 1, ... , n, then Ilfll 2:: IISLl(Y; .)11, and more precisely Ilf - SLl(Y; .)11 2 = IIfl12 -IISLl(Y; .)11 2 2:: 0 holds for every spline function SLl(Y; .), provided one of the conditions [compare (2.4.1.2)J S':J,(Y; a) = S':J,(Y; b) = 0, f E K,~[a, b], SLl(Y;.) periodic, (c) S~(Y;a) = 1'(a), S~(Y;b) = 1'(b), is met. In each of these cases, the spline function S Ll (Y; .) is uniquely determined. (a) (b) The existence of such spline functions will be shown in Section 2.4.2. PROOF. In each of the above three cases (2.4.1.5a, b, c), the expression n (f'(x) - S~(x))S':J,(x)l~ - t;(f(x) - SLl(x))S~(x)I:~_l = 0, vanishes in the Holladay identity (2.4.1.4) if SLl == SLl(Y; .). This proves the minimum property of the spline function SLl(Y; .). Its uniqueness can be seen as follows: suppose SLl (Y; .) is another spline function having the same properties as SLl (Y; .). Letting SLl (Y; .) play the role of the function f E K,2[a, bJ in the theorem, the minimum property of SLl(Y;·) requires that and since SLl(Y;·) and SLl(Y;.) may switch roles, IISLl(Y;.) - SLl(Y; .)11 2 = lb IS':J,(Y; X) - S':J,(Y; xW dx = o. 2.4 Interpolation by Spline Functions 101 Since S~ (Y; .) and S~ (Y; .) are both continuous, s~ (Y; x) == S~ (Y; x), from which S~(Y; x) == S~(Y; x) + ex + d follows by integration. But S~ (Y; x) = S~ (Y; x) holds for x = a, b, and this implies c = d = 0. 0 The minimum-norm property of the spline function expressed in Theorem (2.4.l.5) implies in case (2.4.l.2a) that, among all functions 1 in K2[a, b] with I(xi) = Yi, i = 0, 1, ... ,11, it is precisely the spline function S ~ (Y; .) with S~ (Y; x) = for .7: = a, b that minimizes the integral ° The spline function of case (2.4.l.2a) is therefore often referred to as the natural spline function In cases (2.4.l.2b) and (2.4.l.2c), the corresponding spline functions S~ (Y; .) minimize 11111 over the more restricted sets K~[a, b] and {I E K 2 [a, b]1 f'(a) = Yb, I'(b) = y~} n {I I I(xi) = Yi for i = 0, 1, ... , 11}, respectively. The expression f"(x)(l + f'(x)2)-3/2 indicates the curvature of the function I(x) at x E [a, b]. If 1f'(x)1 is small compared to 1, then the curvature is approximately equal to f"(x). The value 11111 provides us therefore with an approximate measure of the total curvature of the function 1 in the interval [a, b]. In this sense, the natural spline function is the "smoothest" function to interpolate given support points (Xi, Yi), i = 0, 1, ... , n. 2.4.2 Determining Interpolating Cubic Spline Functions In this section, we will describe computational methods for determining cubic spline functions which assume prescribed values at their knots and satisfy one of the side conditions (2.4.l.2). In the course of this, we will have also proved the existence of such spline functions: their uniqueness has already been established by Theorem (2.4.l.5). In what follows, ,1 = {Xi I i = 0,1, ... , n} will be a fixed partition of the interval [a, b] by knots a = Xo < Xl < ... < :en = band Y = (Yi)i=O.I, .... n will be a sequence of n+ 1 prescribed real numbers. In addition let I j be the subinterval I j := [:ej-I, Xj], j = 0, 1, ... , 11 - 1, and hj+l := Xj+l - Xj its length. We refer to the values of the second derivatives at knots Xj E ,1, (2.4.2.1 ) j = 0,1, ... , n, 102 2 Interpolation of the desired spline function S.:::. (Y; .) as the moments M j of S.:::. (Y; .). We will show that spline functions are readily characterized by their moments, and that the moments of the interpolating spline function can be calculated as the solution of a system of linear equations. Note that the second derivative S~(Y;·) of the spline function coincides with a linear function in each interval [Xj, Xj+l], j = 0, ... , n -1, and that these linear functions can be described in terms of the moments Mi of S.:::.(Y; .): S~(Y; x) = M j X+l - X x-x' Jh + M j+1 h -J for x E [Xj, Xj+1l. j+l j+1 By integration, (2.4.2.2) S f (y. x) = -M. (Xj+1 - x)2 + '1. (x - Xj)2 + A . .:::.' J 2h lV'J+1 2h J' j+l j+l S.:::.(Y;x) = Mj (X+l- x )3 (x-x)3 J h + Mj+l h J + Aj(x - Xj) + B j , 6 j+l 6 j+l for x E [xj,xj+d, j = 0,1, ... , n -1, where A j , B j are constants of integration. From S.:::.(Y; Xj) = Yj, S.:::.(Y; Xj+l) = Yj+1 we obtain the following equations for these constants Aj and B j : Consequently, h2+ 1 (2.4.2.3) B j = Yj - Mr--~-, A - Yj+l - Yj _ hj+l (M+l - M.) J hj + 1 6 J J . This yields the following representation of the spline function in terms of its moments: (2.4.2.4) S.:::.(Y;x) = aj + (3j(x - Xj) + fj(X - Xj)2 + OJ (x - Xj)3 for x E [Xj,Xj+1], where 2.4 Interpolation by Spline Functions 103 , . _M _ .1 2 ' J'- j ._ I ( • ) _ (3j,-S,:1Y,Xj - - M hj+l 2 +Aj 2Mj + Mj+l h j+l, 6 Yj+l - Yj hj+l 5j := S6'(Y; xj) 6 Mj+l - M j = -----"--'-::-------"- 6hj+l Thus S,:1 (Y; .) has been characterized by its moments Alj . The task of calculating these moments will now be addressed. The continuity of S~ (Y; .) at the interior knots x = x j, j = 1, 2, ... , n - 1 [namely, the relations S~ (Y; x j) = S~ (Y; x j)] yields n - 1 equations for the moments Alj . Substituting the values (2.4.2.3) for Aj and B j in (2.4.2.2) gives for x E [Xj, Xj+l] For j = 1, 2, ... , n - 1, we have therefore I ( _ S,:1 Y; Xj ) = + S~(Y;x) = J Yj - Yj-l hj hj h + -Mj + -Alj _l , j 3 6 h.1·+1 h.1·+1 Y ·+1 - y. J J - - - M - --M+l h j +1 3 J 6 J' and since S~(Y;xj) = S~(Y;xj), (2.4.2.5) h J Alj _ 1 + h j + hj+l M j + hj+1 Mj+l = Yj+l -- Yj _ Yj - Yj-l 6 3 6 hj +1 hj for j = 1, 2, ., ., n - 1. These are n - 1 equations for the n + 1 unknown moments. Two further equations can be gained separately from each of the side conditions (a), (b) and (c) listed in (2.4.1.2). Case (a): S~(Y; a) = Mo = 0 = Mn = S~(Y; b), Case (b): S~(Y;a) = S~(Y;b) =} Mo = M n , I ( S,:1 Y;a ) = I ( S,:1 Y;b ) =} hn h n + hI hI 6Mn-1 + - 3 - - 1\ln + (f1Vh Yn - Yn-l hn 104 2 Interpolation The latter condition is identical with (2.4.2.5) for j = n if we put Recall that (2.4.l.2b) requires Yn = Yo. Case (c) : The last two equations, as well as those in (2.4.2.5), can be written in a common format: j = 1,2, ... ,n - 1, upon introducing the abbreviations (j = l. 2, ... , n - 1) (2.4.2.6) In case (a), we define in addition (2.4.2.7) ..\.0 := 0, d o := 0, !-In:= 0, d n := 0, and in case (c) (Y1 ~ Yo - Yb), '= ~ ( Yn -h Yn-1) . !-In := 1. dn' h Yn . \.0 := 1, (2.4.2.8) d o := :1 I _ n n This leads in cases (a) and (c) to a system of linear equations for the moments JlvIi that reads in matrix notation: 2 ..\.0 !-l1 2 0 1'1.10 NIl ..\.1 jL2 (2.4.2.9) 0 2 !-In ..\.n-1 2 Mn The periodic case (b) also requires further definitions, 2.4 Interpolation by Spline Functions 105 (2.4.2.10) d n := .6 { Yl - Yn _ Yn - Yn-l } , h n + hi hi hn which then lead to the following linear system of equations for the moments M l , Nh, ... , Mn(= Mo): Al 2 /Ll A2 2 /L2 Nh M2 /L3 (2.4.2.11) 2 An /Ln An-l 2 Mn The coefficients Ai, /Li, d i in (2.4.2.9) and (2.4.2.11) are well defined by (2.4.2.6) and the additional definitions (2.4.2.7), (2.4.2.8), and (2.4.2.10), respectively. Note in particular that in (2.4.2.9) and (2.4.2.11) /Li 2' 0, (2.4.2.12) for all coefficients Ai, /Li, and that these coefficients depend only on the location of the knots Xj E L1 and not on the prescribed values Yi E Y or on yb, y~ in case (c). We will use this observation when proving the following: (2.4.2.13) Theorem. The systems (2.4.2.9) and (2.4.2.11) of linear equations are nonsingular JOT any partition L1 of a, b. This means that the above systems of linear equations have unique solutions for arbitrary right-hand sides, and that consequently the problem of interpolation by cubic splines has a unique solution in each of the three cases (a), (b), (c) of (2.4.1.2). PROOF. Consider the (n + 1) x (n + 1) matrix o A= o 2 /Ln A11 -1 2 of the linear system (2.4.2.9). This matrix has the following property: (2.4.2.14) Az = w =? max , IZil s:: max , IWil. for every pair of vectors Z = (zo, ... , zn) T, W = (wo, ... , w n ) T, Z, W E IR n+l. Indeed, let r be such that IZr I = maXi IZi I. From Az = w, 106 2 Interpolation (fJ·O := An := 0). By the definiton of r and because Mr + Ar = 1, max IWil ?: Iwrl ?: 21zrl ~ Mrlzr-ll ~ Arlzr+ll 1 ?: 21zrl ~ MrlZrl ~ Arlzrl = (2 ~ Mr ~ Ar)IZrl = IZrl = max IZil· 1 Suppose the matrix A were singular. Then there would exist a solution Z i- 0 of Az = 0, and (2.4.2.14) would lead to the contradiction 0< max IZil :::; o. 1 The nonsingularity of the matrix in (2.4.2.11) is shown similarly. 0 To solve the equations (2.4.2.9), we may proceed as follows: subtract Ml/2 times the first equation from the second, thereby annihilating Ml, and then a suitable multiple of the second equation from the third to annihilate M2, and so OIl. This leads to a "triangular" system of equations which can be solved in a straightforward fashion [note that this method is the Gaussian elimination algorithm applied to (2.4.2.9); compare Section 4.1]: qo := ~Ao/2; 'Uo:= d o/2; An:= 0; for k := 1 , 2, ... , n: Pk := Mkqk-l + 2; (2.4.2.15) qk := ~Ak/Pk; Uk := (d k ~ Mk'Uk-d/Pk; Aln := 'Un: for k := n ~ 1, n ~ 2, ... , 0: 11,h := qkA1k+l + Uk; [It can be shown that Pk > 0, so that (2.4.2.15) is well defined: see Exercise 25.] The linear system (2.4.2.11) can be solved in a similar, but not as straightforward, fashion. An ALGOL program by C. Reinsch can be found in Bulirsch and Rutishauser (1968). The reader can find more details in Greville (1969) and de Boor (1972), ALGOL programs in Herriot and Reinsch (1971), and FORTRAN programs in de Door (1978). These references also contain information and algorithms for the spline functions of degree k ?: 3 and B-splines, which are treated here in Sections 2.4.4 and 2.4.5. 2.4 Interpolation by Spline Functions 107 2.4.3 Convergence Properties of Cubic Spline Functions Interpolating polynomials may not converge to a function f whose values they interpolate, even if the partitions .,1 are chosen arbitrarily fine [see Section 2.1.4]. In contrast, we will show in this section that, under mild conditions on the function f and the partitions .,1, the interpolating spline functions do converge towards f as the fineness of the underlying partitions approaches zero. We will show first that the moments (2.4.2.1) of the interpolating spline function converge to the second derivatives of the given function. More precisely, consider a fixed partition .,1 = { a = Xo < Xl < ... < Xn = b} of [a, b], and let be the vector of moments M j of the spline function S Ll (Y; . ) with f (x j) = Yi for j = 0, ... , n as well as f'(a), S~(Y;a) = S~(Y;b) = f'(b). We are thus dealing with case (c) of (2.4.1.2). The vector M of moments satisfies the equation AM=d, which expresses the linear system of equations (2.4.2.9) in matrix form. The components d j of d are given by (2.4.2.6) and (2.4.2.8). Let F and r be the vectors F:= [ f"(XO)] f"(Xl) . , r := d - AF = A(M - F). f"(xn) Writing Ilzll := maxi IZil for vectors z, and 11.,111 for the fineness (2.4.3.1) of the partion .,1, we have: (2.4.3.2) If f E C 4 [a,b] and If(4)(x)1 ~ L for X E [a,b], then 11M - FII :S Ilrll :S ~LIILlI12. PROOF. By definition, ro = do - ro = -6 hI 2f"(xo) - f"(Xl) and by (2.4.2.8), (Yl- hl-- Yo- - Yo') - 2f"( Xo ) - f"( ) Xl· 108 2 Interpolation Using Taylor's theorem to express Yl = f(Xl) and f"(Xl) in terms of the value and the derivatives of the function f at Xo yields Analogously, we find for that For the remaining components of r = d - AF, we find similarly rj =d j - J.ljf"(Xj-l) - 2f"(xj) - A.jf"(Xj+l) = 6 [Yj+l - Yj _ Yj - Yj-l] hj + hj+1 hj+l hj hj f"( Xj-l ) - 2f"() f"( Xj+1 ) . Xj - h hj+1 h h j + h j+1 j + j+1 Taylors's formula at Xj then gives 2.4 Interpolation by Spline Functions 109 Here Ti E [Xj-l, Xj+1j. Therefore for j = 1, 2, ... , n - 1. In sum, and since r = A(M - F), (2.4.2.14) implies o (2.4.3.3) Theorem. 6 Suppose f E C 4[a,bj and 1f(4) (x)1 ::; L for X E [a,bj. Let .1 be a partition .1 = {a = Xo < Xl < ... < Xn = b} of the interval [a, b], and K a constant such that IXj+l11.111- Xj I ::; K . for J = 0, 1, ... n - 1. J If S.Cl is the spline function which interpolates the values G1 the function f at the knots Xo, ... ,X n E .1 and satisfies S~ (x) = f'(x) for X = a, b, then there exist constants Ck ::; 2, which do not depend on the partition .1, such that for x E [a, b], If(k) (x) _ S~)(x)1 < {CkLII.1114-k c3LKII.111 for k = 0,1,2, for k = 3. Note that the constant K 2': 1 bounds the deviation of the partition .1 from uniformity. PROOF. We prove the proposition first for k = 3. For x E [Xj_l,Xj], S'g(X) - fll/(x) = M j - M j - l - fll/(x) hj M j - f"(x J ) M j - l - f"(Xj-l) hj hj + f"(x)) - f"(x) - ~f"(Xj-d - f"(x)] _ f"'(x). J Using (2.4.3.2) and Taylor's theorem at x, we conclude that 6 The estimates of Theorem (2.4.3.3) have been improved by Hall and Meyer (1976): If(k)(x) - S~)(x)1 :::; ckLlldI1 4 - k, k = 0, 1, 2, 3, with Co := 5/384, Cl := 1/24, C2 := 3/8, C3 := (K + K- 1 )/2. Here Co and Cl are optimal. 110 2 Interpolation - (Xj-l - X)f"'(X) - (Xj- 12- x)2 f(4) ('r}2) - hjf"'(x) I <~L IILll1 2 + ~ IILll1 2 - 2 hj hj 2 =2L 11~"2 :S 2LKIILlII, J where 'r}1,'r}2 E [Xj-I,Xj]. Thus If"'(x) - S~(x)1 :S 2LKIILlII· To prove the proposition for k = 2, we observe: for each x E (a, b) there exists a closest knot Xj = Xj(x). Assume without loss generality x :S Xj(x) = Xj so that x E [Xj-l, Xj] and IXj(x) - xl :S hj /2 :S IILlII/2. From f"(X) - S~(x) = f"(Xj(x)) - S~(Xj(x)) + l x Xj(x) (f"'(t) - S~(t))dt, (2.4.3.2) and the estimate just proved we find We consider k = 1 next. In addition to the boundary points ~o := a, ~n+l := b, there exist, by Rolle's theorem, n further points ~j E (Xj-I,Xj), j = 1, 2, ... , n with j = 0,1, ... , n + 1. For any x E [a,b] there exists a closest one of the above points ~j = ~j(x), for which consequently Thus l !,(X) - S~(x) = x Ej(x) (f"(t) - S~(t)]dt, and x E [a,b]. The case k = 0 remains. Since f(x) - SLl(X) = l x Xj(X) (f'(t) - S~(t))dt, 2.4 Interpolation by Spline Functions 111 it follows from the above result for k = 1 that J~ E [a, bj. D Clearly, (2.4.3.3) implies that for sequences L1 m = {a =O x(m) <l x(m) < ... < x(m) = b ln rn of partitions with L1m --+ 0 the corresponding cubic spline functions SL1", and their first two derivatives converge to f and its corresponding derivatives uniformly on [a, bj. If in addition sup .1 m,] II L1 mll x j(m) +l - (m)1 ::; K < +00, Xj even the third derivative I'" is uniformly approximated by S'){m' a usually discontinuous sequence of step functions. 2.4.4 B-Splines Spline functions are instances of piecewise polynomial functions associated with a partition L1 = {a = Xo < Xl < ... < Xn = b} of an interval [a, bj by abscissae or knots Xi. In general, a real function f: [a, b] --+ IR is called a piecewise polynomial function of order r or degree r - 1 if, for each i = 0, .... n - 1, the restriction of f to the suhinterval (Xi, xi+d agrees with a polynomial Pi(X) of degree::; r -- I, Pi E IIr - l · In order to achieve a 1-1 correspondence between f and the sequence (po(x),pdx), ... ,Pn-I(X)), we define f at the knots xi,i = 0, ... , n-l, so that it becomes continuous from the right, f(Xi): = f(Xi + 0), 0 ::; i ::; n -1, and f(xn) = f(b): = f(xn - 0). Then, generalizing cubic splines. we define spline functions SL1 of degree k as piecewise polynomial functions of degree k that are k - 1 times differentiable at the interior knots Xi, 1 ::; i ::; n - 1, of L1. By SL1,k we denote the set of all spline functions SL1 of degree k, which is easily seen to be a real vector space of dimension n+k. In fact, the polynomial SL1I[xo, Xl] is uniquely determined by its k + 1 coefficients, and this in turn already fixes the first k - 1 derivatives (= k conditions) of the next polynomial SL11 [Xl. X2] at Xl, so that only one degree of freedom is left for choosing SL11 [Xl, X2]' As the same holds for all subsequent polynomials SL11 [Xi, Xi+l], i = 2, ... ,n-l, one finds dimSL1.k = k+ 1 + (n-l)·1 = k+n. B-splines are special piecewise polynomial functions with remarkable properties: they are nonnegative and vanish everywhere except on a few contiguous intervals [Xi. xi+d. Moreover, the function space SL1,k has a basis 112 2 Interpolation consisting of B-splines. As a consequence, B-splines provide the framework for an efficient and numerically stable calculation of splines. In order to define B-splines, we introduce the function fx: IR -+ IR t - x fx(t) := (t - x)+ := max(t - x, 0) = { 0 for t > x, for t ~ x, and its powers f;, r 2: 0, in particular for r = 0, fO(t) := {I x 0 for t > x, for t ~ x. The function f; (.) is composed of two polynomials of degree ~ r: the 0polynomial Po(t) := 0 for t ~ x and the polynomial Pl(t) := (t - xY for t > x. Note that f; depends on a real parameter x, and the definition of f; (t) is such that it is a function of x which is continuous from the right for each fixed t. Clearly, for r 2: 1, f;(t) is (r - 1) times continuously differentiable both with respect to t and with respect to x. Recall further that under certain conditions the generalized divided difference j[ti, ti+1' ... , ti+r] of a real function f (t), f: IR -+ IR is well defined for any segment ti ~ ti+1 ~ ... ~ ti+r of real numbers, even if the tj are not mutually distinct [see Section 2.1.5, in particular (2.1.5.4)]. The only requirement is that f be (8j - I)-times differentiable at t = tj, j = i, i + 1, ... , i + r, if the value of tj occurs 8j times in the subsegment ti ~ ti+l ~ ... ~ tj terminating with t j . In this case, by (2.1.5.7) (2.4.4.1) fer) (ti) j[ti , ... ,ti+r] := - - , - , r. if ti = ti+Tl . ] .- j[t i+1,"" ti+r] - j[ti,"" ti+r-l] f[ t". ... ,t,+r ., ti+r - ti otherwise. It follows by induction over r that the divided differences of the function f are linear combinations of its derivatives at the points tj: (2.4.4.2) i+r j[t i , ti+l, ... ,ti+r] = L/l:jf(sr l )(tj), j=i where, in particular, Di+r i- O. Now let r 2: 1 be an integer and t = {ti hEZ any infinite nondecreasing unbounded sequence of reals ti with inf ti = -00, sup ti = +00 and ti < ti+r for all i E 7l.. Then the i-th B-spline of order r associated with t is defined as the following function in x: (2.4.4.3) 2.4 Interpolation by Spline Functions 113 for which we also write sometimes more briefly Bi or Bi,r [see Figure 2 for simple examples]. Clearly, Bi,r,t(x) is well defined for all x i- t i , ti+l' ... , ti+r and, by (2.4.4.2), (2.4.4.3) is a linear combination of the functions i ::; j ::; i + r, (2.4.4.4a) if the value of t J occurs sJ times within the subsegment ti ::; ti+I ::; ... ::; tj. If in addition, the value of tj occurs nj times within the full segment ti ::; ti+I ::; ... ::; ti+Tl then the definition of sJ shows that for each index (J" with 1 ::; (J" ::; nj there exists exactly one integer I = 1(0') satisfying t/ = t j , 81 = (J", i::; I ::; i + r. Thus Bt,r,t is also a linear combination of (2.4.4.4b) where r ~ nj ::; .5 ::; r ~ 1, i ::; j ::; i + r. Hence the function Bi,r,t coincides with a polynomial of degree at most r ~ 1 on each of the open intervals in the following set That is, Bi,r.t(x) is a piecewise polynomial function in a; of order r with respect to the partition of the real axis given by the tk with k E Ti,r, where Only at the knots x = tk, k E Ti.Tl the function Bi,r,t(:r) may have jump discontinuities, but it is continuous from the right, because the functions (tj ~ x):+ occuring in (2.4.4.4) are functions of x that are continuous from the right everywhere, even when S = O. Also by (2.4.4.4) for given t = (tj), the B-spline Bi,r(X) == Bi.r,t(x) is (r~nj ~ 1) times differentiable at x = t j , if tj occurs nj times within ti, ti+I' ... , ti+r (if nj = r then Bi,r(x) has a jump discontinuity at x = t J ). Hence the order of dlfferentiability of Bi,r.t (x) at x = tj is governed by the number of repetitions of the value of tj . We note several important properties of B-Splines: (2.4.4.5) Theorem. (a) Bi,r,t(x) = 0 for x rf. [ti, ti+r]' (b) Bi,r.t(x) > 0 for ti < x < ti+r. (c) For all x E IR L Bi,r,t(x) = 1, and the sum contains only finitely many nonzero terms. 114 2 Interpolation By that theorem, the functions Bi,r Bi,r,t arc nonnegative weight functions with sum 1 and support [ti' ti+r], supp Bi,r := {x I Bi,r(x) -=1= O} = [ti, ti+r] , they form a "partition of unity." s: EXAMPLE. For r = 5 and a sequence t with· .. to < h = t2 = t3 < t4 = t5 < t6 the B-Spline E 1 ,5 = B 1 ,s.t of order 5 is a piecewiese polynomial function of degree 4 with respect to the partition t3 < ts < t6 of IR. For x = t3, ts, t6 it has continuous derivatives up to the orders 1, 2, and 3, respectively, as n3 = 3, n5 = 2, and n6 = 1. s: ... , B O•1 ,t 1 to Fig. 2. Some simple B-Splines. PROOF. (a) For x < t; -s: t -s: ti+TJ f;-l(t) = (t - xy-l is a polynomial of degree (r - 1) in t, which has a vanishing rth divided difference f;-l[ti.ti+l, ... ,ti+r] =0 ===} Bi.r(X) =0 by Theorem (2.l.3.10). On the other hand, if ti -s: t -s: t i +r < x then f;-l(t) := (t = 0, is trivially true, so that again Bi.r(x) = O. (b) For r = 1 and ti < x < ti+ 1, the assertion follows from the definition xr;l B i,l(X) = [(ti+l - x)~ - (ti - x)~l = 1- 0 = l. and for r > 1, from recursion (2.4.5.2) for the functions N i .r := Bi,r/(ti+rt;), which will be derived later on. (c) Assume first tj < .r < tj+l. Then by (a), Bi,r(x) = 0 for alli, r with i + r -s: j and all i 2': j + 1, so that L Bi.r(X) Now, (2.4.4.1) implies j = L i=j-r+l Bi,r(x). 2.4 Interpolation by Spline (Cunctions 115 Therefore, The last equality holds because the function g-l (t) = (t - x y-1 is a polynomial of degree (r - 1) in t for tj < x < tj+1 :s; t ::::; tj+r, for which g-1[tj+1"" ,tj+r] = 1 by (2.1.4.3), and the function fJr-1)(t) = (tX)~-l = 0 vanishes for tj-r+1 :s; t :s; tj < x < tj+1. Por x = t J , the assertion follows because of the continuity of B i . r from the right, o We now return to the space S L1,k of splines of degree k ~: 0 and construct a sequence t = (tj) so that the B-splines Bi,k+l,t(X) of order k + 1 form a basis of SL1,k. For this purpose, we associate the partition Ll = {a = Xo < Xl < ... < Xn = b} with any infinite unbounded nondecreasing sequence t = (t J )jE71. satisfying tj < t j +k+1 for all j, t-k :s; ... :s; to := Xo < t1 := Xl < ... < tn := Xn · In the following only its finite subsequence will matter. Then a (special case of a) famous result of Curry and Schoenberg (1966) is (2.4.4.6) Theorem. The restrictions B i ,k+1,t I[a, b] of the n + k B-splines Bi == Bi,k+U, i = -k, -k + 1, ... , n - I, to [a, b] form a basis of SL1,k. PROOF. Each B i , -k :s; i :s; n - 1, agrees with a polynomial of degree :s; k on the subintervals [Xj) xj+d of [a, b], and has derivatives up to the order k - nj = k - 1 in the interior knots tj = Xj, j = 1, 2, ... , n - 1, of Ll since these tj occur only once in t, nj = 1. This shows Bil[a,b] E SL1,k for all i = -k, ... , n - 1. On the other hand we have seen dim SL1,k = n + k. Hence it remains to show only that the functions B i , -kS i :s; n - 1, are linearly independent on [a, b]. Assume L:~:~k aiBi(x) = 0 for all x E [a, b]. Consider the first subinterval [xo, .T1]. Then 116 2 Interpolation ° L aiBi(X) = 0 i=-k for x E [XO, Xl] because Bil[xo,XI] = 0 for i ::::: 1 by Theorem (2.4.4.5)(a). We now prove that B-k, ... , B o are linearly independent on [xo, Xl], which implies a-k = a-k+l = ... = ao = o. We claim that for X E [xo, xd span {B-k(X), . .. ,Bo(x)} = span {(tl - x)k+l-s 1 , ••• , (tk+l - x)k+l-Sk+l}, if the number tj occurs exactly Sj times in the sequence Lk :S Lk+l :S ... :S t j . This is easily seen by induction since by (2.4.4.2), (2.4.4.4a) each Bi(x), -k :S i :S 0, has the form Bi(x) = al(t l - x)k+l-s 1 + ... + aHk+l(tHk+l - x)k+l-s;+k+ 1 , where aHk+1 =I- O. Therefore it suffices to establish the linear independence of the functions (t I - X)k-(Sl-l) , ... , (t k+l _ x)k-(Sk+l- l ) on [xo, Xl], or equivalently, of the normalized functions (2.4.4.7) p(Sj-I)(tj_X), defined by j=1,2, ... ,k+1, 1 k p(t):= k!t . To prove this, assume for some constants el, C2, ... , Ck+l k+1 LCjp(Sj-1l(tj - x) = 0 j=l for all X E [XO,X1J. Then by Taylor expansion the polynomial k+l k ( )1 LCj LP(l+Sj-1) (tj) ~~ = 0 j=l l=O on [XO,X1] vanishes on [xo, Xl] and therefore must be the zero polynomial, so that for alll = 0, 1, ... k k+1 LCjp(l+Sr 1)(tj) = O. j=l But this means that k+1 L Cj lI(Sj-1) (tj) = 0, j=l 2.4 Interpolation by Spline Functions 117 where II(t) is the row vector II(t) : = [p(k)(t),p(k-I)(t), ... ,pet)] = [ 1, t, ... , ~~ ] . However the vectors II(Sj-1)(tj), j = 1, 2, ... , k + 1, are linearly independent [see Corollary (2.1.5.5)l because of the unique solvability of the Hermite interpolation problem with respect to the sequence tl t2 tk+l· Hence CI = ... = Ck+1 = 0 and therefore also ai = 0 for i = -k, -k+ 1, ... , o. Repeating the same reasoning for each subinterval [Xj, Xj+l], j = 1, 2, · .. , n - 1, of [a, bl we find ai = 0 for all i = -k, -k + 1, ... , n - 1, so that all B i , -k i n - 1, are linearly independent on [a, bl and their restrictions to [a, bl form a basis of SLJ.,k. 0 :s: ... :s: :s: :s: :s: 2.4.5 The Computation of B-Splines B-splines can be computed recursively. The recursions are based on a remarkaHe generalization of the Leibniz formula for the derivatives of the product of two functions. Indeed, in terms of divided differences, we find the following. (2.4.5.1) Product Rule for Divided Differences. Suppose ti :s: tHI :s: · .. :s: tHk. Assume further that the function f (t) = g( t) h(t) is a product of two functions that are sufficiently often differentiable at t = t j , j = i, ... , i + k, so that g[ti, tHI, ... , ti+kl and h[ti, ti+I, ... , ti+kl are defined by (2.4.4.1). Then Hk f[ti, tHI, ... , tHkl = Lg[ti , t HI , ... , trJh[tr' tr+I, ... , ti+kl· r=i PROOF. From (2.1.5.8) the polynomials Hk L g[t i , . .. ,trl (t - t i ) ... (t - t r - I ) r=i and Hk L h[ts, ... ,ti+k](t - tsH) ... (t - tHk) s=i interpolate the functions g and h, respectively, at the points t = ti, tiH' · .. , tHk (in the sense of Hermite interpolation, see Section 2.1.5, if the tj are not mutually distinct). Therefore, the product polynomial 118 2 Interpolation i+k F(t) := Lg[t;, ... ,tr](t - t i )··· (t - t r - 1) r=i i+k .L h[t" ... , ti+k](t - ts+d ... (t - ti+k) also interpolates the function f(t) at t = ti, ... , t;+k' This product can be written as the sum of two polynomials F( t) = i+k L ... = L ... + L ... = PI (t) + P (t). 2 r>s r.s=i Since each term of the second sum 2:r>s is a multiple of the polynomial n;:7(t-tj ), the polynomial P2(t) interpolates the O-function at t = ti, ... , ti+k. Therefore, the polynomial P1(t), which is of degree S k, interpolates f(t) at t = ti, ... , ti+k' Hence, P1(t) is the unique (Hermite-) interpolant of f(t) of degree S k. By (2.1.5.8) the highest coefficient of Pdt) is J[t;, ... ,ti+kl. A comparison of the coefficients of t k on both sides of the sum representation Pdt) = 2:rS;s ... of PI proves the desired formula i+k f[t;, ... , ti+kl = L g[t i , ... , trlh[t n ... , ti+kl· D r=1 Now we use (2.4.5.1) to derive a recursion for the B-splines B;,r(x) == Bi,r,t(x) (2.4.4.3). To do this, it will be convenient to renormalize the functions Bi.r(x), letting for which the following simple recursion holds: For r ~ 2 and ti < ti+r: (2.4.5.2) PROOF. Suppose first x t= tj for all j, We apply rule (2.4.5.1) to the product g-l(t) = (t - x),;--l = (t - x)(t - x)~-2 = g(t)f~-2(t). Noting that g(t) is a linear polynomial in t for which by (2.1.4.3) g[til = t; - X, g[t i , ti+d = 1, g[t;, ... , tjl = 0 for j > i + 1, we obtain 2.4 Interpolation by Spline Functions 119 and this proves (2.4.5.2) for x f= ti, ... , tHr' The result is furthermore true for all x since all Bi,r(x) are continuous from the right and ti < t Hr D The proof of (2.4.4.5), (b) can now be completed: By (2.4.5.2), the value Ni,r(x) is a convex linear combination of N i ,r-1(X) and N H1 ,r-1(X) for ti < x < tHr with positive weights Ai(X) = (x - ti)/(t Hr - t i ) > 0, 1- Ai(X) > O. Also Ni,r(x) and Bi,r(x) have the same sign, and we already know that Bi,dx) = 0 for x rt [ti' tH1J and Bi,l(X) > 0 for ti < x < tH1' Induction over r using (2.4.5.2) shows that Bi,r(x) > 0 for ti < x < t Hr . The formula (2.4.5.3) is equivalent to (2.4.5.2), and represents Bi,r(x) directly as a positive linear combination of Bi,r-dx) and B H1 ,r-1(X). It can be used to compute the values of all B-splines Bi,r(x) = Bi,r,t(X) for a given fixed value of x. To show this, let x be given. Then there is a tj E t with tj ~ x < tj+1' By (2.4.4.5)a) we know Bi,r(x) = 0 for all i, r with x F/. [ti, t Hr ], i.e., for i ~ j - r and for i ~ j + 1. Therefore, in the following tableau of Bi,r := Bi,r(x), the Bi,r vanish at the positions denoted by 0: (2.4.5.4) 0 0 0 0 0 0 0 0 0 0 B j,l B j- 1,2 B j ,2 B j - 2 ,3 B j- 1,3 B j ,3 B j- 3 ,4 B j- 2 ,4 B j- 1,4 B j ,4 0 0 0 0 By definition B j,l = B j,l (x) = 1 for tj < x < tj+1' which determines the first column of (2.4.5.4). The remaining columns can be computed consecutively using recursion (2.4.5.3): Each element Bi,r can be derived from its two left neighbors, B i ,r-1 and B H1 ,r-1 B i ,r-1 ~ /' Bi,r 120 2 Interpolation This method is numerically very stable because only nonnegative multiples of nonnegative numbers are added together. EXAMPLE. For ti = i, i = 0, 1, ... and x = 3.5 E [t3, t4J the following tableau of values Bi,r = Bi,r(X) is obtained. r o 1 2 3 4 o o 0 0 0 1/48 23/48 23/48 1/48 1/8 6/8 1 1/2 1/8 000 1 o 1/2 2 3 4 o For instance, B 2 ,4 is obtained from B 2,4 = B 2,4 (3.5) = 3.5 - 2 . ~ 5_ 2 6 - 3.5 8+ 6- 3 1 8 23 48' We now consider the interpolation problem for spline functions, namely the problem of finding a spline S that assumes prescribed values at given locations. We may proceed as follows. Assume that r ::::: 1 is an integer and t = (tih<i<N+r a finite sequence of real numbers satisfying and ti < t Hr for i = 1, 2, ... , N. Denote by Bi(X) == Bi,r,t(x), i = 1, ... , N, the associated B-splines, and by N Sr,t = {L (ViBi(X) I (Vi E 1R }, i=l the vector space spanned by the B i , i = 1, ... , N. Further, assume that we are given N pairs (I;j, ij), j = 1, ... , N, of interpolation points with 6 < 6 < '" < I;N. These are the data for the interpolation problem of finding a function S E Sr,t satisfying (2.4.5.5) Since any S E Sr,t can be written as a linear combination of the B i , i = 1, ... , N, this is equivalent to the problem of solving the linear equations N (2.4.5.6) L (ViBi(l;j) = ij, j = 1, ... , N. i=l The matrix of this system 2.4 Interpolation by Spline Functions 121 A= [B1i~d B1(~N) has a special structure: A is a band matrix, because by (2.4.4.5) the functions Bi(x) = Bi,r,t(x) have support [ti' ti+r], so that within the jth row of A all elements Bi (~j) with ti+r < ~j or ti > ~j are zero. Therefore, each row of A contains at most r elements different from 0, and these elements are in consecutive positions. The components Bi (~j) of A can be computed by the recursion described previously. The system (2.4.5.6), and thereby the interpolation problem, is uniquely solvable if A is nonsingular. The nonsingularity of A can be checked by means of the following simple criterion due to Schoenberg and Whitney (1953), which is quoted without proof. (2.4.5.7) Theorem. The matrix A = (Bi(~j)) of (2.4.5.6) is nonsingular if and only if all its diagonal elements Bi (~i) -I 0 are non;;ero. It is possible to show [see Karlin (1968)] that the matrix A is totally positive in the following sense: all r x r submatrices B of A of the form B = (aipjJ;,q=l with r ~ 1, i 1 < i'2 < ... < ir , j1 < j2 < ... < jr, have a nonnegative determinant, det(B) ~ O. As a consequence, solving (2.4.5.6) for nonsingular A by Gaussian elimination without pivoting is numerically stable [see de Boor and Pinkus (1977)]. Also the band structure of A can be exploited for additional savings. For further properties of B-Splines, their applications, and algorithms the reader is referred to the literature, in particular to de Boor (1978), where one can also find numerous FORTRAN programs. 2.4.6 Multi-Resolution Methods and B-Splines B-Splines play also an important role in multi-resolution methods in signal processing. These methods deal with the approximation of real or complex functions f: IR -+ <C. In what follows, we consider only square integrable functions f, J~oo If(xWdx < 00. It is well known that the set of these functions, L2 (IR), is a Euclidean vector space (even a Hilbert space) with respect to the scalar product (1, g) := I: f(x)g(x) dx and the associated Euclidean norm Ilfll := (1,1)1/2. By f:'2 we denote the set of all sequences (ckhEZ of complex numbers Ck with 2:=kEZ ICkl2 < 00. In the multi-resolution analysis (MRA) of functions f E L 2 (IR), closed linear subspaces Vj, jEll., of L 2 (IR) with the following properties playa central role: 122 2 Interpolation (Sl) (S2) f(x) E Vi ¢} f(2x) E Vi+1' jElL, (S3) f(x) E Vi ¢} f(x + z-j) E Vi, jElL, U Vi = L (lR), (S4) 2 jElL nVi (S5) jElL = {a}. Usually, such subspaces are generated by a function <P(x) E L2(lR) as follows: By translation and dilatation of <P, we first define additional functions (we call them "derivates" of <P) <Pj,k(X) := 2j / 2 <p(2 j x - k), j, k E lL. The space Vi is then defined as the closed linear subspace of L2(lR) that is generated by the functions <Pj,k, k E lL, Vi := span{ <Pj,k IkE lL} = { L Ck<Pj,k I Ck E OJ, n::::: o}. Ikl:S;n Then (82) and (83) are clearly satisfied. (2.4.6.1) Definition. The function <P E L 2 (lR) is called scaling function, if the associated spaces Vi satisfy conditions (Sl)- (S5) and if, in addition, the functions <PO,k, k E lL, form a Riesz-basis of VA, that is, if there are positive constants 0 < A :::; B with kElL kElL kElL for all sequences (Ck)kElL E £2 (then also for each jElL, the functions <Pj,k, k E lL, will form a Riesz- basis of Vi). If <P is a scaling function then, for all jElL, each function f E Vi can be written as a convergent (in L2(lR)) series (2.4.6.2) f = L Cj,k<Pj,k kElL with uniquely determined coefficients Cj,k, k E lL, with (cj,khElL E £2. Also for a scaling function <P, condition (Sl) is equivalent to the condition that <P satisfies a so-called two-scale-relation 2.4 Interpolation by Spline Functions 123 (2.4.6.3) with coefficients (h k ) E £2. For, (2.4.6.3) is equivalent to P E Vo C VI' But then also Vi C Vi+1 for all jEll. because of Pj,I(X) =2 j / 2 p(2 j x -I) = 2j / 2 = V2L h kP(2 j +lx - 2/- k) kEZ L hk P j+l,k+21(X). kEZ EXAMPLE. A particularly simple scaling function is the Haar-function for 0::; x < 1, otherwise. <p(x) := {~: (2.4.6.4) The associated two-seale-relation is <p(x) = <P(2x) + <P(2x - 1). See Figure 3 for an illustration of some of its derivates <Pj,k. ), ~ <P1,0 v'2 r---'-<P0,0 = <P 1 -3/2 -1 0 Q'14 --'-- 1/2 1 4/2 2.5 x Fig. 3. Some derivates <Pj,k of the Haar-function Scaling functions, and the associated spaces Vi they generate, play a central role in modern signal- and image processing methods. For example, any function f E Vi = L:k Cj,kPj,k can be considered as a function with a finite resolution (e.g. a signal or, in two dimensions, a picture obtained by scanning) where only details of size 2- j are resolved (this is illustrated 124 2 Interpolation most clearly by the space Vj belonging to the Haar-function). A common application is to approximate a high resolution function 1, that is a function 1 E Vj with j large, by a coarser function j (say a j E Vk with k < j) without losing to much information. This is the purpose of multi-resolution methods, whose concepts we wish to explain in this section. We first describe additional scaling functions based on splines. The Haar-function P is the simplest B-spline of order r = 1 with respect to the special sequence t := (kha of all integers: Definition (2.4.4.3) shows immediately B O,l,t(X) = 1~[0, 1] = (1- x)~ - (0 - x)~ = p(x). Moreover, for all k E 7l, their translates are B k,l,t(X) = 1~[k, k + 1] = B O,l,t(X - k) = p(x - k) = PO,k. This suggests to consider general B-splines (2.4.6.5) Mr(x) := Bo,r,t(x) = r 1;-1 [0,1, ... ,r] of arbitrary order r 2: 1 as candidates for scaling functions, because they also generate all B-splines Bk,r,t(x) = r 1;-1 [k, k + 1, .. " k + r], k Ell, by translation, Bk,r,t(x) = BO,r,t(x - k) = Mr(x - k). Advantages are that their smoothness increases with r since, by definition, A1r(x) = Bo,r,t(X) is r - 2 times continuously differentiable, and that they have compact support [see Theorem (2.4.4.5)]. It can be shown that all M r , r 2: 1, are in fact scaling functions in the sense of Definition (2.4.6.1); for example, their two-scale-relation reads Mr(x) = T r+ 1 t C) Mr(2x -I), x E JR. 1=0 For proofs we refer the reader to the literature [see Chui (1992)]. In the interest of simplicity, we only consider low order instances of these functions and their properties. EXAMPLES. For low order r one finds the following B-splines Mr := Eo,r,t: Ml{X) is the Haar-function (2.4.6.4), and for r = 2, 3 one obtains: for 0 < x < 1, for 1 ;:; x ;:; 2, o otherwise. X2 for 0 ~ x ~ 1, M ( ) = ~ { - 2X2 + 6x - 3 for 1 ~ x ~ 2, 3 X 2 {3 _ X)2 for 2 ~ x ~ 3, o otherwise. M 2 {x) = { X 2- x 2.4 Interpolation by Spline Functions 125 The translate H(x) := M 2 (x + 1) of M2(X), H(x) = {I -Ixl (2.4.6.6) o for -l.S x S 1 otherwIse, is known as hat-function. These formulas are special cases of the following representation of .Alr(x) for general r ;:0: 1: (2.4.6.7) This is readily verified by induction over r, taking the recursion (k = 1, 2, ., ., r) (2.1.3.5) for divided differences . k] =k 1 (fT-I[. . k] - fr-I[. . k 1]) , f xT-I[.. z,z+ 1 , ... ,z+ x z+ 1 , ... ,z+ x z, ... ,z+'and f;-I(t) = (t - X)~-I into account. We now describe multi-resolution methods in more detail. As already stated, their aim is to approximate, as well as possible, a given "high resolution" function Vj+l E Vi+l by a function of lower resolution, say by a function Vj E Vj with Since v j +1 - VJ is then orthogonal to Vi, and v j E Vi C Vi + 1 , the orthogonal subspace Wj of Vi in Vi+l comes into play, Wj := {w E Vi+l I (w, v) = 0 for all v E Vi}, jEll. We write Vi+l = Vi E9 Wj to indicate that every V)+1 E Vi-H has a unique representation as a sum Vj+l = Vj + Wj of two orthogonal functions Vj E Vi and Wj E W j . The spaces Wj are mutually orthogonal, Wj -.l W k for j # k (i.e. (v,w) = 0 for v E W j , W E W k ). If, say j < k, this follows from Wj C Vi+l C Vk and W k -.l V k . More generally, (S4) and (S5) then imply L 2(IR) = EB Wj = ... E9 W-l E9 Wo E9 WI E9 ... jElL For a given Vj+l E Vi+l, multi-resolution methods compute for m S j the orthogonal decompositions V=+1 = v= + W=, Vm E 11;", Wm E W m , according to the following scheme: (2.4.6.8) 126 2 Interpolation Then for m ~ j, Vm E Vm is also the best approximation of v j + 1 by an element of V"" that is, This is because the vectors 'Wk are mutually orthogonal and 'Wk ~ Vm for k 2:: m. Hence for arbitrary v E v'n, Vm - 11 E v'm and Vj+1 - V = 10j + 'Wj-l + ... + 'Wm+ (v m - v), so that . 2 Ilv]+l - vii 2 = II'Wjll 2 + ... + Ilwmll + IIVm- vii 2 2:: Ilv]+l - vmll 2 . Tn order to compute the orthogonal decomposition V]+1 = Vj +'Wj of an arbitrary V]+l E Vj+1, we need the dual function cP of the scaling function P. As will be shown later, cP is uniquely determined by the property (2.4.6.9) (PU,k, cP) = I: p(x - k)cP(x) dx = Ok,O, k E ll.. With the help of this dual function cP, we are able to compute the coefficients Ck of the representation (2.4.6.2) offunction f E VOl f = ~za: czPo,z, as scalar products: (f(x), cP(x - k)) = 2:>lj= p(x - l)cP(x - k) d:r I = -co ~ czjCO p(x - l + k)cP(x) dx = Ck. -oc Z Also, for any jEll., the functions cPj,k(X) := 2j / 2 cP(2 j x - k), k E ll., satisfy I: I: P(2J x -k)p(2jx-l)dx Ok-I,O, k, I E ll.. (Pj,k,cPj,l) =2 j (2.4.6.10) = = p(x - k + l)cP(x) dx The dual function cP may thus be used to compute also the coefficients Cj,l of the series f = ~kEZ Cj,kPj,k representing a function f E Vj: (2.4,6.11) (j, cPj,l) = ~ CJ,k(Pj,k, cPj,l) = C),I' k The following theorem ensures the existence of dual functions: 2.4 Interpolation by Spline Functions 127 (2.4.6.12) Th_eorem. To each scaling function there el:ists exactly one dual function E Yo. PROOF. Since the O,k, k E 7l., form a Riesz-basis of Yo, the function o,o = is not contained in the subspace V := span{ O,k I Ikl :::: 1 } of Va. Therefore, there exists exactly one element 9 = -11 of= 0 with u E V, so that Ilgll = min{ II - viii v E V}. U is the (unique!) orthogonal projection of to V, hence 9 1. V, i.e. (v, g) = o for all v E V, and in particular, (o,kl g) = 0 for all Ikl :::: 1, and 0 < IIgl12 = ( - u, - u) = ( - u,g) = (,g). The function <f> := 9 / (, g) has therefore the properties of a dual function. 0 EXAMPLE. The Haar-function (2.4.6.4) is selfdual, <P = <P. This follows from the orthonormality property (<P, <PO,k) = il k . a , k E 71... It is, furthermore, possible to compute the best approximation Vj E 1/j of a given Vj+l E 1/j+l using the dual function <f>. \Ve assume that Vj+l E 1/j+l is given by its series representation Vj+l = L:kE~ Cj+l,kj+l,k. Because of Vj E 1/j, the function Vj we are looking for has the form Vj = L:kE~ Cj,kj.k. Its coefficients Cj,k have to be determined so that Vj+l -Vj ..L 1/j, (Vj, v) = (Vj+l' v) for all v E 1/j. Now, since <f> E Vo, all functions <f>],kl k E 7l., belong to 1/j so that by (2.4.6.10) Cj.l = (v]' <f>j,l) = (Vj+l,<f>],Z) = L Cj+1,k(j+l,k, <f>j.l) kE~ = L Cj+l,k 2j + 1/ j= (2j+lx - k)<f>(2jx -l) dx 2 kE~ = -00 L Cj+1,kV2j= (2x + 21 - k)<f>(x) dx. kE~ -00 This leads to the following method for computing the coefficients Cj,k from the coefficients C]+l,k: First, we compute the numbers (they do not depend on j) 128 2 Interpolation 'Yi := J2 (2.4.6.13) i: p(2x + i)~(x)dx, i E 7L. The coefficients Cj,l are then obtained by means of the formula (2.4.6.14) Cj,l = L Cj+l,k'Y2l-k, I E 7L. kEZ EXAMPLE. Because of ;p = p for the Haar-function, the numbers 'Yi are given by ° "'{i := {1/2 for i =.0, -1, otherwise. Since P has compact support, only finitely many 'Yi are nonzero, so that also the sums (2.4.6.14) are finite in this case: I E 7l.. In general, the sums in (2.4.6.14) are infinite and thus have to be approximated by finite sums (truncation after finitely many terms). The efficiency of the method thus depends on how fast the numbers 'Yi converge to 0 as Ii I ---+ 00. It is possible to iterate the method following the scheme (2.4.6.8). Note that the numbers 'Yi have to be computed only once since they are independent of j. In this way we obtain the following multi-resolution algorithm for computing the coefficients Cm,k, k E 7L, of the series Vm = LkEZ Cm,kPm,k for all m :::; j: (2.4.6.15) Given: Cj+l,k, k E 7L, and 'Yi, i E 7L, s. (2.4.6.13). For m = j, j -1, ... for IE 7L Cm,l := LkEZ Cm +l,k'Y2l-k· We have seen that this algorithm is particularly simple for the Haarfunction as scaling function. We now consider scaling functions that are given by higher order Bsplines, p(x) = Pr(x) := Mr(x), r > 1. For reasons of simplicity we treat only case r = 2, which is already typical. Since the scaling functions M 2 (x) and the hat-function H(x) = M 2 (x+l) generate the same spaces V), j E 7L, the investigation of c]j(x) := H(x) is sufficient. First we show the property (Sl), V) C V)+l, for the linear spaces generated by p(x) = H(x) and its derivates c]jj,k(X) = 2j / 2 p(2 j x-k), j, k E 7L. This follows from the two-seale-relation of H(x), 1 H(x) = "2 (H(2x + 1) + 2 H(2x) + H(2x - 1)), 2.4 Interpolation by Spline Functions 129 that is immediately verified using the definition H(x) (2.4.6.6). We leave it to the reader to prove that H (x) has also the remaining properties required for a scaling function [see Exercise 32]. The functions f E Vi have a simple structure: they are continuous and piecewise linear functions on IR with respect to the partition LV of fineness 2- j given by the knots L1j := { k . 2- j IkE 71..}. (2.4.6.16) Up to a common factor, the coefficients Cj,k of the series representation f = L:kEZ Cj,k<Pj,k are given by the values of f on L1j: (2.4.6.17) Unfortunately, the functions <PO,k(X) = H(x - k), k E 71.., do not form an orthogonal system of functions: A short direct calculation shows (2.4.6.18) so that (<PO,k, <PO,I) = 1= -= H(x - k + l)H(x) dx = {2/3 1/6 0 for k = l, for Ik - II = 1, otherwise. As a consequence, the dual function if will be different from H, but it c~n be calculated explicitly. According to the proof of Theorem (2.4.6.12), H is a scalar multiple of the function g(x) = H(x) - L akH(x - k), Ikl21 where the sequence (ak)l kI2 1 is the unique solution of the following equations: (g, <PO,k) = (H(J.;) - L alH(x -l), H(x - k)) = 0 for all Ikl 2' 1 11121 that satisfies L:k lakl 2 < 00. By (2.4.6.18), this leads for k > 1 to the equations (2.4.6.19) and for k -S -1 to (k = 1), (k 2' 2), 130 2 Interpolation =1, (k=-I), It is easily seen that with (ak) also (a-k) is a solution of these equations, so that by the uniqueness of solutions ak = a-k for all Ikl :::0- l. Therefore it is sufficient to find the solutions ak, k:::o- l. of (2.4.6.19). According to these equations, the sequence (akk2,l is a solution of the homogeneous linear difference equation Ck-l + 4Ck + Ck+l = 0, k = 2,3, .... e The sequence Ck := k , k:::o- 1, is clearly a solution of this difference equation, if is a zero of the polynomial x 2 + 4x + 1 = O. This polynomial, however. has the two different zeros e r.; -1 A:= -2 + v3 = - - , 2+V3 J-l := -2 - V3 = 1/ A, with IAI < 1 < 1/11. The general solution of the difference equation is an arbitrary linear combination of these special special solutions (Akh>l and (J-lk)k> 1, that is, the desired solution ak has the form ak = CA k + DI;k, k :::01, with appropriately chosen constants C, D. The condition Lk IOkl 2 < ex; implies D = 0 because of 1J-l1 > l. The constant C has to he chosen such that also the first equation (2.4.6.19), 4a1 + a2 = 1, is satisfied, that is, 1 = CA(4 + A) = CA(2 + V3) = -C. This implies ak = a-k = -A k = _(_1)k(2 + V3)-k for k:::o- 1 so that g(x) = H(x) + 2:) -1)k(2 + V3)-k(H(x + k) + H(x - k)). k2,l The dual function if(x) = ,g(x) we are looking for is a scalar multiple of g, where, is determined by the condition (H, if) = l. Using (2.4.6.18) this leads to , (H,H) - (2 + V3)-l((H(x),H(x 1 - = 2 1 1 3 3(2+V3) V3' -1)) + (H(x),H(x + 1))) Therefore, the dual function if (x) is given by the infinite series if(.T) = L bkH(x - k) kElL. with coefficients 2.4 Interpolation by Spline Functions b '= (-l)kJ3 k· (2 + J3)lkl ' 131 k E 7l.. The computation of the numbers "Yi (2.4.6.13) for the multi-resolution method (2.4.6.15) leads to simple finite sums with known terms, "Yi = V2(H(2x + i), H(x)) L bk(H(2x + i), H(x - k)) = V2 kEZ (2.4.6.20) = V2 L bk(H(2x + i + 2k), H(x)} = V2 L bk(H(2x + i + 2k),H(x)), k: 12k+il::02 because an elementary calculation shows 5/12 { (H(2x + j), H(x)) = 1//4 1 24 o for j = 0, for IJI = 1, for Ijl = 2, otherwise. Since 2 + J3 = 3.73 ... , the numbers bk = 0((2 + J3)-lk l) and, consequently, the numbers "Yk converge rapidly enough to zero as Ikl ---+ 00, so that the sums (2.4.6.14) of the multi-resolution method are well approximated by relatively short finite sums. EXAMPLE. For small i, one finds the following values of /'i: o 1 2 3 4 5 6 7 8 9 10 /'i 0.96592 58262 0.44828 77361 -0.1640846996 -0.12011 83369 0.04396 63628 0.03218 56114 -0.01178 07514 -0.00862 41086 0.00315 66428 0.00231 08229 -0.00084 58199 Incidentally, the symmetry relation /,-i = /'i holds for all i E d~ [see Exercise 33]. There is a simple interpretation of this multi-resolution method: A function f = L::kEZ Cj+l,kPj+l,k E l':i+1 is a continuous function which is piecewise linear with respect to the partition Llj+l = {k . 2-(j+l) IkE 7l.} of IR with [see (2.4.6.17)] Cj+l,k = T(j+l)/2f(k· 2-(j+l)), k E 7l.. 132 2 Interpolation It is approximated optimally by a continuous function j = LkZ Cj,kiPj,k which is piecewise linear with respect to the coarser partition LV = { I· 2- j I 1 Ell.} if we set for 1 E 7l. The numbers /i thus have the following nice interpretation: The continuous piecewise linear (with respect to Llo = {Ill E 71.}) function j(x) with j(l) := /2t/ V2, I E 71., is the best approximation in Va of the compressed hat-function [(x) := H(2x) E VI. The translated function h(x) := H(2x - 1) E VI is best approximated by jl E Va where jl(l) := /21-dV2, I E 71.. [see Figure 4]. [,j 1 x -3 -2 -1 1 2 3 Fig. 4. The functions [ and j We now return to general scaling functions. Within the framework of multi-resolution methods it is essential that the functions iPj,k, k E 7l., form a Riesz-basis of the spaces Vj. It is of equal importance that also the orthogonal complements Wj of Vj in Vj+l have similarly simple Rieszbases: One can show [proofs can be found in e.g. Daubechies (1992), Louis et al. (1994)]' that to any scaling function <P generating the linear spaces Vj there is a function lj! E Wo with the following properties: 2.4 Interpolation by SplinE Functions 133 a) The functions PO,k(X) := P(x - k), k E ll.., form a Riesz-basis [(see (2.4.6.1))] ofWo , and for eachj E ll.. its derivates Pj,k(X) = 2j / 2 p(2 j x-k), k E ll.., form a Riesz-basis of W j . b) The functions P],k, j, k E ll.., form a Riesz-basis of L2(JR). Each function P with these properties is called a wavelet belonging to the scaling function P (wavelets are not uniquely determined by pl. P is called an orthonormal wavelet,if in addition the PO,k, k E ll.., form an orthonormal basis of W o, (PO,k, PO•l ) = 5k ,l' In this case an orthonormal basis of L2(JR) is given by the derivates Pj,k, j, k E ll.., of rJI, (Pj,k, Pl,m) = 5j ,15k ,m' EXAMPLE. To the Haar-function P (2.4.6.4) belong~ the Haar-wavelet defined by I, P(x):= { -.1, O. for 0 :::; x < 1/2, for 1/2:::; x < 1, otherwise. Indeed, the relation P(x) = p(2x) -p(2x-1) first implies PO,k E Vi for all k E lL; then the orthogonality (PO,k, PO,I) = 0 for all k, IE lL gives Po,k E Wo for all k E lL, and finally proves Vi = VoEElWo. Also, since (P, PO.k) = Sk,O, the Haar-wavelet is orthonormal. One can show that every orthonormal scaling function P has also an orthonormal wavelet, which even can be given explicitly by means of the two-scale-relation of P [see e.g. the literature cited above]. It is very difficult however, to determine for a non-orthonormal P an associated wavelet. The situation is better for the scaling functions P = P r := !vIr given by Bsplines. Here, explicit formulas for the special wavelets wr are known that have a minimal compact support (namely the interval [0, 2r - 1]) among all wavelets belonging !vIr: they are given by finite sums of the form 3r-2 Pr(x) = L Qk M r(2x - k). k=O However, these wavelets are not orthonormal for T > 1, but one still knows explicitly their associated dual functions .pr E Wo with their usual properties (Pr(x-k), .pr(X)) = 5k ,o for k E ll.. [see Chui (1992) for a comprehensive account]. The example of the Haar-function shows, how simple the situation becomes if both the scaling function P and the wavelet P are orthonormal and 134 2 Interpolation if both functions have compact support. A drawback of the Haar-function is that it is not even continuous, whereas already M2 is continuous and the smoothness of the B-Splines Mr with r :2: 3 and its wavelets I{fr even increases with r. Also for all r :2: 1, tPr = Nlr has compact support, and has a wavelet I{fr with compact support. But unfortunately neither tP r nor I{fr are orthonormal for r > 1. Therefore the following deep result of Daubechies is very remarkable: She was the first to give explicit examples of scaling functions and associated wavelets that have an arbitrarily high order of differentiability, have compact support, and are orthonormal. For a detailed exposition of these important results and their applications, we refer the reader to the relevant special literature [see e.g. Louis et al. (1994), Mallat (1998), and, in particular, Daubechies (1992)J. EXERCISES FOR CHAPTER 2 1. Let L; (x) be the Lagrange polynomials (2.1.1.3) for pairwise different support abscissas Xo, . .. , Xn , and let C; := L;(O). Show that (a) n LCiXi = ;=0 {I 0 n (-1) XOX1·· .Xn for j = 0, for j = 1,2, ... ,n, for j = n + 1, n (b) L Li(x) == 1. i=O 2. Interpolate the function In x by a quadratic polynomial at x = 10, 11, 12. (a) Estimate the error committed for x = 11.1 when approximating lnx by the interpolating polynomial. (b) How does the sign of the error depend on x? 3. Consider a function f which is twice continuously differentiable on the interval I = [-1, 1]. Interpolate the function by a linear polynomial through the support points (Xi, f(Xi)), i = 0, 1, xo, Xl E I. Verify that 0: = ~ max If" (~)I max I(x - xo)(X - Xl)1 ~E[ xE[ is an upper bound for the maximal absolute interpolation error on the interval I. Which values xo, Xl minimize o:? What is the connection between (x xo)(x - Xl) and cos(2 arccos x)? 4. Suppose a function f(x) is interpolated on the interval [a, b] by a polynomial Pn(X) whose degree does not exceed n. Suppose further that f is arbitrarily often differentiable on [a, b] and that there exists M such that If(i)(x)1 ::; M for i = 0, 1, : .. and any x E [a, b]. Can it be shown, without additional hypotheses about the location of the support abscissas Xi E [a, b], that Pn(x) converges uniformly on [a, b] to f (x) as n ~ oo? EXERCISES FOR CHAPTER 2 5. 135 (a) The Bessel function of order zero, I11f Jo(x) = - cos(xsint)dt, 0 11: is to be tabulated at equidistant arguments Xi = Xo -+- ih, i = 0, 1, 2, ... . How small must the increment h be chosen so that the interpolation error remains below 10- 6 if linear interpolation is used? (b) What is the behavior of the maximal interpolation error as n --+ 00, if Pn(X) interpolates Jo(x) at X = x;n) ;=, i/n, i = 0,1, ... , n? Hint: It suffices to show that IJ~k)(x)1 <:: 1 for k = 0,1, ... (c) Compare the above result with the behavior of the error as n --+ 00, where St:J. n is the interpolating spline function with knot set Ll n = { } and S~n (x) = J~(x) for x = 0, 1. 6. Interpolation on product spaces: Suppose every linear interpolation problem stated in terms of functions 'Po, 'PI, ... , 'Pn has a unique solution p(x) = L Cl;i'Pi(X) i=O with P(Xk) = h, k = 0, ... , n, for prescribed support arguments Xo, ... , Xn with Xi i= Xj, i i= j. Show the following: If 1/;0, ... , 'Pm is also a set of functions for which every linear interpolation problem has a unique solution, then for every choice of abscissas x, i= Xj for i i= j, Yi i= YJ for i i= j, XO,Xl, .. 'lXn~ yo, YI , ... , Yrn, and support ordinates f'k, i = 0, ... ,n, k = 0, ... ,m, there exists a unique function of the form n p(X,y) = L LCl;I/Il'P,/(x)1/;I'(Y) 1/=0 1l=0 with <p(x" Yk) = Jik, i = 0, ... , n, k = 0, ... , TTL 7. Specialize the general result of Exercise 6 to interpolation by polynomials. Give the explicit form of the function p(x, y) in this case. 8. Given the abscissas Yo, YI, ... , Ym, Yi i= Yj for i i= j, 136 2 Interpolation and, for each k = 0, ... ,m, the values of x;kl for i of j, and support ordinates fik, i = 0, ... , nk, k = 0, ... , m, suppose without loss of generality that the Yk are numbered in such a fashion that no ::0: nl ::0: ... ::0: nm . Prove by induction over m that exactly one polynomial "p P(x,y) == L Lctv/"XVyli 11-=0 v=O exists with 9. Is it possible to solve the interpolation problem of Exercise 8 by other polynomials M N " P(x,y) = L Lctv/"XVyl', /"=0 v=o requiring only that the number of parameters ct u /" agree with the number of support points, that is, AI L(n/" + 1) = L(N/" + I)? Hint: Study a few simple examples. 10. Calculate the inverse and reciprocal differences for the support points Xi: ° 1 -1 2 -2 fi: 1 3 3/5 3 3/5 and use them to determine the rational espression p2.2(X) whose numerator and denominator are quadratic polynomials and for which p2,2(Xi) = fi, first in continued-fraction form and then as the ratio of polynomials. 11. Let <p m .n be the rational function which solves the system sm." for given support points (Xk, fk), k = 0, 1, ... , m + n: (ao + alXk + ... + amX~') - h(bo + b1xk + ... + bnx~) = 0, k = 0, 1, ... ,m+ n. Show that <pm,n (x) can be represented as follows by determinants: pm.,,(X) = Ifk, Xk - X, ... , (Xk - x)m, (Xk - X)fk,"" (Xk - x)" fkl;;"=+d' 11,xk - X, ... , (Xk - x)m, (Xk - x)h, ... , (Xk - x)n hl;;":on EXERCISES FOR CHAPTER 2 137 Here the following abbreviation has been used: I Im+n = det Ctk,· .. , (k k=O 12. Generalize Theorem (2.3.1.12): (a) For 2n + 1 support abscissas Xk with a ::; Xo < Xl < ... < X2n < a + 21r and support ordinates yo, . .. , Y2n, there exists a unique trigonometric polynomial T(x) = ~ao + 2)aj cosjx + bj sinj:r) j=l with T(Xk) = Yk for k = 0,1, ... ,2n. (b) If Yo, ... , Y2n are real numbers, then so are the coefficients a J , bj • Hint: Reduce the interpolation by trigonometric polynomials in (a) to (complex) interpolation by polynomials using the transformation T(x) = cje Then show C- J = Cj to establish (b). 2:.7=-n ijx . 13. (a) Show that, for real Xl, ... , X2n, the function II sin - 2 2n t(x) = X - Xk k=l is a trigonometric polynomial n ~ao + 2)aj cosjx + bj sinjx) j=l aj, with real coefficients bj . Hint: Substitute sin'P = (1/2i)(e i 'P - e-''P). (b) Prove, using (a) that the interpolating trigonometric polynomial with support abscissas Xk, a ::; Xo < Xl ... < X2n < a + 21r, and support ordinates Yo, ... ,Y2n is identical with 2n T(x) = LYJtj(x), j=O where 138 2 Interpolation tj(X) := g 2n X - Xk / sin --2- k-,fj g 2n X· - Xk sin _1_2-_, k-,fj 14. Show that for n + 1 support abscissas Xk with a :S Xo < Xl < ... < Xn < a + 7r and support ordinates yo, ... , Yn, a unique "cosine polynomial" n C(x) = Lajcosjx j=O exists with C(Xk) = Yk, k = 0, 1, ... , n. Hint: See Exercise 12. 15. (a) Show that for any integer j 2m 2m COSjXk = (2m + 1) h(j), L k=O with Xk:= and 27rk 2m+ l' Lsinjxk = 0, k=Q k = 0, 1, ... ,2m, for j = 0 mod 2m + 1, otherwise. h(j) := {~ (b) Use (a) to derive for integers j, k the following orthogonality relations: 2m L sinjxi sin kXi = 2mt 1 [h(j - k) - h(j + k)], i=O 2m COSjXi COSkXi = 2m2+ 1 [h(j - k) + h(j + k)], L i=O 2m LCOsjXisinkxi = O. i=O 16. Suppose the 27r-periodic function I: IR. -+ IR. has an absolutely convergent Fourier series 00 I(x) = ~ao + L(aj cosjx + bj sinjx). j=l Let 1li(X) = ~AQ + L(Aj cosjx + Bj sinjx) j=l EXERCISES FOR CHAPTER 2 139 be trigonometric polynomials with 27rk 2m+ l ' Xk:= for k = 0, 1, ... , 2m. Show that 00 Ak = ak + 2 ) ap (2rn+1)+k + a p (2rn+1)-k], 0::; k ::; m, p=l 00 Bk = bk + L[bp (2rn+1)+k - bp (2m+1)-k], 1::; k ::; m. p=l 17. Formulate a Cooley-Thkey method in which the array ,a[ 1 is initialized directly (,B[j] = Ii) rather than in bit-reversed fashion. Hint: Define and determine explicitly a map u = u( m, r, j) with the same replacement properties as (2.3.2.6) but u(O, r, 0) = r instead of (2.3.2.7). 18. Let N = 2n : Consider the N-vectors (2.3.2.1) expresses a linear transformation between these two vectors, f3 = (ljN)TJ, where T = [tjk], tjk := e-27rijk/N. (a) Show that T can be factored as follows: T = QSP(Dn-ISP)·· . (DISP), where S is the N x N matrix 1 -1 1 1 . _. ( (I) (I) (l) ) _ The matnces Dl - dlag 1,°1 ,1, 03 ,···,1, ON_I' l - 1, ... , n - 1, are diagonal matrices with i' = [~], r odd. Q is the matrix of the bit-reversal permutation (2.3.2.8), and P is the matrix of the following bit-cycling permutation (: (b) Show that the Sande-Thkey method for fast Fourier transforms corresponds to multiplying the vector J from the left by the (sparse) matrices in the above factorization. 140 2 Interpolation (c) Which factorization of T corresponds to the Cooley-Thkey method? Hint: TH differs from T by a permutation. 19. Investigate the numerical stability of the methods for fast Fourier transforms described in Section 2.3.2. Hint: The matrices in the factorization of Exercise 18 are almost orthogonal. 20. Given a set of knots .d = {xo < Xl < ... < Xn} and values Y := (YO, ... , Yn), prove independently of Theorem (2.4.1.5) that the spline function S.<l(Y;.) with S~(Y;xo) = S~(Y;Xn) = 0 is unique. Hint: Examine the number of zeros of the difference S~ - S~ of two such spline functions. This number will be incompatible with S.<l - S.<l '# O. 21. The existence of a spline function S.<l (Y; .) in cases (a), (b), and (c) of (2.4.1.2) can be established without explicitly calculating it, as was done in Section 2.4.2. (a) The representation of S.<l(Y;.) requires 4n parameters aj, (3j, '"'Ij, 8j . Show that in each ofthe cases (a), (b), (c) a linear system of 4n equations results (n + 1 = number of knots). (b) Use the uniqeness of S.<l(Y;.) (Exercise 20) to show that the system of linear equations is not singular, which ensures the existence of a solution. 22. Show that the quantities d j of (2.4.2.6) and (2.4.2.8) satisfy j and even = 0,1, ... ,n, j = 1,2, ... , n - 1, in the case of n + 1 equidistant knots Xj E .d. 23. Show that Theorem (2.4.1.5) implies: If the set of knots .d' c la, bj contains the set of knots .d, .d' ::) .d, then in each of the cases (a), (b), and (c), 24. Suppose S.<l(x) is a spline function with the set of knots .d = { a = Xo < Xl < ... < Xn = b} interpolating ! E K4 (a, b). Show that if anyone of the following additional conditions is met: (a) f'(x) = S~(x) for X = a, b. (b) rex) = S~(x) for X = a, b. (c) S.<l is periodic and! E K;(a, b). 25. Prove that Pk > 1 holds for the quantities Pk, k = 1, ... , n, encountered in solution method (2.4.2.15) of the system of linear equations (2.4.2.9). All the divisions required by this method can therefore be carried out. 26. Define the spline functions Sj for equidistant knots Xi = a + ih, h > 0, i = 0, ... , n, by EXERCISES FOR CHAPTER 2 141 Verify that the moments M I , ... , M n - l von Sj are as follows: 1 i=I, ... ,j-2, Mi=--Mi+l, Pi 1 Mi = - - - Mi - l , pn-i M i = j + 2, ... ,n -1, _ -6. 2 + I/Pj-l + I/Pn-j-l J - h2 4 - 1/ pj-l - 1/ pn-j-l M j - l = _1_(6h- 2 - M j ) Pj-l for j # 0, 1, n - 1, n, where the numbers Pi are recursively defined by PI := 4 and Pi := 4 - I/Pi-1 i = 2,3, .... It is readily seen that they satisfy the inequalities 4 = PI > P2 > ... > Pi > Pi+I > 2 + v'3 > 3.7, 0.25 < 1/ Pi < 0.3. 27. Show for the functions Sj defined in Exercise 26 that for j ,= 2, 3, ... , n - 2 and either x E [Xi, Xi+l], j + 1 :'::: i :'::: n - 1 or X E [Xi-I, x;}, 1 :'::: i :'::: j - 1, 28. Let Sa,! denote the spline function which interpolates the function f at prescribed knots x E Ll and for which S~;f (xo) = S~;f (Xn) = O. The map f -+ Sa;! is linear, that is, The effect on Sa;! of changing a single function value f(xj) is therefore that of adding a corresponding multiple of the function Sj which was defined in Exercise 26. Show, for equidistant knots and using the results of Exercise 27, that a perturbation of a function value subsides quickly as one moves away from the location of the perturbation. Consider the analogously defined Lagrange polynomials (2.1.1.2) in order to compare the behavior of interpolating polynomials. 29. Let Ll = {xo < Xl < ... < Xn }. (a) Show that a spline function Sa with the boundary conditions (*) 142 2 Interpolation vanishes identically for n < 4. (b) For n = 4, the spline function with (*) is uniquely determined for any value c and the normative condition (SLl is a multiple of the B-spline with support [xo, X4J.) Hint: Prove the uniqueness of S Ll for c = 0 by determining the number of zeros of S~ in the open interval (xo, X4). Deduce from this the existence of SLl by the reasoning employed in exercise 21. (c) Calculate SLl explicitly in the following special case of (b): Xi := -2, -1,0, 1,2, c:= 1. 30. Let 5 be the linear space of all spline functions SLl with knot set .:1 = {xo < Xl < ... < Xn} and S~(xo) = S~(Xn) = O. The spline functions So, ... , Sn are the ones defined in Exercise 26. Show that for Y := (YO, .. . , Yn), n SLl{Y;X) == LYjSj (x). j=O What is the dimension of 5? Exponentialspline kubischer Spline Fig. 5. Comparison of spline functions. 31. Let ELl,f{X) denote the spline-like function which, for given Ai, minimizes the functional EXERCISES FOR CHAPTER 2 143 over K?(a, b). [Compare Theorem (2.4.1.5).] (a) Show that ELl,! is between knots of the form ELl.J(X) = Qi + !3i(X - Xi) + ii'¢(X - Xi) + OiC,,(X - Xi), i = 0, ... ,N -1, 2 '¢i(X) = A2 [cosh(Aix ) - 1], , 'Pi(X) = ~3 [sinh(A;x) - Ai X], , with constants Qi, !3i, ii, Oi. ELl,! is called exponential spline function. (b) Examine the limit as Ai -+ O. (c) Figure 5 illustrates the qualitative behavior of cubic and exponential spline functions interpolating the same set of support points. 32. Show that the hat-function H(x) (2.4.6.6) has all properties (Definition. (2.4.6.1)) of a scaling function. 33. Show for the numbers ik defined by (2.4.6.20) the properties (a) ik = i-k for all k E Zl, (b) ik = y'2, (c) i2k-I + 2i2k + i2k+I = 0 for all k E Zl, k f= O. Hint: The multi-resolution method (2.4.6.8) approximates every function VI E VI with VI E Va by itself, Va = VI EVa. 2:kEZ References for Chapter 2 Achieser, N. I. (1953): Vorlesungen iiber Approximationsthe01"ie. Berlin: Akademie-Verlag. Ahlberg, J., Nilson, E., Walsh, J. (1967): The Theory of Splines and Their Applications. New York: Academic Press. Bloomfield, P. (1976): Fourier Analysis of Time Series. New York: Wiley. Bohmer, E. O. (1974): Spline-Funktionen. Stuttgart: Teubner. Brigham, E. O. (1974): The Fast Fourier Transform. Englewood Cliffs, N.J.: Prentice-Hall. Bulirsch, R., Rutishauser, H. (1968): Interpolation und genaherte Quadratur. In: Sauer, SzabO (1968). - - - , Stoer, J. (1968): Darstellung von Funktionen in Rechenautomaten. In: Sauer, Szabo (1968). Chui, C. K. (1992): An Introduction to Wavelets. San Diego, CA: Academic Press. Ciarlet, P. G., Schultz, M. H., Varga, R. S. (1967): Numerical methods of highorder accuracy for nonlinear boundary value problems I. One dimensional problems. Numer. Math. 9, 294-430. Cooley, J. W., Tukey, J. W. (1965): An algorithm for the machine calculation of complex Fourier series. Math. Comput. 19, 297-301. Curry, H. B., Schoenberg, I. J. (1966): On Polya frequency functions, IV: The fundamental spline functions and their limits. J. d'Analyse Math. 17, 73-82. Daubechies, I. (1992): Ten Lectures on Wavelets. CBMS-NSF Regional Conference Series in Applied Mathematics. Philadelphia: SIAM. Davis, P. J. (1963): Interpolation and Approximation. New York: Blaisdell, 2d printing (1965). 144 2 Interpolation de Boor, C. (1972): On calculating with B-splines. J. Approximation Theory 6, 50-62. - - - (1978): A Practical Guide to Splines. Berlin, Heidelberg, New York: Springer-Verlag. - - - , Pinkus, A. (1977): Backward error analysis for totally positive linear systems. Numer. Math. 27,485-490. Gautschi, W. (1972): Attenuation factors in practical Fourier analysis. Numer. Math. 18, 373-400. Gentleman, W. M., Sande, G. (1966): Fast Fourier transforms - For fun and profit. In: Proc. AFIPS 1966 Fall Joint Computer Conference, 29, 503-578. Washington, D.C.: Spartan Books. Goertzel, G. (1958): An algorithm for the evaluation of finite trigonometric series. Amer. Math. Monthly 65, 34-35. Greville, T. N. E. (1969): Introduction to spline functions. In: Theory and Applications of Spline Functions. Edited by T. N. E. Greville. New York: Academic Press. Hall, C. A., Meyer, W. W. (1976): Optimal error bounds for cubic spine interpolation. J. Approximation Theory 16, 105-122. Herriot, J. G., Reinsch, C. (1971): ALGOL 60 procedures for the calculation of interpolating natural spline functions. Technical Report STAN-CS-71-200, Computer Science Department, Stanford University, CA. Karlin, S. (1968): Total Positivity, Vol. 1. Stanford: Stanford University Press. Korneichuk, N. P. (1984): Splines in Approximation Theory. Moscow: Nauka (in Russian). Kuntzmann, J. (1959): Methodes Numeriques, Interpolation - Derivees. Paris: Dunod. Louis, A., MaaB, P., Rieder, A. (1994): Wavelets. Stuttgart: Teubner. Maehly, H., Witzgall, Ch. (1960): Tschebyscheff-Approximationen in kleinen Intervallen II. Numer. Math. 2, 293-307. Mallat, St. (1997): A Wavelet Tour of Signal Processing. San Diego, CA.: Academic Press. Milne, E. W. (1949): Numerical Calculus. Princeton, N.J.: Princeton University Press, 2d printing (1951). Milne-Thomson, L. M. (1933): The Calculus of Finite Differences. London: Macmillan, reprinted (1951). Nurnberger, G. (1989): Approximation by Spline Functions. Berlin: SpringerVerlag. Reinsch, C.: Unpublished manuscript. Sauer, R., Szabo, I. (Eds.) (1968): Mathematische Hilfsmittel des Ingenieurs, Part III. Berlin, Heidelberg, New York: Springer-Verlag. Schoenberg, 1. J., Whitney, A. (1953): On Polya frequency functions, III: The positivity of translation determinants with an application to the interpolation problem by spline curves. Trans. Amer. Math. Soc. 74, 246-259. Schultz, M. H. (1973): Spline Analysis. Englewood Cliffs, N.J.: Prentice-Hall. Schumaker, L.L. (1981): Spline Functions. Basic Theory. New York, Chichester, Brisbane, Toronto: Wiley. Singleton, R. C. (1967): On computing the fast Fourier transform. Comm. ACM 10, 647-654. - - - (1968): Algorithm 338: ALGOL procedures for the fast Fourier transform. Algorithm 339: An ALGOL procedure for the fast Fourier transform with arbitrary factors. Comm. ACM 11, 773-779. 3 Topics in Integration Calculating the definite integral of a given real function f (x), lb f(x)dx, is a classic problem. For some simple integrands f(x), the indefinite integral jt f(x) dx = F(t), F'(x) = f(x), can be obtained in closed form as an algebraic expression in t and wellknown transcendental functions of t. Then lb f(x) dx = F(b) - F(a). See Grabner and Hofreiter (1961) for a comprehensive collection of formulas describing such indefinite integrals and many important definite integrals. As a rule, however, definite integrals are computed using discretization methods which approximate the integral by finite sums corresponding to some partition of the interval of integration [a, b] ("numerical quadrature"). A typical representative of this class of methods is Simpson's rule, which is still the best-known and most widely used integration method. It is described in Section 3.1, together with some other elementary integration methods. Peano's elegant and systematic representation of the error terms of integration rules is described in Section 3.2. A closer investigation of the trapewidal sum in Section 3.3 reveals that its deviation from the true value of the integral admits an asymptotic expansion in terms of powers of the discretization step h. This expansion is the classical summation formula of Euler and Maclaurin. Asymptotic expansions of this form are exploited in so-called "extrapolation methods", which increase the accuracy of a large class of discretization methods. An application of extrapolation methods to integration ("Romberg integration") is studied in Section 3.4. The general scheme is described in Section 3.5. 146 3 Topics in Integration A description of Gaussian integration rules follows in Section 3.6. The chapter closes with remarks on the integration of functions with singularities. For a comprehensive treatment of integration, the reader is referred to Davis and Rabinowitz (1975). 3.1 The Integration Formulas of Newton and Cotes The integration formulas of Newton and Cotes are obtained if the integrand is replaced by a suitable interpolating polynomial P(x) and if then P(x)dx is taken as an approximate value for f(x)dx. Consider a uniform partition of the closed interval [a, b] given by J: J: Xi = a + i h, i = 0, 1, ... , n, of step length h := (b - a) / n, n > 0 integer, and let P n be the interpolating polynomial of degree n or less with Pn(x;) = fi := f(Xi) for i = 0, L ... , n. By Lagrange's interpolation formula (2.1.1.4), 'L fiLi(X), L;(x) II Xi - xk n n Pn(x) == = X - k~O i=O k¥i or, introducing the new variable t such that X = a + ht Li(X) = tpi(t) := t- k II i _ k' n k=D k#i Integration gives n =h'Lfi D;. i=O Note that the coefficients or weights Di:= r tpi(t)dt ./0 Xk --, 3.1 The Integration Formulas of Newton and Cotes 147 depend solely on n; in particular, they do not depend on the function f to be integrated, or on the boundaries a, b of the integral. If n = 2 for instance, then aO = [2 t - 1 t - 2 dt = ~ [2(t2 _ 3t + 2)dt = ~ (~ _ 12 + 10 0 - 1 0 - 2 al = [2 t - 2 10 2 3 2 4) =~, 3 0t - 2dt = _ 10[2 (t2 _ 2t)dt = _ (~3 _ 4) = i,3 10 1 - 0 1 - 2 i) 3' a2 = [2 t - 0 t - 1 dt = ~ [\t2 _ t)dt = ~ (~ _ 2- 02- 1 2 3 2 10 210 1 and we obtain the following approximate value: J: f(x)dx. This is Simpson's rule. for the integral For any natural number n, the Newton-Cotes formulas (3.1.1) Ib a Pn(x)dx = h t i=O fiai, fi:= f(a + ih), b-a h:=--, n J: provide approximate values for f(x)dx. The weights ai, i = 0, 1, ... , n, have been tabulated. They are rational numbers with the property n (3.1.2) Lai =n. i=O This follows from (3.1.1) when applied to f(x) :=::= 1 for which Pn(x) =::= 1. If s is a common denominator for the fractional weights ai so that the numbers (Ji := s ai, i = 0, 1, ... , n, are integers, then (3.1.1) becomes (3.1.3) For sufficiently smooth functions f(x) on the closed interval [a, b] it can be shown [see Steffensen (1950)] that the approximation error may be expressed as follows: (3.1.4) 148 3 Topics in Integration Here (a, b) denotes the open interval from a to b. The values of p and K depend only on n but not on the integrand f. For n = 1,2, ... ,6 we find the Newton-Cotes formulas given in the following table. For larger n, some of the values (Ji become negative and the corresponding formulas are unsuitable for numerical purposes, as cancellations tend to occur in computing the sum (3.1.3). n O"i 1 1 1 2 1 4 1 3 1 3 3 1 4 7 32 12 32 5 19 75 50 50 75 6 41 216 27 272 27 216 41 ns Error h3 f2 j<2) (~) Name 2 Trapezoidal rule 6 h 5 /of(4)W Simpson's rule 8 h5iof(4)(~) 3/8-rule 90 h7 9~5f(6)(~) Milne's rule 288 h7 1;~~6f(6)(~) 840 h g 1:oof(8)(~) 7 19 Weddle's rule Additional integration rules may be found by Hermite interpolation [see Section 2.1.5] of the integrand f by a polynomial P E IIn of degree n or less. In the simplest case, a polynomial P E II3 with P(a) = f(a), P(b) = f(b), P'(a) = 1'(a), P'(b) = l' (b) is substituted for the integrand f. The generalized Lagrange formula (2.1.5.3) yields for P in the special case a = 0, b = 1, P(t) = f(O)[(t - 1)2 + 2t(t - 1)2] + f(l)[t 2 - 2t 2 (t - 1)] +1'(O)t(t _1)2 + 1'(I)t2 (t -1), integration of which gives 11 P(t)dt = ~(f(O) + f(I)) + A(f'(O) - 1'(1)). From this, we obtain by a simple variable transformation the following integration rule for general a < b (h := b - a): (3.1.5) j f(x)dx M(h) b a ~ h h2 := 2 (f(a) + f(b)) + 12 (f'(a) - f'(b)). If f E C 4 [a, b] then ~ using methods to be described in Section 3.2 ~ the approximation error of the above rule can be expressed as follows: 3.1 The Integration Formulas of Newton and Cotes (3.1.6) M(h) -lb f(x)dx = ;~; f(4)(~), ~ E (a, b), 149 h:= (b - a). If the support abscissas Xi, i = 0, ... , n, Xo = a, Xn = b., are not equally spaced, then interpolating the integrand f(x) will lead to different integration rules, among them the ones given by Gauss. These will be described in Section 3.6. The Newton-Cotes and related formulas are usually not applied to the entire interval of integration [a, b], but are instead used 'in each one of a collection of subintervals into which the interval [a, b] has been divided. The full integral is then approximated by the sum of the approximations to the subintegrals. The locally used integration rule is said to have been extended, giving rise to a corresponding composite rule. We proceed to examine some composite rules of this kind. The trapezoidal rule (n = 1) provides the approximate value in the subinterval [Xi, Xi+1] of the partition Xi = a + ih, 1 = 0, 1, ... , N, h := (b - a)j N. For the entire interval [a, b], we obtain the approximation (3.1.7) N-l T(h):= L Ii i=O =h [f~a) +f(a+h)+f(a+2h)+"'+f(b-h)+ f;b)] , which is the trapezoidal sum for step length h. In each subinterval [Xi, Xi+1] the error is incurred, assuming f E C 2 [a, b]. Summing these individual error terms gives Since and f(2) (x) is continuous, there exists ~ E [min~i,max~i] C (a, b) with , , 150 3 Topics in Integration Thus T(h) -l b f(x)dx = b ~ a h2f(2)(~), ~ E (a, b). Upon reduction of the step length h (increase of n) the approximation error approaches zero as fast as h 2 , so we have a method of order 2. If N is even, then Simpson's rule may be applied to each subinterval [X2i' X2i+2], i = 0, 1, ... , (N/2) -1, individually yielding the approximation (h/3) (f(X2i) + 4f(X2i+l) + f(X2i+2)). Summing these N/2 approximations results in the composite version of Simpson's rule h S(h) := 3[f(a) + 4f(a + h) + 2f(a + 2h) + 4f(a + 3h) + ... + 2f(b - 2h) + 4f(b - h) + f(b)], for the entire interval. The error of S(h) is the sum of all N/2 individual errors and we conclude, just as we did for the trapezoidal sum, that S(h) -lb a f(x)dx = b - ah4f(4)(~), 180 ~ E (a,b), provided f E C 4 [a, b]. The method is therefore of order 4. Extending the rule of integration M(h) in (3.1.5) has a remarkable effect: when the approximation to the individual subintegrals l Xi +1 f(x)dx for i = 0, 1, ... , N - 1 x, are added up, all the "interior" derivatives f'(Xi), 0 < i < N, cancel. The following approximation to the entire integral is obtained: U(h) : = h [f~a) + f(a + h) + ... + f(b - h) + f;b)] + ~~ [f'(a) - f'(b)] = T(h) + ~; [f'(a) - f'(b)]. This formula can be considered as a correction to the trapezoidal sum T(h). It relates closely to the Euler-Maclaurin summation formula, which will be discussed in Section 3.3 [see also Schoenberg (1969)]. The error formula 3.2 Peano's Error Representation 151 (3.1.6) for M(h) can be extended to an error formula for the composite rule U(h) in the same fashion as before. Thus (3.1.8) provided f E C 4 [a,b]. Comparing this error with that of the trapezoidal sum, we note that the order of the method has been improved by 2 with a minimum of additional effort, namely the computation of f'(a) and f'(b). If these two boundary derivatives are known to agree, e.g. for periodic functions, then the trapezoidal sum itself provides a method of order at least 4. Replacing f'ea), f'(b) by difference quotients with an approximation error of sufficiently high order, we obtain simple modifications ["end corrections": see Henrici (1964)] of the trapezoidal sum which do not involve drivatives but still lead to methods of orders higher than 2. The following variant of the trapezoidal sum is already a method of order 3: T(h) :=h[152 f (a) + ~U(a + h) + f(a + 2h) + ... + feb - 2h) + ~U(b - h) + 152 f (b)]. For many additional integration methods and their systematic examination see, for instance Davis and Rabinowitz (1975). 3.2 Peano's Error Representation All integration rules considered so far are of the form (3.2.1) 1(f):= rno ml mn k=O k=O k=O L akof(xkO) + L aklf'(xkl) + ... + L aknf(n) (Xkn). The integration error R(f) := l(n (3.2.2) -l b f(x)dx is a linear operator R( af + f3g) = aR(f) + f3R(g) for f, 9 E V, a, 8 E lR on some suitable linear function space V. Examples are V = Cn[a,b], the space of functions with continuous nth derivatives on the interval [a, b], or V = IIn , the space of all polynomials of degree no greater than n. The following elegant integral representation of the error R(f) is a classical result due to Peano: 152 3 Topics in Integration (3.2.3) Theorem. Suppose R(P) = 0 holds for all polynomials P E IIn , that is, every polynomial whose degree does not exceed n is integrated exactly. Then for all functions f E en+! [a, b], R(J) = lb f Cn + I )(t)K(t)dt, where K(t) := ~Rx[(X - t):;:'L n. (x _ t):;:' := { o(x - t)n for x ::::: t, for x < t, and Rx[(x - t):;:'] denotes the error of (x - t)+' when the latter is considered as a function in x. The function K(t) is called the Peano kernelDf the operator R. Before proving the theorem, we will discuss its application in the case of Simpson's rule R(J) = ~f(-I) + ~f(O) + ~f(l) -111 f(x)dx. We note that any polynomial P E II3 is integrated exactly. Indeed, let Q E II2 be the polynomial with P( -I) = Q( -I), P(O) = Q(O), P( +1) = Q(+I). Putting S(x) := P(x) - Q(x), we have R(P) = R(S). Since the degree of S(x) is no greater then 3, and since S(x) has the three roots -I, 0, +1, it must be of the form S(x) = a(x 2 - I)x, and R(P) = R(S) = -ajI x(x 2 -I)dx = O. -1 Thus Theorem (3.2.3) can be applied with n = 3. The Peano kernel becomes K(t) = iRx[(x - t)t] =i [~(-I-t)t+~(0-t)t+~(I-t)t-l11(X-t)tdx]. By definition of (x - t)+', we find that for t E [-1, I] j (x - t)tdx 11 (x - t)3dx (I ~ t)4 I = -1 = t (-I-t)t=o, (_t)3 = + (l-t)t=(I-t)3, {O_t 3 0, if t ::::: if t < O. , 3.2 Peano's Error Representation 153 The Peano kernel for Simpson's rule in the interval [-1, 1J is then K(t) = { (3.2.4) i2 (1 - t)3(1 + 3t) if 0::; t ::; 1, if-1::;t::;O. K(-t) PROOF OF THEOREM (3.2.3). Consider the Taylor expansion of f(x) at x = a: (3.2.5) f(n)(a) f(x) = f(a) + J'(a)(x - a) + ... + - - I-(x - a)n + rn(x). n. Its remainder term can be expressed in the form rn(x) = , 1 n. JX f(n+l)(t)(x - t)ndt = ,1 Jb f(n+l)(t)(x - t)~dt. a n. a Applying the linear operator R to (3.2.5) gives (3.2.6) since R(P) = 0 for P E IIn. In order to transform this representation of R(f) into the desired one, we have to interchange the Rx operator with the integration. To prove that this interchange is legal, we show first that for 1 ::; k ::; n. For k < n this follows immediately from the fact that (x - t)~ is n - 1 times continuously differentiable. For k =, n - 1 we have in particular and therefore The latter integral is differentiable as a function of x, since the integrand is jointly continuous in x and t; hence 154 3 Topics in Integration Thus (3.2.7) holds also for k = n. By (3.2.7), the differential operators dk k = 1, ... , n, commute with integration. Because I(f) = Ix(f) is a linear combination of differential operators, it also commutes with integration. Finally the continuity properties of the integrand j(n+l) (t)(x - t)+' are such that the following two integrations can be interchanged: This then shows the entire operator Rx commutes with integration, and we obtain the desired result Note that Peano's integral representation of the error is not restricted to operators of the form (3.2.1). It holds for all operators R for which Rx commutes with integration. For a surprisingly large class of integration rules, the Peano kernel K(t) has constant sign on [a, b]. In this case, the mean-value theorem of integral calculus gives (3.2.8) R(f) = j(n+l)(o lb K(t)dt for some e E (a, b). The above integral of K(t) does not depend on j, and can therefore be determined by applying R, for instance, to the polynomial j(x) := xn+l. This gives (3.2.9) R(f) = R(xn+1) j(n+1) (0 (n + I)! for some ~ E (a, b). 3.2 Peano's Error Representation In the case of Simpson's rule, K(t) addition 4 R(x 1 - )= 4! 24 0 for -1 :S t :S 1 by (3.2.4). In ~ (1-·1+-·0+-·14 1 11 x 4dx ) 3 3 155 3 1 90' -1 so that we obtain for the error of Simpson's formula ~f( -1) + ~f(O) + ~f(l) - [11 f(t)dt = 910f(4)(~), ~ E (a, b). In general, the Newton-Cotes formulas of degree n integrate without error polynomials P E IIn if n is odd, and P E IIn+l if n is even [see Exercise 2]. The Peano kernels for the Newton-Cotes formulas are of constant sign [see for instance Steffensen (1950)], and (3.2.9) confirms the error estimates in Section 3.1: Rn(f) = { Rn(xn+l)f(n+1)(t) 1 <", ( R n+ 1.) n(x n+2 (n + 2)! 'f 1 . dd n IS 0 , ~ E (a,b), ) f(n+2)(o ' if n is even . Finally, we use (3.2.9) to prove the representation (3. L.6) of the error of integration rule (3.1.5), which was based on cubic Hermite interpolation. Here r h ~ R(f) = 2(f(a) + feb)) + 12 (f'(a) - f'(b)) - Jb f(x)dx, which vanishes for all polynomials P E II3 . For n following Peano kernel: 1 h:= b- a, = 3, we obtain the 3 K(t) = "6Rx((x - t)+) lb 1 3 h - t)+2 - (b - t)+)2' - a (X - t)+dx 3 = "61 [h2((a - t)+3 +(b - t)+) + 4((a 2 = ~[~(b - t)3 _ h2 (b _ t)2 _ ~(b _ t)4] 6 2 4 4 = - 214(b-t)2(a-t)2. Since K(t) ::; 0 in the interval of integration [a, b], (3.2.9) is applicable. We find for a = 0, b = 1 that R(x 4 ) = .L(1.1 +.L. (-4) _ 1) = __ 1 4! 24 2 12 5 720 . Thus 156 3 Topics in Integration R(f) = -~ {b j(4)(t)(b _ t)2(a _ t)2dt = _ (b - a)5 j(4)(~) uL no' ~ E (a, b), for j E C4[a, b], which was to be shown. 3.3 The Euler-Maclaurin Summation Formula The error formulas (3.1.6), (3.1.8) are low-order instances of the famous Euler-Maclaurin summation formula, which in its simplest form reads (for 9 E C 2m+2[0, 1]) (I g(t)dt =g(O) + g(l) + (3.3.1) Jo 2 2 f B21 (g(21-1)(0) _ g(21-1)(1)) 1=1 (2l)! Here Bk are the classical Bernoulli numbers B_1 2 - 6' (3.3.2) B I B6 = 42' 1 4 = - 30 ' B _1 8 - - 30' ... , whose general definition will be given below. Extending (3.3.1) to its composite form in the same way as (3.1.6) was extended to (3.1.8), we obtain for 9 E C 2m+2[0, N] (N g(t)dt =g(O) + g(l) + ... + g(N _ 1) + g(N) Jo 2 + f 1=1 _ 2 B21 (g(21-1)(0) _ g(21-1)(N)) (21)! B 2m +2 N (2m+2)(c) (2m + 2)! 9 ., , 0 < C < N. ., Rearranging the terms of the above formula leads to the most frequently presented form of the Euler-Maclaurin summation formula: g(O) + g(l) + ... + g(N _ 1) + g(N) 2 2 (3.3.3) f = (N g(t)dt + B21 (g(21-1)(N) _ g(21-1)(0)) Jo 1=1 (21)! + B 2m +2 Ng(2m+2)(~) (2m + 2)! ' 0 < ~ < N. For a general uniform partition Xi = a + ih, i = 0, ... , N, XN = b, of the interval [a, b], (3.3.3) becomes 3.3 The Euler-Maclaurin Summation Formula (3.3.4) T(h) = l b a f(t)dt + 157 B LTn h 21 --;U(21-1)(b) - f(21-1)(a)) (2l). 1=1 + h 2Tn +2 B 2Tn +2 (b _ a)f(2m+2)(~), (2m + 2)! a < ~ < b, where T(h) denotes the trapezoidal sum (3.1.7) f(a) f(b)] T(h)=h [-2-+ f (a+h)+···+f(b-h)+-2 . In this form, the Euler-Maclaurin snmmation formula expands the trapezoidal sum T (h) in terms of the step length h = (b - a) / N, and herein lies its importance for our purposes: the existence of such an expansion puts at one's disposal a wide arsenal of powerful general "extrapolation methods", which will be discussed in subsequent sections. PROOF OF (3.3.1). We will use integration by parts and successively determine polynomials Bk(x), starting with B 1(x) == x -~, such that (3.3.5) r 11 o r 1 1 Jo g(t)dt = B 1(t)g(t)lo1 - Jo B 1(t)g'(t)dt, 1 1111 Bdt)g'(t)dt = -B2(t)g'(t)dtl -2 0 2 0 B2(t)gl/(t)dt, where (3.3.6) B~+l (x) = (k + l)Bdx), k = 1, 2, .... It is clear from (3.3.6) that each polynomial Bk(X) is of degree k and that its highest-order terms has coefficient 1. Given Bk(x), thE relation (3.3.6) determines Bk+1(X) up to an arbitrary additive constant. We now select these constants so as to satisfy the additional conditions (3.3.7) which determine the polynomials Bdx) uniquely. Indeed, if B 21-1 () X = X21-1 + C21-2x 21-2 + ... + C1x + CO, then with integration constants c and d, B 21+1(X)=x 21+1 (2l + 1 )2l 21 +2l(21_1)C21-2X +···+(2l+1)cx+d. 158 3 Topics in Integration B 21 + 1 (0) = 0 requires d = 0, and B 21 + 1 (1) = 0 determines c. The polynomials B3(X) == :r 3 - ~x2 + ~x, B4(X) == X4 - 2x 3 + x2 - do. are known as Bernoulli polynomials. Their constant terms Bk = Bk(O) are the Bernoulli numbeT's (3.3.2). All Bernoulli numbers of odd index k > 1 vanish because of (3.3.7). The Bernoulli polynomials satisfy (3.3.8) This folows from the fact that the polynomials (_l)k B k (l- x) satisfy the same recursion, namely (3.3.6) and (3.3.7). as the Bernoulli polynomials Bk(X), Since they also start out the same, they must coincide. The following relation - true for odd indices k > 1 by (3.3.7) _.- can now be made generaL since (3.3.8) establishes it for even k: (3.3.9) This gives (3.3.10) We are now able to complete the expansion of 11 g(t)dt. Combining the first 2m + 1 relations (3.3.5), observing for k > 1 by (3.3.9), and accounting for B 21 + 1 = 0, we get where the error term T'm+1 is given by the integral 1'm+1 := ( 1 + -1 2m 1 )' 1. 0 B 2m+ 1 (t)g (2m+1) (t)dt. \Ve use integration by parts once more to transform the error term: 3.3 The Euler-Maclaurin Summation Formula r 1 10 159 B 2m +1(t)g(2m+1l(t)dt = _1_(B 2m +2(t) _ B 2m +2 )g(2m+l)(t)1 1 2m + 2 0 _ _1_ r 2m+ 210 1(B +2(t) _ B + )g(2m+2l(t)dt. 2m 2 2m The first term On the right-hand side vanishes again by (~1.3.9). Thus (3.3.11) rm+1 = (2m ~ 2)! 10 1(B2m+2(t) - B 2m+2)g(2m+2) (t)dt. In order to complete the proof of (3.3.1), we need to show that B2m+2(t)B 2m +2 does not change its sign between 0 and 1. We will show by induction that (-1)mB 2m _ 1 (X» (3.3.12b) (-1)m(B2m(x) - B 2m » 0 (_1)m+lB2m> O. (3.3.12c) for 0 < x < ~, for 0 < x < 1, 0 (3.3.12a) Indeed, (3.3.12a) holds for m = 1. Suppose it holds for some m ;::: 1. Then for 0 < x ::::; ~, (_l)m - - ( B2m (x) - B 2m ) = (_l)m 2m lx 0 B 2m - 1(t)dt> O. By (3.3.8), this extends to the other half of the interval, ~ :::; x < 1, proving (3.3.12b) for this value of m. In view of (3.3.10), we have (-lr+ 1B2m = (_l)m 10 1(B2m(t) - B 2m )dt > 0, which takes care of (3.3.12c). We must now prove (3.3.12a) for m+ 1. Since B 2m +1(x) vanishes for x = 0 by (3.3.7), and for x = ~ by (3.3.8), it cannot change its sign without having an inflection point x between 0 and ~. But then B 2m - 1 (x) = 0, in violation of the induction hypothesis. The sign of B2m+r(X) in 0 < x < ~ is equal to the sign of its first derivative at zero, whose value is (2m + 1)B2m(O) = (2m + 1)B2m' The sign of the latter is (_1)m+1 by (3.3.12c). Now for the final simplification of the error term (3.3.11). Since the function B 2m +2(X) - B2m+2 does not change its sign in the interval of integration, there exists ~, 0 < ~ < 1, such that From (3.3.10), B 2m +2 (2m+2) (t) 2m + 2)!g <" , 160 3 Topics in Integration which completes the proof of (3.3.1). 3.4 Integration by Extrapolation Let f E C 2m+2[a, b] be a real function to be integrated over the interval [a, b]. Consider the expansion (3.3.4) of the trapezoidal sum T(h) of f in terms of the step length h = (b - a)/n. It is of the form (3.4.1) Here TO = lb f(x)dt is the integral to be calculated, and is the error coefficient. Since j(2m+2) is continuous by hypothesis in the closed finite interval [a, b], there exists a bound L such that Ipm+2(x)1 ~ L for all x E [a, b]. Therefore: (3.4.2) There exists a constant Mm+l such that lam+l(h)1 ~ Mm+l for all h = (b - a)/n, n = 1, 2, .... Expansions of the form (3.4.1) are called asymptotic expansions in h if the coefficients Tk, k ~ m do not depend on hand am+l (h) satisfies (3.4.2). The summation formula of Euler and Maclaurin is an example of an asymptotic expansion. If all derivatives of f exist in [a, b]' then by letting m = 00, the right-hand side of (3.4.1) becomes an infinite series: TO + T 1 h2 + T2h4 + .... This power series may diverge for any h f. O. Nevertheless, because of (3.4.2), asymptotic expansions are capable of yielding for small h results which are often sufficiently accurate for practical purposes [see, for instance, Erdelyi (1956), Olver (1974)]. The above result (3.4.2) shows that the error term of the asymptotic expansion (3.4.1) becomes small relative to the other terms of (3.4.1) as h 3.4 Integration by Extrapolation 161 decreases. The expansion then behaves like a polynomial in h 2 which yields the value TO of the integral for h = O. This suggests the following method for finding TO: For each step length hi in a sequence b-a h ho = - - , hl = ~, ... , no nl where nl, n2, ... , nm are strictly increasing positive integers, determine the corresponding trapezoidal sums TiO := T(h i ), i = 0, 1, ... , m. Let be the interpolating polynomial in h 2 with and take the "extrapolated" value Tmm(O) as the approximation to the desired integral TO. This method of integration is known as Romberg integration, having been introduced by Romberg (1955) for the special sequence hi = (b - a)/2i. It has been closely examined by Bauer, Rutishauser, and Stiefel (1963). Neville's interpolation algorithm is particularly well suited for calculating Tmm(O). For indices i, k with 1 ::; k ::; i ::; m let Tik(h) be the polynomial of degree at most k in h 2 for which and let The recursion formula (2.1.2.7) becomes for Xi = (3.4.3) T. = T •k .,k-l + Ti,k-l - Ti-1,k-l [ hi - k ] 2 _ 1 hr 1 ::; k S i ::; m . hi It will be advantageous to arrange the intermediate values Tik in the triangular tableau (2.1.2.4), where each element derives from its two left neighbors: 162 3 Topics in Integration T(ho) = Too h6 (3.4.4) EXAMPLE. Calculating the integral 11 t 5 dt by extrapolation over the step lengths ho = 1, hI = 1/2, h2 = 1/4, we arrive at the following tableau using (3.4.3) and 6-digit arithmetic (the fractions in parentheses indicate the true values) Too = 0.500000(= ~) _ 1 h 21 -"4 TlO = 0.265625(= H) Tn = 0.187 500( = -&) T22 = 0.166667(=~) T21 = 0.167969(= 2~6) T 20 = 0.192383(= 11~;4) Each entry Tik of the polynomial extrapolation tableau (3.4.4) represents in fact a linear integration rule for a step length hi = (b - a)/mi' mi > 0 integer: Some of the rules, but not all, turn out to be of the Newton-Cotes type [see Section 3.1]. For instance, if hI = h o/2 = (b - a)/2, then Tn is Simpson's rule. Indeed, Too = (b - a) (~f(a) + ~f(b)) , 1 TlO = "2(b - a) By (3.4.3), and therefore (1"2f(a) + f (a+b) + "2f(b) 1 ). -2- 3.4 Integration by Extrapolation TIl 163 = ~ (b - a) ( ~ f (a) + ~ f ( a ; b) + ~ f (b)) . If we go one step further and put h2 = hd2, then T22 becomes Milne's rule. However this pattern breaks down for h3 = h2/2 since T33 is no longer a Newton-Cotes formula [see Exercise 10]. In the above cases, T21 and T31 are composite Simpson rules: b- a I 4 2 ( 4 ( I ()) T21 = -4-(-3f(a) + 3f(a + h 2 ) + 3f a + 2h2 ) + 3f a + 3h2) +:d b T31 = b ~ a (~f(a) + V(a + h3) + U(a + 2h 3 ) + ... + ~f(b)). Very roughly speaking, proceeding downward in tableau (3.4.4) corresponds to extending integration rules, whereas proceeding to the right increases their order. The following sequences of step lengths are usually chosen for extrapolation methods: h 0 -- b - a, h I -!Y2 2"'" (3 .4 .5a ) (3 . 4 . 5b) h'• -- h i2- 1 ' ho h 2 -- "'3' ho ... , h i -h 0 -- b - a, h I -- 2'"' Z• -hi- 2 -2-" 2 , 3 , ... , Z• -- 3 , 4, . . . . The first sequence is characteristic of Romberg's method [Romberg (1955)]. The second has been proposed by Bulirsch (1964). It has the advantage that the effort for computing T(h i ) does not increase quite as rapidly as for the Romberg sequence. For the sequence (3.4.5a), half of the function values needed for calculating the trapezoidal sum T(hHd have been previously encountered in the calculation of T(h i ), and their recalculation can be avoided. Clearly T(hi+l) = T(~hi) = ~T(hi) + hi+1[J(a + hHd + f(a + 3h H d + ... +- feb - hi+1)]' Similar savings can be realized for the sequence (3.4.5b). An ALGOL prozedure which calculates the tableau (3.4.4) for given m and the interval [a, b] using the Romberg sequence (3.4.5a) is given below. To save memory space, the tableau is built up by adding u.pward diagonals to the bottom of the tableau. Only the lowest elements in each column need to be stored for this purpose in the linear array t[O : m]. 164 3 Topics in Integration procedure romber:q (a, b, 1, rn); value a, b, rn; integer rn; real a, b; real procedure f; begin real h, s; integer i, k, n, q; array t[O: rn]; h := b - a; n := 1; t[O] := 0.5 x h x (f(a) + f(b)); for k:= 1 step 1 until rn do begin s:= 0; h:= 0.5 x h; n := 2 x n; q := 1; for i := 1 step 2 until n - 1 do s:=s+f(a+ixh); t[k] := 0.5 x t[k - 1] + s x h; print (t[k]); for i := k - 1 step -1 until a do begin q := q x 4; t[i] := t[i + 1] + (t[i + 1]- t[i])/(q - 1); print t([i]) end end end; We emphasize that the above algorithm serves mainly as an illustration of integration by extrapolation methods. As it stands, it is not well suited for practical calculations. For one thing, one does not usually know ahead of time how big the parameter rn should be chosen in order to obtain the desired accuracy. In practice, one calculates only a few (say seven) columns of (3.4.4), and stops the calculation as soon as IT;,6 - Ti+1.61 :'S: E s, where E is a specified tolerance and s is a rough approximation to the integral lb If(t)ldt. Such an approximation s can be obtained concurrently with calculating one of the trapezoidal sums T(h;). A more general stopping rule will be described, together with a numerical example, in Section 3.5. Furthermore, the sequence (3.4.5b) of step lengths is to be preferred over (3.4.5a). Finally, rational interpolation has been found to yield in most applications considerably better results than polynomial interpolation. A program with all these improvements can be found in Bulirsch and Stoer (1967). When we apply rational interpolation [see Section 2.2], then the recursion (2.2.3.8) replaces (2.1.2.7): 3.5 About Extrapolation Methods (3.4.6) T T Ti,k-l - Ti-l,k-l ik = i,k-l + - - - " " 2 - - ' - - - - - - - - - ' - - - - - Ti,k-l - Ti-1,k-l] -1 [ hi-k] hi Ti,k-l - Ti- l ,k-2 [1- 165 1 :S k :S i :S m. The same triangular tableau arrangement is used as for polynomial extrapolation: k is the column index, and the recursion (~;.4.6) relates each tableau element to its left-hand neighbors. The meaning of Tik, however, is now as follows: The functions Tik (h) are rational functions in h 2 , with the interpolation property We then define Tik := Tik(O), and initiate the recursion (3.4.6) by putting TiD := T(h i ) for i = 0, 1, ... , m, and Ti,-l := 0 for i = 0, L ... , m - 1. The observed superiority of rational extrapolation methods reflects the more flexible approximation properties of rational functions [see Section 2.2.4]. In Section 3.5, we will illustrate how error estimates for extrapolation methods can be obtained from asymptotic expansions like (3.4.1). Under mild restrictions on the sequence of step lengths, it will follow that, for polynomial extrapolation methods based on even asymptotic expansions, the errors of Til behave like h;, those of Til like 1 h;l and, in general, those of Tik like hLkhLk+l ... h; as i -+ 00. For fixed k, consequently, the sequence Tik i = k, k + 1, ... , approximates the integral like a method of order 2k + 2. For the sequence (3.4.5a) a stronger result has been found: hL h2(-1)kB2k+2j(2k+2)( ) · -lbj( )d - (b- )h 2 h 2 T ,k (347) .. X X a i-k i-k+l'" i ( )' ~ , a 2k + 2 . for a suitable ~ E (a, b) and j E C 2k +2 [a, b] [see Bauer, Rutishauser and Stiefel (1963), Bulirsch (1964)]. 3.5 About Extrapolation Methods Some of the numerical integration methods discussed in this chapter (as, for instance, the methods based on the formulas of Newton and Cotes) had a common feature: they utilized function information only on a discrete set 166 3 Topics in Integration of points whose distance - and consequently the coarseness of the sample - was governed by a "step length" . To each such step h i- 0 corresponded an approximate result T(h), which furthermore admitted an asymptotic expansion in powers of h. Analogous discretization methods are available for many other problems, of which the numerical integration of functions is but one instance. In all these cases, the asymptotic expansion of the result T(h) is of the form (3.5.1) T(h) = TO + T1h'Y1 + T2h'Y2 + ... + Tmh'Ym + am+! (h)h'Ym+l, o < 1'1 < 1'2 < ... < I'm+!, where the exponents I'i need not to be integers. The coefficients Ti are independent of h, the function a m+1(h) is bounded for h -t 0, and TO = limh--+o T(h) is the exact solution of the problem at hand. Consider, for example, numerical differentiation. For hi- 0, the central difference quotient T(h) = f(x + h) - f(x - h) 2h is an approximation to f'(x). For functions f E C 2m+3[x - a,x + a] and Ihl :s; lal, Taylors's theorem gives 1 h2 h 2m +3 T(h) =-{ f(x) + hf'(x) + - f"(x) + ... + [jC2m+3 l (x) + 0(1)] 2h 2! (2m + 3)! h2 h 2m +3 - f(x) + hf'(x) - - f"(x) + ... + [f C2m+3 l (x) + 0(1)] } 2! (2m + 3)! =TO + T1h2 + ... + Tmh2m + h 2m +2a m+1 (h) where TO = f'(x), Tk = fC2k+ 1l(x)/(2k + I)!, k = 1, 2, ... , m + 1, and a m+1(h) = Tm+! + 0(1). U sing the one-sided difference quotient f(x + h) - f(x) T(h) .=-. h ' --J- hr 0, leads to the asymptotic expansion with fCk+!l(x) Tk = (k + I)! ' k = 0, 1, 2, ... , m + 1. We will see later that the central difference quotient is a better approximation to base an extrapolation method on, as far as convergence is concerned, because its asymptotic expansion contains only even powers of the step length h. Other important examples of discretization methods 3.5 About Extrapolation Methods 167 which lead to such asymptotic expansions are those for the solution of ordinary differential equations [see Sections 7.2.3 and 7.2.12]. A systematic treatment of extrapolation methods is found in Brezinski a,nd Zaglia (1991). In order to derive an extrapolation method for a given discretization method with (3.5.1), we select a sequence of step lengths ho > hI > h2 > ... > 0, F = {ho, hI, h2' ... }, and calculate the corresponding approximate solutions T(h i ), i = 0, 1, 2, .... For i 2': k, we introduce the "polynomials" for which Tik(hj)=T(h j ), j=i-k,i-k+1, ... ,i, and we consider the values as approximations to the desired value 70. Rational functions Tik(h) are frequently preferred over polynomials. Also the exponents 'Yk need not be integer [see Bulirsch and Stoer (1964)]. For the following discussion of the discretization errors, we will assume that Tik (h) are polynomials with exponents of the form 'Yk: = 'Y k. Romberg integration [see Section 3.4] is a special case with 'Y = 2. First, we consider the case i = k. In the sequel, we will use the abbreviations h'Yj ' j = 0,1, ... , m. Zj ..Applying Lagrange's interpolation formula (2.1.1.4) to the polynomial yields for Z = ° k k nk = Pk(o) = LCjPk(Zj) = LCjT(hj ) j=o j=o with k Cj:=II ~. u""J Za - Zj 17=0 Then (3.5.2) k ~CjZT = ~ J j=O {I° k (-1) ZOZI'" Zk if 7 = 0, if 7 = 1, 2, ... , k, if7=k+1. 168 3 Topics in Integration PROOF. By Lagrange's interpolation formula (2.1.1.4) and the uniqueness of polynomial interpolation, k p(O) = LCjp(Zj) j=O holds for all polynomials p(z) of degree::; k. One obtains (3.5.2) by choosing p( z) as the polynomials ZT, T = 0, 1, ... , k, and D (3.5.2) can be sharpened for sequences hj for which there exists a constant b > 0 such that h+l T : ; b < 1 for all j. (3.5.3) J In this case, there exists a constant C k which depends only on b and for which k (3.5.4) L [Cj[zj+l ::; CkZOZl··· Zk· j=O We prove (3.5.4) only for the special case of geometric sequences {h j } with h j = hobl, 0 < b < 1, j = 0, 1, .... For the general case see Bulirsch and Stoer (1964). With the abbreviation (J := b'Y we have zj = (hobl)'YT = zoBjT. In view of (3.5.2), the polynomial k Pk(Z) := L CjZ j j=O satisfies for T = 0, for T = 1, 2, ... , k, so that Pk(z) has the k different roots BT, T = 1, ... , k. Since P k (l) = 1 the polynomial Pk must have the form 3.5 About Extrapolation Methods 169 The coefficients of Pk alternate in sign, so that k k L ICjlzj+1 = Z~+l L IcJ I(g(k+1))j = z~+1IPk( __ gk+1)1 j=O j=O - ~k+1 - "0 gk+1 + gl IIk ------,-1 _ gl 1=1 gl II ~ 1 _ gl k - k+1g1+2+ ... +k = Cdg)Z()Z1 ... Zk - Zo 1=1 with (3.5.5) This proves (3.5.4) for the special case of geometrically decreasing step lengths hj. We are now able to make use of the asymptotic expansion (3.5.1) which gives for k :::; m k Tkk = L cjT(hj ) j=O k =L J Cj[TO + T1Zj + T2 Z + ... + Tkzj + cxk+1(hj)zj+1], j=O where of course, for k < m By (3.5.2) k (3.5.6) Tkk = TO + LCjCXk+l(hj)zj+1. J=O Using (3.5.4) and ICXrn+l(hj )1 :::; Mrn+1 for all j 2: 0 [see (3.4.2)] we find for k=m (3.5.7) 170 3 Topics in Integration and for k < m, (3.5.8) because of (}:k+l(h j ) = Tk+l + O(h]), (3.5.2), and (3.5.6). The corresponding estimates for the error of Tik for the general case i 2: k are obtained by just replacing in (3.5.7) and (3.5.8) Zo, Zl, ... , Zk by Zi-k, Zi-k+1, ... , Zi, and ho by h i - k respectively, because Tik is obtained by extrapolating from T(h j ), j = i - k, i - k + 1, ... , i. Since Zj = h] this leads to (3.5.9) and for k < m, i 2: k to (3.5.10) Consequently, for fixed k and i -t 00, Tik - TO = O(h~~tlh). In other words, the elements Tik of the (k + 1)st column of the tableau (3.4.4) converge to TO like a method of order (k + 1)')'. Note that the increase of the order of convergence from column to column which can be achieved by extrapolation methods is equal to 'Y : 'Y = 2 is twice as good as 'Y = 1. This explains the preference for discretization methods whose corresponding asymptotic expansions contain only even powers of h, e.g., the asymptotic expansion of the trapezoidal sum (3.4.1) or the central difference quotient discussed in this section. The formula (3.5.10) shows furthermore that the sign of the error remains constant for fixed k < m and sufficiently large i provided Tk+l -I- O. Then (3.5.11) Now in many cases b'Y(k+ 1) < ~. Then the error of the quantity Uik := 2Ti+l,k - Tik satisfies Uik - TO = 2(Ti+l,k - TO) - (Tik - TO)' For s := sign (Ti+l,k - TO) = sign (Ti,k - TO) we have S(Uik - TO) = 2ITi,k+1 - Tol-ITik - Tol ::::; -ITik - Tol < O. Thus Uik converges monotonically to TO, for i -t 00 at roughly the same rate as Tik but from the opposite direction, so that eventually Tik and Uik 3.6 Gaussian Integration IVIethods 171 will include the limit TO between them. This observation yields a convenient stopping criterion. EXAMPLE. The exact value of the integral 1 "/2 o 5(e" - 2)-le 2X cosx dx is 1. Using the polynomial extrapolation method of Romberg, and carrying 12 digits, we obtain for Tik, Uik , 0 ::; i ::; 6, 0 ::; k ::; 3 the values given in the following table. TiO o 0.185 755 068 924 1 0.724 727 335 089 0.904 384 757 145 2 0.925 565 035 158 0.992 510 935 182 0.998 386 013 717 3 0.981 021 630 069 0.999 507 161 706 0.999 973 576 808 0.999 998 776 222 4 0.995 232 017 388 0.999 968 813 161 0.999 999 589 925 1.000 000 002 83 5 0.998 806 537974 0.999998044836 0.999 999 993 614 1.00000000002 6 0.999 701 542 775 0.999 999 877 709 0.999 999 999 901 1.000 000 000 00 UiO o 1.263 699 601 26 1 1.126 402 735 23 2 1.036478 22498 3 1.009 442 404 71 4 1.002 381 058 56 5 1.000 596 547 58 6 1.000 149 217 14 Ui1 Ui3 Ui 2 1.080 637 113 22 1.006 503 388 23 1.000 430 464 62 1.000 027 276 51 1.000 001 710 58 1.000 000 107 00 1.001 561 139 90 1.000 025 603 04 1.000 001 229 44 1.000 000 397 30 0.999 999 997 211 1.000 000 006 19 0.999 999 999 978 1.000 000 000 09 1. 000 000 000 00 3.6 Gaussian Integration Methods In this section, we broaden the scope of our examination by considering integrals of the form 1(1) := lb w(x)f(x)dx, where w(x) is a given nonnegative weight function on the interval [a, b]. Also, the interval [a, b] may be infinite, e.g., [0, +00] or :-00, +00]. The weight function must meet the following requirements: (3.6.1) a) w(x) 2: 0 is measurable on the finite or infinite interval [a, b]. b) All moments /-lk := J: c) For polynomials sex) which are nonnegative on [a, b], o implies sex) == O. J: xkw(x)dx, k = 0, 1, ... , exist and are finite. w(x)s(x)dx = 172 3 Topics in Integration The conditions (3.6.1) are met, for instance, if w(x) is positive and continuous on a finite interval [a, b]. Condition (3.6.1c) is equivalent to w(x)dx > 0 [see Exercise 14]. We will again examine integration rules of the type J: n (3.6.2) 1(1) := L Wd(Xi)' i=l The Newton-Cotes formulas [see Section 3.1] are of this form, but the abscissas Xi were required to form a uniform partition of the interval [a, b]. In this section, we relax this restriction and try to choose the Xi as well as the Wi so as to maximize the order of the integration method, that is, to maximize the degree for which all polynomials are exactly integrated by (3.6.2). We will see that this is possible and leads to a class of well-defined so-called Gaussian integration rules or Gaussian quadrature formulas [see for instance Stroud und Secrest (1966)]. These Gaussian integration rules will be shown to be unique and of order 2n - 1. Also Wi > 0 and a < Xi < b for i = 1, ... , n. In order to establish these results and to determine the exact form of the Gaussian integration rules, we need some basic facts about orthogonal polynomials. We introduce the notation for the set of normed real polynomials of degree j, and, as before, we denote by IIj := {p I degree (p) ::; j} the linear space of all real polynomials whose degree does not exceed j. In addition, we define the scalar product (I, g) := lb w(x)f(x)g(x)dx on the linear space L2[a, b] of all functions for which the integral is well defined and finite. The functions j, g E L2[a, b] are called orthogonal if (I, g) = O. The following theorem establishes the existence of a sequence of mutually orthogonal polynomials, the system of orthogonal polynomials associated with the weight function w(x). (3.6.3) Theorem. There exist polynomials Pj E llj, j = 0, 1, 2, ... , such that (3.6.4) 3.6 Gaussian Integratior:. Methods 173 These polynomials are uniquely defined by the recursions (3.6.5a) (3.6.5b) Po(X)= 1, Pi+I(X)= (x - 6HI)Pi(X) -iT+lPi-I(X) for i 2: 0, where P-I(X) := 0 and! (3.6.6a) (3.6.6b) The polynomials can be constructed recursively by a technique known as Gram-Schmidt orihogonalization. Clearly po(:r) 1. Suppose then, as an induction hypothesis, that all orthogonal polynomials with the above properties have been constructed for j ::; i and have been shown to he unique. vVe proceed to show that there exists a unique polynomial PHI E fIHI with = PROOF. (3.6.7) and that this polynomial satisfies (3.6.5b). Any polynomial PHI E fIHI can be written uniquely in the form because its leading coefficient and those of the polynomials Pj, j ::; i, have value 1. Since (Pj, Pk) = 0 for all j, k ::; i with j i- k, (3.6.7) holds if and only if (3.6.8a) (3.6.8b) (PHl,Pi)= (XPi,Pi) - 6Hl(Pi,Pi) = 0, (PHI,Pj-d= (XPj-l,Pi) + Cj-I(Pj-l,Pj-l) = 0 P; for j ::; i. The condition (3.6.1c) ~ with and PJ-I' respectively, in the role of the nonnegative polynomial s ~ rules out (Pi, Pi) = 0 and (Pj-l,Pj-l) = 0 for 1 ::; j ::; i. Therefore, the equations (3.6.8) can be solved uniquely. (3.6.8a) gives (3.6.6a). By the induction hypothesis, for j::; i. From this, by solving for XPj_l(X), we have (xPj_l,p;) = (Pj,Pi) for j ::;i, so that for j = i, for j < 1. in view of (3.6.8). Thus (3.6.5b) has been established for i + 1. 7 XPi denotes the polynomial with values XPi(.T) for all x. 0 174 3 Topics in Integration Every polynomial P E Ih is clearly representable as a linear combination of the orthogonal polynomials Pi, i ::; k. We thus have: (3.6.9) Corollary. (P,Pn) = 0 for all P E IIn-l. (3.6.10) Theorem. The root.s Xi. i = 1, ... , n, of Pn are real and simple. ThpV all lie in the oppn intprval (a, b). PROOF. Consider those roots of Pn which lie in (a, b) and which are of odd multiplicity, that is, at which Pn changes sign: a < Xl < ... < Xl < b. The polynomial I q(x) := II (x - Xj) E 111 j=1 is such that the polynomial Pn(x)q(x) does not change sign in [a, b], so that (Pn, q) = lb w(x)Pn(x)q(x)dx -# 0 by (3.6.1c). Thus degree (q) = I = n must hold, as otherwise (Pn, q) = 0 by Corollary (3.6.9). D N ext we have the (3.6.11) Theorem. The n x n-matrix A·- [ Po(h) . Pn-I(h) is nonsingular for mutually distinct arguments t i , i = 1, '" , n. PROOF. Assume A is singular. Then there is a vector cT C -# 0 with cT A = O. The polynomial = (co, ... , cn-d, n-I q(x) := l::: CiI1i(X), ';=0 with degree (p) < n, has the n distinct roots t 1 , ... , tn and must vanish identically. Since the polynomials Pi (.) are linearly independent, q( x) == 0 implies the contradiction c = O. D Theorem (3.6.11) shows that the interpolation problem of finding a function of the form 3.6 Gaussian Integration Methods 175 n-l p(X) := 2: CiPi(X) i=O with P(ti) = fi' i = 1, 2, ... , n is always solvable. The condition of the theorem is known as the Haar condition. Any sequence of functions Po, PI, ... which satisfy the Haar condition is said to form a Chebyshev system.Theorem (3.6.11) states that sequences of orthogonal polynomials are Chebyshev systems. Now we arrive at the main result of this section. (3.6.12) Theorem. (a) Let Xl, ... , Xn be the 'rOots of the nth orthogonal polynomial Pn(x), and let WI, ... , Wn be the solution of the (nonsingular) system of equations (3.6.13) ~P k (x·)w· = {(PO'0 Po) •• if if L.... i=l k =0, k = 1, 2, ... , n - 1. Then Wi > 0 for i = 1, 2, ... , n, and (3.6.14) lb w(x)p(x)dx = t WiP(Xi) i=l a holds for all polynomials P E II2n - l . The positive numbers Wi are called "weights". (b) Conversely, if the numbers Wi, Xi, i = 1, ... , n, are such that (3.6.14) holds for all P E II2n - l , then the Xi are the 'rOots of Pn and the weights Wi satisfy (3.6.13). (c) It is not possible to find numbers Xi, Wi, i = 1, ... , n, such that (3.6.14) holds for all polynomials P E II2n . PROOF. By Theorem (3.6.10), the roots Xi, i = 1, ... , n, of Pn are real and mutually distinct numbers in the open interval (a, b). The matrix PO(Xl) (3.6.15) A·- [ . Pn-;(Xl) is nonsingular by Theorem (3.6.11), so that the system of equations (3.6.13) has a unique solution. Consider an arbitrary polynomial P E II2n - l . It can be written in the form (3.6.16) p(X) := Pn(x)q(x) + r(x), 176 3 Topics in Integration where q, r are polynomials in IIn - 1 , which we can express as linear combinations of orthogonal polynomials q(x) == n-1 n-1 k=O k=O L Q:kPk(X), rex) == L ,BkPk(X). Since Po(x) == 1, it follows from (3.6.16) and Corollary (3.6.9) that lb w(x)p(x)dx = (Pn, q) + (r,po) = ,Bo(Po,Po). On the other hand, by (3.6.16) [since Pn(Xi) = 0] and by (3.6.13), n n n-1 (n ~ WiP(Xi) = ~ Wir(Xi) = {;,Bk ~ WiPk(Xi) ) = ,Bo(Po,Po), Thus (3.6.14) is satisfied. We observe that (3.6.17). If Wi, Xi, i = 1, ... , n, are such that (3.6.14) holds for all polynomials P E II2n - 1, then Wi > 0 for i = 1, ... , n. This is readily verified by applying (3.6.14) to the polynomials n Pj(x) := II (x - Xh)2 E II2n - 2, j = 1, ... , n, h=l h¥j and noting that b n n i=l h=l h¥j 0<1 w(x)pj(x)dx = L WiPj(Xi) = Wj II (Xj - Xh)2 a by (3.6.1c). This completes the proof of (3.6.12a). We prove (3.6.12c) next. Assume that Wi, Xi, i = 1, ... , n, are such that (3.6.14) even holds for all polynomials P E II2n . Then p(x) :== n II (x - Xj)2 E II2n j=l contradicts this claim, since by (3.6.1c) 0<1 w(x)p(x)dx = L WiP(Xi) = o. b a This proves (3.6.12c). n i=l 3.6 Gaussian Integration Methods 177 To prove (3.6.12b), suppose that Wi, Xi, i = 1, ... , n, are such that (3.6.14) holds for all P E TI2n-1. Note that the abscissas Xi must be mutually distinct, since otherwise we could formulate the same integration rule using only n - 1 of the abscissas Xi, contradicting (3.6.12c). Applying (3.6.14) to the orthogonal polynomials P = Pb k = 0, ... , n - 1, themselves, we find if k = 0, if 1 :::; k :::; n - 1. In other words, the weights Wi must satisfy (3.6.13). Applying (3.6.14) to p(x) :== Pk(X)Pn(x), k = 0, ... , n - 1, gives by (3.6.9) °= n (Pk,Pn) = L WiPn(Xi)Pk(Xi), k = 0, ... , n - l. i=l ° In other words, the vector c := (w1Pn(xd, ... , WnPn(:rn))T solves the homogeneous system of equations Ac = with A the matrix (3.6.15). Since the abscissas Xi, i = 1, ... , n, are mutually distinct, the matrix A is nonsingular by Theorem (3.6.11). Therefore c = 0 and WiPn(Xi) = 0 for i = 1, ... , n. Since Wi > 0 by (3.6.17), we have Pn(Xi) = 0, i = 1, ... , n. This completes the proof of (3.6.12b). D For the most common weight function w (x) :== 1 and the interval [-1, 1], the results of Theorem (3.6.12) are due to Gauss. The corresponding orthogonal polynomials are [see Exercise 16] (3.6.18) Indeed, Pk E fh and integration by parts establishes (Pi, Pk) = 0 for i "I k. Up to a factor, the polynomials (3.6.18) are the Legendre polynomials. In the following table we give some values for Wi, Xi in this important special case. For further values see the :'-Jational Bureau of Standards Handbook of Mathematical Functions [Abramowitz and Stegun (1964)]. 178 3 Topics in Integration n Wi Xi 2 =1 3 9 5 4 WI 0.577 350 2692 .. . X3 = -Xl = 0.7745966692 .. . X2 = = W4 = 0.347 854 8451 .. . = 0.652 145 1549 .. . W2 = W;) 0.2369268851 .. . 0.4786286705 .. . W;) = g~ = 0.568888 8889 .. . 5 X2 = -Xl = 0 X4 = - X l = X3 = -X2 = = W5 = X5 = -Xl = Wz = W4 = X4 = -X2 = WI X3 = 0.861 136 3116 .. . 0.339 981 0436 .. . 0.906 179 8459 .. . 0.538 469 3101 .. . 0 Other important cases which lead to Gaussian integration rules are listed in the following table: [a, b] 1,1] [0,00] [-00, x] w(X) (1 - Orthogonal polynomials Tn (X), Chebyschev polynomials Ln (x), Laguerre polynomials H n (x), Hermite polynomials X 2 )-1/2 e- X e- X 2 We have characterized the quantities Wi, Xi which enter the Gaussian integration rules for given weight functions, but we have yet to discuss methods for their actual calculation. We will examine this problem under the assumption that the coefficients 6i , I i of the recursion (3.6.5) are given. Golub and Welsch (1969) and Gautschi (1968, 1970) discuss the much harder problem of finding the coefficients 6i , Ii. The theory of orthogonal polynomials ties in with the theory of real tridiagonal matrices 61 12 62 ~(2 (3.6.19) In = In 7 0: 1 7; 1 and their principal submatrices J j := [" 12 12 02 Ij OJ 3.6 Gaussian Integration Methods 179 Such matrices will be studied in Sections 5.5, 5.6, and 6.6.1. In Section 5.5 it will be seen that the characteristic polynomials Pj(x) = det(Jj - xI) of the Jj satisfy the recursions (3.6.5) with the matrix elements OJ, "(j as the coefficients. Therefore, Pn is the characteristic polynomial of the tridiagonal matrix I n . Consequently we have (3.6.20) Theorem. The roots Xi, i = 1, ... , n, of the nth orthogonal polynomial Pn are the eigenvalues of the tridiagonal matri:r I n in (3.6.19). The bisection method of Section 5.6, the QR method of Section 6.6.6, and others are available to calculate the eigenvalues of these tridiagonal systems. With respect to the weights Wi, we have [Szego (1959), Golub and Welsch (1969)]: (3.6.21) Theorem. Let V(i) := (vii), ... , v~i))T be an eigenvector of I n (3.6.19) for the eigenvalue Xi, Jnv(i) = Xiv(i). Suppose v(i) is scaled in such a way that V(i)T V(i) = (Po,Po) = lb w(x)dx. Then the weights are given by Wi = (vii ))2, i=I, ... ,n. PROOF. We verify that the vector v(i) = (POPO(Xi), P1P1 (Xi), ... ,Pn-1Pn-1 (Xi) f where Pj := l/bl'Y2··· "(1+1) for j = 0, 1, ... , n -·1 is an eigenvector of I n for the eigenvalue Xi: Jnv(i) = XiV(i). By (3.6.5) for any x, For j = 2, ... , n - 1, similarly "(jPj-2Pj-2(X) + OjPj-1Pj-1(X) + "(j+lPjPj(J;) = pj-1b;pj-2(X) + OjPj-1(X) + Pj(x)] = XPj-1Pj-1(X), and finally Pn-1b~Pn-2(X) + OnPn-1(X)] = XPn-1Pn-1(X) - Pn-1Pn(X), so that 180 3 Topics in Integration holds, provided Pn(Xi) = 0. Since Pj # 0, j = 0, 1, ... , n - 1, the system of equations (3.6.13) for Wi is equivalent to -;(n)) ( v- ( l ) , ... ,L W -_ ( Po, Po ) el (3.6.22) with W = (WI, ... , wn)T, el = (1,0, ... , O)T. Eigenvectors of sYIIlmetric matrices for distinct eigenvalues are orthogonal. Therefore, multiplying (3.6.22) by v(i)T from the left yields (v(i)Tv(i))Wi = (po,po)vi i ). Since Po = 1 and po(:r) == 1, we have vii) = 1. Thus (3.6.23) Using again the fact that vii) = 1, we find vii)v(i) = v(i), and multiplying (3.6.23) by (vii))2 gives (V(i)T V(i))Wi = (vi i ))2(pO,PO)' Since V(i)TV(i) = (Po,Po) by hypothesis, we obtain Wi = (vi i ))2. D If the QR-method is employed for determining the eigenvalues of I n , then the calculation of the first components vii) of the eigenvectors v(i) is readily included in that algorithm: calculating the abscissas Xi and the weights Wi can be done concurrently [Golub and Welsch (1969)]. Finally, we will estimate the error of Gaussian integration: (3.6.24) Theorem. If f E C 2n[a, b], then f . a b n w(x)f(x)dx - ~ w;J(x;) = f(2n) (0 (2n)! (Pn,Pn) for some ~ E (a, b). PROOF. Consider the solution h E II2n - 1 of the Hermite interpolation problem [see Section 2.1,5] h(Xi) = f(x;), h'(x;) = !'(Xi), i = 1,2, ... , n. Since degree h < 2n, 1 b a n w(x)h(x)dx = n L Wih(Xi) L W;J(Xi) = i=l i=l by Theorem (3.6.12). Therefore, the error term has the integral representation 3.7 Integrals with Singularities (b w(x)f(x)dx _ t W;!(Xi) = lba w(x)(f(x) - h(.r))dx. Ja 181 i=l By Theorem (2.1.5.9), and since the Xi are the roots of Pn(X)) E fIn' for some ( = ((x) in the interval I[x, Xl, ... , xn] spanned by X and Xl, ... , X n . Next, f(2n)(((X)) f(x) - h(x) (2n)! p~(x) is continuous on [a, b] so that the mean-value theorem of integral calculus applies: Ib a Ib 1 w(x)(f(x) - h(x))dx = ~(~)I w(x)f(2n)(((x))p~(x)dx 2n. a f(2n) (,;) = (2n)! (Pn,Pn) for some'; E (a, b). o Comparing the various integration rules (Newton-Cotes formulas, extrapolation methods, Gaussian integration), we find that, computational efforts being equal, Gaussian integration yields the most accurate results. If only one knew ahead of time how to chose n so as to achieve specified accuracy for any given integral, then Gaussian integration would be clearly superior to other methods. Unfortunately, it is frequently not possible to use the error formula (3.6.24) for this purpose, because the 2nth derivative is difficult to estimate. For these reasons, one will usually apply Gaussian integration for increasing values of n until successive approximate values agree within the specified accuracy. Since the function yalues which had been calculated for n cannot be used for n + 1 (at least not in the classical case w(x) == 1), the apparent advantages of Gauss integration as compared with extrapolation methods are soon lost. There have been attempts to remedy this situation [e.g. Kronrod (1965)]. A collection of FORTRAN programs is given in Piessens et al. (1983). 3.7 Integrals with Singularities Examining some frequently used integration methods in this chapter, we found that their application to a given integral 182 3 Topics in Integration lb f(x)dx, a, b finite, was justified provided the integrand f (x) was sufficiently often differentiable in [a,b]. For many practical problems, however, the function f(x) turns out to be not differentiable at the end points of [a, b], or at some isolated point in its interior. In what follows, we suggest several ways of dealing with this and related situations. 1) f(x) is sufficiently often differentiable on the closed subintervals of a partition a = al < a2 < ... < am+l = b. Putting fi(X) := f(x) on [ai, ai+l] and defining the derivatives of fi (x) at ai as the one-sided right derivative and at ai+l as the one-sided left derivative, we find that standard methods can be applied to integrate the functions fi (x) separately. Finally I b f(x)dx = L l . + fi(X)dx. ai m i=l a 1 at 2) Suppose there is a point i; E [a, b] for which not even one-sided derivatives of f(x) exist ..For instance, the function f(x) = y'Xsinx is such that f'(x) will not be continuous for any choice of the value l' (0). Nevertheless, the variable transformation t := y'X yields {b (vb Jo y'X sin x dx = Jo 2t2 sin t 2dt and leads to an integral with an integrand which is now arbitrarily often differentiable in [0, Vb]. :}) Another way to deal with the previously discussed difficulty is to split the integral: fob y'Xsinx dx = foE y'Xsinx dx + lb y'Xsinx dx, c> 0. The second integrand is arbitrarily often differentiable in [10, b]. The first integrand can be developed into an uniformly convergent series on [0,10] so that integration and summation can be interchanged: i c o y'X sin x dx = iE ( 0 x3 ) y'X x - I ±. .. dx = ~) -1 t 3. v=O CX) ",2v+5/2 ~ )' ( / )" 2v + 1 . 2v + 5 2 ( For sufficiently small 10, only few terms of the series need be considered. The difficulty lies in the choice of c: if c is selected too small, then the proximity of the singularity at x = causes the speed of convergence to deteriorate when we calculate the remaining integral. ° 3.7 Integrals with Singularities 183 4) Sometimes it is possible to subtract from the integrand f(x) a function whose indefinite integral is known, and which has the same singularities as f(x). For the above example, xylx is such a function: fob VX sin x dx = fob (VX sin x ~ xVX) dx + fob J;VX dx = fob VX (sinx ~ x) dx + ~b5/2. The new integrand has a continuous third derivative and is therefore better amenable to standard integration methods. In order to avoid cancellation when calculating the difference sin x ~ x for small x, it is recommended to evaluate the power series . sm x ~ x = ~x 3[13! 5fx 1 ± ... 2 ~ ] =~x 3 2: -(~ly --x. 00 (2// + 3)! v=O 2v 5) For certain types of singularities, as in the case of I = fob xCY.f(x)dx, 0 < IX < 1, with f(x) sufficiently often differentiable on [0, b], the trapezoidal sum T(h) does not have an asymptotic expansion of the form (3.4.1), but rather of the more general form (3.5.1): T(h) ~ TO + Tlh"Yl + T2h"Y2 + "', where { Ii} = { 1 + IX, 2, 2 + IX, 4, 4 + IX, 6, 6 + IX, ... } [see Bulirsch (1964)]. Suitable step-length sequences for extrapolation methods in this case are discussed in Bulirsch and Stoer (l964). 6) Often the following scheme works surprisingly well: if the integrand of 1= l bf (X)dX is not, or not sufficiently often, differentiable for x = a, put aj := a b~a + --.-, J j = 1, 2, ... , in effect partitioning the half-open interval (a, b] into infinitely many subintervals over which to integrate separately: 184 3 Topics in Integration using standard methods. Then The convergence of this sequence can often be accelerated using for instance, Aitken's L12 method [see Section 5.10]. Obviously, this scheme can be adapted to calculating irnpmper- integrals l X f(x) dx. 7) The range of improper integrals can be made finite by suitable transformations. For x = lit we have, for instance, If the new integrand is singular at 0, then one of the above approaches may be tried. Note that the Gaussian integration rules based on Laguerre and Hermite polynomials [see Section 3.6] apply directly to improper integrals of the forms lX f(x) dx, £:YO f(x) dx, respectively. EXERCISES FOR CHAPTER 3 1. Let a s: Xo < Xl < X2 < ... < Xn s: b be an arbitrary fixed partition of the interval [a, b]. Show that there exist unique /,0, /,1, ... , /'n with ~ /'iP(Xi) = 2. 3. lb P(x)dx for all polynomials P with degree (P) s: n. Hint: P(x) = 1, X, ... , xn. Compare the resulting system of linear equations with that representing the polynomial interpolation problem with support abscissas Xi, i = 0, ... , n. By construction, the nth Newton-Cotes formula yields the exact value of the integral for integrands which are polynomials of degree at most n. Show that for even values of n, polynomials of degree n + 1 are also integrated exactly. Hint: Consider the integrand xn+1 in the interval [~k, +k], n = 2k + 1. If f E C 2 [a, b] then there exists an x E (a, b) such that the error of the trapezoidal rule is expressed as follows: ~(b ~ a)(f(a) + f(b)) ~ lb f(x) dx = f2(b ~ a)3 f"(x). EXERCISES FOR CHAPTER 3 4. 5. 6. Derive this result from the error formula in (2.1.4.1) by showing that !"(e(x)) is continuous in x. Derive the error formula (3.1.6) using Theorem (2.1.5.9). Hint: See Exercise 3. Let IE C 6 [-I, +1], and let P E II5 be the Hermite interpolation polynomial with P(Xi) = I(Xi), PI(x;) = !'(Xi), Xi = -1, 0, +1. (a) Show that +1 / -1 P(t)dt = fs/(-I) + H/(O) + fs/(+I) + i5!'(-I) - i5!'(+I). (b) By construction, the above formula represents an integration rule which is exact for all polynomials of degree 5 or less. Show that it need not be exact for polynomials of degree 6. (c) Use Theorem (2.1.5.9) to derive an error formula for the integration rule in (a). (d) Given a uniform partition Xi = a + ih, i = 0, ... , 2n, h = (b - a)/2n, of the interval [a, b], what composite integration rule can be based on the integration rule in (a)? Consider an arbitrary partition L1 := {a = Xo < ... < Xn = b} of a given interval [a, b]. In order to approximate lb 7. 8. 9. 185 I(t)dt using the function values I(Xi), i = 0, ... , n, spline interpolation [see Section 2.4] may be considered. Derive an integration rule in terms of I(Xi) and the moments (2.4.2.1) of the "natural" spline (2.4.1.2a). Determine the Peano kernel for Simpson's rule and n = 2 instead of n = 3 in [-1, +1]. Does it change sign in the interval of integration? Consider the integration rule of Exercise 5. (a) Show that its Peano kernel does not change its sign in [-1, +1]. (b) Use (3.2.8) to derive an error term. Prove ~k3 = (n(n 2+ 1) r using the Euler-Macluarin summation formula. 10. Integration over the interval [0,1] by Romberg's method using Neville's algorithm leads to the tableau h5 = 1 Too = T(ho) T11 h 21 -_ 4:1 h~ = ft 186 3 Topics in Integration In Section 3.4, it is shown that T11 is Simpson's rule. (a) Show that T22 is Milne's rule. (b) Show that T33 is not the Newton-Cotes formula for n = 8. 11. Let ho := b-a, hI := ho/3. Show that extrapolating T(ho) and T(hl) linearly to h = 0 gives the 3/8-rule. 12. One wishes to approximate the number e by an extrapolation method. (a) Show that T(h) := (1 + h)l/h, h =1= 0, Ihl < 1, has an expansion of the form 00 T(h) = e + LTih i i=l which converges if Ihl < 1. (b) Modify T(h) in such a way that extrapolation to h = 0 yields, for a fixed value x, an approximation to eX. 13. Consider integration by a polynomial extrapolation method based on a geometric step-size sequence hj = ho~, j = 0, 1, ... , 0 < b < 1. Show that small errors LJTj in the computation of the trapezoidal sums T(hj ), j = 0, 1, ... , m, will cause an error LJTmm in the extrapolated value Tmm satisfying where Cm(O) is the constant given in (3.5.5). Note that Cm(O) --+ 00 as 0 --+ 1, so that the stability of the extrapolation method deteriorates sharply as b approaches 1. 14. Consider a weight function w(x) ~ 0 which satisfies (3.6.1a) and (3.6.1b). Show that (3.6.1c) is equivalent to l a w(x)dx > O. Hint: The mean-value theorem of integral calculus applied to suitable subintervals of [a, b]. 15. The integral (I, g) := [~1 f(x)g(x)dx defines a scalar product for functions f, 9 E C[-I, +1]. Show that if f and 9 are polynomials of degree less than n, if Xi, i = 1, 2, ... , n, are the roots of the nth Legendre polynomial (3.6.18), and if 1 +1 1i := Li(X)dx -1 with II -, i Xi n Li(X):= X - k=i:i Xk -Xk = 1, 2, ... , n, k=l then (I, g) = L 1;!(Xi)g(Xi). i=l EXERCISES FOR CHAPTER 3 187 16. Consider the Legendre polynomials pj(x) in (3.6.18). (a) Show that the leading coefficient of pj(x) has value 1. (b) Verify the orthogonality of these polynomials: (Pi,Pj) = 0 if i < j. Hint: Integration by parts, noting that and that the polynomial is divisible by X2 - 1 if l < k. 17. Consider Gaussian integration, [a, b] = [-1,1], w(x) == 1. (a) Show that 8i = 0 for i > 0 in the recursion (3.6.5) for the corresponding orthogonal polynomials pj(x) (3.6.18). Hint: pj(x) == (_I)i+ 1pj (_x). (b) Verify +1 2 . (_I)j22i+l 1 -1 (x -lfdx = ~,.,..-'--- Cj)(2j + 1)· Hint: Repeated integration by parts of the integrand. (X2 - l)i == (x + l)j (x - l)j. (c) Calculate (Pj, pj) using integration by parts [see Exercise 16] and the result (b) of this exercise. Show that ·2 J 2 'Yj = (2j + 1)(2j - 1) for j > 0 in the recursion (3.6.5). 18. Consider Gaussian integration in the interval [-1, +1] with the weight function w(x)== 1 ~ vl- X2 In this case, the orthogonal polynomials pj(x) are the classical Chebyshev polynomials To(x) == 1, n(x) == x, T2(X) == 2x2 - 1, T3(X) == 4 x 3 - 3x, ... , Ti+1(X) == 2xTj (x) - Tj - 1(x), up to scalar factors. (a) Prove that pj(x) == (1/2 j - 1)Tj (x) for j ?: 1. What is the form of the tridiagonal matrix (3.6.19) in this case? (b) For n = 3, determine the equation system (3.6.13). Verify that W1 = W2 = W3 = 7T" /3. (In the Chebyshev case, the weights Wi are equal for general n). 19. Denote by T(f; h) the trapezoidal sum of step length h for the integral 11 f(x) dx. For Q > 1, T(x G ; h) has the asymptotic expansion 188 3 Topics in Integration Show that, as a consequence, every function f(x) which is analytic on a disk Izl :s; r in the complex plane with r > 1 admits an asymptotic expansion of the form T(xUf(x); h) rv 11 XU f(x)dx + b1 hl+ n + b2 h2+ u + b3 h3+ u + ... () Hint: Expand f(x) into a power series and apply T(rp + 1/;; h) = T(rpi h) + T(1/;;h). References for Chapter 3 Abramowitz, M., Stegun, I. A. (1964): Handbook of Mathematical Functions. National Bureau of Standards, Applied Mathematics Series 55, Washington, D.C.: US Government Printing Office, 6th printing 1967. Bauer, F. L., Rutishauser, H., Stiefel, E. (1963): New aspects in numerical quadrature. Proc. of Symposia in Applied Mathematics 15,199-218, Amer. Math. Soc. Brezinski, C., Zaglia, R. M. (1991): Extrapolation Methods, Theory and Practice. Amsterdam: North Holland. Bulirsch, R. (1964): Bcmerkungen zur Romberg-Integration. Numer. Math. 6, 6-16. - - - , Stoer, J. (1964): Fehlerabschiitzungen und Extrapolation mit rationalen Funktionen bei Verfahren vom Richardson- Typus. Numer. Math. 6, 413-427. - - - , - - - (1967): Numerical quadrature by extrapolation. Numer. Math. 9,271-278. Davis, Ph. J. (1963): Interpolation and Approximation. New York: Blaisdell, 2d printing (1965). - - - , Rabinowitz. P. (1975): Methods of Numerical Integrations. New York: Academic Press. Erdelyi, A. (1956): Asymptotic Expansions. New York: Dover. Gautschi, W. (1968): Construction of Gauss-Christoffel quadrature formulas. Math. Compo 22, 251-270. - - - (1970): On the Construction of Gaussian quadrature rules from modified moments. Math. Compo 24, 245-260. Golub, G. H., Welsch, J. H. (1969): Calculation of Gauss quadrature rules. Math. Compo 23, 221-230. Gradshteyn, I. S., Ryzhik, I. 1\1. (1980): Table of Integrals, Series and Products. New York: Academic Press. Grabner, VV., Hofreiter, N. (1961): Integraltafel, 2 vols. Berlin, Heidelberg, New York: Springer-Verlag. Henrici, P. (1964): Elements of Numerical Analysis. )Jew York: \Viley. Kronrod, A. S. (1965): Nodes and Weights of Quadrature Formulas. Authorized translation from the Russian. New York: Consultants Bureau. Olver, F. W. J. (1974): Asymptotics and Special Functions. New York: Academic Press. Piessens, R., de Doncker, E., Uberhuber, C. \V., Kahaner, D. K. (1983): Quadpack. A Subroutine Package for Automatic Integration. Berlin, Heidelberg, New York: Springer-Verlag. References for Chapter 3 189 Romberg, W. (1955): Vereinfachte numerische Integration. Det. Kong. Norske Videnskabers Selskab Forhandlinger 28, Nr. 7, llondheim. Schoenberg, 1. J. (1969): Monosplines and quadrature formulae. In: Theory and applications of spline functions. Ed. by T. N. E. Greville. U;7-207. New York: Academic Press. Steffensen, J. F. (1927): Interpolation. New York: Chelsea, 2d edition (1950). Stroud, A. H., Secrest, D. (1966): Gaussian Quadrature Formulas. Englewood Cliffs, NJ: Prentice Hall. Szego, G. (1959): Orthogonal Polynomials. New York: Amer. Math. Soc. 4 Systems of Linear Equations In this chapter direct methods for solving systems of linear equations Ax = b, will be presented. Here A is a given n x n matrix, and b is a given vector. We assume in addition that A and b are real, although this restriction is inessential in most of the methods. In contrast to the iterative methods (Chapter 8), the direct methods discussed here produce the solution in finitely many steps, assuming computations without roundoff errors. This problem is closely related to that of computing the inverse A-I of the matrix A provided this inverse exists. For if A-I is known, the solution x of Ax = b can be obtained by matrix vector multiplication, x = A-I b. Conversely, the ith column ai of A-I = [0,1, ... , an] is the solution of the linear system Ax = ei, where ei = (0, ... ,0,1,0, ... O)T is the ith unit vector. A general introduction to numerical linear algebra is given in Golub and van Loan (1983) and Stewart (1973). ALGOL programs are found in Wilkinson and Reinsch (1971), FORTRAN programs in Dongarra, Bunch, Moler, and Stewart (1979) (LINPACK), and in Andersen et al. (1992) (LAPACK). 4.1 Gaussian Elimination. The Triangular Decomposition of a Matrix In the method of Gaussian elimination for solving a system of linear equations (4.1.1) Ax = b, 4.1 Gaussian Elimination. The Triangular Decomposition of a Matrix 191 where A is an n x n matrix and bE IR n , the given system (4.1.1) is transformed in steps by appropriate rearrangements and linear combinations of equations into a system of the form r11 R= [ Rx = c, o which has the same solution as (4.1.1). R is an upper triangular matrix, so that Rx = c can easily be solved by "back substitution" (so lang as rii i= 0, i = 1, ... , n): for i = n, n - 1, ... , 1 . In the first step of the algorithm an appropriate multiple of the first equation is subtracted from all of the other equations in such a way that the coefficients of Xl vanish in these equations; hence, XI remains only in the first equation. This is possible only if all i= 0, of course, which can be achieved by rearranging the equations if necessary, as long as at least one ail i= O. Instead of working with the equations themselves, the operations are carried out on the matrix all (A,b) = : [ anI which corresponds to the full system given in (4.1.1). The first step of the Gaussian elimination process leads to a matrix (A', b') of the form (4.1.2) (A',b') ~ ra':" a~2 a~n a;2 a;n b'2 a:,2 a~n b'n b; 1 . , and this step can be described formally as follows: ( 4.1.3) (a) Determine an element ad i= 0 and proceed with (b); if no such r exists, A is singular; set (A', b') := (A, b); stop. (b) Interchange rows rand 1 of (A b). The result is the matrix (A,b). (c) For i = 2,3, ... , 71, subtract the multiple 192 4 Systems of Linear Equations of row 1 from row i of the matrix (A, b). The desired matrix (A', b') is obtained as the result. The transition (A, b) ---+ (.4., b) ---+ (A', b') can be described by using matrix multiplications: (4.1.4) where PI is a permutation matrix, and G l is a lower triangular matrix: (4.1.5) 1 o 0 1 1 1 o o +- r, G l .-- 1 [-~" 1 -lnl 0 . 1 1 Matrices such as G l , which differ in at most one column from an identity matrix, are called Frobenius matrices. Both matrices PI and G l are nonsingular; in fact l~l 1 G-1 = [ : l lnl o 1 For this reason, the equation systems Ax = b and A' x = b' have the same solution: Ax = b implies A'x = GlPlA.T = GlPlb = b', and A'x = b' implies A.T = Pl-lG l l A'x = Pl-lGllb' = b. The element arl = all which is determined in (a) is called the pivot element (or simply the pivot), and step (a) itself is called pivot selection (or pivoting). In the pivot selection one can, in theory, choose any arl -# 0 as the pivot element. For reasons of numerical stability [see Section 4.5] it is not recommended that an arbitrary arl -# 0 be chosen. Usually the choice larll = max , laill is made; that is, among all candidate elements the one of largest absolute value is selected. (It is assumed in making this choice however - see Section 4.5 - that the matrix A is "equilibrated", that is, that the orders of magnitudes of the elements of A are "roughly equal".) This sort of pivot selection is called partial pivot selection (or partial pivoting), in contrast to complete pivot selection (or complete pivoting), in which the search for 4.1 Gaussian Elimination. The Triangular Decomposition of a Matrix 193 a pivot is not restricted to the first column; that is, (a) and (b) in (4.1.3) are replaced by (a') and (b'): (a') Determine rand s so that larsl = max laijl ',J rs and continue with (b') if a # O. Otherwise A is singular; set (A',b'):= (A,b); stop. (b') Interchange rows 1 and r of (A, b), as well as columns 1 and s. Let the resulting matrix be (A, b). After the first elimination step, the resulting matrix has the form (4.1.2): a,11 :I a'T :I b'1 (A' , b') = [ - - - - LI - -_ - JI - -- o :I A :I b 1 with an (n - I)-row matrix A. The next elimination step consists simply of applying the process described in (4.1.3) for (A, b) to the smaller matrix (A, b). Carrying on in this fashion, a sequence of matrices (A,b):= (A(O),b(O)) -+ (A(l),b(l)) -+ ... -+ (A(n-l),b(n-l)) =: (R,c), is obtained which begins with the given matrix (A, b) (4.1.1) and ends with the desired matrix (R, c). In this sequence the jth intermediate matrix (A (j) , b(j)) has the form (4.1.6) * ... * * ... * * * ... * o o ... 0 I I I * * I * ... * : * I o ... 0 (j) :I A(j) : b(j) A 11 12 I 1 [ - - - - __ 1- _ _ _ _ 1_ _ _ _ : b(j) O :I A(j) 22 I 2 I 1 I I I I * ... * I I I I * with a j-row upper triangular matrix A~~). (A(j), b(j)) is obtained from (A(j-l), b(j-l)) by applying (4.1.3) on the (n - j + 1) x~n - j + 2) matrix b(j-l)) . The elements of A(j) A(j) b(j) do not change from this ( A(j-l) 22 '2 11' 12' 1 step on; hence they agree with the corresponding elements of (R, c). As in the first step, (4.1.4) and (4.1.5), the ensuing steps can he described using matrix multiplication. As can be readily seen (4.1. 7) b(j-l)) , ( A(j) , b(j)) = GP·(A(j-l) J J , 194 4 Systems of Linear Equations with permutation matrices Pj and nonsingular Frobenius matrices G j of the form o 1 1 -IHl,j 1 Gj = (4.1.8) -I n,]. 0 1 0 In the jth elimination step (A (j-l), b(j-l)) -+ (A (j), b(j)) the elements below the diagonal in the jth column are annihilated. In the implementation of this algorithm on a computer, the locations which were occupied by these elements can now be used for the storage of the important quantities lij, i ?:: j + 1, of G j ; that is, we work with a matrix of the form I : r22 1- ___ _ I I I L ___ _ I I :L. _rjj : rj,Hl _ _ I -1- - - - - - rj,n - (j) Aj+l,j : aH1,Hl I - - - - - - - - - Cj - (j) aH1,n a(j) n,n - - - - b(j) HI - bn(j) Here the subdiagonal elements Ak+l,k, Ak+2,k, ... , Ank of the kth column are a certain permutation of the elements lk+l,ko ... , In,k in Gk (4.1.8). Based on this arrangement, the jth step T(j-l) -+ T(j), j = 1, 2, ... , n -1, can be described as follows (for simplicity the elements of T(j-l) are denoted by tik, and those of T(j) by t~k' 1 :s: i :s: n, 1 :s: k :s: n + 1): (a) Partial pivot selection: Determine r so that If t Tj = 0, set T(j) := T(j-l); A is singular: stop. Otherwise carry on with (b). (b) Interchange rows rand j ojT(j-l), and denote the result by T = (fik)' (c) Replace 4.1 Gaussian Elimination. The Triangular Decomposition of a Matrix I - 1tij := Iij := tij tjj 195 for i = j + 1, j + 2, ... , n, for i = j + 1, ... , nand k = j + 1, ... , n + 1, t~k = tik otherwise. We note that in (c) the important elements 1j+I,j, ... , lnj of G j are stored in their natural order as tj+l,j, ... ,t~j' This order may, however, be changed in the subsequent elimination steps T(k) -+ T(kH), k ~ j, because in (b) the rows of the entire matrix T(k) are rearranged. This has the following effect: The lower triangular matrix L and the upper triangular matrix R, L:= [t!, tnl 1 tn,n-I which are contained in the final matrix T(n-I) = (tik), provide a triangular decomposition of the matrix P A: (4.1.9) LR=PA. In this decomposition P is the product of all of the permutations appearing in (4.1.7): We will only show here that a triangular decomposition is produced if no row interchanges are necessary during the course of the elimination process, i.e., if PI = ... = Pn - I = P = I. In this case, 1 In,n-I 1 since in all of the minor steps (b) nothing is interchanged. Now, because of (4.1.7), therefore (4.1.10) Since 196 4 Systems of Linear Equations o 1 a-:J 1 = o 1 In,j it is easily verified that 1 a-1 1a-2 1... a-I = n-l 1 [ 121 : lnl In,n-l Then the assertion follows from (4.1.10). EXAMPLE. [~ ~ ~ 1[~n [n ' ~ [~:.~: ~ 1~ [ : [3; :: :1 I : I ;;. The pivot elements are marked. The triangular equation system is Its solution is X3 = -8, X2 = ~(~ +X3) = -7, Xl = ~(2 - X2 - 6X3) = Further 1 0 0] 0 1 , 010 P= [ 0 PA= 19. U 1 1 1 and the matrix P A has the triangular decomposition P A = LR with ~1 4.1 Gaussian Elimination. The Triangular Decomposition of a Matrix 197 Triangular decompositions (4.1.9) are of great practical importance in solving systems of linear equations. If the decomposition (4.1.9) is known for a matrix A (that is, the matrices L, R, P are known), then the equation system Ax = b can be solved immediately with any right-hand side b; for it follows that PAx = LRx = Ph, from which x can be found by solving both of the triangular systems Lu = Pb, Rx=u (provided all rii =I 0). Thus, with the help of the Gaussian elimination algorithm, it can be shown constructively that each square nonsingular matrix A has a triangular decomposition of the form (4.1.9). However, not every such matrix A has a triangular decomposition in the more narrow sense A = LR, as the example A= [~ ~] shows. In general, the rows of A must be permuted appropriately at the outset. The triangular decomposition (4.l.9) can be obtained directly without forming the intermediate matrices T(j). For simplicity, we will show this under the assumption that the rows of A do not have to be permuted in order for a triangular decomposition A = LR to exist. The equations A = LR are regarded as n 2 defining equations for the n 2 unknown quantities (lii = 1); that is, for all i, k = l. 2, ... , n min(i,k) (4.1.11) aik = L lijrjk (lii = 1). j=l The order in which the lij, rjk are to be computed remains open. The following versions are common: In the Crout method the n x n matrix A = LR is partiotioned as follows: 1 3 5 2 46~ 198 4 Systems of Linear Equations and the equations A = LR are solved for Land R in an order indicated by this partitioning: 1 1. ali = 2. ail 3. a2i L: hjT"ji, i = 1,2, ... , n, j=l 1 = L: lijT"j1' i = 2,3, ... , n, j=l 2 = L: l2jT"ji, i = 2,3, ... ,n, etc. j=l And in general, for i = 1, 2, ... , n, i-I T"ik := aik - L k = i, i + 1, ... , n, lijT"jk, j=l (4.1.12) lki:= ( aki - i-I ~ lkjT"ji ) / k = i + 1, i + 2, ... ,n. T"ii, In all of the steps above lii = 1 for i = 1, 2 .... , n. In the Banachiewicz method, the partitioning 1 2 4 6 3 I 5 I 7 I 8 I 9 is used; that is, Land R are computed by rows. The formulas above are valid only if no pivot selections is carried out. Triangular decomposition by the methods of Crout or Banachiewicz with pivot selection leads to more complicated algorithms; see Wilkinson (1965). Gaussian elimination and direct triangular decomposition differ only in the ordering of operations. Both algorithms are, theoretically and numerically, entirely equivalent. Indeed, the jth partial sums j (4.1.13) (j) . .- a ik - "'"' a ik ~ l is T" sk s=l of (4.1.12) produce precisely the elements of the matrix AU) in (4.1.6), as can easily be verified. In Gaussian elimination, therefore, the scalar products (4.1.12) are formed only in pieces, with temporary storing of the intermediate results; direct triangular decomposition, on the other hand, forms 4.1 Gaussian Elimination. The Triangular Decomposition of a Matrix 199 each scalar product as a whole. For these organizational reasons, direct triangular decomposition must be preferred if one chooses to accumulate the scalar products in double-precision arithmetic in order to reduce roundoff errors (without storing double-precision intermediate results). Further, these methods of triangular decomposition require about n 3 /3 operations (1 operation = 1 multiplication + 1 addition). Thus, they also offer a simple way of evaluating the determinant of a matrix A: From (4.1.9) it follows, since det(P) = ±1, det(L) = 1, that det(PA) = ± det(A) = det(R) = rllr22 ... r"n. Up to its sign, det(A) is exactly the product of the pivot elements. (It should be noted that the direct evaluation of the formula n det(A) = J.Ll,···,J./.n=l J.£i#JAk fur i::;i.k requires n! » n 3 /3 operations.) In the case that P = I, the pivot elements rii are representable as quotients of the determinants of the principal submatrices of A. If, in the representation LR = A, the matrices are partitioned as follows: [ Lll L21 o ] [Rll A21 ] A22 ' 0 L22 it is found that LnRll = All; hence det(R ll ) = det(All), or rl1 ... rii = det(An), if An is an i x i matrix. In general, if Ai denotes the ith leading principal submatrix of A, then rii = det(Ai)/ det(A i - 1 ), i ~ 2, rll = det(Ad· A further practical and important property of the method of triangular decomposition is that, for band matrices with bandwidth m, * 0 * A= * 0 aij = * * * }m 0 for li- jl ~ m, 200 4 Systems of Linear Equations the matrices Land R of the decomposition LR = P A of A are not full: R is a band matric with bandwidth 2m - 1, * * 0 0 0 R= * 0 * , } 'm-' and in each column of L there are at most m elements different from zero. In contrast, the inverses A -1 of band matrices are usually filled with nonzero entries. Thus, if m « n, using the triangular decomposition of A to solve Ax = b results in a considerable saving in computation and storage over using A-I. Additional savings are possible by making use of the symmetry of A if A is a positive definite matrix (see Sections 4.3 and 4.A). 4.2 The Gauss-Jordan Algorithm In practice, the inverse A -1 of a nonsingular n x n matrix A is not frequently needed. Should a particular situation call for an inverse, however, it may be readily calculated using the triangular decomposition described in Section 4.1 or using the Gauss-Jordan algorithm, which will be described below. Both methods require the same amount of work. If the triangular decomposition PA = LR of (4.1.9) is available, then the ith column ai of A-I is obtained as the solution of the system (4.2.1) where ei is the ith coordinate vector. If the simple structure of the righthand side of (4.2.1), Pei, is taken into account, then the n equation systems (4.2.1) (i = 1, ... , n) can be solved in about ~n3 operations. Adding the cost of producing the decomposition gives a total of n 3 operations to determaine A-I. The Gauss-Jordan method requires this amount of work, too, and offers advantages only of an organizational nature. The Gauss-Jordan method is obtained if one attempts to invert the mapping x I--t Ax = y, x E IRn, y E IRn, determined by A in a systematic manner. Consider the system Ax = y: 4.2 The Gauss-Jordan Algorithm 201 (4.2.2) In the first step of the Gauss-Jordan method, the variable Xl is exchanged for one of the variables Yr. To do this, an arl i- 0 is found, for example by partial pivot selection larll = max laill, • and equations rand 1 of (4.2.2) are interchanged. In this way, a system (4.2.3) is obtained in which the variables 1h, ... , Yn are a permutation of YI, ... , Yn and all = arI, YI = Yr holds. Now, all i- 0, for otherwise we would have ail = 0 for all i, making A singular, contrary to assumption. By solving the first equation of (4.2.3) for Xl and substituting the result into the remaining equations, the system a~IYI + a~2x2 + ... + a~nxn = Xl a~IYI + a~2X2 + ... + a~nXn = Y2 (4.2.4) is obtained with (4.2.5) I _ aik := aik - ailalk -_-- an for i, k = 2, 3, ... , n. In the next step, the vaiable X2 is exchanged for one of the variables Y2, ... , Yn; then X3 is exchanged for one of the remaining Y variables, and so on. If the successive equation systems are represented by their matrices, then starting from A(O) := A, a sequence is obtained. The matrix A(j) = (a~k)) stands for the matrix of a "mixed equation system" of the form 202 4 Systems of Linear Equations (j) - all YI + ... + ajl YI + ... + (j) - (4.2.6) (j)x a nn n anlYI + ... + (j) - In this system (iiI, ... , f11, Yj+l, ... , Yn) is a certain permutation of the original variables (YI,'" ,Yn)' In the transition A(j-l) --+ A(j) the variable Xj is exchanged for Yj. Thus, A(j) is obtained from A(j-l) according to the rules given below. For simplicity, the elements of A(j-l) are denoted by aik, and those of A (j) are denoted by a~k' (4.2.7) (a) Partial pivot selection: Determine r so that If aTj = 0, the matrix is singular. Stop. (b) Interchange rows rand j of A(j-l), and call the result A = (aik)' (c) Compute A(j) = (a~k) according to the formulas [ compare with (4.2.5)] (4.2.6) implies that (4.2.8) Y = YI,···, Yn )T , A ( A A where ih, ... , Yn is a certain permutation of the original variables YI, ... , Yn, Y = Py which, since it corresponds to the interchange step (4.2.7b), can easily be determined. From (4.2.8) it follows that (A(n) P)y and therefore, since Ax = Y, = x, 4.2 The Gauss-Jordan Algorithm EXAMPLE. A = A(O) := -7 A(2) = [r [ -i -1 1 2 3 ~l-7 -1 1 2 -~l I' A(l) = 203 U -~l -7 A(3) = -1 I' 2 [-~ -~l -3 5 -2 = A~l. The pivot elements are marked. The following ALGOL program is a formulation of the Gauss-Jordan method with partial pivoting. The inverse of the n x n matrix A is stored back into A. The array p[i] serves to store the information about the row permutations which take place. for j := 1 step 1 until n do prj] := j; for j := 1 step 1 until n do begin pivotsearch: max := abs (a[j,j]); r := j; for i := j + 1 step 1 until n do if abs (a[i,j]) greater max then begin max:= abs (a[i,j]); r:= i; end; if max = 0 then goto singular; TOwinterchange: if r > j then begin for k := 1 step 1 until n do begin hr := a[j, k]; a[j, k] := a[r, k]; a[r. k] := hr: end; hi := prj]; prj] := p[r]; p[r] := hi; end; transformation: hr := 1/a[j,j]; for i := 1 step 1 until n do ali, j] := hr x ali, j]; a[j, j] := hr; for k := 1 step 1 until j - 1, j + 1 step 1 until n do begin for i := 1 step 1 until j - 1, j + 1 step 1 until n do 204 4 Systems of Linear Equations ali, k] := ali, k] - ali, j] X a[j, k]; a[j, k] := -hr X a[j, k]; end k; end j; columninterchange: for i := 1 step 1 until n do begin for k := 1 step 1 until n do hv[P[k]] := ali, k]; for k := 1 step 1 until n do ali, k] := hv[k]; end; 4.3 The Choleski Decomposition The methods discussed so far for solving equations can fail if no pivot selection is carried out, i.e. if we restrict ourselves to taking the diagonal elements in order as pivots. Even if no failure occurs, as we will show in the next sections, pivot selection is advisable in the interest of numerical stability. However, there is an important class of matrices for which no pivot selection is necessary in computing triangular factors: the choice of each diagonal element in order always yields a nonzero pivot element. Futhermore, it is numerically stable to use these pivots. We refer to the class of positive definite matrices. ( 4.3.1) Definition. A (complex) n X n matrix is said to be positive definite if it satisfies: (a) A = A H , i.e. A is a Hermitian matrix. (b) x H Ax > 0 for all x E (Cn, X -=I- o. A = AH is called positive semidefinite if x H Ax 2 0 holds for all x E (Cn. (4.3.2) Theorem. For any positive definite matrix A the matrix A-I exists and is also positive definite. All principal submatrices of a positive definite matrix are also positive definite, and all principal minors of a positive definite matrix are positive. The inverse of a positive definite matrix A exists: If this were not the case, an x -=I- 0 would exist with Ax = 0 and x H Ax = 0, in contradiction to the definiteness of A. A-I is positive definite: We have (A-l)H = (AH)-1 = A-I, and if y -=I- 0 it follows that x = A-l y -=I- o. Hence yH A-l y = x H AH A-I Ax = x H Ax> O. Every principal submatrix PROOF. _ [ai~il. A= aikil 4.3 The Choleski Decomposition 205 of a positive definite matrix A is also positive definite: Obviously AH = A. Moreover, every can be expanded to for f-£ = i j , otherwise, x"lO, and it follows that .i = 1, ... , k, xH Ax = x H Ax > O. In order to complete the proof of (4.3.2) then, it suffices to show that det(A) > 0 for positive definite A. This is shown by using induction on n. For n = 1 this is true from (4.3.1b). Now assume that the theorem is true for positive definite matrices of order n - 1, and let A be a positive definite matrix of order n. According to the preceding parts of the proof, is positive definite, and consequently all > O. As is well known, all a22 a~n 1) / det(A). an 2 ann = det ( [ : By the induction assumption, however, a~2 a~n. 1) > 0, an2 ann de [ .. ( t . and hence det(A) > 0 follows from all > O. D (4.3.3) Theorem. For each n x n positive definite matrix A there is a unique n x n lower triangular matrix L, lik = 0 for k > i, with lii > 0, i = 1, 2, ... , n, satisfying A = LLH. If A is real, so is L. Note that lii = 1 is not required. The matrix L is called the Choleski factor of A, and A = LLH its Choleski decomposition. PROOF. The theorem is established by induction on n. For n = 1 the theorem is trivial: A positive definite 1 x 1 matrix A = (a) is a positive numer a > 0, which can be written uniqely in the form 206 4 Systems of Linear Equations Assume that the theorem is true for positive matrices of order n - 1 :::: 1. An n x n positive definite matrix A can be partitioned into where b E <e n - 1 and An-1 is a positive definite matrix of order n - 1 by (4.3.2). By the induction hypothesis, there is a unique matrix L n - 1 of order n - 1 satisfying for k > i, Iii> O. We consider a matrix L of the form and try to determine e E <en-I, 0 > 0 so that (4.3.4) This means that we must have Ln-Ie = b, H e e+0 2 =a nn , 0> O. The first equation must have a unique solution e = L-;'~l b, since L n - 1 , as a a triangular matrix with positive diagonal entries, has det(Ln-d > O. As for the second equation, if e H e :::: ann (that is, 0 2 :::: 0), then from (4.3.1) we would have a contradiction with 0 2 > 0, which follows from det(A) = I det(Ln_dI 2 0 2 , det(A) > 0 (4.3.2), and det(L n - 1 ) > O. Therefore, from (4.3.4), there exists exactly one 0 > 0 giving LLH = A, namely o The decomposition A = LLH can be determined in a manner similar to the methods given in Section 4.1. If it is assumed that all lij are known for j <:::: k - 1, the as defining equation for lkk and lik' i :::: k + 1, we have (4.3.5) akk = Ild 2 + Ild 2 + ... + 11kk1 2 , aik = 1i1lk1 + li2lk2 + ... + lik1kk. For a real A, the following algorithm results: 4.4 Error Bounds 207 for i := 1 step 1 until n do for k := i step 1 until n do begin x := ali, k]; for j := i - I step -1 until 1 do x := x - a[k,j] x a[i,j]; if i = k then begin if x :s: 0 then goto fail; p[i] := l/sqrt(x); end else ark, i] := x x p[i]; end i,k; Note that only the upper triangular portion of A is used. The lower triangular matrix L is stored in the lower triangular portion of A, with the exception of the diagonal elements of L, whose reciprocals are stored in p. This method is due to Choleski. During the course of computation, n square roots must be taken. Theorem (4.3.3) assures us that the arguments of these square roots will be positive. About n 3 /6 operations (multiplications and additions) are needed beyond the n square roots. Further substantial savings are possible for sparse matrices, see Section 4.A. Finally, note as an important implication of (4.3.5) that (4.3.6) j = 1, ... ,k, k = 1,2, ... ,n. That is, the elements of L cannot grow too large. 4.4 Error Bounds If anyone of the methods described in the previous sections is used to determine the solution of a linear equation system Ax = b, then in general only an approximation x to the true solution x is obtained, and there arises the question of how the accuracy of x is judged. In order to measure the error x-x we have to have the means of measuring the "size" of a vector. To do this a (4.4.1) norm: Ilxll is introduced on <en; that is, a function 11.11: <en ----t lR, which assigns to each vector x E <en a real value Ilxll serving as a measure for the "size" of x. The function must have the following properties: 4 Systems of Linear Equations 208 (4.4.2) (a) Ilxll > 0 for all x E <en, x # 0 (positivity), (b) lIaxll = lalllxll for all a E <e, x E <en (homogeneity), (c) IIx + yll ::::: IIxll + lIyll for all x, y E <en (triangle inequality). In the following we use only the norms (4.4.3) IIxll2 := VxHx = (I: IXiI2) n 1/2 (Euclidean norm), i=l IIxlioo :=max , IXil (maximum norm). The norm properties (a), (b), (c) are easily verified. For each norm II . II the inequality (4.4.4) IIx - yll ~ IlIxll - lIylll for all x, y E <en holds. From (4.4.2c) it follows that IIxll = II (x - y) + y II ::::: IIx - yll + lIyll, and consequently IIx - yll ~ IIxll-llyli. By interchanging the roles of x and y and using (4.4.2b), it follows that IIx - yll = lIy - xII ~ lIyll -lIxll, and hence (4.4.4). It is easy to establish the following: (4.4.5) Theorem. Each norm 11.11 in IRn (or <en) is a uniformly continuous function with respect to the metric g(x,y) := maxi IXi - Yil on IRn(<e n ). PROOF. From (4.4.4) it follows that IlIx+hll-lIxlll::::: IIhll· Now h = 2:~=1 hiei, where h = (h 1 , ... , hn)T and ei are the usual cordinate (unit) vectors of IRn (<en). Therefore n n IIhll ::::: '~, " Ihilliei ll ::::: max Ihil i=l L lIejll = Mmax , Ihil j=l with M := 2:7=1 lIejll. Hence, for each e > maxi Ihil ::::: elM, the inequality 0 and all h satisfying Ilix + hll - IIxlll ::::: e holds. That is, II . II is uniformly continuous. o 4.4 Error Bounds 209 This result is used to show: (4.4.6) Theorem. All norms on lR"(Cn) are equivalent in the following sense: For each pair of norms Pl (x)) P2 (x) there are positive constants 177 and M satisfying PROOF. We will prove this only in the case that P2(X) := Ilxll := maxi IXil. The general case follows easily from this special result. The set 5 = {x E (Cn I maxlxil = I} , is a compact set in (Cn. Since Pl (x) is continuous by (4.4.EI), M:= max Pl(X) > 0, xES 177:= min Pl(X) > I) xES exist. Thus for all y cf 0, y/llyll E 5, it follows that 177 ~ Pl CI~II) = ~Pl(Y) ~ M, and therefore mllyll ~ Pl(y) ~ Mllyll· D For matrices as well, A E 11/[(177, n) of fixed dimensions, norms IIAII can be introduced. In analogy to (4.4.2), the properties IIAII > 0 for all A cf 0, A E M(m, n), IloA11 = lolllAII, IIA + BII ~ IIAII + IIBII arc required. The matrix norm II ·11 is said to be consistent with the vector norms II . Iia on (C" and II . lib on (Cm if IIAxllb ~ IIAllllxll a for all x E <en, A E M(m, n). A matrix norm 11.11 for square matrices A E M(n, n) is called submultiplicative if IIABII ~ II All IIBII for all A, BE M(n, n). Choosing B := I then shows 11111 :::: 1 for such norms. Frequently used matrix norms are (4.4.7a) " (row-sum norm), 210 4 Systems of Linear Equations ~ (t, la"I') 'I' (4.4.7b) IIAII (4.4.7c) IIAII = max laikl· (Schur-Norm), "k (a) and (b) are submultiplicative: (c) is not; (b) is consistent with the Euclidean vector norm. Given a vector norm Ilxll, a corresponding matrix norm for square matrices, the subordinate matrix norm or least upper bound norm, can be defined by ( 4.4.8) IIAxl1 lub(A) :=~;tW' Such a matrix norm is consistent with the vector norm 11.11 used to define it: (4.4.9) IIAxl1 :s; lub (A) Ilxll· Obviously lub (A) is the smallest of all of the matrix norms IIAII which are consistent with the vector norm Ilxll: IIAxl1 :s; IIAllllxl1 for all x =? lub (A) :s; IIAII· Each subordinate norm lub(·) is submultiplicative: IIABxl1 IIBxl1 lub (AB) = max -11-11- :s; max lub (A)-II-II x#O x x#O x = lub (A) lub (B), and furthermore lub(I) = maxx=o IIIx1/11x11 = l. (4.4.9) shows that lub(A) is the greatest magnification which a vector may attain under the mapping determined by A: It shows how much IIAxll, the norm of an image point, can exceed II x II, the norm of a source point. EXAMPLE. (a) For the maximum norm Ilxll= = max v Ixlv the subordinate matrix norm is the row-sum norm IIAxll oo {maxi 1I::-l a'kXkl} ~ lUboo(A) = max -11-1-1- = max 1 1 = max x,,'Q X 00 x,,'Q maXk Xk ' ~ la,kl· k=l (b) Associated with the Euclidean norm IIxI12 = vlxHx we have the subordinate matrix norm 4.4 Error Bounds 211 which is expressed in terms of the largest eigenvalue >"max(AH A) of the matrix AH A [see (6.4.7)]. With regard to this matrix norm, we note that (4.4.10) for unitary matrices U, that is, for matrices defined by UHU = I. In the following we assume that Ilxll is an arbitrary vector norm and IIAII is a consistent submultiplicative matrix norm with 11111 = 1. Specifically, we can always take the subordinate norm lub(A) as IIAII if we want to obtain particularly good estimates in the results below. We shall show how norms can be used to bound the influence due to changes in A and b on the solution x to a linear equation system Ax = b. If the solution x + L\x corresponds to the right-hand side b + L\b, A(x + L\x) = b + L\b, then the relation L\x = A- 1 L\b follows from AL\x = L\b, as does the bound (4.4.11) For the relative change IIL\xll/llxll, the bound (4.4.12) IIL\xll < IIAIIIIA -111 IIL\bll = cond(A) IIL\bll Ilxll Ilbll Ilbll follows from Ilbll = IIAxll ~ IIAllllxll. In this estimate, cond(A) := IIAIIIIA- 1 11. For the special case that cond(A) := lub(A)lub(A- 1 ), this so-called condition of A is a measure of the sensitivity of the relative error in the solution to relative changes in the right-hand side b. Since AA-l = I, cond(A) satisfies lub(I) = 1 ~ lub(A) lub(A- 1 ) ~ IIAIIIIA- 1 11 = cond(A). The relation (4.4.11) can be interpreted as follows: If :r is an approximate solution to Ax = b with residual r(x) := b - Ax = A(x - x), then x is the exact solution of Ax = b - r(x), and the estimate 212 4 Systems of Linear Equations IIL\xll -s: IIA-1111Ir(x)ll· (4.4.13) must hold for the error L\x = x - x. Next, in order to investigate the influence of changes in the matrix A upon the solution x of Ax = b, we establish the following: (4.4.14) Lemma. If F is an n x n matrix with IIFII < 1, then (1 + F)-l exists and satisfies PROOF. From (4.4.4) the inequality 11(1 + F)xll = Ilx + Fxll 2': Ilxll -llFxll 2': (1 - IIFII) Ilxll follows for all x. From 1 - IIFII > 0 it follows that 11(1 + F)xll > 0 for x of 0; that is, (1 + F)x = 0 has only the trivial solution x = 0, and I + F is nonsingular. Using the abbreviation G := (1 + F)-I, it follows that 1 = 11111 = 11(1 + F)CII = IIG + FCII 2': IIGII - IIGII IIFII = II Gil (1 - IIFII) > 0, from which we have the desired result 1 11(1 + F)-III -s: 1 - IIFII· o We can show: (4.4.15) Theorem. Let A be a nonsingular n x n matrix, B = A(1 + F), IIFII < 1, and x and L\x be defined by Ax = b, B(x + L\x) = b. It follows that _II L\_xII < ----"-11Fc-"-11_ Ilxll - 1 - IIPII ' as well as IIL\xll < Ilxll cond(A) - 1 _ cond(A) liB-Ail liB-Ail IIAII IIAII if cond(A) . liB - AII/IIAII < l. PROOF. B- 1 exists from (4.4.14), and L\x = B-1b - A-1b = B-1(A - B)A-1b, x = A-1b, 4.4 Error Bounds 213 IILlxl1 < IIB- 1(A - B)II = I - (I + F)-1 A- 1 4FII Ilxll • :s I (I + F)-111 IIFII :s 1 ~~~II Since F = A-1(B - A) and IIFII :s IIA- 11 IIAII liB - AII/IIAII, the rest of the theorem follows. 0 According to theorem (4.4.15), cond(A) also measures the sensitivity of the solution x of Ax = b to changes in the matrix A. If the relations C = (I + F)-1 = B- 1A, are taken into account, it follows from (4.4.14) that IIB- 1All :S 1 _ III ~ A -1 BII . By interchanging A and B, it follows immediately from A- 1 = A- 1BB- 1 that (4.4.16) In particular, the residual estimate (4.4.13) leads to the bound of Collatz (4.4.17) _ < IIB- 111 _ Ilx - xii - 1 -III _ B-1All llr (x)lI, r(x) = b - Ax, where B- 1 is an approximate inverse to A with III - B- 1 All < l. The estimates obtained up to this point show the significance of the quantity cond(A) for determining the influence on the solution of changes in the given data. These estimates give bounds on the error x - x, but the evaluation of the bounds requires at least an approximate knowledge of the inverse A-1 to A. The estimates to be discussed next, due to Prager and Oettli (1964), are based upon another principle and do not require any knowledge of A -1. The results are obtained through the following considerations: Usually the given data Ao, bo of an equation system Aox = bo are inexact, being tainted, for example, by measurement errors LlA, Llb. Hence, it is reasonable to accept an approximate solution x to the system Aox = bo as "correct" if x is the exact solution to a "neighboring" equation system Ax = b with 214 4 Systems of Linear Equations A E A := {A IIA - Aol (4.4.18) s: L1A} bE B := { b lib - bol s: L1b} The notation used her is IAI = (iOikl), where Ibl = (lpll,·· ·IPnlf. A = (Oik), where b = (PI, ... Pnf, s: and the relation between vectors and matrices is to be understood as holding componentwise. Prager and Oettli prove: (4.4.19) Theorem. Let L1A ~ 0, L1b ~ 0, and let A, B be defined by (4.4.18). Associated with any approximate solution x of the system Aox = bo there is a matrix A E A and a vector b E B satisfying Ax = b, if and only if Ir(x)1 s: L1A Ixl + L1b, where r (x) := bo - Aox is the residual of X. PROOF. (1) We assume first that Ax = b holds for some A E A, b E B. Then it follows from A = Ao + bA, b = bo + bb, that s: L1A, where Ibbl s: L1b, where IbAI Ir(x)1 = Ibo - Aoxl = Ib - bb - (A - bA)xl = I - bb + (bA)xl s: Ibbl + IbAllxl s: L1b + L1Alxl· (2) On the other hand, if (4.4.20) and if we use the notation x =: (6,··· .t;nf. bo =: (PI, ... ,Pnf, r := r(x) = (Pl .... , Pn)T, .5 := L1b then set + L1Alxl ~ 0, 4.5 Roundoff-Error Analysis for Gaussian Elimination 215 OOij := PiL10ij sign(~j)/lTi' O(3i := -piL1(3dlTi, where pdlTi:= 0, if IT; = o. From (4.4.20) it follows that IpdlTil ::; 1, and consequently A = Ao +oA E A, b = bo + ob E E, as well as the following for i = 1, 2, ... , n: n = -O(3i + L OOij~j j=l or n L(Oij + 60ij)~j = (3i + O(3i, j=l that is, Ax=b, which was to be shown. o The criterion expressed in Theorem (4.4.19) permits us to draw conclusions about the fitness of a solution from the smallness of its residual. For example, if all components of Ao and bo have the same relative accuracy c, then (4.4.19) is satisfied if IAox - bol ::; c(lbol + IAollxl). From this inequality, the smallest c can be computed for which a given x can still be accepted as a usable solution. 4.5 Roundoff-Error Analysis for Gaussian Elimination In the discussion of methods for solving linear equations, the pivot selection played only the following role: it guaranteed for any nonsingular matrix A that the algorithm would not terminate prematurely if some pivot element 216 4 Systems of Linear Equations happened to vanish. We will now show that the numerical behavior of the equation-solving methods which have been covered depends upon the choice of pivots. To illustrate this, we consider the following simple example: Solve the system (4.5.1) with the use of Gaussian elimination. The exact solution is x = 5000/9950 = 0.503 ... , Y = 4950/9950 = 0.497 .... If the element a11 = 0.005 is taken as the pivot in the first step, then we obtain [~.005 1] [x] [0.5] fj -99 ' -200 fj = 0.5, x = 0, using 2-place floating-point arithmetic. If the element a21 = 1 is taken as the pivot, then 2-place floating-point arithmetic yields [6 ~ ][ ~] = [0\]' fj = 0.50, x = 0.50. In the second case, the accuracy of the result is considerably higher. This could lead to the impression that the largest element in magnitude should be chosen as a pivot from among the candidates in a column to get the best numerical results. However, a moment's thought shows that this cannot be unconditionally true. If the first row of the equation system (4.5.1) is multiplied by 200, for example, the result is the system (4.5.2) which has the same sulution as (4.5.1). The element an = 1 is now just as large as the element a21 = 1. However, the choice of an as pivot element leads to the same inexact result as before. We have replaced the matrix A of (4.5.1) by A = DA, where D is the diagonal matrix D = [20~ ~]. Obviously, we can also adjust the column norms of A-i.e., replace A by A = AD (where D is a diagonal matrix) - without changing the solution x to Ax = b in any essential way. If x is the solution to Ax = b, then y = D-1x is the solution of A = (AD)(D-1x) = b. In general, we refer to a scaling of a matrix A if A is replaced by D 1AD2, where D 1, D2 are diagonal matrices. The example shows that it is not reasonable to propose a particular choice of pivots unless assumptions about the scaling of a matrix are made. Unfortunately, no one has yet determined satisfactorily how to carry out scaling so that partial pivot selection is numerically stable for 4.5 Roundoff-Error Analysis for Gaussian Elimination 217 any matrix A. Practical experience, however, suggests the ::ollowing scaling for partial pivoting: Choose Dl and D2 so that n n k=1 j=1 L laikl ~ L lajll holds approximately for alli, I = 1,2, ... , n in the matrix A~ = D 1AD 2. The sum of the absolute values of the elements in the rows (and the columns) of A should all have about the same magnitude. Such matrices are said to be equilibrated. In general, it is quite difficult to determine Dl and D2 so that D1AD2 is equilibrated. Usually we must get by with the following: Let D2 = I, Dl = diag(sl, ... , sn), where 1 Si := 2:~=1 laikl' then for A = D1AD2, it is true at least that n for i=1,2, ... ,n. Now, instead of replacing A by A, i.e., instead of actually carrying out the transformation, we replace the rule for pivot selection in order to avoid the explicit scaling of A. The pivot selection for the jth elimination step A (j -1) -+ A (j) is given by the following: (4.5.3). Determine r :2: j so that Iaij(j -1) ISi .r./. 0, Ia rj(j - 1) ISr = m>ax "_J and take aU-I) as the pivot. rJ The example above shows that it is not sufficient, in general, to scale A prior to carrying out partial pivot selection by making the largest element in absolute value within each row and each column have roughly the same magnitude: (4.5.4) max laikl ~ max laJII k j for all i, 1= 1,2" ... , n. For, if the scaling matrices are chosen, then the matrix A of our example (4.5.1) becomes 218 4 Systems of Linear Equations ~.005] . The condition (4.5.4) is satisfied, but inexact answers will be produced, as before, if all = 1 is used as pivot. We would like to make a detailed study of the effect of the rounding errors which occur in Gaussian elimination or direct triangular decomposition [see Section 4.1]. We assume that the rows of the n x n matrix A are already so arranged that A has a triangular decomposition of the form A = LR. Hence, Land R can be determined using the formulas of (4.1.12). In fact, we only have to evaluate expressions of the form which were analyzed in (1.4.4)-(1.4.11). Instead of the exact triangular decomposition LR = A, (4.1.12) shows that the use of floating-point arithmetic will result in matrices L = (Tik ), II = (fik) for which the residual F := (fik) = A - Lll is not zero in general. Since, according to (4.1.13), the jth partial sums -U) aik = fl ( aik - L lisrsk J - - ) s=1 are exactly the elements of the matrix AU), which is produced instead of A (j) of (4.1.6) from the jth step of Gaussian elimination in floatingpoint arithmetic, the estimates (1.4.7) applied to (4.1.12) yield (we use the abbreviation eps' = cps /(1 - eps) (4.5.5) " i-I Ifikl = laik - LT;jfjkl s:: eps'L(la)r)1 + IT;jllfjkl), J=1 J=1 i i-I j=1 J=1 k 2: i, Ilkil = laki - LTkjfjil s:: epsl[la~ii-l)1 + L(la~)1 + ITkjllfJil)] , k> i. Further, (4.5.6) - -(i-I) Tik = aik for is:: k, since the first j + 1 rows of A(j) [or A(j) of (4.1.6)] do not change in the subsequent elimination steps, and so they already agree with the corresponding rows of R. We sassume, in addition, that ITikl s:: 1 for all i, k (which is satisfied, for example. when partial or complete pivoting is used). Setting 4.5 Roundoff-Error Analysis for Gaussian Elimination aj ..- max 1-(j)1 aik ' a:= .,k max O:::;i:::;n-I 219 ai, it follows immediately from (4.5.5) and (4.5.6) that eps lfikl ::; -l--(ao + 2aI + 2a2 + ... + 2ai-2 + ai-I) - eps ::; 2(i - (4.5.7) l)a~ for 1- eps k ~ i, eps Ifikl::; ---(aD + 2aI + ... + 2ak-2 + 2ak-l) 1- eps <2ka~ - for 1- eps k < i. For the matrix F, then, the inequality (4.5.8) IFI::; 2a~ 1- eps 0 1 1 1 0 1 0 1 0 1 0 1 2 2 2 3 2 3 2 3 1 2 3 n-1 n-1 holds, where IFI := (Ifjkl). If a has the same order of magnitudes as ao, that is, if the matrices A(j) do not grow too much, then the computed matrices Lil form the exact triangular decomposition of the matrix A - F, which differs little from A. Gaussian elimination is stable in this case. The value a can be estimated with the help of ao = maxr,s lars I. For partial pivot selection it can easily be shown that and hence that a ::; 2n - 1 ao. This bound is much too pessimistic in most cases; however, it can be attained, for example, in forming the triangular decomposition of the matrix 1 o o 1 -1 A= o -1 1 -1 1 1 Better estimates hold for special types of matrices. For example in the case of upper Hessenberg matrices, that is, matrices of the form 220 4 Systems of Linear Equations x A= r x o x the bound a ::; (n - 1)ao can be shown. (Hessenberg matrices anse in eigenvalue problems.) For tridiagonal matrices o A= 12 o it can even be shown that holds for partial pivot selection. Hence, Gaussian elimination is quite numerically stable in this case. For complete pivot selection Wilkinson (1965) has shown that with the function f(k) := k 1 / 2 [2131/241/3 ... k 1 /(k-1) ] 1/2 This function grows relatively slowly with k: k 10 20 50 100 f(k) 19 67 530 3300 Even this estimate is too pessimistic in practice. Up until now, no real matrix has been found which fails to satisfy ak::;(k+l)ao, k=1,2, ... ,n-1, when complete pivot selection is used. This indicates that Gaussian elimination with complete pivot selection is usually a stable process. Despite this, partial pivot selection is preferred in practice, for the most part, because: (1) Complete pivot selection is more costly than partial pivot selection. (To compute A (i), the maximum from among (n - i + 1)2 elements 4.6 Roundoff Errors in Solving Triangular Systems 221 must be determined instead of n - i + 1 elements as in partial pivot selection. ) (2) Special structures in a matrix, e.g. the band structure of a tridiagonal matrix, are destroyed in complete pivot selection. If the weaker estimates (1.4.11) are used instead of (1.4.7), then the following bounds replace those of (4.5.5) for the fik: eps Ifikl :s; 1 - n· eps eps Ifkil :s; 1 - n· eps [tj Ilijllfjkl-Ifikl] [tj IlkjllFjil] 2 for k 2 i, )=1 for k i -c- 1, )=1 or (4.5.9) IFI :s; eps [ILl D IHI - IHI] , 1 - n· eps where 4.6 Roundoff Errors in Solving Triangular Systems As a result of applying Gaussian elimination in floating-point arithmetic to the matrix A, a lower triangular matrix L and an upper triangular matrix H are obtained whose product LH approximately equals A. Solving the system Ax = b is thereby reduced to solving the triangular system Ly = b, Rx=y. In this section we investigate the influence of roundoff errors on the solution of such equation systems. If we use y to denote the solution obtained using t-digit floating-point arithmetic, then the definition of y gives (4.6.1) From (1.4.10), (1.4.11) it follows immediately that 222 4 Systems of Linear Equations or In other words, there exists a matrix .:1L satisfying I.:1LI ::; (4.6.3) eps (ILID - 1). 1 - n eps Thus, the compnted solution can be interpreted as the exact solution of a slightly changed problem, showing that the process of solving a triangular system is stable. Similarly, the computed solution x of Rx = y is found to satisfy the bound Iy - Rxl ::; 1-eps IRI E lxi, 71 eps (4.6.4) I.:1RI ::; (R + .:1R)x = y, . =[ 0 '. E: ~ 71 2 OIl' eps IRI E. 1 - 71 eps By combining the estimates (4.5.9), (4.6.3), and (4.6.4), we obtain the following result [due to Sautter (1971)] for the approximate solution x produced by floating-point arithmetic to the linear system Ax = b: Ib - Axl ::; 2(71 + 1) eps ILIIRllxl, (4.6.5) 1 - 71 eps if 71 eps ::; ~. PROOF. Using the abbreviation E := eps /(1-71 eps), it follows from (4.5.9), (4.6.3) and (4.6.4) that Ib - Axl = Ib - (LR + F)xl 1- Fx + b - L(y - .:1Rx) I = I( -F + .:1L(R + .:1R) + L.:1R)xl ::; E[2(1LID - I)IRI + ILIIRIE + E(lLID - I)IRIEjlxl· The (i, k) component of the matrix [... J appearing in the last line above has the form min(i,k) L Ilij l(2j - 20ij + 71 + 1- k + E(j - Oij + 71 + 1- k))lrjkl, ]=1 Oij := { I ° for i = j, for i of j. 4.7 Orthogonalization Techniques of Householder and Gram-Schmidt 223 It is easily verified for all j ::; min(i, k), 1 ::; i, k::; n, that 2j - 26ij + n + 1 - k+E(j - 6ij + n + 1 - k) < { 2n - 1 + E . n, - 2n+E(n+1), if j ~; i ::; k, if j <; k < i, ::; 2n + 2, since E . n ::; 2n eps ::; 1. This completes the proof of (4.6.5). D A comparison of (4.6.5) with the result (4.4.19) due to Prager and Oettli (1964) shows, finally, that the computed solution i; can be interpreted as the exact solution of a slightly changed equation system, provided that the matrix nlLllRI has the same order of magnitude as AI. In that case, computing the solution via Gaussian elimination is a numerically stable algorithm. 4.7 Orthogonalization Techniques of Householder and Gram-Schmidt The methods discussed up to this point for solving a system of equations (4.7.1) Ax = b consisted of multiplying (4.7.1) on the left by appropriate matrices P j , j = 1, ... , n, so that the system obtained as the final outcome, A(n)x = b(n), could be solved directly. The sensitivity of the result x to changes in the arrays A (j), b(j) of the intermediate systems AU)x = b(j). [AU),b U)] = Pj[AU-l),b(j-l)J, is given by [see (4.4.12) and Theorem (4.4.15)] cond (A(j)) = lub (A(j)) lub ((A(j))-l). If we denote the roundoff error incurred in the transition from [AU -1) , bCi - 1 )] to [A (j), bUll by EU), then these roundoff errors are amplified by the factors cond (AU)) in their effect on the final result x, and we have In the right-hand expression above, E(O) stands for the error in the initial data A, b. If there is an A (j) with 224 4 Systems of Linear Equations then the sequence of computations is not numerically stable: The roundoff error e(j) has a stronger influence on the final result than the initial error e(O) . For this reason, it is important to choose the PJ so that the condition numbers cond (A (j») do not grow. For condition numbers derived from an arbitrary norm Ilxll, that is difficult to do. For the Euclidean norm Ilxll = VxHx and its subordinate matrix norm lub(A) = max x,to however, the choice of transformations is more easily made. For this reason, only the above norms will be used in the present section. If U is a nnitary matrix, UHU = I, then the above matrix norm satisfies lub (A) = lub (UHU A) :s; lub (U H ) lub (U A) = lub (U A) :s; lub (U) lub (A) = lub (A); hence lub (U A) = lub (A), and similarly lub (AU) = lub (A). In particular, it follows that cond (A) = lub (A) lub (A -1) = cond (U A) for unitary U. If the transformation matrices Pj are chosen to be unitary, then the condition numbers associated with the systems A (j) x = b(j) do not change (so they certainly don't get worse). Furthermore, the matrices Pj should be chosen so that the matrices A (j) become simpler. As was suggested by Householder, this can be accomplished in the following manner: The unitary matrix P is chosen to be Matrices of this form are called Householder matrices. These matrices are Hermitian: unitary: 4.7 Orthogonalization Techniques of Householder and Gram-Schmidt 225 pH P = p P = p2 = (I _ 2wwH) (I - 211J1V H ) = I - 2ww H - 2wwH + 4wu;H ww H =I, and therefore involutory: p2 = I. Geometrically, the mapping x f--t y := px = x - 2( w H x)w describes a reflection of x with respect to the plane {z I w H Z = O}, and x and y then satisfy (4.7.2) yH Y = x H pH px = x H x, (4.7.3) xHy = x H Px = (x H Px)H and x H y is real. We wish to determine a vector w, and thereby P, so that a given x = (Xl, ... , xn)T is transformed into a multiple of the first coordinate vector el: kel = Px. From (4.7.2) it follows immediately that k satisfies and, since k x H el must be real according to (4.7.3), a:= vi x H x, k = =t=e"'a, if Xl = eialxll. Hence, it follows from Px = x - 2(w H x)w = kel and the requirement w H w = 1 that w= x - kel Ilx - kelll· Ilx - ked = Ilx ± aeiC>eIil = JIXI ± ae ia 2 + IX212 + ... + Ixn l2 = V(IXII ± a)2 + IX212 + ... + Ix n l2. In order that no cancellation may occur in the computation of IXII ± a, we l choose the sign in the definition of k to be 226 4 Systems of Linear Equations which gives (4.7.4) It follows that The matrix P = I - 2ww H can be written in the form P = 1- (3uu H with n (J"= Llxil2, xl=ei"lxll, k=-(J"e i ", i=l (4.7.5) _ reiQ('X:~ + (J")1 _ u - x - kel - . Xn A n X n matrix A == A (0) can be reduced step by step using these unitary Householder matrices Pj , A(j) = PjA(j-l), into an upper triangular matrix A(n-l) = R = [ roll To do this, the n x n unitary matrix PI is determined according to (4.7.5) so that (0) Pial = k el, where a~O) stands for the first column of A (0) . If the matrix A (j -1) obtained after j - 1 steps has the form (4.7.6) I X X o X I I .T .1: X X (]-J) a jj 0 (j-l) anj (j-l) a jn (j-l) ann [~ ~(j-J) ] , 4.7 Orthogonalization Techniques of Householder and Gram-Schmidt 227 then we determine the (n - j + 1) x (n - j + 1) unitary matrix Fj according to (4.7.5) so that Using Fj the desired n x n unitary matrix is constructed as aW After forming A (j) = Pj A (j -1), the elements for i > :i are annihilated, and the rows in (4.7.6) above the horizontal dashed line remain unchanged. In this wayan upper triangular matrix R:= A(n-I) is obtained after n - 1 steps. It should be noticed, when applying Householder transformation in practice, that the locations which are set to zero beneath the diagonal of A (j) can be used to store u, which contains the important information about the transformation P. However, since the vector u which belongs to Fj contains n - j + 1 components, while only n - j locations are set free in A (j), it is usual to store the diagonal elements of A (j) in a separate array d in order to obtain enough space for u. The transformation of a matrix by is carried out as follows: that is, the vector Yj is computed first, and then A(j-I) is modified as indicated. The following ALGOL program contains the essentials of the reduction of a real matrix, stored in the array a, to triangular form using Householder transformations: 228 4 Systems of Linear Equations for j := 1 step 1 until n do begin sigma:= 0; for i := j step 1 until n do sigma := sigma + ali, j] t 2; if sigma = 0 then goto singular; s := d[j] = if a[j,j] < 0 then sqrt(sigma) else -sqrt(sigma); beta:= l/(s x a[j,j]- sigma); a[j,j]:= a[j,j]- s; for k := j + 1 step 1 until n do begin sum:= 0; for i := j step 1 until n do sum := sum + a[i,j] x ali, k]; sum := beta x sum; for i := j step 1 until n do ali, k] := ali, k] + ali, j] x sum; end; end; The Householder reduction of a n x n matrix to triangular form requires about 2n 3 /3 operations. In this process an n x n unitary matrix P = Pn - l ... PI consisting of Householder matrices Pi and an n x n upper triangular matrix R are determined so that PA=R or (4.7.7) A=P-IR=QR, Q:=p-l=p H , holds: that is we have found a so-called QR-decomposition of the matrix A into a product of a unitary matrix Q and an upper triangular matrix R, A=QR. An upper triangular matrix R, with the property (4.7.7) that AR- I = Q = (ql, q2,···, qn) is a matrix with orthonormal columns qi, can also be produced directly by the application of Gram-Schmidt orthogonalization to the columns ai of A = (al, ... , an). The equation A = QR shows that the kth column ak of A k ak = Lrikqi, k = 1, ... ,n, i=1 is a linear combination of the vectors ql, q2, ... , qk, so that, conversely, qk is a linear combination of the first k columns aI, ... , ak of A. The GramSchmidt process determines the columns of Q and R recursively as follows: Begin with 4.7 Orthogonalization Techniques of Householder and Gram-Schmidt 229 If the orthonormal vectors q1, ... , qk-1 and the elements 1'ij with j :s; k-1 of R are known, then the numbers 'r1k, ... , 'rk-1,k are determined so that the vector (4.7.8) is orthogonal to all qi, i = 1, ... , k - 1. Because II qi qj = i = j {I0 for otherwise for 1 :s; i, j :s; k -- 1, the conditions qf bk = 0 lead immediately to i = 1, ... , k - 1. (4.7.9) After 'rik,i :s; k - 1, and thereby bk have been determined, the quantities (4.7.10) are computed, so that (4.7.8) is equivalent to k ak = L'rikq; ;=1 and moreover H qi qj = i = j, {I0 for otherwise for all 1 :s; i, j :s; k. As given above, the algorithm has a serious disadvantage: it is not numerically stable if the columns of the matrix A are nearly linearly dependent. In this casc the vector bk, which is obtained from the formulas (4.7.8), (4.7.9) instead of thc vcctor bk (due to the influence of roundoff errors), is no longer orthogonal to the vectors q1, ... , qk.-1. The slightest roundoff error incurred in the determination of the 'rki in (4.7.9) destroys orthogonality to a greater or lesser extent. We will discuss this effect in the spccial casc corresponding to k = 1 in (4.7.8). Let two real vectors a and q be given, which wc regard, for simplicity, as normalized: Iiall = IIqll = 1. This means l;hat their scalar product Po := qT a satisfies IPol :s; 1. We assume that IPol < 1 holds, that is, a and q are linearly independent. The vector b=b(p):=a-pq is orthogonal to q for p = Po. In general, the angle a(p) between b(p) and q satisfies 230 4 Systems of Linear Equations qTb(p) qT a _ p f(p) := cosa(p) = Ilb(p)11 = Iia _ pqll' f(po) = o. a(po) = 7r/2, Differentiating with respect to p, we find -1 !'(Po) = II a - Poq II a'(po) = since -1 R' - Po Iia - Poql12 = aT a - 2poaT q + p~qT q = 1 - p~. Therefore, to a first approximation, we have . 7r a(po + L1po) = -2 + L1po ;'1----::2 V 1- Po for L1po small. The closer Ipollies to 1, i.e., the more linearly dependent a und q are, the more the orthogonality beween q and b(p) is destroyed by even tiny errors L1po, in particular by the roundoff errors which occur in computing Po = qT a. Since it is precisely the orthogonality of the vectors qi that is essential to the process, the following trick is used in practice. The vectors bk which are obtained instead of the exact bk from the evaluation of (4.7.8), (4.7.9) are subjected to a "reorthogonalization". That is, scalars L1rik and a vector bk are computed from bk = bk - L1rlkql - ... - L1rk-l,kqk-l, where L1rik := q;b k , i = 1, ... ,k - 1, bI so that in exact arithmetic qi = 0, i = 1, ... , k -1. Since bk was at least approximately orthogonal to the qi, the L1rik are small, and according to the theory just given, the roundoff errors which are made in the computation of the L1rik have only a minor influence on the orthogonality of bk and qi' This means that the vector will be orthogonal within machine precision to the already known vectors ql, ... , qk-l· The values Tik which have been found by evaluating (4.7.9) are corrected appropriately: 4.8 Data Fitting 231 Clearly, reorthogonalization requires twice as much computing effort as the straightforward Gram-Schmidt process. 4.8 Data Fitting In many scientific observations, one is concerned with determining the values of certain constants Often, however, it is exceedingly difficult or impossible to measure the quantities Xi directly. In such cases, the following indirect method is used: instead of observing the Xi, another, more easily measurable quantity y is sampled, which depends in a known way on the Xi and on further controllable "experimental conditions" , which we symbolize by z: y = !(z; Xl, ... , Xn). In order to determine the Xi, experiments are carried out under m different conditions Zl, ... , Zm, and the corresponding results k = 1,2, ... ,m, (4.8.0.1 ) are measured. Values for Xl, ... , Xn are sought so that the equations (4.8.0.1) are satisfied. In general, of course, at least m experiments, m 2: n, must be carried out in order that the Xi may be uniquely determined. If m > n, however, the equations (4.8.0.1) form an overdel;ermined system for the unknown parameters Xl, ... , .T n , which does not usually have a solution because the observed quantities Yi are perturbed by measurement errors. Consequently, instead of finding an exact solution to (4.8.0.1), the problem becomes one of finding a "best possible solution". Such a solution to (4.8.0.1) is taken to mean a set of values for the unknown parameters for which the expression m (4.8.0.2) 2)Yk -!d Xl, ... ,xn ))2, k=l or the expression (4.8.0.3) is minimized. Here we have denoted !(Zk; xl, ... , xn) by fdxl, ... , xn). In the first case, the Euclidean norm of the residuals is minimized, and we are presented with a fitting problem of the special. type studied by Gauss (in his "method of least squares"). In mathematical statistics [e.g. 232 4 Systems of Linear Equations see Guest (1961), Seber (1977), or Grossmann (1969)] it is shown that the "least-squares solution" has particularly simple statistical properties. It is the most reasonable point to determine if the errors in the measurements are independent and normally distributed. More recently, however, points x which minimize the norm (4.8.0.3) of the residuals have been considered in cases where the measurement errors are subject to occasional anomalies. Since the computation of the best solution x is more difficult in this case than in the method of least squares, we will not pursue this matter. If the functions h(Xl,'" ,.T n ) have continuous partial derivatives in all of the variables Xi, then we can readily give a necessary condition for x = (Xl, ... ,Xn)T to minimize (4.8.0.2): i = 1, ... , n. (4.8.0.4) These are the normal equations for x. An important special case, the linear least-squares problem, is obtained if the functions fk (Xl, ... ,X n ) arc linear in the Xi. In this case there is an m x n matrix A with [ h(XI" .... ,xn ) . 1 =Ax fm(XI .... ,xn ) and the normal equations reduce to a linear system or (4.8.0.5) We will concern ourselves in the following sections (4.8.1-4.8.3), with methods for solving the linear least-squares problem, and in particular, we will show that more numerically stable methods exist for finding the solution than by means of the normal equations. Least-squares problems are studied in more detail in Bjorck (1990) and the book by Lawson and Hanson (1974), which also contains FORTRAN programs; ALGOL programs are found in Wilkinson and Reinsch (1971). 4.8.1 Linear Least Squares. The Normal Equations In the following sections Ilxll will always denote the Euclidean norm Ilxll := VXTX. Let a real m x n matrix A and a vector y E IRrn be given, and let (4.8.1.1) Ily - AxI1 2 = (y - Ax)T(y - Ax) 4.8 Data Fitting 233 be minimized as a function of x. We want to show that x E lRn is a solution of the normal equations (4.8.1.2) if and only if x is also a minimum point for (4.8.1.1). We have the following: ( 4.8.1.3) Theorem. The linear least-squares problem min Ily - Axil xEIRP has at least one minimum point Xo. If Xl is another minimum point, then Axo = AXI. The residual r := y - Axo is uniquely determ~~ned and satisfies the equation AT r = 0. Every minimum point Xo is also a solution of the normal equations (4.8.1.2) and conversely. PROOF. Let L ~ lRm be the linear subspace which is spanned by the columns of A, and let LJ.. be the orthogonal complement LJ.. := {r I rT z = ° for all z E L} = {r I rT A = o}. Because lRm = L E9 LJ.., the vector y E lRm can be written uniquely in the form y = s +r, (4.8.1.4) s E L, r E L.l.., and there is at least one Xo with Axo = s. Because AT r == 0, Xo satisfies AT y = AT S = AT Axo, that is, Xo is a solution of the normal equations. Conversely, each solution Xl of the normal equations corresponds to a representation (4.8.1.4): y = s + r, Because this representation is unique, it follows that Axo = AXI for all solutions xo, Xl of the normal equations. Further, each solution Xo of the normal equations is a minimum point for the problem min Ily - Axil· xElRn To see this, let x be arbitrary, and set z = Ax - Axo, Then, since rT z = 0, r:= y - Axo. 234 4 Systems of Linear Equations that is, Xo is a minimum point. This establishes Theorem (4.8.1.3). 0 If the columns of A are linearly independent, that is, if x f 0 implies Ax f 0, then the matrix AT A is nonsingular (and positive definite). If this were not the case, there would exist an x f 0 satisfying AT Ax = 0, from which 0= x T AT Ax = IIAxl12 would yield a contradiction, since Ax f O. Therefore the normal equations ATAx=ATy have a unique solution x = (AT A)-1 AT y, which can be computed using the methods of Section 4.3 (that is, using the Choleski factorization of AT A). However, we will see in the following sections that there are more numerically stable ways of solving the linear least squares problem. We digress here to go briefly into the statistical meaning of the matrix (A T A) -1. To do this, we assume that the components Yi, i = 1, ... , m, are independent, normally distributed random variables having J.li as means and all having the same variance (j2: for i = k, otherwise. If we set J.l := (J.ll, ... , J.lm)T, then the above can be expressed as (4.8.1.5) E[y] =J.l, E[(y - J.l)(y - J.l?] = (j2 I. The covariance matrix of the random vector y is (j2 I. The optimum x = (AT A)-1 ATy of the least-squares problem is also a random vector, having mean E[x] = E[(AT A)-1 AT y] = (AT A)-1 AT E[y] = (AT A)-1 AT J.l, and covariance matrix E[(x-E(x))(x - E(x)?] = E[(AT A)-1 AT(y - J.l)(y - J.l)T A(AT A)-I] = (AT A)-I AT E[(y - J.l)(y - J.l)T]A(AT A)-I = (j2(AT A)-I. 4.8 Data Fitting 235 4.8.2 The Use of Orthogonalization in Solving Linear LeastSquares Problems The problem of determing an x E IR n which minimizes Ily- Axil, (A E M(m,n), m:::: n) can be solved using the orthogonalization techniques discussed in Section 4.7. Let the matrix A =: A(O) and the vector y =: yeO) be transformed by a sequence of Householder transformations Pi, A (i) = PiA (i-1), y( i) = Piy(i-i). The final matrix A (n) has the form (4.8.2.1) A(n) = [R] }n o }m-n with R = Tll [ 0 since m :::: n. Let the vector h := yen) be partitioned correspondingly: (4.8.2.2) The matrix P = P n ... Pi, being the product of unitary matrices, is unitary itself: and satisfies A(n) = PA. h= Py. Unitary transformations leave the Euclidean norm Ilull of a vector u invariant (11Pu11 2 = u H pH Pu = uHu = IluI1 2 ), so Ily - Axil = IIP(y - Ax)11 = Ily(n) - A(n)xll· However, from (4.8.2.1) and (4.8.2.2), the vector yen) -A(n)x has the structure (n) _ A(n) = [hi y X h2 . RX] Hence Ily - Axil is minimized if x is chosen so that (4.8.2.3) The matrix R has an inverse R- i if and only if the columns ai, ... , an of A are linearly independent. Az = 0 for z of 0 is equivalem to PAz=O and therefore to Rz = O. 236 4 Systems of Linear Equations If we assume that the columns of A are linearly independent, then which is a triangular system, can be solved uniquely for x. This x is, moreover, the unique minimum point for the given least-squares problem. [If the columns of A, and with them the columns of R, are linearly dependent, then, although the value of minz Ily - Azil is uniquely determined, there are many minimum points x.] The size lIy - Axil of the residual of the minimum point is seen to be ( 4.8.2.4) We conclude by mentioning that instead of using unitary transformations, the Gram-Schmidt technique with reorthogonalization can be used to obtain the solution, as should be evident. 4.8.3 The Condition of the Linear Least-Squares Problem We begin this section by investigating how a minimum point x for the linear least-squares problem (4.8.3.1 ) minlly - Axil x changes if the matrix A and the vector yare perturbed. We assume that the columns of A are linearly independent. If the matrix A is replaced by (A + LlA), and y is replaced by y + Lly, then the solution x = (AT A)-l AT y of (4.8.3.1) changes to .r + Llx = ((A + LlAf(A + LlA))-l(A + LlAf(y + Lly). If LlA is small relative to A, then (( A + LlA) T (A + LlA)) -1 exists and satisfies, to a first approximation, ((A + LlAf(A + LlA))-l ~ ~ (AT A(I + (AT A)-l[AT LlA + LlAT A]))-l ='= (1 - (AT A)-l[AT LlA + LlAT A]) (AT A)-I. [To a first approximation, (I + F)-l ~ 1 - F if the matrix F is "small" relative to 1.] Thus it follows that (4.8.3.2) x + Llx ~(AT A)-l AT Y - (AT A)-l[AT LlA + LlAT A](AT A)-l AT Y + (AT A)-lLlAT y + (AT A)-l AT Lly. 4.8 Data Fitting Noting that 237 x = (AT A)-l AT y and introducing the residual r:= y - Ax, it follows immediately from (4.8.3.2) that L1x ~ _(AT A)-l AT L1Ax + (AT A)-l L1AT r + (AT A)-l AT L1y. Therefore, for the Euclidean norm 11·11 and the associated matrix norm lub, IIL1xll:i; lub((AT A)-l AT) lub(A) 1~:~fA~) Ilxll T -1 T lub(L1AT) Ilrll + lub((A A) ) lub(A ) lub(A) lub(AT"f IIAxllllxl1 (4.8.3.3) + lub((AT A)-l AT) lub(A)J!1LlL IIL1yllllxll. IIAxl1 IIYII [Observe that the definition given in (4.4.8) for lub makes sense even for nonsquare matrices.] This approximate bound can be simplified. According to the results of Section 4.8.2, a unitary matrix P and an upper triangular matrix R can be found such that and it follows that AT A = RTR, (AT A)-l = R- 1 (R T )-1, (4.8.3.4) (AT A)-l AT = [R-l, O]P. If it is observed that lub( CT) = lub( C) lub(PC) = lub(CP) = lub(C), holds for the Euclidean norm, where P is unitary, then (4.8.3.3) and (4.8.3.4) imply IIL1xll <: cond(R) lub(L1A) + cond(R)2L lub(L1A) Ilxll lub(A) IIAxl1 lub(A) Ilyll IIL1YII +cond(R) IIAxl1 llYf' 238 4 Systems of Linear Equations If we define the angle y by 111'11 tan y = IIAxll' 1f o :s: y < 2' then IIYII/IIAxll = (1 + tan 2 .p)1/2 because of y = Ax + 1', l' ..l Ax. Therefore IIL1xll . lub(L1A) 2 lub(L1A) W:s:cond(R) lub(A) +cond(R) tany lub(A) (4.8.3.5) IIL1YII + cond(R)y/ 1 + tan 2 y W. In summary, the condition of the least-squares problem depends on cond( R) and the angle y: If .p is small, say if cond( R) tan.p :s: 1, then the condition is measured by cond(R). With increasing .p t 1f/2 the condition gets worse: it is then measured by cond(R)2 tan y. If the minimum point .r for the linear least-squares problem is found using the orthogonalization method described in Section 4.8.2, then in exact arithmetic, a unitary matrix P (consisting of a product of Householder matrices), an upper triangular matrix R, a vector hT = (hI, hI), and the solution x all satisfying PA=[~]' (4.8.3.6) h=Py, Rx=h l . are produced. Using floating-point arithmetic with relative machine precision eps, an exactly unitary matrix will not be obtained, nor will R, h, x exactly satisfy (4.8.3.6). Wilkinson (1965) has shown, however, that there exist a unitary matrix P', matrices L1A and L1R, and a vector L1y such that lub(P' - P) :s: f(m) eps, P'(A + L1A) = [~], P'(y + L1y) = h, (R + L1R)x = hI, lub(L1A):S: f(m) epslub(A), IIL1YII :s: f(m) eps Ilyll, lub(L1R) :s: f(m) epslub(R). In these relations f(m) is a slowly growing function of m, roughly f(m) ~ O(m). Up to terms of higher order lub(R) ~ lub(A), and therefore, since 4.8 Data Fitting 239 it also follows that pi (A + F) = [R +O.:1R] , lub(F) S 2f(m) eps lub(A). In other words, the computed solution x can be interpreted as an exact minimum point of the following linear least-squares problem: (4.8.3.7) minll(y + .:1y) - (A + F)zll, z wherein the matrix A + F and the right-hand-side vector y +.:1y differ only slightly from A and y respectively. The technique of orthogonalization is, therefore, numerically stable. If, on the other hand, one computes the solution x by way of the normal equations ATAx=ATy, then the situation is quite different. It is known form Wilkinson's (1965) estimates [see also Section 4.5] that a solution vector i; is obtained which satisfies (4.8.3.8) lub(G) S f(n) epslub(AT A), when floating point computation with the relative machine precision eps is used (even under the assumption that AT y and AT A are computed exactly). If x = (AT A)-l AT Y is the exact solution, then (4.4.15) shows that (4.8.3.9) IIi; - xii T lub(G) 2 lub(G) IlxlJ S cond(A A) lub(AT A) = cond(R) lub(AT A) S f(n) epscond(R)2 to a first approximation. The roundoff errors, represented here as the matrix G, are amplified by the factor cond(R? This shows that the use of the normal equations is not numerically stable if the first term dominates in (4.8.3.5). Another situation holds ifthe second term dominates. If tan<p = Ilrll/llAxlJ ~ 1, for example, then the use of the normal equations will be numerically stable and will yield results which are comparable to those obtained through the use of orthogonal transformations. EXAMPLE 1 (Lauchli). For the 6 x 5 matrix 240 4 Systems of Linear Equations if follows that 1 + 1'2 1 1 1 1 + 1'2 1 1 1 + 1'2 ATA= 1 1 1 1 1 1 1 1 1 + 1'2 1 1 1 1 1 + 1'2 If I' = O.Ei . 10- 5 , then 10-place decimal arithmetic yields fl(AT A) ~ f; 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 since 1'2 = 0.25 X 10- 10 . This matrix has rank 1 and has no inverse. The normal equations cannot be solved, whereas the orthogonalization technique can be applied without difficulty. [Note for AT A that cond(AT A) = cond(R)2 = (5 + 1'2)/1'2.] EXAMPLE 2. The following example is intended to offer a computational comparison between the two methods which have been discussed for the linear leastsquares problem. (4.8.3.7) shows that, for the solution of a least-squares problem obtained using orthogonal transformations, the approximate bound given by (4.8.3.5) holds with LlA = F and lub(LlA):S: 2f(m)epslub(A). (4.8.3.9) shows that the solution obtained directly from the normal equations satisfies a somewhat different bound: (AT A + G)(x + Llx) = AT Y + Llb where IILlbl1 :S: epslub(A)IIYII, that is, (4.8.3.10) IILlxllnorrnal < d(R)2 (IUb(G) ~) Ilxll - con lub(AT A) + IIATYII ' which holds because of (4.4.12) and (4.4.15). Let the fundamental relationship (4.8.3.11) 1 1 1 S S S Y(S):=XI-+X22+X33 withxI=X2=X3=1 be given. A series of observed values [computed from (4.8.3.11)] were produced on a machine having eps = 10- 11 . (a) Determine Xl, X2, X3 from the data {Si, Yi}. If exact function values Yi = y(s;) are used, then the residual will satisfy r(x)=O. tan'P=O. 4.8 Data Fitting 241 The following table presents some computational results for this example. The size of the errors II.:1xll or th from the orthogonalization method and II.:1xllnormal from the normal equations are shown together with a lower bound for the condition number of R. The example was solved using Si = So + i, i = 1, ... , 10, for a number of values of So: So cond(R) II.:1xll orth II.:1x II normal 10 50 100 150 200 6.6.10 3 1.3.10 6 1.7.10 7 8.0.10 7 2.5.10 8 8.0.10- 10 6.4.10- 7 3.3.10- 6 1.8.10- 5 1.8.10- 3 8.8.10- 7 8.2.10- 5 4.2.10- 2 6.9.10- 1 2.7.10 0 (b) We introduce a perturbation into the Yi replacing Y by !I + AV, where v is chosen so that AT v = O. this means that, in theory, the solution should remain unchanged: Now the residual satisfies rex) = y + AV - Ax = AV, tan'P = Allvll/llyll, >. E JR. The following table presents the errors .:1x, as before, and the norm of the residualllr(x)11 together with the corresponding values of A: So = 10, v = (0.1331, -0.5184,0.6591, -0.2744,0,0,0,0,0, lub(A) ::;:, 0.22, eps = 10- 11 . A Ilr(x)11 II.:1xll orth II.:1x II normal 0 10- 6 10- 4 10- 2 0 9.10- 7 9.10- 5 9.10- 3 9.10- 1 8.10- 10 9.5.10- 9 6.2.10- 10 9.1.10- 9 6.1.10- 7 5.7.10- 5 8.8.10- 7 8.8.10- 7 4.6.10- 7 1.3.10- 6 8.8.10- 7 9.1.10- 6 10 0 10+ 2 9.10 1 of, 4.8.4 Nonlinear Least-Squares Problems Nonlinear data-fitting problems can generally be solved only by iteration, as for instance the problem given in (4.8.0.2) of determining a point x* = (xi, ... ,x~)T for given functions hex) == h(Xl, ... ,X n ) and numbers Yk, k = 1, ... , m, 242 4 Systems of Linear Equations y~ [:~,l which minimi7:es m Ily - i(x)112 = L(Yk - h(Xl,'" ,xn ))2. k=l For example, linearization techniques [see Section 5.1] can be used to reduce this problem to a sequence of linear least squares problems. If each h is continuously differentiable, and if ah aXl ah aX n aim aXl afm aXn Df(~) = x=,; represents the Jacobian matrix at the point x = ~, then f(x) = f(x) + Df(x)(x - x) + h, Ilhll = o(llx - xii), holds. If x is close to the minimum point of the nonlinear least-squares problem, then the solution x of the linear least-squares problem min (4.8.4.1) zEIRH Ily - f(x) - Df(x)(z - x)11 2 = Ilr(x) - Df(x)(x - x)112. r(x) := y - f(x), will often be still closer than .T; that is, Ily - f(x)112 < IIY - f(x)ll· This relation is not always true in the form given. However, it can be shown that the direction s=s(x):=:r:-x satisfies the following: There is a A > 0 such that the function is strictly monotone decreasing for all 0 :s; T :s; A. In particular ip(A) = Ilv - f(x + As)11 2 < ip(O) = Ily - f(x)112 4.8 Data Fitting 243 PROOF. VJ is a continuously differentiable function of T and satisfies d 'P'(O) = dT [(y - f(x + Ts)f(y - f(x + TS))]7=0 = -2(Df(x)sf(y - f(x)) = -2(Df(x)s)Tr (x). Now, by the definition of X, s = (x-x) is a solution to the normal equations (4.8.4.2) of the linear least-squares problem (4.8.4.1). It follows immediately from (4.8.4.2) that IIDf(x)sI12 = sT Df(xf Df(x)s = (Df(x)sfj·(x) and therefore 'P'(O) = -21IDf(x)sI12 < 0, so long as rank Df(x) = nand s #- o. The existence of a>. > 0 satisfying 'P' (T) < 0 for 0 :S r :S >. is a result of continuity, from which observation the assertion follows. 0 This result suggests the following algorithm, called the Gauss-Newton algorithm, for the iterative solution of nonlinear least-squares problems. Beginning with an initial point x CO ), determine successive approximations xCi), i = 1, 2, ... , as follows: (1) For xCi) compute a minimum point sci) for the linear least-squares problem (2) Let 'P(r) := Ily - f(x(i) + TSCi))1I2, and further, let k be the smallest integer k ~ 0 with (3) Define X CH1 ) := xci) + 2- k s Ci ). In Section 5.4 we will study the convergence of algorithms which are closely related to the process described here. 4.8.5 The Pseudoinverse of a Matrix For any arbitrary (complex) m x n matrix A there is an n x m matrix A+, the so-called pseudoinverse (or Moore-Penrose inverse). It is associated with A in a natural fashion and agrees with the inverse A-I of A in case m = n and A is nonsingular. Consider the range space R(A) and the null space N(A) of A, 244 4 Systems of Linear Equations R(A) := {Ax E (Cm I x E (Cn }, N(A) := {x E (Cn I Ax = O}, together with their orthogonal complement spaces R(A)l.. C (Cm, N(A)-'- c (Cn. Further, let P be the n x n matrix which projects (Cn onto N(A)l.., and let P be the TTl x Tn matrix which projects (Cm onto R(A): P.T = 0 ¢=} x E N(A), P= pH =p 2, Py = y ¢=} Y E R(A), P = pH = P2. For each y E R(A) there is a uniquely determined Xl E N(A)l.. satisfying AXI = y; i.e., there is a well-defined mapping f: R(A) --+ (Cn with Af(y) = y, fey) E N(A)l.. for all y E R(A). For, given y E R(A), there is an x which satisfies y = Ax; hence y = A(Px + (I - P)x)) = APx = AXI. where Xl := Px E N(A)l.., since (I - P)x E N(A). Further, if Xl, X2 E N(A)l.., AXl = AX2 = y, it follows that Xl - X2 E N(A) n N(A)l.. = {O}, which implies that Xl = X2. f is obviously linear. The composite mapping foP: y E (Cm --+ f(P(y)) E (Cn is well defined and linear, since Py E R(A); hence it is respresented by an n x TTl matrix, which is precisely A+, the pseudoinverse of A: A+y = f(P(y)) for all y E (Cm. A + has the following properties: (4.8.5.1) Theorem. Let A be an m x n matris. The pseudoinverse A+ is an n x TTl matrix satisfying: (1) A+ A = P is the orthogonal projector P : (Cn --+ N(A)l.. and AA+ = P is the orthogonal projector P : (Cm --+ R(A). (2) The following formulas huld: (a) A+ A = (A+ A)H, (b) AA+ = (AA+)H, (c) AA+ A = A, (d) A+ AA+ = A+. PROOF. According to the definition of A +, A+ Ax = f(P(Ax)) = f(Ax) = Px for all x, so that A+ A = P. Since pH = P, (4.8.5.1.2a) is satisfied. Further, from the definition of f, AA+ = A(f(Py)) = Py for all y E (Cm; hence AA+ = P = PH. Since pH = P, (4.8.5.1.2b) follows too. Finally, for all x E (Cn 4.8 Data Fitting 245 (AA+)Ax = PAx = Ax according to the definition of P, and for all y E (Cm o hence, (4.S.5.1.2c, 2d) hold. The properties (2a-d) of (4.8.5.1) uniquely characterize A+: (4.8.5.2) Theorem. If Z is a matrix satisfying a') ZA = (ZA)H, b') AZ = (AZ)H, c') AZA = A, d') ZAZ = Z, then Z = A+. PROOF. From (a)-(d) and (a')-(d') we have the following chain of equalities: Z = ZAZ = Z(AA+ A)A+(AA+ A)Z from d'), cl (AH ZH AH A+H)A+(A+ H AH ZH AH) from a), a'l, b), b') (AH A+H)A+(A+ H AH) from c') (A+ A)A+(AA+) from a), b) from d). o We note the following (4.8.5.3) Corollary. For all matrices A, A++ = A, This holds because Z := A [respectively Z := (A+)H] has the properties of (A+)+ [respectively (AH)+] in (4.8.5.2). An elegant representation of the solution to the least-squares problem min IIAx - Yl12 x can be given with the aid of the pseudoinverse A +: (4.8.5.4) Theorem. The vector x := A+y satisfies: (a) IIAx - yl12 2:: IIAx - Yl12 for all x E (Cn. (b) IIAx - Yl12 = IIAx - ylb and x i- x imply IIxl12 > Ilx112. 246 4 Systems of Linear Equations In other words, x = A+y is that minimum point of the least squares problem which has the smallest Euclidean norm, in the event that the problem does not have a unique minimum point. PROOF. From (4.8.5.1), AA+ is the orthogonal projector on R(A); hence, for all x E (Cn it follows that Ax - y = U - v U := A(x - A+y) E R(A), v:= (I - AA+)y = Y - Ax E R(A)l.. Consequently, for all x E (Cn IIAx - YII~ = Ilull~ + Ilvll~ 2 Ilvll~ = IIAx - YII~, and IIAx - yl12 = IIAx - Yl12 holds precisely if Ax = AA+y. Now, A+ A is the projector on N(A)l.. Therefore, for all x such that Ax = AA+y, x = UI + VI, UI:= A+ Ax = A+ AA+y = A+y = x E N(A)l., VI := x - UI = X - x E N(A), from which it follows that Ilxll~ > Ilxll~ for all x E (Cn satisfying x - x -I 0 and IIAx - Yl12 = IIAx - y112. 0 If the m x n matrix A with m 2 n has maximal rank, rank A = n, then there is an explicit formula for A+: It is easily verified that the matrix Z := (AH A)~I AH has all properties given in Theorem (4.8.5.2) characterizing the pseudoinverse A + so that By means of the QR decomposition (4.7.7) of A, A = QR, this formula for A + is equivalent to This allows a numerically more stable computation of the pseudoinverse, A+ = R~IQH. If m < n and rank A = m then because of (A+)H = (AH)+ the pseudoinverse A + is given by if the matrix AH has the QR decompositon AH = QR (4.7.7). 4.9 Modification Techniques for Matrix Decompositions 247 For general m x n matrices A of arbitrary rank the pSEudoinverse A + can be computed by means of the singular value decomposition of A [see (6.4.11), (6.4.13)]. 4.9 Modification Techniques for Matrix Decompositions Given any n x n matrix A, Gaussian elimination [see Section 4.1] produces an n x n upper triangular matrix R and a nonsingular n x n matrix F = Gn-1Pn - 1 ... G1P1, a product of Frobenius matrices G j and permutation matrices Pj , which satisfy FA=R. Alternatively, the orthogonalization algorithms of Section 4.'7 produce n x n unitary matrices P, Q and an upper triangular matrix R (different from the one above) for which PA = R or A = QR. [compare with (4.7.7)]. These algorithms can also be applied to rectangular m x n matrices A, m 2': n. Then they produce nonsingular rn x rn matrices F (alternatively rn x rn unitary matrices P or rn x n matrices Q with orthonormal columns) and n x n upper triangular matrices R satisfying (4.9.1) FA = }n [R] 0 }m-n' or (4.9.2a) (4.9.2b) For rn = n, these decompositions were seen to be useful in that they reduced the problem of solving the equation systems (4.9.3) Ax = y or AT x = Y to a process of solving triangular systems and carrying out matrix multiplications. When rn 2': n, the orthogonal decompositions mentioned in (4.9.2) permit the solutions i: of linear least-squares problems to be obtained in a similar manner: ( 4.9.4) 248 4 Systems of Linear Equations Further, these techniques offer an efficient way of obtaining the solutions to linear-equation problems (4.9.3) or to least-squares problems (4.9.4) in which a single matrix A is given with a number of right-hand sides y. Frequently, a problem involving a matrix A is given and solved, following which a "simple" change A ---+ A is made. Clearly it would be desirable to determine a decomposition (4.9.1) or (4.9.2) of A, starting with the corresponding decomposition of A, in some less expensive way than by using the algorithms in Sections 4.1 of 4.7. For certain simple changes this is possible. We will consider: (1) the change of a row or column of A, (2) the deletion of a column of A, (3) th addition of a column to A, (4) the addition of a row to A, (5) the deletion of a row of A, where A is an m x n matrix, m ~ n. [See Gill, Golub, Murray, and Saunders (1974) and Daniel, Gragg, Kaufman, and Stewart (1976) for a more detailed description of modification techniques for matrix decompositions.] Our principal tool for devising modification techniques will be certain simple elimination matrices E ij . These are nonsingular matrices of order m which differ from the identity matrix only in columns i and j and have the following form: o 1 1 a b t-i d t-j 1 Eij = 1 C 1 1 0 A matrix multiplication y = Eijx changes only components Xi and Xj of the vector X = (Xl, ... , Xm)T E lRm : Yi = aXi + bXj, Yj = CXi +dXj, Yk = Xk for k i= i,j. Given a vector X and indices i i= j, the 2 x 2 matrix 4.9 Modification Techniques for Matrix Decompositions , 249 [ac d b] ' E:= and thereby E ij , can be chosen in several ways so that the jth component Yj of the result Y = Eijx vanishes and so that Eij is nonsingular: For numerical reasons E should also be chosen so that the condition number cond (Eij) is not too large. The simplest possibility, which finds application in decompositions of the type (4.9.1), is to construct E~ as a Gaussian elimination of a certain type [see (4.1.8)]: (4.9.5) n [~ -X~/Xi [~ -X~/Xj E= [ if Xj = 0, if IXil :::: IXjl > 0, if IXil < IXjl· n ] In the case of orthogonal decompositions (4.9.2) we choose E, and thereby E ij , to be the unitary matrices know as Givens matrices. One possible form such a matrix may take is given by the following: (4.9.6) '=[Cs -cs] ' c := cos cp, s := sincp. E E and Eij are Hermitian and unitary, and satisfy det (E) = -1. Since E is unitary, it follows immediately from the condition that Yi = k = ± J + x; x;, which can be satisfied by c := 1, s:= ° if Xi = Xj = or alternatively, if J-L := max{lxil, IXjl} > 0, by Ikl is to be computed as ° 250 4 Systems of Linear Equations and the sign of k is, for the moment, arbitrary. The computation of Ikl in this form avoids problems which would arise from exponent overflow or underflow given extreme values for the components .ri and Xj' On numerical grounds the sign of k will he chosen according to k = Ikl sign(xi), sign(xi) := { l. . -1, if Xi 2: 0, if Xi < 0, so that the computation of the auxiliary value v := s/(l + c) can take place without cancellation. Using v, the significant components Zi, Zj of the transformation Z := Eiju of the vector U E lR m can be computed somewhat more efficiently (in which a multiplication is replaced by an addition) as follows: + SUj, Zj := V(Ui + Zi) - 11j. Zi := CUi The type of matrix E shown in (4.9.6), together with its associated E ij , is known as Givens reflection. We can just as easily use a matrix E of the form E = [ - SC 8], C . C = COScp, s = sincp, which, together with its associated E ij , is known as a Givens rotation. In this case E and Eij are orthogonal matrices which describe a rotation of lRm about the origin through the angle cp in the (i,j) plane; det(E) = l. We will present the following material, however, only in terms of Givens reflections. Since modification techniques for the decomposition (4.9.1) differ from those for (4.9.2a) only in that, in place of Givens matrices (4.9.6), corresponding elimination matrices E of type (4.9.5) arc used, we will only study the orthogonal factorization (4.9.2). The techniques for (4.9.2b) are similar to, but more complicated than, those for (4.9.2a). Consequently, for simplicity's sake, we will restrict out attention to (4.9.2a). The corresponding discussion for (4.9.2b) can be found in Daniel, Gragg, Kaufman, and Stewart (1976). In the following let A be a real m x n matrix with m 2: n, and let PA = [~] he a factorization of type (4.9.2a). (1) Change of a row or column, or more generally, the change of A to A := A + V11 T (rank-one modification of A), where v E IR.=, 11 E lRn are given vectors. From (4.9.2a) we have 4.9 Modification Techniques for Matrix Decompositions 251 - [R] 0 +wu, (4.9.7) PA= T In the first step of the modification we annihilate the successive components m, m-1, ... ,2 of the vector w using appropriate Givens matrices Gm-l,m, Gm- 2 ,m-l, ... , G 12 , so that k = ±llwll = ±lIvll· The sketch below in the case m = 4 should clarify the effect of carrying out the successive transformations G i ,Hl. We denote by * here, and in the further sketches of this section, those elements which are changed by It transformation, other than the one set to zero: If (4.9.7) is multiplied on the left as well by Gm-l,m, ... , G 12 in order, we obtain (4.9.8) where R':= G [~], P := GP, G:= G 12 G23 ··· Gm-l,m. Here P is unitary, as are G and P; the upper triangular matrix [~] IS changed step by step into an upper Hessenberg matrix that is a matrix with (R')ik = 0 for i > k + 1: m = 4, n = 3: x Xl x ~ [X 0 o x o 0 0 0 =G [~] =R'. R = R' + kelUT in (4.9.8) is also an upper Hessenberg matrix, since the addition changes only the first row of R'. In the second step of the 252 4 Systems of Linear Equations modification we annihilate the subdiagonal elements (R)i+Li, i = 1, 2, ... , R by means of further Givens matrices H 12 , H 23 , ... , H/",/"+1, so that [see (4.9.8)] /1 := min(m -1, n -1) of -- - [R] 0 ' HPA=HR=: where R is, again, an n x n upper triangular matrix and P := H P is an m x m unitary matrix. A decomposition of A of the type (4.9.2a) is provided once again: - - [R] 0 . PA= The sequence of transformations R --+ H12R --+ H 23 (H 12 R) --+ ... --+ HR will be sketched for m = 4, n = 3: R ~ [~ ~ ~l ~ [~ ~ ~l ~ [~ ~ ~l ~ [~ ~ ~l = [~] . (2) Deletion of a column. If A is obained from A by the deletion of the kth column, then from (4.9.2a) the matrix R := pA is an upper Hessenberg matrix of the following form (sketched for m = 4, n = 4, k = 2): R:=pA= [ X X~ x~xl The sub diagonal elements of R are to be annihilated as described before, using Givens matrices The decomposition of A is P:=HP, H := Hn-1,nHn-2.n-l··· Hk,k+l. (3) Addition of a column. If A = [A, a), a E IRTn , and A is an m x n matrix with m > n, then (4.9.2a) implies 4.9 Modification Techniques for Matrix Decompositions x x x x x x o 253 x The subdiagonal elements of R, which appear as an "spike" in the last column, can be annihilated by means of a single Householder transformation H from (4.7.5): x x x x x HR= * o o [~] . o P:= HP and R are components of the decomposition (4.9.2a) of A: - - [R] 0 . PA= (4) Addition of a row. If then there is a permutation matrix II of order m + 1 which satisfies IIA = [~] . The unitary matrix of order m + 1, p.- [~ ~]II satisfies, according to (4.9.2a), PA= [~ ~] [~] [;:] [~l =:R= :r :z: x x 0 x o 254 4 Systems of Linear Equations R is an upper Hessenberg matrix whose sub diagonal elements can be annihilated using Givens matrices H 12 , ... , Hn,n+l: [~] = HR, as described before, to produce the decompositon - - [R] 0 . PA= (5) Deletion of a row: Let A be an m x n matrix with m > n. We assume without loss of generality (see the use of the permutation above) that the last row aT of A is to be dropped: We partition the matrix P = [P,p], accordingly and obtain, from (4.9.2a), (4.9.9) We choose Givens matrices Hm,m-l, H m ,m-2, ... , Hml to annihilate successive components m - 1, m - 2, ... , 1 of p: Hml Hm2 ... Hm,m-lP (0, ... ,0,11')T. A sketch for m = 4 is (4.9.10) Now, P is unitary, so Ilpll = 111'1 = 1. Therefore, the transformed matrix HP, H := Hml H m2 ··· Hm,m-l, has the form (4.9.11) HP=[~ ~]=[~ ~], 11' = ±1, since 111'1 = 1 implies q = 0 because ofthe unitarity of H P. Consequently P is a unitary matrix of order m - 1. On the other hand, the Hmi transform the upper triangular matrix [~], with the exception of its last row, into an upper triangular matrix 4.9 Modification Techniques for Matrix Decompositions 255 Sketching this for m = 4, n = 3, we have [~l ~ [~ j ~] ~ [~ i ~] ~ [! ~ ~] ~ [~ ~ ;] = [zi}]· From (4.9.9), (4.9.11) it follows that and therefore PA= [ RO-] for the (m -I)-order unitary matrix P and the upper triangular matrix R of order n. This has produced a decomposition for A of the form (4.9.2a). It can be shown that the techniques of this section are numerically stable in the following sense. Let P and R be given matrices with the following property: There exists an exact unitary matrix P' and a matrix A' such that P'A' = [~] is a decomposition of A' in the sense of (4.9.2a) and the differences IIP-P'II, IIA - A'II are "small". Then the methods of this section, applied to P, R in floating-point arithmetic with relative machine precision eps, produce matrices P, R to which are associated an exact unitary matrix P' and a matrix A' satisfying: (a) liP - P'II, IIA - A'II are "small", and (b) P' A' = [~] is a decomposition of A' in the sense of (4.9.2a). By "small" we mean that the differences above satisfy IILlPIl = O(mCi eps) or where 0:' is small (for example 0:' = ~). IILlAIl = O(mCi eps IIAII), 256 4 Systems of Linear Equations 4.10 The Simplex Method Linear algebraic techniques can be applied advantageously, in the context of the simplex method, to solve linear programming problems. These problems arise frequently in practice, particularly in the areas of economic planning and management science. Linear programming is also the means by which a number of important discrete approximation problems (e.g. data fitting in the Ll and Loo norms) can be solved. At this introductory level of treatment we can cover only the most basic aspects of linear programming; for a more thorough treatment, the reader is referred to the special literature on this subject [e.g. Dantzig (1963), Gass (1969), Hadley (1962), Murty (1976), or Schrijver (1986)]. A general linear programming problem (or linear program) has the following form: (4.10.1) with respect to all x E lRn which satisfy finitely many constraints of the form (4.10.2) + ai2 x 2 + ... + ainxn ::; bi , ailXI + ai2X2 + ... + ainXn = bi , ai1 x l i=1,2, ... ,ml, i = ml + 1, ml + 2, ... ,m. The numbers Ck, aik, bi are given real numbers. The function c T x to be minimized is called the objective function. Each x E lRn which satisfies all of the conditions (4.10.2) is said to be a feasible point for the problem. By introducing additional variables and equations, the linear programming problem (4.10.1), (4.10.2) can be put in a form in which the only constraints which appear are equalities or elementary inequalities (inequalities, for example, of the form Xi 2:: 0). It is useful, for various reasons, to require that the objective function cT x have the form cT x == -xp. In order to bring a linear programming problem to this form, we replace each nonelementary inequality (4.10.2) by an equality and an elementary inequality using a slack variable Xn+i, If the objective function CIXI + ... + CnXn is not elementary, we introduce an additional variable x n +m1 +1 and include an additional equation among the constraints (4.10.2). The minimization of cT x is equivalent to the maximization of x n +m1 +1 under this extended saystem of constraints. 4.10 The Simplex Method 257 Hence we can assume, without loss of generality, that the linear progamming problem is already given in the following standard form: LP(I,p) : maximize (4.10.3) x xp Ax = b, E lR n : Xi 2: 0 for i E I. In this formulation, I ~ N := {I, 2, ... , n} is a (possibly empty) index set, p is a fixed index satisfying p E N\I, A = (aI, a2, ... , an) is a real m x n matrix having columns ai, and bE lRm is a given vector. The variables Xi for which i E I are restricted variables, while those for which i tJ. I are the free variables. By P := {x E lR n I Ax = b & Xi 2: 0 for all i Eo I} we denote the set of all feasible points of LP(I,p), x* E P is an optimum point of LP(I,p), if x; = max{x p I x E P}. As an illustration: minimize -Xl -2X2 X: -Xl +X2 S2 S4 Xl 2: 0, X2 2: o. Xl +X2 After the introduction of X3, X4 as slack variables and Xs as an objective function variable, the following standard form is obtained, p = 5, I = {I, 2, 3, 4}: maximize x: Xs = 2 -Xl +X2+X3 XI+X2 -Xl -X2 Xi =4 +X4 +Xs= 0 2: 0 for i S 4. This can be shown graphically in lR? The set P (shaded in Figure 6) is a polygon. (In higher dimensions P would be a polyhedron.) We begin by considering the linear equation system Ax = b of LP(I,p). For any index vector J = (jl, ... ,jr), ji EN, we let AJ:= (ajll ... ,ajJ denote the submatrix of A having columns aji; XJ denotes the vector (xJI , ... , X jr) T. For simplicity we will denote the set {ji I i = 1,2, ... ,r} of the components of J by J also, and we will write p E J if there exists an i withp =ii- 258 4 Systems of Linear Equations Xl =0 c "- , ~-------.', X3 =0 X2 E , ~p~""'" I B =0 A , ,= 0 , ",X5 1 X4 1 = 0 D Fig. 6. Feasible region and objective function. (4.10.4) Definition. An index vector J = (jl, ... ,jm) of m distinct indices ji E N is called a basis of Ax = b [and of LP (I,p)] if AJ is nonsingular. AJ is also referred to as a basis; the variables Xi for i E J are referred to as basic variables, while the remaining variables Xk, k (j. J, are the nonbasic variables. If K = (kl' ... ,kn - m ) is an index vector containing the nonbasic indices, we will use the notation J EB K = N. In the above example JA := (3,4,5), JB := (4,5,2) are bases. Associated with any basis J, J EB K = N, there is a unique solution x = x(J) of Ax = b, called the basic solution, with XK = O. Since X is given by (4.10.5) Moreover, given a basis J, each solution x of Ax = b is uniquely determined by its nonbasic segment XK and the basic solution x. This follows from the multiplication Ax = AJxJ + AKxK = b by A J1 and from (4.10.5): 4.10 The Simplex Method 259 (4.10.6) If XK E IR n - m is chosen arbitrarily, and if XJ (and thereby x) is defined by (4.10.6), then x is a solution of Ax = b. Therefore, (4.10.6) provides a parametrization of the solution set {x I Ax = b} through the components of XK E IR n - m . If the basic solution x of Ax = b associated with the basis J is a feasible point of LP(I,p), x E P, that is, if (4.10.7) Xi ?: 0 for all i E In J, then J is a feasible basis of LP (I, p) and X is a basic feasible solution. Finally, a feasible basis is said to be nondegenerate if (4.10.8) Xi > 0 for all i E In J. The linear programming problem as a whole is nondegenerate if all of its feasible bases are nondegenerate. Geometrically, the basic feasible solutions given by the various bases of LP(I,p) correspond to the vertices of the polyhedron P of feasible points, assuming that P does have vertices. In the above example (see Figure 6) the vertex A E P belongs to the feasible basis JA = (3,4,5), since A is determined by Xl = X2 = 0 and {1,2} is in the complementary set corresponding to JA in N = {1,2,3,4,5}; B is associated with J B = (4,5,2), C is associated with Jc = (1,2,5), etc. The basis JE = (1,4,5) is not feasible, since the associated basic solution E is not a feasible point (E f/. P). The simplex method for solving linear programming problems, due to G. B. Dantzig, is a process which begins with a feasible basis Jo of LP(I,p) satisfying p E J o and recursively generates a sequence {Ji } of feasible bases Ji of LP(I,p) with p E Ji for each i by means of simplex steps These steps are designed to ensure that the objective function values x(Ji)p corresponding to the bases J i are nondecreasing: In fact, if all of the bases encountered are nondegenerate, this sequence is strictly increasing, and if LP(I,p) does have an optimum, then the sequence {J;} terminates after finitely many steps in a basis h.I whose basic solution x(h.d is an optimum point for LP(I,p) and is such that x(Ji)p < X(Ji+l)p for all 0 ~ i ~ ]vI - 1. Furthermore, each two successive bases J = (jl, ... ,jm) and j = (]1, ... ,1m) are neighboring, that is, J and j have exactly m - 1 components in common. This means that j may be 260 4 Systems of Linear Equations obtained from J through an index exchange: there are precisely two indices q, sEN satisfying q E J, s ~ J and q ~ J, s E J, thus J = (JU {s})\{q}. For nondegenerate problems, neighboring feasible bases correspond geometrically to neighboring vertices of P. In the example above (Figure 6), JA = (3,4,5) and JB = (4,5,2) are neighboring bases, and A and B are neighboring vertices. The simplex method, or more precisely, "phase two" of the simplex method, assumes that a feasible basis J of LP(I,p) satisfying p E J is already available. Given that LP(I,p) does have feasible bases, one such basis can be found using a variant known as "phase one" of the simplex method. We will begin by describing a typical step of phase two of the simplex method, which leads from a feasible basis J to a neighboring feasible basis J of LP(I, p): (4.10.9) Simplex Step. Requirements: J = (j1, ... ,jm) is assumed to be a feasible basis of LP(I,p) having p = jt E J, J ffi K = N. (1) Compute the vector b:= A:J1b which gives the basic feasible solution x corresponding to J, where xJ := b, (2) Compute the 'rOw vector XK:= O. TA-J1' Jr := e t where et = (0, ... ,1, ... , O)T E lRm is the t th coordinate vector for lRm. Determine the values kEK, using Jr. (3) Determine whether (4.10.10) and Ck 2:: 0 for all k E K n I Ck = 0 for all k E K\I. (a) Ifso, then stop. The basic feasible solution x is optimal for LP(I,p). (b) If not, determine an s E K such that Cs < 0, s E K n I or Icsl # 0, s E K\I, and set ()" := - sign C s . (4) Calculate the vector (-)T := A-1 a:= aI, a2,···, am J as· 4.10 The Simplex Method 261 (5) If (4.10.11) (]"(5:i =::: 0 for all i with ji E I, stop: LP(I,p) has no finite optimum. Otherwise, (6) Determine an index r with jr E I, an r > 0, and br = mIn bi . { -_-_an r ani I· Ji. E I & - > ani Z: 0} . (7) Take as j any suitable index vector with j:= (JU{s})\{jr}, for example or We wish to justify these rules. Let us assume that J =, (jl, ... ,jm) is a feasible basis for LP(I, p) with p = jt E J, J ffi K = N. Rule (1) of (4.10.9) produces the associated basic feasible solution x = x(J), as is implied by (4.10.5). Since all solutions of Ax = b can be represented in the form given by (4.10.6), the objective function can be written as T - xp = et XJ = xp - TA-1A J KXK et = xp -7rAK xK (4.10.12) This uses the fact that p = jt, and it uses the definitions of the vector 7r and of the components CK, Ck E IR n - m , as given in rule (2). CK is known as the vector of reduced costs. As a result of (4.10.12), k E K, gives the amount by which the objective function xp changes as X·k is changed by one unit. If the condition (4.10.10) is satisfied [see rule (3}J, it follows from (4.10.12) for each feasible point x of LP(I,p), since Xi ~ (I for i E I, that Xp = xp - L CkXk =::: xp; kEKnI that is, the basic feasible solution x is an optimum point for LP(I,p). This motivates the test (4.10.10) and statement (a) of rule (3). If (4.10.10) is not satisfied, there is an index s E K for which either (4.10.13) or else Cs < 0, s E KnI, 262 4 Systems of Linear Equations s E K\I, (4.10.14) holds. Let s be such an index. We set a := - signcs . Because (4.10.12) implies that an increase ofax s leads to an increase of the objective function x p , we consider the following family of vectors x(B) E lRn, B E lR: - 1- x(B)J := b - BaA:J as = b - Baa, (4.10.15) x(B)s := Ba, x(Bh:=O forkEK,ki=s. Here a:= A:Jla s is defined as in rule (4) of (4.10.9). In our example I = {I, 2, 3, 4} and J o = JA = (3,4,5) is a feasible basis; Ko = (1,2), p = 5 E Jo, to = 3. (4.10.9) produces, starting form Jo, x(Jo) = (0,0,2,4,0)T (which is point A in Figure 6), and 7rAJo = eia =} 7r = (0,0,1). The reduced costs are CI = 7ral = -1, C2 = 7ra2 = -2. This implies that Jo is not optimal. If we choose the index s = 2 for (4.10.9), rule (3)(b), then - a = A-I Jo a2 = [J]. The family of vectors x(e) is given by x(e) = (O,e, 2 - e,4 - e,2ef. ° Geometrically x(e), e 2: describes a semi-infinite ray in Figure 6 which extends along the edge of the polyhedron P from vertex A (e = 0) in the direction of the neighboring vertex B (e = 2). From (4.10.6) it follows that Ax(B) = b for all () E lR. In particular (4.10.12) implies, from the fact that i; = x(O) and as a result of the choice of a, that (4.10.16) so that the objective function is a strictly increasing function of () on the family x( (}). It is reasonable to select the best feasible solution of Ax = b from among the x((}): that is, we wish to find the largest B 2:: 0 satisfying X((})l 2:: 0 for alII E I. From (4.10.15), this is equivalent to finding the largest () 2:: 0 such that (4.10.17) 4.10 The Simplex Method 263 because X(O)k 2 0 is automatically satisfied, for all k E K n I and 0 2 0, because of (4.10.15). If aiii sO for all ji E J [see rule (5) of (4.10.9)]' then x(O) is a feasible point for all 0 2 0, and because of (4.10.17) sup{x(O)p I o 2 O} = +00: LP (1, p) does not have a finite optimum. This justifies rule (5) of (4.10.9). Otherwise there is a largest 0 =: Bfor which (4.10.17) holds: B= b~ = min { aai b~ aa r Ii: ji E I & aiii > o'.} . This determines an index r with jr E I, aii r > 0 and (4.10.18) x(B)jr = br - Baii r = 0, x(B) feasible. In the example xCi}) = (0,2,0,2, 4)T corresponds to vertex B of Figure 6. From the feasibility of J, B20, and it follows from (4.10.16) that x(B)p 2: xp. If J is nondegenerate, as defined in (4.10.8), then B> 0 holds and further, x(B)p > xp. From (4.10.6), (4.10.15), and (4.10.18), x = x(O) is the uniquely determined solution of Ax = b having the additional property Xk = 0 for k E K, k -=1= s, Xjr = 0, that is, x k = 0, k := (K U {jr})\ {s}. From the uniqueness of x it follows that A j , J := (J U {s})\{jr}, is nonsingular; x(O) = x(J) is, therefore, the basic solution associated with the neighboring feasible basis J, and we have (4.10.19) x(J)p > x(J)p, if J is nondegenerate, x(J)p 2: x(J)p, otherwise. In our example we obtain the new basis J 1 = (2,4,5) = Js, Kl = (1,3), which corresponds to the vertex B of Figure 6. With respect to the objective function X5, B is better than A: x(Jsh = 4 > x(JAh = o. 264 4 Systems of Linear Equations Since the definition of r implies that iT E I always holds, it follows that J\I <;;:: J\I; that is, in the transition J ---+ J only a nonnegatively constrained variable Xjr' ir E I, leaves the basis; as soon as a free variable x s , s (j. I, becomes a basis variable, it remains in the basis throughout all subsequent simplex steps. In particular, p E J, because p E J and p (j. I. Hence, the new basis J also satisfies the requirements of (4.10.9), so that rules (1)-(7) can be applied to J in turn. Thus, beginning with a first feasible basis Jo of LP(I,p) with p E J o, we obtain a sequence J o ---+ J 1 ---+ h ---+ ... of feasible bases J i of LP(I,p) with p E J i , for which x(Jo)p < x(Jdp < x(J2 )p < ... in case all of the J i are nondegenerate. Since there are only finitely many index vectors J, and since no J i can be repeated if the above chain of inequalities holds, the simplex method must terminate after finitely many steps. Thus, we have proven the following for the simplex method: (4.10.20) Theorem. Let J o be a feasible basis for LP(I,p) with p E Jo. If LP(I,p) is nondegenerate, then the simplex method generates a finite sequence of feasible base J i for LP(I,p) with p E J i which begin with J o and for which x(Ji)p < x(Ji+1)p. Either the final basic solution is optimal for LP(I,p), or LP(I,p) has no finite optimum. We continue with our example: as a result of the first simplex step we have a new feasible basis J1 = (2,4,5) = JB, Kl = (1,3), tl = 3, so that AJ1=[j ~ ~l' b=[~l' x(h)=(0,2,0,2,4)T(=B), 7rAJl = e~ '* 7r = (2,0,1). The reduced costs are Cl = 7ral = -3, C3 = 7ra3 = 2, hence J 1 is not optimal: s = 1, a = AJ11 a1 '* a = [ =1 1'* r = 2. Therefore h = (2,1,5) = Ja, K2 = (3,4), x(J2 ) = (1,3,0,0,7) (= C), t2 = 3, 4.10 The Simplex Method 265 The reduced costs are C3 = 1ra3 = ~ > 0, C4 = ?Ta4 = ~ > O. The optimality criterion is satisfied, so x(h) is optimal; i.e., Xl = 1, X2 = 3, X3 = 0, X4 = 0, X5 = 7. The optimal value of the objective function X5 is X5 = 7. We observe that the important quantities of each simplex step - b, ?T, and a - are determined by the following systems of linear equations: AJb (4.10.21) b =? b [(4.10.9), rule (1)], 1T [(4.10.9), rule (2)], =? a [(4.10.9), rule (4)J. e; =? AJa = as The computational effort required to solve these systems for successive bases J -+ j -+ .. , can be significantly reduced if it is noted that the successive bases are neighboring: each new basis matrix A] is obtained from its predecessor AJ by the exchange of one column of AJ for one column of A K . Suppose, for example, a decomposition of the basis matrix AJ of the form (4.9.1), F AJ = R, F nonsingular, R upper triangular, is used [see Section 4.9J. On the one hand, such a decomposition can be used to solve the equation systems (4.10.21) in an efficient manner: RT Z = et =? z =? 1T = zT F, Ra = Fa s =? a. On the other hand, the techniques of Section 4.9 can be used on the decomposition F AJ = R of AJ in each simplex step to obtain the corresponding decomposition F AJ = R of the neighboring basis [compare (4.10.9), rule (7)J. The matrix FA], with this choice of index vector J, is an upper Hessenberg matrix of the form depicted here for m = 4, r = 2: [ X ~x ~x ~lx x =: R'. x The subdiagonal elements can easily be eliminated using transformations of the form (4.9.5), Er,r+l, E r+1,r+2, ... , Em-I,m, which will change R' into an upper triangular matrix R: FA J- = R, F:= EF, R:= ER', E:= E m- l mEm-2m-l'" E rr+l . , , ) , Thus it is easy to implement the simplex method in a practical fashion by taking a quadruple M = {J; t; F, R} with the property that 266 4 Systems of Linear Equations jt =p, and changing it at each simplex step J --+ j into an analogous quadruple M = U;l; P, R}. To begin this variant of the simplex method, it is necessary to find a factorization FoAJo = Ro for AJo of the form given in (4.9.1) as well as finding a feasible basis J o of LP(I,p) with p E J o. ALGOL programs for such an implementation of the simplex method are to be found in Wilkinson and Reinsch (1971), and a roundoff investigation is to be found in Bartels ((1971). [In practice, particularly for large problems in which the solution of anyone of the systems (4.10.21) takes a significant amount of time, the vector XJ = A:J1b = b of one simplex step is often updated to the vector Xj := Aj 1 b = x(O)j of the next step by using (4.10.15) with the value 0= br/aii r as given by (4.10.9), rule (6). The chance of incurring errors, particularly in the selection of the index r, should be borne in mind if this is done.] The more usual implementations of the simplex method use other quantities than the decomposition F AJ = R of (4.9.1) in order to solve the equation systems (4.10.21) in an efficient manner. The "basis inverse method" uses a quintuple of the form M = {J;t;B,b,1T} with jt = p, B:= A:Jl, b = A:J1b, 1T:= ei A:J1. Another variant uses the quintuple M = {J;t;.4,b,1T} with jt = p, - -1 b:= A J b, 1 1T := eTAJ' t JffiK = N. In the simplex step J --+ j efficiency can be gained by observing that the inverse Aj1 is obtainable from A:J1 through multiplication by an appropriate Frobenius matrix G (see Exercise 4 of Chapter 4): A j1 = GA:J1. In this manner somewhat more computational effort can be saved than when the decomposition F AJ = R is used. The disadvantage of updating the inverse of the basis matrix using Frobenius transformations, however, lies in its possible numerical instability: it may happen that a Frobenius matrix which is used in this inverse updating process is ill conditioned, particularly if the basis matrix itself is ill conditioned. In this case, errors can appear in A:J/, A:J,l AKi for Mi , M i , and they will be propagated throughout all quintuples M i , M i , j > i. When the factorization F AJ = R is used, however, it is always possible to update the decomposition with 4.10 The Simplex Method 267 well-conditioned transformations, so that error propagation is not likely to occur. The following numerical example is typical of the gain in numerical stability which one obtains if the triangular factorization of the basis (4.9.1) is used to implement the simplex method rather than the basis inverse method. Consider a linear programming problem with constraints of the form (4.10.22) Ax=b, x 2 o. A = (Al,A2), The 5 x 10 matrix A is composed of two 5 x 5 submatrices AI, A2: Al = (al i k), al i k:= 1/(i + k), i, k = 1, ... ,5, A2 : = Is = identity matrix of order 5. Al is very ill conditioned, and A2 is very well conditioned. The vector 5 bi := L i~k' k=l that is, b:= AI· e, where e = (1,1,1,1, II E lR5 , is chosen as a right-hand side. Hence, the bases J 1 := (1,2, <I, 4, 5) and J 2 := (6,7,8,9,10) are both feasible for (4.10.22); the corresponding basic solutions are x(JI):= [b~] , (4.10.23) x(h):= [b~] , We choose h = (6,7,8,9,10) as a starting basis and transform it, using the basis inverse method on the one hand and triangular decompositions on the other, via a sequence of single column exchanges, into the basis Jl. From there another sequence of single column exchanges is used to return to h: h -+ ... -+ Jl -+ ... -+ h. The resulting basic solutions (4.10.23), produced on a machine having relative precision eps "" 1O- 11 , are shown in the following table (inaccurate digits are under lined) . 4 Systems of Linear Equations 268 Basis exact basic solution Basis inverse Triangular decomposition J2 1.4500000000 10 0 1.4500000000 10 0 1.4500000000 10 0 1.092857142810 0 1.092857142810 0 1.092857142810 0 b2 = 8.845238095210-1 = 8.845238095210-1 = 8.845238095210 0 7.456349206310 -1 7.456349206310 -1 7.456349206310 -1 6.456349206310 -1 6.456349206310 -1 6.456349206310-1 J1 1 1 b1 = 1 1 1 1.000000018210 0 9.9999984079 10 -1 1.0000004372 10 0 9.9999952118 10 -1 1.0000001826 10 0 1. 000000078610 0 9.9999916035 10 -1 1.0000027212 10 0 9.9999956491 10 -1 1.0000014837 10 0 h 1.450000000010 0 1.092857142810 0 b2 = 8.845238095210-1 7.456349206310-1 6.456349206310 -1 1.4500010511 10 0 1.0928579972 10 0 8.8452453057 10 -1 7.4563554473 10 -1 6.456354 7103 10 -1 1.4500000000 10 0 1.092857142110 0 8.845238095Q10 -1 7 .456349206Q10-1 6.4563492059 10 -1 The following is to be observed: Since AJ, = h, both computational methods yield the exact solution at the start. For the basis J 1 , both methods give equally inexact results, which reflects the ill-conditioning of AJ" No computational method could produce better results than these without resorting to higher-precision arithmetic. After passing through this ill-conditioned basis AJ" however, the situation changes radically in favor of the triangular decomposition method. This method yields, once again, the basic solution corresponding to h essentially to full machine accuracy. The basis inverse method, in contrast, produces a basic solution for h with the same inaccuracy as it did for J 1 . With the basis inverse method, the effect of ill-conditioning encountered while processing one basis matrix AJ is felt throughout all further bases; this in not the case using triangular factorization. 4.11 Phase One of the Simplex Method In order to start phase two of the simplex method, we require a feasible basis J o of LP(I,p) with p = jto E J o; alternatively, we must find a quadruple Mo = {Jo; to; Fo, Ro} in which a nonsingular matrix Fo and a nonsingular triangular matrix Ro form a decomposition FoAJo = Ro as in (4.9.1) of the basis matrix AJo ' In some special cases, finding J o (Mo) presents no problem, e.g. if LP(I,p) results from a linear programming problem of the special form 4.11 Phase One of the Simplex Method 269 minimize subject to ailxl + ... + ainXn ~ bi , Xi ~ 0 i = 1,2, ... , m, for i E It c {I, 2, ... , n}, with the additional property bi ~ 0 for i = 1, 2, ... , m. Such problems may be transformed by the introduction of slack variables Xn+l, ... , x n + m into the form (4.11.1) maximize Xn+m+l subject to ailXl + ... + ainXn + Xn+i Xi~O = bi , ,~= 1,2, ... , m, foriEItU{n+1,n+2, ... ,n+m}, Note that Xn+m+l = 1 - CIXI - ... - CnX n . The extra positive constant (arbitrarily selected to be 1) prevents the initially chosen basis from being degenerate. (4.11.1) has the standard form LP(I,p) with 1 p:= n + m + 1, 1:= It U {n + 1,n + 2, ... ,n + m}, an initial basis J o with p E J o, and a corresponding Mo :::: (Jo; to; Fo, Ro) given by Jo :=(n+1,n+2, ... ,n+m+1), Fo to=m+1, ~ flo , d +, ~ [: '. :J (o,d", m + 1). m Since bi ~ 0 for i = 1, ... , m, the basis Jo is feasible for (4.11.1). "Phase one of the simplex method" is a name for a class of techniques which are applicable in general. Essentially, all of these techniques consist of applying phase two of the simplex method to some linear programming problem derived from the given problem. The optimal basis obained from the derived problem provides a starting basis for the given problem. We will sketch one such method here in the briefest possible fashion. This sketch is included for the sake of completeness. For full details on starting techniques for the simplex method, the reader is referred to the extensive literature on 270 4 Systems of Linear Equations linear programming, for example to Dantzig (1963), Gass (1969), Hadley (1962), and Murty (1976). Consider a general linear programming problem which has been cast into the form (4.11.2) minimize C1X1 + ... + CnXn subject to aj1X1 + ... + ajnXn = bj , Xi 2': 0 j = 1,2, ... , m, for i E I, where I S;;; {I, 2, ... , n}. We may assume without loss of generality that bj 2': 0 for j = 1, ... , m (otherwise multiply the nonconforming equations through by -1). We begin by extending the constraints of the problem by introducing artificial variables Xn+l, ... , Xn +m : (4.11.3) +amnXn for i E I U { n + 1, ... , n + m}. +xn+m Clearly there is a one-to-one correspondence between feasible points for (4.11.2) and those feasible points for (4.11.3) which satisfy Xn+l = Xn+2 = ... = x n+m = O. (4.11.4) We will set up a derived problem with constraints (4.11.3) whose maximum should be taken on, if possible, by a point satisfying (4.11.4). Consider LP(i,p): maximize Xn+m+l subject to allxl + .. , +alnX n +Xn+l = bm , Xn+l + ... +xn+m +Xn+m+l = 22::::1 bi, a m lXl+ ... +amnx n +x n +m Xi 2': 0, for i E i := I U { n + 1, ... ,n + m }, with p := n + m + 1. We may take io := (n + 1, ... , n + m + 1) as a starting basis with feasible basic solution x given by m Xj = 0, Xn+i = bi, Xn+m+l = L bi, for 1 :s: j :s: n, 1 :s: i :s: m. i=l 4.11 Phase One of the Simplex Method It is evident that xn+m+1 :S 2 more 271 E7=1 bj ; hence LP(i, p) is bounded. FurtherM Xn+m+1 = 2 Lb i i=l if and only if (4.11.4) holds. Corresponding to Jo we have the quadruple Mo given by to:= m+ 1, 1 -1 1 Phase two can be applied immediately to LP(i, p), and it will terminate with one of three possible outcomes for the optimal x: (1) xn+m+1 < 2 E:1 bi [i.e., (4.11.4) does not hold], (2) xn+m+1 = 2 E:1 bi and all artificial variables are nonbasic, (3) xn+m+1 = 2 E:1 bi and some artificial variables are basic. In case (1), (4.11.2) is not a feasible problem [since any feasible point for (4.11.2) would correspond to a feasible point for LP(i,p) with X n + m +1 = 2 E:1 bJ In case (2), the optimal basis for LP(i,p) clearly provides a feasible basis for (4.11.2). Phase two can be applied to the problem as originally given. In case (3), we are faced with a degenerate problem, since the artificial variables which are basic must have the value zero. We may assume without loss of generality (by renumbering equations and artificial variables as necessary) that X n +1, Xn+2, ... , Xn+k are the artificial variables which are basic and have value zero. We may replace xn+m+1 by a new variable X n +k+1 and use the optimal basis for LP(i,p) to provide an initial feasible basis for the problem maximize X n +k+2 subject to: +Xn+k Xn +1 + ... +Xn+k +Xn +k+1 = bk , = 0, = bk + 1 , =bm , +Xn +k+2 = 0, 272 4 Systems of Linear Equations Xi 2': 0 for i E I u {n + 1, ... ,n + k + I}. This problem is evidently equivalent to (4.11.2). 4.12 Appendix: Elimination Methods for Sparse Matrices Many practical applications require solving systems of linear equations Ax = b with a matrix A that is very large but sparse, i.e., only a small fraction of the elements aik of A are nonzero. Such applications include the solution of partial differential equations by means of discretization methods [see Sections 7.4, 7.5, 8.4]' network problems, or structural design problems in engineering. The corresponding linear systems can be solved only if the sparsity of A is used in order to reduce memory requirements, and if solution methods are designed accordingly. Many sparse linear systems, in particular, those arising from partial differential equations, are solved by iterative methods [see Chapter 8]. In this section, we consider only elimination methods, in particular, the Choleski method [see 4.3] for solving linear systems with a positive definite matrix A, and explain in this context some basic techniques for exploiting the sparsity of A. For further results, we refer to the literature, e.g., Reid (1971), Rose and Willoughby (1972), Tewarson (1973), and Barker (1974). A systematic exposition of these methods for positive definite systems is found in George and Liu (1981), and for general sparse linear systems in Duff, Erisman, and Reid (1986). We first illustrate some basic storage techniques for sparse matrices. Consider, for instance, the matrix A= [~o J ~ ~ -5 0 0 o -6 0 0 1 -0: ] 6 One possibility is to store such a matrix by rows, for instance, in three onedimensional arrays, say a, ja, and ip. Here ark], k = 1, 2, ... , are the values of the (potentially) nonzero elements of A, ja[k] records the column index of the matrix component stored in ark]. The array ip holds pointers: if ip[i] = p and ip[i + 1] = q then the segment of nonzero elements in row i of A begins with alP] and ends with a[q - 1]. In particular, if ip[i] = ip[i + 1] then row i of A is zero. So, the matrix could be stored in memory as follows 1 2 3 4 5 6 7 8 9 10 ip[i] = 1 ja[i] = 5 ali] = -2 3 1 1 6 1 3 8 3 2 9 5 1 11 4 7 2 -4 2 -5 2 -6 5 6 i= 11 4.12 Appendix: Elimination Methods for Sparse Matrices 273 Of course, for symmetric matrices A, further savings are possible if one stores only the nonzero elements aik with i :s: k. When using this kind of data structure, it is difficult to insert additional nonzero elements into the rows of A, as might be necessary with elimination methods. This drawback is avoided if the nonzero elements of A are stored as a linked list. It requires only one additional array next: if ark] contains an element of row i of A, then the "next" nonzero element of row i is found in a[next[k]], if next[k] =f. O. If next[k] = 0 then ark] was the "last" nonzero component of row i. Using linked lists the above matrix can be stored as follows: 1 2 3 4 5 6 7 8 9 10 6 ja[i] = 2 ali] = -4 next[i] = 0 4 3 2 7 5 5 -2 0 10 1 3 2 9 4 7 1 1 1 3 5 1 0 5 6 0 2 -6 8 2 -5 0 i= ip[i] = Now it is easy to insert a new element into a row of A: e.g., a new element a31 could be incorporated by extending the vectors a, ja, and next each by one component a[l1], ja[l1], and next[l1]. The new element a31 is stored in a[l1]; ja[l1] = 1 records its column index; and the vectors next and ip are adjusted as follows: next [11] := ip[3] ( = 5), ip[3] := 11. On the other hand, with this technique the vector ip no longer contains information on the number of nonzero elements in the rows of A as it did before. Refined storage techniques are also necessary for the efficient implementation of iterative methods to solve large sparse systems [see Chapter 8]. However, with elimination methods there are additional difficulties, if the storage (data structure) used for the matrix A is also to be used for storing the factors of the triangular decomposition generated by these methods [see Sections 4.1 and 4.3]: These factors may contain many more nonzero elements than A. In particular, the number of these extra nonzeros (the "fill-in" of A) created during the elimination depends heavily on the choice of the pivot elements. Thus a bad choice of a pivot may not only lead to numerical instabilities [see Section 4.5], but also spoil the original sparsity pattern of the matrix A. It is therefore desirable to find a sequence of pivots that not only ensures numerical stability but also limits fill-in as much as possible. In the case of Choleski's method for positive definite matrices A, the situation is particularly propitious because, in that case, pivot selection is not crucial for numerical stability: Instead of choosing consecutive diagonal pivots, as described in Section 4.3, one may just as well select them in any other order without losing numerical stability (see (4.3.6)). One is therefore free to select the diagonal pivots in any order that tends to minimize the fill-in generated during the elimination. This amounts to finding 274 4 Systems of Linear Equations a permutation P such that the matrix P ApT = LLT has a Choleski factor L that is as sparse as possible. The following example shows that the choice of P may influence the sparsity of L drastically. Here, the diagonal elements are numbered so as to indicate their ordering under a permutation of A, and x denotes elements 7= 0: The Choleski factor L of a positive definite matrix A having the form is in general a "full" matrix, whereas the permuted matrix P ApT obtained by interchanging the first and last rows and columns of A has a sparse Choleski factor PAp T = [5 2 3 x x x ~l 4 x x = LL T , 1 Efficient elimination methods for sparse positive definite systems therefore consist of three parts: l. Determine P so that the Choleski factor L of P ApT = LLT is as sparse as possible, and determine the sparsity structure of L. 2. Compute L numerically. 3. Compute the solution x of Ax = b, i.e., of (PAP)T Px = LLT Px = Pb by solving the triangular systems Lz = Pb, LT u = z, and letting x:= pTu. In step 1, only the sparsity pattern of A as given by index set Nonz(A):= {(i,j) I j < i and aij of- O} is used to find Nonz(L), not the numerical values of the components of L: In this step only a "symbolic factorization" of P ApT takes place; the "numerical factorization" is left until step 2. It is convenient to describe the sparsity structure of a symmetric n x n matrix A, i.e., the set Nonz(A), by means of an undirected graph C A = (VA, EA) with a finite set of vertices (nodes) VA = {VI, V2, ... ,vn } and a finite set EA = {{Vi,Vj} I (i,j) E Nonz(A)} of "undirected" edges {Vi, Vj} between the nodes Vi and Vj of- Vi (thus an edge is a subset of V A containing two elements). The vertex Vi is associated 4.12 Appendix: Elimination Methods for Sparse Matrices 275 with the diagonal element aii (i.e., also with row i and column i) of A, and the vertices Vi =I- Vj are connected by an edge in G A if and only if aij =I- O. EXAMPLE 1. The matrix 1 x x 2 x 3 x x x 4 x x x x 5 6 x x is associated with the graph C A V V V2- l - x x 7 4-T-T V7- V6 We need a few concepts from graph theory. If G = (V, E) is an undirected graph and S c V a subset of its vertices then AdjG(S), or briefly Adj(S):= {v E V\S I {s,v} E E for some s Eo S} denotes the set of all vertices v E V\S that are connected to (a vertex of) S by an edge of G. For example, Adj ({ v }) or briefly Adj (v) is the set of neighbors of the node v in G. The number of neighbors of the vertex v E V deg v := IAdj ({ v } ) I is called the degree of v. Finally, a su bset MeV of the vertices is called a clique in G if each vertex x E 11/! of N! is connected to any other vertex y EM, Y =I- x, by an edge of G. We return to the discussion of elimination methods. First we try to find a permutation so that the number of nonzero elements of the Choleski factor L of P ApT = LLT becomes minimal. Unfortunately, an efficient method for computing an optimal P is not known, but there is a simple heuristic method, the minimal degree algorithm of Rose (1972), to find a P that nearly minimizes the sparsity of L. Its basic idea is to select the next pivot element in Choleski's method as that diagonal element that is likely to destroy as few O-elements as possible in the elimination step at hand. To make this precise, we have to analyze only the first step of the Choleski algorithm, since this step is already typical for the procedure in general. By partitioning the n x n matrices A and L in the equation A=LLT A = [~ f] = [7 ~]. [~ i~] = LL T , aT = (a2' a3, ... ,an), where d = all and [d, aT] is the first row of A, we find the following relations 276 4 Systems of Linear Equations Hence, the first column Ll of L and the first row Lf of LT are given by and the computation of the remaining columns of L, that is, of the columns of L, is equivalent to the Choleski factorization A = LLT of the (n - 1) x (n - 1) matrix A = (aik)~k=2 (4.12.1) _ aiak aik = aik - d for all i, k 2': 2. Disregarding the exceptional case in which the numerical values of aik i- 0 and aiak = alialk i- 0 are such that one obtains aik = 0 by cancellation, we have (4.12.2) Therefore, the elimination step with the pivot d = a11 will generate a number of new nonzero elements to A, which is roughly proportional to the number of nonzero components of the vector aT = (a2' ... ,an) = (a12, ... ,al n ). The elimination step A --+ A can be described in terms of the graphs G = (V, E) := G A and G = (V, E) := GA. associated witl]. A and A: The vertices 1, 2, ... , n of G = G A (resp. 2, 3, ... , n of G = G A ) correspond to the diagonal elements of A (resp. A); in particular, the pivot vertex 1 of G belongs to the pivot element a11. By (4.12.2), the vertices i i- k, i, k 2': 2, of G are connected by an edge in G if and only if they are either connected by an edge in G (aik i- 0), or both vertices i, k are neighbors of pivot node 1 in G (alialk i- 0, i, k E AdjG(l)). The number of nonzero elements ali i- 0 with i 2': 2 in the first row of A is equal to the degree degG(1) of pivot vertex 1 in G. Therefore, a11 is a favorable pivot, if vertex 1 is a vertex of minimum degree in G. Moreover, the set AdjG(l) describes exactly which nondiagonal elements of the first row of LT (first column of L) are nonzero. EXAMPLE 2. Choosing au as pivot in the following matrix A leads to fill-in at the positions denoted by (9 in the matrix A: 4.12 Appendix: Elimination Methods for Sparse Matrices 277 The associated graphs are: 2--3 1--2--3 I~I G: 5--4-6 =? G: 5-4--6 Generally, the choice of a diagonal pivot in A corresponds to the choice of a vertex x E V (pivot vertex) in the graph G = (V E) = G A , and the elimination step A ---+ A with this pivot corresponds to a transformation of the graph G(= G A ) into the graph G = (V, E) ( = G A ) (which is also denoted by G x = G to stress its dependence on x) according to the following rules (1) lI:=V\{x}. (2) Connect the nodes y '" z, y, z E 11 by an undirected edge ({y, z} E E), if y and z are connected by an edge in G ({y, z} E E) or if y and z are neighbors ofx in (y,z E AdjG(x)). For obvious reasons, we say that the graph G x is obtained by "elimination of the vertex x" from G. By definition, we have in G x for y E 11 = V \ { x } (4.12.3) . {AdjG(y) AdJGJy)= (AdjG(x)UAdgdy))\{x,y} if y tI- Adgdx) , otherwise. Every pair of vertices in AdjG(;1;) is connected by an edge in G x . Those vertices thus form a clique in G x , the so-called pivot clique, pivot clique DF associated with the pivot vertex x chosen in G. Moreover AdjG(x) describes the nonzero off-diagonal elements of the column of L (resp. the row of LT) that corresponds to the pivot vertex x. As we have seen, the fill-in newly generated during an elimination step is probably small if the degree of the vertex selected as pivot vertex for this step is small. This motivates the minimal degree algorithm of Rose (1972) for finding a suitable sequence of pivots: (4.12.4) Minimal Degree Algorithm. Let A be a positive definite n x n matrix and GO = (Va, EO) := G A be the graph associated to A. For i = 1, 2, ... , n: (1) Determine a vertex Xi E Vi-l of minimal degree in C i - 1 . (2) Set G i := G~-;-l. Remark. A minimal degree vertex Xi need not be uniquely determined. EXAMPLE 3. Consider the matrix A of example 2. Here, the nodes 1, 3, and 5 have minimal degree in CO := C = CA. The choice of vertex 1 as pivot vertex Xl 278 4 Systems of Linear Equations leads to the graph C of Example 2, G ' := G~ = C. The pivot clique associated with X, = 1 is Adjco(l) = {2,5}. In the next step of (4.12 ..4) one may choose vertex 5 as pivot vertex X2, since vertices 5 and 3 have minimal degree in G ' = C. The pivot clique associated with X2 = 5 is Adjq,(5) = {2,4}, and by "elimination" of X2 from G ' we obtain the next graph G := G§. All in all, a possible sequence of pivot elements given by (4.12.4) is (1,5,4,2,3, 6J, which leads to the series of graphs (Go := G, G ' := C are as in Example 2, G is the empty graph): 2-3 2*--3 I~I ~I 4*--6 3 G 2 = Gg The pivot vertices are marked. The associated pivot cliques are Pivot Clique 3 6 {2,5} {2,4} {2,6} {3,6} {6} 0 1 5 4 2 The matrix P ApT = LL T arising from a permutation of the rows and columns corresponding to the sequence of pivots selected and its Choleski factor L have the following structure, which is determined by the pivot cliques (fill-in is denoted by 0): For instance, the position of the nonzero off-diagonal elements of the third row of L T, which belongs to the pivot vertex X3 = 4, is given by the pivot clique { 2, 6 } = Adjc2 (4) of this vertex. For large problems, any efficient implementation of algorithm (4.12.4) hinges on the efficient determination of the degree function degci of graphs G i = G~-;-l, i = 1, 2, ... , n - 1. One could use, for instance, for all y I- x, y tf. Adjc(x), since by (4.12.3) the degree function does not change in the transition G --+ G x at such vertices y. Numerous other methods for efficiently implementing (4.12.4) have been proposed [see George and Liu (1989) for a review]. These proposals also involve an appropriate 4.12 Appendix: Elimination Methods for Sparse Matrices 279 representation of graphs. For our purposes, it proved to be useful to represent the graph G = (V, E) by a set of cliques M = {K 1, K2, ... , Kq} such that each edge is covered by at least one clique Ki E M. Then the set E of edges can be recovered: E = {{x, y} 1 x =I- y & 3i : x, y E Ki}. Under those conditions, M is called clique representation of G. Such respresentations exist for any graph G = (V, E). Indeed, AI := E is a clique representation of G since every edge {x, y} E E is a clique of G. EXAMPLE 4. A clique representation of the graph C = C A of Example 2 is, for instance, {{1,5},{4,5},{1,2},{2,3,6},{2,4,6}}. The degree dego(x) and the sets Adjo(x), x E V, can be computed from a clique representation AI by Adjo(x) = U Ki \ {x}, dego(x) = IAdjo(x)l· i:xEK, Let x E V be arbitrary. Then, because of (4.12.3), one can easily compute a clique representation Mx for the elimination graph G x of G = (V, E) from such a representation !vI = {K 1, ... , Kq} of G: Denote by {KS" ... , K st } the set of all cliques in M that contain x, and set K := U~=lKsi\{x}. Then Mx = {K1, ... ,Kq,K}\{Ks" ... ,Kst} gives a clique representation of G x . Assuming that cliques are represented as lists of their elements, then Mx takes even up less memory space than M because t IKI < L IKsj I· j=l Recall now the steps leading up to this point. A suitable pivot sequence was determined using algorithm (4.12.4). To this pivot sequence corresponds a permutation matrix P, and we are interested in the Choleski factor L of the matrix PAp T = LLT and, in particular, in a suitable data structure for storing L and associated information. Let Nonz(L) denote the sparsity pattern, that is, the set of locations of nonzero elements of L. The nonzero elements of L may, for instance, be stored by columns this corresponds to storing the nonzero elements of LT by rows -- in three arrays, ip, ja, a as described at the beginning of this section. Alternatively, a linked list next may be employed in addition to such arrays. 280 4 Systems of Linear Equations Because Nonz(L) :J Nonz(PApT), the data structure for L can also be used to store the nonzero elements of A. Next follows the numerical factorization of P ApT = LLT: Here it is important to organize the computation of LT by rows, that is, the array a is overwritten step by step with the corresponding data of the consecutive rows of LT. Also the programs to solve the triangular systems Lz = Pb, for z and u, respectively, can be coded based on the data structure by accessing each row of the matrix LT only once when computing z and again only once when computing u. Finally, the solution x of Ax = b is given by x = p T u. The reader can find more details on methods for solving sparse linear systems in the literature, e.g., in Duff, Erisman, and Reid (1986) and George and Liu (1981), where numerous FORTRAN programs for solving positive definite systems are also given. Large program packages exist, such as the Harwell package MA27 [see Duff and Reid (1982)]' the Yale sparse matrix package YSMP [see Eisenstat et al. (1982)]' and SPARSEPAK [see George, Liu, and Ng (1980)]. EXERCISES FOR CHAPTER 4 1. Consider the following vector norms defined on 1Rn (or (Cn): n Ilxlll := L IxJ i=l Show: (a) that the norm properties are satisfied by each; (b) that Ilxll oo ::; IIxl12 ::; Ilxlll; (c) that IIxl12 ::; v'nllxlloo, Ilxlll ::; v'nllxIl2' Can equality hold in (b), (c)? (d) Determine what lub(A) is, in general, for the norm 11·1\1. (e) Starting from the definition EXERCISES FOR CHAPTER 4 show that 281 IIAxl1 lub(A) = r;;:lJ- W' 1 lub(A-I) . IIAyl1 #0 IIYII --:----::-:- = mIn - - for nonsingular A. 2. Consider the following class of vector norms defined on <en: IlxiiD := IIDxll, where II ·11 is a fixed vector norm, and D is any member of the class of nonsingular matrices. (a) Show that II . liD is, indeed, a vector norm. (b) Show that mllxll :s:; IlxiiD :s:; Mllxll with m = l/lub(D- I ), M = lub(D), where lub(D) is defined with respect to II· II. (c) Express lubD(A) in terms of the lub norm defined with respect to 11·11· (d) For a nonsingular matrix A, cond(A) depends upon the underlying vector norm used. To see an example of this, show that condD(A) as defined from II . liD can be arbitrarily large, depending on the choice of D. Give an estimate of condD(A) using m, M. (e) How different can cond(A) defined using 11·1100 be from that defined using 11·112? [Use the results of Exercise l(b, c) above]. 3. Show that, for an n x n nonsingular matrix A and vectors u, v E IRn, (a) (A + UVT)-l = A-I _ A-IuVT A-I, l+v T A-I U T if v A-Iu -I -l. (b) If vT A-I U = -1, then (A + uv T ) is singular. Hint: Find a vector z -I 0 such that (A + uvT)z = O. 4. Let A be a nonsingular n x n matrix with columns ai, A = [al,'" ,an]' (a) Let A = [al, ... ,ai-l,b,ai+l, ... ,a n ], b E IRn , be the matrix obtained from A by replacing the ith column ai by b. Determine, using the formula of Exercise 3(a), under what conditions A-I exists, and show that A-I = FA-l for some Frobenius matrix F. 282 4 Systems of Linear Equations (b) Let A, be the matrix obtained from A by changing a single element aik to aik + 0:. For what 0: will A~l exist? 5. Consider the following theorem [see Theorem (6.4.10)]: If A is a real, nonsingular n x n matrix, then there exist two real orthogonal matrices U, V satisfying UTAV=D where D = diag(111, ... , I1n) and 111 ::::: 112 ::::: ... ::::: I1n > 0. Using this theorem, and taking the Euclidian norm as the underlying vector norm: (a) Express cond(A) in terms of the quantities l1i. (b) Give an expression for the vectors band Llb in terms of U for which the bounds (4.4.11), (4.4.12), and Ilbll S; lub(A)llxll are satisfied with equality. (c) Is there a vector b such that for all Llb in (4.4.12), IILlxl1 - < IILlbl1 -Ilxll - Ilbll holds? Determine such vectors b with the help of U. Hint: Look at b satisfying lub(A-1)ll bll = Ilxll. 6. Let Ax = b be given with A = [0.780 0.913 0.563] 0.659 and 0.217] b = [ 0.254 . The exact solution is x T = (1, -1). Further, let two approximate solutions = (0.999, -1.001), xi xf = (0.341, -0.087) be given. (a) Compute the residuals r(x1), r(x2). Does the more accurate solution have a smaller residual? (b) Determine the exact inverse A-1 of A and cond(A) with respect to the maximum norm. EXERCISES FOR CHAPTER 4 283 (c) Express i-x = L1x using r(i), the residual for i. Does this provide an explanation for the discrepancy observed in (a)? (Compare with Exercise 5.) 7. Show that: (a) The largest elements in magnitude in a positive definite matrix appear on the diagonal and are positive. (b) If all leading principal minors of an n x n Hermitian matrix A = (aik) are positive, i.e., det, ([a~.l a:. azl a" 1' ]) > 0 for i = 1, ... , n, then A is positive definite. Hint: Study the induction proof for Theorem (4.3.3). 8. Let A' be a given, n x n, real, positive definite matrix partitioned as follows: A' = [ABT B] C where A is an m x m mat.rix. First., show: (a) C - RT A -1 B is positive definite. [Hint: Partition .1: correspondingly: x = [~~], Xl E IRm , X2 E IRn - m , and determine an Xl for fixed X2 such that holds.] According to Theorem (4.3.3), A' has a decomposition A' = RTR, where R is an upper triangular matrix, which may be partitioned consistently with A': Now show: (b) Each matrix AI = NT N, where N is a nonsingular matrix, is positive definite. 284 4 Systems of Linear Equations (c) R'f2R22 = C - BT A-I B. (d) The result i=l, ... ,n, follows from (a), where 'rii is any diagonal element of R. (e) 2 'r. x T A/x xiO x x 1 lub(R-I)2 . > mm - - = --,--~ T " - for i = 1, ... , n, where lub(R- I ) is defined using the Euclidean vector norm. Hint: Exercise l(e). (f) xTA'x lub(R)2 = max - - T - 2: 'rTi' xiO x x i = 1, ... ,n, for lub(R) defined with the Euclidean norm. (g) cond(R) satisfies cond(R) 2: 9. max 1'rii I· l:'Oi,k:'On 'rkk A sequence An of real or complex 'r x 'r matrices converges componentwise to a matrix A if and only if the An form a Cauchy sequence; that is, given any vector norm II . II and any E: > 0, lub(A n - Am) < E: for m and n sufficiently large. Using this, show that if lub(A) < 1, then the sequence Anand the series L:~=o An converge. Further, show that 1 - A is nonsingular, and 00 (1 - A)-I = LAn. n=O Use this result to prove (4.4.14). 10. Suppose that we attempt to find the inverse of an n x n matrix A using the Gauss-Jordan method and partial pivot selection. Show that the columns of A are linearly dependent if no nonzero pivot element can be found by the partial pivot selection process at some step of the method. [Caution: The converse of this result is not true if floating-point arithmetic is used. A can have linearly dependent columns, yet the GaussJordan method may never encounter all zeros among the candidate pivots at any step, due to round-off errors. Similar statements can be made about any of the decomposition methods we have studied. For EXERCISES FOR CHAPTER 4 285 a discussion of how the determination of rank (or singularity) is best made numerically, see Chapter 6, Section 6, of Stewart (1973).] 11. Let A be a positive definite n x n matrix. Let Gaussian elimination be carried out on A without pivot selection. After k steps of elimination, A will be reduced to the form A(k)] 12 (k) A22 , where A~~) is an (n - k) x (n - k) matrix. Show by induction that (a) A~~) is positive definite, C k 'Z ::; n, k -- 1, 2 , ... , n - 1. (b) aii(k) ::; aii(k-1) lor::; 12. In the error analysis of Gaussian elimination (Section 4.5) we used certain estimates of the growth of the maximal elements of the matrices A(i). Let Show that for partial pivot selection: (a) ak::; 2kao, k = 1, ... , n - 1, for arbitrary A. (b) ak ::; kao, k = 1, ... , n - 1, for Hessenberg matrices A. (c) a = max1sksn-1 ak ::; 2ao for tridiagonal matrices A. 13. The following decomposition of a positive definite matrix A, A = SDS H , where S is a lower triangular matrix with Sii = 1 and D is a diagonal matrix D = diag(di ), gives rise to a variant of the Choleski method. Show: (a) that such a decomposition is possible [Theorem (4.3.3)]; (b) that di = (lii)2, where A = LLH and L is a lower triangular matrix; (c) that this decomposition does not require the n square roots which are required by the Choleski method. 14. We have the following mathematical model: which depends upon the two unknown parameters Xl, x2. Moreover, let a collection of data be given: {Yl, Zl}t=l, ... ,m with Zl = l. 286 4 Systems of Linear Equations Try to determine the parameters Xl, X2 from the data using least-square fitting. (a) What are the normal equations? (b) Carry out the Choleski decomposition of the normal equation ma- trixB=ATA=LLT. (c) Give an estimate for cond( L) based upon the Euclidean vector norm. [Hint: Use the estimate for cond(L) from Exercise 8(g).] (d) How does the condition vary as m, the number of data, is increased [Schwarz, Rutishauser, and Stiefel (1968)]7 15. The straight line y(x)=n+(3x is to be fitted to the data Xi 1-2 -1 o 3.51 3.52 2 so that is minimized. Determine the parameters nand (3. 16. Determine nand (3 as in Exercise 15 under the condition that is to be minimized. [Hint: Let Pi - (Ii = (Ii = Yi(Xi) - Yi for i = 1, ... ,5, where Pi ~ 0 and (Ii ~ o. Then I:i IY(Xi) -Yil = I:i(Pi+(Ii). Set up a linear programming problem in the variables n, (3 (unrestricted) and Pi, (Ii (nonnegative).] References for Chapter 4 Andersen, E., Bai, Z., Bischof, C., Demmel, J., Dongarra, J.J., DuCroz, J., Greenbaum, A., Hammarling, S., McKenney, A., Ostrouchov, A., Sorensen, D. (1992): LAPACK Users Guide. Philadelphia: SIAM Publications. Barker, V. A. (Ed.) (1977): Sparse Matrix Techniques. Lecture Notes in Mathematics 572. Berlin, Heidelberg, New York: Springer-Verlag. Bartels, R. H. (1971): A stabilization of the simplex method. Numer. Math. 16, 414-434. EXERCISES FOR CHAPTER 4 287 Bauer, F. L. (1966): Genauigkeitsfragen bei der Lasung linearer Gleichungssysteme. ZAMM 46, 409-421. Bixby, R. E. (1990): Implementing the simplex method: The initial basis. Techn. Rep. TR 90-32, Department of Mathematical Sciences, Rice University, Houston, Texas. Bjarck, A. (1990): Least squares methods. In: Ciarlet, Lions, Eds. (1990), 465647. Businger, P., Golub, G. H. (1965): Linear least squares solutions by Householder transformations. Numer. Math. 7, 269-276. Ciarlet, P. G., Lions, J. L. (Ed.) (1990): Handbook of numerical analysis, Vol. I: Finite difference methods (Part 1), Solution of equation" in lR n (Part 1). Amsterdam: North Holland. Collatz, L. (1966): Functional Analysis and Numerical Mathematics. New York: Academic Press. Daniel, J. W., Gragg, W. B., Kaufmann, L., Stewart, G. W. (1976): Reorthogonalization and stable algorithms for updating the Gram-Schmidt QR factorization. Math. Compo 30, 772-795. Dantzig, G. B. (1963): Linear Programming and Extensions. Princeton, N.J.: Princeton University Press. Dongarra, J. J., Bunch, J. R., Moler, C. B., Stewart, G. W. (1979): LINPACK Users Guide. Philadelphia: SIAM Publications. Duff, I.S., Erisman, A.M., Reid, J.K (1986): Direct Methods for Sparse Matrices. Oxford: Oxford University Press. - - - , Reid, J. K. (1982): MA27: A set of FORTRAN subroutines for solving sparse symmetric sets of linear equations. Techn. Rep. AERE R 10533, Harwell, U.K. Eisenstat, S. C., Gursky, M. C., Schultz, M. H., Sherman, A. H. (1982): The Yale sparse matrix package. I. The symmetric codes. Internat. J. Numer. Methods Engrg. 18, 1145-1151. Forsythe, G. E., Moler, C. B. (1967): Computer Solution of Linear Algebraic Systems. Series in automatic computation. Englewood Cliffs, N.J.: PrenticeHall. Gass, S. T. (1969): Linear Programming, 3d edition. New York: McGraw-Hill. George, J. A., Liu, J. W. (1981): Computer Solution of Large Sparse Positive Definite Systems. Englewood Cliffs, N.J.: Prentice-Hall. - - - , - - - (1989): The evolution ofthe minimum degree ordering algorithm. SIAM Review 31, 1-19. - - - , - - - , Ng, E. G. (1980): Users guide for SPARSEPAK: Waterloo sparse linear equations package. Tech. Rep. CS-78-30, Dept. of Computer Science, University of Waterloo, Waterloo. Gill, P. E., Golub, G.H ., Murray, W., Saunders, M. A. (1974): Methods for modifying matrix factorizations. Math. Compo 28, 505-535. Golub, G. H., van Loan, C. F. (1983): Matrix Computations. Baltimore: The John Hopkins University Press. Grossmann, W. (1969): Grundzuge der Ausgleichsrechnung. 3. Aufl. Berlin, Heidelberg, New York: Springer-Verlag. Guest, P.G. (1961): Numerical Methods of Curve Fitting. Cambridge: University Press. Hadley, G. (1962): Linear Programming. Reading, MA: Addison-Wesley. Householder, A. S. (1964): The Theory of Matrices in Numerical Analysis. New York: Blaisdell. 288 4 Systems of Linear Equations Lawson, C. L., Hanson, H. J. (1974): Solving Least Squares Problems. Englewood Cliffs, N.J.: Prentice-Hall. Murty, K. G. (1976): Linear and Combinatorial Programming. New York: Wiley. Prager, W., Oettli, W. (1964): Compatibility of approximate solution of linear equations with given error bounds for coefficients and right hand sides. Num. Math. 6, 405-409. Reid, J. K. (Ed) (1971): Large Sparse Sets of Linear Equations. London, New York: Academic Press. Rose, D. J. (1972): A graph-theoretic study of the numerical solution of sparse positive definite systems of linear equations, pp. 183-217. In: Graph theory and computing, R. C. Read, ed., New York: Academic Press. - - - , Willoughby, R. A. (Eds.) (1972): Sparse Matrices and Their Applications. New York: Plenum Press. Sautter, W. (1971): Dissertation TV Munchen. Schrijver, A. (1986): Theory of Linear and Integer Programming. Chichester: Wiley. Schwarz, H. R., Rutishauser, H., Stiefel, E. (1968).: Numerik symmetrischer Matrizen. Leitfiiden der angewandten Mathematik, Bd. 11. Stuttgart: Teubner. Seber, G. A. F. (1977): Linear Regression Analysis. New York: Wiley. Stewart, G. W. (1973): Introduction to Matrix Computations. New York: Academic Press. Tewarson, R. P. (1973): Sparse Matrices. New York: Academic Press. Wilkinson, J. H. (1965): The Algebraic Eigenvalue Problem. Monographs on Numerical Analysis, Oxford: Clarendon Press. - - - , Reinsch, Ch. (1971): Linear Algebra. Handbook for Automatic Computation, Vol. II. Grundlehren der mathematischen Wissenschaften in Einzeldarstellungen, Bd. 186. Berlin, Heidelberg, New York: Springer-Verlag. 5 Finding Zeros and Minimum Points by Iterative Methods Finding the zeros of a given function f, that is arguments ~ for which f(~) = 0, is a classical problem. In particular, determining the zeros of a polynomial (the zeros of a polynomial are also known as its roots) p ( X ) = aox n + alx n-I + ... + an has captured the attention of pure and applied mathematicians for centuries. However, much more general problems can be formulated in terms of finding zeros, depending upon the definition of the function f: E -+ F, its domain E, and its range F. For example, if E = F = IRn , then a transformation f: IRn -+ IRn is described by n real functions fi(xl, ... , xn) (we will use superscripts in this chapter to denote the components of vectors x Eo IRn , n > 1, and subscripts to denote elements in a set or sequence of veetors Xi, i = 1, 2, ... ); XT The problem of solving f(x) (nonlinear) equations: = (I X , ••• ,xn) . o becomes that of solving a system of (5.0.1) Even more general problems result if E and F are linear vector spaces of infinite dimension, e.g. function spaces. Problems of finding zeros are closely associated with problems of the form (5.0.2) minimize x E IR h (x) n for a real function h: IR n -+ IR of n variables h(x) = h( xl, x 2 , ... , xn). For if h is differentiable and g(x) := (oh/ox!, ... , oh/oxn)T is the gradient of h, then each minimum point x of h(x) is a zero of the gradient g(x) = O. 290 5 Finding Zeros and Minimum Points by Iterative lVlethods Conversely, each zero of f(5.0.1) is also the minimum point of some function h, for example h(x) := Ilf(x)112 The minimization problem described above is an unconstrained minimization problem. More generally, one encounters constrained problems such as the following: minimize ho (x) subject to hdx):::; 0 for i = 1,2 .... , mI, hi(x) = 0 for i = mi + 1, ml + 2, ... , rr!. Finding minimum points for functions subject to constraints is one of the most important problems in applied mathematics. In this chapter, however, we will consider only unconstrained minimization. The special case of constrained linear minimization, for which all h t : IRn ---+ IR are linear (or, more exactly, affine) functions has been discussed in Section 4.10 and 4.11. For a more thorough treatment of finding zeros and minimum points the reader is referred to the extensive literature on that subject [for example Ortega, Rheinboldt (1970), Luenberger (1973), Himmelblau (1972), Traub (1964)]. 5.1 The Development of Iterative Methods In general it is not possible to determine a zero ~ of a function f: E ---+ F explicitly within a finite number of steps, so we have to resort to approximation methods. These methods are usually iterative and have the following form: beginning with a starting value Xo, successive approximates Xi, i = 1, 2, ... , to ~ are computed with the aid of an iteration function </J : E ---+ E: If ~ is a fixed point of </J [</J(O = ~], if all fixed points of </J are also zeros of f, and if </J is continuous in a neighborhood of each of its fixed points, then cach limit point of the sequence Xi, i = 1, 2, ... , is a fixed point of </J, and hence a zero of f. The following questions arise in this connection: (1) How is a suitable iteration function </J to be found? (2) Under what conditions will the sequence Xi converge? (3) How quickly will the sequence Xi converge? Our discussion of these questions will be restricted to the finite-dimensional case E = F = IR n . 5.1 The Development of Iterative Methods 291 Let us examine how iteration functions iP might be constructed. Frequently such functions are suggested by the formulation of the problem. For example, if the equation x - cos x = 0 is to be solved, then it is natural to try the iterative process Xi+l = cos Xi, i = 0, 1, 2, ... , for which iP(x) := cosx. More systematically, iteration functions iP can be obtained as follows: If ~ is the zero of a function f: JR -+ JR, and if f is sufficiently differentiable in a neighborhood U(~) of this point, then the Taylor series expansion of f about Xo E U(~) is f(~) = 0 = f(xo) + (~- .TO)!'(Xo) + (~-2.~O)2 !"(xo) + ... + (~-k~O)k f{k)(xO + '!9(~ - xo)), 0 < '!9 < 1. If the higher powers (~ - xo) v are ignored, we arrive at equations which must express the point ~ approximately in terms of a given, nearby point Xo, e.g. (5.1.1) 0= f(xo) + (~- xo)!'(xo), or (5.1.2) These produce the approximations and c _ _ f(xo) ± J(fI(xo))2 - 2f(xo)f"(J.:o) ,,-Xo f"(xo) , respectively. In general, ~, ~ are merely close to the desired zero: they must be corrected further, for instance by the scheme from which they were themselves derived. In this manner we arrive at the iteration methods (5.1.3) f(x) iP(x) := x - f'(X) , rl'. ( )._ "'± x .- x _ f'ex) ± J(fI(X))2 -·2f(x)f"(x) f"(x) . The first is the classical Newton-Raphson method. The E.econd is an obvious extension. In general such methods can be obtained by truncating the 292 5 Finding Zeros and Minimum Points by Iterative Methods Taylor expansion after the (~ - xo)v term. Geometrically these methods amount to replacing the function f by a polynomial of degreee v, Pv(x) (v = 1, 2, ... ), which has the same derivatives f(k)(xO), k = 0, 1, 2, ... , v, as f at the point Xo. One of the roots of the polynomial is taken as an approximation to the desired zero ~ of f [see Figure 7]. f(x) Fig. 7. Newton-Raphson methods (5.1.3). The classical Newton-Raphson method is obtained by linearizing f. Linearization is also a means of constructing iterative methods to solve equation systems of the form f(x) = [ (5.1.4) 1I(X 1 , .... ,xn) : fn(x1, ... ,xn ) 1 =0. If we assume that x = ~ is a zero for f, that Xo is an approximation to ~, and that f is differentiable for x = Xo then to a first approximation o = f(~) ~ f(xo) + Df(xo)(~ - xo), where [ ox 8f,1 (5.1.5) Df(xo) = oII ox n : ofn ox 1 ~ - ofn ox n Xo = [ " ~ x~ 1 ~n X=Xo Xo 5.2 General Convergence Theorems 293 If the Jacobian Df(xo) is nonsingular, then the equation f(xo) + D f(XO)(XI - xo) = 0 can be solved for Xl: and Xl may be taken as a closer approximation to the zero t;. The generalized Newton method for solving systems of equations (5.1.4) is given by (5.1.6) Xi+l = Xi - (D f(X;))-1 f(x;), i = 0, 1, 2, .... In addition to Newton's method for such equation systems there are, for example, generalized secant methods [see (5.9.7)] for functions of many variables, and there are generalizations to nonlinear systems of the iteration methods given in Chapter 8 for systems of linear equations. A good survey can be found in Ortega and Rheinboldt (1970). 5.2 General Convergence Theorems In this section, we study the convergence behavior of a sequence {x;} which has been generated by an iteration function <P, Xi+l := <P(Xi), i = 0, 1, 2, in the neighborhood of a fixed point t; of <P. \Ve concentrate on the case E = IR n rather than considering general normed linear vector spaces. Using a norm 11·11 on IR n , we measure the size of the difference between two vectors, x, y E IRn by Ilx - YII. A sequence of vectors Xi E IRn converges to a vector x, if for each c > 0 there is an integer 1'1 (c) such that Ilxl - xii < c for all I::: N(c). It can be shown that this definition of the convergence of vectors in IR n is independent of the chosen norm [see Theorem (4.4.6)]. Finally, it is known that the space IR n is complete in the sense that the Cauchy convergence criterion is satisfied: A sequence Xi E IR n is convergent if and only if for each c > 0 there exists N(c) such that Ilxl - xmll < c for all I, m ::: N(c). In order to characterize the speed of convergence of a convergent sequence Xi, limi Xi = X, we say that the sequence converges at least with order p ::: 1 if there are a constant C ::: 0 (with C < 1 if p = 1) and an integer 1'1 such that the inequality 294 5 Finding Zeros and Minimum Points by Iterative lVlethods holds for all i 2: N. If p = 1 or p = 2, one also speaks of linear convergence and quadratic convergence , respectively. In the case of linear convergence, the size ei := Ilx, - xii of the error is reduced at each step of the iteration at least by the factor C, 0 <::: C < 1, and the speed of convergence increases with decreasing C. Therefore, C is called convergence factor. If the sequence converges linearly with, say convergence factor C = 0.1, the error sequence may look like In the case of quadratic convergence, the error behavior is quite different. For C = 1. a typical error sequence may look like We call the iteration method locally convergent with convergence domain V(~), a neigborhood of ~, if it generates, for all starting points Xu E V(~), a sequence Xi that converges to t;. It is called globally convergent, if in addition V (~) = Rn. The following theorem is readily verified: (5.2.1) Theorem. Let CP: JRn --+ JRn be an iteration function with fixed point ~. Suppose that there are a ncigborhood U (~) of t;, a number p 2: 1. and a constant C 2: 0 (with C < 1 if P = 1) so that for all x E U(O Then there is a neighborhood V (~) <:: U (~) of ~ so that for all starting points Xo E V(t;) the iteration method defined by cP generates iterates Xi with Xi E V(~) for all i 2: 0 that converge to ~ at least with order p. In the I-dimensional case, E = JR, the order of a method defined by cP can often be determined if cP is sufficiently often differentiable in a convex neighborhood U(~) of ~. If Xi E U(~) and if cp(k) (~) = 0 for k = 1, 2, ... , p - L but cp(p) (~) f- 0, it follows by Taylor expansion For p = 2, 3, ... , the method then is of (precisely) pth order (since cp(p) (0 f0). A method is of first order if, besides p = 1, it is true that W(OI < l. 5.2 General Convergence Theorems 295 EXAMPLE 1. E = JR, p is differentiable in a neighborhood U(O. If 0< p'(e) < 1, then convergence will be linear (first order). In fact, the Xi will converge monotonically to [See Figure 8.] e. , p(x) X x Fig. 8. Monotone convergence. If -1 < pi (e) < 0, then the Xi will alternate about Figure 9]. EXAMPLE 2. eduring convergence [see E = JR, p(x) = x - f(x)/f'(x) (Newton's method). Assume that I is sufficiently often continuously differentiable in a neighborhood of the simple zero e of f, that is f' (e) # 0. If follows that pte) = e, pl(e) = f(e)f" (e) ." (jl(e))2' Newton's method is locally at least quadratically (second order) convergent. EXAMPLE 3. In the more general case that j<v)(e)=o for eis a zero of multiplicity m of f, i.e., v=0,1, ... ,m-1, j<m)(o#O, then f and I' have a representation of the form f(x) = (x - e)mg(x), g(e) # 0, j'(x) = m(x - ~)m-lg(x) + (x _ ~)mg'(x), with a differentiable function g. If follows that p x) = X _ f(x) = x _ (x - ~)g(x) . (f'(x) mg(x) + (x - e)gl(x)' and therefore 296 5 Finding Zeros and Minimum Points by Iterative Methods x "- -+--~- - - - - -'-', - - - - - - - -- <p(x) o x Xi Fig. 9. Alternating convergence. <p'(€) = 1 - 2.. m Thus for m > 1 - that is, for multiple zeros of f - Newton's method is only linearly convergent. The modified method defined by <P(x) := X - mf(x)/f'(x) would, in theory, still be quadratically convergent. But for m > 1 both methods are numerically unstable, since, for x --+ €, both If(x)1 and 1f'(x)1 become small, usually by cancellation. The computed iteration functions fl(<p(x)), fl(<P(x)) are thus heavily affected by rounding errors. The following general convergence theorems show that a sequence Xi generated by P: E --+ E will converge to a fixed point ~ of P if P is a contractive mapping. As usual, I . II represents some norm on E = IRn. (5.2.2) Theorem. Let the function P: E --+ E, E = IRn , have a fixed point ~: p(~) = ~. Further let Sr(~) := {z Illz - ~II < r} be a neighborhood of ~ such that P is a contractive mapping in Sr(~)' that is, 11<p(x) - p(y)11 :::; Kllx - yll O:::;K<1. for all x, y E Sr(~)' Then for any Xo E Sr(~) the generated sequence Xi+! = P(Xi), i = 0, 1, 2, ... , has the properties (a) Xi E Sr(~) for all i = 0, 1, ... , (b) Ilxi+! - ~II :::; Kllxi - ~II :::; Ki+!llxo - ~II, i.e., {xd converges at least linearly to ~. 5.2 General ConvergencE Theorems 297 PROOF. The proof follows immediately from the contraction property. Properties (a) and (b) are true for i = O. If we assume that they are true for j = 0, I, ... , i, then it follows immediately that The following theorem (known as the fixed point theorem of Banach) is more precise. Note that the existence of a fixed point is no longer assumed a priori. (5.2.3) Theorem. Let P: E --+ E, E = IRn be an itemtlOn function, Xo E = p(x;), i = 0, I, ... . Further, let a neighborhood Sr(XO) = {x I Ilx - xoll < r} of Xo and a constant K, 0 < K < I, exist such that E be a starting point and Xi+l (a) IIp(x) - p(y)11 :s: Kllx - yll for all x, y E S,.(xo) := {x Illx - xoll :s: r}, (b) Ilxl - xoll = IIp(xo) - xoll :s: (1 - K)r < r. Then it follows that (1) Xi E S,.(xo) for all i = 0, 1, ... , (2) P has exactly one fixed point ( p(f;,) = f;"in S,.(:ro), and lim Xi = f;" t--'>oc II·THl - f;,11 :s: Kllxi - f;,11· as well as PROOF. The proof is by induction. (1): From (b) it follows that Xl E S,.(xo). If it is true that Xj E S,.(xo) for j = 0, 1, ... , i and i ;::- 1, then (a) implies and therefore, from the triangle inequality and from (b), IlxHl - xoll :s: Ilxi+l - xiii + Ilxi - Xi-1II + ... + IIJ;l - xoll :s: (Ki + K i - l + ... + 1)llx1 - xoll :s: (1 + K + ... + Ki)(l- K)r = (1 - Ki+l)r < r. (2): First we show that {x;} is a Cauchy sequence. From (5.2.4) and from (b) it follows for m > I that (5.2.5) Ilxm- XIII :s: Ilx m - xm-lll + Ilxm-1 - xm -211 + ... + IlxHl - XIII :s: KI(l + K + ... + Km-I-I)llxI - xoll Kl < ~K IlxI - Xo I < Klr. 1- 298 5 Finding Zeros and Minimum Points by Iterative IVlethods Because 0 < K < 1, we have Klr < s for sufficiently large I ~ N(s). Hence {xd is a Cauchy sequence. Since E = lRn is complete, there exists a limit lim Xi = ( i-+oo Because Xi E Sr(XO) for alli, ~ must lie in the closure Sr(XO)' Furthermore ~ is a fixed point of P, because for all i ~ 0 IIp(O - ~II s IIp(O - p(xi)11 + IIp(x;) - ~II s KII~ - xiii + Ilxi+1 - ~II· Since limi-+CXJ Ilx, - ~II = 0, it follows at once that IIp(O - ~II = 0, and hence p(O = ~. If ~ E Sr(XO) were another fixed point of P, then II~ - ~II = IIp(~) - p(~) I s KII~ - ~II, o < K < 1, which implies II~ - ~II = O. Finally, (5.2.5) implies II~ - xIII = lim m-+oc Kl Ilxm - xIII S K IlxI - Xo I 1- and which concludes the proof. D 5.3 The Convergence of Newton's Method in Several Variables Consider the system J (x) = 0 given by the function J: lRTl -+ lRn. Such a function is said to be different'iable at the point Xo E lR n if an n x n matrix A exists for which lim IIJ(x) - J(xo) - A(x - xo)11 = o. X-+Xo Ilx - xoll In this case A agrees with the Jacobian matrix DJ(xo) [see (5.1.5)]. We first note the following (5.3.1) Lemma. IJ D J (x) exists Jar all X in a convex region Co <;;; lRn) and iJ a constant 'Y exists with IIDJ(x) - DJ(y)11 S 'Yllx - yll then Jar all x) y E Co the estimate Jar all X,y E Co, 5.3 The Convergence of Newton's Method in Several Variables 299 Ilf(x) - fey) - Df(y)(x - y)11 ::; ~llx - Yl12 holds. (Recall that a set it! <;;; IRn is convex if x, y E M implies that the line segment [.T, V] := {z = AX + (1 - A)y I 0 ::; A ::; 1} is contained within itf.) PROOF. The function cp: [0, 1] --7 IR n given by cp(t) :=f(y+t(x-y)) is differentiable for all a ::; t ::; 1, where x, y E Co are arbitrary. This follows from the chain rule: cp/(t) = Df(y + t(x - y))(x - v). Hence it follows for a ::; t ::; 1 that Ilcp/(t) - cp'(o) I = II(Df(y + t(X - V)) - Df(y))(x - y)11 ::; IIDf(y+t(x-y)) -Df(y)llllx-yll ::; -ytllx _ Y112. On the other hand, L1:= f(x) - fey) - Df(y)(x - y) = cp(l) - cp(O) - cp/(O) = 11 (cp'(t) - 'P'(O))dt, so the above inequality yields This complctes the proof. D We can now show that Newton's method is quadratically convergent: (5.3.2) Theorem. Let C <;;; IR n be a given open set. Further, let Co be a convex set with Co <;;; C, and let f: C --7 IRn be a function which is differentiable for all x E Co and continuous for all x E C. For Xo E Co let positive constants r, 0, P, -y, h be given with the following properties: Sr(XO) := {x Illx - xoll < r} <;;; Co, h := op-Y/2 < 1, r:= 0/(1 - h), and let f (x) have the properties (a) IIDf(x) - Df(y)11 :s: -yllx - yll for all X,y E Co, (b) Df(x)-l exists and satisfies II(Df(x))-lll::; P for aU x E Co, 300 5 Finding Zeros and Minimum Points by Iterative Methods (c) II(Df(xo))-lf(xo)11 <:::: oo. Then (1) beginning at Xu, each point is well defined and satisfies Xk E Sr(XO) for all k :;:: O. (2) limk-+CXJ Xk = ~ exists and satisfies ~ E Sr(XO) and f(~) = O. (3) For all k :;:: 0 Sincc 0 < h < 1, Newton's method is at least quadratically convergent. (1): Since Df(x)-l exists for x E CO, Xk+1 is well defined for all k if Xk E Sr(XO) for all k :;:: O. This is valid for k = 0 and k = 1 by assumption (c). ~ow, if Xj E Sr(XO) for j = 0, 1, ... , k, then from assumption (b) PROOF. Ilxk+1 - xkll = 11- Df(xk)-l f(xk)11 <:::: ;3llf(xk)11 = fJllf(xk) - f(xk-d - Df(xk-d(xk - xk-dll, since the definition of Xk implies But, according to Lemma (5.3.1), (5.3.3) and therefore (5.3.4) This last inequality is correct for k = 0 because of (c). If it is correct for k :;:: 0, then it is correct for k + 1, since (5.3.3) implies /3, 2 fJ~f 2 2"_2 IIXk+l-Xkll<::::21Ixk-Xk-111 <::::2Qh =ooh 2"_1 . Furthermore, (5.3.4) implies Il x k+1 - Xoll <:::: IIXk+1 - xkll + 11:r:k - Xk-tI! + ... + II·T1 - Xoll <:::: 00(1 + h + h 3 + h7 + ... + h2k_1) < 00/(1 - h) = r, and consequently Xk+1 E Sr(:ro). 5.3 The Convergence of Newton's Method in Several Variables 301 (2): From (5.3.4) it is easily determined that {Xk} is a Cauchy sequence, since for m ~ n we have (5.3.5) Ilxm+l - xnll <:::: Ilxm+l - xmll + Ilxm - xm-lll + ... + Ilxn+l - xnll <:::: nh2n-l(1 + h 2n + (h2n)2 + ... ) nh 2n - 1 < 1 _ h2n < c for sufficiently large n ~ N(c), because 0 < h < 1. Consequently, there is a limit lim Xk = ~ E Sr(XO), k-toc whose inclusion in the closure follows from the fact that Xk E Sr(XO) for all k ~ O. By passing to the limit m -+ 00 in (5.3.5) we obtain (3) as a side result: We must still show that ~ is a zero of f in Sr(XO)' Because of (a), and because Xk E Sr(XO) for all k ~ 0, and therefore IIDf(Xk)11 <:::: IT + IIDf(xo)11 =: K. The inequality follows from the equation Hence and, since f is continuous at ~, lim Ilf(Xkll = Ilf(~)11 = 0, i.e., ~ is a zero of f. o Under somewhat stronger assumptions it can be shown that ~ is the only zero of fin Sr(XO): 302 5 Finding Zeros and Minimum Points by Iterative Methods (5.3.6) Theorem (Newton-Kantorovich). Given the function f: C ~ lRn -+ lRn and the convex set Co ~ C, let f be continuously differentiable on Co and satisfy the conditions (a) IIDf(x) - Df(y)11 ::; I'llx - yll for all x, y E Co, (b) IIDf(xo)-lll ::; {3, (c) IIDf(xo)-l f(xo)11 ::; a, for some Xo E Co. Consider the quantities h := a{3l', If h ::; ~ and Sri (XO) C CO, then the sequence {Xk} defined by Xk+1 := Xk - Df(Xk)-l f(Xk), k = 0, 1, ... , remains in Sri (XO) and converges to the unique zero of f(x) Co n Sr2(XO). For the proof see Ortega and Rheinboldt (1970) or Collatz (1968). 5.4 A Modified Newton Method Theorem (5.3.2) guarantees the convergence of Newton's method only if the starting point Xo of the iteration is chosen "sufficiently close" to the desired solution ~ of The following example shows that Newton's method may diverge otherwise. EXAMPLE. Let f: lR -+ lR be given by f(x) = of f(x) = O. arctanx. Then ~ = 0 is a solution The Newton iteration is defined by Xk+l If we choose Xo so that := Xk - (1 + x~) arctan(xk). 21 X ol arctan(lxol} 2: - - 2 ' 1 +xo then the sequence {IXkl} diverges: limk .... oo IXkl = 00. We describe a modification of Newton's method for which global convergence can be proven for a large class of functions f. The modification involves the introduction of an extra parameter >. and a search direction s to define the sequence (5.4.0.1) 5.4 A Modified Newton Method 303 where typically Sk := d k := [D f(xk)]-1 f(xk), and the A.k are chosen so that the sequence {h(Xk)}, h(x) := f(x)T f(x), is strictly monotone decreasing and the Xk converge to a minimum point of h(x). [Compare this with the problem of nonlinear least-squares data fitting mentioned in Section 4.8.4.] Since h(x) 20 for all x, h(x) = 0 ~ f(x) = o. Every local minimum point x of h which satisfies h(x) = 0 is also a glohal minimum point x of h as well as a zero of f. In the following section we will consider first a few general results about the convergence of a class of minimization methods for arhitrary functions h( x). These results will then be used in Section 5.4.2 to investigate the convergence of the modified Newton method. 5.4.1 On the Convergence of Minimization Methods Let 11.11 be the Euclidean vector norm and 0 < , :::; 1. We consider the set (5.4.1.1 ) Db,x) := {s E lRn Illsll = 1 with Dh(x)s::::: ,IIDh(x)ll} of all directions s forming a not-too-large acute angle with the gradient Vh(x), Vh(x)T = Dh(x) = Cl;~~) ,"', 8;~~)), where (~cl, ... , xn)T. The following lemma shows, given :r, under which conditions a scalar A. and an s E lR n exist such that h(x - fJ,s) < h(x) for 0 < /1:::; _\: (5.4.1.2) Lemma. Let h: lRn ---+ lR be a function which has a continuous derivative Dh(x) for all x E V(x) in a neighborhood V(x) of x. Suppose fUT·theT' that Dh(x) # O. and let 1 2 , > O. Then there lS a neighborhood U(x) ~ V(x) of x and a number A. > 0 such that h(x - fJ,s) :::; h(x) - fJ,4' IIDh(x)11 for all x E U(x), s E Db,x), and 0 :::; fJ,:::; A.. PROOF. The set U 1(.f:) := {x E V(x) I I Dh(x) - Dh(x) I :::; ~ IIDh(x) II} is nonempty and a neighborhood of x, since Dh(x) continuous on V(x). Now, for x E U 1(x), # 0 and Dh(x) IS IIDh(x)11 21IDh(x)ll- hIIDh(x)11 2 ~IIDh(;r)ll, 304 5 Finding Zeros and Minimum Points by Iterative Methods so that for s E Db, x) D(h(x))s ~ 'YIIDh(x)11 ~ hIIDh(x)ll· Choose .A > 0 with and define U(x) := B>. = {x Illx - xii ~ .A}. Then for all x E U(x), with ° JL .A, ~ ~ and S E Db,x) there is a 0 < 0 < 1 h(x) - h(x - JLs) = JLDh(x - OJLs)s, = JL([Dh(x - OJLs) - Dh(x)]s + Dh(x)s). Now, x E U(x) = B>. implies x, x - JLS, x - OJLS E B 2 >. C U 1 , and therefore 11[(Dh(x - OJLs) - Dh(x)) + (Dh(x) - Dh(x))]sll ~ hIIDh(x)ll. Using D(h(x))s ~ hIIDh(x)ll, we finally obtain h(x) - h(x - JLs) ~ -JL2'YIIDh(x)11 + 3~'YIIDh(x)11 = JL4'Y IIDh(x) II, D We consider the following method for minimizing a differentiable function h: JRn -+ JR. (5.4.1.3). (a) Choose numbers 'Yk ~ 1, CTk, k = 0, 1, ... , with inf 'Yk > 0, k inf CTk > 0, k and choose a starting point Xo E JRn . (b) For all k = 0, 1, ... , choose an Sk E Dbk,Xk) and set where .Ak E [0, CTk IIDh(xk) III is such that h(Xk+l) = min{h(xk - JLSk) I 0 ~ JL ~ CTkIlDh(Xk)II}· i-' The convergence properties of this method are given by the following (5.4.1.4) Theorem. Let h: JRn -+ JR be a junction, and let Xo E JRn be chosen so that 5.4 A Modified Newton Method 305 (a) K:= {xlh(x) S h(xo)} is compact, and (b) h is continuously differentiable in some open set containing K. Then for any sequence {xd defined by a method of the type (5.4.1.3): (1) Xk E K for all k = 0, 1, ... , {xd has at least one accumulation point x in K. (2) Each accumulation point x of {xd is a stationary point of h: Dh(x) = o. PROOF. (1): From the definition of the sequence {xd it follows immediately that the sequence {h(Xk)} is monotone: h(xo) 2 h(Xl) 2 .... Hence Xk E K for all k. K is compact; therefore {Xk} has at least one accumulation point x E K. (2): Assume that x is an accumulation point of {xd but is not a stationary point of h: Dh(x) 10. (5.4.1.5) Without loss of generality, let limk-+cx) Xk = x. Let "'( := inf"'(k > 0, k (J:= inf (Jk > O. k According to Lemma (5.4.1.2) there is a neighborhood U(x) of x and a number .\ > 0 satisfying (5.4.1.6) h(x -l1s) s h(x) - 11~IIDh(x)11 for all x E U(x), S E Db, x), and 0 S 11 S .\. Since limk--+oo Xk = x, the continuity of Dh(x) together with (5.4.1.5), implies the existence of a ko such that for all k 2 ko (a) Xk E U(x), (b) IIDh(Xk)11 2 ~IIDh(x)ll· Let A := min{.\, ~(JIIDh(x)II}, E := A~IIDh(x)11 > O. Since (Jk 2 (J, it follows that [0, A] c:: [0, (Jk IIDh(Xk) II] for all k 2 ko. Therefore, from the definition of Xk+l, Since A S .\, Xk E U(x), Sk E Dbk' Xk) c:: Db, Xk), (5.4.1.6) implies that h(Xk+l) S h(Xk) - A"'( 41IDh(x)11 = h(Xk) - E 306 5 Finding Zeros and Minimum Points by Iterative Methods for all k 2:: k o . This means that limk--+oo h(xk) = -cx::, which contradicts h(xk) 2:: h(xk+d 2:: '" 2:: h(x). Hence, x is a stationary point of h. D Step (b) of (5.4.l.3) is known as the line search. Even though the method given by (5.4.l.3) is quite general, its practical application is limited by the fact that the line search must be exact, i.e., it requires that the exact minimum point of the function be found on the interval [O,O"kIIDh(Xk)ll] in order to determine Xk+l. Generally a great deal of effort is required to ohtain even an approximate minimum point. The following variant of (5.4.l.3) has the virtue that in step (b) the exact minimization is replaced by an inexact line search. in particular by a finite search process ("Armijo line search"): Armijo line search DF (5.4.1.7). (a) Choose numbers /k :s; 1, O"k, k = 0, 1, ... , so that inf /k > 0, k inf O"k > O. k Choose a starting point Xo E IRn . (b) For each k = 0, L ... , obtain Xk+I from Xk as follows: (a) Select define and determine the smallest integer j 2> 0 such that UJ) Determine 7 ~ {O, 1, ... ,j}, such that hk(Pk2-i) is minimum and let Ak := p2- i , Xk+l := Xk - AkSk· [Note that h(Xk+I) = minI::;i::;j hk(Pk2-i).] It is easily seen that an integer j 2> 0 exists with the properties (5.4.l.7ba): If Xk is a stationary point, then j = O. If Xk is not stationary, then the existence of j follows immediately from Lemma (5.4.l.2) applied to x := Xk. In any case j (and Ak) can be found after a finite number of steps. The modified process (5.4.l.7) satisfies an analog to (5.4.l.4): 5.4 A Modified Newton Method 307 (5.4.1.8) Theorem. Under the hypotheses of Theorem (5.4.l.4) each sequence {xd produced by a method of the type (5.4.l. 7) satisfies the conclu8ions of Theorem (5.4.1.4). PROOF. We assume as before that x is an accumulation point of a sequence {xd defined by (5.4.l.7), but not a stationary point, i.e., Dh(x) 1= O. Again, without loss of generality, let lim Xk = x. Also let (J := infk (Jk > 0, r := infk rk > O. According to Lemma (5.4.l.2) there is a neighborhood U(x) and a number ..\ > 0 such that (5.4.l.9) h(x ~ fJs) ::;; h(x) ~ fJ~IIDh(x)11 for all x E U(x), s E Db, x), 0::;; fJ ::;; ..\. Again, the fact that limk Xk = x, that Dh(x) is continuous, and that Dh(x) 1= 0 imply the existence of a ko such that (5.4.1.10a) Xk E U(x), (5.4.1.10b) IIDh(.1:k)11 ~ ~IIDh(x)ll· for all k ~ k o. We need to show that there is an c > 0 for which Note first that (5.4.l.lO) and rk ~ ~( imply Consequently, according to the definition of Xk+l and j (5.4.l.11) Now let J ~ 0 be the smallest integer satisfying (5.4.l.12) According to (5.4.l.11), J ::;; j, and the definition of Xk+l we have (5.4.1.13) There are two cases: 308 5 Finding Zeros and Minimum Points by Iterative Methods Case 1, J = O. Note that Pk = O"kIIDh(Xk)11 ::" 0"/21IDh(x)ll. Then (5.4.1.12) and (5.4.1.13) imply h(Xk+1) ::; h(Xk) - PkiIIDh(x)11 O"'Y 2 ::; h(Xk) -16IIDh(x)11 = h(Xk) - E1 with E1 > 0 independent of Xk. Case 2, J > O. From the minimality of J we have hk(PkTC3-1)) > h(Xk) - Pk2-C3-1)iIIDh(x)11 ::" h(Xk) - Pk2-(]-1)~IIDh(x)ll. Because Xk E U(k) and Sk E Dhk, Xk) ~ Dh, Xk), it follows immediately from (5.4.1.9) that PkT(3-1) > A. Combining this with (5.4.1.12) and (5.4.1.13) yields A, c h(Xk+l) ::; hdpk 2- J ) ::; h(Xk) -16IIDh(x)11 = h(Xk) - E2 with E2 > 0 independent of Xk. Hence, for E = min( El, E2) for all k ::" ko, contradicting the fact that h(Xk) ::" h(x) for all k. Therefore x is a stationary point of h. D 5.4.2 Application of the Convergence Criteria to the Modified Newton Method In order to solve the equation f(x) = 0, we let h(x) := f(x)T f(x) and apply one of the methods (5.4.1.3) or (5.4.1.7) to minimize h(x). We use the Newton direction as the search direction Sk to be taken from the point Xk. This direction will be defined if Df(Xk)-l exists and f(Xk) i= O. (11.11 denotes the Euclidean norm.) To apply the theorems of the last section requires a little preparation. vVe show first that, for every x such that d = d(x) := Df(x) -1 d f(x) and S = s(x) = TIdIT 5.4 A Modified Newton Method 309 exist [i.e., Df(x)~l exists and d#- 0], we have 1 s E Db,x), for all 0 < 1':::; ;y(x), ;y(x) := cond(Df(x))' (5.4.2.1) In the above, IIDf(x)11 := lub(Df(x)) and cond(Df(.1:)) := IIDf(.1:)~lIIIIDf(x)11 are to be defined with respect to the Euclidian norm. PROOF. Since h(x) = f(xf f(x), we have Dh(x) = 2F(x)Df(x). (5.4.2.2) The inequalities IIF(x)Df(x)11 :::; IIDf(x)11 Ilf(x)ll, IIDf(x)~lf(x)ll:::; IIDf(x)~lllllf(x)11 clearly hold, and consequently Dh(x)s IIDh(x)11 -,;---'f-o'(.,-'X).,.-T-c:-D-,:-f.,-'(x-,'-:).,.-Dc'-f~(x:,:-)~-:-l_f-,:-(x-:--)-:--:c > ___ 1 _ _ > O. IIDf(x)~l f(x)11 IlfT(x)Df(x)11 ~ cond(Df(x)) Now, for all "y with 0 < "y:::; l/cond(Df(x)), it follows that s E Db,x) according to the definition of Db, x) given in (5.4.1.1). D As a consequence of (5.4.2.2) we observe: If Df(x)~l exists, then Dh(x) = 0 ¢} f(x) = 0, (5.4.2.3) i.c., x is a stationary point of h if and only if x is a zero of f. Consider the following modified Newton method [cf. (5.4.1.7)]: (5.4.2.4). (a) Select a starting point Xo E IRn. (b) For each k = 0, 1, ... define Xk+l from Xk as follows: (0:) Set 1 1'k:= cond(Df(xk)) , and let hk(r) := h(Xk ~rdk)! where h(x) := f(x)T f(x). Determine the smallest integer j 2: 0 satisfying 310 5 Finding Zeros and Minimum Points by Iterative Methods (p) Determine Ak and thereby Xk+1 := Xk - Akdk so that h(Xk+d = min hd2-i). O:Si:Sj As an analog to Theorem (5.4.1.8) we have (5.4.2.5) Theorem. Let f: JRn --+ JRn be a g'iven function, and let Xo E JRn be a point with the following pmperties: (a) The set K := {xlh(x) ::; h(xo)}, where h(x) := f(x)T f(x), is compact; (b) f is continuously differentiable on some open set containing K; (c) D f (x) -1 exists for all x E K. Then the sequence {xd defined by (5.4.2.4) is well defined and satisfies the following: (1) Xk E K for all k = 0, 1, ... , and {xd has at least one accumulation point x E K. (2) Each accumulation point j; of {xd is a zem of f, f(x) = o. PROOF. By construction, {h(Xk)} is monotone: Hence Xk E K, k = 0, 1, .... Because of assumption (c), d k and lk are well defined if Xk is defined. From (5.4.2.1) where Sk := dk/lldkll· As was the case for (5.4.1.7), there is a j :::: 0 with the properties given in (5.4.2.4). Hence Xk+1 is defined for each Xk. Now (5.4.2.4) becomes formally identical to the process given by (5.4.1.7) if O"k is defined by The remainder of the theorem follows from (5.4.1.8) as soon as we establish that inf lk > 0, inf O"k > O. k k According to assumptions (b) and (c), D f (x) -1 is continuous on the compact set K. Therefore cond(Df(x)) is continuous, and 1 ,:= max cond(D f(x)) > 0 xEK 5.4 A Modified Newton Method 311 exists. Without loss of generality, we may assume that Xk is not a stationary point of h; which means that it is no zero of f, because of (5.4.2.3) and assumption (c). [If f(Xk) = 0, then it follows immediately that Xk = Xk+1 = Xk+2 = .. " and there is nothing left. to show.] Thus, since .7:k E K, k = 0, 1, ... , inf /k 2': / > 0. On the other hand, from the fact that f(Xk) # 0, from (5.4.2.2), and from the inequalities Ildkll = IIDf(Xk) -1 1 f(Xk)ll2': IIDf(Xk)llllf(Xk)ll, IIDh(Xk)11 ::; 2 ·IIDf(Xk)11 Ilf(x)ll, it follows immediately that [from the continuity of Df(x) in the set K, which is compact]. Thus, all the results of Theorem (5.4.1.8) [or (5.4.1.4)] apply to the sequence {xd. Since assumption (c) and (5.4.2.3) together imply that each stationary point of h is also a zero of f, the proof is complete. 0 The method (5.4.2.4) requires that Dh(Xk) and /k := 1/ cond(D f(Xk)) be computed at each iteration step. The proof of (5.4.1.8), however, shows that it would be sufficient to replace all ~(k by a lower bou.nd / > 0, /k 2': / > 0. In accord with this, Ak is usually determined in practice so that However, since this only requires that /k > 0, the methods used for the above proofs are not sufficiently strong to guarantee the convergence of this variant. A further remark about the behavior of (5.4.2.4): In a sufficiently small neighborhood of a zero x (with nonsingular D f(x)) the method chooses /k = 1 automatically. This means that the method conforms to the ordinary Newton method and converges quadratically. We can see this as follows: Since limk->cx) Xk = x and f(x) = 0, there is a neighborhood V1 (x) of x in which every iteration step Zk --t Zk+ 1 which would be caTIied out by the ordinary Newton method would satisfy the condition (5.4.2.6) and (5.4.2.7) 312 5 Finding Zeros and Minimum Points by Iterative Methods Taylor's expansion of f about x gives f(x) = Df(x)(x - x) + o(llx - xii)· Since Ilx - xII/IIDf(x)-lll ::; IIDf(x)(x - x)11 ::; IIDf(x)llllx - xii and limx--+x o(llx - xll)/llx - xii = 0, there is another neighborhood V2 (x) of x such that for all x E V2 (x). Choose a neighborhood and let ko be such that Xk E U(x) for k ::" k o. This is possible because limk--+oo Xk = x. Considering and using (5.4.2.6), (5.4.2.7), we are led to h(Xk+l) ::; 41IDf(x)1121Ixk+l - xl1 2 ::; 16a2r211xk - xI1 2h(Xk) ::; h(xk)(l - ~). From (5.4.2.4bQ), This implies That is, there exists a ko such that for all k ::" ko the choice j = 0 and Ak = 1 will be made in the process given by (5.4.2.4). Thus (5.4.2.4) is identical to the ordinary Newton method in a sufficiently small neighborhood of x, which means that it is locally quadratically convergent. Assumption (a)-(c) in Theorem (5.4.2.5) characterize the class of functions for which the algorithm (5.4.2.4) is applicable. In one of the assigned problems for this chapter [Exercise 15], two examples will be given of function classes which satisfy (a)-(c). 5.4 A Modified Newton Method 313 5.4.3 Suggestions for a Practical Implementation of the Modified Newton Method. A Rank-One Method Due to Broyden Newton's method for solving the system f(x) = 0, where f:IRn -+ IR n , is quite expensive even in its modified form (5.4.2.4), since the Jacobian D f(Xk) and the solution to the linear system D f(Xk)d == f(Xk) must be computed at each iteration. The evaluation of explicit formulas for the components of D f(x) for given x is usually complicated and costly even if explicit formulas for f(.) and Df(.) are available. In many cases this work is alleviated by some recent and surprisingly efficient methods for computing the values of derivatives like D f (x) at a given point x exactly (methods of automatic differentiation). These methods presuppose that the function f(x) is built up from simple differentiable functions with known derivatives which allows the repeated use of the chain rule to compute the wanted derivatives recursively. For an exposition of these methods we have to refer the reader to the relevant literature [see e.g. Griewank (2000)]. If these assumptions are not satisfied, if for instance explicit formulas for f (x) are not available (f (x) could be the result of a measurement or of another lenghty computation), one could resort to the following techniques: One is to replace Df( ) = (af(x) af(x)) x axl , ... , ax n at each x = x k by a matrix (5.4.3.1) L1f(x) = (L1d, .. ·, L1nf), where L1d(x): = f(xl, ... ,xi + hi, ... ,x~: - f(x 1 , ... ,Xi, ... ,xn) f(x + hiei) - f(x) hi that is, we may replace the partial derivatives a f / axi by suitable difference quotients L1d. Note that the matrix L1f(x) can be computed with only n additional evaluations of the function f (beyond that required at the point x = Xk). However it can be difficult to choose the stepsizes hi. If any hi is too large, then L1f(x) can be a bad approximation to Df(x), so that the iteration (5.4.3.2) converges, if it converges at all, much more slowly than (5.4.2.4). On the other hand, if any hi is too small, then f(x+hiei) ~ f(x), and cancellations can occur which materially reduce the accuracy of the difference quotients. 314 5 Finding Zeros and Minimum Points by Iterative Methods The following compromise seems to work the best: if we assume that all components of f(x) can be computed with a relative error of the same order of magnitude as the machine precision eps, then choose hi so that f(x) and f(x + hiei) have roughly the first t/2 digits in common, given that i-digit accuracy is being maintained inside the computer. That is, Ihil IILld(x)11 ~ v'ePsllf(x) II· In this case the influence of cancellations is usually not too bad. If the function f is very complicated, however, even the n addiional evaluations of f needed to produce Llf(x) can be too expensive to bear at each iteration. In this case we try replacing D f(Xk) by some matrix Bk which is even simpler than Llf(Xk). Suitable matrices can be obtained using the following result due to Broyden (1965). (5.4.3.3) Theorem. Let A and B be arbitrary n x n matrices; let bE IRn , and let F: IRn ---+ IRn be the affine mapping F( u) := Au + b. Suppose x, x' E IRn are distinct vectors, and define p, q by p := x' - x, q:= F(X') - F(x) = Ap. Then the n x n matrix B' given by 1 ( q - Bp )p T , B I := B + --;yp p satisfies with respect to the Euclidean norm, and it also satisfies the equation B'p= Ap= q. PROOF. The equality (B'_A)p = 0 is immediate from the definition of B'. Each vector u E IRn satisfying IIul12 = 1 has an orthogonal decomposition of the form u = O!p + V, v T p = 0, Ilv112::; 1, O! E IR. Thus it follows from the definition of B' that, for IIul12 = 1, II(B' - A)u112 = II(B' - A)v112 = II(B - A)v112 ::; lub 2(B - A)llvl12 ::; lub 2(B - A). Hence lub 2(B ' - A) = sup II(B ' - A)u112 ::; lub 2(B - A). Ilu112=1 0 5.4 A Modified Newton Method 315 This result shows that the Jacobian Df(x) == A of an affine function F is approximated by B' at least as well as it is approximated by B, and furthermore B' and DF(x) will both map P into the same vector. Since a differentiable nonlinear function f: IRn --+ IRn can be approximated to first order in the neighborhood of one of its zeros x by an affine function, this suggests using the above construction of B' from B even in the nonlinear case. Doing so yields an iteration of the form dk := B-;;l f(Xk), (5.4.3.4) Xk+l := Xk - Akdk, Pk := Xk+1 - Xk, qk:= f(Xk+1) - f(Xk), 1 T Bk+1 := Bk + --;y-(qk - BkPk)Pk· Pk Pk The formula for Bk+1 was suggested by Broyden. Since rank(Bk+1 -Bk) s: 1, it is called Broyden's rank-one update. The stepsizes Ak may be determined from an approximate minimization of Ilf(x)112: using, for example, a finite search process (5.4.3.5) as in (5.4.2.4). A suitable starting matrix Bo can be obtained using difference quotients: Bo = L1f(xo). It does not make good sense, however, to compute all following matrices B k , k ?:: 1, from the updating formula. Various suggestions have been made about which iterations of (5.4.3.4) are to be modified by replacing Bk + (1/p[Pk)(qk ~ BkPk)p[ with L1f(Xk) ("reinitialization"). As one possibility we may obtain B~;+l from Bk using Broyden's update only on those iterations where the step produced by (5.4.3.5) lies in the interval 2- 1 s: A s: 1. A justification for this is given by observing that the bisection method (5.4.3.5) automatically picks Ak = 1 when Ilf(xk+1)11 < Ilf(Xk - dk)ll· The following result due to Broyden, Dennis, and More (1973) shows that this will be true for all Xk sufficiently close to x (in which case making an affine approximation to f is presumably justified): Under the assumption that (a) Df(x) exists and is continuous in a neighborhood U(:i:) of a zero point x, (b) IIDf(x) - Df(x) II s: Allx - xii for some A > 0 and all x E U(x), (c) Df(x)-l exists, 316 5 Finding Zeros and Minimum Points by Iterative Methods then the iteration (5.4.3.4) is well defined using .Ak = 1 for all k :2:: 0 (i.e., all Bk are nonsingular) provided Xo and Bo are "sufficiently close" to x and D f (x). Moreover, the iteration generates a sequence {x d which converges superlinearly to x, lim Ilxk+1 - xii = 0 k--+oo Ilxk - xii ' if Xk -I- x for all k :2:: o. The direction dk = Bi: 1f(Xk) appearing in (5.4.3.4) is best obtained by solving the linear system Bkd = f(Xk) using a decomposition FkBk = Rk of the kind given in (4.9.1). Observe that the factors H+1, Rk+1, of Bk+1 can be found from the factors Fk , R k , of Bk by employing the techniques of Section 4.9, since modification of Bk by a rank-one matrix is involved. We remark that all of the foregoing also has application to function minimization as well as the location of zeros. Let h: ffin -+ ffi be a given function. The minimum points of h are among the zeros of f(x) = \lh(x). Moreover, the Jacobian Df(x) is the Hessian matrix \l2h(x) of h; hence it can be expected to be positive definite near a strong local minimum point x of h. This suggests that the matrix Bk which is taken to approximate \l2h(Xk) should be positive definite. Consequently, for function minimization, the Broyden rank-one updating formula used in (5.4.3.4) should be replaced by an updating formula which guarantees the positive definiteness of Bk+l given that Bk is positive definite. A number of such formulas have been suggested. The most successful are of rank two and can be expressed as two stage updates: B k+1/ 2 := Bk + DkUkUI, Bk+l := Bk+l/2 - (3k v kV[, Dk > 0, (3k > 0, with suitable numbers Dk, (3k and vectors Uk, Vk so that Bk+1/2 is guaranteed to be positive definite as well as B k +1. For such updates the Cholesky decomposition is the most reasonable to use in solving the system Bkd = f(Xk). More details on these topics may be found in the reference by Gill, Golub, Murray, and Saunders (1974). Rank-two updates to positive definite approximations Hk of the inverse [\l2h(Xk)]-1 of the Hessian are described in Section 5.11. 5.5 Roots of Polynomials. Application of Newton's Method Sections 5.5-5.8 deal with roots of polynomials and some typical methods for their determination. There are a host of methods available for this purpose which we will not be covering. See, for example, Bauer (1956), 5.5 Roots of Polynomials. Application of Newton's Method 317 Jenkins and Traub (1970), Nickel (1966), and Henrici (1974), to mention just a few. The importance of general methods for determining roots of general polynomials may sometimes be overrated. Polynomials found in practice are frequently given in some special form, such as characteristic polynomials of matrices. In the latter case, the roots are eigenvalues of matrices, and methods to be described in Chapter 6 are to be preferred. We proceed to describe how the Newton method applies to finding the roots of a given polynomial p(x). In order to evaluate the it.eration function of Newton's method, P(Xk) Xk+l := Xk - p'(Xk)' we have to calculate the value of the polynomial p, as well as the value of its first derivative, at the point x = Xk. Assume the polynomial p is given in the form p(X) = aox n- 1 + ... + an. Then p(Xk) and P'(Xk) can be calculated as follows: For :r =~, The multipliers of ~ in this expression are recursively of the form (G.G.l) bo : = ao b; : = bi-l~ + ai i = 1, 2, ... , n. The value of the polynomial p at ~ is then given by The algorithm for evaluating polynomials using the recursion (5.5.1) is known as Horner's scheme. The quantities bi , thus obtained, are also the coefficients of the polynomial Pl(X) := box n - 1 + b1x n - 2 + ... + bn - 1 which results if the polynomial p( x) is divided by x - ~: (5.5.2) p(X) = (x - ~)pdx) + bn . This is readily verified by comparing the coefficients of the powers of x on both sides of (5.5.2). Furthermore, differentiating the relation (.'>.5.2) with respect to x and setting x = ~ yields Therefore, the first derivative p' (~) can be determined by repeating the Horner scheme, using the results bi of the first as coefficients for the second: 5 Finding Zeros and l\1inimum Points by Iterative Methods 318 p'(O = (- .. (bo~ + bd~ + .. X + bn - 1 · Frequently, however, the polynomial p(x) is given in some form other than p(x) = aoxn + ... + an. Particularly important is the case in which p(x) is the characteristic polynomial of a symmetric tridiagonal matrix ["' ,62 0 ('h J= 0:;, ;3i real. ,Bn 0 (in an Denoting by Pi(X) the characteristic polynomial P2(X) := det ([ 0 0:1~X i-h o 3i ;3i 0:; ~ x ) of the principal submatrix formed by the first i rows and columns of the matrix J, we have recursions Po(x) := l. (5.5.3) P1(X) := (0:1 ~ x) . 1, Pi(X) := (O:i ~ x)Pi-dx) ~ 8;Pi-2(X),i = 2, 3, ... , n, p(x) := det(J ~ .T1) := Pn(x). These can be used to calculate p(~) for any x = ~ and any given matrix elements ai, ;3i. A similar recursion for calculating p' (x) is obtained by differentiating (5.5.3): (5.5.4) p~(x) := 0, p~(x) := ~l. p;(x) := ~Pi-l(X) + (0:; ~ X)P;_l(X) ~ 6;P;_2(X),i = 2, 3, ... , n, p'(x) := p~(x). The two recursions (5.5.3) and (5.5.4) can be evaluated concurrently. During our general discussion of the Newton method in Section 5.3 it became clear that the convergence of a sequence Xk towards a zero'; of a function is assured only if the starting point Xo is sufficiently close to ~. A bad initial choice Xo may cause the sequence Xk to diverge even for polynomials. If the real polynomial p(x) has no real roots [e.g., p(x) = 5.5 Roots of Polynomials. Application of Newton's Method 319 X2 + 1], then the Newton method must diverge for any initial value Xo E lR. There are no known fail-safe rules for selecting initial values in the case of arbitrary polynomials. However, such a rule exists in an important special case, namely, if all roots ~, i = 1, 2, ... , n, are real: In Section 5.6, Theorem (5.6.5), we will show that the polynomials defined by (5.5.3) have this property if the matrix elements ai, Pi are real. (5.5.5) Theorem. Let p(x) be a polynomial of degree n > 2 with real coefficients. If all roots Ei, ofp(x) are real, then Newton's method yields a strictly decreasing sequence Xk converging to 6 for any initial value Xo > 6. PROOF. Without loss of generality, we may assume that p(xo) p( x) does not change sign for x > ~, we have > O. Since p(x) = aoxn + ... + an > 0 for x > 6 and therefore ao > O. The derivative p' has n - 1 real zeros ai with 6 2: 001 2: 6 2: 002 2: ... 2: a n -1 2: ~n by Rolle's theorem. Since p' is of degree n - 1 2: 1, these are all its roots, and p'(x) > 0 for x > 001 because ao > O. Applying Rolle's theorem again, and recalling that n 2: 2, we obtain (5.5.6) p"(x) > 0 for x > 001, p"'(x) 2: 0 for x 2: 001. Thus p and p' are convex functions for x 2: 001. Now Xk > ~ implies that p(Xk) Xk+1 = Xk - -,-(-) < Xk, p Xk since p'(Xk) > 0, P(Xk) > O. It remains to be shown, that we do not "overshoot", i.e., that Xk+1 > 6· From (5.5.6), Xk > 6 2: 001, and Taylor's theorem we conclude that 0= p(6) = P(Xk) + (6 - Xk)p'(Xk) + ~(6 - Xk)2p"(c5) , 6 < c5 < Xk, > p(Xk) + (6 - Xk)p'(Xk). P(Xk) = p'(Xk)(Xk - xx+!) holds by the definition of Xk+l' Thus 320 5 Finding Zeros and Minimum Points by Iterative Methods and Xk+1 > 6 follows, since p'(Xk) > O. D For later use we note the following consequence of (5.5.6): (5.5.7) Lemma. p(x) = aoxn + ... + an, ao > 0, be a real polynomial of degree n ~ 2 all roots of which are real. If III is the largest root of pi, then pili ~ 0 for x ~ lll' i. e., pi is a convex function for x ~ lll. We are still faced with the problem of finding a number Xo > 6, without knowing 6 beforehand. The following inequalities are available for this purpose: (5.5.8) Theorem. For all roots ~i of an arbitrary polynomial p(x) aoxn + ... + an with ao =1= 0, Some of these inequalities will be proved in Section 6.9. Compare also Householder (1970). Additional inequalities can be found in Marden (1949) Quadratic convergence does not necessarily mean fast convergence. If the initial value Xo is far from a root, then the sequence Xk obtained by Newton's method may converge very slowly in the beginning. Indeed, if Xk is large, then Xk+l (1) +... ~ Xk 1 - = Xk - nXXk n-l +... n k , so that there is little change between Xk and Xk+1. This observation has led to considering the following double-step method: 5.5 Roots of Polynomials. Application of Newton's Method Xk+l = Xk - P(Xk) 2 -'-(-)' P Xk 321 k = 0,1,2, ... , in lieu of the straightforward Newton method. Of course, there is now the danger of "overshooting". In particular, in the case of polynomials with real roots only and an initial point Xo > 6, some Xk+l may overshoot 6, negating the benefit of Theorem (5.5.5). However, this overshooting can be detected, and, due to some remarkable properties of polynomials, a good initial value y (6 ~ y > ~2) with which to start a subsequent Newton procedure for the calculation of 6 can be recovered. The latter is a consequence of the following theorem: (5.5.9) Theorem. Let p(x) be a real polynomial of degree n ~ 2, all roots of which are real, 6 ~ 6 ~ ... ~ ~n' Let al be the largest root of p'(x): For n = 2, we require also that 6 > 6. Then for every z > 6, the numbers , p(z) z :=z- p'(z) , p(z) y:=z-2 p '(z)' , p(y) y :=y- p'(y) , (see Figure 10) are well defined and satisfy < y, (5.5.10a) al (5.5.10b) 6 ::; y' ::; z'. It is readily verified that n = 2 and 6 = 6 imply y = 6 for any z > 6. PROOF. Assume again that p(z) > 0 for z > 6. For such values z, we consider the quantities .:1 0, .:11 (Figure 10), which are defined as follows: 1 z' .:10 := p(z') = p(z') - p(z) - (z' - z)p'(z) = 1 [P'(t) - p'(z)]dt, z' .:11 := p(z') - p(y) - (z' - y)p'(y) = [P'(t) - p'(y)]dt . .:10 and .:11 can be interpreted as areas over and under the graph of p'(x), respectively (Figure 11). By Lemma (5.5.7), p'(x) is a convex function for x ~ al. Therefore, and because z' - y = z - z' > 0 - the latter being positive by Theorem (5.5.5) - we have (5.5.11) 322 5 Finding Zeros and Minimum Points by Iterative Methods p(x) Fig. 10. Geometric interpretation of the double-step method. pl(X) y Zl z x Fig. 11. The quantities Llo and Ll1 interpreted as areas. with equality .d 1 = .do holding if and only if pI is a linear function, that is, if p is a polynomial of degree 2. Now we distinguish the three cases 5.5 Roots of Polynomials. Application of Newton's l\Iethod 323 y > 6, y = 6, y < 6· For y > 6, the proposition of the theorem follows immediately from Theorem (5.5.5). For y = 6, we show first that 6 > (Xl > 6, that is, 6 is a simple root of p. If y = 6 = 6 = (Xl were a multiple root, then by hypothesis n :->: 3. and consequently Lll < Llo would hold in (5.5.11). This would lead to the contradiction p(z') = p(z') - p(6) - (z' - 6)p'(~J) = Ll1 < Llo = p(z'). Thus 6 must be a simple root; hence (Xl < 6 = y' = y < z', and the proposition is seen to be correct in the second case, too. The case y < 6 remains. If (Xl < y, then the validity of the proposition can be established as follows. Since p(z) > 0 and 6 < (Xl < y < 6, we have p(y) < 0, p'(y) > O. In particular, y' is well defined. Furthermore, since p(y) = (y - y')p'(y), and Ll1 ::; Llo, we have Llo - Lll = p(y) + (z' - y)p'(y) = p'(y)(z' - y') :->: O. Therefore z' :->: y'. By Taylor's theorem, finally, p(6) = 0 = p(y) + (~l - y)p'(y) + ~(6 - y)2 pl/(b) , y < J < 6, and since pl/(x) :->: 0 for x :->: (Xl, p(y) = (y - y')p'(y), and p'(y) > 0, 0:->: p(y) + (6 - y)p'(:Ij) = p'(Y)(6 - y'). Therefore ~l ::; y'. To complete the proof, we proceed to show that (5.5.12) y = y(z) > (Xl. for any z > 6. Again we distinguish two cases, 6 6· If 6 > (Xl > 6, then (5.5.12) holds whenever > 01 > 6 and 6 a1 = This is because Theorem (5.5.5) implies z > z' :->: 6, and therefore y = z' - (z - z') > 6 - (6 - ad = (Xl holds by the definition of y = y( z). Hence we can select a Zo with y(zo) > (Xl· Assume that there exists a Zl > 6 with y(Zl) ::; (Xl. By the intermediate-value theorem for continuous functions, there exists a z E [ZO,Zl] withy = y(z) = (Xl. From (5.5.11) for z = z, Lll = p(z') - p(y) - (z' - y)p'(y) = p(z') - p(y) ::; Llo = p(z'), 324 5 Finding Zeros and Minimum Points by Iterative Methods and therefore p(y) = p( (};1) ~ 0. On the other hand, p( (};1) < 0, since ~1 is a simple root, in our case, causing p(x) to change sign. This is a contradiction, and (5.5.12) must hold for all z > 6. If 6 = (};1 = 6, then by hypothesis n ~ 3. Assume, without loss of generality. that p (X) = X n + a1x n-\ + ... + an. Then () z' = z - ~ = z _ ~ an Z zn n 1 + n - 1 a 1 + ... + an -1 p' (Z ) Therefore a1 1+-+···+~ n 2 Y = Y(z) = z + (z' - z) = z - z nZn-1 2: (1 + 0(~ )) = z (1-~) +0(1). Since n ~ 3, the value of y(z) increases indefinitely as z --+ co, and we conclude again that there exists a Zo > 6 with Yo = y(zo) > (};1. If (5.5.12) did not hold for all z > 6, then we could conclude, just as before, that there exists z > 6 with y = a( z) = 0'\. However. the existence of such a value y = (};1 = 6 = 6 has been shown to be impossible earlier in the proof of this theorem. D The practical significance of this theorem is as follows. If we have started with Xo > 6, then either the approximate values generated by the doublestep method satisfy XO~X1~"'~Xk~Xk+1~"'~6 and limxk=6, or there exists a first Xko := y such that In the first case, all values p(Xk) are of the same sign, p(xo)p(xkl ~ 0, for all k, and the Xk converge monotonically (and faster than for the straightforward Newton method) towards the root 6. In the second case, 5.5 Roots of Polynomials. Application of Newton's Method 325 Xo > Xl > ... > Xko-l > 6 > Y = Xko > (Xl > 6· Using Yo := Y as the starting point of a subsequent straightforward Newton procedure, P(Yk) Yk+ 1 = Yk - ---,------() , k = 0, 1, ... , P Yk will also provide monotonic convergence: lim Yk = 6· k--+oo Having found the largest root 6 of a polynomial P, there are still the other roots 6, 6, ... , ~n to be found. The following idea suggests itself immediately: "divide off" the known root 6, that is, form the polynomial p(X) Pl(X):= - - c X - <,,1 of degree n - 1. This process is called deflation. The largest root of PI (x) is 6, and may be determined by the previously described procedures. Here 6 or, even better, the value Y = Xko found by overshooting may serve as a starting point. In this fashion, all roots will be found eventually. Deflation, in general, is not without hazard, because roundoff will preclude an exact determination of PI (x). The polynomial actually found in place of PI will have roots different from 6, 6, ... , ~n. These are then found by means of further approximations, with the result that the last roots may be quite inaccurate. However, deflation has been found to be numerically stable if done with care. In dividing off a root, the coefficients of the deflated polynomial , n-l + alx ' n-2 + ... +' PI (X) = aox an - l may be computed in the order a~, a~, ... , a~_l ( !orwa1'd deflation) or in the reverse order (backward deflation). The former is numerically stable if the root of smallest absolute value is divided off; the latter is numerically stable if the root of largest absolute value is divided off. A mixed process of determining the coefficients will be stable for roots of intermediate absolute value. See Peters and Wilkinson (1971) for details. Deflation can be avoided altogether as suggested by Maehly (1954). He expresses the derivative of the deflated polynomial PI (x) as follows: p~(X) = p'(x) _ x-6 p(x) . (X-~l)2' and substitutes this expression into the Newton iteration function for PI: 326 5 Finding Zeros and Minimum Points by Iterative Methods In general, we find for the polynomials that p'(x) = J p'(x) _ p(x) . ~ _1_ (x - 6)··· (x - Ej) (x - 6)··· (x - ';J) x -';i· 8 The Maehly version of the (straightforward) Newton method for finding the root ';j+ 1 is therefore as follows: (5.5.13) with <Pj(x) :=x- p(x) ( ) . 'x - Li=l j ~ X - ';i p () The advantage of this formula lies in the fact that the iteration given by <Pj converges quadratically to ';j+ 1 even if the numbers 6, ... , ';.i in <Pj are not roots of p (the convergence is only local in this case). Thus the calculation of ~j+1 is not sensitive to the errors incurred in calculating the previous roots. This technique is an example of zero suppression as opposed to deflation [Peters and Wilkinson (1971)]. Note that <Pj(x) is not defined if x = ~k' k = L ... , j, is a previous root of p(x). Such roots cannot be selected as starting values. Instead one may use the values found by overshooting if the double-step method is employed. The following pseUdO-ALGOL program for finding all roots of a polynomial p having only real roots incorporates these features. Function procedures p(z) and p'(z) for the polynomial and its derivative are presumed available. zO := starting point Xo; for j := 1 step 1 until n do begin ITt := 2: zs := zO; Iteration: z:= zs; s := 0; for i := 1 step 1 until j - 1 do s:=s+lj(z-';i); zs := p(z); zs := z - ITt X zsj(p'(z) - zs x s); if zs < z then goto Iteration; if ITt = 2 then begin zs := z; ITt := 1: goto Iteration; end; Ej := z: end; 5.5 Roots of Polynomials. Application of Newton's Method 327 EXAMPLE. The following example illustrates the advantages of Maehly's method. The coefficients ai of the polynomial II (x - Tj) L aix 13 p(x) := 14 = 14 - i i=O j=O are calculated in general with a relative error of magnitude E. Considerations in Section 5.8 will show the roots of p(x) to be well conditioned. The following table shows that the Newton-Maehly method yields the roots of the above polynomial up to an absolute error of 40E (E = 10- 12 = machine precision). If forward deflation is employed but the roots are divided off as shown, i.e. in order of decreasing magnitude, then the fifth root is already completely wrong. The absolute errors below are understood as multiples of E. (Absolute error) x 1012 Newton-Maehly Deflation o 1.0 0.5 0.25 0.125 0.0625 0.03125 0.015625 0.0078125 0.00390625 0.001 953 125 0.000 976 562 5 0.000488281 25 0.000 244 140 625 0.000 122 070 3125 o 3.7 X 102 1.0 X 106 1.4 X 109 6.8 1.1 0.2 4.5 4.0 3.3 39.8 10.0 5.3 > 10 12 o o 0.4 o The polynomial Pl(X), II(x - 213 Pl(X) := p(x) = x-I j ), j=l has the roots 2- j , j = 1, 2, ... , 13. If ih is produced by numerically dividing p(x) by x - 1, ih(x) := n( x-I p(x) ), then we observe that the roots of Pl(X) are already quite different from those of Pl (x): 328 5 Finding Zeros and Minimum Points by Iterative Methods j Roots of ih(x) computed by Newton-Maehly 1 2 3 4 5 6 7 8 0.499999996 335 0.250001 00 .. . 0.123697 .. . 0.0924 .. . -0.0984 .. . -0.056 .. . -0.64 .. . +1.83 .. . However, if forward deflation is employed and if the roots are divided off in the sequence of increasing absolute values, starting with the root of smallest absolute value, then this process will also yield the roots of p(x) up to small multiples of machine precision [Wilkinson (1963), Peters and Wilkinson (1971)]: Sequence of dividing off roots of p(x). 13 j (Absolute error) 12 11 10 9 0.2 0.4 2 5 3 4 3 2 1 14 6 12 6 2 2 12 1 8 7 6 5 1 0 ( x 10 12 ) 5.6 Sturm Sequences and Bisection Methods Let p(x) be a real polynomial of degree n, p(x) = aoxn + alX n- 1 + ... + an, ao =1= 0. It is possible [see Henrici (1974) for a thorough treatment of related results] to determine the number of real roots of p(x) in a specified region by examining the number of sign changes w(a) for certain points x = a of a sequence of polynomials Pi(X), i = 0, 1, ... , m, of descending degrees. Such a sign change happens whenever the sign of a polynomial value differs from that of its successor. Furthermore, if Pi(a) = 0, then this entry is to be removed from the sequence of polynomial values before the sign changes are counted. Suitable sequences of polynomials are the so-called Sturm sequences. (5.6.1) Definition. The sequence p(x) = Po(x), Pl(X), ... , Prn(x) of real polynomials is a Sturm sequence for the polynomial p(x) if 5.6 Sturm Sequences and Bisection Methods 329 (a) all real roots of Po(x) are simple. (b) sign PI (.;) = - signp~(';) is a real root of po(x). (c) For i = 1,2, ... , Tn - 1, if'; is a real root of Pi(X). (d) The last polynomial Pm(x) has no real mots. For such Sturm sequences we have the following (5.6.2) Theorem. The number of real roots ofp(x) == Po(x) in the interval a:S x < b equals w(b) -w(a), where w(x) is the number of sign changes of a Sturm sequence Po ( x ), ... , Pm (X) at location x. Before proving this theorem, we show briefly how a simple recursion can be used to construct a Sturm sequence for the polynomial p(x), provided all its real roots are simple. We define initially Po(x) := p(x), PI(,r):= -pG(x) = -p'(x), and form the remaining polynomials PHI (x) recursively, dividing Pi-I(X) by Pi(X), (5.6.3) where Pi-I(X) = qi(x)Pi(:r) - CiPHI(X), i = 1, 2, ... , [degree of Pi(X)] > [degree of Pi+I(X)] and the constants Ci > 0 are positive but otherwise arbitrary. This recursion is the well-known Euclidean algorithm. Because the degree of the polynomials decreases, the algorithm must terminate after m :S n steps: The final polynomial Pm (X) is a greatest common divisor of the two initial polynomials p( x) and PI (x) = -P' (:[). (This is the purpose of the Euclidean algorithm.) If all real roots of p( x) are simple, then p( x) and pi (x) have no real roots in common. Thus Pm(x) has no real roots and satisfies (5.6.1d). If Pi (0 = 0, then (5.6.3) gives Pi-I (.;) = -CiPHdO· Assume that PHl (.;) = O. Then (5.6.3) would imply P,+1(';) = ... = Pm(O = 0, contradicting Pm(';) i- O. Thus (5.6.1c) is satisfied. The two remaining conditions of (5.6.1) are immediate. 330 5 Finding Zeros and Minimum Points by Iterative Methods PROOF OF THEOREI\! (5.6.2). vVe examine how a perturbation of the value a E IR affects the number of sign changes w(a) of the sequence po(a), Pl(a), ... , Pm(a). So long as a is not a root of any of the polynomials Pi (x), i = 0, 1. ... , m, there is, of course no change. If a is a root of Pi(X), we consider the two cases i > 0 and i = O. In the first case, i < m by (5.6.1d) and Pi+l(a) f 0, Pi-l(a) f 0 by (5.6.1c). Then for a sufficiently small perturbation h > 0, the signs of the polynomials Pj (a), j = i ~ 1, i, i + 1, display the behavior illustrated in one of the following four tables: a~h a a+h i ~ 1 i +1 i ~ 1 0 ± + + + a~h a a+h i ~ 1 i +1 + + ± + + a a+h + + + 0 ± a~h a a+h + + + + i +1 i ~1 0 a~h i +1 0 ± In each instance, w(a ~ h) = w(a) = w(a + h): the number of sign changes remains the same. In the second case, we conclude from (5.6.1b) that the following sign patterns pertain: a~h 0 1 a a+h 0 + 0 1 a~h a + + 0 + a+h + In each instance, w(a ~ h) = w(a) = w(a + h) ~ 1: exactly one sign change is gained as we pass through a root of Po(x) == p(x). For a 0, w(b) ~ w(a) = w(b ~ h) ~ w(a ~ h) indicates the number of roots of p( x) in the interval a ~ h < x 0 can be chosen arbitrarily small, the above difference indicates the number of roots also in the interval a ::; x < b. 0 5.6 Sturm Sequences and Bisection Methods 331 An important use of Sturm sequences is in bisection methods for determining the eigenvalues of real symmetric matrices which are tridiagonal: o J= o Recall the characteristic polynomials pi(X) of the principal submatrix formed by the first i rows and columns of the matrix J, which were mentioned before as satisfying the recursion (5.5.3): Pl(X):= al - x, Po(X) := 1, Pi(X) := (ai - X)Pi-l(X) - P;Pi-2(X), i = 2, 3, ... , n. The key observation is that the polynomials (5.6.4) Pn(X), Pn-l(X), ... , po(x) are a Sturm sequence for the characteristic polynomial Pn(x) = det(J - xl) [note that the polynomials (5.6.4) are indexed differently than in (5.6.1)] provided the off-diagonal elements Pi, i = 2, ... , n, of the tridiagonal matrix J are all nonzero. This is readily apparent from the following (5.6.5) Theorem. Let ai, Pj be real numbers and Pj =I 0 for j = 2, ... , n. Suppose the polynomials Pi(X), i = 0, ... , n, are defined by the recursion (5.5.3). Then all roots x~i), k = 1, ... , i, of Pi, i = 1, ... , n, are real and simple: X~i) > X~i) > ... > X~i), and the roots of Pi-l and Pi, respectively, separate each olher strictly: Xli) > x(i-l) > Xli) > x(i-l) > ... > x(i-l) > xli) 1 1 2 2 ,-1, . PROOF. We proceed by induction with respect to i. The theorem is plainly true for i = l. Assume that it is true for some i 2': 1, that is, that the roots x k(i) and x k(i-I) 0 f Pi an d Pi-I, respec t'lve 1y, sat'IS f y (5.6.6) > Xi' (i) Xl(il > Xl(i-I) > X2(i) > X2(i-I) > . . . > Xi(i-I) - 1 By (5.5.3), Pk is of the form Pj (x) = (-I)J x j + .. '. In particular, the degree of Pk equals k. Thus Pi-l(X) does not change its sign for x > X~i-ll, and since the roots x > x~i-ll are all simple, (5.6.6) gives immediately 5 Finding Zeros and Minimum Points by Iterative Methods 332 (5.6.7) sign Pi-I (xki )) = (_l)Hk for k = 1, 2, ... , i. Also by (5.5.3), PHI ( X k(i)) -_ - (32HI Pi-I (X k(i)) , k = 1, 2, ... , i. Since (3;+1 > 0, sign Pi+l (xki)) = (_l)i+k+1, k = 1, 2, ... , i, sign PHI (+00) = (-l)Hl, sign PHI (-00) = 1 holds, and PH1(X) changes sign in each of the open intervals (xii),+oo), ( -00, Xi(i)) , ((i) X + ' X(i)) ' k = 1, ... , z. - 1. The roots X(HI) 0 f Pi+I are k I k k therefore real and simple, and they separate the roots xki ) of Pi: o The polynomials of the above theorem, Pn(X),Pn-I(X), ... ,Po(x), indeed form a Sturm sequence: By (5.6.5), Pn(x) has simple real roots 6 > 6> ... > ~n' and by (5.6.7) signpn-l(~k) = (_1)n+k, signp~(~k) = (_l)n+k+1 = -signpn-I(~k)' for k = 1, 2, ... , n. For x = -00 the Sturm sequence (5.6.4) has the sign pattern +, +, ... ,+. Thus w( -(0) = 0. By Theorem (5.6.2), w(JJ) indicates the number of roots + 1 - i holds if and only if ~i < JJ. The bisection method for determining the ith root ~i of Pn(x) (6 > 6 > ... > ~n) now is as follows. Start with an interval ~ of Pn(x) with ~ < JJ: w(JJ) ::::: n [aD, bol which is known to contain ~, e.g., choose aD < ~n' 6 < boo Then divide this interval with its midpont and check by means of the Sturm sequence which of the two subintervals contains ~i' The subinterval which contains ~i is again divided, and so on. More precisely, we form for j = 0, 1, 2, ... 5.7 Bairstow's Method 333 if w(J-tj) ~ n + 1 - i, if w(J-tj) < n + 1 - i, if w(J-tj) ~ n + 1 - i, if w(J-tj) < n + 1 - i. Then [aj+l' bj+ll c [aj,bj ], laj+l - bj+ll = laj - bj 1/ 2, ~i E [aj+l' bj+ll· The quantities aj increase, and the quantities bj decrease, to the desired root ~i. The convergence process is linear with convergence rate 0.5. This method for determining the roots of a real polynomial all roots of which are real is relatively slow but very accurate. It has the additional advantage that each root can be determined independently of the others. 5.7 Bairstow's Method If a real polynomial has any complex conjugate roots, they cannot be found using the ordinary Newton's method if it is carried out in real arithmetic and begun at a real starting point: complex starting points and complex arithmetic must be used. Bairstow's method avoids complex arithmetic. The method follows from the observation that the roots of a real quadratic polynomial x 2 - rx - q are roots of a given real polynomial p(x) = aoxn + ... + an, ao =f. 0, if and only if p(x) can be divided by x 2 - rx - q without remainder. Now generally (5.7.1) p(x) = Pl(X)(X 2 - rx - q) + Ax + B, where the degree of Pl is n - 2, and the remainder has been expressed as Ax + B. The coefficients of the remainder depend, of course, upon rand q, that is A = A(r, q), and B = B(r, q), and the remainder vanishes when r, q satisfy the system (5.7.2) A(r, q) = 0, B(r, q) = o. 334 5 Finding Zeros and Minimum Points by Iterative Methods Bairstow's method is nothing more than Newton's method (5.3) applied to (5.7.2): ri+l] [qi+l (5.7.3) 8A = [ri] _ [ 8r qi 8B 8r 8A 8q 8B 8q ]-1 [A(ri,qi)]. B(ri' qi) r=r; q=qi In order to carry out (5.7.3), we must first determine the partial derivatives 8A Ar= 8r' 8A Aq = 8q' 8B Br = 8r' 8B Bq = 8q. Now (5.7.1) is an identity in r, q, and x. Hence, differentiating with respect to rand q, (5.7.4) After a further division of PI by (X2 - rx - q) we obtain the representation (5.7.5) Assuming that X2 - rx - q = 0 has two distinct roots Xo, Xl, it follows that for X = Xi, i = 0, 1. Therefore the equations -xi(AlXi + B l ) + ArXi + Br = 0 } , -(AlXi + B l ) + AqXi + Bq = 0 follow from (5.7.4). From the second of these equations we have (5.7.6) since Xo =I Xl, and therefore the first equation yields -x;Aq + xi(Ar - Bq) + Br = 0, i = 0, 1. Since x; = rXi + q it follows that i = 0, 1, and therefore i = 0, 1, 5.8 The Sensitivity of Polynomial Roots 335 Ar - Bq - r Aq = 0, Br - qAq = 0, since Xo #- Xl. Putting this together with (5.7.6) yields Aq = AI, Bq = B I , Ar = r Al + B I , Br = q AI· The values A, B (or AI, Bd can be found by means of a Horner-type scheme. Using p(x) = aoxn + ... + an, PI(X) = buxn-2 + ... + bn - 2 and comparing coefficients, we find the follwoing recursion for A, b, bi from (5.7.2): bo := ao, bI := bo r + aI, bi := bi - 2 q + bi - 1 r + ai, for i = 2, 3, ... , n - 2, A := bn - 3 q + bn - 2 r + an-I, B := bn- 2 q + an· Similarly, (5.7 ..5) shows how the bi can be used to find Al and B 1 . 5.8 The Sensitivity of Polynomial Roots We will consider the condition of a root ~ of a given polynomial p(x). By this we mean the influence on ~ of a small perturbation of the coefficients of the polynomial p(x): p(x) --+ pc(x) = p(x) + cg(x), where g(x) 1'- 0 is an arbitrary polynomial. Later on it will be shown [Theorem (6.9.8)] that if ~ is a simple root of p, then for sufficiently small absolute values of c there exists an analytic function ~(c), with ~(O) = ~, such that ~(c) is a (simple) root of the perturbed polynomial Pc (x): p(~(c)) + cg(~(c)) == O. From this, by differentiation with respect to c, we get for k := ((0) the equation kp'(~(O)) + g(~(O)) = 0, so that k = -g(O pl(~) . 336 5 Finding Zeros and Minimum Points by Iterative Methods Thus, to a first order of approximation (i.e., disregarding terms in powers of E greater than 1), based on the Taylor expansion of ~(E) we have . g(~) ~(E) = ~ - Ep'(~)" (5.8.1) In the case of a multiple root ~, of order m, it can be shown that p(x) + Eg(X) has a root of the form ~(E) = ~ + h(E l/m ), where h(t) is, for small Itl, an analytic function with h(O) = O. Differentiating m times with respect to t and noting that p(~) = p'(~) p(m-l)(~) = 0, p(m)(~) =I- 0, we obtain from o ~ Pc;(~(E)) = p(~ + h(t)) + tmg(~ + h(t)), t m = E, for k := h'(O) the equation so that k = [_ m!g(O ] 11m p(m)(~) Again to a first order of approximation, (5.8.2) ~(E) == ~ + El/m [_ m! g(~)] 11m p(m)(~) For m = 1 this formula reduces to (5.8.1) for simple roots. Let us assume that the polynomial p(x) is given in the usual form by its coefficients ai. For g(x) := aiXn-i the polynomial Pc; (x) is the one which results if the coefficient ai of p(x) is replaced by ai(I+E). The formula (5.8.2) then yields the following estimate of the effect on the root ~ of a relative error E of ai: (5.8.3) It thus becomes apparent that in the case of multiple roots the changes in root values ~(E) - ~ are proportional to El/ m , m > 1, whereas in the case of simple roots they are proportional to just E: multiple roots are always 5.8 The Sensitivity of Polynomial Roots 337 badly conditioned. But simple roots may be badly conditioned too. This happens if the factor of s in (5.8.3), ~n-i k(i,~):= a;,(~) 1 1 , is large compared to ~, which may be the case for seemingly "harmless" polynomials. EXAMPLE. [Wilkinson (1959)]. (1) The roots C,k = k, k = 1, 2, ... , 20, of the polynomial p(x) = (x - l)(x - 2) ... (x - 20) = 20 L aix 20 - i i=O are well separated. For 60 = 20, we find pi (20) = 19!, and replacing the coefficient al = -(1 + 2 + ... + 20) = -210 by at(l + E) causes an estimated change of 10 . 210 . 20 19 60(E) - 60 = E , "'" E . 0.9 ·10 . 19. The most drastic changes are caused in C,16 by perturbations of a5. Since 66 = 16 and a5 "'" _1010, This means that the roots of the polynomial p are so badly conditioned that even computing with 14-digit arithmetic will not guarantee any correct digit in 66. (2) By contrast, the roots of the polynomial 20 20-i . p ( x ) -- ~ ~ aiX .i=O IT( x - 2- , 20 j ) j=l while not well separated and "accumulating" at zero, are all well conditioned. For instance, changing a20 to a20(1 + E) causes a variation of 60 which to a first order of approximation can be bounded as follows: 60 (E) - 60 1-'-1 E l l<4 I 1 60 (2- 1 - 1)(2-2 - 1)··· (2- 19 - 1) - IE. More generally, it can be shown for all roots C,j and changes ai --+ ai (1 + E) that However, the roots are well conditioned only with respect to small relative changes of the coefficients ai, and not for small absolute changes. If we replace a20 = 2- 210 by (120 = a20 + .da2o, .da20 = 2- 4 "("", 10- 14 ) this 338 5 Finding Zeros and Minimum Points by Iterative Methods can be con_sidered a small absolute change has roots t;i with then the modified polynomial In other words, there exists at least one subscript r with I(,./~rl 2: (2 162 + 1)1/20 > 2 8 = 256. It should be emphasized that the formula (5.8.3) refers only to the sensitivity of the roots of a polynomial n p(x) := L ai;r n- i ;=0 in its usual representation by coefficients. There are other ways of representing polynomials - for instance, as the characteristic polynomials of tridiagonal matrices by the elements of these matrices [see (5.5.3)]. The effect on the roots of a change in the parameters of such an alternative representation may differ by an order of magnitude from that described by the formula (5.8.3).The condition of roots is defined always with a particular type of representation in mind. EXA:vIPLE. In Theorem (6.9.7) it will be shown that, for each real tridiagonal matrix whose characteristic polynomial is p(x) == (x -1 )(x - 2) ... (x - 20), small relative changes in (li and (3·i cause only small relative changes of the roots t;j = j. With respect to this representation, all roots are well conditioned, although they are very badly conditioned with respect to the usual representation by coefficients, as was shown in the previous example. For detailed discussions of this topic see Peters and Wilkinson (1969) and Wilkinson (1965). 5.9 Interpolation Methods for Determining Roots The interpolation methods to be discussed in this section are very useful for determining zeros of arbitrary real functions f(x). Compared to Newton's method, they have the advantage that the derivatives of f need not be computed. l\;Ioreover, in a sense yet to be made precise, they converge even faster than Newton's method. 5.9 Interpolation Methods for Determining Roots 339 The simplest among these methods is known as the method of false position or regula falsi. It is similar to the bisection method in that two numbers Xi and ai with (5.9.1) are determined at each step. The interval [Xi, ail contains therefore at least one zero of f, and the values Xi are determined so that they converge towards one of these zeros. In order to define Xi+l, ai+l, let JLi be the zero of the interpolating linear function a;!(Xi) - x;!(ai) f(Xi) - f(ai) (5.9.2) Since f(xi)f(ai) < 0 implies f(Xi) - f(ai) i- 0, JLi is always well defined and satisfies either Xi < JLi < ai or ai < JLi < Xi. Unless f(Pi) = 0, put (5.9.3) If f(JLi) = 0, then the method terminates with JLi the zero. An improvement on (5.9.3) is due to Dekker and is described in Peters and Wilkinson (1969). In order to discuss the convergence behavior of the regula falsi, we assume for simplicity that f" exists and that for some i (5.9.4a) (5.9.4b) (5.9.4c) Xi < ai, f(Xi) < 0 < f(ai), f"(x) 2': 0 for all X E [xi,ai]. Under these assumptions, either f(JLi) = 0 or f(JLi)f(Xi) > 0, and consequently To see this, note that (5.9.4) and the definition of JLi imply immediately that Xi < JLi < ai· 5 Finding Zeros and Minimum Points by Iterative Methods 340 The remainder formula (2.1.4.1) for polynomial interpolation yields, for x E [Xi, ail and a suitable value is E [Xi, ai], the representation f(x) - p(x) = (x - Xi)(X - ai)f"(iS)/2, is E [Xi. ail. By (5.9.4c), f(x) - p(x) ::; 0 in the interval X E [Xi, ad. Thus f(tl,;) ::; 0 since P(fLi) = 0, which was to be shown. It is now easy to see that (5.9.4) holds for all i ~ io provided it holds for some io. Therefore ai = a for i ~ io, the Xi form a monotone increasing sequence bounded by a, and liIlli--+oc Xi = .; exists. Since f is continuous, and because of (5.9.4) and (5.9.2), f(.;) ::; 0, f(a) > 0, a f---,-(.;"-'--)_---=-:U'---'-(a---,-) t; = -'C f(';) - f(a) . Thus (.; - a)f(';) = O. But.; =1= a, since f(a) > 0 and f(';) ::; 0, and we conclude f(.;) = O. The values Xi converge to a zero of the function f. Under the assumptions (5.9.4) the regula falsi method can be formulated with the help of an iteration function: p(X) := af(x) - xf(a) f(x) - f(a) (5.9.5) Since f(';) = 0, p'(';) = -(af'(';) - f(a))f(a) + U(a)f'(';) = 1 _ 1'(';) .; - a f(a)2 f(.;) - f(a)· By the mean-value theorem, there exist T]l, T]2 such that 1(,;) - f(a) = f'( T]l , ) .; < T]l < a, f(Xi) - f(';) = f'( ) C T]2 , Xi -<., Xi < T]2 < .;. t (5.9.6) <.,-a Because f//(x) ~ 0 holds for X E [Xi, a], f'(x) increases monotonically in this interval. Thus (5.9.6), :ri < .;, and f(Xi) < 0 imply immediately that and therefore 0::; p'(.;) < 1. In other words, the regula falsi method converges linearly under the assumptions (5.9.4). The previom; discussion shows that, under the assumption (5.9.4), regula falsi will eventually utilize only the first two of the recursions (5.9.3). 5.9 Interpolation Methods for Determining Roots 341 We will now describe an important variant of regula fasi, called the secant method, which is based exclusively on the second pair of recursions (5.9.3): (5.9.7) Xi+l = Xi-l](X,) - x;](x,-d ] (Xi) - j .( Xi-l ) , i = 0,1, .... In this case, the linear function always interpolates the two latest points of the interpolation. While the original method (5.9.3) is numerically stable because it enforces (5.9.1), this may not be the case for the secant method. Whenever ](Xi) ;::::; ](Xi-J), digits are lost due to cancellation. Moreover, XHl need not lie in the interval [Xi, Xi-l], and it is only in a sufficiently small neighborhood of the zero ~ that the secant method is guaranteed to converge. We will examine the convergence behavior of the secant method (5.9.7) in such a neighborhood and show that the method has a superlinear order of convergence. To this end, we subtract ~ from both sides of (5.9.7), and obtain, using divided differences (2.1.3.5), (5.9.8) If ] is twice differentiable, then (2.1.4.3) gives (5.9.9) j[Xi-l,Xi] = j'(T}d, T}l E I[Xi-l,Xi], j[Xi-l,Xi'~] = ~1"(172)' T}2 E I[Xi-l,Xi'~]' For a simple zero ~ of ], ]' (~) #- 0, and there exist a bound M and an interval J = {x I Ix - ~I ::; e} such that (5.9.10) (T}2) I < '1. I~2 1" l' (T}l) - "' . c lor any T}1,T}2 E J. Letei:= Allxi - ~I and eo, el < min{ 1, cAl}, so that xo, Xl E J. Then, using (5.9.8) and (5.9.10), it can be easily shown by induction that (5.9.11 ) eHl ::; eiei-l for i = 1, 2, ... and leil ::; min{ 1, eM}, so that Xi E J for all i ~ O. But note that (5.9.12) ei ::; K(q') for i = 0, 1, ... , where q (1 + J5) /2 = 1.618 ... is the positive root of the equation p.2 - P. - 1 = 0 and K := max{ eo, vel} < 1. The proof is inductive: This 342 5 Finding Zeros and Minimum Points by Iterative Methods choice of K makes (5.9.12) valid for i = 0 and i = 1. If (5.9.12) holds for i - I and i, then (5.9.11) yields since q2 = q + 1. Thus (5.9.12) holds also for i + 1, and must therefore hold in general. According to (5.9.12), the secant method converges at least as well as a method of order q = 1.618 .... Since one step ofthe secant method requires only one additional function evaluation, two secant steps are at most as expensive as a single Newton step. Since Kqi+2 = (Kqi )q2 = (Kqi )q+!, two secant steps lead to a method of order q2 = q + 1 = 2.618 .... With comparable effort, the secant method converges locally faster than the Newton method, which is of order 2. The secant method suggests the following generalization. Suppose there are r + 1 different approximations Xi, Xi-I, ... , Xi-r to a zero ~ of f(x). Determine the interpolating poynomial Q(x) of degree r with j = 0, 1, ... ,r, and choose the root of Q(x) closest to Xi as the new approximation Xi+!. For r = 1 this is the secant method. For r = 2 we obtain the method of Muller. The methods for r 2: 3 are rarely considered, because there are no practical formulas for the roots of the interpolating polynomial. Muller's method has gained a reputation as an efficient and fairly reliable method for finding a zero of a function defined on the complex plane and, in particular, for finding a simple or multiple root of a polynomial. It will find real as well as complex roots. The approximating values Xi may be complex even if the coefficients and the roots of the polynomial as well as the starting values Xl, X2, X3 are all real. Our exposition employs divided differences [see Section 2.1.3]' following Traub (1964). By Newton's interpolation formula (2.1.3.8), the quadratic polynomial which interpolates a function f (in our case the given polynomial p) at Xi-2, Xi-I, Xi can be written as Qi(X) = f[Xi] + f[Xi-l, Xi](X - Xi) + f[Xi-2, Xi-I, Xi](X - Xi-I)(X - Xi), or where Qi(X) = ai(X - Xi)2 + 2bi (x - Xi) + Ci, ai := J[ Xi-2, Xi-I, Xi], bi := ~ (J[Xi-l, x;J + f[Xi-2, Xi-I, Xi](Xi - xi-d), Ci := f[x;]. 5.9 Interpolation Methods for Determining Roots 343 If hi is the root of smallest absolute value of the quadratic equation aih2 + 2b i h + Ci = 0, then Xi+l := Xi + hi is the root of Qi(.7:) closest to Xi. In order to express the smaller root of a quadratic equation in a numerically stable fashion, the reciprocal of the standard solution formula for quadratic equations should be used. Then Muller's iteration takes the form (5.9.13) X +1 , = X· - Ci -=---~;c=== 'bi±Jb;-aici' where the sign of the square root is chosen so as to maximize the absolute value of the denominator. If ai = 0, then a linear interpolation step as in the secant method results. If ai = bi = 0, then f(Xi-2) = f(Xi-d = f(x;), and the iteration has to be restarted with different initial values. In solving the quadratic equation, complex arithmetic has to be used, even if the coefficients of the polynomial are all real, since b; - aici may be negative. Once a new approximate value Xi+l has been found, the function f is evaluated at xi+l to find f[Xi+l] := f(xi+d, j[Xi, Xi+l] := j[Xi+l] - f[Xi] , xi+l - Xi These quantities determine the next quadratic interpolating polynomial Qi+l(X), It can be shown that the errors Ci := Xi - ~ of Muller's method in the proximity of a simple zero ~ of f(x) = 0 satisfy (5.9.14) Ei+l = EiEi-lEi-2 ( - f(3)(~) ) 61'(0 + O(E) , E = max(!Ei!, !Ei-l!, !Ei-2!) [compare (5.9.8)]. By an argument analogous to the one used for the secant method, it can be shown that Muller's method is at least of order q = l.84 ... , where q is the largest root of the equation f.L3 - f.L2 - f.L - 1 = O. The secant method can be generalized in a different direction. Consider again r + 1 approximations Xi, Xi-I, ... , Xi-r to the zero ,; of the function f (x). If the inverse 9 of the function f exists in a neighborhood of ~ (this is the case if ~ is a simple zero), f(g(u)) = y, g(f(x)) = X, g(O) =~, then determining ~ amounts to calculating g(O). Since g(f(Xj)) = Xj, j = i, i - I , ... , i - r, 344 5 Finding Zeros and Minimum Points by Iterative Methods the following suggests itself: determine the interpolating polynomial Q(y) of degree r or less with Q(f(Xj)) = Xj, j = i, i-I, ... , i - r, and approximate g(O) by Q(O). Then select Xi+l := Q(O) as the next approximate point to be included in the interpolation. This method is called determining zeros by inverse interpolation. Note that it does not require solving polynomial equations at each step even if r ;:::: 3. For r = 1 the secant method results. The interpolation formulas of Neville and Aitken [see Section 2.1.2] are particularly useful in implementing inverse interpolation methods. The methods are locally convergent of superlinear order. For details see Ostrowski (1966) and Brent (1973). In Section 5.5, in connection with Newton's method, we considered two techniques for determining additional roots: deflation and zero suppression. Both are applicable to the root-finding methods discussed in this section. Here zero suppression simply amounts to evaluating the original polynomial p(x) and numerically dividing it by the value of the product (X-~i)'" (x~k) to calculate the function whose zero is to be determined next. This process is safe if computation is restricted away from values x which fall in close neighborhoods of the previously determined roots 6, ... , ~k. Contrary to deflation, zero suppression works also if the methods of this section are applied to finding the zeros of an arbitrary function f(x). 5.10 The ..::P-Method of Aitken The Ll 2 -method of Aitken is one of first and simplest methods for accelerating the convergence of a given convergent sequence of values Xi, lim Xi =~. '-+00 These methods transform the sequence {Xi} into a sequence {x~} which in general converges faster toward ~ than the original sequence of values Xi, and they apply to finding zeros inasmuch as they can be used to accelerate the convergence of sequences {Xi} furnished by one of the methods previously discussed. The extrapolation methods discussed in Section 3.5 are examples of acceleration methods. A systematic account of these methods can be found in Brezinski (1978) and in Brezinski and Zaglia (1991). In order to illustrate the Ll2- method, let us assume that the sequence {Xi} converges towards ~ like a geometric sequence with factor k, JkJ < 1: i = 0,1, .... Then k and ~ can be determined from Xi, Xi+l, Xi+2 using the equations (5.10.1) 5.10 The iF-Method of Aitken 345 By subtraction of these equations, k -_ Xi+2 - Xi+l Xi+l - Xi , and by substitution into the first equation, since k i= 1, ~ = (Xi X H2 - X7+1) ---'-----'-'-=----:- (XH2 - 2XHl + Xi) Using the difference operator LlXi := Xi+l - Xi and noting that Ll2xi LlXHl - LlXi = Xi+2 - 2XHl + Xi, this can be written as (5.10.2) The method is named after this formula. It is based On the expectation (5.10.2) provides at least an improved approximation to the limit of the sequence of values Xi even if the hypothesis that {Xi} is a geometrically convergent sequence should not be valid. The Ll 2 -method of Aitken thus consists of generating from a given sequence {Xi} the transformed sequence of values ~ to be confirmed below ~ that the value (5.10.3) I Xi := Xi - (XHl - Xi)2 2XHl Xi+2 - + Xi The following theorem shows that the x; converge faster than the Xi to ~ as i ---+ 00, provided {Xi} behaves asymptotically as a geometric sequence: (5.10.4) Theorem. Suppose there exists k, sequence {Xi}, xi i= ~, Ikl < 1, such that for the holds, then the values x; defined by (5.10.3) all exist for sufficiently large i, and lim = o. x; -.; i---+CXJ X-i - ~ PROOF. By hypothesis, the errors ei := Xi - ~ satisfy eHl = follows that (k + 5;)ei. It = ei((k + 5H d(k + 5i ) - 2(k + 5;) + 1) (5.10.5) = Xi+! - Xi ei((k - I? + ILi), where = eHl - ei = ei((k - 1) + ()i). ILi ---+ 0, 346 5 Finding Zeros and Minimum Points by Iterative ~Iethods Therefore :1:i+2 - 2Xi+l + .ri 10 for sufficiently large i, since ei I 0, k I 1 and tli -+ O. Hence (5.10.3) guarantees that x; is well defined. By (5.10.3) and (.5.10 ..5), , ((k - 1) + 5y x, - c: = ei - ei ( k. - 1) 2 + tIi. ' for sufficiently large i, and consequently ?} =0. lim x;-~ = lim {1- ((k-1)+5 i i-+oe (k - 1)2 + tIi i-+cx) Xi - ~ D Consider an iteration method with iteration function tjj( x), (5.10.6) Xi+l = tjj(Xi), i = 0, 1,2, ... , for determining the zero ~ of a function f(x) = O. The formula (5.10.3) can then be used to determine, from triples of successive elements Xi, Xi+l, Xi+2 of the sequence {Xi} generated by the above iteration function tjj, a new sequence {x; }, which hopefully converges faster. However, it appears to be advantageous to make use of the improved approximations immediately by putting (5.10.7) Yi := tjj(x,), Zi = tjj(Yi), Xi+l := Xi - (Yi - Xi)2 . Zi - 2Yi + Xi i = 0, 1, 2, ... This method is due to Steffensen. (5.10.7) leads to a new iteration function W, :Ti+l = (5.10.8) w(x) : = W(Xi), :rtjj(tjj(x)) - tjj(x)2 . tjj(tjj(x)) - 2tjj(x) + X Both iteration functions tjj and W have, in generaL the same fixed points: (5.10.9) Theorem. w(~) = ~ implies tjj(~) = C:. Conversely, if tjj(O = ~ and tjj'(O 11 exists, then w(c:) = ( PROOF. By the definition (5.10.8) of W, Thus w(~) = ~ implies tjj(~) = ~. Next we assume that tjj(~) = ~, tjj is differentiable for x = C:. and tjj'(~) I l. L'Hopital's rule applied to the definition (5.10.8) gives 5.10 The LP-Method of Aitken 347 1jI(~) = tp(tp(~)) + ~iP'(iP(~))iP'(O - 2iP(OiP'(~) tp'(iP(O)iP'(O - 2iP'(O + 1 ~ + ~p'(~)2 _ 2~iP'(~) = 1 + iP'(~)2 - 2iP'(O =~. D In order to examine the convergence behavior of 1jI in the neighborhood of a fixed point ~ of 1jI (and iP), we assume that iP is p+ 1 times differentiable in a neighvorhood of x = ~ and that it defines a method of order p, that is, (5.10.10) tp'(~) = ... = iP(P-l)(~) = 0, iP(p)(O =: piA i- 0 [see Section 5.2]. For p = 1, we require that in additon A = iP'(O i- 1, (5.10.11) and without loss of generality we assume ~ = O. Then for small x iP(x) = Ax P + x p +1 (p + I)! tp(P+l)(Bx), 0 1, if p = 1, iP(X)2 = (AxP + O(x P+ 1))2 = A 2 x 2p + O(x 2p +1 ). For p > 1, because of (5.10.8), For p = 1, however, since Ai-I, This proves the following (5.10.13) Theorem. Let iP be an iteration function defZning a method of order p for computing its fixed point ~. For p > 1 the corresponding iteration function 1jI of (5.10.8) determines a method of order 2p -- 1 for computing ~. For p = 1 this method is at least of order 2 provided iP' (~) i- 1. 348 5 Finding Zeros and Minimum Points by Iterative Methods Note that lfr yields a second-order method, that is, a locally quadratically convergent method, even if 1<p'(01 > 1 and the <P-method diverges as a consequence [see Section 5.2]. It is precisely in the case p = 1 that the method of Steffensen is important. For p > 1 the lfr-method does generally not improve upon the <p-method. This is readily seen as follows. Let Xi - ~ = E with E sufficiently small. Neglecting terms of higher order, we find <P(Xi) - ~ ~ AE P, <P(<P(Xi)) - ~ ~ AP+1 EP2 , whereas for Xi+1 -~, ;(:i+1:= lfr(.Td, Xi+1 - ~ = _A 2E2p - 1 by (5.10.12) Now IAp+1 EP21 «: IA2 E2p - 11 for p > 1 and sufficiently small E, which establishes <P(<P(Xi)) as a much better approximation to ~ than Xi+1 = lfr(xd. For this reason, Steffensen's method is only recommended for p = 1. EXAMPLE. The iteration function p(x) p'(6) = 0, = x2 has fixed points 6 = 0,6 = 1 with pl/(6) = 2, p'(6) = 2. The iteration Xi.+! = p(x,) converges quadratically to 6 if Ixol < 1. But for Ixo I > 1 the sequence {Xi} diverges. The transformation (5.10.8) yields . X3 1]f (X) - --,,---- X2 + x - I WIth rl.2 = -1±V5 2 . We proceed to show that the iteration Xi+! = 1]f(Xi) reaches both fixed points for suitable choices of the starting point Xo. For Ixi ::; 0.5, 1]f(x) is a contraction mapping. If Ixol ::; 0.5 then Xi+1 = 1]f(Xi) converges towards 6 = O. In sufficient proximity of 6 the iteration behaves as Xi+! = 1]f(x,) "" xf, whereas the iteration Xi+l = <t>(<t>(Xi)) has 4th-order convergence for Ixol < 1: Xi+1 = p(p(x;)) = x7. that For Ixol > rl, Xi+1 = 1]f(Xi) converges towards 6 tlf'(l) = 0, = 1. It is readily verified 1]f1/(1) of 0 and that therefore quadratic convergence holds (in spite of the fact that p(x) did not provide a convergent iteration). 5.11 Minimization Problems without Constraints 349 5.11 Minimization Problems without Constraints We will consider the following minimization problem for a real function h: IRn -+ IR of n variables: (5.11.1) determine min {h(x) I x E IRn}. We assume that h has continuous second partial derivatives with respect to all of its variables, hE c 2 (IRn), and we denote the gradient of h by g(x) := Dh(x) T = ( 8h(x) 8h(x) ) 8x 1 ' " ' ' 8x n T and the matrix of second derivatives of h, the Hessian of h, by 8 2 h(X)) H(x):= ( 8x i 8xk. 'L,k=l, ... ,n . Almost all minimization methods start with a point Xo E IR n and generate a sequence of points Xk, k :::.. 0, which arc supposed to approximate the desired minimum point x. In cach step of the iteration, Xk -+ Xk+l, for which gk = g(Xk) =I 0 [see (5.4.1.3)], a search direction Sk is determined by some computation which characterizes the method, and the next point is obtained by a line search; i.e., the stepsize Ak is determined so that holds at least approximately for Xk+l' Usually the direction Sk is taken to be a descent direction for h, i.e., (5.11.2) This ensures that only positive values for A need to be considered in the minimization of lP k (A). In Section 5.4.1 we established general convergence theorems [(5.4.1.4) and (5.4.1.8)] which apply to all methods of this type with only mild restrictions. In this section we wish to become acquainted with a few, special methods, which means primarily that we will discuss a few specific ways of selecting Sk which have become important in practice. A (local) minimum point x of h is a zero of g(x); hence we can use any zero-finding method on the system g(x) = 0 as an approach to finding minimum points of h. Most important in this regard is Newton's method (5.1.6), at each step Xk -+ Xk+l = Xk - AkSk of which the Newton direction 350 5 Finding Zeros and Minimum Points by Iterative Methods is taken as a search direction. This, when used with the constant step length Ak = 1, has the advantage of defining a locally quadratically convergent method [Theorem (5.3.2)]. But it has the disadvantage ofrequiring that the matrix H(Xk) of all second partial derivatives of h be computed at each step. If n is large and if h is complicated, the computation of H(Xk) can be very costly. Therefore methods have been devised wherein the matrices H(Xk)-1 are replaced by suitable matrices H k , which are easy to compute. A method is said to be a quasi-Newton method if for each k ;::: 0 the matrix Hk+1 satisfies the so-called quasi-Newton equation (5.11.3) This equation causes Hk+1 to behave - in the direction (Xk+1 -Xk) -like the Newton matrix H(Xk+1)-I. For quadratic functions h(x) = ~xT Ax + bT X + c, for example, where A is an n x n positive definite matrix, the Newton matrix H(Xk+1)-1 == A-I satisfies (5.11.3) because g(x) == Ax+b. Further, it seems reasonable to insist that the matrix Hk be positive definite for each k ;::: O. This will guarantee that the direction Sk = Hkg k will be a descent direction for the function h if 9k -=I- 0 [see (5.11. 2)] : gf Sk = gfHkgk > O. The above demands can, in fact, be met: Generalizing earlier work by Davidon (1959), Fletcher and Powell (1963), and Broyden (1965, 1967)), Oren and Luenberger (1974) have found a two-parameter recursion for producing Hk+l from Hk with the desired properties. Using the notation and the parameters 'Yk > 0, (h ;::: 0 this recursion has the form (5.11.4) where IJr is the one-parameter function (5.11.5 ) qTHq)ppT (I-e) T lJr(e,H,p,q):=H+ ( l + e --;Y--~H Hq·q H Tpq pq q q e _ --;y- (pqT H + H qpT) P q which is due to Broyden (1970). Oren and Luenberger introduced the additional scaling parameter 'Yk > 0 into (5.11.4) allowing a scaling of the matrix H k . 5.11 Minimization Problems without Constraints 351 Clearly, if Hk is symmetric then so is Hk+l. It is directly verified from (5.11.5) that H k+1 satisfies the quasi-Newton equation (5.11.3), Hk+lqk = Pb and it will be shown later [Theorem (5.11.9)] that Ui.11.4) preserves the positive definiteness of the matrices H k . The "update function" l[r is only defined if pT q i- O. qTH q i- O. Observe that H k+ 1 is obtained from Hk by adding a correction of rank :s: 2 to the matrix fkHk: rank(Hk+l - H k ) :s: 2. Hence, (5.11.4) is said to define a rank-two method. The following special cases are contained in (5.11.4): (a) fk == L (h == 0: the method of Davidon (1959) and Fletcher and Powell (1963) ("DFP method"); e (b) fk == 1, k == 1: the rank-two method of Broyden. Fletcher, Goldfarb, and Shanno ("BFGS method") [see. for example, Broyden (1979)]: (c) rk == 1. ek = P[qk/(p[qk - p[Hkqk): the symmetric rank-one method of Broyden. The last method is only defined for pI qk 1= qI Hkqk. It is possible that (h < 0, in which case Hk+l can be indefinite even when Hk is positive definite [see Theorem (5.11.9)]. If we substitute the value of (h into (5.11.4) and (5.11.5), we obtain: which explains why this is referred to as a rank-one method. A minimization method of the Oren-Luenberger class has the following form: (5.11.6). (0) Start: Choose a starting point Xo E IRn and an n x n positive definite matrix Ho (e.g. Ho = 1), and set go := g(xo). For k ,= 0, 1, ... obtain Xk+l, H k+ 1 from Xk. Hk as follows: (1) If Yk = 0, stop: Xk is at least a stationary point for h. Otherwise (2) compute Sk := HkYk. (3) Determine Xk+1 = Xk - AkSk by means of an (approximate) minimization and set 5 Finding Zeros and Minimum Points by Iterative Methods 352 (4) Choose suitable parameter values rk > 0, 8k 2: 0, and compute H k+1 = P(8 k , rkHk,Pk. qk) according to (5.11.4). The method is uniquely defined through the choice of the sequences {rk}, {8 k } and through the line-search procedure in step (3). The characteristics of the line search can be described with the aid of the parameter Ji,k defined by (5.11.7) If Sk is a descent direction, g[ Sk > 0, then Ji,k is uniquely determined by Xk+l. If the line search is exact, then ILk = 0, since g[+lSk = -'P~(>'k) = 0, 'Pk(A) := h(xk - ASk). In what follows we assume that (5.11.8) Ji,k < 1. ° If gk f and Hk is positive definite, it follows from (5.11.8) that Ak > 0; therefore q[ Pk = -Ak(gk+l - gk)Sk = Ak(1 - Ji,k)g[ Sk = Ak(1 - Ji,k)g[ Hkgk > 0, and also qk f 0, q[ Hkqk > 0: the matrix H k+1 is well defined via (5.11.4) and (5.11.5). The condition (5.11.8) on the line search cannot be satisfied only if 'P~(A) = -g(Xk - ASk)Tsk:::; 'P~(o) = -g[Sk < But then h(xk - ASk) - h(Xk) = ° for all A 2: 0. 1A 'P~(T)dT :::; -Ag[ Sk < ° for all A 2: 0, so that h(xk - ASk) is not bounded below as A -+ +00. The condition (5.11.8) does not, therefore, pose any actual restriction. At this point we have proved the first part of the following theorem, which states that the method (5.11.6) meets our requirements above: ° (5.11.9) Theorem. If there is a k 2: in (5.11.6) for which Hk is positive definite, gk f 0, and for which the line search satisfies Ji,k < 1, then for all rk > 0, 8k 2: the matrix Hk+l = P(fh,rkHk,Pk,qk) is well defined and positive definite. ° PROOF. It suffices to consider only the case rk = 1 and to show the following property of the function P given by (5.11.5): Assuming that H is positive definite, pT q > 0, pT H q > 0, () 2: 0, then fI:= P(8,H,p,q) is also positive definite. Let y E lR n , y f 0, be an arbitrary vector, and let H = LLT be the Choleski decomposition of H [Theorem (4.3.3)]. Using the vectors 5.11 Minimization Problems without Constraints 353 The Cauchy-Schwarz inequality implies that u T u - (u T z;)2 /v Tv 2: 0, with equality if and only if 'U = av for some a "I 0 (since y "I 0). If u "I av, then yT Hy > O. If u = av, it follows from the non singularity of Hand L that o "I y = aq, so that Because 0 "I y E IR n was arbitrary, H must be positive definite. 0 The following theorem establishees that the method (5.11.6) yields the minimum point of any quadratic function h: IR n -+ IR after at most n steps, provided that the line searches are performed exactly. Since each sufficiently differentiable function h can be approximated arbitrarily closely in a small enough neighborhood of a local minimum point by a quadratic function, this property suggests that the method will be rapidly convergent even when applied to nonquadratic functions. (5.11.10) Theorem. Let h(x) = ~xT Ax+bT x+c be a quadratic function, where A is an n x n positive definite matrix. Further, let Xo E IR n , and let Ho an n x n positive definite matrix. If the method (5.11.6) is used to minimize h, starting with xo, Ho and carrying out exact line searches (/Ji = o for all i 2: 0), then sequences Xi, Hi, gi, Pi := XHI - Xi, qi := gHl - gi are produced with the properties: (a) There is a smallest index m :s; n with Xm = X = -A-1b such that Xm = X is the minimum of h, gm = O. (b) p[ qk = p[ APk = 0 for 0 :s; i "I k :s; m - 1, p[ qi > 0 for 0 :s; i :s; m - 1. That is. the vectors Pi are A-conjugate. (c) p[ gk = 0 for all 0 :s; i < k :s; m. (d) Hkqi = "Yi.kPi for 0 :s; i < k :s; m, with "YHI "YH2 ... "Yk-l { "Yik . := 1 for i < k -- 1, for i = k -- 1. 354 5 Finding Zeros and Minimum Points by Iterative Methods (e) If m = n, then additionally lIm = Hn = PDP-lA-I, where D := diagbo,n, l1.n,·· ·In-1.n), P := (PU,PI,'" 'Pn-d· And if Ii == L it follows that Hn = A - I . PROOF. Consider the following conditions for an arbitrary index 1 2: 0: pT qk = pT APk = 0 for. 0 S i =J k S 1 - 1, { pT qi > 0 for 0 SiS I - 1, (AI) HI is positive definite: (Bd pT gk = 0 for all 0 S i < k S 1; (Cd Hkqi = A/i.kJii for 0 S i < k S l. If these conditions hold, and if in addition gl = g(xd =J 0, then we will show that (A l+1 , B l+ 1, C l+1 ) hold. Since HI is positive definite by (Az), Hlgl > 0 and SI := Hlg l =J 0 follow immediately from gl =J O. Because the line search is exact, Al IS a zero point of gf hence PI = -AlSI =J 0 and (5.11.11) pf gl+1 = -Alsf gl+1 = 0, pf ql = -AISf (gl+l - gz) = Alsf gl = Algf Hig i > O. According to Theorem (5.11.9), H l + 1 is positive definite. Further, for i < l, because APk = qk, (Bz) and (Cz) hold. This establishes (A I + 1 ). To prove (Bl+d, we have to show that gl+1 = 0 for all i < 1 + 1. (5.11.11) takes care of the case i = 1. For i < 1, we can write pT since qj = gj+l - gj by (5.11.6). The above expression vanishes according to (Bz) and (Al+d. Thus (B1+d holds. Using (5.11.5), it is immediate that Hl+1ql = Pl. Further (Al+d, (Cz) imply qi = 0, Hlqi = li,lqf Pi = 0 for i < I, so that pf qf 5.11 Minimization Problems without Constraints 355 follows for i < I from (5.11.5). Thus (Cl+d holds, too. We note that (Ao, Bo, Co) hold trivially. So long as Xl satisfies (b)-(d) and g(xz) -=/:- 0, we can use the above results to generate, implicitly, a point XI+1 which also satisfies (b)-(d). The sequence xo, Xl, ... must terminate: (Az) could only hold for I <::::: n, since the I vectors Po, ... , Pl-l are linearly independent [if 2:i<I-1 OOiPi = 0, then multiplying by prA for k = 0, ... , 1- 1, gives D:kPr Apk = 0 ===? D:k = 0 from (AI)]. No more than n vectors in .IRn can be linearly independent. When the sequence terminates, say at I = m, 0 <::::: m <::::: n, it must do so because i.e. (a) holds. In case m = n, (d) implies for the matrices P:= (po, ... ,Pn-l), Q = (qO, ... ,qn-l). Since AP = Q, the nonsingularity of P implies Hn = PDP-1A-1, which proves (e) and, hence, the theorem. D We will now discuss briefly how to choose the parameters "/k, ()k in order to obtain as good a method as possible. Theorem (5.11.10e) seems to imply that the choice "/k == 1 would be good, because it appears to ensure that limi Hi = A- l (in general this is true for nonquadratic functions only under certain additional assumptions), which suggests that the quasiNewton method will converge "in the same way as the Newton method". Practical experience indicates that the choice "/k == 1, ()k == 1 (BFGS method) is good [see Dixon (1971)]. Oren and Spedicato (1976) have been able to give an upper bound pbk' ()k) on the quotient cond(Hk+1)/ cond(Hk) of the condition numbers (with respect to the Euclidean norm) of Hk+l and Hk. Minimizing P with respect to "/k and ()k leads to the prescription if c <::::: 1 a - a if -::::: 1 T if a c - <::::: 1 <:::::T a Here we have used c a ()k := 0; then choose "/k := -, then choose "/k := -, ()k := 1; then choose "/k:= 1, ()k := a T cr(c - a) lOT - a 2 . 356 5 Finding Zeros and Minimum Points by Iterative Methods ~._ pTH-lp C.- kkk, Another promising suggestion has been made by Davidon (1975). He was able to show that the choice (J(c-(J; fA := { lOT - (J (J 2cT if (J < - - c+T' otherwise, (J-T used with rk == 1 in the method (5.1l.5), minimizes the quotient Arnax/ Arnin of the largest to the smallest eigenvalues satisfying the generalized eigenproblem determine A E <C, Y ¥ 0 so that Hk+1y = AHky, that is det(Hk 1 H k + 1 - A1) = O. Theorem (5.1l.10) suggests that methods of the type (5.11.4) will converge quickly even on nonquadratic functions. This has been proved formally for some individual methods of the Oren-Luenbergcr class. These results rest, for the most part, on the local behavior in a sufficiently small neighborhood U(x) of a local minimum x of h under the following assumptions: (5.11.12a) H(x) is positive definite, (5.1l.12b) H(x) is Lipschitz continuous at x = x, i.e., there is a A with IIH(x) - H(x)11 'S Jllix - xii for all x E U(x). Further, certain mild restrictions have to be placed on the line search, for example: (5.11.13) For given constants 0 < Cl < C2 < 1. Cl 'S ~, Xk+l = Xk - AkSk is chosen so that h(Xk+l) s; h(Xk) - C1 Akgl Sk, T T gk+l Sk s; c2gk Sk, or (5.1l.14) Under the conditions (5.11.12), (5.11.13) Powell (1975) was able to show for the BFGS method (rk == 1, 8k == 1): There is a neighborhood V(x) ~ U(x) such that the method is superlinearly convergent for all positive definite initial matrices Ho and all Xo E V (x). That is, 5.11 Minimization Problems without Constraints 357 · Ilxi+1 - xii = 0 11m i-+oo Ilxi - xii ' as long as Xi =f. x for all i ::::0: O. In his result the stepsize Ak == 1 satisfies (5.11.13) for all k ::::0: 0 sufficiently large. Another convergence result has been established for the subclass of the Oren-Luenberger class (5.11.4) which additionally satisfies TH-· 1 T Pkqk < < ~ Pk qfHkqk - "Yk pfqk . (5.11.15) For this subclass, using (5.11.12), (5.11.14), and the additional demand that the line search be asymptotically exact, i.e. IILkl ~ cl19kll for large enough k, it can be shown [Stoer (1977), Baptist and Stoer (1977)] that limxk = x, k IIXk+n - xii ~ "Yllxk - xW for all k ::::0: O. for all positive definite initial matrices Ho and for Ilxo - xii small enough. The proofs of all of the above convergence results are long and difficult. The following simple example illustrates the typical beh.aviors of the DFP method, the BFGS method, and the steepest-descent method (Sk := gk in each iteration step). We let 2( 2 2 (2 + a:)2 () hx,y :=100(y 3-x)-x (3+x)) +--(--)2' 1 + 2+x which has the minimum point x := -2, j) := 0.8942719099 ... , hex, y) = O. For each method we take Xo := 0.1, yo := 4.2 as the starting point and let for the BFGS and DFP methods. Using the same line-search procedure for each method, we obtained the following results on a machine with eps = 10- 11 : BFGS DFP N F 54 374 47 568 E S 10- 11 S 10- 11 Steepest descent 201 1248 0.7 358 5 Finding Zeros and l\linimum Points by Iterative Methods N denotes the number of iteration steps (Xk, yA:) --t (Xk+l, xk+d: F denotes the number of evaluations of h: and c:= Ilg(xJV,YN)11 was the accuracy attained at termination. The steepest descent method is clearly inferior to both the DFP and BFGS methods. The DFP method is slightly inferior to the BFGS method. (The line searches were not done in a particularly efficient manner. More than 6 function evaluations were needed, on the average, for each iteration step. It is possible to do better, but the same relative performance would have been evident among the methods even with a better line-search procedure.) EXERCISES FOR CHAPTER 5 1. Let the continuously differentiable iteration function cf>: IR n --t IR n be given. If lub(Dcf>(x)) ~ K < 1 for all x E IR n , then the conditions for Theorem (5.2.2) are fulfilled for all x, Y E IRn. 2. Show that the iteration Xk-l = COS(Xk) converges to the fixed point ~ = cos ~ for all Xo E IR n . 3. Give a locally convergent method for determining the fixed point ~ = ~ of cf>(x) := x 3 + x - 2. (Do not use the Aitken transformation.) 4. The polynomial p(x) = x 3 - x 2 - x-I has its only positive root near ~ = 1.839 .... \Vithout using pi (x), construct an iteration function cf>(x) having the fixed point ~ = cf>(~) and having the property that the iteration converges for any starting point Xo > O. 5. Show that lim Xi = 2. i-+oc where Xo := O. 6. .'l:i+1:= /2 + x.,. Let the function f: IR2 --t IR2 f( z) = [ exp(x 2 + y2) - 3 ] x+y-sin(3(x+y))' be given. Compute the first derivative Df(z). For which z is Df(z) singular? 7. 8. The polynomial po(x) = X4 -8x 3 +24x2 -32:r+a4 has a quadruple root x = 2 for a4 = 16. To first approximation, where are its roots if a4 = 16 ± 10- 4 ? Consider the sequence {Zi } with Zi+1 = cf>(z;}, cf>: IR --t IR. Any fixed point ~ of cf> is a zero of F(z) := z - cf>(z). Show that if one step of regula falsi is applied to F (z) with EXERCISES FOR CHAPTER 5 359 then one obtains Steffensen's (or Aitken's) method for transforming the sequence {Zi } into {JLi }. 9. Let f: IR -+ IR have a single, simple zero. Show that, if <p(~c) := x - f(x) and the recursion (5.10.7) are used, the result is the quasi-Newton method n = 0.1, .... Show that this iteration converges at least quadratically to simple <leros and linearly to multiple zeros. Hint: (5.10.13). 10. Calculate x = 1/ a for any given a Ie 0 without using division. For which starting values Xo will the method converge? 11. Give an iterative method for computing y'a, a > 0, which converges locally in second order. (The method may only use the four fundamental arithmetic operations. ) 12. Let A be a nonsingular matrix and {Xk }k=O.l, .. be a sequene of matrices satisfying (Schulz's method). (a) Show that lub(I - AXo ) < 1 is sufficient to ensure the convergence of {Xk} to A-I Further. Ek := I - AXk satisfies (b) Show that Schulz's method is locally quadratically convergent. (c) If, in addition, AXo = XoA then 13. Let the function f: IR -+ IR be twice continuously differentiable for all x E U(~) := {x I Ix - ~I :::; r} in a neighborhood of a simple zero ( f(~) = O. Show that the sequence {x n } generated by y:= Xn - f'(Xn)-l f(x n ), Xn+l := Y - !'(X 71 )-l f(y), converges locally at least cubically to ~. 14. Let the function f: IR -+ IR have the zero ~. Let f be twice continuously differentiable and satisfy f' (x) Ie 0 for all x E I := {x IJ: - ~ I <::: r}. Define the iteration where q(x) := (f(x + f(x)) - f(x))/ f(x) for x Ie~, x E I. and show: (a) For all x 360 5 Finding Zeros and Minimum Points by Iterative Methods f(x+f(x)) -f(x) = f(x) 11 f'(x+tf(x))dt. (b) The function q (x) can be continuously differentiably extended to x = ~ using q(x) = .i l f'(x+tf(x))dt. (c) There is a constant c so that Iq(x) - f'(x)1 S clf(x)l· Give an interpretation of this estimate. (d) The iteration has the form Xk+l = CP(Xk) with a differentiable function CP. Give cP and show cp(!;) = ~, cp'(O = O. Hence the iteration converges with second order to ~. 15. Let the function f: IR n --+ IR n satisfy the assumptions (1) f(x) is continuously differentiable for all x E IRn; (2) for all x E IR n , Df(x)-l exists: (3) x T f(x) 2> ,(llxll) Ilxll for all x E 1R", where ,(p) is a continuous function for p 2> 0 and satisfies ,(p) --+ +00 as p --+ 00; (4) for all x, hE IR n hT Df(x)h 2> J.i(llxll)llhI1 2 , with /1(p) monotone increasing in P > 0, J.i(0) = 0, and 1= J.i(p )dp = +00. Then (1), (2), (3) or (1), (2), (4) are enough to ensure that conditions (a)-(c) of Theorem (5.4.2.5) are satisfied. For (4), use the Taylor expansion f(x + h) - f(x) = 11 Df(x + th)h dt. 16. Give the recursion formula for computing the values AI, Bl which appear in the Bairstow method. 17. (Tornheim 1964). Consider the scalar. multistep iteration function of r + 1 variables 'P( Xo, Xl, ... x r ) defining the iteration i = 0, 1, ... , where Yo, Y-l, ... , Y-r are specified. Let 'P have partial derivatives of at least order r + 1. y* is called a fixed point of 'P if for all k = 1, ... , l' and arbitrary Xi, i 1= k, it follows that (* ) Y* = 'P(xo, ... , Xk-l, y*, Xk+l,···, x,.). Show that (a) The partial derivatives References for Chapter 5 S ( ) 361 alsl<p(Xo, ... ,Xr) a Sr' Xo Xs ... Xr D <p Xo, ... ,Xr := a 80a 81 where s = (so, ... ,Sr), Isl:= L:~=OSj, satisfy DS<p(y*, ... ,yO) = 0 if for some j, 0 :S j :S r, Sj = O. [Note that (*) holds identically in Xo, Xk-l, Xk+l, ... , Xr for all k.] (b) In a suitably small neighborhood of y* the recursion (**) holds with Ci := IYi - y* I and with an approporiate constant c. Further, give: (c) the general solution of the recursion (**) and the local convergence order of the sequence Yi. 18. Prove (5.9.14). References for Chapter 5 Baptist, P., Stoer, J. (1977): On the relation between quadratic termination and convergence properties of minimization algorithms. Part II. Applications. Numer. Math. 28, 367-391. Bauer, F.L. (1956): Beitrage zur Entwicklung numerischer Verfahren fUr programmgesteuerte Rechenalagen. II. Direkte Faktorisierung eines Polynoms. Bayer. Akad. Wiss. Math. Natur. Kl. S.B. 163-203. Brent, R.P. (1973): Algorithms for Minimization without Derivatives. Englewood Cliffs, N.J.: Prentice-Hall. Brezinski, C. (1978): Algorithmes d'Acceleration de la Convergence. Etude Numerique. Paris: Edition Technip. - - - , Zaglia, R. M. (1991): Extrapolation Methods, Theory and Practice. Amsterdam: North Holland. Broyden, C.G. (1965): A class of methods for solving nonlinear simultaneous equations. Math. Comput. 19, 577-593. - - - (1967): Quasi-Newton-methods and their application to function minimization. Math. Comput. 21, 368-381. - - - (1970): The convergence of a class of double rank minimization algorithms. 1. General considerations, 2. The new algorithm. J. Inst. Math. Appl. 6, 76-90, 222-231. - - - , C.G., Dennis, J.E., More, J.J. (1970): On the local and superlinear convergence of quasi-Newton methods. J. Inst. Math. Appl. 12, 223-245. Collatz, L. (1968): Funktionalanalysis und numerische Mathematik. Die Grundlehren der mathematischen Wissenschaften in Einzeldarstellungen. Bd. 120. Berlin, Heidelberg, New York: Springer-Verlag. English edition: Functional Analysis and Numerical Analysis. New York: Academic Press (1966). Collatz, L., Wetterling, W. (1971): Optimierungsaufgaben. Berlin, Heidelberg, New York: Springer-Verlag. Davidon, W.C. (1959): Variable metric methods for minimization. Argonne National Laboratory Report ANL-5990. 362 5 Finding Zeros and Minimum Points by Iterative Methods - - - (1975): Optimally conditioned optimization algorithms without line searches. Math. Programming 9, 1-30. Deuflhard, P. (1974): A modified Newton method for the solution of ill-conditioned systems of nonlinear equations with applications to multiple shooting. Numer. Math. 22, 289-315. Dixon, L.C.W. (1971): The choice of step length, a crucial factor in the performance of variable metric algorithms. In: Numerical Methods for Nonlinear Optimization. F.A. Lootsma, ed., 149-170. New York: Academic Press. Fletcher, R., Powell, M.J.D. (1963): A rapidly convergent descent method for minimization. Comput. J. 6, 163-168. - - - (1980): Unconstrained Optimization. New York: Wiley. - - - (1981): Constrained Optimization. New York: Wiley. Gill, P.E., Golub, G.H., Murray, W., Saunders, M.A. (1974): Methods for modifying matrix factorizations. Math. Comput. 28, 505-535. Griewank, A. (2000): Evaluating Derivatives, Principles and Techniques of Algorithmic Differentiation. Frontiers in Appl. Math., Vol 19. Philadelphia: SIAM. Henrici, P. (1974): Applied and Computional Complex Analysis. Vol. 1. New York: Wiley. Himmelblau, D.M. (1972): Applied Nonlinear Programming. New York: McGraw-Hill. Householder, A.S. (1970): The Numerical Treatment of a Single Non-linear Equation. New York: McGraw-Hill. Jenkins, M.A., Traub, J.F. (1970): A three-stage variable-shift iteration for polynomial zeros and its relation to generalized Rayleigh iteration. Numer. Math. 14, 252-263. Luenberger, D.G. (1973): Introduction to Linear and Nonlinear Programming. Reading, Mass.: Addison-Wesley. Maehly, H. (1954): Zur iterativen Auflosung algebraischer Gleichungen. Z. Angew. Math. Physik 5, 260-263. Marden, M. (1966): Geometry of Polynomials. Providence, R.I.: Amer. Math. Soc. Nickel, K. (1966): Die numerische Berechnung der Wurzeln eines Polynoms. Numer. Math. 9, 80-98. Oren, S.S., Luenberger, D.G. (1974): Self-scaling variable metric (SSVM) algorithms. 1. Criteria and sufficient conditions for scaling a class of algorithms. Manage. Sci. 20, 845-862. - - - , Spedicato, E. (1976): Optimal conditioning of self-scaling variable metric algortihms. Math. Programming 10, 70-90. Ortega, J.M., Rheinboldt, W.C. (1970): Iterative Solution of Non-linear Equations in Several Variables. New York: Academic Press. Ostrowski, A.M. (1973): Solution of Equations in Euclidean and Banach Spaces. New York: Academic Press. Peters, G., Wilkinson, J.H. (1969): Eigenvalues of Ax = >..Bx with band symmetric A and B. Comput. J. 12,398-404. - - - , - - - (1971): Practical problems arising in the solution of polynomial equations. J. Inst. Math. Appl. 8, 16-35. Powell, M.J.D. (1975): Some global convergence properties of a variable metric algorithm for minimization without exact line searches. In: Proc. AMS Symposium on Nonlinear Programming 1975. Amer. Math. Soc. 1976. Spellucci, P. (1993): Numerische Verfahren der nichtlinearen Optimierung. Basel: Birkhiiuser. References for Chapter 5 363 Stoer, J. (1975): On the convergence rate of imperfect minimization algorithms in Broyden's ,B-class. Math. Programming 9, 313-335. - - - (1977): On the relation between quadratic termination and convergence properties of minimization algorithms. Part I. Theory. Numer. Math. 28,343366. Tornheim, L. (1964): Convergence of multipoint methods. J. Assoc. Comput. Mach. 11, 210-220. Traub, J.F. (1964): Iterative Methods for the Solution of Equations. Englewood Cliffs, NJ: Prentice-Hall. Wilkinson, J .H. (1959): The evaluation ofthe zeros of ill-conditioned polynomials. Part I. Numer. Math. 1, 150-180. - - - (1963): Rounding Errors in Algebraic Processes. Englewood Cliffs, N.J.: Prentice Hall. - - - (1965): The Algebraic Eigenvalue Problem. Oxford: Clarendon Press. 6 Eigenvalue Problems 6.0 Introduction l\1any practical problems in engineering and physics lead to eigenvalue problems. Typically, in all these problems, an overdetermined system of equations is given, say n + 1 equations for n unknowns ~l' ... , ~n of the form (6.0.1) F(x; A) :=== [ h(6, .... (n; A) : 1= 0 In+l(6'''''~n: A) in which the functions Ii also depend on an additional parameter A. Usually, (6.0.1) has a solution x = [6'''''~n]T only for specific values A = Ai, i = 1, 2, ... , of this parameter. These values Ai are called eigenvalues of the eigenvalue problem (6.0.1) and a corresponding solution x = X(A;) of (6.0.1), eigensolution belonging to the eigenvalue Ai. Eigenvalue problems of this general form occur, e.g., in the coutext of boundary value problems for differential equations [see Section 7.3]. In this chapter we consider only the special class of algebraic eigenval'ue problems, where all but one of the Ii in (6.0.1) depend linearly on x and A, and which have the following form: Given real or complex n x n matrices A und B, find a number A E <e such that the system of n + 1 equations (A - AB)x = O. (6.0.2) has a solution x E <en. Clearly, this problem is equivalent to finding numbers A E <e such that there is a nontrivial vector x E <en, x i- 0, with (6.0.3) Ax = ARl:. For arbitrary A and B, this problem is still very general, and we treat it only briefly in Section 6.8. The main portion of Chapter 6 is devoted to the special case of (6.0.3) where B := I is the identity matrix: For an n x n 6.0 Introduction 365 matrix A, find numbers A E ~ (the eigenvalues of A) and nontrivial vectors x E ~n (the eigenvectors of A belonging to A) such that (6.0.4) Ax = AX, X # O. Sections 6.1-6.4 provide the main theoretical results on the eigenvalue problem (6.0.4) for general matrices A. In particular, we describe various normal forms of a matrix A connected with its eigenvalues, additional results on the eigenvalue problem for important special classes of matrices A (such as Hermitian and normal matrices) and the basic facts on the singular values (Ji of a matrix A, i.e., the eigenvalues (J; of AHA and AA H , respectively. The methods for actually computing the eigenvalues and eigenvectors of a matrix A usually are preceded by a reduction step, in which the matrix A is transformed to a "similar" matrix B having the same eigenvalues as A. The matrix (B = (b ik ) has a simpler structure than A. (B is either a tridiagonal matrix, bik = 0 for Ii - kl > 1, or a Hessenberg matrix, bik = 0 for i :::: k + 2), so that the standard methods for computing eigenvalues and eigenvectors are computationally less expensive when applied to B than when applied to A.. Various reduction algorithms are described in Section 6.5 and its subsections. The main algorithms for actually computing eigenvalues and eigenvectors are presented in Section 6.6, among others the LR algorithm of Rutishauser (Section 6.6.4) and the powerful QR algorithm of Francis (Sections 6.6.4 and 6.6.5). Related to the QR algorithm is the method of Golub and Reinsch for computing the singular values of matrices, which is described in Section 6.7. After touching briefly on the more gmeral eigenvalue problem (6.0.3) in Section 6.8, the chapter closes (Section 6.9) with a description of several useful estimates for eigenvalues. These may serve, e.g, to locate the eigenvalues of a matrix and to study their sensitivity with respect to small perturbations. A detailed treatment of all numerical aspects to the eigenvalue problem for matrices is given in the excellent monograph of Wilkinson (1965), and in Golub and van Loan (1983); the eigenvalue problem for symmetric matrices is treated in Parlett (1980). ALGOL programs for all algorithms described in this chapter are found in Wilkinson and Reinsch (1971), and FORTRAN programs in the "EISPACK Guide" of Smith et a1. (1976) and its extension by Garbow et a1. (1977). All these methods have been incorporated in the widely used program systems like MATLAB, MATHEMATICA and ]'v[APLE. 366 6 Eigenvalue Problems 6.1 Basic Facts on Eigenvalues In the following we study the problem (6.0.4), i.e., given a real or complex n x n matrix A, find a number A E <e such that the linear homogeneous system of equations (6.1.1) (A - AI)X = D has a nontrivial solution x f D . (6.1.2) Definition. A number A E <e is called an eigenvalue of the matrix A if there is a vector x f 0 such that Ax = AX. Every such vector is called a (right) eigenvector of A associated with the eigenvalue A. The set of all eigenvalues is called the spectrum of A. The set L(A) := {x I (A - AI)X = D} forms a linear subspace of <en of dimension p(A) = n - rank (A - AI), and a number A E <e is an eigenvalue of A precisely when L(A) f D, i.e., when p(A) > 0 and thus A - AI is singular: det(A - AI) = o. It is easily seen that ip(fl) := det(A - flI) is an nth-degree polynomial of the form It is called the (6.1.3) characteristic polynomial of the matrix A. Its zeros are the eigenvalues of A. If A1, ... , Ak are the distinct zeros of ip(fl), then ip can be presented in the form The integer O"i, which we also denote by dAi) = O"i, is called the multiplicity of the eigenvalue Ai, more precisely, its algebraic multiplicity. The eigenvectors associated with the eigenvalue A are not uniquely determined: together with the zero vector, they fill precisely the linear subspace L(A) of <en. Thus, (6.1.4). If x and yare eigenvectors belonging to the eigenvalue A of the matrix A, then so is every linear combination ax + f3y f D. 6.1 Basic Facts on Eigenvalues 367 The integer p(>..) = dim L(>..) specifies the maximum number of linearly independent eigenvectors associated with the eigenvalue >... It is therefore also called the geometric mnltiplicity of the eigenvalne >... One should not confuse it with the algebraic multiplicity 0'(>"). EXAMPLES. The diagonal matrix of order n. D := ),1, has the characteristic polynomial cp(p,) = det(D - p,I) = (). - ~i)n. ). is the only eigenvalue, and every vector x E <c n , x =1= 0, is an eigenvector: L()') = <c n ; furthermore. iT().) = n = p().). The nth-order matrix .\ 1 o ). (6.1.5) o 1 ). also has the characteristic polynomial cp(~) = ()._~)n and), as its only eigenvalue, with iT().) = n. The rank of C n ()') - AI, however, is now equal to n - 1; thus p().) = n - (n - 1) = 1, and L()') = {aella E <C}, el = 1st coordinate vector. Among further simple properties of eigenvalues we note: (6.1.6). Let p(tL) = 10 + 11~ + ... + Im~m be an arbitrary polynomial, and A a matrix of order n. Defining the matrix p(A) by p(A) := 101 + 11A + ... + ImAm, the matrix p(A) has the eigenvector x corresponding to the eigenvalne p(>..) if>.. is an eigenvalne of A and x a corresponding eigenvector. In particnlar, etA has the eigenvalne et>.., and A + TIthe eigenvalne >.. + T. From Ax = >..x one obtains immediately A 2 x = A(Ax) = >..Ax = >..2x, and in general Ai.1: = >..iX. Thus PROOF. p(A)x = (roI + 11A + ... + ,mArn)x = (ro + ,1>" + ... + Im>..rn)x = p(>..):r. Furthermore, from det(A - >..1) = det«A - >..1)T) = det(A T - >..1), det(A H - 5.1) = det«A - >..1)H) = det «A - >..1)T) = det(A - >"1), o 368 6 Eigenvalue Problems there follows: (6.1.7). If A is an eigenvalue of A, then A is also an eigenvalue of AT, and ), an eigenvalue of AH. Between the corresponding eigenvector x, y, z, Ax = AX, AT Y = AY, AH z = ),z, merely the trivial relationship y = z holds, in view of A H = ;F. In particular, there is no simple relationship, in general, between X and y or x and z, Because of yT = zH and zH A = AZ H , one calls zH or yT, also a left eigenvector left eigenvector DF associated with eigenvalue A of A. Furthermore, if x -lOis an eigenvector corresponding to the eigenvalue A, Ax = AX, T an arbitrary nonsingular n x n matrix, and if one defines y .- T-Ix, then T- I ATy = T- I Ax = AT-Ix = AY, Y -I 0, i.e., y is an eigenvector of the transformed matrix associated with the same eigenvalue A. Such transformations are called similarity transformations, and B is said to be similar to A, A rv B. One easily shows that similarity of matrices is an equivalence relation, i.e., ArvA, A rv A B, rv B B rv =? B C =? rv A, A rv C. Similar matrices have not only the same eigenvalues, but also the same characteristic polynomial. Indeed, det(T- I AT - f..LI) = det(T- I (A - p1)T) = det(T- I ) det(A - f..LI) det(T) = det(A - f..LI). Moreover, the integers peA), o-(A) remain the same: For o"(A), this follows from the invariance of the characteristic polynomial; for p( A), from the fact that, T being nonsingular, the vectors Xl, ... , xp are linearly independent 6.2 The Jordan Normal Form of a Matrix 369 if and only if the corresponding vectors Yi := T-IXi, i = 1, ... , p, are linearly independent. In the most important methods for calculating eigenvalues and eigenvectors of a matrix A, one first performs a sequence of similarity transformations. A(O) :=A, A(i) := T i- l A(i-l)Ti , i = 1, 2, ... , in order to gradually transform the matrix A into a matrix of simpler form, whose eigenvalues and eigenvectors can then be determined more easily. 6.2 The Jordan Normal Form of a Matrix We remarked already in the previous section that for an eigenvalue A of an n x n matrix A, the multiplicity a(A) of A as a zero of the characteristic polynomial need not coincide with p(A), the maximum number of linearly independent eigenvectors belonging to A. It is possible, however, to prove the following inequality: 1 :s; p(A) :s; a(A) :s; n. (6.2.1) PROOF. We prove only the nontrivial part p(A) :s; a(A). Let P := p(A), and let Xl, ... , xp be linearly independent eigenvectors associated with A: AXi = AXi, i = 1, ... , p. We select n - P additional linearly independent vectors Xi E (Cn, i = P + 1, ... ) n, such that the Xi, i = 1, ... , n, form a basis in (Cn. Then the square matrix T := [Xl) ... ,xn ] with columns Xi is nonsingular. For i = 1, ... , p, in view of Tei = Xi, ei = T-IXi, we now have T- I ATei = T- I AXi = AT-IXi = Aei. T- I AT, therefore, has the form T-IAT = A 0 * * 0 A * * * * * * 0 '"-,,--' p [ml, and for the characteristic polynomial of A, or of T- I AT, we obtain 370 6 Eigenvalue Problems i.p(p.) = det(A - pJ) = det(T- l AT - p.!) = (A - p.)P . det(C - p.!). i.p is divisible by (A-fl)P; hence A is a zero of 'P of multiplicity at least p. D In the example of the previous section we already introduced the 1/ x 1/ matrices [see (6.l.5)] ""(Al ~ l Ao 1 :1] A and showed that 1 = p(A) < cr(A) = 1/ (if v> 1) for the (only) eigenvalue A of these matrices. The unique eigenvector (up to scalar multiples) is el, and for the coordinate vectors ei we have generally (Cv(A) - AI)ei = ei-l, (6.2.2) i = v, v - L ... ,2, (Cv(A) - A!)el = O. Setting formally ek := 0 for k :::; 0, then for all i. j ~ L (Cv(A) - AJ)'ej = ej-i, and thus (6.2.3) The significance of the matrices Cv(A) lies in the fact that they are used to build the so-called Jordan normal form J of a matrix. Indeed, the following fundamental theorem holds, which we state without proof: (6.2.4) Theorem. Let A be an arbitrary n x n matri:r and A!.. ... ! Ak its distinct eigenvalues, with geometric and algebraic multiplicities p(Ai) and cr(Ai), respectively. i = 1. ... , k. Then for each of the eigenvalues Ai. i = 1, ... , k, there exists p(A,) natural numbers j = 1, 2, .... p(Ai). with vj'). \ ) cr ( /\i = VI(i) + 1/2( i) + ... + v p(i)(>-,) and there exists a nonsingular n x n matrix T, such that J := T- l AT has the following form: C V ,(1) (Ad (6.2.5) J = o o 6.2 The Jordan Normal Form of a Matrix 371 The numbers vJi), j = 1, ... , P(Ai), (and with them, the matrix J) are uniquely determined up to order. J is called the Jordan normal form of A. The matrix T, in general, is not uniquely determined. If one partitions the matrix T columnwise, in accordance with the Jordan normal form J in (6.2.5), (1) T = [Tl (1) (k) , ... , T p(>.,)' ... , Tl (k) 1 , ... , Tp(>'k) , then from T- l AT = J and hence AT = T J, there follow immediately the relations (6.2.6) Denoting the columns of the n x v?) matrix T?) without further indices briefly by t m , m = 1, 2, ... , VJi) , Tj (i) = [tl, t2, ... , t",Ci)], J it immediately follows from (6.2.6), and the definition of C C,) (Ai) that "'j 1 or (6.2.7) (A - AJ)tm = tm-l, (i) m = Vj (i) ,Vj - 1 1, ... , 2, (A - AJ)tl = O. In particular, tl, the first column of T?), is an eigenvector for the eigenvalue Ai. The remaining t m , m = 2, 3, ... , vJi), are called principal vectors corresponding to Ai, and one sees that with each Jordan block C Ci) (Ai) "'j there is associated an eigenvector and a set of principal vectors. Altogether, for an n x n matrix A, one can thus find a basis of <en (namely, the columns of T) which consists entirely of eigenvectors and principal vectors of A. The characteristic polynomials of the individual Jordan blocks C C,) (Ai) are called the "'j (6.2.8) elementary divisors 372 6 Eigenvalue Problems of A. Therefore, A has only linear elementary linear elementary divisors precisely if vji) = 1 for all i and j, i.e., if the Jordan normal form is a diagonal matrix. One then calls A diagonalizable or also normalizable. This case is distinguished by the existence of a basis of IJjn consisting solely of eigenvectors of A; principal vectors do not occur. Otherwise, one says that A has "higher", i.e., nonlinear elementary divisors. From Theorem (6.2.4) there follows immediately: (6.2.9) Theorem. Every n x n matrix A with n distinct eigenvalues is diagonalizable. We will get to know further classes of diagonalizable matrices in Section 6.4. Another extreme case occurs if with each of the distinct eigenvalues Ai, i = 1, ... , k, of A there is associated only one Jordan block in the Jordan normal form J of (6.2.5). This is the case precisely if p(Ai) = 1 for i = 1, 2, ... , k. The matrix A is then called (6.2.10) non derogatory, otherwise, derogatory (an n x n matrix with n distinct eigenvalues is thus both diagonalizable and nonderogatory). The class of nonderogatory matrices will be studied more fully in the next section. A further important concept is that of the minimal polynomial of a matrix A. By this we mean the polynomial 1j;(f.1) = 1'0 + 1'1f.1 + ... + I'm_lf.1 m- l + f.1m of smallest degree having the property 1j;(A) = O. The minimal polynomial can be read off at once from the Jordan normal form: (6.2.11) Theorem. Let A be an n x n matrix with the (distinct) eigenvalues AI, ... , Ak and with the Jordan normal form J of (6.2.5), and let Ti := maxl::Oj::Op().;) Then vY). (6.2.12) is the minimal polynomial of A. 1j;(f.1) divides every polynomial X(f.1) with X(A) = O. PROOF. We first show that all zeros of the minimal polynomial 1j; of A, if it exists, are eigenvalues of A. Let, say, A be a zero of 1j;. Then 6.2 The Jordan Normal Form of a Matrix 373 where the polynomial g(/1) has smaller degree than 'l/J, and hence by the definition of the minimal polynomial g(A) i= O. There exists, therefore, a vector z i= 0 with x := g(A)z i= O. Because of'l/J(A) = 0 it then follows that 0= 'l/J(A)z = (A - AI)g(A)z = (A - AI)x, i.e., A is eigenvalue of A. If a minimal polynomial exists, it will thus have the form 'l/J(/1) = (/1- Ad T1 (/1- A2t2 ... (/1- Aktk for certain Ti' We wish to show nOw that Ti := maxj vjiJ will define a polynomial with 'l/J(A) = O. With the notation of Theorem (6.2.4), indeed, A = T JT- 1 and thus 'l/J(A) = T'l/J(J)T-l. In view of the diagonal structure of J, J=diag(C (1)(Al),""C (k) VI Vp()'k) (Ak))i however, we now have (6.2.13) and thus, by virtue of Ti ~ v?) and (6.2.3), 'l/J (Cv (i) (Ai)) = O. J Thus, 'l/J(J) = 0, and therefore also 'l/J(A) = O. At the same time One sees that none of the integers Ti can be chosen (i). (i) smaller than maxj Vj . If there were, say, Ti < Vj ,then, by (6.2.3), (CV(i) (Ai) - Ai1f' i= O. J From g(Ai) i= 0 it would follow at once that B:= g(CV(i) (Ai)). J is nonsingular. Hence, by (6.2.13), also 'l/J(C (i))) i= 0, and neither 'l/J(J) Vj nor 'l/J(A) would vanish. This shows that the specified polynomial is the minimal polynomial of A. If, finally, X(/1) is a polynomial with X(A) = 0, then X, with the aid of the minimal polynomial, can be written in the form 374 6 Eigenvalue Problems where deg r < deg From X(A) = ¢(A) = 0 we thus get also r(A) = O. Since Vi is the minimal polynomial of A, we must have identically rep) == 0: 1/J is a divisor of X. 0 By (6.2.4), one has ptA,) '" (i) O'(Ai)=~l/j ::::Ti=maXl/j(i) , J j=l i.e., the characteristic polynomial cp(p) = det(A - pI) of A is a multiple of the minimal polynomial. Equality O'(Ai) = Ti, i = 1, ... , k, prevails precisely when A nonderogatory. Thus, (6.2.14) Corollary (Cayley-Hamilton). The character'istic polynomial cp(p,) of a matrix A satisfies cp(A) = o. (6.2.15) Corollary. A matrix A is nonderogatory if and only if its minimal polynomial and characteric polynomial coincide (up to a multiplicative constant). EXA~IPLE. The Jordan matrix 1 1 1 0 1 1 1 J 1 1 -1 1 -1 -1 LJ 0 has the eigenvalues >11 = 1, .\2 = 1 with multiplicities p(.\d = 2, O'(.\d = 5, Elementary divisors: Characteristic polynomial: Minimal polynomial: P(.\2) = 3, 0'(.\2) = 4. 6.3 The Frobenius Normal Form of a Matrix 375 To Al = 1 there correspond the linearly independent (right) eigenvectors el, e4; to A2 = -1 the eigenvectors e6, es, eg. 6.3 The Frobenius Normal Form of a Matrix In the previous section we studied the matrices Cv(.A), which turned out to be the building blocks of the Jordan normal form of a matrix. Analogously, one builds up the Frobenius normal form (also called the rational normal form) of a matrix from Frobenius matrices F of the form (6.3.1) 0 0 -1'0 1 0 -1'1 0 1 -1'm-2 0 F= -1'm-l whose properties we now wish to discuss. One encounters matrices of this type in the study of Krylov sequences of vectors. By a Krylov sequence of vectors for the n x n matrix A and the initial vector to E ~n one means a sequence of vectors ti E ~n, i = 0, 1, ... , m - 1, with the following properties: (6.3.2). (a) ti = Ati-l, i::::: l. (b) to, t l , ... , t m - l are linearly independent, (c) tm:= Atm -1 depends linearly on to, h, ... , t m - l : there are constants 1'i with tm + 1'm-l t m-1 + ... + 1'oto = O. The length m of Krylov sequence of course depends on to. Clearly, m :::; n, since more than n vectors in ~n are always linearly dependent. If one forms the n x m matrix T := [to, ... , tm-d and the matrix F of (6.3.1), then (6.3.2) is equivalent to (6.3.3) Rang T= m, AT = A[to, ... , tm-l] = [h, ... , tm] = [to, ... , tm-1]F = TF. Every eigenvalue of F is also eigenvalue of A: From Fz = AZ, Z #- 0, we indeed obtain for x := Tz, in view of (6.3.3), x#-O Moreover, we have: and Ax = ATz = T F z = ATz = AX. 376 6 Eigenvalue Problems (6.3.4) Theorem. The matrix F (6.3.1) is nonderogatory: The minimal polynomial of F is 'lj;(p,) = /'0 + /'IP, + ... + /'m_Ip,m-l + p,m = (_l)m det (F - p,I). PROOF. Expanding cp(p,) := det (F - p,I) by the last column, the characteristic polynomial F is found to be 0 -p, 1 -p, -/'o -/'1 cp(p,) = det 1 -p, 1 0 -/'m-2 -/'m-l - P, By the results (6.2.12), (6.2.14) of the preceding section, the minimum polynomial 'lj;(p,) of T divides cp(p,). If we had deg 'lj; < m = deg cp, say 'lj;(p,) = 000 + alP, + ... + ar_lp,r-l + p,r, r < m, then from 'lj;(F) = 0 and Fei = eHl for 1 ::; i ::; m - 1, the following contradiction would result at once: 0= 'lj;(F)el = aOel + aIe2 + ... + ar-ler + er+1 = [000,001, ... , ar-l, 1,0, ... , O]T "" O. Thus, deg 'lj; = m, and hence 'lj;(p,) = (-l)mcp(p,). Because of (6.2.15), the theorem is proved. 0 Assuming the characteristic polynomial of F to have the zeros Ai with multiplicities (J'i, i = 1, ... , k, the Jordan normal of Fin (6.3.1), in view of (6.3.4), is given by C<7 1 (Ad C<72 (A2) [ o The significance of the Frobenius matrices lies in the fact that they furnish the building blocks of the so-called Frobenius or rational normal form of a matrix. Namely: 6.3 The F'robenius Normal Form of a Matrix 377 l (6.3.5) Theorem. For every n x n matrix A there is a nonsingular n x n matrix T with FI (6.3.6) T-IAT = 0 where the Pi are Frobenius matrices having the following properties: (a) If 'Pi(f.l) = det (Fi - f.lI) is the characteristic polynomial of Pi, i = 1, ... , r, then 'Pi(f.l) is a divisor of 'Pi-I (f.l), i = 2,3, ... , r. (b) 'PI (f.l), up to the multiplicative constant ± 1, is the minimal polynomial of A. (c) The matrices Fi are uniquely determined by A. One calls (6.3.6) the Frobenius normal form of A. This easily done with the help of the Jordan normal form [see (6.2.4)]: We assume that J in (6.2.5) is the Jordan normal form of A. Without loss of generality, let the integers v)J) be ordered, PROOF. (6.3.7) > v(i) > ... > v(i) 2 p(Ai)' V I(i) - . Z= 12k , , ... , . Define the polynomials 'Pj(f.l), j = 1, ... , r, r := maXi pP'i), by (1) 'Pj(f.l) = (AI - f.l)"j (2) (A2 - f.l)"j (k) ... (Ak - f.l)"j [using the convention v?) := 0 if j > p(Ai)]. In view of (6.3.7), 'Pj(f.l) divides 'Pj-I (f.l) and ±'PI (f.l) is the minimal polynomial of A. Now take as the Frobenius matrix Pj just the Frobenius matrix whose characteristic polynomial is 'Pj(f.l). Let Si be the matrix that transforms Fi into its Jordan normal form J i , Si-iPiSi = k A Jordan normal form of A (the Jordan normal form is unique only up to permutations of the Jordan blocks) then is 378 6 Eigenvalue Problems According to Theorem (6.2.4) there is a matrix U with U- 1 AU = J'. The matrix T:= US- 1 with transforms A into the desired form (6.3.6). It is easy to convince oneself of the uniqueness F i . o EXAMPLE. For the matrix J in the example of Section 6.2 one has <Pl(J-t) = (1 - J-t)3( -1 - J-t)2 = _(J-t5 - J-t4 - 2J-t3 + 2J-t 2 + J-t - 1), <P2(J-t) = (1 - J-t)2( -1 - J-t) = _(J-t3 - J-t2 - J-t + 1), <P3(J-t) = -(J-t + 1), and there follows F, ~ [j 000 000 1 o 0 -2 2 0 1 0 o 0 1 1 -:1 , F2 [~ 0 1 , 0 -1] 1 1 F3 = [-1]. The significance of the Frobenius normal form lies in its theoretical properties (Theorem (6.3.5)). Its practical importance for computing eigenvalues is very limited. For example, if the n x n matrix A is nonderogatory, then computing the Frobenius normal form F (6.3.1) is equivalent to the computation of the coefficients Ik of the characteristic polynomial which has the desired eigenvalues of A as zeros. But it is not advisable to first determine the Ii in order to subsequently compute the eigenvalues as zeros of <P: In general, the zeros Aj of <P react much more sensitively to small changes in the coefficients Ii of 'P than to small changes in the elements of the original matrix A [see Sections 5.8 and 6.9]. EXAMPLE. In Section 6.9, Theorem (6.9.7), it will be shown that the eigenvalue problem for Hermitian matrices A = A H is well conditioned in the following sense: For each eigenvalue Ai (A + .6. A ) of A + .6.A there is an eigenvalue Aj (A) of A such that If the 20 x 20 matrix A has, say, the eigenvalues Aj = j, j = 1, 2, ... , 20, then lub 2 (A) = 20 because A = AH (see Exercise 8). Subjecting all elements of A to a relative error of at most eps, i.e., replacing A by A +.6.A with I.6.AI : : : eps IAI, it follows that [see Exercise 11] 6.4 The Schur Normal form of a Matrix 379 lub 2 (L.A) ::::: lub 2 (IL.AI) ::::: lub2(eps IAI) ::::: epsv20lub2(A) < 90 eps. On the other hand, a mere relative error IL. li I = eps of the coefficients Ii of the characteristic polynomial 'P(/-t) = (/-t -1)(/-t - 2) ... (/-t - 20) produces tremendous changes in the zeros Aj = j of'P [see Example (1) of Section 5.8]. The situation is particularly bad for Hermitian matrices with clustered eigenvalues: the eigenvalue problem for these matrices is still well conditioned even for multiple eigenvalues, whereas multiple zeros of a polynomial 'P are always ill-conditioned functions of the coefficients ri. In addition, many methods for computing the coefficients ri of the characteristic polynomial are numerically unstable. As an example, we mention the method of Frazer, Duncan, and Collar, which is based on the observation [see (6.3.2) (c)] that the vector c = [rO,rl, ... ,rn-lf is the solution of a linear system of equations Tc = -tn with the nonsingular matrix T := [to, t l , ... , tn-I], provided that to, ... , tn-l is a Krylov sequence oflength n. Unfortunately, the matrix T is in general ill-conditioned, cond(T) » 1, so that the computed solution c can be highly incorrect [see Section 4.4]. Indeed, as k --t 00, the vectors tk = Akto will, when scaled by suitable factors (J"k, converge toward a nonzero vector t = limk (J"ktk that, in general, does not depend on the choice of the initial vector to. The columns of T tend, therefore, to become "more and more linearly dependent" [see Section 6.6.3 and Exercise 12]. 6.4 The Schur Normal Form of a Matrix; Hermitian and Normal Matrices; Singular Values of Matrices If one does not admit arbitrary nonsingular matrices T in the similarity transformation T- l AT, it is in general no longer possible to transform A to Jordan normal form. However, for unitary matrices T, i.e., matrices T with THT = I, one has the following result of Schur: (6.4.1) Theorem. For every n x n matrix A there is a unitary n x n matrix U with * Here Ai, i = 1, ... , n, are the (not necessarily distinct) eigenvalues of A. = 1 the theorem is trivial. Suppose the theorem is true for matrices of order n - 1, and let A be an n x n matrix. Let Al be an eigenvalue of A, and Xl =f. 0 a corresponding PROOF. We use complete induction on n. For n 380 6 Eigenvalue Problems eigenvector with Ilxd~ = x{f Xl = 1, AX1 = >'1X1. Then one can find n - 1 additional vectors X2, ... , Xn such that Xl, X2, ... , Xn forms an orthonormal basis of <en, and the n x n matrix X := [Xl, ... ,xnl with columns Xi is thus unitary, XH X = I. Since XH AXel = XH AX1 = A1XHx1 = A1e1, the matrix X H AX has the form where A1 is a matrix of order n - 1 and aH E <e n- 1. By the induction hypothesis, there exists a unitary (n - 1) x (n - 1) matrix U1 such that The matrix then is a unitary n x n matrix satisfying UHAU = [~ ufo ] XHAX [10 [1o uf0] [A1 a] [1 0] 0 A1 0 U1 The fact that Ai, i = 1, ... , n, are zeros of det(U H AU - p1), and hence eigenvalues of A, is trivial. 0 Now, if A = AH is a Hermitian matrix, then is again a Hermitian matrix. Thus, from (6.4.1), there follows immediately 6.4 The Schur Normal form of a Matrix 381 (6.4.2) Theorem. For every Hermitian n x n matrix A = AH there is a unitary matrix U = [Xl, ... ,xn ] with U- 1 AU = U H AU = [ AOI The eigenvalues Ai, i = 1, ... , n, of A are real. A is diagonalizable. The ith column Xi of U is an eigenvector belonging to the eigenvalue Ai: AXi = Ai xi . A thus has n linearly independent pairwise orthogonal eigenvectors. If the eigenvalues Ai of a Hermitian n x n matrix A = AH are arranged in decreasing order, Al ;::: A2 ;::: ... ;::: An, then Al and An can be characterized also in the following manner [see (6.9.14) for a generalization]: (6.4.3) An = PROOF. If U H AU x H Ax xHx xHAx min - H - · OjixE<cn X X = A = diag[Al, ... , An], U unitary, then for all x -=f. 0, (xHU)U H AU(UHx) (xHU)(UHX) 2:iAi l1]i 12 2:i l1]i 12 < 2:i A1 11]iI 2 = A - 2:il1]iI 2 1, where y := U H x = [1]1, ... , 1]n]T -=f. O. Taking for x -=f. 0 in particular an eigenvector belonging to AI, Ax = Al x, one gets x H Ax/x H x = AI, so that Al = maxOjixE<cn x H Ax/x Hx. The other assertion in (6.4.3) follows from what was just proved by replacing A with -A. From (6.4.3) and the definition (4.3.1) of a positive definite (positive semidefinite) matrix A, one obtains immediately (6.4.4). A Hermitian matrix A is positive definite (positive semidefinite) if and only if all eigenvalues of A are positive (nonnegative). A generalization of the notion of a Hermitian matrix is that of a normal matrix: An n x n matrix A is called normal if i.e., A commutes with AH. For example, all Hermitian, diagonal, skew Hermitian and unitary matrices are normal. (6.4.5) Theorem. An n x n matrix A is normal if and only if there exists a unitary matrix U such that 382 6 Eigenvalue Problems U- 1 AU = U H AU = >'1 [0 Normal matrices are diagonalizable and have n linearly independent pairwise orthogonal eigenvectors Xi, i = 1. ... , n, AXi = AiXi, namely the columns of the matrix U = [Xl, ... ,XnJ. PROOF. By Schur's theorem (6.4.1), there exists a unitary matrix U with UHAU = [ A1 * . . **..] o An From AHA = AA H there now follows RHR= U H AHUU H AU = U H AHAU = U H AAHU = U H AUU H AHU =RRH. For the (1,1) element of RH R = RRH we thus obtain hence r1k = 0 for k = 2, ... , n. In the same manner one shows that all nondiagonal elements of R vanish. Conversely, if A is unitarily diagonalizable, U H AU = D, where D := diag(A1, ... , An), UHU = I, there follows at once AH A = UDHUHUDU H = UIDI 2U H = UDUHUDHU H = AAH. 0 Given an arbitrary m x n matrix A, the n x n matrix AHA is positive semidefinite, since xH (AH A)x = IIAxll~ 2': 0 for any X E <en. Its eigenvalues A1 2': A2'" 2': An 2': 0 are nonnegative by (6.4.4) and can therefore be written in the form Ak = (J~ with (Jk 2': O. The numbers (J1 2': ... 2': (In 2': 0 are called (6.4.6) singular values of A. Replacing the matrix A in (6.4.3) by AH A, one obtains immediately (6.4.7) 6.4 The Schur Normal form of a Matrix 383 In particular, if m = n and A is nonsingular, one has (6.4.8) The smallest singular value an of a square matrix A gives the distance of A to the "nearest" singular matrix: (6.4.9) Theorem. Let A and E be arbitrary n x n matrices and let A have the singular values al 2:: a2 2:: ... 2:: an 2:: O. Then (a) lub 2 (E) 2:: an if A + E is singular. (b) There is a matrix E with lub 2 (E) = an such that A + E singular. PROOF. (a): Let A + E be singular, thus (A + E)x = 0 for some x =I- O. Then (6.4.7) gives hence an :::; lub 2 (E). (b): If an = 0, there is nothing to prove. For, by (6.4.7), one has 0 = IIAxl12 for some x =I- 0, so that A already is singular. Let, therefore, an > O. Because of (6.4.7), there exist vectors u, v such that IIAul12 = an, IIul12 = 1, 1 IIvl12 = l. v := -Au, an For the special n x n matrix E:= -anvu H , one then has (A+E)u = 0, so that A + E is singular, and moreover D An arbitrary m x n matrix A can be transformed unitarily to a certain normal form in which the singular values of A appear: (6.4.10) Theorem. Let A be an arbitrary (complex) m x n matrix. Then: (a) There exist a unitary m x m matrix U and a unitary n x n matrix V such that U H AV = E is an m x n "diagonal matrix" of the following form: E=[~ ~], D:=diag(al, ... ,ar ), al2::a22::···2::ar>0. Here aI, ... , a r are the nonvanishing singular values of A, and r is the rank of A. 384 6 Eigenvalue Problems (b) The nonvanishing singular values of AH are also precisely the number aI, ... , aT' The decomposition A = U EV H is called the singular-value decomposition of A. (6.4.11) ° ° PROOF. We show (a) by mathematical induction on m und n. For m = or n = there is nothing to prove. We assume that the theorem is true for (m - 1) x (n - 1) matrices and that A is an m x n matrix with m ~ 1, n ~ 1. Let al be the largest singular value of A. If al = 0, then by (6.4.7) also A = 0, and there is nothing to show. Let, therefore, al > 0, and let Xl -=I- be an eigenvector of AHA for the eigenvalue ar, with IIxll12 = 1: ° (6.4.12) Then one can find n - 1 additional vectors X2, ... , Xn E (Cn such that the n x n matrix X := [Xl, X2, ... , xnl with the columns Xi becomes unitary, X H X = In. By virtue of IIAxlll~ = xf AH AXI = arxf Xl = ar > 0, the vector Yl:= l/alAxl E (Cm with IIYll12 = 1 is well defined, and one can find m - 1 additional vectors Y2, ... , Ym E (Cm such that the m x m matrix Y := [Yl, Y2, ... , Yml is likewise unitary, yHy = 1m. Now from (6.4.12) and the definition of Yl, X and y, there follows at once, with el:= [1,O, ... ,OV E (Cn,el:= [1,O, ... ,OV E <em, that yH AXel = yH AXI = alyH Yl = aIel E (Cm and (yH AX)Hel = XH AHYel = XH AHYI = ~XH AH AXI al so that the matrix Y H AX has the following form: Here A is an (m - 1) x (n - 1) matrix. By the induction hypothesis, there exist a unitary (m - 1) x (m - 1) matrix {j and a unitary (n - 1) x (n - 1) matrix V such that with t; a (m - 1) x (n - 1) "diagonal matrix" of the form indicated. The m x m matrix 6.4 The Schur Normal form of a Matrix u:= y. 385 [~ &] is unitary, as is the n x n matrix and one has UHAV=[~ [?H]YHAX[~ ~]=[~ UOH][~l ~][~1~] [~1 ~]=[~~]=E, D:=diag(a1, ... ,ar ), E being an m x n diagonal matrix with a2 2: ... 2: a r > 0, ar Amax(AH A). Evidently, rank A = r, since rank A = rank U H AV = rank E. We must still prove that a1 2: a2 and that the <Ti are the singular values of A. Now from U H AV = E there follows, for the n x n diagonal matrix EHE, EH E = diag(ai, ... , a;, 0, ... ,0) = VH AHUU H AV = VH (AH A)V, a; so that [see Theorem (6.4.2)] ar, ... , are the nonvanishing eigenvalues of AHA, and hence aI, ... , ar are the nonvanishing singular values of A. Because of ar = Amax(AH A), one also has a1 2: a2. 0 The unitary matrices U, V in the decomposition U H AV = E have the following meaning: The columns of U represent m orthonormal eigenvectors of the Hermitian m x m matrix AA H , while those of V represent n orthonormal eigenvectors of the Hermitian n x n matrix AHA. This follows at once from U H AAHU = EEH,V H AH AV = EH E, and Theorem (6.4.2). Finally we remark that the pseudoinverse A+ [see Section 4.8.5] of the m x n matrix A can be immediately obtained from the decomposition UHAV=E:If E = [~ ~], D = diag(al, ... ,<Tr ), a1 2: ... 2: <Tr > 0, then the n x m diagonal matrix 0] ' [D-o 0 1 is the pseudoinverse of E, and one verifies at once that the n x m matrix (6.4.13) 386 6 Eigenvalue Problems satisfies the conditions of (4.8.5.1) for a pseudoinverse of A, so that A+, in view of the uniqueness statement of Theorem (4.8.5.2), must be the pseudoinverse of A. 6.5 Reduction of Matrices to simpler Form The most common methods for determining the eigenvalues and eigenvectors of a dense matrix A proceed as follows. By means of a finite number of similarity transformations i = 1, 2, ... , m, one first transforms the matrix A into a matrix B of simpler form, B:= Am = T-1AT, and then determines the eigenvalues A and eigenvectors y of B, By = Ay. For x := Ty = T 1 ··· Tmy, since B = T- 1AT, we then have Ax = Ax, i.e., to the eigenvalue A of A there belongs the eigenvector x. The matrix B is chosen in such a way that (1) the determination of the eigenvalues and eigenvectors of B is as simple as possible (i.e., requires as few operations as possible) and (2) the eigenvalue problem for B is not (substantially) worse conditioned than that for A (i.e., small changes in the matrix B do not impair the eigenvalues of B, nor therefore those of A, substantially more than equally small changes in A). In view of B = T-1AT, B + lc,B = T-1(A + lc,A)T, lc,A:= T lc,BT- 1, given any vector norm 11·11 and corresponding matrix norm lub (.), one gets the following estimates: lub(B) ::::: cond(T) lub(A), lub(lc,A) ::::: cond(T) lub(lc,B), and hence lub(lc,A) < ( d(T))2Iub(lc,B) lub(A) - con lub(B) . 6.5 Reduction of Matrices to simpler Form 387 For large cond(T) » 1 even a small error 6B, committed when computing B, say lub(6B)/lub(B) ::::; eps, has the same effect on the eigenvalues of A as the equivalent error 6A, which can be as large as lub(l:>A) 2lub(l:>B) 2 lub(A) ::::; cond(T) lub(B) = cond(T) . eps. Hence for large cond(T) , no method to compute the eigenvalues of A by means of the eigenvalues of B will be numerically stable [see Section 1.3]. Since cond(T) = cond(TI ... Tm) :"::: cond(Td··· cond(Tm) , numerical stability can be expected only if one chooses the matrices Ti such that cond(Ti) does not become too large. This is the case, in particular, for the maximum norm Ilxll oo = maxi IXil and elimination matrices of the form [see Section 4.1] 1 0 1 Ti = G j = with Ilkjl :"::: 1, 1)+I,j 0 lnj 0 1 (6.5.0.1) 1 0 1 1 = G-: J -1)+I,j -lnj 0 0 1 condoo(Ti) :"::: 4, and also for the Euclidean norm IIxI12 = v'xHx and unitary matrices Ti = U (e.g., Householder matrices) for which cond(Ti) = 1. Reduction algorithms using either unitary matrices Ti or elimination matrices Ii as in (6.5.0.1) are described in the following sections. The "simple" terminal matrix B = Am that can be achieved in this manner, for arbitrary matrices, is an upper Hessenberg matrix, which has the following form: * B= * * o o bik = 0 for k s: i - 2. 0 * * 388 6 Eigenvalue Problems For Hermitian matrices A = AH only unitary matrices T i , T i- 1 = TiH , are used for the reduction. If A i - 1 is Hermitian, then so is Ai = T i- 1A i - 1T i : A{i = (TiH Ai_ITi)H = TiH AE1 Ti = TiH A i- 1T i = Ai. For the terminal matrix B one thus obtains a Hermitian Hessenberg matrix, i.e., a (Hermitian) tridiagonal matrix, or Jacobi matrix: B= 81 12 'Y2 82 0 8i = 8i . 0 'Yn 1n 8n 6.5.1 Reduction of a Hermitian Matrix to Tridiagonal Form. The Method of Householder In the method of Householder for the tridiagonalization of a Hermitian n x n matrix AH = A =: Ao, one uses suitable Householder matrices [see Section 4.7] TiH = T i- 1 = Ti = I - f3iUiU{i for the transformation (6.5.1.1) We assume that the matrix A-I = (ajk) has already the following form: A i- 1 = J i- 1 C 0 cH 8i a ,H 0 ai Ai- 1 (ajk), with [~l 81 12 'Y2 82 0 0 0 1i 8i 0 'Yi-1 1i-1 8i- 1 0 0 'Yi _[ai~"il a, - : ani . 6.5 Reduction of Matrices to simpler Form 389 According to Section 4.7, there is a Householder matrix Ti of order n - i such that T;ai = k· el E <en-i. (6.5.1.2) T; has the form Ti = I - (3uu H , U E <e n - i , and is given by n L lajil 8) (3'= {1/(0'(0' + lai+l,il)) . 0 (6.5.1.3) k := -0" U:= [ if 0' =I- 0, otherwise, i if a'+1 1 , , 1 ,' = e <pla'+1 ' l . , 2,I , ei<p ei<p 2, j=i+l (0' + lai+l,i I) ai+2,i . 1 . ani Then, for the unitary n x n matrix Ii, partitioned like (6.5.1.1), Ti := I 0 0 0 1 0 0 0 Ti } i-I one clearly has TiH = Ti- 1 = Ti and, by (6.5.1.2), c o o 390 6 Eigenvalue Problems 81 "(2 "12 82 0 o o "(i-l o 0 "Ii-1 8i - 1 "(i 0 0 "Ii 8i "(HI 0 ... 0 =: Ai "IHI o o o with "IHI := k. Since Ti = I - (3uu H , one can compute TUii-lTi as follows: Introducing for brevity the vectors p, q E <D n - i , T;.t4 i - 1 T i = A i - 1 - pu H - upH + (3upH uu H (6.5.1.4) =Ai_l-U[P_~(pHU)u]H - [p_~(pHU)U]uH = A i - 1 - uqH - qu H . The formulas (6.5.1.1)-(6.5.1.4) completely describe the ith transformation Evidently, o B = An - 2 = "12 o "In 6.5 Reduction of Matrices to simpler Form 391 is a Hermitian tridiagonal matrix. A formal ALGOL-like description of the Householder transformation for a real symmetric matrix A = AT = [ajkJ with n ;:::: 2 is for i := 1 step 1until n - 2 do begin lSi := aii; i----n L lajil if ai+l,i < 0 then := 2 ; S := Efl 8 -8; j=i+l e := 8 + ai+l,i; if 8 = 0 then begin aii := 0; goto M Mend; (3:= aii := 1/(8 x e); Ui+l := ai+1,i := e; for j := i + 2 step 1 until n do Uj := aji; for j := i + 1 step 1 until n do 'Yi+l := -8; Pj := ( t t j=i+1 ajk x Uk + t akj x Uk) x (3; k=j+1 8k := ( . Pj x Uj) x (3/2; )=,+1 for j := i + 1 step 1 until n do qj := Pj - 8k x Uj; for j := i + 1 step 1 until n do for k := i + 1 step 1 until j do ajk := ajk - qj x Uk - Uj x qk; MM: end; This program takes advantage of the symmetry of A: Only the elements need to be given. Moreover, the matrix A is overwritten with the essential elements (3i, Ui of the transformation matrix Ti = 1- (3iuiuf!, i = 1, 2, ... , n - 2: Upon exiting from the program, the ith column of A will be occupied by the vector ajk with k :::; j i = 1,2, ... ,n - 2. (The matrices T i , i = 1,2, ... ,n - 2, are needed for the "back transformation" of the eigenvectors: if y is an eigenvector of A n - 2 for the eigenvalue A, A n - 2 y = AY, then x := T 1 T2 ··· Tn - 2 y is an eigenvector of A, Ax = AX.) 392 6 Eigenvalue Problems Tested ALGOL programs for the Householder reduction and back transformation of eigenvectors can be found in Martin, Reinsch, and Wilkinson (1971); FORTRAN programs, in Smith et al. (1976). Applying the transformations described above to an arbitrary nonHermitian matrix A of order n, the formulas (6.5.1.1), (6.5.1.3) give rise to a chain of matrices Ai, i = 0, 1, ... , n - 2, of the form Ao =A, * * o o * * * o * * . · * * * . · * * * . * *. · * . * * * . · * v i-I The first i-I columns of A i - I are already those of a Hessenberg matrix . . A n - 2 is a Hessenberg matrix. During the transition from A i - I to Ai the elements ajk of A i - I with j, k :::; i remain unchanged. ALGOL programs for this algorithm can be found in Martin and Wilkinson (1971); FORTRAN programs, in Smith et al. (1976). In Section 6.5.4 a further algorithm for the reduction of a general matrix A to Hessenberg form will be described which does not operate with unitary similarity transformations. The numerical stability of the Householder reduction can be shown in the following way: Let Ai and Ti denote the matrices obtained if the algorithm is carried out in floating-point arithmetic with relative precision eps; let Ui denote the Householder matrix which, according to the rules of the algorithm, would have to be taken as transformation matrix for the transition A i - I --+ Ai in exact arithmetic. Thus, Ui is an exact unitary matrix, while Ti is an approximate unitary matrix, namely the one obtained in place of Ui by computing Ui in floating-point arithmetic. The following relations thus hold: Ti = fl(Ui ), Ai = fl(TiAi_ITi). By means of the methods described in Section 1.3 one can now show [see, e.g., Wilkinson (1965)] that 6.5 Reduction of Matrices to simpler Form 393 lub 2 (Ti - Ui ) ~ f(n) eps, Ai = fl(Ti Ai - 1T;) = Ti Ai - 1T i + G i , (6.5.1.5) lub 2 (G i ) ~ f(n) eps lub 2 (Ai-I), where f(n) is a certain function [for which, generally, f(n) = 0(71"'), a ~ 1]. From (6.5.1.5), since lub 2 (Ui ) = 1, UiH = Ui- 1 = Ui (U i is a Householder matrix!), there follows at once Ri := Ti - Ui , lub 2 (Ri ) ~ f(n) eps, - -1 - - - - Ai = Ui Ai-lUi + R iA i- 1Ui + Ui Ai- 1R i + RiAi-1Ri + Gi -1 - =: Ui Ai-lUi + F i , where or, since f(n) eps « 3, in first approximation: lub 2 (F;) ::; 3 eps f( n) lub 2 (Ai-r), (6.5.1.6) lub 2 (A;) ::; (1 + 3 eps f(n)) lub 2 (A i -r), ::; (1 + 3 eps f(n))i lub 2 (A). For the Hessenberg matrix An - 2 , finally, since A = Ao, one obtains (6.5.1.7) where n-2 F := L U1U 2 ··· Ui F i Ui- 1 ... U 1 1. i=l It thus follows from (6.5.1.6) that n-2 lub 2 (F) ~ L lub (F 2 i) i=l n-2 ::; 3 eps f(n) lub 2 (A) L (1 + 3 eps f(n))i-1, i=l or, in first approximation (6.5.1.8) lub 2 (F) ::; 3(71 - 2)f(n) eps lub 2 (A). Provided nf(n) is not too large, the relations (6.5.1.7) and (6.5.1.8) show that the matrix An - 2 is exactly similar to the matrix A + F, which is only a slight perturbation of A, and therefore the method is numerically stable. 394 6 Eigenvalue Problems 6.5.2 Reduction of a Hermitian Matrix to Tridiagonal or Diagonal Form: The Methods of Givens and Jacobi In Givens' method (1954), a precursor of the Householder method, the chain for the transformation of a Hermitian matrix A to tridiagonal form B = Am is constructed by means of unitary matrices Ti = njk of the form ('P, 'ljJ real) 1 0 1 cos'P (6.5.2.1) -e- i 1j; sin 'P +-j cos 'P +-k 1 njk = 1 ei 1j; sin 'P 1 1 0 In order to describe the method of Givens we assume for simplicity that A = AH is real; we can choose in this case 'ljJ = 0, and njk is orthogonal. Note that in the left multiplication A ~ njkl A = nJ{,A only rows j and k of A undergo changes, while in the right multiplication A ~ Anjk only columns j and k change. We describe only the first transformation step A = Ao ~ T I- I Ao =: A~ ~ A~TI = T I- I AoTI =: AI. In the half step Ao ~ A~ the matrix TI = n 23 , T1- I = n~ is chosen such [see Section 4.9] that the element of A~ = n~Ao in position (3,1) is annihilated; in the subsequent right multiplication by n 23 , A~ ~ Al = A~n23, the zero position (3,1) is preserved. Below is a sketch for a 4 x 4 matrix, where changing elements are denoted by *: ~~ [~ x x x x x x x x ~l ~ n~Ao = ~ A~n23 = [~ [~ x x * * * * x x 0 * * * * * * * ~l ~l =A~ =: AI. 6.5 Reduction of Matrices to simpler Form 395 Since with Ao, also Al is Hermitian, the transformation Ao -+ Al also annihilates the element in position (1, 3). After this, the element in position (4, 1) is transformed to zero by a Givens rotation T2 = [l24, etc. In general, one takes for Ti successively the matrices .. , , ... , [In-I,n, and chooses [ljk, j = 2, 3, ... , n -1, k = j + 1, j + 2, ... , n, so as to annihilate the element in position (k,j - 1). A comparison with Householder's method shows that this variant of Givens method requires about twice as many operations. For this reason, Householder's method is usually preferred. There are, however, modern variants ("rational Givens transformations") which are comparable to the Householder method. The method of Jacobi, too, employs similarity transformations with the special unitary matrices [ljk (6.5.2.1); however, it no longer produces a finite sequence ending in a tridiagonal matrix, but an infinite sequence of matrices A (i), i = 0, 1, ... , converging to a diagonal matrix Here, the Ai are just the eigenvalues of A. To explain the method, we again assume for simplicity that A is a real symmetric matrix. In the transformation step A(i) -+ A(i+I) = [lJr,A(i)[ljk the quantities c:= cosrp, s := sinrp of the matrix [ljk, j < k, in (6.5.2.1) are now determined so that ajk = 0 (we denote the elements of A (i) byars, those of A(i+ I ) by a~s): A(i+l) = I a~1 a~.,J a1,k a~,n a j1 , a( . J,J 0 a j .n a~l 0 a~k a~,n a~1 an,j I , a~,k a~.n Only the entries in rows and columns j and k (in bold face) are changed, according to the formulas 396 6 Eigenvalue Problems (6.5.2.2) , - a 'kj - -es (ajj - akk ) + (2 e - s 2) ajk -~ 0 , a jk - a~k = S2ajj + e2akk - 2eSajk. From this, one obtains for the angle cp the defining equation 2es 2ajk tan 2cp = - - - = ----"-e2 - S2 ajj - akk By means of trigonometric identities, one can compute from this the quantities e and s and, by (6.5.2.2), the a~s' It is recommended, however, that the following numerically more stable formulas be used [see Rutishauser (1971), where an ALGOL program can also be found]. First compute the quantity '13 := cot 2cp from '13 '= ajj - akk . 2ajk' and then t := tan cp as the root of smallest modulus of the quadratic equation t 2 + 2M - 1 = 0, that is, t= s(iJ) 1'131 + VI + '13 2 ' s(iJ) := {~1 if '13 :::: 0, otherwise, or t := 1/2'13 if 1'131 is so large that '13 2 would overflow. Obtain the quantities e and s from 1 c'.- --=== JI+t2' s:= te. Finally compute the number T := tan(cp/2) from S T := (1 + e)' and with the aid of s, t and T rewrite the formulas (6.5.2.2) in a numerically more stable way as + t ajk, aj j ajj a'jk a"k j ' -0' a~k akk - t ajk· 6.5 Reduction of Matrices to simpler Form 397 For the proof of convergence, one considers S(A(i)) := ~]ajkI2, S(A(i+l)) = j# I:lajkl 2 j# the sums of the squares of the off-diagonal elements of A(i) and A(i+1), respectively. For these one finds, by (6.5.2.2), The sequence of nonnegative numbers S (A (i)) therefore decreases monotonically, and thus converges. One can show that limi-toc S(A(i)) = 0 (i.e., the A(i) converge to a diagonal matrix), provided the transformations Jl jk are executed in a suitable order, namely row-wise, ... , ... , Jln-1,n, and, in this order, cyclically repeated. Under these conditions one can even prove quadratic convergence of the Jacobi method, if A has only simple eigenvalues: 'hN WI t IS := n(n-1) := ----'--::-2--'-' n:ln IAi(A) - Aj(A)1 > 0 '-r-J [for the proof, see Wilkinson (1962); further literature: Rutishauser (1971), Schwarz, Rutishauser and Stiefel (1972), Parlett(1980)]. In spite of this rapid convergence and the additional advantage that an orthogonal system of eigenvectors of A can easily be obtained from the Jl jk employed, it is more advantageous in practical situations, particularly for large n, to reduce the matrix A to a tridiagonal matrix J by means of the Householder method [see Section 6.5.1] and to compute the eigenvalues and eigenvectors of J by the QR method, since this method converges cubically. This all the more so if A has already the form of a band matrix: in the QR method this form is preserved; the Jacobi method destroys it. We remark that Eberlein developed a method for non-Hermitian matrices similar to Jacobi's. An ALGOL program for this method, and further details, can be found in Eberlein (1971). 398 6 Eigenvalue Problems 6.5.3 Reduction of a Hermitian Matrix to Tridiagonal Form: The Method of Lanczos Krylov sequences of vectors q, Aq, A2q, ... belonging to an n x n matrix and a starting vector q E en were already used for the derivation of the Frobenius normal form of a general matrix in Section 6.3. They also play an important role in the method of Lanczos (1950) for reducing a Hermitian matrix to tridiagonal form. Closely related to such a sequence of vectors is a sequence of subspaces of en i::::: 1, Ki(q, A) := span[ q, Aq, ... , Ai-1q J, Ko(q, A) := {o}. called Krylov spaces: Ki(q, A) is the subspace spanned by the first i vectors of the sequence {Ajq}j~o. As in Section 6.3, we denote by m the largest index i for which q, Aq, ... , Ai-l q are still linearly independent, that is, dim Ki(q, A) = i. Then m :::; n, Amq E Km(q, A), the vectors q, Aq, ... , Am-l q form a basis of Km(q, A), and therefore AKm(q, A) C Km(q, A): the Krylov space Km(q, A) is A-invariant and the map x H p(x) := Ax describes a linear map of Km(q, A) into itself. In Section 6.3 we arrived at the Frobenius matrix (6.3.1) when the map P was described with respect to the basis q, Aq, ... , Am-l q of Km(q, A). The idea of the Lanczos method is closely related: Here, the map P is described with respect to a special orthonormal basis ql, q2, ... , qm of Km(q, A), where the qj are chosen such that for all i = 1, 2, ... , m, the vectors ql, q2, ... , qi form an orthonormal basis of Ki(q, A). If A = AH is a Hermitian n x n matrix, then such a basis is easily constructed for a given starting vector q. We assume q i- 0 in order to exclude the trivial case and suppose in addition that Ilqll = 1, where II . II is the Euclidean norm. Then there is a three-term recursion formula for the vectors qi [similar recursions are known for orthogonal polynomials, cf. Theorem (3.6.3)] (6.5.3.1a) ql := q, IIqO:= 0, Aqi = liqi-I + ~iqi + li+1qHl for i ::::: 1, where ~i := qf Aqi, (6.5.3.1b) IHI := Ilrill with ri := Aqi - ~iqi - liqi-}, qi+1 := rd'HI' if IHI i- O. Here, all coefficients Ii, ~i are real. The recursion breaks off with the first index i =: io with IHI = 0, and then the following holds io = m = max dim Ki(q, A). 2 6.5 Reduction of Matrices to simpler Form 399 PROOF. We show (6.5.3.1) by induction over i. Clearly, since Ilqll = 1, the vector ql := q provides an orthonormal basis for Kl (q, A). Assume now that for some j 2': 1 vectors ql, ... , qj are given, so that (6.5.3.1) and hold for all i :::; j, and that ri i- 0 in (6.5.3.1b) for all i < j. We show first that these statements are also true for j + 1, if rj i- O. In fact, then 'YJ+l i- 0, 6j and qj+1 are well defined by (6.5.3.1b), and IIqJ+11I = 1. The vector qj+1 is orthogonal to all qi with i ::; j: This holds for i = j, because 'YJ+l i- 0, because from the definition of 6j, and using the induction hypothesis 'YJ+lqfqJ+l = qf A% - 6jqfqj = O. For i = j - 1, the same reasoning and A = A H first give 'YJ+lqf-l%+1 = qf-1Aqj - 'Yjqf-lqj-l = (Aqj_dHqj - 'Yj. The orthogonality ofthe qi for i :::; j and A%-l = 'Yj-lqj-2+ 6j-lqj-l +'Yjqj then imply (Aqj_l)H qj = "(j = 'Yj, and therefore qJ-l qj+1 = O. For i < j-1 we get the same result with the aid of Aqi = 'Yiqi-l + 6iqi + 'Yi+1qi+l: 'Yj+1qf qj+1 = qf Aqj = (Aqi)H qj = o. Finally, since span[ ql, ... , qi 1 = Ki(q, A) c Kj(q, A) for i :::; j, we also have which implies by (6.5.3.1b) qJ+l E span[ qj-l, qj, Aqj] C Kj+1 (q, A), and therefore span[ ql, ... , qJ+l] c K j +1(q, A). Since the orthonormal vectors ql, ... , qJ+l are linearly independent and dim Kj+1 (q, A) ::; j + 1 we obtain Kj+1 (q, A) = span[ ql,···, qj+1]. This also shows j + 1 :::; m = maxi dim Ki (q, A), and io :::; m for the breakoff index io of (6.5.3.1). On the other hand, by the definition io Aqio E span[ qio-l, qio] C span[ ql, ... ,qio 1= Kio (q, A), so that, because 400 6 Eigenvalue Problems we get the A-invariance of Kio(q,A), AKio(q,A) c Kio(q,A). Therefore io 2: m, since Km(q, A) is the first A-invariant subspace among the Ki(q, A). This finally shows io = m, and the proof is complete. 0 The recursion (6.5.3.1) can be written in terms of the matrices 1 ::; i ::; m, Ii as a matrix equation (6.5.3.2) AQi = Qdi + [0, ... ,0, Ii+Iqi+l] = QiJi + Ii+Iqi+Ief, i = 1, 2, ... , m, where ei := [0, ... ,0, I]T E IRi is the ith axis vector of IRi. This equation is easily verified by comparing the jth columns, j = 1, ... , i, on both sides. Note that the n x i matrices Qi have orthonormal columns, Q,fIQi = li(:= i x i identity matrix) and the J i are real symmetric tridiagonal matrices. Since i = m is the first index with Im+1 = 0, the matrix J m is irreducible, and the preceding matrix equation reduces to {cf. (6.3.3)] AQm = QmJm where Q;;,Qm = 1m. Any eigenvalue of J m is also an eigenvalue of A, since Jmz = Az, z t 0 implies x := Qmz t 0 and If m = n, i.e., if the method does not terminate prematurely with an m < n, then Qn is a unitary matrix, and the tridiagonal matrix I n = Q;[ AQn is unitarily similar to A. Given any vector q =: ql with iiqii = 1, the method of Lanczos consists of computing the numbers Ii, iSi, i = 1, 2, ... , m, bl := 0), and the tridiagonal matrix J m by means of (6.5.3.1). Subsequently, one may apply the methods of Section 6.6 to compute the eigenvalues and eigenvectors of J m (and thereby those of A). Concerning the implementation of the method, the following remarks are in order: 1. The number of operations can be reduced by introducing an auxiliary vector defined by Ui := Aqi - liqi-I· Then ri = Ui - iSiqi, and the number 6.5 Reduction of Matrices to simpler Form 401 can also be computed from Ui, since q{f qi-1 = o. 2. It is not necessary to store the vectors qi if one is not interested in the eigenvectors of A: In order to carry out (6.5.3.1) only two auxiliary vectors v, W E ~n are needed, where initially v := q is the given starting vector with iiqii = 1. Within the following program, which implements the Lanczos algorithm for a given Hermitian n x n matrix A = A H , Vk and Wk, k = 1, ... , n, denote the components of v and w, respectively: W := 0; /'1 := 1; i := 1; 1: if /'i #- 0 then begin if i #- 1 then for k := 1 step 1 until n do begin t := Vk; Vk := Wk//'i; Wk := -/'it end; W := Av + w; {)i := v H w; W := W - {)iV; m := i; i := i + 1; /'i := Vw H w; goto 1; end; Each step i -+ i + 1 requires about 5n scalar multiplications and one multiplication of the matrix A with a vector. Therefore, the method is inexpensive if A is sparse, so that it is particularly valuable for solving the eigenvalue problem for large sparse matrices A = AH. 3. In theory, the method is finite: it stops with the first index i = m ::::: n with /'H1 = O. However, because of the influence ofroundoff, one will rarely find a computed /'i+l = 0 in practice. Yet, it is usually not necessary to perform many steps of the method until one finds a zero or a very small /'i+ 1: The reason is that, under weak assumptions, the largest and smallest eigenvalues of J i converge very rapidly with increasing i toward the largest and smallest eigenvalues of A [Kaniel-Paige theory: see Kaniel (1966), Paige (1971), and Saad (1980)]. Therefore, if one is only interested in the extreme eigenvalues of A (which is quite frequently the case in applications), only relatively few steps of Lanczos' method are necessary to find a J i , i « n, with extreme eigenvalues that already approximate the extreme eigenvalues of A to machine precision. 4. The method of Lanczos will generate orthogonal vectors qi only in theory: In practice, due to roundoff, the vectors iii actually computed become less and less orthogonal as i increases. This defect could be corrected by reorthogonalizing a newly computed vector qi+l with respect to all previous vectors iii, j ::::: i, that is, by replacing qi+l by 402 6 Eigenvalue Problems However, reorthogonalization is quite expensive: The vectors iii have to be stored, and step i of the Lanczos method now requires O(i . n) operations instead of O( n) operations as before. But it is possible to avoid a full reorthogonalization to some extent and still obtain very good approximations for the eigenvalues of A in spite of the difficulties mentioned. Details can be found in the following literature, which also contains a systematic investigation of the interesting numerical properties of the Lanczos method: Paige (1971), Parlett and Scott (1979), and Cullum and Willoughby (1985), where one can also find programs. 6.5.4 Reduction to Hessenberg Form It was already observed in Section 6.5.1 that one can transform a given n x n matrix A by means of n - 2 Householder matrices T; similarly to Hessenberg form B, A:= Ao ---+ Al ---+ ... ---+ An-2 = B, We now wish to describe a second algorithm of this kind, in which one uses as transformation matrices T; permutation matrices (here Ci is the ith axis vector of IR n ), and elimination matrices of the form 1 Inj 1 These matrices have the property p7~1 = P~ = Prs ) 1 (6.5.4.1) 1 A left multiplication P7-:;I A of A by pr--;I = Prs has the effect of interchanging rows rand 8 of A, whereas a right multiplication APrs interchanges 6.5 Reduction of Matrices to simpler Form 403 columns rand s of A. A left multiplication Gjl A of A by Gjl has the effect of subtracting lrj times row j from row r of the matrix A for r = j + 1, j + 2, ... , n, while a right multiplication AGj means that lrj times column r is added to column j of A for r = j + 1, j + 2, ... , n. In order to successively transform A to Hessenberg form by means of similarity transformations of the type considered, we proceed similarly as in Gaussian elimination [see Section 4.1] and as in Householder's method [see Section 6.5.1]: Starting with Ao := A one generates further matrices Ai, i = 1,2, ... , n-2, by similarity transformations. Here Ai is a matrix whose first i columns have already Hessenberg form. In step i, A i - l -t Ai := T i- l Ai-ITi , one uses as Ti a product Ti = Pi+I,sGi+I of a permutation Pi+I,s (where s :::: i + 1) with an elimination matrix Gi+I' Thus step consists of two half steps A i - I ---+ A' := Pi+l,sAi-IPi+I,s ---+ Ai := G;-;I A'Gi+I. The first corresponds to finding a pivot a~+l,i by partial pivoting as in Gaussian elimination; this pivot is then used in the second half step to annihilate the nonzero elements below the subdiagonal of column i of A' by means of the elimination matrix Gi+1. After n - 2 steps of this type one ends up with a Hessenberg matrix B = An - 2. Since the method proceeds as in 6.5.1, we illustrate only the first step A = Ao ---+ Al = T1- I ATI , TI = P2,sG 2 , for n = 4 (here, * denotes changing elements). First one determines by column pivoting the element as,1 of maximal modulus in the first column of A below the diagonal, s 2: 2, which is then shifted to position (2,1) by a row permutation (in the scetch s = 4, the pivot is marked by 0): A ~ A" ~ [~ 0 x x x x x x ~l-~ P,,~ ~ [~ x x --> A' ~ (P"A")P" ~ [! x x * * x x * * * xx * x * x * ~l ~l Then suitable multiples of row 2 of A' are subtracted from rows j > 2 in order to annihilate the elements below the pivot, A' ---+ A" := G 21 A', and finally one computes Al := A"G2 : A' ----+ A" = XX [0 x x x Xl [X x ----+ A = 0 o * * * o * * * 1 0 0 ALGOL programs for this algorithm and for the back transformation of the eigenvectors can be found in Martin and Wilkinson (1971); FORTRAN programs, in Smith et al. (1976). 404 6 Eigenvalue Problems The numerical stability of this method can be examined as follows: Let Ai and Ti be the matrices actually obtained in place of the Ai, Ti during the course of the algorithm in a floating-point computation. In view (6.5.4.1), ~o further rounding errors are committed in the computation of Ti - 1 from Ti , 1 T, = fl (T.-l) " and by definition of Ai one has (6.5.4.3) With the methods of Section 1.3 one can derive for the error matrix Ri estimates of the following form: (6.5.4.4) where f(n) is a certain function of n with f(n) = O(n<», a ;:::; 1 [ef. Exercise 13]. It finally follows from (6.5.4.3), since A = Ao, that (6.5.4.5) A n - 2 = T n - 2 ... Tl (A + F)Tl ... T n - 2 - --1 --1 -- where n-2 (6.5.4.6) F = LTIT2 ... TiRiTi-1 ... T 2- 1T 1- 1. i=1 This shows that An - 2 is exactly similar to a perturbed initial matrix A + F. The smaller F is compared to A, the more stable is the method. For the matrices Ti we now have, in view of (6.5.4.1) and Iliil :s: 1, so that, rigorously, (6.5.4.7) - -1 -1 - - As a consequence, taking note of Ai ;:::; Ti- ... T 1- AT1 ··· T i , (6.5.4.4), and (6.5.4.6), on can get the estimates IUboo(Ai) ~Ciluboo(A), (6.5.4.8) IUboo(F) :s: Kf(n) eps IUboo(A), Ci := 22i , n-2 K:= L2 4i - 2 , i=l with a factor K = K (n) which grows rapidly with n. Fortunately, the bound (6.5.4.7) in most practical cases is much too pessimistic. It is already rather unlikely that lub oo (Tl ... Ti) ~ 2i for i ~ 2. 6.6 Methods for Determining the Eigenvalues and Eigenvectors 405 In all these cases the constants C i and K in (6.5.4.8) can be replaced by substantially smaller constants, which means that in most cases F is small compared to A and the method is numerically stable. For non-Hermitian matrices Housholder's reduction method [see Section 6.5.1] requires about twice as many operations as the method described in this section. Since in most practical cases this method is not essentially less stable than the Householder method, it is preferred in practice (tests even show that the Householder method, due to the larger number of arithmetic operations, often produces somewhat less accurate results). 6.6 Methods for Determining the Eigenvalues and Eigenvectors In this section we first describe how classical methods for the determination of zeros of polynomials (see Sections 5.5, 5.6] can be used to determine the eigenvalues of Hermitian tridiagonal and Hessenberg matrices. We then note some iterative methods for the solution of the eigenvalue problem. A prototype of these methods is the simple vector iteration, in which one iteratively computes a specified eigenvalue and corresponding eigenvector of a matrix. A refinement of this method is Wielandt's inverse iteration method, by means of which all eigenvalues and eigenvectors can be determined, provided one knows sufficiently accurate approximations for the eigenvalues. To these iterative methods, finally, belong also the LR and the QR method for the calculation of all eigenvalues. The last two methods, especially the QR method, are the best methods currently known for the solution of the eigenvalue problem. 6.6.1 Computation of the Eigenvalues of a Hermitian Tridiagonal Matrix For determining the eigenvalues of a Hermitian tridiagonal matrix (6.6.1.1) there are (in addition to the most important method, the QR method, which will be described in Section 6.6.6) two obvious methods which we now wish to discuss. Without restricting generality, let J be an irreducible tridiagonal matrix, i.e., 'Yi "10 for all i. Otherwise, J would decompose into irreducible tridiagonal matrices J(i), i = 1, ... , k, 406 6 Eigenvalue Problems but the eigenvalues of J are just the eigenvalues of the J(i), i = 1, ... , k, so that it suffices to consider irreducible matrices J. The characteristic polynomial 4J(JL) of J can easily be computed by recursion: indeed, letting Pi (JL) := det( J i - JLI), Ii ~l " c5 i ' and expanding det( J i - JLI) by the last column, one finds the recurrence relations (6.6.1.2) Po(JL) := 1, pdJL) = 15 1 - JL, Pi(JL) = (15; -IL)Pi-1(JL) -l!iI 2pi-2tJL), i = 2,3, ... , n, 4J(IL) == Pn (JL). Since c5 i and I!i 12 are real, the polynomials Pi (11,) form a Sturm sequence [see Theorem (5.6.5)]' provided that Ii # 0 for i = 2, ... , n. It is possible, therefore, to determine the eigenvalues of J with the bisection method descrihed in Section 5.6. This method is recommended especially if one wants to compute not all, but only certain prespecified eigenvalues of .1. It can further be recommended on account of its numerical stability when some eigenvalues of J are lying very close to each other [see Barth, Martin and Wilkinson (1971) for an ALGOL program]. Since the eigenvalues of J are all real and simple, they can also be determined with Newton's method, say, in Maehly's version [see (5.5.13)]' at least if the eigenvalues are not excessively clustered. The values Pn (A (j)), p~ (A (j)) of the characteristic polynomial and its derivative required for Newton's method can be computed recursively: Pn(A(J)) by means of (6.6.1.2), and p~(A(j)) by means of the following formulas which are obtained by differentiating (6.6.1.2): p~(JL) = 0, p~(p) = -1, p;(/L) = -Pi-1(P) + (Oi - JL)P;-l(/L) -1,iI 2p;_2(JL), i = 2, ... , n. A starting value A(0) 2: maxi Ai for Newton's method can be obtained from 6.6 Methods for Determining the Eigenvalues and Eigenvectors 407 (6.6.1.3) Theorem. The eigenvalues Aj of the matrix J in (6.6.1.1) satisfy the inequality 11 := In+l := 0. PROOF. For the maximum norm Ilxll oo = max IXil one has and from Jx = AjX, x i- 0, there follows at once IAjlllxll oo = IIJxll oo :S luboo(J) ·llxll oo , Ilxll oo i- 0, and hence IAjl :S luboo(J). D 6.6.2 Computation of the Eigenvalues of a Hessenberg Matrix. The Method of Hyman Besides the QR method [see Section 6.6.6]' which is used most frequently in practice, one can in principle use all me.thods of Chapter 5 to determine the zeros of the polynomial p(p,) = det(B - p,I) of a Hessenberg matrix B = (b ik ). For this, as for example in the application of Newton's method, one must evaluate the values p(p,) and p'(p,) for given p,. The following method for computing these quantities is due to Hyman: We assume that B irreducible, i.e., that bi,i-l i- for i = 2, ... , n. For fixed p, on can then determine numbers a, Xl, ... , Xn-l such that x = (Xl, ... ,Xn-l, x n ), Xn := 1, is a solution of the system of equations ° or, written out in full, (b ll - P,)Xl + b12X2 + ... + blnxn = a, (6.6.2.1) b2lXl + (b 22 - p,)X2 + ... + b2n x n = 0, bn,n-lXn-l + (bnn - p')xn = 0. Starting with Xn = 1, one can indeed determine Xn-l from the last equation Xn-2 from the second to last equation, ... , Xl from the second equation, and finally a from the first equation. The numbers Xi and a of course depend on p,. By interpreting (6.6.2.1) as an equation for x, given a, it follows by Cramer's rule that 1 = Xn = a( _1)n-l b2l b32 ... bn,n-l det(B - p,I) , 408 6 Eigenvalue Problems or ( 6.6.2.2) Apart from a constant factor, a = a(p) is thus identical with the characteristic polynomial of B. By differentiation with respect to /1, letting x; := x;(p) and observing Xn == 1, x~ == 0 one further obtains from (6.6.2.1) the formulas (b l l ~ /h)x~ ~ Xl + b 12 X; + ... + b1.n-lX~_1 = a', b 21 X; + (b 22 ~ p)x; ~ X2 + ... + b2n-lX~_1 = 0, which, together with (6.6.2.1) and starting with the last equation, can be solved recursively for the Xi, X;, a, and finally a'. In this manner one can compute a = a(p) and a'(p) for every /h, and thus apply Newton's method to compute the zero of a(p), i.e., by (6.6.2.2), the eigenvalues of B. 6.6.3 Simple Vector Iteration and Inverse Iteration of Wielandt A precursor of all iterative methods for determining eigenvalues and eigenvectors of a matrix A is the simple vector iteration: Starting with an arbitrary initial vector to E (Cn, one forms the sequence of vectors {td with i = 1. 2, .... Then t, = Aito. In order to examine the convergence of this sequence, we first assume A to be a diagonalizable n x n matrix having eigenvalues Ai, We further assume that there is no eigenvalue Aj different from Al with IAjl = IA11, i.e., there is an integer r > 0 such that (6.6.3.1) Al = A2 = ... = Ar The matrix A, being diagonalizable, has n linearly independent eigenvectors Xi, AXi = AiXi, which forms a basis of (Cn We can therefore write to in the form 6.6 Methods for Determining the Eigenvalues and Eigenvectors 409 (6.6.3.2) For ti there follows the representation (6.6.3.3) Assuming now further that to satisfies -this makes precise the requiremant that to be "sufficiently general" follows from (6.6.3.3) and (6.6.3.1) that (6.6.3.4) it 1) i (An) i A11 ti = P1 X1 + ... + PrXr + Pr+1 (Ar+ ~ Xr+1 + ... + Pn ~ Xn and thus, because IAj/A11 < 1 for j .::,. l' + 1, (6.6.3.5) 1 lirn - t = P1X1 + ... + P X i-4CX) A1 ' r r· Normali:.ling the ti =: (Ti i ), ... , T~i») T in any way, for example, by letting Hii) I = max Hi)l, s it follows from (6.6.3.5) that (HI) (6.6.3.6) T· Ji \ 1· -(-.).1m = A1, ~-HX) ~ T lim Zi = a(P1X1 + ... + PrXr), ,-400 ], where a i= 0 is a normalization constant. Under the stated assumptions the method thus furnishes both the dominant eigenvalue A1 of A and an eigenvector belonging to A1, namely the vector Z = a(P1x1 + .. ·+Prxr), and we say that "the vector iteration converges toward A1 and a corresponding eigenvector" . Note that for l' = 1 (AJ a simple eigenvalue) the limit vector z is independent of the choice to to (provided only that P1 i= 0). If Al is a multiple dominant eigenvalue (1' > 1), then the eigenvector z obtained depends on the ratios P1 : P2 : ... : Pr, and thus on the initial vector to. In addition, it can be seen from (6.6.3.4) that we have linear convergence with convergence factor IAr+ Ii A11. The proof of convergence, at the same time, shows that the method does not always converge to Al and an associated eigenvector, but may converge toward an eigenvalue Ak and an eigenvector belonging to Ak, if in the decomposition (6.6.3.2) of to one has P1 = ... = Pk-1 = 0, 410 6 Eigenvalue Problems (and there is no eigenvalue which differs from Ak and has the same modulus as Ak). This statement, however, has only theoretical significance, for even if initially one has exactly PI = 0 for to, due to the effect of rounding errors we will get for the computed [1 = fl(Ato), in general, [1 = cA1:1'1 + lh A 2 x 2 + ... + PnAnxn with a small c =f. 0 and Pi :::::; Pi, i = 2, ... , n, so that the method eventually still converges to Al and a corresponding eigenvector. Suppose now that A is a nondiagonalizable matrix with a uniquely determined dominant eigenvalue Al (i.e., IA11 = IAil implies)'1 = Ai). Replacing (6.6.3.2) by a representation of to as linear combination of eigenvectors and principal vectors of A. one can show in the same way that for "sufficiently general" to the vector iteration converges to Al and an associated eigenvector . For practical computation the simple vector iteration is only conditionally useful, since it converges slowly if the moduli of the eigenvalues are not sufficiently separated, and moreover furnishes only one eigenvalue and associated eigenvector. These disadvantages are avoided in the inverse iteration (also called fractional iteration) of \Vielandt. Here one assumes that a good approximation A is already known for one of the eigenvalues AI, ... , An of A, say Aj: (6.6.3.7) Starting with a "sufficiently general" initial vector to E <en. one then forms the vectors t i , i = 1. 2, ... , according to (6.6.3.8) (A - A1)ti = f i - 1. If A =f. Ai, i = 1, ... , n, then (A - A1)-1 exists and (6.6.3.8) is equivalent to ti = (A - A1)-l ti _ 1 , i.e., to the simple vector iteration with the matrix (A - A1)-l and the eigenvalues l/(Ak - A), k = 1, 2, ... , n. Because of (6.6.3.7) one has IAj~AI»IAk~AI for Ak=f.Aj. Assuming A again diagonalizable, with eigenvectors Xi, it follows from to = PIXI + ... + Pn:rn (if Aj is a simple eigenvalue) that (6.6.3.9) 6.6 Methods for Determining the Eigenvalues and Eigenvectors 411 The smaller IAj - AI/IAk - AI for Ak =I- Aj (i.e., the better the approximation A), the better will be the convergence. The relations (6.6.3.9) may give the impression that an initial vector to is the more suitable for inverse iteration the more closely it agrees with the eigenvector Xj, to whose eigenvalue Aj one knows a good approximation A. They further seem to suggest that with the choice to ~ Xj the "accuracy" of the t; increases uniformly with i. This impression is misleading, and true in general only for the well-conditioned eigenvalues Aj of a matrix A [see Section 6.9], i.e., for the eigenvalues Aj which under a small perturbation of A change only a little: if lub(6A) lub(A) = O(eps). According to Section 6.9 all eigenvalues of symmetric matrices A are well conditioned. A poor condition for Aj must be expected in those cases in which Aj is a multiple zero of the characteristic polynomial of A and A has nonlinear elementary divisors, or, what often occurs in practice, when Aj belongs to a cluster of eigenvalues whose eigenvectors are almost linearly dependent. Before we study in an example the possible complications for the inverse iteration caused by an ill-conditioned Aj, let us clarify what we mean by a numerically acceptable eigenvalue A (in the context of the machine precision eps used) and corresponding eigenvector X;.. =I- 0 of a matrix A. The number A is called a numerically acceptable eigenvalue A if there exists a small matrix 6A with lub(6A)/lub(A) = O(eps) such that A is an exact eigenvalue of A + 6A. Such numerically acceptable eigenvalues A are produced by every numerically stable method for the determination of eigenvalues. The vector X;.. =I- 0 is called a numerically acceptable eigenvector corresponding to a given approximation A of an eigenvalue A if there exists a small matrix 6 1 A with lub(61 A) lub(A) = O(eps). (6.6.3.10) EXAMPLE 1. For the matrix the number>' := 1 + yIePs is a numerical acceptable eigenvalue, for>. is an exact eigenvalue of A + 6A with 6 [0 0]. A:= eps 0 412 6 Eigenvalue Problems Although x)" := [~] is an exact eigenvector of A, it is not, in the sense of the above definition, a numerically acceptable eigenvector corresponding to the approximation A = 1 + because every matrix vers, with has the form AlA = '-" [~oeps ~], u (3, is arbitrary. For all these matrices, lub=(D. 1 A) 2': ~» O(eps ·lub(A)). It is thus possible to have the following paradoxical situation: Let .\0 be an exact eigenvalue of a matrix, Xo a corresponding exact eigenvector, and .\ a numerically acceptable approximation of '\0, Then Xo need not be a numerically acceptable eigenvector to the given approximation .\ of .\0. If one has such a numerically acceptable eigenvector .\ of A, then with inverse iteration one merely tries to find a corresponding numerically acceptable eigenvector x)". A vector t j, j ~ 1, found by means of inverse iteration from an initial vector to can be accepted as such an ;J:)", provided Iltj-III = O(eps) ·lub(A ). It;f (6.6.3.11) For then, in view of (A ~ .\I)tj = t j - I , one has (A + LiA ~ .\I)t] = 0 with lub 2 (LiA ) = Iltj-d2 IItjl12 = 0 (cps lub 2( A )) . It is now surprising that for ill-conditioned eigenvalues we can have an iterate tj which is numerically acceptable, while its successor t j + 1 is not: EXAI\IPLE 2. The matrix A = [~ ~], T} = O(eps), has the eigenvalues Al = T} + vIr!, A2 = T} ~ vir! with corresponding eigenvectors 6.6 Methods for Determining the Eigenvalues and Eigenvectors 413 Since 7J is small, A is very ill-conditioned and )\1, A2 form a cluster of eigenvalues whose eigenvectors Xl, X2 are almost linearly dependent [e.g., the slightly perturbed matrix - [00 6] A·= . = A+LiA, has the eigenvalue A(A) = 0 with min IA(A) - Ai I 2': ~ - 1771 » O(eps)J. The number A = 0 is a numerically acceptable eigenvalue of A to the eigenvector Indeed (A + LiA - o· I)x), = 0 Li A := [=~ for ~], lub=(LiA)/lub=(A) = _1771_ 1;: :; O(eps). 1+ 1 77 Taking this numerically acceptable eigenvector as the initial vector to for the inverse iteration, and A eigenvalue of A, one obtains after the first step o as an approximate The vector h, is no longer numerically acceptable as an eigenvector of A, every matrix lilA with having the form with a =1 + {3 - '}, ,= 6. These matrices lilA all satisfy lub=(LilA) 2': 1-1771; there no small among them with lub ou (LiA) ;:::; O( eps). On the other hand, every initial vector to of the form to = [n with ITI not too large produces for h a numerically acceptable eigenvector. For example, if to = the vector [i] , 414 6 Eigenvalue Problems h~ ~ [1]0 ~ I) is numerically acceptable, as we just showed. Similar circumstances to those in this example prevail in general whenever the eigenvalue considered has nonlinear f'lementary divisors. Suppose Aj is an eigenvalue of A for which there arc elementary divisors of degrees at most k [k = Tj: sec (6.2.11)]. Through a numerically stable method one can in general [see (6.9.11)] obtain only an approximately value A within the error A ~ Aj = O(eps1/k) [we assume for simplicity lub(A) = 1]. Let now Xo, Xl, ... , Xk-1 be a chain of principal vectors [see Sections 6.2, 6.3] for the eigenvalue Aj, (A ~ AjI);ri = Xi-1 for i = k ~ L ... ,0, (:1:-1 := 0). It then follows immediately, for A i- Ai, i = 1, ... , n, that (A ~ AI)-l [A ~ AI + (A ~ Aj)I]xi = (A ~ AI)-1,ri_1, (A ~ AI) -1 1 1 .ri = A' ~ A Xi + A ~ A' (A ~ AI) J i=k~l, J -1 Xi-1, ... ,O, and from this, by induction, (A ~ AI) -1 1 1 1 Xk-1 = AJ ~ A Xk-1 ~ (Aj ~ A)2 Xk-2 + ... ± (Aj ~ A)k Xo· For the initial vector to := .Tk-1, we thus obtain for the corresponding t1 and therefore Iltoll/lltIil = O(eps), so that t1 is acceptable as an eigenvector of A [see (6.6.3.11)]. If, however, we had taken as initial vector to := XU, the exact eigenvector of A, we would get and thus Since eps1/k » eps, the vector t1 is not a numerically acceptable eigenvector to the approximate value A at hand [cf. Example 1]. For this reason, one applies inverse iteration in practice only in a rather rudimentary form: Having computed (numerically acceptable) approximations A for the exact eigenvalues of A by means of a numerically stable 6.6 lVlethods for Determining the Eigenvalues and Eigenvectors 415 method, one tentatively determines, for a few different initial vectors to, the corresponding tl and accepts as eigenvector that vector tl for which Iitoli/lltill best represents a quantity of O(eps lub(A)). Since the step to ---+ tl requires the solution of a system of linear equations (6.6.3.8), one applies inverse iteration in practice only for tridiagonal and Hessenberg matrices A. To solve the system of equations (6.6.3.8) (A - AI is almost singular), one decomposes the matrix A - AI into the product of a lower and an upper triangular matrix, Land R, respectively. To ensure numerical stability one must use partial pivoting, i.e., one determines a permutation matrix P and matrices L, R with [see Section 4.1] P(A - M) = LR, The solution tl of (6.6.3.8) is then obtained from the two triangular systems of equations Lz = Pt o, (6.6.3.12) Rtl = z. For tridiagonal and Hessenberg matrices A, the matrix L is very sparse: In each column of L at most two elements are different from zero. R for Hessenberg matrices A is an upper triangular matrix; for tridiagonal matrices A, a matrix of the form * * * R= 0 * o * * so that in cach case the solution of (6.6.3.12) requires only a few operation. Besides, Ilzll :::::0 Iitoll, by the first relation in (6.6.3.12). The work can therefore be simplified further hy determining tl only from Rtl = z, where one tries to choose the vector z so that Ilzll/lltlll becomes as small as possible. An ALGOL program for the computation of the eigenvectors of a symmetric tridiagonal matrix using inverse iteration can he found in Peters and Wilkinson (1971a); FORTRAN programs, in Smith et al. (1976). 6.6.4 The LR and QR Methods The LR-method of Rutishauser (1958) and the QR method of Francis (1961/62) and Kuhlallovskaja (1961) are also iterative methods for the 416 6 Eigenvalue Problems computation of the eigenvalues of an n x n matrix A. We first describe the historically earlier LR method. Here, one generates a sequence of matrices A;. beginning with Al := A, according to the following rules: Use the Gauss elimination method [see Section 4.1] to present Ai as the product of a lower triangular matrix Li = (Ijk) with I jj = 1 and an upper triangular matrix Ri (6.6.4.1) Thereafter, let i = 1, 2, .... From Section 4.1 we know that an arbitrary matrix Ai does not always have a factorization Ai = LiRi. We assume, however, for the following discussion that Ai can be so factored. We begin by showing the following. (6.6.4.2) Theorem. If all decompositions Ai = LiRi exist, then (a) Ai+l is similar to Ai: i = 1, 2, .... (b) A i+ 1 = (L I L 2 ·· ·L i )-lA 1 (L I L 2 ·· ·L i ), i = 1, 2, .... (c) For the lower triangular matrix Ti := Ll ... Li and the upper triangular matrix Ui := R i · .. Rl one has i = 1, 2, .... PROOF. (a) From A; = LiRi we get (b) follows immediately from (a). To prove (c), note that (b) implies i = 1, 2, .... It follows for i = 1, 2, ... that TiUi = Ll ... L i - 1 (LiRi)R i - 1 ... Rl = LI ... Li-IA;Ri - 1 ... Rl = AILI ... Li-IRi - 1 ••. Rl = AITi-lUi-I' With this, the theorem is proved. o 6.6 Methods for Determining the Eigenvalues and Eigenvectors 417 It is possible to show, under certain conditions, that the matrices Ai converge to an upper triangular matrix AX) with the eigenvalues Aj of A as diagonal elements Aj = (Ax;)jj. However, we wish to analyze the convergence only for the Q R method which is closely related to the LR method, because the LR method has a serious drawback: it breaks down if one of the matrices Ai does not have a triangular decomposition, and even if the decomposition Ai = LiRi exists, the problem of computing Li and R may be ill conditioned. In order to avoid these difficulties (and make the method numerically stable), one could think of forming the LR decomposition with partial pivoting: PiAi =: LiRi, Pi permutation matrix, p;-l = pt, A i + 1 : = RiP;T L; = L;l(p;AiPt)Li' There are examples, however, in which the modified process fails to converge: A1 = [~ ~] , >q = 3, '>"2 = -2, P1 = [~ 6] , P1A 1 = [i ~] A2 = R 1P'[ L1 = [~ ~] , 6] , P2A 2 = P2 = [~ A3 = R 2PiL 2 = [i ~] = [1~2 ~][~ ~] = L1R1, [ 1~3 ~] [~ ~] = L R 2 2, [~ ~] == A 1. The LR method without pivoting, on the other hand, would converge for this example. These difficulties of the LR method are avoided in the QR method of Francis (1961/62), which can be regarded as a natural modification of the LR algorithm. Formally the QR method is obtained if one replaces the LR decompositions in (6.6.4.1) by QR decompositions [see Section 4.7]. Thus, again beginning with Al := A, the QR method generates matrices Qi, R i , and Ai according to the following rules: (6.6.4.3) Ai =: QiRi, Here, the matrices are factored into a product of a unitary matrix Qi and an upper triangular matrix R i . Note that a QR decomposition Ai always exists and that it can be computed with the methods of Section 4.7 in a 418 6 Eigenvalue Problems numerically stable way: There are n - 1 Householder matrices H?), j = 1, ... , n - 1, satisfying where Ri is upper triangular. Then, since (H?»)H = (H?»)-l = H?) the matrix H(i) Q ,...-- H(i) 1 ... n-l is unitary and satisfies Ai = QiRi' so that Ai+! can be computed as _ _ (i) (i) Ai+l - RiQi - RiHl ... H n - l · Note further that the QR decomposition of a matrix is not unique. If S is an arbitrary unitary diagonal matrix, i.e., a phase matrix of the form then QiS and SH Ri are unitary and upper triangular, respectively, and (QiS)(SH R i ) = QiRi' In analogy to Theorem (6.6.4.2), the following holds. (6.6.4.4) Theorem. The matrices Ai, Qi, Ri in (6.6.4.3) and Pi := QlQ2'" Qi, Ui := RiRi - l ... R l , satisfy: (a) Ai+! is unitarly similar to Ai, Ai+l = QfI AiQi' (b) Ai+l = (Ql'" Qi)H Al(Ql··· Qi) = p iH AlPi · (c) Ai = PiUi . The proof is carried out in much the same way as for Theorem (6.6.4.2) and is left to the reader. In order to make the convergence properties [see (6.6.4.12)] of the QR method plausible, we show that this method can be regarded as a natural generalization of both the simple and inverse vector iteration [see Section 6.6.3]. As in the proof of (6.6.4.2), starting with Po := I, we obtain from (6.6.4.4) the relation (6.6.4.5) i ~ 1. If we partition the matrices Pi and Ri as follows r '(rl] P i = [Pi ,Pi ' so that pt) is an n x r matrix and R~r) an r x r matrix, then (6.6.4.5) implies (6.6.4.6) AP(r)l = p(r) R(r) 1,- 1, 1, for i ~ 1, 1:::; r :::; n. 6.6 Methods for Determining the Eigenvalues and Eigenvectors 419 Denoting the linear subspace, which is spanned by the orthonormal columns of pt) by Pi(r) := R(Pi(r)) = {pi(r) z I Z E <e r }, we obtain p(r) ~ AP(r)I = R(p(r) R(r)) ~ z- z for i 2: 1, 1::::: r ::::: n, z with equality holding, when A is nonsingular, implying that all R i , R~r) will also be nonsingular, that is, the QR method can be regarded as a subspace iteration. The special case r = 1 in (6.6.4.6) is equivalent to the I ): If we usual vector iteration [see 6.6.3] with the starting vector el = write Pi := pP) then pci Ilpill = 1, (6.6.4.7) i 2: 1, as follows from (6.6.4.6); also, R?) = ri~, and IIFF) II = 1. The convergence of the vector Pi can be investigated as in 6.6.3: We assume for simplicity that A is diagonalizable and has the eigenvalues Ai with IAII > IA21 2: .. ·IAnl· Further let X = [Xl, ... ,X n ] = (Xik), Y := X-I = [YI, ... ,YnV = (Yik) be matrices with A=XDY, (6.6.4.8a) D -_ [AOI so that Xi and y[ are a right and left eigenvector belonging to Ai: YiT Xk = (6.6.4.8b) i = k, {I0 for otherwise. If PI = I'll =I 0 holds in the decomposition of el then by the results of 6.6.3 the ordinary vector iteration, tk := Akel satisfies: (6.6.4.9) On the other hand (6.6.4.7) implies Aiel = rWri;) ... ri?Pi. Since Ilpill = 1, there are phase factors (J"k = ei(Pk, l(J"kl = 1, so that lim (J"iPi = Xl, i . (i) (J"i-I _ \ 1Imr ll - - -/II, (J"i 420 6 Eigenvalue Problems ri? Thus, the and Pi "essentially converge" (i.e., up to a phase factor) toward Al and Xl as i tends to oo. By the results of 6.6.3 the speed of convergence is determined by the factor IA2/AII < 1, (6.6.4.10) Using (6.6.4.10) and A i + l = PiH APi [Theorem (6.6.4.4)(b)] it is easy to see that the first column Aiel of Ai converges to the vector AIel = [AI, 0, ... , O]T as i -too, and the smaller IA2/ All < 1 is, the faster the error IIAiel - Alelll = O((IA21/IAd) tends to zero. We note that the condition PI = Un i- 0, which ensures the convergence, is satisfied if the matrix Y has a decomposition Y = L y Ry into a lower triangular matrix L y with (Ly )j j = 1 and an upper triangular matrix Ry. It will turn out to be particularly important that the QR method is also related to the inverse iteration [see 6.6.3] if A is nonsingular. Since p;HPi = I, we obtain from (6.6.4.5) P/!..lA- l = RilPiH or Denote the 71 x (71 - r + 1) matrix consisting of the last 71 - r + 1 columns of Pi by pt), denote the subspace generated by the columns of Pt) by P?·) = R(Pi(r)), and denote the upper triangular matrix consisting of the last 71 - r + 1 rows and columns of Ri by R~r). Then the last equation can be written as a vector space iteration with the inverse A-H = (AH)-l of the matrix A H : A -H p(r)l = p(r) (R(r))-H '[- z z for i 2: 1, 1:S r :S 71, In the special case r = 71 this iteration reduces to an ordinary inverse vector iteration for the last column Pi := Pi(n) of the matrices Pi: A-H'. _'.-(i) P,-l - p, Pnn' (i) ._ (R-1) inn, p.nn·- arising from the starting vector Po = en. Convergence of the iteration can be investigated as in Section 6.6.3 and as previously. Again, the discussion is easy if A is diagonalizable and A -H has a unique eigenvalue of maximal modulus, i.e., if the eigenvalues Ai of A now satisfy IAll 2: ... 2: IAn-II> IAnl > 0 (note that A- H has the eigenvalues 5. j l, j = 1, ... ,71). Let Xi and UJ be the right and left eigenvector of A belonging to the eigenvalue Ai, and consider (6.6.4.8). If the starting vector Po = en is sufficiently general, the vector Pi will "essentially" converge toward a normalized eigenvector u, lIull = 1, of A-H for its largest eigenvalue 5.;;-1, A-Hu = 5.;;-lu, or 6.6 Methods for Determining the Eigenvalues and Eigenvectors 421 equivalently u H A = An u H . Therefore the last rows e~ Ai of the matrices Ai will converge to Ane~ = [0, ... ,0, Anl as i ---+ oc. Now, the quotient IAn/An-II < 1 determines the convergence speed, (I D, Ile~Ai - [0, ... ,0, Anlll = 0 A~:l (6.6.4.11) since A- H has the eigenvalues 5. j l satisfying IA~ll > IA;;-~ll ;:: ... ;:: IAIll by hypothesis. For the formal convergence analysis we have to write the starting vector Po = en as a linear combination en = Plfh + ... + PnYn, of the right eigenvectors Yj of the matrix A-HYj = 5.j l Yj. Convergence is ensured if the coefficient Pn associated with the eigenvalue 5.~ 1 of A-H largest in modulus is =I O. Now 1= XHyH = [xl, ... ,xnlH[yl, ... ,Ynl implies Pn = xI,[ en, so that Pn =I 0 if the element Xnn = e~xn = Pn of the matrix X is nonzero. We note that this is the case if the matrix y = Ly R y has an LR decomposition: the reason is that X = Ryl Lyl, since X = y-l and Ryl and Lfl are upper and lower triangular matrices, which implies Xnn = e~ Ryl Ly en =I O. This analysis motivates part of the following theorem, which assures the convergence of the QR method if all eigenvalues of A have different moduli. (6.6.4.12) Theorem. Let A =: Al be an n x n matrix satisfying the following hypotheses (1) The eigenvalues Ai of A have distinct mod1l1i: tAll > IA21 > ... > IAnl· (2) The matrix Y with A = XDY, X = y-l, D = diag(Al, ... , An) = the Jordan normal form of A, has a triangular decomposition Then the matrices Ai, Qi, Ri of the QR method (6.6.4.3) have the following convergence properties: There are phase matrices Q 0'( (J'l(i) , . . . , (J' n(i)) ' ,:>, -_ d'lab I= 1 I(J'(i) k ' such that lim; Sf-I QiSi = I and lim sf RiS;-l = lim SE-lAiSi- l = [ 'l-+()C) Al:2 'x~ 1 o An t-+<x 422 6 Eigenvalue Problems } n part'zcu l ar, rImi->00 a jJ (i) = Aj ' J. = 1 " . " h A i = ((il) n, were a jk ' Remark. Hypothesis (2) is used only to ensure that the diagonal elements of Ai converge to the eigenvalues Aj in their natural order, IAll> IA21 > '" > IAnl, PROOF [following Wilkinson (1965)], We carry out the proof under the additional hypothesis An i- 0 so that D- 1 exists, Then from X - I = Y Ai=XDiy = QxRxDiLyRy (6,6.4.13) = QxRx(DiLyD-i)DiRy, where Q x Rx = X is a Q R decomposition of the nonsingular matrix X into a unitary matrix Qx and a (nonsingular) upper triangular matrix R x , Now DiLyD- i =: (l~~) is lower triangular, with () (AA: ) lj~ = i ljk, L y =: (I jk ), ljk = {I 0 k, for j = for j < k, Since IAj I < IAk I for j > k it follows that lim;l;~ = 0 for j > k, and thus lim Ei = 0, i---+oc' Here the speed of convergence depends on the separation of the moduli of the eigenvalues, Next, we obtain from (6.6.4.13) Ai = QxRx(I + Ei)DiRy (6.6.4.14) = Qx(I + RxEiR)/)RxDiRy = Qx(I + Fi)RxDiRy with Fi := Rx EiR)/, limi Fi = O. Now the positive definite matrix (I + F;JH (I + F;J = I + Hi, with limi Hi = 0, has a uniquely determined Choleski decomposition [Theorem (4,3,3)] where Hi is an upper triangular matrix with positive diagonal elements, Clearly, the Choleski factor Hz depends continuously on the matrix I + Hi, as is shown by the formulas of the Choleski method, Therefore lim; Hi = 0 implies limi Hi = I, Also the matrix is unitary: 6.6 Methods for Determining the Eigenvalues and Eigenvectors 423 Qf Qi = ft;H (I + Fi)H (I + Fi)it;l = R;H (I + Hi)R;l = R;H (Rf k)R;l = I. Therefore, the matrix I + Fi has the QR decomposition I + Fi = Qik, with limi Qi = limi(I + Fi)R;l = I, limi k = I. Thus, by (6.6.4.14) where QXQi is unitary, and RiRXD'Ry is an upper triangular matrix. On the other hand, by Theorem (6.6.4.4)(c), the matrix Ai has the QR decomposition Since the QR decomposition for nonsingular A is unique up to a rescaling of the columns (rows) of Q (resp. R) by phase factors CJ = e i ¢, there are phase matrices d',g.( CJ 1(i) , " ' , CJ n(il) , Si -- l a with i ? 1, and it follows that lim FiSi = limQxQi = Qx, i i 1 1D- i+ 1R1R- i-1 -1 SH R i -- Ui Ui-1 -- S i R- i R X DiRY' RY X i-1 = - -1 - -1 H SiRiRxDRx R i _ 1S i _ 1, lim Sf R iS i - 1 = Rx DR)/, I and finally, by Ai = QiRi, limSt~'lAiSi-1 = limS~lQiSisfRiSi-1 = RxDRxl, , , The proof is now complete, since the matrix Rx D R)/ is upper triangular and has diagonal D 1 Rx DR;/ = [ >'0 * D 424 6 Eigenvalue Problems It can be seen from the proof that the convergence of the Qi, Ri and Ai is linear, and improves with decreasing "converging factors" IAJ / Ak I, j > k, i.e., with improved separation of the eigenvalues in absolute value. The hypotheses of the theorem can be weakend, e.g., our analysis that led to estimates (6.6.4.10) and (6.6.4.11) already showed that the first column and the last row of Ai converge under weaker conditions on the separation of the eigenvalues. In particular, the strong hypotheses of the theorem are violated in the important case of a real matrix A with a pair AT) Ar+l = '\1' of conjugate complex eigenvalues. Assuming, for example, that and that the remaining hypotheses of Theorem (6.6.4.12) are satisfied, the following can still be shown for the matrices Ai = (a;2). (6.6.4.15) (a) limi a;2 = 0 for all (j, k) =I (r + 1,r) with j > k. (b) limi a;'i = A] for j =I r, r + 1. (c) Although the 2 x 2 matrices (i) [ a r1' ( i) a r+1.1' diverge in general as i --t :)0, their eigenvalues converge toward Ar and Ar +l. In other words, the convergence of the matrices Ai takes place in the positions denoted by Aj and 0 of the following figure, and the eigenvalues of the 2 x 2 matrix denoted by * converge Al X 0 A2 X X X x X X X X 0 Ai --+ A1'-l 0 i----'tx * * * * 0 0 X Ar+~ 0 x An For a detailed investigation of the convergence of the QR method, the reader is referred to the following literature: Parlett (1967), Parlett and Poole (1973), and Golub and Van Loan (1983). 6.6 Methods for Determining the Eigenvalues and Eigenvectors 425 6.6.5 The Practical Implementation of the QR Method In its original form (6.6.4.:3), the QR method has some serious drawbacks that make it hardly competitive with the methods considered so far: (a) The method is expensive: a complete step Ai -+ A i + 1 for a dense n x n matrix A requires 0(n 3 ) operations. (b) Convergence is very slow if some quotients IAj / Ak I, j # k, of the eigenvalues of A are close to 1. However, there are remedies for these disadvantages. (a) Because of the amount of work, one applies the QR method only to reduced matrices A, namely matrices of Hessenberg form, or in the case of Hermitian matrices, to Hermitian tridiagonal matrices (i.e., Hermitian Hessenberg matrices). A general matrix A, therefore must first be reduced to one of these forms by means of the methods described in Section 6.5. For this procedure to make sense, one must show that these special matrices are invariant under the QR transformation: if Ai is a (possibly Hermitian) lIessenberg matrix then so is Ai+ 1. This invariance is easily established. From Theorem (6.6.4.4)(a), the matrix AH1 = Qf AiQi is unitarily similar to Ai so that AH1 is Hermitian if Ai is. If Ai is an n x n Hessenberg matrix, then AH 1 can be computed as follows: First, reduce the subdiagonal elements of Ai to 0 by means of suitable Givens matrices of type D I2 , ... , D n - I.n [see Section 4.9] Ai = QiRi' nHnH nH Q i ._ . - UI2J£23' .. un-l,n, and then compute A i + 1 by means of A i + l = RiQi = RiDi!zD~ ... D;;-l.n· Because of the special structure of the Dj,j+l, the upper triangular matrix Ri is transformed by the post multiplications with the Dfj+1 into a matrix Ai+1 of Hessenberg form. Note that A i+ l can be computed from Ai in one sweep if the matrix multiplications are carried out in the following order One verifies easily that it takes only 0(n 2 ) operations to transform an n x n Hessenberg matrix Ai into A i +1 in this way, and only O(n) operations in the Hermitian case, where Ai is a Hermitian tridiagonal matrix. We therefore assume for the following discussion that A and thus all Ai are (possibly Hermitian) Hessenberg matrices. We may also assume that 426 6 Eigenvalue Problems the Hessenberg matrices Ai are irreducible, i.e., their subdiagonal elements (i) . 0 therwise Ai has the form aj,j_l' J = 2, ... , n, are nonzero. where A;, A;' are Hessenberg matrices of order lower than n. Since the eigenvalues of Ai are just the eigenvalues of A; and A;', it is sufficient to determine the eigenvalues of the matrices A;, A;' separately. So, the eigenvalue problem for Ai can be reduced to the eigenvalue problem for smaller irreducible matrices. Basically, the Q R method for Hessenberg matrices runs as follows: The Ai are computed according to (6.6.4.3) until one of the last two subdiagonal elements a~?n-l and a~~1,n-2 of the Hessenberg matrix Ai (which converge to 0 in general, see (6.6.4.12), (6.6.4.15)) become negligible, that is, . {I a n(i). n - I I, Ia n(i)- 1 ,n-2 I} min (;)1 + Ian-l. (i) I) s: eps (I ann n- 1 ' eps being, say, the relative machine precision. In case a~:n-l is negligible, a~~ is a numerically acceptable eigenvalue [see Section 6.6.3] of A, because it is the exact eigenvalue of a Hessenberg matrix Ai close to Ai IIAi - Ai I eps IIAil1 : Ai is obtained from Ai by replacing 1L~)n_l by O. In case 1L~~1 , n-2 , is negligible, the eigenvalues of the 2 x 2 matrix s: (i) [ an-I.n-l (i) an,n-l (i) an-,I.n (i) ann 1 are two numerically acceptable eigenvalues of Ai, since they are exact eigenvalues of a Hessenberg matrix Ai close to Ai, which is obtained by replacing the small element a~~l n-2 by O. If a~~ is taken as an eigenvalue, one crosses out the last row and c~lumn, and if the eigenvalues of the 2 x 2 matrix are taken, one removes the last two rows and columns of Ai, and continues the algorithm with the remaining matrix. By (6.6.4.4) the QR method generates matrices that are unitarily similar to each other. We note an interesting property of "almost" irreducible Hessenberg matrices that are unitarily similar to each other, a property that will be important for the implicit shift techniques to be discussed later. (6.6.5.1) Theorem. Let Q = [ql"'" qn] and U = [U1 .. ' .. Un] be unitary matrices with columns qi and 11i, and suppose that H = (h jk ) := QH AQ and K := U H AU are both Hessenberg matrices that are similar to the same matrix A using Q und U, respectively, for unitary transformations. Assume further that h i . i - 1 Ie 0 for i n - 1, and that 111 = IJ'lql, 11J'11 = l. Then s: 6.6 Methods for Determining the Eigenvalues and Eigenvectors 427 there is a phase matrix S = diag(ab ... , an), lakl = 1, such that U = QS andK=SHHS. That is, if H is "almost" irreducible and Q and U have "essentially" the same first column, then all columns of these matrices are "essentially" the same, and the Hessenberg matrices K and H agree up to phase factors, K = SHHS. PROOF. In terms of the unitary matrix V = [VI, ... ,Vn ] := U H Q, we have H = V H KV and therefore KV=VH. A comparison of the (i - 1)-th column on both sides shows i-I h i ,i-l V i = KVi-l - Lh j ,i-l V j, 2 ~ i ~ n. j=1 With h i ,i-l i= 0 for 2 ~ i ~ n - 1, this recursion permits one to compute i ~ n - 1 from VI. Since VI = U H ql = alUHul = aIel and K is a Hessenberg matrix we find that the matrix V = [VI, ... , v n ] is upper triangular. As V is also unitary, VVH = I, V is a phase matrix, which we denote by SH := V. The theorem now follows from K = V HV H. 0 Vi, (b) The slow convergence of the QR method can be improved substantially by means of so-called shift techniques. We know from the relationship between the QR method and inverse vector iteration [see (6.6.4.11)] that, for nonsingular A, the last rowe; Ai of the matrices Ai converges in general to Ane; as i ---+ 00, if the eigenvalues Ai of A satisfy In particular, the convergence speed of the error is determined by the quotient IAn/An-II. This suggests that we may accelerate the convergence by the same techniques used with inverse vector iteration [see Section 6.6.3], namely, by applying the QR method not to A but to a suitable shifted matrix A = A - kI. Here, the shift parameter k is chosen as a close approximation to some eigenvalue of A, so that, perhaps after a reordering of the eigenvalues, Then the elements a~)n-l of the matrices Ai associated with A will converge much faster to 0 'as i ---+ 00, namely, as 428 6 Eigenvalue Problems Note that if A is a Hessenberg matrix, so also are A = A - kI and all Ai, hence the last row of Ai has the form TA-·l -- [0 , • • . , 0 ,an,n-l' -til -(ill en ann' More generally, if we choose a new shift parameter in each step, the QR method with shifts is obtained: AI: = A, (6.6.5.2) Ai - k;l =: QiRi (QR decomposition) The matrices Ai are still unitarily similar to each other, since AHI = Qf (Ai - k;l)Qi + kiI = Qf AiQi. In addition, it can be shown as in the proof of Theorem (6.6.4.4) [see Exercise 20l (6.6.5.3) A i + 1 = P iH APi, (A - k 1 l)'" (A - k;l) = PiUi , where again Pi := QIQ2'" Qi and Ui := RiR i - 1 ... R I · Moreover, (6.6.5.4) A i + 1 = RiAiR;1 = U i AUi- 1 holds if all R j are nonsingular. Also, AHI will be a Hessenberg matrix, but this assertion can be sharpened (6.6.5.5) Theorem. Let Ai be an irreducible Hessenberg matrix of ordcr n. If k i is not an eigenvalue of Ai then A i+ 1 will also be an irreducible Hessenberg matrix. Otherwise, AHI will have the form wherc Ai+! is an irreducible H essenberg matrix of order (n - 1). PROOF. It follows from (6.6.5.2) that (6.6.5.6) Ri = Qf (A; - kil)o If ki is not an eigenvalue of Ai. then Ri is nonsingular by (6.6.5.2), and therefore (6.6.5.4) yields (HI) _ T A _ T R A R- 1 _ aj+l. j - e J +l i+l C j - Cj + 1 , i i ej - so that AHI is irreducible, too. (il (i) ((i») -1 --I- 0 rj+l.j+laj+l.j rjj (. 6.6 Methods for Determining the Eigenvalues and Eigenvectors 429 On the other hand, if k i is an eigenvalue of A;, then R; is singular. But since Ai is irreducible, the first n - 1 columns of Ai - kif are linearly independent, and therefore, by (6.6.5.6) so also are the first n - 1 columns of R i . Hence, by the singularity of R i , the last row of Ri must vanish *] R.=[Hi " 0 0 ' with Hi being a nonsingular upper triangular matrix of order (n - 1). Because of the irreducibility of Ai - k;l and the fact that Qi is also an irreducible Hessenberg matrix. Therefore, since ~ ] Qi + kif = [ Act 1 the submatrix AHI is an irreducible Hessenberg matrix of order n -1, and AH 1 has the structure asserted in the theorem. D We now turn to the problem of choosing the shift parameters k i appropriately. To simplify the discussion, we consider only the case of real matrices A, which is the most important case in practice, and we assume that A is a Hessenberg matrix. The earlier analysis suggests choosing close approximations to the eigenvalues of A as shifts k i , and the problem is finding such approximations. But Theorem (6.6.4.12) and (6.6.4.15) are helpful here: According to them we have limi = An, if A has a unique eigenvalue An of smallest absolute value. Therefore, for large i, the last diagonal element a~~ of Ai will, in general, be a good approximation for An, which suggests the choice a)::, (6.6.5.7a) if the convergence of the a~i:n has stabilized, say, if I1 - (i-I) ann I ---t-,) ::: 7] < 1 an,n, for a small T7 (even the choice T7 = 1/3 leads to acceptable results). It is possible to show for Hessenberg matrices under weak conditions that the choice (6.6.5.7a) ensures la~i.~~11 ::: CE 2 if la~:n-ll ::: E and E is sufficiently small, so that the QR method then converges locally at a quadratic rate. By direct calculation, this is easily verified for real 2 x 2 matrices 430 6 Eigenvalue Problems with a fc 0 and small lei: Using the shift ki := a n .n matrix . ~ Iblc 2 ~ [Ii b] --2' A = ed' wIth Icl = -a2 +e o one finds the as the QR successor of A, so that indeed lei = 0(e 2 ) for small 14 For symmetric matrices, b = e, the same example shows so that one can expect local convergence at a cubic rate in this case. In fact, for general Hermitian tridiagonal matrices and shift strategy (6.6.5.7a), local cubic convergence was shown by Wilkinson [for details sec Wilkinson (1965), p. 548, and Exercise 23]. A more general shift strategy, which takes account of both (6.6.4.12) and (6.6.4.15), is the following. (6.6.5. 7b) Choose k i as the eigenvalue A of the 2 x 2 matrix [ a~~l,n-I an-.1,n 1. (i) (i) an,n-I (i) a n .n . for which la~)n - AI is smallest. This strategy also allows for the possibility that A has several eigenvalues of equal absolute value. But it may happen that k" is a complex number even for real Ai, if Ai nonsymmetric, a case that will be considered later. For real symmetric matrices Wilkinson (1968) even showed the following global convergence theorem. (6.6.5.8) Theorem. If the QR method with shift strategy (6.6.5.7b) is applied to a real, irreducible, symmetric tridiagonal n x n matrix in all iterations, then the elements a~:n-l of the ith iteration matrix Ai converge to zero at least quadratically toward an eigenvalue of A. Disregarding rare exceptions, the convergence rate is even cubic. D Nu~mRICAL (6.6.5.9) EXAMPLE. The spectrum of the matrix 6 1 1 1 ~ ~ 1 6.6 Methods for Determining the Eigenvalues and Eigenvectors 431 lies symmetrically around 6; in particular, 6 is an eigenvalue of A. For the QR method with shift strategy (b) we display in the following the elements a~?n_l' a~:n as well as the shift parameters k i : (i) ( i) a 54 1 ki a 55 2 3 1 - .454544295 102 10 - 2 0 -.316869782391100 -.302775637732100 -.316875874226100 +.106774452090 10 -9 -.316875952616100 -.316875952619100 4 +.918983519419 10 -22 1-.316875952617100 = ..\.51 Processing the 4 x 4 matrix further, we find ki 4 5 +.143723850633 10 0 -.171156231712 10 -5 +.29906913587510 1 +.298386369683101 6 -.11127768766310-17 1+.298386369682101 = ..\.41 +.29838996772210 1 +.298386369682101 Processing the 3 x 3 matrix further, we get (i) (i) a 32 a 33 ki 6 7 +.78008805287910- 1 -.83885498096110-7 +.600201597254 10 1 +.59999999999610 1 8 +.12718113562310- 19 1+.599999999995101 = ..\.31 +.600000324468101 +.59999999999510 1 The remaining 2 x 2 matrix has the eigenvalues +.90161363031410 1 = ..\.2 +.123168759526 10 2 = ..\.1 This ought to be compared with the result of eleven QR steps without shifts: (12) (12) k a k ,k-l a kk 1 2 3 4 5 +.33745758663710-1 +.11407995142110-1 +.46308675985310-3 -.20218824473310-10 +.123165309125 10 2 +.90164381961h o 1 + .600004307566 10 1 +.298386376789 10 1 -.316875952617100 432 6 Eigenvalue Problems Such convergence behavior is to be expected, since Further processing the 4 x 4 matrix requires as many as 23 iterations in order to achieve ai~5) = +.48763746442510-10. The fact that the eigenvalues in this example come out ordered in size is accidental: With the use of shifts this is not to be expected in general. When the QR step (6.6.5.2) with shifts for symmetric tridiagonal matrices is implemented in practice, there is a danger of losing accuracy when subtracting and adding kiJ in (6.6.5.2), in particular, if Ikil is large. Using Theorem (6.6.5.1), however, there is way to compute Ai without subtracting and adding kif explicitly. Here, we use the fact the first column ql of the matrix Qi := [ql,"" qn] can be computed from the first column of Ai - kif: at the beginning of this section we have seen that Qi = flifzfl~ ... fl;;-1.n, where the flj,j+l are Givens matrices. Note that, for j > 1, the first column of flj,j+l equals the first unit column el, so that ql = Qiel = fl~el is also the first column of fl~. l\Iatrix fl12 is by definition the Givens matrix that annihilates the first subdiagonal element of Ai -kif, and it is therefore fully determined by the first column of that matrix. The matrix B = fl~Aifl12' then has the form B= x x x X x x o .r x x x x x x 0 x x that is, it is symmetric and, except for the potentially nonzero elements b13 = b3l , tridiagonal. In analogy to the procedure described in Section 6.5.1, one can find a sequence of Givens transformations -H - -H B --+ fl23Bfl23 --+ ... --+ fl n - 1 ,n'" -H fl 23 Bfl23 ··· fln-l,n = C, which transform B into a unitarily similar matrix C that is symmetric and tridiagonal. First, D23 , is chosen so as to annihilate the elements b31 = b13 of B. The corresponding Givens transformation will produce nonzero elements in the positions (4, 2) and (2, 4) of D/ABD 23 . These in turn, are annihilated by a proper choice of i?34, and so on. So, for n = 5 the matrices B --+ ... --+ C have the following structure, where again 0 and * denote new zero and potentially nonzero elements, respectively. 6.6 Methods for Determining the Eigenvalues and Eigenvectors B= ti34 ) [~ x x x [~ x x x x x x x x X X 0 X X X 0 x x * x J [~ ti 23 ) J ti 45 ) [~ x 0 X X x x x * x x * x X X X X X X x 0 x x 433 ;] ~] ~C x n In terms of the unitary matrix U := D 12 23 ··· nn-l,n, we have C = U H AiU since B = D~AiD12' The matrix U has the same first column as the unitary matrix Qi, which transforms Ai into A i+ 1 = Qf' AiQi according to (6.6.5.2), namely, ql = D12el. If A; is irreducible then by Theorem (6.6.5.5) the elements a;:;~i of A i + 1 = (a;~+l)) are nonzero for j = 2,3, ... , n - 1. Thus, by Theorem (6.6.5.1), the matrices C and Ai+l = SCSH are essentially the same, S = diag(±l, ... , ±1), U = QS. Since in the explicit QR step (6.6.5.2), the matrix Qi is unique only up to a rescaling Qi ---+ QiS by a phase matrix S, we have found an alternate algorithm to determine Ai+l' namely, by computing C. This technique, which avoids the explicit subtraction and addition of kJ, is called the implicit shift technique for the computation of Ai+l. Now, let A be a real nonsymmetric Hessenberg matrix. The shift strategy (6.6.5. 7b) then leads to a complex shift k i , even if the matrix Ai is real, whenever the 2 x 2 matrix in (6.6.5.7b) has two conjugate complex eigenvalues k i and k,. In this case, it is possible to avoid complex calculations, if one combines the two QR steps Ai ---+ Ai+l ---+ Ai+2 by choosing the shift ki in the first, and the shift ki+l := ki in the second. It is easy to verify that for real Ai the matrix Ai+2 is real again, even though Ai+l may be complex. Following an elegant proposal by Francis (1961/62) one can compute Ai+2 directly from Ai using only real arithmetic, provided the real Hessenberg matrix Ai is irreducible: For such matrices Ai this method permits the direct calculation of the subsequence of the sequence of matrices generated by the QR method with shifts. It is in this context that implicit shift techniques were used first. In this text, we describe only the basic ideas of Francis' technique. Explicit formulas can be found, for instance, in Wilkinson (1965). For simplicity, we omit the iteration index and write A = (aJk) for the real irreducible Hessenberg matrix Ai, k for ki' and A for Ai+2. By (6.6.5.2), (6.6.5.3) there are a unitary matrix Q and an upper triangle matrix R such that A = QH AQ is a real Hessenberg matrix and (A - kI)(A - kI) = QR. We try to find a unitary matrix U having essentially the same first column 434 6 Eigenvalue Problems as Q, Qel = ±Uel, so that U H AU =: K is also a Hessenberg matrix. If k is not an eigenvalue of A, then Theorems (6.6.5.1) and (6.6.5.5) show again that the matrices A and K differ only by a phase transformation: H. K = S AS, S = dlag(±L ... , ±1). This can be achieved as follows: The matrix. n := (A - kJ)(A - kI) = A2 - (k + k)A + kkI is real, and its first column b = Bel, which is readily computed from A and k, has the form the form b = [x, x, x, 0, ... ,O]T, since A is a Hessenberg matrix. Choose P [see Section 4.7] as the Householder matrix that transforms b into a multiple of el, Pb = {11 e1. Then the first column of P has the same structure as b, that is, Pel = [x, x, x, 0, ... , oV, because P 2 b = b = {11Pel. Since B = QR, B = p 2 B = P[{1lel, *, ... , *], the matrix P has essentially the same first column Pel = ±Qe1 as the matrix Q. Next, one computes the matrix pH AP = PAP, which, by the structure of P (note that Pek = ek for k .2: 4) is a matrix of the form x x x :1: x x * * PAP = x x x x x x x in other words, except for few additional subdiagonal elements in positions (3,1), (4,1), (4,2), it is again a Hessenberg matrix. Applying Householder's method [see the end of Section 6.5.4], PAP is then transformed into a genuine Hessenherg matrix K using unitary similarity transformations The Householder matrices T k , k = 1, ... , n - 2, according to Section 6.5.1, differ from the unit matrix only in three columns, Tke, = ei for i rf. {k + 1, k + 2, k + 3}. For n = 6 one obtains a sequence of matrices of the following form (0 and * again denote new zero and nonzero elements, respectively): x PAP+ X x x X X X T, ----=-+ X X X X .r r~ 0 0 x x T2 X X X X X * * x ----t x x x 6.6 Methods for Determining the Eigenvalues and Eigenvectors x x x x x x x x o x x o x x * * x x x x x x x o o x ~ [~ ~ 435 x x x ~ x x x x x x x x =K. x o x x As Tkel = el for all k, the unitary matrix U := PT1 ... T n - 2 with U H AU = K has the same first column as P, Uel = PT1 ··· Tn-2el = Pel' Thus, by Theorem (6.6.5.1), the matrix K is (essentially) identical to A. Note that it is not necessary for the practical implementation of QR method to compute and store the product matrix Pi := Ql ... Qi, if one wishes to determine only the eigenvalues of A. This is no longer true in the Hermitian case if one also wants to find a set of eigenvectors that are orthogonal. The reason is that for Hermitian A limpiH APi = D = diag(Al, ... , An), " which follows from (6.6.5.3), (6.6.4.12), so that for large i the j-th column p;i) = Piej, of Pi will be a very good approximation to an eigenvector of A for the eigenvalue Aj and these approximations p;i), j = 1, ... , n, are orthogonal to each other, since Pi is unitary. If A is not Hermitian, it is better to determine the eigenvectors by the inverse iteration of Wielandt [see Section 6.6.3], where the usually excellent approximations to the eigenvalues that have been found by the Q R method can be used. The QR method converges very quickly, as is also shown by the example given earlier. For symmetric matrices, experience suggests that the QR method is about four times as fast as the widely used Jacobi method if eigenvalues and eigenvectors are to be computed, and about ten times as fast if only the eigenvalues are desired. There are many programs for the QR method. ALGOL programs can be found in the handbook of Wilkinson and Reinsch (1971) [see the contributions by Bowdler, Martin, Peters, Reinsch, and Wilkinson], FORTRAN programs are in Smith et al. (1976). 436 6 Eigenvalue Problems 6.7 Computation of the Singular Values of a Matrix The singular values (6.4.6) of an m x n matrix A and its singular-value decomposition (6.4.11) can be computed rapidly, and in a numerically stable manner, by a method due to Golub and Reinsch (1971), which is closely related to the QR method. We assume, without loss of generality, that m::';' n (otherwise. replace A by A H ). The decomposition (6.4.11) can then be written in the form (6.7.1) A = U [~] V H , D:= diag(O"l, ... ,O"n), 0"1::';' 0"2::';""::';' O"n ::.;, 0, where U is a unitary m x m matrix, V is a unitary n x n matrix, and 0"1, ..• , O"n are the singular values of A, i.e., 0"; are the eigenvalues of AH A. In principle, therefore, one could determine the singular values by solving the eigenvalue problem for the Hermitian matrix AHA, but this approach can be subject to loss of accuracy: for the matrix lei < y'ePs, eps = machine precision, for example, the matrix AH A is given by A H A = [1 +1 e 2 1] 1 + e2 ' and A has the singular values O"l(A) = yi2 + e2 , 0"2(A) = lei. In floatingpoint arithmetic, with precision eps, instead of AHA one obtains the matrix fl(A H A) =: B = [1 1] 1 1 ' with eigenvalues '\1 = 2 and '\2 = 0; 0"2(A) = lei does not agree to machine precision with ~ = O. In the method of Golub and Reinsch one first reduces A, in a preliminary step, unitarily to bidiagonal form. By the Householder method [Section 4.7], one begins with determining an m x m Householder matrix Pi which annihilates the sub diagonal elements in the first column of A. One thus obtains a matrix A' = PiA of the form shown in the following sketch for m = 5, n = 4, where changing elements are denoted by *: A= x x x x x X .J; X x x x x x x x x x x x x -+ HA =A' = * * * * * * 0 0 0 0 * * * * * J 6.7 Computation of the Singular Values of a Matrix 437 Then, as in Section 4.7, one determines an n x n Househoulder matrix Q1 of the form such that the elements in the positions (1,3), ... , (1, n) in the first row of A" := A/Q1 vanish: A'= x 0 0 0 0 x x x x :r; x X X X x x x x x x -+ A/Q1 = A" = x 0 0 0 0 * * * * * 0 0 * * * * * * * * In general, one obtains in the first step a matrix A" of the form A" = [*l, 0,1 := [e2, 0, ... ,0] E <e n - 1 , with an (m - 1) x (n - 1) matrix A. One now treats the matrix A in the same manner as A, etc., and in this way, after n reduction steps, obtains an m x n bidiagonal matrix J(O) of the form (0) q1 J(O) = [~O 1' (0) e2 0 (0) q2 Jo = (0) en 0 (0) qn where fO) = Pn ··· P 2 P 1AQ1Q2'" Qn-2 and Pi, Qi are certain Householder matrices. Since Q := Q1 Q2 ... Qn-2 is unitary and J(O)H J(O) = JI! J o = QH AH AQ, the matrices J o and A have the same singular values; likewise, with P := P 1P 2 ... Pn , the decomposition is the singular-value decomposition (6.7.1) of A if J o = GDH H is the singular-value decomposition of J o. It suffices therefore to consider the problem for n x n bidiagonal matrices J. In the first step, J is multiplied on the right by a certain Givens reflection of the type T 12 , the choice of which we want to leave open for the moment. Thereby, J changes into matrix J(1) = JT12 of the following form (sketch for n = 4): 438 6 Eigenvalue Problems The sub diagonal element in position (2,1) of J(1), generated in this way, is again annihilated through left multiplication by a suitable Givens reflection of the type S12. One thus obtains the matrix J(2) = S12J(1): = .f2). The element in position (1,3) of J(2) is annihilated by multiplication from the right with a Givens reflection of the type T2 :l , The subdiagonal element in position (3,2) is now annihilated by left multiplication with a Givens reflection S2:" which generates an element in position (2,4), etc. In the end, for each Givens reflection T12 one can thereby determine a sequence of further Givens reflections Si,i+1, T i ,i+l, i = 1, 2, ... , n - 1, in such a manner that the matrix is again a bidiagonal matrix. The matrices S := S12S23 ... Sn-l,n, T := T 12 T 23 ... T n -1.n are unitary, so that, by virtue of j = SH JT, the matrix AI := jH j = TH JH SSH JT = TH MT is a tridiagonal matrix which is unitarily similar to the tridiagonal matrix M := JH J. Up to now, the choice of T12 was open. We show that T12 can be so chosen that AI becomes precisely the matrix which is produced from the tridiagonal matrix AI by the QR method with shift k. To show this, we may assume without loss of generality that AI = JH J is irreducible, for otherwise the eigenvalue problem for Ai could be reduced to the eigenvalue problem for irreducible tridiagonal matrices of smaller order than that of M. We already know [see (6.6.5.2)] that the QR method, with shift k, applied to the tridiagonal matrix Ivi yields another tridiagonal matrix if according to the following rules: (6.7.2) AI - kI =: T R. R upper triangular, Al:= RT+ kI, T unitary, 6.7 Computation of the Singular Values of a Matrix 439 Both tridiagonal matrices ]\;1 = TH AIT and J'V1 = TH MT are unitarily similar to lvI, where the unitary matrix T = T(k) depends, of course, on the shift k. Now, Theorem (6.6.5.5) shows that for all shifts k the matrix if = [mij] is almost irreducible, that is, mj,j-1 -I- 0 for j = 2,3, ... , n - 1 since AI is irreducible. Hence we have the situation of Theorem (6.6.5.1): If the matrices T and T = T(k) have the same first column, then according to this theorem there is a phase matrix L1 = diag( ei<bl, ei<b2, ... ,ei<bn ), ¢1 = 0, so that NI = L1H !VI L1, that is, ]\;1 is essentially equal to M. We show next that it is in fact possible to choose T12 in such a way that T and T(k) have the same first column. The first column t1 of T = T 12 T23 ... Tn - Ln is precisely the first column of T 12 , since (with e1 := [1,0, ... , O]T E (Cn) and has the form t1 = [c, 8, 0, ... ,O]T E IRn , c2 + 8 2 = 1. On the other hand, the tridiagonal matrix AI - kI, by virtue of AI = JH J, has the form 61 (6.7.3) 12 62 1'2 M-kI= 0 o 'Yn if 1 6: 1 61 = Iql1 2 - k, 1'2 = e2ql, e2 J= r~ ,~ 1 q2 qn A unitary matrix T with (6.7.2), TH(M - kI) = R upper triangular, can be determined as a product of n - 1 Givens reflections of the type Ti ,i+1 [see Section 4.9]' T = T12 T23 ··· Tn-1,n which successively annihilate the sub diagonal elements of 1\1 - kI. The matrix T12 , in particular, has the form ("; 8 8 -c 1 T12 = (6.7.4) o o [i] 1 := ex [ where 8 -I- 0 if M is irreducible. ~~] = ex [lq11:q~ k], (";2 + 82 = 1, 440 6 Eigenvalue Problems The first column t1 of T agrees with the first column of T 12 , which is given by c, sin (6.7.4). Since /1.1 = TH MT is a tridiagonal matrix, it follows from Francis' observation that if essentially agrees with IVI (up to scaling factors), provided one chooses as first column t1 of T12 (through which T12 is determined) precisely the first column of T12 (if the tridiagonal matrix !vI is not reducible). Accordingly, a typical step J --+ J of the method of Golub and Reinsch consists in first determining a real shift parameter k, for example by means of one of the strategies in Section 6.6.5; then choosing T12 := T12 with the aid of the formulas (6.7.4); and finally determining, as described above, further Givens matrices T i ,i+1, Si,it-l such that again becomes a bidiagonal matrix. Through this implicit handling of the shifts, one avoids in particular the loss of accuracy which in (6.7.2) would occur, say for large k, if the subtraction !'vI --+ !'vI - kI and subsequent addition RT --+ RT + kI were executed explicitly. The convergence properties of the method are of course the same as those of the QR method: in particular, using appropriate shift strategies [see Theorem (6.6.5.8)]' one has cubic convergence as a rule. An ALGOL program for this method can be found in Golub and Reinsch (1971). 6.8 Generalized Eigenvalue Problems In applications one frequently encounters eigenvalue problems of the following form: For given n x n matrices A, B, numbers A are to be found such that there exists a vector x i= 0 with (6.8.1) Ax = ABx. For nonsingular matrices B, this is equivalent to the classical eigenvalue problem (6.8.2) B- 1 Ax = AX for the matrix B- 1 A (a similar statement holds if A-I exists). Now usually, in applications, the matrices A and B are real symmetric, and in addition B is positive definite. Although in general B- 1 A is not symmetric, it is still possible to reduce (6.8.1) to a classical eigenvalue problem for symmetric matrices: If 6.9 Estimation of Eigenvalues 441 is the Choleski decomposition of the positive definite matrix B, then L is nonsingular and B- 1A similar to the matrix G := L -1 A(L -1 )T, But now, the matrix G, like A, is symmetric. The eigenvalues A of (6.8.1) are thus precisely the eigenvalues of the symmetric matrix G. The computation of G can be simplified as follows: One first computes F:= A(L-1)T by solving F LT = A, and then obtains from the equation LG = F. In view of the symmetry of G it suffices to determine the elements below the diagonal of G. For this, knowledge of the lower triangle of F (jik with k :::; i) is sufficient. It suffices, therefore, to compute only these elements of F from the equation F LT = A. Together with the Choleski decomposition B = LLT. which requires about ~n3 multiplications, the computation of G = L -1 A(L -l)T from A, B thus costs about ~n3 multiplications. For additional methods, sec Martin and Wilkinson (1971) and Peters and Wilkinson (1970). A more recent method for the solution of the generalized eigenvalue problem (6.8.1) is the QZ method of J'vIoler and Stewart (1973). 6.9 Estimation of Eigenvalues Using the concepts of vector and matrix norm developed in Section 4.4, we now wish to give some simple estimates for the eigenvalues of a matrix. We assume that for x E (Cn Ilxll is a given vector norm and lub(A) = ~;t IIAxl1 W the associated matrix norm. In particular, we employ the maximum norm Ilxll exo = max , lXii, IUboo(A) = max , L lai,kl· k We distinguish between two types of eigenvalue estimations: (1) exclusion theorems, 442 6 Eigenvalue Problems (2) inclusion theorems. Exclusion theorems give domains in the complex plane which contain no eigenvalue (or whose complement contains all eigenvalues); inclusion theorems give domains in which there lies at least one eigenvalue. An exclusion theorem of the simplest type is (6.9.1) Theorem (Hirsch). For- all eigenvalues..\. of A one has 1..\.1 :s: lub(A). PROOF. If x is an eigenvector to the eigenvalue ..\., then from A.T = ..\.x, x ~ 0, there follows II..\.xll = 1..\.1· 11·1:11 :s: lub(A) . Ilxll, 1..\.1 :s: lub(A). D If ..\.i are the eigenvalues of A, then p(A):= max I..\.il l::;i::;n is called the spectml mdius of A. By (6.9.1) we have p(A) < lub(A) for every vector norm. (6.9.2) Theorem. (a) For- ever-y matr-ix A and ever-y E > 0 thcr-e exists a vector norm sitch that lub(A) :s: p(A) + E. (b) If every eigenvalue..\. of A with 1..\.1 = p(A) has only linear elementary divisors, then there exists a vector norm such that lub(A) = p(A). PROOF. (a): Given A, there exists a nonsingular matrix T such that TAT- 1 =.J is the Jordan normal form of A [see (6.2.5)], i.e., .J is made up of diagonal blocks of the form ..\.i 1 0 1 o 6.9 Estimation of Eigenvalues 443 By means of the transformation J --+ D; I J D '" with diagonal matrix DE := diag(l, E, E2, ... , En-I), E > 0, one reduces the C,,(.\) to the form o From this it follows immediately that luboo(D;IJDE) = luboo(D;ITAT-ID E):=:; p(A) +E. Now the following is true in general: If S is a nonsingular matrix and II . II a vector norm, then p(x) := IISxl1 is also a vector norm, and lUbp(A) = lub(SAS- I ). For the norm p(x) := IID;ITxlloo it then follows that lubp(A) = luboo(D;IT AT- I DE) :=:; p(A) + E. (b): Let the eigenvalues Ai of A be ordered as follows: Then, by assumption, for 1 :=:; i :=:; s each Jordan box Cv(Ai) in J has dimension 1, i.e., Cv(Ai) = [AJ Choosing we therefore have luboo(D;ITAT-ID E) = p(A). For the norm p(x) := IID;ITxlloo it follows, as in (a), that lUbp(A) = p(A). o A better estimate than (6.9.1) is given by the following theorem [ef. Bauer and Fike (1960)]: (6.9.3) Theorem. If B is an arbitrary n x n matrix, then for all eigenvalues A of A one has I:=:; lub((AI - B)-I(A - B)) :=:; lub((AI - B)-I) lub(A - B) unless A is also an eigenvalue of B. PROOF. If x identity is an eigenvector of A for the eigenvalue A, then from the 444 6 Eigenvalue Problems (A - B)x = ()"I - B)x it follows immediately, if ).. is not an eigenvalue of B, that ()"I - B)-l(A - B)x = x, and hence o Choosing in particular B = A D := [ 0 .. all o 1 ann the diagonal of A, and taking the maximurIl norm, it follows that n lubCXJ [()"I - A D )-l(A - AD)] = max l::;i::;n 1 I).. - aiil L laikl. k=l k'i'" From Theorem (6.9.3) we now get t (6.9.4) Theorem (Gershgorin). The union of all discs J(i:= {p, E <C lip, - aiil ::; laikl}, i = 1, 2, ... , n, k:;E:-i contains all eigenvalues of the n x n matrix A = (aik). Since the disc J(i has center at aii and radius I:~=l,k#i laik I, this estimate will be sharper as A deviates less from a diagonal matrix. EXAMPLE 1. A= [~ ~.1 -~:~l' -0.2 0 3 Kl = {Ill 111- 11::; 0.2}, K2 = {II 1111 - 21 ::; 0.4}, K3 = {II I 111 - 31 ::; 0.2}, [see Figure 12J. The preceding theorem can be sharpened as follows: (6.9.5) Corollary. If the union }.h = U~=l J(i J of k discs J(ij' j = 1, ... , k. and the union .fIv12 of the remaining discs are disjoint, then.flvh contains exactly k eigenvalues of A and !Y!2 exactly n - k eigenvalues. PROOF. If A = AD + R, for t E [O,l]lct 6.9 Estimation of Eigenvalues o 445 4 Fig. 12. Gershgorin circles A t := AD + tR. Then Ao = AD, The eigenvalues of At are continuous functions of t. Applying the theorem of Gershgorin to At, one finds that for t = 0 there are exactly k eigenvalues of Ao in lvh and n - k in lvh (counting multiple eigenvalues according to their multiplicities as zeros of the characteristic polynomial). Since for o S; t S; 1 all eigenvalues of At likewise must lie in these discs, it follows for reasons of continuity that also k eigenvalues of A lie in lvh and the remaining n - k in ~M2. 0 Since A and AT have the same eigenvalues, one can apply (6.9.4), (6.9.5) to A as well as to AT and thus, in general, obtain more information about the location of the eigenvalues. It is often possible to improve the estimates of Gershgorin's theorem by first applying a similarity transformation A -t D- 1 AD with a diagonal matrix D = diag(d 1 , ... , dn ). For the eigenvalues of D- 1 AD, and thus of A, one thus obtains the discs Ki = {/lii/l-aiil S; t lai~~kl =:Pi}. k#-i By a suitable choice of D one can often substantially reduce the radius Pi of a disc (the remaining discs, as a rule, become larger) in such a way, indeed, that the disc Ki in question remains disjoint with the other discs K j , j i i. Ki then contains exactly one eigenvalue of A. EXA:VIPLE 2. 0< r:: « 1, K 1 ={/-lII/-l-1Is2r::}, K2 = K3 = {/-lil/-l- 21 S 2r::}, 446 6 Eigenvalue Problems Transformation with D = diag(l, kE, kE), k > 0, yields A' = D-1AD = For A' we have [1~k 11k 1 P2 = P3 = k + E. The discs Kl and K2 = K3 for A' are disjoint if Pl + P2 = 2kE 2 1 + k + E < 1. For this to be true we must clearly have k > 1. The optimal value k, for which Kl and K2 touch one another, is obtained from Pl + P2 = 1. One finds k= and thus 2 2 1 - E + J(1 - E)2 - 8E2 = 1+E+0(E), Pl = 2kE2 = 2E2 + 0(E 3 ). Through the transformation A -+ A' the radius Pl of Kl can thus be reduced from the initial 2E to about 2E2. The estimate of Theorem (6.9.3) can also be interpreted as a perturbation theorem, indicating how much the eigenvalues of A can deviate from the eigenvalues of B. In order to show this, let us assume that B is normalizable: It then follows, if .>. is no eigenvalue of B, that lub( (.AI - B)-I) = lub( (P(.AI - AB)-l p-l) S lub((.AI - AB)-l) lub(P) lub(P-l) = max l::;i::;n 1 1'>'-'>'i(B)1 cond(P) 1 mm I.>. - '>'i(B) 1cond(P). l::;i::;n This estimate is valid for all norms 11·11 satisfying, like the maximum norm and the Euclidean norm, lub(D) = max Idil l::;i::;n 6.9 Estimation of Eigenvalues 447 for all diagonal matrices D = diag(d1, ... ,d n ). Such norms are called absolute; they are also characterized by the condition Illxlll = Ilxll for all x E (Cn [see Bauer, Stoer, and Witzgall (1961)]. From (6.9.3) we thus obtain [cf. Bauer, Fike (1960), Householder (1964)] : (6.9.6) Theorem. If B is a diagonalizable n x n matrix, B = P ABP- 1 , AB = diag(Al(B), ... , An(R)), and A an arbitrary n x n matrix, then for each eigenvalue )'(A) there is an eigenvalue Ai(B) such that IA(A) - Ai(B)1 ::; cond(P) lub(A - B). Here cond and lub are to be formed with reference to an absolute norm 11·11. This estimate shows that for the sensitivity of the eigenvalues of a matrix B under perturbation, the controlling factor is the condition cond(P) of the matrix P. not the condition of B. But the columns of P are just the (right) eigenvectors of B. For Hermitian, and more generally, normal matrices normal matrix B one can choose for P a unitary matrix [Theorem (6.4.5)]. With respect to the Euclidean norm IIxl12 one therefore has cond 2 (P) = 1, and thus: (6.9.7) Theorem. If B is a nurmal n x n matrix and A an arbitrary n x n TrwtTix, then for' each eigenvalue A(A) of A there is an eigenvalue )'(B) of B such that I)'(A) - A(E)I ::; lub 2 (A - E). The eigenvalue problem for Hermitian matrices is thus always well conditioned. The estimates in (6.9.6), (6.9.7) are of global nature. We now wish to examine in a first approximation the sensitivity of a fixed eigenvalue A of A under small perturbation A -+ A+EC, E -+ 0. Vie limit ourselves to the case in which the eigenvalue A under study is a simple zero of the characteristic polynomial of A. To ), then belong uniquely determined (up to a constant factor) right and left eigenvectors x and y, respectively: xi 0, y i 0. For these, one has yH x i 0, as one easily shows with the aid of the Jordan normal form of A, which contains only one Jordan block, of order one, for the eigenvalue A. (6.9.8) Theorem. Let), be a simple zero of the characteristic polynomial of the n x n matrix A, and x and yH corresponding right and left eigenvectors of A, respectively, x, y i 0, 448 6 Eigenvalue Problems and let C be an arbitrary n x n matrix. Then there exists a function A(e) which is analytic for lei sufficiently small, lei S eo., eo > 0, such that A(O) = A, A'(O) = yH Cx, yHx and A(e) is a simple zero of the characteristic polynomial of A + eC. One thus has, in first approximation, PROOF. The characteristic polynomial of the matrix A + eC, is an analytic function of e and 11. Let K be a disc about A, K = {11 1111 - AI S r}, r > 0, which does not contain any eigenvalues of A other than A, and S := {11 I 111 - AI = r} its boundary. Then inf ICPo(l1)I =: m > O. !J.ES Since CPo(l1) depends continuously on e, there is an eo > 0 such that also (6.9.9) inf IcpE(I1)1 > 0 ,LES for all lei < eo. According to a well-known theorem in complex variables, the number of zeros of CPE (11) within K is given by In view of (6.9.9), I/(e) is continuous for lei S eo; hence 1 = 1/(0) = I/(e) for lei SeD, 1/ being integer valued. The simple zero A(e) of CPE(IJ,) inside of K, according to another theorem in complex variables, admits the representation (6.9.10) For lei S eo the integrand of (6.9.10) is an analytic function of c and therefore also A(e), according to a well-known theorem on the interchangeability of differentiation and integration. For the simple eigenvalue A(e) of A + eC one can choose right and left eigenvectors X(e) and y(e), 6.9 Estimation of Eigenvalues (A + eC)x(e) = A(e)X(e), 449 y(e)H (A + eC) = A(e)y(e)H, in such a way that x(e) and y(e) are analytic function::; of e for lei:::; co. We may put, e.g., x(e) = [6(e), ... ,';n(e)]T with ';i(e) = (_l)i det(B 1 ;), where Bli is the (n - 1) x (n - 1) matrix obtained by deleting row 1 and column i in the matrix A + eC - ).,(e)l. From (A + eC - ).,(e) 1)x(e) = 0, differentiating with re::;pect to e at e = 0, one obtains (C - ).,' (0) l)x(O) + (A - ),,(0) I) x' (0) = 0, from which, in view of y(O)H (A - ).,(0) 1) = 0, y(O)H(C - ).,1(0)1)X(O) = 0, and hence, since y(O)H x(O) i- 0, A'(O) = yHCx, yHx y = y(O), x = x(O), as was to be shown. Denoting, for the Euclidean norm I . 112, by o the cosine of the angle between x and y, the preceding result implie::; the estimate 1>.'(0)1 = lyHCxl IIYl12 Ilx 112 Icos(x, y) I < IICxl12 < lub 2(C) . IIxl121 cos(x, y) I - Icos(x, y) I The sensitivity of )., will thus increase with decreasing I cos(x, y)l. For Hermitian matrices we always have x = y (up to constant multiples), and hence I cos(x, y)1 = 1. This is in harmony with Theorem (6.9.7), according to which the eigenvalues of Hermitian matrices are relatively insensitive to perturbations. Theorem (6.9.8) asserts that an eigenvalue A of A which is only a simple zero of the characteristic polynomial is relatively insensitive to perturbations A --+ A + eC, in the sense that for the corresponding eigenvalue A(e) of A + eC there exists a constant K and an co > 0 such that I).,(e) - )"1 :::; K lei for lei:::; co· 450 6 Eigenvalue Problems However, for ill-conditioned eigenvalues A, i.e., if the corresponding left and right eigenvectors are almost orthogonal, the constant K is very large. This statement remains valid if the eigenvalue A is a multiple zero of the characteristic polynomial and has only linear elementary divisors. If to A there belong nonlinear elementary divisors, however, the statement becomes false. The following can be shown in this case (cf. Exercise 29): Let (t1-A)Vl, (p.-At 2 , .•• , (Il-At p , VI 2" V2 2" ... 2" vP' be the elementary divisors belonging to the eigenvalue A of A. Then the matrix A + Ee, for sufficiently small E, has eigenvalues Ai(E), i = 1, ... , IJ", IJ" := VI + ... + vP' satisfying, with some constant K, (6.9.11) This has the following numerical consequence: If in the practical computation of the eigenvalues of a matrix A [with lub(A) = 1] one applies a stable method, then the rounding errors committed during the course of the method can be interpreted as having the effect of producing exact results not for A. but for a perturbed initial matrix A + DA, lub(DA) = O(eps). If to the eigenvalue Ai of A there belong only linear elementary divisors, then the computed approximation ),i of Ai agrees with Ai up to an error of the order eps. If however Ai has elementary divisors of order at most v, one must expect an error of order of magnitude epsl/v for ),i. FOr the derivation of a typical inclusion theorem, we limit ourselves to the case of the Euclidean norm and begin by proving the formula . jjUrjj2 . j mln-j-j -jj- = mmjdi x,to x 2 l for a diagonal matrix D = diag(d l , ... , d n ). Indeed, for all x i= 0, jjD.Tjj§ _ jjxjj§ - Li jX ij2jdi j2 > . jd.j2 Li jX ij2 - mim , . If jdjj = mini jdij, then the lower bounds is attained for x = ej (the j-th unit vector). Furthermore, if A is a normal matrix, i.e., a matrix which can be transformed to diagonal form by a unitary matrix U, A=UHDU, D = diag(d], ... ,dn ), and if f(A) is an arbitrary polynomial, then f(A) = U H f(D)U, di = Ai(A), 6.9 Estimation of Eigenvalues 451 and from the unitary invarianee of IIxl12 = IIUxl12 it follows at once, for all x =I 0, that Ilf(A)xI12 IIxl12 Ilf(D)UxI12 IIU H f(D)UxI12 IIUxl12 IIxl12 ~ min 1f (d i ) 1 = min 1f (Ai (A) ) I· l~i~n l~i~n We therefore have the following [sec, e.g., Householder (1964)]: (6.9.12) Theorem. If A is a normal matrix, f(A) an arbitrary polynomial, and x =I 0 an arbitrary vector, then there is an eigenvalue A(A) of A such that If(A(A)) 1 ~ Ilf(A)x I12 . IIxl12 In particular, choosing for f the linear polynomial f(A) == A _ x H Ax == A _ /L0l , xH X where /Lik :=X H( A H)i Ax. k /Loa i, k = 0, 1, 2, ... , we immediately get, in view of fLik = !lh' Ilf(A)xll~ = x H(AH _ !l01 I) (A _ /1011).7: /Loa /LOO /L1O/L0l /L10/L01 /L01/L1O = /111 - - - - - - + -2-/100 /100 /Loo /100 flOl/L10 = /Lll - - - . /100 We thus have (6.9.13) Theorem (Weinstein): If A is normal and x -=I 0 an arbitrary vector, then the disc /Lll - /L1O/L01 / /Loa } /Loa contains at least one eigenvalue of A. The quotient /101 / fLoo = x H Ax / x H x, incidentally, is called the Rayleigh quotient 452 6 Eigenvalue Problems of A belonging to x. The last theorem is used especially in connection with the vector iteration: If x is approximately equal to Xl, the eigenvector corresponding to the eigenvalue AI, x ~ Xl, AXI = AlXl, then the Rayleigh quotient x H Ax / x H X belonging to x is in general a very good approximation to AI. Theorem (6.9.13) then shows how much at most this approximate value x H Ax / x H x deviates from an eigenvalue of A. The set G[A] = {x:H~x I x 1= °} of all Rayleigh quotients is called the field of values of the matrix A. Choosing for x an eigenvector of A, it follows at once that G[A] contains the eigenvalues of A. Hausdorff, furthermore, has shown that G[A] is always convex. For normal matrices A = UHAU. one even has xH U H AU x I G[A] = { xHUHUx x 1= = {11 I = J.L t Ai, Ti °} = { - - I 1= } Ti yH Ay yHy 2:: 0, t Ti = y U 1 }. That is, for normal matrices, G[A] is the convex hull of the eigenvalues of A [see Figure 13]. Fig. 13. Field of values 6.9 Estimation of Eigenvalues 453 For a Hermitian matrix H = H H = U H AU with eigenvalues >'1 : : : A2 ::::: ... ::::: An, we thus recover the result (6.4.3), which characterizes Al and An by extremal properties. The remaining eigenvalues of H, too, can be characterized similarly, as is shown by the following theorem: (6.9.14) Theorem (Courant, Weyl). For the eigenvalues Al ::::: A2 ::::: ... ::::: An of an n x n Hermitian matrix H, one has for i = 0, 1, ... , n - 1 xHHx PROOF. For arbitrary Yl,' .. , Yi E (Cn define /J(Yl, ... , Yi) by /J(Yl, ... ,y;) := max xE(f;n: X H Yl="'=X H Yi=O xHx=l Further let Xl, ... , Xn be a set of n orthonormal eigenvectors of H for the eigenvalues Aj [see Theorem (6.4.2)]: HXj = AjXj, XfXk = 6jk for j, k = 1, 2, ... , n. For Yj := Xj, j = L ... , i, all x E (Cn with xHYj = 0, j = 1, ... , i, xH X = 1, can then be represented in the form so that, since Ak :s: Ai+l for k ::::: i + 1, one has for such x XH Hx = (Pi+lXi+l + ... + Pnxn)H H(Pi+lXi+l + ... + Pnxn) = IPi+112Ai+l + ... + IPnl 2A n :s: Ai+l(lpi+11 2 + ... + IPnI 2 ) = Ai+l, where equality holds for x := Xi+l. Hence, On the other hand, for arbitrary Yl, ... , Yi E (Cn, the subspaces E := {x E (Cn I x H Yj = 0 for j :s: i}, F := {~jS;i+l PjXj I Pj E (C}, have dimensions dim E ::::: n - i, dim F = i + 1, so that dim F n E 2: 1, and there exists a vector Xo E F n E with x{f Xo = 1. Therefore, because Xo = PlXl + ... + Pi+lXi+l E F, /J(Yl, ... , Yi) ::::: x{f H Xo = IPll 2 Al + ... + IPi+11 2Ai+1 ::::: (IPlI 2 + ... + IPi+11 2)Ai+l = Ai+l. o 454 6 Eigenvalue Problems Defining, for an arbitrary matrix A, then HI, H2 are Hermitian and ( HI, H 2 are also denoted by Re A and 1m A, respectively; but it should be noted that the elements of Re A are not real in general.) For every eigenvalue A of A, one has, on account of A E G[A] and (6.4.3), xHAx 1 xHAx+XHAHx ReA S; maxRe-H= max~ 2 X",O x x X",O X X = max X",O xHHIX H x x = Amax(Hd, xHAx ImA S; maxlm-H= Amax(H2)' X",O x.r By estimating Re A, 1m A analogously from below, one obtains (6.9.15) Theorem (Bendixson). Decomposing an arbitrary matrix A into A = HI + iH2, where HI and H2 are Hermitian, then for every eigenvalue A of A one has Amin(H1 ) S; Re AS; Amax(Hd. Amin(H2 ) S; ImA S; Arnax (H 2 ). The estimates of this section lead immediately to estimates for the zeros of a polynomial We need only observe that to p there corresponds the Frobenius matrix o 1 which has the characteristic polynomial (l/a n )( -l)np(A). In particular, from the estimate (6.9.1) of Hirsch, with luboo(A) = maXi Lk laikl. applied to F and F T , one obtains the following estimates for all zeros Ak of p(A), respectively: EXERCISES FOR CHAPTER 6 455 (a) IAkl ~ max{1 :: I, l~~~a:_l (1 + 1::1) }, (b) IAkl ~ max{ 1, ~ 1:: 1}. EXAMPLE 3. For p('\) = ,\3 - 2,\2 + ,\ - lone obtains (a) I'\il S max{l, 2, 3} = 3, (b) I'\il S max{l, 1 + 1 + 2} = 4. Tn this case (a) gives a better estimate. EXERCISES FOR CHAPTER 6 1. For the matrix 2 1 2 o 1 2 2 A= 1 2 1 o o find the characteristic polynomial, the minimal polynomial, a system of eigenvectors and principal vectors, and the Frobenius normal form. 2. How many distinct (i.e., not mutually similar) 6 x 6 matrices are there whose characteristic polynomial is 3. What are the properties of the eigenvalues of positive definite/semidefinite, orthogonal/ unitary, real skew-symmetric (AT = -A) matrices? Determine the minimal polynomial of a projection matrix A = A 2 . 4. For the matrices (a) A=UVT,u,VElR", (b) H = 1- 2ww H • w H W = 1, w E (Cn, o o o o 1 o o 1 il 456 6 Eigenvalue Problems determine the eigenvalues Ai, the multiplicities O"i and pi, the characteristic polynomial i.p(JL) , the minimal polynomial 1jJ(JL) , a system of eigenvectors and principal vectors, and a Jordan normal form J, 5. For 'U,V,'W,Z E IR n , n > 2, let (a) With AI, A2 the eigenvalues of _ A:= [vT'U T Z 'U show that A has the eigenvalues AI, A2, O. (b) How many eigenvectors and principal vectors can A have 7 'What types of Jordan normal form J of A are possible 7 Determine J in particular for A. = O. 6. (a) Show: A is an eigenvalue of A if and only if -A is an eigenvalue of B, where (b) Suppose the real symmetric tridiagonal matrix A= 12 o satisfies In ,n On i = 1, ... , n, i = 2, ... , n. Show: If A is an eigenvalue of A, then so is -A. 'What does this imply for the eigenvalues of the matrix (6.6.5.9)7 (c) Show that eigenvalues of the matrix are symmetric with respect to the origin, and that EXERCISES FOR CHAPTER 6 457 if n is even, n = 2k, otherwise. 7. Let C be a real n x n matrix and t Show: / is stationary at x 0 precisely if x is an eigenvector of ~(C + C T ) with /(x) as corresponding eigenvalue. S. Show: (a) If A is normal, and has the eigenvalue Ai, IAII > singular values ai, then i = 1, ... , n, lub 2 (A) = IA11 = p(A), cond 2 (A) = :~~: = p(A)p(A-I) (if A-I exists). (b) For every n x n matrix, 9. Show: If A is a normal n x n matrix with the eigenvalues A" and if U, V are unitary, then for the eigenvalues p" of U AV one has 10. Let B be an n x m matrix. Prove: M = [~: 11: 1 positive definite {o} p(BH R) < 1 [In' 1m unit matrices, p(BH B) = spectral radius of BH Bl. 11. For n x n matrices A, B show: s (a) IAI S IBI =? lub 2 (IAI) lub 2 (IBI), (b) lub 2 (A) S lub 2 (IAI) S fo lub 2 (A). 12. The content of this exercise is the assertion in Section 6.3 that a matrix is ill conditioned if two column vectors are "almost linearly dependent". For the matrix A = [aJ, ... , an], ai E IR n , i = 1, ... , n, let O<c:<l 458 6 Eigenvalue Problems (i.e., al and a2 form an angle ex with 1 - E :::; I cos exl :::; 1). Show: Either A is singular or cond2(A) 2: liVE. 13. Estimation of the function f(n) in formula (6.5.4.4) for floating-point arithmetic with relative precision eps: Let A be an n x n matrix in floating-point repretientation and G j an elimination matrix of the type (6.5.4.1) with the essential elements lij. The Ii] are formed by division of elements of A and are thus subject to a relative error of at most eps. With the methods of Section 1.3, derive the following estimates (higher powers eps', i 2: 2, are to be neglected) : (a) lub=[fl(GjIA) - GjlA] S; 4epslub=(.4), (b) lub=[fl(AG J ) - .4G j ] S; (n - j + 2) epslub=(A), (c) lub=[fl(Gjl AG j ) - Gjl AG j ] S; 2(n - j + 6) epslub=(A). 14. Convergence behavior of the vector iteration: Let A be a rcal symmetric n x n matrix having the eigenvalues Ai with xT and the corresponding eigenvectors Xl, ... , Xn with Xk = l5 ik . Starting with an initial vector Yo for which xi Yo # O. suppose one computes for k = 0, 1, ... with an arbitrary vector norm I . II, and concurrently the quantities (AYk)i qki:= (Yk)i ' and the Rayleigh quotient Prove: (a) qki = Ad 1 + O((A2/Ad)] for all i with (XI)i # 0, (b) Tk=>"d1+0((A2/At)2k)]. 15. In the real matrix A=AT = [1 9 * * * 0 * * * * * * 4 * * * ~ j, 21 stars represent elements of modulus:::; 1/4. Suppose the vector iteration is carried out with A and the initial vector yo = e5. (a) Show that e5 is an "appropriate" initial vector, i.e., the sequence Yk in Exercise 14 does indeed converge toward the eigenvector belonging to the dominant eigenvalue of A. (b) Estimate how many correct decimal digits Tk+5 gains compared to Tk· EXERCISES FOR CHAPTER 6 459 16. Prove: luboo(F) < 1 ==? A := 1+ F admits a triangular factorization A = L·R. 17. Let the matrix A be nonsingular and admit a triangular factorization A = L . R (lii = 1). Show: (a) Land R are uniquely determined. (b) If A is an upper Hessenberg matrix, then 1 °1] , RL upper Hessenberg. * (c) If A is tridiagonal, then L is as in (b) and RL tridiagonal. 18. (a) Which upper triangular matrices are at the same time unitary; which ones real orthogonal ? (b) In what do different QR factorizations of a nonsingular matrix differ from one another? Is the answer valid also for singular matrices? 19. (a) Consider an upper triangular matrix and determine a Givens rotation fl so that Hint: flel is an eigenvector of R corresponding to the eigenvalue A2. (b) How can one transform an upper triangular matrix R with rkk = Ak, k = 1, ... , n, unitarily into an upper triangular matrix R = U H RU, UHU = I, with the diagonal diag(Ai, AI, ... ,Ai-I, Ai+1, ... ,An)? 20. QR method with shifts: Prove formula (6.6.5.3). 21. Let A be a normal n x n matrix with the eigenvalues AI, ... , An, A = QR, QH Q = I, R = (rik) upper triangular. Prove: minlAil -s; Irjjl -s; max IAil, i 22. Compute a QR step with the matrix A = (a) without shift j = 1, ... , n. i [ :c E 1] 460 6 Eigenvalue Problems (b) with shift k = 1, i.e., following strategy (a) of Section 6.6.6. 23. Effect of a QR step with shift for tridiagonal matrices: B /; 1= 0, i = 2, ... , n. /n Suppose we carry out a QR step with A, using the shift parameter k = 5n , , /2 ::,1' 5;, Prove: If d:= min; IAi(B) - 5n l > 0, then 15'n _ 5 1< ITnd 12 . 3 1/n'1<lT d 21 ' n 11, - Hint: Q is a product of suitable Givens rotations; apply Exercise 21. Example: What does one obtain for 1 5 1 1 5 1 1 5 0.1 24. Is shift strategy (a) in the QR method meaningful for real tridiagonal matrices of the type 5 /2 0 /2 5 ? /n 0 /n 5 Answer this question with the aid of Exercise 6 (c) and the following fact: If A is real symmetric and tridiagonal, and if all diagonal elements are zero, then the same is true after one QR step. 25. For the matrix 5.2 A = [ 0.6 2.2 0.6 6.4 0.5 2.2] 0.5 4.7 compute an upper bound for cond 2 (A), using estimates of the eigenvalues by the method of Gershgorin. 26. Estimate the eigenvalues of the following matrices as accurately as possible: EXERCISES FOR CHAPTER 6 461 (a) The matrix A in Exercise 15. 10- 3 2 10- 3 (b) 10- 34 ] 103 . Hint: Use the method of Gershgorin in conjunction with a transformation A --+ D- 1 AD, D a suitable diagonal matrix. 27. (a) Let A, B be Hermitian square matrices and Show: For every eigenvalue )"(B) of B there is an eigenvalue )"(H) of H such that (b) Apply (a) to the practically important case in which H is a Hermitian, "almost reducible" tridiagonal matrix of the form 0 * * * * * H= * * E E * * E small. * * o * * * How can the eigenvalues of H be estimated in terms of those of A and B? 28. Show: If A = (aid is Hermitian, then for every diagonal element aii there exists an eigenvalue )"(A) of A such that I),.(A)-aiil ~ ~)a,jI2. ji"i 29. (a) For the v x v matrix 462 6 Eigenvalue Problems prove with the aid of Gershgorin's theorem and appropriate scaling with diagonal matrices: The eigenvalues Ai(c) ofthe perturbed matrix Cv(A)+ cF satisfy, for c sufficiently small, with some constant K. Through a special choice of the matrix F show that the case Ai(c) - A = O(c l / v ) does indeed occur. (b) Prove the result (6.9.11). Hint: Transform A to Jordan normal form. 30. Obtain the field of values G[A] = {x H Axlx H X = I} for 1 [1 A:= 0 1 1 0 o 0 o o -1 -1 ~l· -1 31. Decompose these matrices into HI + iH2 with HI, H2 Hermitian: 5 +i 1 + 4i 1 32. For the eigenvalues of 3. A = [ 0.9 0.1 1.1 15.0 -1.9 -0.1] -2.1 . 19.5 use the theorems of Gershgorin and Bendixson to determine inclusion regions which are as small as possible. References for Chapter 6 Barth, W., Martin, R. S., Wilkinson, J. H. (1971): Calculation of the eigenvalues of a symmetric tridiagonal matrix by the method of bisection. Contribution Il/5 in Wilkinson and Reinsch (1971). Bauer, F. L., Fike, C. T. (1960): Norms and exclusion theorems. Numer. Math. 2, 137-141. - - - , Stoer, J., Witzgall, C. (1961): Absolute and monotonic norms. Numer. Math. 3, 257-264. Bowdler, H., Martin, R. S., Wilkinson, J. H. (1971): The QR and QL algorithms for symmetric matrices. Contribution Il/3 in: Wilkinson and Reinsch (1971). Bunse, W., Bunse-Gerstner, A. (1985): Numerische Lineare Algebra. Stuttgart: Teubner. Cullum, J., Willoughby, R. A. (1985): Lanczos Algorithms for Large Symmetric Eigenvalue Computations. Vol. I: Theory, Vol. II: Programs. Progress in Scientific Computing, Vol. 3, 4. Basel: Birkhiiuser. References for Chapter 6 463 Eberlein, P. J. (1971): Solution to the complex eigenproblem by a norm reducing Jacobi type method. Contribution II/17 in: Wilkinson and Reinsch (1971). Francis, J. F. G. (1961/62): The QR transformation. A unitary analogue to the LR transformation. I. Computer J. 4, 265-271. The QR transformation. II. ibid., 332-345. Garbow, B. S., et al. (1977): Matrix Eigensystem Computer Routines - EISPACK Guide Extension. Lecture Notes in Computer Science 51. Berlin, Heidelberg, New York: Springer-Verlag. Givens, J. W. (1954): Numerical computation of the characteristic values of a real symmetric matrix. Oak Ridge National Laboratory Report ORNL-1574. Golub, G. H., Reinsch, C. (1971): Singular value decomposition and least squares solution. Contribution I/1O in: Wilkinson and Reinsch (1971). - _ .. , Van Loan, C. F. (1983): Matrix Computations. Baltimore: The JohnsHopkins University Press. - - - , Wilkinson, J. H. (1976): Ill-conditioned eigensystems and the computation of the Jordan canonical form. SIAM Review 18, 578-619. Householder, A. S. (1964): The Theory of Matrices in Numencal Analysis. New York: Blaisdell. Kaniel, S. (1966): Estimates for some computational techniques in linear algebra. Math. Compo 20, 369-378. Kielbasinski, A., Schwetlick, H. (1988): Numerische Lineare Algebra. Thun, Frankfurt /M.: Deutsch. Kublanovskaya, V. N. (1961): On some algorithms for the solution of the complete eigenvalue problem. Z. Vycisl. Mat. i Mat. Fiz. 1, 555-570. Lanczos, C. (1950): An iteration method for the solution of the eigenvalue problem of linear differential and integral operators. J. Res. Nat. Bur. Stand. 45, 255-282. Martin, R. S., Peters, G., Wilkinson, J. H. (1971): The QR algorithm for real Hessenberg matrices. Contribution II/14 in: Wilkinson and Reinsch (1971). - - - , Reinsch, C., Wilkinson. J. H. (1971): Householder's tridiagonalization of a symmetric matrix. Contribution II/2 in: Wilkinson and Reinsch (1971). - , Wilkinson, J. H. (1971): Reduction of the symmetric eigenproblem Ax = )'Bx and related problems to standard form. Contribution II/10 in: Wilkinson and Reinsch (1971). - - - , - - - (1971): Similarity reduction of a general matrix to Hessenberg form. Contribution II/1:1 in: Wilkinson and Reinsch (1971). Moler, C. B .. Stewart, G. W. (1973): An algorithm for generalized matrix eigenvalue problems. SIAM J. Numer. Anal. 10, 241-256. Paige, C. C. (1971): The computation of eigenvalues and eigenvectors of very large sparse matrices. Ph.D. thesis, London University. Parlett, B. N. (1965): Convergence of the QR algorithm. Numer. Math. 7, 187-193 (corr. in 10, 16:~-164 (lY67)). - - - (1980): The Symmetric Eigenvalue Problem. Englewood Cliffs, N.J.: Prcntice-Hall. - - - , Poole, W. G. (1973): A geometric theory for the QR, LU and power iterations. SIAM J. Numer. Anal. 10. 389-412. - - - , Scott, D. S. (1979): The Lanczos algorithm with selective orthogonalization. Math Compo 33, 217-238. Peters, G., Wilkinson J. H. (1970): Ax = )'Bx and the generalized eigenproblem. SIAM 1. Numer. Anal. 7, 479-492. - - - , - - - (1971): Eigenvectors of real and complex matrices by LR and QR triangularizations. Contribution II/15 in: Wilkinson and Reinsch (1971). 464 6 Eigenvalue Problems - - - , - - - (1971): The calculation of specified eigenvectors by inverse iteration. Contribution II/18 in: Wilkinson and Reinsch (1971). Rutishauser, H. (1958): Solution of eigenvalue problems with the LR-transformation. Nat. Bur. Standards Appl. Math. Ser. 49, 47-81. - - - (1971): The Jacobi method for real symmetric matrices. Contribution II/I in: Wilkinson and Reinsch (1971). Saad, Y. (1980): On the rates of convergence of the Lanczos and the block Lanczos methods. SIAM 1. Num. Anal. 17, 687-706. Schwarz, H. R., Rutishauscr, H., Stiefel, E. (1972): Numerik symmetrischer Matrizen. 2d ed. Stuttgart: Teuhner. (English translation: Englewood Cliffs, N.J.: Prentice-Hall (1973).) Smith, B. T., et al. (1976): Matrix eigensystems routines - EISPACK Guide. Lecture Notes in Computer Science 6, 2d ed. Berlin, Heidelberg, New York: Springer-Verlag. Stewart, G. W. (1973): Introduction to Matrix Computations. New York, London: Academic Press. Wilkinson, J. II. (1962): Note on the quadratic convergence of the cyclic Jacobi process. Numer. Math. 4, 296-300. - - - (1965): The Algebraic Eigenvalue Problem. Oxford: Clarendon Press. - - - (1968): Global convergence of tridiagonal QR algorithm with origin shifts. Linear Algebra and Appl. 1, 409-420. - - - . Reinsch, C. (1971): Linear Algebra, Handbook for Automatic Computation, Vol. II. Berlin, Heidelberg, New York: Springer-Verlag. 7 Ordinary Differential Equations 7.0 Introduction Many problems in applied mathematics lead to ordinary differential equations. In the simplest case one seeks a differentiable function Y = y(x) of one real variable x, whose derivative y'(x) is to satisfy an equation of the form y'(x) = f(x, y(x)), or more briefly, (7.0.1) y' = f(x,y); one then speaks of an ordinary differential equation. In general there are infinitely many different functions y which are solutions of (7.0.1). Through additional requirements one can single out certain solutions from the set of all solutions. Thus, in an initial-value problem, one seeks a solution y of (7.0.1) which for given xo, Yo satisfies an initial condition of the form (7.0.2) y(xo) = Yo· More generally, one also considers systems of n ordinary differential equations y~ (x) = II (x, Yl(X), ... ,Yn(x)), y~(x) = h(x, Yl (x), ... ,Yn(x)), for n unknown real functions Yi(X), i = 1, ... , n, of a real variable. Such systems can be written analogously to (7.0.1) in vector form: (7.0.3) To the initial condition (7.0.2) there now corresponds a condition of the form 466 (7.0.4) 7 Ordinary Differential Equations y(xo) = Yo = [Y~Ol YnO In addition to ordinary differential equations of first order' (7.0.1), (7.0.3), in which there occur only first derivatives of the unknown function y( x), there are ordinary differential equations of mth order of the form (7.0.5) )) y (m)( x ) -_ f( x, y () x ,Y (1)() x, .... y (m-l)( x. By introducing auxiliary functions ZI(X) : = y(x), Z2(X) : = y(1)(x), Zm(x) : = y(m-l)(x), (7.0.5) can always be transformed into an equivalent system of first-order differential equations, (7.0.6) By an initial-value problem for the ordinary differential equation of mth order (7.0.5) one means the problem of finding an m times differentiable function y(x) which satisfies (7.0.5) and initial conditions of the form i = 0, 1, ... , m - 1. Initial-value problems will be treated in Section 7.2. Besides initial-value problems for systems of ordinary differential equations, boundary-value problems also frequently occur in practice. Here, the desired solution y(x) of the differential equation (7.0.3) has to satisfy a boundary condition of the form r(y(a), y(b)) = 0, (7.0.7) where a =f. b are two different numbers and rl (Ul, ... , Un, VI, ... , V n ) r(u, V) := [ : r n (Ul, ... , Un, VI,"" Vn ) 1 7.1 Some Theorems from the Theory of Diffferential Equations 467 is a vector of n given functions ri of 2n variables Ul, ... , Un, Vl, ... , V n · Problems of this kind will be considered in Sections 7.3, 7.4, and 7.5. In the methods which are to be discussed in the following, one does not construct a closed-form expression for the desired solution y(x) - this is not even possible, in general - but in correspondence to certain discrete abscissae Xi, i = 0, 1, ... ,one determines approximate values 'r/i := 'r/(Xi) for the exact values Yi := Y(Xi). The discrete abscissae are often equidistant, Xi = Xo + ih. For the appproximate values 'r/i we then also write more precisely 'r/(Xi; h), since the 'r/i, like the Xi, depend on the stepsize h used. An important problem, for a given method, will be to examine whether, and how fast, 'r/(x; (x - xo)jn) converges to y(x) as n -+ 00, i.e., h -+ o. For a detailed treatment of numerical methods for solving initial- and boundary-value problems we refer to the literature: In addition to the classical book by Henrici (1963), and the more recent exposition by Hairer et al. (1987, 1991), we mention the books by Gear (1971), Grigorieff (1972, 1977), Keller (1968), Shampine and Gordon (1975), and Stetter (1973). 7.1 Some Theorems from the Theory of Ordinary Differential Equations For later use we list a few results - some without proof - from the theory of ordinary differential equations. We assume throughout that [see (7.0.3)] y' = f(x, y) is a system of n ordinary differential equations, 11.11 a norm on m,n, and II All an associated consistent multiplicative matrix norm with 11111 = 1 [see (4.4.8)]. It can then be shown [see, e.g., Henrici (1962), Theorem 3.1] that the initial-value problem (7.0.3), (7.0.4) - and thus in particular (7.0.1), (7.0.2) - has exactly one solution, provided f satisfies a few simple regularity conditions: (7.1.1) Theorem. Let f be defined and continuous on the strip S := {(x, y)la :s; x :s; b, y E m,n}, a, b finite. Further, let there be a constant L such that (7.1.2) for all x E [a, b] and all Yl, Y2 E m,n ("Lipschitz condition"). Then for every Xo E [a, b] and every Yo E IRn there exists exactly one function y(x) such that (a) y(x) is continuous and continuously differentiable for x E [a, b]; (b) y'(x) = f(x, y(x)) for x E [a, b]; (c) y(xo) = Yo. 468 7 Ordinary Differential Equations From the mean-value theorem it easily follows that the Lipschitz condition is satisfied ifthe partial derivatives 8fd8Yj, i, j = 1, ... , n, exist on the strip S and are continuous and bounded there. For later use, we denote by (7.1.3) the set of functions f for which all partial derivatives up to and including order N exist on the strip S = {(x,y)la ~ x ~ b, y E IR n }, a,b finite, and arc continuous and bounded there. The functions f E Fl (a, b) thus fulfill thc assumptions of (7.1.1). In applications, f is usually continuous on S and also continuously differentiable there, but the derivatives 8fd8Yj are often unbounded on S. Then, while the initial-value problem (7.0.3), (7.0.4) is still solvable, the solution may be defined only in a certain neighborhood U(xo) (which may even depend on Yo) of the initial point and not on all of [a, b] [see, e.g., Henrici (1962)]. EXAMPLE. The initial-value problem y' = y2, y(O) = Yo > 0 has the solution y(x) = l/(yo - x), which is defined only for x < Yo. (7.1.1) is the fundamental existence and uniqueness theorem for the initial-value problem given in (7.0.3), (7.0.4). We now show that the solution of an initial-value problem depends continuously on the initial value: (7.1.4) Theorem. Let the function f: S -t IRn be continuous on the strip S = {(x,y)la ~ x ~ b, y E IRn} and satisfy the Lipschitz condition for all (x, Yi) E S, i = 1, 2. Let a ~ :[;0 ~ b. Then for the solution y(x; s) of the initial-value problem y' = f(x,y), y(xo; s) = s there holds the estimate for a:::: x ~ b. PROOF. By definition of y(x; s) one has y(x; s) = s + l x Xo f(t, y(t; s)) dt 7.1 Some Theorems from the Theory of Diffferential Equations 469 for a ::; x ::; b. It follows that y(x; sd - y(x; S2) = Sl - S2 + l x [f(t, yet; Sl)) - f(t, y(t; S2))] dt, Xo and thus (7.1.5) x Ily(x;sd -y(x;S2)11 ::; IIs1 - s211 +Lll Ily(t;sl) - y(t;s2)11 dtl· Xo For the function x 1>(x) := l Ily(t; sd - yet; s2)11 dt Xo one has 1>'(x) = Ily(x; sd - y(x; s2)11 and thus, by (7.1.5), for x :::: XD Q(x) ::; IIs1 - s211 Q(x):= 1>'(x) - L1>(x). with The initial-value problem (7.1.6) 1>'(X) = Q(x) + L1>(x), 1>(Xo) = 0 has for x :::: Xo the solution (7.1.7) 1>(x) = eL(x-x o ) l x Q(t)e-L(t-x o ) dt. Xo On account of Q(x) ::; 1181 - s211, one thus obtains the estimate 0::; 1>(x) ::; eL(x-xo)lls1 - s2111x e-L(t-x o ) dt xo = ~lls1 L s211[eL (x-x o ) -1] for x:::: Xo· The desired result finally follows for x :::: Xo: o If x < XD one proceeds similarly. The preceding theorem can be sharpened: Under additional assumptions the solution of the initial-value problem actually depends on the initial value in a continuously differentiable manner. We have the following: (7.1.8) Theorem. If in addition to the assumptions of Theorem (7.1.4) the Jacobian matrix Dyf(x, y) = [Bli/BYj] exists on S and is continuous and bounded there, IIDyf(x, y) II ::; L for (x, y) E S, 470 7 Ordinary Differential Equations then the solution y(x;s) ofy' = f(x,y), y(xo;s) = s, is continuously differentiable for all x E [a, b] and all s E IRn. The derivative Z(x: s) := DsY(:J;; s) = [ OY(x; s) O(J'l"'" oy(x; S)] o(J'n ' is the solution of the initial-value problem (Z' = DxZ) (7.1.9) Z' = Dyf(x, y(x; s))Z, Z(xo; s) = I. Note that Z', Z, and Dyf(x, y(x; s)) are n x n matrices. (7.1.9) thus describes an initial-value problem for a system of n 2 differential equations. Formally, (7.1.9) can be obtained by differentiating with respect to s the identities y'(x; s) == f(x, y(x; s)), y(xo; s) == s. A proof of Theorem (7.1.8) can be found, e.g., in Coddington and Levinson (1955). For many purposes it is important to estimate the growth of the solution Z of (7.1.9). Suppose, then, that T(x) is an n x n matrix, and the n x n matrix Y(x) solution of the linear initial-value problem y' = T(x)Y, (7.1.10) Y(a) = I. One can then show: (7.1.11) Theorem. If T(x) is continuous on [a, b], and k(x) := IIT(x)lI, then the solution Y(x) of (7.1.10) satisfies IIY(x) - III ::; exp (l k(t) dt) X 1, x 2:: a. PROOF. By definition of Y(x) one has Y(x) = 1+ Letting 1 x T(t)Y(t) dt. 'P(x) := IIY(x) - III by virtue of IIY(x)11 ::; 'P(x) + 11111 = 'P(x) + 1 there follows for x 2:: a the estimate (7.1.12) 'P(x) ::; N ow let c( x) be defined by 1 x k(t)('P(t) + 1) dt. (7.1.13) l 7.2 Initial-Value Problems x k(t)(ip(t) + 1) dt = c(x) exp (l X 471 c(a) = 1. k(t) dt) - 1, Through differentiation of (7.1.13) [c(x) is clearly differentiable] one obtains k(x)(ip(x) + 1) = c'(x) exp = c'(x) exp (l X (l k(t) dt) + k(x)c(x) (l k(t) dt) X k(t) dt) + k(x) [1 + l exp x X k(t)(ip(t) + 1) dt] , from which, because of k(x) :2: 0 and (7.1.12), it follows that c'(x) exp (l X k(t) dt) + k(x) ::; k(x) l l x k(t)(ip(t) + 1) dt = k(x)ip(x) x k(t)(ip(t) + 1) dt. One thus obtains finally c'(x) ::; 0 and therefore (7.1.14) c(x) ::; c(a) = 1 for x:2: a. The assertion of the theorem now follows immediately from (7.1.12)- (7.1.14). 0 7.2 Initial-Value Problems 7.2.1 One-Step Methods: Basic Concepts As one can already surmise from Section 7.1, the methods and results for initial-value problems for systems of ordinary differential equations of first order are essentially independent of the number n of unknown functions. In the following we therefore limit ourselves to the case of only one ordinary differential equation of first order for only one unknown function (i.e., n = 1). The results, however, are valid, as a rule, also for systems (i.e., n> 1), provided quantities such as y and f (x, y) are interpreted as vectors, and I· I as norm I . II. For the following, we assume that the initial-value problem under consideration is always uniqucly solvable. A first numerical method for the solution of the initial-value problem (7.2.1.1) y' = f(x, y), y(xo) = Yo is suggested by the following simple observation: Since f(x,y(x)) is just the slope y'(x) of the desired exact solution y(x) of (7.2.1.1), one has for h =1= 0 approximately 472 7 Ordinary Differential Equations y(X + h~ - y(x) ~ f(.7:, y(x)), or y(X + h) ~ y(x) + hf(x,y(x)). (7.2.1.2) Once a steplength h i= 0 is chosen, starting with the given initial values xo, Yo = y(xo), one thus obtains at equidistant points Xi = Xo + ih, i = 1, 2, ... , approximations 'Ii to the values Yi = y(Xi) of the exact solution y(x) as follows: 170 := Yo; for i = 0, 1, 2, ... (7.2.1.3) 17i+l := 17i + hf(Xi, 17i)' Xi+l := Xi + h. One arrives at the polygon method of Euler shown in Figure 14. y (xo,1)O) Xo X Fig. 14. Euler's method Evidently, the approximate values 17i of y(Xi) depend on the stepsize h. To indicate this, we also write more precisely 1]( :Z:i; h) instead of 1]i. The "approximate solution" 17(x; h) is thus defined only for X E Rh := {xo + ih I i = 0, 1, 2, ... } or, alternatively, for h E Hx := { X ~ Xo I n = 1, 2, ... } ; 7.2 Initial-Value Problems 473 in fact, it is defined recursively by [ef. (7.2.1.3)] 7)(Xo; h) := Yo, 7)(x + h; h) := 7](x; h) + h f(x, 7](x; h)). Euler's method is a typical one-step method. In general, such methods are given by a function p(x, y; h; 1). Starting with the initial values xo, Yo of the initial-value problem (7.2.l.1), one now obtains approximate values 7]i for the quantities Yi := y(xd of the exact solution y(x) by means of 7)0 := Yo, (7.2.1.4) for i = 0,1, 2, ... 7]i+l := 7]i Xi+l := Xi + h P(Xi' 7]i; h; 1), + h. In the method of Euler, for example, one has p(x, y; h; 1) := f(x, y); here, P is independent of h. For simplicity, the argument f in the function P will from now on he omitted. As in Euler's method (see above) we also write more precisely 7]( Xi; h) instead of 7]i, in order to indicate the dependence of the approximate values of the stepsize h used. Let now X and y be arbitrary, but fixed, and let z(t) he the exact solution of the initial-value problem (7.2.1.5) z' (t) = f (t, z (t) ), z(x) = y, with initial values x, y. Then the function (7.2.l.6) ,1(x, y; h; 1) := { Z(X+h)-y h f(x,y) if h f 0, if h = ° represents the difference quotient of the exact solution z(t) of (7.2.1.5) for stepsize h, while p(x, y; h) is the difference quotient for stepsize h of the approximate solution of (7.2.l.5) produced by P. As in P, we shall also drop the argument f in ,1. The magnitude of the difference T(X, y; h) := ,1(x, y; h) - p(x, y; h) indicates how well the value z(x + h) of the exact solution at X + h of the differential quat ion with the inital data (x, y) obeys the equation of the onestep method: it is a measure of the quality of the approximation method. 474 7 Ordinary Differential Equations One calls T(X, y; h) the local discretization error at the point (x, y) of the method in question. For a reasonable one-step method one will require that lim T(X, y: h) = o. h-+D In as much as limh-+o L1(x, y; h) = f(x, y), this is equivalent to lim p(x, y; h) = f(x, y). (7.2.1.7) h-+O One calls P, and the associated one-step method, consistent if (7.2.1.7) holds for all x E [a,b], y E IR and f E Fl(a,b) [see (7.1.3)]. EXAMPLE. Euler's method, (x,Yjh) := f(x,y), is obviously consistent. The result can be sharpened: if f has sufficiently many continuous partial derivatives, it is possible to say with what order T(X, Yj h) goes to zero as h --+ O. For this, expand the solution z(t) of (7.2.1.5) into a Taylor series about the point t = x: h2 2 hP p. z(x + h) = z(x) + hz'(x) + -z"(x) + ... + ,z(p)(x + ()h), 0 < () < 1. We now have, because z(x) = y, z'(t) == f(t, z(t)), z"(x) = ~f(t, z(t)) It=x= fx(t, z(t)) It=x +fy(t, z(t))z'(t) It=x = fx(x, y) + fy(x, y)f(x, y), z'''(x) = fxx(x, y) + 2fxy(x, y)f(x, y) + fyy(x,y)f(x,y)2 + fy(x,y)z"(x), etc., and thus (7.2.1.8) h Ip-1 d(x, Yj h) = z'(x) + ,z"(x) + ... + _l_,_Z(p) (x + ()h) 2. p. = f(x, y) + ~ [jx(x, y) + fy(x, y)f(x, y)] + ... For Euler's method, <P(x, Y; h) := f(x, y), it follows that h T(X, y; h) = d(x, y; h) ~ <P(x, y; h) = "2 [jx(x, y) + fy(x, y)f(x, y)] + ... = O(h). Generally, one speaks of a method of order p if (7.2.1.9) T(X, y; h) = O(hP) for all x E [a, b], y E IR and f E Fp(a, b). Euler's method thus is a method of order 1. 7.2 Initial-Value Problems 475 The preceding example shows how to obtain methods of order greater than 1. Simply take for tf>(x, y; h) sections of the Taylor series (7.2.1.8) of L1(x, y; h). For example, 7.2.1.10 h tf>(x,y;h):= f(x,y) + "2[fx(x,y) + fy(x,y)f(x,y)] produces a method of order 2. The higher-order methods so obtained, however, are hardly useful, since in every step (Xi, 'TJi) ~ (Xi+l' 'TJi+l) one must compute not only f, but also the partial derivatives fx, f y, etc. Simpler methods of higher order can be obtained, e.g., by means of the construction (7.2.1.11) tf>(x, y; h) := ad(x, y) + a2f(x + Plh, y + P2hf(x, y)), in which the constants al, a2, Pl, P2 are so chosen that the Taylor expansion of L1(x, y; h)-tf>(x, y; h) in powers of h starts with the largest possible power. For tf>(x, y; h) in (7.2.1.11), the Taylor expansion is Comparison with (7.2.1.8) yields for a second-order method the conditions One solution of these equations is a2 -- 12' Pl = 1 , P2 = 1, and one obtains the method of Heun (1900): (7.2.1.12) tf>(x, y; h) = ~ [f(x, y) + f(x + h, y + hf(x, y))], which requires only two evaluations of f per step. Another solution is 1 Pl = 2' 1 P2 = 2' leading to the modified Euler method [Collatz (1960)] (7.2.1.13) tf>(x, y; h) := f (x + ~h, y + ~hf(x, y)) , which again is of second order and requires two evaluations of f per step. The Runge-Kutta method [Runge (1895), Kutta (1901)] is obtained from a construction which is somewhat more general than (7.2.1.11). It has the form (7.2.1.14) where 476 7 Ordinary Differential Equations kl := f(x, y), k2 := f(x + ~h, y + ~hkl)' k3 := f(x + ~h, y + ~hk2)' k4 := f(x + h, y + hk3). Through a simple but rather tedious Taylor expansion in h one finds, for f E F4(a, b), <J>(x, y; h) - .::l(x, y; h) = O(h4). The Runge-K utta method, therefore, is a method of fourth order. It requires four evaluations of f per step. If f(x,y) does not depend on y, then the solution of the initial-value problem y' = f(x), y(xo) = Yo J:a is just the integral y(x) = Yo + f(t)dt. The method of Heun then corresponds to the approximation of y(x) by means of trapezoidal sums, the modified Euler method to the midpoint rule, and the Runge-Kutta method to Simpson's rule [see Section 3.1]. All methods of this section are examples of multistage Runge-Kutta methods: (7.2.1.15) Definition. As-stage Runge-Kutta method is a one-step method given by a function 1[>( x, y; h; 1) which is defined by finitely real numbers Cl, C2, ... , C., a2, a3, ... , as, and f3i,j with 2 ::::; i ::::; sand 1 ::::; j ::::; i - I as follows: where kl := f(x, y), k2 := f(x + a2 h , Y + hf321kd, k3:= f(x+a3 h ,y+h(f331 kl + f332 k2)) , Following Butcher (1964), a method of this type is designated by the scheme 0 a2 f321 a3 f331 f332 as f3s1 f32s f3s,s-l Cl C2 Cs-l (7.2.1.16) Cs 7.2 Initial-Value Problems 477 Butcher (1964) has analysed these methods systematically. Concrete examples of such methods of order higher than four were given by him, by Fehlberg (1964, 1966, 1969), Shanks (1966) and many others. We will describe some methods of this type in Section 7.2.5. A general exposition of one-step methods is found in Grigorieff (1972) and Stetter (1973), and in particular in Hairer, Norsett and Wanner (1993). 7.2.2 Convergence of One-Step Methods ° In this section we wish to examine the convergence behavior as h -+ of an approximate solution 'fl(x; h) furnished by a one-step method. We assume that f E FI (a, b) and denote by y( x) the exact solution of the initial-value problem (7.2.1.1) y' = f(x, y), y(xo) = Yo· Let cfJ(x, y; h) define a one-step method, 'flo:= Yo, for i = 0, 1, ... 'flHI := 'fli + hcfJ(Xi' 'fli; h), Xi+I := Xi + h, which for x E Rh := {xo + ih I i = 0, 1, 2, ... } produces the approximate solution 'fl(x; h): if x = Xo + ih. 'fl(x; h) := 'fli' We are interested in the behavior of the global discretization error e(x; h) := 'fl(x; h) - y(x) for fixed x and h -+ 0, h E Hx := {(x - xo)/n I n = 1, 2, ... }. Since e(x; h), like 'fl(x; h), is only defined for h E H x , this means a study of the convergence of x -Xo h n := - - - , as n -+ 00. n We say that the one-step method is convergent if (7.2.2.1) lim e(x; h n ) = n->oo ° for all x E [a,b] and all f E FI(a, b). We will see that for f E Fp(a, b) methods of order p > 0 [ef. (7.2.1.9)] are convergent, and even satisfy 478 7 Ordinary Differential Equations The order of the global discretization error is thus equal to the order of the local discretization error. We begin by showing the following: (7.2.2.2) Lemma. If the numbers ~i satisfy estimates of the form then PROOF. From the assumptions we get immediately 161 ~ (1 + 5)I~ol + E, 161 ~ (1 + 5)21~ol + E(l + 5) + E, I~nl ~ (1 + 5)nl~ol + E[l + (1 + 5) + (1 + 5)2 + ... + (1 + 5)n-l] (1 + 5)nl~01 + E (1 + 5)n - 1 5 enS - 1 < ensl"<,0 1+ E -' 5 = since 0 < 1 + 5 ~ eO for 5 > -1. With this, we can prove the following main theorem: o (7.2.2.3) Theorem. Cons'ider, for Xo E [a, b], Yo E JR, the initial-value problem y' = f(x, y), y(xo) = Yo, having the exact solution y(x). Let the function P be continuous on G:={(x,y,h)la~x~b, ly-y(x)I:s:1', O~lhl~ho}, ho>O, 1'>0, and let there exist positive constants Ai and N such that (7.2.2.4) for all (x, Yi, h) E G, i = 1,2, and (7.2.2.5) IT(X, y(x); h)1 = 1L1(x, y(x); h) - p(x, y(x); h)1 ~ Nlhl P , p > 0, for all x E [a,b], Ihl ~ h o . Then there exists an h, 0 < h :s: h o , such that for the global discretization error e(x; h) = TJ(x; h) - y(x), 7.2 Initial-Value Problems 479 for all x E [a,bj and all h n = (x - xo)/n, n = 1, 2, ... , with Ihnl ::; h. If "( = 00, then h = h o. PROOF. The function p(x, Y; h) J>(x,y;h):= { p(x,y(x) + "(; h) p(x, y(x) - "(; h) if (x, y, h) E G, if x E [a,b], Ihl::; ho, y 2': y(x) +,,(, if x E [a, b], Ihl ::; ho, y ::; y(x) - ,,(, is evidently continuous on G:= {(x,y,h)lx E [a,b], y E JR, Ihl 2': ho} and satisfies the condition (7.2.2.6) for all (x, Yi, h) E G, i = 1, 2, and, because J>(x, y(x); h) = p(x, y(x); h), also the condition (7.2.2.7) 1L1(x, y(x); h) - J>(x, y(x); h)1 ::; Nlhl P for x E [a, bj, Ihl ::; ho. Let the one-step method generated by J> furnish the approximate values iii := ii(Xi; h) for Yi := y(Xi), Xi := Xo + ih: iii+l = iii +hJ>(Xi,iii;h). In view of Yi+l = Yi + h L1(Xi' Yi; h), one obtains for the error ei := iii -Yi, by subtraction, the recurrence formula (7.2.2.8) ei+1 = ei + h[J>(Xi' iii; h) - J>(Xi' Yi; h)] + h[J>(Xi' Yi; h) - L1(Xi' Yi; h)]. Now from (7.2.2.6), (7.2.2.7) it follows that 1J>(Xi' iii; h) - J>(Xi' Yi; h) I ::; Mliii - Yil = Mlei I, 1L1(Xi,Yi;h) -J>(Xi,Yi;h)l::; Nlhl P , and hence from (7.2.2.8) we have the recursive estimate Lemma (7.2.2.2), since eo = iio - Yo = 0, gives (7.2.2.9) lekl ::; Nlhl P eklhlM M - 1 . Now let x E [a, bj, x i= xo, be fixed and h := h n = (x - xo)/n, n > 0 an integer. Then Xn = Xo + nh = x and from (7.2.2.9) with k = n, since e(x; h n ) = en it follows at once that 480 7 Ordinary Differential Equations (7.2.2.10) le(x; hn)1 :S Nlh"I P eMlx~xol Ai - 1 for all x E [a, bj and hn with Ih,,1 :S ho. Since Ix - xol :S Ib - al and r > 0, there exists an h, 0 < h :S ho. such that le(x; hn)1 :S r for all x E [a, bj, Ihnl :S h, i.e., for the one-step method generated by P, 7]0 = Yo, TIi+l = 7]i + P(Xi' 7]i; h), we have for Ihl :S h, according to the definition of $, 7]i = 7]i, ei = ei, and $(Xi' 7/;; h) = P(Xi' 7]i; h). The assertion of the theorem, thus follows for all x E [a, bj and all h n = (x - xo)/n, n = 1, 2, ... , with Ihnl :S II,. 0 From the preceding theorem it follows in particular that methods of order p > 0 which in the neighborhood of the exact solution satisfy a Lipschitz condition of the form (7.2.2.4) are convergent in the sense (7.2.2.1). Observe that the condition (7.2.2.4) is fulfilled, e.g.) if (8/8y)p(x, y; h) exists and is continuous in a domain G of the form stated in the theorem. Theorem (7.2.2.3) also provioes an upper bound for the discretization error, which in principle can be evaluated if one knows it! and N. One could use it, e.g., to determine the steplength h which is required to compute y(x) within an error c, given x and c > O. Unfortunately, in practice this is doomed by the fact that the constants !vI and N are not easily accessible, since an estimation of !vI and N is only possible via estimates of higher derivatives of f. Already in the simple Euler's method, p(x, y; h) := f(x, y), e.g., one has [see (7.2.1.8) f.j N;::::; ~Ifx(x, y(x)) + fy(x, y(x))f(x, y(x))I, Ai;::::; 18p/Byl = Ify(x, Y)I· For the Runge-Kutta method, one would already have to estimate derivatives of f of the fourth order. 7.2.3 Asymptotic Expansions for the Global Discretization Error of One-Step Methods It may be conjectured from Theorem (7.2.2.3) that the approximate solution 7](x; h), furnished by a method of order p, possesses an asymptotic expansion in powers of h of the form 7.2 Initial-Value Problems 481 (7.2.3.1) for all h = h n = (x - xo)/n, n = 1, 2, ... , with certain coefficient functions ek(x), k = p, p + 1, ... , that are independent of h. This is indeed true for general one-step methods of order p, provided only that iJ>(x, y; h) and f satisfy certain additional regularity conditions. One has [see Gragg (1963)] (7.2.3.2) Theorem. Let f(x, y) E FN+1(a, b) [ef. (7.1.3)] and let 7](x; h) be the approximate solution obtained by a one-step method of order p, p :::; N, to the solution Vex) of the initial value problem y' = f(x, v), y(xo) = Yo, Xo E [a, b]. Then 7](x; h) has an asymptotic expansion of the form (7.2.3.3) 7](x; h) =y(x) + ep(x)hP + ep+dx)hP-rl + ... + eN(x)h N + EN +1 (x; h)hN+1 with edxo) = 0 for k 2:: p, which is valid for all x E [a, b] and all h = h n = (x - xo)/n, n = 1,2, .... The functions ei(x) therein are differentiable and independent of h, and the remainder term EN+1(X;h) is bounded for fixed x and all h = h n = (x - xo)/n, n = 1,2, ... , sUPn IEN+1(X; hn)1 < 00. PROOF. The following elegant proof is due to Hairer and Lubich (1984). Suppose that the one-step method is given by iJ>(x, y; h). Since the method has order p, and f E F N + 1 , there follows [see Section 7.2.1] y(x + h) - Vex) - hiJ>(x, y; h) = dp+1(x)h p+1 + ... + dN+1(X)h N+1 + O(h N+2 ). First, we use only y(x + h) - vex) - hiJ>(x, y; h) = dp+1(x)h P+1 + O(hP+2 ) and show that there is a differentiable function ep(x) such that 7](x; h) - Vex) = ep(x)h P + O(h P+1), ep(xo) = O. Tb this end, we consider the function where the choice of e p is still left open. It is easy to see that i) can be considered as the result of another one-step method, i)(x + h; h) = i)(x; h) + htP(x, i)(x; h); h), if 1> is defined by tP(x, y; h) := iJ>(x, y + ep(x)h P; h) - (cp(x + h) - cp(x)hP- 1. 482 7 Ordinary Differential Equations By Taylor expansion with respect to h, we find y(:r + h) - y(x) - h(x, Yi h) = [rlp+dx) - fy(x, y(x))ep(x) - e~(x)]hP+l + O(h P+ 2 ). Hence, the one-step method belonging to has order p + 1 if ep is taken as the solution of the initial value problem With this choice of ep , Theorem (7.2.2.3) applied to then shows that 7)(Xi h) - y(x) = 7/(X; h) - y(x) - ep(x)hP = O(h P+ 1). A repetition of these arguments with in place of 1> completes the proof of the theorem. D Asymptotic laws of the type (7.2.3.1) or (7.2.3.3) are significant in practice for two reasons. In the first place, one can use them to estimate the global discretization error e(x; h). Suppose the method of order p has an asymptotic expansion of the form (7.2.3.1), so that e(x; h) = Tf(x; h) - y(x) = ep(x)hP + O(h P+ 1 ). Having found the approximate value Tf(Xi h) with stepsize h. one computes for the same x, but with another stepsize (say h/2), the approximation Tf(x: h/2). For sufficiently small h [and ep(x) i= 0] one then has in first approximation (7.2.3.4) (7.2.3.5) Subtracting the second equation from the first gives and one obtains, by sustitution in (7.2.3.5), (7.2.3.6) X: lI,) _y( x) ~ Tf(x; h)2P- -ry(x;1 h/2). ry ( . 2 For the Rungc-Kutta method one has p = 4 and obtains the frequently used formula 7.2 Initial-Value Problems 483 .~) _ (x) ~ T}(x; h) - T}(x; h/2). T} ( x, 2 Y 15 The other, more important, significance of asymptotic expansions lies in the fact that they justify the application of extrapolation methods [see Sections 3.4 and 3.5]. Since a little later [see (7.2.12.7) f.] we will get to know a discretization method for which the asymptotic expansion of T}(x; h) contains only even powers of h and which, therefore, is more suitable for extrapolation algorithms [see Section 3.5] than Euler's method, we defer the description of extrapolation algorithms to Section 7.2.14. 7.2.4 The Influence of Rounding Errors in One-Step Methods If a one-step method T}o := Yo; (7.2.4.1) for i = 0, 1, ... T}i+1 := T}i + hp(Xi, T}i; h), xHl := Xi + h is executed in floating-point arithmetic (t decimal digits) with relative precision eps = 5 x lO-t, then instead of the T}i one obtains other numbers i'ii, which satisfy a recurrence formula of the form i'io:= Yo; for i = 0, 1, ... (7.2.4.2) Ci := fl(P(Xi, i'ii; h)), d i := fl(h Ci), i'iHl := fl(i'ii + di ) = i'ii + hp(Xi, i'ii; h) + l':i+1, where the total absolute rounding error Ei+1, in first approximation, is made up of threee components: Here ni+l = fl(P(Xi, i'ii; h)) - P(Xi, i'ii; h) P(Xi, T}i; h) is the relative rounding error committed in the floating-point computation of P, I-tHl the relative rounding error committed in the computation of the product hei, and O"i+1 the relative rounding error which occurs in the addition i'ii + di . Normally, in practice, the stepsize h is so small that IhP(Xi,i'ii;h) « li'iil, and if Ini+11 :S eps and ll-ti+11 :S eps, one thus has Ci+1 ~ i'ii+1O"i+1, i.e., the influence of rounding errors is determined primarily by the addition error O"i+1. 484 7 Ordinary Differential Equations Remark. It is natural, therefore, to reduce the influence of rounding errors by carrying out the addition in double precision (2t decimal places). Denoting by fl( a + b) a double-precision addition, byi)i a double-precision number (2t decimal places), and by iii := rd1(i)d the number iji rounded to single precision, then the algorithm, instead of (7.2.4.2), now runs as follows, ijo := yo; for i = 0, 1, ... iii := rd 1 (ijd, (7.2.4.3) Ci := fl(4)(x;, iii; h)), di := fl(hc;), fji+l := fb (iji + d i ). Let us now briefly estimate the total influence of all rounding errors C i. For this, let Yi = Y(Xi) be the values of the exact solution of the initialvalue problem, 7]i = 7](Xi; h) the discrete solutions produced by the one-step method (7.2.4.1) in exact arithmetic, and finally 'iii the approximate values of 7]i actually obtained in t-digit floating-point arithmetic. The latter satisfy relations of the form 7]0 := Yo; (7.2.4.4) for i = 0, 1, ... iiHl := iii + h(Xi' iii; h) + CHI· For simplicity, we also assume We assume further that satisfies a Lipschitz condition of the form (7.2.2.4), 1(x, YI; h) - (x, Y2; h)1 ::; MIYI - Y21· Then, for the error r(xi; h) := ri := iii - 7]i, there follows by subtraction of (7.2.4.1) from (7.2.4.4) rHl = ri + h((Xi,iii; h) - (Xi, 7]i; h)) + CH1, and thus (7.2.4.5) Since ro = 0, Lemma (7.2.2.2) gives C Ir(x; h)1 ::; ThT eMlx-xo I - M 1 for all x E [a, b] and h = hn = (x - xo)/n, n = 1, 2, '" therefore, that for a method of order p the total error . It follows, 7.2 Initial-Value Problems 485 under the assumptions of Theorem (7.2.2.3), obeys the estimate (7.2.4.6) Iv(x; h)1 ~ [Nlhl P + c ] eMlx-xol - 1 M ihT for all x E [a, b] and for all sufficiently small h := h n = (x - xo)/n. This formula reveals that, on account of the influence of rounding errors, the total error vex; h) begins to increase again, once h is reduced beyond a certain critical value. The following table shows this behavior. For the initial-value problem 1 y(-l) = 101' with exact solution Vex) = 1/(1 + 100x 2 ), an approximate value 17(0; h) for yeO) = 1 has been computed by the Runge-Kutta method in 12-digit arithmetic: h 10- 2 0.5.10- 2 10- 3 0.5.10- 3 v(O; h) -0.276 . 10- 4 -0.178 . 10- 5 -0.229 . 10- 7 -0.192.10- 7 h 10- 4 0.5.10- 4 10- 5 v(O; h) -0.478.10- 6 -0.711 . 10- 6 -0.227 . 10- 5 The appearance of the term c/lhl in (7.2.4.6) becomes plausible if one considers that the number of steps to get from Xo to x, using steplength h, is just (x - xo)/h and that all essentially independent rounding errors were assumed equal to c. Nevertheless, the estimate is much too coarse to be practically significant. 7.2.5 Practical Implementation of One-Step Methods In practice, initial-value problems present themselves mostly in the following form: What is desired is the value which the exact solution y(x) assumes for a certain x #- Xo. It is tempting to compute this solution approximately by means of a one-step method in a single step, i.e., choosing stepsize Ii = x - Xo. For large x - Xo this of course leads to a large discretization error e(x; Ii); the choice made for Ii would be entirely inadequate. Normally, therefore, one will introduce suitable intermediate points Xi, i = 1, ... , k - 1, Xo < Xl < ... < Xk = x, and, beginning with Xo, Yo = y(xo), compute successive approximate values of Y(Xi): Having determined an approximation Y(Xi) of Y(Xi), one computes Y(Xi+l) by applying a step of the method with stepsize hi := Xi+! - Xi, 486 7 Ordinary Differential Equations There again arises, however, the problem of how the stepsizes hi are to be chosen. Since the amount of work involved in the method is proportional to the number of individual steps, one will attempt to choose the stepsizes hi as large as possible. On the other hand, they must not be chosen too large if one wants to keep the discretization error small. In principle, one has the following problem: For given Xo, Yo, determine a stepsize h as large as possible, but such that the discretization error e(xo + h; h) after one step with this stepsize still remains below a certain tolerance e. This tolerance e should not be selected smaller than K eps, i.e., e :::. K eps, where eps is the relative machine precision and K a bound for the solution y(x) in the region under consideration, K:::o max{ ly(x)11 x E [xu, Xu + h]}. A choice of h corresponding to e = K eps then guarantees that the approximate solution TJ(xo+h; h) obtained agrees with the exact solution y(xo+h) to machine precision. Such a step size h = h(e), with le(xu+h;h)I:::Oe, e:::'Keps, can be found approximately with the methods of Section 7.2.3: For a method of order p one has in first approximation (7.2.5.1 ) Now, by (7.2.3.3), ep(xo) = 0 ; thus in first approximation, (7.2.5.2) Therefore le(xo + h; h)1 ~ e will hold if (7.2.5.3) If we know e~(xo), we can compute from this the appropriate value of h. An approximate value for e~(xo), however, can be obtained from (7.2.3.6). Using first the stepsize H to compute TJ(xo + H; H) and TJ(xo + Hi H /2), one then has by (7.2.3.6) (7.2.5.4) .H)--,-,,(xo+H;H)-T/(xu+H;H/2) . 2P - 1 e (Xo +H, 2 By (7.2.5.1), (7.2.5.2), on the other hand, H ~ ep(xo + H) (H)P e(xo + H; :2) :2 ~ e~(xo)H (H)P :2 From (7.2.5.4) thus follows the estimate 7.2 Initial-Value Problems 487 e~(xo)~ H;+l2p2: 1 [1J(Xo+H;H)-1J(xo+H;~)]. Equation (7.2.5.3) therefore yields for h the formula H ~ (~I1J(xo + H; H) -1J(xo + H; H j 2)1) l/(p+l) , (7.2.5.5) h 2P-l E which can be used in the following way: Choose a stepsize H; compute 1J(xo + H; H), 1J(xo + H; Hj2), and h from (7.2.5.5). If Hjh » 2, then by (7.2.5.4) the error e(xo + H; Hj2) is much larger than the prescribed E. It is expedient, therefore, to reduce H. Replace H by 2h; with the new H compute again 1J(xo + H; H), 1J(xo + H; Hj2), and from (7.2.5.5) the corresponding h, until finally IHjhl ::::: 2. Once this is the case, one accepts 1J(xo + H; Hj2) as an approximation for y(xo + H) and proceeds to the next integration step, replacing xo, Yo, and H by the new starting values Xo + H, 1J(xo + H; Hj2) and 2h [see Fig. 15 for an illustration]. EXAMPLE. We consider the initial-value problem y' = - 200xy2 , 1 y( -3) = 901' with the exact solution y(x) = 1/(1+100x 2 ). An approximate value 17 for y(O) = 1 was computed using the Runge-Kutta method and the "step control" procedure just discussed. The criterion H/h» 2 in Fig. 15 was replaced by the test H/h 2: 3. Computation in 12 digits yields: 17 - y(O) Number of required Runge-Kutta steps Smallest stepsize H observed -0.13585 X 10- 6 1476 0.1226··· X 10- 2 For fixed stepsize h the Runge-Kutta method yields: 17(0; h) - y(O) Number of Runge-Kutta steps = 0.2032· .. X 10- 2 -0.5594 X 10- 6 1476 0.1226· .. X 10- 2 -0.4052 X 10- 6 h 1;76 0.1226. 3. x 10 ~ = 2446 A fixed choice of stepsize thus yields worse results with the same, or even greater, computational effort. In the first case, the stepsize h = 0.2032· .. X 10- 2 is probably too large in the "critical" region near 0; the discretization error becomes too large. The stepsize h = 0.1226··· X 10- 2 , on the other hand, will be too small in the "harmless" region from -3 to close to 0; one takes unnecessarily many steps and thereby commits more rounding errors. Our discussion shows that in order to be able to make statements about the size of an approximately optimal steplength h, there must be available 488 7 Ordinary Differential Equations in Xo =? Yo =? Compute + H;H) 'T](xo + H; H/2) 'T](Xo H:= 2h h yes no Xo := Xo + H Yo := 'T](xo + H; H/2) H:= 2h out Fig. 15. A stepsize control method two approximate values 'T](xo + H; H) and 'T](xo + H; H/2) ~ obtained from the same discretization method ~ for the exact solution y(xo + H). Efficient methods for controlling the stepsize, however, can be obtained in still another manner. Instead of comparing two approximations (with different H) from the same discretization method, one takes, following an idea of Fehlberg's, two approximations (with the same H) which originate from a pair q,r, q,rr of different discretization methods of Runge-K utta type of orders p and p + 1. This gives rise to so-called Runge-Kutta-Fehlberg methods, which have the form (7.2.5.6) Yi+l = fj; + hq,r(xi, Yi; h), YHI = Yi + hq,rr(Xi, Yi; h), 7.2 Initial-Value Problems 489 where [ef. (7.2.1.15)] p <PI (x, y; h) = L ckidx, y; h), k=O (7.2.5.7) p+l <pn(x, y; h) = L ckh(x, y; h) k=O and (7.2.5.8) k-l h := h(x, y; h) = i(x + Qk h, y + h L 'skdl) , k = 0,1, ... , p + 1. 1=0 Note that in the computation of <Pn the function values io, il, ... , i p used in <PI are again utilized, so that only one additional evaluation, i p +1, is required. The methods <PI, <Pn are therefore called embedded methods. The constants Qk, 'skI, ck and Ck are determined so that the methods have orders p and p + 1, (7.2.5.9a) Ll(x, y(x); h) - <PI (x, y(x); h) = O(h P ), Ll(x, y(x); h) - <Pn (x, y(x); h) = O(h P+1), and the additional relations k-l (7.2.5.9.b) Qk = L'skj, k=O, 1, ... ,p+1, j=O are satisfied In principle, suitable coefficients can be determined as in (7.2.1.11). This leads, in general, to complicated system of nonlinear equations. For this reason we discuss only the special case p = 2 that leads to embedded methods of orders 2 and 3. For p = 2, conditions (7.2.5.9) yield the equations 2 2 LCk -1 =0, LCtkCk - k=O k=l 3 3 LCk -1 =0, LCtkCk - k=O k=l 3 LPklCk - ~ = 0, ~ = 0, ~ = 0, k=2 This system of equations admits infinitely many solutions. One can therefore impose additional requirements to enhance economy: for example, make the value h from the ith step reusable as fo in the (i + 1)st step. Since 490 7 Ordinary Differential Equations and fo = f(x + h, y + h<PI(X, y; h)), "new" this implies = 1, Ct3 P30 = Co, (hI = Cl, /132 = C2· Further requirements can be made concerning the magnitudes of the coefficients in the error terms .1 - <PI, .1- <Pu. However, we will not pursue this any further and refer, instead, to the papers of Fehlberg (1964, 1966, 1969). For the set of coefficients, one then finds k Ctk 0 0 1 1 PkO 13k 1 /3k2 1 "4 "4 27 189 800 214 891 2 40 3 1 - Ck Ck 214 891 1 533 2106 650 891 800 1053 1 0 33 729 800 1 33 650 891 -78 In applications, however, methods of higher order are more interesting. Dormand and Prince (1980) found the following coefficients for a pair of embedded methods of orders p = 4 and 5 (DOPRI 5(4)): k Ctk 0 Pk2 13k3 Pk4 Pk5 0 1 "5 2 /'hl rho 3 5 3 40 4 44 45 19372 6561 9017 3168 35 384 "5 4 9 8 5 1 6 1 Ck 35 384 5179 57600 1. 10 3 Ck 9 - 32 -15 "9 25360 2187 64448 6561 46732 5247 500 355 -33" 0 0 84 7571 16695 393 640 92097 339200 187 2100 0 .1. lli3 40 56 0 500 lli3 212 -729 49 176 125 192 5103 - 18656 2187 - 6784 11 84 125 192 2187 6784 11 - 40 The coefficients are such that the error term of the higher order method <Pn is minimal. Step control is now accomplished as follows: Consider the difference Yi+l - fj;+l. From (7.2.5.6) it follows that (7.2.5.10), and from (7.2.5.9a), up to higher order terms, (7.2.5.11) l]>r(Xi, Yi; h) - L1(Xi' Yi; h) ~ hPCr(x), l]>rr(Xi,Yi;h) - L1(:J:i,Yi;h) ~ hP+1Crr (x). Hence for small Ihl (7.2.5.12) 1 '''':'' Yi+l - Yi+l - hP + C r ( Xi ) . 7.2 Initial-Value Problems 491 Suppose the integration from Xi to Xi+l was successful, i.e., for given error tolerance E > 0, it achieved ~eglecting the terms of higher order, one then has also If we wish the "new" step size hnew = Xi+2 - XHl to be successful, we must have ICr(Xi+1)h~t~l.::; E. Now, up to errors of the first order, we have For Cr(Xi), however, one has from (7.2.5.12) the approximation This yields the approximate relation h I + .::; E, !Yi+l - Yi+l! I-h_ new P A 1 which can be used to estimate the new stepsize, (7.2.5.13) hnew ='= h I_ C 11/(P+l) A YHl - Yi+l Step-size control according to this formula is relatively cheap: The computation of YHl and YHl in (7.2.5.13) requires only p+ 1 evaluations of the right-hand side of f(x, y) ofthe differential equation. Compare this with the analogous relation (7.2.5.5): Since the computation of 1)(xo + H; H /2) necessitates further evaluation of f (three times as many, altogether), step-size control according to (7.2.5.13) is more efficient than according to (7.2.5.5). We remark that many authors, based on extensive numerical experimentation, recommend a modification of (7.2.5.13), viz., (7.2.5.14) hnew ='= ah I Eh _ 1 1 P / , YHl - YHl where a is a suitable adjustment factor: a :=:::: 0.9. In the context of the graphical representation and the assessment of the solution of an initial value problem the following difficulty arises: A Runge-Kutta method generates approximations Yi for the solution only at discrete points Xi, i = 0, 1, ... , that are determined by the stepsize 492 7 Ordinary Differential Equations control mechanism. These data are usually too coarse for a good graphical representation or for detecting irregularities in the solution. On the other hand, an artificial restriction of the stepsize Ihl :::; hmax within the stepsize control decreases the efficiency. A solution to this dilemma is to construct from the data Xi, Yi, i ~ 0, a continuous approximation for the solution at all intermediate points x('O) := Xi + iJh, 0 :::; iJ :::; 1, between Xi and Xi+l. Here, of course, h := Xi+l - Xi. This leads to continuous Runge-Kutta methods. For the DOPRI 5(4)-method introduced above this is achieved by the representation (here x and y stand for Xi and Yi, respectively): y(x + iJh) := y + h 5 L Ck(iJ)Jk, k=O where the Jk are determined as in DOPRI 5(4). The following weights Ck(iJ) depending on iJ guarantee a method of order four: CO(iJ) := iJ(l + iJ( -1337/480 + iJ(1039/360 + iJ( -1163/1152)))), Cl (iJ) := 0, C2(iJ) := 100iJ2(1054/9275 + iJ(-4682/27825 + iJ(379/5565)))/3, C3(iJ) := -5 iJ2(27/40 + iJ( -9/5 + iJ(83/96)))/2, C4 (iJ) := 18225 iJ2 (-3/250 + iJ(22/375 + iJ( -37/600))) /848, C5(iJ) := -22 iJ2( -3/10 + iJ(29/30 + iJ( -17/24)))/7. Clearly the constants Ck (iJ) agree for iJ = 1 with the constants Ck of DOPRI 5(4). 7.2.6 Multistep Methods: Examples In a multistep method for the solution of the initial-value problem Y' = I(x, Y), Y(Xo) = Yo, one computes an approximate value 'TJj+r of Y(Xj+r) from r ~ 2 given approximate values 'TJk of Y(Xk), k = j, j + 1, ... , j + r - 1, at equidistant points Xk = Xo + k h: (7.2.6.1) for j = 0, 1, 2, ... 'TJj, 'TJj+r, ... , 'TJj+r-1 ~ 'TJj+r' To initiate such methods, it is of course necessary that r starting values 'TJo, 'TJ1, ... , 'TJr-l be at our disposal; these must be obtained by other means, e.g., with the aid of a one-step method. We begin by introducing some examples of multistep methods. A class of such methods can be derived from the formula (7.2.6.2) l Y(Xp+k) - y(xp_j) = 7.2 Initial-Value Problems XP +k 493 f(t, y(t)) dt, Xp_j which is obtained by integrating the identity y'(x) = f(x,y(x)). As in the derivation of the Newton-Cotes formulas [see Section 3.1], one now replaces the integrand in (7.2.6.2) by an interpolating polynomial Pq(x) with (1) deg Pq(x) :::; q, (2) Pq(Xk) = f(Xk, Y(Xk)), k = p, p - 1, ... , p - q. We assume here that the Xi are equidistant and h := Xi+! - Xi. With the abbreviation Yi := Y(Xi)' and using the Lagrange interpolation formula (2.1.1.4), Pq(x) = X - Xp-l , L f(Xp-i, Yp-i)Li(x), Li(X) := II Xp-i - Xp-l q q i=O 1=0 l#-i one obtains the approximate formula (7.2.6.3) q = h L (3qd(X p-i, Yp-i) i=O with 1 (7.2.6.4) (3qi:= h l xP +k Li(x)dx = Xp_j Jk. II ++ ds, q -) S -i l l i = 0, 1, ... , q. 1=0 l#-i Replacing in (7.2.6.3) the Yi by approximate values 'f/i and equality sign, one obtains the formula ::::;0 by the q 'f/p+k = 'f/p-j + h L {3qdp-i, i=O For different choices of k, j, and q, one obtains different multistep methods. For k = 1, j = 0, and q = 0, 1, 2, ... , one obtains the Adams-Bashforth methods: (7.2.6.5) . . wIth {3qi := 11 II -.+-+ q S o 1=0 -z l#-i l l ds, i = 0, 1, ... , q. 494 7 Ordinary Differential Equations Comparison with (7.2.6.1) shows that here r values: q + 1. A few numerical ;3q ; i=O 1 2 3 4 ;10i 2,81i 12;12; 2483 ; 1 3 23 55 1901 -1 -16 -59 -2774 5 37 2616 -9 -1274 251 720B4; For k = 0, j = 1 and q = 0, 1, 2, ... , one obtains the Adarns-Moulton formulas: Replacing p hy p + 1 gives T]p+l = T/p + h[;1qof(xp+l,T/p+d + Bqdp + ... + Bqq f p+1-q] . (7.2.6.6) wIth ;1qi := fa II -=;--z s+ I ds, q -1 1=0 liei + i = 0, 1, ... , q. At first sight it appears that (7.2.6.6) no longer has the form (7.2.6.1), since T]p+l occurs on both the left-hand and the right-hand side of the equation in (7.2.6.6). Thus for given T]p, T/p-l, ... , T/p+l-q, equation (7.2.6.6) represents, in general, a nonlinear equation for T/p+ 1, and the Adams-Moulton method is an implicit method. The following iterative method for determining T/p+l suggests itself naturally: (HI) (7.2.6.7) T/p+l (i) = T]p + h[;1qof(Xp+1' T]P+1) + ;1qdp + ... + Bqq !p+l-q], i=0,1,2, .... This iteration has the form T]~~II) = iJI (T]~~ 1)' and with the methods of Section 5.2 it can easily be shown that for sufficiently smalllhi the mapping z -+ iJI(z) is contractive [see Exercise 10] and thus has a fixed point T]p+1 = iJI(T]P+1), which solves (7.2.6.6). This solution of course depends on x P' T]p, T]p-l, ... T]p+l-q and h, and the Adams-Moulton method therefore is indeed a multistep method of the type (7.2.6.1). Here r = q. For given T]p, T/p-l, ... , T]p+l-q a good initial value T]~~1 for the iteration (7.2.6.7) can be found, e.g., with the aid of the Adams-Bashforth method (7.2.6.5). For this reason, one also calls explicit methods like the Adams-Bashforth method predictor methods, and implicit methods like the Adams-Moulton method corrector methods [through the iteration (7.2.6.7) one "corrects" T]~~ 1]' A few numerical values for the coefficients occurring in (7.2.6.6): 7.2 Initial-Value Problems /3qi i=O 1 2 3 4 f30i 1 1 5 9 251 1 8 19 646 -1 -5 -264 1 106 -19 2f31i 12f32i 24f33i 720f34i 495 In the method of Nystrom one chooses k = 1, j = 1 in (7.2.6.2) and thus obtains (7.2.6.8) i = 0,1, ... , q. This is a predictor method, which has obviously the form (7.2.6.1) with r=q+1. Worthy of note is the special case q = O. Here 1300 = J~l 1 ds = 2, and (7.2.6.8) becomes T)p+l = T)p-l + 2hfp· (7.2.6.9) This is the so-called midpoint rule, which corresponds in the approximation of an integral to "rectangular sums" . The methods of Milne are corrector methods. They are obtained from (7.2.6.2) by taking k = 0, j = 2, and by replacing p with p + 1: T)p+l = T)p-l + h[f3qof(xp+l, T)P+l) + f3qd p + ... + f3qqfp+l-q] (7.2.6.10) . wIth /3qi := 1° II + q -2 1=0 s+l =-1 ds, i = 0, 1, ... , q. • l::;i:i Analogously to (7.2.6.7), one solves (7.2.6.10) also by iteration. 7.2.7 General Multistep Methods All multistep methods discussed in Section 7.2.6, as well as the one-step methods of Section 7.2.1, can be written in the following form: (7.2.7.1) T)j+r + ar-lT)j+r-l + ... + aoT/j = hF(xj; T)j+r, T)j+r-l, .. ·, T/j; h; f). Generally, a multistep method given by (7.2.7.1) is called an r-step method. In the examples considered in Section 7.2.6 the function F, in addition, depends linearly on f as follows: F(xj; T)j+r, T)j+r-l, ... ,T)j; h; f) == brf(xj+r, T)j+r) + ... + bof(xj, T)j). 496 7 Ordinary Differential Equations The b;, i = 0, ... , r, are certain constants. One then speaks of a linear r-step method; such methods will be further discussed in Section 7.2.1l. For the Adams-Bashforth method (7.2.6.5), for example, r = q + 1, and bq - i = (3qi = 1II -.1 o q 1=0 s+l -7 +I d.s, i = 0, 1, ... , q. Ii" Any r initial values T/o, ... , T/r-l define, through (7.2.7.1), a unique sequenceT/j, j ~ 0. For the initial values T/j one chooses, as well as possible, certain approximations for the exact solution Yi = Y(Xi) of (7.2.l.1) at Xi = Xo + ih, i = 0, L ... , r - l. These can be obtained, e.g., by means of appropriate one-step methods. We denote the errors in the starting values by Ci := T/i - y(x;), i = 0, 1, ... , r - l. Further errors, e.g. rounding errors in the evaluation of F, occur during the execution of (7.2.7.1). We want to include the influence of these errors in our study also, and therefore consider, more generally than (7.2.7.1), the following recurrence formulas: T)O := Yo + co, T/r-l := Yr-l + Cr-l; (7.2.7.2) for j = 0, 1, 2, ... : T/j+r + ar-IT/j+r-l + ... + aoT/j := hF(xj; T/j+n T/j+r-l,···, T/J; h; f) + hCj+r. The solution T/i of (7.2.7.2) depends on h and the Cj, and represents a function T/(x;c;h), which, like the error function C = c(x; h), is defined only for x E Rh = {xo +ih I i = 0, 1, ... }, or h E Hx = {(x -xo)/n I n = 1, 2, ... }, through T/(Xi; C; h) := T/i, C(Xi; h) := Ci, Xi:= Xo + ih. As in one-step methods, one can define the local discretization error T(X, Y; h) of a multistep method (7.2.7.1) at the point x, y. Indeed, let f E F1(a,b), X E [a,b], Y E IR, and let z(t) be the solution of the initialvalue problem z'(t) = f(t, z(t)), z(x) = y. 7.2 Initial-Value Problems 497 Then the local discretization error r(x, y; h) is defined to be the quantity (7.2.7.3) r(x, y; h) := ~ [z(x + rh) + r-l L aiz(x + ih) i=O - hF(x; z(x + rh), z(x + (r - l)h), ... , z(x); h; n]. The local discretization error r(x, y; h) thus indicates how well the exact solution of a differential equation satisfies the recurrence formula (7.2.7.1). From a reasonable method one expects that this error will become small for small Ihl. Generalizing (7.2.1.7), one thus defines the consistency of multistep methods by: (7.2.7.4) Definition. The multistep metod is called consistent if for each f E F1(a, b) there exists a function a(h) with limh-+o a(h) = 0 such that (7.2.7.5) Ir(x,y;h)1 ~ a(h) for all x E [a,b]' y E IR. One speaks of a method of order p if, for f E Fp(a,b), EXAMPLE. For the midpoint rule (7.2.6.9) one has, in view of z'(t) z(x) = y, = J(t,z(t)), 1 T(X, y; h) : = h [z(x + 2h) - z(x) - 2hJ(x + h, z(x + h))] = k[z(x + 2h) - z(x) - 2hz'(x + h)]. Through Taylor expansion in h one finds T(X,y; h) = k[z(x) + 2hz'(x) + 2h z (X) + 8~3 ZIl'(X) 2 ll _ z(x) - 2h(z'(x) + hz"(x) + ~2 z'"(X))] + O(h3) = ~2 ZIll(X) + O(h3). The method is thus consistent and of second order. The order of the methods in Section 7.2.6 can be determined more conveniently by means of the error estimates for interpolating polynomials [e.g., (2.1.4.1)] and Newton-Cotes formulas (3.1.1). For f(x, y) := 0, and z(x) := y, a consistent method gives 1 r(x, y; h) = h [y(1 + ar-l + ... + ao) - hF(x; y, y, ... ,y; h; 0)], 498 7 Ordinary Differential Equations IT(X, y; h)1 .::; u(h), lim u(h) = O. h--+O For continuous F(x; y, y, ... ,y; .; 0), since y is arbitrary, it follows that 1 + ar-l + ... + ao = O. (7.2.7.6) We will often, in the following, impose on F the condition F(x; Un Ur-l, ... , Uo; h; 0) == 0 (7.2.7.7) for all x E [a, b], all h, and all Ui. For linear multistep methods, (7.2.7.7) is certainly always fulfilled. (7.2.7.7), together with (7.2.7.6), guarantees that the exact solution y(x) == Yo of the trivial differential equation y' = 0, y(xo) = Yo, is also exact solution of (7.2.7.2) if Ci = 0 for all i. Since the approximate solution l](x; C; h) furnished by a multistep method (7.2.7.2) also depends on the errors Ci, the definition of convergence is more complicated than for one-step methods. One cannot expect of course, that the global discretization error e(x; C; h) := l](x; C, h) - y(x) for fixed x and for h = hn = (x - xo)/n, n = 1, 2, ... , will converge toward 0, unless the errors c(x; h) also become arbitrarily small as h --+ O. One therefore defines: (7.2.7.8) Definition. The multistep method given by (7.2.7.2) is called convergent if lim l](x; C; hn ) = y(x), n--+oo x -Xo hn:= - - , n n= 1, 2, ... , for all x E [a, b], all f E FI(a, b), and all functions c(z; h) for which there is a p( h) such that (7.2.7.9) Ic(z; h)1 .::; p(h) for all Z E Rh, limh--+o p( h) = O. 7.2.8 An Example of Divergence The results of 7.2.2, in particular Theorem (7.2.2.3), may suggest that multistep methods, too, converge faster with increasing order p of the local discretization error [see (7.2.7.4)J. The point of the following example is to show that this conjecture is false. At the same time the example furnishes a method for constructing multistep methods of maximum order. Suppose we construct a linear multistep method of the type (7.2.7.1) with r = 2, having the form 7.2 Initial-Value Problems 499 7)H2 + al7)Hl + a07)j = h[bd(xj+l' 7)Hl) + bof(xj, 7)j)]. The constants ao, aI, bo, bl , are to be determined so as to yield a method of maximum order. If Z'(t) = f(t, z(t)), then for the local discretization error T(X, y; h) of (7.2.7.3) we have hT(X, y; h) = z(x + 2h) + alz(x + h) + aoz(x) - h[bd(x + h) + boz'(x)]. Now expand the right-hand side in a Taylor series in h, hT(X, y; h) = z(x)[l + al + ao] + hZ'(x)[2 + al - bl - bo] + h 2 z (X)[2 + ~al - bl ] + h3 z (X) [1 + tal - ~bl] + O(h4). ll lll The coefficients ao, aI, bo, bl are to be determined in such a way as to annihilate as many h-powers as possible, thus producing a method which has the largest possible order. This leads to the equations -b l =0, --bo = 0, -b l = 0, 1+ 2+ 2+ ~al 1+ tal -~bl = 0, with the solution al = 4, ao = -5, bl = 4, bo = 2, and hence to the method 7)H2 + 47)Hl - 57)j = h[4f(xHl,7)j+l) + 2f(xj, 7)j)] of order 3 [since hT(X, y; h) = O(h4), i.e., T(X, y; h) = O(h 3 )]. If one tries to use this method to solve the initial-value problem y' = -y, y(O) = 1, with the exact solution y(x) = e- x , computation in 1O-digit arithmetic for h = 10- 2 , even taking as starting values the exact (within machine 1, 7)1 := e- h , produces the results given in the precision) values 7)0 following table. - ..:1(-5)j 216 ---yr-e 3xj/5 [ cf.( 7.2.8.3) ] 2 3 4 5 -0.164 X 10- 8 +0.501 X 10- 8 -0.300 X 10- 7 +0.144 X 10- 6 -0.753 X 10- 9 +0.378 X 10- 8 -0.190 X 10- 7 +0.958 X 10- 7 96 97 98 99 100 -0.101 x 1058 +0.512 x 1058 -0.257 x 1059 +0.129 x 1060 -0.652 X 1060 -0.668 X 1057 +0.336 X 1058 -0.169 X 1059 +0.850 X 1059 -0.427 X 1060 500 7 Ordinary Differential Equations How does one account for this wildly oscillating behavior of the 'r/j? If we assume that the exact values 'r/o := 1, 'r/l := e- h are used as starting values and no rounding errors are committed during the execution of the method (Ej = for all j), we obtain a sequence of numbers 'r/j with ° 'r/o = 1, 'r/l = e- h , 'r/j+2 + 4'r/j+l - 5'r/j = h[-4'r/j+l - 2'r/j] for j = 0, 1, ... , or (7.2.8.1) 'r/j+2 + 4(1 + h)'r/j+l + (-5 + 2h)'r/j = ° for j = 0, 1, .... Such difference equations have special solutions of the form 'r/j = )../. Upon substituting this expression in (7.2.8.1), one finds for A the equation which, aside from the trivial solution A = 0, has the solutions 3V1 + ~h + th2, A2 -2 - 2h - 3V1 + ~h + th2. Ai = -2 - 2h + = For small h one has hence (7.2.8.2) 2 - lh 3 + 2..h4 + O(h5) Ai = 1 - h + lh 2 6 72 ' A2 = -5 - 3h + O(h2). One can now show that every solution 'r/j of (7.2.8.1) can be written as a linear combination 'r/j = aA{ + {3A~ AL of the two particular solutions A~ just found [see (7.2.9.9)]. The constants a and {3, in our case, are determined by the initial conditions 1]0 = 1, 1]1 = e- h , which lead to the following system of equations for a, {3: = a + {3 = 1, 1]1 = aAl + {3A2 = e- h . 1]0 The solution is given by 7.2 Initial-Value Problems 501 From (7.2.8.2) one easily verifies a: = 1 + O(h2), (3 = -2~6h4 + O(h 5 ). For fixed x i- 0, h = h n = x/n, n = 1, 2, ... , one therefore obtains for the approximate solution TJn = TJ(x; h n ) = [1 + 0 (; - r] [ 1 - ; + 0 (; rr 2~6 ~: [1 + 0 (;)] [-5 - 3; + 0 (;rr The first term tends to y(x) = e- x as n --+ 00; the second term for n --+ 00 behaves like (7.2.8.3) Since lim n --+ oo 5n /n 4 00, this term oscillates more and more violently as n --+ 00. This explains the oscillatory behavior and divergence of the method. As is easily seen, the reason for this behavior lies in the fact that -5 is a root of the quadratic equation p,2 + 4p, - 5 = 0. It is to be expected that also in the general case (7.2.7.2) the zeros of the polynomial tJt(p,) = p,r + ar_lp,r-l + ... + ao playa significant role for the convergence of the method. 7.2.9 Linear Difference Equations In the following section we need a few simple results on linear difference equations. By a linear homogeneous difference equation of order r one means an equation of the form For every set of starting values uo, Ul, ... , Ur-l one can evidently determine exactly one sequence of numbers Un, n = 0, 1, ... , which solves (7.2.9.1). In applications to multistep methods a point of interest is the growth behavior of the Un as n --+ 00, in dependence of the starting values uo, Ul, ... , Ur-l. In particular, one would like to know conditions guaranteeing that (7.2.9.2a) for all real starting values uo, Ul, ... , Ur-l. Since the solutions Un = un(UO), Uo .- [Uo,Ul, ... ,Ur-1V, obviously depend linearly on the starting vector, 502 7 Ordinary Differential Equations the restriction to real starting vectors Uo E lR r is unnecessary, and (7.2.9.2a) is equivalent to (7.2.9.2b) U lim -...!.':.. = 0 n n--+oo for all (complex) starting values uo, Ul, ... , U,-I. With the difference equation (7.2.9.1) one associates the polynomial (7.2.9.3) One now says that (7.2.9.1) satisfies the (7.2.9.4) stability condition if for every zero A of tJr(jL) one has IAI ::; 1, and further 1/;(A) = 0 and IAI = 1 together imply that A is a simple zero of tJr. (7.2.9.5) Theorem. The stability condition (7.2.9.4) is necessary and sufficient for (7.2.9.2). PROOF. (1) Assume (7.2.9.2), and let A be a zero of tJr in (7.2.9.3). Then the sequence Uj := Aj, j = 0, 1, ... , is a solution of (7.2.9.1). For IAI > 1 the sequence unln = An In, diverges, so that from (7.2.9.2) it follows at once that IAI ::; 1. Let now A be a multiple zero of tJr with IAI = 1. Then 1//(A) = r,\r-l + (r - 1)ar_lAr-2 + ... + 1· al = O. The sequence Uj := j V, j 2': 0, is thus a solution of (7.2.9.1): Uj+r+ar-lUj+r-l + ... + aOuj =jAj(A r + ar_lAr-1 + ... + ao) + AJ+l(rA r - 1 + (r - 1)ar_lAr-2 + ... + all =0. Since unln = An does not converge to zero as 71 ---+ 00, A must be a simple zero. (2) Conversely, let the stability condition (7.2.9.4) be satisficd. We first use the fact that, with the abbreviations 1 o the difference equation (7.2.9.1) is equivalent to the recurrence formula 7.2 Initial-Value Problems 503 (7.2.9.6) A is a Frobenius matrix with the characteristic polynomial 'lj;(j.t) in (7.2.9.3) [Theorem (6.3.4)]. Therefore, if the stability condition (7.2.9.4) is satisfied, one can choose, by Theorem (6.9.2), a norm II . lion C r such that for the corresponding matrix norm lub(A) ::; 1. It thus follows, for all Uo E Cr , that (7.2.9.7) IlUnll = IIAnUoll ::; IIUol1 for n = 0,1, .... ° Since on C r all norms are equivalent [Theorem (4.4.6)]' there exists a k > with (l/k)IIUII ::; 1IUlioo ::; klIUII, and one obtains from (7.2.9.7) in particular IlUnlloo ::; k 2 1lUoll oo , n = 0,1, ... , i.e., one has limn-+oo (l/n)IIUn ll oo = 0, and hence(7.2.9.2). D In the proof of the preceding theorem we exploited the fact hat the zeros Ai of tJ/ furnish particular solutions of (7.2.9.1) of the form Uj := A{, j = 0, 1, .... The following theorem shows that one can similarly represent all solutions of (7.2.9.1) in terms of the zeros of tJ/: (7.2.9.8) Theorem. Let the polynomial 'lj;(j.t) := j.tr + ar_lj.tr-l + ... + ao have the k distinct zeros Ai, i = 1, 2, ... , k, with multiplicities ai, i = 1, 2, ... , k, and let aD f= 0. Then for arbitrary polynomials Pi(t) with deg Pi < ai, i = 1, 2, ... , k, the sequence (7.2.9.9) is a solution of the difference equation (7.2.9.1). Conversely, every solution of (7.2.9.1) can be uniquely represented in the form (7.2.9.9). PROOF. We only show the first part of the theorem. Since with {Uj}, {Vj}, also {Uj +Vj} is a solution of (7.2.9.1), it suffices to show that for a a-fold zero A of tJ/ the sequence j = 0,1, ... , is a solution of (7.2.9.1) if p(t) is an arbitrary polynomial with deg p < a. For fixed j :::: 0, let us represent p(j + t) with the aid of Newton's interpolation formula (2.1.3.1) in the following form: where 0:0" = 0:0"+1 = ... = O:r = 0, because deg p < a. With the notation a r := 1, we thus have 504 7 Ordinary Differential Equations L apAPp(j + p) r Uj+r + ar -1Uj+r-1 + ... + aOUj = Aj p=O = Aj r a-1 L apAP [0: + L O:TP(P - 1) ... (p - + 1)] 0 p=o 'T T=l = Aj[O:O~(A) + O:lA~/(A) + ... + O:a_1Aa-1~(a-1)(A)] =0, since A is a a-fold zero of tJi and thus ~(T)(A) = 0 for 0 ::; 'T :S a - 1. This proves the first part of the theorem. Now a polynomial p(t) = Co + C1t + ... + Ca_1ta-1 of degree < a is precisely determined by its a coefficients cm , m = 0, I, ... , a - I, so that by a1 + a2 + ... + ak = r the representation (7.2.9.9) contains a total of r free parameters, namely the coefficients of the Pi (t). The second part of the theorem therefore asserts that by suitably choosing these r parameters one can obtain every solution of (7.2.9.1), i.e., that for every choice of initial values Uo, U1, ... , Ur-1 there is a unique solution to the following system of r linear equations for the r unknown coefficients of the Pi(t), i = I, ... , k: P1(j)A{ + P2(j)A~ + ... + Pk(j)At = Uj j = 0, I, ... , r-1. The proof of this, although elementary, is tedious. We therefore omit it. 0 7.2.10 Convergence of Multistep Methods We now want to use the results of the previous section in order to study the convergence behavior of the multistep method (7.2.7.2). It turns out that for a consistent method [see (7.2.7.4)] the stability condition (7.2.9.4) is necessary and sufficient for the convergence of the method, provided F satisfies certain additional regularity conditions [see (7.2.10.3)]. In connection with the stability condition (7.2.9.4), note from (7.2.7.6) that for a consistent method ~(/-l) = /-lr + ar _1/-l r - 1 + ... + ao has A = 1 as a zero. We begin by showing that the stability condition is necessary for convergence: (7.2.10.1) Theorem. If the multistep method (7.2.7.2) is convergent [Definition (7.2.7.8)] and F obeys the condition (7.2.7.7), F(x; U r , U r -1, ... , Uo; h; 0) == 0, then the stability condition (7.2.9.4) holds. PROOF. Since the method (7.2.7.2) is supposed to be convergent, in the sense of (7.2.7.8), for the integration of the differential equation y' == 0, 7.2 Initial-Value Problems 505 y(xo) = 0, with the exact solution Vex) == 0, it must produce an approximate solution 1'/( x; c; h) satisfying lim 1'/(x; C; h) = h-+O ° ° for all x E [a, b] and all C with Ic(z; h)1 ::::; p(h), p(h) -+ for h -+ 0. Let x =J. Xo, x E [a,b], be given. For h = h n = (x - xo)/n, n = 1, 2, ... , it follows that Xn = x and 1'/(x; C; h n ) = 1'/n, where 1'/n in view of F(x; Ur , Ur-l, ... , Uo; h; 0) == 0, is determined by the recurrence formula 1'/i = Ci, 1'/j+r + ar -l1'/j+r-l + ... + ao1'/j = hncj+r, i = 0, 1, ... , r - 1, j = 0, 1, ... , n - r, with Cj := c(xo + jh n ; h n ). We choose cj+r := 0, j = 0, 1, ... , n - r, Ci := hnUi, i = 0, 1, ... , r - 1, with arbitrary constants Uo, Ul, ... , Ur-l. With p(h) := Ihl max IUil O~i~r-l one then has i = 0,1, ... , n, and lim p(h) = 0. h-+O Now, 1'/n = hnu n , where Un is obtained recursively from Uo, Ul, ... , Ur-l via the difference equation Uj+r + ar-lUj+r-l + ... + aOuj = 0, j = 0, 1, ... , n - r. Since, by assumption, the method is convergent, we have lim 1'/n = (x - xo) lim Un = 0, n--+oo n n-+oo i.e., (7.2.9.2) follows, the Uo, Ul, ... , Ur-l having been chosen arbitrarily. The assertion now follows from Theorem (7.2.9.5). 0 We now impose on F the additional condition of being "Lipschitz continuous" in the following sense: For every function f E Fl (a, b) there exist constants ho > and M (which may depend on 1) such that (7.2.10.2) ° r IF(x; Ur , Ur-l,···, Uo; h; 1) - F(x; Vr , Vr-l,···, Vo; h; 1)1 ::::; M L lUi - vii i=O for all x E [a, b], Ihl : : ; ho, Uj, Vj E lR [ef. the analogous condition (7.2.2.4)]. We then show that for consistent methods the stability condition is also sufficient for convergence: 506 7 Ordinary Differential Equations (7.2.10.3) Theorem. Let the multistep method (7.2.7.2) be consistent in the sense of (7.2.7.4), and F satisfy the conditions (7.2.7.7) and (7.2.10.2). Then the method is convergent in the sense of (7.2.7.8) for all f E FI (a, b) if and only if the stability condition (7.2.9.4) holds. PROOF. The fact that the stability condition is necessary for convergence follows from (7.2.10.1). In order to show that it is also sufficient under the stated assumptions, one proceeds, analogously to the proof of Theorem (7.2.2.3), in the following way: Let y(x) be the exact solution of y' = f(x, Y), y(xo) = Yo, Yj := Y(Xj), Xj = Xo + jh, and 'r/j the solution of (7.2.7.2): 'r/i = Yi + Ei, i = 0, 1, ... , r - 1, 'r/j+r + ar-l'r/j+r-l + ... + aO'r/j = hF(xj; 'r/j+T?"" 'r/j; h; f) + hEj+r, for j = 0,1, ... , with IEjl :::; p(h), limh---+op(h) = 0. For the error ej := 'r/j - Yj one has i = 0, ... , r -1, (7.2.10.4) j = 0,1, ... , where Cj+r := h[F(xj; 'r/j+T?"" 'r/j; h; f) - F(xj; Yj+r,···, Yj; h; f)] + h(Ej+r - rj+r) , rj+r := r(xj, Yj; h). By virtue of the consistency (7.2.7.4), one has for a suitable function a(h), Irj+rl = Ir(xj, Yj; h)1 :::; a(h), lim a(h) = 0, h---+O and by(7.2.10.2), r (7.2.10.5) ICj+rl :::; IhlM 2: lej+il + Ihl [p(h) + a(h)]. i=O With the help of the vectors and the matrix 7.2 Initial-Value Problems 507 (7.2.10.4) can be written equivalently in the form Ej+l = AEj + cj+rB, (7.2.10.6) 17]. Eo:= [ Cr-1 Since by assumption the stability condition (7.2.9.4) is valid, one can choose, by Theorem (6.9.2), a norm II . lion (Cr with lub(A) S; 1. Now all norms on (Cr are equivalent [see (4.4.6)], i.e., there is a constant k > 0 such that 1 kllEjl1 S; T-l L lej+il S; kllEjll· i=O Since r-l r i=O i=1 L lej+il S; kllEjll, L lej+il S; kllEj+lll, one thus obtains from (7.2.10.5) ICj+rl S; IhlMk(IIEj II + IIEj+llD + Ihl [p(h) + O"(h)]. Using(7.2.1O.6) and IIBII S; k, it follows for j = 0, 1, ... (7.2.10.7) (1-lhIMk2)IIEj+l11 S; (1 + Ih1Mk 2)IIEjll + klhl [p(h) + O"(h)], and IIEol1 S; krp(h). For Ihl S; 1/(2Mk 2), we now have (1 - Ih1Mk2) ~ ~ and 1 + IhlMk2 < 41hlMk2 1 _ IhlMk2 - 1 + . (7.2.10.7), therefore, yields for Ihl S; 1/(2Mk 2) IIEj+ll1 S; (1 + 4lhlMk2)IIEj II + 2klhl [p(h) + O"(h)], j = 0,1, .... Because of IIEol1 S; krp(h), Lemma (7.2.2.2) shows that 2 e4nlhlMk2 1 liEn II S; e4nlhlMk krp(h) + [p(h) + O"(h)] 2Mk - , i.e., for x -=I- Xo, h = hn = (x - xo)/n, Ihnl S; 1/(2Mk 2), one has There exist, therefore, constants C 1 and C 2 independent of h such that (7.2.10.8) for all sufficiently large n. Convergence of the method now follows in view of limh-+o p(h) = limh-+o O"(h) = o. D 508 7 Ordinary Differential Equations From (7.2.10.8) one immediately obtains the following: (7.2.10.9) Corollary. If in addition to the assumptions of Theorem (7.2.10.3) the multistep method is a method of order p [see (7.2.7.4)]' CJ"(h) = O(hP ), and f E Fp(a, b), then the global discretization error also satisfies 17)(x;s; h n ) - y(x)1 = O(h~) for all h n = (x - xo)/n, n sufficiently large, provided the errors Ci = c(xo + ih n ; h n ), i = 0,1, ... ,n, obey an estimate ICil ~ p(h n ), i = 0,1, ... , n, with p(hn ) = O(h~) as n -+ 00. 7.2.11 Linear Multistep Methods In the following sections we assume that in (7.2.7.2), aside from the starting errors Ci, 0 ~ i ~ r - 1, there are no further errors: Cj = 0 for j 2 r. Since it will always be clear from the context what starting values are being used and hence what the starting errors are, we will simplify further by writing 7)(x; h) in place of 7)(x; C; h) for the approximate solution produced by the multistep method (7.2.7.2). The most commonly used multistep methods are linear ones. For these, the function F(x; u; h; /), u := fur, Ur-l, . .. ,uol in (7.2.7.1) has the following form: (7.2.11.1) F(x;Ur,Ur-l, ... ,uo;h;f) == brf(x + rh, u r ) + br-d(x + (r - l)h, Ur-l) + ... + bof(x, uo). A linear multistep method therefore is determined by listing the coefficients ao, ... , ar-l, bo , ... , br . By means of the recurrence formula 7)J+r + a r -l7)J+r-l + ... + a07)j = h[brf(xJ+r, 7)J+r) + ... + bof(xj, 7)j)], Xi := Xo + ih, for each set of starting values 7)0, 7)1, ... , 7)r-l and for each (sufficiently small) stepsize h =1= 0, the method produces approximate values 7)j for the values Y(Xj) of the exact solution y(x) of an initial-value problem y' = f(x, y), y(xo) = Yo· 7.2 Initial-Value Problems 509 If br -=I=- 0, one deals with a corrector method; if br = 0, with a predictor method. For I E Fl (a, b) every linear multistep method evidently satisfies the conditions (7.2.10.2) and (7.2.7.7). According to Theorem (7.2.10.1), therefore, the stability condition (7.2.9.4) for the polynomial 'I/;(p,) := p,r + ar _lp,r-l + ... + ao is necessary for convergence [see (7.2.7.8)] of these methods. By Theorem (7.2.10.3), the stability condition for tff together with consistency, (7.2.7.4), is also sufficient for convergence. To check consistency, by Definition (7.2.7.4) one has to examine the behavior of the expression (here a r := 1) L[z(x); h] : = (7.2.11.2) == r r i=O i=O r r i=O i=O L aiz(x + ih) - h L bd(x + ih, z(x + ih)) L aiz(x + ih) - h L biZ'(x + ih) == h· T(x,y;h) for the solution z(t) with z'(t) = I(t, z(t)), z(x) = y, X E [a, b], y E JR. Assuming that z(t) is sufficiently often differentiable (this is the case if I has sufficiently many continuous partial derivatives), One finds by Taylor expansion of L[z(x); h] in powers of h for I E Fq_ 1 (a, b), L[z(x); h] = Coz(x) + C 1 h z'(x) + ... + Cqh qz(q) (x) + o(lhl q) = hT(X, y; h). Here, the C i are independent of z(.), x and h, and one has in particular Co = aD + a1 + ... + a r -1 + 1, C 1 = a1 + 2a2 + ... + (r - l)a r - l + r·l - (b o + b1 + ... + br ). In terms of the polynomialtff(p,) and the additional polynomial (7.2.11.3) P, E <C, one can write Co and C 1 in the form Co = '1/;(1), C 1 = '1/;'(1) - X(I). Now, for I E F1 (a, b), T(X, y; h) = ~L[Z(X); h] = ~O z(x) + Cd(x) + O(h), 510 7 Ordinary Differential Equations and, according to Definition (7.2.7.4), a consistent multistep method requires Co = C 1 = 0, i.e., a consistent linear multistep method has at least order l. In generaL it has order p [see Definition (7.2.7.4)] if, for f E Fp(a, b), Co = C 1 = ... = Cp = O. In addition to Theorems (7.2.10.1) and (7.2.10.3), we now have for linear multistep methods the following: (7.2.11.4) Theorem. A linear multistep method which is convergent is also consistent. PROOF. Consider the initial-value problem y' = 0, y(O) = 1, with the exact solution y(x) == l. For starting values TJi := 1, i = 0, 1, ... , r - 1, the method produces values TJj+r, j = 0, 1, ... , with (7.2.1l.5) TJj+r + ar-lTJj+r-l + ... + aoTJj = O. Letting h n := x/n, one gets TJ(x; h n ) = TJn, and in view of the convergence of the method, lim TJ(x;h n ) = lim Tin = y(x) = l. TL --+ eX) n --+ (Xl For j -+ eXJ it thus follows at once from (7.2.11.5) that Co = 1 + ar-l + ... + aD = O. In order to show C 1 = 0, we utilize the fact that the method must converge also for the initial-value problem y' = 1, y(O) = 0 with the exact solution y(x) == x. We already know that Co = '1'(1) = O. By Theorem (7.2.10.1) the stability condition (7.2.9.4) holds, hence A = 1 is only a simple zero of 1/;, i.e.,1/;' (1) "10; the constant K'= X(l) . 1/;'(1) is thus well defined. With the starting values TJj := jhK, j=O,l, ... ,r-l, we have for the initial-value problem y' = 1, y(O) = 0, in view of Y(Xj) = Xj = jh, 7.2 Initial-Value Problems r/j = Y(Xj) + Cj whereby 511 with Cj := jh(K - 1), j = 0, 1, ... , r - 1, lim Cj = 0 h-+O for j = 0, 1, ... , r - 1. The method, with these starting values, yields a sequence TJj for which Through substitution in (7.2.11.6), and observing Co = 0, one easily sees that TJj = j h K for all j. Now, TJn = TJ(x; h n ), h n := x/no By virtue ofthe convergence of the method, therefore, x = y(x) = lim TJ(x; h n ) = lim TJn = lim nhnK = xK. n~oo n~oo n~oo Consequently K = 1, and thus, C 1 = '¢'(1) - X(l) = O. D Together with Theorems (7.2.10.1), (7.2.10.3) this yields the following: (7.2.11.7) Theorem. A linear multistep method is convergent for all f E F1 (a, b) if and only if it satisfies the stability condition (7.2.9.4) for,¢ and is consistent [i.e., '¢(1) = 0, '¢'(1) - X(l) = 0]. The following theorem gives a convenient means of determining the order of a linear multistep method: (7.2.11.8) Theorem. A linear multistep method is a method of order p if and only if the function rp(f.1) := 1jJ(f.1)/ln(f.1) - X(f.1) has f.1 = 1 as a p-fold zero. PROOF. Put z(x) := eX in L[z(x); h] of (7.2.11.2). Then for a method of order p, On the other hand, We thus have a method of order p precisely if i.e., if h = 0 is a p-fold zero of rp(e h ), or, in other words, if rp(f.1) has the D p- fold zero f.1 = 1. 512 7 Ordinary Differential Equations This theorem suggests the following procedure: For given constants ao, aI, ... , ar-l, suppose additional constants bo, bl , ... , br, are to be determined so that the resulting multistep method has the largest possible order. To this end, the function 'IjJ(p,)/ln(p,), which for consistent methods is holomorphic in a neighorhood of p, = 1, is expanded in a Taylor series about p, = 1: (7.2.11.9) ~~~~ = cO+cI(p,-l)+ ... +Cr_I(p,-lr-l+cr(p,-lr+···. Choosing x(p,) : = Co + CI(p, - 1) + ... + cr(p, _1)r =: bo + blP, + ... + brp,r (7.2.11.10) then gives rise to a corrector method of order at least r + 1. Taking x(p,) : = Co + CI(p, -1) + ... + Cr-I(P, -lr- 1 = bo + blP, + ... + br - l + 0 . p,r one obtains a predictor method of order at least r. In order to achieve methods of still higher order, one could think of further determining the constants ao, ... , ar-l in such a way that in (7.2.11.9) (7.2.11.11) 'IjJ(1) = 1 + ar-l + ... + ao = 0, Cr+l = Cr +2 = ... = C2 r -1 = o. The choice (7.2.11.10) for X(p,) would then lead to a corrector method of order 2r. Unfortunately, the methods so obtained are no longer convergent, since the polynomials 'IjJ for which (7.2.11.11) holds no longer satisfy the stability condition (7.2.9.4). Dahlquist (1956, 1959) was able to show that an r-step method which satisfies the stability condition (7.2.9.4) has order < { r + 1, if r is odd, r + 2, if r is even P- [ef. Section 7.2.8]. EXAMPLE. The consistent method of maximum order, for r = 2, is obtained by setting Taylor expansion of 'IjJ(p,)j In(p,) about p, = 1 yields 'IjJ(p,) = 1 _ a + 3 - a (p, _ 1) + a + 5 (p, _ 1)2 _ 1 + a (p, _ 1)3 + .... In(p,) 2 12 24 Putting 7.2 Initial-Value Problems 513 the resulting linear multistep method has order 3 for a # -1, and order 4 for a = -1. Since 'l/J(/-l) = (/-l- 1)(/-l- a), the stability condition (7.2.9.4) is satisfied only for -1 :::; a < 1. In particular, for a = 0, one obtains This is just the Adams-Moulton method (7.2.6.6) for q = 2, which therefore has order 3. For a = -1 one obtains 'l/J(/-l) = /-l2 - 1, X(/-l) = ~/-l2 + ~/-l + ~, which corresponds to Milne's method (7.2.6.10) for q = 2 and has order 4 [see also Exercise 11]. One should not overlook that for multistep methods of order p the integration error is of order O(hP ) only ifthe solution y(x) of the differential equation is at least p + 1 times differentiable [j E Fp (a, b)]. 7.2.12 Asymptotic Expansions of the Global Discretization Errror for Linear Multistep Methods In analogy to Section 7.2.3, one can also try to find asymptotic expansions in the stepsize h for approximate solutions generated by multistep methods. There arise, however, a number of difficulties. To begin with, the approximate solution T](x; h), and thus certainly also its asymptotic expansion (if it exists), will depend on the starting values used. Furthermore, it is not necessarily true that an asymptotic expansion of the form [ef. (7.2.3.3)] (7.2.12.1) T](x; h) =y(x) + hPep(x) + hP+lep+l(x) + ... + hN eN(x) + hN+ 1 EN+l(x; h) exists for all h = h n := (x - xo)/n, with functions ei(x) independent of h and a remainder term EN+l(X; h) which is bounded in h for each x. We show this for a simple linear multistep method, the midpoint rule [see (7.2.6.9)], i.e., (7.2.12.2) T]j+l = T]j-l + 2hf(xj, T]j), Xj = Xo + jh, j = 1, 2, .... We will use this method to treat the initial-value problem y' = -y, Xo = 0, Yo = y(O) = 1, whose exact solution is y(x) = e- x . We take the starting values T]o := 1, T]l:= 1 - h [T]l is the approximate value for Y(Xl) = e- h obtained by Euler's polygon method (7.2.1.3)]. Beginning with these starting values, the sequence {T]j} 514 7 Ordinary Differential Equations - and hence the function 'TI(x; h) for all x E Rh = {Xj = jh I j = 0, 1, ... } - is then defined by (7.2.12.2) through T/(x; h) := T/j = T/xlh if x = Xj = jh. According to (7.2.12.2), since f(xj, T/j) = -T/j, the T/j satisfy the following difference equation: T/j+! + 2hT/j - T/j-l = 0, j = 1,2, .... With the help of Theorem (7.2.9.8), the 'TIj can be obtained explicitly: The polynomial has the zeros Therefore, by (7.2.9.8), j = 0,1, ... , (7.2.12.3) where the constants CI, C2 can be determined by means of the starting values 'TID = 1, T/I = 1 - h. One finds T/o = 1 = CI + C2, T/I = 1 - h = CIAI + C2A2, and thus Consequently, for x E R h , hoi 0, (7.2.12.4) One easily verifies that is an analytic function of h in Ihl < 1. Furthermore, 7.2 Initial-Value Problems 515 since, evidently, cl(-h) = cl(h) and Al (-h) = l/Al(h). The second term in (7.2.12.4) exhibits more complicated behavior. We have with '-P2(h) = c2(h)[Al( _h)]x/h = c2(h)[Al(h)]-x/h an analytic function in Ihl < 1. As above, one sees that '-P2( -h) = '-P2(h). It follows that '-PI and '-P2, for Ihl < 1, have convergent power-series expansions of the form '-Pl(h) = uo(x) + ul(x)h 2 + u2(x)h 4 +"', '-P2(h) = vo(x) + vl(x)h 2 + v2(x)h4 +"', with certain analytic functions Uj(x), Vj(x). The first terms in these series can easily be found from the explicit formulas for ci(h), Ai(h): Vo(X) = 0, e- x Ul(X) = 4[-1 + 2x], eX Vl(X) = 4' Therefore, 1](x; h) has an expansion of the form (7.2.12.5) 00 1](X; h) = y(x) + I>2k [Uk(X) + (-l)x/h vk (x)] k=l x for h = -, n = 1, 2, .... n Because ofthe oscillating term ( -1 y/ h, which depends on h, this is not an asymptotic expansion of the form (7.2.12.1). Restricting the choice of h in such a way that x / h is always even, or always odd, one obtains true asymptotic expansions (7.2.12.6) 00 x 1](X; h) = y(x) + 2::>2k[Uk(X) + Vk(X)] for h= - , n= 1,2, ... , 2n k=l 1](x; h) = y(x) + 00 x - , n = 1,2, ... , 2: h2k [Uk(X) - Vk(X)] for h = 2n -1 k=l Computing 1]1 with the Runge-Kutta method (7.2.1.14) instead of Euler's method, one obtains the starting values 1]0 := 1, 1]1 := 1- h h2 h3 h4 2 6 24 + - - - +-. For Cl and C2 one finds in the same way as before 516 7 Ordinary Differential Equations C 2 ~ h2 h3 h4 = c (h) = V 1 + h- - 1 - ""2 + ""6 - 24 2';1 + h 2 2 Since Cl (h) and C2 (h), and thus 1]( x; h), are no longer even functions of h, there will be no expansion ofthe form (7.2.12.5) for 1](x; h), but merely an expansion of the type The form of the asymptotic expansion thus depends critically on the starting values used. In general, we have the following theorem due to Gragg (1965) [see Hairer and Lubich (1984) for a short proof]: (7.2.12.7) Theorem. Let f E F2N +2(a, b), and y(x) be the exact solution of the initial-value problem y' = f(x,y), y(xo) = Yo, Xo E [a,b]. For x E Rh = {xo + ih I i = 0, 1, ... } let 1](x; h) be defined by 1](xo; h) := Yo, 1](xo + h; h) := Yo + hf(xo, Yo), 1](x + h; h) := 1](x - h; h) + 2hf(x, 1](x; h)). (7.2.12.8) Then 1](x; h) has an expansion of the form (7.2.12.9) 1](x; h) = y(x) + L h2k [Uk(X) + (_I)(x-x o vk (x)] + h 2N+2E 2N+2(X; h), N )/h k=l valid for x E [a, b] and all h = (x - xo)/n, n = 1, 2, .... The functions Uk(X), Vk(X) are independent of h. The remainder term E 2N +2 (X; h) for fixed x remains bounded for all h = (x - xo)/n, n = 1, 2, .... We note here explicitly that Theorem (7.2.12.7) holds also for systems of differential equation (7.0.3): f, Yo, 1], y, Uk, Vk, etc. are then to be understood as vectors. Under the assumptions of Theorem (7.2.12.7) the error e(x; h) := 1](x; h) - y(x) in first approximation is equal to h 2[Ul(X) + (-I)Cx-x o)/h v1 (x)]. In view of the term (_I)(x-x o)/h, it shows an oscillating behavior: 7.2 Initial-Value Problems 517 One says, for this reason, that the midpoint rule is "weakly unstable" . The principal oscillating term (-1)(x-x ol /h v1 (x) can be removed by a trick. Set [Gragg (1965)] S(X; h) := ~ [1J(x; h) + 1J(x - h; h) + hf(x, 1J(x; h))], (7.2.12.10) where 1J(x; h) is defined by (7.2.12.9). On account of (7.2.12.8), one now has 1J(X + h; h) = 1J(x - h; h) + 2hf(x, 1J(x; h)), and hence also (7.2.12.11) S(X; h) = ~ [1J(x; h) + 1J(x + h; h) - hf(x, 1J(x; h))]. Addition of (7.2.12.10) and (7.2.12.11) yields (7.2.12.12) S(x; h) = ~ [1J(x; h) + ~1J(x - h; h) + ~1J(x + h; h)]. By (7.2.12.9), one thus obtains for S an expansion of the form S(x; h) =H y(x) + ~ [y(x + h) + y(x - h)] N + L h2k [Uk(X) + ~(Uk(X + h) + Uk(X - h)) k=l + (_l)(x-x ol /h(Vk(X) - ~(Vk(X + h) + Vk(X - h)))]} + O(h 2N+2 ). Expanding y(x ± h) and the coefficient functions Uk(X ± h), Vk(X ± h) in Taylor series in h, one finally obtains for S(x; h) an expansion of the form (7.2.12.13) S(x; h) =y(x) + h2 [Ul(X) + iy"(x)] N + L h2k [Uk(X) + (_l)(x-x ol /h vk (x)] + O(h 2N +2 ), k=2 in which the leading error term no longer contains an oscillating factor. 7.2.13 Practical Implementation of Multistep Methods The unpredictable behavior of solutions of differential equations forces the numerical integration to proceed with stepsizes which, in general, must vary from point to point if a prescribed error bound is to be maintained. Multistep methods which use equidistant nodes Xi and a fixed order, therefore, are not very suitable in practice. Every change of the steplength requires 518 7 Ordinary Differential Equations a recomputation of starting data [cf. Section 7.2.6] or entails a complicated interpolation process, which severely reduces the efficiency of these integration methods. In order to construct efficient multistep methods, the equal spacing of the nodes Xi must be given up. We briefly sketch the construction of such methods. Our point of departure is again Equation (7.2.6.2), which was obtained by formal integration of y' = f(x, y): As in (7.2.6.3) we replace the integrand by an interpolating polynomial Qq of degree q, but use here Newton's interpolation formula (2.1.3.4), which, for the purpose of step control, offers some advantages. With the interpolating polynomial one first obtains an approximation formula and from this the recurrence formula with Qi(X) = (x - xp)··· (x - Xp-i+l), Qo(x) == 1. In the case k = 1, j = 0, q = 0, 1, 2, ... , one obtains an "explicit" formula (predictor) T/p+l = T/p + (7.2.13.1) q 2: g;J[xp, ... ,xp_;], i=O where For k = 0, j = 1, q = 0,1,2, ... , and replacing p by p+ 1, there results an "implicit" formula (corrector) q T/p+l = T/p + Lg;J[xp+l,'" ,Xp-i+l], (7.2.13.2) i=O with go := l i > 0, XP + 1 Xp dx. 7.2 Initial-Value Problems 519 The approximation rJp+! =: rJ~~I from the predictor formula can be used, as in Section 7.2.6, as "starting value" for the iteration on the corrector formula. However, we carry out only one iteration step, and denote the resulting approximation by rJ~~I. From the difference of the two approximate values we can again derive statements about the error and stepsize. Subtracting the predictor formula from the corrector formula gives Q~I)(X) is the interpolation polynomial of the corrector formula (7.2.13.2), involving the points with f~~1 = f(Xp+I, rJ~~I)· Q~O) (x) is the interpolation polynomial of the predictor formula, with points (xp, f p), ... , (x p_q, fp-q). Defining f~~1 := Q~O)(Xp+I)' then Q~O)(x) is also determined uniquely by the points (xp+!, f~~I)' (xp, fp), ... , (x p_q+!, f p- q+!) The difference Q~l)(X) - Q~O\x) therefore vanishes at the nodes x p_q+!, ... , xP' and at Xp+1 has the value (I) (0) j p+1 - f p+!. Therefore, (7.2.13.3) (1) (0) _ ((1) (0) ) rJp+! - rJp+! - C q · f p+1 - f p+! . Now let y(x) denote the exact solution of the differential equation y' f(x,y) with initial value y(xp) = rJp. We assume that the "past" approximate values rJp-i = y(Xp-i), i = 0, 1, ... , q are "exact", and we make the additional assumption that f(x,y(x)) is exactly represented by the polynomial Q~~l(X) with interpolation points Then y(xp+!) = rJp + l XP + 1 Xp f(y, y(x)) dx 520 7 Ordinary Differential Equations For the error one obtains In the first term, Q~O) (x) may be replaced by Q~~l (x) if one takes as interpolation points (Xp+l' !~~l)' (xp, !p), ... , (x p_ q, !p-q)' In analogy to (7.2.13.3) one thcn obtains If now thc nodes Xp+l, x P' ... , x p_ q are equidistant and h := Xj+l - Xj, one has for the error Eq = y(xp+d - f)~~l = O(h q+2 ) ~ C h q+2 . The further development now follows the previous pattern [ef. Section 7.2.5]. Let c he a prescribed error bound. The "old" step hold is accepted as "successful" if The new step h ncw is considered "successful" if we can make Elimination of C again yields h new ~ hold (I;ql) q~2 This strategy can still be combined with a change of q (variable order method). One computes the three quantities ( ~ IE:-11 ) q~, ' and determines their maximum. If the first quantity is the maximum, one reduces q by l. If the second is the maximum, one holds on to q. If the 7.2 Initial-Value Problems 521 third is the maximum, one increases q by 1. The important point to note is that the quantities E q- 1, E q, E q+1 can be computed recursively from the table of divided differences. One has E q- 1 = gq_l,2f(1)[Xp+l, x P"'" Xp-q+l], Eq = gq,2f(1)[Xp+1,Xp, ... ,xp_ q], E q+1 = gq+1,2f(1) [Xp+l, x P' ... ,Xp- q-1], where f(l) [xP+1' x P" .. ,Xp-i] are the divided differences relative to the interpolation points (xP+1' f~~l)' (xp, fp), ... , (Xp-i, fp-i) and gij is defined by i,j ~ 1. The quantities gij satisfy the recursion gij = (xP+1 - Xp+l-i)gi-l,j + gi-l,j+1, j = 1, 2, ... , q + 2 - i, i = 2, 3, ... , q, with the initial values j = 1, ... , q + 1. The method discussed is "self-starting"; one begins with q = 0, then raises q in the next step to q = 1, etc. The "initialization" [ef. Section 7.2.6] required for multistep methods with equidistant nodes and fixed q becomes unnecessary. For a more detailed study we refer to the relevant literature, for example, Hairer, N!ZSrsett, and Wanner (1993); Shampine and Gordon (1975); Gear (1971); and Krogh (1974). 7.2.14 Extrapolation Methods for the Solution of the InitialValue Problem As discussed in Section 3.4, asymptotic expansions suggest the application of extrapolation methods. Particularly effective extrapolation methods result from discretization methods which have asymptotic expansions containing only even powers of h. Observe that this is the case for the midpoint rule or the modified midpoint rule [see (7.2.12.8), (7.2.12.9) or (7.2.12.10), (7.2.12.13)]. In practice, one uses especially Gragg's function Sex; h) of (7.2.12.10), whose definition we repeat here because of its importance: Given the triple (j, xo, Yo), a real number H, and a natural number n > 0, define x := Xo + H, h := H/n. For the initial-value problem 522 7 Ordinary Differential Equations y' = f(x, y), Y(Xo) = Yo, with exact solution y(x), one then defines the function value S(x; h), in the following way: (7.2.14.1) Tlo : = Yo, TIl : = Tlo + hf(xo, Tlo), for j = 1, 2, ... , n - 1 : XI:= Xo Tlj+l : = Tlj-l + 2hf(xj, Tlj), + h, Xj+!:= Xj + h, S(x; h) : = ~ [TIn + Tln-l + hf(xn, TIn)]. In an extrapolation method for the computation of y(x) [see Sections 3.4, 3.5], one then has to select a sequence of natural numbers F = {no, nl, nz,·· .}, o < no < nl < nz < ... , and for hi := H/ni to compute the values S(x; hi). Because of the oscillation term (_l)(x-x o)/h in (7.2.12.13), however, F must contain only even or only odd numbers. Usually, one takes the sequnce (7.2.14.2) ni := 2ni-z for i:::: 3. F = {2, 4, 6, 8,12,16, ... }, As described in Section 3.4, beginning with the Oth column, one then computes by means of interpolation formulas a tableau of numbers ~k' one upward diagonal after another: S(x; h o) =: Too Tn (7.2.14.3) S(x; hz) =: T zo \. Here is just the value of the interpolating polynomial (rational functions would be better) 7.2 Initial-Value Problems 523 of degree k in h 2 with Tik(h j ) = Sex; h j ) for j = i, i - I , ... , i - k. As shown in Section 3.5, each column of (7.2.14.3) converges to y(x): lim Tik = y(x) 2-+00 for k = 0, 1, .... In particular, for k fixed, the Tik for i -+ 00 converge to y(x) like a method of order (2k + 2). We have in first approximation, by virtue of (7.2.12.13) [see (3.5.10)]' Tik - y(x) ~ (_1)kh;hLl ... hLk [Uk +1 (x) + Vk+1(X)]. One can further, as described in Section 3.5, exploit the monotonic behavior of the Tik in order to compute exlicit estimates of the error Tik - y(x). Once a sufficiently accurate Tik =: 1} is found, 1} will be accepted as an approximate value for y(x). Thereafter, one can compute in the same way an approximation to y(x) at an additional point x = x + fI by replacing xo, Yo, H by x, 1}, fI and by solving the new initial-value problem again, just as described. We wish to emphasize that the extrapolation method can be applied also for the solution of an initial-value problem (7.0.3), (7.0.4) for systems of n ordinary differential equations. In this case, f(x, y) and y(x) are vectors of functions, and Yo, 'TJi, and Sex; h) in (7.2.14.1) are vectors in IRn. The asymptotic expansions (7.2.12.9) and (7.2.12.13) remain valid, and they mean that each component of Sex; h) E IRn has an asymmptotic exspansion of the form indicated. The elements Tik of (7.2.14.3), likewise, are vectors in IR n and are computed as before, component by component, from the corresponding components of Sex; hi). In the practical realization of the method, the problem arises how large the basic steps H are to be chosen. If one chooses H too large, one must compute a very large tableau (7.2.14.3) before a sufficiently accurate Tik is found; i is a large integer, and to find Tik one has to compute Sex; hj) for j = 0, 1, ... , i. The computation of Sex; h j ) requires nj + 1 evaluations of the right-hand side f(x, y) of the differential equation. For the sequence F in (7.2.14.2) chosen above, the numbers Si := L;=o(nj + 1), and thus the expenditure of work for a tableau with i + 1 upward diagonals grows rapidly with i: one has Si+l ~ 1.4s i . If, on the other hand, the step H is too small, one takes unnecessarily small, and hence too many, integration steps (xo, y(xo)) -+ (xo + H, y(xo + H)). It is therefore very important for the efficiency ofthe method that one incorporate into the method, as in Section 7.2.5, a mechanism for estimating a reasonable stepsize H. This mechanism must accomplish two things: (1) It must guarantee that any stepsize H which is too large is reduced, so that no unnecessarily large tableau is constructed. (2) It should propose to the user of the method (of the program) a reasonable stepsize fI for the next integration step. 524 7 Ordinary Differential Equations However, we don't want to enter into further discussion of such mechanisms and only remark that one can proceed, in principle, very much as in Section 7.2.5. An ALGOL program for the solution of initial-value problems by means of extrapolation methods can be found in Bulirsch and Stoer (1966). 7.2.15 Comparison of Methods for Solving Initial-Value Problems The methods we have described fall into three classes: (1) one-step methods, (2) multistep methods, (3) extrapolation methods. All methods allow a change of the steplength in each integration step; an adjustment ofthe respective stepsizes in each ofthem does not run into any inherent difficulties. The modern multistep methods are not employed with fixed orders; nor are the extrapolation methods. In extrapolation methods, for example, one can easily increase the order by appending another column to the tableau of extrapolated values. One-step methods of the RungeKutta-Fehlberg type, by construction, are tied to a fixed order, although methods of variable orders can also be constructed with correspondingly more complicated procedures. In an attempt to find out the advantages and disadvantages of the various integration schemes, computer programs for the methods mentioned above have been prepared with the utmost care, and extensive numerical experiments have been conducted with a large number of differential equations. The result may be described roughly as follows: The least amount of computation, measured in evaluations of the righthand side of the differential equation, is required by the multistep methods. In a predictor method the right-hand side of the differential equation must be evaluated only once per step, while in a corrector method this number is equal to the (generally small) number of iterations. The expense caused by the step control in multistep methods, however, destroys this advantage to a large extent. Multistep methods have the largest amount of overhead time. Advantages accrue particularly in cases where the right-hand side of the differential equation is built in a very complicated manner (large amount of computational work to evaluate the right-hand side). In contrast, extrapolation methods have the least amount of overhead time, but on the other hand sometimes do not react as "sensitively" as one-step methods or multistep methods to changes in the prescribed accuracy tolerance c: often results are furnished with more corrrect digits than necessary. The reliability of extrapolation methods is quite high, but for modest accuracy requirements they no longer work economically (are too expensive). 7.2 Initial-Value Problems 525 For modest accuracy requirements, Runge-Kutta-Fehlberg methods with low orders of approximation p are to be preferred. Runge-KuttaFehlberg methods of certain orders sometimes react less sensitively to discontinuities in the right-hand side of the differential equation than multistep or extrapolation methods. It is true that in the absence of special precautions, the accuracy at a discontinuity in Runge-Kutta-Fehlberg methods is drastically reduced at first, but afterward these methods continue to function without disturbances. In certain practical problems, this can be an advantage. None of the methods is endowed with such advantages that it can be preferred over all others (assuming that for all methods computer programs are used that were developed with the greatest care). The question of which method ought to be employed for a particular problem depends on so many factors that it cannot be discussed here in full detail; we must refer for this to the original papers, e.g., Clark (1968), Crane and Fox (1969), Hull et al. (1972), Shampine et al. (1976), Diekhoff et al. (1977). 7.2.16 Stiff Differential Equations In the upper atmosphere, ozone decays under the influence of the radiation of the sun. This process is described by the formulas of chemical reaction kinetics. The kinetic parameters k j , j = 1, 2, 3, are either known from direct measurements or are determined indirectly from the observed concentrations as functions of time by solving an "inverse problem". If we denote by Yl = [0 3 ], Y2 = [0], and Y3 = [0 2 ] the concentration of the reacting gases, then according to a simple model of physical chemistry, the reaction can be described by the following set of ordinary differential equation [ef. Willoughby (1974)]: Yl(t) = -k 1 YIY3 + k2Y2Y~ - k3YIY2, Y2(t) = k 1YIY3 - k2Y2Y~ - k3YIY2, Y3(t) = -k 1YIY3 + k2Y2Y~ + k3YIY2. We simplify this system by assuming that the concentration of molecular oxygen [0 2 ] is constant, i.e., Y3 = 0, and consider the situation where initially the concentration [0] of the radical 0 is zero. Then, after inserting the appropriate constants and rescaling of the k j , one obtains the initial value problem Yl = -Yl - YIY~ + 294Y2, Y2 = (Yl - YIY2)/98 - 3Y2, Yl(O) = 1, Y2(0) = 0, 526 7 Ordinary Differential Equations Typical for chemical kinetics are the widely different velocities (time scales) of the various elementary reactions. This results in reaction constants kj and coefficients of the differential equations of very different orders of magnitudes. A linearization therefore gives linear differential equations with a matrix having a large spread of eigenvalues [see the Gershgorin theorem (6.9.4)]. One therefore has to expect a solution structure characterized by "boundary layers" and "asymptotic phases" . The integration of systems of this kind gives rise to peculiar difficulties. The following example will serve as an illustration [ef. Grigorieff (1972, 1977)]. Suppose we are given the system (7.2.16.1) , Al + A2 Al - A2 Y1(X) = 2 Y1+ 2 Y2, '() Al - A2 >'1 + A2 Y2 X = 2 Y1 + 2 Y2, with negative constants AI, A2: Ai < O. The general solution is (7.2.16.2) Y1(X) = C 1e A1X + C2eA2X } Y2(X) = C 1e A1X - C 2e A2X x :2: 0, where C 1 , C 2 are constants of integration. If (7.2.16.1) is integrated by Euler's method [ef. (7.2.1.3)], the numerical solution can be represented in "closed form" as follows, (7.2.16.3) rJ1i = C 1(1 + hAd + C 2(1 + hA2)i, rJ2i = C 1(1 + hAd - C2(1 + hA2)i. Evidently, the approximations converge to zero as i -+ steplength h is chosen small enough to have 00 only if the (7.2.16.4) Now let IA21 be large compared to IA11. Since A2 < 0, the influence of the component e A2X in (7.2.16.2) is negligibly small in comparison with e A1X • Unfortunately, this is not true for the numerical integration. In view of (7.2.16.4), indeed, the steplength h> 0 must be chosen so small that 2 h < p:;J. For example, if Al = -1, A2 = -1000, we must have h ~ 0.002. Thus, even though e-1000x contributes practically nothing to the solution, the factor 1000 in the exponent severely limits the step size. This behavior in the numerical solution is referred to as "stiffness": It is to be expected if for a system of differential equations y' = f (x, Y), Y E IRn, the n x n-matrix fy(x,y) has eigenvalues A with ReA« o. 7.2 Initial-Value Problems 527 Euler's method (7.2.1.3) is hardly suitable for the numerical integration of such systems; the same can be said for the Runge-Kutta-Fehlberg methods, multistep methods, and extrapolation methods, discussed earlier. Appropriate methods for integrating stiff differential equations can be derived from so-called implicit methods. As an example, we may take the implicit Euler method, (7.2.16.5) The "new" approximation TJi+l will have to be determined by iteration. The computational effort thus increases considerably. One finds that many methods, applied with constant steplength h > 0 to the linear system of differential equations (7.2.16.6) y' = Ay, y(O) = Yo, A a constant n x n matrix, will produce a sequence TJi of approximation vectors for the solution y(Xi) which satisfy a recurrence formula (7.2.16.7) TJi+l = g(hA)TJi. The function g(z) depends only on the method employed and is usually a rational function in which it is permissible to substitute a matrix for the argument. EXAMPLE 1. For the explicit Euler method (7.2.1.3) one finds 1)i+l = 1)i + hA1)i = (I + hA)1)i, whence g(z) = 1 + z; for the implicit Euler method (7.2.16.5), 1 whence g(z) = - - . 1-z If one assumes that the matrix A in (7.2.16.6) has only eigenvalues Aj with Re Aj < 0 [ef. (7.2.16.1)]' then the solution y(x) of (7.2.16.6) converges to zero as x -+ 00, while the discrete solution {TJi}, by virtue of (7.2.16.7), converges to zero as i -+ 00 only for those stepsizes h > 0 for which Ig( hAj) I < 1 for all eigenvalues Aj of A. Inasmuch as the presence of eigenvalues Aj with Re Aj « 0 will not force the employment of small stepsizes h > 0, a method will be suitable for the integration of stiff differential equations if it is absolutely stable in the following sense: (7.2.16.8) Definition. A method (7.2.16.7) is called absolutely stable if Ig(z)1 < 1 for all z with Rez < o. A more accurate description of the behavior of a method (7.2.16.7) is provided by its region of absolute stability DF, by which we mean the set 528 7 Ordinary Differential Equations (7.2.16.9) M = {z E ~ Ilg(z)1 < I}. The larger the intersection M n ~_ of M with the left half-plane ~_ {z I Re z < O}, the more suitable the method is for the integration of stiff differential equations; it is absolutely stable if M contains ~_. EXAMPLE 2. The region of absolute stability for the explicit Euler method is {zlll+zl < I}; for the implicit Euler method, {z 111- zl > I}. The implicit Euler method is thus absolutely stable; the explicit Euler method is not. Taking A-stability into account, one can then develop one-step, multistep, and extrapolation methods as in the previous Sections 7.2.1, 7.2.9, and 7.2.14. All these methods are implicit or semi-implicit, since only these methods have a proper rational stability function. All explicit methods considered earlier lead to polynomial stability functions, and hence cannot be A-stable. The implicit character of all stable methods for solving stiff differential equations implies that one has to solve a linear system of equations in each step at least once (semi-implicit methods), sometimes even repeatedly, resulting in Newton-type iterative methods. In general, the matrix E of these linear equations contains the Jacobian matrix fy = fy(x, y) and usually has the form E = J - h"ffy for some number "f. Extrapolation methods In order to derive an extrapolation method for solving a stiff system of the form* y' = f(y), we note that the stiff part of the solution y(t) can be factored off near t = x by putting c(t) := e-A(t-x)y(t) where A := fy(y(x)). Then c'(x) = /(y(x)), with /(y) := f(y) - Ay, and Euler's method (7.2.1.3), resp., the midpoint rule (7.2.6.9), lead to the approximations c(x + h) ~ y(x) + h/(y(x)), c(x + h) ~ c(x - h) + 2h/(y(x)). Using that c(x ± h) = eOFAhy(x ± h) ~ (J =F Ah)y(x ± h) we obtain a semi-implicit midpoint rule [ef. (7.2.12.8)] * Every differential equation can be reduced to this autonomous form: fj' = ](i, fj) is equivalent to [EJ' = [f(~,y)]. 7.2 Initial-Value Problems 529 1J(xo; h) := Yo, A:= Iy(Yo), 1J(xo + h; h) := (I - hA)-l[yo + hJ(yo)]' 1J(x + h; h) := (I - hA)-l [(1 + hA)1J(x - h; h) + 2hJ(1J(x; h))] for the computation of an approximate solution 1J(x; h) ~ y(x) of the initial value problem y' = I(y), y(xo) = Yo. This rule was used by Bader and Deuflhard (1983) as a basis for extrapolation methods, and they applied it with great success to problems of chemical reaction kinetics. One-step methods In analogy to the Runge-Kutta-Fehlberg methods [see Equation (7.2.5.6) ff.], Kaps and Rentrop (1979) construct for stiff autonomous systems y' = I(y) methods that are distinguished by a simple structure, efficiency and robust step control. They have been tested numerically for stiffness ratios as large as I = 10 I .\m~x .\mm 7 (with 12-digit computation) . As in (7.2.5.6) one takes (7.2.16.10) Yi+1 = Yi + hiPr(Yi; h), Yi+ 1 = Yi + h iPn (Vi; h), with (7.2.16.11) ILl(x,y(x));h) -iPr(y(x);h)l:::; N rh3 , ILl(x, y(x)); h) - iPn(y(x); h)1 :::; N n h4 and 3 iPr(y; h) = L ckIk(y; h), iPn(y; h) = L ckfk(y; h), k=l 4 k=l where the Ii: := Ik(y; h), k = 1, 2, 3, 4, have the form (7.2.16.12) Ii: = I (y + h k-1 k 1=1 1=1 L (3kdt) + hf' (y) L "ikdt· For specified constants, the Ii: must be determined iteratively from these systems. The constants obey equations similar to those in (7.2.5.9) ff. Kaps and Rentrop provide the following values: 530 7 Ordinary Differential Equations 'Ykk= "121 = "131 = "141 = 0.220428410 for k = 1, 2, 3, 4, 0.822867461, 0.695700194, "132= 0, 3.90481342, "142= 0, "143 = 1, (321 = -0.554591416, (331 = 0.252787696, (332= 1, (341 = (331, (342= (332, C2 1.182 15360, C1 C3 = -0.162871 035, = -0.192825995 X 10- 1 , i\ 0.545211 088, (;2 (;3 = 0.177064668, (;4 (343 = 0, 0.301486480, = -0.237622363 X 10-1, Step control is accomplished as in (7.2.5.17): hnew = 0.9h 3 A elhl IYi+1 - Yi+11 Multistep methods It was shown by Dahlquist (1963) that there are no A-stable multistep methods of order r > 2 and that the implicit trapezoidal rule h 1]n+1 = Tfn + '2(f(Xn,1]n) + !(Xn+1,Tfn+1)), n> 0, where 1]1 is given by the implicit Euler method (7.2.16.5), is the A-stable method of order 2 with an error coefficient c = - 112 of smallest modulus. Im(z) -----L-t---"t'~+7'_.h,L-.,..Re(z) stable set Fig. 16. The stability of BDF-methods 7.2 Initial-Value Problems 531 Gear (1971) also showed that the BDF-methods up to order 'T' = 6 have good stability properties: Their stability regions (7.2.16.9) contain subsets of <C_ = {z I Rez < O} having the shape sketched in Figure 16. These multistep methods all belong to the choice [see (7.2.11.3)] X(J1) = br J1r and can be represented by backward difference formulas, which explains their name. Coefficients of the standard representation 'r/j+r + ar-l'r/j+r-l + ... + ao'r/j = hbrf(xj+r, 'r/j+r) are listed in the following table [taken from Gear (1971) and Lambert (1973)]. In this table, a = a r and c = Cr are the parameters indicated in Figure 16 that correspond to the particular BDF-method. Byrne and Hindmarsh (1987) report very favorable numerical results for these methods. More details on stiff differential equations can be found in e.g., Grigorieff (1972, 1977), Enright, Hull, and Lindberg (1975), and in Willoughby (1974). For a recent comprehensive treatment, see Hairer and Wanner (1991). r ar Cr br ao 1 90 0 0 1 -1 2 90 0 0 3 88 0 0.1 4 73 0 0.7 5 51 0 2.4 6 18 0 6.1 2 3 6 1 3 2 - 11 11 12 25 60 137 60 147 3 25 12 137 10 147 al a2 a3 a4 a5 48 25 300 137 400 147 300 137 450 147 360 147 4 3 9 11 18 16 25 75 137 72 147 36 25 200 137 225 147 11 7.2.17 Implicit Differential Equations. Differential-Algebraic Equations So far we have only considered explicit ordinary differential equations y' = f(x, y) (7.0.1). For many modern applications this is too restrictive, and it is important, for reasons of efficiency, to deal with implicit systems 532 7 Ordinary Differential Equations F(x, y, y') = 0 directly without transforming them to explicit form, which may not even be possible. Large-scale examples of this type abound in many applications, e.g., in the design of efficient microchips for modern computers. Thousands of transistors are densely packed on these chips, so that their economical design is not possible without using efficient numerical techniques. Such a chip can be modeled as a large complicated electrical network. Kirchhoff's laws for the electrical potentials and currents in the nodes of this network lead to a huge system of ordinary differential equations for these potentials and currents as functions of time t. By solving this system numerically, the behavior of these chips, and their electrical properties, can be checked before their actual production, that is, such a chip can be simulated by a computer [see, e.g., Bank, Bulirsch, and Merten (1990); Horneber (1985)]. The resulting differential equations are implicit, and the following types, depending on the complexity of the model for a single transistor, can be distinguished: (7.2.17.1) Linear implicit systems (7.2.17.2) Cif(t) = BU(t) + J(t), Linear-implicit nonlinear systems (7.2.17.3) Cif(t) = J(t, U(t)), Quasilinear-implicit systems (7.2.17.4) C(U)if(t) = J(t, U(t)), General implicit systems F(t, U(t), Q(t)) = 0, Q(t) = C(U)U(t). The vector U, which may have a very large number of components, describes the electrical potentials of the nodes of the network. The matrix C contains the capacities of the network which may depend on the voltages, C = C(U). Usually, this matrix is singular and very sparse. There are already algorithms for solving systems of type (7.2.17.2) [cf. Deufihard, Hairer, and Zugck (1987); Petzold (1982); Rentrop (1985); Rentrop, Roche, and Steinebach (1989)]. But the efficient and reliable solution of systems of the types (7.2.17.3) and (7.2.17.4) is still the subject of research [see, e.g., Hairer, Lubich, and Roche (1989); Petzold (1982)]. We now describe some basic differences between initial value problems for implicit and explicit systems (we again denote the independent variable by x) (7.2.17.5) y'(x) = f(x,y(x)), treated so far. Concerning systems of type (7.2.17.1), y(xo) := Yo 7.2 Initial-Value Problems (7.2.17.6) 533 Ay'(x) = By(x) + f(x), y(xo) = Yo, where y(x) E IRn and A, Bare n x n matrices, the following cases can be distinguished: 1st Case. A regular, B arbitrary. The formal solution of the associated homogeneous system is given by y(x) = e(x-xo)A-1B yo . Even though the system can be reduced to explicit form y' = A-I By + A-I f, this is not feasible for large sparse A in practice, since then the inverse A -1 usually is a large full matrix and A-I B would be hard to compute and store. 2nd Case. A singular, B regular. Then (7.2.17.6) is equivalent to B- 1 Ay' = y + B- 1 f(x). The Jordan normal form J of the singular matrix B- 1 A = T JT- l has the structure J=[~ ~], where W contains all Jordan blocks belonging to the nonzero eigenvalues of B- 1 A, and N the Jordan blocks corresponding to the eigenvalue zero. We say that N has index k if N is nilpotent of order k, i.e., N k = 0, Nj i= 0 for j < k. The transformation to Jordan normal form decouples the system: In terms of :=T-IB-lf(x) [ p(x)] q(x) we obtain the equivalent system Wu'(x) = u(x) + p(x), Nv'(x) = vex) + q(x). The partial system for u' has the same structure as in Case 1, so it has a unique solution for any initial value Yo. This is not true for the second system for v': First, this system can only be solved under additional smoothness assumption for f: f E C k - l [xo, Xend], where k is the index of N. Then 534 7 Ordinary Differential Equations vex) = -q(x) + Nv'(x) = -(q(x) + Nq'(x)) + N 2 v"(x) = -(q(x) + Nq(x) + ... + N k- 1q(k-1)(X)). Since Nk = 0, this resolution chain for v finally terminates, proving that vex) is entirely determined by q(x) and its derivatives. Therefore, the following principal differences to (7.2.17.5) can be noted: (1) The index of the Jordan normal form of B- 1 A determines the smoothness requirements for J(x). (2) Not all components of the initial value Yo can be chosen arbitrarily: v(xo) is already determined by q(x), and therefore by J(x), and its derivatives at x = Xo. For the solvability of the system, the initial values Yo = y(xo) have to satisfy a consistency condition, and one speaks of consistent initial values in this context. However, the computation of consistent initial values may be difficult in practice. 3rd Case. A and B singular. Here it is necessary to restrict the investigation to "meaningful" systems. Since there are matrices (the zero matrix A := B := 0 provides a trivial example) for which (7.2.17.6) has no unique solution, one considers only pairs of matrices (A, B) that lead to uniquely solvable systems. It is possible to characterize such pairs by means of the concept of a regular matrix pencil: These are pairs (A, B) of matrices such that AA + B is regular for some A E <C. Since then det(AA + B) is a nonvanishing polynomial in A of degree :::; n, there are at most n numbers A, namely, the eigenvalues of the generalized eigenvalue problem [see Section 6.8] Bx = AAx, for which AA + B is singular. It is possible to show for regular matrix pencils that (7.2.17.6) has at most one solution. Also, there is a closed formula for this solution in terms of the Drazin-inverse of matrices [see, e.g., Wilkinson (1982) or Gantmacher(1969)]. We will not pursue this analysis further, as it does not lead to essentially new phenomena compared with Case 2. If the capacity matrices C of the systems (7.2.17.1) and (7.2.17.2) are singular, then, after a suitable transformation, these systems decompose into a differential equation and a nonlinear system of equations (we denote the independent variable again by x): (7.2.17.7) y'(x) = J(x,y(x),z(x)), 0= g(x, y(x), z(x)) E lRn2, y(xo) E lRnl, z(xo) E lR n2 : consistent initial values. 7.2 Initial-Value Problems 535 Such a decomposed system is called a differential-algebraic system [see, e.g., Griepentrog and Marz·(1986), Hairer and Wanner (1991)]. By the implicit function theorem, it has a unique local solution if the Jacobian ~~ is regular (Index-l assumption). Differential-algebraic systems typically occur in models of the dynamics of multibody systems; however, the Index-1 assumption may not always be satisfied [see Gear (1988)]. Numerical methods The differential-algebraic system (7.2.17.7) can be viewed as a differential equation on a manifold. Therefore, such systems can be successfully solved by methods combining solvers for differential equations [see Section 7.2]' for nonlinear equations [see Section 5.4], and continuation methods. Extrapolation and Runge-Kutta type methods can be derived by imbedding the nonlinear equation of (7.2.17.7) formally into a differential equation cz'(x) = g(x, y(x), z(x)) and then considering the limiting case c = o. Efficient and reliable one-step and extrapolation methods for solving differential-algebraic systems are described in Deuflhard, Hairer, and Zugck (1987). The method (7.2.16.10)(7.2.16.12) for solving stiff systems can be modified into a method for solving implicit systems of the form (7.2.17.2). The following method of order 4 with stepsize control is described in Rentrop, Roche, and Steinebach (1989): In the notation of Section 7.2.16, and for an autonomous implicit system Cy' = fey), the method has the form [ef. (7.2.16.12)J Cfk = f(Y + h k-1 k l=1 1=1 L (Adz') + hf'(y) L "(kilt, k = 1, 2, ... ,5, where the coefficients are given in Table (7.2.17.8). Multistep methods can also be used for the numerical solution of differential-algebraic systems. Here, we only refer to the program package DASSL [see Petzold (1982)J for solving implicit systems of the form F(x, y(x), y'(x)) = o. There, the derivative y'(x) is replaced by a BDF-formula, which leads to a nonlinear system of equations. Concerning details of solution strategies and their implementation, we refer to the literature, e.g., Byrne and Hindmarsh (1987), Gear and Petzold (1984, and Petzold (1982). 536 7 Ordinary Differential Equations 0.5 for k = 1, 2, 3, 4, 5, 4.0, {21 = 0.46296 ... , {32= -1.5, {31= {41 = 2.2083 ... , {42 = -l.89583 ... , {Sl = -12.00694 ... , {52= 7.30324074 ... , {S4 = -15.851 ... , {kk= ,1321 = ,1331 = 0, 0.25, ,1341 = -0.0052083 ... , ,1351 = l.200810185 ... , l.0, ,1354= C1 C4 C1 C4 0.538 ... , 0.5925 ... , 0.4523148 ... , 0.5037 ... , 0.25, 0.1927083 ... , f3S2= -1.950810185 ... , {43= -2.25, {53= 19.0, ,1343= 0.5625, 0.25, ,1332= ,1342= C2 = -0.13148 ... , .f353 = C3 0.2, C2 = -0.1560185 ... , Cs 0.2. = -0.2, C5 0, Table (8.2.17.8) 7.2.18 Handling Discontinuities in Differential Equations So far, the methods for solving initial value problems y(xo) = Yo, y'(x) = f(x,y(x)), assumed that the right hand side f of the differential equation and, therefore, its solution y(x) is sufficiently often differentiable. However, there are many applied problems where f and yare discontinuous. These discontinuities are often due to the underlying tasks, e.g., in friction and contact problems of mechanics, in process engineering if one admits different charges of input. But they may also be caused by unavoidahle simplifications of a model, e.g. if one represents discrete tabular data found in experiments by an interpolating function of low order of differentiability. Here, we consider initial value problems of the form (7.2.18.1) y'(x) = fn(x,y(x)), y(XO) = Yo· Xn-l < X < Xn, y;; = lim We assume that the initial values of discontinuity, the "switching points" transition functions ¢n(x, y) as follows: Xn, y;; = ¢n(x n , y;;'), where y,-:: = lim X----,Xn _, - y(x). n = 1, 2, ... , + y(x) at the points X----,Xn are determined by known _.c 7.2 Initial-Value Problems 537 It would be easy to solve (7.2.18.1), say by one-step methods, if the switching points Xn were known a priori. One would have to choose the integration points Xi of these methods, Xo =: XO < Xl < "', which are usually determined by a stepsize control technique alone [see Section 7.2.5] in such a way that also all switching points Xn occur among them, and to compute at Xn the new starting ordinate y1; by means of the transition function. However, there are many problems where the switching points are not. known a priori: They are usually characterized as zeros of known real valued switching functions: (7.2.18.2) n = 1, 2, ... , that depend on the value y(x;;) of the solution y. In what follows we suppose that these zeros are simple zeros. In these cases one has to compute y(.) and the switching points concurrently. Essentially, one has to check each prospective integration interval [xi, xi+ 1] whether it contains a switching point X n , and, if this is the case, to compute X n , to interrupt the integration at Xn [i.e. replace xi+1 by xn], to compute the starting ordinate y1;, and resume the integration with the new starting point (xn' y1;) and the new right hand side fn+1 of (7.2.18.1). Efficient methods for finding the switching points are as follows [ef. e.g., Eich (1992)]: Let 7)i-1 and 7)i be numerical approximations of the solution of (7.2.18.1) at the points xi-1 and Xi obtained in the (i -1 )st and i-th integration step, respectively, using the right hand side fn(x, y(x)). Also let y(x) be a continuous approximation of the solution y(x) for X E [x i - 1, xi], and assume that the switching function qn has a zero .T n with x,-l < Xn < xi. Moreover, we assume that 7)i exists and therefore also the approximation y(x) on the whole interval [x i - 1 , Xi]. If y(x) is sufficiently well approximated on [x i - 1,X n + f] [here f > 0 accounts for an eventual shift of the zero Xn due to roundoff], then the function qn(x, y(x)), too, has a zero xn in [x i - 1, Xn + f], which can be used as an approximation of switching point X n . The zero xn can be determined by an appropriate algorithm, e.g., by the secant method or by inverse interpolation [see Section 5.9]. Suitable generalizations [safeguard methods] of these algorithms are described in Gill et al. (1995). It would be too expensive to search for a possible zero of a switching function between every two integrations steps. If the integration stepsize is small enough so that each switching function has at most one zero in the integration interval, it is sufficient to monitor only the sign changes of the switching functions on these intervals. If more than one switching function changes its sign, the switching of the differential equation (7.2.18.1) is effected by the zero of that switching function which changes its sign first. 538 7 Ordinary Differential Equations 7.2.19 Sensitivity Analysis of Initial-Value Problems Initial value problems often depend on real parameters P = (Pl, ... ,Pn,'), (7.2.19.1 ) y'(X;p) = j(x, y(x;p),p), y(xo:p) = Yo(p). Possibly, even small changes of these parameters may produce large changes in the solution y( x; p). Thus, it is important to carefully investigate parameter dependence. This may add valuable insight into the process described by the differential equation. Mainly, such an investigation requires computing, along with a solution trajectory y(x;p), also its sensitivity (condition numbers) given by the matrix oy(x;p) I . op p If the parameters are genuine model parameters, then this information allows to judge the quality of the model. Sensitivities play also a dominant role in the solution of optimal control problems, such as parametrization of control functions or parameter identification [Heim and von Stryk (1996), Engl et al. (1999)]. There are several numerical methods to compute these sensitivities: - approximation by difference quotients, - solution of the sensitivity equations, - solution of adjoint equations. With the first method, the sensitivities are approximated by oy(x;p) I ~ y(XiP+ Llpiei) - Y(XiP), 0Pi p LlPi i = 1, ... , np. Here ei is the ith axis vector of IRnp. The implementation is easy and not too expensive. When using an integration method with order and/or stepsize control, the computation of the "perturbed trajectory" y(x; P + Llpiei) must use the same order/stepsize sequence used for the computation of the reference trajectory y(x; p) [see Buchauer et al. (1994), Kiehl (1999)]. If higher accuracy is needed or if the number np of parameters is large the other two methods are preferable. By Theorem (7.1.8), the following sensitivity equations are associated with (7.2.19.1): (7.2.19.2) oy'(x;p) = oj(x, YiP) . OY(XiP) + oj(x, YiP) op oy op op' oy(xo;p) oYo(p) op op 7.3 Boundary-Value Problems 539 [see Hairer et al. (1993) for details]. These equations establish the function z(x) := 8y(x;p)/8p, which is a n x np-matrix function if y E IR n , as the solution of an initial value problem for a system of linear differential equations. That can be used to implement a very efficient method for solving (7.2.19.2). For the method based on adjoint differential equations, we refer the reader to the literature [e.g., Morrison and Sargent (1986)]. High precision approximations of sensitivities can be computed very efficiently both by integrating the sensitivity equations and the system of adjoint equations. A basic tool is the internal numerical differentiation (IND), that employs the differentiation of the recursions used for the numerical computation of the reference trajectory [Leis and Kramer (1985), Caracotsios and Stewart (1985), Bock et al. (1995)] . By contrast, the external numerical differentiation (END), does not change the integration method (up to the adaptation of the order/stepsize control). If the initial value problem contains discontinuities [see Section 7.2.18], the sensitivities have to be corrected at the switching points. Such corrections for systems of ordinary differential equations have been described by Rozenvasser (1967), and for initial value problems for differential-algebraic systems of index-1 [see Section 7.2.17] by Cabin, Feehery and Barton (1998). 7.3 Boundary-Value Problems 7.3.0 Introduction More general than initial-value problems are boundary-value problems. In these one seeks a solution y( x) of a system of n ordinary differential equations, (7.3.0.1a) y' = f(x, y), f(x,y) = [ h(Xl Yl""'Yn)] : , fn(x, Yl,···, Yn) satisfying a boundary condition of the form (7.3.0.1b) Ay(a) + By(b) = c. Here, a 01 b are given numbers, A, B square matrices of order n, and c a vector in lRn. In practice, the boundary conditions are usually separated: 540 7 Ordinary Differential Equations (7.3.0.1b') i.e., in (7.3.0.1b) the rows of the matrix [A, B, c] can be permuted such that for the rearranged matrix [A, B, c], The boundary conditions (7.3.0.1b) are linear (more precisely, affine) in yea), y(b). Occasionally, in practice, one encounters also nonlinear boundary conditions of the type r(y(a), y(b)) = 0, (7.3.0.1b") which are formed by means of a vector r of n functions ri, i = 1, ... , n, of 2n variables: _ [rl(Ul' ... 'U~'Vl, ... r(u,v) = . 'vn)l . rn(ul, ... ,Un,Vl, ... ,Vn ) Even separated linear boundary-value problems are still very general. Thus, initial-value problems, e.g., can be thought of as special boundary-value problems of this type (with A = I, a = xo, c = Yo, B = 0). Whereas initial-value problems are normally uniquely solvable [see Theorem (7.1.1)]' boundary-value problems can also have no solution or several solutions. EXAMPLE. The differential equation w" +w = 0 (7.3.0.2a) for the real function w : JR --+ JR, with the notation Yl(X) := w(x), w'(x), can be written in the form (7.3.0.1a), Y2(X) [YlY2 ], -_ [-YlY2 ] . It has the general solution w(X) = Cl sinx + C2 cos X, Cl, C2 arbitrary. The special solution w(x) := sinx is the only solution satisfying the boundary conditions (7.3.0.2b) w(O) = 0, w(7r/2) = 1. 7.3 Boundary-Value Problems 541 All functions w(x) := Cl sinx, with Cl arbitrary, satisfy the boundary conditions w(O) = 0, (7.3.0.2c) wen) = 0, while there is no solution w(x) of (7.3.0.2a) obeying the boundary conditions w(O) = 0, (7.3.0.2d) wen) = 1. [Observe that all boundary conditions (7.3.0.2b-d) have the form (7.3.0.1b') with Al = B2 = [1,0].] The preceding example shows that there will be no theorem, such as (7.1.1), of similar generality, for the existence and uniqueness of solutions to boundary-value problems [see, in this connection, Section 7.3.3]. Many practically important problems can be reduced to boundaryvalue problems (7.3.0.1). Such is the case, e.g., for eigenvalue problems for differential equations, in which the right-hand side f of a system of n differential equations depends on a parameter A, y' = f(x, y, A), (7.3.0.3a) and one has to satisfy n + 1 boundary conditions of the form (7.3.0.3b) r(y(a),y(b), A) = 0, r(u, v, A) = The problem (7.3.0.3) is overdetermined and therefore, in general, has no solution for an arbitrary choice of A. The eigenvalue problem in (7.3.0.3) consists in determining those numbers Ai, the "eigenvalues" of (7.3.0.3), for which (7.3.0.3) does have a solution. Through the introduction of an additional function Yn+1(x) := A and an additional differential equation y~+l(X) = 0, (7.3.0.3) is seen to be equivalent to the problem y'=f(x,y), f(y(a), y(b») = 0, which now has the form (7.3.0.1), with y.=[y ] . Yn+l' 542 7 Ordinary Differential Equations Furthermore, so-called boundary-value problems with free boundary are also reducible to (7.3.0.1). In these problems. only one boundary abscissa, a, is prescribed, while b is to be determined so that the system of n ordinary differential equations (7.3.0.4a) y' = f(x, y) has a solution y satisfying n + 1 boundary conditions (7.3.0.4b) r(y(a),y(b)) = O. rl(U,V) rCu.v) = [ : 1 rn+I(U,V) Here, in place of x, one introduces a new independent variable t and a constant Zn+1 := b - a, yet to be determined, by means of o : : .: t ::::.: 1, x - a = t Zn+l, . dZ n + 1 Zn+1 = - - =0. dt [Instead of this choice of parameters, any substitution of the form x - a = ([>(t, zn+r) with ([>(1, zn+r) = Zn+1 is also suitable.] For z(t) := y(a+t zn+r), where y(.7:) is a solution of (7.3.0.4), one then obtains and (7.3.0.4) is thus equivalent to a boundary-value problem of the type (7.3.0.1) for the functions Zi(t). i = 1, ... , n + L (7.3.0.5) 7.3.1 The Simple Shooting Method We want to explain the simple shooting method first by means of an example. Suppose we are given the boundary-value problem (7.3.1.1) wI! = f(x. W, 10'), w(a) = ct, w(b) = /3, 7.3 Boundary-Value Problems 543 with separated boundary conditions. The initial-value problem w"=f(x,w,w'), (7.3.1.2) w(a)=a, w'(a)=s in general has a uniquely determined solution w(x) ::::= w(x; s) which of course depends on the choice of the initial value s for w'(a). To solve the boundary-value problem (7.3.1.1), we must determine s =: s so as to satisfy the second boundary condition, web) = web; s) = (3. In other words: one has to find a zero s of the function F( s) :::::= web; s) - (3. For every argument s the function F(s) can be computed. For this, one has to determine (e.g., with the methods of Section 7.2) the value w(b) = web; s) of the solution w(x; s) of the initial-value problem (7.3.1.2) at the point x = b. The computation of F( s) thus amounts to the solution of an initial-value problem. For determining a zero s of F( s) one can use, in principle, any method of Chapter 5. If one knows, e.g., values sCO), SCI) with F(sCO») < 0, F(s(l») > 0 [see Figure 17], one can compute s by means of a simple bisection method [see Section 5.6]. w w(x; 8(1») b x Fig. 17. Simple shooting Since web; s), and hence F(s), are in general [see Theorem (7.1.8)] continuously differentiable functions of s, one can also use Newton's method to determine s. Starting with an initial approximation sCO), one then has to iteratively compute values SCi) according to the prescription (7.3.1.3) (») S(i+l) = sCi) _ F (s • F' (s(i») . web; sCi») and thus F(s(i)), can be determined by solving the initial-value problem (7.3.1.4) w" = f(x, w, w'), w(a) = a, 544 7 Ordinary Differential Equations The value of the derivative of F, F'(s) = :s web; s), for s = sri) can be obtained, e.g., by adjoining an additional initial-value problem. With the aid of Theorem (7.1.8) one easily verifies that the function vex) := vex; s) = (8/8s)w(x; s) satisfies (7.3.1.5) v" = fw(x, w, w')v + fw' (x, w, w')v', v(a) = 0, v'(a) = 1. Because of the partial derivatives fw, fw" the initial-value problem (7.3.1.5) is in general substantially more complex than (7.3.1.4). For this reason, one often replaces the derivative F'(s(i») in Newtons formula (7.3.1.3) by a difference quotient LlF( s(i»), (i) ._ F(s(i) + Lls(i)) - F(s(i)) LlF(s ).- Lls(i) , where Lls(i) is chosen "sufficiently" small. F(s(i) + Lls(i») is computed, like F(s(i»), by solving an initial-value problem. The following difficulties then arise: If LlS(i) is chosen too large, LlF(s(i») is a poor approximation to F'(s(i») and the iteration (7.3.1.3a) s(i+ 1) = sri) _ F ( s C)) 2 LlF(s(i») converges toward s considerably more slowly (if at all) than (7.3.1.3). If Lls(i) is chosen too small, then F(s(i) +Lls(i») ;:::; F(s(i»), and the subtraction F(s(i) + LlS(i») - F(s(i») is subject to cancellation, so that even small errors in the calculation of F(s(i)) and F(s(i) + Lls(i») strongly impair the result LlF(s(i»). The solution of the initial-value problems (7.3.1.4), i.e., the calculation of F, therefore has to be carried out as accurately as possible. The relative error of F(s(i») and F(s(i) + Lls(i») is only allowed to have the order of magnitude of the machine precision eps. Such accuracy can be attained with extrapolation methods [see Section 7.2.14]. If Lls(i) is then further chosen so that in t-digit computation F(s(i») and F(sCi) + LlS(i») have approximately the first t/2 digits in common, LlS(i) LlF(s(i)) ;:::; y""ePsF(s(i»), then the effects of cancellation are still tolerable. As a rule, this is the case with the choice Lls(i) = y""ePs sCi). EXAMPLE. Consider the boundary-value problem W 1/ 3 2 = 2W, w(O) = 4, weI) = 1. 7.3 Boundary-Value Problems 545 Following (7.3.1.2), one finds the solution of the initial-value problem W /I 3 2 = 'iW , W(O;s) =4, w'(O;s)=s. The graph of F(s) := we!; s) - 1 is shown in Figure 18. 70 60 50 30 20 10 o s -+ Fig. 18. Graph of F(s) = w(l; s) - 1. It is seen that F(s) has two zeros 81, 82. The iteration according to (7.3.1.3a) yields 81 = -8.0000000000, 82 = -35.8585487278. 6 4 2 O~~~---L--~--~--~ -2 -4 -6 -8 -10 -12 Fig. 19. The solutions WI and W2. 546 7 Ordinary Differential Equations Figure 19 shows the graphs of the two solutions Wi(X) = w(x; Si), i = 1, 2, of the boundary-value problem. The solutions were computed to about ten decimal digits. Both solutions, incidentally, can be expressed in closed form by where cn(~lk2) denotes the Jacobian elliptic function with modulus k -- v'2+73 2 . From the theory of elliptic functions, and by using an iterative method of Section 5.2, one obtains for the constants of integration C 1 , C2 the values C 1 = 4.30310990 ... , C2 = 2.33464196 ... . For the solution of a general boundary-value problem (7.3.0.1) involving n unknown functions Yi(X), i = 1, ... , n, (7.3.1.6) y' = f(x,y), r(y(a),y(b)) = 0, where f(x, y) and r(u, v) are vectors of n functions, one proceeds as in the example above. One tries again to determine a starting vector s E lRn for the initial-value problem (7.3.1.7) y' = f(x, y), y(a) = s in such a way that the solution y(x) = y(x; s) obeys the boundary conditions of (7.3.1.6), r(y(a;s),y(b;s») =r(s,y(b;s») =0. One thus has to find a solution s = [0"1,0"2, •.• ,00nV of the equation (7.3.1.8) F(s) = 0, F(s) := r(s, y(b; s»). This can be done, e.g., by means of the general Newton's method (5.1.6) (7.3.1.9) In each iteration step, therefore, one has to compute F(s(i), the Jacobian matrix 7.3 Boundary-Value Problems 547 and the solution d(i) := sri) - s(i+l) of the linear system of equations DF(s(i))d(i) = F(S(i)). For the computation of F(s) = r(s,y(b;s)) at s = sri) one must determine y(b; sci)), i.e., solve the initial value problem (7.3.1.7) for s = sri). For the computation of DF(s(i)) one observes DF(s) = Dur(s,y(b;s)) + Dvr(s,y(b;s)) . Z(b;s), (7.3.1.10) with the matrices V)] ,vru,V D ( ) = [ori(u, V)] , D uru,V ( ) = [ori(U, £l ~ vUj (7.3.1.11) VVj Z(b;s) = Dsy(b;s) = [Oy~~;s)]. In the case of nonlinear functions r(u, v), however, one will not compute DF(s) by means of these complicated formulas, but instead will approximate by means of difference quotients. Thus, DF(s) will be approximated by the matrix In view of F(s) = r(s, y(b; s)), the calculation of LljF(s) of course will require that y(b; s) = y(b; 0"1, ... , O"n) and y(b; 0"1, ... , O"j + LlO"j,"" O"n) be determined through the solution of the corresponding initial-value problems. For linear boundary conditions (7.3.0.1b), r(u, v) == Au + Bv - c, the formulas simplify somewhat. One has F(s) == As + By(b; s) - c, DF(s) == A + BZ(b; s). In this case, in order to form DF(s), one needs to determine the matrix Z(b' )=[OY(b;S) ... OY(b;S)] ,s ~" ~ . VO"I VO"n As just described, the jth column 8y(b; s)/oO"j of Z(b; s) is replaced by a difference quotient A. (b' ).= UJY ,S . y(b; 0"1,···, O"j + LlO"j,"" O"n) - y(b; 0"1,···, IJj,"" O"n) A • uO"j 548 7 Ordinary Differential Equations One obtains the approximation (7.3.1.13) LlF(s) = A + B Lly(b; S), Lly(b; S) := [Llly(b; S), ... , Llny(b; S)]. Therefore, to carry out the approximate Newton's method (7.3.1.14) the following has to be done: (0) Choose a starting vector s(O). For i = 0, 1, 2, ... : (1) Determine y(b; sCi)) by solving the initial-value problem (7.3.1.7) for s = s(i), and compute F(s(i») = r( s(i), y(b; s(i»)). (2) Choose (sufficiently small) numbers Llaj =1= 0, j = 1, ... , n, and determine y(b; SCi) + Llajej) by solving the n initial-value problems (7.3.1. 7) for s = SCi) + Llajej = [a~i), ... , a?) + Llaj,"" a~i)f, j = 1, 2, ... , n. (3) Compute LlF(s(i)) by means of (7.3.1.12) [or (7.3.1.13)] and also the solution d(i) of the system of linear equations LlF(s(i))d(i) = F(s(i)), and put s(i+ 1) := SCi) - d(i). In each step ofthe method one thus has to solve n+ 1 initial-value problems and an nth-order system of linear equations. In view of the mere local convergence of the (approximate) Newton's method (7.3.1.14), the method will in general diverge unless the starting vector s(O) is already sufficiently close to a solution s of F(s) = [see Theorem (5.3.2)]. Since, as a rule, such initial values are not known, the method in the form (7.3.1.9), or (7.3.1.14), is not very useful in practice. For this reason, one replaces (7.3.1.9), or (7.3.1.14), by the modified Newton method [see (5.4.2.4)]' which usually converges even for starting vectors s(O) that are not particularly good (if the boundary-value problem is at all solvable). ° 7.3.2 The Simple Shooting Method for Linear Boundary-Value Problems By substituting LlF(s(i») for DF(s(i») in (7.3.1.9), one generally loses the (local) quadratic convergence of Newton's method. The substitute method (7.3.1.14) as a rule converges only linearly (locally), the rate of convergence being larger for better approximations LlF(s(i)) to DF(s(i)). In the special case of linear boundary-value problems, one now has DF(s) = LlF(s) for all s (and for arbitrary choice of the Llaj), so that (7.3.1.9) and (7.3.1.14) become identical. By a linear boundary-value problem one means a problem in which I(x, y) is an affine function in y and the boundary conditions (7.3.0.1b) are linear, i.e., 7.3 Boundary-Value Problems (7.3.2.1) 549 y' = T(x)y + g(x), Ay(a) + By(b) = c, with an n x n matrix T(x), a function g:IR -+ IR n , c E IR", and constant n x n matrices A and B. We assume in the following that T(x) and g(x) are continuous functions on [a,b]. By y(x;s) we again denote the solution of the initial-value problem (7.3.2.2) y' =T(x)y+g(x), y(a;s) =s. For y( x; s) one can give an explicit formula, (7.3.2.3) y(x; s) = Y(x) s + y(x; 0), where the n x n matrix Y(x) is the solution of the initial-value problem Y' = T(x) Y, Y(a) = I. Denoting the right-hand side of (7.3.2.3) by u(x;s), one indeed has u(a; s) = Y(a) s + y(a; 0) = Is + 0 = s Dxu(x; s) = u'(x; s) = Y'(x) s + y'(x; 0) = T(x) Y(x);; + T(x) y(x; 0) + g(;T) = T(x) u(x; s) + g(x), i.e., u(x; s) is a solution of (7.3.2.2). Since, under the assumptions on T(x) and g(x) made above, the initial-value problem has a unique solution, it follows that u(x; s) == y(x: s). Using (7.3.2.3), one obtains for the function F( s) in (7.3.1.8) (7.3.2.4) F(s) = As + By(b; s) - c = [A + BY(b)] s + By(b; 0) - c. Thus, F(s) is also an affine function of s. Consequently [ef. (7.3.1.13)], DF(s) = LlF(s) = A + BY(b) = LlF(O). The solution s of F(s) = 0 [assuming the existence of LlF(O)-l] is given by s = -[A + BY(b)]-l [By(b; 0) - c] = 0 - LlF(O)-l F(O), or, slightly more generally, by (7.3.2.5) where s(O) E IRn is arbitrary. In other words, the solution s of F(s) = 0, and hencc the solution of the linear boundary-value problem (7.3.2.1), will 550 7 Ordinary Differential Equations be produced by the method (7.3.1.14) in one iteration step, initiated with an arbitrary starting vetor s(O). 7.3.3 An Existence and Uniqueness Theorem for the Solution of Boundary-Value Problems Under very restrictive conditions one can show the unique solvability of certain boundary-value problems. To this end, we consider in the following boundary-value problems with nonlinear boundary conditions: (7.3.3.1) y' = f(x, y), r(y(a), y(b)) = o. The problem given in (7.3.3.1) is solvable precisely if the function F(s) in (7.3.1.8) has a zero s: (7.3.3.2) F(s) = r(s, y(b; s)) = o. The latter is certainly true if one can find a nonsingular n x n matrix Q such that (7.3.3.3) <P(s) := s - QF(s) is a contractive mapping in JRn; the zero s of F(s) is then a fixed point of <P, <P(s) = s. With the help of Theorem (7.1.8), we can now prove the following result, which for linear boundary conditions is due to Keller (1968). (7.3.3.4) Theorem. For the boundary-value problem (7.3.3.1), let the following assumptions be satisfied: (1) f and Dyf are continuous on S:= {(x,y) I a:::; x:::; b, y E JR n }. (2) There is a k E C[a,b] with IIDyf(x,y)11 :::; k(x) for all (x,y) E s. (3) The matrix P(u, v) := Dur(u, v) + Dvr(u, v) admits for all u, v E JRn a representation of the form P(u,v) = Po (I +M(u,v)) with a constant nonsingular matrix Po and a matrix]v! = M(u, v), and there are constants Il and m with for all u,v E JRTh. (4) There is a number). > 0 with), + Il < 1 such that 7.3 Boundary-Value Problems 551 Then the boundary-value problem (7.3.3.1) has exactly one solution y(x). PROOF. We show that with a suitable choice of Q, namely Q := PO-I, the function <p(s) in (7.3.3.3) satisfies (7.3.3.5) IIDs<p(s)1I ~ K < 1 for all s E IRn , where K:= A+ M< 1. From this it then follows at once that i.e., <P is a contractive mapping which, according to Theorem (5.2.3), has exactly one fixed point s = <p(s) , which is a zero of F(s), on account of the nonsingularity of Q. For <p(s) := s - Po-Ir(s, y(b; s)) one has (7.3.3.6) Ds<P(s) = 1- PO-I [Dur(s, y(b; s)) + Dvr(s, y(b; s))Z(b; s)] = 1- PO-I [p((s, y(b; s)) + Dvr(s, y(b; s)) (Z(b; s) - 1)] = 1- PO- I [Po (I + M) + Dvr (Z - I)] = -M(s, y(b; s)) - Po- I Dvr(s, y(b; s)) (Z(b; s) - I), where the matrix Z(x; s) := DsY(x; s) is the solution of the initial-value problem Z' = T(x)Z, Z(a; s) = I, T(x) := Dyf(x, y(x; s)) [see (7.1.8), (7.1.9)]. From Theorem (7.1.11), by virtue of assumption (2), there thus follows for Z the estimate IIZ(b; s) - III ~ exp (l k(t) dt) -1, b and further, from (7.3.3.6) and assumptions (3) and (4), IIDs<p(s)II ~ M+m [exp (lb k(t)dt) -1] ~M+m [1+ ~ -1] =M+A=K<1. 552 7 Ordinary Differential Equations o The theorem is now proved. Remark. The conditions of the theorem are only sufficient and are also very restrictive. Assumption (3), for example, already in the case n = 2, is not satisfied for such simple boundary conditions as [see Exercise 20]. Even though some of the assumptions, for example (3), can be weakened, one nevertheless obtains only theorems whose conditions are vere rarely satisfied in practice. 7.3.4 Difficulties in the Execution of the Simple Shooting Method To be useful in practical applications, methods for solving the boundaryvalue problem y' = f(x, y), r(y(a), y(b)) = 0, should yield an approximation to y(xo) for every choice of Xo in the region of definition of a solution y. In the shooting method discussed so far, an approximate value s for the solution y(a) is computed only at one point, the point Xo = a. This seems to suggest that the boundary value problem has thus been solved, since the value y(xo) of the solution at every other point Xo can be (approximately) determined by solving the initial-value problem (7.3.4.1 ) y' = f(x, y), y(a) = s, say, with the methods of Section 7.2. This, however, is true only in principle. In practice, there often accrue considerable inaccuracies if the solution y(x) = y(x; s) of (7.3.4.1) depends very sensitively on S, as shown in the following example. EXAMPLE 1. The linear system of differential equations (7.3.4.2) [0100 01] [Y1] [Yl]' Y2 Y2 has the general solution C1, C2 Let y(x: s) be the solution of (7.3.4.2) satisfying the initial condition ] y( -5) = s = [ 81 82 . One verifies at once that arbitrary. 7.3 Boundary-Value Problems (7.3.4.4) 553 e 50 (1081 + 82) lOx [ 1 ] . ) _ e-5°(lOsl - 82) -lOx [ 1] 20 e -10 + 20 e 10· y ( X,8 - We now wish to determine the solution y(x) of (7.3.4.2) which satisfies the linear and separated (7.3.0.1b') boundary conditions (7.3.4.5) Y1( -5) = 1, Using (7.3.4.3), the exact solution is (7.3.4.6) e50 _ e- 50 y(x) = e100 _ e- lOO e -lOx [ 1] + e50 - e-e- 50100 -10 elOO _ e lOx [ 1] 10· The initial vale s = y( -5) of the exact solution is s= [ -10 + 20(11 - e- lOO ) 100 -100 e -e 1 Trying to compute s, e.g., in lO-digit floating-point arithmetic, we obtain at best an approximate value s of the form s = fl(s) = [ 1(1 + Ed] , -10(1 + E2) with IEil S eps = 10- 10 . Let, e.g., E1 = 0, E2 = _10-10. To the initial value however, correspondss by (7.3.4.4) an exact solution y(x; s) with (7.3.4.7) 10- 9 100 34 _ Y1(5;8) ~ 2Qe ~ 1.3 x 10 . On the other hand, the boundary-value problem (7.3.4.2), (7.3.4.5) is wellconditioned with respect to the influence of the boundary data on the solution y(x). For example, to a perturbation of the first boundary condition Y1(-5)=1+E, Y1(5)=1, corresponds the solution Y(X,E) (jj(x, 0) = y(x)): 1] + e50 e-100e-_50e-- e- 50 Ee _ e50 _ e- 50 + e 50 E -lOx [ Y(X,E) = e 100 _ e- lOO e -10 lOO lOx [ 1] 10· For -5 S x S 5 the factors of c: are small (=0(1)), one can even show for small lEI IY1 (x, E) -111 (x, 0)I~EY1 (x, 0) S E, IY2(X, E) - Y2(X, 0) l~lOEth (x, 0) S lOE with factors Y1 (x, 0) = Y1 (x) that are extremely small in the open interval (-5,5). Therefore the roundoff error committed when computing the starting value s = 554 7 Ordinary Differential Equations y( -5) of the exact solution by means of the shooting method has a much larger influence on the solution y(.) than comparably small errors in the input data (here the boundary data). In this example, the shooting method is thus not wellconditioned [see Section 1.3]. The above example shows that even the computation of the initial value s = yea) to full machine accuracy does not guarantee that additional values y(x) can be determined accurately. By (7.3.4.4), we find that, for x > 0 sufficiently large, i.e., the inaccuracy in the initial data propagates exponentially with x. Theorem (7.1.4) asserts that this may be true in general: For the solution y(x; 8) of the initial-value problem y' = f(x, y), yea) = 8, it follows that Ily(x; 81) - Y(X; 82)11 : : : 1181 - 8211 eLlx-al, provided the hypotheses of Theorem (7.1.4) are satisfied. This estimate, however, also shows that for sufficiently small intervals [a, b], the influence of inaccuracies in the initial data 8 = y( a) remains small. For large intervals, there is a further problem which severely restricts the practical significance of the simple shooting method: Frequently, the function f in the differential equation y' = f(x, y) has a derivative Dyf(x, y) which is continuous for all x E [a,b] and all y E rn,n, but IIDyf(x,y)11 is unbounded on S = {(x, y) I a ::::: x ::::: b, y E rn,n}, so that the Lipschitz condition of Theorem (7.1.1) is violated. In this case, the solution y(x) = y(x; 8) of the initial-value problem y' = f(x, y), yea) = 8, still exists, but only in a neighborhood Us(a) of a, whose size may depend on 8, and perhaps not for all x E [a, b]. Thus, y(b; 8) possibly exists only for the values 8 in a small set M. Besides, A1 is usually not known. The simple shooting method, therefore, will always break down if one chooses for the starting value in Newton's method a vector 8(0) tf- M. EXAMPLE 2. Consider the boundary-value problem [ef. Troesch (1960), (1976)] (7.3.4.8) (7.3.4.9) y" = A sinh AY yeO) = 0, y(l) = 1 (A a fixed parameter). In order to treat the problem with the simple shooting method, the initial slope y' (0) = s needs to "estimated" first. When numerically integrating the initial-value problem (7.3.4.8) with A = 5, yeO) = 0, y'(O) = s, the solution y(x; s) turns out to depend extremely sensitively on s. For s = 0.1, 0.2, ... the computation breaks down even before the right-hand boundary (x = 1) is reached, due to exponent overflow, i.e., y(x; s) has a singular point (which depends on s) at some Xs :S 1. The effect of the initial slope y' (0) = s upon the location of the singularity can be estimated in this example: 7.3 Boundary-Value Problems 555 y" = .\ sinh .\y possesses the first integral (y')2 = cosh.\y + C. (7.3.4.10) 2 The conditions y(O) = 0, y' (0) = s define the constant of integration, S2 C=2"-1. Integration of (7.3.4.10) leads to x = ~ lAY V + 2dC:Shry _ 2 S2 The singular point is then given by Xs 11 = >: 0 00 dry vs 2 + 2 cosh ry - 2 For the approximate evaluation of the integral we decompose the interval of integration, c: > 0 arbitrary, and estimate the partial integrals separately. We find and J e OO dry Joo vs2+2coshry-2 - e dry < 2 Vs +4sinh(ry/2) - Joo e dry 2sinh(ry/2) = -In (tanh(c:/4)). One thus gets the estimate -+ c: < .!.l ( lsi Xs >. n - g2) tanh 1+S2 ( ~) =: H(c:, s). For each c: > 0 the quantity H(c:, s) is an upper bound for Xs; therefore, in particular, 556 7 Ordinary Differential Equations vTsI, s) for all s =I- o. The asymptotic behavior of H ( vTsI, s) for s ~ 0 can easily be found: For small Xs ::; H ( lsi, in fact, we have in first approximation tanh (vTsI) == vTsT 44' so that, asymptotically for s ~ 0, (7.3.4.11) [One can even show (see below) that, asymptotically for s ~ 0, (7.3.4.12) holds]. If, in the shooting process, one is to arrive at the right-hand boundary x = 1, the value of lsi can thus be chosen at most so large that for>. = 5, one obtains the small domain lsi ~ 0.05. (7.3.4.13) For the "connoisseur" we add the following: The initial-value problem (7.3.4.14) y" = >. sinh >.y, has the exact solution 2. y(x; s) = ); smh -1 (s2" y(O) = 0, y'(O) = s 2 e = 1-~· 4' sn(>,x lk 2 ) ) cn(>,xlk 2 ) , here, sn and cn are the Jacobian elliptic functions with modulus k, which here depends on the initial slope s. If K(k 2 ) denotes the quarter period of cn, then cn has a zero at (7.3.4.15) and therefore y(x; s) has a logarithmic singularity there. But K(k 2 ) has the expansion K(k 2 ) = In 4 + -1( v!f=k2 4 ) v!f=k2 or, rewritten in terms of s, 4 In - 1 (1 - k 2 ) + ... ' 7.3 Boundary-Value Problems K(k 2 ) = In -8 + -S2 ( In -8 - 1) lsi 16 lsi 557 + ... , from which (7.3.4.12) follows. For the solution of the actual boundary-value problem, i.e., y(l; s) = 1, one finds for oX = 5 the value s = 4.57504614 x 10- 2 [ef. (7.3.4.13)]. This value is obtained from a relation between Jacobi functions combined with an iteration method of Section 5.1. For completeness, we add still another relation, which can be derived from the theory of Jacobi functions. The solution of the boundary-value problem (7.3.4.8), (7.3.4.9) has a logarithmic singularity at . 1 (7.3.4.16) Xs 1 = + oXcosh(oX/2). For oX = 5 it follows that Xs == 1.0326 ... ; the singularity of the exact solution lies in the immediate neighborhood of the right-hand boundary point. The example sufficiently illuminates the difficulties that can arise in the numerical solution of boundary value problems. For a direct numerical solution of the problem without knowledge of the theory of elliptic functions, see Section 7.3.6. A further example for the occurrence of singularities can be found in Exercise 19. 1.3.5 The Multiple Shooting Method The multiple shooting method has been described repeatedly in the literature, for example in Keller (1968), Osborne (1969), and Bulirsch (1971). A FORTRAN program can be found in Oberle and Grimm (1989). In a multiple shooting method, the values k = 1, 2, ... , m, of the exact solution y(x) of a boundary-value problem (7.3.5.1) at several points y' = f(x,y), r(y(a), y(b)) = 0, a = Xl < X2 < ... < Xm = b are computed simultaneously by iteration. To this end, let y(x; Xk, Sk) be the solution of the initial-value problem 558 7 Ordinary Differential Equations The problem now consists in determining the vectors 8k, k = 1, 2, ... , m, in such a way that the function y(X) := Y(X:Xk,8k) for x E [Xk,Xk+d, k = 1, 2 .... , m - 1, y(b) := 8 m . pieced together by the y(x: Xk, 8k), is continuous, and thus a solution of the differential equation y' = f(x, y), and in addition satisfies the boundary conditions r(y(a), y(b)) = 0 (see Figure 20). y Xm-l xm=b x Fig. 20. Multiple shooting. This yields the following nm conditions: y(Xk+1: Xk, 8k) = 8k+1, (7.3.5.2) k = 1,2, ... , m -1, r(81' 8 m ) = 0 for the nm unknown components O"kj, j = 1, 2, ... , n, k = 1, 2, ... , m, of the 8k = [O"kl, O"k2,"" O"kn]T. Altogether, (7.3.5.2) respresents a system of equations of the form (7.3.5.3) :~~:~: :~~ F(8) := F m - 1 (8 m -I, 8 m ) F,n (81, Srn) in the unknowns ].- 3d - [ 82 Y(X2: Xl, y(X3: X2, 82) - 83 =0 Y(X m : Xm -1' 3 m -d - 8 m r(81,8 m ) 7.3 Boundary-Value Problems 559 It can be solved iteratively with the help of Newton's method, (7.3.5.4) i = 0,1, '" . In order to still induce convergence if at all possible, even for a poor choice of the starting vector s(O), instead of (7.3.5.4) one takes in practice the modified Newton method (5.4.2.4) [see Section 7.3.6 for further indications concerning the implementation of the method]. In each step of the method one must compute F(s) and DF(s) for s = sCi). For the computation of F(s) one has to determine the values Y(Xk+li Xk, Sk) for k = 1, 2, ... , m-1 by solving the initial-value problems Y' = f(x, y), and to compute F(s) according to (7.3.5.3). The Jacobian matrix DF(s) = [DskFi(s)L,k=l, ... ,m' in view of the special structure of the Fi in (7.3.5.3), has the form (7.3.5.5) Gl -1 0 G2 -1 0 0 DF(s) = 0 0 A 0 Gm - l 0 -1 B where the n x n matrices A, B, G k , k = 1, ... , m -1, in turn are Jacobian matrices, (7.3.5.6) G k :::= DSkFk(s)::= DskY(Xk+liXk,Sk), k = 1, 2, ... , m -1, B :::= DsmFm(s) ::= Dsm r(sl' sm), A:::= DSIFm(s)::= DS1r(Sl,Sm)' As already described in the simple shooting method, it is expedient in practice to again replace the differential quotients in the matrices A, B, G k by difference quotients, which can be computed by solving additional (m - l)n initial-value problems (n initial-value problems for each matrix G l , ... , G m - l ). The computation of s(Hl) from SCi) according to (7.3.5.4) can be carried out as follows: With the abbreviations (7.3.5.7) 560 7 Ordinary Differential Equations (7.3.5.4) (or the slightly modified substitute problem) is equivalent to the following system of linear equations: G1Lls l - Lls 2 = -Fl' G 2 Lls 2 - LlS3 = -F2, (7.3.5.8) Gm-1Llsm- 1 - Llsm = -Fm- 1 , ALls l + BLls m = -Fm. Beginning with the first equation, one can express all Lls k successively in terms of Lls 1. One thus finds Lls 2 = GILls l + Fl, (7.3.5.9) and from this finally, by means of the last equation, (7.3.5.10) where w:= -(Fm+BFm- 1 +BGm- 1 Fm- 2 +·· ·+BGm- 1 Gm- 2 .· .G2 Fl)' This is a system of linear equations for the unknown vector Lls 1, which can be solved by means of Gauss elimination. Once Lls i is determined, one obtains Lls 2 , Lls3 , ... , Lls m successively from (7.3.5.8) and s(i+1) from (7.3.5.7). We further remark that under the assumptions of Theorem (7.3.3.4) one can show that the matrix A + BG m - 1 ... G 1 in (7.3.5.10) is nonsingular. Furthermore, there again will be a nonsingular nm x nm matrix Q such that the function 4>(S) := S - QF(s) is contracting, and thus has exactly one fixed point s with F(s) = O. It is seen, in addition, that F(s), and essentially also DF(s), is defined for all vectors ,~ [J ,~ E M M'" x M'" x ... x M'm-" x Ff', so that the iteration (7.3.5.4) of the multiple shooting method can be executed for s E M. Here M(k), k = 1, 2, ... , m - 1, is the set of all vectors Sk for which the solution y(x; Xk, Sk) exists (at least) on the small intervall [Xk' Xk+l]' This set M(k) includes the set Mk of all Sk for which y(x; Xk, Sk) 7.3 Boundary-Value Problems 561 exists on all of [a, b]. Now Newton's method for computing Sk by means of the simple shooting method can only be executed for Sk E Mk c M(k). This shows that the demands of the multiple shooting method upon the quality of the starting vectors in Newton's method are considerably more modest than those of the simple shooting method. 7.3.6 Hints for the Practical Implementation of the Multiple Shooting Method The multiple shooting method in the form described in the previous section is still rather expensive. The iteration in the modified Newton's method (5.4.2.4), for example, has the form (7.3.6.1 ) and in each step one has to compute at least the approximation LlF(s(i}) of the Jacobian matrix DF(s(i}) by forming appropriate difference quotients. The computation of LlF(s(i}) alone amounts to the solution of n initial-value problems. This enormous amount of work can be substantially reduced by means of the techniques described in Section 5.4.3, recomputing the matrix LlF(s(i}) only occasionally, and employing the rank-one procedure of Broyden in all remaining iteration steps in order to obtain approximations for LlF(s(i}). We next wish to consider the problem of how to choose the intermediate points Xk, k = 1, ... , m, in [a,b], provided an approximate solution (starting trajectory) 17(X) is known for the boundary-value problem. Put Xl := a. Having already chosen Xi « b), integrate the initial-value problem 17~ = f(x, 17i), 17i(Xi) = 17(Xi) by means of the methods of Section 7.2, and terminate the integration at the first point X = ~ for which the solution II17i(~) II becomes "too large" compared to 1117(~)II, say l117i(~)11 2 21117(~)11; then put Xi+1 :=~. EXAMPLE 1. Consider the boundary-value problem (7.3.6.2) y" = 5 sinh 5y, [ef. Example 2 of Section 7.3.4]. y(O) = 0, y(l) = 1 For the starting trajectory ry(x) we take the straight line connecting the two boundary points, ry(x) == x. The subdivision 0 = Xl < X2 .•. < Xm = 1 found by the computer is shown in Figure 21. Starting with this subdivision and the values Sk := ry(Xk), s~ := ry' (Xk), the multiple shooting method yields the solution to about 9 digits in 7 iterations. The problem suggests a "more advantageous" starting trajectory: the linearized problem ry" = 5· 5ry, ry(O) = 0, ry(l) = 1 has the solution ry(x) := sinh5x/sinh5. This function would have been a somewhat "better" starting trajectory. 562 7 Ordinary Differential Equations 1.5 1.0 0.5 0.5 x ---+ 1.0 Fig. 21. Subdivision in multiple shooting. In some practical problems the right-hand side f (x, y) of the differential equation is only piecewise continuous on [a, b] as a function of x, or else only piecewise continuously differentiable. In these cases one ought to be careful to include the points of discontinuity among the subdivision points Xi; otherwise there are convergence difficulties [see Example 2]. There remains the problem of how to find a first approximation 'TJ(x) for a boundary-value problem. In many cases one knows the qualitative behavior of the solution, e.g. on pysical grounds, so that one can easily find at least a rough approximation. Usually this also suffices for convergence, as the modified Newton's method does not impose high demands on the qualitiy of the starting vector. In complicated cases the so-called continuation method (homotopy method) is usually effective. Here one gradually feels one's way, via the solution of "neighboring" problems, toward the solution of the actually posed problem: Almost all problems contain certain parameters a, (7.3.6.3) y' = f(x, y; a), r(y(a), y(b); a) = 0, and the problem consists in determining the solution y(x) == y(x; a), which of course depends also on a, for a certain value a = ii. It is usually true that for a certain other value a = aD the boundary-value problem Paa is simpler, so that at least a good approximation 1]o(x) is known for the solution y(x; aD) of P aa . One then chooses a finite sequence of sufficiently close numbers Ci with 0< 101 < 102 < ... < lOt = 1, puts ai := ao+ci(ii-ao), and for i = 0, 1, ... , l - 1 takes the solution y(x; ai) of Pai as the starting approximation for the determination of the solution y(x; aiH) of Pai + 1 • In 7.3 Boundary-Value Problems 563 y(x; al) = y(x; a) one then finally has the desired solution of Pa. For the method to work, it is important that one not take too large steps Ei -+ Ei+l, and that "natural" parameters a, which are intrinsic to the problem, be chosen: The introduction of parameters, say constructs of the type f(x, y; a) = af(x, y) + (1 - a)g(x, y), a = 1, aD = 0, where f(x, y) is the given right-hand side of the differential equation and g(x, y) an arbitrarily chosen "simple" function which has nothing to do with the problem, does not succeed in critical cases. A more refined variant of the continuation method, which has proved to be useful in the context of multiple shooting methods, is described in Deufihard (1979). EXAMPLE 2. (A singular boundary-value problem). The steady-state temperature distribution in the interior of a cylinder of radius 1 is described by the solution y(x; a) of the nonlinear boundary-value problem [ef. e.g., Na and Tang (1969)] , y " = - y- -ae y , x y'(O) = y(l) = O. (7.3.6.4) Here, a is a "natural" parameter with . heat generation conductivity , 0< a ::; 0.8. a=------ For the numerical computation of the solution for a = 0.8 one could use the subdivision a = 0,0.1, ... ,0.8 and, beginning with the explicitly known solution y(x; 0) == 0, construct the homotopy chain for the computation of y(x; 0.8). The problem, however, exhibits one further difficulty: The right-hand side of the differential equation has a singularityat x = o. While it is true that y' (0) = 0 and the solution y(x; a) is defined, indeed analytic, on the whole interval 0 ::; x ::; 1, there nevertheless arise considerable convergence difficulties. The reason is to be found in the following: A backward analysis [ef. Section 1.3] shows that the result y of the numerical computation can be interpreted as the exact solution of the boundary-value problem (7.3.6.4) with slightly perturbed boundary data jj' (0) = e1, y(l) = e2 [see, e.g., Babuska et al. (1966)]. This solution y(x; a) is "near" the solution y(x; a), but in contrast to y(x; a) has only a continuous first derivative at x = 0, because lim y" = - lim e1 = ±oo. x--tO x-+O X Since the order of convergence of every numerical method depends not only on the method itself, but also (as is often overlooked) on the existence and boundedness of the higher derivatives of the solution, the order of convergence in the present example will be considerably reduced (to essentially 1). These difficulties and 564 7 Ordinary Differential Equations others as well, are encountered in many practical boundary-value problems. The cause of failure is often not recognized, and the "blame" put on the method or the computer. Through a simple artifice these difficulties can be avoided. The "neighboring" solutions with y' (0) =I 0 are filtered out by means of a power series expansion about x = 0: (7.3.6.5) The coefficients y(i)(O), i = 2, 3, 4, ... , can all be expressed in terms of A:= y(O), an unknown constant; through substituion in the differential equation (7.3.6.4) one finds (7.3.6.6) from which, for x -t 0, Differentiation gives so that y(3) (0) = O. Further, and 3 2 2.\ Y (4)(0) -_ sa e . One can show that y(5) (0) = 0, and in general, y(2i+l) (0) = O. The singular boundary-value problem can now be treated as follows. For y(x; a) one utilizes in the neighborhood of x = 0 the power-series representation (7.3.6.5), and at a sufficient distance from the singularity x = 0 the differential equation (7.3.6.4) itself-for example, (7.3.6.7) yl/(x) = { ~ae.\ [1 - ~x2ae.\ ] y'(x) - - - -ae x y(x) if 10- 2 <::: x <::: 1. The error is of the order of magnitude 10- 8 Now the right-hand side still contains the unknown parameter A = y(O). As shown in Section 7.3.0, however, this can be interpreted as an extended boundary-value problem. One puts Yl(X) := y(x), Y2(X) := y'(x), Y3(X) := y(O) = A, and obtains the system of differential equations on 0 <::: x <::: 1, 7.3 Boundary-Value Problems 565 I YI = Y2, I (7.3.6.8) ~ { ~oeY3 [1 - ~x20eY31 Y2 ~ _ Y2 _ oeY! x if 0 ::; x ::; 1O~2, if 1O~2 ::; x ::; 1, y~ = 0, with the boundary conditions (7.3.6.9) In the multiple shooting method one chooses the "patch point" 1O~2 as one of the subdivision points, say, Xl = 0, X2 = 10~2, X3 = 0.1, ... , Xu = 0.9, Xl2 = 1. There is no need to worry about the smoothness of the solution at the patch point; it is ensured by construction. With the use of the homotopy [see (7.3.6.3)], beginning with y(x; 00) == 0, the solution y(x; Ok) is easily obtained from y(x; ak~l) in only 3 iterations to an accuracy of about 6 digits (total computing time on the CDC 3600 computer for all 8 trajectories: 40 seconds). The results are shown in Figure 22. <>=0.8 y(x;a) 0.1 o 1.0 X ---+ Fig. 22. The solution y(x; a) for 0 = 0(0.1)0.8. 7.3.7 An Example: Optimal Control Program for a lifting Reentry Space Vehicle The following extensive problem originates in space travel. This concluding example is to illustrate how nonlinear boundary-value problems are handled and solved in practice, because experience shows that the actual solution of such real-life problems causes the greatest difficulties to the beginner. 566 7 Ordinary Differential Equations The numerical solution was carried out with the multiple shooting method; a program can be found in Oberle and Grimm (1989). The coordinates of the trajectory of an Apollo type vehicle satisfy, during the flight of the vehicle through the earth's atmosphere, the following differential equations: Spv 2 gsiwy V=V(V'I,~,11)=- 2m CD(u)- (1+0 2 ' . (7.3.7.1 ) Spv vcos 1 r = r(v'I'~'u) = 2m Cdu) + R(l +~) gcos r v(1 +~)2' . _ v sin I ~ = '::(V'I'~'U) =~, . v ( = z (v, I, ~ , u) = 1 + ~ cos f. The meanings are: v: velocity: r: flight-path angle; h: altitude above the earth's surface; R: earth radius; ~ = hj R: normalized altitude; (: distance on the earth's surface; p = Po exp( -f3R~): atmospheric density; CD(u) = 1.174 - 0.9cosu: aerodynamical drag coefficient; Cdu) = 0.6sinu: aerodynamical lift coefficient; u: control parameter, which can be chosen arbitrarily as a function of time; g: gravitational acceleration; Sjm: (frontal area)j(mass of vehicle). Numerical values arc: R = 209 (~ 209 10 5 ft); 8 = 4.26; Po = 2.70410-3; g = 3.217210-4; Sjm = 53,200; (1 ft = 0.3048 m). (See Figure 23.) 4 v Fig. 23. The coordinates of the trajectory. The differential equations have been simplified somewhat by assuming (1) a spherical earth at rest, (2) a flight trajectory in a great circle plane, 7.3 Boundary-Value Problems 567 (3) that vehicle and astronauts can be subjected to unlimited deceleration. The right-hand sides of the differential equations nevertheless contain all terms which are essential physically. The largest effect is produced by the terms multiplied by CD (11) and Cd U), respectively: these are the atmospheric forces, which in spite of the low air density p, are particularly significant by virtue of the high speed of the vehicle; they can be influenced via the parameter 11 (angle of attack). The terms multiplied by 9 are the gravitational forces of the earth acting on the vehicle; the remaining terms result from the choice of the coordinate system. During the flight through the earth's atmosphere the vehicle is heated up considerably. The total stagnation point convective heating per unit area is given by the integral (7.3.7.2) J = r qdt, .10 T The range of integration is from t = 0, the time when the vehicle hits the 400,000 ft atmospheric level, until a time instant T. The vehicle is to be maneuvered into an initial position favorable for the final splashdown in the Pacific. Through the freely disposable parameter 11 the maneuver is to be executed in such a way that the heating J becomes minimal and that the following boundary conditions are satisfied: Data at the moment of entry: v(O) = 0.36 (~ 36,000 ft/sec), 7r "( ( 0 ) = -8.1 180°' 0 ((7.3.7.3) ~(O) = ~ [h(O) = 4 (~ 400,000 ft)]. Data at the end of maneuver: v(T) = 0.27 (~ 27,000 ft/sec), (7.3.7.4) "((T) = 0, ~(T) = 2.5 R [h(T) = 2.5 (~ 250,000 ft)]. The terminal time T is free. ( is ignored in the optimization process. The calculus of variations [cf., e.g., Hestenes (1966)] now teaches the following: Form, with parameters (Lagrange multipliers), the expression (7.3.7.5) where Av , A'Y' A~ satisfy the three differential equations 568 7 Ordinary Differential Equations · Au = (7.3.7.6) · aH -av' aH A) = - a"Y ' · aH ae A.; = - The optimal control u is then given by (7.3.7.7) . -0.6A) -0.9VAv SIn 'U = - - - , COS U = ---"o . 0 Note that (7.3.7.6), by virtue of (7.3.7.7), is nonlinear in Av , A). Since the terminal time T is not subject to any condition, the further boundary condition must be satisfied: (7.3.7.8) HI t=T =0. The problem is thus reduced to a boundary-value problem for the six differential equations (7.3.7.1), (7.3.7.6) with the seven boundary conditions (7.3.7.3), (7.3.7.4), (7.3.7.8). We are thus dealing with a free boundaryvalue problem [cf. (7.3.0.4a,b)]. A closed-form solution is impossible; one must use numerical methods. It would be wrong to construct a starting trajectory 7)( x) without reference to reality. The unexperienced should not be misled by the innocentlooking form of the right-hand side of the differential equation (7.3.7.1): During the numerical integration one quickly observes that v, "Y, ~, Av , A" A.; depend in an extremely sensitive way on the initial data. The solution has moving singularities which lie in the immediate neighborhood of the initial point of integration [see for this the comparatively trivial example (7.3.6.2)]. This sensitivity is a consequence of the effect of atmospheric forces, and the physical interpretation of the singularity is a "crash" of the vehicle or a "hurling back" into space. As can be shown by an a posteriori calculation, there exist differentiable solutions of the boundary-value problem only for an extremely narrow domain of boundary data. This is the mathematical formulation of the danger involved in the reentry maneuver. Construction of a Starting Trajectory. For aerodynamic reasons the graph of the control parameter 'U will have the shape depicted in Figure 24 (information from space engineers). This function can be approximated, e.g., by (7.3.7.9) where 7.3 Boundary-Value Problems 569 'U 1f 2" 0.5 1 T Fig. 24. Control parameter (empirical). T = t T' () erf x = 0 < T < 1, 2 (X _".2 ft 10 e del, and PI, P2, P3 are, for the moment, unknown constants. For the determination of the Pi one solves the following auxiliary boundary-value problem: the differential equations (7.3.7.1) with u from (7.3.7.9) and in addition PI = 0, P2 = 0, P3 = 0, (7.3.7.10) with boundary conditions (7.3.7.3), (7.3.7.4) and T = 230. The boundary value T = 230 sec is an estimate for the duration of the reentry maneuver. With the relatively poor approximations PI = 1.6, P2 = 4.0, P3 = 0.5 thc auxiliary boundary-value problem can be solved even with the simple shooting method, provided one integrates in the backward direction (initial point T = 1, terminal point T = 0). The result after 11 iterations is (7.3.7.11) PI = 1.09835, P2 = 6.48578, P3 = 0.347717. Solution of the Actual Boundary- Value Problem. \Vith the "nonoptimal" control function u from (7.3.7.9), (7.3.7.11) one obtains approximations to v(t), ,(t), ~(t) by integrating (7.3.7.1). This "incomplete" starting trajectory can be made to a "complete" one as follows: since cos u > 0, it follows from (7.3.7.7) that Av < 0; we choose Av == -1. Further, from 570 7 Ordinary Differential Equations h 40 r 3.5 (J> -2" 30 _40 2.5 _60 2D -8" so lOO 150 200 2249 Fig. 25. The trajectories h, /, v. Fig. 26. The trajectories of the adjoint variables AE, A" Av. 7.3 Boundary-Value Problems 571 6A'"'( tan u = 9VAv there follows an approximation for A'"'(. An approximation for Ae can be obtained from the relation H == 0, where H is given by (7.3.7.5), because H = const is a first integral of the equations of motion. With this approximate trajectory for v, ,,(, ... , Ae, T (T = 230) and the techniques of Section 7.3.6, one can now determine the subdivision points for the multiple shooting method (m = 6 suffices). Figures 25 and 26 show the result of the computation (after 14 iterations); the total time of the optimal reentry maneuver comes out to be T = 224.9 sec; the accuracy of the results is about 7 digits. 100" :=;C'\~ {400 _'",.. zoo ~ 1 I \\ \\ , \ ,, co § -200·· !. \\"\\ -60" \ \, \ \\ \ \\ \ \ \ \ \ \ ~ -4()0 ::> f\ \\\ --........., I \ \ ., \ \ --- mitlal values \ -soo ........ after 6 Iterations \ -100 0 a \ _after 14 Iterations (a1ccuracy . 7 dlg'ltS) so 100 150 200 224.9 Fig. 27. Successive approximations to the control u. Figure 27 shows the behavior of the control u = arctan(6A'"'(/9vA v ) during the course of the iteration. It can be seen how the initially large jumps at the subdivision points of the multiple shooting method are "flattened out". The following table shows the convergence behavior of the modified Newton's method [see Sections 5.4.2 and 5.4.3]. 572 7 Ordinary Differential Equations Error IIF(s(i l )112 Steplength Error IIF(s(il)1I2 Steplength (in (5.4.3.5)) A [so (7.3.5.3)] (in (5.4.3.5)) A 3 X 10 2 0.250 1 x 10- 1 1.000 6 X 10 4 0.500 (trial) 0.250 (trial) 0.125 6 x 10- 2 1.000 1.000 1.000 0.125 3 x 10- 2 1 x 10- 2 1 x 10- 3 0.250 0.500 4 x 10- 5 3 x 10- 7 1.000 1.000 1.000 1.000 1 X 10- 9 1.000 [so (7.3.5.3)] 5 x 10 2 7 X 10 2 2 X 10 2 1 X 10 2 8 X 10 1 1 X 10 1 1 x 10° 7.3.8 Advanced Techniques in Multiple Shooting In Section 7.3.5 we examined a prototype version of the multiple shooting method. In this section, we deal with some advanced techniques for reducing the large computational expenses when applying the multiple shooting algorithm to complex problems in science and engineering. Calculation of the Jacobian Matrix. The calculation of approximations L1F(s(il) of the Jacobian matrices DF(s(il) - especially of the partial derivatives G k in (7.3.5.6) -- is the most expensive part of the algorithm. Those partial derivatives k = 1, 2, ... , m - 1, can be computed by solving at least n (m - 1) additional initial-value problems and forming appropriate difference quotients. This approach is straightforward, but no information concerning the accuracy of the G k is available and all initial-value problems have to be solved to high accuracy. Low accuracy of the Gk often leads to poor convergence of the iteration to solve the nonlinear system (7.3.5.4). In view of these problems, another approach has been developed based on the integration of the so-called "variational equation" [see Hiltmann (1990), Callies (2000)] (7.3.8.1) oGk(x; Xk, sd ox = fy(x, y(x; .Tk, Sk)) Gk(x; Xk, Sk). GdXk; Xk, Sk) = I. To prove (7.3.8.1), we only have to differentiate 7.3 Boundary-Value Problems (7.3.8.2) y' = f(x) y), 573 Y(Xk) = Sk with respect to Sk observing that y = y(x; Xk, Sk) is also a function of Sk. (7.3.8.1) and (7.3.8.2) are integrated in parallel with completely different integrators specially tailored to the respective differential equation. In case of (7.3.8.1), full advantage is taken of the linearity of the system to save about half of the function evaluations [see Callies (2001)]. Step size control is done independently, integration accuracy is different for each system but controlled for both systems, and switching points can be located much easier than in the original approach. We also note the formula (7.3.8.3) for the partial derivative of Y(Xk+1; Xk) Sk) with respect to Xk, which follows from the differentiation of (7.3.8.2) and the identity y(Xk; Xk, Sk) == Sk with respect to Xk. Piecewise Continuous Right Hand Sides. In many applications) the right hand side f(x, y) of the differential equation is discontinuous or not differentiable [ef. Section 7.2.18]. Even jumps in the solution y(x) are possible. These problems could, in principle, be treated in the framework of the basic multiple shooting algorithm by incorporating the discontinuity points into the set of the multiple shooting abscissae Xi, but the price to be paid is a dramatic increase in the number of switching points accompanied by a similar increase in the size of the nonlinear system (7.3.5.3). These problems can be overcome by working with two types of discretisations [Callies (2000a)], the macm-discretisation Xl < X2 < .. , < Xm and the micm-discretisation Xv =: Xv.1 < Xv,2 < ... < xv,''" := Xv+1, v = 1, ... ,m-I. Consider the following piecewise defined multi-point boundary value problem (7.3.8.4) y' = fV.I"(x,y), X E [xv'l",xv'l"+d} v = 1, ... ,m - 1, JL = 1, ... ,"'V - 1, y(xv.l") = SV,I" v = 1, ... , m - 1, JL = 2, ... , "'v, Using the abbreviations Sl := Sl,l and v = 1, ... , m -1, the solution of (7.3.8.4) is formally equivalent to the solution of a special system of nonlinear equations [cf. (7.3.5.3)] 574 7 Ordinary Differential Equations Xl 2 )) "2(X3, 53, y(X 3 )) "I (X2. 52, y(X (7.3.8.5) 51 X2 = 0, F(z) := Z .- 1'm -1(X m , 5m , y(x;)) 1'(X1' 51, Xm , 5m ) 82 Xm 8m Here. Y(X;';+l) := Y(X;;+l; XV' 8 v ), v = 1, ... , m - 1, and, for X E [xv,!" XV.I'+d, y(.) is the solution of the initial value problem y' = iv.1' (X, y), X E [XV,I" XV.I'+l), fl = 1, ... , Kv - 1, y(xv'l') = 8 v.!" where 8 v .1 .- 8 v , and, for fl = 2, ... , Kv - 1, xV.1' and 8 V .?l satisfy the equation "v. I' (xv.I" 5 u .1' , y(x;;'I')) = O. Note that the switching and jump functions considered in Sections 7.2.18 generate equations of this type. These problems are rather general, since the XV,I' may not be known a priori, jumps in the solution y(.) are allowed at x = xV.I" and the right hand side of the differential equation iv.1' may depend on v and fl. Note that (7.3.8.4) encompasses the basic problem (7.3.5.1) and (7.3.G.2) treated in Section 7.3.5 with a fixed i, fixed abscissae Xi in the macro-discretization, no micro-discretization points, no jumps in the solution, as special case: we only have to set Kv := 2, iv.1' := i, Xv.1 := Xv, Xv.2 := Xv+1, 8v.1 := 8v , "v.2(X. 8, y) := y - 8, and to drop the variables Xl, X2, ... , Xm from z in (7.3.8.5) . The nonlinear system (7.3.8.5) is iteratively solved e.g. by a modified Newton method (iteration of the linearized system). The ~-th iteration step reads (7.3.8.6) z(~+l) := z(EJ - A(LlF(z(O))-l F(z(O), LlF(z(O) :=:::; D F(z(E,)) An iteration step is accepted if both of the following tests are valid: (7.3.8.7) IIF(z(U 1)) I ~ IIF(z(~)) I (Test 1) II (LlF(z(O)) -1 F(Z(~+l)) II ~ I (LlF(z(O)) -1 F(z(O) II (Test 2) In particular, calculating the Jacobian matrix D F(z) requires computing the sensitivity matrices v = 1, 2, ... , m - 1. 7.3 Boundary-Value Problems 575 This presents a problem when the open interval (xv, xv+d contains points of the micro-discretization, "'v > 2. We will treat this problem, but only for the most elementary case "'v = 3 and only one such interval, when this interval contains exactly one point of the micro-discretization. This leads to the following core problem and a basic assumption. Assumption (A): Let Xl < X2 < X3 and f > O. Further let h E CN([XI - f,X2 + f] X IRn,IR n ), h E C N ([X2 - f,X3 + f] X IRn,IRn ), and q E C 2 ([Xl, .T3] X IRn , IR). Consider the piecewise defined system of ordinary differential equation8 (7.3.8.8) y' _ {h(X,. y(x)),. h(x, y(x)), if q(x, y(x)) < 0, if q(x, y(x)) > 0, and the initial condition Y(Xl) = 81; let X2 exist with q(X2,y(X2";Xl,Sl)) = o and start the integration of h with the initial value (X2' S2), where S2 is a solution (a8sumed to e.Tist) of a further equation p(y( x2"), S2) = 0 with a function p: IR2n -+ IRn. Assume that q(x, y(x)) < 0 for Xl :S X < X2) and q(x, y(x)) > 0 for X2 < X :S X3. Further let Q(x, y) := [qx + qyh](x, V)· Finally, assume that Q(X2, y(X2)) > 0, p is twice continuously differentiable on .'lome open convex set D C IR2n with (y(X2), S2) E D, and that the Jacobian D S2 P(Y(X2), S2) i8 nonsingular. If Assumption (A) is satisfied, the function q(x, y) is called a 8witching function, p(y, 8) a tran8ition function, and X2 a design point. Interpretation of the core problem: Xl and X3 represent two adjacent points of the macro-discretization, and X2 the abscissa of the micro-discretization between Xl and X3. The role of the equation Tv.I'("') = 0 in (7.3.8.4) is played by the simple system [ q(X2' Y(X 2 ))] = o. p(y(x2 ), S2) In many applications, p has the form p(y, s) = ¢(y) - s [ef. Section 7.2.18]. Assumption (A) allows the application of the implicit function theorem and guarantees that the design point X2 and the new starting ordinate S2 are locally unique solutions of smooth equations and depend smoothly on Sl, X2 = X2(Sl), S2 = S2(sI) . This will be used in the proof of the following Theorem: (7.3.8.9) Theorem. Let AS8umption (A) be 8atisfied and let X2 be a de8ign point. Then the sensitivity matrix OY(X3; x2(sd, s2(sd)/OSl can be computed by 576 7 Ordinary Differential Equations OY(X3; x2(sd, S2(SI)) OSI with the abbreviations Op Op(y, S2) I oy = oy Y=Y(X 2 )' Op Op(y, 82) I OS2 = OS2 y=y(x;-) PROOF. By Assumption (A), X2 and 82 satisfy q(X2,y(X2;Xl,SI)) = 0, p(y(X 2 ;Xl,Sl),S2) = 0 and thus can be treated as functions of Sl (and Xl, but the dependence on Xl is not used): X2 = X2(81), S2 = 82(8d. By differentiating the identity q(x2(sd, y(X2(81); Xl, sd) == 0 we obtain by the implicit function theorem Analogously we get from the equation p(y,82) = 0, which defines 82 as function of y, 082 = _ (op(y, 8 2 ) ) -1 op(y, 82) . oy OS2 oy Using the chain rule and (7.3.8.3) we obtain OY(X3; X2, S2) OSl OY(X3; X2, 82) 0.7:2 082 OY(X2) - + OY(X3; X2, 82) -OX2 OSl 082 oy OSl OX2 OS2 OY(X2)) = G 2(X3; X2, 82) ( - h(X2, 82) -0 + -a - a , Sl Y Sl --'------::,-----------'--= where aY(X2)/081 stands for oy(x2(sd-;x1,sl) = 0y(x 2 ) OX2 + Oy(x;XI,Sdl OSl OX2 OSl OSl X=X 2 _ OX2 = h(X2,y(X 2 )) -0 +G 1(X2;X1,SI)' Sl 7.3 Boundary-Value Problems 577 Combination of the equations yields the desired result. 0 The theorem stated above allows to treat piecewise continuous right hand sides without an increase in the size of the nonlinear system (7.3.5.3), but with equal precision. The determination of the zeros of the switching functions q(x,y(x)) = 0 is coupled to the convergence tests (7.3.8.7) of the shooting algorithm making the algorithm even more efficient (Callies (200la)) . 7.3.9 The Limiting Case m -+ 00 of the Multiple Shooting Method (General Newton's Method, Quasilinearization) If the subdivision [a, bj is made finer and finer (m -+ (0), the multiple shooting method converges toward the general Newton's method for boundaryvalue problems [ef., e.g., Collatz (1966)], also known as quasilinearization. In this method an approximation 17(X) for the exact solution y(x) of the nonlinar boundary-value problem (7.3.5.1) is improved by solving linear bondary-value problems. Using Taylor expansion of f(x,y(x)) and r(y(a), y(b)) about T)(x) one finds in first approximation [ef. the derivation of Newton's method in Section 5.1] y'(x) = f(x, y(x)) == f(x, T)(x)) + Dyf(x, T)(x)) (y(:r) - T)(x)), 0= r(y(a),y(b)) == T(17(a),T)(b)) + A(y(a) - 7/(a)) + B(y(b) -17(b)), where A := Dur(T)(a), T)(b)), B := Dvr(T)(a), T)(b)). One may expect, therefore, that the solution 7)(x) of the linear boundary-value problem (7.3.9.1) 7)' = f(x, T)(x)) + Dyf(x, T)(x))(7) - T)(x)) A(7)(a) -T)(a)) +B(7)(b) -T)(b)) = -r(T)(a),T)(b)) , will be a better solution of (7.3.5.1) than T)(x). For later use we introduce the correction function L1T)(x) := 7)(x) -T)(x), which by definition is solution of the boundary-value problem (7.3.9.2) (L1T))' = f(x, T)(x)) - T)'(x) + Dyf(x, 17(x)) L117, AL1T)(a) + BL1T)(b) = -T(17(a), 17(b)), so that (7.3.9.3) L1T)(x) = L117(a) + l x [Dyf(t, T/(t))L1r/(t) + f(t, T)(t)) - T)'(t)] dt. Replacing T) in (7.3.9.1) by 7), one obtains a further approximation r, for y, etc. In spite of the simple derivation, this iteration method has serious disadvantages which cast some doubt on its practical value: (1) The vector functions T)( x), 7)( x), ... must be stored over their entire range [a, bj. 578 7 Ordinary Differential Equations (2) The matrix function Dyf(x, y) must be computed in explicit analytic form and must likewise be stored over its entire range [a, b]. The realization of both these requirements is next to impossible in the problems that currently arise in practice (e.g., f with 25 components, 500-1000 arithmetic operations per evaluation of f). We now wish to show that the multiple shooting method as m -+ 00 converges to the method (7.3.9.1) in the following sense: Let 71(X) be a sufficiently smooth function on [a, b] and let further (7.3.9.1) be uniquely solvable with the solution fJ(x), L171(:r) := r)(.1;) -71(X). Then the following is true: If in the multiple shooting method (7.3.5.3)-(7.3.5.10) one chooses any subdivision a = Xl < X2 < ... < Xm = b of fineness h := maxk IXk+1 - xkl and the starting vector then the solution of (7.3.5.8) will yield a vector of corrections (7.3.9.4) L1s = r L1S12 L1s : 1 L1,sm For the proof we assume for simplicity IXk+l - xkl = h, k = 1, 2, ... , m - l. We want to show that to each h there exists a differentiable function L1s: [a,b] -+ JRn such that maxk IIL1s k - L1S(Xk) I = O(h), L1s(a) = L1s 1 and maxxE[a.bJ IIL1s(x) - L171(X) II = O(h). To this end, we first show that the products G k- I G k - 2 ··· GJ+I = Gk- I G k - 2 ··· G 1 (GjGj - 1 ... G I )-1 appearing in (7.3.5.9), (7.3.5.10) have simple limits as h -+ 0 (hk = const, hj = const). The representation (7.3.5.9) for L1s k can be written in the form (7.3.9.5) k-2 L1s k = Fk-l + Gk-l ... G 1 [L1S 1 + ~ (Gj ... G 1 ) -1 Fj ] , J=l According to (7.3.5.3), Fk is given by thus 2 <::: k <::: m. Fk = 8k + 1: 7.3 Boundary-Value Problems k 1 + 579 f(t, y(t; Xk,Sk)) dt - Sk+l, and with the help of the mean-value theorem, Let now Z(x) be the solution of the initial-value problem Z' = Dyf(x, T/(X) )Z, (7.3.9.7) Z(a) = I, and the matrix Zk solution of 1 S; k S; m - 1, so that (7.3.9.8) For the matrices Zk := Zk(xk+d one now shows easily by induction on k that (7.3.9.9) Indeed, since Zl = Z, this is true for k = 1. If the assertion is true for k - 1, then the function Z(x) := Zk(X)Zk-1··· Zl satisfies the differential equation Z' = Dyf(x,7)(x))Z and the initial conditions Z(Xk) = Zk(Xk)Zk-1 ... Zl = Zk-1 ... Zl = Z(Xk). By uniqueness of the solution of initial-value problems there follows Z(x) = Z(x), hence (7.3.9.9). By Theorem (7.1.8) we further have for the matrices Gk(x) := DskY(x; Xk, Sk), Gk(x) = I + 1~ Dyf(t, y(t; Xk, Sk) )Gdt ) dt, (7.3.9.10) G k = Gk (.Tk+ 1 ). With the abbreviation 1: upon subtraction of (7.3.9.8) and (7.3.9.10), one obtains the estimate 'P(X) ::; (7.3.9.11) IIDyf(t, 7)(t)) - Dyf(t, y(t; Xk, Sk)) 1IIIZk(t)11 dt + l~IIDYf(t,y(t;xk,sk))II'P(t)dt. 580 7 Ordinary Differential Equations If Dyf(t,y), for all t E [a,b], is uniformly Lipschitz continuous in y, one easily shows for Xk ::;; t ::;; Xk+l that IIDyf(t,1](t)) - Dyf(t, y(t; Xk, Sk)) II ::;; LII11(t) - y(t; Xk, sk)11 = O(h), as well as the uniform boundedness of IIZk(t)ll, IIDyf(t, y(t; Xk,Sk))11 for t E [a, b], k = 1, 2, ... , m - 1. From (7.3.9.11) and the proof techniques of Theorems (7.1.4) (7.1.11), one then obtains an estimate of the form ~in particular, for x = Xk+l, 1::;; k::;; m -1, (7.3.9.12) with a constant Cl independent of k and h. From the identity ZkZk-l'" Zl - GkGk-l'" G 1 = (Zk - Gk)Zk-l ... Zl +Gk (Zk-l - Gk-l)Zk-2'" Zl + ... + GkGk-l ... G 2 (Zl - G 1 ) and the further estimates 1::;; k ::;; m - 1, C2 independent of k and h, which follow from Theorem (7.1.11), we thus obtain for 1 ::;; k ::;; m - 1, in view of kh ::;; b - a and (7.3.9.9), (7.3.9.13) with a constant D independent of k and h. From (7.3.9.5), (7.3.9.6), (7.3.9.9), (7.3.9.12), (7.3.9.13) one obtains (7.3.9.14) Lls k = (Z(Xk) + O(h2)) Here, Lls( x) is the function (7.3.9.15) Lls(x) := Z(x)Lls l + Z(x) Lls(a) = Lls 1 . IX Z(t)-l [f(t, 1](t)) -1]'(t)] dt, 7.3 Boundary-Value Problems 581 Note that .1s depends also on h, since .1s 1 depends on h. Evidently, .1s is differentiable with respect to x, and by (7.3.9.7) we have .1s'(X) =Z'(x) [.1S 1 + IX Z(t)-l [f(t,1](t)) -1]'(t)] dt] + f(x, 1](X)) -1]'(x) =Dyf(x, 1](X)).1s(x) + f(x, 1](X)) -1]'(X), so that .1s(x) = .1s(a) + IX [Dyf(t,1](t)).1s(t) + f(t,1](t)) -1]'(t)] dt. By subtracting this equation from (7.3.9.3), we obtain for the difference 8(x) := .11](X) - .1s(x) the equation (7.3.9.16) 8(x) = 8(a) + IX Dyf(t, 1](t))8(t) dt, and further, with the aid of (7.3.9.7), (7.3.9.17) 8(x) = Z(x)8(a), [[8(x)[[::; K[[8(a)[[ for a ::; x ::; b, with a suitable constant K. In view of (7.3.9.14) and (7.3.9.17) it suffices to show [[8(a)[[ = O(h) in order to complete the proof of (7.3.9.4). With the help of (7.3.5.10) and (7.3.9.15) one now shows in the same way as (7.3.9.14) that .1s 1 = .1s(a) satisfies an equation of the form [A + B(Z(b) + 0(h2))].1s1 = -Pm - BZ(b) Ib Z(t)-l (J(t, 1](t)) -1]'(t)) dt + O(h) = -Pm - B[.1s(b) - Z(b).1S1] + O(h). From Pm = r(sl' sm) = r(1](a), 1](b)) it follows that A.1s(a) + B.1s(b) = -r(1](a), 1](b)) + O(h). Subtraction of (7.3.9.2), in view of (7.3.9.17), yields A8(a) + B8(b) = [A + BZ(b)]8(a) = O(h), and thus 8(a) = O(h), since (7.3.9.1) by assumption is uniquely solvable and hence A + BZ(b) is nonsingular. Because of (7.3.9.14), (7.3.9.17), this proves (7.3.9.4). D 582 7 Ordinary Differential Equations 7.4 Difference Methods The basic idea underlying all difference metods is to replace the differential quotients in a differential equation by suitable difference quotients and to solve the discrete equations that are so obtained. We illustrate this with the following simple boundary-value problem of second order for a function Y : [a. b] -+ IR: _y" + q(x)y = g(x). (7.4.1) y(a) = n. y(b) = 8. Under the assumptions that q, 9 E C[a, b] (i.e., q and 9 are continuous functions on [a.b]) and q(x) 2': 0 for x E [a.b], it can be shown that (7.4.1) has a unique solution y(.T). In order to discretize (7.4.1), we subdivide [a, b] into 71+ 1 equal subintervals. a Xj = a + j h, = Xo < Xl < ... < Xn < .Tn+l = b, b-a 71+1 h:=--, and, with the abbreviation Yj := Y(Xj), replace the differential quotient y;' = y"(Xi) for i = 1, 2, ... ,71 by the second difference quotient ,12 L.l . ' _ Yi+l - y, .- 2Yi + Yi-l h2 We now estimate the error Ti(Y) := Y"(Xi) - Ll 2 Yi. We assume that Y is four times continuously differentiable on [a, b]' Y E C 4 [a, b]. Then, by Taylor expansion of Y(Xi ± h) about Xi, one finds h3 h4 h 2 II _. I _ _ III _ (4) ( . ±) Yi±l - y, ± hYi + 2! Yi ± 3! y, + 4! Y X, ± 0i h , o < O'j= < 1. and thus Since y(4) is still continuous, it follows that (7.4.2) 1/) 2 h 2 (4) Ti(Y ) := Y (Xi - Ll Yi = -12 Y (Xi + Oih) for some IOil < 1. Because of (7.4.1), the values Yi = Y(Xi) satisfy the equations Yo = n (7.4.3) 2Yi - Yi-l h2 Yi+l + q(X;)Yi = g(Xi) + Ti(Y), Yn+l = ;3. i = 1, 2, .... 71, 7.4 Difference Methods 583 With the abbreviations qi ;= q(Xi), gi;= g(Xi), the vectors k ;= gn-1 {3 gn + h2 and the symmetric n x n tridiagonal matrix (7.4.4) -1 the equations (7.4.3) are equivalent to (7.4.5) AY=k+T(Y)· The difference method now consists in dropping the error term T(Y) in (7.4.5) and taking the solution u = [U1,"" un]T of the resulting system of linear equations, (7.4.6) Au= k, as an approximation to y. We first want to show some properties of the matrix A in (7.4.4). \Ve will write A ::; B for two n x n matrices if Gij ::; bij for i, j = 1, 2, ... , n. We now have the following; (7.4.7) Theorem. If qi 2' 0 for i = 1, ... , n, then A in (7.4.4) is positive definite, and 0 ::; A -1 ::; A-1 with the positive definite n x n matrix -1 (7.4.8) -1 PROOF. We begin by showing that A is positive definite. For this, we consider the n x n matrix An ;= h 2 A. According to Gershgorin's theorem (6.9.4), we have for the eigenvalues the estimate IAi - 21 ::; 2, and hence o ::; Ai ::; 4. If Ai = 0 were an eigenvalue of An, it would follow that det(An) = 0; however, det(An) = n + 1, as one easily verifies by means of the recurrence formula det(An+d = 2det(An) - det(A n - 2 ) (which is 584 7 Ordinary Differential Equations obtained at once by expanding det(A n+ 1 ) along the first row). Neither is Ai = 4 an eigenvalue of An, since otherwise An - 41 would be singular. But -2 -1 An - 41 r-1 = -1 D := diag(1, -1,1, ... , ±1), so that 1det(A n - 41)1 = 1det(An)l, and with An, also An - 41 is nons ingular. We thus obtain the estimate 0 < Ai < 4 for the eigenvalues of An. This shows in particular that An is positive definite. By virtue of n zH Az = zH Az + 2:qilziI2, zH Az > 0 for z f 0 and qi:;:' 0, i=l it follows immediately that zH Az > 0 for z f 0; hence the positive definiteness of A. Theorem (4.3.2) shows the existence of A-I and A-1, and it only remains to prove the inequality 0 ::; A-I ::; A-I. To this end, we consider the matrices D, D, J, J with h 2 A = D(I - J), D = diag(2 + qlh2, ... , 2 + qnh2), h 2 A = D(I - J), D = 21. Since qi :;:. 0, we obviously have the inequalities 0::; D::; D, 1 2 + qlh2 0 (7.4.9) O::;J= o 1 2 + q2h2 1 1 0 1 o 0 :2 0 ::;J= [: 1 1 :2 "2 0 In view of J = ~ ( - An + 21) and the estimate 0 < Ai < 4 for the eigenvalues of An, we have -1 < J1i < 1 for the eigenvalues J1i of J, i.e., p( J) < 1 for 7.4 Difference Methods 585 the spectral radius of 1. From Theorem (6.9.2) there easily follows the convergence of the series Since 0 ::; J ::; J, we then also have convergence in and since, by (7.4.9), 0 ::; D- 1 ::; iJ-l we get as was to be shown. D From the above theorem it follows in particular that the system of equations (7.4.6) has a solution (if q(x) 2: 0 for x E [a, b]), which can easily be found, e.g., by means of the Choleski method [see Section 4.3]. Since A is a tridiagonal matrix, the number of operations for the solution of (7.4.6) is proportional to n. We now wish to derive an estimate for the errors Yi -Ui of the approximations Ui obtained from (7.4.6) for the exact solutions Yi = Y(Xi), i = 1, ... , n. (7.4.10) Theorem. Let the boundary-value problem (7.4.1) have a solution y(x) E C 4 [a, b], and let ly(4)(x)1 ::; M for x E [a, b]. Also, let q(x) 2: 0 for x E [a, b] and u = [Ul, ... ,Un]T be the solution of (7.4.6). Then, for i = 1, 2, ... , n, PROOF. Because of (7.4.5) and (7.4.6) we have for f) - u the equation A(fj - u) = T(Y). U sing the notation we obtain from Theorem (7.4.7) and the representation (7.4.2) of T(Y), (7.4.11) 1 - Mh2 - If) - ul = IA- T(Y)I ::; A- 1 IT(y)1 ::; 12 A - 1e , 586 7 Ordinary Differential Equations where e := [1,1, ... , I]T. The vector A-Ie can be obtained at once by the following observation: The special boundary-value problem -y"(x) = 1, y(a) = y(b) = 0, ° of the type (7.4.1) has the exact solution y(x) = ~(x - a)(b - x). For this boundary-value problem, however, we have T(Y) = by (7.4.2), and the discrete solution u of (7.4.6) coincides with the exact solution y of (7.4.5). In addition, for this special boundary-value problem the matrix A in (7.4.4) is just the matrix A in (7.4.8), and moreover k = e. We thus have A-Ie = u, Ui = ~(Xi - a)(b - x;). Together with (7.4.11) this yields the assertion of the theorem. D Under the assumptions of Theorem (7.4.lO), the errors go to zero like h 2 : the difference method has order 2. The method of Stormer and Numerov, which discretizes the differential equation y"(X) = !(x, y(x)) by h2 Yi+I - 2Yi + Yi-I = 12 (fi+I + 10!i + !i-d and leads to tridiagonal matrices, has order 4. All these methods can also be applied to nonlinear boundary-value problems y" = !(x,y), y(a) = ct, y(b) =;3. We then obtain a system of nonlinear equations for the approximations y(x;) , which in general can be solved only iteratively. At any rate, one obtains only methods of low order. To achieve high accuracy, one has to use a very fine subdivision of the interval [a, b]; in contrast, e.g., with the multiple shooting method [see Section 7.6 for comparative examples]. For an example of the application of difference methods to partial differential equations, see Section 8.4. Ui ~ 7.5 Variational Methods Variational methods (Rayleigh-Ritz-Galerkin methods) are based on the fact that the solutions of some important types of boundary-value problems possess certain minimality properties. We want to explain these methods for the following simple boundary-value problem for a function U : [a, b] --+ IR, (7.5.1) -(p(x)u'(x))' + q(x)u(x) = g(x, u(x)), u(a) = ct, u(b) = ;3. Note that the problem (7.5.1) is somewhat more general than (7.4.1). Under the assumptions 7.5 Variational Methods (7.5.2) P E C 1 [a,b], q E C[a, b], 587 p(x);:::po>o. q(x) ;::: 0, gu(x, u) :::; .Ao, g E C 1 ([a, b] x lR), with .Ao the smallest eigenvalue of the eigenvalue problem -(pz')' - (.A - q)z = 0, z(a) = z(b) = 0, it is known that (7.5.1) always has exactly one solution. For the following we therefore assume (7.5.2) and make the simplifying assumption g(x, u(x)) = g(x) (no u-dependence of the right-hand side). If u(x) is the solution of (7.5.1), then y(x) := u(x) -l(x) with b-x a-x l(x) := OOb - + f3--b , -a l(a) = 11', a- l(b) = f3, is the solution of a boundary-value problem of the form -(py')' + qy = j, y(a) = 0, y(b) = 0, (7.5.3) with vanishing boundary values. Without loss of generality, we can thus consider, instead of (7.5.1), problems of the form (7.5.3). With the help of the differential operator L(v) :=::= -(pv')' + qv (7.5.4) associated with (7.5.3), we want to formulate the problem (7.5.3) somewhat differently. The operator L maps the set D L := {v E C 2 [a,b] I v(a) = O,v(b) = O} ° of all real functions that are twice continuously differentiable on [a, b] and satisfy the boundary conditions v(a) = v(b) = into the set C[a, b] of continuous functions on [a, b].The boundary-value problem (7.5.3) is thus equivalent to finding a solution of L(y) = j, (7.5.5) Y E DL , Evidently, D L is a real vector space and L a linear operator on D L: for u, v E DL also oou+f3v belongs to D L , and one has L(oou+f3v) = ooL(u)+f3L(v) for all real numbers a, f3. On the set L 2 (a, b) of all square-integrable functions on [a, b] we now introduce a bilinear form and a norm by means of the definition (7.5.6) (u,v):= lb u(x)v(x)dx, IluI12:= (u,u)1/2. 588 7 Ordinary Differential Equations The differential operator L in (7.5.4) has a few properties which are important for the understanding of the variational methods. One has thc following: (7.5.7) Theorem. L is a symmetric operator on D L , i.e., we have (u, L(v» = (L(u), v) for all u, v E D L . PROOF. Through integration by parts one finds (u, L(v» = = = lb u(x)[-(p(x)v'(x»' + q(x)v(x)] dx -u(x)p(x)v'(x)l~ + lb lb [P(x)u'(x)v'(x) + q(x)u(x)v(x)] dx [P(x)u'(x)v'(x) + q(x)u(x)v(x)] dx, since u (a) u (b) = 0 for u E D L. For reasons of symmetry it follows likewise that 1 b (7.5.8) (L(u), v) = [P(x)u'(x)v'(x) + q(x)u(x)v(x)] dx; hence the assertion. D The right-hand side of (7.5.8) is not only defined for u, v E D L . Indeed, let D := {u E Kl(a, b)lu(a) = u(b) = O} be the set of all functions u that are absolutely continuous on [a, b] with u(a) = u(b) = 0, for which u' on [a, bj (exists almost everwhere and) is square integrable [see Definition (2.4.1.3)]. In particular, all piecewise continuously differentiable functions satisfying the boundary conditions belong to D. D is again a real vector space with D :2 D L . The right-hand side of (7.5.8) defines on D the symmetric bilinear form (7.5.9) [u, v] := lb [P(x)u'(x)v'(x) + q(x)u(x)v(x)] dx, which for u, v E DL coincides with (u, L(v». As above, one shows for y E D L , U E D, through integration by parts, that (7.5.10) (u, L(y» = [u, yj. Relative to the scalar product introduced on DL by (7.5.6), L is a positive definite operator in the following sense: (7.5.11) Theorem. Under the assumptions (7.5.2) one has [u,u] = (u,L(u)) > ° 7.5 Variational Methods 589 for all u i- 0, u E D L . One even has the estimate 'Yllull~ ::; [u, u] ::; nu'll;' (7.5.12) for all u E D with the norm IlullCXl := SUPa:C;x:C;b lu(x)1 and the constants Po 'Y:= b _ a' PROOF. In view of'Y > because u( a) = 0, r:= Ilplloo(b - a) + Ilqlloo(b - a) 3 . ° it suffices to show (7.5.12). For u E D we have, u(x) = i x u'(O dE;, for x E [a, b]. The Schwarz inequality yields the estimate U(X)2::; i x x 12 dE;,·l u'(02dE;,= (x-a) ::; (b - a) ·i U'(~)2 d~, i u'(E;,)2d~ x b and thus (7.5.13) Now, by virtue of the assumption (7.5.2), we have p(x) ~ Po > 0, q(x) ~ for x E [a, b]; from (7.5.9) and (7.5.13) it thus follows that [u, u] = ib [P(X)U'(X)2 + q(x)u(x)2] dx ~ Po ib u'(x)2 dx ° ~ b ~ a Ilull;'· By (7.5.13), finally, we also have [u,u] = i b [p(X)U'(X)2+ q(X)U(X)2]dx ::; Ilplloo(b - a)llu'll~ + Ilqll=(b - a)llull~ ::; Fllu' II~, o as was to be shown. In particular, from (7.5.11) we may immediately deduce the uniqueness of the solution Y of (7.5.3) or (7.5.5). If L(Yl) = L(Y2) = f, Yl, Y2 E D L , then L(YI-Y2) = and hence = (YI-Y2,L(YI-Y2)) ~ 'YIIYI-Y211;, ~ 0, which yields at once Yl = Y2. ° ° 590 7 Ordinary Differential Equations We now define for u E D a quadratic functional F: D -+ JR, by F(u) := [u, u]- 2(u, f); (7.5.14) here f is the right-hand side of (7.5.3) or (7.5.5): F associates with each function 'U E D a real number F(u). Fundamental for variational methods is the observation that the function F attains its smallest value exactly for the solution y of (7.5.5): (7.5.15) Theorem. Let y E DL be the solution of (7.5.5). Then F(u) > F(y) for all u ED, 71 #- y. PROOF We have L(y) = F, for u #- y, 11, f and therefore, by (7.5.10) and the definition of ED, F(u) = [u,u]- 2(u,f) = [u,11,]- 2(u,L(y)) = [u, u] - 2[u, y] + [y, y] - [y, y] = [71. - y, 71. - y] - [y, y] > -[y, y] = F(y), since, by Theorem (7.5.11), [71. - y, 11, - y] > 0 for u #- y. As a side result, we note the identity (7.5.16) [71. - y, 71. - y] = F(u) + [y, y] for all o 71. E D. Theorem (7.5.15) suggests approximating the desired solution y by minimizing F(u) approximately. Such an approximate minimum of F may be obtained systematically as follows: One chooses a finite-dimensional subspace S of D, SeD. If dimS = m, then relative to a basis 11,1, ... , Urn of S, every 71. E S admits a representation of the form (7.5.17) One then determines the minimum Us E S of Fin S, (7.5.18) F(us) = minF(u), uES and takes Us to be an approximation for the exact solution y of (7.5.5), which according to (7.5.15) minimizes F on the whole space D. For the computation of the approximation Us consider the following representation of F(u), u E S, obtained via (7.5.17), 7.5 Variational Methods 591 P(61, 62,"" 6m ) : == F(6171.1 + ... + 6m u m ) [t 6i Ui , ~ 6kUk]- 2 (~ 6k U f) k, m m i,k=l k=l With the help of the vectors 6, t.p and the m x m matrix A, (7.5.19) 61 6:= [ : 6m 1,t.p [ (U1, j) 1 A := : , [[U1' uI] :=: [U m ,U1] '" (um,j) 1 [U1, um ] :, [um,um] one obtains for the quadratic function P: IR m -+ IR (7.5.20) The matrix A is positive definite, since A is symmetric by (7.5.9), and for all vectors 6 i= 0 one also has 'U := 61 u1 + ... + 6m u m i= 0 and thus, by Theorem (7.5.11), 6T A6 = I)i6k[Ui, Uk] = [u,u] > O. i,k The system of linear equations (7.5.21 ) A6=t.p, therefore, has a unique solution 6 = 8, which can be computed by means of the Choleski method [see Section 4.3]. In view of the identity p( 6) = 6T A6 - 2t.pT 6 = 6T A6 - 28T A6 + 8T A8 - ;sr A8 = (6 - 8f A(6 - 8) - 8T A8 = (6 - 8)T A(6 - 8) +p(8) and (6 - 8)T A(6 - 8) > 0, for 6 i= 8, it follows at once that P(6) > p(8) for 6 i= 8, and consequently that the function belonging to 8 furnishes the minimum (7.5.18) of F(u) on S. With Y the solution of (7.5.5), it follows immediately from (7.5.16), by virtue of F(us) = minuEsF(u), that (7.5.22) [us - y,Us - y] = min s' [u - y,u - y]. uE 592 7 Ordinary Differential Equations \Ve want to use this relation to estimate the error Ilus - ylloo. One has the following: (7.5.23) Theorem. Let y be the exact solution of (7.5.3), (7.5.5). Let S c D be a finite-dimensional subspace of D, and let F( us) = minuEs F( u). Then the estimate Ilus - ylloc::; CIIu' - Y'lloo holds for allu E S. Here C = Theorem (7.5.11). PROOF. jr/r, where r, , are the constants of (7.5.12) and (7.5.22) yield immediately, for arbitrary u E S ~ D, ,llus - YII~ ::; [us - y, Us - y] ::; [u - y, u - y] ::; rllu' - y'll;'· o From this, the assertion follows. Every upper bound for infuEs[u - y, u - V], or more weakly for inf Ilu' - Y'lloo, uES immediately gives rise to an estimate for Ilus - yllCX). We wish to indicate such a bound for an important special case. We choose for the subspace S of D the set S = SPLl := {SLl I SLl(a) = SLl(b) = O} of all cubic spline functions SLl [see Definition (2.4.1.1)] which belong to a fixed subdivision of the interval [a, b], Ll: a = Xo < Xl < X2 < ... < Xn = b, and vanish at the end points a, b. Evidently, SPLl ~ DL ~ D. We denote by IILlII the width of the largest subinterval of the subdivision Ll, IILlII := max (Xi - Xi-I). I::;~::;n The spline function u := SLl with U(Xi) = y(Xi), i = 0, 1, ... ,n, u l (() = yl(() for (= a, b, where y is the exact solution of (7.5.3), (7.5.5), clearly belongs to S = SPLl' From Theorem (2.4.3.3) and its proof one gets the estimate provided y E C 4 (a, b). Together with Theorem (7.5.23) this gives the following result: 7.5 Variational Methods 593 (7.5.24) Theorem. Let the exact solution Y of (7.5.3) belong to C 4 (a, b), and let the assumptions (7.5.2) be satisfied. Let S := Sp d and Us be the spline function for which F(us) = minF(u). uES Then with the constant C:= vir;.,. which is independent ofy, By the footnote following Theorem (2.4.3.3), the estimate can be improved: The error bound thus goes to zero like the third power of the fineness of L1; in this respect, therefore, the method is superior to the difference method of the previous section [see Theorem (7.4.10)]. For the practical implementation of the variational method, say in the case S = Sp d, one first has to select a basis for Sp d' One easily sees that m:= dimSPd = n+1 [according to (2.4.1.2) the spline function Sd E SPd is uniquely determined by n + 1 conditions Sd(Xi)=Yi, i=1,2, ... ,n-1, S~(a) = y~, S~(b) = y~, for arbitrary Yi, yb, y~]. As in Exercise 29, Chapter 2, one can find a basis of spline functions So, S 1 ... , Sn in Sp d such that (7.5.25) This basis has the advantage that the corresponding matrix A m (7.5.19) is a band matrix of the form x x x x (7.5.26) x o x o x x x x x X .1: X since, by virtue of (7.5.9), (7.5.25), lSi, Sk] = lb [P(x)S~(x)S~(x) + q(x)Si(x)Sdx)] dx = 0 594 7 Ordinary Differential Equations whenever Ii - kl .: : 4. Once the components of the matrix A and of the vector cp in (7.5.19) for this basis So, ... , SrI have been found by integration, one solves the system of linear equations (7.5.21) for t5 and so obtains the approximation Us for the exact solution y. Of course, instead of SPL\ one can also choose other spaces S <:;; D. For example, one could take for S the set 71,-2 S = { PI P(x) = (x - a)(x - b) L aix\ ai arbitrary} i=O of all polynomials of degree at most n which vanish at a and b. The matrix A in this case would be in general no longer be a band matrix (and besides would be very ill conditioned if the special polynomials Pi(x):=(x-a)(x-b)x', i=0,1, ... ,n-2, were chosen as basis in S). Similarly to the spline functions, one can also choose for S the set [see (2.l.5.10)] S = H1 m ), H1 m) where L1 is again a subdivision a = Xo < Xl < ... < Xn = band consists of all functions U E C m - l [a, b] which in each subinterval [Xi, Xi+1], i = 0, ... , n - 1, coincide with a polynomial of degree :S 2m - 1 ("Hermite function space"). Here again, through a suitable choice of the basis {ud m ), one can assure that the matrix A = ([Ui' Uk]) is a band matrix. in By appropriate choices of the parameter m one can even obtain methods m ), of order higher than 3 in this way [ef. Theorem (7.5.24)]. For S = analogously to Theorem (7.5.24), using the estimate (2.1.5.14) instead of (2.4.3.3), one obtains H1 H1 <_ 22m-2(2m c _ 2)! I Y (2m)11 x 11L1112m-1 , I us - YII~, ~ l provided Y E C 2m [a, b]; we have a method of order at least 2m-I. [Proofs of these and similar estimates, as well as the generalization of the variational methods to certain nonlinear boundary-value problems, can be found in Ciarlet, Schultz, and Varga (1967).] Observe, however, that with increasing m the matrix A becomes more and more difficult to compute. We further point out that the variational method can be applied to considerably more general boundary-value problems than (7.5.3), e.g., to partial differential equations. The important prerequisites for the crucial theorems (7.5.15), (7.5.23) and the solvability of (7.5.21) are essentially only the symmetry of L and the possibility of estimates of the form (7.5.12) [see Section 7.7 for applications to the Dirichlet problem]. 7.5 Variational Methods 595 The bulk of the work with methods of this type consists in computing the coefficients of the system of linear equations (7.5.21) by means of integration, having made a decision concerning an appropriate basis Ul, ... , U m in S. For boundary-value problems in ordinary differential equations these methods as a rule are too expensive and too inaccurate, and are not competitive with the multiple shooting method [see the following section for comparative results]. Its value becomes evident only in boundary-value problems for partial differential equations. Finite-dimensional spaces SeD of functions satisfying the boundary conditions of the boundary-value problem (7.5.5) also playa role in collocation methods. The idea of these methods is quite natural: One tries to approximate the solution y(x) of (7.5.5) by a function u(x) in S represented by U=61Ul+···+6 m u m in terms of a basis {Uj} of S. To this end, one selects m different collocation points Xj E (a, b), j = 1, 2, ... , m and determines U E S such that (7.5.27a) (Lu)(Xj) = f(xj), j = 1, 2, ... , m, that is, the differential equation L( u) = f is to be satisfied exactly at the collocation points. Of course, this is equivalent to solving the following linear equations for the coefficients 6k, k = 1, ... , m: L L(Uk)(Xj)6k = f(xj), j = 1, 2, ... , m. m (7.5.27b) k=l There are many possibilities of implementing such methods, namely, by different choices of S, of bases of S, and of collocation points x j. With a suitable choice, one may obtain very efficient methods. For instance, it is advantageous for the basis functions Uj, j = 1, 2, ... , m, to have compact support, like the B-splines [see 2.4.4 and (7.5.25)]. Then the matrix A = [L(Uk)(Xj)] of the linear system (7.5.27) is a band matrix. In the so-called spectral methods, one chooses spaces S of trigonometric polynomials (2.3.1.1) and (2.3.1.2), which are spanned by finitely many simple trigonometric functions, say, e ikx , k = 0, ±1, ±2, .... Then, with a suitable choice of the collocation points Xj, one can often solve (7.5.27) with the aid of fast Fourier transforms [see Section 2.3.2]. For illustration, consider the simple boundary-value problem -y"(x) = f(x), y(O) = y(7r) = 0, of the form (7.5.3). Here, all functions u(x) = E;:'=l 8k sin kx satisfy the boundary conditions. If we choose the collocation points Xj := j7r /(m+ 1), j = 1, 2, ... , m, then by (7.5.27) the numbers "/k := 8kk 2 will solve the trigonometric interpolation problem [see Section 2.3.1] 596 7 Ordinary Differential Equations ~ 'Yk sin kj7r = fj := f(xj), ~ m+1 j = 1, 2, ... , m. k=l Therefore, the 'Yk can be computed by using fast Fourier transforms. These methods play an important role for the solution of initialboundary value problems for partial differential equations. For a detailed exposition, the reader is referred to the literature, e.g., Gottlieb and Orszag (1977), Canuto et al. (1987), and Boyd (1989). 7.6 Comparison of the Methods for Solving Boundary-Value Problems for Ordinary Differential Equations The methods described in the previous sections, namely (1) the simple shooting method, (2) the multiple shooting method, (3) the difference method, (4) the variational method, will now be compared by means of the following example. We consider the linear boundary-value problem with linear separated boundary conditions (7.6.1) - y" + 400y = -400 cos 2 7rX - 27f2 cos(27fx) , y(O) = y(l) = 0, with the exact solution [see Figure 28] (7.6.2) y(x) = e -20 1+ e- 20 e 20x + 1 1 + e- 20 e- 20x _ cos 2 (7fx). Although this problem must be considered very simple compared with most problems in practice, the following peculiarities are present which lead us to expect difficulties in its solution: (1) The homogeneous differential equation -y" +400y = 0 associated with (7.6.1) has solutions of the form y(x) = ce±20x which grow or decay at a rapid exponential rate. This leads to difficulties in the simple shooting methods. (2) The derivatives y(i)(x), i = 1, 2, ... , of the exact solution (7.6.2) are very large for x ~ 0 and x ~ 1. The error estimates (7.4.10) for the difference method, and (7.5.24) for the variational method, thus suggest large errors. 7.6 Boundary-Value Problems, Comparison of Methods 597 y o 1 0.5 x -0.5 Fig. 28. Exact solution of (7.6.1). On a 12-digit computer the following results were found [in evaluating the error, especially the relative error, one ought to observe the behavior of the solution [see Figure 28]: maxO<x<l ly(x)1 R:! 0.77, y(O) = y(l) = 0, y(0.5) = 0.907 998 593 370 x 10- 4 ]: - (a) Simple shooting method: The initial values y(O) = 0, y' (0) = -20 1 - e- 20 1 + e- 20 = -19.999 999 9176 ... of the exact solution were found, after one iteration, with a relative error::; 3.2.10- 11 . If the initial-value problem corresponding to (7.6.1) is then solved by means of an extrapolation method [see Section 7.2.14; ALGOL procedure Diffsys in Bulirsch and Stoer (1966)], taking as initial values the exact initial values y(O), y'(O) rounded to machine precision, one obtains instead of the exact solution y(x) in (7.6.2) an approximate solution y(x) having the absolute error .1y(x) = y(x) -y(x) and relative error Cy(x) = (y(x) - y(x))jy(x) as shown in the table below. Here the effect of the exponentially growing solution y(x) = ce 20x of the homogeneous problem is clearly noticeable [ef. Section 7.3.4]. 598 7 Ordinary Differential Equations x lL1y(x)l* IEy(x)1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.9 X 10- 11 1.5 X 10- 10 1.1 X 10- 9 8.1 X 10- 9 6.0 X 10- 8 4.4 X 10- 7 3.3 X 10- 6 2.4 X 10- 5 1.8 X 10- 4 1.3 X 10- 3 2.5 X 1O- 11 2.4 X 10- 10 3.2 X 10- 9 8.6 X 10- 8 6.6 X 10- 4 4.7 X 10- 5 9.6 X 10- 6 3.8 X 10- 5 2.3 X 10- 4 00 • maxx E[O,l] lL1y(X) I ::::; 1.3 X 10- 3 . (b) Multiple shooting method: To reduce the effect of the exponentially growing solution y( x) = c e20x of the homogeneous problem, we chose m large, m = 21: k-1 0= Xl < X2 < ... < X21 = 1, Xk=--· 20 X lL1y(x)l* IEy(x)1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 9.2 X 10- 13 2.7 X 10- 12 4.4 X 10- 13 3.4 X 10- 13 3.5 X 10- 13 1.3 X 10- 12 1.8 X 10- 12 8.9 X 10- 13 9.2 x 10- 13 5.0 X 10- 12 1.2 X 10- 12 4.3 X 10- 12 1.3 X 10- 12 3.6 X 10- 12 3.9 X 10- 9 1.4 X 1O- 11 5.3 X 10- 12 1.4 X 10- 12 1.2 X 10- 12 00 * maX x E[O,l]lL1y(x)1 ::::; 5 x 10- 12 • Three iterations of the multiple shooting method [program in Oberle and Grimm (1989)] gave the absolute and relative errors in the above table. (c) Difference method: The method of Section 7.4 for step sizes h = l/(n + 1) produces the following absolute errors L1y = maxi IL1Y(Xi)l, Xi = i h [powers of 2 were chosen for h in order that the matrix A in (7.4.4) could be calculated exactly on the computer]: 7.6 Boundary-Value Problems, Comparison of Methods h f1y 2~4 2.0 X 1O~2 l.4 X 1O~3 9.0 X 1O~5 5.6 X 1O~6 2~6 2~8 2~1O 599 These errors are in harmony with the estimate of Theorem (7.4.10). Halving the step h reduces the error to ~ of the old value. In order to achieve errors comparable to those of the multiple shooting method, f1y ::::::: 5 X 1O~ 12, one would have to choose h ::::::: 1O~6! (d) Variational method: In the method of Section 7.5 we choose for the spaces S spaces Sp Ll of cubic spline functions corresponding to equidistant subdivisions f1: 0 = Xo < Xl < ... < Xn = 1, Xi = ih, 1 h =-. n The following maximum absolute errors f1y = Ilus - yll= [ef. Theorem (7.5.24)] were found: h f1y 1/10 1/20 1/30 1/40 1/50 1/100 6.0 x 1O~3 5.4 x 1O~4 l.3 x 1O~4 4.7 x 1O~5 2.2 x 1O~5 l.8 x 1O~6 In order to achieve errors of the order of magnitude f1y ::::::: 5 x 1O~12 as in the multiple shooting method, one would have to choose h ::::::: 1O~4. Merely to compute the band matrix A in (7.5.26), this would require about 4 x 10 4 integrations for this stepsize. These results clearly demonstrate the superiority of the multiple shooting method, even for simple linear separated boundary-value problems. Difference methods and variational methods are feasible, even here, only if the solution need not be computed very accurately. For the same accuracy, difference methods require the solution of larger systems of equations than variational methods. This advantage of the variational methods, however, is of importance only for smaller stepsizes h, and its significance is considerably reduced by the fact that the computation of the coefficients in the 600 7 Ordinary Differential Equations systems of equations is much more involved than in the simple difference methods. For the treatment of nonlinear boundary-value problems for ordinary differential equations the only feasible methods, effectively, are the multiple shooting method and its modifications. 7.7 Variational Methods for Partial Differential Equations. The Finite-Element Method The methods described in Section 7.5 can also be used to solve boundaryvalue problems for partial differential equations. \¥e want to explain this for a D'iTichlet buundaTy-value pmblem in lR? We are given a (open bounded) region D c lR? with boundary 8D. We seek a function y: f? ---+ IR, f? := D U 8D such that - L1y(x) + c(x)y(x) = j(x) for x = (XI,X2) E D, y(x)=O forxE8D, (7.7.1) where L1 is the Laplace-Operator !\ L.l ( Y X ) ._ .- 8 2y(x) 8 2y(x) 8 2 + 8 2 . Xl X2 Here, c, f: f? ---+ IR are given continuous functions with c(x) 2: 0 for x E f? To simplify the discussion, we assume that (7.7.1) has a solution y E C 2 (f?), i.e., y has continuous derivatives D'ty := 8ay/8x't on D for Q ::; 2 which are continuously extendable to f? As domain DL of the differential operator L(v) := -L1v + cv associated to (7.7.1) we accordingly take DL := {v E C 2(f? I vex) = 0 for x E 8D}. Thus, the problem is to find a solution of L(v) = j, (7.7.2) v E D L. We further assume that the region D is such that the integral theorems of Gauss and Green apply, and moreover, that each line Xi = const, i = 1, 2, intersects D in at most finitely many segments. With the abbreviations (u, v) := in u(x)v(x) dx, IluI12:= (u, U)I/2 we then have [ef. Theorem (7.5.7)]: (7.7.3) Theorem. L is a symmetTic opemtoT on D L : (7.7.4) (u, L(v)) = (L(u), v) = L[DIU(X) Dlv(x) + D 2 u(x) D 2 v(x) + c(x)u(x)v(x)] dx 7.7 Variational Methods for Partial Differential Equations 601 for all u, v E D L . PROOF. One of Green's formulas is - lS?r uLlvdx = lS?r (tDiUD;V) dx - laS? r U~~ dw. i=l Here OV / ov is the differential quotient in the direction of the outward normal, and dw the line element of af? Since u E DL, we have u(x) = 0 for x E af?; thus IaS?u(av/av) dw = O. From this, the assertion of the theorem follows at once. D The right-hand side of (7.7.4) again defines a bilinear form [cf. (7.5.9)] 2 (7.7.5) [u,v]:= l(~DiUDiV+CU1J)dX' and there holds an analog of Theorem (7.5.11): (7.7.6) Theorem. There are constants 'Y > 0, r> 0 such that (7.7.7) 'Yllull~(1) :::; [u, U] :::; r 2 L IID;ull~ for all E D U L. ;=1 Here is the so-called "Sobolev norm" . PROOF. The region f? is bounded. There exists, therefore, a square f?1 with side lenght a which contains f? in its interior. Without loss of generality, let the origin be a corner of f?1 (Figure 29). a 1---------, o a Fig. 29. The regions [2 and [21 602 7 Ordinary Differential Equations Now let u E D L . Since u(x) := a for x E af2, we can extend u continuously onto f2I by letting u(.r) := a for x E f2I \f2. By our assumption on il, DI U(il' .) is piecewise continuous, so that by u(a, X2) = a U(Xl' X2) = faXl DIu(il, X2) dt 1 for all x E rh. The Schwarz inequality therefore gives [U(XI,X2W:s; Xlfaxl[DIU(tl,X2Wdtl :s; a faa [D I u(t l ,X2W dt 1 for x E f2 1 . Integrating this inequality over fh, one obtains (7.7.8) r [U(XI,x2)f dx I dx 2:S; a 21 [D Iu(tl,t2)]2dt1 dt 2 J.!?l 1 .!?l :s; a 2 [(D1U)2 + (D2U)2] dx . .!?l Since u( x) = a for x E f21 \ f2, one can restrict the integration to f2, and because c( x) :::" a for x E f?, it follows immediately that 2 a [u, u] :::" a 2 2 L IID;ull§ :::" Ilull~, ;=1 and from this finally ')'llull~(I) :s; [u, u] . 1 wIth')' := - 2 - - ' a +1 Again from (7.7.8) it follows that C := max Ic(x)l, xE.!? and thus 2 [11,11] :s; r L IID;ull~ with r := 1 + Ca 2 . ;=1 This proves the theorem. D As in Section 7.5, one can find a larger veetor space D ~ D L of functions such that definition (7.7.5) of [u, v] remains meaningful for u, v E D, the statement (7.7.7) of the previous theorem remains valid for u ED, and (7.7.9) [u,v] = (u,L(y)) for y E DL and u E D. 7.7 Variational Methods for Partial Differential Equations 603 In what follows, we are not interested in what the largest possible set D with this property looks like 8 ; we are only interested in special finite-dimensional vector spaces S of functions for which the validity of (7.7.7) and (7.7.9) for U E D S , D S the span of Sand D L , can be established individually. EXAMPLE. Let the set fl be supplied with a triangulation 7 = {TI, . .. , Td, i.e., fl is the union of finitely many triangles Ti E 7, fl = U:=I Ti , such that any two triangles in 7 either are disjoint or have exactly one vertex or one side in common (Figure 30). = Fig. 30. Triangulation of fl By 5i(i 2 1) we denote the set of all functions u: fl-+ lR with the following properties: (1) u is continuous on fl. (2) u(x) = 0 for x E afl. (3) On each triangle T of the triangulation 7 of fl the function u coincides with a polynomial of degree i, Evidently, each 5i is a real vector space. Even though u E 5 i need not even be continuously differentiable in fl, it is nevertheless continuously differentiable in the interior of each triangle T E T Definition (7.7.5) is meaningful for all u, v E DSi; likewise, one sees immediately that (7.7.7) is valid for all u E DSi. The function spaces 5 = 5 i , i = 1, 2, 3 - in particular 52 and 53 - are used in the finite-element method in conjunction with the Rayleigh-Ritz method. Exactly as in Section 7.5 [cf. (7.5.14), (7.5.15), (7.5.22)] one shows the following: 8 One obtains such a D by "completion" of the vector space CJ(fl), equipped with the norm I . Ilw(1) of all once continuously differentiable functions r.p on fl, whose "support" supp(r.p) := {x E fl I r.p(x) =I=- O} is compact and contained in fl. 604 7 Ordinary Differential Equations (7.7.10) Theorem. Let y be the solution of (7.7.2), and S a finitedimensional space of functions for which (7.7.7), (7.7.9) hold for all u E D S . Let F ( u) be defined by (7.7.11) F(u) := [u, uj- 2(u, J) for U E D S . Then: (a) F(u) > fey) fOT allu -=1= y. (b) There exists a Us E S with F(us) = minuEs F(u). (c) For this Us one has [Us - y, Us - yj = min [u - y, u - yj. uES The approximate solution Us can be represented as in (7.5.17)-(7.5.21) and determined by solving a linear system of equations of the form (7.5.21). For this, one must choose a basis in S and calculate the coefficients (7.5.19) of the system of linear equations (7.5.21). A practically useful basis for S2, e.g., is obtained in the following way: Let P be the set of all vertices and midpoints of sides of the triangles T i , i = 1, 2, ... , k, of the triangulation T which are not located on the boundary af]. One can show, then, that for each PEP there exists exactly one function Up E S2 ~atisfying up(P) = 1, up(Q) = ° for Q =f. P, Q P. E In addition, these functions have the nice property that up(x) = 0, for all x E T with P f/. T; it follows from this that in the matrix A of (7.5.21), all those elements vanish, [up,uQ] = 0, for which there is no triangle T E T to which both P and Q belong: the matrix A is sparse. Bases with similar properties can be found also for the spaces Sl and S3 [see Zl<imal (1968)]. The error estimate of Theorem (7.5.23), too, carries over. (7.7.12) Theorem. Let y be the exact solution of (7.7.1), (7.7.2), and S a finite-dimensional space of functions such that (7.7.7), (7.7.9) holds for all u E D S . For the approximate solution Us with F( us) = minuEs F( u) one then has the estimate r Ilus - yIIW(l) :::; inf ( - L IIDju - Djyll~ uES 2 I j=l We now wish to indicate upper bounds for )1/2 . 7.7 Variational Methods for Partial Differential Equations 605 in the case of the spaces 8 = 8 i , i = 1, 2, 3, defined in the above example of triangulated regions fl. The following theorem holds: (7.7.13) Theorem. Let fl be a triangulated region with triangulation T. Let h be the maximum side length, and the smallest angle, of all triangles occurring in T. Then the following is true: e then there exists a function U E 8 1 with (a) Iu(x) - y(x)1 ::; M2h2 for all x E 0, (b) IDju(x) - Djy(x)1 ::; 6M2h/ sine for j = 1, 2 and for all x ETa, TET. (2) If y E C 3 (s7) and then there exists a U E 8 2 with (a) lu(x) - y(x)1 ::; M3h3 for x E il, (b) IDJu(x) - Djy(x)1 ::; 2M3h2/ sine for j = 1, 2, x ETa, T E T. (3) If y E C 4 (s7) and then there exists a u E 8 3 with (a) lu(x) -y(x)l::; 3M4h 4 /sine for x EO, (b) IDju(x) - Djy(x)1 ::; 5M4h3/ sine for j = 1, 2, x ETa, T E T. [Here, TO means the interior of the triangle T. The limitation x E TO, T E T in the estimates (b) is necessary because the first derivatives Dju may have jumps along the sides of the triangles T.] Since for u E 8 i , i = 1, 2, 3, the function Dju is only piecewise continuous, and since 606 7 Ordinary Differential Equations s C max IDju(x) - Djy(x) 2 , 1 xETO C:= !n1dX, TET from each of the estimates given under (b) there immediately follows, via Theorem (7.7.12), an upper bound for the Sobolev norm Ilus - vllw(l) of the error Us -y for the special function spaces 5 = 5 i , i = 1. 2, 3, provided only that the exact solution y satisfies the differentiability assumptions of (7.7.13): h Ius! - yll",(!) s C1Ah -.-, l I n US 2 - II u s 3 - vll wl !) vii Will SIn () h2 S C2 A13 sin()' 17. 3 s C3 M4 ---:--()' sm The constants Ci are independent of the triangulation T of n and of v, and can be specified explicitly; the constants 1'.1;. h, () have the same meaning as in (7.7.13). These estimates show, e.g. for 5 = 53, that the error (i.e., its Sobolev norm) goes to zero like the third power of the fineness 17. of the triangulation. We will not prove Theorem (7.7.13) here. Proofs can be found, e.g., in Zl<imal (1968) and in Ciarlet and Wagschal (1971). We only indicate that the special functions U E 5 i , i = 1, 2, 3, of the theorem are obtained by interpolation of y: For 5 = 52, e.g., one obtains the function U E 52 by interpolation of y at the points PEP of the triangulation (see above): 11(x) = L y(P)up(X). PEP An important advantage of the finite-element method lies in the fact that boundary-value problems can be treated also for relatively complicated regions n, provided n can still he triangulated. We remark, in addition, that not only can problems in JR2 of the special form (7.7.1) be attacked by these methods, hut also boundary-value problems in higher-dimensional spaces and for more general differential operators than the Laplace operator. A detailed exposition of the finite element method and its practical implementation can be found in Strang and Fix (1973), Oden and Reddy(1976), Schwarz (1988), and Ciarlet and Lions (1991), and Quarteroni and Valli (1994). EXERCISES FOR CHAPTER 7 607 EXERCISES FOR CHAPTER 7 1. Let A be a real diagonalizable n x n matrix with the real eigenvalues Ai, ... , An and the eigenvectors Cl, ... , cn . How can the solution set of the system y' =Ay be described in terms of the Ak and Ck? How is the special solution y(x) obtained with y(O) = Yo, yo E lRn ? 2. Determine the solution of the initial-value problem y' = Jy, y(O) = yo with yo E lR n and the n x n matrix [1 :1 A E lR. Hint: Seek the kth component of the solution vector y(x) in the form y(x) = Pk(x)e AX , Pk a polynomial of degree ::; n - k. 3. Consider the initial-value problem , 3 y(O) = O. Y =x-x, Suppose we use Euler's method with stepsize h to compute approximate values T/(Xj; h) for y(Xj), Xj = jh. Find an explicit formula for T/(Xj; h) and e(xj; h) = T/(Xj; h) - Y(Xj), and show that e(x; h), for x fixed, goes to zero as h = x/n -+ O. 4. The initial-value problem y' =..;y, y(O) = 0 has the nontrivial solution y(x) = x 2 /4. Application of Euler's method however yields T/(Xj; h) = 0 for all x and h = x/n, n = 1, 2, .... Explain this paradox. 5. Let T/(x; h) be the approximate solution furnished by Euler's method for the initial-value problem , y(O) = 1. y =y, (a) One has T/(x; h) = (1 + h)x/h. (b) Show that T/(x; h) has the expansion T/(X; h) = LTi(X)hi with TO(X) = eX, i=O which converges for Ihl < 1; the Ti(X) here are analytic functions independent of h. 608 7 Ordinary Differential Equations (c) Determine Ti(X) for i = 1, 2, 3. (d) The Ti(X), i ?': 1, are the solutions of the initial-value problems i k+l ( ) '"""' Ti_k X , Ti(X)=Ti(X)- L...J (k+l)!' Ti(O) = O. k=l 6. Show that the modified Euler method furnishes the exact solution of the differential equation y' = -2ax. 7. Show that the one-step method given by ¢P(X, y; h) := Hkl + 4k2 + k3], kl := f(x, y), k2 := f (x + ~, y + ~kl) , k3 := f(x + h, y + h( -k 1 + 2k2)) ("simple Kutta formula") is of third order. 8. Consider the one-step method given by h ¢P(x, y; h) := f(x, y) + "2 g(x + ~h, y + ~hf(x, y)), where g(x, y) := :xf(x, y) + (:/(x, y)) . f(x, y). Show that it is a method of order 3. 9. What does the general solution of the difference equation Uj+2 = Uj+l + Uj, j ?': 0, look like? (For Uo = 0, Ul = lone obtains the "Fibonacci sequence".) 10. In the methods of Adams-Moulton type, given approximate values "T]p-(q-l), ... , "T]p for y(Xp-(q-l)), ... , y(xp), one computes an approximate value "T]p+l for y(xp+1), Xp+l E [a, b], through the following iterative process [see (7.2.6.7)]: (0) "T]p+l ar b·Itrary; for i = 0, 1, ... : "T]p+1 := 'IF("T]p+l) := "T]p + h[,BqO!(Xp+l, "T]p+l) + ,Bql!p + ... + (HI) (i) (i) +,Bqqfp+l-q]. Show: For f E H(a,b) there exists an ho > 0 such that for all Ihl ::; ho the sequence {"T]~~d converges toward an "T]p+l with "T]p+l = 'IF("T]p+I). 11. Use the error estimates for interpolation by polynomials, or for Newton-Cotes formulas, to show that the Adams-Moulton method for q = 2 is of third order and the Milne method for q = 2 of fourth order. EXERCISES FOR CHAPTER 7 609 12. For q = 1 and q = 2 determine the coefficients {3qi in Nystrom's formulas 1]p+l = 1]p-l + h[{3lO!p + {3n!p-l], 1]p+l = 1]p-l + h[{320!p + {321!p-l + (322!p-2]. 13. Check whether or not the linear multistep method h 1]p - 1]p-4 = "3 [8!p-l - 4!p-2 + 8!p-3] is convergent. 14. Let 1P'(1) = 0, and assume that for the coefficients in IP'(J-t) r-l r-2 --1 = 'Yr-lJ-t + 'Yr-2J-t + ... + 1'1J-t + 1'0 J-tone has br-ll > br-21 2 ... 2 11'01· Does IP'(J-t) satisfy the stability condition? 15. Determine a, (3 and I' such that the linear multistep method has order 3. Is the method thus found stable? 16. Consider the predictor method given by (a) Determine ao, bo and b1 as a function of al such that the method has order at least 2. (b) For which aI-values is the method thus found stable? (c) What special methods are obtained for al = 0 and al = -I? (d) Can al be so chosen that there results a stable method of order 3? 17. Suppose the method is applied to the initial-value problem y' =0, y(O) = c. Let the starting values be 1]0 = c and 1]1 = c+eps (eps = machine precision). What values 1]j are to be expected for arbitrary stepsize h? 18. For the solution of the boundary-value problem y" = 100y, y(O) = 1, y(3) = e- 30 , consider the initial-value problem y" = 100y, y(O) = 1, y' (0) = s, 610 7 Ordinary Differential Equations with solution y(x; s), and determine s = 8 iteratively such that y(3; 8) = e- 3o . Assume further that 8 is computed only within a relative error E, i.e., instead of 8 one obtains 8( 1 + E). How large is y(3; 8(1 + E))? In this case, is the simple shooting method (as described above) a suitable method for the solution of the boundary-value problem? 19. Consider the initial-value problem y' = K(y + y3), y(O) = s. (a) Determine the solution y(x; s) of this problem. (b) In which x-neighborhood Us(O) of 0 is y(x; s) defined? (c) For a given b 0 find a k > 0 such that y(b; s) exists for all lsi < k. t- 20. Show that assumption (3) of Theorem (7.3.3.4) in the case n = 2 is not satisfied for the boundary conditions 21. Show that the matrix E := A + BG m - 1 ... G 1 in (7.3.5.10) is nonsingular under the assumptions of Theorem (7.3.3.4). Hint: E = Po (I + H), H = !vI + P o- 1 Dvr' (Z - I). 22. Show by an error analysis (backward analysis) that the solution vector of (7.3.5.10), (7.3.5.9), subject to rounding errors, can be interpreted as the exact result of (7.3.5.8) with slightly perturbed right-hand sides Pj and perturbed last equation (A + E1)Lls1 + E2Lls2 + ... + Em-1Llsm- 1 + (B + Em)Llsm = -Pm, lub( E j ) small. 23. Let DF(s) be the matrix in (7.3.5.5). Prove: det(DF(s)) = det(A + BGm- 1 ... GIl. For the inverse (DF(s))-l find explicitly a decomposition in block matrices R, S,T: (DF(s))-l = RST with S block-diagonal, R unit lower block triangular, and S, T chosen so that ST· DF(s) is unit lower block triangular. 24. Let Ll E IR n be arbitrary. Prove: If Lls1 := Ll and Lls j , j = 2, ... , m, is obtained from (7.3.5.9), then satisfies (DF(s) . F)T Lls = _FT F + F;;((A + BGm- 1 ··· G1)Ll- w). Thus, if Ll is a solution of (7.3.5.10), then always EXERCISES FOR CHAPTER 7 611 (DF(s) . F{ .ds < O. 25. Consider the boundary-value problem (y(x) E IRn) yl = f(x, y) with the separated boundary conditions Yk(a) ~ ak = o. Yk+l(b) ~ f3 k+l There are thus k initial values and n~k terminal values which are known. The information can be used to reduce the dimension of the system of equations (7.3.5.8) (k components of .ds l and n ~ k components of .ds m are zero). For m = 3 construct the system of equations for the corrections .ds;. Integrate (using the given information) from Xl = a to X2 as well as from X3 = b to X2 ("counter shooting"; see Figure 31). ~ I Fig. 31. Counter shooting. 26. The steady concentration of a substrate in a sperical cell of radius 1 in an enzyme catalyzed reaction is described by the solution y( X; a) of the following singular boundary-value problem [ef. Keller (1968)]: 2 II Y "=~l+ X yl(O) = 0, Y a(y + k)' y(l) = 1, a, k : parameters, k = 0.1, 10- 3 S a S 10- 1 . Altough f is singular at x = 0, there exists a solution y(x) which is analytic for small Ixl. In spite of this, every numerical integration method fails near x = O. Help yourself as follows: By using the symmetries of y(x), expand y(x) as in Example 2, Section 7.3.6, into a Taylor series about 0 up to and including the term x 6 ; 612 7 Ordinary Differential Equations thereby express all coefficients in terms of A := y(O). By means of p(x; A) = y(O) + y' (O)x + ... + y(6) (0)X6 /6! one obtains a modified problem d2p(X; A) dx 2 f(x, y, y') F ( x,y,y ') = { y"= y(O) = A, y'(O) = 0, 2 for 0 ::; x ::; 10- , for 10- 2 < x ::; 1, y(l) = 1, which is better suited for numerical solution. Consider (*) an eigenvalue problem for the eigenvalue A, and formulate (*) as in Section 7.3.0 as a boundary-value problem (without eigenvalue). For checking purposes we give the derivatives: y(O) = A, Y y (2) A _ (0) - 3a(A + k)' (6)(0) = (3k - WA)kA 21a 3 (A + k)5 ' Y (4) y(i)(O) = 0 kA _ (0) - 5a2(A + k)3' c . lor ~ = 1 3 5 , , . 27. Consider the boundary-value problem y" - p(x)y' - q(x)y = r(x), y(a) = a, y(b) = (3 with q(x) 2: qo > 0 for a ::; x ::; b. We are seeking approximate values Ui for the exact values y(Xi), i = 1, 2, ... , n, at Xi = a + ih, h = (b - a)/(n + 1). Replacing y'(Xi) by (Ui+l - ui-d/2h and y"(x;) by (Ui-l - 2Ui + ui+d/h 2 for i = 1, 2, ... , n, and putting Uo = a, Un +l = (3, one obtains from the differential equation a system of equations for the vector U = [Ul, ... , unf Au = k, A an n x n-matrix, k E JRn . (a) Determine A and k. (b) Show that the system of equations is uniquely solvable for h sufficiently small. 28. Consider the boundary-value problem y" = g(x), with 9 E e[O, 1]. (a) Show: y(x) = with y(O) = y(l) = 0, 11 G(x, G(x 0 .- {~(x - 1) , .- x(~ - 1) ~)g(~)d~ for 0 ::; ~ ::; x ::; 1, for 0 ::; x ::; ~ ::; 1. (b) Replacing g(x) by g(x) + Llg(x) with ILlg(x) I ::; E: for all x, the solution y(x) changes to y(x) + Lly(x). Prove: E: ILly(x) I ::; "2x(l - x) for 0 ::; x ::; 1. References for Chapter 7 613 (c) The difference method described in Secion 7.4 yields the system of equations [see (7.4.6)] Au= k, the solution vector u = [U1, ... , Un]T of which provides approximate values Ui for Y(Xi), Xi = i/(n + 1) and i = 1, 2, ... , n. Replacing k by k + LJ.k with ILJ.k, I ::; E, i = 1, 2, ... , n, the vector u changes to u + LJ.u. Show: i = 1, 2, ... , n. 29. Let D:= {u I u(o) = 0, u E C 2 [0,1]} and F(u):= 11 g(U'(X))2 + j(x,u(x))}dx + p(u(1)) with u E D. juu(x, u) 20, p" (u) 20. Prove: If y(x) is solution of y"_jU(X,y) =0, then y(O) =0, F(y) < F(u), y'(1)+p'(y(1)) =0, u ED, u # y, and vice versa. 30. (a) Let T be a triangle in JR2 with the vertices H, P 3 and P 5 ; let further P2 E P1P3, P4 E P3P5, and P6 E P5P1 be different form H, P3, P5. Then for arbitrary real numbers Y1, ... ,Y6 there exists exactly one polynomial of degree at most 2, which assumes the values y, at the points Pi, i = 1, ... ,6. Hint: It evidently suffices to show this for a single triangle with suitably chosen coordinates. (b) Is the statement analogous to (a) valid for arbitrary position of the Pi? (c) Let T 1 , T2 be two triangles of a triangulation with a common side g and Ul, U2 polynomials of degree at most 2 on T1 and T2, respectively. Show: If 'U1 and U2 agree in three distinct points of g, then U2 is a continuous extension of U1 onto T2 (i.e.,u1 and U2 agree on all of g). References for Chapter 7 Babuska, r., Prager, M., Vitisek, E. (1966): Numerical Processes in Differential Equations. New York: Interscience. Bader, G., Deufthard, P. (1983): A semi-implicit midpoint rule for stiff systems of ordinary systems of differential equations. Numer. Math. 41, 373-398. 614 7 Ordinary Differential Equations Bank, R E., Bulirsch, R, Merten, K. (1990): Mathematical modelling and simulation of electrical circuits and semiconductor devices. ISNM. 93, Basel: Birkhiiuser. Bock, H.G., Schlader, J.P., Schulz, V.H. (1995): Numerik groBer DifferentiellAlgebraischer Gleichungen: Simulation und Optimierung, p. 35-80 in: Schuler, II. (ed.) ProzejJsimulation, Weinheim: VCH. Boyd, J. P. (1989): Chebyshev and Fourier Spectral Methods. Berlin, Heidelberg, New York: Springer-Verlag. Broyden, C.G. (1967): Quasi-Newton methods and their application to function minimization. Math. Compo 21, 368-381. Buchauer, 0., Hiltmann, P .. Kiehl, M. (1994): Sensitivity analysis of initial-value problems with application to shooting techniques. Numerische Mathematik 67, 151-159. Bulirsch, R. (1971): Die Mehrzielmethode zur numerischen Lasung von nichtlinearen Randwertproblemen und Aufgaben der optimalen Steuerung. Report der Carl-Cranz-Gesellschaft. - - - , Stoer, J. (1966): Numerical treatment of ordinary differential equations by extrapolation methods. Numer. Math. 8, 1-13. Butcher, J. C. (1964): On Runge-Kutta processes of high order. 1. Austral. Math. Soc. 4, 179-194. Byrne, D. G., Hindmarsh, A. C. (1987): Stiff o.d.e.-solvers: A review of current and coming attractions. J. Compo Phys. 70, 1-62. Callies, R (2000): Design Optimization and Optimal Control. Differential-Algebraic Systems, Multigrid Multiple-Shooting Methods and Numerical Realization. Habilitation Thesis, Technische Universitiit Miinchen. - - - (2000a): Efficient Treatment of Experimental Data in Integrate Design Optimization. In: Advances in Computational Engineering fj Sciences ( S. N. Atluri, F. W. Brust, Eds.), pp. 1188-1193. Palmdale/CA: Tech Science Press,. - - - (2001): Optimized Runge-Kutta Methods forLinear Differential Equations. Submitted to: 1. Appl. Math. Compo - - - (200la): Multidimensional Stepsize Control. ZAMM 81 S3,743-744. Canuto, C., Hussaini, M. Y., Quarteroni, A., Zang, T. A. (1987): Spectral Methods in Fluid Dynamics. Berlin, Heidelberg, New York: Springer-Verlag. Caracotsios, M., Stewart, W.E. (1985): Sensitivity analysis of initial value problems with mixed ODEs and algebraic equations. Computers and Chemical Engineering 9, 359-365. Ciarlet, P. G., Lions, J. L., Eds. (1991): IIandbook of Numerical Analysis. Vol. II. Finite Element Methods (Part 1). Amsterdam: North Holland. - - - , Schultz, M. H., Varga, R. S. (1967): Numerical methods of high order accuracy for nonlinear boundary value problems. Numer. Math. 9, 394-430. - - - , Wagschal. C. (1971): Multipoint Taylor formulas and applications to the finite element method. Numer. Math. 17, 84-100. Clark, N. W. (1968): A study of some numerical methods for the integration of systems of first order ordinary differential equations. Report ANL-7428. Argonne National Laboratories. Coddington, E. A., Levinson, N. (1955): Theory of Ordinary D~fferential Equations. New York: McGraw-Hill. Collatz, L. (1960): The Numerical Treatment of Differential Equations. Berlin, Gottingen, Heidelberg: Springer-Verlag. - - - (1966): Functional Analysis and Numerical Mathematics. New York: Academic Press. References for Chapter 7 615 Crane, P.J., Fox, P.A. (1969): A comparative study of computer programs for integrating differential equations. Num. Math. Computer Program library one - Basic routines for general use. Vol. 2, issue 2. New Jersey: Bell Telephone Laboratories Inc. Dahlquist, G. (1956): Convergence and stability in the numerical integration of ordinary differential equations. Math. Scand. 4, 33-53. - - - (1959): Stability and error bounds in the numerical integration of ordinary differential equations. Trans. Roy. Inst. Tech. (Stockholm), No. 130. - - - (1963): A special stability problem for linear multistep methods. BIT 3, 27-43. Deuflhard, P. (1979): A stepsize control for continuation methods and its special application to multiple shooting techniques. Numer. Math. 33, 115-146. ----, Hairer, E., Zugck, J. (1987): One-step and extrapolation methods for differential-algebraic systems. Numer. Math. 51, 501-516. Diekhoff, H.-J., Lory, P., Oberle, H. J., Pesch, H.-J., Rentrop, P., Seydel, R. (1977): Comparing routines for the numerical solution of initial value problems of ordinary differential equations in multiple shooting. Numer. Math. 27, 449469. Dormand, J. R., Prince, P. J. (1980): A family of embedded Runge-Kutta formulae. J. Compo Appl. Math. 6, 19-26. Eich, E. (1992): Projizierende l'vIehrschrittverfahren zur numerischen Losung von Bcwegungsglcichungen technischer l'vIehrkorpersysteme mit Zwangsbedingungen und Unstetigkeiten. Fortschritt-Berichte VDI, Reihe 18, Nr. 109, VDIVerlag, Dusseldorf. Engl, C., Kroner, A., Kronseder, T., von Stryk, O. (1999): Numerical Simulation and Optimal Control of Air Separation Plants, pp. 221-231 in: Bungartz, H.J., Durst, F., Zenger, Chr.(eds.): High Perfonnance Scientific and Engineering Computing, Lecture Notes in Computational Science and Engineering Vol. 8, New York: Springer-Verlag. Enright, W. H., Hull, T. E., Lindberg, B. (1975): Comparing numerical methods for stiff systems of ordinary differential equations. BIT 15, 10-48. Fehlberg, E. (1964): New high-order Runge-Kutta formulas with stepsize control for systems of first- and second-order differential equations. Z. Angew. Math. Mech. 44, T17-T29. - - - (1966): New high-order Runge-Kutta formulas with an arbitrary small truncation error. Z. Angew. Math. Mech. 46, 1-16. - - - (1969): Klassische Runge Kutta Formeln funfter und siebenter Ordnung mit Schrittweiten-Kontrolle. Computing 4, 93-106. - - - (1970): Klassische Runge-Kutta Formeln vierter und niedrigerer Ordnung mit Schrittweiten-Kontrolle und ihre Anwendung auf Warmc!citungsproblcme. Computing 6,61-71. Gabin, S., Feehery, W.F., Barton, P.I. (1998): Parametric sensitivity functions for hybrid discrete/continuous systems. Preprint submitted to Applied Numerical Mathematics. Gantmacher, F. R. (1969): Matrizenrechnung II. Berlin: VEB Deutscher Verlag def Wissenschaften. Gear, C. W. (1971): Numerica.l initial value problems in ordinary differential equations. Englewood Cliffs, N.J.: Prentice-Hall. -------- (1988): Differential algebraic equation index transformations. SIAM J. Sci. Statist. Comput. 9, 39-47. - - - , Petzold, L. R. (1984): ODE methods for the solution of differential/algebraic systems. SIAM J. Numer. Anal. 21, 716-728. 616 7 Ordinary Differential Equations Gill, P.E., Murray, W., Wright, H.M. (1995): Practical Optimization, 10th printing. London: Academic Press. Gottlieb, D., Orszag, S. A. (1977): Numerical Analysis of Spectral Methods: Theory and Applications. Philadelphia: SIAM. Gragg, W. (1963): Repeated extrapolation to the limit in the numerical solution of ordinary differential equations. Thesis, UCLA. - - - (1965): On extrapolation algorithms for ordinary initial value problems. J. SIAM Numer. Anal. Ser. B 2, 384-403. Griepentrog, E., Miirz, R (1986): Differential-Algebraic Equations and Their Numerical Treatment. Leipzig: Teubner. Grigorieff, RD. (1972, 1977): Numerik gewohnlicher Differentialgleichungen 1, 2. Stuttgart: Teubner. Hairer, E., Lubich, Ch. (1984): Asymptotic expansions of the global error of fixed stepsize methods. Numer. Math. 45, 345-360. - - - , - - - , Roche, M. (1989): The numerical solution of differentialalgebraic systems by Runge-Kutta methods. In: Lecture Notes in Mathematics 1409. Berlin, Heidelberg, New York: Springer-Verlag. - - - , Norsett, S. P., Wanner, G. (1993): Solving Ordinary Differential Equations 1. Nonstiff Problems. 2nd ed., Berlin, Heidelberg, New York: SpringerVerlag. - - - , Wanner, G. (1991): Solving Ordinary Differential Equations II. Stiff and Differential-Algebraic Problems. Berlin, Heidelberg, New York: SpringerVerlag. Heim, A., von Stryk, O. (1996): Documentation of PAREST - A multiple shooting code for optimization problems in differential-algebraic equations. Technical Report M9616, Fakultiit fiir Mathematik, Technische Universitiit Miinchen. Henrici, P. (1962): Discrete Variable Methods in Ordinary Differential Equations. New York: John Wiley. Hestenes, M. R (1966): Calculus of Variations and Optimal Control Theory. New York: John Wiley. Heun, K. (1900): Neue Methode zur approximativen Integration der Differentialgleichungen einer unabhiingigen Variablen. Z. Math. Phys. 45, 23-38. Hiltmann, P. (1990): Numerische Lasung von Mehrpunkt-Randwertproblemen und Aufgaben der optimalen Steuerung mit Steuerfunktionen liber endlichdimensionalen Riiumen. PhD Thesis, Mathematisches Institut, Technische Universitiit Miinchen. Horneber, E. H. (1985): Simulation elektrischer Schaltungen auf dem Rechner. Fachberichte Simulation, Bd. 5. Berlin, Heidelberg, New York: Springer-Verlag. Hull, T. E., Enright, W. H., Fellen, B. M., Sedgwick, A. E. (1972): Comparing numerical methods for ordinary differential equations. SIAM J. Numer. Anal. 9, 603-637. [Errata, ibid. 11, 681 (1974).] Isaacson, E., Keller, H. B. (1966): Analysis of Numerical Methods. New York: John Wiley. Kaps, P., Rentrop, P. (1979): Generalized Runge-Kutta methods of order four with stepsize control for stiff ordinary differential equations. Numer. Math. 33, 55-68. Keller, H. B. (1968): Numerical Methods for Two-Point Boundary- Value Problems. London: Blaisdell. Kiehl, M. (1999): Sensitivity analysis of ODEs and DAEs - theory and implementation guide. Optimization Methods and Software 10,803-821. References for Chapter 7 617 Krogh, F. T. (1974): Changing step size in the integration of differential equations using modified divided differences. In: Proceedings of the Conference on the Numerical Solution of Ordinary Differential Equations, 22-71. Lecture Notes in Mathematics 362. Berlin, Heidelberg, New York: Springer-Verlag. Kutta, W. (1901): Beitrag zur naherungsweisen Integration totaler Differentialgleichungen. Z. Math. Phys. 46, 435-453. Lambert, J. D. (1973): Computational Methods in Ordinary Differential Equations. London-New York-Sidney-Toronto: John Wiley. Leis, J.R., Kramer, M.A. (19S5): Sensitivity analysis of systems of differential and algebraic equations. Computers and Chemical Engineering 9, 93-96. :vIorrison, K.R., Sargent, R.W.H. (1986): Optimization of multistage processes described by differential-algebraic equations, pp. S6-102 in: Lecture Notes in Mathematics 1230. New York: Springer-Verlag. Na, T. Y., Tang, S. C. (1969): A method for the solution of conduction heat transfer with non-linear heat generation. Z. Angew. Math. Mech. 49, 45-52. Oberle, H. J., Grimm, W. (19S9): BNDSCO - A program for the numerical solution of optimal control problems. Internal Report No. 515-S9/22, Institute for Flight Systems Dynamics, DLR, Oberpfaffenhofen, Germany. Oden, J. T., Rcddy, J. K. (1976): An Introduction to the Mathematical Theory of Finite Rlements. New York: John \Viley. Osborne, M. R. (1969): On shooting methods for boundary value problems. J. Math. Anal. Appl. 27, 417-433. Petzold, L. R. (19S2): A description of DASSL - A differential algebraic system solver. IMA CS Trans. Sci. Compo 1, 65ff., Ed. H. Stepleman. Amsterdam: North Holland. Quarteroni, A., Valli, A. (1994): Numerical Approximation of Partial Differential Equations. Berlin, Heidelberg, New York: Springer-Verlag. Rentrop, P. (19S5): Partitioned Runge-Kutta methods with stiffness detection and stepsize control. Numer. Math. 47, 545-564. - - - , Roche, M., Steinebach, G. (19S9): The application of RosenbrockWanner type methods with stepsize control in differential-algcbraic equations. Numer. Math. 55, 545-563. Rozenvasser, E. (1967): General sensitivity equations of discontinuous systems. Automation and Remote Control 28, 400-404 Runge, C. (lS95): Uber die numerische Aufiiisung von Differentialgleichungen. Math. Ann. 46, 167-17S. Schwarz, H. R. (19S1): FORTRAN-Programme zur Methode der finiten Elemente. Stuttgart: Teubner. - - - , (1988): Finite Element Methods. London-New York: Academic Press. Shampine, L. F., Gordon, M. K. (1975): Computer Solution of Or'dinary Differential Equations. The Initial Value Problem. San Francisco: Freeman and Company. - - - , Watts, H. A., Davenport, S. 1\1. (1976): Solving nonstifI ordinary differential equations - The state of the art. SIAM Review 18, 376- /111. Shanks, E. B. (1966): Solution of differential equations by evaluation of functions. Math. Compo 20, 21-3S. Stetter, H. J. (1973): Analysis of Discretization Methods for Ordinary Differential Equations. Berlin, Heidelberg, New York: Springer-Verlag. Strang, G., Fix, G. J. (1973): An Analysis of the Finite Element Method. Englewood Cliffs, N.J.: Prentice-Hall. Troesch, B. A. (1960): Intrinsic difficulties in the numerical solution of a boundary value problem. Report NN-142, TRW, Inc. Redondo Beach, CA. 618 7 Ordinary Differential Equations - - - (1976): A simple approach to a sensitive two-point boundary value problem. J. Computational Phys. 21, 279-290. Wilkinson, J. H. (1982): Note on the practical significance of the Drazin Inverse. In: L.S. Campbell, Ed. Recent Applications of Generalized Inverses. Pitman Pub!. 66, 82-89. Willoughby, R. A. (1974): Stiff Differential Systems. New York: Plenum Press. Zlamal, M. (1968): On the finite element method. Numer. Math. 12, 394-409. 8 Iterative Methods for the Solution of Large Systems of Linear Equations. Additional Methods 8.0 Introduction Many applications require the solution of very large systems of linear equations Ax = b in which the matrix A is fortunately sparse, i.e., has only relatively few nonzero elements. Such systems arise, for instance, if difference methods or finite element methods arc being used for solving boundary value problems in partial differential equations. The classical elimination methods [see Chapter 4] are not suitable in this context since they tend to lead to the formation of dense intermediate matrices, making the number of arithmetic operations necessary for the solution too large even for presentday computers, not to mention the fact that the memory requirements for such intermediate matrices exceed available space. For these reasons, researchers have long since moved to iterative methods for solving such systems of equations. These methods start with an initial vector x(O) and subsequently generate a sequence of vectors which converge toward the desired solution x. A common feature of all these methods is the fact that the effort required for an individual iteration step xli) -+ x(i+l) is essentially equal to that of multiplying a vector by the matrix A - a very modest effort provided A is sparse. Therefore, one can still carry out a relatively large number of iterations with a reasonable amount of effort. This is necessary for the reason alone that these methods converge only linearly, and sometimes very slowly at that. Iterative methods are therefore usually inferior to elimination methods if A is small (a 100 x 100 matrix is small in this sense) or not sparse. Somewhat outside of this framework lie the so-called method of iterative refinement [see (8.1.9)-(8.1.11)] and the Krylov space methods [see Section 8.7]. Iterative refinement is used frequently to iteratively improve the accuracy of an approximate solution i; of a system of equations which has been obtained by an elimination method using machine arithmetic. The general characteristics of iterative methods, stated above, also apply to Krylov space methods, with the one exception that these methods, in 620 8 Iterative Methods for the Solution of Systems of Linear Equations exact arithmetic, terminate with the exact solution x after a finite number of steps. They generate iterates Xk that approximate the solution of Ax = b best among all vectors belonging to a Krylov space of dimension k. We describe four methods of this type: the classical conjugate-gradient method (cg-method) of Hestenes and Stiefel (1952) for systems of equations with a positive definite matrix A [see Section 8.7.1]' and, for systems with an arbitrary nonsingular A, the GMRES-method of Saad and Schultz (1986) [see Section 8.7.2], a simplified version of the QMR-method of Freund and Nachtigal (1991) [Section 8.7.3], and the Bi-CGSTAB algorithm of van der Vorst (1992) [Section 8.7.4]. These methods have the advantage that they do not depend on additional parameters (as other methods do), the proper choice of which is often difficult. With regard to the applicability of Krylov space methods, however, the same remarks apply as for the true iterative methods. For this reason, we include these methods in the present chapter. A detailed treatment of iterative methods can be found in the new edition of the classical book by Varga (2000), and also in Young (1971) and Axelsson (1994); Krylov space methods are described in depth in Saad (1996). For the solution of certain special systems of equations arising with the solution of the so-called model problem (the Poisson problem on a rectangle) there are some direct methods which give the solution after finitely many steps and are superior to (most) iterative methods. Two such methods are the algorithm of Buneman (1969), and Fourier methods that use the FFT-algorithm of trigonometric interpolation. Both methods are described in Section 8.8. Nowadays, the very large systems of equations connected with the solution of boundary-value problems of partial differential equations by finite element techniques are mainly solved by multigrid methods. We describe in Section 8.9 only concepts of these important iterative methods, using the context of a boundary value problem for ordinary differential equation. For thorough treatments of multigrid methods, which are closely connected to the numerics of partial differential equations, we refer the reader to the vast special literature on this subject, e.g. to Hackbusch (1985), Braess (1997), Bramble (1993), Quarteroni and Valli (1997). It should be pointed out that elimination techniques for solving linear systems Ax = b have also been modified in various ways to handle large sparse matrices A. These techniques concern themselves with appropriate ways of storing such matrices, and with the determination of suitable permutation matrices PI, P2 such that in the application of elimination methods to the equivalent system P 1 AP2 y = Ptb, Y = P 2- 1 X, the intermediate matrices generated remain as sparse as possible. Basic methods of this type were described in Section 4.12. For a relevant exposition, we refer the reader to the literature cited there, and in addition to George (1973), who proposed a particularly useful technique for handling the linear equa- 8.1 General Procedures for the Construction of Iterative Methods 621 tions arising from the application of finite element techniques to partial differential equations. 8.1 General Procedures for the Construction of Iterative Methods Let a nonsingular n x n matrix A be given, and a system of linear equations (8.1.1) Ax=b with the exact solution x := A-lb. We consider iterative methods of the form [ef. Chapter 5] i = 0,1, .... (8.1.2) With the help of an arbitrary nonsingular n x n matrix B such iteration algorithms can be obtained from the equation Bx + (A - B)x = b, by putting (8.1.3) BX(Hl) + (A - B)X(i) = b, or solved for x(Hl), (8.1.4) Such iteration methods, in this generality, were first considered by Wittmeyer (1936). Note that (8.1.4) is identical with the following special vector iteration [see Section 6.6.3]: where the matrix W of order n + 1, in correspondence to the eigenvalue >'0 := 1, has the left eigenvector [1,0] and the right eigenvector [!], x := A-lb. According to the results of Section 6.6.3, the sequence [xti)] will converge to [!] only if >'0 = 1 is a simple dominant eigenvalue of W, i.e., if the remaining eigenvalues >'1, ... , >'n of W (these are the eigenvalues of I - B- 1 A) being smaller in absolute value than 1. 622 8 Iterative Methods for the Solution of Systems of Linear Equations Each choice of a nonsingular matrix B leads to a potential iterative method (8.1.4). The better B satisfies the following conditions, the more useful the method will be: (1) the system of equations (8.1.3) is easily solved for x(i+l) , (2) the eigenvalues of 1- B- 1 A have moduli which are as small as possible. The better B agrees with A, the more likely the latter will be true. These questions of optimality and convergence will be examined in the next sections. Here, we only wish to indicate a few important special iterative methods (8.1.3) obtained by simple choices of B. We introduce, for this purpose, the following standard decomposition of A: A=D-E-F, (8.1.5) with o as well as the abbreviations (8.1.6) L:= D- l E, U:= D- l F, J:= L + U, H:= (I - L)-IU, assuming aii i- 0 for i = 1, 2, ... , n. (1) In the Jacobi method or total-step method one chooses B:= D, (8.1.7) 1- B- I A = J. One thus obtains for (8.1.3) + LajkXk(i) -_ bj , (HI)" ajjX j j = 1,2, ... ,n, i = 0, 1, ... , kij where xCi) := [xii), ... ,x~)lT. (2) In the Gauss-Seidel method or single-step method one chooses (8.1.8) B:=D-E, One thus obtains for (8.1.3) I-B- I A=(I-L)-lU=H. 8.2 Convergence Theorems ""' (i+1) L...JajkXk (i+1) +ajjXj k<j j = 1, 2, ... ,n, 623 (i) - b + ""' L...JajkXk j, k>j i = 0, 1, .... (3) The method of iterative refinement is a special case in itself. Here, the following situation is assumed. As the result of an elimination method for the solution of Ax = b one obtains, owing to rounding errors, a (generally) good approximate solution xeD) for the exact solution x and a lower and upper triangular matrix Land R, respectively, such that L R ~ A [see Section 4.5]. The approximate solution xeD) can the~ subsequently be improved by means of an iterative method of the form (8.1.3), choosing B:=LR. (8.1.3) is equivalent to (8.1.9) with the residual rei) := b - Ax(i). From (8.1.9) it then follows that (8.1.10) Note that uti) can be simply computed by solving the triangular systems of equations (8.1.11) Lz = r(i), Ru(i) = z. In general (if A is not too ill conditioned), the method converges extremely fast. Already Xli) or x(2) agrees with the exact solution x to machine accuracy. Since, precisely for this reason, there occurs severe cancellation in the computation of the residuals r(i) = b - Ax(i), it is extremely important for the proper functioning of the method that the computation of rei) be done in double precision. For the subsequent computation of z, U(i), and x(i+1) = Xli) + uti) from (8.1.11) and (8.1.10), double precision is not required. [Programs and numerical examples for iterative refinement can be found in Wilkinson and Reinsch (1971), and in Forsythe and Moler (1967)]. 8.2 Convergence Theorems The iterative methods considered in (8.1.3), (8.1.4) produce from each initial vector xeD) a sequence of vectors {XCi) h=D,l, .... We now call the method convergent if for all initial vectors xCD) this sequence {xCi) h=D,l, ... converges 624 8 Iterative Methods for the Solution of Systems of Linear Equations toward the exact solution x = A-lb. By peG) we again denote in the following the spectral radius [see Section 6.9] of a matrix G. We can then state the following convergence criterion: (8.2.1) Theorem. (a) The method (8.1.3) converges if and only if p(l - B- 1 A) < 1. (b) It is sufficient for convergence of (8.1.3) that lub(I - B- 1 A) < l. Here lub(·) can be taken relative to any norm. PROOF. (a): For the error fi := xCi) - x, from x CH1 ) = (I - B- 1 A)xCi) + B- 1 b, x = (I - B- 1 A)x + B- l b, we immediately obtain by subtraction the recurrence formula or i = 0,1, .... (8.2.2) (1) Assume now (8.1.3) is convergent. Then we have limi .... oo fi = 0 for all fo. Choosing in particular fo to be an eigenvector of I - B- 1 A corresponding to the eigenvalue A, it follows from (8.2.2) that (8.2.3) and hence IAI < 1, since limHoo fi = O. Therefore, p(l - B-IA) < 1. (2) If, conversely, p(l - B- 1 A) < 1, it follows immediately from Theorem (6.9.2) that limHoo(I - B- 1 A)i = 0 and thus limi .... oo fi = 0 for all fo. (b): For arbitrary norms one has p(I - B- 1 A) :S lub(l - B- 1 A) [see Theorem (6.9.1)]. This proves the theorem. 0 Theorem (8.2.1) suggests the conjecture that the rate of convergence is larger the smaller p(I - B- 1 A). The statement can be made more precise. (8.2.4) Theorem. For the method (8.1.3) the errors Ii = XCi) - x satisfy (8.2.5) sup lim sup i Ilfill = (I _ B- I A) p . II JIfoil i .... oo 10#0 Here II . II is an arbitrary norm. PROOF. Let 11·11 be an arbitrary norm, and lub(·) the corresponding matrix norm. By k we denote, for short, the left-hand side of (8.2.5). One sees immediately that k 2: p(I - B- 1 A), by choosing for fo, as in (8.2.2), (8.2.3), 8.2 Convergence Theorems 625 the eigenvectors of (I - B- 1 A). Let now E > 0 be arbitrary. Then by Theorem (6.9.2) there exists a vector norm N(·) such that for the corresponding matrix norm lub N (·), According to Theorem (4.4.6) all norms on (Cn are equivalent, and there exist constants m, AJ > 0 with m Ilxll :::; N(x) :::; M Ilxll. If now fa =J 0 is arbitrary, from these inequalities and (8.2.2) it follows that 1 1 ( (1 - B- 1 lifill :::; -N(fi) = -N A)'·fa) m m or Since limi-+CXJ ;jA1jm = 1, one obtains k :::; p(I - B- 1 A) + E, and since B- 1 A). The theorem is now proved. 0 E > 0 was arbitrary, k :::; p(I - Let us appply these results first to the Jacobi method (8.1.7). We continue llsing the notation (8.1.5)-(8.1.6) introduced in the previous section. Relative to the maximum norm lubCXJ (C) = maXi 2::k ICik I, we then have If laiil > 2::k#i laikl for all i, we get immediately lubCXJ (J) < 1. From Theorem (8.2.1b) we thus obtain at once the first part of the following theorem: (8.2.6) Theorem. (a) Strong Row Sum Criterion: The Jacobi method is convergent for all matrices A with (8.2.7) laiil > L laikl fori = 1, 2, ... , n. k#i 626 8 Iterative Methods for the Solution of Systems of Linear Equations (b) Strong Column Sum Criterion: The Jacobi method converges for all matrices A with lakkl > (8.2.8) L laikl for k = 1,2, ... , n. itk A matrix A satisfying (8.2.7) [(8.2.8)] is called strictly row-wise (column-wise) diagonally dominant. PROOF OF (b). If (8.2.8) is satisfied by A, then (8.2.7) holds for the matrix AT. The Jacobi method therefore converges for AT, and thus, by Theorem (8.2.1a), p(X) < 1 for X := I - D- 1AT. Now X has the same eigenvalues as X T and D-1XTD = 1- D-IA. Hence also p(I - D-1A) < 1, i.e., the Jacobi method is convergent also for the matrix A. D For irreducible matrices A the strong row (column) sum criterion can be improved. Here, a matrix A is called irreducible if there is no permutation matrix P such that p T AP has the form where All is a p x p matrix and A22 a q x q matrix with p + q = n, p > 0, q> o. The irreducibility of a matrix A can often be readily tested by means of the (directed) graph G(A) associated with the matrix A. If A is an n x n matrix, then G (A) consists of n vertices PI, ... , Pn and there is an (oriented) arc Pi --+ Pj in G(A) precisely if aij "10. EXA1VIPLE. 2 A= r: 0 :1 ~ P P G(A) : Pi 2 3 U U U It is easily shown that A is irreducible if and only if the graph G(A) is connected in the sense that for each pair of vertices (Pi, Pj ) in G(A) there is an oriented path from Pi to Pj . For irreducible matrices one has the following: (8.2.9) Theorem (Weak Row Sum Criterion). If A is irreducible and laiil 2: L laikl for all i = 1, 2, ... , n, kti but lainiol > Lktin laiokl for at least one io! then the Jacobi method converges. 8.2 Convergence Theorems 627 A matrix A satisfying the hypotheses of the theorem is called irreducibly diagonally dominant. Analogously there is, of course, also a weak column sum criterion for irreducible A. PROOF. From the assumptions of the theorem it follows for the Jacobi method, as in the proof of (8.2.6a), that and from this, Pie:::; e, (8.2.10) IJle 01 e, e:=(l,l. ... ,lf· [Absolute value signs 1·1 and inequalities for vectors or matrices are always to be understood cornponentwise.] Now, with A also J is irreducible. In order to prove the theorem, it suffices to show the inequality because from it we have immediately p(J)n = p(r) :::; luboo(r) :::; lub oo (IJln) < 1. Now, in view of (8.2.10) and IJI 20, IJI 2 c:::; IJI c ~ c, and generally, IJIi+le :::; IJlie :::; ... ~ e, i.e., the vectors t(i) := e -IJlie satisfy (8.2.11) We show that the number Ti of nonvanishing components of t(i) increases monotonically with i as long as Ti < n: 0 < Tl < T2 < .... If this were not the case, there would exist, because of (8.2.11), a first i 2 1 with Ti = Ti+l. Without loss of generality, let t(i) have the form t( i) = [~] with a vector a > 0, a E IRP, p > o. Then, in view of (8.2.11) and Ti = Ti+l, also t(i+1) has the form t(i+l) = [~] with a vector b > 0, bE IRP. 628 8 Iterative Methods for the Solution of Systems of Linear Equations Partitioning IJI analogously, IJI = [11111 IhII IJ11 1a p x p matrix, it follows that Because a > 0, this is only possible if hI = 0, i.e., if J is reducible. Hence, o < 71 < 72 < ... , and thus t(n) = e - IJln e > O. The theorem is now proved. D The conditions of Theorems (8.2.6) and (8.2.9) are also sufficient for the convergence of the Gauss-Seidel method. We show this only for the strong row sum criterion. We have even more precisely: (8.2.12) Theorem. If laiil > L laikl for all i = 1, 2, ... , n, k#i the Gauss-Seidel method is convergent, and fuTtermore [see (8.1.6)] PROOF. Let "'H := luboo(H), "'J:= lub(X) (J). As already exploited repeatedly, the assumption of the theorem implies e=(l, ... ,lf, for the matrix J = L +U. From this, because IJI = ILl + lUI, one concludes (8.2.13) Now L and ILl are both lower triangular matrices with vanishing diagonal. For such matrices, as is easily verified, so that (1 - L)-l and (1 - ILI)-l exist and 0::; 1(1 - L)-ll = 11 + L + ... + Ln-11 ::; 1 + ILl + ... + ILln-l = (1 -ILI)-l. 8.3 Relaxation Methods 629 Multiplying (8.2.13) by the nonnegative matrix (I - ILI)-l, one obtains, because H = (I - L)-lU, IHle::; (I -ILI)-lIUle::; (1 -ILI)-l(1 -ILl + (K,.f -l)I)e = (I + (K,.f -1)(1 -ILI)-l)e. Now, (1 - ILI)-l :::: I and K,.f < 1, so that the chain of inequalities can be continued: IHle::; (1 + (K,.f -l)I)e = K,.fe. But this means as was to be shown. D Since lubcxo(H) :::: p(H), luboc(J) :::: p(J), this theorem may suggest the conjecture that under the assumptions of the theorem also p(H) ::; p(J) < 1, i.e., in view of Theorem (8.2.4), that the Gauss-Seidel method converges at least as fast as the Jacobi method. This, however, as examples show, is not true in general, but only under further assumptions on A. Thus, e.g., the following theorem is valid, which we state without proof [for a proof, see Varga (2000)]. (8.2.14) Theorem (Stein, Rosenberg). If the matrix J = L + U :::: 0 is nonnegative, then for J and H = (1 - L )-lU precisely one of the following relations holds: (1) p(H) = p(J) = 0, (2) 0 < p(H) < p(J) < 1, (3) p(H) = p(J) = 1, (4) p(H) > p(J) > 1. The assumption J :::: 0 is satisfied in particular [see (8.1.5), (8.1.6)] if the matrix A has positive diagonal elements and nonpositive off-diagonal elements: aii > 0, aik ::; 0 for i =1= k. Since this condition happens to be satisfied in almost all systems of linear equations which are obtained by difference approximations to linear differential operators [ef., e.g., Section 8.4]' this theorem provides in many practical cases the significant information that the Gauss-Seidel method converges faster than the Jacobi method, if one of the two converges at all. 8.3 Relaxation Methods The results of the previous section suggest looking for simple matrices B for which the corresponding iterative method (8.1.3) converges perhaps 630 8 Iterative Methods for the Solution of Systems of Linear Equations still faster than the Gauss-Seidel method, p(I - B- 1A) < p(H). More generally, one can consider classes of suitable matrices B(w) depending on a parameter wand try to choose the parameter w in an "optimal" way, i.e., so that p(I - B(W)-1 A) as a function of w becomes as small as possible. In the relaxation methods (SOR methods) one studies the following class of matrices B(w): 1 w B(w) = - D(I - wL). (8.3.1) Here again we use the notation (8.1.5)-(8.1.6). This choice is obtained through the following considerations. Suppose for the (i + l)st approximation x(i+ l ) we already know the components X~i+l), k = 1, 2, ... , j - 1. As in the Gauss-Seidel method (8.1.8), we then define an auxiliary quantity xji+l) by (8.3.2) _(i+l) -_ - '~ " ajkXk ,(i+1) - '~ " ajk·Tk(i) + bj, 1 :S j :S n, ajjX j k<j whereupon -(i+l) Xj i :;:, 0, k>] x/(+1) is determined through a certain averaging of x/() and . , VIZ. (8.3.3) (i+ l ) ._ (1 Xj .- - -(i+l) _ (i) + (-(i+l) w ) Xj(i) + w Xj - Xj W Xj - (i)) Xj . Eliminating the auxiliary quantitiy xji+l) from (8.3.3) by means of (8.3.2), one obtains + .. (i+1) _ .. ( i ) , [ aJJx j LU - aJJT j ' " . (i+l) _ .. (i) _ '~ " . (i) ~ a]kxk aJJx j a]kak k<j + b.] ] , k>j 1 :S j :S n, i:;:, O. In matrix notation this is equivalent to B(w)x Ci +1) = (B(w) - A)x(i) + b, where B(w) is defined by (8.3.1) and 1 B(w) - A = -D((l- w)1 +wU). w For this method the rate of convergence, therefore, is determined by the spectral radius p( H (w )) of the matrix (8.3.4) H(w) := 1- B(W)-1 A = (I - wL)-1 [(1 - w)1 + wUj. One calls w the relaxation parameter and speaks of overrelaxation if w > 1 and underrelaxation if w < 1. For w = lone exactly recovers the Gauss-Seidel method. 8.3 Relaxation Methods 631 We begin by listing, in part without proof, a few qualitative results about p(H(w)). The following theorem shows that in relaxation methods only parameters w with 0 < w < 2, at best, lead to convergent methods: (8.3.5) Theorem (Kahan).For arbitrary matrices A one has p(H(w)) ~ Iw - 11 for all w. PROOF. 1- wL is a lower triangular matrix with 1 as diagonal elements, so that det(I - wL) = 1 for all w. For the characteristic polynomial rp(A) of H(w) it follows that rp(A) = det(AI - H(w)) = det((I - wL)(H - H(",'))) = det(A + w - 1)1 - wAL - wU). Up to a sign, the constant term rp(O) of rp(.) is equal to the product of the eigenvalues Ai(H(w)) of H(w): n (_1)n II Ai(H(w)) = rp(O) = det(w - 1)1 - wU) == (w _l)n. i=l It follows immediately that p(H(w)) = maxi IAi(H(w))1 ~ Iw -11- D For matrices A with L ~ 0, U ~ 0, only overrelaxation can give faster convergence than the Gauss-Seidel method: (8.3.6) Theorem. If the matrix A is irreducible and J = L + U ~ 0, and if the Jacobi method converges, p(J) < 1, then the function p(H(w)) is strictly decreasing on the interval 0 < w s: w for some w ~ 1. For a proof, see Varga (1962) and Householder (1964). One further shows: (8.3.7) Theorem (Ostrowski, Reich). For positive definite matrices A one has p(H(w)) < 1 for all 0 < w < 2. In particular, the Gauss-Seidel method (w = 1) converges for positive definite matrices. PROOF. Let 0 < w < 2, and A be positive definite. Then F = EH in the decomposition (8.1.5), A = D - E - F of A = AH. For the matrix B = B(w) in (8.3.1) one has B = (l/w)D - E, and the matrix B+B H 1 1 -A = -D -E+ -D - F - (D - E - F) w w 632 8 Iterative Methods for the Solution of Systems of Linear Equations is positive definite, since the diagonal elements of the positive definite matrix A are positive [Theorem (4.3.2)] and (2/w) - 1 > o. We first show that the eigenvalues A of A -1 (2B - A) all lie in the interior of the right half plane, Re A > O. Indeed, if x is an eigenvector for A, then A- 1 (2B - A)x = AX, H x (2B - A)x = AxH Ax. Taking the conjugate complex of the last relation gives, because A = A H , By addition, it follows that xH(B + BH - A)x = Re AxH Ax. But now, A and (B + BH - A) are positive definite and thus Re A> O. For the matrix Q:= A- 1(2B - A) = 2A- 1B - lone has (Q - I)(Q + I)-1 = I - B- 1A = H(w). [Observe that B is a nonsingular triangular matrix; therefore B- 1 and thus (Q + I)-1 exist.] If f.-l is an eigenvalue of H(w) and z a corresponding eigenvector, then from (Q - I)(Q + I)-l z = H(w)z = f.-lZ it follows, for the vector y := (Q + I) -1 Z i 0, that (Q - I)y = f.-l(Q + I)y, (1 - f.-l)Qy = (1 + f.-l)Y· Since y i 0, we must have f.-l i 1, and one finally obtains QY -_ 1+/1 1 y, -f.-l i.e., A = (1 + f.-l)/(1 - f.-l) is an eigenvalue of Q = A-1(2B - A). Hence, f.-l = (A - l)/(A + 1). For 1f.-l12 = f.-lp one obtains 2 If.-ll = IAI2 + 1 - 2 ReA IAI2 + 1 +2ReA' and since Re A > 0 for 0 < w < 2, IMI < 1, i.e., p(H(w)) < 1. o For an important class of matrices the more qualitative assertions of Theorems (8.3.5)-(8.3.7) can be considerably sharpened. This is the class 8.3 Relaxation Methods 633 of matrices with property A introduced by Young [see, e.g., Young (1971)]' or its generalization due to Varga (1962), the class of consistently ordered matrices [see (8.3.10)]. (8.3.8) Definition. The matrix A has property A if there exists a permutation matrix P such that P ApT has the form D 1, D2 diagonal matrices. The most important fact about matrices with property A is given in the following theorem: (8.3.9) Theorem. For every n x n matrix with property A and aii =J 0, i = 1, ... J n, there exists a permutation matrix P such that the decomposition (8.l.5), (8.l.6), A = D(I - L - U), of the permuted matrix A := P ApT has the following property: The eigenvalues of the matrices ex E <C, ex =J 0, are independent of ex. PROOF. By Definition (8.3.8) there exists a permutation P such that PApT = [D1 M2 D:= [~1 ~~] = D(I - L - U), o . ~2] , L=-[Di?Ah ~] , U = - [~ Dl1Nh] Here D1 and D2 are nonsingular diagonal matrices. For ex =J 0, one now has () J ex 0 = - [ exDi1M2 ex-1D011AI1] [0 = -Sa Di1Ah = -SaJ(l)S~l with the nonsingular diagonal matrix 11 , 12 unit matrices. The matrices J(a) and J(l) are similar, and hence have the same eigenw~. 0 Following Varga (1962), a matrix A which, relative to the decomposition (8.l.5), (8.l.6), A = D(I - L - U), has the property that the eigenvalues of the matrices 634 8 Iterative Methods for the Solution of Systems of Linear Equations for 0' -I 0 are independent of 0', is called (8.3.10) consistently ordered. Theorem (8.3.9) asserts that matrices with property A can be ordered consistently, i.e., the rows and columns of A can be rearranged by a permutation P such that there results a consistently ordered matrix P ApT. Consistently ordered matrices A, however, need not at all have the form D 1 , D2 diagonal matrices. This is shown by the important example of block tridiagonal matrices A, which have the form A= Di diagonal matrices. AN~l,N AN.N~l DN If all D; are nonsingular, then the matrices J(O') = 0' ~lD~l A N~l N~LN obey the relation that is, A is consistently ordered. In addition, block tridiagonal matrices have property A. We show this only for the special 3 x 3 matrix One has in fact, 8.3 Relaxation Methods 635 In the general case, one proceeds analogously. We point out, however, that there are consistently ordered matrices which do not have property A. This is shown by the example For irreducible n x n matrices A with nonvanishing diagonal elements aii =f. 0 and decomposition A = D(I - L - U) it is often easy to find out whether or not A has property A by considering the graph G(J) associated with the matrix J = L + U. For this one examines the lenghts sii), s~i), ... of all closed oriented paths (oriented cycles) Pi -+ Pk 1 -+ Pk2 -+ ... -+ Pks(i) = Pi in G(J) which lead from Pi to Pi. Denoting by Ii the greatest common ·· f the s1(i) ,s2(i) , ... , dIVlsor 0 (i) (i) li=gcd(S1 ,s2 , ... ), the graph G(J) is called 2-cyclic if h = 12 = ... = In = 2, and weakly 2-cyclic if all Ii are even. The following theorem then holds, which we state without proof. (8.3.11) Theorem. An irreducible matrix A has property A if and only if G( J) is weakly 2-cyclic. EXAMPLE. To the matrix A- [-1 0 -1 -1 4 -1 0 0 -1 4 -1 -~l -1 4 there belongs the matrix 1 J:= 4" with the graph [1 1 0 0 1 1 0 0 1 ~l 636 8 Iterative Methods for the Solution of Systems of Linear Equations G(J) is connected, so that J, and thus also A, is irreducible (see Section 8.2). Since G(J) is evidently 2-cyclic, A has property A. The significance of consistently ordered matrices [and therefore by (8.3.9), indirectly also of matrices with property A] lies in the fact that one can explicitly show how the eigenvalues p of J = L + U are related to the eigenvalues A = A(W) of H(w) = (1 - WL)-l((1 - w)l + wU): (8.3.12) Theorem (Young, Varga). Let A be a consistently ordered matrix (8.3.10) and wi- O. Then: (a) With p, also -p is an eigenvalue of J = L + u. (b) If p is an eigenvalue of J and (8.3.13) then A is an eigenvalue of H (w). (e) If A i- 0 is an eigenvalue of H(w) and (8.3.13) holds, then p is an eigenvalue of J. PROOF. (a): Since A is consistently ordered, the matrix J ( -1) = - L - U = -J has the same eigenvalues as J(I) = J = L + U. (b): Because det(I - wL) = 1 for all w, one has det(Al - H(w)) = det[(I - wL) (AI - H(w))] (8.3.14) = det [AI - AwL - (1 - w)I - wU] = det((A + w - 1)1 - AwL - wU). Now let p be an eigenvalue of J = L + U and A a solution of (8.3.13). Then A + w - 1 = V5.wp or A + w - 1 = -V5.wp. Because of (a) we can assume without loss of generality that If A = 0, then w = 1, so that by (8.3.14), det(O· I - H(I)) = det(-wU) = 0, i.e., A is an eigenvalue of H(w). If A i- 0 it follows from (8.3.14) that 8.3 Relaxation Methods 637 Jxu)] = (v'xwr det [ILl - (V,XL + Jxu)] det(AI - H(w)) = det [(A + w -1)1 - V,Xw( V,XL + (8.3.15) = (v'xwr det(ILl - (L + U)) = 0, since the matrix J(.J>...) = .J>...L + (l/.J>...)U has the same eigenvalues as J = L + U and J-L is eigenvalue of J. Therefore, det(AI - H(w)) = 0, and A is an eigenvalue of H(w). (c): Now, conversely, let A # be an eigenvalue of H(w), and J-L a number satisfying (8.3.13), i.e., with A + w - 1 = ±w.J>...J-L. In view of (a) it suffices to show that the number J-L with A + w - 1 = w.J>...J-L is an eigenvalue of J. This, however, follows immediately from (8.3.15). 0 ° As a side result the following is obtained for w = 1: (8.3.16) Corollary. Let A be a consistently ordered matrix (8.3.10). Then for the matrix H = H(l) = (I - L)-lU of the Gauss-Seidel method one has p(H) = [p(JW. In view of Theorem (8.2.4) this means that with the Jacobi method one has to carry out about twice as many iteration steps as with the GaussSeidel method, to achieve the same accuracy. We now wish to exhibit in an important special case the optimal relaxation parameter Wb, characterized by [see Theorem (8.3.5)] p(H(wb») = min p(H(w») = min p(H(w»). wEIR O<w<2 (8.3.17) Theorem (Young, Varga). Let A be a consistently ordered matrix. Let the eigenvalues of J be real and p( J) < 1. Then 2 Wb = -1-+-Jr1=_=p=C(===J;::::;:C)2' for Wb :s: w :s: 2, 638 8 Iterative Methods for the Solution of Systems of Linear Equations Q(H(w)) 2 w Fig. 32. Spectral radius of p(H(w)) where the abbreviation JL := p(J) is used (see Figure 32). Note that the left-hand differential quotient of p(H(w)) at Wb is "-oc:". One should preferably take, therefore, as relaxation parameter w a number that is slightly too large, rather than one that is too smalL if Wb is not known exactly. PROOF. The eigenvalues JLi of the matrix J, by assumption, are real, and - p( J) :s: JLi :s: p( J) < 1. For fixed W E (0,2) [by Theorem (8.3.5) it suffices to consider this domain] to each JLi there belong two eigenvalues ..\~l)(w), ..\~2)(w) of H(w), which are obtained by solving the quadratic equation (8.3.13) in ..\. Geometrically, ..\~l) (w), ..\~2) (w) are obtained as abscissae of the points of intersection of the straight line 9w(..\) = (..\ + w - 1) w with the parabola rni(..\) := ±v0.JLi (see Figure 33). The line 9w has the slope l/w and passes through the point (1,1). If it does not intersect the parabola mi(..\)' then ..\i1)(w), ..\~2)(W) are conjugate complex numbers with modulus Iw - 11, as one finds immediately from (8.3.13). Evidently, p(H(w)) = max(I..\~l)(w)I, 1..\~2)(w)l) = max(I..$1)(w)l, 1..\(2)(w)I), i the ..\(1) (w), ..\(2)(W) being obtained by intersecting gw(..$ with m(..\) .±v0.JL, where JL = p( J) = maxi IJL, I· By solving the quadratic equation (8.3.13), with J~ = p(J), one verifies (8.3.18) immediately, and thus also the remaining assertions of the theorem. 0 8.4 Applications to Difference Methods-An Example 639 Fig. 33. Determination of Wb. 8.4 Applications to Difference Methods-An Example In order to illustrate how the iterative methods described can be applied, we consider the Dirichlet boundary-value problem (8.4.1) -u xx - U yy = f(x, y), u(x, y) = 0 fur 0< x, y < 1, (x, y) E aD, for the unit square D := {(x, y) I 0 < x, y < 1} c IR? with boundary aD [ef. Section 7.7]. We assume f(x, y) continuous on DUaD. Since the various methods for the solution of boundary-value problems are usually compared on this problem, (8.4.1) is also called the model problem. To solve (8.4.1) by means of a difference method, one replaces the differential operator by a difference operator, as described in Section 7.4 for boundary-value problems in ordinary differential equations. One covers D U aD with a grid Dh U aDh: Dh := {(Xi, Yj) I i,j = 1,2, ... , N}, aDh := {(Xi, 0), (Xi, 1), (0, Yj), (1, Yj) I i,j = 0, 1, ... , N + 1}, where, for abbreviation, we put [see Figure 34] Xi := ih, Yj:= jh, 1 h:= N + l' i,j = 0, 1, ... , N + 1, N 2:: 1 an integer. 640 8 Iterative Methods for the Solution of Systems of Linear Equations Y 1~~---r--~--r--' 1 Xi X Fig. 34. The grid fh With the further abbreviation i,j = 0, 1, ... , N + 1, the differential operator can be replaced for all (Xi, Yj) E sh by the difference operator 4Uij - (8.4.2) Ui-l,j - Ui+l,j - Ui,j-1 - Ui,j+1 h2 up to an error Tij. The unknowns UiJ' 1 :::; i, j :::; N [because of the boundary conditions, the Uij = 0 are known for (Xi, Yj) E ash] therefore obey a system of linear equations of the form (8.4.3) 4Uij - Ui-l,j - Ui+1,j - Ui.j-l - (Xi, Yj) E Ui,)+l = h 2 1ij + h2Tij, [h, with lij := I (Xi, Yj). Here the errors Tij of course depend on the mesh size h. Under appropiate assumptions for the exact solution u, one shows as in Section 7.4 that Tij = O(h2). For sufficiently small h one can thus expect that the solution Zij, i, j = 1, .... N, of the linear system of equations (8.4.4) 4zij ZOj Z;-l,) - Zi+l,j - Zi,j-1 - Zi,j+1 = h2lij, = ZN+l,j = ZiO = Zi,N+1 = 0 i,j = 1, ... , N, for Z,] = 0,1, ... , N + 1, obtained from (8.4.3) by omitting the error T 2 ) , agrees approximately with the Uij' To every grid point (Xi, Yj) of flh there belongs exactly one component Zij of the solution of (8.4.4). Collecting the N2 unknowns Zij and the right-hand sides h 2 lij row-wise [see Figure 34] into vectors 8.4 Applications to Difference Methods~An Example Z := (Zll' Z21, ... , ZNI, ZI2,"" b := h 2 (Jll , ... , iNI,"" 641 ZN2, ... , ZIN, ... , ZNN )T, ilN, ... , iNN )T, then (8.4.4) is equivalent to a system of linear equations of the form Az = b, with the N 2 x N 2 matrix 4 -1 -1 4 -1 A= -1 -1 -1 -1 4 -1 4 -1 -1 4 -1 -1 -1 -1 4 -1 -1 -1 -1 4 -1 -1 4 -1 -1 (8.4.5) All Al2 A21 A22 -1 -1 4 0 AN-I,N 0 AN,N-I ANN A is partitioned in a natural way into blocks Aij of order N, which are induced by the partitioning of the points (Xi, Yj) of [h into horizontal rows (Xl, Yj), (X2' Yj), ... , (XN' Yj)· The matrix A is quite sparse. In each row, at most five elements are different from zero. For this reason, in the execution of an iteration step z(i) -+ z(i+l) of, say, the Jacobi method (8.1.7) or the Gauss-Seidel method (8.1.8), one requires only about 5N 2 operations (1 operation = 1 multiplication or division + 1 addition). (If the matrix A where dense, N 4 operations per iteration step would be required.) Compare this expenditure with that of a direct method for solving Az = b. If we were to compute a triangular decomposition A = LLT of A, say with the Choleski method (below we 642 8 Iterative Methods for the Solution of Systems of Linear Equations will see that A is positive definite), then L would be an N 2 x N 2 lower triangular matrix of the form * L= * *" - v - '* N+l The computation of L alone requires approximately ~ N 4 operations. Since the Jacobi method, e.g., requires approximately (N + 1)2 iterations (overrelaxation method: N + 1 iterations) [see (8.4.9)J in order to obtain a result accurate to 2 decimal places, the Choleski method would be less expensive than the Jacobi method. The main difficulty with the Choleski method however, lies in its storage requirement: The storage ofthe approximately N3 nonvanishing elements of L requires too much space for larger values of N, say N ::::: 100. Here lie the advantages of iterative methods: Their storage requirement is only of the order of magnitude N 2 . To the Jacobi method belongs the matrix J = L + U = H41 - A). The associated graph G(J) (for N = 3) is connected and 2-cyclic. A is therefore irreducible [see Section 8.2J and has property A [Theorem 8.3.11]. One easily sees, in addition, that A is already consistently ordered. The eigenvalues and eigenvectors of J = L + U can be determined explicitly. One verifies at once, by substitution, that the N 2 vectors Z(k,l) with components , k, l = 1, 2, ... , N, 8.4 Applications to Difference Methods-An Example hi . 17rj + 1 sm N + 1 ' (k,l)._. Zij 643 1::::: i,j::::: N, . - sm N satisfy with ) := _1 (cos k7 j1 ( k,l -r- + cos -17r- ) N+1 2 1::::: k,l::::: N. N+1' J thus has the eigenvalues j1(k,I), 1 ::::: k, 1 ::::: N. The spectral radius of J, therefore, is (8.4.6) p(J) = max !j1(k,I)! = cos _7r_. N +1 k,l To the Gauss-Seidel method belongs the matrix H = (1 - L)-lU with the spectral radius [see (8.3.16)] p(H) = p(J)2 = cos 2 N: l' According to Theorem (8.3.17) the optimal relaxation parameter Wb and p(H(Wb)) are given by 2 2 Wb=----r========== 1+ (8.4.7) 11 - cos2 _ 7 r _ y N+1 . 7r l+sm N - +1 cos 2 _ 7 r _ P(H (Wb)) = ---=:N-'--'-+-=:1--n2 · ( 1 + sin N: 1) The number K, = K,(N) with p(J)" = p(H(Wb)) indicates how many steps of the Jacobi method produce the same error reduction as one step of the optimal relaxation method. Clearly, K, = Inp(H(wb)) Inp(J) . Now, for small z one has In(l + z) = z - z2/2 + O(z3), and for large N 7r 7r 2 cos N + 1 = 1 - 2(N + 1)2 + 0 so that Likewise, ( 1 ) N4 ' 644 8 Iterative Methods for the Solution of Systems of Linear Equations lnp(H(wb)) = 2 [lnp(J) -In( 1 + sin N: 1)] = 2 [ 7f2 7f 7f2 2(N + 1)2 - N + 1 + 2(N + 1)2 + 0 ( 1 )] N3 so that finally, asymptotically for large N, (8.4.8) K = K(N) ~ 4(N + 1) , 7f i.e., the optimal relaxation method is more than N times as fast as the Jacobi method. The quantities In(lO) 2 RJ := -lnp(J) ~ 0.467(N + 1) , 1 2 RGS := 2RJ ~ 0.234(N + 1) , (8.4.9) RWb := - In(lO) ( ( ) ) ~ 0.367(N lnp H Wb + 1) indicate the number of iterations required in the Jacobi method, the GaussSeidel method, and the optimal relaxation method, respectively, in order to reduce the error by a factor of 110 • Finally, we derive an explicit formula for the condition number cond 2 (A) of A with respect to the Euclidean norm, which will be important for the speed of convergence of the conjugate-gradient method in Section 8.7.1 [see (8.7.1.9)J: Since A = 4(1 - J) and J has the eigenvalues p(k,l) = ~ (cos ~ + cos~) 2 N+l N+l' 1:::; k,l:::; N, it follows Amax(A)=4(I+cos N:l)' Amin(A)=4(I-cos N:l)· This implies for cond 2 (A) := Amax(A)/Amin(A) (8.4.10) cond 2 (A) = cos 2 _ 7 f _ N: 1 ~ 42 (N + 1)2 sin 2 _ _ 7f N+l 4 h2 · 7f 2 8.5 Block Iterative Methods 645 8.5 Block Iterative Methods As the example of the previous section shows, the matrices A arising in the application of difference methods to partial differential equations often exhibit a natural block structure, where Aii are square matrices. If now, in addition, all Aii are nonsingular, it seems natural to introduce block iterative methods relative to the given partition 7r of A, in the following way: In analogy to the decomposition (8.l.5) of A one defines, relative to 7r, the decomposition A = DTC - ETC - F", L TC := D;;l ETC, UTC := D;;l FTC with (8.5.1) 1 One obtains the block Jacobi method (block total-step method) for the solution of Ax = b by choosing in (8.l.3), analogously to (8.l.7), B := D TC . One thus obtains the iteration algorithm D 7r x(i+1) -- b + (E 7r + F 1r )x(i) , or, written out in full, (8.5.2) j = 1, 2, ... , N, i = 0, 1, .... Here, the vectors x(i), b are of course partitioned similarly to A. In each step x(i) --+ x(Hl) of the method we now must solve N systems of linear equations of the form Ajjz = y, j = 1, 2, ... , N. This is accomplished by first obtaining, by the methods of Section 4, a triangular decomposition (or a Choleski decomposition, if appropriate) Ajj = LjRj of the A j j , and then reducing Ajjz = y to the solution of two triangular systems of equations 646 8 Iterative Methods for the Solution of Systems of Linear Equations For the efficiency of the method it is essential that the Ajj be simply structured matrices for which the triangular decomposition is easily carried out. This is the case, e.g., for the matrix A in (8.4.5) ofthe model problem. Here, Ajj are positive definite tridiagonal N x N matrices Ajj = r 4 ~1 ~1 * ~1 J whose Choleski decomposition requires a very modest amount of work (number of operations proportional to N). The rate of convergence of (8.5.2) of course is now determined by the spectral radius p( J1f ) of the matrix l" := 1 ~ B- 1 A = L1f + U1f . In the same way one can define, analogously to (8.1.8), a block GaussSeidel method (block single-step method), through the choice or, explicitly, (8.5.3) j = L 2, ... ,N, i = 0, 1, .... Again, systems of equations with the matrices Ajj need to be solved in each iteration step. As in Section 8.3, one can also introduce block relaxation methods through the choice B:= B(w) = (l/w)D 1f (i ~wL1f). Explicitly [cf. (8.3.3)]' let x;i+1l be the solution of (8.5.3); then j = 1, 2, .... N. Now, of course, Also the theory of consistently ordered matrices of Section 8.3 carries over, if one defines A as consistently ordered whenever the eigenvalues of the matrices 8.6 The ADI Method of Peaceman and Rachford 647 are independent of ct. Optimal relaxation factors are determined as in Theorem (8.3.17) with the help of p(J1[). One expects intuitively that the block methods will converge faster with increasing coarseness of the block partition 7f of A. This indeed can be shown under fairly general assumptions on A [see Varga (1962)]. For the coarsest partition 7f of A into a single block, e.g., the iterative method "converges" after just one step. It is then equivalent to a direct method. This example shows that arbitrarily coarse partitions are only of theoretical interest. The reduction in the numher of iterations is compensated, to a certain extent, by the larger computational work for each individual iteration step. For the most common partitions, in which A is hlock tridiagonal and the diagonal blocks usually tridiagonal (this, e.g., is the case for the model problem), one can show, however, that the computational work per iteration involved in block methods is equal to that in ordinary methods. In these cases, block methods bring real advantages. For the model problem [see Section 8.4]' relative to the partition given in (8.4.5), the spectral radius p( J1[) can again be determined explicitly. One finds 7f cos-N+1 p(J1[) = - - - - - ' - - - < p(J). 7f 2~cos- N+1 For the corresponding optimal block relaxation method one has asymptotically for N -+ ::xJ The number of iterations is reduced by a factor V2 compared to the ordinary optimal relaxation method [proof: see Exercise 17]. 8.6 The ADI Method of Peaceman and Rachford Still faster convergence than in relaxation methods is obtained with the AD! methods (alternating-direction implicit methods) for the iterative computation of the solution of a system of linear equations which arises in difference methods. From amongst the many variants of this method we describe here only the historically first method of this type, which is due to Peaceman and Rachford [see Varga (1962) and Young (1971) for an exposition of further variants]. We illustrate the method for the boundary-value problem (8.6.1) ~'uxx(x, y) ~ Uyy(x, y) u(x,y) = 0 + (]" u(x, y) = f(x, y) for (x, y) E fl, for (x,y) E 8fl, fl:= {(x,y) 10 < x,y < I} 648 8 Iterative Methods for the Solution of Systems of Linear Equations on the unit square, which on account of the term au is only slightly more general than the model problem (8.4.1). We assume in the following that a is constant and non-negative. Using the same discretization and the same notation as in Section 8.4, in place of (8.4.4) one now obtains the system of linear equations (8.6.2) 4z ij - Zi~l,j - Zi+1,j - Zi,j-1 - Zi,j+1 + O'h 2 zij = h 2 fij, 1::; i,j::; H, ZOj = ZN+l,j = ZiQ = Zi,N+1 = 0 for 0::; i,j ::; H + 1 for the approximate values Zij of the values Uij = U(Xi' Yj) of the exact solution. To the decomposition 4zij - Zi~l,j - + ah 2 zij == == [2z·2J - Z-l ' - Z'+l ,] + [2z''lJ - Z - l - Z 1"J'+1] + [O'h 2 z' ,] ZH1,j - Zi.j-1 2 Zi,j+1 ,) 2 1,) 't,) ,] into variables which are located on the same horizontal and vertical line [see Figure 34] of fh, respectively, there corresponds a decomposition of the matrix A of the system of equations (8.6,2), Az = b, of the form A=H+V+E. Here, H, V, E are defined through their actions on a vector z: ZH1,j, ifw = Hz, 2Zi~ - Zi,j-1 - Zi,j+1, if w = Vz, O'h Zij. if w = Ez. 2Zij - (8.6,3) Wij = { Zi-l,j - E is a diagonal matrix with nonnegative elements; H and V are both symmetric and positive definite, It suffices to show this for H: If the z;.j are ordered in correspondence to the rows of fh [see Figure 34]' Z = [Zl1' Z21,"" ZN1,·· .. ZlN, Z2N, ... , ZNN]T, then 2 -1 -1 -1 -1 2 H= 2 -1 -1 -1 -1 2 8.6 The ADI Method of Peaceman and Rachford 649 But according to Theorem (7.4.7) the matrices 2 -1 -1 -1 -1 2 and hence also H, are positive definite. For V one proceeds similarly. Analogous decompositions of A are also obtained for boundary-value problems which are considerably more general than (8.6.1). In the ADI method of Peaceman and Rachford the system of equations Az = b, in accordance with the decomposition A = H + V + E, is now transformed equivalently into and also (V+~E+r1)z= (r1-H-~E)z+b. Here r is an arbitrary real number. With the abbreviations HI := H + ~E, VI := V + ~E, one obtains the iteration algorithm of the ADI method, (8.6.4a) (8.6.4b) (HI + ri+lI) z eH1 / 2) = (ri+l1 - V1)Ze i ) + b, (vI + rHII)ZeHl) = (r;+l1 - Hl)Ze H1 / 2) + b. Given z(i), one first computes ze H1 / 2) from the first of these equations, then substitutes this value into the second and computes zeHl). The quantity ri+l is a real parameter which may be chosen differently from step to step. With suitable ordering of the variables Zij the matrices (HI + rHI1) and (VI + rHI1) are positive definite tridiagonal matrices (assuming rHl ;::: 0), so that the systems of equations (8.6.4a,b) can easily be solved for ze H1 / 2 ) and zeHl) via a Choleski decomposition of HI + rH11 and VI + rHI1. Eliminating zeHl/2), one obtains from (8.6.4) (8.6.5) with (8.6.6) Tr := (VI + r1) -1 (r1 - Hl)(HI + rI) -1 (r1 - VI), 9r(b):= (VI + r1)-l [I + (r1 - Hl)(H1 + rI)-l]b. For the error Ii := ze i ) - z it follows from (8.6.5) and the relation z = T r .+ 1 z + 9r'+1 (b), by subtraction, that 650 8 Iterative Methods for the Solution of Systems of Linear Equations and therefore (8.6.8) As in relaxation methods, one tries to choose the parameters ri in such a way that the method converges as fast as possible. In view of (8.6.7), (8.6.8), this means that the ri are to be determined so that the spectral radius p(Trrn ... T r1 ) becomes as small as possible. We first want to consider the case in which the same parameter ri = r is chosen for all i = 1, 2, .... Here one has the following: (8.6.9) Theorem. Under the assumption that HI and VI are positive definite one has p(Tr) < 1 for all r > O. Note that the assumption is satisfied for our special problem (8.6.1). From Theorem (8.6.9) it follows in particular that each constant choice ri = r > 0 leads to a convergent ite

Introduction to Numerical Analysis, 3rd Edition

Related documents

Products

Support

Introduction to Numerical Analysis, 3rd Edition

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib