Texts in Applied Mathematics
12
Editors
J.E. Marsden
L. Sirovich
M. Golubitsky
S.S. Antman
Advisors
G.Iooss
P. Holmes
D. Barkley
M. Dellnitz
P. Newton
Springer Science+Business Media, LLC
Texts in Applied Mathematics
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
Sirovich: Introduction to Applied Mathematics.
Wiggins: Introduction to Applied Nonlinear Dynamical Systems and Chaos.
HalelKorak: Dynamics and Bifurcations.
ChorinlMarsden: A Mathematical Introduction to Fluid Mechanics, 3rd ed.
HubbardlWest: Differential Equations: A Dynamical Systems Approach: Ordinary
Differential Equations.
Sontag: Mathematical Control Theory: Deterministic Finite Dimensional Systems,
2nd ed.
Perko: Differential Equations and Dynamical Systems, 3rd ed.
Seaborn: Hypergeometric Functions and Their Applications.
Pipkin: A Course on Integral Equations.
HoppensteadtlPeskin: Modeling and Simulation in Medicine and the Life Sciences,
2nd ed.
Braun: Differential Equations and Their Applications, 4th ed.
StoerlBulirsch: Introduction to Numerical Analysis, 3rd ed.
RenardylRogers: An Introduction to Partial Differential Equations.
Banks: Growth and Diffusion Phenomena: Mathematical Frameworks and
Applications.
Brenner/Scott: The Mathematical Theory of Finite Element Methods, 2nd ed.
Van de Velde: Concurrent Scientific Computing.
MarsdenlRatiu: Introduction to Mechanics and Symmetry, 2nd ed.
Hubbard/West: Differential Equations: A Dynamical Systems Approach: HigherDimensional Systems.
KaplanlGlass: Understanding Nonlinear Dynamics.
Holmes: Introduction to Perturbation Methods.
Curtain/Zwart: An Introduction to Infinite-Dimensional Linear Systems Theory.
Thomas: Numerical Partial Differential Equations: Finite Difference Methods.
Taylor: Partial Differential Equations: Basic Theory.
Merkin: Introduction to the Theory of Stability of Motion.
Naber: Topology, Geometry, and Gauge Fields: Foundations.
Polderman/Willems: Introduction to Mathematical Systems Theory: A Behavioral
Approach.
Reddy: Introductory Functional Analysis with Applications to Boundary-Value
Problems and Finite Elements.
Gustafson/Wilcox: Analytical and Computational Methods of Advanced
Engineering Mathematics.
Tveito/Winther: Introduction to Partial Differential Equations: A Computational
Approach.
GasquetlWitomski: Fourier Analysis and Applications: Filtering, Numerical
Computation, Wavelets.
(continued after index)
]. Stoer
R. Bulirsch
Introduction to
Numerical Analysis
Third Edition
Translated by R. Bartels, W. Gautschi, and C. Witzgall
With 39 Illustrations
Springer
J. Stoer
Institut fUr Angewandte Mathematik
Universtiit Wurzburg
AM Hubland
D-97074 Wurzburg
Germany
R. Bartels
Department of Computer
Science
University of Waterloo
Waterloo, Ontario N2L 3Gl
Canada
R. Bulirsch
Institut fUr Mathematik
Technische Universitiit
8000 Munchen
Germany
W. Gautschi
Department of Computer
Sciences
Purdue University
West Lafayette, IN 47907
USA
Series Editors
C. Witzgall
Center for Applied
Mathematics
National Bureau of
Standards
Washington, DC 20234
USA
J.E. Marsden
Control and Dynamical Systems, 107-81
California Institute of Technology
Pasadena, CA 91125
USA
L. Sirovich
Division of Applied Mathematics
Brown University
Providence, RI 02912
USA
M. Golubitsky
Department of Mathematics
University of Houston
Houston, TX 77204-3476
USA
S.S. Antman
Department of Mathematics
and
Institute for Physical Science
and Technology
University of Maryland
College Park, MD 20742-4015
USA
Mathematics Subject Classification (2000): 44-01, 44AIO, 44A15, 65R1O
Library of Congress Cataloging.in.Publication Data
Stoer, Josef.
[EinfOhrung in die numerische Mathematik. English]
Introduction to numerica! analysis / J. Stoer, R. Bulirsch. - 3rd ed.
p. cm. - (Texts in applied mathematics ; 12)
Includes bibliographical references and index.
ISBN 978-1-4419-3006-4
ISBN 978-0-387-21738-3 (eBook)
DOI 10.1007/978-0-387-21738-3
1. Numerica! analysis. I. Bulirsch, Roland, II. Title. III. Series.
QA297 .S8213 2002
519.4-<1c21
2002019729
Printed on acid-free paper.
© 2002, 1980, 1993 Springer Science+Business Media New York
Originally published by Springer-Verlag New York, Inc. in 2002
Softcover reprint ofthe hardcover 3rd edition 2002
AII rights reserved. This work may nat be translated or copied in whole ar in part without the written
pennission of the publisher Springer Science+Business Media, LLC,
except for brief excerpts in connection with reviews or scholarly analysis. Vse in connection with
any fonn of information storage and retrieval, electronic adaptation, computer software, or by similar or
dissimilar methodology now known or hereafter developed is forbidden.
The use of general descriptive names, tracle names, trademarks, etc., in this publication. even if the
former are not especially identified, is not to be taken as a sign that such names, as understood by the
Trade Marks and Merchandise Marks Act, may accordingly be used freely by anyone.
Production managed by Timothy Taylor; manufactu!:!.ns. supernsed by Jacqui Ashri.
Photocomposed copy prepared from the author's U:l~ files using Springer's svsing.sty macro.
9 8 7 6 5 4 3 2 1
ISBN 978-1-4419-3006-4
SPIN 10867658
Series Preface
Mathematics is playing an ever more important role in the physical and
biological sciences, provoking a blurring of boundaries between scientific
disciplines and a resurgence of interest in the modern as well as the classical
techniques of applied mathematics. This renewal of interest, both in research and teaching, has led to the establishment of the series Texts in
Applied Mathematics (TAM).
The development of new courses is a natural consequence of a high level
of excitement on the research frontier as newer techniques, such as numerical and symbolic computer systems, dynamical systems, and chaos, mix
with and reinforce the traditional methods of applied mathematics. Thus,
the purpose of this textbook series is to meet the current and future needs
of these advances and to encourage the teaching of new courses.
TAM will publish textbooks suitable for use in advanced undergraduate
and beginning graduate courses, and will complement the Applied Mathematical Sciences (AMS) series, which will focus on advanced textbooks and
research-level monographs.
Pasadena, California
Providence, Rhode Island
Houston, Texas
College Park, Maryland
J .E. Marsden
L. Sirovich
M. Golubitsky
S.S. Antman
Preface to the Third Edition
A new edition of a text presents not only an opportunity for corrections
and minor changes but also for adding new material. Thus we strived to
improve the presentation of Hermite interpolation and B-splines in Chapter 2, and we added a new Section 2.4.6 on multi-resolution methods and
B-splines, using, in particular, low order B-splines for purposes of illustration. The intent is to draw attention to the role of B-splines in this area,
and to familiarize the reader with, at least, the principles of multi-resolution
methods, which are fundamental to modern applications iI:. signal- and image processing.
The chapter on differential equations was enlarged, too A new Section
7.2.18 describes solving differential equations in the presence of discontinuities whose locations are not known at the outset. Such discontinuities
occur, for instance, in optimal control problems where the character of a
differential equation is affected by control function changes in response to
switching events.
Many applications, such as parameter identification, lead to differential
equations which depend on additional parameters. Users then would like
to know how sensitively the solution reacts to small changes in these parameters. Techniques for such sensitivity analyses are the subject of the
new Section 7.2.19.
Multiple shooting methods are among the most powerful for solving
boundary value problems for ordinary differential equatiom:. We dedicated,
therefore, a new Section 7.3.8 to new advanced techniques in multiple shooting, which especially enhance the efficiency of these methods when applied
to solve boundary value problems with discontinuities, which are typical
for optimal contol problems.
Among the many iterative methods for solving large sparse linear equations, Krylov space methods keep growing in importance. We therefore
treated these methods in Section 8.7 more systematically by adding new
subsections dealing with the GMRES method (Section 8.7.2), the biorthogonalization method of Lanczos and the (principles of the) QMR method
(Section 8.7.3), and the Bi-CG and Bi-CGSTAB algorithmo: (Section 8.7.4).
Correspondingly, the final Section 8.10 on the comparison of iterative methods was updated in order to incorporate the findings for all Krylov space
methods described before.
The authors are greatly indebted to the many who have contributed
Vll
viii
Preface to the Third Edition
to the new edition. We thank R. Grigorieff for many critical remarks on
earlier editions, M. v. Golitschek for his recommendations concerning Bsplines and their application in multi-resolution methods, and Ch. Pflaum
for his comments on the chapter dealing with the iterative solution of linear
equations. T. Kronseder and R. Callies helped substantially to establish
the new sections 7.2.18, 7.2.19, and 7.3.8. Suggestions by Ch. Witzgall,
who had helped translate a previous edition, were highly appreciated and
went beyond issues of language. Our co-workers M. Preiss and M. Wenzel
helped us read and correct the original german version. In particular, we
appreciate the excellent work done by J. Launer and Mrs. W. Wrschka
who were in charge of transcribing the full text of the new edition in TEX.
Finally we thank the Springer-Verlag for the smooth cooperation and
expertise that lead to a quick realization of the new edition.
Wiirzburg, Miinchen
January 2002
J. Stoer
R. Bulirsch
Preface to the Second Edition
On the occasion of the new edition, the text was enlarged by several new
sections. Two sections on B-splines and their computation were added to
the chapter on spline functions: due to their special properties, their flexibility, and the availability of well tested programs for their computation,
B-splines play an important role in many applications.
Also, the authors followed suggestions by many readers to supplement
the chapter on elimination methods by a section dealing with the solution
of large sparse systems of linear equations. Even though such systems
are usually solved by iterative methods, the realm of elimination methods
has been widely extended due to powerful techniques for handling sparse
matrices. We will explain some of these techniques in connection with the
Cholesky algorithm for solving positive definite linear systems.
The chapter on eigenvalue problems was enlarged by a section on the
Lanczos algorithm; the sections on the LR- and QR algorit.hm were rewritten and now contain also a description of implicit shift teehniques.
In order to take aeeount of the progress in the area of ordinary differential equations to some extent, a new section on implieit differential
equations and differential-algebraic systems was added, and the seetion on
stiff differential equations was updated by describing further methods to
solve such equations.
Also the last chapter on the iterative solution of linear equations was
improved. The modern view of the eonjugate gradient algorithm as an
iterative method was stressed by adding an analysis of its convergence
rate and a description of some preconditioning techniques. Finally, a new
section on multigrid methods was incorporated: It contains a description
of their basic ideas in the context of a simple boundary value problem for
ordinary differential equations.
ix
x
Preface to the Second Edition
Many of the changes were suggested by several colleagues and readers. In
particular, we would like to thank R. Seydel, P. Rentrop and A. Neumaier
for detailed proposals, and our translators R. Bartels, W. Gautschi and C.
Witzgall for their valuable work and critical commentaries. The original
German version was handled by F. Jarre, and 1. Brugger was responsible
for the expert typing of the many versions of the manuscript.
Finally we thank the Springer-Verlag for the encouragement, patience
and close cooperation leading to this new edition.
Wiirzburg, Miinchen
May 1991
J. Stoer
R. Bulirsch
Contents
VII
IX
Preface to the Third Edition
Preface to the Second Edition
1
Error Analysis
1
1.1
1.2
1.3
1.4
1.5
Representation of Numbers 2
Roundoff Errors and Floating-Point Arithmetic 4
Error Propagation 9
Examples 21
Interval Arithmetic; Statistical Roundoff Estimation
Exercises for Chapter 1 33
References for Chapter 1 36
2
Interpolation
2.1
2.1.1
2.1.2
2.1.3
2.1.4
2.1.5
2.2
2.2.1
2.2.2
2.2.3
2.2.4
2.3
2.3.1
2.3.2
2.3.3
2.3.4
Interpolation by Polynomials 38
Theoretical Foundation: The Interpolation Formula of Lagrange 38
Neville's Algorithm 40
Newtons Interpolation Formula: Divided Differences 43
The Error in Polynomial Interpolation 48
Hermite Interpolation 51
Interpolation by Rational Functions 59
General Properties of Rational Interpolation 59
Inverse and Reciprocal Differences. Thiele's Continued Fraction 64
Algorithms of the Neville Type 68
Comparing Rational and Polynomial Interpolation 73
Trigonometric Interpolation 74
Basic Facts 74
Fast Fourier Transforms 80
The Algorithms of Goertzel and Reinsch 88
The Calculation of Fourier Coefficients. Attenuation Factors 92
27
37
xi
Contents
XII
2.4
2.4.1
2.4.2
2.4.3
2.4.4
2.4.5
2.4.6
Interpolation by Spline Functions 97
Theoretical Foundations 97
Determining Interpolating Cubic Spline Functions
Convergence Properties of Cubic Spline Functions
B-Splines 111
The Computation of B-Splines 117
lVIulti-Resolution Methods and B-Splines 121
Exercises for Chpater 2 134
References for Chapter2 143
3
Topics in Integration
3.1
3.2
3.3
3.4
3.5
3.6
3.7
The Integration Formulas of Newton and Cotes 146
Peano's Error Representation 151
The Euler-Maclaurin Summation Formula 156
Integration by Extrapolation 160
About Extrapolation Methods 165
Gaussian Integration Methods 171
Integrals with Singularities 181
Exercises for Chapter 3 184
References for Chapter;{ 188
4
Systems of Linear Equations
4.1
4.2
4.3
4.4
4.5
4.6
4.7
4.8
4.8.1
4.8.2
Gaussian Elimination. The Triangular Decomposition of a Matrix 190
The Gauss-Jordan Algorithm 200
The Choleski Decompostion 204
Error Bounds 207
Roundoff-Error Analysis for Gaussian Elimination 215
Roundoff Errors in Solving Triangular Systems 221
Orthogonalization Techniques of Householder and Gram-Schmidt 223
Data Fitting 231
Linear Least Squares. The Normal Equations 232
The Use of Orthogonalization in Solving Linear Least-Squares
Problems 235
The Condition of the Linear Least-Squares Problem 236
Nonlinear Least-Squares Problems 241
The Pseudoinverse of a Matrix 243
Modification Techniques for Matrix Decompositions 247
The Simplex Method 256
Phase One of the Simplex Method 268
Appendix: Elimination Methods for Sparse Matrices 272
Exercises for Chapter 4 280
References for Chapter 1 286
4.8.3
4.8.4
4.8.5
4.9
4.10
4.11
4.12
101
107
145
190
Contents
5
Finding Zeros and Minimum Points by Iterative
Methods
5.1
5.2
5.3
5.4
5.4.1
5.4.2
The Development of Iterative Methods 290
General Convergence Theorems 293
The Convergence of Newton's Method in Several Variables 298
A Modified Newton Method 302
On the Convergence of Minimization Methods 303
Application of the Convergence Criteria to the Modified
Newton Method 308
Suggestions for a Practical Implementation of the Modified
Newton Method. A Rank-One Method Due to Broyden 313
Roots of Polynomials. Application of Newton's Method 316
Sturm Sequences and Bisection Methods 328
Bairstow's Method 333
The Sensitivity of Polynomial Roots 335
Interpolation Methods for Determining Roots 338
The 6,2-Method of Aitken 344
Minimization Problems without Constraints 349
Exercises for Chapter 5 358
References for Chapter 5 361
5.4.3
5.5
5.6
5.7
5.8
5.9
5.10
5.11
6
Eigenvalue Problems
6.0
6.1
6.2
6.3
6.4
Introduction 364
Basic Facts on Eigenvalues 366
The Jordan Normal Form of a Matrix 369
The Frobenius Normal Form of a Matrix 375
The Schur Normal Form of a Matrix; Hermitian and
Normal Matrices; Singular Values of Matrixes 379
Reduction of Matrices to Simpler Form 386
Reduction of a Hermitian Matrix to Tridiagonal Form:
The Method of Householder 388
Reduction of a Hermitian Matrix to Tridiagonal or Diagonal
Form: The Methods of Givens and Jacobi 394
Reduction of a Hermitian Matrix to Tridiagonal Form:
The Method of Lanczos 398
Reduction to Hessenberg Form 402
Methods for Determining the Eigenvalues and Eigenvectors 405
Computation of the Eigenvalues of a Hermitian
Tridiagonal Matrix 405
Computation of the Eigenvalues of a Hessenberg Matrix.
The Method of Hyman 407
Simple Vector Iteration and Inverse Iteration of Widandt 408
The LR and QR Methods 415
The Practical Implementation of the QR Method 425
6.5
6.5.1
6.5.2
6.5.3
6.5.4
6.6
6.6.1
6.6.2
6.6.3
6.6.4
6.6.5
xiii
289
364
xiv
Contents
6.7
6.8
6.9
Computation of the Singular Values of a Matrix
Generalized Eigenvalue Problems 440
Estimation of Eigenvalues 441
Exercises for Chapter 6 455
References for Chapter 6 462
7
Ordinary Differential Equations
436
465
7.U
7.1
Introduction 465
Some Theorems from the Theory of Ordinary Differential
Equations 467
7.2
Initial-Value Problems 471
7.2.1
One-Step Methods: Basic Concepts 471
7.2.2
Convergence of One-Step Methods 477
7.2.3
Asymptotic Expansions for the Global Discretization Error
of One-Step Methods 480
7.2.4
The Influence of Rounding Errors in One-Step Methods 483
7.2.5
Practical Implementation of One-Step Methods 485
7.2.6
Multistep Methods: Examples 492
7.2.7
General Multistep Methods 495
7.2.8
An Example of Divergence 498
7.2.9
Linear Difference Equations 501
Convergence of Multistep Methods 504
7.2.10
7.2.11
Linear Multistep rvIethods 508
7.2.12
Asymptotic Expansions of the Global Discretization Error for
Linear Multistep Methods 513
Practical Implementation of Multistep Methods 517
7.2.13
Extrapolation Methods for the Solution of the Initial-Value
7.2.14
Problem 521
Comparison of Methods for Solving Initial-Value Problems 524
7.2.15
7.2.16
Stiff Differential Equations 525
Implicit Differential Equations. Differential-Algebraic Equations
7.2.17
Handling Discontinuities in Differential Equations 536
7.2.18
Sensitivity Analysis of Initial-Value Problems 538
7.2.19
Boundary-Value Problems 539
7.3
7.3.0
Introduction 539
7.3.1
The Simple Shooting Method 542
7.3.2
The Simple Shooting Method for Linear Boundary-Value
Problems 548
An Existence and Uniqueness Theorem for the Solution of
7.3.3
Boundary-Value Problems 550
7.3.4
Difficulties in the Execution of the Simple Shooting
Method 552
The Multiple Shooting Method 557
7.3.5
531
Contents
7.3.6
7.3.7
7.3.8
7.3.9
7.4
7.5
7.6
7.7
Hints for the Practical Implementation of the Multiple
Shooting Method 561
An Example: Optimal Control Program for a Lifting Reentry
Space Vehicle 565
Advanced Techniques in Multiple Shooting 572
The Limiting Case m -+ 00 of the Multiple Shooting Method
(General Newton's Method, Quasilinearization) 577
Difference Methods 582
Variational Methods 586
Comparison of the Methods for Solving Boundary-Value Problems
for Ordinary Differential Equations 596
Variational Methods for Partial Differential Equations
The Finite-Element Method 600
Exercises for Chapter 7 607
References for Chapter 7 613
8
Iterative Methods for the Solution of
Large Systems of Linear Equations.
Additional Methods
8.0
8.1
8.2
8.3
8.4
8.5
8.6
8,7
8.7.1
8.7.2
8.7.3
Introduction 619
General Procedures for the Construction of Iterative Methods 621
Convergence Theorems 623
Relaxation Methods 629
Applications to Difference Methods-An Example 63~)
Block Iterative Methods 645
The ADI-Method of Peaceman and Rachford 647
Krylov Space Methods for Solving Linear Equations 657
The Conjugate-Gradient Method of Hestenes and Stiefel 658
The GMRES Algorithm 667
The Biorthogonalization Method of Lanczos and the QMR
algorithm 680
The Bi-CG and BI-CGSTAB Algorithms 686
Buneman's Algorithm and Fourier Methods for Solving the
Discretized Poisson Equation 691
Multigrid Methods 702
Comparison of Iterative Methods 712
Exercises for Chapter 8 719
References for Chapter 8 727
8.7.4
8.8
8.9
8.10
xv
619
General Literature on Numerical Methods
730
Index
732
1 Error Analysis
Assessing the accuracy of the results of calculations is a paramount goal
in numerical analysis. One distinguishes several kinds of errors which may
limit this accuracy:
(1) errors in the input data,
(2) roundoff errors,
(3) approximation errors.
Input or data errors are beyond the control of the calculation. They
may be due, for instance, to the inherent imperfections of physical measurements. Roundoff errors arise if one calculates with numbers whose representation is restricted to a finite number of digits, as is usually the case.
As for the third kind of error, many methods will not yield the exact
solution of the given problem P, even if the calculations are carried out
without rounding, but rather the solution of another simpler problem P
which approximates P. For instance, the problem P of summing an infinite
series, e.g.,
111
e=l+-+-+-+···
I!
2!
3!
'
may be replaced by the simpler problem P of summing only up to a finite
number of terms of the series. The resulting approximation error is commonly called a truncation error (however, this term is also used for the
roundoff related error committed be deleting any last digit of a number
representation). Many approximating problems P are obtained by "discretizing" the original problem P: definite integrals are approximated by
finite sums, differential quotients by a difference quotients, etc. In such
cases, the approximation error is often referred to as discretization error.
Some authors extend the term "truncation error" to cover discretization
errors.
In this chapter, we will examine the general effect of input and roundoff
errors on the result of a calculation. Approximation errors will be discussed
in later chapters as we deal with individual methods. For a comprehensive
treatment of roundoff errors in floating-point computation see Sterbenz
(1974).
2
1 Error Analysis
1.1 Representation of Numbers
Based on their fundamentally different ways of representing numbers, two
categories of computing machinery can be distinguished:
(1) analog computers,
(2) digital computers.
Examples of analog computers are slide rules and mechanical integrators as
well as electronic analog computers. When using these devices one replaces
numbers by physical quantities, e.g., the length of a bar or the intensity of a
voltage, and simulates the mathematical problem by a physical one, which
is solved through measurement, yielding a solution for the original mathematical problem as well. The scales of a slide rule, for instance, represent
numbers x by line segments of length k In x. Multiplication is simulated by
positioning line segments contiguously and measuring the combined length
for the result.
It is clear that the accuracy of analog devices is directly limited by the
physical measurements they employ.
Digital computers express the digits of a number representation by a
sequence of discrete physical quantities. Typical instances are desk calculators and electronic digital computers.
EXAMPLE
123101
Each digit is represented by a specific physical quantity. Since only a
small finite number of different digits have to be encoded - in the decimal
number system, for instance, there are only 10 digits - the representation of
digits in digital computers need not be quite as precise as the representation
of numbers in analog computers. Thus one might tolerate voltages between,
say, 7.8 and 8.2 when aiming at a representation of the digit 8 by 8 volts.
Consequently, the accuracy of digital computers is not directly limited
by the precision of physical measurements.
For technical reasons, most modern electronic digital computers represent numbers internally in binary rather than decimal form. Here the
coefficients or bits O!i of a decomposition by powers of 2 play the role of
digits in the representation of a number x:
x = ±(O!n2n + O!n_12n-1 + ... + O!o2° + 0!_12- 1 + 0!_22-2 + ... ),
O!i
= 0 or
O!i
= 1.
1.1 Representation of Numbers
3
In order not to confuse decimal an binary representations of numbers, we
denote the bits of binary number representation by 0 and L, respectively.
EXAMPLE. The number x =
18.5 admits the decomposition
18.5 = 1 . 24 + 0 . 2 3 + 0 . 22 + 1 .2 1 + 0 . 2° + 1 . T1
and has therefore the binary representation
LOOLO.L.
We will use mainly the decimal system, pointing out differences between
the two systems whenever it is pertinent to the examination at hand.
As the example 3.999 ... = 4 shows, the decimal representation of a
number may not be unique. The same holds for binary representations. To
exclude such ambiguities, we will always refer to the finite representation
unless otherwise stated.
In general, digital computers must make do with a fixed finite number
of places, the word length, when internally representing a number. This
number n is determined by the make of the machine, although some machines have built-in extensions to integer multiples 2n, 3n, ... (double word
length, triple word length, ... ) of n to offer greater precision if needed. A
word length of n places can be used in several different fashions to represent a number. Fixed-point representation specifies a fixed number nl of
places before and a fixed number n2 after the decimal (binary) point, so
that n = nl + n2 (usually nl = 0 or nl = n).
EXAMPLE. For n
= 10, nl = 4, n2 = 6
30.421 ---+ I00301421000 I
0.0437 ---+ I 00001 0437001
In this representation, the position of the decimal (binary) point is
fixed. A few simple digital devices, mainly for accounting purposes, are still
restricted to fixed-point representation. Much more important, in particular for scientific calculations, are digital computers featuring floating-point
representation of numbers. Here the decimal (binary) point is not fixed at
the outset; rather its position with respect to the first digit is indicated for
each number separately. This is done by specifying a so-called exponent. In
other words, each real number can be represented in the form
(1.1.1)
x = a x lOb (x = a x 2b )
with lal < 1, b integer
(say, 30.421 by 0.30421 x 10 2 ), where the exponent b indicates the position
of the decimal point with respect to the mantissa a. Rutishauser proposed
the following "semilogarithmic" notation,which displays the basis of the
4
1 Error Analysis
number system at the subscript level and moves the exponent up to the
level of the mantissa:
0.30421102.
Analogously,
0.LOOLOL 2 LOL
denotes the number 18.5 in the binary system. On any digital computer
there are, of course, only fixed finite numbers t and e, n = t + e, of places
available for the representation of mantissa and exponent, respectively.
EXAMPLE. For t = 4, e = 2 one would have the floating-point representation
o 15420110 [Qi] or more concisely 15420 I041
for the number 5420 in the decimal system.
The floating-point representation of a number need not be unique. Since
5420 = 0.542104 = 0.0542105, one could also have the floating-point representation
o 10542110 ~ or 105421051
instead of the one given in the above example.
A floating-point representation is normalized if the first digit (bit) of
the mantissa is different from 0 (0). Then lal 2: 1O- 1 (lal 2: 2- 1 ) holds
in (1.1.1). The significant digits (bits) of a number are the digits of the
mantissa not counting leading zeros.
In what follows, we will only consider normalized floating-point representations and the corresponding floating-point arithmetic. The numbers t
and e determine - together with the basis B = 10 or B = 2 of the number
representation - the set A ~ IR of real numbers which can be represented
exactly within a given machine. The elements of A are called machine
numbers.
While normalized floating-point arithmetic is prevalent on current
electronic digital computers, unnormalized arithmetic has been proposed
to ensure that only truly significant digits are carried [Ashenhurst and
Metropolis, (1959)].
1.2 Roundoff Errors and Floating-Point Arithmetic
The set A of numbers which are representable in a given machine is only
finite. The question therefore arises of how to approximate a number x rf. A
which is not a machine number by a number g E A which is. This problem
is encountered not only when reading data into a computer, but also when
representing intermediate results within the computer during the course
1.2 Roundoff Errors and Floating-Point Arithmetic
5
of a calculation. Indeed, straightforward examples show that the results of
elementary arithmetic operations x ± y, x x y, x/y need not belong to A,
even if both operands x, yEA are machine numbers.
It is natural to postulate that the approximation of any number x ~ A
by a machine number rd(x) E A should satisfy
Ix - rd(x)1 ::; Ix - gl
(1.2.1 )
for all
g EA.
Such a machine-number approximation rd(x) can be obtained in most cases
by 'rOunding.
EXAMPLE 1 (t =
4).
= 0.1429100,
= 0.3142101,
rd(0.142842102) = 0.1428102.
rd(0.14285100)
rd(3.14159100)
In general, one can proceed as follows in order to find rd( x) for at-digit
computer: x ~ A is first represented in normalized form x = a x lOb, so
that lal ~ 10- 1 . Suppose the decimal respresentation of lal is given by
Then one forms
,
{0.a1a2'" at
a:= 0.a1 a 2'" at + lO-t
if
if
o ::; at+1 :-s: 4,
at+! ~ 5,
that is, one increases at by 1 if the (t + 1 )st digit at+! ~ .5, and deletes all
digits after the tth one. Finally one puts
rd(x) := sign(x)· a' x lOb.
Since lal ~ 10- 1 , the "relative error" of ~(x) admits the following bound
(Scarborough, 1950):
I
~(x) - x I
x
::;
5 x 1O-(t+1)
lal
::; 5 x lO-t.
With the abbreviation eps := 5 x lO-t, this can be written as
(1.2.2)
rd(x) = x(l + E),
where
lEI::; eps.
The quantity eps = 5 x lO-t is called the machine precision. In the binary
system, rd(x) is defined analogously: Starting with a decomposition x =
a· 2b satisfying 2- 1 ::; lal < 1 and the binary representation of lal,
6
1 Error Analysis
one forms
,
{0.a1 ... Cl:t
a:= 0.Cl:1 ... Cl:t + 2- t
if
if
at+1 = 0,
= L,
Cl:t+1
rd(x) := sign(x) . a' x 2b.
Again (l.2.2) holds, provided one defines the machine precision by eps :=
2- t .
Whenever rd(x) E A is a machine number, then rd has the property
(l.2.1) of a correct rounding process, and we may define
rd(x) := rd(x)
for all x with rd(x) EA.
Because only a finite number e of places are available to express the exponent in a floating-point representation, there are unfortunately always
numbers x ~ A with rd(x) ~ A.
EXAMPLE 2 (t =
4, e = 2).
a) r~(0.3179410110) = 0.317910 110
b) rd(0.999971099)
= 0.1000lO1O0
c) rd(0.0123451O- 99) = 0.123510-100
d) ;;:1(0.54321 10 - 110) = 0.543210-110
rt A.
rt .4.
rt A.
rt A.
In cases (a) and (b) the exponent is too greatly positive to fit the allotted
space: These are instances of exponent overflow. Case (b) is particularly
pathological: Exponent overflow happens only after rounding. Cases (c) and
(d) are instances of exponent underflow, i.e. the exponent of the number
represented is too greatly negative. In cases (c) and (d) exponent underflow
may be prevented by defining
(l.2.3)
rd(0.0123451O - 99) := 0.012310 - 99 E A,
rd(0.54321 10 - 110) := 0 E A.
But then rd does not satisfy (l.2.2), that is, the relative error of rd( x)
may exceed eps. Digital computers treat occurrences of exponent overflow
and underflow as irregularities of the calculation. In the case of exponent
underflow, rd(x) may be formed as indicated in (l.2.3). Exponent overflow
may cause a halt in calculations. In the remaining regular cases (but not
for all makes of computers), rounding is defined by
rd(x) := rd(x).
Exponent overflow and underflow can be avoided to some extent by suitable
scaling of the input data and by incorporating special checks and rescalings
during computations. Since each different numerical method will require its
own special protection techniques, and since overflow and underflow do not
1.2 Roundoff Errors and Floating-Point Arithmetic
7
happen very frequently, we will make the ideaEzed assumption that e = 00
in our subsequent discussions, so that rd := rd does indeed provide a rule
for rounding which ensures
(1.2.4)
rd: JR -t A,
rd(x) = x(l + c) with Icl ::; eps
for all x E: JR.
In further examples we will, correspondingly, give the length t of the mantissa only. The reader must bear in mind, however, that subsequent statements regarding roundoff errors max be invalid if overflows or underflows
are allowed to happen.
We have seen that the results of arithmetic operations x±y, x x y, x/y
need not be machine numbers, even if the operands x and yare. Thus one
cannot expect to reproduce the arithmetic operations exactly on a digital
computer. One will have to be content with substitute operations + *, - *,
x *, /*, so called floating-point operations, which approximate arithmetic
operations as well as possible [v. Neumann and Goldstein (1947)]. Such
operations my be defined, for instance, with the help of the rounding map
rd as follows
(1.2.5)
x+* y:=rd(x+y),
x - * y:= rd(x - y),
x x* y:= rd(x x y),
for x,y E A,
x /* y := rd(x/y),
so that (1.2.4) implies
(1.2.6)
x+* Y = (x + y)(l + cd
x- * y=(x-y)(1+c2)
x x* y = (x x y)(l + C3)
x /* y = (x/y)(l + c4)
}
ICi I ::; eps.
On many modern computer installations, the floating-point operations
±*, ... are not defined by (1.2.5), but instead in such a way that (1.2.6)
holds with only a somewhat weaker bound, say ICil ::; I~ • eps, k 2:: 1 being a small integer. Since small deviations from (1.2.6) are not significant
for our examinations, we will assume for simplicity that the floating-point
operations are in fact defined by (1.2.5) and hence satisfy (1.2.6).
It should be pointed out that the floating-point operations do not satisfy
the well-known laws for arithmetic operations. For instance,
x +* y = x, if
x,y E A,
where B is the basis of the number system. The machine precision eps
could indeed be defined as the smallest positive machine number 9 for
which 1 +* 9 > 1:
8
1 Error Analysis
eps = min {g E A I 1 +* 9 > 1 and 9 > 0 }.
Furthermore, floating-point operations need not be associative or distributive.
EXAMPLE 3 (t =
8). With
a:=
b:=
0.2337125810 - 4,
0.33678429102,
c := -0.3367781h02,
one has
a +* (b +* c) = 0.2337125810 - 4 +* 0.6180000010 - 3
= 0.6413712610 - 3,
(a +* b) +* c = 0.33678452102 - * 0.3367781h02
= 0.6410000010 - 3.
The exact result is
a + b + c = 0.64137125810 - 3.
When subtracting two numbers x, yEA of the same sign, one has to
watch out for cancellation. This occurs if x and y agree in one or more
leading digits with respect to the same exponent, e.g.,
x = 0.315876101,
Y = 0.314289101.
The subtraction causes the common leading digits to disappear. The exact
result x-y is consequently a machine number, so that no new roundoff error
arises, x -* y = x - y. In this sense, subtraction in the case of cancellation
is a quite harmless operation. We will see in the next section, however,
that cancellation is extremely dangerous concerning the propagation of old
errors, which stem from the calculation of x and y prior to carrying out
the subtraction x - y.
For expressing the result of floating-point calculations, a convenient
but slightly imprecise notation has been widely accepted, and we will use
it frequently ourselves: If it is clear from the context how to evaluate an
arithmetic expression E (if need be this can be specified by inserting suitable parentheses), then fl( E) denotes the value of the expression as obtained
by floating-point arithmetic.
EXAMPLE 4.
fl(x x y) := X X * y,
fl(a+ (b+c)):= a+* (b+* c),
fl«a + b) + c) := (a +* b) +* c.
We will also use the notation fl(.JX), fl(cos(x)), etc., whenever the digital
computer approximates functions /, cos, etc., by substitutes /*, cos·, etc.
Thus fl(.JX) := .JX., and so on.
1.3 Error Propagation
9
The arithmetic operations +, -, x, /, together with those basic functions like ..j, cos, for which floating-point substitutes ..j*, cos*, etc., have
been specified, will be called elementary operations ..
1.3 Error Propagation
We have seen in the previous section (Example 3) that two different but
mathematically equivalent methods (a + b) + c,a + (b + c) for evaluating
the same expression a + b + c may lead to different results if floating-point
arithmetic is used. For numerical purposes it is therefore important to distinguish between different evaluation schemes even if they are mathematically equivalent. Thus we call a finite sequence of elementary operations
(as given for instance by consecutive computer instructions) which prescribes how to calculate the solution of a problem from given input data,
an algorithm.
We will formalize the notion of an algorithm somewhat. Suppose a
problem consists of calculating desired result numbers Yl, ... ,Ym from input numbers Xl, ... ,X n . If we introduce the vectors
then solving the above problem means determining the value Y = 'P(x) of
a certain multivariate vector function 'P : D ---+ IRm , D;;; IRn where 'P is
given by m real functions 'Pi,
At each stage of a calculation there is an operand set of numbers, which
either are original input numbers Xi or have resulted from previous operations. A single operation calculates a new number from one or more
elements of the operand set. The new number is either an intermediate or
a final result. In any case, it is adjoined to the operand set, which then
is purged of all entries that will not be needed as operands during the remainder of the calculation. The final operand set will consist of the desired
results YI, ... , Ym·
Therefore, an operation corresponds to a transformation of the operand
set. Writing consecutive operand sets as vectors,
(i)
Xli)
=
[
X~ 1
(i)
X ni
10
1 Error Analysis
we can associate with an elementary operation an elementary map
cp(i) : Di ---+ lR ni + 1 ,
so that
Di ~ lR ni ,
cp(i)(x(i») = x(i+ 1 ),
where (x(i+1») is a vector of the transformed operand set. The elementary
map cp(i) is uniquely defined except for inconsequential permutations of
xii) and x(i+I) which stem from the arbitrariness involved in arranging the
corresponding operand sets in the form of vectors.
Given an algorithm, then its sequence of elementary operations gives
rise to a decomposition of cp into a sequence of elementary maps
(1.3.1)
cp(i) : Di ---+ Di+l,
D j ~ lRnj ,
cp = cp(T) 0 cp(T-I) 0 . . . 0 ({J(O) , Do = D, DT+1 ~ lRnr + 1 = lRm ,
which characterize the algorithm.
EXAMPLE 1. For ip( a, b, c) = a + b + c consider the two algorithms "I := a + b, y :=
+ "I and "I := b + c, y := a + "I. The decompositions (1.3.1) are
C
and
ip
(0) (
a]
a" b c )._
.- [ b + c E JR2 ,
ip(l) (u, v) := u
+ v E JR.
EXAMPLE 2. Since a 2 _b 2 = (a+b)(a- b), one has for the calculation of ip(a, b) :=
a 2 - b2 the two algorithms
Algorithm 1:
"11:= a x a,
"12 := b X b,
y := "11 - "12,
Algorithm 2:
"11:= a + b,
"12 := a - b,
y := "11 X "12.
Corresponding decompositions (1.3.1) are
Algorithm 1:
cp(O) (a, b) :=
2
[ab ] ,
Algorithm 2:
ip(O)(a,b):= [
~ ]
a+b
'
ip
(1)
._ [ v ]
(a, b, v).a_ b '
Note that the decomposition of ip(, b) := a 2 - b2 corresponding to Algorithm
1 above can be telescoped into a simpler decomposition:
1.3 Error Propagation
'P-(0) (a,b) -_
[ab
2
2
]
11
'
Strictly speaking, however, map cp(O) is not elementary. Moreover the decomposition does not determine the algorithm uniquely, since there is still a choice,
however numerically insignificant, of what to compute first, a 2 or b2 .
Hoping to find criteria for judging the quality of algorithms, we will
now examine the reasons why different algorithms for solving the same
problem generally yield different results. Error propagation, for one, plays
a decisive role, as the example of the sum y = a + b + c shows (see Example
3 in Section 1.2). Here floating-point arithmetic yields an approximation
fj = fl((a + b) + c) to y which, according to (1.2.6), satisfies
rl : = fl(a + b) = (a + b)(l + El),
fj: = fl(T] + c) = (7] + c)(l + E2)
= [(a + b)(l + El) + c](l + E2)
= (a+b+c)
[1+ a+a;b+c E1(1+E2)+E2].
For the relative error Ey := (fj - y)jy of fj,
Ey=
a+b
b
E1(1+E2)+E2
a+ +c
or disregarding terms of order higher than 1 in E'S such as E1E2,
.
Ey =
a+b
b
E1 + 1 . E2·
a+ +c
The amplification factors (a + b)j(a + b + c) and 1, respectively, measure
the effect of the roundoff errors E1, E2 on the error Ey of the result. The
factor (a + b)j(a + b + c) is critical: depending on whether la + bl or Ib + cl
is the smaller of the two, it is better to proceed via (a + b) + c rather than
a + (b + c) for computing a + b + c.
In Example 3 of the previous section,
a+b
a+b+c
0.33 ... 10 2 ~ ~105
0.64 ... 10-3
2
'
b+ c
0.618 ... 10-3
---=
~O.97,
a + b + cO. 64 ... 10-3
which explains the higher accuracy of fl(a + (b + c)).
The above method of examining the propagation of particular errors
while disregarding higher-order terms can be extended systematically to
provide a differential error analysis of an algorithm for computing cp(x) if
this function is given by a decomposition (1.3.1):
12
1 Error Analysis
To this end we must investigate how the input errors Llx of x as well as the
roundoff errors accumulated during the course of the algorithm affect the
final result y = <p(x). We start this investigation by considering the input
errors Llx alone, and we will apply any insights we gain to the analysis of
the propagation of roundoff errors. We suppose that the function
is defined on an open subset D of lRn , and that its component functions
= 1, ... , n, have continuous first derivatives on D. Let x be an
approximate value for x. Then we denote by
CPi, i
Llx:=.1: - x
the absolute error of Xi and X, respectively. The relative error of Xi is defined
as the quantity
Xi - Xi
Xi
if
Xi
=I- O.
Replacing the input data x by X leads to the result iJ := cp(x) instead of
y = cp(x). Expanding in a Taylor series and disregarding higher-order terms
gives
/I
-)
)OCPi(X)
LJ.Yi
: = Yi
- Yi = CPi (X
- CPi ( X) =. ~(~ xj - xj ~
(1.3.2)
j=1
_ ~ Ocpi(X) Ll .
-~
ox.J
j=1
Xl'
1
i = 1, .. . ,m,
or in matrix notation,
(1.3.3)
with the Jacobian matrix Dcp(x).
The notation "~,, instead of "=", which has been used occasionally
before, is meant to indicate that the corresponding equations are only a
first order approximation, i.e., they do not take quantities of higher order
(in s's or Ll's) into account.
The quantity 0cpi(X)/OXj in (1.3.3) represents the sensitivity with which
Yi reacts to absolute perturbations Llx j of x j. If Yi =I- 0 for i = 1, ... , m
1.3 Error Propagation
13
and Xj =I 0 for j = 1, ... ,n, then a similar error propagation formula holds
for relative errors:
(1.3.4)
Again the quantity (Xj/<Pi)O<pi/OXj indicates how strong;ly a relative error
Xj affects the relative error in Yi. The amplification factors (Xj/<Pi)8<pi/OXj
for the relative error have the advantage of not depending on the scales of
Yi and Xj. The amplification factors for relative errors are generally called
condition numbers. If any condition numbers are present which have large
absolute values, then one speaks of an ill-conditioned problem; otherwise,
of a well-conditioned problem. For ill-conditioned problems, small relative
errors in the input data x can cause large relative errors in the results
Y = <p(x).
The above concept of condition numbers suffers from the fact that it
is meaningful only for nonzero Yi, Xj. Moreover, it is impractical for many
purposes, since the condition of <p is described by mn numbers. For these
reasons, the conditions of special classes of problems are frequently defined
in a more convenient fashion. In linear algebra, for example, it is customary
to call numbers c condition numbers if, in conjunction with a suitable norm
II· II,
~<p(7:--'x)'---:-----:<p~(x--,-)-".I < cII x - x II
-,,-II
II <p(x) II
-
II x II
(see Section 4.4).
EXAMPLE 3. For y = <pea, b, c) := a + b + c, (1.3.4) gives
.
Cy =
a b c
Ca +
Cb +
Ec.
a+b+c
a+b+c
a+b+c
The problem is well conditioned if every summand a, b, c is small compared to
a + b+ c.
EXAMPLE 4:
Let y = <pep, q) := -p + Vp2 + q. Then
O<p = -1 +
op
P
=
-y
Vp2 + q Vp2 + q'
o<p
oq
so that
Since
IVp2p + q 1<1
- ,
::; 1 for q>O,
I p+~1
2yp2 +q
~
14
1 Error Analysis
ip is well conditioned if q
> 0, and badly conditioned if q ::::: _p2.
For the arithmetic operations (1.3.4) specializes to (x -j. 0, y -j. 0)
rp(x,y) := X· y:
(1.3.5)
rp(x,y) := x/y:
rp(x, y) := x ± y :
rp(x) := Fx :
Exy ~ Ex + Ey,
Exjy ~ Ex - Ey,
X
Y
Ex±y=-±Ex ± - ±Ey, ifx±y-j.O,
x y
x y
.
I
E,;x = "2Ex.
It follows that the multiplication, division, and square root are not dangerous: The relative errors of the operands don't propagate strongly into the
result. This is also the case for the addition, provided the operands x and
y have the same sign. Indeed, the condition numbers x/(x + y), y/(x + y)
then lie between and 1, and they add up to 1, whence
°
If one operand is small compared to the other, but carries a large relative
error, the result x + y will still have a small relative error so long as the
other operand has only a small relative error: error damping results. If,
however, two operands of different sign are to be added, then at least one
of the factors
is bigger than 1, and at least one of the relative errors Ex, Ey will be amplified.This amplification is drastic if x::::; -y holds and therefore cancellation
occurs.
We will now employ the formula (1.3.3) to describe the propagation
of roundoff errors for a given algorithm. An algorithm for computing the
function rp : D -+ IRi l l , D ~ IRn , rp : D -+ IRID , D ~ IRn , for a given
x = (Xl, ... ,Xn)T E D corresponds to a decomposition of the map rp into
elementary maps rp(i) [see (1.3.1)]' and leads from x(O) := x via a chain of
intermediate results
(1.3.6)
to the result y. Again we assume that every ip(i) is continuously differentiable on D i .
Now let us denote by 'ljJ(i) the "remainder map"
'ljJ(i)
= rp(r) 0 rp(r-l) 0 . . . 0 ip(i) : Di -+ IRm ,
i = 0,1,2, ... ,r.
Then 'ljJ(O) == rp. Drp(i) and D'Ij;(i) are the Jacobian matrices of the maps
rp(i) and 'ljJ(i). Since Jacobian matrices are multiplicative with respect to
function composition,
1.3 Error Propagation
15
D(f 0 g)(x) = Df(g(x)) . Dg(x),
we note for further reference that for i = 0, 1, ... , r
(1.3.7)
D'P(x) = D'P(r) (x(r)) . D'P(r-l) (x(r-l)) ... D'P(O) (x),
D'!jJ(i) (x(i)) = D'P(r) (x(r)) . D'P(r-l) (x(r-l)) ... D'P(i) (x(i)).
With floating-point arithmetic, input and roundoff errors will perturb
the intermediate (exact) results xCi) so that approximate values xci) with
x(i+l) = fl('P(i)(x(i))) will be obtained instead. For the absolute errors
Llx(i) = xCi) - x(i),
By (1.3.3) (disregarding higher-order error terms),
(1.3.9)
If 'P(i) is an elementary map, or if it involves only independent elementary
operations, the floating-point evaluation of 'P(i) will yield the rounding of
the exact value:
(1.3.10)
Note, in this context, that the map 'P(i) : Di ---+ Di+l c::::; IRn i+l is actually
a vector of component functions 'PY) : Di ---+ IR,
Thus (1.3.10) must be interpreted componentwise:
(1.3.11)
fl('PY) (u)) = rd('PY) (u)) = (1 + Ej)'P)i) (u).
IEjl~eps,
j=1,2, ... ,ni+l.
Here Ej is the new relative roundoff error generated during the calculation
of the jth component of 'P(i) in floating-point arithmetic. Plainly, (1.3.10)
can also be written in the form
with the identity matrix I and the diagonal error matrix
16
Error Analysis
o
o
This yields the following expression for the first bracket in (1.3.8): ,
Furthermore EHI . cp(i)(x(i)) ~ Ei+l . cp(i)(x(i)), since the error terms by
which cp(i)(x(i)) and cp(i)(x(i)) differ are multiplied by the error terms on
the diagonal of Ei+l' giving rise to higher-order error terms. Therefore
The quantity ai+l can be interpreted as the absolute roundoff error newly
created when cp(i) is evaluated in floating-point arithmetic, and the diagonal elements of Ei+l can be similarly interpreted as the corresponding
relative roundoff errors. Thus by (1.3.8), (1.3.9), and (1.3.12), L1x(HI) can
be expressed in first-order approximation as follows
L1X(HI) ~ ai+l
+ Dcp(i) (X(i))L1X(i) = EHIX(HI) + Dcp(i) (x(i))L1x(i) ,
i ~ 0,
L1x(O):= L1x.
Consequently
L1x(1) ~ Dcp(O) L1x
+ aI,
L1x(2) ~ Dcp(l) [Dcp(O) . L1x
+ all + a2,
In view of (1.3.7), we finally arrive at the following formulas which describe
the effect of the input errors L1x and the roundoff errors ai on the result
Y = x(r+l) = cp(x):
(1.3.13)
+ D'ljP)(x(I))al + ... + D'ljJ(r) (x(r))a r + ar+l
= Dcp(x)L1x + D'ljJ(I) (x(1))E1 x(1) + ... + D'ljJ(r) (x(r))Erx(r)+
+ Er+lY·
L1y ~ Dcp(x)L1x
It is therefore the size of the Jacobian matrix D'ljJ(i) of the remainder map
'ljJ(i) which is critical for the effect of the intermediate roundoff errors ai or
Ei on the final result.
1.3 Error Propagation
17
EXAMPLE 5. For the two algorithms for computing y = <p(a, b) = a 2 - b2 given
in Example 2 we have for Algorithm 1:
x
= xeD) = [~],
X(I)
= [ab2 ] ,
X(2)
1jJ(I)(U,V) = u - v 2 ,
D<p(X) = (2a, -2b),
= [~~],
X(3)
= Y = a 2 _ b2 ,
1jJ(2)(U,V) = u - v,
D1jJ(I)(X(I)) = (1, -2b),
Moreover
El
EI = [ 0
0]0 '
D1jJ(2) (x(2)) = (1, -1).
lEI I :S eps,
because of
and likewise
IEil :S eps for i = 2,3.
From (1.3.13) with .dx = [~~],
(1.3.14)
Analogously for Algorithm 2:
x=x(O)=
[~],
X(l) =
[:~~l
x(2)=y=a 2 _b 2 ,
1jJ(I)(U,V) = U· v,
D<p(x) = (2a, -2b),
ill
= [:~~~ ~ ~n,
il2
D1jJ(!)(x(1)) = (a -
= E3(a 2 - b3),
El
= [~
b,a + b),
E~]'
IEil:S eps,
and therefore (1.3.13) again yields
(1.3.15)
If one selects a different algorithm for calculating the same result <p( x)
(in other words, a different decomposition of <p into elementary maps), then
D<p remains unchanged; the Jacobian matrices D1jJ(i), which measure the
propagation of roundoff, will be different, however, and so will be the total
effect of rounding,
(1.3.16)
18
1 Error Analysis
An algorithm is called numerically more trustworthy than another algorithm for calculating !p( x) if, for a given set data x, the total effect of
rounding, (l.3.16), is less for the first algorithm than for the second one.
EXAMPLE 6. The total effect of rounding using Algorithm 1 in Example 2 is, by
(1.3.14),
(1.3.17)
and that of Algorithm 2, by (1.3.15),
(1.3.18)
Algorithm 2 is numerically more trustworthy than algorithm 1 whenever 1/3 <
la/bl 2 < 3; otherwise algorithm 1 is more trustworthy. This follows from the
equivalence of the two relations 1/3 <::: la/W <::: 3 and 31a 2 -b 2 1<::: a2 +b 2 +la 2 _b 2 i.
For a := 0.3237, b := 0.3134 using four places (t = 4), we obtain the following
results:
Algorithm 1:
Algorithm 2:
Exact result:
a x* a = 0.1048, b x* b = 0.988210-1,
(a x* a) -* (b x* b) = 0.658010-2,
a +* b = 0.6371, a -* b = 0.103010-1,
(a x* b) x* (a -* b) = 0.656210-2,
a 2 - b2 = 0.65621310-2.
In the error propagation formula (l.3.13), the last term admits the
following bound: I
no matter what algorithm had been used for computing y = !p( x). Hence
an error iJy of magnitude Iyl eps has to be expected for any algorithm.
Note, moreover, when using mantissas of t places, that the rounding of the
input data x = (Xl, .. ' ,xn)T will cause an input error ,1(O)x with
unless the input data are already machine numbers and therefore representable exactly. Since the latter can not be counted on, any algorithm
for computing y = !p(x) will have to be assumed to incur the error
D!p(x)· ,1(O)x, so that altogether for every such algorithm an error of magnitude
(l.3.19)
,1(O)y := [ID!p(x)1 ·Ixl + Iyl] eps
must be expected. We call ,1(O)y the inherent error of y. Since this error
will have to be reckoned with in any case, it would be unreasonable to
ask that the influence of intermediate roundoff errors on the final result
1
The absolute values of vectors and matrices are to be understood componentwise, e.g., Iyl = (IYll, .. ·IYmI)T.
1.3 Error Propagation
19
be considerable smaller than L1(Oly. We therefore call roundoff errors ai or
Ei harmless if their contribution in (1.3.13) towards the total error L1y is
of at most the same order of magnitude as the inherent ErrOr L1(Oly from
(1.3.19):
If all roundoff errors of an algorithm are harmless, then ~he algorithm is
said to be well behaved or numerically stable. This particular notion of
numerical stability has been promoted by Bauer et a1. (1965); Bauer also
uses the term benign (1974). Finding numerically stable algorithms is a
primary task of numerical analysis.
EXAMPLE 7. Both algorithms of Example 2 are numerically stable. Indeed, the
inherent error ,1(O)y is as follows:
Comparing this with (1.3.17) and (1.3.18) even shows that the wtal error of each
of the two algorithms cannot exceed ,1(O)y.
Let us pause to review our usage of terms. Numerical trustworthiness,
which we will use as a comparative term, relates to the roundoff errors
associated with two or more algorithms for the same problem. Numerical
stability, which we will use as an absolute term, relates to the inherent error
and the corresponding harmlessness of the roundoff errors associated with a
single algorithm. Thus one algorithm may be numerically more trustworthy
than another, yet neither may be numerically stable. If both are numerically
stable, the numerically more trustworthy algorithm is to he preferred. \Ve
attach the qualifier "numerically" because of the widespread use of the term
"stable" without that qualifier in other contexts such as the terminology
of differential equations, economic models, and linear multistep iterations,
where it has different meanings. Further illustrations of the concept which
we have introduced above will be found in the next section.
A general technique for establishing the numerical stability of an algorithm, the so-called backward analysis, has been introduced by Wilkinson
(1960) for the purpose of examining algorithms in linear algebra. He tries
to show that the floating-point result fJ = y + L1y of an algorithm for computing y = cp(.1;) may bc written in the form fJ = cp(x + L1J;) , that is, as the
result of an exact calculation based on perturbed input data x + L1x. If L1x
turns out to have the same order of magnitude as 1L1(Olxl ::: eps lxi, then
the algorithm is indeed numerically stable.
Bauer (1974) associates graphs with algorithms in order to illuminate
their error patterns. For instance, Algorithms 1 and 2 0:[ example 2 give
rise to the graphs in Figure 1. The nodes of these graphs correspond to
the intermediate results. Node i is linked to node j by a directed arc if the
intermediate result corresponding to node i is an operand of the elementary
20
1 Error Analysis
operation which produces the result corresponding to node j. At each node
there arises a new relative roundoff error, which is written next to its node.
Amplification factors for the relative errors are similarly associated with,
and written next to, the arcs of the graph. Tracing through the graph of
Algorithm 1, for instance, one obtains the following error relations:
0
C3
Algorithmus 2
Fig. 1. Graphs Representing Algorithms and Their Error Propagation
To find the factor by which to multiply the roundoff error at node i in
order to get its contribution to the error at node j, one multiplies all arc
factors for each directed path from i to j and adds these products. The
graph of Algorithm 2 thus indicates that the input error Ca contributes
( _a_ . 1 + _a_ .
a+b
to the error Cy.
a-b
1) .
Ca
1.4 Examples
21
1.4 Examples
EXAMPLE L: This example follows up Example 4 of the previous section: given
p> 0, q > O,p» q, determine the root
y = -p+
J p +q
2
with smallest absolute value of the quadratic equation
y2 + 2py - q = 0.
J
Input data: p, q. Result: y = cp(p, q) = -p + p 2 + q.
The problem was seen to be well conditioned for p > 0, q > 0. It was also
shown that the relative input errors E p , Eq make the following contribution to the
relative error of the result y = cp(p, q):
Since
+ql < 1,
1<1 , Ip+2JJp2
p 2+ q
p
I J p2 + q -
the inherent error Ll (0) Y satisfies
We will now consider two algorithms for computing y = cp(p, q).
s:= p2,
Algorithm 1:
t:= s + q,
u:= Vt,
y:= -p+u.
Obviously, p » q causes cancellation when y := -p + u is evaluated, and it must
therefore be expected that the roundoff error
Llu := E· Vt = E·
J p 2 + q.
generated during the floating-point calculation of the square root
fl(Vt) =
Vi· (1 + E),
lEI::; eps,
will be greatly amplified. Indeed, the above error contributes the following term
to the error of y :
~Llu =
y
p+q
1
~
~ . E = -(py p2 + q + P + q)E = k· E.
2
_p+ yp2 + q
q
22
1 Error Analysis
Since p, q > 0, the amplification factor k admits the following lower bound:
2 2
k> L > o.
q
which is large, since p » q by hypothesis. Therefore, the proposed algorithm is
not numerically stable, because the influence of rounding
p 2 + q alone exceeds
J
that of the inherent error E~O) by an order of magnitude.
2
Algorithm 2:
S :=p ,
t:= s + q,
'11:= 0,
v :=p+ '11,
y := q/v.
This algorithm does not cause cancellation when calculating v := p + u. The
roundoff error £I'll =
p + q, which stems from rounding
p + q, will be
amplified according to the remainder map 1jJ (u):
EJ 2
J2
'11 -+ p + '11 -+ - q - =: 1jJ(u).
p+u
Thus it contributes the following term to the relative error of y:
.:':. 8'P L1u =
-q
. £I'll
y8u
y(p+u)2
_qJp 2 + q
------..'......--------,c .
( -p + J p2 + q)(p + J p2 + q) 2
~
E
·E=k·E.
p+Jp 2+ q
The amplification factor k remains small; indeed, Ikl < 1, and algorithm 2 is
therefore numerically stable.
The following numerical results illustrate the difference between Algorithms
1 and 2. They were obtained using floating-point arithmetic of 40 binary mantissa
places - about 13 decimal places - as will be the case in subsequent numerical
examples.
p = 1000, q = 0.018000000081
Result y according to Algorithm 1:
0.900030136108 10 -5
Result y according to nach Algorithm 2:
0.899999999998. 10 - 5
Exact value of y:
0.900 000 000 00010 - 5
EXAJ\-IPLE 2. For given fixed x, the value of cos kx may be computed recursively
using for m = 1,2, ... , k - 1 the formula
cos(m + l)x = 2cosx cosmx - cos(m - l)x.
1.4 Examples
23
In this case, a trigonometric-function evaluation has to be carried out only once,
to find c = cos x. Now let Ixl f= 0 be a small number. The calculation of c causes
a small roundoff error:
C=(l+E)cosx,IEI:S;eps.
How does this roundoff error affect the calculation of cos kx ?
coskx can be expressed in terms of c: coskx = cos(k arccos c) =: f(c). Since
df
dc
ksinkx
sinx
the error 10 cos x of c causes, to first approximation, an absolute error
.:1 cos kx ~ 10 c~s x k sin kx = 10 . k cot x sin kx
(1.4.1)
smx
in coskx.
On the other hand, the inherent error .:1 (0) Ck (1.3.19) of the result Ck := cos kx
is
.:1 (0) Ck = [klx sin kxl + I cos kxll eps.
Comparing this with (1.4.1) shows that .:1 cos kx may be considerably larger than
for small Ixl; hence the algorithm is not numerically stable.
EXAMPLE 3. For given x and a "large" positive integer k, the numbers cos kx and
sin kx are to be computed recursively using
.:1(O)Ck
cosmx = cos x cos(m - l)x - sinx sin(m - l)x,
sin mx = sinx cos(m - l)x + cos x sin(m - l)x, m = 1,2, ... , k.
How do small errors Ec cos x, lOs sin x in the calculation of cos x, sin x affect the
final results cos kx, sin kx ? Abbreviating Crn := cos mx, Sm := sin mx, c := cos x,
s := sinx, and putting
U:=
we have
[cm] = U [c.
8m
[~-:l
rn - l ] ,
Srn-l
m= 1, ... ,k.
Here U is a unitary matrix, which corresponds to a rotation by the angle x.
Repeated application of the formula above gives
Now
au =
OC
and therefore
:c
aU [0- 1]
[10]
01' as
1
0
=: A
'
Uk = kU k - l ,
~Uk = AU k - 1 + U AU k - 2 + ... + U k - 1 A
as
= kAU k - l ,
24
1 Error Analysis
because A commutes with U. Since U describes a rotation in lR 2 by the angle x,
!!..-Uk = k [COS(k - l)x
sin(k - l)x
ac
!!..-Uk = k [- sin(k - l)x
cos(k - l)x
as
- sin(k - l)X]
cos(k - l)x '
- cos(k - l)X]
- sin(k - l)x .
The relative errors Ee. Es of C = cos x, S = sin x effect the following absolute errors
of cos kx, sin kx:
(1.4.2)
= Eck cosx [COS(k - l)X] + csk sin x [- Sin((k - l))X] .
cos k - 1 x
sin(k - l)x
.
The inherent errors d (0) Ck and d (0) Sk of Ck = cos kx and Sk = sin kx, respectively, are given by
(1.4.3)
d(O)Ck
d(O)Sk
= [klxsinkxl + Icoskxl]eps,
= [klxcoskxl + Isinkxl]eps.
Comparison of (1.4.2) and (1.4.3) reveals that for big k and Ikxl ;:::; 1 the influence
of the roundoff error Cc is considerably bigger than the inherent errors. while
the roundoff error Cs is harmless. The algorithm is not numerically stahle, albeit
numerically more trustworthy than the algorithm of Example 2 as far as the
computation of Ck alone is concerned.
EXAMPLE 4. For small lxi, the recursive calculation of
ern = cosmx,
based on
SUt
=
sinmx,
m = 1,2, ... ,
cos(m + l)x = cosx cos mx - sinxsin mx,
sin(m + l)x = sinxcosmx + cosxsinmx,
as in Example 3, may he further improved numerically. To this end, we express
the differences dS m + 1 and dCm+l of subsequent sine and cosine values as follows:
+ l)x - cos mx
= 2(cos x-I) cos mx - sin x sin mx - cosx cos mx + cosmx
= -4 (sin2~) cosmx + [cosmx - cos(m - l)x],
dS m + 1 : = sin(m + l)x - sinmx
= 2(cos x-I) sin mx + sin x cos mx - cosxsin mx + sin mx
= -4 (sin2 ~) sin mx + [sin mx - sin(m - l)x].
dCm+l : = cos(m
This leads to a more elaborate recursive algorithm for computing Ck, Sk in the
case x > 0:
1.4 Examples
25
. 2 X
dCI := - 2 sin
2'
J
d81 :=
-dcI (2 + dCI)'
80 := 0,
Co:= 1,
and for m := 1, 2, ... , k:
Cm := Cm-I + dcm ,
8m := 8m -1 + d8 m ,
dCm+I:= t· Cm + dcm ,
d8 m+I:= t . Sm + ds m .
For the error analysis, note that Ck and Sk are functions of S = sin(x/2):
Ck =
cos(2k arcsin s) =: 'PI(S),
Sk = sin(2karcsins) =: 'P2(S).
An error Lls = c: s sin(x/2) in the calculation of S therefore causes - to a first-order
approximation - the following errors in Ck:
. x
O'PI
-2ksinkx.
a;c:s Sill 2 = c: s cos(x/2)
Sill
x
2
= -2ktan ~ sinkx· C: s ,
. X
O'P2
--;:;--C: s sm -
2
Us
X
= 2k tan - cos kx . C: s .
2
Comparison with the inherent errors (1.4.3) shows these errors to be harmless for
smallixi. The algorithm is then numerically stable, at least as far as the influence
of the roundoff error c: s is concerned.
Again we illustrate our analytical considerations with some numerical results.
Let x = 0.001, k = 1000.
Algorithm
Result for cos kx
Example 2
Example 3
Example 4
Exact value
0.540 302 121 124
-0.3410--6
-0.1710--9
0.540 302 305 776
0.540 302 305 86§.
-0.5810 --11
0.540302 305 868 140 ...
Relative error
EXAMPLE 5. We will derive some results which will be useful for the analysis of
algorithms for solving linear equations in Section 4.5. Given the quantities c, ai,
... , an, bl , ... , bn- l with an # 0, we want to find the solution f3n of the linear
equation
(1.4.4)
Floating-point arithmetic yields the approximate solution
bn = fl
(1.4.5)
(C- albl - .~~ - an-Ibn-I)
as follows:
So := Cj
for j:= 1, 2, ... , n - 1
26
1 Error Analysis
8j := fl(8j-1 ~ ajb j ) = (8j-1 ~ a j bj (1 + ILj))(l + aj),
(l.4.6)
bn := fl(s71-l/a71) = (1 + 8)s71-l/a71'
with Ifljl, lajl, 181 ::; eps. If an = 1, as is frequently the case in applications, then
8 = 0, since bn := 8 71 -1.
We will now describe two useful estimates for the residual
From (1.4.6) follow the equations
aj
1 + aj
=8j--~~ajbjflj.
j=1,2, ... ,n~1,
anbn ~ 871-1 = 68 71 -1.
Summing these equations yields
~O'Sn-l
and thereby the first one of the promised estimates
n-1
(1.4.7)
0" .=
.
{O1 otherwise.
if
1.
an =
The second estimate is cruder than (1.4.7). (1.4.6) gives
(1.4.8)
which can be solved for c:
C=
L
j=l
n-1
j-l
n-i
(1.4.9)
a j b j (1 + fll)
II (1 + ak)-I +
anb" (1 + 0')-1
k=l
II (1 + ak)-I.
k=l
A simple induction argument over m shows that
(1+0') = II(l+O'k)±l,
k=1
implies
10'1::;
IO'kl::;eps,
m· eps
1 ~ m· eps
m·eps<l
1.5 Interval Arithmetic; Statistical Roundoff Estimation
27
In view of (1.4.9) this ensures the existence of quantities Cj with
n-1
(1.4.10)
C =
2~>jbj(1 + j. Cj) + an bn (1 + (n -1 + (j')cn),
j=1
ICj 1<- 1 _ eps
n . eps'
8' .=. {O1 otherwise.
if an = 1,
(1.4.11)
In particular, (1.4.8) reveals the numerical stability of our algorithm for computing f3n. The roundoff error Q m contributes the amount
to the absolute error in f3n. This, however, is at most equal to
(lei + 2=:::1 laibil) eps
::;
lanl
'
which represents no more than the influence of the input errors Cc and Cai of
c and ai, i = 1, ... , m, respectively, provided Iccl, Icai I ::; eps. The remaining
roundoff errors /J-k and 8 are similarly shown to be harmless.
The numerical stability of the above algorithm is often shown by interpreting
(1.4.10) in the sense of backward analysis: The computed approximate solution
bn is the exact solution of the equation
whose coeffizients
aj := aj(1 + j. Cj), 1::; j ::; n - 1,
+ (n - 1 + 8')cn)
lij := aj(1
have changed only slightly from their original values aj. This kind of analysis,
however, involves the difficulty of having to define how large n can be so that
errors of the form nc, 1101 ::; eps can still be considered as being of the same order
of magnitude as the machine precision eps.
1.5 Interval Arithmetic; Statistical Roundoff
Estimation
The effect of a few roundoff errors can be quite readily estimated, to a
first-order approximation, by the methods of Section 1.3. For a typical
28
1 Error Analysis
numerical method, however, the number of the arithmetic operations, and
consequently the number of individual roundoff errors, is very large, and
the corresponding algorithm is too complicated to permit the estimation
of the total effect of all roundoff errors in this fashion.
A technique known as interval arithmeticoffers an approach to determining exact upper bounds for the absolute error of an algorithm, taking
into account all roundoff and data errors. Interval arithmetic is based on
the realization that the exact values for all real numbers a E IR which either enter an algorithm or are computed as intermediate or final results are
usually not known. At best one knows small intervals which contain a. For
this reason, the interval-arithmetic approach is to calculate systematically
in terms of such intervals
- ['
a=
a ,a"] ,
bounded by machine numbers a', a" E A, rather than in terms of single
real numbers a. Each unknown number a is represented by an interval ii =
[a', a"] with a E ii. The arithmetic operations ® E {EB, 8, @, 0} between
intervals are defined so as to be compatible with the above interpretation.
That is, c := ii ® bis defined as an interval (as small as possible) satisfying
c ::) {a * b I a E ii and b E b.
and having machine number endpoints.
In the case of addition, for instance, this holds if EB is defined as follows:
[c', c"] := [a', a"] EB [b', b"]
where
c' := max{ "i' E A I "i' ::::; a' + b' }
c" := min{ "i" E A I "i" ;::: a" + b" },
with A denoting again the set of machine numbers. In the case of multiplication @, assuming, say, a' > 0, b' > 0,
[c', e"] := [a', a"] @ [b', b"]
can be defined by letting
e' := max{ "i' E A I "i' ::::; a' x b'},
elf := min{ "i" E A I "i" ;::: a" x b" }.
Replacing, in these and similar fashions, every quantity by an interval and
every arithmetic operation by its corresponding interval operation - this is
readily implemented on computers - we obtain interval algorithms which
produce intervals guaranteed to obtain the desired exact solutions. The
data for these interval algorithms will be again intervals, chosen to allow
for data errors.
1.5 Interval Arithmetic; Statistical Roundoff Estimation
29
It has been found, however, that an uncritical utilization of interval
arithmetic techniques leads to error bounds which, while certainly reliable,
are in most cases much too pessimistic. It is not enough to simply substitute
interval operations for arithmetic operations without taking into account
how the particular roundoff or data enter into the respective results. For
example, it happens quite frequently that a certain roundoff error c impairs
some intermediate results Ul, ... ,Un of an algorithm considerably,
Iaa:i I"
c.
//
1
for
i = 1, ... , n,
while the final result y = f(Ul, ... , un) is not strongly affected,
I~~I ~ 1,
even though it is calculated from the highly inaccurate intermediate values
Ul, ... , Un: the algorithm shows error damping.
EXAMPLE 1. Evaluate y = ¢(x) = x 3 - 3x 2 + 3x = ((x - 3) x x + 3) x x using
Horner's scheme:
u:= x - 3,
v:= u x x,
w:= v +3,
y:= w x x.
The value x is known to lie in the interval
x E x := [0.9,1.1].
Starting with this interval and using straight interval arithmetic, we find
u = x 8 [3,3] = [-2.1, -1.9],
v = u 0 x = [-2.31, -1.71],
'Ii; = v 8 [3,3] = [0.69, 1.29]'
Y = 'Ii; 0 x = [0.621,1.419].
The interval y is much too large compared to the interval
{¢(x)lx EX} = [0.999,1.001]'
which describes the actual effect of an error in x on ¢(x).
EXAMPLE 2. Using just ordinary 2-digit arithmetic gives considerably more accurate results than the interval arithmetic suggests:
u
v
w
y
x = 0.9
-2.1
-1.9
1.1
0.99
x = 1.1
-1.9
-2.1
0.9
0.99
30
1 Error Analysis
For the successful application of interval arithmetic, therefore, it is not
sufficient merely to replace the arithmetic operations of commonly used
algorithms by interval operations: It is necessary to develop new algorithms
producing the same final results but having an improved error-dependence
pattern for the intermediate results.
EXAMPLE 3. In Example 1 a simple transformation of <p(x) suffices:
y = <p(x) = 1 + (x - 1)3.
When applied to the corresponding evaluation algorithm and the same starting
interval i = [0.9,1.1], interval arithmetic now produces the optimal result:
iiI := i e [1, 1] = [-0.1, 0.1],
ii2 := iiI ® iiI = [-0.01, 0.01],
ii3 := ii2 ® iiI = [-0.001, 0.001],
fj := ii3 EEl [1, 1] = [0.999, 1.001].
As far as ordinary arithmetic is concerned, there is not much difference between
the two evaluation algorithms of Example 1 and Example 3. Using two digits
again, the results are practically identical to those in Example 2:
UI
U2
U3
Y
x = 0.9
-0.1
0.01
-0.001
1.0
x = 1.1
0.1
om
0.001
1.0
For an in-depth treatment of interval arithmetic the reader should consult, for instance, Moore (1966) or Kulisch (1969).
In order to obtain statistical roundoff estimates [Rademacher (1948)]' we
assume that the relative roundoff error [see (1.2.6)] which is caused by
an elementary operation is a random variable with values in the interval
[- eps, eps]. Furthermore we assume that the roundoff errors E attributable
to different operations are independent random variables. By flE: we denote
the expected value and by (7E: the variance of the above round-off distribution. They satisfy the general relationship
Assuming a uniform distribution in the interval [- eps, eps], we get
(1.5.1)
flE:
= E(E) = 0,
1 jeps
1
(7; = E(E2) = edt = - eps =: [2.
2eps _ eps
3
2
Closer examinations show the roundoff distribution to be not quit uniform
[see Sterbenz (1974)), Exercise 22, p. 122]. It should also be kept in mind
that the ideal roundoff pattern is only an approximation to the roundoff
1.5 Interval Arithmetic; Statistical Roundoff Estimation
31
patterns observed in actual computing machinery, so that the quantities /1E
and CT~ may have to be determined empirically.
The results x of algorithms subjected to random roundoff errors become
random variables themselves with expected values /1x and variances CT~
connected again by the basic relation
The propagation of previous roundoff effects through elementary operations
is described by the following formulas for arbitrary independent random
variables x, y and constants a, (3 E R:
{/ux±(3y = E(ax ± (3y) = aE(x) ± (3E(y) = a/1x ± (3/1y,
(1.5.2)
CT~x±(3y = E((ax ± 6y)2) - (E(ax ± (3y)?
= a 2E(x - E(x))2 + (32 E(y - E(y))2 = a 2 CT; + (32CT~.
The first of the above formulas follows by the linearity of the expected-value
operator. It holds for arbitrary random variables x, y. The second formula
is based on the relation E(xy) = E(x)E(y), which holds whenever x and
y arc independent. Similarly, we obtain for independent J; and y
ILxxy = E(x x y) = E(x)E(y) = /1x/1y,
(1.5.3)
CT;xy = E[:r x y) - E(x)E(y)]2 = /1x 2 /1y2 - /1;/1~
=
CTxCTy + /1xCTy + /1yCT x '
2
2
2
2
2
2
EXAMPLE. For calculating y = a 2 - b2 (see example 2 in Section 1.3) we find,
under the assumptions (1.5.1), E(a) = a, CT~ = 0, E(b) = b, CT~ = 0 and using
(1.5.2) and (1.5.3), that
771=a 2 (1+cd,
T/2 = b2(1 + 102),
y = (T/l - T/2)(1 + 103),
E(T/l)=a 2,
E(T/2) = b2,
CT~,=a4i'2,
CT;2 = b4 f',
E(y) = E(T/1 - T/2)E(1 + 103) = a 2 - b2 ,
(T/l, T/2, 103 arc assumed to be independent),
CT; = CT~, -'72 CTi+C3 + fJ~, -'72 CTi+C3 + f1~ +"3 CT~, -'72
= (CT~, + CT~2)f2 + (a 2 - b2 )2f2 + l(CT~, + CT~2)
= (a 4 + b4)~ + [(a 2 _ b2)2 + a 4 + b4]S2.
Neglecting g4 compared to S2 yields
CT~ ~ ((a 2 _ b2 )2 + a 4 + b4 )S2.
For a := 0.3237, b = 0.3134, eps = ,5 x 10- 4 (see example 5 in Section 1.3), we
find
CT y ~ 0.144s = 0.000 0415,
32
1 Error Analysis
which is close in magnitude to the true error L1y = 0.000 01787 for 4-digit arithmetic. Compare this with the error bound 0.000 10478 furnished by (1.3.17).
We denote by M(x) the set of all quantities which, directly or indirectly,
have entered the calculation of the quantity x. If M (x) n M (y) =f 0 for the
algorithm in question, then the random variables x and yare in general
dependent.
The statistical roundoff error analysis of an algorithm becomes extremely complicated if dependent random variables are present. It becomes
quite easy, however, under the following simplifying assumptions:
(1.5.4)
(a) The operands of each arithmetic operation are independent random
variables.
(b) In calculating variances all terms of an order higher than the smallest
one are neglected.
(c) All variances are so small that for elementary operations * in jirstorder approximation, E(x * y) ~ E(x) * E(y) = /Lx * /Ly.
If in addition the expected values /Lx are replaced by the estimated values
x, and relative variances E; := (y;/ /L; ~ (Y;/x 2are introduced, then from
(1.5.2) and (1.5.3) [compare (1.2.6), (1.3.5)]'
z=fl(x±y):
(1.5.5)
z = flex ± y):
z=fl(x/y):
E2z ~ E2x
+ E2y + t 2 ,
E;~E;'+E~+t2.
It should be kept in mind, however, that these results are valid only if the
hypothesis (1.5.4), in particular (1.5.4a), are met.
It is possible to evaluate above formulas in the course of a numerical
computation and thereby to obtain an estimate of the error of the final
results. As in the case of interval arithmetic, this leads to an arithmetic of
paired quantities (x, E;) for which elementary operations are defined with
the help of the above or similar formulas. Error bounds for the final results
1" are then obtained from the relative variance
assuming that the final
error distribution is normal. This assumption is justified inasmuch as the
distributions of propagated errors alone tend to become normal if subjected
to many elementary operations. At each such operation the nonnormal
roundoff error distribution is superimposed on the distribution of previous
errors. However, after many operations, the propagated errors are large
compared to the newly created roundoff errors, so that the latter do not
appreciably affect the normality of the total error distribution. Assuming
the final error distribution to be normal, the actual relative error of the
final result 1" is bounded with probability 0.9 by 2E r .
E;,
EXERCISES FOR CHAPTER I
33
EXERCISES FOR CHAPTER 1
1.
Show that with floating-point arithmetic of t decimal places
rd(a) = _a_
with
I+e
lei ~ 5 ·lO-t
holds in analogy to (1.2.2). [In parallel with (1.2.6), as a consequence, fl(a*
b) = (a* b)/(I + e) with lei ~ 5· lO-t for all arithmetic operations * = +, -,
x, /.l
2.
Let a, b, c be fixed-point numbers with N decimal places after the decimal
point, and suppose
< a, b, c < 1. Substitute product a * b is defined as
follows: Add lO-N /2 to the exact product a· b, and delete the (N + I)-st and
subsequent digits.
(a) Give a bound for I(a * b) * c - abel.
(b) By how many units of the N-th place can (a * b) * c and a * (b * c) differ
?
3.
Evaluating L:~~1 aj in floating-point arithmetic may lead to an arbitrarily
large relative error. If, however, all summands ai are of the same sign, then
this relative error is bounded. Derive a crude bound for this error, disregarding terms of higher order.
4.
Show how to evaluate the following expressions in a numerically stable fashion:
1
I-x
- - - - - for Ixl« 1,
1 + 2x
I +x
°
J~ J~
x +
I - cos x
x
5.
-
x -
for
for
x» I,
x =1= 0, Ixl « 1.
Suppose a computer program is available which yields values for arcsin y in
floating-point representation with t decimal mantissa places and for Iyl ~ I
subject to a relative error e with lei ~ 5 x lO-t. In view of the relation
.
arctan x = arCSIn
x
~,
vI + x 2
this program could also be used to evaluate arctan x. Determine for which
values x this procedure is numerically stable by estimating the relative error.
6.
For given z, the function tan z/2 can be computed according to the formula
z
tan - = ±
2
(1 - cos Z) 1/2
I + cosz
Is this method of evaluation numerically stable for z
necessary, give numerically stable alternatives.
7.
The function
:::0
0, z
:::0 7f /2
? If
34
1 Error Analysis
is to be evaluated for 0 ::; 'P ::; 7f /2, 0 < kc ::; l.
The method
2
2
k : = 1 - kc ,
1
f('P, k c ) : = --;====;0=
Jl - k 2 sin 2 'P
avoids the calculation of cos'p and is faster. Compare this with the direct
evaluation of the original expression for f('P, k c ) with respect to numerical
stability.
8.
For the linear function f(x) := a + b x, where a # O. b # 0, compute the first
derivative Dhf(O) = 1'(0) = b by the formula
in binary floating-point arithmetic. Suppose that a and b are binary machine
numbers, and h a power of 2. Multiplication by h and division by 2h can be
therefore carried out exactly. Give a bound for the relative error of Dhf(O).
What is the behavior of this bound as h -+ 0 ?
9.
The square root ±(u + iv) of a complex number x +iy with y # 0 may be
calculated from the formulas
U=
2
± jx+ Jx +y2
v =
..JL
2
2u
Compare the cases x :;;, 0 and x < 0 with respect tom their numerical stability. Modify the formulas if necessary to ensure numerical stability.
10. The variance 52, of a set of observations Xl, . . . ,X" is to determined. Which
of formulas
2
_2)
1
5 = n _ 1
(~2
~ Xi - nx
,
5 2 = -1-
L" (Xi -x)
with
n-l
2
x:= ~ t X i
i=l
i=l
is numerically more trustworthy ?
11. The coefficients a" br(r = 0, ... ,n) are, for fixed X, connected recursively:
(*)
for
r=n-l,n-2 .... ,0:
b,:=xb,+l+a r .
(a) Show that the polynomials
A(z) :=
L arz'. B(z):= L b,Z,-l
r=O
r'=l
EXERCISES FOR CHAPTER 1
satisfy
35
A(z) = (z - x) . B(z) + boo
(b) Suppose A( x) = bo is to be calculated by the recursion (*) for fixed x in
floating-point arithmetic, the result being b~. Show, using the formulas
(com pare Exercise 1)
u+v
fl(u+v) = - - ,
lal:::; eps,
fl(u· v) = ;~ : '
171"1:::; eps,
l+a
the inequality
IA(x) - b~1 :::; ~(2eo -lb~I),
1 - eps
where eo is defined by the following recursion:
1'=n-1,n-2, ... ,O:
for
er:=lxlar+l+lb~l.
Hint: From
l'
derive
= n - 1, ... ,0,
(1' = n - 1, ... ,0);
12. Assuming Earth to be special, two points on its surface can be expressed
in Cartesian coordinates
Pi = [Xi,YiZi] = [1' cos Qi cos/1i,rsinQi cos/1i,rsin/1i],
i = 1,2,
where r is the earth radius and Qi, /1i are the longitudes and latitudes
of the two points Pi, respectively. If
T
cos a = PI ;2 = COS( Ql - (2) cos /11 cos /12 + sin /11 sin /12,
l'
then rO" is the great-circle distance between the two points.
(a) Show that using the arccos function to determine 0" from the above
expression is not numerically stable.
(b) Derive a numerically stable expression for 0".
36
1 Error Analysis
References for Chapter 1
Ashenhurst, R. L., Metropolis, N. (1959): Unnormalized floating-point arithmetic.
Assoc. Comput. Mach. 6, 415-428.
Bauer, F. L. (1974): Computational graphs and rounding error. SIAM J. Numer.
Anal. 11, 87-96.
- - - , Heinhold, J., Samelson, K., Sauer, R. (1965): Moderne Rechenanlagen.
Stuttgart: Teubner.
Henrici, P. (1963): Error propagation for difference methods. New York: Wiley.
Knuth, D. E. (1969): The Art of Computer Programming. Vol. 2. Seminumerical
Algorithms. Reading, Mass.: Addison-Wesley.
Kulisch, U. (1969): Grundziige der Intervallrechnung. In: D. Laugwitz: ijberblicke
Mathematik 2, 51-98. Mannheim: Bibliographisches Institut.
Moore, R. E. (1966): Interval analysis. Englewood Cliffs, N.J.: Prentice-Hall.
Neumann, J. von, Goldstein, H. H. (1947): Numerical inverting of matrices. Bull.
Amer. Math. Soc. 53, 1021-1099.
Rademacher, H. A. (1948): On the accumulation of errors in processes of integration on high-speed calculating machines. Proceedings of a symposium on
large-scale digital calculating machinery. Annals Comput. Labor. Harvard Univ.
16, 176-185.
Scarborough, J. B. (1950): Numerical Mathematical Analysis, 2nd edition. Baltimore: Johns Hopkins Press.
Sterbenz, P. H. (1974): Floating Point Computation. Englewood Cliffs, N.J.:
Prentice-Hall.
Wilkinson, J.H. (1960): Error analysis of floating-point computation. Numer.
Math. 2, 219-340.
- - - (1963): Rounding Errors in Algebraic Pmcesses. New York: Wiley.
- - - (1965): The Algebraic Eigenvalue Problem. Oxford: Clarendon Press.
2 Interpolation
Consider a family of functions of a single variable x,
having n + 1 parameters ao, ... , an, whose values characterize the individual functions in this family. The interpolation problem for iP consists
of determining these parameters ai so that for n + 1 given real or complex
pairs of numbers (Xi, h), i = 0, ... , n, with Xi =J Xk for i =J k
i = 0, ... ,n,
holds. We will call the pairs (Xi, I;) support points, the locations Xi support
abscissas, and the values Ii support ordinates. Occasionally, the values of
derivatives of iP are also prescribed.
The above is a linear interpolation problem if iP depends linearly on the
parameters ai:
This class of problems includes the classical one of polynomial interpolation
[Section 2.1],
as well as trigonometric interpolation [Section 2.3],
(i 2 = -1).
In the past, polynomial interpolation was frequently used to interpolate
function values gathered from tables. The availability of modern computing machinery has almost eliminated the need for extensive table lookups.
However, polynomial interpolation is also important as the basis of several
types of numerical integration formulas in general use. In Be more modern development, polynomial and rational interpolation (see below) arc employed
in the construction of "extrapolation mcthods" for integration, differential
equations, and related problems [see for instance Sections 3.3 and 3.4].
38
2 Interpolation
Trigonometric interpolation is used extensively for the numerical Fourier
analysis of time series and cyclic phenomena in general. In this context, the
so-called "fast Fourier transforms" are particularly important and successful [Section 2.3.2].
The class of linear interpolation problems also contains spline interpolation [Section 2.4]. In the special case of cubic splines, the functions tJ>
are assumed to be twice continuously differentiable for x E [xo, xn] and to
coincide with some cubic polynomial on every subinterval [Xi,.Ti+l] of a
given partition Xo < Xl < ... < X n .
Spline interpolation is a fairly new development of growing importance.
It provides a valuable tool for representing empirical curves and for approximating complicated mathematical functions. It is increasingly used when
dealing with ordinary or partial differential equations.
Two nonlinear interpolation schemes are of importance: rational interpolation,
and exponential interpolation,
rh(X'
\ ... ,A\ n ) ==. aoe AO .£ + ale A1X + ... + a,>e
'±'
, a 0,···, a n,AQ,
,. AnX •
Rational interpolation (Section 2.2) plays a role in the process of best
approximating a given function by one which is readily evaluated on a
digital computer. Exponential interpolation is used, for instance, in the
analysis of radioactive decay.
Interpolation is a basic tool for the approximation of given functions.
For a comprehensive discussion of these and related topics consult Davis
(1965).
2.1 Interpolation by Polynomials
2.1.1 Theoretical Foundation: The Interpolation Formula of
Lagrange
In what follows, we denote by lIn the set of all real or complex polynomials
P whose degrees do not exceed n:
P(X) = ao + alx + ... + anxn.
(2.1.1.1) Theorem For n + 1 arbitrary support points
(Xi,
fi),
i = 0, ... ,n,
Xi
=I .Tk for i =I k,
there exists a unique polynomial P E lIn with
2.1 Interpolation by Polynomials
P(Xi) = h
PROOF.
39
i = 0, 1, ... ,n.
Uniqueness: For any two polynomials Pi, P 2 E nn with
i = 0,1, ... , n,
the polynomial P := Pi - P 2 E lIn has degree at most n, and it has at
least n + 1 different zeros, namely Xi, i = 0, ... ,n. P must therefore vanish
identically, and Pi = P 2 .
Existence: We will construct the interpolating polynomial P explicitly
with the help of polnomials L, E lIn, i = 0, ... , n, for which
(2.1.1.2)
The following Lagrange polynomials satisfy the above conditions:
Li (X) : ==
(X - xu)··· (x - Xi-i)(X - .Ti+i)··· (x - Xn)
(Xi - Xo ) ... (Xi - Xi-i )(Xi - Xi+i ) ... (Xi - Xn )
(2.1.1.3)
w(X)
with w(x) :=
(X - Xi)W'(Xi)
II(X - x;).
i=O
Note that our proof so far shows that the Lagrange polynomials are
uniquely determined by (2.1.1.2).
The solution P of the interpolation problem can now be expressed directly in terms of the polynomials L i , leading to the Lagrange interpolation
formula:
X - Xk
2: f , II.X·_·Xk·
--
n
(2.1.1.4)
P(X)
2: fiLi(x)
i=O
n
n
i=O
k::;i:1
k=O
o
'/,
The above interpolation formula shows that the coefficients of P depend linearly on the support ordinates fi. While theoretically important,
Lagrange's formula is, in general, not as suitable for actual calculations
as some other methods to be described below, particularly for large numbers n of support points. Lagrange's formula may, however, be useful in
some situations in which many interpolation problems are to be solved for
the same support abscissae Xi, i = 0, ... ,n, but different sets of support
ordinates 1;, i = 0, ... , n.
EXAMPLE. Given for n
= 2:
Xi
0
1
3
Ji
1
3
2
Wanted: P(2), where P E I"h, P(Xi) = Ji for i = 0, 1, 2.
40
2 Interpolation
Solution:
L ()
OX
(x - l)(x - 3)
== (0-1)(0-3)'
L (x) = (x - O)(x - 3) L (x) = (x - O)(x - 1)
1
(1-0)(1-3),2
- (3-0)(3-1)'
P(2) = 1 . Lo (2) + 3 . Ll (2) + 2 . L2 (2) = 1 . -1 + 3 . 1 + 2 . ! = 10 .
3
3
3
2.1.2 Neville's Algorithm
Instead of solving the interpolation problem all at once, one might consider
solving the problem for smaller sets of support points first and then updating these solutions to obtain the solution to the full interpolation problem.
This idea will be explored in the following two sections.
For a given set of support points (Xi, fi)' i = 0, 1, ... , 71, we denote by
that polynomial in Ih for which
j = 0,1, ... , k.
These polynomials are linked by the following recursion:
(2.1.2.1a)
Pi(X) == Ii,
(2.1.2.1b)
PROOF: (2.1.2.1a) is trivial. To prove (2.1.2.1b), we denote its right-hand
side by R( x), and go on to show that R has the characteristic properties of
Piai, ... ik' The degree of R is clearly not greater than k. By the definitions
of P i, ... ik and Pio .. ik-1'
Pio .. ik-1 (Xio)
Pi,.ik(Xik)
and
for j = I, 2, ... , k - 1. Thus R = Pi a i 1 ... ik' in view of the uniqueness of
0
polynomial interpolation [Theorem (2.1.1.1)].
Neville's algorithm aims at determining the value of the interpolating
polynomial P for a single value of x. It is less suited for determining the
interpolating polynomial itself. Algorithms that are more efficient for the
2.1 Interpolation by Polynomials
41
later task, and also more efficient if values of P are sought for several
arguments x simultaneously, will be described in Section 2.1.3.
Based on the recursion (2.1.2.1), Neville's algorithm constructs a symmetric tableau of the values of some of the partially interpolating polynomials Piail .. ik for fixed x:
k=Q
Xo
fo = Po (x)
Xl
h = PI(x)
2
1
3
POI (x)
(2.1.2.2)
POI2 (X)
PI2 (X) .
X2
12 = P2(x)
x3
1':3 = P3(X)
Po 123 (x)
P123 (x)
P23 (X)
The first column of the tableau contains the prescribed support ordinates
fi' Subseqnent columns are filled by calculation each entry recursively from
its two "neighbors" in the previous column according to (2.1.2.1b). The
entry P I23 (X), for instance, is given by
EXAMPLE.
Determine P012 (2) for the same support points as in section 2.1.1.
k=O
2
1
Xo
=0
fa = Po(2) = 1
Xl
=1
II = P 1 (2) = 3
X2
=3
h = P2 (2) = 2
Pol (2) = 5
H2(2) = %
POI2 (2) =
Pm (2) =
(2 - 0) ·3- (2 - 1) . 1
1_ 0
= 5,
P12(2) =
(2 - 1) ·2- (2 - 3) . 3
5
3- 1
= 2'
P012(2) =
(2 - 0) ·5/2 - (2 - 3) ·5
3-0
¥
10
= -.
3
We will now discuss slight variants of Neville's algorithm, employing a
frequently used abbreviation,
(2.1.2.3)
42
2 Interpolation
The tableau (2.1.2.2) becomes
fo = Too
Xo
T11
h = TlO
T22
(2.1.2.4)
T21
12 = T20
13 = 1\0
~
)'
~
)'
~
)'
T33
T32
T31
The arrows indicate how the additional upward diagonal Ti,o, Ti,l, ... , Ti,i
can be constructed if one more support point Xi, 1; is added.
The recursion (2,1.2.1) may be modified for more efficient evaluation:
(2.1.2.5a)
Ti,o :=fi
(2.1.2.5b)
Ti,k:
(x - Xi-k)Ti,k-l - (x - xi)Ti-1,k-l
Xi - Xi-k
= Ti k -1
,
Ti,k-l - Ti-1,k-l
+ --'-'''--''---'----=''''--....::.
X - Xi-k
----1
x - Xi
1 :::; k :::; i,
i 2': O.
The following ALGOL algorithm is based on this modified recursion:
for i := 0 step 1 until n do
begin t[i] := f[i];
for j := 'i - 1 step -1 until 0 do
t[j] := t[j + 1] + (t[j + 1] - t[j]) x (x - x[i])/(x[i]- x[j])
end;
After the inner loop has terminated, t[j] = Ti,i-j, 0 :::; j :::; i. The
desired value Tnn = POl,.,n(X) of the interpolating polynomial can be
found in t[O].
Still another modification of Neville's algorithm serves to improve somewhat the accuracy of the interpolated polynomial value. For i = 0, 1, ... ,n,
let the quantities Qik, Dik be defined by
QiO := D iO := fi,
Qik := Tik - Ti,k-l
Dik := Tik - Ti-1,k-l
}
1 :::; k :::; i.
2.1 Interpolation by Polynomials
43
The recursion (2.1.2.5) then translates into
(2.1.2.6)
Xi _ X }
Qik := (Di,k-l - Q i - l , k - d - - - Xi-k - Xi
Xi-k - X
Dik := (Di,k-l - Q i - l , k - l ) - - - Xi-k - Xi
1 ~ k ~ i, l~ = 0,1, ....
Starting with QiO .- DiD := fi, one calculates Qik' Dik from the above
recursion. Finally
n
Pnn := fn + LQnk'
k=l
If the values fo, ... ,fn are close to each other, the quantities Qik will be
small compared to fi. This suggests forming the sum of the "corrections"
Qnl,"" Qnn first [contrary to (2.1.2.5)] and then adding it to fn, thereby
avoiding unnecessary roundoff errors.
Note finally that for X = 0 the recursion (2.1.2.5) takes a particularly
simple form
(2.1.2.7a)
TiD := fi
(2.1.2.7b)
Tik:=
T.
"k-l
+
Ti.k-l - Ti- 1 ,k-l
Xi-k
--1
Xi
- as does its analog (2.1.2.6). These forms are encountered when applying
extrapolation methods.
For historical reasons mainly, we mention Aitken's algorithm . . It is also
based on (2.1.2.1), but uses different intermediate polynomials. Its tableau
is of the form
Xo
fo = Po(x)
Xl
h = Pdx)
h = P2(x)
h = P3(X)
X2
X3
POl (:r)
P02 (X)
POl 2 (X)
P03(X)
POI3(X)
P0123 (X)
The first column again contains the prescribed values fi. Each subsequent
entry derives from the previous entry in the same row and the top entry in
the previous column according to (2.1.2.1b).
2.1.3 Newton's Interpolation Formula: Divided Differences
Neville's algorithm is geared towards determining interpolating values
rather than polynomials. If the interpolating polynomial itself is needed,
44
2 Interpolation
or if one wants to find interpolating values for several arguments I:,j simultaneously, then Newton's interpolation formula ff is to be preferred. Here
we write the interpolating polynomial P E IIn, P(Xi) = J;, i = 0, 1, ... n,
in the form
(2.l.3.1)
P(x) == POl.n(x)
= ao + a1(x - xo) + a2(x - xo)(x - xI) + ...
+ an(x - xo)··· (x - Xn -1)'
Note that the evaluation of (2.l.3.1) for x = I:, may be done recursively
as indicated by the following expression:
This requires fewer operations than evaluating (2.l.3.1) term by term. It
corresponds to the so-called Horner scheme for evaluating polynomials
which are given in the usual form, i.e. in terms of powers of x. and it
shows that the representation (2.l.3.1) is well suited for evaluation.
It remains to determine the coefficients ai in (2.l.3.1).ln principle, they
can be calculated successively from
fo = P(xo) = ao
h = P(xd = ao + (L1 (:1;1 - xo)
fz = P(X2) = au + a1(x2 - xu) + a2(x2 - XO)(X2 - xd
This can be done with n divisions and n( n - 1) multiplications. There is,
however, a better way, which requires only n(n + 1)/2 divisions and which
produces useful intermediate results.
Observe that the two polynomials Pioi, .. ik_, (x) and Piai1 .. ik (x) differ
by a polynomial of degree k with k zeros Xi o ' .Ti " " , , Xi k _ 1 , since both
polynomials interpolate the corresponding support points. Therefore there
exists a unique coefficient
(2.l.3.2)
such that
(2.l.3.3)
Piui1.ik (x) = Pioi,.ik-, (x) + h)i 1... ik (x - Xio)(x - Xi,) ... (x - Xi k _ , ) •
From this and from the identity Pio == h, it follows immediately that
(2.l.3.4)
PiOi1 .. ik (x) =fio + fioi1 (x - Xi,J + ...
+ h", ... ik(X - Xio)(x - Xi,)'" (.r - Xi
k_
1)
2.1 Interpolation by Polynomials
45
is a Newton representation ofthe interpolating polynomial pio, ... ,ik (x). The
coefficients (2.1.3.2) are called k th divided differences.
The recursion (2.1.2.1) for the partially interpolating polynomials
translates into the recursion
(2.1.3.5)
for the divided differences, since by (2.1.3.3), l i l ... ik and l i o ... ik_1 are the
coefficients of the highest terms of the polynomials P il i 2 ... i, and Pioil ... ik-l'
respectively. The above recursion starts for k = 0 with the given support
ordinates Ii, i = 0, ... , n. It can be used in various ways for calculating divided differences ho' hoil' ... , hoil ... i n , which then characterize the desired
interpolating poynomial P = Pioil ... i".
Because the polynomial P iail . .ik is uniquely determined by the support
points it interpolates [Theorem (2.1.1.1)]' the polynomial is invariant to any
permutation of the indices i o , i l , ... , ik, and so is its coefficient li o i l ... i k of
xk. Thus:
(2.1.3.6). The divided differences A,ii .. ik are invariant to permutations of
the indices i o , iI, ... , ik: If
is a permutation of the indices i o, iI, ... ,ik, then
If we choose to calculate the divided differences in analogy to Neville's
method - instead of, say, Aitken's method - then we are led to the
following tableau, called the divided-difference scheme:
k=O
(2.1.3.7)
Xo
fo
Xl
II
X2
h
1
2
f01
1I2
f012
The entries in the second column are of the form
f
_
JOI -
II - 10
Xl - .TO
,
those in the third column,
f
_
JOl2 -
1I2 - 101 ,
X2 -
Xo
... ,
46
2 Interpolation
Clearly,
P(X) == POl .. n(X)
== fa + f01(X - xa) + ... + fOln(X - xa)(x - Xl)'" (X - Xn-l)
is the desired solution to the interpolation problem at hand. The coefficients
of the above expansion are found in the top descending diagonal of the
divided-difference scheme (2.1.3.7).
EXAMPLE. vVith the numbers of the example in sections 2.1.1 and 2.1.2, we have:
Xo
= 0
fa = 1
Xl
=1
it = 3
X2 = 3
h =2
f01 = 2
it2 = -~
f012 = -~
P012 (X) = 1 + 2· (x - 0) - ~(x - O)(x - 1),
P012(2) = (-~. (2 - 1) + 2)(2 - 0) + 1 =
¥1.
Instead of building the divided-difference scheme column by column,
one might want to start with the upper left corner and add successive
ascending diagonal rows. This amounts to adding new support points one
at a time after having interpolated the previous ones. In the following
ALGOL procedure, the entries in an ascending diagonal of (2.1.3.7) are
found, after each increase of i, in the top portion of array t, and the first i
coefficients fOl ... i are found in array a.
for i := 0 step 1 until n do
begin t[i] := f[i];
for j := i - I step -1 until 0 do
t[j] := (t[j + 1] - t[j])/(x[i] - .T[j]);
ali] := t[O]
end;
Afterwards, the intcrpolationg polynomial (2.1.3.1) may be evaluated for
any desired argument z:
p:= a[n];
for i := n - 1 step -1 until 0 do
p:= p x (z - xli)) + ali];
Some Newton representations of the same polynomial are numerically
more trustworth to evaluate than others. Choosing the permutation so that
dampens the error (see Section 1.3) during the Horner evaluation of
2.1 Interpolation by Polynomials
47
(2.1.3.8)
P(O == Pio "' in (0 == fio + fioi1 (~-Xio) + ... + fioi1 ... in (~-Xio) ... (~-Xin_1)'
All Newton representations of the above kind can be found in the single divided-difference scheme which arises if the support arguments Xi,
'I = 0, ... ,71, are ordered by size: Xi < Xi+l for i = 0, ... ,71- 1. Then the
preferred sequence of indices io, iI, ... , ik is such that each index ik is
"adjacent" to some previous index. More precisely, either ik = min{ i l I
o I < k } - 1 or ik = max{ il lOS: I < k } + 1. Therefore the coefficients
of (2.1.3.8) are found along a zigzag path-instead of the upper descending diagonal-of the divided-difference scheme. Starting with fio' the path
proceeds to the upper right neighbor if i k < i k - l , or to the lower right
neighbor if i k > i k - l .
s:
EXAMPLE, In the previous example, a preferred sequence for I; =
2 is
io = 1, i l = 2, i2 = O.
The corresponding path in the divided difference scheme is indicated below:
Xo = 0
fa = 1
Xl =
1
h = 3
X2 =
3
h = 2
fOl = 2
f012
= -~
The desired Newton representation is:
P120(X) =
3 - ~(x - 1) - ~(x - 1)(x - 3)
H20(2) = (-~(2 - 3) - ~(2 -1) + 3 = ¥.
Frequently, the support ordinates Ii are values f(x;l = fi of a given
function f(x), which one wants to approximate by interpolation. In this
case, the divided differences may be considered as multivariate functions
of the support arguments Xi, and are historically written as
These functions satisfy (2.1.3.5). For instance,
J[xol = f(xo)
.
] - f[XIl- J[xol
f(XI) - f(xo)
f[ XO,
Xl Xl - Xo
Xl - ·'To
J[XI'
X2]
f[xo,
Xl]
f[ XO,XI,x2 1 =
X2 - Xo
f(XO)(Xl - X2) + f(XI)(X2 - Xo) + f(:r:2)(XO - xd
(Xl - XO)(X2 - xd(xo - X2)
48
2 Interpolation
Also, (2.l.3.6) gives immediately:
(2.1.3.9) Theorem. The divided differences J[Xiu,"" Xik] are symmetric
functions of their arguments, i. e., they are invariant to permutations of the
If the function f (x) is itself a polynomial, then we have the
(2.1.3.10) Theorem. If f(x) is a polynomial of degree N, then
J[xo, ... ,Xk] = 0
for k > N.
Because of the unique solvability of the interpolation problem
[Theorem (2.l.1.1)]' PQ1 ... k(X) == f(x) for k :2: N. The coefficient of xk
in Po1 ... dx) must therefore vanish for k> N. This coefficient, however, is
D
given by J[xo, ... , Xk] according to (2.l.3.3).
PROOF.
EXAMPLE. f(x) =
x2
Xi
k=O
o
0
1
2
1
3
4
1
5
3
4
9
4
1
1
2
3
1
o
o
o
7
16
If the function f(x) is sufficiently often differentiable, then its divided
differences J[xo, ... , Xk] can also be defined if some of the arguments Xi
coincide. For instance, if f(x) has a derivative at xo, then it makes sense
for certain purposes to define
J[xo, xo] := !,(xo).
For a corresponding modification of the divided-difference scheme (2.l.3.7)
see Section 2.1.5 on Hermite interpolation.
2.1.4 The Error in Polynomial Interpolation
Once again wie consider a given function f(x) and certain of its values
Ii = f(·1:i),
i = 0,1, ... , n,
which are to be interpolated. We wish to ask how well the interpolating
polynomial P(x) == PO ... n(x) E IIn with
2.1 Interpolation by Polynomials
49
i = 0, 1, ... ,n,
reproduces f(x) for arguments different from the support arguments Xi.
The error
f(x) - P(x),
where x #- Xi, i = 0, 1, ... , can clearly become arbitrarily large for suitable functions f unless some restrictions are imposed on f. Under certain
conditions, however, it is possible to bound the error. We have, for instance:
(2.1.4.1) Theorem. If the function f has an (n + l)st derivative, then
for every argument x there exists a numer ~ in the smallest interval
I[xo, ... , x n , xl which contains x and all support abscissas Xi, satisfying
w(x)f(n+1)(~)
_
_
f(x) - POl .. n(x) =
(n + I)!
'
where
w(x) =:0 (x - xo)(x - Xl)'" (X - Xn).
Let P(x) :=:0 POl...n(x) be the polynomial which interpolates the
function at Xi, i = 0, 1, ... , n, and suppose x #- Xi (for x = Xi there is
nothing to show). We can find a constant K such that the function
PROOF.
F(x) :=:0 f(x) - P(x) - Kw(x)
vanishes for X = x:
F(x) = O.
Consequently, F(x) has at least the n + 2 zeros
in the interval I[xo, ... ,xn,xl. By Rolle's theorem, applied repeatedly,
F' (x) has at least n + 1 zeros in the above interval, F" (x) at least n zeros,
and finally F(n+1) (x) at least one zero ~ E I[xo, ... ,xn,x] .
Since p(n+l)(x) =:0 0,
F(n+1)(~) = f(n+l)(~) - K(n + I)! = 0
or
K
f(n+1)(~)
(n + I)! .
This proves the proposition
f(x) - P(x) = Kw(x)
w(x) f(n+1)«(
(n + I ) ! ) '
o
50
2 Interpolation
A different error term can be derived from Newton's interpolation formula (2.l.3.4):
P(X) == POln(x) == J[xo] + J[.TO,Xl](X - xo) + ...
+ J[xo, Xl,···. Xn](X - Xo) ... (x - xn-d.
Here J[xo, xl, ... ,Xk] are the divided differences of the given function f. If
in addition to the n + 1 support points
we introduce an (n + 2)nd support point
where
x 1= Xi,
i = 0, ... , n,
then by Newton's formula
f(x) = PO .n + 1(x) = PO... n (x) + J[.TO, ... , .T n , x]w(x) ,
or
f(x) - PO .. n(x) = w(x)J[xo, ... , Xn, x] .
(2.l.4.2)
The difference on the left-hand side appears in Theorem (2.l.4.1), and since
w(x) 1= 0, we must have
_
f(n+1)(~)
J[xo, ... , Xn, x] = (n + I)! for some ~ E I[xo, ... , Xn, x].
This also yields
f(n)(~)
f[xo, ... ,xn ] = --,- for some ~ E I[xo, ... ,xn ] ,
n.
(2.l.4.3)
which relates derivatives and divided differences.
EXAMPLE.
f(x) = sinx:
Xi
= ~ . i,
i = 0, 1,2,3,4,5,
n = 5,
-sin(
sinx - P(x) = (x - xo)(x - Xl) ... (X - X5 )72i:) ,
I.
SlllX -
I
I
I
(= ((X),
1
jw(x)j
P(x) ::; 720
(x - xo)(x - xd··· (x - X5) = 72i)'
2.1 Interpolation by Polynomials
51
We end this section with two brief warnings, one against trusting the
interpolating polynomial outside of I[xo, ... ,X n ], and one against expecting
too much of polynomial interpolation inside I[xo, . .. , x n ].
In the exterior of the interval I[xo, ... , x n ], the value of Iw(x) I in Theorem (2.1.4.1) grows very fast. The use of the interpolation polynomial P
for approximating f at some location outsinde the interval I[xo, ... , x n ] called extrapolation - should be avoided if possible.
Within I[xo, ... ,x n ] on the other hand, it should not be assumed that
finer and finer samplings of the function f will lead to better and better
approximations through interpolation.
Consider a real function f which is infinitely often differentiable in a
given interval [a, h]. To every interval partition L1 = {a = Xo < Xl <
. .. < Xn = h} there exists an interpolating polynomial P L1 E lIn with
P L1 (;];i) = fi for Xi E L1. A sequence of interval partitions
L1m = {a = x6m ) < xim ) < ... < x~:) = b}
gives rise to a sequence of interpolating polynomials P L1m' One might expect
the polynomials PL1m to converge toward f if the fineness
IIL1mll := max Ix~:'i - x~m)1
°
of the partitions tends to as m -+ 00. In general this is not true. For
example, it has been shown for the functions
1
l+x
[a, b] = [-5,5]'
f(x):= - - 2 '
and
f(x) := Fx,
[a, b] = [0,1]'
that the polynomials PL),,,, do not converge pointwise to f for arbitrarily
fine uniform partitions L1 m , x~m) = 0.+ i(b - a)/m, i = 0, ... , m.
2.1.5 Hermite-Interpolation
Consider the real numbers Xi, fi(k) , i = 0,1, ... , m, k = 0,1, ... , ni - 1,
with
Xo < Xl < ... < Xm ·
The Hermite interpolation problem for these data consists of determining
a polynomial P whose degree docs not exceed n, P E lIn, where
m
n+1:= Lni,
i=O
and which satisfies the following interpolation conditions:
52
2 Interpolation
(2.1.5.1)
p(k)(X'i)
= fi(k) ,
i = 0, 1, ... , m
k = 0, 1, ... , ni - 1.
This problem differt:; from the usual interpolation problem for polynomials
in that it prescribes at each support abscissa Xi not only the value but
also the first ni - 1 derivatives of the det:;ired polynomial. The polynomial
interpolation of Section 2.1.1 is the special case ni = 1, i = 0, 1, ... , m.
There are exactly L ni = n + 1 conditions (2.1.5.1) for the 11 + 1
coefficients of the interpolating polynomial, leading us to expect that the
Hermite interpolation problem can be solved uniquely:
(2.1.5.2) Theorem. For arbitrary numbers Xo < Xl < ... < X= and fi(k) ,
i = 0, 1, ... , m, k = 0, 1, ... , ni -1, there exists precisely one polynomial
n + 1 :=
P E IIn,
m
2:
11i,
i=O
which satisfies (2.1.5.1).
PROOF.
We first show uniqueness. Consider the difference polynomial
Q(x) := Pi (X) - P2(x) of two polynomials Pi, P 2 E IIn for which (2.1.5.1)
holds. Since
Xi is at least an n;-fold root of Q, so that Q has altogether L ni = n+ 1 roots
each counted according to its multiplicity. Thus Q must vanish identically,
since its degree is less than n + 1.
Existence is a consequence of uniqueness: For (2.1.5.1) is a system of
linear equations for n unknown coefficients Cj of P(x) = CO+C1X+' . ·+cnx n .
The matrix of this system is not singular, because of the uniqueness of
its solutions. Hence the linear system (2.1.5.1) has a unique solution for
arbitrary right-hand sides f;(k).
D
Hermite interpolating polynomials can be given explicitly in a form
analogous to the interpolation formula of Lagrange (2.1.1.4). The polynomial P E IIn given by
(2.1.5.3)
P(x)
m
n~-l
i=O
k=O
2: 2: fY) L;k(x).
satisfies (2.1.5.1). The polynomials L;k E IIn are generalized Lagrange polynomials.They are defined as follows: Starting with the auxiliary polynomials
0<:::: i <:::: m,
0<:::: k <:::: n;,
2.1 Interpolation by Polynomials
53
[compare (2.1.1.3)]' put
Li,n,-I(X)
:=
li,ni-dx),
i=O,l, ... ,m .
and recursively for k = ni - 2, ni - 3, ... , 0,
ni- 1
Lik(X) := lidx) -
L l;~)(Xi)Liv(X).
v=k+l
By induction
if i = j and k = a,
otherwise.
Thus P in (2.1.5.3) is indeed the desired Hermite interpolating polynomial.
An alternative way to describe Hermite interpolation is important for
Newton- and Neville-type algorithms to construct the Hermite interpolating polynomial, and, in particular, for developing the theory of Bsplines [Section 2.4.4]. The approach is to generalize divided differences
f[xo, xl,·· ., Xk] [see Section 2.1.3] so as to accommodate repeated abscissae. To this end, we expand the sequence of abscissae Xo < Xl < ... < Xm
occuring in (2.1.5.1) by replacing each Xi by ni copies of itself:
Xo = ... = Xo < Xl = ... = Xl < ... < Xm = ... = Xm .
'---v-----'
no
'---v-----'
'-v-----"
The n + 1 = I:z:,o ni elements in this sequence are then renamed
and will be referred to as virtual abscissae.
We now wish to reformulate the Hermite interpolation problem (2.1.5.1)
in terms of the virtual abscissae, without reference to the numbers Xi and
ni, i = 0, ... , m. Clearly, this is possible since the virtual abscissae determine the "true" abscissae Xi and the integers ni, i = 0, 1, ... , m. In order
to stress the dependence of the Hermite interpolant P(.) on to, tl, ... , tn
we write POl .. n (.) for P(.). Also it will be convenient to write f(r)(xi) instead of fi(r) in (2.1.5.1). The interpolant POL.n is uniquely defined by the
n + 1 = I:z:,o ni interpolation conditions (2.1.5.1), which are as many as
there are index pairs (i, k) with i = 0, 1, ... , m, k = 0, 1, ... , ni - 1, and
are as many as there are virtual abscissae. An essential observation is that
the interpolation conditions in (2.1. 5.1) belonging to the following linear
ordering of the index pairs (i, k),
(0,0), (0,1), ... , (O,no-1), (1,0), ... , (1,nl-1), ... , (m,n m -1),
have the form
54
2 Interpolation
(2.1.5.4)
j = 0, 1. ... , n,
if we define Sj, j = 0, 1, ... , n, as the number of times the value of tj
occurs in the subsequence
to ::; t 1 ::; . . . ::; t j .
The equivalence of (2.1.5.1) and (2.1.5.4) follows directly from
Xo = to = t1 = ... = tno~l < Xl = tno = ... = t no + n, ~1 < ... ,
and the definition of the Sj,
So
= 1, Sl = 2, ... , Sno~l = no; sno = 1, ... , sno+nl~l = n1; ....
We now use this new formulation in order to express the existence
and uniqueness result of Theorem (2.1.5.2) algebraically. Note that any
polynomial pet) E IIn can be written in the form
where II (t) is the row vector
II(t) := [ 1, t, ... , ~~] .
Therefore by Theorem (2.1.5.2) and (2.1.5.4), the following system of linear
equations
j = 0, 1, ... , n,
has a unique solution b. This proves the following corollary, which is equivalent to Theorem (2.1.5.2):
(2.1.5.5) Corollary. For" any non decreasing finite sequence
of n + 1 real numbers, the (n + 1) x (n + 1) matrix
is nonsingular.
The matrix V;,(to, t l , .... t n ) is related to the well-known Vandermonde
matrix if the numbers tj are distinct (then Sj = 1 for all j):
2.1 Interpolation by Polynomials
55
to
EXAMPLE 1. For to = tl
< t2, one finds
to
1
t~t~ 1.
t~
2
In preparation for a Neville type algorithm for Hermite interpolation,
we associate with each segment
o :s: i :s: i + k :s: n,
of virtual abscissae the solution Pi.Hl.. .. ,Hk E Ih of the partial Hermite
interpolation problem belonging to this subsequence, that is the solution
of [see (2.1.5.4)]
j = i, i + 1, ... ,i + k.
Here of course, the integers S j, i :s: .i :s: i + k, are defined with respect to
the subsequence, that is 5j is the number of times the value of tj occurs
within the sequence ti, ti+l, ... , tj.
EXAMPLE 2. Suppose no =
2, nl = 3 and
Xo = 0,
XI
f~O) = -1, f~l) = -2,
= 1, f~O) = 0, f~l) = 10, f?) = 40.
This leads to virtual abscissae tj, j = 0, 1, ... , 4, with
For the subsequence tl S t2 S t3, that is i = 1 and k = 2, one has
It is associated with the interpolating polynomial PI23(X) E Ih, satisfying
p~~j-I)(tl) = P 123 (O) = f(sl-l)(td = f(O)(O) =-1,
pi~~-1)(t2) = P 123 (l) = f(S2-1)(t2) = lO)(l) = 0,
pi~~-1)(t3) = P{23(1) = lS3- 1)(t3) = f(I)(l) = 10.
This illustrates (2.1.5.4).
56
2 Interpolation
Using this notation the following analogs to Neville's formulae (2.1.2.1)
hold: If ti = ti+l = ... = tHk = Xl then
Pi,i+l, .. ,i+k(X) =
lr)
L r. (x - xlf,
k
(2.1.5.6a)
_1_,
r=O
and if ti < tHk
(2.1.5.6b)
P .
. (x) = (x - ti)Pi+I,.,Hk(X) - (x - tHk)Pi,i+l, .. ,i+k-I(X)
t.,+l .... ,,+k
t
Hk -
t
i
.
The first of these formulae follows directly from definition (2.1.5.4); the second is proved in the same way as its counterpart (2.1.2.1b): One verifies first
that the right hand side of (2.1.5.6b) satisfies the same uniquely solvable
(Theorem (2.1.5.2)) interpolation conditions (2.1.5.4) as Pi,HI"i+k(X),
The details of the proof are left to the reader.
D
In analogy to (2.1.3.2), we now define the generalized divided difference
J[ti' tHl, ... , ti+kl
as the coefficient of xk in the polynomial Pi,i+l, .. ,i+k(X) E 11k . By comparing the coefficients of xk in (2.1.5.6) [cf. (2.1.3.5)] we find, if ti = ... =
ti+k = Xl,
(2.1.5.7a)
.. l_J[tHl, ... ,ti+kl-J[t;,ti+l, ... ,ti+k-ll
(2.1.5.7b) f[ ttl...
t,+l, ... , t,+k .
tHk - t;
By means of the generalized divided differences
k = 0,1, ... ,n,
the solution P(x) = POI ... n(x) E IIn of the Hermite intcrpolation problem
(2.1.5.1) can be represented explicitly in its Newton form [ef. (2.1.3.1)]
(2.1.5.8)
Pm .. n(x) = ao + al(x - to) + a2(x - to)(x - tl) + ...
This follows from the observation that the difference polynomial
Q(X) := Pol ... n(x) - POl ... (n-l) (x) = J[to, tl, ... , tnlxn + ...
is of degree at most n and has, because of (2.1.5.1) and (2.1.5.4), for i :::; m1, the abscissa Xi as zero of multiplicity ni, and Xm as zero of multiplicity
(nm - 1). The polynomial of degree n
2.1 Interpolation by Polynomials
57
(x - to)(x - td··· (x - tn-d,
has the same zeros with the same multiplicities. Hence
Q(x) = f[t o, t1,"" tn](X - to)(X - h)'" (x - tn-d,
o
which proves (2.1.5.8).
EXAMPLE 3. Wc illustrate the calculation of the generalized divided differences
with the data of Example 2 (m = 1, no = 2, nl = 3). The following difference
scheme results:
to = 0 -1* = f[to]
-2* = f[to, tl]
tl = 0 -1* = f[tl]
3 = f[to, tl, t2]
1 = 1[tl, t2]
6 = f[to, ... , t3]
t2 = 1 0* = 1[t2]
9 = 1[tl, t2, t3]
5 = 1[to, ... , t4]
10* = f[t2, t3]
11 = f[tl, ... , t4]
t3 = 1 0* = f[t3]
20* = 1[t2, t3, t4]
10* = 1[h t 4 ]
t4 = 1 0* = 1[t4]
The entries marked * have been calculated using (2.1.5.7a) rather than (2.1.5.7b).
The coefficients of the Hermite interpolating polynomial can be found in the
upper diagonal of the difference scheme:
P(x) = -1 - 2(x - 0) + 3(x - O)(x - 0) + 6(x - O)(x -- O)(x - 1)
+ 5(x - O)(x - O)(x - l)(x - 1)
= -1 - 2x + 3x 2 + 6x 2 (x - 1) + 5x 2 (x - 1)2.
The interpolation error incurred by Hermite interpolation can be estimated in the same fashion as for the usual interpolation by polynomials.
In particular, the proof of the following theorem is entirely analogous to
the proof of Theorem (2.1.4.1):
(2.1.5.9) Theorem. Let the real function f be n + 1 times differentiable
on the interval [a, b], and consider m + 1 support abscissae Xi E [a, b],
Xo < Xl < ... < x m ·
If the polynomial P(x) is of degree at most n,
Lni = n+ 1,
i=O
and satisfies the interpolation conditions
i = 0, 1, ... , m,
k = 0, 1, ... , ni - 1,
58
2 Interpolation
then for every x E [a, b] exists ~ E I[xo, ... ,X m , x] such that
f(x)-P(x) =
w(x)f(n+l) (~)
(n+l)!
where
Hermite interpolation is frequently used to approximate a given real
function f by a piecewise polynomial function cpo Given a partition
Ll : a = Xo < Xl < ... < Xm = b
of an interval [a, b], the corresponding Hermite function space Hr;) is defined as consisting of all functions cp : [a, b] -+ IR with the following properties:
(2.1.5.10)
(a) cp E c v - l [a, b]: The (// - 1 )st derivative of cp exists and
is continuous on [a, b].
(b) cplIi
E II2v - l : On each subinterval Ii := [Xi, XHl],
i = 0, 1, ... , m - 1, cp agrees with a polynomial of
degree at most 2// - 1.
Thus the function cp consists of polynomial pieces of degree 2// - 1 or less
which are // - 1 times differentiable at the "knots" Xi. In order to approximate a given real function f E c v - l [a, b] by a function cp E Hr;), we choose
the component polynomials Pi = cplIi of cp so that Pi E II2v - 1 and so that
the Hermite interpolation conditions
are satisfied.
Under the more stringent condition f E C 2u [a, b], Theorem (2.1.5.9)
provides a bound to the interpolation error for X E Ii, which arises if the
component polynomial Pi replaces f:
If(x) - Pi(x)1 :::; I(x - Xi)/X ~ Xi+lW' max If(2V)(~)1
(2.1.5.11)
2// .
(El,
< IXHI - xil 2V max If(2V)(~)I.
-
22v (2//!)
(Eli
Combining these results for i = 0, 1, ... , m gives for the function cp E Hr;),
which was defined earlier,
2.2 Interpolation by Rational Functions
where
11.::111 =
max
0:::;i:::;rn-1
59
IX i+l - xii
is the "fineness" of the partition .::1.
The approximation error goes to zero with the 21/ 1;h power of the
fineness II.::1 j II if we consider a sequence of partitions .::1} of the interval
[a, b] with II.::1 j II --+ O. Contrast this with the case of ordinary polynomial
interpolation, where the approximation error does not necessarily go to
zero as II.::1 j II --+ 0 [sec Section 2.1.4].
Ciarlet, Schultz, and Varga (1967) were able to show also that the first 1/
derivatives of <p are a good approximation to the corresponding derivatives
of I:
(2.1.5.13)
=
II(k)(x) - p?)(x)1 ::: I(x - ~:t~x ~~)~W-k (Xi+l - Xi)'" max II(2v)(01
. 1/.
f,EI,
for all X E Ii, i = 0, 1, ... , m - 1, k = 0, 1, ... , 1/, and therefore
(2.1.5.14)
III(2v) I
I I(k) - <p(k) 11= <- 22v - 2k11 .::111
k! (21/ - 2k)!
2V - k
CX!
for k = 0, 1, ... , 1/.
2.2 Interpolation by Rational Functions
2.2.1 General Properties of Rational Interpolation
Consider again a given set of support points (Xi, Ii), i = 0, 1, .... \Ve will
now examine the use of rational functions
pl"'V(x) == PI"·V(x) == ao + a1x + ... + al"xl"
QIL,v(X)
bo + b1x + ... + bvxlJ
for interpolating these support points. Here the integers JJ and 1/ denote the
maximum degrees of the polynomials in the numerator and denominator,
respectively. We call the pair of integers (JJ, 1/) the degree type of the rational
interpolation problem.
The rational function pl",V is determined by its JJ + 1/ + 2 coefficients
On the other hand, <p1",V determines these coefficients only up to a common
factor p oJ O.This suggests that pl",V is fully determined by the JJ + 1/ + 1
interpolation conditions
60
2 Interpolation
i = 0,1, ... ,p, + v.
(2.2.1.1)
We denote by A/l·v the problem of calculating the rational function p/l,v
from (2.2.1.1).
It is clearly necessary that the coefficients an bs: of p/l,V solve the
homogeneous system of linear equations
i = 0, 1. ... ,IL + v,
(2.2.1.2)
or written out in full,
We denote this linear system by S/l,v.
At first glace. substituting S/l,V for A/l,v does not seem to present a
problem. The next example will show, however, that this is not the case, and
that rational interpolation is inherently more complicated than polynomial
interpolation.
EXAMPLE. For support poiIlt~
and fJ, = 1/ = 1:
x,:
0
fi:
1
2
2
2
ao
- 1 . bo
= 0,
aO+a1 -2(bo +b1) =0,
ao + 2a1 - 2(bo + 2b1) = O.
Up to a common nonzero factor, solving the above system 5 1 ,1 yields the coefficients
ao = 0, bo = 0, a1 = 2, b) = 1,
and therefore the rational expression
which for x = 0 leads to the indeterminate expression 0/0. After cancelling the
factor x, we arrive at the rational expression
Both expressions p1,] and ,p1,1 represent the same rational function, namely
the constant function of value 2. This function misses the first support point
(xo, fo) = (0,1). Therefore it does not solve A 1.1. Since solving 5 1 ,1 is necessary
for any solution of A 1,1, we conclude that no such solution exists.
The above example shows that the rational interpolation problem A/l,V
need not be solvable. Indeed, if S/l,V has a solution which leads to a rational
function that does not solve A/l,V -- as was the case in the example - then
2.2 Interpolation by Rational Functions
61
the rational interpolation problem is not solvable. In order to examine this
situation more closely, we have to distinguish between different representations of the same rational function <Jj!-',l/, which arise from each other by
canceling or by introducing a common polynomial factor in numerator and
denominator. We say that two rational expressions (that is pairs of two
polynomials, the numerator and denominator polynomials)
are equivalent, and write
if
This is precisely when the two rational expressions represent the same
rational function.
A rational expression is called relatively prime if its numerator and
denominator are relatively prime, i.e., not both divisible by the same polynomial of positive degree. If a rational expression is not relatively prime,
then canceling all common polynomial factors leads to an equivalent rational which is.
Finally we say that a rational expression <Jj!-',v is a solution of S!-"v if its
coefficients solve S!-,,l/. As noted before, <Jj!-',V solves S!-"v if it solves A!-'·v.
Rational interpolation is complicated by the fact that the converse need
not hold.
(2.2.1.3) Theorem. The homogeneous linear system of equations S!-"v
always has nontrivial solutions. For each such solution
QIJ.V(X) =j.
°
holds, i.e., all nontrivial solutions define rational functions.
PROOF. The homogeneous linear system S!-"v has p, + v -+- 1 equations for
p, + v + 2 unknowns. As a homogeneous linear system with more unknowns
than equations, S!-'·v has nontrivial solutions
(aO,a1, ... ,a!-', bo, ... ,bv ) i- (0, ... ,0, 0, ... ,0).
For any such solution, Q!-',V(x) =j. 0, since
Q!-',V(x) == bo + b1x + ... + bvxv ==
°
would imply that the polynomial p!-"V(x) == au + al + .. , + a!-,x!-' has the
zeros
i = 0,1, ... ,p, + v.
62
2 Interpolation
It would follow that pll,V (x) == 0, since the polynomial pll,v has at most
degree tl, and vanishes at tL + v + 1 ~ tL + 1 different locations, contradicting
(ao,al, ... ,a ll , bo,···,b v ) =1= (0, ... ,0).
o
The following theorem shows that the rational interpolation problem
has a unique solution if it has a solution at all.
(2.2.1.4) Theorem. If PI and P2 are both (nontrivial) solutions of the
homogeneous linear system SIl.V, then they are equivalent (PI ~ P2), that
is, they df?termine the same rational function.
PROOF. IfbothPI(x) == PI(X)/QI(X) and P2(X) == P 2(X)/Q2(X) solve SIl.V,
then the polynomial
has tL + v + 1 different zeros
P(X;) = P 1(Xi)Q2(Xi) - P 2(Xi)QI(Xi)
=
fi QI(Xi)Q2(Xi) - fi Q2(Xi)QI(X;)
i = 0, ... , tL + v.
= 0,
Since the degree of polynomial P does not exceed tL + v, it must vanish
identically, and it follows that PI (x) ~ P2 (x).
0
Note that the converse of the above theorem does not hold: a rational expression PI may well solve SIl,v whereas some equivalent rational expression
P2 does not. The previously considered example furnishes a case in point.
Combining Theorems (2.2.1.3) and (2.2.1.4), we find that there exists
for each rational interpolation problem AIl.v a unique rational function,
which is represented by any rational expression pll.V that solves the corresponding linear system SIl,V. Either this rational function satisfies (2.2.1.1),
thereby solving A,t.I/, or AllY is not solvable at all. In the latter case, there
must be some support point (Xi, fi) which is "missed" by the rational function. Such a support point is called inaccessible. Thus AI"Y is solvable if
there are no inaccessible points.
Suppose pll,V(x) == PI",V(X)/QIl,V(x) is a solution to SIl.V. For any i E
{O, 1, ... ,tL + v} we distinguish the two cases:
1)
2)
QIl,V(Xi) =1= 0,
QIl,V(Xi) = 0.
In the first case, clearly pll,L'(Xi) = fi. In the second case, however, the
support point (Xi, fi) may be inaccessible. Here
2.2 Interpolation by Rational Functions
63
must hold by (2.2.1.2). Therefore, both PfJ.,V and QfJ.,V contain the factor
x ~ Xi and are consequently not relatively prime. Thus:
(2.2.1.5). If SfJ.,V has a solution pfJ.,v which is relatively prime, then there
are no inaccessible points: AfJ.,V is solvable.
Given PfJ.. v , let /pfJ.,V be an equivalent rational expression which is relatively
prime. We then have the general result:
(2.2.1.6) Theorem. Suppose PfJ.,V solves SfJ.· v . Then A,L,u is solvable and PfJ.,V represents the solution -- if and only if /pfJ.,V solves S",I/.
PROOF. If /P/I,I/ solves SfJ.,V, then AfJ.,V is solvable by (2.2.1.5). If /pfJ.,V does
not solve S,L,V, its corresponding rational function does not solve AfJ.· v . D
Even if the linear system SfJ.· v has full rank 11 + V + 1, the rational interpolation problem AfJ.,V may not solvable. However, since the solutions of
SfJ.,u are, in this case, uniquely determined up to a common constant factor
p =J 0, we have:
(2.2.1.7) Corollary to (2.2.1.6). If SfJ.,V has full rank, then AfJ.,V is solvable if and only if the solution PfJ.. V of SfJ.,V is relatively prime.
We say that the support points (Xi, fd, i = 0,1, ... , (J' are in special position if they are interpolated by a rational expression of degree type (r;, A)
with r; + A < (J'. In other words, the interpolation problem is solvable for
a smaller combined degree of numerator and denominator than suggested
hy the nnmher of support points. We observe that
(2.2.1. 8) Theorem. The accessible support points of a nonsolvable 'interpolation problem AfJ.,v are in special position.
PROOF. Let i 1 , ... , ia be the subscripts of the inaccessible points, and let
PfJ.,v be a solution of S/l.V. The numerator and the denominator of PfJ.,V
were seen above to have the common factors X ~ Xi" ... , X ~ X 1n , whose
cancellation leads to an equivalent rational expression pK,A with r; = /1 ~ G,
A = v ~ G. pK,A solves the interpolation problem AK,A which just consists
of the 11 + V + 1 ~ G accessible points. As
r; + A + 1 = I). + V + 1 ~ 2G < M + v + 1 ~ G,
the accessible points of AfJ.,V are clearly in special position.
D
The observation (2.2.1.8) makes it clear that nonsolvability of the rational interpolation problem is a degeneracy phenomenon: solvability can
be restored by arbitrarily sIllall perturbations of the support points. In
what follows, we will therefore restrict our attention to fully nondegenerate
problems, that is, problems for which no subset of the support points is in
special position. Not only is AfJ.,v solvable in this case, but so are all problems AK,A of r; + A + 1 of the original support points where r; + A ::; M + v.
For further details see Milne (1950) and Maehly and Witzgall (1960).
64
2 Interpolation
Most of the following discussion will be of recursive procedures for solving rational interpolation problems AIL,v. With each step of such recursions
there will be associated a rational expression J[IL,v, of degree type (/-l, v)
with /1 :s; m and v :s; n, and either the numerator or the denominator of
([JIL,v will be increased by l. Because of the availability of this choice, the
recursion methods for rational interpolation are more varied than those for
polynomial interpolation. It will be helpful to plot the sequence of degree
types (/-l, v) which are encountered in a particular recursion as paths in a
diagram:
/-l=0
1
2
3
1/=0
1
2
3
We will distinguish two kinds of algorithms. The first kind is analogous
to Newton's method of interpolation: A tableau of quantities analogous
to divided differences is generated from which coefficients are gathered for
an interpolating rational expression. The second kind corresponds to the
Neville-Aitken approach of generating a tableau of values of intermediate
rational functions ([JIL,V. These values rclate to each other directly.
2.2.2 Inverse and Reciprocal Differences. Thiele's Continued
Fraction
The algorithms to be described in this section calculate rational expressions
along the main diagonal of the (/-l, 1/ )-plane:
/-l=0
1
2
3
v=O
1
2
3
(2.2.2.1)
Starting from the support points (Xi, j,), i = 0, 1, ... , we build the following
tableau of inverse differences:
2.2 Interpolation by Rational Functions
Xi
fi
0
Xo
fo
1
2
Xl
3
X2
X3
h
h
h
rp(XO, xd
rp(xo, X2)
rp(xo, X3)
rp(xo, Xl, X2)
rp(XO,Xl,X3)
65
rp(xo, :rl, X2, X3)
The inverse differences are defined recursively as follows:
(2.2.2.2)
On occasion, certain inverse differences become 00 because the denominators in (2.2.2.2) vanish.
Note that the inverse differences are, in general, not symmetric functions of their arguments.
Let PI-', QV be polynomials whose degree is bounded by /1 and /J, respectively. We will now try to use inverse differences in order to find a
rational expression
pn(X)
pn,n(x) = _ _
Qn(x)
with
pn,n(Xi) = fi
for i = 0,1, ... ,2n.
We must therefore have
pn(x) = fo + pn(x) _ pn(xo)
Qn(x)
Qn(x)
Qn(xo)
pn-l(x)
(x-xo)
= fo + (x - xo) Qn(x) = fo + Qn(x)/pn-l(x)"
The rational expression Qn (x) / pn-l (x) satisfies
Qn(Xi)
pn-l(Xi)
for i = 1, 2, ... , 2n. It follows that
66
2 Interpolation
and therefore
pn-l(Xi)
Qn-I(X;)
Continuing in this fashion, we arrive at the following expression for pn,n(X):
pn(X)
f
n n()
x = Qn (x) =
P'
x - Xo
°+ Qn (x) / pn - X)
I (
x - Xo
x - xl
= fo + - - - - - - - - - - ip(XO, Xl) + pn-I(X)/Qn-I(X)
Xo
x - Xl
= fo + - - - - - - - - - - - - - - - X -
ip(XO, Xl) + - - - - - - - - - - - -
1: - X2n-1
ip(XO, ' , . ,X2n) .
pn.n(X) is thus srepresented by a continued fraction:
(2.2.2.3)
pn,T!(X) =fo + X - xo/P(XO,XI) + ~ + ...
+ X - X2n-I/ip(XO, Xl,"" X2n)'
It is readily seen that the partial fractions of this continued fraction are
nothing but the rational expressions PJ",,"(x) and p'" +1 ,," (x) , p, = 0, 1,
... , n - 1, which satisfy (2.2.1.1) and which are indicated in the diagram
(2.2.2.1) .
pO,O(X)
fo,
p1.0(x)
fo + x - xo/p(xo, Xl)'
p1.1 (x)
fa + X - xo/p(xo, Xl) + X - xI/ip(xo, Xl, X2) ,
EXAMPLE
2.2 Interpolation by Rational Functions
Xi
Ji
0
1
0
1
0
-1
2
2
3
3
'P(XQ, x;)
'P(XQ,XI,X,)
2
3
-1
-3
-2
9
I
:3
3
67
'P(XQ, Xl, X2, x;)
I
2
I
2
Because the inverse differences lack symmetry, the so-called reciprocal
differences
P(Xi' XH1,"" Xi+k)
are often preferred. They are defined by the recursions
P(Xi) := fi'
(2.2.2.4)
P(Xi, Xi+l
)
Xi - Xi+l
f
'
:= f
(
)
PXi,Xi+I,···,Xi+k :=
i-HI
Xi - Xi+k
(
)
(
)+
P Xi,' .. , XHk-1 - P Xi+l,···, Xi+k
For a proof that the reciprocal differences are indeed symmetrical, see
Milne-Thompson (1951).
The reciprocal differences are closely related to the inverse differences.
(2.2.2.5) Theorem. For p = 1, 2, ... [letting p(xo, ... , X p -2) = 0 for
p = 1],
PROOF. The proposition is correct for p = l. Assuming it true for p, we
conclude from
that
By (2.2.2.4),
68
2 Interpolation
p(Xp. Xo, ...• xp~d - p(xo, ... ,Xp-l, Xp+l)
= p(Xp, XO,"" Xp-l, Xp+l) - p(XO," ., Xp~l)'
Since the p( ... ) are symmetric,
'P(Xo, Xl,···, :.c:p+l) = p(Xo,···, xp+d - p(Xo, ... , Xp-l),
whence (2.2.2.5) has been established for p + 1.
The reciprocal differences can be arranged in the tableau
Xo
fa
Xl
h
o
p(xo, Xl)
(2.2.2.6)
p(Xo, Xl, :r'2)
P(Xl,X2)
X2
12
X3
h
P(XO,Xl,X2,X3)
P(Xl, X2, X3)
P(X2,X3)
Using (2.2.2.5) to substitute reciprocal differences for inverse differences
in (2.2.2.3) yields Thiele's continued fraction:
(2.2.2.7)
pn.n(x) =fo + X - xo/p(Xo, Xl) + ~p(xo. Xl, X2) - p(Xo) + ...
+ X - X2n-dp(Xo, ... , X2n) - p(Xo, .... X2n~2)'
2.2.3 Algorithms of the Neville Type
We proceed to derive an algorithm for rational interpolation which is analogous to Neville's algorithm for polynomial interpolation.
A quick reminder that, after discussing possible degeneracy effects in
rational interpolation problems [Section 2.2.1]' we have assumed that such
effects are absent in the problems whose solution we are discussing. Indeed,
such degeneracies are not likely to occur in numerical problems.
We use
plL.V(X) = Pf'V(x)
s
-
Q~'V(x)
to denote the rational expression with
p~,v(x;)=fi
fori=s,s+l, ... ,S+M+V.
Pf'V, Q~'v being polynomials of degrees not exceeding M and v, respectively. Let p~'v and q~'v be the leading coefficients of these polynomials:
2.2 Interpolation by Rational Functions
69
For brevity we put
Qi :=
x - Xi and T/:'V(x, y) := p/:,V(x) - y. C;)~,V(x),
noting that
(2.2.3.1) Theorem. Starting with
P~,O(x) = fs,
the following recursions hold:
(a) Transition (M - 1,1/) -+ (M,I/):
(b) Transition (M,I/ - 1) -+ (M,I/):
PROOF. We show only (a), the proof of (b) being analogous. Suppose the
rational expressions q>~-l,v and q>~+;'v meet the interpolation requirements
(2.2.3.2)
T/:-1,V(Xi, fi) = 0 for i = s, ... , s + M+ 1/ - 1,
T:;/,V(Xi,fi) =0 for i=s+l, ... ,S+M+I/.
If we define Pt,V(x), Q~'V(x) by (a), then the degree of Pt,V clearly does
not exceed M. The polynomial expression for Q~'v contains formally a term
with x v+1, whose coefficient, however, vanishes. The polynomial Q~'v is
therefore of degree at most 1/. Finally,
From this and (2.2.3.2),
T/:,V(Xi, fi) = 0 for i = s, ... , s + M+ II.
Under the general hypothesis that no combination (M, 1/, s) has inaccessible points, the above result shows that (a) indeed defines the numerator
and denominator of q>~,v .
0
70
2 Interpolation
Unfortunately, the recursions (2.2.3.1) still contain the coefficients
The formulas are therefore not yet suitable for t.he calculation of cP';'''(x) for a prescribed value of .T. However, we can eliminate
t.hese coefficients on the basis of the following theorem.
p~,v-1, q~-1,v.
(2.2.3.3) Theorem.
(a)
cPfJ.-1,v(x) _ cPfJ.-l,v-1(x) = k (:1: - xs+d'" (x - xS+I"+v-d
8
8+1
1
Q~-1'V(x)Q~~rV-1(x)
fJ.- 1,1/-1 qs1"-1,v ,
WZ'th k 1 -- _ Ps+1
(b)
cPl"-l,v(X) _ cPl"-l,v-1(,T) = k (x - X8 H)'" (x - x"+I"+v-1)
8+1
s+1
2
QI"-l,v( )QI"-1,V-1( )
8+1 x 8+1
x
.h k 1"-1,v-1 l"-l,v
q8+1'
wzt 2 - -P s+1
PROOF. The numerator polynomial of the rational expression
"'-1"-1,v(
",-fJ.-l,v-1( x)_
'¥s
X) _ '¥s+l
Pt'-1'V(x)Q~~:,v-1(X) _ p:;/,v-1(X)Q~-1,v(X)
Q~-l,v (x )Q~~:,v-1 (x)
is at most of degree 11 - 1 + v and has fl, + v-I different. zeros
Xi,
i=s+1,s+2, ... ,s+f1+ v - 1
by definition of cP~-l,v and cP~~l,v-l. It must therefore be of the form
k"1 ( X - .T 8 +1 ) ... (x - xS+fJ.+v-1 )
'tl1 k 1 -- _ P fJ.-1,v-1
q"l"-l,v .
s+1
WI
o
This proves (a). (b) is shown analogously.
(2.2.3.4) Theorem. For f1 ;;0. 1, v ;;0. 1,
cP~~ll,v(x) _ cP~-1,v(X)
(a)
cP~,v(x) = cP~+:'V(x) + - - - - - - - - - - - - - - - as
[
cP~~:,V(x) - cP~-I,v(x) ]
1-1
as+J-l+v
cP~~11,v(X) - cP~~11,v-1(X)
....
~
2.2 Interpolation by Rational Functions
71
PROOF. By Theorem (2.2.3.1),
J.L-1,v pJ.L-l,v( )
J.L-1,v p/J.-1,v( )
cpJ.L,V(x) _ a 8q8
8+1
x - a 8+J.L+v q8+1
8
X
8
V( )
J.L-1 vQ,,-l
V( ) .
a 8q8J.L-1 'vQJ.L-1
8+1' X - a 8+J.L+v q8+1 '
8'
, X
We now assume that p~+t,V-1 i- 0 and multiply numerator and denominator of the above fraction by
J.L-1,v-1( x - X8+1 )( X - X8+2 ) ... ( x - x S +J.L+v-1 )
-P8+1
)QJ.L-1,v( )QJ.L-1,v-1( )
Q J.L-1,v(
s+l x s
X s+l
X
Taking Theorem (2.2.3.3) in account, we arrive at
II - as+J.L+vcp~-l,v(x) [
II - as+J.L+v [ b
(2.2.3.5)
where
_ cpJ.L-1,v-1(x)
11 = cpJ.L-1,v(X)
s
s+l
,
b = cp~+t,v (x) - cp~+t,V-1 (x) .
(a) follows by a straightforward transformation. (b) is derived analogously.
o
The formulas in Theorem (2.2.3.4) can now be used to calculate the
values of rational expressions for prescribed x successivel:y, alternately increasing the degrees of numerators and denominators. This corresponds to
a zigzag path in the (IL, v)-diagram:
1
2
3
v=O
1
(2.2.3.6)
2
L
Special recursive rules are still needed for initial straight portions-vertically and horizontally-of such paths.
As long as v = 0 and only IL is being increased, one has a case of pure
polynomial interpolation. One uses Neville's formulas [see (2.1.2.1)l
cp~,O(x) := is,
IL = 1,2, ....
72
2 Interpolation
Actually these are specializations of Theorem (2.2.3.4a) for v = 0, provided
the convention q>~+;,-l := 00 is adopted, which causes the quotient marked
* in (2.2.3.4) to vanish.
If J-l = and only v is being increased, then this case relates to polynomial interpolation with the support points (Xi, 1/ Ii), and one can use the
formulas
°
(2.2.3.7)
q>~'V(x)
as - a
+
:= _ _ _ _ _s_v_ __
as
a s +v
q>~;;-l (x)
q>~,v-l (x)
v = 1,2, ... ,
which arise from Theorem (2.2.3.4) if one defines q>-;~i-l(x) := 0.
Experience has shown that the (J-l, v)-sequence
(0,0) --+ (0,1) --+ (1,1) --+ (1,2) --+ (2,2) --+ ...
- indicated by the dotted line in the diagram (2.2.3.6) - holds particular
advantages, especially in the important application area of extrapolation
methods [Sections 3.4 and 3.5], where interest focuses on the values q>~,,,(x)
for x = 0. If we refer to this particular sequence, then it suffices to indicate
J-l + v, instead of both J-l and v, and this permits the shorter notation
Ti,k := q>~'''(x)
with i = S + J-l + v, k = J-l + v.
The formulas (2.2.3.4) combine with (2.2.3.7) to yield the algorithm
Ti,-l := 0,
(2.2.3.8)
Ti,k := Ti,k-l
Ti,k-l - Ti-1,k-l
+ --------------X -
Xi-k
x - Xi
[1 _
Ti,k-l - Ti-1,k-l]
Ti,k-l - T i - 1,k-2
-1
for 1 ::; k ::; i, i = 0, 1, .... Note that this recursion formula differs from
the corresponding polynomial formula (2.1.2.5) only by the expression in
brackets [... j, which assumes the value 1 in the polynomial case.
If we display the values Tik in the tableau below, letting i count the
ascending diagonals and k the columns, then each instance of the recursion
formula (2.2.3.8) interrelates the four corners of a rhombus:
2.2 Interpolation by Rational Functions
(M, v)
(0,0)
(0,1)
(1,1)
(1,2)
73
...
fo = To,o
0= TO,-l
(2.2.3.9)
TI,1
h = TI,o
0= TI,-l
T 2 ,2 \ .
T 2,1
12 = T 2 ,0
0= T 2,-1
--+
T3,2
/"
T33
T 3,1
h = T3,0
If one is interested in the rational function itself, i.e. its coefficients, then
the methods of Section 2.2.2, involving inverse or reciprocal differences, are
suitable. However, if one desires the value of the interpolating function for
just one single argument, then algorithms of the Neville type based on the
formulas of Theorem (2.2.3.4) and (2.2.3.8) are to be preferred. The formula
(2.2.3.8) is particularly useful in the context of extrapolation methods [see
Sections 3.4, 3.5, 7.2.3, 7.2.14].
2.2.4 Comparing Rational and Polynomial Interpolation
Interpolation, as mentioned before, is frequently used for the purpose of
approximating a given function f (x). In many such instances, interpolation by polynomials is entirely satisfactory. The situation is different if the
location x for which one desires an approximate value of f(x) lies in the
proximitiy of a pole or some other singularity of f(x) - like the value
of tan x for x close to 7r /2. In such cases, polynomial interpolation does
not give satisfactory results, whereas rational interpolation does, because
rational functions themselves may have poles.
EXAMPLE [taken from Bulirsch and Rutishauser (1968)]. For the function f(x) =
cot x the values cot 1 0, cot 2°, ... have been tabulated. The problem is to determine an approximate value for cot 2° 30'.
Polynomial interpolation of order 4, using the formulas (2.1.2.4), yields the
tableau
74
2 Interpolation
ctg(x;)
Xi
Ji =
1°
57.28996163
2°
28.63625328
14.30939911
21.47137102
23.85869499
3°
19.08113669
4°
14.30066626
22.36661762
23.26186421
21.47137190
22.63519158
23.08281486
22.18756808
18.60658719
5°
11.43005230
Rational interpolation with (fL, 1/) = (2,2) using the formulas (2.2.3.8) in contrast
gives
1°
57.28996163
2°
28.63625328
3°
19.08113669
4°
14.30066626
22.90760673
22.90341624
22.90201805
22.90369573
22.90411487
22.91041916
22.9037655~
22.90384141
22.90201975
22.94418151
5°
11.43005230
The exact value is cot 2°30' = 22.9037655484 ... ; incorrect digits are underlined.
A similar situation is encountered in extrapolation methods [see Sections 3.4, 3.5, 7.2.3, 7.2.14]. Here a function T(h) of the step length h is
interpolated at small positive values of h.
2.3 Trigonometric Interpolation
2.3.1 Basic Facts
Trigonometric interpolation uses combinations of the trigonometric functions cos hx and sin hx for integer h. We will confine ourselves to linear
interpolation, that is, interpolation by one of the trigonometric expressions
(2.3.1.1a)
1A + ~)AhCOS hx+Bh sin hx),
AI
l/i'(x):=
h=l
2.3 Trigonometric Interpolation
1P(x):= ~O +
(2.3.1.1b)
75
A
M-l
L (Ah cos hx + Bh sin hx) + : cos Mx
h=l
of, respectively, N = 2M + 1 or N = 2M support points (Xk' ik), k = 0,
1, ... , N - 1. Interpolation by such expressions is suitable for data which
are periodic of known period. Indeed, the expressions 1P(x) in (2.3.1.1)
represent periodic functions of x with the period 271".2
Considerable conceptual and algebraic simplifications are achieved by
using complex numbers and invoking De Moivre's formula
e ikx = cos kx + i sin kx .
Here and in what follows, i denotes the imaginary unit. If c = a + i b, a, b
real, then c = a - ibis its complex conjugate, a is the real part of c, bits
imaginary part, and Icl := (cc)1/2 = (a 2 + b2 )1/2 its absolute value.
Particularly important are uniform partitions of the interval [0,271"]
k = 0,1, ... ,N - 1,
Xk := 271"k/N,
to which we now restrict our attention. For such partitions, the trigonometric interpolation problem can be transformed into the problem of finding a
phase polynomial of order N (Le. with N coefficients)
(2.3.1.2)
with complex coefficients (3j such that
P(Xk) = ik,
Indeed
e-hixk
k = 0,1, ... ,N - 1.
= e-21rihk/N = e2rri(N-h)k/N = e(N-h)ixk,
and therefore
sin hx k =
(2.3.1.3)
ehixk _ e(N-h)ixk
---::-2""'"i- - -
Making these substitutions in expressions (2.3.1.1) for 1P(x) and then collecting the powers of e iXk produces a phase polynomial p(:r) , (2.3.1.2), with
coefficients (3j, j = 0, ... , N - 1, which are related to the coefficients A h ,
Bh of 1P(s) as follows:
2
If sin u and cos u have to be both evaluated for the same argument u, then it
may be advantageous to evaluate t = tan( u/2) and express sin u and cos u in
terms of t :
.
2t
sm u = 1 + t 2 '
1 - t2
cos U = 1 + t 2 .
This procedure is numerically stable for 0 ::; u ::; 7r / 4, and the problem can
always be transformed so that the argument falls into that range.
76
2 Interpolation
(2.3.1.4) (a) If N is odd, then N = 2M + 1 and
Ao
(30 = 2' (3j = ~(Aj - iBj ), (3N-j = ~(Aj + iBj ), j = 1, ... , M,
Ao = 2(30,
Ah = (3h + (3N-h,
Bh = i((3h - (3N-h),
h = 1, ... , M.
(b) If N is even, then N = 2M and
(30 = ~o, (3j = ~(Aj - iBj ), (3N-j = ~(Aj + iBj ), j = 1, ... , M - 1,
AM
(3M = 2 '
Ao = 2(30, Ah = (3h + (3N-h, Bh = i((3h - (3N-h), h = 1, ... , M - 1,
AM = 2(3M.
The trigononometric expression lP(x) and its corresponding phase polynomial p( x) agree for all support arguments Xk = 27r / N of an equidistant
partition of the interval [0,27r]:
k = 0, 1, ... ,N - 1.
However lP(x) = p(x) need not hold at intermediate points x -I- Xk. The
two interpolation problems are equivalent only insofar as a solution to one
problem will produce a solution to the other via the coefficient relations
(2.3.1.4).
The phase polynomials p(x) in (2.3.1.2) are structurally simpler than
the trigonometric expressions lP(x) in (2.3.1.1). Upon abbreviating
°: :;
and since Wj -I- Wk for j -I- k,
j, k :::; N -1, it becomes clear that we are
faced with just a standard polynomial interpolation problem in disguise:
find the (complex) algebraic polynomial P of degree less than N with
P(Wk) =
ik,
k = 0, 1, ... ,N - 1.
The uniqueness of polynomial interpolation immediately gives the following
(2.3.1.5) Theorem. For any support points (Xk' ik), k = 0, ... , N - 1,
with ik complex and Xk = 27rk/N, there exists a unique phase polynomial
p(x) = (30 + (31e ix + ... + (3N_le(N-l)ix
with
for k = 0, 1, ... , N - 1.
p(xd = ik
2.3 Trigonometric Interpolation
77
The coefficients (3j of the interpolating phase polynomial can be expressed in closed form. To this end, we note that, for 0 ::; j, h ::; N - 1
(2.3.1.6)
More importantly, however, we have for 0 ::; j, h ::; N - 1
=
N-1
~ wj w- h
(2.3.1. 7)
~
k=O
k
{£'
or = h ,
N
J
for j =1= h.
0
k
PROOF. Wj-h is a root of the polynomial
wN -
1 = (w -
1)(W N - 1
+ w N - 2 + ... + 1),
from which either Wj-h = 1, and therefore j = h, or
N-1
N-1
~
j
-h
~ wkw k
~
j-h
~ wk
=
k=O
N-1
k=O
k
0
= ~
~ Wj_h = .
D
k=O
Introducing in the N-dimensional complex vector space (CN of all Ntuples (uo, U1, ... ,UN -1) of complex numbers Uk the usual scalar product
N-1
L
[u, v] :=
UjVj,
j=O
then (2.3.1.7) says that the special N-vectors
W
(h) _
-
(1 ,W1"",WN-1,
h
h)
h = 0, ... ,N -1,
form an orthogonal basis of (CN,
( .)
(2.3.1.8)
[w J ,w
(h)
]=
{
N
0
for j = h,
for j =1= h.
Note that the vectors are of length [w(h), w(h)j1/2 = VN instead of length
1, however. From the orthogonality of the vectors w(h) follows:
N-1
..
(2.3.1.9) Theorem. The phase polynomial p(x) = 2: j =o (3je J 'X satisfies
k = 0,1, ... , N -1,
for fk complex and Xk = 27f/N, if and only if
N-1
N-1
k=O
k=O
~ f w- j = ~
~ f e-2rrijk/N ,
(3J. = ~
N~kk
N~k
j = 0, ... ,N -1.
78
2 Interpolation
PROOF. Because of fk = p(Xk) the N-vector f := (fo, iI,···, fN-d satis-
fies
N-l
f = L (3jw(j)
j=O
so that
N-l
LfkW-';/= [f,w(j)] = [(3ow(O)+···+(3N_IW(N-l),w(j)] =N(3j. D
k=O
The mapping F : QjN ---+ QjN,
f = (fo, iI,···, fN-l) H (3 := «(30, (31, ... ,(3N-l) =: F(f),
defined by (2.3.1.9) is called discrete Fouriertransformation (DFT). Its
inverse (3 H f = F- 1((3) corresponds to Fourier synthesis, namely the
evaluation ofthe phase polynomial p(x) (2.3.1.2) at the equidistant abscissas Xk = 27rk/N, k = 0, 1, ... , N - 1,
N-l
N-l
fk = L (3je27rijk/N = L (3jwf
j=O
j=O
Because of
A = L.f=~1 {JjW;j the mapping F- 1 can be expressed by F,
(2.3.1.10)
Therefore the algorithms to compute F [e.g. the fast Fourier transform
methods of the next section 2.3.2] can also be used for Fourier synthesis.
For phase polynomials q(x) of order at most s, s :::::: N - 1 given, it is
in general not possible to make all residuals
k =O, ... ,N -1,
vanish, as they would for the interpolating phase polynomial of order N.
In this context, the s-segments
Ps(x) = (30 + (31eix + ... + (3se six
of the interpolating polynomial p( x) have an interesting best-approximation property:
(2.3.1.11) Theorem. The s-segment Ps(x), 0::; s < N, of the interpolating phase polynomial p(x) minimizes the sum
N-l
S(q) := L Ifk - q(xk)1 2
k=O
2.3 Trigonometric Interpolation
79
[note that S (p) = 0] of the squared absolute values of the residuals over all
phase polynomials
The phase polynomial Ps (x) is uniquely determined by this minimum property.
PROOF. We introduce the N-vectors
Then S(q) can be written as the scalar product
S(q) = [f-q,f-q].
By Theorem (2.3.l.9), N,6j = [f, w(j)] for j = 0, ... , N -- l. So for j ~ s,
s
[f - Ps, w(j)] = [f - L
f3h W (h) , w(j)] = N f3j - N f3j = 0,
h=O
and
[f - Ps,Ps - q] =
L [J - Ps, (f3j - I'j
)w(j)] = 0.
j=O
But then we have
S(q) = [f - q, f - q]
[(f - Ps) + (Ps - q), (f - Ps) + (Ps - q)]
= [f - Ps, f - Ps] + [Ps - q,ps - q]
::::: [f - Ps, f - Ps]
=
= S(Ps)'
Equality holds only if [Ps - q, Ps - q] = 0, i.e., if the vectors Ps and q
are equal. Then the phase polynomials Ps (x) and q( x) are identical by the
uniqueness theorem (2.3.l.5).
D
Returning to the original trigonometric expressions (2.3.l.1), we note
that Theorems (2.3.l.5) and (2.3.l.9) translate into the following:
(2.3.1.12) Theorem. The trigonometric expressions
iJr(x) :=
A
-f
+ L(Ah cos hx + Bh sin hx),
iJr(x) :=
i + L (Ah cos hx + Bh sin hx) + A;! cos Mx,
M
h=l
A
M-l
h=l
80
2 Interpolation
where N = 2M + 1 and N = 2M, respectively, satisfy
tJr(Xk) = ik,
k = 0, 1, ... N - 1,
for Xk = 21Tk/N if and only if the coefficients of tJr(x) are given by
2 N-l
Ah = N
2 N-l
21Thk
L fk cos hXk = N L ik cos ---y:;- .
k=U
k=O
2 N-l
2 N-l
21Thk
Bh = N
ik sin hXk = N
ik sin ---y:;-'
k=O
k=O
L
PROOF.
L
Only the expressions for A h . Bh remain to be verified. For by
(2.3.1.4)
Ah = Ph + PN-h = ~
N-l
L fk(e- hiXk + e-(N-h)iXk) ,
k=O
. N-l
Bh = i(Ph - PN-h) = ~
L fde- hixk - e-(N-h)ixk) ,
k=O
and the substitutions (2.3.l.3) yield the desired expressions.
D
Note that if the support ordinates ik are real, then so are the coefficients
A h , Bh in (2.3.l.l2).
2.3.2 Fast Fourier Transforms
The interpolation of support points
(Xk, ik), k = 0,1, ... , N -1, with equidistant Xk = 21Tk/N. by a phase
N-l
-polynomial p(x) = L:j=o Pje)'X leads to expressions of the form [Theorem
(2.3.l.9)]
N-l
(2.3.2.1)
(-J = ~ '" f e-2rrijk/N
{-')
N L
k
,
k=O
j = 0,1, ... ,N - 1.
The evaluation of such expressions is of prime importance in Fourier analysis. The expressions occur also as discrete approximations - for N equidistant arguments s - to the Fourier transform
H(s) =
I:
f(t)e-2rrist dt,
which pervades many areas of applied mathematics. However, the numerical
evaluation of expressions (2.3.2.1) had long appeared to require on the order
2.3 Trigonometric Interpolation
81
of N 2 multiplications, putting it out of reach for even high-speed electronic
computers for those large values of N necessary for a sufficiently accurate
discrete representation of the above integrals. The discovery [Cooley and
Tukey (1965); it is remarkable that similar techniques were already used
by Gauss] of a method for rapidly evaluating (on the order of NlogN
multiplications) all expressions (2.3.2.1) for large special values of N has
therefore opened up vast new areas of applications. This method and its
variations are called fast Fourier transforms. For a detailed treatment see
Brigham (1974) and Bloomfield (1976).
There are two main approaches, the original Cooley- Tukey method and
one described by Gentleman and Sande (1966), commonly ealled the SandeTukey method.
Both approaches rely on an integer factorization of ]Ii" and decompose
the problem accordingly into subproblems of lower degree. These decompositions are then carried out recursively. This works best when
N = 2n,
n> 0 integer.
We restrict our presentation to this most important and most straightforward case, although analogous techniques will clearly work for the more
genereal case N = N1N2 ... N n , N m integer.
The Cooley-Tukey approach is best understood in terms of the interpolation problem described in the previous section 2.3.1. Suppose N = 2M
and consider the two interpolating phase polynomials q(x) and r(x) of order
M = NI2 with
h = 0, ... , ivl - 1.
The phase polynomial q(x) interpolates all support points of even index,
whereas the phase polynomial r(x) = r(x - 21fIN) = r(x -1fIM) interpolates all those of odd index. Since
keven,
k odd,
the complete interpolating phase polynomial p(x) is now readily expressed
in terms of the two lower-order phase polynomials q(x) and r(x):
(2.3.2.2)
p(x) = q(x) (
l+eM~)
2
+ r(x -1fIM) (l-eM~)
--2- .
This suggests the following n step recursive scheme. For Tn :S n, let
M = 2m - 1
and
R = 2n - m .
Step Tn then consists of determining R phase polynomials
(.I('/TI) + (.I('/TI) eix + ... + (.I(m)
e(2M -l)ix
Pr(m) = Pr,O
fJr,l
fJ r ,2AI-1
,
r ,= 0, ... , R - 1,
82
2 Interpolation
from 2R phase polynomials p~m-l) (x), r = 0, ... , 2R-l, using the recursion
(2.3.2.2):
2p~m) (x) = p~m-l) (x)(1 + e Mix ) + pkn;..~l) (x - 7T /M)(1 _ e Mix ).
This relation gives rise to the following recursion between the coefficients
of the above phase polynomials:
(2.3.2.3)
1 j
2(3(m)
+ (3(mr,] = (3(m-l)
r,]
R+r,]lcm
2(3(m)
r,M +j -
(3(m-l)
r,j
+ (3(m-l)
cj
R+r,j m
}
r = 0, ... ,R - 1,
j = 0, ... ,!vI - 1,
where
m=O, ... ,n.
The recursion is initiated by putting
(0) . - f
(3k,O'k,
k=O, ... ,N -1,
. - (3(n)
(3 j.O,j'
j = 0, ... ,N-1.
and terminates with
This recursion typifies the Cooley-Tukey method.
The Sande-Tukey approach chooses a clever sequence of additions
in the sums
ike- jixk . Again with M = N/2, we assign to each
term ike- jixk an opposite termik+Me-jixk+M. Summing respective opposite terms in (2.3.2.1) produces N sums of M = N /2 terms each. Splitting
those N sums into two sets, one for even indices j = 2h and one for odd
indices j = 2h + 1, will lead to two problems of evaluating expressions of
the form (2.3.2.1), each problem being of reduced degree M = N/2.
Using the abbreviation
2:.::-01
again, we can write the expressions (2.3.2.1) in the form
N-l
~k
N(3j = L..
fkc~ ,
k=O
j = 0,1, ... ,N - 1.
Here n is such that N = 2n. Distinguishing between even and odd values
of j and combining opposite terms gives
M-l
N-l
N (32h =
=
k=O
=:
k=O
k=O
M-l
N-l
N(32h+1 =
M-l
L ikc~hk L (ik + fk+M )C~~l L f£C~~l
M-l
L ikc~2h+1)k L (Uk - fk+M)C~)c~~l L ffc~~l
=
k=O
=:
k=O
k=O
2.3 Trigonometric Interpolation
83
for h = 0, ... , .M - 1 and AI := N/2, since 10;' = IOn-I, c~I = -1. Here
f~ = ik + fk+M
}
f~ = (ik - fk+M )c~
k = 0, ... ,M -1.
In order to iterate this process for m = n, n - 1, ... , 0, we let AI :=
2rn - 1 , R:= 217 - rn and introduce the notation
f;.r;:) ,
°
r = 0, ... ,R - 1,
k = 0, ... , 2M - 1,
. h JO.k
.(n) -- f k, k -J\T
.(n-l) an d f(n-l)
' .
Wlt
, ... ,"'
- 1. JO.k
l,k
represent th
e quantltles
f£ and f~, respectively, which were introduced above. In g;eneral we have,
with 1\;1 = 2rn - l and R = 2n - m ,
(2.3.2.4)
2M-l
N(~
fJjR+r =
~
~
r = 0,1, ... ,R - 1,
.(m) jk
Jr,k 10 m ,
j = 0, 1, ... 2M - 1.
k=O
with the quantities f;~) satisfying the recursions:
(2.3.2.5)
lrn-l) -lrn)
r,k
r,k
+ lrn)
r,k+JH
(m-l) _
(m)
fr+R.k - (fr.k -
} {
(m)
k
fr,k+M )c m
m = n, n -- 1, ... , 1.
r = 0, 1. .... R - 1.
'"
k = 0, 1, ... ,1\1 - 1.
PROOF. Suppose (2.3.2.4) is correct for some m ::; n, and let M' := M/2 =
2m - 2 , R' := 2R = 2n - m + 1 . For j = 2h and j = 2h 1, respectively, we
+
find by combining opposite terms
M-l
2M'-1
~ (f;.r;:) + f;,r;:~M)C~r~ = ~ f;.r;:-I)c~_I'
N{3hR'+r = N{3JR+r =
k=O
k=O
M-l
N{3hR'+r+R = N{3jR+r' = ~ (f;~) -
f;~LI)Ct::
k=O
M-l
=
~ (f(rn)
~
r,k k=O
2M'-1
f(rn)
) k hk
r,k+M c m c m - l =
~
~
.(m-l) hk
Jr+R,k c m-l'
k=O
where r = 0, ... , R - 1, j = 0, ... , 2lvl - 1.
o
The recursion (2.3.2.5) typifies the Sande-Tukey method. It is initiated
by
and terminates with
.(n) . - f
JO.k . - k·
k = 0, ... , 1V - 1,
84
2 Interpolation
._ 1 (0)
f3r .- N fr,o ,
r = 0, L ... , N - l.
Returning to the Cooley-Tukey method for a more detailed algorithmic
formulation, we are faced with the problem of arranging the quantities
in a one-dimensional array:
fJ?r;)
1'£
= O, ... ,N-l.
Among suitahle maps (m,r,j) --+ I'£(m,r,j), the following IS the most
straightforward:
1'£
= 2m r + j;
m = 0, ... ,11" r = 0, ... , 2n - m - 1, j = 0, ... , 2m - l.
It has the advantage, that the final results are automatically in the correct
order. However, two arrays {il[ ], {i[ ] are necessary to accommodate the
left- and right-hand sides of the recusion (2.3.2.3). We can make do with
only one array {i[ ] if we execute the transformations "in place", that is,
if we let each pair of quantities ,[3~7), f3~~~l +j occupy the same positions in
8[ ] as the pair of quantities (3~7-1), ,[31:~,~), from which the former are
computed. In this case, however, the entries in the array {i[ ] are being
permuted, and the maps which assign the positions in {i[ ] as a function of
integers m, r, j become more complicated. Let
T = T(m,r,j)
be a map with the above mentioned replacement properties, namely
with
(2.3.2.6)
T(m, r,j) = T(m - 1, r.j)
T(m, r,j + 2m - I ) = T(m -1, r + 2n - m ,j)
m=L ... ,n,
r = 0, ... , 2 n - m - 1,
j = 0, ... ,2 m - 1 - 1,
and
(2.3.2.7)
T(n, 0, j) = j,
j = 0, ... , N - l.
The last condition means that the final result f3j will be found in position
j in the array {i: f3j = {i[j].
The conditions (2.3.2.6) and (2.3.2.7) define the map T recursively. It
remains to determine it explicitly. To this end, let
t = ao + al ·2+ ... + an-I' 2n - 1 ,
aj E {O, I},
2.3 Trigonometric Interpolation
be the binary representation of an integer t,
°:S: t <
85
2n. Then putting
pet) := an-I + a n -2 ·2+ ... + ao ·2 n - l
(2.3.2.8)
defines a permutation of the integers t = 0, ... , 2n -1 called bit reversal. The
bit-reversal permutation is symmetric, i.e. p(p(t)) = t.
In terms of the bit-reversal permutation p, we can express T(m, r,j)
explicitly:
T(m, r,j) = per) + j,
(2.3.2.9)
for all m = 0, 1, ... , n, r = 0, 1, ... , 2n - m - 1, j = 0, 1, ... , 2m - 1.
PROOF. If again
t = ao + al ·2+ ... + an-I· 2n - \
aj E {O, 1 },
then by (2.3.2.6) and (2.3.2.7)
T(n-1,0,t)
( 0)
t = T n, ,t = { T(n _ 1,0, t _ 2n-l)
Thus
if an-l. = 0,
if an-l. = 1.
t = T(n, 0, t) = T(n - 1, an-I, aO + ... + a n -2 ·2 n - 2 ),
and, more generally,
t = T(n, 0, t) = T(m, an-I + ... + am . 2n - m - l , aO + ... + am-I· 2m - I )
for all m = n - 1, n - 2, ... , 0. For r = an-I + ... + am· 2n - m - l , we find
per) = am . 2m + ... + an-I· 2n - l
and t = per) + j, where j = ao + ... + am-I· 2m - I .
By the symmetry of bit reversal,
D
T(m,p(r),j) = r + j,
where r is a multiple of 2m ,
o :S: j < 2m - I , then
°:S: r <
2n , and
°:S: <
j
2m . Observe that if
t = T(m, p(r),j) = T(m -l,p(r),j) = r + j,
t = T(m, p(r),j + 2m - I ) = T(m - 1, per) + 2n - m ,j) = r + j + 2m - I
mark a pair of positions in ,6[ ] which contain quantities connected by the
Cooley-Thkey recursions (2.3.2.3).
In the following pseUdO-ALGOL formulation of the classical CooleyThkey method, we assume that the array ,6[ ] is initialized by putting
,6[p(k)] := ik,
k = O, ... ,N -1,
86
2 Interpolation
where p is the bit-reversal permutation (2.3.2.8). This "scrambling" of the
initial values can also be carried out "in place", because the bit-reversal
permutation is symmetric and consists, therefore, of a sequence of pairwise
interchanges or "transpositions". In addition, we have deleted the factor 2
which is carried along in the formulas (2.3.2.3), so that finally
1 -
(Jj := NP[j],
j = 0, ... ,N-1.
The algorithm then takes the form
°
for m := 1 step 1 until n do
begin for j :=
step 1 until 2m - I - 1 do
begin e := cjn;
for F :=
step 2m until 2n - 1 do
begin u := ~[F + j]; v:= ~[F + j + 2m - I ] X e;
B[F + j] := u + v; ~[F + j + 2 m - I ] := u - v
end
end
end:
°
If the Sande-Tukey recursions (2.3.2.5) are used, there is again no problem if two arrays of length N are available for new and old values, respectively. However, if the recursions are to be carried out "in place" in a single
array j[ ], then we must again map index triples m, T, j into single indices
T. This index map has to satisfy the relations
T(m - I, T, k) = T(m, T, k),
T(m - 1, T
+ 2n - m , k) = T(m, T, k + 2m - I )
for m = n, n - L ... , L T = 0, 1, ... , 2 n - m - 1, k = 0, 1, ... , 2m - I - 1.
If we assume
T(n, 0, k) = k for k = 0, ... , N - 1,
that is. if we start out with the natural order, then these conditions are
precisely the conditions (2.3.2.6) and (2.3.2.7) written in reverse. Thus
T = T(m,T, k) is identical to the index map T considered for the CooleyTukey method.
In the following pseUdO-ALGOL formulation of the Sande-Tukey method,
we assume that the array j[ ] has been initialized directly with the values
h:
j[k] := h,
k = 0, 1, ... ,N - 1.
However, the final results have to be "unscrambled" using bit reversal,
j = 0, ... ,N - 1:
2.3 Trigonometric Interpolation
°
87
for m:= n step -1 until 1 do
begin for k :=
step 1 until 2m - 1 - 1 do
begin e := c~;
for r:=
step 2m until 2n - 1 do
begin
u := i[r + k]; v:= i[r + k + 2m - I ];
i[r + k] := u + v; i[r + k + 2m - I ] :== (u - v) X e
end
end
end;
°
If all values ik, k = 0, ... , N - 1, are real and N = 2M is even, then
the problem of evaluating the expressions (2.3.2.1) can be reduced in size
by putting
h = 0,1, ... , M - 1,
and evaluating the expressions
M-l
~
9 e-2rrijh/M
IJ·- M ~
h
,
'"Y . • -
j = 0, 1, ... ,loll - 1.
'"
h=O
The desired values (3j, j = 0, ... , N - 1, can be expressed in terms of the
values 'Yj, j = 0, ... , M - 1. Indeed, one has with 'YM := 'Yo
(2.3.2.10)
- ) + 4i1 ('Yj -'YM-j
)e -2rrij/N ,
(3j = 4:1 ('Yj + 'YM-j
J. =
°1
'1
, , ... ,d
j=1,2, ... ,M-1.
(3N-j=i]j,
PROOF. It is readily verified that
M-l
1 Yj + 1M-j) = N
1 '~
" hhe- 2 rrtJ.. 2h/N ,
4:('
h=O
1\[-1
1 ('V
4i
;y
diI - j
,j -
) _
-
1 '" f
e- 2rrij (2h+l)/N+2rrijjN
N ~ 2h+l
.
D
h=O
In many cases, particularly if all values fk are real, one is actually
interested in the expressions
2 N-l
Aj:= N
2 N-l
27rjk
2·k
L ikcos~, B := N L fk sin ~
j
k=O
k=O
88
2 Interpolation
for j = 0,1, ... , Ai, which occur, for instance, in Theorem (2.3.1.12). The
values A j , B j are connected with the corresponding values for (3j via the
relations (2.3.l.4).
2.3.3 The Algorithms of Goertzel and Reinsch
The problem of evaluating phase polynomials p(x) from (2.3.1.2) or trigonometric expressions If/(x) from (2.3.1.1) for some aTbitmry argument x = f.
is called FouTieT synthesis. For phase polynomials, there are Horner-type
evaluation schemes, as there are for expressions (2.3.l.1a) when written in
the form If/(x) = L~~-l\I /3 j e jix . The numerical behavior of such evaluation
schemes, however, should be examined carefully.
For example, Goertzel (1958) proposed an algorithm for a problem
closely related to Fourier synthesis, namely, for simultaneously evaluating
the two sums
N-l
L Yk cos kf. ,
N-l
k=O
k=1
L Yk sink';
for a given argument f. and given values Yk, k = 0, ... , N -1. This algorithm
is not numerically stable unless it is suitably modified. The algoithm is
based on the following:
(2.3.3.1) Theorem. FaT f. '" nr, T = 0, ±1, ±2, ... , define the quantities
1 N-l
Vj := ~
Yk sin(k - j + 1)f.,
sm" k=j
L
j = 0, 1, ... , N - 1,
V N := V N + 1 := 0.
These quantities satisfy the TecuTsions
VJ = Yj + 2Vj+1 cosf. - Uj+2,
(2.3.3.1a)
j = N -l,N - 2, ... ,0.
In paTticulaT
N-l
L Yk sin kf. = VI sin f.,
(2.3.3.1b)
N-l
(2.3.3.1c)
L Yk cos k';
=
Yo + VI cos f. - V 2 .
°
k=O
PROOF. For
S j S N - 1, let
By the definition of Vj+l, Vj+2,
2.3 Trigonometric Interpolation
L
L
N-l
N-l
A = YJ + _.1_ { 2cos~
Yk sin(k - j)~ Yk sin(k - j sm ~
k=j+l
k=j+2
89
1)~
}
N-l
=Yj+_.l_
L yd2cos~sin(k-j)~-sin(k-j-l)~l·
sm~ k=j+l
Kow
2 cos~ sin(k - j)~ = sin(k - j + 1)~ + sin(k - j - 1)~.
Hence
A = _._1_ [ Yj sin~ +
sm~
L Yk sin(k - j + 1)~ == U;-
N-l
]
k=j+l
This proves (2.3.3.1a). (2.3.3.1b) restates the definition of U 1 . To verify
(2.3.3.1c), note that
1
1
N-l.
L
N-l
.
U 2 = -.- ~ Yk sm(k - 1)~ = -.Yk sm(k - 1)~,
sm~ ~
sm~
k=2
k=l
and
sin(k - 1)~ = cos~ sink~ - sin~ cosk~.
D
Goertzel's algorithm applies the recursions (2.3.3.1) directly:
UN := [h.J+l := 0;
c:= cos~;
for j := N - 1, N - 2, ... , 1 :
cc:= 2· c:
Uj := Yj + cc· Uj+l - Uj +2;
51 := Yo + U 1 . C - U2 ;
52 := U1 . sin~;
to find the desired results 51 = L~:-Ol Yk cos k( 82 = L~:ll Yk sin k~.
This algorithm is unfortunately not numerically stable for small absolute values of~, I~I «1. Indeed, having calculated c = cosk~, the quantity
81 = L~=~l Yk cos k~ will depend solely on c and the values Yk. We can
write 51 = cp(c, Yo, ... , YN-l), where
cp(c, Yo, ... ,YN -d =
N-l
L Yk cos(k arccos c).
k=O
As in Section 1.2, we denote by eps the machine precision. The roundoff
error Llc = ccc, Iccl :::; eps, which occurs during the calculation of c, causes
an absolute error Ll c 51 in 51, which in first-order approximation amounts
to
90
2 Interpolation
N-l
cAp
E e cos ~ """""
.1 e s1 ~ -;:;-.1c = -.-- L k Yk sin k~
sm~
uC
k=O
N-l
= Ec(COtO L kYksink~.
k=O
An error .1~ = EE~' lEE I
s: eps in ~, on the other hand, causes only the error
L
fJ ( N - l
)
.1.;s1 ~ fJ~
Yk cos k~ . .1~
k=O
N-l
= -EE~ L kYksink~
k=O
in s1. Now cot~ ~ 1/~ for small I~I. The influence of the roundoff error
in c is consequently an order of magnitude more serious than that of a
corresponding error in ~. In other words, the algorithm is not numerically
stable.
In order to overcome these numerical difficulties, Reinsch has modified
Goertzel's algorithm [see Bulirsch and Stoer (1968)]. He distinguishes the
two cases cos ~ > 0 and cos ~ O.
Case (a): cos~ > O. The recursion (2.3.3.1a) yields for the difference
s:
bUJ := Uj - UH1
the relation
i5Uj = Uj - UH1 = Yj + (2 cos ~ - 2) UH1 + UH1 - UH2
= Yj + AUH1 + i5UH1 ,
where
A := 2(cos~ - 1) = -4sin2(~/2).
This suggests the algorithm
A := -4sin2(~/2);
UN+l := i5UN := 0;
for j := N - 1. N - 2, ... , 0:
UH1 := i5UH1 + Uj+2;
i5Uj := A' UJ +1 + i5UH1 + Yj;
s1 := i5Uo - A . UI/2:
s2 := U1 . sin~:
This algorithm is well behaved as far as the propagation of the error .1A =
E\A, 110\1
eps, in A is concerned. The latter causes only the following error
.1\s1 in s1:
s:
2.3 Trigonometric Interpolation
91
8s1
8S1/8)..
,1),s1 ="= a:\,1)" = E)')..7i{ 8~
= -E A
L
N-l
sin 2( ~ / 2 )
kYksink~
.
sin(U2) . cos(U2) k=O
I tan(U2)I is small for small I~I. Besides, I tan(U2)1 < 1 for cos~ > o.
Case (b): cos~ :::; O. Here we put
and find
5Uj = Uj + Uj+l = Yj + (2 cos ~ + 2)Uj+l - Uj+l - Uj+2
= Yj + )..Uj+l - 5UJ +1 ,
where now
).. := 2(cos~
+ 1) = 4cos 2 (U2).
This leads to the following algorithm:
).. := 4COS2(~/2);
UN+l := 5UN := 0;
for j := N - L N - 2, ... , 0:
Uj+l := 5Uj+l - Uj+2;
5Uj := ).. . UJ + 1 - 5Uj + 1 + Yj;
s1 := 5Uo - U 1 . )../2;
s2 := U 1 • sin~;
It is readily confirmed that a roundoff error ,1)" = E),).., IE)'I :::; eps, in )..
causes an error of at most
in s1, and I cot(~/2)1 :::; 1 for cos~ :::; o. The algorithm is therefore well
behaved as far as the propagation of the error ,1)" is concerned.
92
2 Interpolation
2.3.4 The Calculation of Fourier Coefficients. Attenuation
Factors
Let K be the set of all absolutely continuous 3 real functions f : IR -+ IR
which are periodic with period 27r. It is well known [see for instance Achieser
(1956)] that every function f E K can be expanded into a Fourier series
oc
(2.3.4.1)
L cje
f(x) =
jix
j=~oo
which converges towards f(x) for every x E IR. The coefficients Cj = Cj(f)
of this series are given by
Cj -- Cj (1'),.- - 1
(2.3.4.2)
27r
1271" f( x )e ~jixdx,
0
j = 0, ±1, ±2, ....
In practice, frequently all one knows of a function f are its values !k :=
f(Xk) at equidistant arguments Xk := 27rkjN, where N is a given fixed
positive integer. The problem then is to find, under these circumstances,
reasonable approximate values for the Fourier coefficients Cj (f). We will
show how the methods of trigonometric interpolation can be applied to
this problem.
By Theorem (2.3.1.9), the coefficients (3j of the interpolating phase
polynomial
with
for k = 0, ±1, ±2, ... , are given by
N~l
jixk
6J = ~
N 'L" f k e,
j = 0, 1, .. . ,N - 1.
k=O
Since fo = fN, the quantities (3j can be thought of as a "trapezoidal sum"
[compare (3.1.7)]
3
A real function f : [a, b] --t IR is absolutely continuous on the interval [a, b] if for
every EO > 0 there exists 0 < 0 such that Li If (b i ) - f (ai) I < EO for every finite set
of intervals [a" bi ] with a::; a1 < b1 < ... < an < bn ::; band Li Ib i -ail < 8. If
the function f is differentiable everywhere on the closed interval [a, b] or, more
generally, if it satisfies a "Lipschitz condition" If(xi) - f(x2)1 ::; elx1 - x21 on
[a, b], then f is absolutely continuous, but not conversely: there are absolutely
continuous functions with unbounded derivatives. If the function is absolutely
continuous, then it is continuous and its derivative l' exists almost everywhere.
1'"lorcover, f(x) = f(a) + Jax1'(t) dt for x E [a, b]. The absolute continuity of
the functions f, 9 also ensures that integration by parts can be carried out
safely,
f(x )g' (x) dx = f(x )g(x) I~ l' (x )g(x) dx .
J:
J:
2.3 Trigonometric Interpolation
;3j = ~ [~ + !Ie- jiX1 + ... + hV_le-jixN-l +
93
f; e- jiXN ]
approximating the integral (2.3.4.2), so that one might think of using the
sums
;3j(J) =;3j := ~
(2.3.4.3)
N-l
L ike- jixk
k=O
for all integers j
0, ±1, ±2, ... as approximate values to the desired
Fourier coefficients Cj (J). This approach appears attractive, since fast
Fourier transforms can be utilized to calculate the quantities ;3j (J) efficiently. However, for large indices j the value ;3j (J) is a very poor approximation to Cj (J). Indeed, ;3j+kN = ;3j holds for all integers k, j, while on
the other hand limlJl--+CXJ Cj = O. [This follows immediately from the convergence of the Fourier series (2.3.4.1) for the argument :c = 0.] A closer
look also reveals that the asymptotic behavior of the Fourier coefficients
c] (J) depends on the degree of differentiability of f:
(2.3.4.4) Theorem. If the 27r-periodic function f has an absolutely continuous r th derivative f(r), then
PROOF. Successive integration by parts yields
C
]
..
= -1 1271" f(x)e-PXdx
27r
= ~
0
2 !'(x)e-jiXdx
r
J
71"
27rJ2 o
in view of the periodicity of f. This proves the proposition.
0
To approximate the Fourier coefficients Cj (J) by values which display
the right asymptotic behavior, the following approach suggests itself: Determine for given values ik, k = 0, ±1, ±2, ... , as simple a function 9 E 1C
as possible which approximates f in some sense (e.g., interpolates f for Xk)
and share with f some degree of differentiability. The Fourier coefficients
94
2 Interpolation
Cj (g) of 9 are then chosen to approximate the Fourier coefficients Cj (f) of
the given function 1. In pursuing this idea, it comes as a pleasant surprise
that even for quite general methods of approximating the function 1 by
a suitable function g, the Fourier coefficients Cj (g) of 9 can be calculated
in a straightforward manner from the coefficients (3j(f) in (2.3.4.3). More
precisely, there are so-called attenuation lactorsTj, j integer, which depend
only on the choice of the approximation method and not on the particular
function values lk, k = 0, ±1, ... , and for which
To clarify what we mean by an "approximation method" , we consider
besides the set K of all absolutely continuous 21r-periodic functions 1 :
JR -+ JR - the set
-
IF = { (ik)kEZ I lk E JR, lk+N = lk for k E 7l.}, 7l. := {k I k intcger},
of all N-periodic sequences of real numbers
1 = (···,1-1,10, h, .. .).
For convenience, we denote by 1 both the function 1 E K and its corresponding sequence (fk)kElL with lk = l(xk). The meaning of 1 will follow
from the context.
Any method of approximation assigns to each sequence 1 E IF a function 9 = P(f) in K; it can therefore be described by a map
P: IF -+ K.
K and IF are real vector spaces with the addition of elements and the
multiplication by scalars defined in the usual straightforward fashion. It
therefore makes sense to distinguish linear approximation methods P. The
vector space IF is of finite dimension N, a basis being formed by the sequences
(2.3.4.5)
elk) = (e(k))
J
jElL'
k = 0,1, ... ,N - 1,
where
if k == j mod N,
otherwise.
In both IF and K we now introduce translation operators E: IF -+ IF
and E: K -+ K, respectively, by
(Ej)k := lk-l
(Eg)(x) := g(x - h)
for all k E 7l., if 1 E IF,
for all x E JR, if 9 E K, h:= 21r/N = Xl.
(For convenience, we use the same symbol for both kinds of translation operators.) We call an approximation method P: IF -+ K translation invariant
if
2.3 Trigonometric Interpolation
95
P(E(f)) = E(P(f))
for all f E IF, that is, a "shifted" sequence is approximated by a "shifted"
function. P(E(f)) = E(P(f)) yields P(Ek(f)) = Ek(P(f)), where E2 =
Eo E, E3 = Eo E 0 E, etc. We can now prove the following theorem by
Gautschi and Reinsch [for further details see Gautschi (1972)]:
(2.3.4.6) Theorem. For each approximation method P: IF -+ K there
exist attenuation factors Tj, jEll, for which
(2.3.4.7)
Cj (P f)
= Tj /3j (f) for all jEll and arbitrary f E IF
if and only if the approximation method P is linear and translation invariant.
PROOF. Suppose that P is linear and translation invariant. Every f E IF
can be expressed in terms of the basis (2.3.4.5):
f =
N-l
N-l
k=O
k=O
L fke(k) = L IkEke(O).
Therefore
g:= Pf =
L
N-l
fkEkpe(O),
k=O
by the linearity and the translation invariance of P. Equivalently,
N-l
g(x) =
L IkTJo(x - Xk),
k=O
where TJo := Pe(O) is the function which approximates the sequence e(O).
The periodicity of 9 yields
where
(2.3.4.8)
We have thus found expressions for the attenuation factors Tj which depend
only on the approximation method P and the number N of given function
96
2 Interpolation
values fk for arguments Xk, 0:::: Xk < 271". This proves the "if" direction of
the theorem.
Suppose now that (2.3.4.7) holds for arbitrary f ElF. Since all functions
in K can be represented by their Fourier series, and in particular P f E K,
(2.3.4.7) implies
(2.3.4.9)
j=-oc
j=-oc
By the definition (2.3.4.3) of !3j (f), !3j is a linear operator on IF and, in
addition,
N-l
N-l
k=O
k=O
jixk = ~e-jih ' " f e- jixk = e- jih , !3(f)
!3(Ef) = ~
. J
N 'L" f k-l·eN
L
k·
'J
.
Thus (2.3.4.9) yields the linearity and the translation invariance of P:
L Tj!3j(f)e j
DC
(P(E(f)))(x) =
,(x-h) =
(Pf)(x - h) = (E(P(f)))(x).
D
j=-oc
As a by-product of the above proof, we obtained an explicit formula
(2.3.4.8) for the attenuation factors. An alternative way of determining the
attenuation factors Tj for a given approximation method P is to evaluate
the formula
(2.3.4.10)
for a suitable f ElF.
EXAMPLE 1. For a given sequence I E IF, let g:= PI be the piecewise linear interpolation of I, that is, 9 is continuous and linear on each subinterval [Xk,Xk+11,
and satisfies g(xd = Ik for k = 0, ±1, .... This function 9 = PI is clearly
absolutely continuous and has period 271". It is also clear that the approximation
method P is linear and translation invariant. Hence Theorem (2.3.4.6) ensures
the existence of attcnuation factors. In order to calculate them, we note that for
the special sequence I = e(O) of (2.3.4.5)
1
(3j(f) = N'
PI(x) = {01 Cj(Pf) = ~
271"
il x - xkNI
if Ix - xkNI:::: h, k = 0, ±1, ... ,
otherwise,
PI(x)e-j'Xdx = ~ jh (1- EJ)e-j'Xdx.
Jtrr
271"
h
o
-h
Utilizing the symmetry properties of the above integrand, we find
2.4 Interpolation by Spline Functions
97
I1h( 1 - -X)
. 2(jh)
cosjxdx = -.-2 sm
--;- .
h
J 11"h
2
C(Pf) = ]
11"
2
0
With h = 211"/N the formula (2.3.4.10) gives
. _ (Sin Z)2
TJ -
Z
. h
Wit
11"j
z:= N'
j=O,±l, ....
EXAMPLE 2. Let 9 := P f be the periodic cubic spline function [see Section 2.4J
with g(Xk) = fk, k = 0, ±1, .... Again, P is linear and translation invariant. Using the same technique as in the previous example, we find the following
attenuation factors:
Tj
=
SinZ)4
3
11"]
( ~z1 + 2 cos2 z' where z:= N'
2.4 Interpolation by Spline Functions
Spline functions yield smooth interpolating curves which are less likely
to exhibit the large oscillations characteristic of high-degree polynomials.
They are finding applications in graphics and, increasingly, in numerical
methods. For instance, spline functions may be used as trial functions in
connection with the Rayleigh-Ritz-Galerkin method for solving boundaryvalue problems of ordinary and partial differential equatio[ls. More recently,
they are also used in signal processing.
Spline functions (splines) are connected with a partition
L1: a = Xo < Xl < ... < Xn = b
of an interval [a, bj by knots Xi, i = 0, 1, ... , n: They are piecewise polynomial functions 8: [a, bj -t JR, with certain smoothness properties that are
composed of polynomials, the restrictions 81Ii of 8 to Ii :=, (Xi-I, Xi), i = 1,
2, ... , n, are polynomials. In Sections 2.4.1 - 2.4.3 we describe only the
simple but already representative case of cubic splines functions (splines of
degree 3) that are composed of cubic polynomials, 81Ii E II3 . The general
case of splines of degree k and of arbitrary piecewise polynomial functions
are treated in Sections 2.4.4 and 2.4.5. The final Section 2.4.6 deals with
the basic ideas behind the application of splines in signal processing.
A detailed description of splines can be found in many monographs, for
instance in Greville (1969), Schultz (1973), Bohmer (1974), de Boor (1978),
Schumaker (1981), Korneichuk (1984), andNiirnberger (1989).
2.4.1 Theoretical Foundations
Let L1 .- {a = Xo < Xl
[a, bj.
< ... < Xn = b} be a partition of the interval
98
2 Interpolation
(2.4.1.1) Definition. A cubic spline (function) S £!, on L1 is a real junction
S£!,: [a, b] -+ IR with the properties:
(a) S£!, E 2 [a, b]' that is, S£!, is twice continuously differentiable on [a, b].
(b) S £!, coincides on every sl1binterval [Xi, Xi+l], i = 0, 1, ... , n - I, with
a polynomial oj degree (at most) three.
e
Thus a cubic spline consists of cubic polynomials pieced together in such
a fashion that their values and those of their first two derivatives coincide
at the interior knots .T;. i = 1, ... , n - 1.
Consider a finite sequence Y := (Yo, Yl, ... , Yn) of n + 1 real numbers.
We denote by
S£!,(Y: .)
an interpolating spline junction S£!, with S£!,(Y; Xi) = Yi for i = 0, 1, ""
n.
Such an interpolating cubic spline function S£!, (Y: .) is not uniquely determined by the sequence Y of support ordinates. Roughly speaking, there
are still two degrees of freedom left, calling for suitable additional requirements. The following three additional requirements are most commonly
considered:
(2.4.1.2)
S//(Y'a)
= S//(Y'b)
= 0,
£!"
£!"
a)
b)
S~) (Y; a) = S~) (Y; b) for k = 0, 1, 2: S£!, (Y; .) is periodic,
c)
S~(Y; a) =
vb, S~(Y; b) = y~ for given numbers vb, y~.
We will confirm that each of these three sets of conditions by itself ensures
uniqueness of the interpolating spline function S£!, (Y: .). A prerequisite of
the condition (2.4.1.2b) is, of course, that Yn = yo.
For this purpose, and to establish a characteristic minimuIIl property
of spline functions, we consider the sets
(2.4.1.3)
m > 0 integer, of real functions j: [a, b] -+ IR for which j(m-l) is absolutely
continuous 4 on [a, b] and j(m) E L2 [a, b].5 By
K;'[a,b]
we denote the set of all functions in Km[a,b] with j(k)(a) = j(k)(b) for
k = 0, 1, ... , m - 1. We call such functions periodic, because they arise as
restrictions to [a, b] of functions which are periodic with period b - a.
4
See footnote 2 in Section 2.3.4.
The set L2 [a, b] denotes the set of all real functions whose squares are integrable
on the interval [a, b], i.e.,
J:
If(t)12 dt exists and is finite.
2.4 Interpolation by Spline Functions
99
Note that SiJ. E K3[a, b]' and that SiJ.(Y;.) E K~[a, b] if (2.4.1.2b) holds.
If f E K2[a, b], then we can define
IIfl12 := lb 1f"(xW dx.
Note that Ilfll :::: O. However, 11.11 is not a norm [see (4.4.1)] but only a
semi norm , because Ilfll = 0 may hold for functions which are not identically
zero, for instance, for all linear functions f(x) == cx + d.
We proceed to show a fundamental identity due to Holladay [see for
instance Ahlberg, Nilson, and Walsh (1967)].
(2.4.1.4) Theorem. If f E K2(a,b),.d = {a = Xo < Xl < ... < Xn = b}
is a partition of the interval [a, bj, and if S iJ. is a spline junction with knots
Xi E .d, then
Ilf - SiJ.112 = IIfl12 -II SiJ.112- 2
[(fl
(x) -
S~ (x) )S~ (x) I~ - t(f(X) - SiJ.(;r) )S~ I:~_,] .
Here g(x)l~ stands for g(u) - g(v), as it is commonly understood in the
calculus of integrals. It should be realized, however, that S~ (x) is piecewise
constant with possible discontinuities at the knots Xl, ... , Xn-I. Hence we
have to use the left and right limits of S~(x) at the locations Xi and Xi-I,
respectively, in the above formula. This is indicated by the notation xi,
X{_I'
PROOF. By the definition of
I ,11,
Ilf - SiJ.112 = lb If"(x) - S~(xW dx
=
IIfl12 - 21b f"(x)S~(x) dx + IISiJ.112
=
IIfl12 - 21\{"(x) - S~(x))S~(x) d:r -II SiJ.112.
Integration by parts gives for i = L 2, ... , n
lXi (f"(X) - S~(x))S~(x) dx
=
(f'(X) - S~(x))S~(x)
XL-l
_lXi (f'(X) - S~(x))8~(x) dx
XL-l
=
(f'(X) - S~(x))S~(xW'
- (f(x) - SiJ.(x))S~(x)
X1-l
100
2 Interpolation
+ 1~~1 (f(x) - SLl(X))S~)(X) dx.
Now S(4)(x) == 0 on the subintervals (Xi-I,Xi), and 1', S~, S':J, are
continuous on [a, bJ. Adding these formulas for i = 1, 2, ... , n yields the
propositon of the theorem, since
n
~)1'(x) - S~(x))S':J,(X)I::_l = (f'(x) - S~(x))S':J,(x)I~·
0
i=l
With the help of this theorem we will prove the important minimum-
norm property of spline functions.
(2.4.1.5) Theorem. Given a partition .:1 := {a = Xo < Xl < ... <
Xn = b} of the interval [a, b], values Y := (Yo,···, Yn) and a function
f E K,2(a, b) with f(xi) = Yi for i = 0, 1, ... , n, then Ilfll 2:: IISLl(Y; .)11,
and more precisely
Ilf - SLl(Y; .)11 2 = IIfl12 -IISLl(Y; .)11 2 2:: 0
holds for every spline function SLl(Y; .), provided one of the conditions
[compare (2.4.1.2)J
S':J,(Y; a) = S':J,(Y; b) = 0,
f E K,~[a, b], SLl(Y;.) periodic,
(c) S~(Y;a) = 1'(a), S~(Y;b) = 1'(b),
is met. In each of these cases, the spline function S Ll (Y; .) is uniquely
determined.
(a)
(b)
The existence of such spline functions will be shown in Section 2.4.2.
PROOF. In each of the above three cases (2.4.1.5a, b, c), the expression
n
(f'(x) - S~(x))S':J,(x)l~ - t;(f(x) - SLl(x))S~(x)I:~_l = 0,
vanishes in the Holladay identity (2.4.1.4) if SLl == SLl(Y; .). This proves
the minimum property of the spline function SLl(Y; .). Its uniqueness can
be seen as follows: suppose SLl (Y; .) is another spline function having the
same properties as SLl (Y; .). Letting SLl (Y; .) play the role of the function
f E K,2[a, bJ in the theorem, the minimum property of SLl(Y;·) requires
that
and since SLl(Y;·) and SLl(Y;.) may switch roles,
IISLl(Y;.) - SLl(Y; .)11 2 =
lb
IS':J,(Y; X) - S':J,(Y; xW dx = o.
2.4 Interpolation by Spline Functions
101
Since S~ (Y; .) and S~ (Y; .) are both continuous,
s~ (Y; x) == S~ (Y; x),
from which
S~(Y; x) == S~(Y; x) + ex + d
follows by integration. But S~ (Y; x) = S~ (Y; x) holds for x = a, b, and
this implies c = d = 0.
0
The minimum-norm property of the spline function expressed in Theorem (2.4.l.5) implies in case (2.4.l.2a) that, among all functions 1 in
K2[a, b] with I(xi) = Yi, i = 0, 1, ... ,11, it is precisely the spline function
S ~ (Y; .) with S~ (Y; x) =
for .7: = a, b that minimizes the integral
°
The spline function of case (2.4.l.2a) is therefore often referred to as the
natural spline function In cases (2.4.l.2b) and (2.4.l.2c), the corresponding
spline functions S~ (Y; .) minimize 11111 over the more restricted sets K~[a, b]
and {I E K 2 [a, b]1 f'(a) = Yb, I'(b) = y~} n {I I I(xi) = Yi for i = 0, 1,
... , 11}, respectively.
The expression f"(x)(l + f'(x)2)-3/2 indicates the curvature of the
function I(x) at x E [a, b]. If 1f'(x)1 is small compared to 1, then the
curvature is approximately equal to f"(x). The value 11111 provides us therefore with an approximate measure of the total curvature of the function
1 in the interval [a, b]. In this sense, the natural spline function is the
"smoothest" function to interpolate given support points (Xi, Yi), i = 0, 1,
... , n.
2.4.2 Determining Interpolating Cubic Spline Functions
In this section, we will describe computational methods for determining
cubic spline functions which assume prescribed values at their knots and
satisfy one of the side conditions (2.4.l.2). In the course of this, we will
have also proved the existence of such spline functions: their uniqueness
has already been established by Theorem (2.4.l.5).
In what follows, ,1 = {Xi I i = 0,1, ... , n} will be a fixed partition
of the interval [a, b] by knots a = Xo < Xl < ... < :en = band Y =
(Yi)i=O.I, .... n will be a sequence of n+ 1 prescribed real numbers. In addition
let I j be the subinterval I j := [:ej-I, Xj], j = 0, 1, ... , 11 - 1, and hj+l :=
Xj+l - Xj its length.
We refer to the values of the second derivatives at knots Xj E ,1,
(2.4.2.1 )
j = 0,1, ... , n,
102
2 Interpolation
of the desired spline function S.:::. (Y; .) as the moments M j of S.:::. (Y; .). We
will show that spline functions are readily characterized by their moments,
and that the moments of the interpolating spline function can be calculated
as the solution of a system of linear equations.
Note that the second derivative S~(Y;·) of the spline function coincides
with a linear function in each interval [Xj, Xj+l], j = 0, ... , n -1, and that
these linear functions can be described in terms of the moments Mi of
S.:::.(Y; .):
S~(Y; x) = M j
X+l - X
x-x'
Jh
+ M j+1
h -J for x E [Xj, Xj+1l.
j+l
j+1
By integration,
(2.4.2.2)
S f (y. x) = -M. (Xj+1 - x)2 + '1. (x - Xj)2 + A .
.:::.'
J
2h
lV'J+1 2h
J'
j+l
j+l
S.:::.(Y;x) = Mj
(X+l- x )3
(x-x)3
J h
+ Mj+l h J + Aj(x - Xj) + B j ,
6 j+l
6 j+l
for x E [xj,xj+d, j = 0,1, ... , n -1, where A j , B j are constants of integration. From S.:::.(Y; Xj) = Yj, S.:::.(Y; Xj+l) = Yj+1 we obtain the following
equations for these constants Aj and B j :
Consequently,
h2+ 1
(2.4.2.3)
B j = Yj - Mr--~-,
A - Yj+l - Yj _ hj+l (M+l - M.)
J hj + 1
6
J
J .
This yields the following representation of the spline function in terms
of its moments:
(2.4.2.4)
S.:::.(Y;x) = aj + (3j(x - Xj) + fj(X - Xj)2 + OJ (x - Xj)3 for x E [Xj,Xj+1],
where
2.4 Interpolation by Spline Functions
103
, . _M
_
.1
2 '
J'-
j
._
I (
•
) _
(3j,-S,:1Y,Xj
- - M hj+l
2
+Aj
2Mj + Mj+l h
j+l,
6
Yj+l - Yj
hj+l
5j :=
S6'(Y; xj)
6
Mj+l - M j
= -----"--'-::-------"-
6hj+l
Thus S,:1 (Y; .) has been characterized by its moments Alj . The task of
calculating these moments will now be addressed.
The continuity of S~ (Y; .) at the interior knots x = x j, j = 1, 2, ... ,
n - 1 [namely, the relations S~ (Y; x j) = S~ (Y; x j)] yields n - 1 equations
for the moments Alj . Substituting the values (2.4.2.3) for Aj and B j in
(2.4.2.2) gives for x E [Xj, Xj+l]
For j = 1, 2, ... , n - 1, we have therefore
I
(
_
S,:1 Y; Xj ) =
+
S~(Y;x) =
J
Yj - Yj-l
hj
hj
h
+ -Mj
+ -Alj
_l ,
j
3
6
h.1·+1
h.1·+1
Y ·+1 - y.
J
J - - - M - --M+l
h j +1
3
J
6
J'
and since S~(Y;xj) = S~(Y;xj),
(2.4.2.5) h J Alj _ 1 + h j + hj+l M j + hj+1 Mj+l = Yj+l -- Yj _ Yj - Yj-l
6
3
6
hj +1
hj
for j = 1, 2, ., ., n - 1. These are n - 1 equations for the n + 1 unknown
moments. Two further equations can be gained separately from each of the
side conditions (a), (b) and (c) listed in (2.4.1.2).
Case (a):
S~(Y; a) = Mo = 0 = Mn = S~(Y; b),
Case (b):
S~(Y;a) = S~(Y;b) =} Mo = M n ,
I
(
S,:1 Y;a
)
=
I
(
S,:1 Y;b
)
=}
hn
h n + hI
hI
6Mn-1 + - 3 - - 1\ln + (f1Vh
Yn - Yn-l
hn
104
2 Interpolation
The latter condition is identical with (2.4.2.5) for j = n if we put
Recall that (2.4.l.2b) requires Yn = Yo.
Case (c) :
The last two equations, as well as those in (2.4.2.5), can be written in
a common format:
j = 1,2, ... ,n - 1,
upon introducing the abbreviations (j = l. 2, ... , n - 1)
(2.4.2.6)
In case (a), we define in addition
(2.4.2.7)
..\.0 := 0,
d o := 0,
!-In:= 0,
d n := 0,
and in case (c)
(Y1 ~ Yo - Yb),
'= ~ (
Yn -h Yn-1) .
!-In := 1. dn' h
Yn
. \.0 := 1,
(2.4.2.8)
d o := :1
I
_
n
n
This leads in cases (a) and (c) to a system of linear equations for the
moments JlvIi that reads in matrix notation:
2
..\.0
!-l1
2
0
1'1.10
NIl
..\.1
jL2
(2.4.2.9)
0
2
!-In
..\.n-1
2
Mn
The periodic case (b) also requires further definitions,
2.4 Interpolation by Spline Functions
105
(2.4.2.10)
d n :=
.6
{ Yl - Yn _ Yn - Yn-l } ,
h n + hi
hi
hn
which then lead to the following linear system of equations for the moments
M l , Nh, ... , Mn(= Mo):
Al
2
/Ll
A2
2
/L2
Nh
M2
/L3
(2.4.2.11)
2
An
/Ln
An-l
2
Mn
The coefficients Ai, /Li, d i in (2.4.2.9) and (2.4.2.11) are well defined by
(2.4.2.6) and the additional definitions (2.4.2.7), (2.4.2.8), and (2.4.2.10),
respectively. Note in particular that in (2.4.2.9) and (2.4.2.11)
/Li 2' 0,
(2.4.2.12)
for all coefficients Ai, /Li, and that these coefficients depend only on the
location of the knots Xj E L1 and not on the prescribed values Yi E Y or on
yb, y~ in case (c). We will use this observation when proving the following:
(2.4.2.13) Theorem. The systems (2.4.2.9) and (2.4.2.11) of linear equations are nonsingular JOT any partition L1 of a, b.
This means that the above systems of linear equations have unique solutions for arbitrary right-hand sides, and that consequently the problem of
interpolation by cubic splines has a unique solution in each of the three
cases (a), (b), (c) of (2.4.1.2).
PROOF.
Consider the (n + 1) x (n + 1) matrix
o
A=
o
2
/Ln
A11 -1
2
of the linear system (2.4.2.9). This matrix has the following property:
(2.4.2.14)
Az = w =? max
, IZil s:: max
, IWil.
for every pair of vectors Z = (zo, ... , zn) T, W = (wo, ... , w n ) T, Z, W E
IR n+l. Indeed, let r be such that IZr I = maXi IZi I. From Az = w,
106
2 Interpolation
(fJ·O := An := 0).
By the definiton of r and because Mr + Ar = 1,
max IWil ?: Iwrl ?: 21zrl ~ Mrlzr-ll ~ Arlzr+ll
1
?: 21zrl ~ MrlZrl ~ Arlzrl
= (2 ~ Mr ~ Ar)IZrl
= IZrl = max IZil·
1
Suppose the matrix A were singular. Then there would exist a solution
Z i- 0 of Az = 0, and (2.4.2.14) would lead to the contradiction
0< max IZil :::; o.
1
The nonsingularity of the matrix in (2.4.2.11) is shown similarly.
0
To solve the equations (2.4.2.9), we may proceed as follows: subtract
Ml/2 times the first equation from the second, thereby annihilating Ml, and
then a suitable multiple of the second equation from the third to annihilate
M2, and so OIl. This leads to a "triangular" system of equations which can be
solved in a straightforward fashion [note that this method is the Gaussian
elimination algorithm applied to (2.4.2.9); compare Section 4.1]:
qo := ~Ao/2;
'Uo:= d o/2;
An:= 0;
for k := 1 , 2, ... , n:
Pk := Mkqk-l + 2;
(2.4.2.15)
qk := ~Ak/Pk;
Uk := (d k ~ Mk'Uk-d/Pk;
Aln := 'Un:
for k := n ~ 1, n ~ 2, ... , 0:
11,h := qkA1k+l + Uk;
[It can be shown that Pk > 0, so that (2.4.2.15) is well defined: see Exercise
25.] The linear system (2.4.2.11) can be solved in a similar, but not as
straightforward, fashion. An ALGOL program by C. Reinsch can be found
in Bulirsch and Rutishauser (1968).
The reader can find more details in Greville (1969) and de Boor (1972),
ALGOL programs in Herriot and Reinsch (1971), and FORTRAN programs in
de Door (1978). These references also contain information and algorithms
for the spline functions of degree k ?: 3 and B-splines, which are treated
here in Sections 2.4.4 and 2.4.5.
2.4 Interpolation by Spline Functions
107
2.4.3 Convergence Properties of Cubic Spline Functions
Interpolating polynomials may not converge to a function f whose values
they interpolate, even if the partitions .,1 are chosen arbitrarily fine [see
Section 2.1.4]. In contrast, we will show in this section that, under mild
conditions on the function f and the partitions .,1, the interpolating spline
functions do converge towards f as the fineness of the underlying partitions
approaches zero.
We will show first that the moments (2.4.2.1) of the interpolating spline
function converge to the second derivatives of the given function. More
precisely, consider a fixed partition .,1 = { a = Xo < Xl < ... < Xn = b} of
[a, b], and let
be the vector of moments M j of the spline function S Ll (Y; . ) with f (x j) = Yi
for j = 0, ... , n as well as
f'(a),
S~(Y;a) =
S~(Y;b) = f'(b).
We are thus dealing with case (c) of (2.4.1.2). The vector M of moments
satisfies the equation
AM=d,
which expresses the linear system of equations (2.4.2.9) in matrix form.
The components d j of d are given by (2.4.2.6) and (2.4.2.8). Let F and r
be the vectors
F:= [
f"(XO)]
f"(Xl)
.
,
r := d - AF = A(M - F).
f"(xn)
Writing Ilzll := maxi IZil for vectors z, and 11.,111 for the fineness
(2.4.3.1)
of the partion .,1, we have:
(2.4.3.2) If f E C 4 [a,b] and If(4)(x)1 ~ L for X E [a,b], then
11M - FII :S Ilrll :S ~LIILlI12.
PROOF. By definition, ro = do -
ro = -6
hI
2f"(xo) - f"(Xl) and by (2.4.2.8),
(Yl- hl-- Yo- - Yo') - 2f"( Xo ) - f"( )
Xl·
108
2 Interpolation
Using Taylor's theorem to express Yl = f(Xl) and f"(Xl) in terms of the
value and the derivatives of the function f at Xo yields
Analogously, we find for
that
For the remaining components of r = d - AF, we find similarly
rj =d j - J.ljf"(Xj-l) - 2f"(xj) - A.jf"(Xj+l)
=
6
[Yj+l - Yj _ Yj - Yj-l]
hj + hj+1
hj+l
hj
hj
f"( Xj-l ) - 2f"()
f"( Xj+1 ) .
Xj - h hj+1
h
h j + h j+1
j + j+1
Taylors's formula at Xj then gives
2.4 Interpolation by Spline Functions
109
Here Ti E [Xj-l, Xj+1j. Therefore
for j = 1, 2, ... , n - 1. In sum,
and since r = A(M - F), (2.4.2.14) implies
o
(2.4.3.3) Theorem. 6 Suppose f E C 4[a,bj and 1f(4) (x)1 ::; L for X E [a,bj.
Let .1 be a partition .1 = {a = Xo < Xl < ... < Xn = b} of the interval
[a, b], and K a constant such that
IXj+l11.111- Xj I ::; K
.
for J = 0, 1, ... n - 1.
J
If S.Cl is the spline function which interpolates the values G1 the function f
at the knots Xo, ... ,X n E .1 and satisfies S~ (x) = f'(x) for X = a, b, then
there exist constants Ck ::; 2, which do not depend on the partition .1, such
that for x E [a, b],
If(k) (x) _ S~)(x)1 < {CkLII.1114-k
c3LKII.111
for k = 0,1,2,
for k = 3.
Note that the constant K 2': 1 bounds the deviation of the partition .1
from uniformity.
PROOF. We prove the proposition first for k
= 3. For x E [Xj_l,Xj],
S'g(X) - fll/(x) = M j - M j - l - fll/(x)
hj
M j - f"(x J )
M j - l - f"(Xj-l)
hj
hj
+ f"(x)) - f"(x) - ~f"(Xj-d - f"(x)] _ f"'(x).
J
Using (2.4.3.2) and Taylor's theorem at x, we conclude that
6
The estimates of Theorem (2.4.3.3) have been improved by Hall and Meyer
(1976): If(k)(x) - S~)(x)1 :::; ckLlldI1 4 - k, k = 0, 1, 2, 3, with Co := 5/384,
Cl := 1/24, C2 := 3/8, C3 := (K + K- 1 )/2. Here Co and Cl are optimal.
110
2 Interpolation
- (Xj-l - X)f"'(X) - (Xj- 12- x)2 f(4) ('r}2) - hjf"'(x) I
<~L IILll1 2 + ~ IILll1 2
- 2
hj
hj
2
=2L 11~"2 :S 2LKIILlII,
J
where 'r}1,'r}2 E [Xj-I,Xj]. Thus If"'(x) - S~(x)1 :S 2LKIILlII·
To prove the proposition for k = 2, we observe: for each x E (a, b)
there exists a closest knot Xj = Xj(x). Assume without loss generality
x :S Xj(x) = Xj so that x E [Xj-l, Xj] and IXj(x) - xl :S hj /2 :S IILlII/2.
From
f"(X) - S~(x) = f"(Xj(x)) - S~(Xj(x)) +
l
x
Xj(x)
(f"'(t) - S~(t))dt,
(2.4.3.2) and the estimate just proved we find
We consider k = 1 next. In addition to the boundary points ~o := a,
~n+l := b, there exist, by Rolle's theorem, n further points ~j E (Xj-I,Xj),
j = 1, 2, ... , n with
j = 0,1, ... , n + 1.
For any x E [a,b] there exists a closest one of the above points ~j = ~j(x),
for which consequently
Thus
l
!,(X) - S~(x) =
x
Ej(x)
(f"(t) - S~(t)]dt,
and
x E [a,b].
The case k = 0 remains. Since
f(x) - SLl(X) =
l
x
Xj(X)
(f'(t) - S~(t))dt,
2.4 Interpolation by Spline Functions
111
it follows from the above result for k = 1 that
J~ E [a, bj.
D
Clearly, (2.4.3.3) implies that for sequences
L1 m
= {a =O
x(m) <l
x(m) < ... < x(m)
= b ln rn
of partitions with L1m --+ 0 the corresponding cubic spline functions SL1",
and their first two derivatives converge to f and its corresponding derivatives uniformly on [a, bj. If in addition
sup
.1
m,]
II L1 mll
x j(m)
+l -
(m)1
::; K < +00,
Xj
even the third derivative I'" is uniformly approximated by S'){m' a usually
discontinuous sequence of step functions.
2.4.4 B-Splines
Spline functions are instances of piecewise polynomial functions associated
with a partition
L1 = {a = Xo < Xl < ... < Xn = b}
of an interval [a, bj by abscissae or knots Xi. In general, a real function
f: [a, b] --+ IR is called a piecewise polynomial function of order r or degree
r - 1 if, for each i = 0, .... n - 1, the restriction of f to the suhinterval
(Xi, xi+d agrees with a polynomial Pi(X) of degree::; r -- I, Pi E IIr - l ·
In order to achieve a 1-1 correspondence between f and the sequence
(po(x),pdx), ... ,Pn-I(X)), we define f at the knots xi,i = 0, ... , n-l, so
that it becomes continuous from the right, f(Xi): = f(Xi + 0), 0 ::; i ::; n -1,
and f(xn) = f(b): = f(xn - 0).
Then, generalizing cubic splines. we define spline functions SL1 of degree k as piecewise polynomial functions of degree k that are k - 1 times
differentiable at the interior knots Xi, 1 ::; i ::; n - 1, of L1. By SL1,k we
denote the set of all spline functions SL1 of degree k, which is easily seen to
be a real vector space of dimension n+k. In fact, the polynomial SL1I[xo, Xl]
is uniquely determined by its k + 1 coefficients, and this in turn already
fixes the first k - 1 derivatives (= k conditions) of the next polynomial
SL11 [Xl. X2] at Xl, so that only one degree of freedom is left for choosing
SL11 [Xl, X2]' As the same holds for all subsequent polynomials SL11 [Xi, Xi+l],
i = 2, ... ,n-l, one finds dimSL1.k = k+ 1 + (n-l)·1 = k+n.
B-splines are special piecewise polynomial functions with remarkable
properties: they are nonnegative and vanish everywhere except on a few
contiguous intervals [Xi. xi+d. Moreover, the function space SL1,k has a basis
112
2 Interpolation
consisting of B-splines. As a consequence, B-splines provide the framework
for an efficient and numerically stable calculation of splines.
In order to define B-splines, we introduce the function fx: IR -+ IR
t - x
fx(t) := (t - x)+ := max(t - x, 0) = { 0
for t > x,
for t ~ x,
and its powers f;, r 2: 0, in particular for r = 0,
fO(t) := {I
x
0
for t > x,
for t ~ x.
The function f; (.) is composed of two polynomials of degree ~ r: the 0polynomial Po(t) := 0 for t ~ x and the polynomial Pl(t) := (t - xY for
t > x. Note that f; depends on a real parameter x, and the definition of
f; (t) is such that it is a function of x which is continuous from the right
for each fixed t. Clearly, for r 2: 1, f;(t) is (r - 1) times continuously
differentiable both with respect to t and with respect to x.
Recall further that under certain conditions the generalized divided
difference j[ti, ti+1' ... , ti+r] of a real function f (t), f: IR -+ IR is well
defined for any segment ti ~ ti+1 ~ ... ~ ti+r of real numbers, even if
the tj are not mutually distinct [see Section 2.1.5, in particular (2.1.5.4)].
The only requirement is that f be (8j - I)-times differentiable at t = tj,
j = i, i + 1, ... , i + r, if the value of tj occurs 8j times in the subsegment
ti ~ ti+l ~ ... ~ tj terminating with t j . In this case, by (2.1.5.7)
(2.4.4.1)
fer) (ti)
j[ti , ... ,ti+r] := - - , - ,
r.
if ti = ti+Tl
. ] .- j[t i+1,"" ti+r] - j[ti,"" ti+r-l]
f[ t". ... ,t,+r
.,
ti+r - ti
otherwise.
It follows by induction over r that the divided differences of the function f
are linear combinations of its derivatives at the points tj:
(2.4.4.2)
i+r
j[t i , ti+l, ... ,ti+r] = L/l:jf(sr l )(tj),
j=i
where, in particular, Di+r i- O.
Now let r 2: 1 be an integer and t = {ti hEZ any infinite nondecreasing
unbounded sequence of reals ti with
inf ti = -00,
sup ti = +00
and ti < ti+r for all i E 7l..
Then the i-th B-spline of order r associated with t is defined as the following
function in x:
(2.4.4.3)
2.4 Interpolation by Spline Functions
113
for which we also write sometimes more briefly Bi or Bi,r [see Figure 2 for
simple examples].
Clearly, Bi,r,t(x) is well defined for all x i- t i , ti+l' ... , ti+r and, by
(2.4.4.2), (2.4.4.3) is a linear combination of the functions
i ::; j ::; i + r,
(2.4.4.4a)
if the value of t J occurs sJ times within the subsegment ti ::; ti+I ::; ... ::;
tj. If in addition, the value of tj occurs nj times within the full segment
ti ::; ti+I ::; ... ::; ti+Tl then the definition of sJ shows that for each index
(J" with 1 ::; (J" ::; nj there exists exactly one integer I = 1(0') satisfying
t/ = t j ,
81
=
(J",
i::; I ::; i + r.
Thus Bt,r,t is also a linear combination of
(2.4.4.4b)
where r ~ nj ::; .5 ::; r ~ 1, i ::; j ::; i + r.
Hence the function Bi,r,t coincides with a polynomial of degree at most
r ~ 1 on each of the open intervals in the following set
That is, Bi,r.t(x) is a piecewise polynomial function in a; of order r with
respect to the partition of the real axis given by the tk with k E Ti,r, where
Only at the knots x = tk, k E Ti.Tl the function Bi,r,t(:r) may have jump
discontinuities, but it is continuous from the right, because the functions
(tj ~ x):+ occuring in (2.4.4.4) are functions of x that are continuous from
the right everywhere, even when S = O. Also by (2.4.4.4) for given t = (tj),
the B-spline Bi,r(X) == Bi.r,t(x) is (r~nj ~ 1) times differentiable at x = t j ,
if tj occurs nj times within ti, ti+I' ... , ti+r (if nj = r then Bi,r(x) has
a jump discontinuity at x = t J ). Hence the order of dlfferentiability of
Bi,r.t (x) at x = tj is governed by the number of repetitions of the value of
tj .
We note several important properties of B-Splines:
(2.4.4.5) Theorem. (a) Bi,r,t(x) = 0 for x rf. [ti, ti+r]'
(b) Bi,r.t(x) > 0 for ti < x < ti+r.
(c) For all x E IR
L Bi,r,t(x) = 1,
and the sum contains only finitely many nonzero terms.
114
2 Interpolation
By that theorem, the functions Bi,r
Bi,r,t arc nonnegative weight
functions with sum 1 and support [ti' ti+r],
supp Bi,r := {x I Bi,r(x) -=1= O} = [ti, ti+r] ,
they form a "partition of unity."
s:
EXAMPLE. For r = 5 and a sequence t with· ..
to < h = t2 = t3 < t4 = t5 <
t6
the B-Spline E 1 ,5 = B 1 ,s.t of order 5 is a piecewiese polynomial function
of degree 4 with respect to the partition t3 < ts < t6 of IR. For x = t3, ts, t6 it
has continuous derivatives up to the orders 1, 2, and 3, respectively, as n3 = 3,
n5 = 2, and n6 = 1.
s: ... ,
B O•1 ,t
1
to
Fig. 2. Some simple B-Splines.
PROOF. (a) For x < t; -s: t -s: ti+TJ f;-l(t) = (t - xy-l is a polynomial of
degree (r - 1) in t, which has a vanishing rth divided difference
f;-l[ti.ti+l, ... ,ti+r] =0 ===} Bi.r(X) =0
by Theorem (2.l.3.10). On the other hand, if ti -s: t -s: t i +r < x then
f;-l(t) := (t = 0, is trivially true, so that again Bi.r(x) = O.
(b) For r = 1 and ti < x < ti+ 1, the assertion follows from the definition
xr;l
B i,l(X) = [(ti+l - x)~ - (ti - x)~l = 1- 0 = l.
and for r > 1, from recursion (2.4.5.2) for the functions N i .r := Bi,r/(ti+rt;), which will be derived later on.
(c) Assume first tj < .r < tj+l. Then by (a), Bi,r(x) = 0 for alli, r
with i + r -s: j and all i 2': j + 1, so that
L Bi.r(X)
Now, (2.4.4.1) implies
j
=
L
i=j-r+l
Bi,r(x).
2.4 Interpolation by Spline (Cunctions
115
Therefore,
The last equality holds because the function g-l (t) = (t - x y-1 is a
polynomial of degree (r - 1) in t for tj < x < tj+1 :s; t ::::; tj+r, for which
g-1[tj+1"" ,tj+r] = 1 by (2.1.4.3), and the function fJr-1)(t) = (tX)~-l = 0 vanishes for tj-r+1 :s; t :s; tj < x < tj+1. Por x = t J , the
assertion follows because of the continuity of B i . r from the right,
o
We now return to the space S L1,k of splines of degree k ~: 0 and construct
a sequence t = (tj) so that the B-splines Bi,k+l,t(X) of order k + 1 form a
basis of SL1,k. For this purpose, we associate the partition
Ll = {a = Xo < Xl < ... < Xn = b}
with any infinite unbounded nondecreasing sequence t = (t J )jE71. satisfying
tj < t j +k+1
for all j,
t-k :s; ... :s; to := Xo < t1 := Xl < ... < tn := Xn ·
In the following only its finite subsequence
will matter. Then a (special case of a) famous result of Curry and Schoenberg (1966) is
(2.4.4.6) Theorem. The restrictions B i ,k+1,t I[a, b] of the n + k B-splines
Bi == Bi,k+U, i = -k, -k + 1, ... , n - I, to [a, b] form a basis of SL1,k.
PROOF. Each B i , -k :s; i :s; n - 1, agrees with a polynomial of degree :s; k
on the subintervals [Xj) xj+d of [a, b], and has derivatives up to the order
k - nj = k - 1 in the interior knots tj = Xj, j = 1, 2, ... , n - 1, of Ll
since these tj occur only once in t, nj = 1. This shows Bil[a,b] E SL1,k for
all i = -k, ... , n - 1. On the other hand we have seen dim SL1,k = n + k.
Hence it remains to show only that the functions B i , -kS i :s; n - 1, are
linearly independent on [a, b].
Assume L:~:~k aiBi(x) = 0 for all x E [a, b]. Consider the first subinterval [xo, .T1]. Then
116
2 Interpolation
°
L aiBi(X) = 0
i=-k
for x E [XO, Xl]
because Bil[xo,XI] = 0 for i ::::: 1 by Theorem (2.4.4.5)(a). We now prove
that B-k, ... , B o are linearly independent on [xo, Xl], which implies a-k =
a-k+l = ... = ao = o.
We claim that for X E [xo, xd
span {B-k(X), . .. ,Bo(x)} = span {(tl - x)k+l-s 1 , ••• , (tk+l - x)k+l-Sk+l},
if the number tj occurs exactly Sj times in the sequence Lk :S Lk+l :S
... :S t j . This is easily seen by induction since by (2.4.4.2), (2.4.4.4a) each
Bi(x), -k :S i :S 0, has the form
Bi(x) = al(t l - x)k+l-s 1 + ... + aHk+l(tHk+l - x)k+l-s;+k+ 1 ,
where aHk+1 =I- O.
Therefore it suffices to establish the linear independence of the functions
(t I - X)k-(Sl-l) , ... , (t k+l _ x)k-(Sk+l- l )
on [xo, Xl], or equivalently, of the normalized functions
(2.4.4.7)
p(Sj-I)(tj_X),
defined by
j=1,2, ... ,k+1,
1 k
p(t):= k!t .
To prove this, assume for some constants el, C2, ... , Ck+l
k+1
LCjp(Sj-1l(tj - x) = 0
j=l
for all X E [XO,X1J.
Then by Taylor expansion the polynomial
k+l
k
( )1
LCj LP(l+Sj-1) (tj) ~~ = 0
j=l l=O
on [XO,X1]
vanishes on [xo, Xl] and therefore must be the zero polynomial, so that for
alll = 0, 1, ... k
k+1
LCjp(l+Sr 1)(tj) = O.
j=l
But this means that
k+1
L Cj lI(Sj-1) (tj) = 0,
j=l
2.4 Interpolation by Spline Functions
117
where II(t) is the row vector
II(t) : = [p(k)(t),p(k-I)(t), ... ,pet)]
= [ 1, t, ... , ~~ ] .
However the vectors II(Sj-1)(tj), j = 1, 2, ... , k + 1, are linearly
independent [see Corollary (2.1.5.5)l because of the unique solvability of
the Hermite interpolation problem with respect to the sequence tl
t2
tk+l· Hence CI = ... = Ck+1 = 0 and therefore also ai = 0 for i = -k,
-k+ 1, ... , o.
Repeating the same reasoning for each subinterval [Xj, Xj+l], j = 1, 2,
· .. , n - 1, of [a, bl we find ai = 0 for all i = -k, -k + 1, ... , n - 1, so
that all B i , -k
i
n - 1, are linearly independent on [a, bl and their
restrictions to [a, bl form a basis of SLJ.,k.
0
:s:
... :s:
:s:
:s: :s:
2.4.5 The Computation of B-Splines
B-splines can be computed recursively. The recursions are based on a remarkaHe generalization of the Leibniz formula for the derivatives of the
product of two functions. Indeed, in terms of divided differences, we find
the following.
(2.4.5.1) Product Rule for Divided Differences. Suppose ti :s: tHI :s:
· .. :s: tHk. Assume further that the function f (t) = g( t) h(t) is a product
of two functions that are sufficiently often differentiable at t = t j , j = i,
... , i + k, so that g[ti, tHI, ... , ti+kl and h[ti, ti+I, ... , ti+kl are defined by
(2.4.4.1). Then
Hk
f[ti, tHI, ... , tHkl = Lg[ti , t HI , ... , trJh[tr' tr+I, ... , ti+kl·
r=i
PROOF. From (2.1.5.8) the polynomials
Hk
L
g[t i , . .. ,trl (t - t i ) ... (t - t r - I )
r=i
and
Hk
L
h[ts, ... ,ti+k](t - tsH) ... (t - tHk)
s=i
interpolate the functions g and h, respectively, at the points t = ti, tiH'
· .. , tHk (in the sense of Hermite interpolation, see Section 2.1.5, if the tj
are not mutually distinct). Therefore, the product polynomial
118
2 Interpolation
i+k
F(t) :=
Lg[t;, ... ,tr](t - t
i )··· (t -
t r - 1)
r=i
i+k
.L h[t" ... , ti+k](t - ts+d ... (t - ti+k)
also interpolates the function f(t) at t = ti, ... , t;+k' This product can be
written as the sum of two polynomials
F( t) =
i+k
L ... = L ... + L ... = PI (t) + P (t).
2
r>s
r.s=i
Since each term of the second sum 2:r>s is a multiple of the polynomial
n;:7(t-tj ), the polynomial P2(t) interpolates the O-function at t = ti, ... ,
ti+k. Therefore, the polynomial P1(t), which is of degree S k, interpolates
f(t) at t = ti, ... , ti+k' Hence, P1(t) is the unique (Hermite-) interpolant
of f(t) of degree S k.
By (2.1.5.8) the highest coefficient of Pdt) is J[t;, ... ,ti+kl. A comparison of the coefficients of t k on both sides of the sum representation
Pdt) = 2:rS;s ... of PI proves the desired formula
i+k
f[t;, ... , ti+kl =
L g[t
i , ... , trlh[t n ... , ti+kl·
D
r=1
Now we use (2.4.5.1) to derive a recursion for the B-splines B;,r(x) ==
Bi,r,t(x) (2.4.4.3). To do this, it will be convenient to renormalize the functions Bi.r(x), letting
for which the following simple recursion holds:
For r ~ 2 and ti < ti+r:
(2.4.5.2)
PROOF. Suppose first x
t= tj for all j, We apply rule (2.4.5.1) to the product
g-l(t) = (t - x),;--l = (t - x)(t - x)~-2 = g(t)f~-2(t).
Noting that g(t) is a linear polynomial in t for which by (2.1.4.3)
g[til = t; - X, g[t i , ti+d = 1, g[t;, ... , tjl = 0 for j > i + 1,
we obtain
2.4 Interpolation by Spline Functions
119
and this proves (2.4.5.2) for x f= ti, ... , tHr' The result is furthermore true
for all x since all Bi,r(x) are continuous from the right and ti < t Hr
D
The proof of (2.4.4.5), (b) can now be completed: By (2.4.5.2), the
value Ni,r(x) is a convex linear combination of N i ,r-1(X) and N H1 ,r-1(X)
for ti < x < tHr with positive weights Ai(X) = (x - ti)/(t Hr - t i ) > 0,
1- Ai(X) > O. Also Ni,r(x) and Bi,r(x) have the same sign, and we already
know that Bi,dx) = 0 for x rt [ti' tH1J and Bi,l(X) > 0 for ti < x < tH1'
Induction over r using (2.4.5.2) shows that Bi,r(x) > 0 for ti < x < t Hr .
The formula
(2.4.5.3)
is equivalent to (2.4.5.2), and represents Bi,r(x) directly as a positive linear
combination of Bi,r-dx) and B H1 ,r-1(X). It can be used to compute the
values of all B-splines Bi,r(x) = Bi,r,t(X) for a given fixed value of x.
To show this, let x be given. Then there is a tj E t with tj ~ x < tj+1'
By (2.4.4.5)a) we know Bi,r(x) = 0 for all i, r with x F/. [ti, t Hr ], i.e.,
for i ~ j - r and for i ~ j + 1. Therefore, in the following tableau of
Bi,r := Bi,r(x), the Bi,r vanish at the positions denoted by 0:
(2.4.5.4)
0
0
0
0
0
0
0
0
0
0
B j,l
B j- 1,2
B j ,2
B j - 2 ,3
B j- 1,3
B j ,3
B j- 3 ,4
B j- 2 ,4
B j- 1,4
B j ,4
0
0
0
0
By definition B j,l = B j,l (x) = 1 for tj < x < tj+1' which determines
the first column of (2.4.5.4). The remaining columns can be computed
consecutively using recursion (2.4.5.3): Each element Bi,r can be derived
from its two left neighbors, B i ,r-1 and B H1 ,r-1
B i ,r-1
~
/'
Bi,r
120
2 Interpolation
This method is numerically very stable because only nonnegative multiples
of nonnegative numbers are added together.
EXAMPLE. For ti = i, i = 0, 1, ... and x = 3.5 E [t3, t4J the following tableau of
values Bi,r = Bi,r(X) is obtained.
r
o
1
2
3
4
o
o
0
0
0
1/48
23/48
23/48
1/48
1/8
6/8
1 1/2 1/8
000
1
o 1/2
2
3
4
o
For instance, B 2 ,4 is obtained from
B
2,4
= B
2,4
(3.5) = 3.5 - 2 . ~
5_ 2
6 - 3.5
8+ 6- 3
1
8
23
48'
We now consider the interpolation problem for spline functions, namely
the problem of finding a spline S that assumes prescribed values at given
locations. We may proceed as follows. Assume that r ::::: 1 is an integer and
t = (tih<i<N+r a finite sequence of real numbers satisfying
and ti < t Hr for i = 1, 2, ... , N. Denote by Bi(X) == Bi,r,t(x), i = 1, ... ,
N, the associated B-splines, and by
N
Sr,t =
{L (ViBi(X) I
(Vi
E 1R },
i=l
the vector space spanned by the B i , i = 1, ... , N. Further, assume that
we are given N pairs (I;j, ij), j = 1, ... , N, of interpolation points with
6 < 6 < '" < I;N.
These are the data for the interpolation problem of finding a function S E
Sr,t satisfying
(2.4.5.5)
Since any S E Sr,t can be written as a linear combination of the B i , i = 1,
... , N, this is equivalent to the problem of solving the linear equations
N
(2.4.5.6)
L (ViBi(l;j) = ij, j = 1, ... , N.
i=l
The matrix of this system
2.4 Interpolation by Spline Functions
121
A= [B1i~d
B1(~N)
has a special structure: A is a band matrix, because by (2.4.4.5) the functions Bi(x) = Bi,r,t(x) have support [ti' ti+r], so that within the jth row of
A all elements Bi (~j) with ti+r < ~j or ti > ~j are zero. Therefore, each row
of A contains at most r elements different from 0, and these elements are
in consecutive positions. The components Bi (~j) of A can be computed by
the recursion described previously. The system (2.4.5.6), and thereby the
interpolation problem, is uniquely solvable if A is nonsingular. The nonsingularity of A can be checked by means of the following simple criterion
due to Schoenberg and Whitney (1953), which is quoted without proof.
(2.4.5.7) Theorem. The matrix A = (Bi(~j)) of (2.4.5.6) is nonsingular
if and only if all its diagonal elements Bi (~i) -I 0 are non;;ero.
It is possible to show [see Karlin (1968)] that the matrix A is totally
positive in the following sense: all r x r submatrices B of A of the form
B = (aipjJ;,q=l
with r ~ 1, i 1 < i'2 < ... < ir , j1 < j2 < ... < jr,
have a nonnegative determinant, det(B) ~ O. As a consequence, solving
(2.4.5.6) for nonsingular A by Gaussian elimination without pivoting is
numerically stable [see de Boor and Pinkus (1977)]. Also the band structure
of A can be exploited for additional savings.
For further properties of B-Splines, their applications, and algorithms
the reader is referred to the literature, in particular to de Boor (1978),
where one can also find numerous FORTRAN programs.
2.4.6 Multi-Resolution Methods and B-Splines
B-Splines play also an important role in multi-resolution methods in signal
processing. These methods deal with the approximation of real or complex
functions f: IR -+ <C. In what follows, we consider only square integrable
functions f, J~oo If(xWdx < 00. It is well known that the set of these
functions, L2 (IR), is a Euclidean vector space (even a Hilbert space) with
respect to the scalar product
(1, g) :=
I:
f(x)g(x) dx
and the associated Euclidean norm Ilfll := (1,1)1/2. By f:'2 we denote the
set of all sequences (ckhEZ of complex numbers Ck with 2:=kEZ ICkl2 < 00.
In the multi-resolution analysis (MRA) of functions f E L 2 (IR), closed
linear subspaces Vj, jEll., of L 2 (IR) with the following properties playa
central role:
122
2 Interpolation
(Sl)
(S2)
f(x) E Vi ¢} f(2x) E Vi+1'
jElL,
(S3)
f(x) E Vi ¢} f(x + z-j) E Vi,
jElL,
U Vi = L (lR),
(S4)
2
jElL
nVi
(S5)
jElL
= {a}.
Usually, such subspaces are generated by a function <P(x) E L2(lR)
as follows: By translation and dilatation of <P, we first define additional
functions (we call them "derivates" of <P)
<Pj,k(X) := 2j / 2 <p(2 j x - k),
j, k E lL.
The space Vi is then defined as the closed linear subspace of L2(lR) that is
generated by the functions <Pj,k, k E lL,
Vi := span{ <Pj,k IkE lL} = { L Ck<Pj,k I Ck E OJ, n::::: o}.
Ikl:S;n
Then (82) and (83) are clearly satisfied.
(2.4.6.1) Definition. The function <P E L 2 (lR) is called scaling function,
if the associated spaces Vi satisfy conditions (Sl)- (S5) and if, in addition,
the functions <PO,k, k E lL, form a Riesz-basis of VA, that is, if there are
positive constants 0 < A :::; B with
kElL
kElL
kElL
for all sequences (Ck)kElL E £2 (then also for each jElL, the functions <Pj,k,
k E lL, will form a Riesz- basis of Vi).
If <P is a scaling function then, for all jElL, each function f E Vi can
be written as a convergent (in L2(lR)) series
(2.4.6.2)
f =
L Cj,k<Pj,k
kElL
with uniquely determined coefficients Cj,k, k E lL, with (cj,khElL E £2.
Also for a scaling function <P, condition (Sl) is equivalent to the condition that <P satisfies a so-called two-scale-relation
2.4 Interpolation by Spline Functions
123
(2.4.6.3)
with coefficients (h k ) E £2. For, (2.4.6.3) is equivalent to P E Vo C VI' But
then also Vi C Vi+1 for all jEll. because of
Pj,I(X) =2 j / 2 p(2 j x -I) = 2j / 2
=
V2L h kP(2 j +lx - 2/- k)
kEZ
L hk P j+l,k+21(X).
kEZ
EXAMPLE. A particularly simple scaling function is the Haar-function
for 0::; x < 1,
otherwise.
<p(x) := {~:
(2.4.6.4)
The associated two-seale-relation is
<p(x) = <P(2x) + <P(2x - 1).
See Figure 3 for an illustration of some of its derivates <Pj,k.
),
~
<P1,0
v'2 r---'-<P0,0 = <P
1
-3/2
-1
0
Q'14
--'--
1/2
1
4/2
2.5
x
Fig. 3. Some derivates <Pj,k of the Haar-function
Scaling functions, and the associated spaces Vi they generate, play a
central role in modern signal- and image processing methods. For example,
any function f E Vi = L:k Cj,kPj,k can be considered as a function with
a finite resolution (e.g. a signal or, in two dimensions, a picture obtained
by scanning) where only details of size 2- j are resolved (this is illustrated
124
2 Interpolation
most clearly by the space Vj belonging to the Haar-function). A common
application is to approximate a high resolution function 1, that is a function
1 E Vj with j large, by a coarser function j (say a j E Vk with k < j)
without losing to much information. This is the purpose of multi-resolution
methods, whose concepts we wish to explain in this section.
We first describe additional scaling functions based on splines. The
Haar-function P is the simplest B-spline of order r = 1 with respect to
the special sequence t := (kha of all integers: Definition (2.4.4.3) shows
immediately
B O,l,t(X) = 1~[0, 1] = (1- x)~ - (0 - x)~ = p(x).
Moreover, for all k E 7l, their translates are
B k,l,t(X) = 1~[k, k + 1] = B O,l,t(X - k) = p(x - k) = PO,k.
This suggests to consider general B-splines
(2.4.6.5)
Mr(x) := Bo,r,t(x) = r 1;-1 [0,1, ... ,r]
of arbitrary order r 2: 1 as candidates for scaling functions, because they
also generate all B-splines
Bk,r,t(x) = r 1;-1 [k, k + 1, .. " k + r],
k Ell,
by translation, Bk,r,t(x) = BO,r,t(x - k) = Mr(x - k). Advantages are that
their smoothness increases with r since, by definition, A1r(x) = Bo,r,t(X) is
r - 2 times continuously differentiable, and that they have compact support
[see Theorem (2.4.4.5)]. It can be shown that all M r , r 2: 1, are in fact
scaling functions in the sense of Definition (2.4.6.1); for example, their
two-scale-relation reads
Mr(x) = T r+ 1
t C)
Mr(2x -I),
x E JR.
1=0
For proofs we refer the reader to the literature [see Chui (1992)]. In the
interest of simplicity, we only consider low order instances of these functions
and their properties.
EXAMPLES. For low order r one finds the following B-splines Mr := Eo,r,t: Ml{X)
is the Haar-function (2.4.6.4), and for r = 2, 3 one obtains:
for 0 < x < 1,
for 1 ;:; x ;:; 2,
o
otherwise.
X2
for 0 ~ x ~ 1,
M ( ) = ~ { - 2X2 + 6x - 3 for 1 ~ x ~ 2,
3 X
2
{3 _ X)2
for 2 ~ x ~ 3,
o
otherwise.
M 2 {x) =
{
X
2- x
2.4 Interpolation by Spline Functions
125
The translate H(x) := M 2 (x + 1) of M2(X),
H(x) = {I -Ixl
(2.4.6.6)
o
for -l.S x S 1
otherwIse,
is known as hat-function.
These formulas are special cases of the following representation of .Alr(x) for
general r ;:0: 1:
(2.4.6.7)
This is readily verified by induction over r, taking the recursion (k = 1, 2, ., .,
r) (2.1.3.5) for divided differences
. k] =k
1 (fT-I[.
. k] - fr-I[.
. k 1]) ,
f xT-I[..
z,z+ 1 , ... ,z+
x
z+ 1 , ... ,z+
x
z, ... ,z+'and f;-I(t) = (t - X)~-I into account.
We now describe multi-resolution methods in more detail. As already
stated, their aim is to approximate, as well as possible, a given "high resolution" function Vj+l E Vi+l by a function of lower resolution, say by a
function Vj E Vj with
Since v j +1 - VJ is then orthogonal to Vi, and v j E Vi C Vi + 1 , the orthogonal
subspace Wj of Vi in Vi+l comes into play,
Wj := {w E Vi+l I (w, v) = 0 for all v E Vi},
jEll.
We write Vi+l = Vi E9 Wj to indicate that every V)+1 E Vi-H has a unique
representation as a sum Vj+l = Vj + Wj of two orthogonal functions Vj E Vi
and Wj E W j .
The spaces Wj are mutually orthogonal, Wj -.l W k for j # k (i.e.
(v,w) = 0 for v E W j , W E W k ). If, say j < k, this follows from Wj C
Vi+l C Vk and W k -.l V k . More generally, (S4) and (S5) then imply
L 2(IR) =
EB Wj = ... E9 W-l E9 Wo E9 WI E9 ...
jElL
For a given Vj+l E Vi+l, multi-resolution methods compute for m S j
the orthogonal decompositions V=+1 = v= + W=, Vm E 11;", Wm E W m ,
according to the following scheme:
(2.4.6.8)
126
2 Interpolation
Then for m ~ j, Vm E Vm is also the best approximation of v j + 1 by an
element of V"" that is,
This is because the vectors 'Wk are mutually orthogonal and 'Wk ~ Vm for
k 2:: m. Hence for arbitrary v E v'n, Vm - 11 E v'm and
Vj+1 -
V = 10j
+ 'Wj-l + ... + 'Wm+ (v m - v),
so that
.
2
Ilv]+l - vii 2 = II'Wjll 2 + ... + Ilwmll
+ IIVm- vii 2 2:: Ilv]+l - vmll 2 .
Tn order to compute the orthogonal decomposition V]+1 = Vj +'Wj of an
arbitrary V]+l E Vj+1, we need the dual function cP of the scaling function
P. As will be shown later, cP is uniquely determined by the property
(2.4.6.9)
(PU,k, cP) =
I:
p(x - k)cP(x) dx = Ok,O,
k E ll..
With the help of this dual function cP, we are able to compute the coefficients Ck of the representation (2.4.6.2) offunction f E VOl f = ~za: czPo,z,
as scalar products:
(f(x), cP(x - k)) = 2:>lj= p(x - l)cP(x - k) d:r
I
=
-co
~ czjCO p(x - l + k)cP(x) dx = Ck.
-oc
Z
Also, for any jEll., the functions cPj,k(X) := 2j / 2 cP(2 j x - k), k E ll.,
satisfy
I:
I:
P(2J x -k)p(2jx-l)dx
Ok-I,O,
k, I E ll..
(Pj,k,cPj,l) =2 j
(2.4.6.10)
=
=
p(x - k + l)cP(x) dx
The dual function cP may thus be used to compute also the coefficients Cj,l
of the series f = ~kEZ Cj,kPj,k representing a function f E Vj:
(2.4,6.11)
(j, cPj,l) = ~ CJ,k(Pj,k, cPj,l) = C),I'
k
The following theorem ensures the existence of dual functions:
2.4 Interpolation by Spline Functions
127
(2.4.6.12) Th_eorem. To each scaling function <I> there el:ists exactly one
dual function <I> E Yo.
PROOF. Since the <I>O,k, k E 7l., form a Riesz-basis of Yo, the function
<I>o,o = <I> is not contained in the subspace
V := span{ <I>O,k I Ikl :::: 1 }
of Va. Therefore, there exists exactly one element 9 = <I>-11 of= 0 with u E V,
so that
Ilgll = min{ II<I> - viii v E V}.
U is the (unique!) orthogonal projection of <I> to V, hence 9 1. V, i.e. (v, g) =
o for all v E V, and in particular,
(<I>o,kl g) = 0
for all Ikl :::: 1,
and 0 < IIgl12 = (<I> - u,<I> - u) = (<I> - u,g) = (<I>,g). The function <f> :=
9 / (<I>, g) has therefore the properties of a dual function.
0
EXAMPLE. The Haar-function (2.4.6.4) is selfdual, <P = <P. This follows from the
orthonormality property (<P, <PO,k) = il k . a , k E 71...
It is, furthermore, possible to compute the best approximation Vj E
1/j of a given Vj+l E 1/j+l using the dual function <f>. \Ve assume that
Vj+l E 1/j+l is given by its series representation Vj+l = L:kE~ Cj+l,k<I>j+l,k.
Because of Vj E 1/j, the function Vj we are looking for has the form Vj =
L:kE~ Cj,k<I>j.k. Its coefficients Cj,k have to be determined so that Vj+l -Vj ..L
1/j,
(Vj, v) = (Vj+l' v)
for all v E 1/j.
Now, since <f> E Vo, all functions <f>],kl k E 7l., belong to 1/j so that by
(2.4.6.10)
Cj.l = (v]' <f>j,l) = (Vj+l,<f>],Z)
=
L Cj+1,k(<I>j+l,k, <f>j.l)
kE~
=
L Cj+l,k 2j + 1/ j= <I>(2j+lx - k)<f>(2jx -l) dx
2
kE~
=
-00
L Cj+1,kV2j= <I>(2x + 21 - k)<f>(x) dx.
kE~
-00
This leads to the following method for computing the coefficients Cj,k
from the coefficients C]+l,k: First, we compute the numbers (they do not
depend on j)
128
2 Interpolation
'Yi := J2
(2.4.6.13)
i:
p(2x + i)~(x)dx,
i E 7L.
The coefficients Cj,l are then obtained by means of the formula
(2.4.6.14)
Cj,l =
L
Cj+l,k'Y2l-k,
I E 7L.
kEZ
EXAMPLE. Because of
;p = p for the Haar-function, the numbers 'Yi are given by
°
"'{i := {1/2
for i =.0, -1,
otherwise.
Since P has compact support, only finitely many 'Yi are nonzero, so that also the
sums (2.4.6.14) are finite in this case:
I E 7l..
In general, the sums in (2.4.6.14) are infinite and thus have to be approximated by finite sums (truncation after finitely many terms). The efficiency of the method thus depends on how fast the numbers 'Yi converge
to 0 as Ii I ---+ 00.
It is possible to iterate the method following the scheme (2.4.6.8). Note
that the numbers 'Yi have to be computed only once since they are independent of j. In this way we obtain the following multi-resolution algorithm for
computing the coefficients Cm,k, k E 7L, of the series Vm = LkEZ Cm,kPm,k
for all m :::; j:
(2.4.6.15) Given: Cj+l,k, k E 7L, and 'Yi, i E 7L, s. (2.4.6.13).
For m = j, j -1, ...
for IE 7L
Cm,l := LkEZ Cm +l,k'Y2l-k·
We have seen that this algorithm is particularly simple for the Haarfunction as scaling function.
We now consider scaling functions that are given by higher order Bsplines, p(x) = Pr(x) := Mr(x), r > 1. For reasons of simplicity we treat
only case r = 2, which is already typical. Since the scaling functions M 2 (x)
and the hat-function H(x) = M 2 (x+l) generate the same spaces V), j E 7L,
the investigation of c]j(x) := H(x) is sufficient.
First we show the property (Sl), V) C V)+l, for the linear spaces generated by p(x) = H(x) and its derivates c]jj,k(X) = 2j / 2 p(2 j x-k), j, k E 7L.
This follows from the two-seale-relation of H(x),
1
H(x) = "2 (H(2x + 1) + 2 H(2x) + H(2x - 1)),
2.4 Interpolation by Spline Functions
129
that is immediately verified using the definition H(x) (2.4.6.6). We leave it
to the reader to prove that H (x) has also the remaining properties required
for a scaling function [see Exercise 32].
The functions f E Vi have a simple structure: they are continuous and
piecewise linear functions on IR with respect to the partition LV of fineness
2- j given by the knots
L1j := { k . 2- j IkE 71..}.
(2.4.6.16)
Up to a common factor, the coefficients Cj,k of the series representation
f = L:kEZ Cj,k<Pj,k are given by the values of f on L1j:
(2.4.6.17)
Unfortunately, the functions <PO,k(X) = H(x - k), k E 71.., do not form
an orthogonal system of functions: A short direct calculation shows
(2.4.6.18)
so that
(<PO,k, <PO,I) =
1=
-=
H(x - k + l)H(x) dx =
{2/3
1/6
0
for k = l,
for Ik - II = 1,
otherwise.
As a consequence, the dual function if will be different from H, but it
c~n be calculated explicitly. According to the proof of Theorem (2.4.6.12),
H is a scalar multiple of the function
g(x) = H(x) -
L akH(x - k),
Ikl21
where the sequence (ak)l kI2 1 is the unique solution of the following equations:
(g, <PO,k) = (H(J.;) -
L alH(x -l), H(x - k)) = 0 for all Ikl 2' 1
11121
that satisfies L:k lakl 2 < 00. By (2.4.6.18), this leads for k > 1 to the
equations
(2.4.6.19)
and for k -S -1 to
(k = 1),
(k 2' 2),
130
2 Interpolation
=1,
(k=-I),
It is easily seen that with (ak) also (a-k) is a solution of these equations,
so that by the uniqueness of solutions ak = a-k for all Ikl :::0- l. Therefore
it is sufficient to find the solutions ak, k:::o- l. of (2.4.6.19).
According to these equations, the sequence (akk2,l is a solution of the
homogeneous linear difference equation
Ck-l
+ 4Ck + Ck+l = 0,
k = 2,3, ....
e
The sequence Ck := k , k:::o- 1, is clearly a solution of this difference equation, if is a zero of the polynomial x 2 + 4x + 1 = O. This polynomial,
however. has the two different zeros
e
r.;
-1
A:= -2 + v3 = - - ,
2+V3
J-l := -2 - V3 = 1/ A,
with IAI < 1 < 1/11. The general solution of the difference equation is an
arbitrary linear combination of these special special solutions (Akh>l and
(J-lk)k> 1, that is, the desired solution ak has the form ak = CA k + DI;k, k :::01, with appropriately chosen constants C, D. The condition Lk IOkl 2 < ex;
implies D = 0 because of 1J-l1 > l. The constant C has to he chosen such
that also the first equation (2.4.6.19), 4a1 + a2 = 1, is satisfied, that is,
1 = CA(4 + A) = CA(2 + V3) = -C.
This implies ak = a-k = -A k = _(_1)k(2 + V3)-k for k:::o- 1 so that
g(x) = H(x) +
2:) -1)k(2 + V3)-k(H(x + k) + H(x - k)).
k2,l
The dual function if(x) = ,g(x) we are looking for is a scalar multiple of
g, where, is determined by the condition (H, if) = l. Using (2.4.6.18) this
leads to
, (H,H) - (2 + V3)-l((H(x),H(x
1
-
=
2
1
1
3
3(2+V3)
V3'
-1)) + (H(x),H(x + 1)))
Therefore, the dual function if (x) is given by the infinite series
if(.T) =
L bkH(x - k)
kElL.
with coefficients
2.4 Interpolation by Spline Functions
b '= (-l)kJ3
k· (2 + J3)lkl '
131
k E 7l..
The computation of the numbers "Yi (2.4.6.13) for the multi-resolution
method (2.4.6.15) leads to simple finite sums with known terms,
"Yi = V2(H(2x + i), H(x))
L bk(H(2x + i), H(x - k))
= V2
kEZ
(2.4.6.20)
= V2
L bk(H(2x + i + 2k), H(x)}
= V2
L
bk(H(2x + i + 2k),H(x)),
k: 12k+il::02
because an elementary calculation shows
5/12
{
(H(2x + j), H(x)) = 1//4
1 24
o
for j = 0,
for IJI = 1,
for Ijl = 2,
otherwise.
Since 2 + J3 = 3.73 ... , the numbers bk = 0((2 + J3)-lk l) and, consequently, the numbers "Yk converge rapidly enough to zero as Ikl ---+ 00, so
that the sums (2.4.6.14) of the multi-resolution method are well approximated by relatively short finite sums.
EXAMPLE. For small i, one finds the following values of /'i:
o
1
2
3
4
5
6
7
8
9
10
/'i
0.96592 58262
0.44828 77361
-0.1640846996
-0.12011 83369
0.04396 63628
0.03218 56114
-0.01178 07514
-0.00862 41086
0.00315 66428
0.00231 08229
-0.00084 58199
Incidentally, the symmetry relation /,-i = /'i holds for all i E d~ [see Exercise 33].
There is a simple interpretation of this multi-resolution method: A function f = L::kEZ Cj+l,kPj+l,k E l':i+1 is a continuous function which is piecewise linear with respect to the partition Llj+l = {k . 2-(j+l) IkE 7l.} of
IR with [see (2.4.6.17)]
Cj+l,k
= T(j+l)/2f(k· 2-(j+l)),
k E 7l..
132
2 Interpolation
It is approximated optimally by a continuous function j = LkZ Cj,kiPj,k
which is piecewise linear with respect to the coarser partition LV = { I· 2- j I
1 Ell.} if we set for 1 E 7l.
The numbers /i thus have the following nice interpretation: The continuous
piecewise linear (with respect to Llo = {Ill E 71.}) function j(x) with j(l) :=
/2t/ V2, I E 71., is the best approximation in Va of the compressed hat-function
[(x) := H(2x) E VI. The translated function h(x) := H(2x - 1) E VI is best
approximated by jl E Va where jl(l) := /21-dV2, I E 71.. [see Figure 4].
[,j
1
x
-3
-2
-1
1
2
3
Fig. 4. The functions [ and j
We now return to general scaling functions. Within the framework of
multi-resolution methods it is essential that the functions iPj,k, k E 7l.,
form a Riesz-basis of the spaces Vj. It is of equal importance that also
the orthogonal complements Wj of Vj in Vj+l have similarly simple Rieszbases: One can show [proofs can be found in e.g. Daubechies (1992), Louis
et al. (1994)]' that to any scaling function <P generating the linear spaces
Vj there is a function lj! E Wo with the following properties:
2.4 Interpolation by SplinE Functions
133
a) The functions PO,k(X) := P(x - k), k E ll.., form a Riesz-basis [(see
(2.4.6.1))] ofWo , and for eachj E ll.. its derivates Pj,k(X) = 2j / 2 p(2 j x-k),
k E ll.., form a Riesz-basis of W j .
b) The functions P],k, j, k E ll.., form a Riesz-basis of L2(JR).
Each function P with these properties is called a wavelet belonging
to the scaling function P (wavelets are not uniquely determined by pl. P
is called an orthonormal wavelet,if in addition the PO,k, k E ll.., form an
orthonormal basis of W o, (PO,k, PO•l ) = 5k ,l' In this case an orthonormal
basis of L2(JR) is given by the derivates Pj,k, j, k E ll.., of rJI,
(Pj,k, Pl,m) = 5j ,15k ,m'
EXAMPLE. To the Haar-function P (2.4.6.4) belong~ the Haar-wavelet defined by
I,
P(x):= { -.1,
O.
for 0 :::; x < 1/2,
for 1/2:::; x < 1,
otherwise.
Indeed, the relation P(x) = p(2x) -p(2x-1) first implies PO,k E Vi for all k E lL;
then the orthogonality
(PO,k, PO,I) = 0
for all k, IE lL
gives Po,k E Wo for all k E lL, and finally
proves Vi = VoEElWo. Also, since (P, PO.k) = Sk,O, the Haar-wavelet is orthonormal.
One can show that every orthonormal scaling function P has also an
orthonormal wavelet, which even can be given explicitly by means of the
two-scale-relation of P [see e.g. the literature cited above]. It is very difficult
however, to determine for a non-orthonormal P an associated wavelet. The
situation is better for the scaling functions P = P r := !vIr given by Bsplines. Here, explicit formulas for the special wavelets wr are known that
have a minimal compact support (namely the interval [0, 2r - 1]) among
all wavelets belonging !vIr: they are given by finite sums of the form
3r-2
Pr(x) =
L Qk M r(2x - k).
k=O
However, these wavelets are not orthonormal for T > 1, but one still knows
explicitly their associated dual functions .pr E Wo with their usual properties (Pr(x-k), .pr(X)) = 5k ,o for k E ll.. [see Chui (1992) for a comprehensive
account].
The example of the Haar-function shows, how simple the situation becomes if both the scaling function P and the wavelet P are orthonormal and
134
2 Interpolation
if both functions have compact support. A drawback of the Haar-function
is that it is not even continuous, whereas already M2 is continuous and
the smoothness of the B-Splines Mr with r :2: 3 and its wavelets I{fr even
increases with r. Also for all r :2: 1, tPr = Nlr has compact support, and
has a wavelet I{fr with compact support. But unfortunately neither tP r nor
I{fr are orthonormal for r > 1.
Therefore the following deep result of Daubechies is very remarkable:
She was the first to give explicit examples of scaling functions and associated wavelets that have an arbitrarily high order of differentiability, have
compact support, and are orthonormal. For a detailed exposition of these
important results and their applications, we refer the reader to the relevant special literature [see e.g. Louis et al. (1994), Mallat (1998), and, in
particular, Daubechies (1992)J.
EXERCISES FOR CHAPTER 2
1.
Let L; (x) be the Lagrange polynomials (2.1.1.3) for pairwise different support
abscissas Xo, . .. , Xn , and let C; := L;(O). Show that
(a)
n
LCiXi =
;=0
{I
0
n
(-1) XOX1·· .Xn
for j = 0,
for j = 1,2, ... ,n,
for j = n + 1,
n
(b)
L Li(x) == 1.
i=O
2.
Interpolate the function In x by a quadratic polynomial at x = 10, 11, 12.
(a) Estimate the error committed for x = 11.1 when approximating lnx by
the interpolating polynomial.
(b) How does the sign of the error depend on x?
3.
Consider a function f which is twice continuously differentiable on the interval I = [-1, 1]. Interpolate the function by a linear polynomial through the
support points (Xi, f(Xi)), i = 0, 1, xo, Xl E I. Verify that
0: =
~ max If" (~)I max I(x - xo)(X - Xl)1
~E[
xE[
is an upper bound for the maximal absolute interpolation error on the interval
I. Which values xo, Xl minimize o:? What is the connection between (x xo)(x - Xl) and cos(2 arccos x)?
4.
Suppose a function f(x) is interpolated on the interval [a, b] by a polynomial
Pn(X) whose degree does not exceed n. Suppose further that f is arbitrarily
often differentiable on [a, b] and that there exists M such that If(i)(x)1 ::; M
for i = 0, 1, : .. and any x E [a, b]. Can it be shown, without additional
hypotheses about the location of the support abscissas Xi E [a, b], that Pn(x)
converges uniformly on [a, b] to f (x) as n ~ oo?
EXERCISES FOR CHAPTER 2
5.
135
(a) The Bessel function of order zero,
I11f
Jo(x) = -
cos(xsint)dt,
0
11:
is to be tabulated at equidistant arguments Xi = Xo -+- ih, i = 0, 1, 2, ...
. How small must the increment h be chosen so that the interpolation
error remains below 10- 6 if linear interpolation is used?
(b) What is the behavior of the maximal interpolation error
as n --+ 00, if Pn(X) interpolates Jo(x) at X = x;n) ;=, i/n, i = 0,1, ... ,
n?
Hint: It suffices to show that IJ~k)(x)1 <:: 1 for k = 0,1, ...
(c) Compare the above result with the behavior of the error
as n --+ 00, where St:J. n is the interpolating spline function with knot set
Ll n = {
} and S~n (x) = J~(x) for x = 0, 1.
6.
Interpolation on product spaces: Suppose every linear interpolation problem
stated in terms of functions 'Po, 'PI, ... , 'Pn has a unique solution
p(x) = L
Cl;i'Pi(X)
i=O
with P(Xk) = h, k = 0, ... , n, for prescribed support arguments Xo, ... ,
Xn with Xi i= Xj, i i= j. Show the following: If 1/;0, ... , 'Pm is also a set of
functions for which every linear interpolation problem has a unique solution,
then for every choice of abscissas
x, i= Xj for i i= j,
Yi i= YJ for i i= j,
XO,Xl, .. 'lXn~
yo, YI , ... , Yrn,
and support ordinates
f'k,
i = 0, ... ,n,
k = 0, ... ,m,
there exists a unique function of the form
n
p(X,y) = L
LCl;I/Il'P,/(x)1/;I'(Y)
1/=0 1l=0
with <p(x" Yk) = Jik, i = 0, ... , n, k = 0, ... , TTL
7.
Specialize the general result of Exercise 6 to interpolation by polynomials.
Give the explicit form of the function p(x, y) in this case.
8.
Given the abscissas
Yo, YI, ... , Ym,
Yi i= Yj for i i= j,
136
2 Interpolation
and, for each k = 0, ... ,m, the values
of x;kl for i of j,
and support ordinates
fik,
i = 0, ... , nk,
k = 0, ... , m,
suppose without loss of generality that the Yk are numbered in such a fashion
that
no ::0: nl ::0: ... ::0: nm .
Prove by induction over m that exactly one polynomial
"p
P(x,y) == L
Lctv/"XVyli
11-=0 v=O
exists with
9.
Is it possible to solve the interpolation problem of Exercise 8 by other polynomials
M
N
"
P(x,y) = L Lctv/"XVyl',
/"=0 v=o
requiring only that the number of parameters ct u /" agree with the number of
support points, that is,
AI
L(n/" + 1) = L(N/" + I)?
Hint: Study a few simple examples.
10. Calculate the inverse and reciprocal differences for the support points
Xi:
°
1
-1
2
-2
fi:
1
3
3/5
3
3/5
and use them to determine the rational espression p2.2(X) whose numerator
and denominator are quadratic polynomials and for which p2,2(Xi) = fi, first
in continued-fraction form and then as the ratio of polynomials.
11. Let <p m .n be the rational function which solves the system sm." for given
support points (Xk, fk), k = 0, 1, ... , m + n:
(ao + alXk + ... + amX~') - h(bo + b1xk + ... + bnx~) = 0,
k = 0, 1, ... ,m+ n.
Show that <pm,n (x) can be represented as follows by determinants:
pm.,,(X) = Ifk, Xk - X, ... , (Xk - x)m, (Xk - X)fk,"" (Xk - x)" fkl;;"=+d'
11,xk - X, ... , (Xk - x)m, (Xk - x)h, ... , (Xk - x)n hl;;":on
EXERCISES FOR CHAPTER 2
137
Here the following abbreviation has been used:
I
Im+n = det
Ctk,· .. , (k k=O
12. Generalize Theorem (2.3.1.12):
(a) For 2n + 1 support abscissas Xk with
a ::; Xo
< Xl < ... < X2n < a + 21r
and support ordinates yo, . .. , Y2n, there exists a unique trigonometric
polynomial
T(x) = ~ao +
2)aj cosjx + bj sinj:r)
j=l
with
T(Xk) = Yk
for k = 0,1, ... ,2n.
(b) If Yo, ... , Y2n are real numbers, then so are the coefficients a J , bj •
Hint: Reduce the interpolation by trigonometric polynomials in (a) to
(complex) interpolation by polynomials using the transformation T(x) =
cje
Then show C- J = Cj to establish (b).
2:.7=-n
ijx .
13. (a) Show that, for real Xl, ... , X2n, the function
II sin - 2 2n
t(x) =
X -
Xk
k=l
is a trigonometric polynomial
n
~ao + 2)aj cosjx + bj sinjx)
j=l
aj,
with real coefficients
bj .
Hint: Substitute sin'P = (1/2i)(e i 'P - e-''P).
(b) Prove, using (a) that the interpolating trigonometric polynomial with
support abscissas Xk,
a ::; Xo < Xl ... < X2n < a + 21r,
and support ordinates Yo, ... ,Y2n is identical with
2n
T(x) = LYJtj(x),
j=O
where
138
2 Interpolation
tj(X) :=
g
2n
X - Xk /
sin --2-
k-,fj
g
2n
X· - Xk
sin _1_2-_,
k-,fj
14. Show that for n + 1 support abscissas Xk with
a :S Xo < Xl < ... < Xn < a + 7r
and support ordinates yo, ... , Yn, a unique "cosine polynomial"
n
C(x) = Lajcosjx
j=O
exists with C(Xk) = Yk, k = 0, 1, ... , n.
Hint: See Exercise 12.
15. (a) Show that for any integer j
2m
2m
COSjXk = (2m + 1) h(j),
L
k=O
with
Xk:=
and
27rk
2m+ l'
Lsinjxk = 0,
k=Q
k = 0, 1, ... ,2m,
for j = 0 mod 2m + 1,
otherwise.
h(j) := {~
(b) Use (a) to derive for integers j, k the following orthogonality relations:
2m
L
sinjxi sin kXi = 2mt 1 [h(j - k) - h(j + k)],
i=O
2m
COSjXi COSkXi = 2m2+ 1 [h(j - k) + h(j + k)],
L
i=O
2m
LCOsjXisinkxi = O.
i=O
16. Suppose the 27r-periodic function I: IR. -+ IR. has an absolutely convergent
Fourier series
00
I(x) = ~ao + L(aj cosjx + bj sinjx).
j=l
Let
1li(X) = ~AQ + L(Aj cosjx + Bj sinjx)
j=l
EXERCISES FOR CHAPTER 2
139
be trigonometric polynomials with
27rk
2m+ l '
Xk:=
for k = 0, 1, ... , 2m.
Show that
00
Ak = ak
+ 2 ) ap (2rn+1)+k + a p (2rn+1)-k], 0::; k ::; m,
p=l
00
Bk
= bk + L[bp (2rn+1)+k - bp (2m+1)-k],
1::; k ::; m.
p=l
17. Formulate a Cooley-Thkey method in which the array ,a[ 1 is initialized directly (,B[j] = Ii) rather than in bit-reversed fashion.
Hint: Define and determine explicitly a map u = u( m, r, j) with the same
replacement properties as (2.3.2.6) but u(O, r, 0) = r instead of (2.3.2.7).
18. Let N = 2n : Consider the N-vectors
(2.3.2.1) expresses a linear transformation between these two vectors, f3 =
(ljN)TJ, where T = [tjk], tjk := e-27rijk/N.
(a) Show that T can be factored as follows:
T = QSP(Dn-ISP)·· . (DISP),
where S is the N x N matrix
1
-1
1
1
.
_.
(
(I)
(I)
(l)
)
_
The matnces Dl - dlag 1,°1 ,1, 03 ,···,1, ON_I' l - 1, ... , n - 1, are
diagonal matrices with
i' = [~],
r odd.
Q is the matrix of the bit-reversal permutation (2.3.2.8), and P is the
matrix of the following bit-cycling permutation (:
(b) Show that the Sande-Thkey method for fast Fourier transforms corresponds to multiplying the vector J from the left by the (sparse) matrices
in the above factorization.
140
2 Interpolation
(c) Which factorization of T corresponds to the Cooley-Thkey method?
Hint: TH differs from T by a permutation.
19. Investigate the numerical stability of the methods for fast Fourier transforms
described in Section 2.3.2.
Hint: The matrices in the factorization of Exercise 18 are almost orthogonal.
20. Given a set of knots .d = {xo < Xl < ... < Xn} and values Y := (YO, ... , Yn),
prove independently of Theorem (2.4.1.5) that the spline function S.<l(Y;.)
with S~(Y;xo) = S~(Y;Xn) = 0 is unique.
Hint: Examine the number of zeros of the difference S~ - S~ of two such
spline functions. This number will be incompatible with S.<l - S.<l '# O.
21. The existence of a spline function S.<l (Y; .) in cases (a), (b), and (c) of
(2.4.1.2) can be established without explicitly calculating it, as was done
in Section 2.4.2.
(a) The representation of S.<l(Y;.) requires 4n parameters aj, (3j, '"'Ij, 8j .
Show that in each ofthe cases (a), (b), (c) a linear system of 4n equations
results (n + 1 = number of knots).
(b) Use the uniqeness of S.<l(Y;.) (Exercise 20) to show that the system of
linear equations is not singular, which ensures the existence of a solution.
22. Show that the quantities d j of (2.4.2.6) and (2.4.2.8) satisfy
j
and even
= 0,1, ... ,n,
j = 1,2, ... , n - 1,
in the case of n + 1 equidistant knots Xj E .d.
23. Show that Theorem (2.4.1.5) implies: If the set of knots .d' c la, bj contains
the set of knots .d, .d' ::) .d, then in each of the cases (a), (b), and (c),
24. Suppose S.<l(x) is a spline function with the set of knots
.d = { a = Xo < Xl < ... < Xn = b}
interpolating ! E K4 (a, b). Show that
if anyone of the following additional conditions is met:
(a) f'(x) = S~(x) for X = a, b.
(b) rex) = S~(x) for X = a, b.
(c) S.<l is periodic and! E K;(a, b).
25. Prove that Pk > 1 holds for the quantities Pk, k = 1, ... , n, encountered in
solution method (2.4.2.15) of the system of linear equations (2.4.2.9). All the
divisions required by this method can therefore be carried out.
26. Define the spline functions Sj for equidistant knots Xi = a + ih, h > 0, i = 0,
... , n, by
EXERCISES FOR CHAPTER 2
141
Verify that the moments M I , ... , M n - l von Sj are as follows:
1
i=I, ... ,j-2,
Mi=--Mi+l,
Pi
1
Mi = - - - Mi - l ,
pn-i
M
i = j + 2, ... ,n -1,
_ -6. 2 + I/Pj-l + I/Pn-j-l
J -
h2
4 - 1/ pj-l - 1/ pn-j-l
M j - l = _1_(6h- 2 - M j )
Pj-l
for j # 0, 1, n - 1, n,
where the numbers Pi are recursively defined by PI := 4 and
Pi := 4 - I/Pi-1
i = 2,3, ....
It is readily seen that they satisfy the inequalities
4 = PI > P2 > ... > Pi > Pi+I > 2 + v'3 > 3.7,
0.25 < 1/ Pi < 0.3.
27. Show for the functions Sj defined in Exercise 26 that for j ,= 2, 3, ... , n - 2
and either x E [Xi, Xi+l], j + 1 :'::: i :'::: n - 1 or X E [Xi-I, x;}, 1 :'::: i :'::: j - 1,
28. Let Sa,! denote the spline function which interpolates the function f at
prescribed knots x E Ll and for which
S~;f (xo) = S~;f (Xn) = O.
The map f -+ Sa;! is linear, that is,
The effect on Sa;! of changing a single function value f(xj) is therefore that
of adding a corresponding multiple of the function Sj which was defined in
Exercise 26. Show, for equidistant knots and using the results of Exercise
27, that a perturbation of a function value subsides quickly as one moves
away from the location of the perturbation. Consider the analogously defined Lagrange polynomials (2.1.1.2) in order to compare the behavior of
interpolating polynomials.
29. Let Ll = {xo < Xl < ... < Xn }.
(a) Show that a spline function Sa with the boundary conditions
(*)
142
2 Interpolation
vanishes identically for n < 4.
(b) For n = 4, the spline function with (*) is uniquely determined for any
value c and the normative condition
(SLl is a multiple of the B-spline with support [xo, X4J.)
Hint: Prove the uniqueness of S Ll for c = 0 by determining the number of
zeros of S~ in the open interval (xo, X4). Deduce from this the existence
of SLl by the reasoning employed in exercise 21.
(c) Calculate SLl explicitly in the following special case of (b):
Xi
:= -2, -1,0, 1,2,
c:= 1.
30. Let 5 be the linear space of all spline functions SLl with knot set .:1 = {xo <
Xl < ... < Xn} and S~(xo) = S~(Xn) = O. The spline functions So, ... , Sn
are the ones defined in Exercise 26. Show that for Y := (YO, .. . , Yn),
n
SLl{Y;X) == LYjSj (x).
j=O
What is the dimension of 5?
Exponentialspline
kubischer Spline
Fig. 5. Comparison of spline functions.
31. Let ELl,f{X) denote the spline-like function which, for given Ai, minimizes
the functional
EXERCISES FOR CHAPTER 2
143
over K?(a, b). [Compare Theorem (2.4.1.5).]
(a) Show that ELl,! is between knots of the form
ELl.J(X) = Qi + !3i(X - Xi) + ii'¢(X - Xi) + OiC,,(X - Xi),
i = 0, ... ,N -1,
2
'¢i(X) = A2 [cosh(Aix ) - 1],
,
'Pi(X) =
~3 [sinh(A;x) - Ai X],
,
with constants Qi, !3i, ii, Oi. ELl,! is called exponential spline function.
(b) Examine the limit as Ai -+ O.
(c) Figure 5 illustrates the qualitative behavior of cubic and exponential
spline functions interpolating the same set of support points.
32. Show that the hat-function H(x) (2.4.6.6) has all properties (Definition.
(2.4.6.1)) of a scaling function.
33. Show for the numbers ik defined by (2.4.6.20) the properties
(a) ik = i-k for all k E Zl,
(b)
ik = y'2,
(c) i2k-I + 2i2k + i2k+I = 0 for all k E Zl, k f= O.
Hint: The multi-resolution method (2.4.6.8) approximates every function
VI E VI with VI E Va by itself, Va = VI EVa.
2:kEZ
References for Chapter 2
Achieser, N. I. (1953): Vorlesungen iiber Approximationsthe01"ie. Berlin: Akademie-Verlag.
Ahlberg, J., Nilson, E., Walsh, J. (1967): The Theory of Splines and Their Applications. New York: Academic Press.
Bloomfield, P. (1976): Fourier Analysis of Time Series. New York: Wiley.
Bohmer, E. O. (1974): Spline-Funktionen. Stuttgart: Teubner.
Brigham, E. O. (1974): The Fast Fourier Transform. Englewood Cliffs, N.J.:
Prentice-Hall.
Bulirsch, R., Rutishauser, H. (1968): Interpolation und genaherte Quadratur. In:
Sauer, SzabO (1968).
- - - , Stoer, J. (1968): Darstellung von Funktionen in Rechenautomaten. In:
Sauer, Szabo (1968).
Chui, C. K. (1992): An Introduction to Wavelets. San Diego, CA: Academic Press.
Ciarlet, P. G., Schultz, M. H., Varga, R. S. (1967): Numerical methods of highorder accuracy for nonlinear boundary value problems I. One dimensional problems. Numer. Math. 9, 294-430.
Cooley, J. W., Tukey, J. W. (1965): An algorithm for the machine calculation of
complex Fourier series. Math. Comput. 19, 297-301.
Curry, H. B., Schoenberg, I. J. (1966): On Polya frequency functions, IV: The
fundamental spline functions and their limits. J. d'Analyse Math. 17, 73-82.
Daubechies, I. (1992): Ten Lectures on Wavelets. CBMS-NSF Regional Conference Series in Applied Mathematics. Philadelphia: SIAM.
Davis, P. J. (1963): Interpolation and Approximation. New York: Blaisdell, 2d
printing (1965).
144
2 Interpolation
de Boor, C. (1972): On calculating with B-splines. J. Approximation Theory 6,
50-62.
- - - (1978): A Practical Guide to Splines. Berlin, Heidelberg, New York:
Springer-Verlag.
- - - , Pinkus, A. (1977): Backward error analysis for totally positive linear
systems. Numer. Math. 27,485-490.
Gautschi, W. (1972): Attenuation factors in practical Fourier analysis. Numer.
Math. 18, 373-400.
Gentleman, W. M., Sande, G. (1966): Fast Fourier transforms - For fun and
profit. In: Proc. AFIPS 1966 Fall Joint Computer Conference, 29, 503-578.
Washington, D.C.: Spartan Books.
Goertzel, G. (1958): An algorithm for the evaluation of finite trigonometric series.
Amer. Math. Monthly 65, 34-35.
Greville, T. N. E. (1969): Introduction to spline functions. In: Theory and Applications of Spline Functions. Edited by T. N. E. Greville. New York: Academic
Press.
Hall, C. A., Meyer, W. W. (1976): Optimal error bounds for cubic spine interpolation. J. Approximation Theory 16, 105-122.
Herriot, J. G., Reinsch, C. (1971): ALGOL 60 procedures for the calculation of interpolating natural spline functions. Technical Report STAN-CS-71-200, Computer Science Department, Stanford University, CA.
Karlin, S. (1968): Total Positivity, Vol. 1. Stanford: Stanford University Press.
Korneichuk, N. P. (1984): Splines in Approximation Theory. Moscow: Nauka (in
Russian).
Kuntzmann, J. (1959): Methodes Numeriques, Interpolation - Derivees. Paris:
Dunod.
Louis, A., MaaB, P., Rieder, A. (1994): Wavelets. Stuttgart: Teubner.
Maehly, H., Witzgall, Ch. (1960): Tschebyscheff-Approximationen in kleinen Intervallen II. Numer. Math. 2, 293-307.
Mallat, St. (1997): A Wavelet Tour of Signal Processing. San Diego, CA.: Academic Press.
Milne, E. W. (1949): Numerical Calculus. Princeton, N.J.: Princeton University
Press, 2d printing (1951).
Milne-Thomson, L. M. (1933): The Calculus of Finite Differences. London:
Macmillan, reprinted (1951).
Nurnberger, G. (1989): Approximation by Spline Functions. Berlin: SpringerVerlag.
Reinsch, C.: Unpublished manuscript.
Sauer, R., Szabo, I. (Eds.) (1968): Mathematische Hilfsmittel des Ingenieurs, Part
III. Berlin, Heidelberg, New York: Springer-Verlag.
Schoenberg, 1. J., Whitney, A. (1953): On Polya frequency functions, III: The
positivity of translation determinants with an application to the interpolation
problem by spline curves. Trans. Amer. Math. Soc. 74, 246-259.
Schultz, M. H. (1973): Spline Analysis. Englewood Cliffs, N.J.: Prentice-Hall.
Schumaker, L.L. (1981): Spline Functions. Basic Theory. New York, Chichester,
Brisbane, Toronto: Wiley.
Singleton, R. C. (1967): On computing the fast Fourier transform. Comm. ACM
10, 647-654.
- - - (1968): Algorithm 338: ALGOL procedures for the fast Fourier transform.
Algorithm 339: An ALGOL procedure for the fast Fourier transform with arbitrary factors. Comm. ACM 11, 773-779.
3 Topics in Integration
Calculating the definite integral of a given real function f (x),
lb
f(x)dx,
is a classic problem. For some simple integrands f(x), the indefinite integral
jt f(x) dx = F(t), F'(x) = f(x),
can be obtained in closed form as an algebraic expression in t and wellknown transcendental functions of t. Then
lb
f(x) dx = F(b) - F(a).
See Grabner and Hofreiter (1961) for a comprehensive collection of formulas
describing such indefinite integrals and many important definite integrals.
As a rule, however, definite integrals are computed using discretization
methods which approximate the integral by finite sums corresponding to
some partition of the interval of integration [a, b] ("numerical quadrature").
A typical representative of this class of methods is Simpson's rule, which
is still the best-known and most widely used integration method. It is
described in Section 3.1, together with some other elementary integration
methods. Peano's elegant and systematic representation of the error terms
of integration rules is described in Section 3.2.
A closer investigation of the trapewidal sum in Section 3.3 reveals that
its deviation from the true value of the integral admits an asymptotic expansion in terms of powers of the discretization step h. This expansion is
the classical summation formula of Euler and Maclaurin. Asymptotic expansions of this form are exploited in so-called "extrapolation methods",
which increase the accuracy of a large class of discretization methods. An
application of extrapolation methods to integration ("Romberg integration") is studied in Section 3.4. The general scheme is described in Section
3.5.
146
3 Topics in Integration
A description of Gaussian integration rules follows in Section 3.6. The
chapter closes with remarks on the integration of functions with singularities. For a comprehensive treatment of integration, the reader is referred
to Davis and Rabinowitz (1975).
3.1 The Integration Formulas of Newton and Cotes
The integration formulas of Newton and Cotes are obtained if the integrand is replaced by a suitable interpolating polynomial P(x) and if then
P(x)dx is taken as an approximate value for
f(x)dx. Consider a uniform partition of the closed interval [a, b] given by
J:
J:
Xi = a + i h,
i = 0, 1, ... , n,
of step length h := (b - a) / n, n > 0 integer, and let P n be the interpolating
polynomial of degree n or less with
Pn(x;) = fi := f(Xi)
for
i = 0, L ... , n.
By Lagrange's interpolation formula (2.1.1.4),
'L fiLi(X), L;(x) II Xi - xk
n
n
Pn(x) ==
=
X -
k~O
i=O
k¥i
or, introducing the new variable t such that X = a + ht
Li(X) = tpi(t) :=
t- k
II i _ k'
n
k=D
k#i
Integration gives
n
=h'Lfi D;.
i=O
Note that the coefficients or weights
Di:=
r tpi(t)dt
./0
Xk
--,
3.1 The Integration Formulas of Newton and Cotes
147
depend solely on n; in particular, they do not depend on the function f to
be integrated, or on the boundaries a, b of the integral.
If n = 2 for instance, then
aO = [2 t - 1 t - 2 dt = ~ [2(t2 _ 3t + 2)dt = ~ (~ _ 12 +
10 0 - 1 0 - 2
al = [2 t -
2
10
2 3
2
4) =~,
3
0t - 2dt = _ 10[2 (t2 _ 2t)dt = _ (~3 _ 4) = i,3
10 1 - 0 1 - 2
i) 3'
a2 = [2 t - 0 t - 1 dt = ~ [\t2 _ t)dt = ~ (~ _
2- 02- 1
2 3
2
10
210
1
and we obtain the following approximate value:
J:
f(x)dx. This is Simpson's rule.
for the integral
For any natural number n, the Newton-Cotes formulas
(3.1.1)
Ib
a
Pn(x)dx = h
t
i=O
fiai,
fi:= f(a + ih),
b-a
h:=--,
n
J:
provide approximate values for
f(x)dx. The weights ai, i = 0, 1, ... , n,
have been tabulated. They are rational numbers with the property
n
(3.1.2)
Lai =n.
i=O
This follows from (3.1.1) when applied to f(x) :=::= 1 for which Pn(x) =::= 1.
If s is a common denominator for the fractional weights ai so that the
numbers
(Ji := s ai,
i = 0, 1, ... , n,
are integers, then (3.1.1) becomes
(3.1.3)
For sufficiently smooth functions f(x) on the closed interval [a, b] it can
be shown [see Steffensen (1950)] that the approximation error may be expressed as follows:
(3.1.4)
148
3 Topics in Integration
Here (a, b) denotes the open interval from a to b. The values of p and K
depend only on n but not on the integrand f.
For n = 1,2, ... ,6 we find the Newton-Cotes formulas given in the following table. For larger n, some of the values (Ji become negative and the
corresponding formulas are unsuitable for numerical purposes, as cancellations tend to occur in computing the sum (3.1.3).
n
O"i
1
1
1
2
1
4
1
3
1
3
3
1
4
7
32 12
32
5
19
75 50
50 75
6
41 216 27 272 27 216 41
ns
Error
h3
f2 j<2) (~)
Name
2
Trapezoidal rule
6
h 5 /of(4)W
Simpson's rule
8
h5iof(4)(~)
3/8-rule
90
h7 9~5f(6)(~)
Milne's rule
288
h7 1;~~6f(6)(~)
840
h g 1:oof(8)(~)
7
19
Weddle's rule
Additional integration rules may be found by Hermite interpolation
[see Section 2.1.5] of the integrand f by a polynomial P E IIn of degree n
or less. In the simplest case, a polynomial P E II3 with
P(a) = f(a),
P(b) = f(b),
P'(a) = 1'(a),
P'(b) = l' (b)
is substituted for the integrand f. The generalized Lagrange formula
(2.1.5.3) yields for P in the special case a = 0, b = 1,
P(t) = f(O)[(t - 1)2 + 2t(t - 1)2] + f(l)[t 2 - 2t 2 (t - 1)]
+1'(O)t(t _1)2 + 1'(I)t2 (t -1),
integration of which gives
11
P(t)dt = ~(f(O) + f(I)) + A(f'(O) - 1'(1)).
From this, we obtain by a simple variable transformation the following
integration rule for general a < b (h := b - a):
(3.1.5)
j f(x)dx M(h)
b
a
~
h
h2
:= 2 (f(a) + f(b)) + 12 (f'(a) -
f'(b)).
If f E C 4 [a, b] then ~ using methods to be described in Section 3.2 ~ the
approximation error of the above rule can be expressed as follows:
3.1 The Integration Formulas of Newton and Cotes
(3.1.6)
M(h)
-lb
f(x)dx = ;~; f(4)(~),
~ E (a, b),
149
h:= (b - a).
If the support abscissas Xi, i = 0, ... , n, Xo = a, Xn = b., are not equally
spaced, then interpolating the integrand f(x) will lead to different integration rules, among them the ones given by Gauss. These will be described
in Section 3.6.
The Newton-Cotes and related formulas are usually not applied to the
entire interval of integration [a, b], but are instead used 'in each one of a
collection of subintervals into which the interval [a, b] has been divided. The
full integral is then approximated by the sum of the approximations to the
subintegrals. The locally used integration rule is said to have been extended,
giving rise to a corresponding composite rule. We proceed to examine some
composite rules of this kind.
The trapezoidal rule (n = 1) provides the approximate value
in the subinterval [Xi, Xi+1] of the partition Xi = a + ih, 1 = 0, 1, ... , N,
h := (b - a)j N. For the entire interval [a, b], we obtain the approximation
(3.1.7)
N-l
T(h):=
L Ii
i=O
=h [f~a) +f(a+h)+f(a+2h)+"'+f(b-h)+ f;b)] ,
which is the trapezoidal sum for step length h. In each subinterval [Xi, Xi+1]
the error
is incurred, assuming f E C 2 [a, b]. Summing these individual error terms
gives
Since
and f(2) (x) is continuous, there exists ~ E [min~i,max~i]
C (a, b) with
,
,
150
3 Topics in Integration
Thus
T(h)
-l
b
f(x)dx = b ~ a h2f(2)(~),
~ E (a, b).
Upon reduction of the step length h (increase of n) the approximation error
approaches zero as fast as h 2 , so we have a method of order 2.
If N is even, then Simpson's rule may be applied to each subinterval
[X2i' X2i+2], i = 0, 1, ... , (N/2) -1, individually yielding the approximation
(h/3) (f(X2i) + 4f(X2i+l) + f(X2i+2)). Summing these N/2 approximations
results in the composite version of Simpson's rule
h
S(h) := 3[f(a) + 4f(a + h) + 2f(a + 2h) + 4f(a + 3h) + ...
+ 2f(b - 2h) + 4f(b - h) + f(b)],
for the entire interval. The error of S(h) is the sum of all N/2 individual
errors
and we conclude, just as we did for the trapezoidal sum, that
S(h)
-lb
a
f(x)dx = b - ah4f(4)(~),
180
~ E (a,b),
provided f E C 4 [a, b]. The method is therefore of order 4.
Extending the rule of integration M(h) in (3.1.5) has a remarkable
effect: when the approximation to the individual subintegrals
l
Xi
+1
f(x)dx
for
i = 0, 1, ... , N - 1
x,
are added up, all the "interior" derivatives f'(Xi), 0 < i < N, cancel. The
following approximation to the entire integral is obtained:
U(h) : = h [f~a) + f(a + h) + ... + f(b - h) + f;b)] + ~~ [f'(a) - f'(b)]
= T(h) + ~; [f'(a) - f'(b)].
This formula can be considered as a correction to the trapezoidal sum T(h).
It relates closely to the Euler-Maclaurin summation formula, which will be
discussed in Section 3.3 [see also Schoenberg (1969)]. The error formula
3.2 Peano's Error Representation
151
(3.1.6) for M(h) can be extended to an error formula for the composite
rule U(h) in the same fashion as before. Thus
(3.1.8)
provided f E C 4 [a,b]. Comparing this error with that of the trapezoidal
sum, we note that the order of the method has been improved by 2 with a
minimum of additional effort, namely the computation of f'(a) and f'(b).
If these two boundary derivatives are known to agree, e.g. for periodic
functions, then the trapezoidal sum itself provides a method of order at
least 4.
Replacing f'ea), f'(b) by difference quotients with an approximation
error of sufficiently high order, we obtain simple modifications ["end corrections": see Henrici (1964)] of the trapezoidal sum which do not involve
drivatives but still lead to methods of orders higher than 2. The following
variant of the trapezoidal sum is already a method of order 3:
T(h) :=h[152 f (a) + ~U(a + h) + f(a + 2h) + ... + feb - 2h)
+ ~U(b - h) + 152 f (b)].
For many additional integration methods and their systematic examination
see, for instance Davis and Rabinowitz (1975).
3.2 Peano's Error Representation
All integration rules considered so far are of the form
(3.2.1) 1(f):=
rno
ml
mn
k=O
k=O
k=O
L akof(xkO) + L aklf'(xkl) + ... + L aknf(n) (Xkn).
The integration error
R(f) := l(n
(3.2.2)
-l
b
f(x)dx
is a linear operator
R( af + f3g) = aR(f) + f3R(g)
for
f, 9 E V, a, 8 E lR
on some suitable linear function space V. Examples are V = Cn[a,b], the
space of functions with continuous nth derivatives on the interval [a, b], or
V = IIn , the space of all polynomials of degree no greater than n. The
following elegant integral representation of the error R(f) is a classical
result due to Peano:
152
3 Topics in Integration
(3.2.3) Theorem. Suppose R(P) = 0 holds for all polynomials P E IIn ,
that is, every polynomial whose degree does not exceed n is integrated exactly. Then for all functions f E en+! [a, b],
R(J) =
lb
f Cn + I )(t)K(t)dt,
where
K(t) := ~Rx[(X - t):;:'L
n.
(x _ t):;:' := { o(x - t)n
for x ::::: t,
for x < t,
and
Rx[(x - t):;:']
denotes the error of (x - t)+' when the latter is considered as a function in
x.
The function K(t) is called the Peano kernelDf the operator R.
Before proving the theorem, we will discuss its application in the case
of Simpson's rule
R(J) =
~f(-I) + ~f(O) + ~f(l)
-111
f(x)dx.
We note that any polynomial P E II3 is integrated exactly. Indeed, let
Q E II2 be the polynomial with P( -I) = Q( -I), P(O) = Q(O), P( +1) =
Q(+I). Putting S(x) := P(x) - Q(x), we have R(P) = R(S). Since the
degree of S(x) is no greater then 3, and since S(x) has the three roots -I,
0, +1, it must be of the form S(x) = a(x 2 - I)x, and
R(P) = R(S) = -ajI x(x 2 -I)dx = O.
-1
Thus Theorem (3.2.3) can be applied with n = 3. The Peano kernel becomes
K(t) = iRx[(x - t)t]
=i
[~(-I-t)t+~(0-t)t+~(I-t)t-l11(X-t)tdx].
By definition of (x - t)+', we find that for t E [-1, I]
j (x - t)tdx 11 (x - t)3dx (I ~ t)4
I
=
-1
=
t
(-I-t)t=o,
(_t)3 =
+
(l-t)t=(I-t)3,
{O_t
3
0,
if t :::::
if t < O.
,
3.2 Peano's Error Representation
153
The Peano kernel for Simpson's rule in the interval [-1, 1J is then
K(t) = {
(3.2.4)
i2 (1 - t)3(1 + 3t) if 0::; t ::; 1,
if-1::;t::;O.
K(-t)
PROOF OF THEOREM (3.2.3). Consider the Taylor expansion of f(x) at
x = a:
(3.2.5)
f(n)(a)
f(x) = f(a) + J'(a)(x - a) + ... + - - I-(x - a)n + rn(x).
n.
Its remainder term can be expressed in the form
rn(x) = , 1
n.
JX f(n+l)(t)(x - t)ndt = ,1 Jb f(n+l)(t)(x - t)~dt.
a
n.
a
Applying the linear operator R to (3.2.5) gives
(3.2.6)
since R(P) = 0 for P E IIn.
In order to transform this representation of R(f) into the desired one,
we have to interchange the Rx operator with the integration. To prove that
this interchange is legal, we show first that
for 1 ::; k ::; n. For k < n this follows immediately from the fact that
(x - t)~ is n - 1 times continuously differentiable. For k =, n - 1 we have
in particular
and therefore
The latter integral is differentiable as a function of x, since the integrand
is jointly continuous in x and t; hence
154
3 Topics in Integration
Thus (3.2.7) holds also for k = n.
By (3.2.7), the differential operators
dk
k = 1, ... , n,
commute with integration. Because I(f) = Ix(f) is a linear combination of
differential operators, it also commutes with integration.
Finally the continuity properties of the integrand j(n+l) (t)(x - t)+' are
such that the following two integrations can be interchanged:
This then shows the entire operator Rx commutes with integration, and we
obtain the desired result
Note that Peano's integral representation of the error is not restricted to
operators of the form (3.2.1). It holds for all operators R for which Rx
commutes with integration.
For a surprisingly large class of integration rules, the Peano kernel K(t)
has constant sign on [a, b]. In this case, the mean-value theorem of integral
calculus gives
(3.2.8)
R(f) = j(n+l)(o
lb
K(t)dt
for some
e E (a, b).
The above integral of K(t) does not depend on j, and can therefore be
determined by applying R, for instance, to the polynomial j(x) := xn+l.
This gives
(3.2.9)
R(f) = R(xn+1) j(n+1) (0
(n + I)!
for some
~ E (a, b).
3.2 Peano's Error Representation
In the case of Simpson's rule, K(t)
addition
4
R(x
1
- )= 4!
24
0 for -1 :S t :S 1 by (3.2.4). In
~
(1-·1+-·0+-·14 1 11 x 4dx )
3
3
155
3
1
90'
-1
so that we obtain for the error of Simpson's formula
~f( -1) + ~f(O) + ~f(l) - [11 f(t)dt = 910f(4)(~), ~ E (a, b).
In general, the Newton-Cotes formulas of degree n integrate without error polynomials P E IIn if n is odd, and P E IIn+l if n is even [see Exercise
2]. The Peano kernels for the Newton-Cotes formulas are of constant sign
[see for instance Steffensen (1950)], and (3.2.9) confirms the error estimates
in Section 3.1:
Rn(f) =
{
Rn(xn+l)f(n+1)(t)
1
<",
(
R
n+ 1.)
n(x
n+2
(n + 2)!
'f
1
. dd
n IS
0
,
~ E (a,b),
) f(n+2)(o
'
if n is even
.
Finally, we use (3.2.9) to prove the representation (3. L.6) of the error
of integration rule (3.1.5), which was based on cubic Hermite interpolation.
Here
r
h
~
R(f) = 2(f(a) + feb)) + 12 (f'(a) - f'(b)) - Jb f(x)dx,
which vanishes for all polynomials P E II3 . For n
following Peano kernel:
1
h:= b- a,
= 3, we obtain the
3
K(t) = "6Rx((x - t)+)
lb
1
3
h - t)+2 - (b - t)+)2' - a (X - t)+dx
3
= "61 [h2((a - t)+3 +(b - t)+)
+ 4((a
2
=
~[~(b - t)3 _ h2 (b _ t)2 _ ~(b _ t)4]
6 2
4
4
= - 214(b-t)2(a-t)2.
Since K(t) ::; 0 in the interval of integration [a, b], (3.2.9) is applicable.
We find for a = 0, b = 1 that
R(x 4 ) = .L(1.1 +.L. (-4) _ 1) = __
1
4!
24 2
12
5
720 .
Thus
156
3 Topics in Integration
R(f) = -~ {b j(4)(t)(b _ t)2(a _ t)2dt = _ (b - a)5 j(4)(~)
uL
no'
~ E (a, b),
for j E C4[a, b], which was to be shown.
3.3 The Euler-Maclaurin Summation Formula
The error formulas (3.1.6), (3.1.8) are low-order instances of the famous
Euler-Maclaurin summation formula, which in its simplest form reads (for
9 E C 2m+2[0, 1])
(I g(t)dt =g(O) + g(l) +
(3.3.1)
Jo
2
2
f
B21 (g(21-1)(0) _ g(21-1)(1))
1=1 (2l)!
Here Bk are the classical Bernoulli numbers
B_1
2 - 6'
(3.3.2)
B
I B6 = 42'
1
4 = - 30 '
B
_1
8 - - 30' ... ,
whose general definition will be given below. Extending (3.3.1) to its composite form in the same way as (3.1.6) was extended to (3.1.8), we obtain
for 9 E C 2m+2[0, N]
(N g(t)dt =g(O) + g(l) + ... + g(N _ 1) + g(N)
Jo
2
+
f
1=1
_
2
B21 (g(21-1)(0) _ g(21-1)(N))
(21)!
B 2m +2 N (2m+2)(c)
(2m + 2)!
9
., ,
0 < C < N.
.,
Rearranging the terms of the above formula leads to the most frequently
presented form of the Euler-Maclaurin summation formula:
g(O) + g(l) + ... + g(N _ 1) + g(N)
2
2
(3.3.3)
f
= (N g(t)dt +
B21 (g(21-1)(N) _ g(21-1)(0))
Jo
1=1 (21)!
+
B 2m +2 Ng(2m+2)(~)
(2m + 2)!
'
0 < ~ < N.
For a general uniform partition Xi = a + ih, i = 0, ... , N, XN = b, of the
interval [a, b], (3.3.3) becomes
3.3 The Euler-Maclaurin Summation Formula
(3.3.4)
T(h) =
l
b
a
f(t)dt +
157
B
LTn h 21 --;U(21-1)(b)
- f(21-1)(a))
(2l).
1=1
+ h 2Tn +2 B 2Tn +2 (b _ a)f(2m+2)(~),
(2m + 2)!
a < ~ < b,
where T(h) denotes the trapezoidal sum (3.1.7)
f(a)
f(b)]
T(h)=h [-2-+ f (a+h)+···+f(b-h)+-2 .
In this form, the Euler-Maclaurin snmmation formula expands the trapezoidal sum T (h) in terms of the step length h = (b - a) / N, and herein lies
its importance for our purposes: the existence of such an expansion puts at
one's disposal a wide arsenal of powerful general "extrapolation methods",
which will be discussed in subsequent sections.
PROOF OF (3.3.1). We will use integration by parts and successively determine polynomials Bk(x), starting with B 1(x) == x -~, such that
(3.3.5)
r
11
o
r
1
1
Jo g(t)dt = B 1(t)g(t)lo1 - Jo B 1(t)g'(t)dt,
1
1111
Bdt)g'(t)dt = -B2(t)g'(t)dtl -2
0
2
0
B2(t)gl/(t)dt,
where
(3.3.6)
B~+l (x) = (k + l)Bdx),
k = 1, 2, ....
It is clear from (3.3.6) that each polynomial Bk(X) is of degree k and that
its highest-order terms has coefficient 1. Given Bk(x), thE relation (3.3.6)
determines Bk+1(X) up to an arbitrary additive constant. We now select
these constants so as to satisfy the additional conditions
(3.3.7)
which determine the polynomials Bdx) uniquely. Indeed, if
B 21-1 ()
X = X21-1 + C21-2x 21-2 + ... + C1x + CO,
then with integration constants c and d,
B 21+1(X)=x
21+1
(2l + 1 )2l
21
+2l(21_1)C21-2X +···+(2l+1)cx+d.
158
3 Topics in Integration
B 21 + 1 (0) = 0 requires d = 0, and B 21 + 1 (1) = 0 determines c.
The polynomials
B3(X) == :r 3 - ~x2 + ~x,
B4(X) == X4 - 2x 3 + x2 -
do.
are known as Bernoulli polynomials. Their constant terms Bk = Bk(O) are
the Bernoulli numbeT's (3.3.2). All Bernoulli numbers of odd index k > 1
vanish because of (3.3.7).
The Bernoulli polynomials satisfy
(3.3.8)
This folows from the fact that the polynomials (_l)k B k (l- x) satisfy the
same recursion, namely (3.3.6) and (3.3.7). as the Bernoulli polynomials
Bk(X), Since they also start out the same, they must coincide.
The following relation - true for odd indices k > 1 by (3.3.7) _.- can
now be made generaL since (3.3.8) establishes it for even k:
(3.3.9)
This gives
(3.3.10)
We are now able to complete the expansion of
11
g(t)dt.
Combining the first 2m + 1 relations (3.3.5), observing
for k > 1 by (3.3.9), and accounting for B 21 + 1 = 0, we get
where the error term T'm+1 is given by the integral
1'm+1 := (
1
+
-1
2m
1
)'
1.
0
B 2m+ 1 (t)g
(2m+1)
(t)dt.
\Ve use integration by parts once more to transform the error term:
3.3 The Euler-Maclaurin Summation Formula
r
1
10
159
B 2m +1(t)g(2m+1l(t)dt = _1_(B 2m +2(t) _ B 2m +2 )g(2m+l)(t)1 1
2m + 2
0
_ _1_
r
2m+ 210
1(B +2(t) _ B + )g(2m+2l(t)dt.
2m 2
2m
The first term On the right-hand side vanishes again by (~1.3.9). Thus
(3.3.11)
rm+1 = (2m
~ 2)! 10 1(B2m+2(t) - B 2m+2)g(2m+2) (t)dt.
In order to complete the proof of (3.3.1), we need to show that B2m+2(t)B 2m +2 does not change its sign between 0 and 1. We will show by induction
that
(-1)mB 2m _ 1 (X»
(3.3.12b)
(-1)m(B2m(x) - B 2m » 0
(_1)m+lB2m> O.
(3.3.12c)
for 0 < x < ~,
for 0 < x < 1,
0
(3.3.12a)
Indeed, (3.3.12a) holds for m = 1. Suppose it holds for some m ;::: 1. Then
for 0 < x ::::; ~,
(_l)m
- - ( B2m (x) - B 2m ) = (_l)m
2m
lx
0
B 2m - 1(t)dt> O.
By (3.3.8), this extends to the other half of the interval, ~ :::; x < 1, proving
(3.3.12b) for this value of m. In view of (3.3.10), we have
(-lr+ 1B2m = (_l)m
10 1(B2m(t) - B 2m )dt > 0,
which takes care of (3.3.12c). We must now prove (3.3.12a) for m+ 1. Since
B 2m +1(x) vanishes for x = 0 by (3.3.7), and for x = ~ by (3.3.8), it cannot
change its sign without having an inflection point x between 0 and ~. But
then B 2m - 1 (x) = 0, in violation of the induction hypothesis. The sign of
B2m+r(X) in 0 < x < ~ is equal to the sign of its first derivative at zero,
whose value is (2m + 1)B2m(O) = (2m + 1)B2m' The sign of the latter is
(_1)m+1 by (3.3.12c).
Now for the final simplification of the error term (3.3.11). Since the
function B 2m +2(X) - B2m+2 does not change its sign in the interval of
integration, there exists ~, 0 < ~ < 1, such that
From (3.3.10),
B 2m +2 (2m+2) (t)
2m + 2)!g
<" ,
160
3 Topics in Integration
which completes the proof of (3.3.1).
3.4 Integration by Extrapolation
Let f E C 2m+2[a, b] be a real function to be integrated over the interval
[a, b]. Consider the expansion (3.3.4) of the trapezoidal sum T(h) of f in
terms of the step length h = (b - a)/n. It is of the form
(3.4.1)
Here
TO =
lb
f(x)dt
is the integral to be calculated,
and
is the error coefficient. Since j(2m+2) is continuous by hypothesis in the
closed finite interval [a, b], there exists a bound L such that Ipm+2(x)1 ~ L
for all x E [a, b]. Therefore:
(3.4.2) There exists a constant Mm+l such that
lam+l(h)1 ~ Mm+l
for all h = (b - a)/n, n = 1, 2, ....
Expansions of the form (3.4.1) are called asymptotic expansions in h
if the coefficients Tk, k ~ m do not depend on hand am+l (h) satisfies
(3.4.2). The summation formula of Euler and Maclaurin is an example of
an asymptotic expansion. If all derivatives of f exist in [a, b]' then by letting
m = 00, the right-hand side of (3.4.1) becomes an infinite series:
TO
+ T 1 h2 + T2h4 + ....
This power series may diverge for any h f. O. Nevertheless, because of
(3.4.2), asymptotic expansions are capable of yielding for small h results
which are often sufficiently accurate for practical purposes [see, for instance,
Erdelyi (1956), Olver (1974)].
The above result (3.4.2) shows that the error term of the asymptotic
expansion (3.4.1) becomes small relative to the other terms of (3.4.1) as h
3.4 Integration by Extrapolation
161
decreases. The expansion then behaves like a polynomial in h 2 which yields
the value TO of the integral for h = O. This suggests the following method
for finding TO: For each step length hi in a sequence
b-a
h
ho = - - , hl = ~, ... ,
no
nl
where nl, n2, ... , nm are strictly increasing positive integers, determine
the corresponding trapezoidal sums
TiO := T(h i ),
i = 0, 1, ... , m.
Let
be the interpolating polynomial in h 2 with
and take the "extrapolated" value Tmm(O) as the approximation to the
desired integral TO. This method of integration is known as Romberg integration, having been introduced by Romberg (1955) for the special sequence
hi = (b - a)/2i. It has been closely examined by Bauer, Rutishauser, and
Stiefel (1963).
Neville's interpolation algorithm is particularly well suited for calculating Tmm(O). For indices i, k with 1 ::; k ::; i ::; m let Tik(h) be the
polynomial of degree at most k in h 2 for which
and let
The recursion formula (2.1.2.7) becomes for Xi =
(3.4.3)
T. = T
•k
.,k-l
+
Ti,k-l - Ti-1,k-l
[ hi - k ] 2 _ 1
hr
1 ::; k S i ::; m .
hi
It will be advantageous to arrange the intermediate values Tik in the triangular tableau (2.1.2.4), where each element derives from its two left neighbors:
162
3 Topics in Integration
T(ho) = Too
h6
(3.4.4)
EXAMPLE. Calculating the integral
11
t 5 dt
by extrapolation over the step lengths ho = 1, hI = 1/2, h2 = 1/4, we arrive
at the following tableau using (3.4.3) and 6-digit arithmetic (the fractions in
parentheses indicate the true values)
Too = 0.500000(= ~)
_ 1
h 21 -"4
TlO = 0.265625(=
H)
Tn = 0.187 500( = -&)
T22 = 0.166667(=~)
T21 = 0.167969(= 2~6)
T 20 = 0.192383(= 11~;4)
Each entry Tik of the polynomial extrapolation tableau (3.4.4) represents in fact a linear integration rule for a step length hi = (b - a)/mi'
mi > 0 integer:
Some of the rules, but not all, turn out to be of the Newton-Cotes type [see
Section 3.1]. For instance, if hI = h o/2 = (b - a)/2, then Tn is Simpson's
rule. Indeed,
Too = (b - a) (~f(a) + ~f(b)) ,
1
TlO = "2(b - a)
By (3.4.3),
and therefore
(1"2f(a) + f (a+b) + "2f(b)
1 ).
-2-
3.4 Integration by Extrapolation
TIl
163
= ~ (b - a) ( ~ f (a) + ~ f ( a ; b) + ~ f (b)) .
If we go one step further and put h2 = hd2, then T22 becomes Milne's rule.
However this pattern breaks down for h3 = h2/2 since T33 is no longer a
Newton-Cotes formula [see Exercise 10].
In the above cases, T21 and T31 are composite Simpson rules:
b- a I
4
2 (
4 (
I ())
T21 = -4-(-3f(a)
+ 3f(a
+ h 2 ) + 3f
a + 2h2 ) + 3f
a + 3h2) +:d
b
T31 = b ~ a (~f(a) + V(a + h3) + U(a + 2h 3 ) + ... + ~f(b)).
Very roughly speaking, proceeding downward in tableau (3.4.4) corresponds
to extending integration rules, whereas proceeding to the right increases
their order.
The following sequences of step lengths are usually chosen for extrapolation methods:
h 0 -- b - a, h I -!Y2
2"'"
(3 .4 .5a )
(3 . 4 . 5b)
h'• --
h i2- 1 '
ho
h 2 -- "'3'
ho ... , h i -h 0 -- b - a, h I -- 2'"'
Z• -hi- 2
-2-"
2 , 3 , ... ,
Z• --
3 , 4, . . . .
The first sequence is characteristic of Romberg's method [Romberg (1955)].
The second has been proposed by Bulirsch (1964). It has the advantage that
the effort for computing T(h i ) does not increase quite as rapidly as for the
Romberg sequence.
For the sequence (3.4.5a), half of the function values needed for calculating the trapezoidal sum T(hHd have been previously encountered in
the calculation of T(h i ), and their recalculation can be avoided. Clearly
T(hi+l) = T(~hi)
= ~T(hi)
+ hi+1[J(a + hHd + f(a + 3h H d + ... +- feb - hi+1)]'
Similar savings can be realized for the sequence (3.4.5b).
An ALGOL prozedure which calculates the tableau (3.4.4) for given m
and the interval [a, b] using the Romberg sequence (3.4.5a) is given below.
To save memory space, the tableau is built up by adding u.pward diagonals
to the bottom of the tableau. Only the lowest elements in each column need
to be stored for this purpose in the linear array t[O : m].
164
3 Topics in Integration
procedure romber:q (a, b, 1, rn);
value a, b, rn;
integer rn;
real a, b;
real procedure f;
begin real h, s;
integer i, k, n, q;
array t[O: rn];
h := b - a; n := 1;
t[O] := 0.5 x h x (f(a) + f(b));
for k:= 1 step 1 until rn do
begin s:= 0; h:= 0.5 x h; n := 2 x n; q := 1;
for i := 1 step 2 until n - 1 do
s:=s+f(a+ixh);
t[k] := 0.5 x t[k - 1] + s x h;
print (t[k]);
for i := k - 1 step -1 until a do
begin q := q x 4;
t[i] := t[i + 1] + (t[i + 1]- t[i])/(q - 1);
print t([i])
end
end
end;
We emphasize that the above algorithm serves mainly as an illustration
of integration by extrapolation methods. As it stands, it is not well suited
for practical calculations. For one thing, one does not usually know ahead
of time how big the parameter rn should be chosen in order to obtain the
desired accuracy. In practice, one calculates only a few (say seven) columns
of (3.4.4), and stops the calculation as soon as IT;,6 - Ti+1.61 :'S: E s, where
E is a specified tolerance and s is a rough approximation to the integral
lb
If(t)ldt.
Such an approximation s can be obtained concurrently with calculating
one of the trapezoidal sums T(h;). A more general stopping rule will be
described, together with a numerical example, in Section 3.5. Furthermore,
the sequence (3.4.5b) of step lengths is to be preferred over (3.4.5a). Finally, rational interpolation has been found to yield in most applications
considerably better results than polynomial interpolation. A program with
all these improvements can be found in Bulirsch and Stoer (1967).
When we apply rational interpolation [see Section 2.2], then the recursion (2.2.3.8) replaces (2.1.2.7):
3.5 About Extrapolation Methods
(3.4.6)
T
T
Ti,k-l - Ti-l,k-l
ik = i,k-l + - - - " " 2 - - ' - - - - - - - - - ' - - - - - Ti,k-l - Ti-1,k-l] -1
[ hi-k]
hi
Ti,k-l - Ti- l ,k-2
[1-
165
1 :S k :S i :S m.
The same triangular tableau arrangement is used as for polynomial extrapolation: k is the column index, and the recursion (~;.4.6) relates each
tableau element to its left-hand neighbors. The meaning of Tik, however,
is now as follows: The functions Tik (h) are rational functions in h 2 ,
with the interpolation property
We then define
Tik := Tik(O),
and initiate the recursion (3.4.6) by putting TiD := T(h i ) for i = 0, 1,
... , m, and Ti,-l := 0 for i = 0, L ... , m - 1. The observed superiority
of rational extrapolation methods reflects the more flexible approximation
properties of rational functions [see Section 2.2.4].
In Section 3.5, we will illustrate how error estimates for extrapolation
methods can be obtained from asymptotic expansions like (3.4.1). Under
mild restrictions on the sequence of step lengths, it will follow that, for
polynomial extrapolation methods based on even asymptotic expansions,
the errors of Til behave like h;, those of Til like
1 h;l and, in general,
those of Tik like hLkhLk+l ... h; as i -+ 00. For fixed k, consequently, the
sequence Tik i = k, k + 1, ... , approximates the integral like a method of
order 2k + 2. For the sequence (3.4.5a) a stronger result has been found:
hL
h2(-1)kB2k+2j(2k+2)( )
· -lbj( )d - (b- )h 2 h 2
T ,k
(347)
..
X
X a i-k i-k+l'" i (
)'
~ ,
a
2k + 2 .
for a suitable ~ E (a, b) and j E C 2k +2 [a, b] [see Bauer, Rutishauser and
Stiefel (1963), Bulirsch (1964)].
3.5 About Extrapolation Methods
Some of the numerical integration methods discussed in this chapter (as,
for instance, the methods based on the formulas of Newton and Cotes) had
a common feature: they utilized function information only on a discrete set
166
3 Topics in Integration
of points whose distance - and consequently the coarseness of the sample
- was governed by a "step length" . To each such step h i- 0 corresponded
an approximate result T(h), which furthermore admitted an asymptotic
expansion in powers of h. Analogous discretization methods are available
for many other problems, of which the numerical integration of functions is
but one instance. In all these cases, the asymptotic expansion of the result
T(h) is of the form
(3.5.1)
T(h) = TO + T1h'Y1 + T2h'Y2 + ... + Tmh'Ym + am+! (h)h'Ym+l,
o < 1'1 < 1'2 < ... < I'm+!,
where the exponents I'i need not to be integers. The coefficients Ti are
independent of h, the function a m+1(h) is bounded for h -t 0, and TO =
limh--+o T(h) is the exact solution of the problem at hand.
Consider, for example, numerical differentiation. For hi- 0, the central
difference quotient
T(h) = f(x + h) - f(x - h)
2h
is an approximation to f'(x). For functions f E C 2m+3[x - a,x + a] and
Ihl :s; lal, Taylors's theorem gives
1
h2
h 2m +3
T(h) =-{ f(x) + hf'(x) + - f"(x) + ... +
[jC2m+3 l (x) + 0(1)]
2h
2!
(2m + 3)!
h2
h 2m +3
- f(x) + hf'(x) - - f"(x) + ... +
[f C2m+3 l (x) + 0(1)] }
2!
(2m + 3)!
=TO + T1h2 + ... + Tmh2m + h 2m +2a m+1 (h)
where TO = f'(x), Tk = fC2k+ 1l(x)/(2k + I)!, k = 1, 2, ... , m + 1, and
a m+1(h) = Tm+! + 0(1).
U sing the one-sided difference quotient
f(x + h) - f(x)
T(h) .=-.
h
'
--J-
hr 0,
leads to the asymptotic expansion
with
fCk+!l(x)
Tk = (k + I)! '
k = 0, 1, 2, ... , m + 1.
We will see later that the central difference quotient is a better approximation to base an extrapolation method on, as far as convergence
is concerned, because its asymptotic expansion contains only even powers
of the step length h. Other important examples of discretization methods
3.5 About Extrapolation Methods
167
which lead to such asymptotic expansions are those for the solution of ordinary differential equations [see Sections 7.2.3 and 7.2.12]. A systematic
treatment of extrapolation methods is found in Brezinski a,nd Zaglia (1991).
In order to derive an extrapolation method for a given discretization
method with (3.5.1), we select a sequence of step lengths
ho > hI > h2 > ... > 0,
F = {ho, hI, h2' ... },
and calculate the corresponding approximate solutions T(h i ), i = 0, 1, 2,
.... For i 2': k, we introduce the "polynomials"
for which
Tik(hj)=T(h j ),
j=i-k,i-k+1, ... ,i,
and we consider the values
as approximations to the desired value 70. Rational functions Tik(h) are
frequently preferred over polynomials. Also the exponents 'Yk need not be
integer [see Bulirsch and Stoer (1964)].
For the following discussion of the discretization errors, we will assume
that Tik (h) are polynomials with exponents of the form 'Yk: = 'Y k. Romberg
integration [see Section 3.4] is a special case with 'Y = 2.
First, we consider the case i = k. In the sequel, we will use the abbreviations
h'Yj ' j = 0,1, ... , m.
Zj ..Applying Lagrange's interpolation formula (2.1.1.4) to the polynomial
yields for Z =
°
k
k
nk = Pk(o) = LCjPk(Zj) = LCjT(hj )
j=o
j=o
with
k
Cj:=II ~.
u""J Za -
Zj
17=0
Then
(3.5.2)
k
~CjZT
=
~
J
j=O
{I°
k
(-1) ZOZI'" Zk
if 7 = 0,
if 7 = 1, 2, ... , k,
if7=k+1.
168
3 Topics in Integration
PROOF. By Lagrange's interpolation formula (2.1.1.4) and the uniqueness
of polynomial interpolation,
k
p(O) = LCjp(Zj)
j=O
holds for all polynomials p(z) of degree::; k. One obtains (3.5.2) by choosing
p( z) as the polynomials
ZT,
T
= 0, 1, ... , k,
and
D
(3.5.2) can be sharpened for sequences hj for which there exists a constant b > 0 such that
h+l
T
: ; b < 1 for all j.
(3.5.3)
J
In this case, there exists a constant C k which depends only on b and for
which
k
(3.5.4)
L [Cj[zj+l ::; CkZOZl··· Zk·
j=O
We prove (3.5.4) only for the special case of geometric sequences {h j } with
h j = hobl,
0 < b < 1,
j = 0, 1, ....
For the general case see Bulirsch and Stoer (1964). With the abbreviation
(J := b'Y we have
zj = (hobl)'YT = zoBjT.
In view of (3.5.2), the polynomial
k
Pk(Z) := L CjZ j
j=O
satisfies
for T = 0,
for T = 1, 2, ... , k,
so that Pk(z) has the k different roots BT, T = 1, ... , k. Since P k (l) = 1
the polynomial Pk must have the form
3.5 About Extrapolation Methods
169
The coefficients of Pk alternate in sign, so that
k
k
L ICjlzj+1 = Z~+l L IcJ I(g(k+1))j = z~+1IPk( __ gk+1)1
j=O
j=O
- ~k+1
-
"0
gk+1 + gl
IIk ------,-1 _ gl
1=1
gl
II ~
1 _ gl
k
-
k+1g1+2+ ... +k
=
Cdg)Z()Z1 ... Zk
- Zo
1=1
with
(3.5.5)
This proves (3.5.4) for the special case of geometrically decreasing step
lengths hj.
We are now able to make use of the asymptotic expansion (3.5.1) which
gives for k :::; m
k
Tkk = L cjT(hj )
j=O
k
=L
J
Cj[TO + T1Zj + T2 Z + ... + Tkzj + cxk+1(hj)zj+1],
j=O
where of course, for k < m
By (3.5.2)
k
(3.5.6)
Tkk = TO + LCjCXk+l(hj)zj+1.
J=O
Using (3.5.4) and ICXrn+l(hj )1 :::; Mrn+1 for all j 2: 0 [see (3.4.2)] we find for
k=m
(3.5.7)
170
3 Topics in Integration
and for k < m,
(3.5.8)
because of (}:k+l(h j ) = Tk+l + O(h]), (3.5.2), and (3.5.6).
The corresponding estimates for the error of Tik for the general case
i 2: k are obtained by just replacing in (3.5.7) and (3.5.8) Zo, Zl, ... , Zk by
Zi-k, Zi-k+1, ... , Zi, and ho by h i - k respectively, because Tik is obtained
by extrapolating from T(h j ), j = i - k, i - k + 1, ... , i. Since Zj = h] this
leads to
(3.5.9)
and for k < m, i 2: k to
(3.5.10)
Consequently, for fixed k and i -t 00,
Tik - TO = O(h~~tlh).
In other words, the elements Tik of the (k + 1)st column of the tableau
(3.4.4) converge to TO like a method of order (k + 1)')'. Note that the increase of the order of convergence from column to column which can be
achieved by extrapolation methods is equal to 'Y : 'Y = 2 is twice as good
as 'Y = 1. This explains the preference for discretization methods whose
corresponding asymptotic expansions contain only even powers of h, e.g.,
the asymptotic expansion of the trapezoidal sum (3.4.1) or the central difference quotient discussed in this section.
The formula (3.5.10) shows furthermore that the sign of the error remains constant for fixed k < m and sufficiently large i provided Tk+l -I- O.
Then
(3.5.11)
Now in many cases b'Y(k+ 1) < ~. Then the error of the quantity
Uik := 2Ti+l,k - Tik
satisfies
Uik - TO = 2(Ti+l,k - TO) - (Tik - TO)'
For s := sign (Ti+l,k - TO) = sign (Ti,k - TO) we have
S(Uik - TO) = 2ITi,k+1 - Tol-ITik - Tol ::::; -ITik - Tol < O.
Thus Uik converges monotonically to TO, for i -t 00 at roughly the same
rate as Tik but from the opposite direction, so that eventually Tik and Uik
3.6 Gaussian Integration IVIethods
171
will include the limit TO between them. This observation yields a convenient
stopping criterion.
EXAMPLE. The exact value of the integral
1
"/2
o
5(e" - 2)-le 2X cosx dx
is 1. Using the polynomial extrapolation method of Romberg, and carrying 12
digits, we obtain for Tik, Uik , 0 ::; i ::; 6, 0 ::; k ::; 3 the values given in the
following table.
TiO
o 0.185 755 068 924
1 0.724 727 335 089 0.904 384 757 145
2 0.925 565 035 158 0.992 510 935 182 0.998 386 013 717
3 0.981 021 630 069 0.999 507 161 706 0.999 973 576 808 0.999 998 776 222
4 0.995 232 017 388 0.999 968 813 161 0.999 999 589 925 1.000 000 002 83
5 0.998 806 537974 0.999998044836 0.999 999 993 614 1.00000000002
6 0.999 701 542 775 0.999 999 877 709 0.999 999 999 901 1.000 000 000 00
UiO
o 1.263 699 601 26
1 1.126 402 735 23
2 1.036478 22498
3 1.009 442 404 71
4 1.002 381 058 56
5 1.000 596 547 58
6 1.000 149 217 14
Ui1
Ui3
Ui 2
1.080 637 113 22
1.006 503 388 23
1.000 430 464 62
1.000 027 276 51
1.000 001 710 58
1.000 000 107 00
1.001 561 139 90
1.000 025 603 04 1.000 001 229 44
1.000 000 397 30 0.999 999 997 211
1.000 000 006 19 0.999 999 999 978
1.000 000 000 09 1. 000 000 000 00
3.6 Gaussian Integration Methods
In this section, we broaden the scope of our examination by considering
integrals of the form
1(1) :=
lb
w(x)f(x)dx,
where w(x) is a given nonnegative weight function on the interval [a, b].
Also, the interval [a, b] may be infinite, e.g., [0, +00] or :-00, +00]. The
weight function must meet the following requirements:
(3.6.1)
a) w(x) 2: 0 is measurable on the finite or infinite interval [a, b].
b) All moments /-lk :=
J:
c) For polynomials sex) which are nonnegative on [a, b],
o implies sex) == O.
J:
xkw(x)dx, k = 0, 1, ... , exist and are finite.
w(x)s(x)dx =
172
3 Topics in Integration
The conditions (3.6.1) are met, for instance, if w(x) is positive and continuous on a finite interval [a, b]. Condition (3.6.1c) is equivalent to
w(x)dx > 0 [see Exercise 14].
We will again examine integration rules of the type
J:
n
(3.6.2)
1(1) :=
L
Wd(Xi)'
i=l
The Newton-Cotes formulas [see Section 3.1] are of this form, but the abscissas Xi were required to form a uniform partition of the interval [a, b].
In this section, we relax this restriction and try to choose the Xi as well
as the Wi so as to maximize the order of the integration method, that is,
to maximize the degree for which all polynomials are exactly integrated by
(3.6.2). We will see that this is possible and leads to a class of well-defined
so-called Gaussian integration rules or Gaussian quadrature formulas [see
for instance Stroud und Secrest (1966)]. These Gaussian integration rules
will be shown to be unique and of order 2n - 1. Also Wi > 0 and a < Xi < b
for i = 1, ... , n. In order to establish these results and to determine the exact form of the Gaussian integration rules, we need some basic facts about
orthogonal polynomials. We introduce the notation
for the set of normed real polynomials of degree j, and, as before, we denote
by
IIj := {p I degree (p) ::; j}
the linear space of all real polynomials whose degree does not exceed j. In
addition, we define the scalar product
(I, g) :=
lb
w(x)f(x)g(x)dx
on the linear space L2[a, b] of all functions for which the integral
is well defined and finite. The functions j, g E L2[a, b] are called orthogonal
if (I, g) = O. The following theorem establishes the existence of a sequence
of mutually orthogonal polynomials, the system of orthogonal polynomials
associated with the weight function w(x).
(3.6.3) Theorem. There exist polynomials Pj E llj, j = 0, 1, 2, ... , such
that
(3.6.4)
3.6 Gaussian Integratior:. Methods
173
These polynomials are uniquely defined by the recursions
(3.6.5a)
(3.6.5b)
Po(X)= 1,
Pi+I(X)= (x - 6HI)Pi(X) -iT+lPi-I(X)
for i 2: 0,
where P-I(X) := 0 and!
(3.6.6a)
(3.6.6b)
The polynomials can be constructed recursively by a technique
known as Gram-Schmidt orihogonalization. Clearly po(:r)
1. Suppose
then, as an induction hypothesis, that all orthogonal polynomials with the
above properties have been constructed for j ::; i and have been shown
to he unique. vVe proceed to show that there exists a unique polynomial
PHI E fIHI with
=
PROOF.
(3.6.7)
and that this polynomial satisfies (3.6.5b). Any polynomial PHI E fIHI
can be written uniquely in the form
because its leading coefficient and those of the polynomials Pj, j ::; i, have
value 1. Since (Pj, Pk) = 0 for all j, k ::; i with j i- k, (3.6.7) holds if and
only if
(3.6.8a)
(3.6.8b)
(PHl,Pi)= (XPi,Pi) - 6Hl(Pi,Pi) = 0,
(PHI,Pj-d= (XPj-l,Pi) + Cj-I(Pj-l,Pj-l) = 0
P;
for j ::; i.
The condition (3.6.1c) ~ with
and PJ-I' respectively, in the role of the
nonnegative polynomial s ~ rules out (Pi, Pi) = 0 and (Pj-l,Pj-l) = 0 for
1 ::; j ::; i. Therefore, the equations (3.6.8) can be solved uniquely. (3.6.8a)
gives (3.6.6a). By the induction hypothesis,
for j::; i. From this, by solving for XPj_l(X), we have (xPj_l,p;) = (Pj,Pi)
for j ::;i, so that
for j = i,
for j < 1.
in view of (3.6.8). Thus (3.6.5b) has been established for i + 1.
7
XPi denotes the polynomial with values XPi(.T) for all x.
0
174
3 Topics in Integration
Every polynomial P E Ih is clearly representable as a linear combination of the orthogonal polynomials Pi, i ::; k. We thus have:
(3.6.9) Corollary. (P,Pn) = 0 for all P E IIn-l.
(3.6.10) Theorem. The root.s Xi. i = 1, ... , n, of Pn are real and simple.
ThpV all lie in the oppn intprval (a, b).
PROOF. Consider those roots of Pn which lie in (a, b) and which are of odd
multiplicity, that is, at which Pn changes sign:
a < Xl < ... < Xl < b.
The polynomial
I
q(x) :=
II (x -
Xj)
E 111
j=1
is such that the polynomial Pn(x)q(x) does not change sign in [a, b], so that
(Pn, q) =
lb
w(x)Pn(x)q(x)dx -# 0
by (3.6.1c). Thus degree (q) = I = n must hold, as otherwise (Pn, q) = 0 by
Corollary (3.6.9).
D
N ext we have the
(3.6.11) Theorem. The n x n-matrix
A·- [
Po(h)
.
Pn-I(h)
is nonsingular for mutually distinct arguments t i , i = 1, '" , n.
PROOF. Assume A is singular. Then there is a vector cT
C
-# 0 with cT A = O. The polynomial
= (co, ... , cn-d,
n-I
q(x) :=
l::: CiI1i(X),
';=0
with degree (p) < n, has the n distinct roots t 1 , ... , tn and must vanish
identically. Since the polynomials Pi (.) are linearly independent, q( x) == 0
implies the contradiction c = O.
D
Theorem (3.6.11) shows that the interpolation problem of finding a
function of the form
3.6 Gaussian Integration Methods
175
n-l
p(X) :=
2: CiPi(X)
i=O
with P(ti) = fi' i = 1, 2, ... , n is always solvable. The condition of the
theorem is known as the Haar condition. Any sequence of functions Po,
PI, ... which satisfy the Haar condition is said to form a Chebyshev system.Theorem (3.6.11) states that sequences of orthogonal polynomials are
Chebyshev systems.
Now we arrive at the main result of this section.
(3.6.12) Theorem.
(a) Let Xl, ... , Xn be the 'rOots of the nth orthogonal polynomial Pn(x), and
let WI, ... , Wn be the solution of the (nonsingular) system of equations
(3.6.13)
~P k (x·)w·
= {(PO'0 Po)
••
if
if
L....
i=l
k =0,
k = 1, 2, ... , n - 1.
Then Wi > 0 for i = 1, 2, ... , n, and
(3.6.14)
lb
w(x)p(x)dx =
t
WiP(Xi)
i=l
a
holds for all polynomials P E II2n - l . The positive numbers Wi are called
"weights".
(b) Conversely, if the numbers Wi, Xi, i = 1, ... , n, are such that (3.6.14)
holds for all P E II2n - l , then the Xi are the 'rOots of Pn and the weights
Wi satisfy (3.6.13).
(c) It is not possible to find numbers Xi, Wi, i = 1, ... , n, such that (3.6.14)
holds for all polynomials P E II2n .
PROOF. By Theorem (3.6.10), the roots Xi, i = 1, ... , n, of Pn are real and
mutually distinct numbers in the open interval (a, b). The matrix
PO(Xl)
(3.6.15)
A·- [
.
Pn-;(Xl)
is nonsingular by Theorem (3.6.11), so that the system of equations (3.6.13)
has a unique solution.
Consider an arbitrary polynomial P E II2n - l . It can be written in the
form
(3.6.16)
p(X) := Pn(x)q(x) + r(x),
176
3 Topics in Integration
where q, r are polynomials in IIn - 1 , which we can express as linear combinations of orthogonal polynomials
q(x) ==
n-1
n-1
k=O
k=O
L Q:kPk(X), rex) == L ,BkPk(X).
Since Po(x) == 1, it follows from (3.6.16) and Corollary (3.6.9) that
lb
w(x)p(x)dx = (Pn, q) + (r,po) = ,Bo(Po,Po).
On the other hand, by (3.6.16) [since Pn(Xi) = 0] and by (3.6.13),
n
n
n-1
(n
~ WiP(Xi) = ~ Wir(Xi) = {;,Bk ~ WiPk(Xi)
)
= ,Bo(Po,Po),
Thus (3.6.14) is satisfied.
We observe that
(3.6.17). If Wi, Xi, i = 1, ... , n, are such that (3.6.14) holds for all
polynomials P E II2n - 1, then Wi > 0 for i = 1, ... , n.
This is readily verified by applying (3.6.14) to the polynomials
n
Pj(x) :=
II (x - Xh)2 E II2n - 2, j = 1, ... , n,
h=l
h¥j
and noting that
b
n
n
i=l
h=l
h¥j
0<1 w(x)pj(x)dx = L WiPj(Xi) = Wj II (Xj - Xh)2
a
by (3.6.1c). This completes the proof of (3.6.12a).
We prove (3.6.12c) next. Assume that Wi, Xi, i = 1, ... , n, are such
that (3.6.14) even holds for all polynomials P E II2n . Then
p(x) :==
n
II (x - Xj)2 E II2n
j=l
contradicts this claim, since by (3.6.1c)
0<1 w(x)p(x)dx = L WiP(Xi) = o.
b
a
This proves (3.6.12c).
n
i=l
3.6 Gaussian Integration Methods
177
To prove (3.6.12b), suppose that Wi, Xi, i = 1, ... , n, are such that
(3.6.14) holds for all P E TI2n-1. Note that the abscissas Xi must be mutually distinct, since otherwise we could formulate the same integration rule
using only n - 1 of the abscissas Xi, contradicting (3.6.12c).
Applying (3.6.14) to the orthogonal polynomials P = Pb k = 0, ... ,
n - 1, themselves, we find
if k = 0,
if 1 :::; k :::; n - 1.
In other words, the weights Wi must satisfy (3.6.13).
Applying (3.6.14) to p(x) :== Pk(X)Pn(x), k = 0, ... , n - 1, gives by
(3.6.9)
°=
n
(Pk,Pn) =
L WiPn(Xi)Pk(Xi), k = 0, ... , n - l.
i=l
°
In other words, the vector c := (w1Pn(xd, ... , WnPn(:rn))T solves the
homogeneous system of equations Ac =
with A the matrix (3.6.15).
Since the abscissas Xi, i = 1, ... , n, are mutually distinct, the matrix A
is nonsingular by Theorem (3.6.11). Therefore c = 0 and WiPn(Xi) = 0 for
i = 1, ... , n. Since Wi > 0 by (3.6.17), we have Pn(Xi) = 0, i = 1, ... , n.
This completes the proof of (3.6.12b).
D
For the most common weight function w (x) :== 1 and the interval [-1, 1],
the results of Theorem (3.6.12) are due to Gauss. The corresponding orthogonal polynomials are [see Exercise 16]
(3.6.18)
Indeed, Pk E fh and integration by parts establishes (Pi, Pk) = 0 for i "I k.
Up to a factor, the polynomials (3.6.18) are the Legendre polynomials. In
the following table we give some values for Wi, Xi in this important special
case. For further values see the :'-Jational Bureau of Standards Handbook of
Mathematical Functions [Abramowitz and Stegun (1964)].
178
3 Topics in Integration
n
Wi
Xi
2
=1
3
9
5
4
WI
0.577 350 2692 .. .
X3 = -Xl =
0.7745966692 .. .
X2 =
= W4 = 0.347 854 8451 .. .
= 0.652 145 1549 .. .
W2 = W;)
0.2369268851 .. .
0.4786286705 .. .
W;) = g~ = 0.568888 8889 .. .
5
X2 = -Xl =
0
X4 = - X l =
X3 = -X2 =
=
W5
=
X5 = -Xl =
Wz =
W4
=
X4 = -X2 =
WI
X3 =
0.861 136 3116 .. .
0.339 981 0436 .. .
0.906 179 8459 .. .
0.538 469 3101 .. .
0
Other important cases which lead to Gaussian integration rules are listed
in the following table:
[a, b]
1,1]
[0,00]
[-00, x]
w(X)
(1 -
Orthogonal polynomials
Tn (X), Chebyschev polynomials
Ln (x), Laguerre polynomials
H n (x), Hermite polynomials
X 2 )-1/2
e- X
e- X
2
We have characterized the quantities Wi, Xi which enter the Gaussian
integration rules for given weight functions, but we have yet to discuss
methods for their actual calculation. We will examine this problem under the assumption that the coefficients 6i , I i of the recursion (3.6.5) are
given. Golub and Welsch (1969) and Gautschi (1968, 1970) discuss the
much harder problem of finding the coefficients 6i , Ii.
The theory of orthogonal polynomials ties in with the theory of real
tridiagonal matrices
61 12
62
~(2
(3.6.19)
In =
In
7
0:
1
7;
1
and their principal submatrices
J j :=
["
12
12
02
Ij
OJ
3.6 Gaussian Integration Methods
179
Such matrices will be studied in Sections 5.5, 5.6, and 6.6.1. In Section 5.5
it will be seen that the characteristic polynomials Pj(x) = det(Jj - xI) of
the Jj satisfy the recursions (3.6.5) with the matrix elements OJ, "(j as the
coefficients. Therefore, Pn is the characteristic polynomial of the tridiagonal
matrix I n . Consequently we have
(3.6.20) Theorem. The roots Xi, i = 1, ... , n, of the nth orthogonal
polynomial Pn are the eigenvalues of the tridiagonal matri:r I n in (3.6.19).
The bisection method of Section 5.6, the QR method of Section 6.6.6, and
others are available to calculate the eigenvalues of these tridiagonal systems.
With respect to the weights Wi, we have [Szego (1959), Golub and Welsch
(1969)]:
(3.6.21) Theorem. Let V(i) := (vii), ... , v~i))T be an eigenvector of I n
(3.6.19) for the eigenvalue Xi, Jnv(i) = Xiv(i). Suppose v(i) is scaled in
such a way that
V(i)T V(i) = (Po,Po) =
lb
w(x)dx.
Then the weights are given by
Wi =
(vii ))2,
i=I, ... ,n.
PROOF. We verify that the vector
v(i) = (POPO(Xi), P1P1 (Xi), ... ,Pn-1Pn-1 (Xi) f
where
Pj := l/bl'Y2··· "(1+1)
for j = 0, 1, ... , n -·1
is an eigenvector of I n for the eigenvalue Xi: Jnv(i) = XiV(i). By (3.6.5) for
any x,
For j = 2, ... , n - 1, similarly
"(jPj-2Pj-2(X) + OjPj-1Pj-1(X) + "(j+lPjPj(J;)
= pj-1b;pj-2(X) + OjPj-1(X) + Pj(x)]
=
XPj-1Pj-1(X),
and finally
Pn-1b~Pn-2(X) + OnPn-1(X)] = XPn-1Pn-1(X) - Pn-1Pn(X),
so that
180
3 Topics in Integration
holds, provided Pn(Xi) = 0.
Since Pj # 0, j = 0, 1, ... , n - 1, the system of equations (3.6.13) for
Wi is equivalent to
-;(n))
( v- ( l )
, ...
,L
W -_ ( Po, Po ) el
(3.6.22)
with W = (WI, ... , wn)T, el = (1,0, ... , O)T.
Eigenvectors of sYIIlmetric matrices for distinct eigenvalues are orthogonal. Therefore, multiplying (3.6.22) by v(i)T from the left yields
(v(i)Tv(i))Wi = (po,po)vi i ).
Since Po = 1 and po(:r) == 1, we have vii) = 1. Thus
(3.6.23)
Using again the fact that vii) = 1, we find vii)v(i) = v(i), and multiplying
(3.6.23) by (vii))2 gives
(V(i)T V(i))Wi = (vi i ))2(pO,PO)'
Since V(i)TV(i) = (Po,Po) by hypothesis, we obtain Wi = (vi i ))2.
D
If the QR-method is employed for determining the eigenvalues of I n ,
then the calculation of the first components vii) of the eigenvectors v(i)
is readily included in that algorithm: calculating the abscissas Xi and the
weights Wi can be done concurrently [Golub and Welsch (1969)].
Finally, we will estimate the error of Gaussian integration:
(3.6.24) Theorem. If f E C 2n[a, b], then
f
. a
b
n
w(x)f(x)dx - ~ w;J(x;) =
f(2n) (0
(2n)! (Pn,Pn)
for some ~ E (a, b).
PROOF. Consider the solution h E II2n - 1 of the Hermite interpolation
problem [see Section 2.1,5]
h(Xi) = f(x;),
h'(x;) = !'(Xi),
i = 1,2, ... , n.
Since degree h < 2n,
1
b
a
n
w(x)h(x)dx =
n
L Wih(Xi) L W;J(Xi)
=
i=l
i=l
by Theorem (3.6.12). Therefore, the error term has the integral representation
3.7 Integrals with Singularities
(b w(x)f(x)dx _ t W;!(Xi) = lba w(x)(f(x) - h(.r))dx.
Ja
181
i=l
By Theorem (2.1.5.9), and since the Xi are the roots of Pn(X)) E fIn'
for some ( = ((x) in the interval I[x, Xl, ... , xn] spanned by X and Xl, ... ,
X n . Next,
f(2n)(((X))
f(x) - h(x)
(2n)!
p~(x)
is continuous on [a, b] so that the mean-value theorem of integral calculus
applies:
Ib
a
Ib
1
w(x)(f(x) - h(x))dx = ~(~)I
w(x)f(2n)(((x))p~(x)dx
2n. a
f(2n) (,;)
=
(2n)! (Pn,Pn)
for some'; E (a, b).
o
Comparing the various integration rules (Newton-Cotes formulas, extrapolation methods, Gaussian integration), we find that, computational
efforts being equal, Gaussian integration yields the most accurate results.
If only one knew ahead of time how to chose n so as to achieve specified
accuracy for any given integral, then Gaussian integration would be clearly
superior to other methods. Unfortunately, it is frequently not possible to
use the error formula (3.6.24) for this purpose, because the 2nth derivative
is difficult to estimate. For these reasons, one will usually apply Gaussian
integration for increasing values of n until successive approximate values
agree within the specified accuracy. Since the function yalues which had
been calculated for n cannot be used for n + 1 (at least not in the classical
case w(x) == 1), the apparent advantages of Gauss integration as compared
with extrapolation methods are soon lost. There have been attempts to
remedy this situation [e.g. Kronrod (1965)]. A collection of FORTRAN programs is given in Piessens et al. (1983).
3.7 Integrals with Singularities
Examining some frequently used integration methods in this chapter, we
found that their application to a given integral
182
3 Topics in Integration
lb
f(x)dx,
a, b finite,
was justified provided the integrand f (x) was sufficiently often differentiable in [a,b]. For many practical problems, however, the function f(x)
turns out to be not differentiable at the end points of [a, b], or at some
isolated point in its interior. In what follows, we suggest several ways of
dealing with this and related situations.
1) f(x) is sufficiently often differentiable on the closed subintervals of a
partition a = al < a2 < ... < am+l = b. Putting fi(X) := f(x) on [ai, ai+l]
and defining the derivatives of fi (x) at ai as the one-sided right derivative
and at ai+l as the one-sided left derivative, we find that standard methods
can be applied to integrate the functions fi (x) separately. Finally
I
b
f(x)dx =
L l . + fi(X)dx.
ai
m
i=l
a
1
at
2) Suppose there is a point i; E [a, b] for which not even one-sided derivatives
of f(x) exist ..For instance, the function f(x) = y'Xsinx is such that f'(x)
will not be continuous for any choice of the value l' (0). Nevertheless, the
variable transformation t := y'X yields
{b
(vb
Jo y'X sin x dx = Jo
2t2 sin t 2dt
and leads to an integral with an integrand which is now arbitrarily often
differentiable in [0, Vb].
:}) Another way to deal with the previously discussed difficulty is to
split the integral:
fob y'Xsinx dx = foE y'Xsinx dx +
lb
y'Xsinx dx,
c> 0.
The second integrand is arbitrarily often differentiable in [10, b]. The first
integrand can be developed into an uniformly convergent series on [0,10] so
that integration and summation can be interchanged:
i
c
o
y'X sin x dx =
iE (
0
x3
)
y'X x - I ±. .. dx = ~) -1 t
3.
v=O
CX)
",2v+5/2
~ )' (
/ )"
2v + 1 . 2v + 5 2
(
For sufficiently small 10, only few terms of the series need be considered.
The difficulty lies in the choice of c: if c is selected too small, then the
proximity of the singularity at x =
causes the speed of convergence to
deteriorate when we calculate the remaining integral.
°
3.7 Integrals with Singularities
183
4) Sometimes it is possible to subtract from the integrand f(x) a function whose indefinite integral is known, and which has the same singularities
as f(x). For the above example, xylx is such a function:
fob VX sin x dx = fob (VX sin x ~ xVX) dx + fob J;VX dx
=
fob VX (sinx ~ x) dx + ~b5/2.
The new integrand has a continuous third derivative and is therefore better
amenable to standard integration methods. In order to avoid cancellation
when calculating the difference sin x ~ x for small x, it is recommended to
evaluate the power series
.
sm x ~ x = ~x
3[13! 5fx
1 ± ...
2
~
]
=~x
3
2: -(~ly
--x.
00
(2// + 3)!
v=O
2v
5) For certain types of singularities, as in the case of
I =
fob xCY.f(x)dx,
0 < IX < 1,
with f(x) sufficiently often differentiable on [0, b], the trapezoidal sum T(h)
does not have an asymptotic expansion of the form (3.4.1), but rather of
the more general form (3.5.1):
T(h) ~ TO + Tlh"Yl + T2h"Y2 + "',
where
{ Ii} = { 1 + IX, 2, 2 + IX, 4, 4 + IX, 6, 6 + IX, ... }
[see Bulirsch (1964)]. Suitable step-length sequences for extrapolation
methods in this case are discussed in Bulirsch and Stoer (l964).
6) Often the following scheme works surprisingly well: if the integrand
of
1=
l
bf (X)dX
is not, or not sufficiently often, differentiable for x = a, put
aj := a
b~a
+ --.-,
J
j = 1, 2, ... ,
in effect partitioning the half-open interval (a, b] into infinitely many subintervals over which to integrate separately:
184
3 Topics in Integration
using standard methods. Then
The convergence of this sequence can often be accelerated using for instance, Aitken's L12 method [see Section 5.10]. Obviously, this scheme can
be adapted to calculating irnpmper- integrals
l
X
f(x) dx.
7) The range of improper integrals can be made finite by suitable transformations. For x = lit we have, for instance,
If the new integrand is singular at 0, then one of the above approaches may
be tried. Note that the Gaussian integration rules based on Laguerre and
Hermite polynomials [see Section 3.6] apply directly to improper integrals
of the forms
lX
f(x) dx,
£:YO f(x) dx,
respectively.
EXERCISES FOR CHAPTER 3
1.
Let a s: Xo < Xl < X2 < ... < Xn s: b be an arbitrary fixed partition of the
interval [a, b]. Show that there exist unique /,0, /,1, ... , /'n with
~ /'iP(Xi) =
2.
3.
lb
P(x)dx
for all polynomials P with degree (P) s: n.
Hint: P(x) = 1, X, ... , xn. Compare the resulting system of linear equations
with that representing the polynomial interpolation problem with support
abscissas Xi, i = 0, ... , n.
By construction, the nth Newton-Cotes formula yields the exact value of the
integral for integrands which are polynomials of degree at most n. Show that
for even values of n, polynomials of degree n + 1 are also integrated exactly.
Hint: Consider the integrand xn+1 in the interval [~k, +k], n = 2k + 1.
If f E C 2 [a, b] then there exists an x E (a, b) such that the error of the
trapezoidal rule is expressed as follows:
~(b ~ a)(f(a) + f(b)) ~
lb
f(x) dx = f2(b ~ a)3 f"(x).
EXERCISES FOR CHAPTER 3
4.
5.
6.
Derive this result from the error formula in (2.1.4.1) by showing that !"(e(x))
is continuous in x.
Derive the error formula (3.1.6) using Theorem (2.1.5.9). Hint: See Exercise
3.
Let IE C 6 [-I, +1], and let P E II5 be the Hermite interpolation polynomial
with P(Xi) = I(Xi), PI(x;) = !'(Xi), Xi = -1, 0, +1.
(a) Show that
+1
/ -1 P(t)dt = fs/(-I) + H/(O) + fs/(+I) + i5!'(-I) - i5!'(+I).
(b) By construction, the above formula represents an integration rule which
is exact for all polynomials of degree 5 or less. Show that it need not be
exact for polynomials of degree 6.
(c) Use Theorem (2.1.5.9) to derive an error formula for the integration rule
in (a).
(d) Given a uniform partition Xi = a + ih, i = 0, ... , 2n, h = (b - a)/2n, of
the interval [a, b], what composite integration rule can be based on the
integration rule in (a)?
Consider an arbitrary partition L1 := {a = Xo < ... < Xn = b} of a given
interval [a, b]. In order to approximate
lb
7.
8.
9.
185
I(t)dt
using the function values I(Xi), i = 0, ... , n, spline interpolation [see Section
2.4] may be considered. Derive an integration rule in terms of I(Xi) and the
moments (2.4.2.1) of the "natural" spline (2.4.1.2a).
Determine the Peano kernel for Simpson's rule and n = 2 instead of n = 3
in [-1, +1]. Does it change sign in the interval of integration?
Consider the integration rule of Exercise 5.
(a) Show that its Peano kernel does not change its sign in [-1, +1].
(b) Use (3.2.8) to derive an error term.
Prove
~k3 = (n(n 2+ 1)
r
using the Euler-Macluarin summation formula.
10. Integration over the interval [0,1] by Romberg's method using Neville's algorithm leads to the tableau
h5 = 1
Too = T(ho)
T11
h 21 -_ 4:1
h~ =
ft
186
3 Topics in Integration
In Section 3.4, it is shown that T11 is Simpson's rule.
(a) Show that T22 is Milne's rule.
(b) Show that T33 is not the Newton-Cotes formula for n = 8.
11. Let ho := b-a, hI := ho/3. Show that extrapolating T(ho) and T(hl) linearly
to h = 0 gives the 3/8-rule.
12. One wishes to approximate the number e by an extrapolation method.
(a) Show that T(h) := (1 + h)l/h, h =1= 0, Ihl < 1, has an expansion of the
form
00
T(h) = e + LTih i
i=l
which converges if Ihl < 1.
(b) Modify T(h) in such a way that extrapolation to h = 0 yields, for a fixed
value x, an approximation to eX.
13. Consider integration by a polynomial extrapolation method based on a geometric step-size sequence hj = ho~, j = 0, 1, ... , 0 < b < 1. Show that
small errors LJTj in the computation of the trapezoidal sums T(hj ), j = 0, 1,
... , m, will cause an error LJTmm in the extrapolated value Tmm satisfying
where Cm(O) is the constant given in (3.5.5). Note that Cm(O) --+ 00 as 0 --+ 1,
so that the stability of the extrapolation method deteriorates sharply as b
approaches 1.
14. Consider a weight function w(x) ~ 0 which satisfies (3.6.1a) and (3.6.1b).
Show that (3.6.1c) is equivalent to
l
a
w(x)dx > O.
Hint: The mean-value theorem of integral calculus applied to suitable subintervals of [a, b].
15. The integral
(I, g) := [~1 f(x)g(x)dx
defines a scalar product for functions f, 9 E C[-I, +1]. Show that if f and 9
are polynomials of degree less than n, if Xi, i = 1, 2, ... , n, are the roots of
the nth Legendre polynomial (3.6.18), and if
1
+1
1i :=
Li(X)dx
-1
with
II -, i
Xi
n
Li(X):=
X -
k=i:i
Xk
-Xk
= 1, 2, ... , n,
k=l
then
(I, g) =
L 1;!(Xi)g(Xi).
i=l
EXERCISES FOR CHAPTER 3
187
16. Consider the Legendre polynomials pj(x) in (3.6.18).
(a) Show that the leading coefficient of pj(x) has value 1.
(b) Verify the orthogonality of these polynomials: (Pi,Pj) = 0 if i < j.
Hint: Integration by parts, noting that
and that the polynomial
is divisible by X2 - 1 if l < k.
17. Consider Gaussian integration, [a, b] = [-1,1], w(x) == 1.
(a) Show that 8i = 0 for i > 0 in the recursion (3.6.5) for the corresponding
orthogonal polynomials pj(x) (3.6.18). Hint: pj(x) == (_I)i+ 1pj (_x).
(b) Verify
+1 2
.
(_I)j22i+l
1
-1
(x -lfdx = ~,.,..-'---­
Cj)(2j + 1)·
Hint: Repeated integration by parts of the integrand. (X2 - l)i == (x +
l)j (x - l)j.
(c) Calculate (Pj, pj) using integration by parts [see Exercise 16] and the
result (b) of this exercise. Show that
·2
J
2
'Yj = (2j + 1)(2j - 1)
for j > 0 in the recursion (3.6.5).
18. Consider Gaussian integration in the interval [-1, +1] with the weight function
w(x)==
1
~
vl- X2
In this case, the orthogonal polynomials pj(x) are the classical Chebyshev
polynomials To(x) == 1, n(x) == x, T2(X) == 2x2 - 1, T3(X) == 4 x 3 - 3x, ... ,
Ti+1(X) == 2xTj (x) - Tj - 1(x), up to scalar factors.
(a) Prove that pj(x) == (1/2 j - 1)Tj (x) for j ?: 1. What is the form of the
tridiagonal matrix (3.6.19) in this case?
(b) For n = 3, determine the equation system (3.6.13). Verify that W1 =
W2 = W3 = 7T" /3. (In the Chebyshev case, the weights Wi are equal for
general n).
19. Denote by T(f; h) the trapezoidal sum of step length h for the integral
11 f(x) dx.
For Q > 1, T(x G ; h) has the asymptotic expansion
188
3 Topics in Integration
Show that, as a consequence, every function f(x) which is analytic on a disk
Izl :s; r in the complex plane with r > 1 admits an asymptotic expansion of
the form
T(xUf(x); h) rv
11
XU
f(x)dx + b1 hl+ n + b2 h2+ u + b3 h3+ u + ...
()
Hint: Expand f(x) into a power series and apply T(rp + 1/;; h) = T(rpi h) +
T(1/;;h).
References for Chapter 3
Abramowitz, M., Stegun, I. A. (1964): Handbook of Mathematical Functions. National Bureau of Standards, Applied Mathematics Series 55, Washington, D.C.:
US Government Printing Office, 6th printing 1967.
Bauer, F. L., Rutishauser, H., Stiefel, E. (1963): New aspects in numerical quadrature. Proc. of Symposia in Applied Mathematics 15,199-218, Amer. Math. Soc.
Brezinski, C., Zaglia, R. M. (1991): Extrapolation Methods, Theory and Practice.
Amsterdam: North Holland.
Bulirsch, R. (1964): Bcmerkungen zur Romberg-Integration. Numer. Math. 6,
6-16.
- - - , Stoer, J. (1964): Fehlerabschiitzungen und Extrapolation mit rationalen
Funktionen bei Verfahren vom Richardson- Typus. Numer. Math. 6, 413-427.
- - - , - - - (1967): Numerical quadrature by extrapolation. Numer. Math.
9,271-278.
Davis, Ph. J. (1963): Interpolation and Approximation. New York: Blaisdell, 2d
printing (1965).
- - - , Rabinowitz. P. (1975): Methods of Numerical Integrations. New York:
Academic Press.
Erdelyi, A. (1956): Asymptotic Expansions. New York: Dover.
Gautschi, W. (1968): Construction of Gauss-Christoffel quadrature formulas.
Math. Compo 22, 251-270.
- - - (1970): On the Construction of Gaussian quadrature rules from modified
moments. Math. Compo 24, 245-260.
Golub, G. H., Welsch, J. H. (1969): Calculation of Gauss quadrature rules. Math.
Compo 23, 221-230.
Gradshteyn, I. S., Ryzhik, I. 1\1. (1980): Table of Integrals, Series and Products.
New York: Academic Press.
Grabner, VV., Hofreiter, N. (1961): Integraltafel, 2 vols. Berlin, Heidelberg, New
York: Springer-Verlag.
Henrici, P. (1964): Elements of Numerical Analysis. )Jew York: \Viley.
Kronrod, A. S. (1965): Nodes and Weights of Quadrature Formulas. Authorized
translation from the Russian. New York: Consultants Bureau.
Olver, F. W. J. (1974): Asymptotics and Special Functions. New York: Academic
Press.
Piessens, R., de Doncker, E., Uberhuber, C. \V., Kahaner, D. K. (1983): Quadpack. A Subroutine Package for Automatic Integration. Berlin, Heidelberg, New
York: Springer-Verlag.
References for Chapter 3
189
Romberg, W. (1955): Vereinfachte numerische Integration. Det. Kong. Norske
Videnskabers Selskab Forhandlinger 28, Nr. 7, llondheim.
Schoenberg, 1. J. (1969): Monosplines and quadrature formulae. In: Theory and
applications of spline functions. Ed. by T. N. E. Greville. U;7-207. New York:
Academic Press.
Steffensen, J. F. (1927): Interpolation. New York: Chelsea, 2d edition (1950).
Stroud, A. H., Secrest, D. (1966): Gaussian Quadrature Formulas. Englewood
Cliffs, NJ: Prentice Hall.
Szego, G. (1959): Orthogonal Polynomials. New York: Amer. Math. Soc.
4 Systems of Linear Equations
In this chapter direct methods for solving systems of linear equations
Ax = b,
will be presented. Here A is a given n x n matrix, and b is a given vector.
We assume in addition that A and b are real, although this restriction is
inessential in most of the methods. In contrast to the iterative methods
(Chapter 8), the direct methods discussed here produce the solution in
finitely many steps, assuming computations without roundoff errors.
This problem is closely related to that of computing the inverse A-I of
the matrix A provided this inverse exists. For if A-I is known, the solution
x of Ax = b can be obtained by matrix vector multiplication, x = A-I b.
Conversely, the ith column ai of A-I = [0,1, ... , an] is the solution of the
linear system Ax = ei, where ei = (0, ... ,0,1,0, ... O)T is the ith unit
vector.
A general introduction to numerical linear algebra is given in Golub
and van Loan (1983) and Stewart (1973). ALGOL programs are found in
Wilkinson and Reinsch (1971), FORTRAN programs in Dongarra, Bunch,
Moler, and Stewart (1979) (LINPACK), and in Andersen et al. (1992) (LAPACK).
4.1 Gaussian Elimination. The Triangular
Decomposition of a Matrix
In the method of Gaussian elimination for solving a system of linear equations
(4.1.1)
Ax = b,
4.1 Gaussian Elimination. The Triangular Decomposition of a Matrix
191
where A is an n x n matrix and bE IR n , the given system (4.1.1) is transformed in steps by appropriate rearrangements and linear combinations of
equations into a system of the form
r11
R= [
Rx = c,
o
which has the same solution as (4.1.1). R is an upper triangular matrix, so
that Rx = c can easily be solved by "back substitution" (so lang as rii i= 0,
i = 1, ... , n):
for
i = n, n - 1, ... , 1 .
In the first step of the algorithm an appropriate multiple of the first
equation is subtracted from all of the other equations in such a way that
the coefficients of Xl vanish in these equations; hence, XI remains only in
the first equation. This is possible only if all i= 0, of course, which can be
achieved by rearranging the equations if necessary, as long as at least one
ail i= O. Instead of working with the equations themselves, the operations
are carried out on the matrix
all
(A,b) =
:
[
anI
which corresponds to the full system given in (4.1.1). The first step of the
Gaussian elimination process leads to a matrix (A', b') of the form
(4.1.2)
(A',b')
~ ra':"
a~2
a~n
a;2
a;n
b'2
a:,2
a~n
b'n
b; 1
.
,
and this step can be described formally as follows:
( 4.1.3)
(a) Determine an element ad i= 0 and proceed with (b); if no such r exists,
A is singular; set (A', b') := (A, b); stop.
(b) Interchange rows rand 1 of (A b). The result is the matrix (A,b).
(c) For i = 2,3, ... , 71, subtract the multiple
192
4 Systems of Linear Equations
of row 1 from row i of the matrix (A, b). The desired matrix (A', b') is
obtained as the result.
The transition (A, b) ---+ (.4., b) ---+ (A', b') can be described by using
matrix multiplications:
(4.1.4)
where PI is a permutation matrix, and G l is a lower triangular matrix:
(4.1.5)
1
o
0
1
1
1
o
o
+- r,
G l .--
1
[-~"
1
-lnl
0
.
1
1
Matrices such as G l , which differ in at most one column from an identity matrix, are called Frobenius matrices. Both matrices PI and G l are
nonsingular; in fact
l~l 1
G-1
= [ :
l
lnl
o
1
For this reason, the equation systems Ax = b and A' x = b' have the
same solution: Ax = b implies A'x = GlPlA.T = GlPlb = b', and A'x = b'
implies A.T = Pl-lG l l A'x = Pl-lGllb' = b.
The element arl = all which is determined in (a) is called the pivot
element (or simply the pivot), and step (a) itself is called pivot selection
(or pivoting). In the pivot selection one can, in theory, choose any arl -# 0
as the pivot element. For reasons of numerical stability [see Section 4.5]
it is not recommended that an arbitrary arl -# 0 be chosen. Usually the
choice
larll = max
, laill
is made; that is, among all candidate elements the one of largest absolute
value is selected. (It is assumed in making this choice however - see Section 4.5 - that the matrix A is "equilibrated", that is, that the orders of
magnitudes of the elements of A are "roughly equal".) This sort of pivot
selection is called partial pivot selection (or partial pivoting), in contrast
to complete pivot selection (or complete pivoting), in which the search for
4.1 Gaussian Elimination. The Triangular Decomposition of a Matrix
193
a pivot is not restricted to the first column; that is, (a) and (b) in (4.1.3)
are replaced by (a') and (b'):
(a') Determine rand s so that
larsl = max
laijl
',J
rs
and continue with (b') if a
# O. Otherwise A is singular; set
(A',b'):= (A,b); stop.
(b') Interchange rows 1 and r of (A, b), as well as columns 1 and s. Let the
resulting matrix be (A, b).
After the first elimination step, the resulting matrix has the form (4.1.2):
a,11 :I a'T :I b'1
(A' , b') = [ - - - - LI - -_ - JI - -- o :I A :I b
1
with an (n - I)-row matrix A. The next elimination step consists simply
of applying the process described in (4.1.3) for (A, b) to the smaller matrix
(A, b). Carrying on in this fashion, a sequence of matrices
(A,b):= (A(O),b(O)) -+ (A(l),b(l)) -+ ... -+ (A(n-l),b(n-l)) =: (R,c),
is obtained which begins with the given matrix (A, b) (4.1.1) and ends with
the desired matrix (R, c). In this sequence the jth intermediate matrix
(A (j) , b(j)) has the form
(4.1.6)
* ... *
* ... *
*
* ... *
o
o ... 0
I
I
I
*
*
I
* ... * : *
I
o ... 0
(j) :I A(j)
: b(j)
A 11
12 I 1
[ - - - - __ 1- _ _ _ _ 1_ _ _ _
: b(j)
O :I A(j)
22 I 2
I
1
I
I
I
I
* ... *
I
I
I
I
*
with a j-row upper triangular matrix A~~). (A(j), b(j)) is obtained from
(A(j-l), b(j-l)) by applying (4.1.3) on the (n - j + 1) x~n - j + 2) matrix
b(j-l)) . The elements of A(j)
A(j)
b(j) do not change from this
( A(j-l)
22
'2
11'
12' 1
step on; hence they agree with the corresponding elements of (R, c). As in
the first step, (4.1.4) and (4.1.5), the ensuing steps can he described using
matrix multiplication. As can be readily seen
(4.1. 7)
b(j-l)) ,
( A(j) , b(j)) = GP·(A(j-l)
J J
,
194
4 Systems of Linear Equations
with permutation matrices Pj and nonsingular Frobenius matrices G j of
the form
o
1
1
-IHl,j 1
Gj =
(4.1.8)
-I n,].
0
1
0
In the jth elimination step (A (j-l), b(j-l)) -+ (A (j), b(j)) the elements below
the diagonal in the jth column are annihilated. In the implementation of
this algorithm on a computer, the locations which were occupied by these
elements can now be used for the storage of the important quantities lij,
i ?:: j + 1, of G j ; that is, we work with a matrix of the form
I
: r22
1-
___
_
I
I
I
L ___ _
I
I
:L. _rjj
: rj,Hl
_ _ I
-1- -
-
-
-
-
rj,n
-
(j)
Aj+l,j : aH1,Hl
I
-
-
-
-
-
-
-
-
-
Cj
-
(j)
aH1,n
a(j)
n,n
-
-
-
-
b(j)
HI
-
bn(j)
Here the subdiagonal elements Ak+l,k, Ak+2,k, ... , Ank of the kth column
are a certain permutation of the elements lk+l,ko ... , In,k in Gk (4.1.8).
Based on this arrangement, the jth step T(j-l) -+ T(j), j = 1, 2, ... ,
n -1, can be described as follows (for simplicity the elements of T(j-l) are
denoted by tik, and those of T(j) by t~k' 1 :s: i :s: n, 1 :s: k :s: n + 1):
(a) Partial pivot selection: Determine r so that
If t Tj = 0, set T(j) := T(j-l); A is singular: stop. Otherwise carry on
with (b).
(b) Interchange rows rand j ojT(j-l), and denote the result by T = (fik)'
(c) Replace
4.1 Gaussian Elimination. The Triangular Decomposition of a Matrix
I
- 1tij
:= Iij := tij
tjj
195
for i = j + 1, j + 2, ... , n,
for i = j + 1, ... , nand k = j + 1, ... , n + 1,
t~k = tik
otherwise.
We note that in (c) the important elements 1j+I,j, ... , lnj of G j are
stored in their natural order as tj+l,j, ... ,t~j' This order may, however, be
changed in the subsequent elimination steps T(k) -+ T(kH), k ~ j, because
in (b) the rows of the entire matrix T(k) are rearranged. This has the
following effect: The lower triangular matrix L and the upper triangular
matrix R,
L:=
[t!,
tnl
1
tn,n-I
which are contained in the final matrix T(n-I) = (tik), provide a triangular
decomposition of the matrix P A:
(4.1.9)
LR=PA.
In this decomposition P is the product of all of the permutations appearing
in (4.1.7):
We will only show here that a triangular decomposition is produced
if no row interchanges are necessary during the course of the elimination
process, i.e., if PI = ... = Pn - I = P = I. In this case,
1
In,n-I
1
since in all of the minor steps (b) nothing is interchanged. Now, because of
(4.1.7),
therefore
(4.1.10)
Since
196
4 Systems of Linear Equations
o
1
a-:J 1 =
o
1
In,j
it is easily verified that
1
a-1 1a-2 1... a-I
=
n-l
1
[ 121
:
lnl
In,n-l
Then the assertion follows from (4.1.10).
EXAMPLE.
[~ ~ ~ 1[~n
[n '
~ [~:.~: ~ 1~ [ :
[3; :: :1
I : I ;;.
The pivot elements are marked. The triangular equation system is
Its solution is
X3 =
-8,
X2 =
~(~ +X3) = -7,
Xl =
~(2 -
X2 -
6X3) =
Further
1
0 0]
0 1 ,
010
P= [ 0
PA=
19.
U
1
1
1
and the matrix P A has the triangular decomposition P A = LR with
~1
4.1 Gaussian Elimination. The Triangular Decomposition of a Matrix
197
Triangular decompositions (4.1.9) are of great practical importance in
solving systems of linear equations. If the decomposition (4.1.9) is known
for a matrix A (that is, the matrices L, R, P are known), then the equation
system Ax = b can be solved immediately with any right-hand side b; for
it follows that
PAx = LRx = Ph,
from which x can be found by solving both of the triangular systems
Lu = Pb,
Rx=u
(provided all rii =I 0).
Thus, with the help of the Gaussian elimination algorithm, it can be
shown constructively that each square nonsingular matrix A has a triangular decomposition of the form (4.1.9). However, not every such matrix A
has a triangular decomposition in the more narrow sense A = LR, as the
example
A=
[~ ~]
shows. In general, the rows of A must be permuted appropriately at the
outset.
The triangular decomposition (4.l.9) can be obtained directly without
forming the intermediate matrices T(j). For simplicity, we will show this
under the assumption that the rows of A do not have to be permuted in
order for a triangular decomposition A = LR to exist. The equations A =
LR are regarded as n 2 defining equations for the n 2 unknown quantities
(lii = 1);
that is, for all i, k = l. 2, ... , n
min(i,k)
(4.1.11)
aik =
L lijrjk
(lii = 1).
j=l
The order in which the lij, rjk are to be computed remains open. The
following versions are common:
In the Crout method the n x n matrix A = LR is partiotioned as
follows:
1
3
5
2
46~
198
4 Systems of Linear Equations
and the equations A = LR are solved for Land R in an order indicated by
this partitioning:
1
1.
ali =
2.
ail
3.
a2i
L: hjT"ji,
i = 1,2, ... , n,
j=l
1
= L: lijT"j1'
i = 2,3, ... , n,
j=l
2
= L: l2jT"ji,
i = 2,3, ... ,n,
etc.
j=l
And in general, for i = 1, 2, ... , n,
i-I
T"ik := aik -
L
k = i, i + 1, ... , n,
lijT"jk,
j=l
(4.1.12)
lki:= ( aki -
i-I
~ lkjT"ji
)
/
k = i + 1, i + 2, ... ,n.
T"ii,
In all of the steps above lii = 1 for i = 1, 2 .... , n.
In the Banachiewicz method, the partitioning
1
2
4
6
3
I
5
I
7
I
8
I 9
is used; that is, Land R are computed by rows.
The formulas above are valid only if no pivot selections is carried out.
Triangular decomposition by the methods of Crout or Banachiewicz with
pivot selection leads to more complicated algorithms; see Wilkinson (1965).
Gaussian elimination and direct triangular decomposition differ only in
the ordering of operations. Both algorithms are, theoretically and numerically, entirely equivalent. Indeed, the jth partial sums
j
(4.1.13)
(j) .
.- a ik - "'"'
a ik
~ l is T" sk
s=l
of (4.1.12) produce precisely the elements of the matrix AU) in (4.1.6), as
can easily be verified. In Gaussian elimination, therefore, the scalar products (4.1.12) are formed only in pieces, with temporary storing of the intermediate results; direct triangular decomposition, on the other hand, forms
4.1 Gaussian Elimination. The Triangular Decomposition of a Matrix
199
each scalar product as a whole. For these organizational reasons, direct triangular decomposition must be preferred if one chooses to accumulate the
scalar products in double-precision arithmetic in order to reduce roundoff errors (without storing double-precision intermediate results). Further,
these methods of triangular decomposition require about n 3 /3 operations
(1 operation = 1 multiplication + 1 addition). Thus, they also offer a simple
way of evaluating the determinant of a matrix A: From (4.1.9) it follows,
since det(P) = ±1, det(L) = 1, that
det(PA) = ± det(A) = det(R) = rllr22 ... r"n.
Up to its sign, det(A) is exactly the product of the pivot elements. (It
should be noted that the direct evaluation of the formula
n
det(A) =
J.Ll,···,J./.n=l
J.£i#JAk
fur i::;i.k
requires n! » n 3 /3 operations.)
In the case that P = I, the pivot elements rii are representable as
quotients of the determinants of the principal submatrices of A. If, in the
representation LR = A, the matrices are partitioned as follows:
[
Lll
L21
o ] [Rll
A21 ]
A22 '
0
L22
it is found that LnRll = All; hence det(R ll ) = det(All), or
rl1 ... rii = det(An),
if An is an i x i matrix. In general, if Ai denotes the ith leading principal
submatrix of A, then
rii = det(Ai)/ det(A i - 1 ),
i ~ 2,
rll = det(Ad·
A further practical and important property of the method of triangular
decomposition is that, for band matrices with bandwidth m,
*
0
*
A= *
0
aij =
*
*
*
}m
0 for li- jl ~ m,
200
4 Systems of Linear Equations
the matrices Land R of the decomposition LR = P A of A are not full: R
is a band matric with bandwidth 2m - 1,
*
*
0
0
0
R=
*
0
*
,
} 'm-'
and in each column of L there are at most m elements different from zero. In
contrast, the inverses A -1 of band matrices are usually filled with nonzero
entries.
Thus, if m « n, using the triangular decomposition of A to solve
Ax = b results in a considerable saving in computation and storage over
using A-I. Additional savings are possible by making use of the symmetry
of A if A is a positive definite matrix (see Sections 4.3 and 4.A).
4.2 The Gauss-Jordan Algorithm
In practice, the inverse A -1 of a nonsingular n x n matrix A is not frequently
needed. Should a particular situation call for an inverse, however, it may be
readily calculated using the triangular decomposition described in Section
4.1 or using the Gauss-Jordan algorithm, which will be described below.
Both methods require the same amount of work.
If the triangular decomposition PA = LR of (4.1.9) is available, then
the ith column ai of A-I is obtained as the solution of the system
(4.2.1)
where ei is the ith coordinate vector. If the simple structure of the righthand side of (4.2.1), Pei, is taken into account, then the n equation systems
(4.2.1) (i = 1, ... , n) can be solved in about ~n3 operations. Adding
the cost of producing the decomposition gives a total of n 3 operations
to determaine A-I. The Gauss-Jordan method requires this amount of
work, too, and offers advantages only of an organizational nature. The
Gauss-Jordan method is obtained if one attempts to invert the mapping
x I--t Ax = y, x E IRn, y E IRn, determined by A in a systematic manner.
Consider the system Ax = y:
4.2 The Gauss-Jordan Algorithm
201
(4.2.2)
In the first step of the Gauss-Jordan method, the variable Xl is exchanged
for one of the variables Yr. To do this, an arl i- 0 is found, for example by
partial pivot selection
larll = max laill,
•
and equations rand 1 of (4.2.2) are interchanged. In this way, a system
(4.2.3)
is obtained in which the variables 1h, ... , Yn are a permutation of YI, ... ,
Yn and all = arI, YI = Yr holds. Now, all i- 0, for otherwise we would have
ail = 0 for all i, making A singular, contrary to assumption. By solving the
first equation of (4.2.3) for Xl and substituting the result into the remaining
equations, the system
a~IYI + a~2x2 + ... + a~nxn = Xl
a~IYI + a~2X2 + ... + a~nXn = Y2
(4.2.4)
is obtained with
(4.2.5)
I
_
aik := aik -
ailalk
-_--
an
for i, k = 2, 3, ... , n.
In the next step, the vaiable X2 is exchanged for one of the variables
Y2, ... , Yn; then X3 is exchanged for one of the remaining Y variables, and
so on. If the successive equation systems are represented by their matrices,
then starting from A(O) := A, a sequence
is obtained. The matrix A(j) = (a~k)) stands for the matrix of a "mixed
equation system" of the form
202
4 Systems of Linear Equations
(j) -
all YI + ... +
ajl YI + ... +
(j) -
(4.2.6)
(j)x
a nn
n
anlYI + ... +
(j) -
In this system (iiI, ... , f11, Yj+l, ... , Yn) is a certain permutation of the
original variables (YI,'" ,Yn)' In the transition A(j-l) --+ A(j) the variable
Xj is exchanged for Yj. Thus, A(j) is obtained from A(j-l) according to the
rules given below. For simplicity, the elements of A(j-l) are denoted by aik,
and those of A (j) are denoted by a~k'
(4.2.7)
(a) Partial pivot selection: Determine r so that
If aTj = 0, the matrix is singular. Stop.
(b) Interchange rows rand j of A(j-l), and call the result A = (aik)'
(c) Compute A(j) = (a~k) according to the formulas [ compare with (4.2.5)]
(4.2.6) implies that
(4.2.8)
Y = YI,···, Yn )T ,
A
(
A
A
where ih, ... , Yn is a certain permutation of the original variables YI, ... ,
Yn, Y = Py which, since it corresponds to the interchange step (4.2.7b),
can easily be determined. From (4.2.8) it follows that
(A(n) P)y
and therefore, since Ax = Y,
= x,
4.2 The Gauss-Jordan Algorithm
EXAMPLE.
A
= A(O) :=
-7 A(2)
=
[r
[
-i
-1
1
2
3
~l-7
-1
1
2
-~l
I'
A(l)
=
203
U -~l
-7 A(3) =
-1
I'
2
[-~ -~l
-3
5
-2
= A~l.
The pivot elements are marked.
The following ALGOL program is a formulation of the Gauss-Jordan
method with partial pivoting. The inverse of the n x n matrix A is stored
back into A. The array p[i] serves to store the information about the row
permutations which take place.
for j := 1 step 1 until n do prj] := j;
for j := 1 step 1 until n do
begin
pivotsearch:
max := abs (a[j,j]); r := j;
for i := j + 1 step 1 until n do
if abs (a[i,j]) greater max then
begin max:= abs (a[i,j]);
r:= i;
end;
if max = 0 then goto singular;
TOwinterchange:
if r > j then
begin for k := 1 step 1 until n do
begin
hr := a[j, k]; a[j, k] := a[r, k];
a[r. k] := hr:
end;
hi := prj]; prj] := p[r]; p[r] := hi;
end;
transformation:
hr := 1/a[j,j];
for i := 1 step 1 until n do ali, j] := hr x ali, j];
a[j, j] := hr;
for k := 1 step 1 until j - 1, j + 1 step 1 until n do
begin
for i := 1 step 1 until j - 1, j + 1 step 1 until n do
204
4 Systems of Linear Equations
ali, k] := ali, k] - ali, j] X a[j, k];
a[j, k] := -hr X a[j, k];
end k;
end j;
columninterchange:
for i := 1 step 1 until n do
begin
for k := 1 step 1 until n do hv[P[k]] := ali, k];
for k := 1 step 1 until n do ali, k] := hv[k];
end;
4.3 The Choleski Decomposition
The methods discussed so far for solving equations can fail if no pivot
selection is carried out, i.e. if we restrict ourselves to taking the diagonal
elements in order as pivots. Even if no failure occurs, as we will show in the
next sections, pivot selection is advisable in the interest of numerical stability. However, there is an important class of matrices for which no pivot
selection is necessary in computing triangular factors: the choice of each
diagonal element in order always yields a nonzero pivot element. Futhermore, it is numerically stable to use these pivots. We refer to the class of
positive definite matrices.
( 4.3.1) Definition. A (complex) n X n matrix is said to be positive definite
if it satisfies:
(a) A = A H , i.e. A is a Hermitian matrix.
(b) x H Ax > 0 for all x E (Cn, X -=I- o.
A = AH is called positive semidefinite if x H Ax 2 0 holds for all x E (Cn.
(4.3.2) Theorem. For any positive definite matrix A the matrix A-I exists and is also positive definite. All principal submatrices of a positive definite matrix are also positive definite, and all principal minors of a positive
definite matrix are positive.
The inverse of a positive definite matrix A exists: If this were
not the case, an x -=I- 0 would exist with Ax = 0 and x H Ax = 0, in
contradiction to the definiteness of A. A-I is positive definite: We have
(A-l)H = (AH)-1 = A-I, and if y -=I- 0 it follows that x = A-l y -=I- o.
Hence yH A-l y = x H AH A-I Ax = x H Ax> O. Every principal submatrix
PROOF.
_ [ai~il.
A=
aikil
4.3 The Choleski Decomposition
205
of a positive definite matrix A is also positive definite: Obviously AH = A.
Moreover, every
can be expanded to
for f-£ = i j ,
otherwise,
x"lO,
and it follows that
.i = 1, ... , k,
xH Ax = x H Ax > O.
In order to complete the proof of (4.3.2) then, it suffices to show that
det(A) > 0 for positive definite A. This is shown by using induction on n.
For n = 1 this is true from (4.3.1b). Now assume that the theorem is
true for positive definite matrices of order n - 1, and let A be a positive
definite matrix of order n. According to the preceding parts of the proof,
is positive definite, and consequently all > O. As is well known,
all
a22
a~n 1) / det(A).
an 2
ann
= det ( [ :
By the induction assumption, however,
a~2
a~n. 1) > 0,
an2
ann
de
[ ..
( t
.
and hence det(A) > 0 follows from all > O.
D
(4.3.3) Theorem. For each n x n positive definite matrix A there is a
unique n x n lower triangular matrix L, lik = 0 for k > i, with lii > 0,
i = 1, 2, ... , n, satisfying A = LLH. If A is real, so is L.
Note that lii = 1 is not required. The matrix L is called the Choleski
factor of A, and A = LLH its Choleski decomposition.
PROOF. The theorem is established by induction on n. For n = 1 the
theorem is trivial: A positive definite 1 x 1 matrix A = (a) is a positive
numer a > 0, which can be written uniqely in the form
206
4 Systems of Linear Equations
Assume that the theorem is true for positive matrices of order n - 1 :::: 1.
An n x n positive definite matrix A can be partitioned into
where b E <e n - 1 and An-1 is a positive definite matrix of order n - 1 by
(4.3.2). By the induction hypothesis, there is a unique matrix L n - 1 of order
n - 1 satisfying
for
k > i, Iii> O.
We consider a matrix L of the form
and try to determine e E <en-I, 0 > 0 so that
(4.3.4)
This means that we must have
Ln-Ie = b,
H
e e+0 2 =a nn ,
0> O.
The first equation must have a unique solution e = L-;'~l b, since L n - 1 ,
as a a triangular matrix with positive diagonal entries, has det(Ln-d > O.
As for the second equation, if e H e :::: ann (that is, 0 2 :::: 0), then from (4.3.1)
we would have a contradiction with 0 2 > 0, which follows from
det(A) = I det(Ln_dI 2 0 2 ,
det(A) > 0 (4.3.2), and det(L n - 1 ) > O. Therefore, from (4.3.4), there exists
exactly one 0 > 0 giving LLH = A, namely
o
The decomposition A = LLH can be determined in a manner similar
to the methods given in Section 4.1. If it is assumed that all lij are known
for j <:::: k - 1, the as defining equation for lkk and lik' i :::: k + 1, we have
(4.3.5)
akk = Ild 2 + Ild 2 + ... + 11kk1 2 ,
aik = 1i1lk1 + li2lk2 + ... + lik1kk.
For a real A, the following algorithm results:
4.4 Error Bounds
207
for i := 1 step 1 until n do
for k := i step 1 until n do
begin x := ali, k];
for j := i - I step -1 until 1 do
x := x - a[k,j] x a[i,j];
if i = k then
begin if x :s: 0 then goto fail;
p[i] := l/sqrt(x);
end else
ark, i] := x x p[i];
end i,k;
Note that only the upper triangular portion of A is used. The lower
triangular matrix L is stored in the lower triangular portion of A, with the
exception of the diagonal elements of L, whose reciprocals are stored in p.
This method is due to Choleski. During the course of computation, n
square roots must be taken. Theorem (4.3.3) assures us that the arguments
of these square roots will be positive. About n 3 /6 operations (multiplications and additions) are needed beyond the n square roots. Further substantial savings are possible for sparse matrices, see Section 4.A. Finally,
note as an important implication of (4.3.5) that
(4.3.6)
j = 1, ... ,k,
k = 1,2, ... ,n.
That is, the elements of L cannot grow too large.
4.4 Error Bounds
If anyone of the methods described in the previous sections is used to
determine the solution of a linear equation system Ax = b, then in general
only an approximation x to the true solution x is obtained, and there arises
the question of how the accuracy of x is judged. In order to measure the
error
x-x
we have to have the means of measuring the "size" of a vector. To do this
a
(4.4.1)
norm:
Ilxll
is introduced on <en; that is, a function
11.11: <en ----t lR,
which assigns to each vector x E <en a real value Ilxll serving as a measure
for the "size" of x. The function must have the following properties:
4 Systems of Linear Equations
208
(4.4.2)
(a) Ilxll > 0 for all x E <en, x # 0 (positivity),
(b) lIaxll = lalllxll for all a E <e, x E <en (homogeneity),
(c) IIx + yll ::::: IIxll + lIyll for all x, y E <en (triangle inequality).
In the following we use only the norms
(4.4.3)
IIxll2 := VxHx =
(I: IXiI2)
n
1/2
(Euclidean norm),
i=l
IIxlioo :=max
, IXil
(maximum norm).
The norm properties (a), (b), (c) are easily verified.
For each norm II . II the inequality
(4.4.4)
IIx - yll ~ IlIxll - lIylll
for all
x, y E <en
holds. From (4.4.2c) it follows that
IIxll = II (x - y) + y II ::::: IIx - yll + lIyll,
and consequently IIx - yll ~ IIxll-llyli. By interchanging the roles of x and
y and using (4.4.2b), it follows that
IIx - yll = lIy - xII ~ lIyll -lIxll,
and hence (4.4.4).
It is easy to establish the following:
(4.4.5) Theorem. Each norm 11.11 in IRn (or <en) is a uniformly continuous
function with respect to the metric g(x,y) := maxi IXi - Yil on IRn(<e n ).
PROOF. From (4.4.4) it follows that
IlIx+hll-lIxlll::::: IIhll·
Now h = 2:~=1 hiei, where h = (h 1 , ... , hn)T and ei are the usual cordinate
(unit) vectors of IRn (<en). Therefore
n
n
IIhll ::::: '~,
" Ihilliei ll ::::: max Ihil
i=l
L lIejll = Mmax
, Ihil
j=l
with M := 2:7=1 lIejll. Hence, for each e >
maxi Ihil ::::: elM, the inequality
0 and all h satisfying
Ilix + hll - IIxlll ::::: e
holds. That is, II . II is uniformly continuous.
o
4.4 Error Bounds
209
This result is used to show:
(4.4.6) Theorem. All norms on lR"(Cn) are equivalent in the following
sense: For each pair of norms Pl (x)) P2 (x) there are positive constants 177
and M satisfying
PROOF. We will prove this only in the case that P2(X) := Ilxll := maxi IXil.
The general case follows easily from this special result. The set
5 = {x E (Cn I maxlxil
= I}
,
is a compact set in (Cn. Since Pl (x) is continuous by (4.4.EI),
M:= max Pl(X) > 0,
xES
177:= min Pl(X) > I)
xES
exist. Thus for all y cf 0, y/llyll E 5, it follows that
177
~ Pl CI~II) = ~Pl(Y) ~ M,
and therefore mllyll ~ Pl(y) ~ Mllyll·
D
For matrices as well, A E 11/[(177, n) of fixed dimensions, norms IIAII can
be introduced. In analogy to (4.4.2), the properties
IIAII > 0
for all A cf 0, A E M(m, n),
IloA11 = lolllAII,
IIA + BII ~ IIAII + IIBII
arc required. The matrix norm II ·11 is said to be consistent with the vector
norms II . Iia on (C" and II . lib on (Cm if
IIAxllb ~ IIAllllxll a
for all x E <en, A E M(m, n).
A matrix norm 11.11 for square matrices A E M(n, n) is called submultiplicative if
IIABII ~ II All IIBII
for all
A, BE M(n, n).
Choosing B := I then shows 11111 :::: 1 for such norms. Frequently used
matrix norms are
(4.4.7a)
"
(row-sum norm),
210
4 Systems of Linear Equations
~ (t, la"I') 'I'
(4.4.7b)
IIAII
(4.4.7c)
IIAII = max laikl·
(Schur-Norm),
"k
(a) and (b) are submultiplicative: (c) is not; (b) is consistent with the
Euclidean vector norm. Given a vector norm Ilxll, a corresponding matrix
norm for square matrices, the subordinate matrix norm or least upper bound
norm, can be defined by
( 4.4.8)
IIAxl1
lub(A) :=~;tW'
Such a matrix norm is consistent with the vector norm 11.11 used to define
it:
(4.4.9)
IIAxl1 :s; lub (A) Ilxll·
Obviously lub (A) is the smallest of all of the matrix norms IIAII which are
consistent with the vector norm Ilxll:
IIAxl1 :s; IIAllllxl1
for all x
=?
lub (A) :s; IIAII·
Each subordinate norm lub(·) is submultiplicative:
IIABxl1
IIBxl1
lub (AB) = max -11-11- :s; max lub (A)-II-II
x#O
x
x#O
x
= lub (A) lub (B),
and furthermore lub(I) = maxx=o IIIx1/11x11 = l.
(4.4.9) shows that lub(A) is the greatest magnification which a vector
may attain under the mapping determined by A: It shows how much IIAxll,
the norm of an image point, can exceed II x II, the norm of a source point.
EXAMPLE.
(a) For the maximum norm Ilxll= = max v Ixlv the subordinate matrix norm
is the row-sum norm
IIAxll oo
{maxi 1I::-l a'kXkl}
~
lUboo(A) = max -11-1-1- = max
1 1
= max
x,,'Q
X 00
x,,'Q
maXk Xk
' ~ la,kl·
k=l
(b) Associated with the Euclidean norm IIxI12 = vlxHx we have the subordinate matrix norm
4.4 Error Bounds
211
which is expressed in terms of the largest eigenvalue >"max(AH A) of the matrix
AH A [see (6.4.7)]. With regard to this matrix norm, we note that
(4.4.10)
for unitary matrices U, that is, for matrices defined by UHU = I.
In the following we assume that Ilxll is an arbitrary vector norm and IIAII
is a consistent submultiplicative matrix norm with 11111 = 1. Specifically,
we can always take the subordinate norm lub(A) as IIAII if we want to
obtain particularly good estimates in the results below. We shall show how
norms can be used to bound the influence due to changes in A and b on
the solution x to a linear equation system
Ax = b.
If the solution x + L\x corresponds to the right-hand side b + L\b,
A(x + L\x) = b + L\b,
then the relation
L\x = A- 1 L\b
follows from AL\x = L\b, as does the bound
(4.4.11)
For the relative change IIL\xll/llxll, the bound
(4.4.12)
IIL\xll < IIAIIIIA -111 IIL\bll = cond(A) IIL\bll
Ilxll Ilbll
Ilbll
follows from Ilbll = IIAxll ~ IIAllllxll. In this estimate, cond(A) :=
IIAIIIIA- 1 11. For the special case that cond(A) := lub(A)lub(A- 1 ), this
so-called condition of A is a measure of the sensitivity of the relative error
in the solution to relative changes in the right-hand side b. Since AA-l = I,
cond(A) satisfies
lub(I) = 1 ~ lub(A) lub(A- 1 ) ~ IIAIIIIA- 1 11 = cond(A).
The relation (4.4.11) can be interpreted as follows: If :r is an approximate solution to Ax = b with residual
r(x) := b - Ax = A(x - x),
then x is the exact solution of
Ax = b - r(x),
and the estimate
212
4 Systems of Linear Equations
IIL\xll -s: IIA-1111Ir(x)ll·
(4.4.13)
must hold for the error L\x = x - x.
Next, in order to investigate the influence of changes in the matrix A
upon the solution x of Ax = b, we establish the following:
(4.4.14) Lemma. If F is an n x n matrix with IIFII < 1, then (1 + F)-l
exists and satisfies
PROOF. From (4.4.4) the inequality
11(1 + F)xll = Ilx + Fxll 2': Ilxll -llFxll 2': (1 - IIFII) Ilxll
follows for all x. From 1 - IIFII > 0 it follows that 11(1 + F)xll > 0 for
x of 0; that is, (1 + F)x = 0 has only the trivial solution x = 0, and I + F
is nonsingular.
Using the abbreviation G := (1 + F)-I, it follows that
1 = 11111 = 11(1 + F)CII = IIG + FCII
2': IIGII - IIGII IIFII
= II Gil (1 - IIFII) > 0,
from which we have the desired result
1
11(1 + F)-III -s: 1 - IIFII·
o
We can show:
(4.4.15) Theorem. Let A be a nonsingular n x n matrix, B = A(1 + F),
IIFII < 1, and x and L\x be defined by Ax = b, B(x + L\x) = b. It follows
that
_II
L\_xII < ----"-11Fc-"-11_
Ilxll - 1 - IIPII '
as well as
IIL\xll <
Ilxll
cond(A)
- 1 _ cond(A) liB-Ail
liB-Ail
IIAII
IIAII
if cond(A) . liB - AII/IIAII < l.
PROOF. B- 1 exists from (4.4.14), and
L\x = B-1b - A-1b = B-1(A - B)A-1b,
x = A-1b,
4.4 Error Bounds
213
IILlxl1 < IIB- 1(A - B)II = I - (I + F)-1 A- 1 4FII
Ilxll •
:s I (I + F)-111 IIFII :s 1 ~~~II
Since F = A-1(B - A) and IIFII :s IIA- 11 IIAII liB - AII/IIAII, the rest of
the theorem follows.
0
According to theorem (4.4.15), cond(A) also measures the sensitivity
of the solution x of Ax = b to changes in the matrix A.
If the relations
C = (I + F)-1 = B- 1A,
are taken into account, it follows from (4.4.14) that
IIB- 1All :S 1 _ III ~ A -1 BII .
By interchanging A and B, it follows immediately from A- 1 = A- 1BB- 1
that
(4.4.16)
In particular, the residual estimate (4.4.13) leads to the bound of Collatz
(4.4.17)
_
<
IIB- 111
_
Ilx - xii - 1 -III _ B-1All llr (x)lI,
r(x) = b - Ax,
where B- 1 is an approximate inverse to A with III - B- 1 All < l.
The estimates obtained up to this point show the significance of the
quantity cond(A) for determining the influence on the solution of changes
in the given data. These estimates give bounds on the error x - x, but
the evaluation of the bounds requires at least an approximate knowledge
of the inverse A-1 to A. The estimates to be discussed next, due to Prager
and Oettli (1964), are based upon another principle and do not require any
knowledge of A -1.
The results are obtained through the following considerations:
Usually the given data Ao, bo of an equation system Aox = bo are
inexact, being tainted, for example, by measurement errors LlA, Llb. Hence,
it is reasonable to accept an approximate solution x to the system Aox = bo
as "correct" if x is the exact solution to a "neighboring" equation system
Ax = b
with
214
4 Systems of Linear Equations
A E A := {A IIA - Aol
(4.4.18)
s: L1A}
bE B := { b lib - bol s: L1b}
The notation used her is
IAI = (iOikl), where
Ibl = (lpll,·· ·IPnlf.
A = (Oik),
where
b = (PI, ... Pnf,
s:
and the relation
between vectors and matrices is to be understood as
holding componentwise. Prager and Oettli prove:
(4.4.19) Theorem. Let L1A ~ 0, L1b ~ 0, and let A, B be defined by
(4.4.18). Associated with any approximate solution x of the system Aox =
bo there is a matrix A E A and a vector b E B satisfying
Ax = b,
if and only if
Ir(x)1 s: L1A Ixl + L1b,
where r (x) := bo - Aox is the residual of X.
PROOF.
(1) We assume first that
Ax = b
holds for some A E A, b E B. Then it follows from
A = Ao + bA,
b = bo + bb,
that
s: L1A,
where Ibbl s: L1b,
where IbAI
Ir(x)1 = Ibo - Aoxl = Ib - bb - (A - bA)xl
= I - bb + (bA)xl s: Ibbl + IbAllxl
s: L1b + L1Alxl·
(2) On the other hand, if
(4.4.20)
and if we use the notation
x =: (6,··· .t;nf. bo =: (PI, ... ,Pnf,
r := r(x) = (Pl .... , Pn)T,
.5 := L1b
then set
+ L1Alxl ~ 0,
4.5 Roundoff-Error Analysis for Gaussian Elimination
215
OOij := PiL10ij sign(~j)/lTi'
O(3i := -piL1(3dlTi,
where
pdlTi:= 0,
if
IT; =
o.
From (4.4.20) it follows that IpdlTil ::; 1, and consequently
A = Ao +oA E A,
b = bo + ob E E,
as well as the following for i = 1, 2, ... , n:
n
= -O(3i + L
OOij~j
j=l
or
n
L(Oij + 60ij)~j = (3i + O(3i,
j=l
that is,
Ax=b,
which was to be shown.
o
The criterion expressed in Theorem (4.4.19) permits us to draw conclusions about the fitness of a solution from the smallness of its residual. For
example, if all components of Ao and bo have the same relative accuracy c,
then (4.4.19) is satisfied if
IAox - bol ::; c(lbol + IAollxl).
From this inequality, the smallest c can be computed for which a given x
can still be accepted as a usable solution.
4.5 Roundoff-Error Analysis for Gaussian Elimination
In the discussion of methods for solving linear equations, the pivot selection
played only the following role: it guaranteed for any nonsingular matrix A
that the algorithm would not terminate prematurely if some pivot element
216
4 Systems of Linear Equations
happened to vanish. We will now show that the numerical behavior of the
equation-solving methods which have been covered depends upon the choice
of pivots. To illustrate this, we consider the following simple example:
Solve the system
(4.5.1)
with the use of Gaussian elimination. The exact solution is x = 5000/9950 =
0.503 ... , Y = 4950/9950 = 0.497 .... If the element a11 = 0.005 is taken as the
pivot in the first step, then we obtain
[~.005
1] [x]
[0.5]
fj -99 '
-200
fj = 0.5,
x = 0,
using 2-place floating-point arithmetic. If the element a21 = 1 is taken as the
pivot, then 2-place floating-point arithmetic yields
[6 ~ ][ ~]
= [0\]'
fj = 0.50,
x = 0.50.
In the second case, the accuracy of the result is considerably higher. This
could lead to the impression that the largest element in magnitude should
be chosen as a pivot from among the candidates in a column to get the best
numerical results. However, a moment's thought shows that this cannot
be unconditionally true. If the first row of the equation system (4.5.1) is
multiplied by 200, for example, the result is the system
(4.5.2)
which has the same sulution as (4.5.1). The element an = 1 is now just as
large as the element a21 = 1. However, the choice of an as pivot element
leads to the same inexact result as before. We have replaced the matrix A
of (4.5.1) by A = DA, where D is the diagonal matrix
D =
[20~ ~].
Obviously, we can also adjust the column norms of A-i.e., replace A by
A = AD (where D is a diagonal matrix) - without changing the solution
x to Ax = b in any essential way. If x is the solution to Ax = b, then
y = D-1x is the solution of A = (AD)(D-1x) = b. In general, we refer
to a scaling of a matrix A if A is replaced by D 1AD2, where D 1, D2 are
diagonal matrices. The example shows that it is not reasonable to propose a
particular choice of pivots unless assumptions about the scaling of a matrix
are made. Unfortunately, no one has yet determined satisfactorily how to
carry out scaling so that partial pivot selection is numerically stable for
4.5 Roundoff-Error Analysis for Gaussian Elimination
217
any matrix A. Practical experience, however, suggests the ::ollowing scaling
for partial pivoting: Choose Dl and D2 so that
n
n
k=1
j=1
L laikl ~ L lajll
holds approximately for alli, I = 1,2, ... , n in the matrix A~ = D 1AD 2. The
sum of the absolute values of the elements in the rows (and the columns)
of A should all have about the same magnitude. Such matrices are said to
be equilibrated. In general, it is quite difficult to determine Dl and D2 so
that D1AD2 is equilibrated. Usually we must get by with the following:
Let D2 = I, Dl = diag(sl, ... , sn), where
1
Si :=
2:~=1 laikl'
then for A = D1AD2, it is true at least that
n
for
i=1,2, ... ,n.
Now, instead of replacing A by A, i.e., instead of actually carrying out the
transformation, we replace the rule for pivot selection in order to avoid
the explicit scaling of A. The pivot selection for the jth elimination step
A (j -1) -+ A (j) is given by the following:
(4.5.3). Determine r :2: j so that
Iaij(j -1) ISi .r./. 0,
Ia rj(j - 1) ISr = m>ax
"_J
and take aU-I)
as the pivot.
rJ
The example above shows that it is not sufficient, in general, to scale A
prior to carrying out partial pivot selection by making the largest element
in absolute value within each row and each column have roughly the same
magnitude:
(4.5.4)
max laikl ~ max laJII
k
j
for all
i, 1= 1,2" ... , n.
For, if the scaling matrices
are chosen, then the matrix A of our example (4.5.1) becomes
218
4 Systems of Linear Equations
~.005] .
The condition (4.5.4) is satisfied, but inexact answers will be produced, as
before, if all = 1 is used as pivot.
We would like to make a detailed study of the effect of the rounding
errors which occur in Gaussian elimination or direct triangular decomposition [see Section 4.1]. We assume that the rows of the n x n matrix A
are already so arranged that A has a triangular decomposition of the form
A = LR. Hence, Land R can be determined using the formulas of (4.1.12).
In fact, we only have to evaluate expressions of the form
which were analyzed in (1.4.4)-(1.4.11). Instead of the exact triangular
decomposition LR = A, (4.1.12) shows that the use of floating-point arithmetic will result in matrices L = (Tik ), II = (fik) for which the residual
F := (fik) = A - Lll is not zero in general. Since, according to (4.1.13),
the jth partial sums
-U)
aik = fl ( aik -
L lisrsk
J
-
-
)
s=1
are exactly the elements of the matrix AU), which is produced instead
of A (j) of (4.1.6) from the jth step of Gaussian elimination in floatingpoint arithmetic, the estimates (1.4.7) applied to (4.1.12) yield (we use the
abbreviation eps' = cps /(1 - eps)
(4.5.5)
"
i-I
Ifikl = laik - LT;jfjkl s:: eps'L(la)r)1 + IT;jllfjkl),
J=1
J=1
i
i-I
j=1
J=1
k 2: i,
Ilkil = laki - LTkjfjil s:: epsl[la~ii-l)1 + L(la~)1 + ITkjllfJil)] , k> i.
Further,
(4.5.6)
-
-(i-I)
Tik = aik
for
is:: k,
since the first j + 1 rows of A(j) [or A(j) of (4.1.6)] do not change in the
subsequent elimination steps, and so they already agree with the corresponding rows of R. We sassume, in addition, that ITikl s:: 1 for all i, k
(which is satisfied, for example. when partial or complete pivoting is used).
Setting
4.5 Roundoff-Error Analysis for Gaussian Elimination
aj ..- max 1-(j)1
aik '
a:=
.,k
max
O:::;i:::;n-I
219
ai,
it follows immediately from (4.5.5) and (4.5.6) that
eps
lfikl ::; -l--(ao
+ 2aI + 2a2 + ... + 2ai-2 + ai-I)
- eps
::; 2(i -
(4.5.7)
l)a~
for
1- eps
k ~ i,
eps
Ifikl::; ---(aD + 2aI + ... + 2ak-2 + 2ak-l)
1- eps
<2ka~
-
for
1- eps
k < i.
For the matrix F, then, the inequality
(4.5.8)
IFI::; 2a~
1- eps
0
1
1
1
0
1
0
1
0
1
0
1
2
2
2
3
2
3
2
3
1
2
3
n-1
n-1
holds, where IFI := (Ifjkl). If a has the same order of magnitudes as ao,
that is, if the matrices A(j) do not grow too much, then the computed
matrices Lil form the exact triangular decomposition of the matrix A - F,
which differs little from A. Gaussian elimination is stable in this case.
The value a can be estimated with the help of ao = maxr,s lars I. For
partial pivot selection it can easily be shown that
and hence that a ::; 2n - 1 ao. This bound is much too pessimistic in most
cases; however, it can be attained, for example, in forming the triangular
decomposition of the matrix
1
o
o
1
-1
A=
o
-1
1
-1
1
1
Better estimates hold for special types of matrices. For example in the
case of upper Hessenberg matrices, that is, matrices of the form
220
4 Systems of Linear Equations
x
A=
r
x
o
x
the bound a ::; (n - 1)ao can be shown. (Hessenberg matrices anse in
eigenvalue problems.)
For tridiagonal matrices
o
A=
12
o
it can even be shown that
holds for partial pivot selection. Hence, Gaussian elimination is quite numerically stable in this case.
For complete pivot selection Wilkinson (1965) has shown that
with the function
f(k) := k 1 / 2 [2131/241/3 ... k 1 /(k-1) ]
1/2
This function grows relatively slowly with k:
k
10
20
50
100
f(k)
19
67
530
3300
Even this estimate is too pessimistic in practice. Up until now, no real
matrix has been found which fails to satisfy
ak::;(k+l)ao,
k=1,2, ... ,n-1,
when complete pivot selection is used. This indicates that Gaussian elimination with complete pivot selection is usually a stable process. Despite
this, partial pivot selection is preferred in practice, for the most part, because:
(1) Complete pivot selection is more costly than partial pivot selection.
(To compute A (i), the maximum from among (n - i + 1)2 elements
4.6 Roundoff Errors in Solving Triangular Systems
221
must be determined instead of n - i + 1 elements as in partial pivot
selection. )
(2) Special structures in a matrix, e.g. the band structure of a tridiagonal
matrix, are destroyed in complete pivot selection.
If the weaker estimates (1.4.11) are used instead of (1.4.7), then the
following bounds replace those of (4.5.5) for the fik:
eps
Ifikl :s; 1 - n· eps
eps
Ifkil :s; 1 - n· eps
[tj Ilijllfjkl-Ifikl]
[tj IlkjllFjil] 2
for k 2 i,
)=1
for k
i -c- 1,
)=1
or
(4.5.9)
IFI :s;
eps
[ILl D IHI - IHI] ,
1 - n· eps
where
4.6 Roundoff Errors in Solving Triangular Systems
As a result of applying Gaussian elimination in floating-point arithmetic to
the matrix A, a lower triangular matrix L and an upper triangular matrix
H are obtained whose product LH approximately equals A. Solving the
system Ax = b is thereby reduced to solving the triangular system
Ly = b,
Rx=y.
In this section we investigate the influence of roundoff errors on the solution
of such equation systems. If we use y to denote the solution obtained using
t-digit floating-point arithmetic, then the definition of y gives
(4.6.1)
From (1.4.10), (1.4.11) it follows immediately that
222
4 Systems of Linear Equations
or
In other words, there exists a matrix .:1L satisfying
I.:1LI ::;
(4.6.3)
eps
(ILID - 1).
1 - n eps
Thus, the compnted solution can be interpreted as the exact solution of a
slightly changed problem, showing that the process of solving a triangular
system is stable. Similarly, the computed solution x of Rx = y is found to
satisfy the bound
Iy - Rxl ::; 1-eps
IRI E lxi,
71 eps
(4.6.4)
I.:1RI ::;
(R + .:1R)x = y,
.
=[
0 '.
E: ~
71
2
OIl'
eps
IRI E.
1 - 71 eps
By combining the estimates (4.5.9), (4.6.3), and (4.6.4), we obtain the
following result [due to Sautter (1971)] for the approximate solution x
produced by floating-point arithmetic to the linear system Ax = b:
Ib - Axl ::; 2(71 + 1) eps ILIIRllxl,
(4.6.5)
1 - 71 eps
if 71 eps ::; ~.
PROOF. Using the abbreviation E := eps /(1-71 eps), it follows from (4.5.9),
(4.6.3) and (4.6.4) that
Ib - Axl = Ib - (LR + F)xl 1- Fx + b - L(y - .:1Rx) I
= I( -F + .:1L(R + .:1R) + L.:1R)xl
::; E[2(1LID - I)IRI + ILIIRIE + E(lLID - I)IRIEjlxl·
The (i, k) component of the matrix [... J appearing in the last line above
has the form
min(i,k)
L Ilij l(2j - 20ij + 71 + 1- k + E(j - Oij + 71 + 1- k))lrjkl,
]=1
Oij := {
I
°
for i = j,
for i of j.
4.7 Orthogonalization Techniques of Householder and Gram-Schmidt
223
It is easily verified for all j ::; min(i, k), 1 ::; i, k::; n, that
2j - 26ij + n + 1 - k+E(j - 6ij + n + 1 - k)
< { 2n - 1 + E . n,
-
2n+E(n+1),
if j ~; i ::; k,
if j <; k < i,
::; 2n + 2,
since E . n ::; 2n eps ::; 1. This completes the proof of (4.6.5).
D
A comparison of (4.6.5) with the result (4.4.19) due to Prager and Oettli
(1964) shows, finally, that the computed solution i; can be interpreted as
the exact solution of a slightly changed equation system, provided that
the matrix nlLllRI has the same order of magnitude as AI. In that case,
computing the solution via Gaussian elimination is a numerically stable
algorithm.
4.7 Orthogonalization Techniques of Householder and
Gram-Schmidt
The methods discussed up to this point for solving a system of equations
(4.7.1)
Ax = b
consisted of multiplying (4.7.1) on the left by appropriate matrices P j ,
j = 1, ... , n, so that the system obtained as the final outcome,
A(n)x = b(n),
could be solved directly. The sensitivity of the result x to changes in the
arrays A (j), b(j) of the intermediate systems
AU)x = b(j).
[AU),b U)] = Pj[AU-l),b(j-l)J,
is given by [see (4.4.12) and Theorem (4.4.15)]
cond (A(j)) = lub (A(j)) lub ((A(j))-l).
If we denote the roundoff error incurred in the transition from [AU -1) ,
bCi - 1 )] to [A (j), bUll by EU), then these roundoff errors are amplified by the
factors cond (AU)) in their effect on the final result x, and we have
In the right-hand expression above, E(O) stands for the error in the initial
data A, b. If there is an A (j) with
224
4 Systems of Linear Equations
then the sequence of computations is not numerically stable: The roundoff
error e(j) has a stronger influence on the final result than the initial error
e(O) .
For this reason, it is important to choose the PJ so that the condition
numbers cond (A (j») do not grow. For condition numbers derived from an
arbitrary norm Ilxll, that is difficult to do. For the Euclidean norm
Ilxll = VxHx
and its subordinate matrix norm
lub(A) = max
x,to
however, the choice of transformations is more easily made. For this reason,
only the above norms will be used in the present section. If U is a nnitary
matrix, UHU = I, then the above matrix norm satisfies
lub (A) = lub (UHU A) :s; lub (U H ) lub (U A) = lub (U A)
:s; lub (U) lub (A) = lub (A);
hence
lub (U A) = lub (A),
and similarly lub (AU) = lub (A).
In particular, it follows that
cond (A) = lub (A) lub (A -1) = cond (U A)
for unitary U. If the transformation matrices Pj are chosen to be unitary,
then the condition numbers associated with the systems A (j) x = b(j) do
not change (so they certainly don't get worse). Furthermore, the matrices
Pj should be chosen so that the matrices A (j) become simpler. As was suggested by Householder, this can be accomplished in the following manner:
The unitary matrix P is chosen to be
Matrices of this form are called Householder matrices. These matrices are
Hermitian:
unitary:
4.7 Orthogonalization Techniques of Householder and Gram-Schmidt
225
pH P = p P = p2 = (I _ 2wwH) (I - 211J1V H )
= I - 2ww H - 2wwH
+ 4wu;H ww H
=I,
and therefore involutory:
p2 = I.
Geometrically, the mapping x f--t y := px = x - 2( w H x)w describes a
reflection of x with respect to the plane {z I w H Z = O}, and x and y then
satisfy
(4.7.2)
yH Y = x H pH px = x H x,
(4.7.3)
xHy = x H Px = (x H Px)H
and x H y is real. We wish to determine a vector w, and thereby P, so that
a given
x = (Xl, ... , xn)T
is transformed into a multiple of the first coordinate vector el:
kel = Px.
From (4.7.2) it follows immediately that k satisfies
and, since k x H el must be real according to (4.7.3),
a:= vi x H x,
k = =t=e"'a,
if Xl = eialxll. Hence, it follows from
Px = x - 2(w H x)w = kel
and the requirement w H w = 1 that
w=
x - kel
Ilx - kelll·
Ilx - ked = Ilx ± aeiC>eIil = JIXI ± ae ia 2 + IX212 + ... + Ixn l2
= V(IXII ± a)2 + IX212 + ... + Ix n l2.
In order that no cancellation may occur in the computation of IXII ± a, we
l
choose the sign in the definition of k to be
226
4 Systems of Linear Equations
which gives
(4.7.4)
It follows that
The matrix P = I - 2ww H can be written in the form
P = 1- (3uu H
with
n
(J"=
Llxil2,
xl=ei"lxll,
k=-(J"e i ",
i=l
(4.7.5)
_
reiQ('X:~ + (J")1
_
u - x - kel -
.
Xn
A n X n matrix A == A (0) can be reduced step by step using these unitary
Householder matrices Pj ,
A(j)
= PjA(j-l),
into an upper triangular matrix
A(n-l) =
R = [ roll
To do this, the n x n unitary matrix PI is determined according to (4.7.5)
so that
(0)
Pial
= k el,
where a~O) stands for the first column of A (0) .
If the matrix A (j -1) obtained after j - 1 steps has the form
(4.7.6)
I
X
X
o
X
I
I
.T
.1:
X
X
(]-J)
a jj
0
(j-l)
anj
(j-l)
a jn
(j-l)
ann
[~ ~(j-J)
] ,
4.7 Orthogonalization Techniques of Householder and Gram-Schmidt
227
then we determine the (n - j + 1) x (n - j + 1) unitary matrix Fj according
to (4.7.5) so that
Using Fj the desired n x n unitary matrix is constructed as
aW
After forming A (j) = Pj A (j -1), the elements
for i > :i are annihilated,
and the rows in (4.7.6) above the horizontal dashed line remain unchanged.
In this wayan upper triangular matrix
R:= A(n-I)
is obtained after n - 1 steps.
It should be noticed, when applying Householder transformation in
practice, that the locations which are set to zero beneath the diagonal of
A (j) can be used to store u, which contains the important information
about the transformation P. However, since the vector u which belongs to
Fj contains n - j + 1 components, while only n - j locations are set free in
A (j), it is usual to store the diagonal elements of A (j) in a separate array
d in order to obtain enough space for u.
The transformation of a matrix by
is carried out as follows:
that is, the vector Yj is computed first, and then A(j-I) is modified as
indicated.
The following ALGOL program contains the essentials of the reduction
of a real matrix, stored in the array a, to triangular form using Householder
transformations:
228
4 Systems of Linear Equations
for j := 1 step 1 until n do
begin
sigma:= 0;
for i := j step 1 until n do sigma := sigma + ali, j] t 2;
if sigma = 0 then goto singular;
s := d[j] = if a[j,j] < 0 then sqrt(sigma) else -sqrt(sigma);
beta:= l/(s x a[j,j]- sigma); a[j,j]:= a[j,j]- s;
for k := j + 1 step 1 until n do
begin
sum:= 0;
for i := j step 1 until n do
sum := sum + a[i,j] x ali, k];
sum := beta x sum;
for i := j step 1 until n do
ali, k] := ali, k] + ali, j] x sum;
end;
end;
The Householder reduction of a n x n matrix to triangular form requires about 2n 3 /3 operations. In this process an n x n unitary matrix
P = Pn - l ... PI consisting of Householder matrices Pi and an n x n upper
triangular matrix R are determined so that
PA=R
or
(4.7.7)
A=P-IR=QR,
Q:=p-l=p H ,
holds: that is we have found a so-called QR-decomposition of the matrix A
into a product of a unitary matrix Q and an upper triangular matrix R,
A=QR.
An upper triangular matrix R, with the property (4.7.7) that AR- I =
Q = (ql, q2,···, qn) is a matrix with orthonormal columns qi, can also be
produced directly by the application of Gram-Schmidt orthogonalization
to the columns ai of A = (al, ... , an). The equation A = QR shows that
the kth column ak of A
k
ak = Lrikqi,
k = 1, ... ,n,
i=1
is a linear combination of the vectors ql, q2, ... , qk, so that, conversely, qk
is a linear combination of the first k columns aI, ... , ak of A. The GramSchmidt process determines the columns of Q and R recursively as follows:
Begin with
4.7 Orthogonalization Techniques of Householder and Gram-Schmidt
229
If the orthonormal vectors q1, ... , qk-1 and the elements 1'ij with j :s; k-1
of R are known, then the numbers 'r1k, ... , 'rk-1,k are determined so that
the vector
(4.7.8)
is orthogonal to all qi, i = 1, ... , k - 1.
Because
II
qi qj =
i = j
{I0 for
otherwise
for 1 :s; i, j :s; k -- 1,
the conditions qf bk = 0 lead immediately to
i = 1, ... , k - 1.
(4.7.9)
After 'rik,i :s; k - 1, and thereby bk have been determined, the quantities
(4.7.10)
are computed, so that (4.7.8) is equivalent to
k
ak = L'rikq;
;=1
and moreover
H
qi qj =
i = j,
{I0 for
otherwise
for all 1 :s; i, j :s; k.
As given above, the algorithm has a serious disadvantage: it is not
numerically stable if the columns of the matrix A are nearly linearly dependent. In this casc the vector bk, which is obtained from the formulas
(4.7.8), (4.7.9) instead of thc vcctor bk (due to the influence of roundoff
errors), is no longer orthogonal to the vectors q1, ... , qk.-1. The slightest
roundoff error incurred in the determination of the 'rki in (4.7.9) destroys
orthogonality to a greater or lesser extent.
We will discuss this effect in the spccial casc corresponding to k = 1
in (4.7.8). Let two real vectors a and q be given, which wc regard, for
simplicity, as normalized: Iiall = IIqll = 1. This means l;hat their scalar
product Po := qT a satisfies IPol :s; 1. We assume that IPol < 1 holds, that
is, a and q are linearly independent. The vector
b=b(p):=a-pq
is orthogonal to q for p = Po. In general, the angle a(p) between b(p) and
q satisfies
230
4 Systems of Linear Equations
qTb(p)
qT a _ p
f(p) := cosa(p) = Ilb(p)11 = Iia _ pqll'
f(po) = o.
a(po) = 7r/2,
Differentiating with respect to p, we find
-1
!'(Po) = II a - Poq II
a'(po) =
since
-1
R'
- Po
Iia - Poql12 = aT a - 2poaT q + p~qT q = 1 - p~.
Therefore, to a first approximation, we have
. 7r
a(po + L1po) = -2 +
L1po
;'1----::2
V 1- Po
for L1po small. The closer Ipollies to 1, i.e., the more linearly dependent a
und q are, the more the orthogonality beween q and b(p) is destroyed by
even tiny errors L1po, in particular by the roundoff errors which occur in
computing Po = qT a.
Since it is precisely the orthogonality of the vectors qi that is essential
to the process, the following trick is used in practice. The vectors bk which
are obtained instead of the exact bk from the evaluation of (4.7.8), (4.7.9)
are subjected to a "reorthogonalization". That is, scalars L1rik and a vector
bk are computed from
bk = bk - L1rlkql - ... - L1rk-l,kqk-l,
where
L1rik := q;b k ,
i = 1, ... ,k - 1,
bI
so that in exact arithmetic qi = 0, i = 1, ... , k -1. Since bk was at least
approximately orthogonal to the qi, the L1rik are small, and according to the
theory just given, the roundoff errors which are made in the computation
of the L1rik have only a minor influence on the orthogonality of bk and qi'
This means that the vector
will be orthogonal within machine precision to the already known vectors
ql, ... , qk-l· The values Tik which have been found by evaluating (4.7.9)
are corrected appropriately:
4.8 Data Fitting
231
Clearly, reorthogonalization requires twice as much computing effort as the
straightforward Gram-Schmidt process.
4.8 Data Fitting
In many scientific observations, one is concerned with determining the
values of certain constants
Often, however, it is exceedingly difficult or impossible to measure the
quantities Xi directly. In such cases, the following indirect method is used:
instead of observing the Xi, another, more easily measurable quantity y is
sampled, which depends in a known way on the Xi and on further controllable "experimental conditions" , which we symbolize by z:
y = !(z; Xl, ... , Xn).
In order to determine the Xi, experiments are carried out under m different
conditions Zl, ... , Zm, and the corresponding results
k = 1,2, ... ,m,
(4.8.0.1 )
are measured. Values for Xl, ... , Xn are sought so that the equations
(4.8.0.1) are satisfied. In general, of course, at least m experiments, m 2: n,
must be carried out in order that the Xi may be uniquely determined. If
m > n, however, the equations (4.8.0.1) form an overdel;ermined system
for the unknown parameters Xl, ... , .T n , which does not usually have a
solution because the observed quantities Yi are perturbed by measurement
errors. Consequently, instead of finding an exact solution to (4.8.0.1), the
problem becomes one of finding a "best possible solution". Such a solution
to (4.8.0.1) is taken to mean a set of values for the unknown parameters
for which the expression
m
(4.8.0.2)
2)Yk -!d Xl, ... ,xn ))2,
k=l
or the expression
(4.8.0.3)
is minimized. Here we have denoted !(Zk; xl, ... , xn) by fdxl, ... , xn).
In the first case, the Euclidean norm of the residuals is minimized,
and we are presented with a fitting problem of the special. type studied by
Gauss (in his "method of least squares"). In mathematical statistics [e.g.
232
4 Systems of Linear Equations
see Guest (1961), Seber (1977), or Grossmann (1969)] it is shown that the
"least-squares solution" has particularly simple statistical properties. It is
the most reasonable point to determine if the errors in the measurements
are independent and normally distributed. More recently, however, points
x which minimize the norm (4.8.0.3) of the residuals have been considered
in cases where the measurement errors are subject to occasional anomalies.
Since the computation of the best solution x is more difficult in this case
than in the method of least squares, we will not pursue this matter.
If the functions h(Xl,'" ,.T n ) have continuous partial derivatives in
all of the variables Xi, then we can readily give a necessary condition for
x = (Xl, ... ,Xn)T to minimize (4.8.0.2):
i = 1, ... , n.
(4.8.0.4)
These are the normal equations for x.
An important special case, the linear least-squares problem, is obtained
if the functions fk (Xl, ... ,X n ) arc linear in the Xi. In this case there is an
m x n matrix A with
[
h(XI" .... ,xn )
.
1
=Ax
fm(XI .... ,xn )
and the normal equations reduce to a linear system
or
(4.8.0.5)
We will concern ourselves in the following sections (4.8.1-4.8.3), with methods for solving the linear least-squares problem, and in particular, we will
show that more numerically stable methods exist for finding the solution
than by means of the normal equations.
Least-squares problems are studied in more detail in Bjorck (1990) and
the book by Lawson and Hanson (1974), which also contains FORTRAN
programs; ALGOL programs are found in Wilkinson and Reinsch (1971).
4.8.1 Linear Least Squares. The Normal Equations
In the following sections Ilxll will always denote the Euclidean norm Ilxll :=
VXTX. Let a real m x n matrix A and a vector y E IRrn be given, and let
(4.8.1.1)
Ily - AxI1 2 = (y - Ax)T(y - Ax)
4.8 Data Fitting
233
be minimized as a function of x. We want to show that x E lRn is a solution
of the normal equations
(4.8.1.2)
if and only if x is also a minimum point for (4.8.1.1). We have the following:
( 4.8.1.3) Theorem. The linear least-squares problem
min Ily - Axil
xEIRP
has at least one minimum point Xo. If Xl is another minimum point, then
Axo = AXI. The residual r := y - Axo is uniquely determ~~ned and satisfies
the equation AT r = 0. Every minimum point Xo is also a solution of the
normal equations (4.8.1.2) and conversely.
PROOF. Let L ~ lRm
be the linear subspace
which is spanned by the columns of A, and let LJ.. be the orthogonal complement
LJ.. := {r I rT z =
°
for all z E L} = {r I rT A = o}.
Because lRm = L E9 LJ.., the vector y E lRm can be written uniquely in the
form
y = s +r,
(4.8.1.4)
s E L,
r E L.l..,
and there is at least one Xo with Axo = s. Because AT r == 0, Xo satisfies
AT y = AT S = AT Axo,
that is, Xo is a solution of the normal equations. Conversely, each solution
Xl of the normal equations corresponds to a representation (4.8.1.4):
y = s + r,
Because this representation is unique, it follows that Axo = AXI for all
solutions xo, Xl of the normal equations. Further, each solution Xo of the
normal equations is a minimum point for the problem
min Ily - Axil·
xElRn
To see this, let x be arbitrary, and set
z = Ax - Axo,
Then, since rT z = 0,
r:= y -
Axo.
234
4 Systems of Linear Equations
that is, Xo is a minimum point. This establishes Theorem (4.8.1.3).
0
If the columns of A are linearly independent, that is, if x f 0 implies
Ax f 0, then the matrix AT A is nonsingular (and positive definite). If this
were not the case, there would exist an x f 0 satisfying AT Ax = 0, from
which
0= x T AT Ax = IIAxl12
would yield a contradiction, since Ax f O. Therefore the normal equations
ATAx=ATy
have a unique solution x = (AT A)-1 AT y, which can be computed using the
methods of Section 4.3 (that is, using the Choleski factorization of AT A).
However, we will see in the following sections that there are more numerically stable ways of solving the linear least squares problem.
We digress here to go briefly into the statistical meaning of the matrix
(A T A) -1. To do this, we assume that the components Yi, i = 1, ... , m, are
independent, normally distributed random variables having J.li as means
and all having the same variance (j2:
for i = k,
otherwise.
If we set J.l := (J.ll, ... , J.lm)T, then the above can be expressed as
(4.8.1.5)
E[y] =J.l,
E[(y - J.l)(y - J.l?] = (j2 I.
The covariance matrix of the random vector y is (j2 I. The optimum x =
(AT A)-1 ATy of the least-squares problem is also a random vector, having
mean
E[x] = E[(AT A)-1 AT y]
= (AT A)-1 AT E[y]
= (AT A)-1 AT J.l,
and covariance matrix
E[(x-E(x))(x - E(x)?]
= E[(AT A)-1 AT(y - J.l)(y - J.l)T A(AT A)-I]
= (AT A)-I AT E[(y - J.l)(y - J.l)T]A(AT A)-I = (j2(AT A)-I.
4.8 Data Fitting
235
4.8.2 The Use of Orthogonalization in Solving Linear LeastSquares Problems
The problem of determing an x E IR n which minimizes
Ily- Axil,
(A E M(m,n),
m:::: n)
can be solved using the orthogonalization techniques discussed in Section
4.7. Let the matrix A =: A(O) and the vector y =: yeO) be transformed
by a sequence of Householder transformations Pi, A (i) = PiA (i-1), y( i) =
Piy(i-i). The final matrix A (n) has the form
(4.8.2.1)
A(n) =
[R]
}n
o }m-n with R =
Tll
[
0
since m :::: n. Let the vector h := yen) be partitioned correspondingly:
(4.8.2.2)
The matrix P = P n ... Pi, being the product of unitary matrices, is unitary
itself:
and satisfies
A(n) = PA.
h= Py.
Unitary transformations leave the Euclidean norm Ilull of a vector u invariant (11Pu11 2 = u H pH Pu = uHu = IluI1 2 ), so
Ily - Axil = IIP(y - Ax)11 = Ily(n) - A(n)xll·
However, from (4.8.2.1) and (4.8.2.2), the vector yen) -A(n)x has the structure
(n) _ A(n) = [hi y
X
h2
.
RX]
Hence Ily - Axil is minimized if x is chosen so that
(4.8.2.3)
The matrix R has an inverse R- i if and only if the columns ai, ... , an of
A are linearly independent. Az = 0 for z of 0 is equivalem to
PAz=O
and therefore to
Rz = O.
236
4 Systems of Linear Equations
If we assume that the columns of A are linearly independent, then
which is a triangular system, can be solved uniquely for x. This x is, moreover, the unique minimum point for the given least-squares problem. [If
the columns of A, and with them the columns of R, are linearly dependent,
then, although the value of minz Ily - Azil is uniquely determined, there
are many minimum points x.]
The size lIy - Axil of the residual of the minimum point is seen to be
( 4.8.2.4)
We conclude by mentioning that instead of using unitary transformations, the Gram-Schmidt technique with reorthogonalization can be used
to obtain the solution, as should be evident.
4.8.3 The Condition of the Linear Least-Squares Problem
We begin this section by investigating how a minimum point x for the linear
least-squares problem
(4.8.3.1 )
minlly - Axil
x
changes if the matrix A and the vector yare perturbed. We assume that
the columns of A are linearly independent. If the matrix A is replaced by
(A + LlA), and y is replaced by y + Lly, then the solution x = (AT A)-l AT y
of (4.8.3.1) changes to
.r + Llx = ((A + LlAf(A + LlA))-l(A + LlAf(y + Lly).
If LlA is small relative to A, then (( A + LlA) T (A + LlA)) -1 exists and
satisfies, to a first approximation,
((A + LlAf(A + LlA))-l ~
~ (AT A(I + (AT A)-l[AT LlA + LlAT A]))-l
='= (1 - (AT A)-l[AT LlA + LlAT A]) (AT A)-I.
[To a first approximation, (I + F)-l ~ 1 - F if the matrix F is "small"
relative to 1.] Thus it follows that
(4.8.3.2)
x + Llx ~(AT A)-l AT Y - (AT A)-l[AT LlA + LlAT A](AT A)-l AT Y
+ (AT A)-lLlAT y + (AT A)-l AT Lly.
4.8 Data Fitting
Noting that
237
x = (AT A)-l AT y
and introducing the residual
r:= y - Ax,
it follows immediately from (4.8.3.2) that
L1x ~ _(AT A)-l AT L1Ax + (AT A)-l L1AT r + (AT A)-l AT L1y.
Therefore, for the Euclidean norm 11·11 and the associated matrix norm lub,
IIL1xll:i; lub((AT A)-l AT) lub(A) 1~:~fA~) Ilxll
T -1
T
lub(L1AT) Ilrll
+ lub((A A) ) lub(A ) lub(A) lub(AT"f IIAxllllxl1
(4.8.3.3)
+ lub((AT A)-l AT) lub(A)J!1LlL IIL1yllllxll.
IIAxl1 IIYII
[Observe that the definition given in (4.4.8) for lub makes sense even for
nonsquare matrices.] This approximate bound can be simplified. According
to the results of Section 4.8.2, a unitary matrix P and an upper triangular
matrix R can be found such that
and it follows that
AT A = RTR,
(AT A)-l = R- 1 (R T )-1,
(4.8.3.4)
(AT A)-l AT = [R-l, O]P.
If it is observed that
lub( CT) = lub( C)
lub(PC) = lub(CP) = lub(C),
holds for the Euclidean norm, where P is unitary, then (4.8.3.3) and
(4.8.3.4) imply
IIL1xll <: cond(R) lub(L1A) + cond(R)2L lub(L1A)
Ilxll lub(A)
IIAxl1 lub(A)
Ilyll IIL1YII
+cond(R) IIAxl1
llYf'
238
4 Systems of Linear Equations
If we define the angle y by
111'11
tan y = IIAxll'
1f
o :s: y < 2'
then IIYII/IIAxll = (1 + tan 2 .p)1/2 because of y = Ax + 1', l' ..l Ax. Therefore
IIL1xll .
lub(L1A)
2
lub(L1A)
W:s:cond(R) lub(A) +cond(R) tany lub(A)
(4.8.3.5)
IIL1YII
+ cond(R)y/ 1 + tan 2 y W.
In summary, the condition of the least-squares problem depends on
cond( R) and the angle y: If .p is small, say if cond( R) tan.p :s: 1, then the
condition is measured by cond(R). With increasing .p t 1f/2 the condition
gets worse: it is then measured by cond(R)2 tan y.
If the minimum point .r for the linear least-squares problem is found
using the orthogonalization method described in Section 4.8.2, then in exact
arithmetic, a unitary matrix P (consisting of a product of Householder
matrices), an upper triangular matrix R, a vector hT = (hI, hI), and the
solution x all satisfying
PA=[~]'
(4.8.3.6)
h=Py,
Rx=h l .
are produced. Using floating-point arithmetic with relative machine precision eps, an exactly unitary matrix will not be obtained, nor will R, h, x
exactly satisfy (4.8.3.6). Wilkinson (1965) has shown, however, that there
exist a unitary matrix P', matrices L1A and L1R, and a vector L1y such that
lub(P' - P) :s: f(m) eps,
P'(A + L1A) =
[~],
P'(y + L1y) = h,
(R + L1R)x = hI,
lub(L1A):S: f(m) epslub(A),
IIL1YII :s: f(m) eps Ilyll,
lub(L1R) :s: f(m) epslub(R).
In these relations f(m) is a slowly growing function of m, roughly f(m) ~
O(m).
Up to terms of higher order
lub(R) ~ lub(A),
and therefore, since
4.8 Data Fitting
239
it also follows that
pi (A + F) = [R +O.:1R] ,
lub(F) S 2f(m) eps lub(A).
In other words, the computed solution x can be interpreted as an exact
minimum point of the following linear least-squares problem:
(4.8.3.7)
minll(y + .:1y) - (A + F)zll,
z
wherein the matrix A + F and the right-hand-side vector y +.:1y differ only
slightly from A and y respectively. The technique of orthogonalization is,
therefore, numerically stable.
If, on the other hand, one computes the solution x by way of the normal
equations
ATAx=ATy,
then the situation is quite different. It is known form Wilkinson's (1965)
estimates [see also Section 4.5] that a solution vector i; is obtained which
satisfies
(4.8.3.8)
lub(G) S f(n) epslub(AT A),
when floating point computation with the relative machine precision eps
is used (even under the assumption that AT y and AT A are computed
exactly). If x = (AT A)-l AT Y is the exact solution, then (4.4.15) shows
that
(4.8.3.9)
IIi; - xii
T
lub(G)
2 lub(G)
IlxlJ S cond(A A) lub(AT A) = cond(R) lub(AT A)
S f(n) epscond(R)2
to a first approximation. The roundoff errors, represented here as the matrix
G, are amplified by the factor cond(R?
This shows that the use of the normal equations is not numerically
stable if the first term dominates in (4.8.3.5). Another situation holds ifthe
second term dominates. If tan<p = Ilrll/llAxlJ ~ 1, for example, then the
use of the normal equations will be numerically stable and will yield results
which are comparable to those obtained through the use of orthogonal
transformations.
EXAMPLE 1 (Lauchli). For the 6 x 5 matrix
240
4 Systems of Linear Equations
if follows that
1 + 1'2
1
1
1
1 + 1'2
1
1
1 + 1'2
ATA=
1
1
1
1
1
1
1
1
1 + 1'2
1
1
1
1
1 + 1'2
If I' = O.Ei . 10- 5 , then 10-place decimal arithmetic yields
fl(AT A)
~ f;
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
since 1'2 = 0.25 X 10- 10 . This matrix has rank 1 and has no inverse. The normal equations cannot be solved, whereas the orthogonalization technique can
be applied without difficulty. [Note for AT A that cond(AT A) = cond(R)2 =
(5 + 1'2)/1'2.]
EXAMPLE 2. The following example is intended to offer a computational comparison between the two methods which have been discussed for the linear leastsquares problem.
(4.8.3.7) shows that, for the solution of a least-squares problem obtained using
orthogonal transformations, the approximate bound given by (4.8.3.5) holds with
LlA = F and
lub(LlA):S: 2f(m)epslub(A).
(4.8.3.9) shows that the solution obtained directly from the normal equations
satisfies a somewhat different bound:
(AT A + G)(x + Llx) = AT Y + Llb
where IILlbl1 :S: epslub(A)IIYII, that is,
(4.8.3.10)
IILlxllnorrnal <
d(R)2 (IUb(G)
~)
Ilxll
- con
lub(AT A) + IIATYII '
which holds because of (4.4.12) and (4.4.15).
Let the fundamental relationship
(4.8.3.11)
1
1
1
S
S
S
Y(S):=XI-+X22+X33
withxI=X2=X3=1
be given. A series of observed values [computed from (4.8.3.11)]
were produced on a machine having eps = 10- 11 .
(a) Determine Xl, X2, X3 from the data {Si, Yi}. If exact function values
Yi = y(s;) are used, then the residual will satisfy
r(x)=O.
tan'P=O.
4.8 Data Fitting
241
The following table presents some computational results for this example. The size
of the errors II.:1xll or th from the orthogonalization method and II.:1xllnormal from
the normal equations are shown together with a lower bound for the condition
number of R. The example was solved using Si = So + i, i = 1, ... , 10, for a
number of values of So:
So
cond(R)
II.:1xll orth
II.:1x II normal
10
50
100
150
200
6.6.10 3
1.3.10 6
1.7.10 7
8.0.10 7
2.5.10 8
8.0.10- 10
6.4.10- 7
3.3.10- 6
1.8.10- 5
1.8.10- 3
8.8.10- 7
8.2.10- 5
4.2.10- 2
6.9.10- 1
2.7.10 0
(b) We introduce a perturbation into the Yi replacing Y by !I + AV, where v is
chosen so that AT v = O. this means that, in theory, the solution should remain
unchanged:
Now the residual satisfies
rex) = y + AV - Ax = AV,
tan'P = Allvll/llyll,
>. E JR.
The following table presents the errors .:1x, as before, and the norm of the
residualllr(x)11 together with the corresponding values of A:
So =
10,
v = (0.1331, -0.5184,0.6591, -0.2744,0,0,0,0,0,
lub(A) ::;:, 0.22, eps = 10- 11 .
A
Ilr(x)11
II.:1xll orth
II.:1x II normal
0
10- 6
10- 4
10- 2
0
9.10- 7
9.10- 5
9.10- 3
9.10- 1
8.10- 10
9.5.10- 9
6.2.10- 10
9.1.10- 9
6.1.10- 7
5.7.10- 5
8.8.10- 7
8.8.10- 7
4.6.10- 7
1.3.10- 6
8.8.10- 7
9.1.10- 6
10 0
10+ 2
9.10 1
of,
4.8.4 Nonlinear Least-Squares Problems
Nonlinear data-fitting problems can generally be solved only by iteration,
as for instance the problem given in (4.8.0.2) of determining a point x* =
(xi, ... ,x~)T for given functions hex) == h(Xl, ... ,X n ) and numbers Yk,
k = 1, ... , m,
242
4 Systems of Linear Equations
y~ [:~,l
which minimi7:es
m
Ily - i(x)112 = L(Yk - h(Xl,'" ,xn ))2.
k=l
For example, linearization techniques [see Section 5.1] can be used to reduce
this problem to a sequence of linear least squares problems. If each h is
continuously differentiable, and if
ah
aXl
ah
aX n
aim
aXl
afm
aXn
Df(~) =
x=,;
represents the Jacobian matrix at the point x = ~, then
f(x) = f(x) + Df(x)(x - x) + h,
Ilhll = o(llx - xii),
holds. If x is close to the minimum point of the nonlinear least-squares
problem, then the solution x of the linear least-squares problem
min
(4.8.4.1)
zEIRH
Ily - f(x) - Df(x)(z - x)11 2 = Ilr(x) - Df(x)(x - x)112.
r(x) := y - f(x),
will often be still closer than .T; that is,
Ily - f(x)112 < IIY - f(x)ll·
This relation is not always true in the form given. However, it can be shown
that the direction
s=s(x):=:r:-x
satisfies the following:
There is a A > 0 such that the function
is strictly monotone decreasing for all 0 :s; T :s; A. In particular
ip(A) = Ilv - f(x + As)11 2 < ip(O) = Ily - f(x)112
4.8 Data Fitting
243
PROOF. VJ is a continuously differentiable function of T and satisfies
d
'P'(O) = dT [(y - f(x + Ts)f(y - f(x + TS))]7=0
=
-2(Df(x)sf(y - f(x)) = -2(Df(x)s)Tr (x).
Now, by the definition of X, s = (x-x) is a solution to the normal equations
(4.8.4.2)
of the linear least-squares problem (4.8.4.1). It follows immediately from
(4.8.4.2) that
IIDf(x)sI12 = sT Df(xf Df(x)s = (Df(x)sfj·(x)
and therefore
'P'(O) = -21IDf(x)sI12 < 0,
so long as rank Df(x) = nand s #- o. The existence of a>. > 0 satisfying
'P' (T) < 0 for 0 :S r :S >. is a result of continuity, from which observation
the assertion follows.
0
This result suggests the following algorithm, called the Gauss-Newton
algorithm, for the iterative solution of nonlinear least-squares problems.
Beginning with an initial point x CO ), determine successive approximations
xCi), i = 1, 2, ... , as follows:
(1) For xCi) compute a minimum point sci) for the linear least-squares
problem
(2) Let 'P(r) := Ily - f(x(i) + TSCi))1I2, and further, let k be the smallest
integer k ~ 0 with
(3) Define X CH1 ) := xci) + 2- k s Ci ).
In Section 5.4 we will study the convergence of algorithms which are
closely related to the process described here.
4.8.5 The Pseudoinverse of a Matrix
For any arbitrary (complex) m x n matrix A there is an n x m matrix A+,
the so-called pseudoinverse (or Moore-Penrose inverse). It is associated
with A in a natural fashion and agrees with the inverse A-I of A in case
m = n and A is nonsingular.
Consider the range space R(A) and the null space N(A) of A,
244
4 Systems of Linear Equations
R(A) := {Ax E (Cm I x E (Cn },
N(A) := {x E (Cn I Ax = O},
together with their orthogonal complement spaces R(A)l.. C (Cm, N(A)-'- c
(Cn. Further, let P be the n x n matrix which projects (Cn onto N(A)l..,
and let P be the TTl x Tn matrix which projects (Cm onto R(A):
P.T = 0 ¢=} x E N(A),
P= pH =p 2,
Py = y ¢=} Y E R(A),
P = pH = P2.
For each y E R(A) there is a uniquely determined Xl E N(A)l.. satisfying
AXI = y; i.e., there is a well-defined mapping f: R(A) --+ (Cn with
Af(y) = y,
fey) E N(A)l..
for all y E R(A).
For, given y E R(A), there is an x which satisfies y = Ax; hence y =
A(Px + (I - P)x)) = APx = AXI. where Xl := Px E N(A)l.., since
(I - P)x E N(A). Further, if Xl, X2 E N(A)l.., AXl = AX2 = y, it follows
that
Xl - X2 E N(A) n N(A)l.. = {O},
which implies that Xl = X2. f is obviously linear.
The composite mapping foP: y E (Cm --+ f(P(y)) E (Cn is well defined
and linear, since Py E R(A); hence it is respresented by an n x TTl matrix,
which is precisely A+, the pseudoinverse of A: A+y = f(P(y)) for all
y E (Cm. A + has the following properties:
(4.8.5.1) Theorem. Let A be an m x n matris. The pseudoinverse A+ is
an n x TTl matrix satisfying:
(1) A+ A = P is the orthogonal projector P : (Cn --+ N(A)l.. and AA+ = P
is the orthogonal projector P : (Cm --+ R(A).
(2) The following formulas huld:
(a) A+ A = (A+ A)H,
(b) AA+ = (AA+)H,
(c) AA+ A = A,
(d) A+ AA+ = A+.
PROOF. According to the definition of A +,
A+ Ax = f(P(Ax)) = f(Ax) = Px
for all x,
so that A+ A = P. Since pH = P, (4.8.5.1.2a) is satisfied. Further, from
the definition of f,
AA+ = A(f(Py)) = Py
for all y E (Cm; hence AA+ = P = PH. Since pH = P, (4.8.5.1.2b) follows
too. Finally, for all x E (Cn
4.8 Data Fitting
245
(AA+)Ax = PAx = Ax
according to the definition of P, and for all y E (Cm
o
hence, (4.S.5.1.2c, 2d) hold.
The properties (2a-d) of (4.8.5.1) uniquely characterize A+:
(4.8.5.2) Theorem. If Z is a matrix satisfying
a') ZA = (ZA)H,
b') AZ = (AZ)H,
c') AZA = A,
d') ZAZ = Z,
then Z = A+.
PROOF. From (a)-(d) and (a')-(d') we have the following chain of equali-
ties:
Z = ZAZ = Z(AA+ A)A+(AA+ A)Z
from d'), cl
(AH ZH AH A+H)A+(A+ H AH ZH AH)
from a), a'l, b), b')
(AH A+H)A+(A+ H AH)
from c')
(A+ A)A+(AA+)
from a), b)
from d).
o
We note the following
(4.8.5.3) Corollary. For all matrices A,
A++ = A,
This holds because Z := A [respectively Z := (A+)H] has the properties
of (A+)+ [respectively (AH)+] in (4.8.5.2).
An elegant representation of the solution to the least-squares problem
min IIAx - Yl12
x
can be given with the aid of the pseudoinverse A +:
(4.8.5.4) Theorem. The vector x := A+y satisfies:
(a) IIAx - yl12 2:: IIAx - Yl12 for all x E (Cn.
(b) IIAx - Yl12 = IIAx - ylb and x i- x imply IIxl12 > Ilx112.
246
4 Systems of Linear Equations
In other words, x = A+y is that minimum point of the least squares problem which has the smallest Euclidean norm, in the event that the problem
does not have a unique minimum point.
PROOF. From (4.8.5.1), AA+ is the orthogonal projector on R(A); hence,
for all x E (Cn it follows that
Ax - y = U - v
U := A(x -
A+y) E R(A),
v:= (I - AA+)y = Y - Ax E R(A)l..
Consequently, for all x E (Cn
IIAx - YII~ = Ilull~ + Ilvll~ 2 Ilvll~ = IIAx - YII~,
and IIAx - yl12 = IIAx - Yl12 holds precisely if
Ax = AA+y.
Now, A+ A is the projector on N(A)l.. Therefore, for all x such that Ax =
AA+y,
x = UI + VI,
UI:= A+ Ax =
A+ AA+y = A+y = x E N(A)l.,
VI := x - UI =
X -
x E N(A),
from which it follows that Ilxll~ > Ilxll~ for all x E (Cn satisfying x - x -I 0
and IIAx - Yl12 =
IIAx - y112.
0
If the m x n matrix A with m 2 n has maximal rank, rank A = n, then
there is an explicit formula for A+: It is easily verified that the matrix Z :=
(AH A)~I AH has all properties given in Theorem (4.8.5.2) characterizing
the pseudoinverse A + so that
By means of the QR decomposition (4.7.7) of A, A = QR, this formula for
A + is equivalent to
This allows a numerically more stable computation of the pseudoinverse,
A+ = R~IQH.
If m < n and rank A = m then because of (A+)H = (AH)+ the
pseudoinverse A + is given by
if the matrix AH has the QR decompositon AH = QR (4.7.7).
4.9 Modification Techniques for Matrix Decompositions
247
For general m x n matrices A of arbitrary rank the pSEudoinverse A +
can be computed by means of the singular value decomposition of A [see
(6.4.11), (6.4.13)].
4.9 Modification Techniques for Matrix
Decompositions
Given any n x n matrix A, Gaussian elimination [see Section 4.1] produces
an n x n upper triangular matrix R and a nonsingular n x n matrix F =
Gn-1Pn - 1 ... G1P1, a product of Frobenius matrices G j and permutation
matrices Pj , which satisfy
FA=R.
Alternatively, the orthogonalization algorithms of Section 4.'7 produce n x n
unitary matrices P, Q and an upper triangular matrix R (different from
the one above) for which
PA = R
or
A = QR.
[compare with (4.7.7)]. These algorithms can also be applied to rectangular
m x n matrices A, m 2': n. Then they produce nonsingular rn x rn matrices
F (alternatively rn x rn unitary matrices P or rn x n matrices Q with
orthonormal columns) and n x n upper triangular matrices R satisfying
(4.9.1)
FA =
}n
[R]
0 }m-n'
or
(4.9.2a)
(4.9.2b)
For rn = n, these decompositions were seen to be useful in that they reduced
the problem of solving the equation systems
(4.9.3)
Ax = y
or
AT x = Y
to a process of solving triangular systems and carrying out matrix multiplications. When rn 2': n, the orthogonal decompositions mentioned in (4.9.2)
permit the solutions i: of linear least-squares problems to be obtained in a
similar manner:
( 4.9.4)
248
4 Systems of Linear Equations
Further, these techniques offer an efficient way of obtaining the solutions
to linear-equation problems (4.9.3) or to least-squares problems (4.9.4) in
which a single matrix A is given with a number of right-hand sides y.
Frequently, a problem involving a matrix A is given and solved, following which a "simple" change A ---+ A is made. Clearly it would be desirable
to determine a decomposition (4.9.1) or (4.9.2) of A, starting with the corresponding decomposition of A, in some less expensive way than by using
the algorithms in Sections 4.1 of 4.7. For certain simple changes this is
possible.
We will consider:
(1) the change of a row or column of A,
(2) the deletion of a column of A,
(3) th addition of a column to A,
(4) the addition of a row to A,
(5) the deletion of a row of A,
where A is an m x n matrix, m ~ n. [See Gill, Golub, Murray, and Saunders
(1974) and Daniel, Gragg, Kaufman, and Stewart (1976) for a more detailed
description of modification techniques for matrix decompositions.]
Our principal tool for devising modification techniques will be certain
simple elimination matrices E ij . These are nonsingular matrices of order
m which differ from the identity matrix only in columns i and j and have
the following form:
o
1
1
a
b
t-i
d
t-j
1
Eij =
1
C
1
1
0
A matrix multiplication y = Eijx changes only components Xi and Xj of
the vector X = (Xl, ... , Xm)T E lRm :
Yi
= aXi + bXj,
Yj
= CXi +dXj,
Yk
= Xk for k i= i,j.
Given a vector X and indices i i= j, the 2 x 2 matrix
4.9 Modification Techniques for Matrix Decompositions
,
249
[ac d
b] '
E:=
and thereby E ij , can be chosen in several ways so that the jth component
Yj of the result Y = Eijx vanishes and so that Eij is nonsingular:
For numerical reasons E should also be chosen so that the condition number
cond (Eij) is not too large. The simplest possibility, which finds application
in decompositions of the type (4.9.1), is to construct E~ as a Gaussian
elimination of a certain type [see (4.1.8)]:
(4.9.5)
n
[~
-X~/Xi
[~ -X~/Xj
E=
[
if
Xj = 0,
if
IXil :::: IXjl > 0,
if
IXil < IXjl·
n
]
In the case of orthogonal decompositions (4.9.2) we choose E, and thereby
E ij , to be the unitary matrices know as Givens matrices. One possible form
such a matrix may take is given by the following:
(4.9.6)
'=[Cs -cs] ' c := cos cp, s := sincp.
E
E and Eij are Hermitian and unitary, and satisfy det (E) = -1. Since E is
unitary, it follows immediately from the condition
that Yi = k = ±
J +
x;
x;, which can be satisfied by
c := 1,
s:=
°
if Xi = Xj =
or alternatively, if J-L := max{lxil, IXjl} > 0, by
Ikl is to be computed as
°
250
4 Systems of Linear Equations
and the sign of k is, for the moment, arbitrary. The computation of Ikl
in this form avoids problems which would arise from exponent overflow or
underflow given extreme values for the components .ri and Xj' On numerical
grounds the sign of k will he chosen according to
k = Ikl sign(xi),
sign(xi) := {
l.
.
-1,
if Xi 2: 0,
if Xi < 0,
so that the computation of the auxiliary value
v := s/(l + c)
can take place without cancellation. Using v, the significant components Zi,
Zj of the transformation Z := Eiju of the vector U E lR m can be computed
somewhat more efficiently (in which a multiplication is replaced by an
addition) as follows:
+ SUj,
Zj := V(Ui + Zi) - 11j.
Zi := CUi
The type of matrix E shown in (4.9.6), together with its associated E ij ,
is known as Givens reflection. We can just as easily use a matrix E of the
form
E = [ - SC 8],
C
.
C
= COScp,
s = sincp,
which, together with its associated E ij , is known as a Givens rotation. In
this case E and Eij are orthogonal matrices which describe a rotation of
lRm about the origin through the angle cp in the (i,j) plane; det(E) = l.
We will present the following material, however, only in terms of Givens
reflections.
Since modification techniques for the decomposition (4.9.1) differ from
those for (4.9.2a) only in that, in place of Givens matrices (4.9.6), corresponding elimination matrices E of type (4.9.5) arc used, we will only
study the orthogonal factorization (4.9.2). The techniques for (4.9.2b) are
similar to, but more complicated than, those for (4.9.2a). Consequently, for
simplicity's sake, we will restrict out attention to (4.9.2a). The corresponding discussion for (4.9.2b) can be found in Daniel, Gragg, Kaufman, and
Stewart (1976).
In the following let A be a real m x n matrix with m 2: n, and let
PA =
[~]
he a factorization of type (4.9.2a).
(1) Change of a row or column, or more generally, the change of A to
A := A + V11 T (rank-one modification of A), where v E IR.=, 11 E lRn are
given vectors. From (4.9.2a) we have
4.9 Modification Techniques for Matrix Decompositions
251
- [R]
0 +wu,
(4.9.7)
PA=
T
In the first step of the modification we annihilate the successive components
m, m-1, ... ,2 of the vector w using appropriate Givens matrices Gm-l,m,
Gm- 2 ,m-l, ... , G 12 , so that
k = ±llwll = ±lIvll·
The sketch below in the case m = 4 should clarify the effect of carrying out
the successive transformations G i ,Hl. We denote by * here, and in the further
sketches of this section, those elements which are changed by It transformation,
other than the one set to zero:
If (4.9.7) is multiplied on the left as well by Gm-l,m, ... , G 12 in order,
we obtain
(4.9.8)
where
R':= G [~],
P := GP,
G:= G 12 G23 ··· Gm-l,m.
Here P is unitary, as are G and P; the upper triangular matrix [~] IS
changed step by step into an upper Hessenberg matrix
that is a matrix with (R')ik = 0 for i > k + 1:
m = 4, n = 3:
x
Xl
x ~ [X
0
o x
o 0
0
0
=G [~] =R'.
R = R' + kelUT in (4.9.8) is also an upper Hessenberg matrix, since
the addition changes only the first row of R'. In the second step of the
252
4 Systems of Linear Equations
modification we annihilate the subdiagonal elements (R)i+Li, i = 1, 2, ... ,
R by means of further Givens matrices H 12 , H 23 ,
... , H/",/"+1, so that [see (4.9.8)]
/1 := min(m -1, n -1) of
--
- [R]
0 '
HPA=HR=:
where R is, again, an n x n upper triangular matrix and P := H P is an
m x m unitary matrix. A decomposition of A of the type (4.9.2a) is provided
once again:
- - [R]
0 .
PA=
The sequence of transformations R --+ H12R --+ H 23 (H 12 R) --+ ... --+ HR
will be sketched for m = 4, n = 3:
R
~ [~ ~ ~l ~ [~ ~ ~l ~ [~ ~ ~l ~ [~ ~ ~l
=
[~] .
(2) Deletion of a column. If A is obained from A by the deletion of the
kth column, then from (4.9.2a) the matrix R := pA is an upper Hessenberg
matrix of the following form (sketched for m = 4, n = 4, k = 2):
R:=pA=
[
X
X~ x~xl
The sub diagonal elements of R are to be annihilated as described before,
using Givens matrices
The decomposition of A is
P:=HP,
H := Hn-1,nHn-2.n-l··· Hk,k+l.
(3) Addition of a column. If A = [A, a), a E IRTn , and A is an m x n
matrix with m > n, then (4.9.2a) implies
4.9 Modification Techniques for Matrix Decompositions
x
x
x
x
x
x
o
253
x
The subdiagonal elements of R, which appear as an "spike" in the last column, can be annihilated by means of a single Householder transformation
H from (4.7.5):
x
x x
x x
HR=
*
o
o
[~] .
o
P:= HP and R are components of the decomposition (4.9.2a) of A:
- - [R]
0 .
PA=
(4) Addition of a row. If
then there is a permutation matrix II of order m + 1 which satisfies
IIA = [~] .
The unitary matrix of order m + 1,
p.-
[~ ~]II
satisfies, according to (4.9.2a),
PA=
[~ ~] [~] [;:]
[~l
=:R=
:r
:z:
x
x
0
x
o
254
4 Systems of Linear Equations
R is an upper Hessenberg matrix whose sub diagonal elements can be annihilated using Givens matrices H 12 , ... , Hn,n+l:
[~] = HR,
as described before, to produce the decompositon
- - [R]
0 .
PA=
(5) Deletion of a row: Let A be an m x n matrix with m > n. We
assume without loss of generality (see the use of the permutation above)
that the last row aT of A is to be dropped:
We partition the matrix
P = [P,p],
accordingly and obtain, from (4.9.2a),
(4.9.9)
We choose Givens matrices Hm,m-l, H m ,m-2, ... , Hml to annihilate successive components m - 1, m - 2, ... , 1 of p: Hml Hm2 ... Hm,m-lP
(0, ... ,0,11')T. A sketch for m = 4 is
(4.9.10)
Now, P is unitary, so Ilpll = 111'1 = 1. Therefore, the transformed matrix
HP, H := Hml H m2 ··· Hm,m-l, has the form
(4.9.11)
HP=[~ ~]=[~ ~],
11' = ±1,
since 111'1 = 1 implies q = 0 because ofthe unitarity of H P. Consequently P
is a unitary matrix of order m - 1. On the other hand, the Hmi transform
the upper triangular matrix [~], with the exception of its last row, into
an upper triangular matrix
4.9 Modification Techniques for Matrix Decompositions
255
Sketching this for m = 4, n = 3, we have
[~l ~ [~ j ~] ~ [~ i ~] ~ [! ~ ~] ~ [~ ~ ;]
=
[zi}]·
From (4.9.9), (4.9.11) it follows that
and therefore
PA= [ RO-]
for the (m -I)-order unitary matrix P and the upper triangular matrix R
of order n. This has produced a decomposition for A of the form (4.9.2a).
It can be shown that the techniques of this section are numerically
stable in the following sense. Let P and R be given matrices with the
following property: There exists an exact unitary matrix P' and a matrix
A' such that
P'A' = [~]
is a decomposition of A' in the sense of (4.9.2a) and the differences IIP-P'II,
IIA - A'II are "small". Then the methods of this section, applied to P, R
in floating-point arithmetic with relative machine precision eps, produce
matrices P, R to which are associated an exact unitary matrix P' and a
matrix A' satisfying:
(a)
liP - P'II, IIA - A'II are "small", and
(b) P' A' = [~] is a decomposition of A' in the sense of (4.9.2a).
By "small" we mean that the differences above satisfy
IILlPIl = O(mCi eps)
or
where 0:' is small (for example 0:' = ~).
IILlAIl = O(mCi eps IIAII),
256
4 Systems of Linear Equations
4.10 The Simplex Method
Linear algebraic techniques can be applied advantageously, in the context of
the simplex method, to solve linear programming problems. These problems
arise frequently in practice, particularly in the areas of economic planning
and management science. Linear programming is also the means by which a
number of important discrete approximation problems (e.g. data fitting in
the Ll and Loo norms) can be solved. At this introductory level of treatment
we can cover only the most basic aspects of linear programming; for a more
thorough treatment, the reader is referred to the special literature on this
subject [e.g. Dantzig (1963), Gass (1969), Hadley (1962), Murty (1976), or
Schrijver (1986)].
A general linear programming problem (or linear program) has the
following form:
(4.10.1)
with respect to all x E lRn which satisfy finitely many constraints of the
form
(4.10.2)
+ ai2 x 2 + ... + ainxn ::; bi ,
ailXI + ai2X2 + ... + ainXn = bi ,
ai1 x l
i=1,2, ... ,ml,
i = ml + 1, ml + 2, ... ,m.
The numbers Ck, aik, bi are given real numbers. The function c T x to be
minimized is called the objective function. Each x E lRn which satisfies all
of the conditions (4.10.2) is said to be a feasible point for the problem.
By introducing additional variables and equations, the linear programming
problem (4.10.1), (4.10.2) can be put in a form in which the only constraints
which appear are equalities or elementary inequalities (inequalities, for example, of the form Xi 2:: 0). It is useful, for various reasons, to require that
the objective function cT x have the form cT x == -xp. In order to bring a
linear programming problem to this form, we replace each nonelementary
inequality (4.10.2)
by an equality and an elementary inequality using a slack variable Xn+i,
If the objective function CIXI + ... + CnXn is not elementary, we introduce
an additional variable x n +m1 +1 and include an additional equation
among the constraints (4.10.2). The minimization of cT x is equivalent to
the maximization of x n +m1 +1 under this extended saystem of constraints.
4.10 The Simplex Method
257
Hence we can assume, without loss of generality, that the linear progamming problem is already given in the following standard form:
LP(I,p) :
maximize
(4.10.3)
x
xp
Ax = b,
E lR n :
Xi
2: 0 for i E I.
In this formulation, I ~ N := {I, 2, ... , n} is a (possibly empty) index set,
p is a fixed index satisfying p E N\I, A = (aI, a2, ... , an) is a real m x n
matrix having columns ai, and bE lRm is a given vector. The variables Xi
for which i E I are restricted variables, while those for which i tJ. I are the
free variables. By
P := {x E lR n I Ax = b & Xi 2: 0 for all i Eo I}
we denote the set of all feasible points of LP(I,p), x* E P is an optimum
point of LP(I,p), if x; = max{x p I x E P}.
As an illustration:
minimize
-Xl -2X2
X:
-Xl +X2
S2
S4
Xl 2: 0, X2 2: o.
Xl +X2
After the introduction of X3, X4 as slack variables and Xs as an objective function
variable, the following standard form is obtained, p = 5, I = {I, 2, 3, 4}:
maximize
x:
Xs
= 2
-Xl +X2+X3
XI+X2
-Xl -X2
Xi
=4
+X4
+Xs= 0
2: 0 for i S 4.
This can be shown graphically in lR? The set P (shaded in Figure 6) is a polygon.
(In higher dimensions P would be a polyhedron.)
We begin by considering the linear equation system Ax = b of LP(I,p).
For any index vector J = (jl, ... ,jr), ji EN, we let AJ:= (ajll ... ,ajJ
denote the submatrix of A having columns aji; XJ denotes the vector
(xJI , ... , X jr) T. For simplicity we will denote the set
{ji I i = 1,2, ... ,r}
of the components of J by J also, and we will write p E J if there exists
an i withp =ii-
258
4 Systems of Linear Equations
Xl
=0
c
"-
,
~-------.',
X3
=0
X2
E
,
~p~""'" I
B
=0
A
, ,= 0
,
",X5
1
X4
1
= 0
D
Fig. 6. Feasible region and objective function.
(4.10.4) Definition. An index vector J = (jl, ... ,jm) of m distinct indices ji E N is called a basis of Ax = b [and of LP (I,p)] if AJ is nonsingular. AJ is also referred to as a basis; the variables Xi for i E J are
referred to as basic variables, while the remaining variables Xk, k (j. J, are
the nonbasic variables. If K = (kl' ... ,kn - m ) is an index vector containing
the nonbasic indices, we will use the notation J EB K = N.
In the above example JA := (3,4,5), JB := (4,5,2) are bases.
Associated with any basis J, J EB K
= N, there is a unique solution
x = x(J) of Ax = b, called the basic solution, with XK = O. Since
X is given by
(4.10.5)
Moreover, given a basis J, each solution x of Ax = b is uniquely determined
by its nonbasic segment XK and the basic solution x. This follows from the
multiplication Ax = AJxJ + AKxK = b by A J1 and from (4.10.5):
4.10 The Simplex Method
259
(4.10.6)
If XK E IR n - m is chosen arbitrarily, and if XJ (and thereby x) is defined
by (4.10.6), then x is a solution of Ax = b. Therefore, (4.10.6) provides a
parametrization of the solution set {x I Ax = b} through the components
of XK E IR n - m .
If the basic solution x of Ax = b associated with the basis J is a feasible
point of LP(I,p), x E P, that is, if
(4.10.7)
Xi ?: 0 for all i E In J,
then J is a feasible basis of LP (I, p) and X is a basic feasible solution.
Finally, a feasible basis is said to be nondegenerate if
(4.10.8)
Xi > 0
for all i E In J.
The linear programming problem as a whole is nondegenerate if all of its
feasible bases are nondegenerate.
Geometrically, the basic feasible solutions given by the various bases of
LP(I,p) correspond to the vertices of the polyhedron P of feasible points, assuming that P does have vertices. In the above example (see Figure 6) the ver-
tex A E P belongs to the feasible basis JA = (3,4,5), since A is determined
by Xl = X2 = 0 and {1,2} is in the complementary set corresponding to JA
in N = {1,2,3,4,5}; B is associated with J B = (4,5,2), C is associated with
Jc = (1,2,5), etc. The basis JE = (1,4,5) is not feasible, since the associated
basic solution E is not a feasible point (E f/. P).
The simplex method for solving linear programming problems, due to
G. B. Dantzig, is a process which begins with a feasible basis Jo of LP(I,p)
satisfying p E J o and recursively generates a sequence {Ji } of feasible bases
Ji of LP(I,p) with p E Ji for each i by means of simplex steps
These steps are designed to ensure that the objective function values x(Ji)p
corresponding to the bases J i are nondecreasing:
In fact, if all of the bases encountered are nondegenerate, this sequence
is strictly increasing, and if LP(I,p) does have an optimum, then the
sequence {J;} terminates after finitely many steps in a basis h.I whose
basic solution x(h.d is an optimum point for LP(I,p) and is such that
x(Ji)p < X(Ji+l)p for all 0 ~ i ~ ]vI - 1. Furthermore, each two successive
bases J = (jl, ... ,jm) and j = (]1, ... ,1m) are neighboring, that is, J and
j have exactly m - 1 components in common. This means that j may be
260
4 Systems of Linear Equations
obtained from J through an index exchange: there are precisely two indices
q, sEN satisfying q E J, s ~ J and q ~ J, s E J, thus J = (JU {s})\{q}.
For nondegenerate problems, neighboring feasible bases correspond geometrically to neighboring vertices of P. In the example above (Figure 6), JA = (3,4,5)
and JB = (4,5,2) are neighboring bases, and A and B are neighboring vertices.
The simplex method, or more precisely, "phase two" of the simplex
method, assumes that a feasible basis J of LP(I,p) satisfying p E J is
already available. Given that LP(I,p) does have feasible bases, one such
basis can be found using a variant known as "phase one" of the simplex
method. We will begin by describing a typical step of phase two of the simplex method, which leads from a feasible basis J to a neighboring feasible
basis J of LP(I, p):
(4.10.9) Simplex Step. Requirements: J = (j1, ... ,jm) is assumed to be
a feasible basis of LP(I,p) having p = jt E J, J ffi K = N.
(1) Compute the vector
b:= A:J1b
which gives the basic feasible solution x corresponding to J, where
xJ := b,
(2) Compute the 'rOw vector
XK:= O.
TA-J1'
Jr := e t
where et = (0, ... ,1, ... , O)T E lRm is the t th coordinate vector for
lRm. Determine the values
kEK,
using Jr.
(3) Determine whether
(4.10.10)
and
Ck
2:: 0 for all k E K n I
Ck
= 0 for all k E K\I.
(a) Ifso, then stop. The basic feasible solution x is optimal for LP(I,p).
(b) If not, determine an s E K such that
Cs
< 0, s E K n I
or
Icsl # 0, s E K\I,
and set ()" := - sign C s .
(4) Calculate the vector
(-)T := A-1
a:=
aI, a2,···,
am
J as·
4.10 The Simplex Method
261
(5) If
(4.10.11)
(]"(5:i =::: 0
for all i with ji E I,
stop: LP(I,p) has no finite optimum. Otherwise,
(6) Determine an index r with jr E I, an r > 0, and
br = mIn
bi
.
{ -_-_an r
ani
I· Ji. E I & - >
ani
Z:
0} .
(7) Take as j any suitable index vector with
j:= (JU{s})\{jr},
for example
or
We wish to justify these rules. Let us assume that J =, (jl, ... ,jm) is a
feasible basis for LP(I, p) with p = jt E J, J ffi K = N. Rule (1) of (4.10.9)
produces the associated basic feasible solution x = x(J), as is implied by
(4.10.5). Since all solutions of Ax = b can be represented in the form given
by (4.10.6), the objective function can be written as
T
-
xp = et XJ = xp -
TA-1A
J
KXK
et
= xp -7rAK xK
(4.10.12)
This uses the fact that p = jt, and it uses the definitions of the vector 7r
and of the components CK, Ck E IR n - m , as given in rule (2). CK is known
as the vector of reduced costs. As a result of (4.10.12), k E K, gives the
amount by which the objective function xp changes as X·k is changed by
one unit. If the condition (4.10.10) is satisfied [see rule (3}J, it follows from
(4.10.12) for each feasible point x of LP(I,p), since Xi ~ (I for i E I, that
Xp
= xp -
L
CkXk =::: xp;
kEKnI
that is, the basic feasible solution x is an optimum point for LP(I,p). This
motivates the test (4.10.10) and statement (a) of rule (3). If (4.10.10) is
not satisfied, there is an index s E K for which either
(4.10.13)
or else
Cs
< 0,
s E KnI,
262
4 Systems of Linear Equations
s E K\I,
(4.10.14)
holds. Let s be such an index. We set a := - signcs . Because (4.10.12)
implies that an increase ofax s leads to an increase of the objective function
x p , we consider the following family of vectors x(B) E lRn, B E lR:
-
1-
x(B)J := b - BaA:J as = b - Baa,
(4.10.15)
x(B)s := Ba,
x(Bh:=O
forkEK,ki=s.
Here a:= A:Jla s is defined as in rule (4) of (4.10.9).
In our example I = {I, 2, 3, 4} and J o = JA = (3,4,5) is a feasible basis;
Ko = (1,2), p = 5 E Jo, to = 3. (4.10.9) produces, starting form Jo,
x(Jo) = (0,0,2,4,0)T (which is point A in Figure 6), and 7rAJo = eia =} 7r =
(0,0,1). The reduced costs are CI = 7ral = -1, C2 = 7ra2 = -2. This implies that
Jo is not optimal. If we choose the index s = 2 for (4.10.9), rule (3)(b), then
-
a =
A-I
Jo a2 =
[J].
The family of vectors x(e) is given by
x(e) = (O,e, 2 - e,4 - e,2ef.
°
Geometrically x(e), e 2: describes a semi-infinite ray in Figure 6 which extends
along the edge of the polyhedron P from vertex A (e = 0) in the direction of the
neighboring vertex B (e = 2).
From (4.10.6) it follows that Ax(B) = b for all () E lR. In particular
(4.10.12) implies, from the fact that i; = x(O) and as a result of the choice
of a, that
(4.10.16)
so that the objective function is a strictly increasing function of () on the
family x( (}). It is reasonable to select the best feasible solution of Ax = b
from among the x((}): that is, we wish to find the largest B 2:: 0 satisfying
X((})l 2:: 0 for alII E I.
From (4.10.15), this is equivalent to finding the largest () 2:: 0 such that
(4.10.17)
4.10 The Simplex Method
263
because X(O)k 2 0 is automatically satisfied, for all k E K n I and 0 2 0,
because of (4.10.15). If aiii sO for all ji E J [see rule (5) of (4.10.9)]' then
x(O) is a feasible point for all 0 2 0, and because of (4.10.17) sup{x(O)p I
o 2 O} = +00: LP (1, p) does not have a finite optimum. This justifies rule
(5) of (4.10.9). Otherwise there is a largest 0 =: Bfor which (4.10.17) holds:
B=
b~ = min { aai
b~
aa r
Ii:
ji E I & aiii >
o'.} .
This determines an index r with jr E I, aii r > 0 and
(4.10.18)
x(B)jr = br - Baii r = 0,
x(B) feasible.
In the example
xCi}) = (0,2,0,2, 4)T corresponds to vertex B of Figure 6.
From the feasibility of J, B20, and it follows from (4.10.16) that
x(B)p 2: xp.
If J is nondegenerate, as defined in (4.10.8), then B> 0 holds and further,
x(B)p > xp.
From (4.10.6), (4.10.15), and (4.10.18), x = x(O) is the uniquely determined
solution of Ax = b having the additional property
Xk = 0 for k E K, k -=1= s,
Xjr = 0,
that is, x k = 0, k := (K U {jr})\ {s}. From the uniqueness of x it follows
that A j , J := (J U {s})\{jr}, is nonsingular; x(O) = x(J) is, therefore,
the basic solution associated with the neighboring feasible basis J, and we
have
(4.10.19)
x(J)p > x(J)p,
if J is nondegenerate,
x(J)p 2: x(J)p,
otherwise.
In our example we obtain the new basis J 1 = (2,4,5) = Js, Kl = (1,3),
which corresponds to the vertex B of Figure 6. With respect to the objective
function X5, B is better than A: x(Jsh = 4 > x(JAh = o.
264
4 Systems of Linear Equations
Since the definition of r implies that iT E I always holds, it follows that
J\I <;;:: J\I;
that is, in the transition J ---+ J only a nonnegatively constrained variable
Xjr' ir E I, leaves the basis; as soon as a free variable x s , s (j. I, becomes
a basis variable, it remains in the basis throughout all subsequent simplex
steps. In particular, p E J, because p E J and p (j. I. Hence, the new
basis J also satisfies the requirements of (4.10.9), so that rules (1)-(7) can
be applied to J in turn. Thus, beginning with a first feasible basis Jo of
LP(I,p) with p E J o, we obtain a sequence
J o ---+ J 1 ---+ h ---+ ...
of feasible bases J i of LP(I,p) with p E J i , for which
x(Jo)p < x(Jdp < x(J2 )p < ...
in case all of the J i are nondegenerate. Since there are only finitely many
index vectors J, and since no J i can be repeated if the above chain of
inequalities holds, the simplex method must terminate after finitely many
steps. Thus, we have proven the following for the simplex method:
(4.10.20) Theorem. Let J o be a feasible basis for LP(I,p) with p E Jo.
If LP(I,p) is nondegenerate, then the simplex method generates a finite
sequence of feasible base J i for LP(I,p) with p E J i which begin with J o
and for which x(Ji)p < x(Ji+1)p. Either the final basic solution is optimal
for LP(I,p), or LP(I,p) has no finite optimum.
We continue with our example: as a result of the first simplex step we have
a new feasible basis J1 = (2,4,5) = JB, Kl = (1,3), tl = 3, so that
AJ1=[j
~ ~l' b=[~l' x(h)=(0,2,0,2,4)T(=B),
7rAJl = e~
'*
7r =
(2,0,1).
The reduced costs are Cl = 7ral = -3, C3 = 7ra3 = 2, hence J 1 is not optimal:
s = 1, a = AJ11 a1
'* a = [ =1 1'* r = 2.
Therefore
h = (2,1,5) = Ja,
K2 = (3,4),
x(J2 ) = (1,3,0,0,7) (= C),
t2 = 3,
4.10 The Simplex Method
265
The reduced costs are C3 = 1ra3 = ~ > 0, C4 = ?Ta4 = ~ > O.
The optimality criterion is satisfied, so x(h) is optimal; i.e., Xl = 1, X2 = 3,
X3 = 0, X4 = 0, X5 = 7. The optimal value of the objective function X5 is X5 = 7.
We observe that the important quantities of each simplex step - b, ?T,
and a - are determined by the following systems of linear equations:
AJb
(4.10.21)
b
=? b
[(4.10.9), rule (1)],
1T
[(4.10.9), rule (2)],
=? a
[(4.10.9), rule (4)J.
e; =?
AJa = as
The computational effort required to solve these systems for successive
bases J -+ j -+ .. , can be significantly reduced if it is noted that the
successive bases are neighboring: each new basis matrix A] is obtained
from its predecessor AJ by the exchange of one column of AJ for one
column of A K . Suppose, for example, a decomposition of the basis matrix
AJ of the form (4.9.1), F AJ = R, F nonsingular, R upper triangular, is
used [see Section 4.9J. On the one hand, such a decomposition can be used
to solve the equation systems (4.10.21) in an efficient manner:
RT Z = et =? z =? 1T = zT F,
Ra = Fa s =? a.
On the other hand, the techniques of Section 4.9 can be used on the decomposition F AJ = R of AJ in each simplex step to obtain the corresponding
decomposition F AJ = R of the neighboring basis
[compare (4.10.9), rule (7)J. The matrix FA], with this choice of index
vector J, is an upper Hessenberg matrix of the form depicted here for
m = 4, r = 2:
[
X
~x ~x ~lx
x
=: R'.
x
The subdiagonal elements can easily be eliminated using transformations
of the form (4.9.5), Er,r+l, E r+1,r+2, ... , Em-I,m, which will change R'
into an upper triangular matrix R:
FA J- = R, F:= EF, R:= ER', E:= E m- l mEm-2m-l'" E rr+l
.
,
,
)
,
Thus it is easy to implement the simplex method in a practical fashion by
taking a quadruple M = {J; t; F, R} with the property that
266
4 Systems of Linear Equations
jt =p,
and changing it at each simplex step J --+ j into an analogous quadruple M = U;l; P, R}. To begin this variant of the simplex method, it is
necessary to find a factorization FoAJo = Ro for AJo of the form given
in (4.9.1) as well as finding a feasible basis J o of LP(I,p) with p E J o.
ALGOL programs for such an implementation of the simplex method are to
be found in Wilkinson and Reinsch (1971), and a roundoff investigation is
to be found in Bartels ((1971).
[In practice, particularly for large problems in which the solution of
anyone of the systems (4.10.21) takes a significant amount of time, the
vector XJ = A:J1b = b of one simplex step is often updated to the vector
Xj := Aj 1 b = x(O)j of the next step by using (4.10.15) with the value
0= br/aii r as given by (4.10.9), rule (6). The chance of incurring errors,
particularly in the selection of the index r, should be borne in mind if this
is done.]
The more usual implementations of the simplex method use other quantities than the decomposition F AJ = R of (4.9.1) in order to solve the equation systems (4.10.21) in an efficient manner. The "basis inverse method"
uses a quintuple of the form
M = {J;t;B,b,1T}
with
jt = p,
B:= A:Jl,
b = A:J1b,
1T:=
ei A:J1.
Another variant uses the quintuple
M = {J;t;.4,b,1T}
with
jt = p,
-
-1
b:= A J b,
1
1T := eTAJ'
t
JffiK = N.
In the simplex step J --+ j efficiency can be gained by observing that
the inverse Aj1 is obtainable from A:J1 through multiplication by an ap-
propriate Frobenius matrix G (see Exercise 4 of Chapter 4): A j1 = GA:J1.
In this manner somewhat more computational effort can be saved than
when the decomposition F AJ = R is used. The disadvantage of updating
the inverse of the basis matrix using Frobenius transformations, however,
lies in its possible numerical instability: it may happen that a Frobenius
matrix which is used in this inverse updating process is ill conditioned,
particularly if the basis matrix itself is ill conditioned. In this case, errors can appear in A:J/, A:J,l AKi for Mi , M i , and they will be propagated
throughout all quintuples M i , M i , j > i. When the factorization F AJ = R
is used, however, it is always possible to update the decomposition with
4.10 The Simplex Method
267
well-conditioned transformations, so that error propagation is not likely to
occur.
The following numerical example is typical of the gain in numerical stability
which one obtains if the triangular factorization of the basis (4.9.1) is used to
implement the simplex method rather than the basis inverse method. Consider a
linear programming problem with constraints of the form
(4.10.22)
Ax=b,
x 2 o.
A = (Al,A2),
The 5 x 10 matrix A is composed of two 5 x 5 submatrices AI, A2:
Al = (al i k),
al i k:= 1/(i + k),
i, k = 1, ... ,5,
A2 : = Is = identity matrix of order 5.
Al is very ill conditioned, and A2 is very well conditioned. The vector
5
bi :=
L i~k'
k=l
that is,
b:= AI· e,
where e = (1,1,1,1,
II E lR5 ,
is chosen as a right-hand side. Hence, the bases J 1 := (1,2, <I, 4, 5) and J 2 :=
(6,7,8,9,10) are both feasible for (4.10.22); the corresponding basic solutions
are
x(JI):= [b~] ,
(4.10.23)
x(h):= [b~] ,
We choose h = (6,7,8,9,10) as a starting basis and transform it, using the basis
inverse method on the one hand and triangular decompositions on the other, via
a sequence of single column exchanges, into the basis Jl. From there another
sequence of single column exchanges is used to return to h:
h -+ ... -+ Jl -+ ... -+ h.
The resulting basic solutions (4.10.23), produced on a machine having relative
precision eps "" 1O- 11 , are shown in the following table (inaccurate digits are
under lined) .
4 Systems of Linear Equations
268
Basis
exact
basic solution
Basis
inverse
Triangular
decomposition
J2
1.4500000000 10 0 1.4500000000 10 0
1.4500000000 10 0
1.092857142810 0 1.092857142810 0
1.092857142810 0
b2 = 8.845238095210-1 = 8.845238095210-1 = 8.845238095210 0
7.456349206310 -1
7.456349206310 -1
7.456349206310 -1
6.456349206310 -1 6.456349206310 -1
6.456349206310-1
J1
1
1
b1 = 1
1
1
1.000000018210 0
9.9999984079 10 -1
1.0000004372 10 0
9.9999952118 10 -1
1.0000001826 10 0
1. 000000078610 0
9.9999916035 10 -1
1.0000027212 10 0
9.9999956491 10 -1
1.0000014837 10 0
h
1.450000000010 0
1.092857142810 0
b2 = 8.845238095210-1
7.456349206310-1
6.456349206310 -1
1.4500010511 10 0
1.0928579972 10 0
8.8452453057 10 -1
7.4563554473 10 -1
6.456354 7103 10 -1
1.4500000000 10 0
1.092857142110 0
8.845238095Q10 -1
7 .456349206Q10-1
6.4563492059 10 -1
The following is to be observed: Since AJ, = h, both computational methods yield the exact solution at the start. For the basis J 1 , both methods give
equally inexact results, which reflects the ill-conditioning of AJ" No computational method could produce better results than these without resorting to
higher-precision arithmetic. After passing through this ill-conditioned basis AJ"
however, the situation changes radically in favor of the triangular decomposition
method. This method yields, once again, the basic solution corresponding to h
essentially to full machine accuracy. The basis inverse method, in contrast, produces a basic solution for h with the same inaccuracy as it did for J 1 . With the
basis inverse method, the effect of ill-conditioning encountered while processing
one basis matrix AJ is felt throughout all further bases; this in not the case using
triangular factorization.
4.11 Phase One of the Simplex Method
In order to start phase two of the simplex method, we require a feasible basis
J o of LP(I,p) with p = jto E J o; alternatively, we must find a quadruple
Mo = {Jo; to; Fo, Ro} in which a nonsingular matrix Fo and a nonsingular
triangular matrix Ro form a decomposition FoAJo = Ro as in (4.9.1) of the
basis matrix AJo '
In some special cases, finding J o (Mo) presents no problem, e.g. if
LP(I,p) results from a linear programming problem of the special form
4.11 Phase One of the Simplex Method
269
minimize
subject to
ailxl + ... + ainXn ~ bi ,
Xi ~ 0
i = 1,2, ... , m,
for i E It c {I, 2, ... , n},
with the additional property
bi ~ 0
for i = 1, 2, ... , m.
Such problems may be transformed by the introduction of slack variables
Xn+l, ... , x n + m into the form
(4.11.1)
maximize
Xn+m+l
subject to
ailXl + ... + ainXn + Xn+i
Xi~O
= bi ,
,~= 1,2, ... , m,
foriEItU{n+1,n+2, ... ,n+m},
Note that Xn+m+l = 1 - CIXI - ... - CnX n . The extra positive constant
(arbitrarily selected to be 1) prevents the initially chosen basis from being
degenerate. (4.11.1) has the standard form LP(I,p) with
1
p:= n + m + 1,
1:= It U {n + 1,n + 2, ... ,n + m},
an initial basis J o with p E J o, and a corresponding Mo :::: (Jo; to; Fo, Ro)
given by
Jo :=(n+1,n+2, ... ,n+m+1),
Fo
to=m+1,
~ flo , d +, ~ [: '. :J (o,d", m + 1).
m
Since bi ~ 0 for i = 1, ... , m, the basis Jo is feasible for (4.11.1).
"Phase one of the simplex method" is a name for a class of techniques
which are applicable in general. Essentially, all of these techniques consist
of applying phase two of the simplex method to some linear programming
problem derived from the given problem. The optimal basis obained from
the derived problem provides a starting basis for the given problem. We will
sketch one such method here in the briefest possible fashion. This sketch is
included for the sake of completeness. For full details on starting techniques
for the simplex method, the reader is referred to the extensive literature on
270
4 Systems of Linear Equations
linear programming, for example to Dantzig (1963), Gass (1969), Hadley
(1962), and Murty (1976).
Consider a general linear programming problem which has been cast
into the form
(4.11.2)
minimize
C1X1 + ... + CnXn
subject to
aj1X1 + ... + ajnXn = bj ,
Xi 2': 0
j = 1,2, ... , m,
for i E I,
where I S;;; {I, 2, ... , n}. We may assume without loss of generality that
bj 2': 0 for j = 1, ... , m (otherwise multiply the nonconforming equations
through by -1).
We begin by extending the constraints of the problem by introducing
artificial variables Xn+l, ... , Xn +m :
(4.11.3)
+amnXn
for i E I U { n + 1, ... , n + m}.
+xn+m
Clearly there is a one-to-one correspondence between feasible points for
(4.11.2) and those feasible points for (4.11.3) which satisfy
Xn+l = Xn+2 = ... = x n+m = O.
(4.11.4)
We will set up a derived problem with constraints (4.11.3) whose maximum
should be taken on, if possible, by a point satisfying (4.11.4). Consider
LP(i,p):
maximize
Xn+m+l
subject to
allxl
+ .. , +alnX n +Xn+l
= bm ,
Xn+l + ... +xn+m +Xn+m+l = 22::::1 bi,
a m lXl+ ... +amnx n
+x n +m
Xi 2': 0, for i E i := I U { n + 1, ... ,n + m },
with p := n + m + 1.
We may take io := (n + 1, ... , n + m + 1) as a starting basis with feasible
basic solution x given by
m
Xj = 0, Xn+i = bi, Xn+m+l =
L bi, for 1 :s: j :s: n, 1 :s: i :s: m.
i=l
4.11 Phase One of the Simplex Method
It is evident that xn+m+1 :S 2
more
271
E7=1 bj ; hence LP(i, p) is bounded. FurtherM
Xn+m+1 = 2
Lb
i
i=l
if and only if (4.11.4) holds.
Corresponding to Jo we have the quadruple Mo
given by
to:= m+ 1,
1
-1
1
Phase two can be applied immediately to LP(i, p), and it will terminate
with one of three possible outcomes for the optimal x:
(1) xn+m+1 < 2 E:1 bi [i.e., (4.11.4) does not hold],
(2) xn+m+1 = 2 E:1 bi and all artificial variables are nonbasic,
(3) xn+m+1 = 2 E:1 bi and some artificial variables are basic.
In case (1), (4.11.2) is not a feasible problem [since any feasible point for
(4.11.2) would correspond to a feasible point for LP(i,p) with X n + m +1 =
2 E:1 bJ In case (2), the optimal basis for LP(i,p) clearly provides a
feasible basis for (4.11.2). Phase two can be applied to the problem as
originally given. In case (3), we are faced with a degenerate problem, since
the artificial variables which are basic must have the value zero. We may
assume without loss of generality (by renumbering equations and artificial
variables as necessary) that X n +1, Xn+2, ... , Xn+k are the artificial variables
which are basic and have value zero. We may replace xn+m+1 by a new
variable X n +k+1 and use the optimal basis for LP(i,p) to provide an initial
feasible basis for the problem
maximize X n +k+2
subject to:
+Xn+k
Xn +1
+ ... +Xn+k +Xn +k+1
= bk ,
= 0,
= bk + 1 ,
=bm ,
+Xn +k+2 = 0,
272
4 Systems of Linear Equations
Xi
2': 0 for i E I u {n + 1, ... ,n + k + I}.
This problem is evidently equivalent to (4.11.2).
4.12 Appendix: Elimination Methods for Sparse
Matrices
Many practical applications require solving systems of linear equations
Ax = b with a matrix A that is very large but sparse, i.e., only a small
fraction of the elements aik of A are nonzero. Such applications include the
solution of partial differential equations by means of discretization methods
[see Sections 7.4, 7.5, 8.4]' network problems, or structural design problems
in engineering. The corresponding linear systems can be solved only if the
sparsity of A is used in order to reduce memory requirements, and if solution methods are designed accordingly. Many sparse linear systems, in
particular, those arising from partial differential equations, are solved by
iterative methods [see Chapter 8]. In this section, we consider only elimination methods, in particular, the Choleski method [see 4.3] for solving linear
systems with a positive definite matrix A, and explain in this context some
basic techniques for exploiting the sparsity of A. For further results, we refer
to the literature, e.g., Reid (1971), Rose and Willoughby (1972), Tewarson
(1973), and Barker (1974). A systematic exposition of these methods for
positive definite systems is found in George and Liu (1981), and for general
sparse linear systems in Duff, Erisman, and Reid (1986).
We first illustrate some basic storage techniques for sparse matrices. Consider,
for instance, the matrix
A=
[~o J ~ ~
-5
0
0
o -6
0
0
1
-0: ]
6
One possibility is to store such a matrix by rows, for instance, in three onedimensional arrays, say a, ja, and ip. Here ark], k = 1, 2, ... , are the values of
the (potentially) nonzero elements of A, ja[k] records the column index of the
matrix component stored in ark]. The array ip holds pointers: if ip[i] = p and
ip[i + 1] = q then the segment of nonzero elements in row i of A begins with alP]
and ends with a[q - 1]. In particular, if ip[i] = ip[i + 1] then row i of A is zero.
So, the matrix could be stored in memory as follows
1
2
3
4
5
6
7
8
9
10
ip[i] = 1
ja[i] = 5
ali] = -2
3
1
1
6
1
3
8
3
2
9
5
1
11
4
7
2
-4
2
-5
2
-6
5
6
i=
11
4.12 Appendix: Elimination Methods for Sparse Matrices
273
Of course, for symmetric matrices A, further savings are possible if one stores
only the nonzero elements aik with i :s: k.
When using this kind of data structure, it is difficult to insert additional
nonzero elements into the rows of A, as might be necessary with elimination
methods. This drawback is avoided if the nonzero elements of A are stored
as a linked list. It requires only one additional array next: if ark] contains
an element of row i of A, then the "next" nonzero element of row i is found
in a[next[k]], if next[k] =f. O. If next[k] = 0 then ark] was the "last" nonzero
component of row i.
Using linked lists the above matrix can be stored as follows:
1
2
3
4
5
6
7
8
9
10
6
ja[i] = 2
ali] = -4
next[i] = 0
4
3
2
7
5
5
-2
0
10
1
3
2
9
4
7
1
1
1
3
5
1
0
5
6
0
2
-6
8
2
-5
0
i=
ip[i] =
Now it is easy to insert a new element into a row of A: e.g., a new element
a31 could be incorporated by extending the vectors a, ja, and next each by one
component a[l1], ja[l1], and next[l1]. The new element a31 is stored in a[l1];
ja[l1] = 1 records its column index; and the vectors next and ip are adjusted
as follows: next [11] := ip[3] ( = 5), ip[3] := 11. On the other hand, with this
technique the vector ip no longer contains information on the number of nonzero
elements in the rows of A as it did before.
Refined storage techniques are also necessary for the efficient implementation of iterative methods to solve large sparse systems [see Chapter 8].
However, with elimination methods there are additional difficulties, if the
storage (data structure) used for the matrix A is also to be used for storing
the factors of the triangular decomposition generated by these methods
[see Sections 4.1 and 4.3]: These factors may contain many more nonzero
elements than A. In particular, the number of these extra nonzeros (the
"fill-in" of A) created during the elimination depends heavily on the choice
of the pivot elements. Thus a bad choice of a pivot may not only lead to
numerical instabilities [see Section 4.5], but also spoil the original sparsity
pattern of the matrix A. It is therefore desirable to find a sequence of pivots
that not only ensures numerical stability but also limits fill-in as much as
possible. In the case of Choleski's method for positive definite matrices A,
the situation is particularly propitious because, in that case, pivot selection
is not crucial for numerical stability: Instead of choosing consecutive diagonal pivots, as described in Section 4.3, one may just as well select them
in any other order without losing numerical stability (see (4.3.6)). One is
therefore free to select the diagonal pivots in any order that tends to minimize the fill-in generated during the elimination. This amounts to finding
274
4 Systems of Linear Equations
a permutation P such that the matrix P ApT = LLT has a Choleski factor
L that is as sparse as possible.
The following example shows that the choice of P may influence the sparsity
of L drastically. Here, the diagonal elements are numbered so as to indicate their
ordering under a permutation of A, and x denotes elements 7= 0: The Choleski
factor L of a positive definite matrix A having the form
is in general a "full" matrix, whereas the permuted matrix P ApT obtained by
interchanging the first and last rows and columns of A has a sparse Choleski
factor
PAp T =
[5 2
3
x
x
x
~l
4 x
x
= LL T ,
1
Efficient elimination methods for sparse positive definite systems therefore consist of three parts:
l. Determine P so that the Choleski factor L of P ApT = LLT is as
sparse as possible, and determine the sparsity structure of L.
2. Compute L numerically.
3. Compute the solution x of Ax = b, i.e., of (PAP)T Px = LLT Px =
Pb by solving the triangular systems Lz = Pb, LT u = z, and letting
x:= pTu.
In step 1, only the sparsity pattern of A as given by index set
Nonz(A):= {(i,j) I j < i
and aij of- O}
is used to find Nonz(L), not the numerical values of the components of
L: In this step only a "symbolic factorization" of P ApT takes place; the
"numerical factorization" is left until step 2.
It is convenient to describe the sparsity structure of a symmetric n x n
matrix A, i.e., the set Nonz(A), by means of an undirected graph C A =
(VA, EA) with a finite set of vertices (nodes) VA = {VI, V2, ... ,vn } and a
finite set
EA = {{Vi,Vj} I (i,j) E Nonz(A)}
of "undirected" edges {Vi, Vj} between the nodes Vi and Vj of- Vi (thus an
edge is a subset of V A containing two elements). The vertex Vi is associated
4.12 Appendix: Elimination Methods for Sparse Matrices
275
with the diagonal element aii (i.e., also with row i and column i) of A, and
the vertices Vi =I- Vj are connected by an edge in G A if and only if aij =I- O.
EXAMPLE 1. The matrix
1
x
x
2
x
3
x
x
x
4
x
x
x
x
5
6
x
x
is associated with the graph C A
V
V
V2- l -
x
x
7
4-T-T
V7- V6
We need a few concepts from graph theory. If G = (V, E) is an undirected graph and S c V a subset of its vertices then AdjG(S), or briefly
Adj(S):= {v E V\S I {s,v} E E
for some s Eo S}
denotes the set of all vertices v E V\S that are connected to (a vertex of)
S by an edge of G. For example, Adj ({ v }) or briefly Adj (v) is the set of
neighbors of the node v in G. The number of neighbors of the vertex v E V
deg v := IAdj ({ v } ) I is called the degree of v. Finally, a su bset MeV of
the vertices is called a clique in G if each vertex x E 11/! of N! is connected
to any other vertex y EM, Y =I- x, by an edge of G.
We return to the discussion of elimination methods. First we try to
find a permutation so that the number of nonzero elements of the Choleski
factor L of P ApT = LLT becomes minimal. Unfortunately, an efficient
method for computing an optimal P is not known, but there is a simple
heuristic method, the minimal degree algorithm of Rose (1972), to find a
P that nearly minimizes the sparsity of L. Its basic idea is to select the
next pivot element in Choleski's method as that diagonal element that is
likely to destroy as few O-elements as possible in the elimination step at
hand.
To make this precise, we have to analyze only the first step of the
Choleski algorithm, since this step is already typical for the procedure
in general. By partitioning the n x n matrices A and L in the equation
A=LLT
A =
[~
f] = [7 ~]. [~ i~] =
LL T ,
aT = (a2' a3, ... ,an),
where d = all and [d, aT] is the first row of A, we find the following relations
276
4 Systems of Linear Equations
Hence, the first column Ll of L and the first row Lf of LT are given by
and the computation of the remaining columns of L, that is, of the columns
of L, is equivalent to the Choleski factorization A = LLT of the (n - 1) x
(n - 1) matrix A = (aik)~k=2
(4.12.1)
_
aiak
aik = aik - d
for all i, k 2': 2.
Disregarding the exceptional case in which the numerical values of aik i- 0
and aiak = alialk i- 0 are such that one obtains aik = 0 by cancellation,
we have
(4.12.2)
Therefore, the elimination step with the pivot d = a11 will generate a number of new nonzero elements to A, which is roughly proportional to the number of nonzero components of the vector aT = (a2' ... ,an) = (a12, ... ,al n ).
The elimination step A --+ A can be described in terms of the graphs
G = (V, E) := G A and G = (V, E) := GA. associated witl]. A and A: The
vertices 1, 2, ... , n of G = G A (resp. 2, 3, ... , n of G = G A ) correspond to
the diagonal elements of A (resp. A); in particular, the pivot vertex 1 of G
belongs to the pivot element a11. By (4.12.2), the vertices i i- k, i, k 2': 2, of
G are connected by an edge in G if and only if they are either connected by
an edge in G (aik i- 0), or both vertices i, k are neighbors of pivot node 1
in G (alialk i- 0, i, k E AdjG(l)). The number of nonzero elements ali i- 0
with i 2': 2 in the first row of A is equal to the degree degG(1) of pivot
vertex 1 in G. Therefore, a11 is a favorable pivot, if vertex 1 is a vertex of
minimum degree in G. Moreover, the set AdjG(l) describes exactly which
nondiagonal elements of the first row of LT (first column of L) are nonzero.
EXAMPLE 2. Choosing au as pivot in the following matrix A leads to fill-in at
the positions denoted by (9 in the matrix A:
4.12 Appendix: Elimination Methods for Sparse Matrices
277
The associated graphs are:
2--3
1--2--3
I~I
G:
5--4-6
=?
G:
5-4--6
Generally, the choice of a diagonal pivot in A corresponds to the choice
of a vertex x E V (pivot vertex) in the graph G = (V E) = G A , and the
elimination step A ---+ A with this pivot corresponds to a transformation
of the graph G(= G A ) into the graph G = (V, E) ( = G A ) (which is also
denoted by G x = G to stress its dependence on x) according to the following
rules
(1) lI:=V\{x}.
(2) Connect the nodes y '" z, y, z E 11 by an undirected edge ({y, z} E E),
if y and z are connected by an edge in G ({y, z} E E) or if y and z are
neighbors ofx in (y,z E AdjG(x)).
For obvious reasons, we say that the graph G x is obtained by "elimination
of the vertex x" from G.
By definition, we have in G x for y E 11 = V \ { x }
(4.12.3)
.
{AdjG(y)
AdJGJy)= (AdjG(x)UAdgdy))\{x,y}
if y tI- Adgdx) ,
otherwise.
Every pair of vertices in AdjG(;1;) is connected by an edge in G x . Those
vertices thus form a clique in G x , the so-called pivot clique, pivot clique
DF associated with the pivot vertex x chosen in G. Moreover AdjG(x)
describes the nonzero off-diagonal elements of the column of L (resp. the
row of LT) that corresponds to the pivot vertex x.
As we have seen, the fill-in newly generated during an elimination step
is probably small if the degree of the vertex selected as pivot vertex for this
step is small. This motivates the minimal degree algorithm of Rose (1972)
for finding a suitable sequence of pivots:
(4.12.4) Minimal Degree Algorithm. Let A be a positive definite n x n
matrix and GO = (Va, EO) := G A be the graph associated to A.
For i = 1, 2, ... , n:
(1) Determine a vertex Xi E Vi-l of minimal degree in C i - 1 .
(2) Set G i := G~-;-l.
Remark. A minimal degree vertex Xi need not be uniquely determined.
EXAMPLE 3. Consider the matrix A of example 2. Here, the nodes 1, 3, and 5
have minimal degree in CO := C = CA. The choice of vertex 1 as pivot vertex Xl
278
4 Systems of Linear Equations
leads to the graph C of Example 2, G ' := G~ = C. The pivot clique associated
with X, = 1 is
Adjco(l) = {2,5}.
In the next step of (4.12 ..4) one may choose vertex 5 as pivot vertex X2, since
vertices 5 and 3 have minimal degree in G ' = C. The pivot clique associated with
X2 = 5 is Adjq,(5) = {2,4}, and by "elimination" of X2 from G ' we obtain the
next graph G := G§. All in all, a possible sequence of pivot elements given by
(4.12.4) is (1,5,4,2,3, 6J, which leads to the series of graphs (Go := G, G ' := C
are as in Example 2, G is the empty graph):
2-3
2*--3
I~I
~I
4*--6
3
G 2 = Gg
The pivot vertices are marked. The associated pivot cliques are
Pivot
Clique
3
6
{2,5} {2,4} {2,6} {3,6} {6}
0
1
5
4
2
The matrix P ApT = LL T arising from a permutation of the rows and columns
corresponding to the sequence of pivots selected and its Choleski factor L have
the following structure, which is determined by the pivot cliques (fill-in is denoted
by 0):
For instance, the position of the nonzero off-diagonal elements of the third row
of L T, which belongs to the pivot vertex X3 = 4, is given by the pivot clique
{ 2, 6 } = Adjc2 (4) of this vertex.
For large problems, any efficient implementation of algorithm (4.12.4)
hinges on the efficient determination of the degree function degci of graphs
G i = G~-;-l, i = 1, 2, ... , n - 1. One could use, for instance,
for all y I- x, y tf. Adjc(x), since by (4.12.3) the degree function does not
change in the transition G --+ G x at such vertices y. Numerous other methods for efficiently implementing (4.12.4) have been proposed [see George
and Liu (1989) for a review]. These proposals also involve an appropriate
4.12 Appendix: Elimination Methods for Sparse Matrices
279
representation of graphs. For our purposes, it proved to be useful to represent the graph G = (V, E) by a set of cliques M = {K 1, K2, ... , Kq} such
that each edge is covered by at least one clique Ki E M. Then the set E
of edges can be recovered:
E = {{x, y} 1 x =I- y & 3i : x, y E Ki}.
Under those conditions, M is called clique representation of G. Such respresentations exist for any graph G = (V, E). Indeed, AI := E is a clique
representation of G since every edge {x, y} E E is a clique of G.
EXAMPLE 4. A clique representation of the graph C = C A of Example 2 is, for
instance,
{{1,5},{4,5},{1,2},{2,3,6},{2,4,6}}.
The degree dego(x) and the sets Adjo(x), x E V, can be computed
from a clique representation AI by
Adjo(x) =
U Ki \ {x},
dego(x) = IAdjo(x)l·
i:xEK,
Let x E V be arbitrary. Then, because of (4.12.3), one can easily compute a
clique representation Mx for the elimination graph G x of G = (V, E) from
such a representation !vI = {K 1, ... , Kq} of G: Denote by {KS" ... , K st }
the set of all cliques in M that contain x, and set K := U~=lKsi\{x}.
Then
Mx = {K1, ... ,Kq,K}\{Ks" ... ,Kst}
gives a clique representation of G x .
Assuming that cliques are represented as lists of their elements, then
Mx takes even up less memory space than M because
t
IKI <
L IKsj I·
j=l
Recall now the steps leading up to this point. A suitable pivot sequence was determined using algorithm (4.12.4). To this pivot sequence
corresponds a permutation matrix P, and we are interested in the Choleski
factor L of the matrix PAp T = LLT and, in particular, in a suitable data
structure for storing L and associated information. Let Nonz(L) denote
the sparsity pattern, that is, the set of locations of nonzero elements of
L. The nonzero elements of L may, for instance, be stored by columns this corresponds to storing the nonzero elements of LT by rows -- in three
arrays, ip, ja, a as described at the beginning of this section. Alternatively,
a linked list next may be employed in addition to such arrays.
280
4 Systems of Linear Equations
Because Nonz(L) :J Nonz(PApT), the data structure for L can also be
used to store the nonzero elements of A.
Next follows the numerical factorization of P ApT = LLT: Here it is
important to organize the computation of LT by rows, that is, the array a
is overwritten step by step with the corresponding data of the consecutive
rows of LT. Also the programs to solve the triangular systems
Lz = Pb,
for z and u, respectively, can be coded based on the data structure by
accessing each row of the matrix LT only once when computing z and
again only once when computing u. Finally, the solution x of Ax = b is
given by x = p T u.
The reader can find more details on methods for solving sparse linear
systems in the literature, e.g., in Duff, Erisman, and Reid (1986) and George
and Liu (1981), where numerous FORTRAN programs for solving positive
definite systems are also given. Large program packages exist, such as the
Harwell package MA27 [see Duff and Reid (1982)]' the Yale sparse matrix
package YSMP [see Eisenstat et al. (1982)]' and SPARSEPAK [see George,
Liu, and Ng (1980)].
EXERCISES FOR CHAPTER 4
1.
Consider the following vector norms defined on 1Rn (or (Cn):
n
Ilxlll := L IxJ
i=l
Show:
(a) that the norm properties are satisfied by each;
(b) that Ilxll oo ::; IIxl12 ::; Ilxlll;
(c) that IIxl12 ::; v'nllxlloo, Ilxlll ::; v'nllxIl2'
Can equality hold in (b), (c)?
(d) Determine what lub(A) is, in general, for the norm 11·1\1.
(e) Starting from the definition
EXERCISES FOR CHAPTER 4
show that
281
IIAxl1
lub(A) =
r;;:lJ- W'
1
lub(A-I)
. IIAyl1
#0 IIYII
--:----::-:- = mIn - -
for nonsingular A.
2.
Consider the following class of vector norms defined on <en:
IlxiiD := IIDxll,
where II ·11 is a fixed vector norm, and D is any member of the class of
nonsingular matrices.
(a) Show that II . liD is, indeed, a vector norm.
(b) Show that mllxll :s:; IlxiiD :s:; Mllxll with
m = l/lub(D- I ),
M = lub(D),
where lub(D) is defined with respect to II· II.
(c) Express lubD(A) in terms of the lub norm defined with respect to
11·11·
(d) For a nonsingular matrix A, cond(A) depends upon the underlying
vector norm used. To see an example of this, show that condD(A)
as defined from II . liD can be arbitrarily large, depending on the
choice of D. Give an estimate of condD(A) using m, M.
(e) How different can cond(A) defined using 11·1100 be from that defined
using 11·112? [Use the results of Exercise l(b, c) above].
3.
Show that, for an n x n nonsingular matrix A and vectors u, v E IRn,
(a)
(A + UVT)-l = A-I _ A-IuVT A-I,
l+v T A-I U
T
if v A-Iu -I -l.
(b) If vT A-I U = -1, then (A + uv T ) is singular.
Hint: Find a vector z -I 0 such that (A + uvT)z = O.
4.
Let A be a nonsingular n x n matrix with columns ai,
A = [al,'" ,an]'
(a) Let A = [al, ... ,ai-l,b,ai+l, ... ,a n ], b E IRn , be the matrix obtained from A by replacing the ith column ai by b. Determine, using
the formula of Exercise 3(a), under what conditions A-I exists, and
show that A-I = FA-l for some Frobenius matrix F.
282
4 Systems of Linear Equations
(b) Let A, be the matrix obtained from A by changing a single element
aik to aik + 0:. For what 0: will A~l exist?
5.
Consider the following theorem [see Theorem (6.4.10)]: If A is a real,
nonsingular n x n matrix, then there exist two real orthogonal matrices
U, V satisfying
UTAV=D
where D = diag(111, ... , I1n) and
111 ::::: 112 ::::: ... ::::: I1n > 0.
Using this theorem, and taking the Euclidian norm as the underlying
vector norm:
(a) Express cond(A) in terms of the quantities l1i.
(b) Give an expression for the vectors band Llb in terms of U for which
the bounds (4.4.11), (4.4.12), and
Ilbll S; lub(A)llxll
are satisfied with equality.
(c) Is there a vector b such that for all Llb in (4.4.12),
IILlxl1
- < IILlbl1
-Ilxll
-
Ilbll
holds? Determine such vectors b with the help of U.
Hint: Look at b satisfying lub(A-1)ll bll = Ilxll.
6.
Let Ax = b be given with
A = [0.780
0.913
0.563]
0.659
and
0.217]
b = [ 0.254 .
The exact solution is x T = (1, -1). Further, let two approximate solutions
= (0.999, -1.001),
xi
xf = (0.341, -0.087)
be given.
(a) Compute the residuals r(x1), r(x2). Does the more accurate solution have a smaller residual?
(b) Determine the exact inverse A-1 of A and cond(A) with respect to
the maximum norm.
EXERCISES FOR CHAPTER 4
283
(c) Express i-x = L1x using r(i), the residual for i. Does this provide
an explanation for the discrepancy observed in (a)? (Compare with
Exercise 5.)
7.
Show that:
(a) The largest elements in magnitude in a positive definite matrix
appear on the diagonal and are positive.
(b) If all leading principal minors of an n x n Hermitian matrix A =
(aik) are positive, i.e.,
det,
([a~.l
a:.
azl
a"
1'
])
> 0 for i = 1, ... , n,
then A is positive definite.
Hint: Study the induction proof for Theorem (4.3.3).
8.
Let A' be a given, n x n, real, positive definite matrix partitioned as
follows:
A' = [ABT B]
C
where A is an m x m mat.rix. First., show:
(a) C - RT A -1 B is positive definite.
[Hint: Partition .1: correspondingly:
x =
[~~],
Xl
E IRm ,
X2
E IRn - m ,
and determine an Xl for fixed X2 such that
holds.]
According to Theorem (4.3.3), A' has a decomposition
A' = RTR,
where R is an upper triangular matrix, which may be partitioned consistently with A':
Now show:
(b) Each matrix AI = NT N, where N is a nonsingular matrix, is positive definite.
284
4 Systems of Linear Equations
(c) R'f2R22 = C - BT A-I B.
(d) The result
i=l, ... ,n,
follows from (a), where 'rii is any diagonal element of R.
(e)
2
'r.
x T A/x
xiO x x
1
lub(R-I)2
.
> mm - - = --,--~
T
" -
for i = 1, ... , n, where lub(R- I ) is defined using the Euclidean
vector norm.
Hint: Exercise l(e).
(f)
xTA'x
lub(R)2 = max - - T - 2: 'rTi'
xiO x x
i = 1, ... ,n,
for lub(R) defined with the Euclidean norm.
(g) cond(R) satisfies
cond(R) 2:
9.
max
1'rii I·
l:'Oi,k:'On 'rkk
A sequence An of real or complex 'r x 'r matrices converges componentwise to a matrix A if and only if the An form a Cauchy sequence; that
is, given any vector norm II . II and any E: > 0, lub(A n - Am) < E: for
m and n sufficiently large. Using this, show that if lub(A) < 1, then
the sequence Anand the series L:~=o An converge. Further, show that
1 - A is nonsingular, and
00
(1 - A)-I = LAn.
n=O
Use this result to prove (4.4.14).
10. Suppose that we attempt to find the inverse of an n x n matrix A using the Gauss-Jordan method and partial pivot selection. Show that
the columns of A are linearly dependent if no nonzero pivot element
can be found by the partial pivot selection process at some step of the
method.
[Caution: The converse of this result is not true if floating-point arithmetic is used. A can have linearly dependent columns, yet the GaussJordan method may never encounter all zeros among the candidate
pivots at any step, due to round-off errors. Similar statements can be
made about any of the decomposition methods we have studied. For
EXERCISES FOR CHAPTER 4
285
a discussion of how the determination of rank (or singularity) is best
made numerically, see Chapter 6, Section 6, of Stewart (1973).]
11. Let A be a positive definite n x n matrix. Let Gaussian elimination be
carried out on A without pivot selection. After k steps of elimination,
A will be reduced to the form
A(k)]
12
(k)
A22
,
where A~~) is an (n - k) x (n - k) matrix. Show by induction that
(a) A~~) is positive definite,
C
k 'Z ::; n, k -- 1, 2 , ... , n - 1.
(b) aii(k) ::; aii(k-1) lor::;
12. In the error analysis of Gaussian elimination (Section 4.5) we used
certain estimates of the growth of the maximal elements of the matrices
A(i). Let
Show that for partial pivot selection:
(a) ak::; 2kao, k = 1, ... , n - 1, for arbitrary A.
(b) ak ::; kao, k = 1, ... , n - 1, for Hessenberg matrices A.
(c) a = max1sksn-1 ak ::; 2ao for tridiagonal matrices A.
13. The following decomposition of a positive definite matrix A,
A = SDS H ,
where S is a lower triangular matrix with Sii = 1 and D is a diagonal
matrix D = diag(di ), gives rise to a variant of the Choleski method.
Show:
(a) that such a decomposition is possible [Theorem (4.3.3)];
(b) that di = (lii)2, where A = LLH and L is a lower triangular matrix;
(c) that this decomposition does not require the n square roots which
are required by the Choleski method.
14. We have the following mathematical model:
which depends upon the two unknown parameters Xl, x2. Moreover,
let a collection of data be given:
{Yl, Zl}t=l, ... ,m
with Zl = l.
286
4 Systems of Linear Equations
Try to determine the parameters Xl, X2 from the data using least-square
fitting.
(a) What are the normal equations?
(b) Carry out the Choleski decomposition of the normal equation ma-
trixB=ATA=LLT.
(c) Give an estimate for cond( L) based upon the Euclidean vector
norm.
[Hint: Use the estimate for cond(L) from Exercise 8(g).]
(d) How does the condition vary as m, the number of data, is increased
[Schwarz, Rutishauser, and Stiefel (1968)]7
15. The straight line
y(x)=n+(3x
is to be fitted to the data
Xi
1-2 -1 o 3.51 3.52
2
so that
is minimized.
Determine the parameters nand (3.
16. Determine nand (3 as in Exercise 15 under the condition that
is to be minimized.
[Hint: Let Pi - (Ii = (Ii = Yi(Xi) - Yi for i = 1, ... ,5, where Pi ~ 0 and
(Ii ~ o. Then I:i IY(Xi) -Yil = I:i(Pi+(Ii). Set up a linear programming
problem in the variables n, (3 (unrestricted) and Pi, (Ii (nonnegative).]
References for Chapter 4
Andersen, E., Bai, Z., Bischof, C., Demmel, J., Dongarra, J.J., DuCroz, J., Greenbaum, A., Hammarling, S., McKenney, A., Ostrouchov, A., Sorensen, D. (1992):
LAPACK Users Guide. Philadelphia: SIAM Publications.
Barker, V. A. (Ed.) (1977): Sparse Matrix Techniques. Lecture Notes in Mathematics 572. Berlin, Heidelberg, New York: Springer-Verlag.
Bartels, R. H. (1971): A stabilization of the simplex method. Numer. Math. 16,
414-434.
EXERCISES FOR CHAPTER 4
287
Bauer, F. L. (1966): Genauigkeitsfragen bei der Lasung linearer Gleichungssysteme. ZAMM 46, 409-421.
Bixby, R. E. (1990): Implementing the simplex method: The initial basis. Techn.
Rep. TR 90-32, Department of Mathematical Sciences, Rice University, Houston, Texas.
Bjarck, A. (1990): Least squares methods. In: Ciarlet, Lions, Eds. (1990), 465647.
Businger, P., Golub, G. H. (1965): Linear least squares solutions by Householder
transformations. Numer. Math. 7, 269-276.
Ciarlet, P. G., Lions, J. L. (Ed.) (1990): Handbook of numerical analysis, Vol. I:
Finite difference methods (Part 1), Solution of equation" in lR n (Part 1).
Amsterdam: North Holland.
Collatz, L. (1966): Functional Analysis and Numerical Mathematics. New York:
Academic Press.
Daniel, J. W., Gragg, W. B., Kaufmann, L., Stewart, G. W. (1976): Reorthogonalization and stable algorithms for updating the Gram-Schmidt QR factorization.
Math. Compo 30, 772-795.
Dantzig, G. B. (1963): Linear Programming and Extensions. Princeton, N.J.:
Princeton University Press.
Dongarra, J. J., Bunch, J. R., Moler, C. B., Stewart, G. W. (1979): LINPACK
Users Guide. Philadelphia: SIAM Publications.
Duff, I.S., Erisman, A.M., Reid, J.K (1986): Direct Methods for Sparse Matrices.
Oxford: Oxford University Press.
- - - , Reid, J. K. (1982): MA27: A set of FORTRAN subroutines for solving
sparse symmetric sets of linear equations. Techn. Rep. AERE R 10533, Harwell,
U.K.
Eisenstat, S. C., Gursky, M. C., Schultz, M. H., Sherman, A. H. (1982): The Yale
sparse matrix package. I. The symmetric codes. Internat. J. Numer. Methods
Engrg. 18, 1145-1151.
Forsythe, G. E., Moler, C. B. (1967): Computer Solution of Linear Algebraic
Systems. Series in automatic computation. Englewood Cliffs, N.J.: PrenticeHall.
Gass, S. T. (1969): Linear Programming, 3d edition. New York: McGraw-Hill.
George, J. A., Liu, J. W. (1981): Computer Solution of Large Sparse Positive
Definite Systems. Englewood Cliffs, N.J.: Prentice-Hall.
- - - , - - - (1989): The evolution ofthe minimum degree ordering algorithm.
SIAM Review 31, 1-19.
- - - , - - - , Ng, E. G. (1980): Users guide for SPARSEPAK: Waterloo sparse
linear equations package. Tech. Rep. CS-78-30, Dept. of Computer Science, University of Waterloo, Waterloo.
Gill, P. E., Golub, G.H ., Murray, W., Saunders, M. A. (1974): Methods for
modifying matrix factorizations. Math. Compo 28, 505-535.
Golub, G. H., van Loan, C. F. (1983): Matrix Computations. Baltimore: The John
Hopkins University Press.
Grossmann, W. (1969): Grundzuge der Ausgleichsrechnung. 3. Aufl. Berlin, Heidelberg, New York: Springer-Verlag.
Guest, P.G. (1961): Numerical Methods of Curve Fitting. Cambridge: University
Press.
Hadley, G. (1962): Linear Programming. Reading, MA: Addison-Wesley.
Householder, A. S. (1964): The Theory of Matrices in Numerical Analysis. New
York: Blaisdell.
288
4 Systems of Linear Equations
Lawson, C. L., Hanson, H. J. (1974): Solving Least Squares Problems. Englewood
Cliffs, N.J.: Prentice-Hall.
Murty, K. G. (1976): Linear and Combinatorial Programming. New York: Wiley.
Prager, W., Oettli, W. (1964): Compatibility of approximate solution of linear
equations with given error bounds for coefficients and right hand sides. Num.
Math. 6, 405-409.
Reid, J. K. (Ed) (1971): Large Sparse Sets of Linear Equations. London, New
York: Academic Press.
Rose, D. J. (1972): A graph-theoretic study of the numerical solution of sparse
positive definite systems of linear equations, pp. 183-217. In: Graph theory and
computing, R. C. Read, ed., New York: Academic Press.
- - - , Willoughby, R. A. (Eds.) (1972): Sparse Matrices and Their Applications. New York: Plenum Press.
Sautter, W. (1971): Dissertation TV Munchen.
Schrijver, A. (1986): Theory of Linear and Integer Programming. Chichester:
Wiley.
Schwarz, H. R., Rutishauser, H., Stiefel, E. (1968).: Numerik symmetrischer Matrizen. Leitfiiden der angewandten Mathematik, Bd. 11. Stuttgart: Teubner.
Seber, G. A. F. (1977): Linear Regression Analysis. New York: Wiley.
Stewart, G. W. (1973): Introduction to Matrix Computations. New York: Academic Press.
Tewarson, R. P. (1973): Sparse Matrices. New York: Academic Press.
Wilkinson, J. H. (1965): The Algebraic Eigenvalue Problem. Monographs on Numerical Analysis, Oxford: Clarendon Press.
- - - , Reinsch, Ch. (1971): Linear Algebra. Handbook for Automatic Computation, Vol. II. Grundlehren der mathematischen Wissenschaften in Einzeldarstellungen, Bd. 186. Berlin, Heidelberg, New York: Springer-Verlag.
5 Finding Zeros and Minimum Points by
Iterative Methods
Finding the zeros of a given function f, that is arguments ~ for which
f(~) = 0, is a classical problem. In particular, determining the zeros of a
polynomial (the zeros of a polynomial are also known as its roots)
p ( X ) = aox n + alx n-I + ... + an
has captured the attention of pure and applied mathematicians for centuries. However, much more general problems can be formulated in terms
of finding zeros, depending upon the definition of the function f: E -+ F,
its domain E, and its range F.
For example, if E = F = IRn , then a transformation f: IRn -+ IRn
is described by n real functions fi(xl, ... , xn) (we will use superscripts
in this chapter to denote the components of vectors x Eo IRn , n > 1, and
subscripts to denote elements in a set or sequence of veetors Xi, i = 1, 2,
... );
XT
The problem of solving f(x)
(nonlinear) equations:
= (I
X , ••• ,xn) .
o becomes that of solving a system of
(5.0.1)
Even more general problems result if E and F are linear vector spaces of
infinite dimension, e.g. function spaces.
Problems of finding zeros are closely associated with problems of the
form
(5.0.2)
minimize
x E IR
h (x)
n
for a real function h: IR n -+ IR of n variables h(x) = h( xl, x 2 , ... , xn). For
if h is differentiable and g(x) := (oh/ox!, ... , oh/oxn)T is the gradient of
h, then each minimum point x of h(x) is a zero of the gradient g(x) = O.
290
5 Finding Zeros and Minimum Points by Iterative lVlethods
Conversely, each zero of f(5.0.1) is also the minimum point of some function
h, for example h(x) := Ilf(x)112
The minimization problem described above is an unconstrained minimization problem. More generally, one encounters constrained problems such
as the following:
minimize ho (x)
subject to
hdx):::; 0
for
i = 1,2 .... , mI,
hi(x) = 0
for
i = mi + 1, ml + 2, ... , rr!.
Finding minimum points for functions subject to constraints is one
of the most important problems in applied mathematics. In this chapter,
however, we will consider only unconstrained minimization. The special
case of constrained linear minimization, for which all h t : IRn ---+ IR are linear
(or, more exactly, affine) functions has been discussed in Section 4.10 and
4.11. For a more thorough treatment of finding zeros and minimum points
the reader is referred to the extensive literature on that subject [for example
Ortega, Rheinboldt (1970), Luenberger (1973), Himmelblau (1972), Traub
(1964)].
5.1 The Development of Iterative Methods
In general it is not possible to determine a zero ~ of a function f: E ---+ F
explicitly within a finite number of steps, so we have to resort to approximation methods. These methods are usually iterative and have the following
form: beginning with a starting value Xo, successive approximates Xi, i = 1,
2, ... , to ~ are computed with the aid of an iteration function </J : E ---+ E:
If ~ is a fixed point of </J [</J(O = ~], if all fixed points of </J are also zeros of
f, and if </J is continuous in a neighborhood of each of its fixed points, then
cach limit point of the sequence Xi, i = 1, 2, ... , is a fixed point of </J, and
hence a zero of f.
The following questions arise in this connection:
(1) How is a suitable iteration function </J to be found?
(2) Under what conditions will the sequence Xi converge?
(3) How quickly will the sequence Xi converge?
Our discussion of these questions will be restricted to the finite-dimensional
case E = F = IR n .
5.1 The Development of Iterative Methods
291
Let us examine how iteration functions iP might be constructed. Frequently such functions are suggested by the formulation of the problem.
For example, if the equation x - cos x = 0 is to be solved, then it is natural
to try the iterative process
Xi+l
= cos Xi,
i = 0, 1, 2, ... ,
for which iP(x) := cosx.
More systematically, iteration functions iP can be obtained as follows:
If ~ is the zero of a function f: JR -+ JR, and if f is sufficiently differentiable
in a neighborhood U(~) of this point, then the Taylor series expansion of
f about Xo E U(~) is
f(~) = 0 = f(xo) + (~- .TO)!'(Xo) + (~-2.~O)2 !"(xo) + ...
+ (~-k~O)k f{k)(xO + '!9(~ - xo)),
0 < '!9 < 1.
If the higher powers (~ - xo) v are ignored, we arrive at equations which
must express the point ~ approximately in terms of a given, nearby point
Xo, e.g.
(5.1.1)
0= f(xo) + (~- xo)!'(xo),
or
(5.1.2)
These produce the approximations
and
c _ _ f(xo) ± J(fI(xo))2 - 2f(xo)f"(J.:o)
,,-Xo
f"(xo)
,
respectively. In general, ~, ~ are merely close to the desired zero: they must
be corrected further, for instance by the scheme from which they were
themselves derived. In this manner we arrive at the iteration methods
(5.1.3)
f(x)
iP(x) := x - f'(X) ,
rl'.
(
)._
"'± x .- x
_
f'ex) ± J(fI(X))2 -·2f(x)f"(x)
f"(x)
.
The first is the classical Newton-Raphson method. The E.econd is an obvious extension. In general such methods can be obtained by truncating the
292
5 Finding Zeros and Minimum Points by Iterative Methods
Taylor expansion after the (~ - xo)v term. Geometrically these methods
amount to replacing the function f by a polynomial of degreee v, Pv(x)
(v = 1, 2, ... ), which has the same derivatives f(k)(xO), k = 0, 1, 2, ... ,
v, as f at the point Xo. One of the roots of the polynomial is taken as an
approximation to the desired zero ~ of f [see Figure 7].
f(x)
Fig. 7. Newton-Raphson methods (5.1.3).
The classical Newton-Raphson method is obtained by linearizing f.
Linearization is also a means of constructing iterative methods to solve
equation systems of the form
f(x) = [
(5.1.4)
1I(X 1 , .... ,xn)
:
fn(x1, ... ,xn )
1
=0.
If we assume that x = ~ is a zero for f, that Xo is an approximation to ~,
and that f is differentiable for x = Xo then to a first approximation
o = f(~) ~ f(xo) + Df(xo)(~ - xo),
where
[ ox
8f,1
(5.1.5)
Df(xo) =
oII
ox n
:
ofn
ox 1
~ -
ofn
ox n
Xo = [ "
~ x~ 1
~n X=Xo
Xo
5.2 General Convergence Theorems
293
If the Jacobian Df(xo) is nonsingular, then the equation
f(xo) + D f(XO)(XI - xo) = 0
can be solved for Xl:
and Xl may be taken as a closer approximation to the zero t;. The generalized Newton method for solving systems of equations (5.1.4) is given
by
(5.1.6)
Xi+l = Xi - (D f(X;))-1 f(x;),
i = 0, 1, 2, ....
In addition to Newton's method for such equation systems there are, for
example, generalized secant methods [see (5.9.7)] for functions of many
variables, and there are generalizations to nonlinear systems of the iteration
methods given in Chapter 8 for systems of linear equations. A good survey
can be found in Ortega and Rheinboldt (1970).
5.2 General Convergence Theorems
In this section, we study the convergence behavior of a sequence {x;} which
has been generated by an iteration function <P,
Xi+l := <P(Xi),
i = 0, 1, 2,
in the neighborhood of a fixed point t; of <P. \Ve concentrate on the case
E = IR n rather than considering general normed linear vector spaces. Using
a norm 11·11 on IR n , we measure the size of the difference between two vectors,
x, y E IRn by Ilx - YII. A sequence of vectors Xi E IRn converges to a vector
x, if for each c > 0 there is an integer 1'1 (c) such that
Ilxl - xii < c
for all
I::: N(c).
It can be shown that this definition of the convergence of vectors in IR n is
independent of the chosen norm [see Theorem (4.4.6)]. Finally, it is known
that the space IR n is complete in the sense that the Cauchy convergence
criterion is satisfied:
A sequence Xi E IR n is convergent if and only if for each c > 0 there exists
N(c) such that Ilxl - xmll < c for all I, m ::: N(c).
In order to characterize the speed of convergence of a convergent sequence Xi, limi Xi = X, we say that the sequence converges at least with
order p ::: 1 if there are a constant C ::: 0 (with C < 1 if p = 1) and an
integer 1'1 such that the inequality
294
5 Finding Zeros and Minimum Points by Iterative lVlethods
holds for all i 2: N. If p = 1 or p = 2, one also speaks of linear convergence
and quadratic convergence , respectively.
In the case of linear convergence, the size ei := Ilx, - xii of the error is
reduced at each step of the iteration at least by the factor C, 0 <::: C < 1,
and the speed of convergence increases with decreasing C. Therefore, C is
called convergence factor.
If the sequence converges linearly with, say convergence factor C = 0.1, the
error sequence may look like
In the case of quadratic convergence, the error behavior is quite different. For
C = 1. a typical error sequence may look like
We call the iteration method locally convergent with convergence domain V(~), a neigborhood of ~, if it generates, for all starting points
Xu E V(~), a sequence Xi that converges to t;. It is called globally convergent, if in addition V (~) = Rn.
The following theorem is readily verified:
(5.2.1) Theorem. Let CP: JRn --+ JRn be an iteration function with fixed
point ~. Suppose that there are a ncigborhood U (~) of t;, a number p 2: 1.
and a constant C 2: 0 (with C < 1 if P = 1) so that for all x E U(O
Then there is a neighborhood V (~) <:: U (~) of ~ so that for all starting
points Xo E V(t;) the iteration method defined by cP generates iterates Xi
with Xi E V(~) for all i 2: 0 that converge to ~ at least with order p.
In the I-dimensional case, E = JR, the order of a method defined by cP
can often be determined if cP is sufficiently often differentiable in a convex
neighborhood U(~) of ~. If Xi E U(~) and if cp(k) (~) = 0 for k = 1, 2, ... ,
p - L but cp(p) (~) f- 0, it follows by Taylor expansion
For p = 2, 3, ... , the method then is of (precisely) pth order (since cp(p) (0 f0). A method is of first order if, besides p = 1, it is true that W(OI < l.
5.2 General Convergence Theorems
295
EXAMPLE 1. E = JR, p is differentiable in a neighborhood U(O. If 0< p'(e) < 1,
then convergence will be linear (first order). In fact, the Xi will converge monotonically to
[See Figure 8.]
e.
,
p(x)
X
x
Fig. 8. Monotone convergence.
If -1 < pi (e) < 0, then the Xi will alternate about
Figure 9].
EXAMPLE 2.
eduring convergence [see
E = JR, p(x) = x - f(x)/f'(x) (Newton's method). Assume that
I is sufficiently often continuously differentiable in a neighborhood of the simple
zero e of f, that is f' (e) # 0. If follows that
pte) =
e,
pl(e) = f(e)f" (e)
."
(jl(e))2'
Newton's method is locally at least quadratically (second order) convergent.
EXAMPLE 3. In the more general case that
j<v)(e)=o
for
eis a zero of multiplicity m of f, i.e.,
v=0,1, ... ,m-1,
j<m)(o#O,
then f and I' have a representation of the form
f(x) = (x - e)mg(x),
g(e) # 0,
j'(x) = m(x - ~)m-lg(x) + (x _ ~)mg'(x),
with a differentiable function g. If follows that
p x) = X _ f(x) = x _
(x - ~)g(x)
.
(f'(x) mg(x) + (x - e)gl(x)'
and therefore
296
5 Finding Zeros and Minimum Points by Iterative Methods
x
"-
-+--~- - - - - -'-', - - - - - - - --
<p(x)
o
x
Xi
Fig. 9. Alternating convergence.
<p'(€) = 1 -
2..
m
Thus for m > 1 - that is, for multiple zeros of f - Newton's method is only
linearly convergent. The modified method defined by <P(x) := X - mf(x)/f'(x)
would, in theory, still be quadratically convergent. But for m > 1 both methods
are numerically unstable, since, for x --+ €, both If(x)1 and 1f'(x)1 become small,
usually by cancellation. The computed iteration functions fl(<p(x)), fl(<P(x)) are
thus heavily affected by rounding errors.
The following general convergence theorems show that a sequence Xi
generated by P: E --+ E will converge to a fixed point ~ of P if P is a
contractive mapping.
As usual, I . II represents some norm on E = IRn.
(5.2.2) Theorem. Let the function P: E --+ E, E = IRn , have a fixed point
~: p(~) = ~. Further let Sr(~) := {z Illz - ~II < r} be a neighborhood of ~
such that P is a contractive mapping in Sr(~)' that is,
11<p(x) - p(y)11 :::; Kllx - yll
O:::;K<1.
for all x, y E Sr(~)' Then for any Xo E Sr(~) the generated sequence
Xi+! = P(Xi), i = 0, 1, 2, ... , has the properties
(a) Xi E Sr(~) for all i = 0, 1, ... ,
(b) Ilxi+! - ~II :::; Kllxi - ~II :::; Ki+!llxo - ~II,
i.e., {xd converges at least linearly to ~.
5.2 General ConvergencE Theorems
297
PROOF. The proof follows immediately from the contraction property.
Properties (a) and (b) are true for i = O. If we assume that they are
true for j = 0, I, ... , i, then it follows immediately that
The following theorem (known as the fixed point theorem of Banach) is
more precise. Note that the existence of a fixed point is no longer assumed
a priori.
(5.2.3) Theorem. Let P: E --+ E, E = IRn be an itemtlOn function, Xo E
= p(x;), i = 0, I, ... . Further, let a
neighborhood Sr(XO) = {x I Ilx - xoll < r} of Xo and a constant K, 0 <
K < I, exist such that
E be a starting point and Xi+l
(a) IIp(x) - p(y)11 :s: Kllx - yll for all x, y E S,.(xo) := {x Illx - xoll :s: r},
(b) Ilxl - xoll = IIp(xo) - xoll :s: (1 - K)r < r.
Then it follows that
(1) Xi E S,.(xo) for all i = 0, 1, ... ,
(2) P has exactly one fixed point ( p(f;,) = f;"in S,.(:ro), and
lim Xi = f;"
t--'>oc
II·THl - f;,11 :s: Kllxi - f;,11·
as well as
PROOF. The proof is by induction.
(1): From (b) it follows that Xl E S,.(xo). If it is true that Xj E S,.(xo)
for j = 0, 1, ... , i and i ;::- 1, then (a) implies
and therefore, from the triangle inequality and from (b),
IlxHl - xoll :s: Ilxi+l - xiii + Ilxi - Xi-1II + ... + IIJ;l - xoll
:s: (Ki + K i - l + ... + 1)llx1 - xoll
:s: (1 + K + ... + Ki)(l- K)r = (1 - Ki+l)r < r.
(2): First we show that {x;} is a Cauchy sequence. From (5.2.4) and
from (b) it follows for m > I that
(5.2.5)
Ilxm- XIII :s: Ilx m - xm-lll + Ilxm-1 - xm -211 + ... + IlxHl - XIII
:s: KI(l + K + ... + Km-I-I)llxI - xoll
Kl
< ~K IlxI - Xo I < Klr.
1-
298
5 Finding Zeros and Minimum Points by Iterative IVlethods
Because 0 < K < 1, we have Klr < s for sufficiently large I ~ N(s). Hence
{xd is a Cauchy sequence. Since E = lRn is complete, there exists a limit
lim Xi = (
i-+oo
Because Xi E Sr(XO) for alli, ~ must lie in the closure Sr(XO)' Furthermore
~ is a fixed point of P, because for all i ~ 0
IIp(O - ~II s IIp(O - p(xi)11 + IIp(x;) - ~II
s KII~ - xiii + Ilxi+1 - ~II·
Since limi-+CXJ Ilx, - ~II = 0, it follows at once that IIp(O - ~II = 0, and
hence p(O = ~.
If ~ E Sr(XO) were another fixed point of P, then
II~ - ~II = IIp(~) - p(~) I
s KII~ - ~II,
o < K < 1, which implies II~ - ~II = O.
Finally, (5.2.5) implies
II~ - xIII =
lim
m-+oc
Kl
Ilxm - xIII S K IlxI - Xo I
1-
and
which concludes the proof.
D
5.3 The Convergence of Newton's Method in Several
Variables
Consider the system J (x) = 0 given by the function J: lRTl -+ lRn. Such a
function is said to be different'iable at the point Xo E lR n if an n x n matrix
A exists for which
lim IIJ(x) - J(xo) - A(x - xo)11 = o.
X-+Xo
Ilx - xoll
In this case A agrees with the Jacobian matrix DJ(xo) [see (5.1.5)].
We first note the following
(5.3.1) Lemma. IJ D J (x) exists Jar all X in a convex region Co <;;; lRn)
and iJ a constant 'Y exists with
IIDJ(x) - DJ(y)11 S 'Yllx - yll
then Jar all x) y E Co the estimate
Jar all
X,y E Co,
5.3 The Convergence of Newton's Method in Several Variables
299
Ilf(x) - fey) - Df(y)(x - y)11 ::; ~llx - Yl12
holds.
(Recall that a set it! <;;; IRn is convex if x, y E M implies that the line
segment [.T, V] := {z = AX + (1 - A)y I 0 ::; A ::; 1} is contained within itf.)
PROOF. The function cp: [0, 1] --7 IR n
given by
cp(t) :=f(y+t(x-y))
is differentiable for all a ::; t ::; 1, where x, y E Co are arbitrary. This follows
from the chain rule:
cp/(t) = Df(y + t(x - y))(x - v).
Hence it follows for a ::; t ::; 1 that
Ilcp/(t) - cp'(o) I = II(Df(y + t(X - V)) - Df(y))(x - y)11
::; IIDf(y+t(x-y)) -Df(y)llllx-yll
::; -ytllx _ Y112.
On the other hand,
L1:= f(x) - fey) - Df(y)(x - y) = cp(l) - cp(O) - cp/(O)
=
11
(cp'(t) - 'P'(O))dt,
so the above inequality yields
This complctes the proof.
D
We can now show that Newton's method is quadratically convergent:
(5.3.2) Theorem. Let C <;;; IR n be a given open set. Further, let Co be
a convex set with Co <;;; C, and let f: C --7 IRn be a function which is
differentiable for all x E Co and continuous for all x E C.
For Xo E Co let positive constants r, 0, P, -y, h be given with the
following properties:
Sr(XO) := {x Illx - xoll < r} <;;; Co,
h := op-Y/2 < 1, r:= 0/(1 - h),
and let f (x) have the properties
(a) IIDf(x) - Df(y)11 :s: -yllx - yll for all X,y E Co,
(b) Df(x)-l exists and satisfies II(Df(x))-lll::; P for aU x E Co,
300
5 Finding Zeros and Minimum Points by Iterative Methods
(c) II(Df(xo))-lf(xo)11 <:::: oo.
Then
(1) beginning at Xu, each point
is well defined and satisfies Xk E Sr(XO) for all k :;:: O.
(2) limk-+CXJ Xk = ~ exists and satisfies ~ E Sr(XO) and f(~) = O.
(3) For all k :;:: 0
Sincc 0 < h < 1, Newton's method is at least quadratically convergent.
(1): Since Df(x)-l exists for x E CO, Xk+1 is well defined for all k
if Xk E Sr(XO) for all k :;:: O. This is valid for k = 0 and k = 1 by assumption
(c). ~ow, if Xj E Sr(XO) for j = 0, 1, ... , k, then from assumption (b)
PROOF.
Ilxk+1 - xkll = 11- Df(xk)-l f(xk)11 <:::: ;3llf(xk)11
= fJllf(xk) - f(xk-d - Df(xk-d(xk - xk-dll,
since the definition of Xk implies
But, according to Lemma (5.3.1),
(5.3.3)
and therefore
(5.3.4)
This last inequality is correct for k = 0 because of (c). If it is correct for
k :;:: 0, then it is correct for k + 1, since (5.3.3) implies
/3,
2
fJ~f 2 2"_2
IIXk+l-Xkll<::::21Ixk-Xk-111 <::::2Qh
=ooh
2"_1
.
Furthermore, (5.3.4) implies
Il x k+1 - Xoll <:::: IIXk+1 - xkll + 11:r:k - Xk-tI! + ... + II·T1 - Xoll
<:::: 00(1 + h + h 3 + h7 + ... + h2k_1) < 00/(1 - h) = r,
and consequently Xk+1 E Sr(:ro).
5.3 The Convergence of Newton's Method in Several Variables
301
(2): From (5.3.4) it is easily determined that {Xk} is a Cauchy sequence, since for m ~ n we have
(5.3.5)
Ilxm+l - xnll <:::: Ilxm+l - xmll + Ilxm - xm-lll + ... + Ilxn+l - xnll
<:::: nh2n-l(1 + h 2n + (h2n)2 + ... )
nh 2n - 1
< 1 _ h2n < c
for sufficiently large n ~ N(c), because 0 < h < 1.
Consequently, there is a limit
lim Xk = ~ E Sr(XO),
k-toc
whose inclusion in the closure follows from the fact that Xk E Sr(XO) for
all k ~ O.
By passing to the limit m -+ 00 in (5.3.5) we obtain (3) as a side result:
We must still show that ~ is a zero of f in Sr(XO)'
Because of (a), and because Xk E Sr(XO) for all k ~ 0,
and therefore
IIDf(Xk)11 <:::: IT + IIDf(xo)11 =: K.
The inequality
follows from the equation
Hence
and, since f is continuous at ~,
lim Ilf(Xkll = Ilf(~)11 = 0,
i.e., ~ is a zero of f.
o
Under somewhat stronger assumptions it can be shown that ~ is the
only zero of fin Sr(XO):
302
5 Finding Zeros and Minimum Points by Iterative Methods
(5.3.6) Theorem (Newton-Kantorovich). Given the function f: C ~
lRn -+ lRn and the convex set Co ~ C, let f be continuously differentiable
on Co and satisfy the conditions
(a) IIDf(x) - Df(y)11 ::; I'llx - yll for all x, y E Co,
(b) IIDf(xo)-lll ::; {3,
(c) IIDf(xo)-l f(xo)11 ::; a,
for some Xo E Co. Consider the quantities
h := a{3l',
If h ::; ~ and Sri (XO) C CO, then the sequence {Xk} defined by
Xk+1 := Xk - Df(Xk)-l f(Xk),
k = 0, 1, ... ,
remains in Sri (XO) and converges to the unique zero of f(x) Co n Sr2(XO).
For the proof see Ortega and Rheinboldt (1970) or Collatz (1968).
5.4 A Modified Newton Method
Theorem (5.3.2) guarantees the convergence of Newton's method only if
the starting point Xo of the iteration is chosen "sufficiently close" to the
desired solution ~ of
The following example shows that Newton's method may diverge otherwise.
EXAMPLE. Let f: lR -+ lR be given by f(x) =
of f(x) = O.
arctanx. Then ~ = 0 is a solution
The Newton iteration is defined by
Xk+l
If we choose Xo so that
:= Xk -
(1 + x~) arctan(xk).
21 X ol
arctan(lxol} 2: - - 2 '
1 +xo
then the sequence {IXkl} diverges: limk .... oo IXkl = 00.
We describe a modification of Newton's method for which global convergence can be proven for a large class of functions f. The modification
involves the introduction of an extra parameter >. and a search direction s
to define the sequence
(5.4.0.1)
5.4 A Modified Newton Method
303
where typically Sk := d k := [D f(xk)]-1 f(xk), and the A.k are chosen so that
the sequence {h(Xk)}, h(x) := f(x)T f(x), is strictly monotone decreasing
and the Xk converge to a minimum point of h(x). [Compare this with the
problem of nonlinear least-squares data fitting mentioned in Section 4.8.4.]
Since h(x) 20 for all x,
h(x) = 0 ~ f(x) = o.
Every local minimum point x of h which satisfies h(x) = 0 is also a glohal
minimum point x of h as well as a zero of f.
In the following section we will consider first a few general results about
the convergence of a class of minimization methods for arhitrary functions
h( x). These results will then be used in Section 5.4.2 to investigate the
convergence of the modified Newton method.
5.4.1 On the Convergence of Minimization Methods
Let 11.11 be the Euclidean vector norm and 0 < , :::; 1. We consider the set
(5.4.1.1 )
Db,x) := {s E lRn Illsll = 1 with Dh(x)s::::: ,IIDh(x)ll}
of all directions s forming a not-too-large acute angle with the gradient
Vh(x),
Vh(x)T = Dh(x) = Cl;~~)
,"', 8;~~)),
where (~cl, ... , xn)T.
The following lemma shows, given :r, under which conditions a scalar A. and
an s E lR n exist such that h(x - fJ,s) < h(x) for 0 < /1:::; _\:
(5.4.1.2) Lemma. Let h: lRn ---+ lR be a function which has a continuous
derivative Dh(x) for all x E V(x) in a neighborhood V(x) of x. Suppose
fUT·theT' that Dh(x) # O. and let 1 2 , > O. Then there lS a neighborhood
U(x) ~ V(x) of x and a number A. > 0 such that
h(x - fJ,s) :::; h(x) - fJ,4' IIDh(x)11
for all x E U(x), s E Db,x), and 0 :::; fJ,:::; A..
PROOF. The set
U 1(.f:) := {x E V(x) I I Dh(x) - Dh(x) I :::; ~ IIDh(x) II}
is nonempty and a neighborhood of x, since Dh(x)
continuous on V(x). Now, for x E U 1(x),
# 0 and Dh(x) IS
IIDh(x)11 21IDh(x)ll- hIIDh(x)11 2 ~IIDh(;r)ll,
304
5 Finding Zeros and Minimum Points by Iterative Methods
so that for s E Db, x)
D(h(x))s ~ 'YIIDh(x)11 ~ hIIDh(x)ll·
Choose .A > 0 with
and define
U(x) := B>. = {x Illx - xii ~ .A}.
Then for all x E U(x),
with
° JL .A,
~
~
and S E Db,x) there is a 0 < 0 < 1
h(x) - h(x - JLs) = JLDh(x - OJLs)s,
= JL([Dh(x - OJLs) - Dh(x)]s + Dh(x)s).
Now, x E U(x) = B>. implies x, x - JLS, x - OJLS E B 2 >. C U 1 , and therefore
11[(Dh(x - OJLs) - Dh(x)) + (Dh(x) - Dh(x))]sll ~ hIIDh(x)ll.
Using D(h(x))s ~ hIIDh(x)ll, we finally obtain
h(x) - h(x - JLs) ~ -JL2'YIIDh(x)11 + 3~'YIIDh(x)11
= JL4'Y IIDh(x) II,
D
We consider the following method for minimizing a differentiable function h: JRn -+ JR.
(5.4.1.3).
(a) Choose numbers 'Yk ~ 1, CTk, k = 0, 1, ... , with
inf 'Yk > 0,
k
inf CTk > 0,
k
and choose a starting point Xo E JRn .
(b) For all k = 0, 1, ... , choose an Sk E Dbk,Xk) and set
where .Ak E [0, CTk IIDh(xk) III is such that
h(Xk+l) = min{h(xk - JLSk) I 0 ~ JL ~ CTkIlDh(Xk)II}·
i-'
The convergence properties of this method are given by the following
(5.4.1.4) Theorem. Let h: JRn -+ JR be a junction, and let Xo E JRn be
chosen so that
5.4 A Modified Newton Method
305
(a) K:= {xlh(x) S h(xo)} is compact, and
(b) h is continuously differentiable in some open set containing K.
Then for any sequence {xd defined by a method of the type (5.4.1.3):
(1) Xk E K for all k = 0, 1, ... , {xd has at least one accumulation point
x in K.
(2) Each accumulation point x of {xd is a stationary point of h:
Dh(x) = o.
PROOF. (1): From the definition of the sequence {xd it follows immediately
that the sequence {h(Xk)} is monotone: h(xo) 2 h(Xl) 2 .... Hence Xk E K
for all k. K is compact; therefore {Xk} has at least one accumulation point
x E K.
(2): Assume that x is an accumulation point of {xd but is not a stationary point of h:
Dh(x) 10.
(5.4.1.5)
Without loss of generality, let limk-+cx) Xk = x. Let
"'( := inf"'(k > 0,
k
(J:= inf (Jk > O.
k
According to Lemma (5.4.1.2) there is a neighborhood U(x) of x and
a number .\ > 0 satisfying
(5.4.1.6)
h(x -l1s) s h(x) - 11~IIDh(x)11
for all x E U(x), S E Db, x), and 0 S 11 S .\.
Since limk--+oo Xk = x, the continuity of Dh(x) together with (5.4.1.5),
implies the existence of a ko such that for all k 2 ko
(a) Xk E U(x),
(b) IIDh(Xk)11 2 ~IIDh(x)ll·
Let A := min{.\, ~(JIIDh(x)II}, E := A~IIDh(x)11 > O. Since (Jk 2 (J, it
follows that [0, A] c:: [0, (Jk IIDh(Xk) II] for all k 2 ko. Therefore, from the
definition of Xk+l,
Since A S .\, Xk E U(x), Sk E Dbk' Xk) c:: Db, Xk), (5.4.1.6) implies that
h(Xk+l) S h(Xk) -
A"'(
41IDh(x)11 = h(Xk) - E
306
5 Finding Zeros and Minimum Points by Iterative Methods
for all k 2:: k o . This means that limk--+oo h(xk) = -cx::, which contradicts
h(xk) 2:: h(xk+d 2:: '" 2:: h(x). Hence, x is a stationary point of h.
D
Step (b) of (5.4.l.3) is known as the line search. Even though the
method given by (5.4.l.3) is quite general, its practical application is limited by the fact that the line search must be exact, i.e., it requires that the
exact minimum point of the function
be found on the interval [O,O"kIIDh(Xk)ll] in order to determine Xk+l. Generally a great deal of effort is required to ohtain even an approximate
minimum point. The following variant of (5.4.l.3) has the virtue that in
step (b) the exact minimization is replaced by an inexact line search. in
particular by a finite search process ("Armijo line search"): Armijo line
search DF
(5.4.1.7).
(a) Choose numbers /k :s; 1, O"k, k = 0, 1, ... , so that
inf /k > 0,
k
inf O"k > O.
k
Choose a starting point Xo E IRn .
(b) For each k = 0, L ... , obtain Xk+I from Xk as follows:
(a) Select
define
and determine the smallest integer j 2> 0 such that
UJ) Determine 7 ~ {O, 1, ... ,j}, such that hk(Pk2-i) is minimum and
let Ak := p2- i , Xk+l := Xk - AkSk·
[Note that h(Xk+I) = minI::;i::;j hk(Pk2-i).]
It is easily seen that an integer j 2> 0 exists with the properties
(5.4.l.7ba): If Xk is a stationary point, then j = O. If Xk is not stationary,
then the existence of j follows immediately from Lemma (5.4.l.2) applied
to x := Xk. In any case j (and Ak) can be found after a finite number of
steps.
The modified process (5.4.l.7) satisfies an analog to (5.4.l.4):
5.4 A Modified Newton Method
307
(5.4.1.8) Theorem. Under the hypotheses of Theorem (5.4.l.4) each sequence {xd produced by a method of the type (5.4.l. 7) satisfies the conclu8ions of Theorem (5.4.1.4).
PROOF. We assume as before that
x is an accumulation point of a sequence
{xd defined by (5.4.l.7), but not a stationary point, i.e.,
Dh(x) 1= O.
Again, without loss of generality, let lim Xk = x. Also let (J := infk (Jk >
0, r := infk rk > O. According to Lemma (5.4.l.2) there is a neighborhood
U(x) and a number ..\ > 0 such that
(5.4.l.9)
h(x ~ fJs) ::;; h(x) ~ fJ~IIDh(x)11
for all x E U(x), s E Db, x), 0::;; fJ ::;; ..\. Again, the fact that limk Xk = x,
that Dh(x) is continuous, and that Dh(x) 1= 0 imply the existence of a ko
such that
(5.4.1.10a)
Xk E U(x),
(5.4.1.10b)
IIDh(.1:k)11 ~ ~IIDh(x)ll·
for all k ~ k o.
We need to show that there is an c > 0 for which
Note first that (5.4.l.lO) and rk ~ ~( imply
Consequently, according to the definition of Xk+l and j
(5.4.l.11)
Now let J ~ 0 be the smallest integer satisfying
(5.4.l.12)
According to (5.4.l.11), J ::;; j, and the definition of Xk+l we have
(5.4.1.13)
There are two cases:
308
5 Finding Zeros and Minimum Points by Iterative Methods
Case 1, J = O. Note that Pk = O"kIIDh(Xk)11 ::" 0"/21IDh(x)ll. Then (5.4.1.12)
and (5.4.1.13) imply
h(Xk+1) ::; h(Xk) - PkiIIDh(x)11
O"'Y
2
::; h(Xk) -16IIDh(x)11 = h(Xk) - E1
with E1 > 0 independent of Xk.
Case 2, J > O. From the minimality of J we have
hk(PkTC3-1)) > h(Xk) - Pk2-C3-1)iIIDh(x)11
::" h(Xk) - Pk2-(]-1)~IIDh(x)ll.
Because Xk E U(k) and Sk E Dhk, Xk) ~ Dh, Xk), it follows immediately
from (5.4.1.9) that
PkT(3-1) > A.
Combining this with (5.4.1.12) and (5.4.1.13) yields
A,
c
h(Xk+l) ::; hdpk 2- J ) ::; h(Xk) -16IIDh(x)11 = h(Xk) - E2
with E2 > 0 independent of Xk.
Hence, for E = min( El, E2)
for all k ::" ko, contradicting the fact that h(Xk) ::" h(x) for all k. Therefore
x is a stationary point of h.
D
5.4.2 Application of the Convergence Criteria to the Modified
Newton Method
In order to solve the equation f(x) = 0, we let h(x) := f(x)T f(x) and
apply one of the methods (5.4.1.3) or (5.4.1.7) to minimize h(x). We use
the Newton direction
as the search direction Sk to be taken from the point Xk. This direction
will be defined if Df(Xk)-l exists and f(Xk) i= O. (11.11 denotes the Euclidean norm.) To apply the theorems of the last section requires a little
preparation. vVe show first that, for every x such that
d = d(x) := Df(x)
-1
d
f(x) and S = s(x) = TIdIT
5.4 A Modified Newton Method
309
exist [i.e., Df(x)~l exists and d#- 0], we have
1
s E Db,x), for all 0 < 1':::; ;y(x), ;y(x) := cond(Df(x))'
(5.4.2.1)
In the above,
IIDf(x)11 := lub(Df(x))
and
cond(Df(.1:)) := IIDf(.1:)~lIIIIDf(x)11
are to be defined with respect to the Euclidian norm.
PROOF. Since h(x) =
f(xf f(x), we have
Dh(x) = 2F(x)Df(x).
(5.4.2.2)
The inequalities
IIF(x)Df(x)11 :::; IIDf(x)11 Ilf(x)ll,
IIDf(x)~lf(x)ll:::; IIDf(x)~lllllf(x)11
clearly hold, and consequently
Dh(x)s
IIDh(x)11
-,;---'f-o'(.,-'X).,.-T-c:-D-,:-f.,-'(x-,'-:).,.-Dc'-f~(x:,:-)~-:-l_f-,:-(x-:--)-:--:c > ___
1 _ _ > O.
IIDf(x)~l f(x)11
IlfT(x)Df(x)11 ~ cond(Df(x))
Now, for all "y with 0 < "y:::; l/cond(Df(x)), it follows that s E Db,x)
according to the definition of Db, x) given in (5.4.1.1).
D
As a consequence of (5.4.2.2) we observe: If Df(x)~l exists, then
Dh(x) = 0 ¢} f(x) = 0,
(5.4.2.3)
i.c., x is a stationary point of h if and only if x is a zero of f.
Consider the following modified Newton method [cf. (5.4.1.7)]:
(5.4.2.4).
(a) Select a starting point Xo E IRn.
(b) For each k = 0, 1, ... define Xk+l from Xk as follows:
(0:) Set
1
1'k:= cond(Df(xk)) ,
and let hk(r) := h(Xk ~rdk)! where h(x) := f(x)T f(x). Determine
the smallest integer j 2: 0 satisfying
310
5 Finding Zeros and Minimum Points by Iterative Methods
(p) Determine Ak and thereby Xk+1 := Xk - Akdk so that
h(Xk+d = min hd2-i).
O:Si:Sj
As an analog to Theorem (5.4.1.8) we have
(5.4.2.5) Theorem. Let f: JRn --+ JRn be a g'iven function, and let Xo E JRn
be a point with the following pmperties:
(a) The set K := {xlh(x) ::; h(xo)}, where h(x) := f(x)T f(x), is compact;
(b) f is continuously differentiable on some open set containing K;
(c) D f (x) -1 exists for all x E K.
Then the sequence {xd defined by (5.4.2.4) is well defined and satisfies the
following:
(1) Xk E K for all k = 0, 1, ... , and {xd has at least one accumulation
point x E K.
(2) Each accumulation point j; of {xd is a zem of f, f(x) = o.
PROOF. By construction, {h(Xk)} is monotone:
Hence Xk E K, k = 0, 1, .... Because of assumption (c), d k and lk are
well defined if Xk is defined. From (5.4.2.1)
where Sk := dk/lldkll·
As was the case for (5.4.1.7), there is a j :::: 0 with the properties given
in (5.4.2.4). Hence Xk+1 is defined for each Xk.
Now (5.4.2.4) becomes formally identical to the process given by
(5.4.1.7) if O"k is defined by
The remainder of the theorem follows from (5.4.1.8) as soon as we establish
that
inf lk > 0, inf O"k > O.
k
k
According to assumptions (b) and (c), D f (x) -1 is continuous on the
compact set K. Therefore cond(Df(x)) is continuous, and
1
,:= max cond(D f(x)) > 0
xEK
5.4 A Modified Newton Method
311
exists.
Without loss of generality, we may assume that Xk is not a stationary
point of h; which means that it is no zero of f, because of (5.4.2.3) and
assumption (c). [If f(Xk) = 0, then it follows immediately that Xk = Xk+1 =
Xk+2 = .. " and there is nothing left. to show.] Thus, since .7:k E K, k = 0,
1, ... ,
inf /k 2': / > 0.
On the other hand, from the fact that f(Xk) # 0, from (5.4.2.2), and
from the inequalities
Ildkll = IIDf(Xk)
-1
1
f(Xk)ll2': IIDf(Xk)llllf(Xk)ll,
IIDh(Xk)11 ::; 2 ·IIDf(Xk)11 Ilf(x)ll,
it follows immediately that
[from the continuity of Df(x) in the set K, which is compact]. Thus, all the
results of Theorem (5.4.1.8) [or (5.4.1.4)] apply to the sequence {xd. Since
assumption (c) and (5.4.2.3) together imply that each stationary point of
h is also a zero of f, the proof is complete.
0
The method (5.4.2.4) requires that Dh(Xk) and /k := 1/ cond(D f(Xk))
be computed at each iteration step. The proof of (5.4.1.8), however, shows
that it would be sufficient to replace all ~(k by a lower bou.nd / > 0, /k 2':
/ > 0. In accord with this, Ak is usually determined in practice so that
However, since this only requires that /k > 0, the methods used for the
above proofs are not sufficiently strong to guarantee the convergence of this
variant.
A further remark about the behavior of (5.4.2.4): In a sufficiently small
neighborhood of a zero x (with nonsingular D f(x)) the method chooses
/k = 1 automatically. This means that the method conforms to the ordinary
Newton method and converges quadratically. We can see this as follows:
Since limk->cx) Xk = x and f(x) = 0, there is a neighborhood V1 (x) of x
in which every iteration step Zk --t Zk+ 1 which would be caTIied out by the
ordinary Newton method would satisfy the condition
(5.4.2.6)
and
(5.4.2.7)
312
5 Finding Zeros and Minimum Points by Iterative Methods
Taylor's expansion of f about x gives
f(x) = Df(x)(x - x) + o(llx - xii)·
Since
Ilx - xII/IIDf(x)-lll ::; IIDf(x)(x - x)11 ::; IIDf(x)llllx - xii
and limx--+x o(llx - xll)/llx - xii = 0, there is another neighborhood V2 (x)
of x such that
for all x E V2 (x).
Choose a neighborhood
and let ko be such that
Xk E U(x)
for k ::" k o. This is possible because limk--+oo Xk = x. Considering
and using (5.4.2.6), (5.4.2.7), we are led to
h(Xk+l) ::; 41IDf(x)1121Ixk+l - xl1 2 ::; 16a2r211xk - xI1 2h(Xk)
::; h(xk)(l - ~).
From (5.4.2.4bQ),
This implies
That is, there exists a ko such that for all k ::" ko the choice j = 0 and Ak = 1
will be made in the process given by (5.4.2.4). Thus (5.4.2.4) is identical
to the ordinary Newton method in a sufficiently small neighborhood of x,
which means that it is locally quadratically convergent.
Assumption (a)-(c) in Theorem (5.4.2.5) characterize the class of functions for which the algorithm (5.4.2.4) is applicable. In one of the assigned
problems for this chapter [Exercise 15], two examples will be given of function classes which satisfy (a)-(c).
5.4 A Modified Newton Method
313
5.4.3 Suggestions for a Practical Implementation of the
Modified Newton Method. A Rank-One Method Due to
Broyden
Newton's method for solving the system f(x) = 0, where f:IRn -+ IR n ,
is quite expensive even in its modified form (5.4.2.4), since the Jacobian
D f(Xk) and the solution to the linear system D f(Xk)d == f(Xk) must be
computed at each iteration. The evaluation of explicit formulas for the
components of D f(x) for given x is usually complicated and costly even
if explicit formulas for f(.) and Df(.) are available. In many cases this
work is alleviated by some recent and surprisingly efficient methods for
computing the values of derivatives like D f (x) at a given point x exactly
(methods of automatic differentiation). These methods presuppose that the
function f(x) is built up from simple differentiable functions with known
derivatives which allows the repeated use of the chain rule to compute the
wanted derivatives recursively. For an exposition of these methods we have
to refer the reader to the relevant literature [see e.g. Griewank (2000)].
If these assumptions are not satisfied, if for instance explicit formulas
for f (x) are not available (f (x) could be the result of a measurement or of
another lenghty computation), one could resort to the following techniques:
One is to replace
Df( ) = (af(x)
af(x))
x
axl , ... , ax n
at each x = x k by a matrix
(5.4.3.1)
L1f(x) = (L1d, .. ·, L1nf),
where
L1d(x): = f(xl, ... ,xi + hi, ... ,x~: - f(x 1 , ... ,Xi, ... ,xn)
f(x + hiei) - f(x)
hi
that is, we may replace the partial derivatives a f / axi by suitable difference
quotients L1d. Note that the matrix L1f(x) can be computed with only n
additional evaluations of the function f (beyond that required at the point
x = Xk). However it can be difficult to choose the stepsizes hi. If any hi is
too large, then L1f(x) can be a bad approximation to Df(x), so that the
iteration
(5.4.3.2)
converges, if it converges at all, much more slowly than (5.4.2.4). On the
other hand, if any hi is too small, then f(x+hiei) ~ f(x), and cancellations
can occur which materially reduce the accuracy of the difference quotients.
314
5 Finding Zeros and Minimum Points by Iterative Methods
The following compromise seems to work the best: if we assume that all
components of f(x) can be computed with a relative error of the same order
of magnitude as the machine precision eps, then choose hi so that f(x) and
f(x + hiei) have roughly the first t/2 digits in common, given that i-digit
accuracy is being maintained inside the computer. That is,
Ihil IILld(x)11 ~ v'ePsllf(x) II·
In this case the influence of cancellations is usually not too bad.
If the function f is very complicated, however, even the n addiional
evaluations of f needed to produce Llf(x) can be too expensive to bear
at each iteration. In this case we try replacing D f(Xk) by some matrix
Bk which is even simpler than Llf(Xk). Suitable matrices can be obtained
using the following result due to Broyden (1965).
(5.4.3.3) Theorem. Let A and B be arbitrary n x n matrices; let bE IRn ,
and let F: IRn ---+ IRn be the affine mapping F( u) := Au + b. Suppose x,
x' E IRn are distinct vectors, and define p, q by
p := x' - x,
q:= F(X') - F(x) = Ap.
Then the n x n matrix B' given by
1 ( q - Bp )p T ,
B I := B + --;yp p
satisfies
with respect to the Euclidean norm, and it also satisfies the equation
B'p= Ap= q.
PROOF. The equality (B'_A)p = 0 is immediate from the definition of B'.
Each vector u E IRn satisfying IIul12 = 1 has an orthogonal decomposition
of the form
u = O!p + V,
v T p = 0,
Ilv112::; 1,
O! E IR.
Thus it follows from the definition of B' that, for IIul12 = 1,
II(B' - A)u112 = II(B' - A)v112 = II(B - A)v112
::; lub 2(B - A)llvl12 ::; lub 2(B - A).
Hence
lub 2(B ' - A) =
sup II(B ' - A)u112 ::; lub 2(B - A).
Ilu112=1
0
5.4 A Modified Newton Method
315
This result shows that the Jacobian Df(x) == A of an affine function F
is approximated by B' at least as well as it is approximated by B, and
furthermore B' and DF(x) will both map P into the same vector. Since a
differentiable nonlinear function f: IRn --+ IRn can be approximated to first
order in the neighborhood of one of its zeros x by an affine function, this
suggests using the above construction of B' from B even in the nonlinear
case. Doing so yields an iteration of the form
dk := B-;;l f(Xk),
(5.4.3.4)
Xk+l := Xk - Akdk,
Pk := Xk+1 - Xk, qk:= f(Xk+1) - f(Xk),
1
T
Bk+1 := Bk + --;y-(qk - BkPk)Pk·
Pk Pk
The formula for Bk+1 was suggested by Broyden. Since rank(Bk+1 -Bk) s:
1, it is called Broyden's rank-one update. The stepsizes Ak may be determined from an approximate minimization of Ilf(x)112:
using, for example, a finite search process
(5.4.3.5)
as in (5.4.2.4).
A suitable starting matrix Bo can be obtained using difference quotients: Bo = L1f(xo). It does not make good sense, however, to compute all following matrices B k , k ?:: 1, from the updating formula. Various suggestions have been made about which iterations of (5.4.3.4) are
to be modified by replacing Bk + (1/p[Pk)(qk ~ BkPk)p[ with L1f(Xk)
("reinitialization"). As one possibility we may obtain B~;+l from Bk using Broyden's update only on those iterations where the step produced by
(5.4.3.5) lies in the interval 2- 1 s: A s: 1. A justification for this is given by
observing that the bisection method (5.4.3.5) automatically picks Ak = 1
when Ilf(xk+1)11 < Ilf(Xk - dk)ll· The following result due to Broyden,
Dennis, and More (1973) shows that this will be true for all Xk sufficiently
close to x (in which case making an affine approximation to f is presumably
justified):
Under the assumption that
(a) Df(x) exists and is continuous in a neighborhood U(:i:) of a zero point
x,
(b) IIDf(x) - Df(x) II s: Allx - xii for some A > 0 and all x E U(x),
(c) Df(x)-l exists,
316
5 Finding Zeros and Minimum Points by Iterative Methods
then the iteration (5.4.3.4) is well defined using .Ak = 1 for all k :2:: 0 (i.e.,
all Bk are nonsingular) provided Xo and Bo are "sufficiently close" to x and
D f (x). Moreover, the iteration generates a sequence {x d which converges
superlinearly to x,
lim Ilxk+1 - xii = 0
k--+oo Ilxk - xii
'
if Xk -I- x for all k :2:: o.
The direction dk = Bi: 1f(Xk) appearing in (5.4.3.4) is best obtained by
solving the linear system Bkd = f(Xk) using a decomposition FkBk = Rk
of the kind given in (4.9.1). Observe that the factors H+1, Rk+1, of Bk+1
can be found from the factors Fk , R k , of Bk by employing the techniques
of Section 4.9, since modification of Bk by a rank-one matrix is involved.
We remark that all of the foregoing also has application to function
minimization as well as the location of zeros. Let h: ffin -+ ffi be a given
function. The minimum points of h are among the zeros of f(x) = \lh(x).
Moreover, the Jacobian Df(x) is the Hessian matrix \l2h(x) of h; hence it
can be expected to be positive definite near a strong local minimum point
x of h. This suggests that the matrix Bk which is taken to approximate
\l2h(Xk) should be positive definite. Consequently, for function minimization, the Broyden rank-one updating formula used in (5.4.3.4) should be
replaced by an updating formula which guarantees the positive definiteness
of Bk+l given that Bk is positive definite. A number of such formulas have
been suggested. The most successful are of rank two and can be expressed
as two stage updates:
B k+1/ 2 := Bk + DkUkUI,
Bk+l := Bk+l/2 - (3k v kV[,
Dk > 0,
(3k > 0,
with suitable numbers Dk, (3k and vectors Uk, Vk so that Bk+1/2 is guaranteed to be positive definite as well as B k +1. For such updates the
Cholesky decomposition is the most reasonable to use in solving the system
Bkd = f(Xk). More details on these topics may be found in the reference by
Gill, Golub, Murray, and Saunders (1974). Rank-two updates to positive
definite approximations Hk of the inverse [\l2h(Xk)]-1 of the Hessian are
described in Section 5.11.
5.5 Roots of Polynomials. Application of Newton's
Method
Sections 5.5-5.8 deal with roots of polynomials and some typical methods
for their determination. There are a host of methods available for this
purpose which we will not be covering. See, for example, Bauer (1956),
5.5 Roots of Polynomials. Application of Newton's Method
317
Jenkins and Traub (1970), Nickel (1966), and Henrici (1974), to mention
just a few.
The importance of general methods for determining roots of general
polynomials may sometimes be overrated. Polynomials found in practice
are frequently given in some special form, such as characteristic polynomials
of matrices. In the latter case, the roots are eigenvalues of matrices, and
methods to be described in Chapter 6 are to be preferred.
We proceed to describe how the Newton method applies to finding the
roots of a given polynomial p(x). In order to evaluate the it.eration function
of Newton's method,
P(Xk)
Xk+l := Xk - p'(Xk)'
we have to calculate the value of the polynomial p, as well as the value of
its first derivative, at the point x = Xk. Assume the polynomial p is given
in the form
p(X) = aox n- 1 + ... + an.
Then p(Xk) and P'(Xk) can be calculated as follows: For :r =~,
The multipliers of ~ in this expression are recursively of the form
(G.G.l)
bo : = ao
b; : = bi-l~ + ai
i = 1, 2, ... , n.
The value of the polynomial p at ~ is then given by
The algorithm for evaluating polynomials using the recursion (5.5.1) is
known as Horner's scheme. The quantities bi , thus obtained, are also the
coefficients of the polynomial
Pl(X) := box n - 1 + b1x n - 2 + ... + bn - 1
which results if the polynomial p( x) is divided by x - ~:
(5.5.2)
p(X) = (x - ~)pdx) + bn .
This is readily verified by comparing the coefficients of the powers of x on
both sides of (5.5.2). Furthermore, differentiating the relation (.'>.5.2) with
respect to x and setting x = ~ yields
Therefore, the first derivative p' (~) can be determined by repeating the
Horner scheme, using the results bi of the first as coefficients for the second:
5 Finding Zeros and l\1inimum Points by Iterative Methods
318
p'(O = (- .. (bo~ + bd~ + .. X + bn - 1 ·
Frequently, however, the polynomial p(x) is given in some form other
than
p(x) = aoxn + ... + an.
Particularly important is the case in which p(x) is the characteristic polynomial of a symmetric tridiagonal matrix
["'
,62
0
('h
J=
0:;, ;3i real.
,Bn
0
(in
an
Denoting by Pi(X) the characteristic polynomial
P2(X) := det
([
0
0:1~X
i-h
o
3i
;3i
0:; ~ x
)
of the principal submatrix formed by the first i rows and columns of the
matrix J, we have recursions
Po(x) := l.
(5.5.3)
P1(X) := (0:1 ~ x) . 1,
Pi(X) := (O:i ~ x)Pi-dx) ~ 8;Pi-2(X),i = 2, 3, ... , n,
p(x) := det(J ~ .T1) := Pn(x).
These can be used to calculate p(~) for any x = ~ and any given matrix
elements ai, ;3i. A similar recursion for calculating p' (x) is obtained by
differentiating (5.5.3):
(5.5.4)
p~(x) := 0,
p~(x) := ~l.
p;(x) := ~Pi-l(X) + (0:; ~ X)P;_l(X) ~ 6;P;_2(X),i = 2, 3, ... , n,
p'(x) := p~(x).
The two recursions (5.5.3) and (5.5.4) can be evaluated concurrently.
During our general discussion of the Newton method in Section 5.3 it
became clear that the convergence of a sequence Xk towards a zero'; of
a function is assured only if the starting point Xo is sufficiently close to
~. A bad initial choice Xo may cause the sequence Xk to diverge even for
polynomials. If the real polynomial p(x) has no real roots [e.g., p(x) =
5.5 Roots of Polynomials. Application of Newton's Method
319
X2 + 1], then the Newton method must diverge for any initial value Xo E lR.
There are no known fail-safe rules for selecting initial values in the case of
arbitrary polynomials. However, such a rule exists in an important special
case, namely, if all roots ~, i = 1, 2, ... , n, are real:
In Section 5.6, Theorem (5.6.5), we will show that the polynomials defined
by (5.5.3) have this property if the matrix elements ai, Pi are real.
(5.5.5) Theorem. Let p(x) be a polynomial of degree n > 2 with real
coefficients. If all roots Ei,
ofp(x) are real, then Newton's method yields a strictly decreasing sequence
Xk converging to 6 for any initial value Xo > 6.
PROOF. Without loss of generality, we may assume that p(xo)
p( x) does not change sign for x > ~, we have
> O. Since
p(x) = aoxn + ... + an > 0
for x > 6 and therefore ao > O. The derivative p' has n - 1 real zeros ai
with
6 2: 001 2: 6 2: 002 2: ... 2: a n -1 2: ~n
by Rolle's theorem. Since p' is of degree n - 1 2: 1, these are all its roots,
and p'(x) > 0 for x > 001 because ao > O. Applying Rolle's theorem again,
and recalling that n 2: 2, we obtain
(5.5.6)
p"(x) > 0
for x > 001,
p"'(x) 2: 0
for x 2: 001.
Thus p and p' are convex functions for x 2: 001.
Now Xk > ~ implies that
p(Xk)
Xk+1 = Xk - -,-(-) < Xk,
p Xk
since p'(Xk) > 0, P(Xk) > O. It remains to be shown, that we do not
"overshoot", i.e., that Xk+1 > 6· From (5.5.6), Xk > 6 2: 001, and Taylor's
theorem we conclude that
0= p(6) = P(Xk) + (6 - Xk)p'(Xk) + ~(6 - Xk)2p"(c5) , 6 < c5 < Xk,
> p(Xk) + (6 - Xk)p'(Xk).
P(Xk) = p'(Xk)(Xk - xx+!) holds by the definition of Xk+l' Thus
320
5 Finding Zeros and Minimum Points by Iterative Methods
and Xk+1 > 6 follows, since p'(Xk) > O.
D
For later use we note the following consequence of (5.5.6):
(5.5.7) Lemma. p(x) = aoxn + ... + an, ao > 0, be a real polynomial of
degree n ~ 2 all roots of which are real. If III is the largest root of pi, then
pili ~ 0 for x ~ lll' i. e., pi is a convex function for x ~ lll.
We are still faced with the problem of finding a number Xo > 6, without
knowing 6 beforehand. The following inequalities are available for this
purpose:
(5.5.8) Theorem. For all roots ~i of an arbitrary polynomial p(x)
aoxn + ... + an with ao =1= 0,
Some of these inequalities will be proved in Section 6.9. Compare also
Householder (1970). Additional inequalities can be found in Marden (1949)
Quadratic convergence does not necessarily mean fast convergence. If
the initial value Xo is far from a root, then the sequence Xk obtained by
Newton's method may converge very slowly in the beginning. Indeed, if Xk
is large, then
Xk+l
(1)
+... ~ Xk 1 - = Xk - nXXk
n-l
+...
n
k
,
so that there is little change between Xk and Xk+1. This observation has
led to considering the following double-step method:
5.5 Roots of Polynomials. Application of Newton's Method
Xk+l = Xk -
P(Xk)
2 -'-(-)'
P Xk
321
k = 0,1,2, ... ,
in lieu of the straightforward Newton method.
Of course, there is now the danger of "overshooting". In particular,
in the case of polynomials with real roots only and an initial point Xo >
6, some Xk+l may overshoot 6, negating the benefit of Theorem (5.5.5).
However, this overshooting can be detected, and, due to some remarkable
properties of polynomials, a good initial value y (6 ~ y > ~2) with which
to start a subsequent Newton procedure for the calculation of 6 can be
recovered. The latter is a consequence of the following theorem:
(5.5.9) Theorem. Let p(x) be a real polynomial of degree n ~ 2, all roots
of which are real, 6 ~ 6 ~ ... ~ ~n' Let al be the largest root of p'(x):
For n = 2, we require also that 6 > 6. Then for every z > 6, the numbers
,
p(z)
z :=z- p'(z) ,
p(z)
y:=z-2 p '(z)'
,
p(y)
y :=y- p'(y) ,
(see Figure 10) are well defined and satisfy
< y,
(5.5.10a)
al
(5.5.10b)
6 ::; y' ::; z'.
It is readily verified that n = 2 and 6 = 6 imply y = 6 for any z > 6.
PROOF. Assume again that p(z) > 0 for z > 6. For such values z, we
consider the quantities .:1 0, .:11 (Figure 10), which are defined as follows:
1
z'
.:10 := p(z') = p(z') - p(z) - (z' - z)p'(z) =
1
[P'(t) - p'(z)]dt,
z'
.:11 := p(z') - p(y) - (z' - y)p'(y) =
[P'(t) - p'(y)]dt .
.:10 and .:11 can be interpreted as areas over and under the graph of p'(x),
respectively (Figure 11).
By Lemma (5.5.7), p'(x) is a convex function for x ~ al. Therefore, and
because z' - y = z - z' > 0 - the latter being positive by Theorem (5.5.5)
- we have
(5.5.11)
322
5 Finding Zeros and Minimum Points by Iterative Methods
p(x)
Fig. 10. Geometric interpretation of the double-step method.
pl(X)
y
Zl
z
x
Fig. 11. The quantities Llo and Ll1 interpreted as areas.
with equality .d 1 = .do holding if and only if pI is a linear function, that
is, if p is a polynomial of degree 2. Now we distinguish the three cases
5.5 Roots of Polynomials. Application of Newton's l\Iethod
323
y > 6, y = 6, y < 6· For y > 6, the proposition of the theorem follows
immediately from Theorem (5.5.5). For y = 6, we show first that 6 >
(Xl > 6, that is, 6 is a simple root of p. If y = 6 = 6 = (Xl were a
multiple root, then by hypothesis n :->: 3. and consequently Lll < Llo would
hold in (5.5.11). This would lead to the contradiction
p(z') = p(z') - p(6) - (z' - 6)p'(~J) = Ll1 < Llo = p(z').
Thus 6 must be a simple root; hence (Xl < 6 = y' = y < z', and the
proposition is seen to be correct in the second case, too.
The case y < 6 remains. If (Xl < y, then the validity of the proposition
can be established as follows. Since p(z) > 0 and 6 < (Xl < y < 6, we
have p(y) < 0, p'(y) > O. In particular, y' is well defined. Furthermore,
since p(y) = (y - y')p'(y), and Ll1 ::; Llo, we have
Llo - Lll = p(y) + (z' - y)p'(y) = p'(y)(z' - y') :->: O.
Therefore z' :->: y'. By Taylor's theorem, finally,
p(6) = 0 = p(y) + (~l - y)p'(y) + ~(6 - y)2 pl/(b) ,
y < J < 6,
and since pl/(x) :->: 0 for x :->: (Xl, p(y) = (y - y')p'(y), and p'(y) > 0,
0:->: p(y) + (6 - y)p'(:Ij) = p'(Y)(6 - y').
Therefore ~l ::; y'.
To complete the proof, we proceed to show that
(5.5.12)
y = y(z) > (Xl.
for any z > 6. Again we distinguish two cases, 6
6·
If 6 > (Xl > 6, then (5.5.12) holds whenever
> 01 > 6 and 6
a1 =
This is because Theorem (5.5.5) implies z > z' :->: 6, and therefore
y = z' - (z - z') > 6 - (6 - ad = (Xl
holds by the definition of y = y( z). Hence we can select a Zo with
y(zo) > (Xl· Assume that there exists a Zl > 6 with y(Zl) ::; (Xl. By
the intermediate-value theorem for continuous functions, there exists a
z E [ZO,Zl] withy = y(z) = (Xl. From (5.5.11) for z = z,
Lll = p(z') - p(y) - (z' - y)p'(y) = p(z') - p(y) ::; Llo = p(z'),
324
5 Finding Zeros and Minimum Points by Iterative Methods
and therefore p(y) = p( (};1) ~ 0. On the other hand, p( (};1) < 0, since ~1 is a
simple root, in our case, causing p(x) to change sign. This is a contradiction,
and (5.5.12) must hold for all z > 6.
If 6 = (};1 = 6, then by hypothesis n ~ 3. Assume, without loss of
generality. that
p (X) = X
n
+ a1x n-\ + ... + an.
Then
()
z' = z - ~ = z _ ~
an
Z
zn
n 1 + n - 1 a 1 + ... + an -1
p' (Z )
Therefore
a1
1+-+···+~
n
2
Y = Y(z) = z + (z' - z) = z -
z
nZn-1
2: (1 + 0(~ ))
= z (1-~) +0(1).
Since n ~ 3, the value of y(z) increases indefinitely as z --+ co, and we
conclude again that there exists a Zo > 6 with Yo = y(zo) > (};1. If (5.5.12)
did not hold for all z > 6, then we could conclude, just as before, that
there exists z > 6 with y = a( z) = 0'\. However. the existence of such
a value y = (};1 = 6 = 6 has been shown to be impossible earlier in the
proof of this theorem.
D
The practical significance of this theorem is as follows. If we have started
with Xo > 6, then either the approximate values generated by the doublestep method
satisfy
XO~X1~"'~Xk~Xk+1~"'~6
and
limxk=6,
or there exists a first Xko := y such that
In the first case, all values p(Xk) are of the same sign,
p(xo)p(xkl ~ 0,
for all k,
and the Xk converge monotonically (and faster than for the straightforward
Newton method) towards the root 6. In the second case,
5.5 Roots of Polynomials. Application of Newton's Method
325
Xo > Xl > ... > Xko-l > 6 > Y = Xko > (Xl > 6·
Using Yo := Y as the starting point of a subsequent straightforward Newton
procedure,
P(Yk)
Yk+ 1 = Yk - ---,------() , k = 0, 1, ... ,
P Yk
will also provide monotonic convergence:
lim Yk = 6·
k--+oo
Having found the largest root 6 of a polynomial P, there are still the
other roots 6, 6, ... , ~n to be found. The following idea suggests itself
immediately: "divide off" the known root 6, that is, form the polynomial
p(X)
Pl(X):= - - c
X -
<,,1
of degree n - 1. This process is called deflation.
The largest root of PI (x) is 6, and may be determined by the previously
described procedures. Here 6 or, even better, the value Y = Xko found by
overshooting may serve as a starting point. In this fashion, all roots will be
found eventually.
Deflation, in general, is not without hazard, because roundoff will preclude an exact determination of PI (x). The polynomial actually found in
place of PI will have roots different from 6, 6, ... , ~n. These are then
found by means of further approximations, with the result that the last
roots may be quite inaccurate. However, deflation has been found to be
numerically stable if done with care. In dividing off a root, the coefficients
of the deflated polynomial
, n-l + alx
' n-2 + ... +'
PI (X) = aox
an - l
may be computed in the order a~, a~, ... , a~_l ( !orwa1'd deflation) or in
the reverse order (backward deflation). The former is numerically stable if
the root of smallest absolute value is divided off; the latter is numerically
stable if the root of largest absolute value is divided off. A mixed process of
determining the coefficients will be stable for roots of intermediate absolute
value. See Peters and Wilkinson (1971) for details.
Deflation can be avoided altogether as suggested by Maehly (1954). He
expresses the derivative of the deflated polynomial PI (x) as follows:
p~(X) = p'(x) _
x-6
p(x)
.
(X-~l)2'
and substitutes this expression into the Newton iteration function for PI:
326
5 Finding Zeros and Minimum Points by Iterative Methods
In general, we find for the polynomials
that
p'(x) =
J
p'(x)
_
p(x)
. ~ _1_
(x - 6)··· (x - Ej)
(x - 6)··· (x - ';J)
x -';i·
8
The Maehly version of the (straightforward) Newton method for finding
the root ';j+ 1 is therefore as follows:
(5.5.13)
with
<Pj(x) :=x-
p(x)
( ) .
'x - Li=l
j
~
X - ';i
p ()
The advantage of this formula lies in the fact that the iteration given by
<Pj converges quadratically to ';j+ 1 even if the numbers 6, ... , ';.i in <Pj
are not roots of p (the convergence is only local in this case). Thus the
calculation of ~j+1 is not sensitive to the errors incurred in calculating the
previous roots. This technique is an example of zero suppression as opposed
to deflation [Peters and Wilkinson (1971)].
Note that <Pj(x) is not defined if x = ~k' k = L ... , j, is a previous
root of p(x). Such roots cannot be selected as starting values. Instead one
may use the values found by overshooting if the double-step method is
employed.
The following pseUdO-ALGOL program for finding all roots of a polynomial p having only real roots incorporates these features. Function procedures p(z) and p'(z) for the polynomial and its derivative are presumed
available.
zO := starting point Xo;
for j := 1 step 1 until n do
begin ITt := 2: zs := zO;
Iteration: z:= zs; s := 0;
for i := 1 step 1 until j - 1 do
s:=s+lj(z-';i);
zs := p(z); zs := z - ITt X zsj(p'(z) - zs x s);
if zs < z then goto Iteration;
if ITt = 2 then
begin zs := z; ITt := 1: goto Iteration; end;
Ej := z:
end;
5.5 Roots of Polynomials. Application of Newton's Method
327
EXAMPLE. The following example illustrates the advantages of Maehly's method.
The coefficients ai of the polynomial
II (x - Tj) L aix
13
p(x) :=
14
=
14 - i
i=O
j=O
are calculated in general with a relative error of magnitude E. Considerations in
Section 5.8 will show the roots of p(x) to be well conditioned. The following table
shows that the Newton-Maehly method yields the roots of the above polynomial
up to an absolute error of 40E (E = 10- 12 = machine precision). If forward deflation is employed but the roots are divided off as shown, i.e. in order of decreasing
magnitude, then the fifth root is already completely wrong. The absolute errors
below are understood as multiples of E.
(Absolute error) x 1012
Newton-Maehly
Deflation
o
1.0
0.5
0.25
0.125
0.0625
0.03125
0.015625
0.0078125
0.00390625
0.001 953 125
0.000 976 562 5
0.000488281 25
0.000 244 140 625
0.000 122 070 3125
o
3.7 X 102
1.0 X 106
1.4 X 109
6.8
1.1
0.2
4.5
4.0
3.3
39.8
10.0
5.3
> 10 12
o
o
0.4
o
The polynomial Pl(X),
II(x - 213
Pl(X) := p(x) =
x-I
j ),
j=l
has the roots 2- j , j = 1, 2, ... , 13. If ih is produced by numerically dividing
p(x) by x - 1,
ih(x) :=
n( x-I
p(x) ),
then we observe that the roots of Pl(X) are already quite different from those of
Pl (x):
328
5 Finding Zeros and Minimum Points by Iterative Methods
j
Roots of ih(x) computed
by Newton-Maehly
1
2
3
4
5
6
7
8
0.499999996 335
0.250001 00 .. .
0.123697 .. .
0.0924 .. .
-0.0984 .. .
-0.056 .. .
-0.64 .. .
+1.83 .. .
However, if forward deflation is employed and if the roots are divided off
in the sequence of increasing absolute values, starting with the root of smallest
absolute value, then this process will also yield the roots of p(x) up to small
multiples of machine precision [Wilkinson (1963), Peters and Wilkinson (1971)]:
Sequence of dividing off roots of p(x).
13
j
(Absolute error)
12 11 10 9
0.2 0.4
2
5
3
4
3
2
1 14 6 12 6
2
2 12 1
8
7
6
5
1
0
( x 10 12 )
5.6 Sturm Sequences and Bisection Methods
Let p(x) be a real polynomial of degree n,
p(x) = aoxn + alX n- 1 + ... + an,
ao =1= 0.
It is possible [see Henrici (1974) for a thorough treatment of related results]
to determine the number of real roots of p(x) in a specified region by
examining the number of sign changes w(a) for certain points x = a of
a sequence of polynomials Pi(X), i = 0, 1, ... , m, of descending degrees.
Such a sign change happens whenever the sign of a polynomial value differs
from that of its successor. Furthermore, if Pi(a) = 0, then this entry is to
be removed from the sequence of polynomial values before the sign changes
are counted. Suitable sequences of polynomials are the so-called Sturm
sequences.
(5.6.1) Definition. The sequence
p(x) = Po(x), Pl(X), ... , Prn(x)
of real polynomials is a Sturm sequence for the polynomial p(x) if
5.6 Sturm Sequences and Bisection Methods
329
(a) all real roots of Po(x) are simple.
(b) sign PI (.;) = - signp~(';) is a real root of po(x).
(c) For i = 1,2, ... , Tn - 1,
if'; is a real root of Pi(X).
(d) The last polynomial Pm(x) has no real mots.
For such Sturm sequences we have the following
(5.6.2) Theorem. The number of real roots ofp(x) == Po(x) in the interval
a:S x < b equals w(b) -w(a), where w(x) is the number of sign changes of
a Sturm sequence
Po ( x ), ... , Pm (X)
at location x.
Before proving this theorem, we show briefly how a simple recursion can
be used to construct a Sturm sequence for the polynomial p(x), provided
all its real roots are simple. We define initially
Po(x) := p(x),
PI(,r):= -pG(x) = -p'(x),
and form the remaining polynomials PHI (x) recursively, dividing Pi-I(X)
by Pi(X),
(5.6.3)
where
Pi-I(X) = qi(x)Pi(:r) - CiPHI(X),
i = 1, 2, ... ,
[degree of Pi(X)] > [degree of Pi+I(X)]
and the constants Ci > 0 are positive but otherwise arbitrary. This recursion is the well-known Euclidean algorithm. Because the degree of the
polynomials decreases, the algorithm must terminate after m :S n steps:
The final polynomial Pm (X) is a greatest common divisor of the two initial
polynomials p( x) and PI (x) = -P' (:[). (This is the purpose of the Euclidean
algorithm.) If all real roots of p( x) are simple, then p( x) and pi (x) have no
real roots in common. Thus Pm(x) has no real roots and satisfies (5.6.1d). If
Pi (0 = 0, then (5.6.3) gives Pi-I (.;) = -CiPHdO· Assume that PHl (.;) =
O. Then (5.6.3) would imply P,+1(';) = ... = Pm(O = 0, contradicting
Pm(';) i- O. Thus (5.6.1c) is satisfied. The two remaining conditions of
(5.6.1) are immediate.
330
5 Finding Zeros and Minimum Points by Iterative Methods
PROOF OF THEOREI\! (5.6.2). vVe examine how a perturbation of the value
a E IR affects the number of sign changes w(a) of the sequence
po(a), Pl(a), ... , Pm(a).
So long as a is not a root of any of the polynomials Pi (x), i = 0, 1. ... , m,
there is, of course no change. If a is a root of Pi(X), we consider the two
cases i > 0 and i = O.
In the first case, i < m by (5.6.1d) and Pi+l(a) f 0, Pi-l(a) f 0 by
(5.6.1c). Then for a sufficiently small perturbation h > 0, the signs of the
polynomials Pj (a), j = i ~ 1, i, i + 1, display the behavior illustrated in one
of the following four tables:
a~h
a
a+h
i ~ 1
i +1
i ~ 1
0
±
+
+
+
a~h
a
a+h
i ~ 1
i +1
+
+
±
+
+
a
a+h
+
+
+
0
±
a~h
a
a+h
+
+
+
+
i +1
i ~1
0
a~h
i +1
0
±
In each instance, w(a ~ h) = w(a) = w(a + h): the number of sign changes
remains the same.
In the second case, we conclude from (5.6.1b) that the following sign
patterns pertain:
a~h
0
1
a
a+h
0
+
0
1
a~h
a
+
+
0
+
a+h
+
In each instance, w(a ~ h) = w(a) = w(a + h) ~ 1: exactly one sign change
is gained as we pass through a root of Po(x) == p(x).
For a < b and sufficiently small h > 0,
w(b) ~ w(a) = w(b ~ h) ~ w(a ~ h)
indicates the number of roots of p( x) in the interval a ~ h < x < b ~ h.
Since h > 0 can be chosen arbitrarily small, the above difference indicates
the number of roots also in the interval a ::; x < b.
0
5.6 Sturm Sequences and Bisection Methods
331
An important use of Sturm sequences is in bisection methods for determining the eigenvalues of real symmetric matrices which are tridiagonal:
o
J=
o
Recall the characteristic polynomials pi(X) of the principal submatrix
formed by the first i rows and columns of the matrix J, which were mentioned before as satisfying the recursion (5.5.3):
Pl(X):= al - x,
Po(X) := 1,
Pi(X) := (ai - X)Pi-l(X) - P;Pi-2(X),
i = 2, 3, ... , n.
The key observation is that the polynomials
(5.6.4)
Pn(X), Pn-l(X), ... , po(x)
are a Sturm sequence for the characteristic polynomial Pn(x) = det(J - xl)
[note that the polynomials (5.6.4) are indexed differently than in (5.6.1)]
provided the off-diagonal elements Pi, i = 2, ... , n, of the tridiagonal
matrix J are all nonzero. This is readily apparent from the following
(5.6.5) Theorem. Let ai, Pj be real numbers and Pj =I 0 for j = 2, ... , n.
Suppose the polynomials Pi(X), i = 0, ... , n, are defined by the recursion
(5.5.3). Then all roots x~i), k = 1, ... , i, of Pi, i = 1, ... , n, are real and
simple:
X~i) > X~i) > ... > X~i),
and the roots of Pi-l and Pi, respectively, separate each olher strictly:
Xli) > x(i-l) > Xli) > x(i-l) > ... > x(i-l) > xli)
1
1
2
2
,-1,
.
PROOF. We proceed by induction with respect to i. The theorem is plainly
true for i = l. Assume that it is true for some i 2': 1, that is, that the roots
x k(i) and x k(i-I) 0 f Pi an d Pi-I, respec t'lve 1y, sat'IS f y
(5.6.6)
> Xi'
(i)
Xl(il > Xl(i-I) > X2(i) > X2(i-I) > . . . > Xi(i-I)
- 1
By (5.5.3), Pk is of the form Pj (x) = (-I)J x j + .. '. In particular, the degree
of Pk equals k. Thus Pi-l(X) does not change its sign for x > X~i-ll, and
since the roots x > x~i-ll are all simple, (5.6.6) gives immediately
5 Finding Zeros and Minimum Points by Iterative Methods
332
(5.6.7)
sign Pi-I (xki )) = (_l)Hk
for k = 1, 2, ... , i.
Also by (5.5.3),
PHI ( X k(i)) -_ - (32HI Pi-I (X k(i)) ,
k = 1, 2, ... , i.
Since (3;+1 > 0,
sign Pi+l (xki)) = (_l)i+k+1,
k = 1, 2, ... , i,
sign PHI (+00) = (-l)Hl,
sign PHI (-00) = 1
holds, and PH1(X) changes sign in each of the open intervals (xii),+oo),
( -00, Xi(i)) , ((i)
X + ' X(i)) ' k = 1, ... , z. - 1. The roots X(HI) 0 f Pi+I are
k I
k
k
therefore real and simple, and they separate the roots xki ) of Pi:
o
The polynomials of the above theorem,
Pn(X),Pn-I(X), ... ,Po(x),
indeed form a Sturm sequence: By (5.6.5), Pn(x) has simple real roots
6 > 6> ... > ~n' and by (5.6.7)
signpn-l(~k) = (_1)n+k,
signp~(~k) = (_l)n+k+1 = -signpn-I(~k)'
for k = 1, 2, ... , n.
For x = -00 the Sturm sequence (5.6.4) has the sign pattern
+, +, ... ,+.
Thus w( -(0) = 0. By Theorem (5.6.2), w(JJ) indicates the number of roots
+ 1 - i holds if and only if ~i < JJ.
The bisection method for determining the ith root ~i of Pn(x) (6 >
6 > ... > ~n) now is as follows. Start with an interval
~ of Pn(x) with ~ < JJ: w(JJ) ::::: n
[aD, bol
which is known to contain ~, e.g., choose aD < ~n' 6 < boo Then divide
this interval with its midpont and check by means of the Sturm sequence
which of the two subintervals contains ~i' The subinterval which contains
~i is again divided, and so on. More precisely, we form for j = 0, 1, 2, ...
5.7 Bairstow's Method
333
if w(J-tj) ~ n + 1 - i,
if w(J-tj) < n + 1 - i,
if w(J-tj) ~ n + 1 - i,
if w(J-tj) < n + 1 - i.
Then
[aj+l' bj+ll c [aj,bj ],
laj+l - bj+ll = laj - bj 1/ 2,
~i E [aj+l' bj+ll·
The quantities aj increase, and the quantities bj decrease, to the desired
root ~i. The convergence process is linear with convergence rate 0.5. This
method for determining the roots of a real polynomial all roots of which
are real is relatively slow but very accurate. It has the additional advantage
that each root can be determined independently of the others.
5.7 Bairstow's Method
If a real polynomial has any complex conjugate roots, they cannot be found
using the ordinary Newton's method if it is carried out in real arithmetic
and begun at a real starting point: complex starting points and complex
arithmetic must be used. Bairstow's method avoids complex arithmetic.
The method follows from the observation that the roots of a real quadratic
polynomial
x 2 - rx - q
are roots of a given real polynomial
p(x) = aoxn + ... + an,
ao =f. 0,
if and only if p(x) can be divided by x 2 - rx - q without remainder. Now
generally
(5.7.1)
p(x) = Pl(X)(X 2 - rx - q) + Ax + B,
where the degree of Pl is n - 2, and the remainder has been expressed as
Ax + B. The coefficients of the remainder depend, of course, upon rand
q, that is
A = A(r, q), and B = B(r, q),
and the remainder vanishes when r, q satisfy the system
(5.7.2)
A(r, q) = 0,
B(r, q) = o.
334
5 Finding Zeros and Minimum Points by Iterative Methods
Bairstow's method is nothing more than Newton's method (5.3) applied to
(5.7.2):
ri+l]
[qi+l
(5.7.3)
8A
= [ri] _ [ 8r
qi
8B
8r
8A
8q
8B
8q
]-1 [A(ri,qi)].
B(ri' qi)
r=r;
q=qi
In order to carry out (5.7.3), we must first determine the partial derivatives
8A
Ar= 8r'
8A
Aq = 8q'
8B
Br = 8r'
8B
Bq = 8q.
Now (5.7.1) is an identity in r, q, and x. Hence, differentiating with respect
to rand q,
(5.7.4)
After a further division of PI by (X2 - rx - q) we obtain the representation
(5.7.5)
Assuming that X2 - rx - q = 0 has two distinct roots Xo, Xl, it follows that
for X = Xi, i = 0, 1. Therefore the equations
-xi(AlXi + B l ) + ArXi + Br = 0 } ,
-(AlXi + B l ) + AqXi + Bq = 0
follow from (5.7.4).
From the second of these equations we have
(5.7.6)
since Xo =I Xl, and therefore the first equation yields
-x;Aq + xi(Ar - Bq) + Br = 0,
i = 0, 1.
Since x; = rXi + q it follows that
i = 0, 1,
and therefore
i = 0, 1,
5.8 The Sensitivity of Polynomial Roots
335
Ar - Bq - r Aq = 0,
Br - qAq = 0,
since Xo #- Xl.
Putting this together with (5.7.6) yields
Aq = AI,
Bq = B I ,
Ar = r Al + B I ,
Br = q AI·
The values A, B (or AI, Bd can be found by means of a Horner-type
scheme. Using p(x) = aoxn + ... + an, PI(X) = buxn-2 + ... + bn - 2 and
comparing coefficients, we find the follwoing recursion for A, b, bi from
(5.7.2):
bo := ao,
bI := bo r + aI,
bi := bi - 2 q + bi - 1 r + ai,
for
i = 2, 3, ... , n - 2,
A := bn - 3 q + bn - 2 r + an-I,
B := bn- 2 q + an·
Similarly, (5.7 ..5) shows how the bi can be used to find Al and B 1 .
5.8 The Sensitivity of Polynomial Roots
We will consider the condition of a root ~ of a given polynomial p(x). By
this we mean the influence on ~ of a small perturbation of the coefficients
of the polynomial p(x):
p(x) --+ pc(x) = p(x) + cg(x),
where g(x) 1'- 0 is an arbitrary polynomial.
Later on it will be shown [Theorem (6.9.8)] that if ~ is a simple root
of p, then for sufficiently small absolute values of c there exists an analytic function ~(c), with ~(O) = ~, such that ~(c) is a (simple) root of the
perturbed polynomial Pc (x):
p(~(c))
+ cg(~(c)) == O.
From this, by differentiation with respect to c, we get for k := ((0) the
equation
kp'(~(O)) + g(~(O)) = 0,
so that
k = -g(O
pl(~)
.
336
5 Finding Zeros and Minimum Points by Iterative Methods
Thus, to a first order of approximation (i.e., disregarding terms in powers
of E greater than 1), based on the Taylor expansion of ~(E) we have
.
g(~)
~(E) = ~ - Ep'(~)"
(5.8.1)
In the case of a multiple root ~, of order m, it can be shown that
p(x) + Eg(X) has a root of the form
~(E) = ~ + h(E l/m ),
where h(t) is, for small Itl, an analytic function with h(O) = O. Differentiating m times with respect to t and noting that p(~) = p'(~)
p(m-l)(~) = 0, p(m)(~) =I- 0, we obtain from
o ~ Pc;(~(E)) = p(~ + h(t)) + tmg(~ + h(t)),
t m = E,
for k := h'(O) the equation
so that
k = [_ m!g(O ] 11m
p(m)(~)
Again to a first order of approximation,
(5.8.2)
~(E) == ~ + El/m [_ m! g(~)] 11m
p(m)(~)
For m = 1 this formula reduces to (5.8.1) for simple roots.
Let us assume that the polynomial p(x) is given in the usual form
by its coefficients ai. For
g(x) := aiXn-i
the polynomial Pc; (x) is the one which results if the coefficient ai of p(x) is
replaced by ai(I+E). The formula (5.8.2) then yields the following estimate
of the effect on the root ~ of a relative error E of ai:
(5.8.3)
It thus becomes apparent that in the case of multiple roots the changes
in root values ~(E) - ~ are proportional to El/ m , m > 1, whereas in the case
of simple roots they are proportional to just E: multiple roots are always
5.8 The Sensitivity of Polynomial Roots
337
badly conditioned. But simple roots may be badly conditioned too. This
happens if the factor of s in (5.8.3),
~n-i
k(i,~):= a;,(~)
1
1
,
is large compared to ~, which may be the case for seemingly "harmless"
polynomials.
EXAMPLE.
[Wilkinson (1959)].
(1) The roots C,k = k, k = 1, 2, ... , 20, of the polynomial
p(x) = (x - l)(x - 2) ... (x - 20) =
20
L aix
20 - i
i=O
are well separated. For 60 = 20, we find pi (20) = 19!, and replacing the
coefficient al = -(1 + 2 + ... + 20) = -210 by at(l + E) causes an estimated
change of
10
. 210 . 20 19
60(E) - 60 = E
,
"'" E . 0.9 ·10 .
19.
The most drastic changes are caused in C,16 by perturbations of a5. Since
66 = 16 and a5 "'" _1010,
This means that the roots of the polynomial p are so badly conditioned that
even computing with 14-digit arithmetic will not guarantee any correct digit
in 66.
(2) By contrast, the roots of the polynomial
20
20-i .
p ( x ) -- ~
~ aiX
.i=O
IT( x - 2- ,
20
j )
j=l
while not well separated and "accumulating" at zero, are all well conditioned.
For instance, changing a20 to a20(1 + E) causes a variation of 60 which to a
first order of approximation can be bounded as follows:
60 (E) - 60 1-'-1 E
l l<4 I
1 60
(2- 1 - 1)(2-2 - 1)··· (2- 19 - 1) - IE.
More generally, it can be shown for all roots C,j and changes ai --+ ai (1 + E)
that
However, the roots are well conditioned only with respect to small relative
changes of the coefficients ai, and not for small absolute changes. If we
replace a20 = 2- 210 by (120 = a20 + .da2o, .da20 = 2- 4 "("", 10- 14 )
this
338
5 Finding Zeros and Minimum Points by Iterative Methods
can be con_sidered a small absolute change has roots t;i with
then the modified polynomial
In other words, there exists at least one subscript r with I(,./~rl 2: (2 162 +
1)1/20 > 2 8 = 256.
It should be emphasized that the formula (5.8.3) refers only to the
sensitivity of the roots of a polynomial
n
p(x) :=
L ai;r
n- i
;=0
in its usual representation by coefficients. There are other ways of representing polynomials - for instance, as the characteristic polynomials of
tridiagonal matrices by the elements of these matrices [see (5.5.3)]. The
effect on the roots of a change in the parameters of such an alternative representation may differ by an order of magnitude from that described by the
formula (5.8.3).The condition of roots is defined always with a particular
type of representation in mind.
EXA:vIPLE.
In Theorem (6.9.7) it will be shown that, for each real tridiagonal
matrix
whose characteristic polynomial is p(x) == (x -1 )(x - 2) ... (x - 20), small relative
changes in (li and (3·i cause only small relative changes of the roots t;j = j. With
respect to this representation, all roots are well conditioned, although they are
very badly conditioned with respect to the usual representation by coefficients,
as was shown in the previous example. For detailed discussions of this topic see
Peters and Wilkinson (1969) and Wilkinson (1965).
5.9 Interpolation Methods for Determining Roots
The interpolation methods to be discussed in this section are very useful for
determining zeros of arbitrary real functions f(x). Compared to Newton's
method, they have the advantage that the derivatives of f need not be
computed. l\;Ioreover, in a sense yet to be made precise, they converge even
faster than Newton's method.
5.9 Interpolation Methods for Determining Roots
339
The simplest among these methods is known as the method of false
position or regula falsi. It is similar to the bisection method in that two
numbers Xi and ai with
(5.9.1)
are determined at each step. The interval [Xi, ail contains therefore at least
one zero of f, and the values Xi are determined so that they converge
towards one of these zeros. In order to define Xi+l, ai+l, let JLi be the zero
of the interpolating linear function
a;!(Xi) - x;!(ai)
f(Xi) - f(ai)
(5.9.2)
Since f(xi)f(ai) < 0 implies f(Xi) - f(ai) i- 0, JLi is always well defined
and satisfies either Xi < JLi < ai or ai < JLi < Xi. Unless f(Pi) = 0, put
(5.9.3)
If f(JLi) = 0, then the method terminates with JLi the zero. An improvement
on (5.9.3) is due to Dekker and is described in Peters and Wilkinson (1969).
In order to discuss the convergence behavior of the regula falsi, we
assume for simplicity that f" exists and that for some i
(5.9.4a)
(5.9.4b)
(5.9.4c)
Xi < ai,
f(Xi) < 0 < f(ai),
f"(x) 2': 0 for all X E [xi,ai].
Under these assumptions, either f(JLi) = 0 or
f(JLi)f(Xi) > 0,
and consequently
To see this, note that (5.9.4) and the definition of JLi imply immediately
that
Xi < JLi < ai·
5 Finding Zeros and Minimum Points by Iterative Methods
340
The remainder formula (2.1.4.1) for polynomial interpolation yields, for
x E [Xi, ail and a suitable value is E [Xi, ai], the representation
f(x) - p(x) = (x - Xi)(X - ai)f"(iS)/2,
is E [Xi. ail.
By (5.9.4c), f(x) - p(x) ::; 0 in the interval X E [Xi, ad. Thus f(tl,;) ::; 0
since P(fLi) = 0, which was to be shown.
It is now easy to see that (5.9.4) holds for all i ~ io provided it holds
for some io. Therefore ai = a for i ~ io, the Xi form a monotone increasing
sequence bounded by a, and liIlli--+oc Xi = .; exists. Since f is continuous,
and because of (5.9.4) and (5.9.2),
f(.;) ::; 0,
f(a) > 0,
a f---,-(.;"-'--)_---=-:U'---'-(a---,-)
t; = -'C
f(';) - f(a) .
Thus (.; - a)f(';) = O. But.; =1= a, since f(a) > 0 and f(';) ::; 0, and we
conclude f(.;) = O. The values Xi converge to a zero of the function f.
Under the assumptions (5.9.4) the regula falsi method can be formulated with the help of an iteration function:
p(X) := af(x) - xf(a)
f(x) - f(a)
(5.9.5)
Since f(';) = 0,
p'(';) = -(af'(';) - f(a))f(a) + U(a)f'(';) = 1 _ 1'(';) .; - a
f(a)2
f(.;) - f(a)·
By the mean-value theorem, there exist T]l, T]2 such that
1(,;) - f(a) = f'(
T]l ,
)
.; < T]l < a,
f(Xi) - f(';) = f'( )
C
T]2 ,
Xi -<.,
Xi < T]2 < .;.
t
(5.9.6)
<.,-a
Because f//(x) ~ 0 holds for X E [Xi, a], f'(x) increases monotonically in
this interval. Thus (5.9.6), :ri < .;, and f(Xi) < 0 imply immediately that
and therefore
0::; p'(.;) < 1.
In other words, the regula falsi method converges linearly under the assumptions (5.9.4).
The previom; discussion shows that, under the assumption (5.9.4), regula falsi will eventually utilize only the first two of the recursions (5.9.3).
5.9 Interpolation Methods for Determining Roots
341
We will now describe an important variant of regula fasi, called the secant
method, which is based exclusively on the second pair of recursions (5.9.3):
(5.9.7)
Xi+l =
Xi-l](X,) - x;](x,-d
] (Xi) - j .( Xi-l )
,
i = 0,1, ....
In this case, the linear function always interpolates the two latest points of
the interpolation. While the original method (5.9.3) is numerically stable
because it enforces (5.9.1), this may not be the case for the secant method.
Whenever ](Xi) ;::::; ](Xi-J), digits are lost due to cancellation. Moreover,
XHl need not lie in the interval [Xi, Xi-l], and it is only in a sufficiently
small neighborhood of the zero ~ that the secant method is guaranteed to
converge. We will examine the convergence behavior of the secant method
(5.9.7) in such a neighborhood and show that the method has a superlinear
order of convergence. To this end, we subtract ~ from both sides of (5.9.7),
and obtain, using divided differences (2.1.3.5),
(5.9.8)
If ] is twice differentiable, then (2.1.4.3) gives
(5.9.9)
j[Xi-l,Xi] = j'(T}d,
T}l E I[Xi-l,Xi],
j[Xi-l,Xi'~] = ~1"(172)'
T}2 E I[Xi-l,Xi'~]'
For a simple zero ~ of ], ]' (~) #- 0, and there exist a bound M and an
interval J = {x I Ix - ~I ::; e} such that
(5.9.10)
(T}2) I < '1.
I~2 1"
l' (T}l) - "' .
c
lor
any
T}1,T}2
E J.
Letei:= Allxi - ~I and eo, el < min{ 1, cAl}, so that xo, Xl E J. Then,
using (5.9.8) and (5.9.10), it can be easily shown by induction that
(5.9.11 )
eHl ::; eiei-l
for i = 1, 2, ...
and leil ::; min{ 1, eM}, so that Xi E J for all i ~ O. But note that
(5.9.12)
ei ::; K(q')
for i = 0, 1, ... ,
where q
(1 + J5) /2 = 1.618 ... is the positive root of the equation
p.2 - P. - 1 = 0 and K := max{ eo, vel} < 1. The proof is inductive: This
342
5 Finding Zeros and Minimum Points by Iterative Methods
choice of K makes (5.9.12) valid for i = 0 and i = 1. If (5.9.12) holds for
i - I and i, then (5.9.11) yields
since q2 = q + 1. Thus (5.9.12) holds also for i + 1, and must therefore hold
in general.
According to (5.9.12), the secant method converges at least as well as a
method of order q = 1.618 .... Since one step ofthe secant method requires
only one additional function evaluation, two secant steps are at most as expensive as a single Newton step. Since Kqi+2 = (Kqi )q2 = (Kqi )q+!, two
secant steps lead to a method of order q2 = q + 1 = 2.618 .... With comparable effort, the secant method converges locally faster than the Newton
method, which is of order 2.
The secant method suggests the following generalization. Suppose there
are r + 1 different approximations Xi, Xi-I, ... , Xi-r to a zero ~ of f(x).
Determine the interpolating poynomial Q(x) of degree r with
j = 0, 1, ... ,r,
and choose the root of Q(x) closest to Xi as the new approximation Xi+!.
For r = 1 this is the secant method. For r = 2 we obtain the method of
Muller. The methods for r 2: 3 are rarely considered, because there are no
practical formulas for the roots of the interpolating polynomial.
Muller's method has gained a reputation as an efficient and fairly reliable method for finding a zero of a function defined on the complex plane
and, in particular, for finding a simple or multiple root of a polynomial. It
will find real as well as complex roots.
The approximating values Xi may be complex even if the coefficients
and the roots of the polynomial as well as the starting values Xl, X2, X3
are all real. Our exposition employs divided differences [see Section 2.1.3]'
following Traub (1964).
By Newton's interpolation formula (2.1.3.8), the quadratic polynomial
which interpolates a function f (in our case the given polynomial p) at
Xi-2, Xi-I, Xi can be written as
Qi(X) = f[Xi] + f[Xi-l, Xi](X - Xi) + f[Xi-2, Xi-I, Xi](X - Xi-I)(X - Xi),
or
where
Qi(X) = ai(X - Xi)2 + 2bi (x - Xi) + Ci,
ai := J[ Xi-2, Xi-I, Xi],
bi := ~ (J[Xi-l, x;J + f[Xi-2, Xi-I, Xi](Xi - xi-d),
Ci := f[x;].
5.9 Interpolation Methods for Determining Roots
343
If hi is the root of smallest absolute value of the quadratic equation aih2 +
2b i h + Ci = 0, then Xi+l := Xi + hi is the root of Qi(.7:) closest to Xi.
In order to express the smaller root of a quadratic equation in a numerically stable fashion, the reciprocal of the standard solution formula for
quadratic equations should be used. Then Muller's iteration takes the form
(5.9.13)
X +1
,
= X· -
Ci
-=---~;c===
'bi±Jb;-aici'
where the sign of the square root is chosen so as to maximize the absolute
value of the denominator. If ai = 0, then a linear interpolation step as in
the secant method results. If ai = bi = 0, then f(Xi-2) = f(Xi-d = f(x;),
and the iteration has to be restarted with different initial values. In solving
the quadratic equation, complex arithmetic has to be used, even if the
coefficients of the polynomial are all real, since b; - aici may be negative.
Once a new approximate value Xi+l has been found, the function f is
evaluated at xi+l to find
f[Xi+l] := f(xi+d,
j[Xi, Xi+l] :=
j[Xi+l] - f[Xi]
,
xi+l - Xi
These quantities determine the next quadratic interpolating polynomial
Qi+l(X),
It can be shown that the errors Ci := Xi - ~ of Muller's method in the
proximity of a simple zero ~ of f(x) = 0 satisfy
(5.9.14)
Ei+l = EiEi-lEi-2 ( -
f(3)(~)
)
61'(0 + O(E) ,
E = max(!Ei!, !Ei-l!, !Ei-2!)
[compare (5.9.8)]. By an argument analogous to the one used for the secant
method, it can be shown that Muller's method is at least of order q =
l.84 ... , where q is the largest root of the equation f.L3 - f.L2 - f.L - 1 = O.
The secant method can be generalized in a different direction. Consider
again r + 1 approximations Xi, Xi-I, ... , Xi-r to the zero ,; of the function
f (x). If the inverse 9 of the function f exists in a neighborhood of ~ (this
is the case if ~ is a simple zero),
f(g(u)) = y,
g(f(x)) = X,
g(O) =~,
then determining ~ amounts to calculating g(O). Since
g(f(Xj)) = Xj,
j = i, i - I , ... , i - r,
344
5 Finding Zeros and Minimum Points by Iterative Methods
the following suggests itself: determine the interpolating polynomial Q(y)
of degree r or less with Q(f(Xj)) = Xj, j = i, i-I, ... , i - r, and approximate g(O) by Q(O). Then select Xi+l := Q(O) as the next approximate point
to be included in the interpolation. This method is called determining zeros
by inverse interpolation. Note that it does not require solving polynomial
equations at each step even if r ;:::: 3. For r = 1 the secant method results.
The interpolation formulas of Neville and Aitken [see Section 2.1.2] are particularly useful in implementing inverse interpolation methods. The methods are locally convergent of superlinear order. For details see Ostrowski
(1966) and Brent (1973).
In Section 5.5, in connection with Newton's method, we considered two
techniques for determining additional roots: deflation and zero suppression.
Both are applicable to the root-finding methods discussed in this section.
Here zero suppression simply amounts to evaluating the original polynomial
p(x) and numerically dividing it by the value of the product (X-~i)'" (x~k) to calculate the function whose zero is to be determined next. This
process is safe if computation is restricted away from values x which fall in
close neighborhoods of the previously determined roots 6, ... , ~k. Contrary
to deflation, zero suppression works also if the methods of this section are
applied to finding the zeros of an arbitrary function f(x).
5.10 The ..::P-Method of Aitken
The Ll 2 -method of Aitken is one of first and simplest methods for accelerating the convergence of a given convergent sequence of values Xi,
lim Xi =~.
'-+00
These methods transform the sequence {Xi} into a sequence {x~} which
in general converges faster toward ~ than the original sequence of values Xi,
and they apply to finding zeros inasmuch as they can be used to accelerate
the convergence of sequences {Xi} furnished by one of the methods previously discussed. The extrapolation methods discussed in Section 3.5 are
examples of acceleration methods. A systematic account of these methods
can be found in Brezinski (1978) and in Brezinski and Zaglia (1991).
In order to illustrate the Ll2- method, let us assume that the sequence
{Xi} converges towards ~ like a geometric sequence with factor k, JkJ < 1:
i = 0,1, ....
Then k and ~ can be determined from Xi, Xi+l, Xi+2 using the equations
(5.10.1)
5.10 The iF-Method of Aitken
345
By subtraction of these equations,
k -_
Xi+2 -
Xi+l
Xi+l -
Xi
,
and by substitution into the first equation, since k i= 1,
~ =
(Xi X H2 -
X7+1)
---'-----'-'-=----:-
(XH2 -
2XHl
+ Xi)
Using the difference operator LlXi := Xi+l - Xi and noting that Ll2xi
LlXHl - LlXi = Xi+2 - 2XHl + Xi, this can be written as
(5.10.2)
The method is named after this formula. It is based On the expectation
(5.10.2) provides at least an
improved approximation to the limit of the sequence of values Xi even if the
hypothesis that {Xi} is a geometrically convergent sequence should not be
valid.
The Ll 2 -method of Aitken thus consists of generating from a given
sequence {Xi} the transformed sequence of values
~ to be confirmed below ~ that the value
(5.10.3)
I
Xi := Xi -
(XHl -
Xi)2
2XHl
Xi+2 -
+ Xi
The following theorem shows that the x; converge faster than the Xi to ~
as i ---+ 00, provided {Xi} behaves asymptotically as a geometric sequence:
(5.10.4) Theorem. Suppose there exists k,
sequence {Xi}, xi i= ~,
Ikl < 1, such that for the
holds, then the values x; defined by (5.10.3) all exist for sufficiently large i,
and
lim
= o.
x; -.;
i---+CXJ X-i -
~
PROOF. By hypothesis, the errors ei := Xi - ~ satisfy eHl =
follows that
(k + 5;)ei. It
= ei((k + 5H d(k + 5i ) - 2(k + 5;) + 1)
(5.10.5)
=
Xi+! -
Xi
ei((k - I? + ILi),
where
= eHl - ei = ei((k - 1) + ()i).
ILi ---+ 0,
346
5 Finding Zeros and Minimum Points by Iterative ~Iethods
Therefore
:1:i+2 -
2Xi+l + .ri 10
for sufficiently large i, since ei I 0, k I 1 and tli -+ O. Hence (5.10.3)
guarantees that x; is well defined. By (5.10.3) and (.5.10 ..5),
,
((k - 1) + 5y
x, - c: = ei - ei ( k. - 1) 2 + tIi. '
for sufficiently large i, and consequently
?} =0.
lim x;-~ = lim {1- ((k-1)+5 i
i-+oe
(k - 1)2 + tIi
i-+cx) Xi - ~
D
Consider an iteration method with iteration function tjj( x),
(5.10.6)
Xi+l = tjj(Xi),
i = 0, 1,2, ... ,
for determining the zero ~ of a function f(x) = O. The formula (5.10.3) can
then be used to determine, from triples of successive elements Xi, Xi+l, Xi+2
of the sequence {Xi} generated by the above iteration function tjj, a new
sequence {x; }, which hopefully converges faster. However, it appears to
be advantageous to make use of the improved approximations immediately
by putting
(5.10.7)
Yi := tjj(x,),
Zi = tjj(Yi),
Xi+l := Xi -
(Yi - Xi)2
.
Zi - 2Yi + Xi
i = 0, 1, 2, ...
This method is due to Steffensen. (5.10.7) leads to a new iteration function
W,
:Ti+l =
(5.10.8)
w(x) : =
W(Xi),
:rtjj(tjj(x)) - tjj(x)2 .
tjj(tjj(x)) - 2tjj(x) + X
Both iteration functions tjj and W have, in generaL the same fixed points:
(5.10.9) Theorem. w(~) = ~ implies tjj(~) = C:. Conversely, if tjj(O = ~
and tjj'(O 11 exists, then w(c:) = (
PROOF. By the definition (5.10.8) of W,
Thus w(~) = ~ implies tjj(~) = ~. Next we assume that tjj(~) = ~, tjj is
differentiable for x = C:. and tjj'(~) I l. L'Hopital's rule applied to the
definition (5.10.8) gives
5.10 The LP-Method of Aitken
347
1jI(~) = tp(tp(~)) + ~iP'(iP(~))iP'(O - 2iP(OiP'(~)
tp'(iP(O)iP'(O - 2iP'(O + 1
~ + ~p'(~)2 _ 2~iP'(~)
= 1 + iP'(~)2 - 2iP'(O =~.
D
In order to examine the convergence behavior of 1jI in the neighborhood
of a fixed point ~ of 1jI (and iP), we assume that iP is p+ 1 times differentiable
in a neighvorhood of x = ~ and that it defines a method of order p, that is,
(5.10.10)
tp'(~) = ... = iP(P-l)(~) = 0,
iP(p)(O =: piA i- 0
[see Section 5.2]. For p = 1, we require that in additon
A = iP'(O i- 1,
(5.10.11)
and without loss of generality we assume ~ = O. Then for small x
iP(x) = Ax P +
x p +1
(p + I)!
tp(P+l)(Bx),
0 < B < 1.
Thus
iP(x) = AxP + O(x P+1),
iP(p(x)) = A(AxP + O(xP+1))P + O(Ax P + O(Xp+1))p+l)
_ {O(x P2 ),
A2;:r + O(X2),
if p > 1,
if p = 1,
iP(X)2 = (AxP + O(x P+ 1))2 = A 2 x 2p + O(x 2p +1 ).
For p > 1, because of (5.10.8),
For p = 1, however, since Ai-I,
This proves the following
(5.10.13) Theorem. Let iP be an iteration function defZning a method of
order p for computing its fixed point ~. For p > 1 the corresponding iteration
function 1jI of (5.10.8) determines a method of order 2p -- 1 for computing
~. For p = 1 this method is at least of order 2 provided iP' (~) i- 1.
348
5 Finding Zeros and Minimum Points by Iterative Methods
Note that lfr yields a second-order method, that is, a locally quadratically convergent method, even if 1<p'(01 > 1 and the <P-method diverges as
a consequence [see Section 5.2].
It is precisely in the case p = 1 that the method of Steffensen is important. For p > 1 the lfr-method does generally not improve upon the
<p-method. This is readily seen as follows. Let Xi - ~ = E with E sufficiently
small. Neglecting terms of higher order, we find
<P(Xi) - ~ ~ AE P,
<P(<P(Xi)) - ~ ~ AP+1 EP2 ,
whereas for Xi+1 -~, ;(:i+1:= lfr(.Td,
Xi+1 - ~ = _A 2E2p - 1
by (5.10.12) Now
IAp+1 EP21 «: IA2 E2p - 11
for p > 1 and sufficiently small E, which establishes <P(<P(Xi)) as a much
better approximation to ~ than Xi+1 = lfr(xd. For this reason, Steffensen's
method is only recommended for p = 1.
EXAMPLE. The iteration function p(x)
p'(6) = 0,
= x2 has fixed points 6 = 0,6 = 1 with
pl/(6) = 2,
p'(6) = 2.
The iteration Xi.+! = p(x,) converges quadratically to 6 if Ixol < 1. But for
Ixo I > 1 the sequence {Xi} diverges.
The transformation (5.10.8) yields
.
X3
1]f (X) - --,,---- X2 + x - I
WIth
rl.2 =
-1±V5
2
.
We proceed to show that the iteration Xi+! = 1]f(Xi) reaches both fixed points for
suitable choices of the starting point Xo.
For Ixi ::; 0.5, 1]f(x) is a contraction mapping. If Ixol ::; 0.5 then Xi+1 = 1]f(Xi)
converges towards 6 = O. In sufficient proximity of 6 the iteration behaves as
Xi+! = 1]f(x,) "" xf,
whereas the iteration Xi+l = <t>(<t>(Xi)) has 4th-order convergence for Ixol < 1:
Xi+1 = p(p(x;)) = x7.
that
For Ixol > rl, Xi+1 = 1]f(Xi) converges towards 6
tlf'(l) = 0,
= 1. It is readily verified
1]f1/(1) of 0
and that therefore quadratic convergence holds (in spite of the fact that p(x) did
not provide a convergent iteration).
5.11 Minimization Problems without Constraints
349
5.11 Minimization Problems without Constraints
We will consider the following minimization problem for a real function
h: IRn -+ IR of n variables:
(5.11.1)
determine min {h(x) I x E IRn}.
We assume that h has continuous second partial derivatives with respect
to all of its variables, hE c 2 (IRn), and we denote the gradient of h by
g(x) := Dh(x)
T
=
( 8h(x)
8h(x) )
8x 1 ' " ' ' 8x n
T
and the matrix of second derivatives of h, the Hessian of h, by
8 2 h(X))
H(x):= ( 8x i 8xk.
'L,k=l, ... ,n
.
Almost all minimization methods start with a point Xo E IR n and
generate a sequence of points Xk, k :::.. 0, which arc supposed to approximate
the desired minimum point x. In cach step of the iteration, Xk -+ Xk+l, for
which gk = g(Xk) =I 0 [see (5.4.1.3)], a search direction Sk is determined by
some computation which characterizes the method, and the next point
is obtained by a line search; i.e., the stepsize Ak is determined so that
holds at least approximately for Xk+l' Usually the direction Sk is taken to
be a descent direction for h, i.e.,
(5.11.2)
This ensures that only positive values for A need to be considered in the
minimization of lP k (A).
In Section 5.4.1 we established general convergence theorems [(5.4.1.4)
and (5.4.1.8)] which apply to all methods of this type with only mild restrictions. In this section we wish to become acquainted with a few, special
methods, which means primarily that we will discuss a few specific ways of
selecting Sk which have become important in practice. A (local) minimum
point x of h is a zero of g(x); hence we can use any zero-finding method
on the system g(x) = 0 as an approach to finding minimum points of h.
Most important in this regard is Newton's method (5.1.6), at each step
Xk -+ Xk+l = Xk - AkSk of which the Newton direction
350
5 Finding Zeros and Minimum Points by Iterative Methods
is taken as a search direction. This, when used with the constant step length
Ak = 1, has the advantage of defining a locally quadratically convergent
method [Theorem (5.3.2)]. But it has the disadvantage ofrequiring that the
matrix H(Xk) of all second partial derivatives of h be computed at each
step. If n is large and if h is complicated, the computation of H(Xk) can
be very costly. Therefore methods have been devised wherein the matrices
H(Xk)-1 are replaced by suitable matrices H k ,
which are easy to compute. A method is said to be a quasi-Newton method
if for each k ;::: 0 the matrix Hk+1 satisfies the so-called quasi-Newton
equation
(5.11.3)
This equation causes Hk+1 to behave - in the direction (Xk+1 -Xk) -like
the Newton matrix H(Xk+1)-I. For quadratic functions h(x) = ~xT Ax +
bT X + c, for example, where A is an n x n positive definite matrix, the
Newton matrix H(Xk+1)-1 == A-I satisfies (5.11.3) because g(x) == Ax+b.
Further, it seems reasonable to insist that the matrix Hk be positive definite
for each k ;::: O. This will guarantee that the direction Sk = Hkg k will be a
descent direction for the function h if 9k -=I- 0 [see (5.11. 2)] :
gf Sk = gfHkgk > O.
The above demands can, in fact, be met: Generalizing earlier work by
Davidon (1959), Fletcher and Powell (1963), and Broyden (1965, 1967)),
Oren and Luenberger (1974) have found a two-parameter recursion for
producing Hk+l from Hk with the desired properties. Using the notation
and the parameters 'Yk > 0, (h ;::: 0 this recursion has the form
(5.11.4)
where IJr is the one-parameter function
(5.11.5 )
qTHq)ppT
(I-e)
T
lJr(e,H,p,q):=H+ ( l + e --;Y--~H Hq·q H
Tpq
pq
q
q
e
_ --;y- (pqT H + H qpT)
P q
which is due to Broyden (1970). Oren and Luenberger introduced the additional scaling parameter 'Yk > 0 into (5.11.4) allowing a scaling of the
matrix H k .
5.11 Minimization Problems without Constraints
351
Clearly, if Hk is symmetric then so is Hk+l. It is directly verified from
(5.11.5) that H k+1 satisfies the quasi-Newton equation (5.11.3), Hk+lqk =
Pb and it will be shown later [Theorem (5.11.9)] that Ui.11.4) preserves
the positive definiteness of the matrices H k .
The "update function" l[r is only defined if pT q i- O. qTH q i- O. Observe
that H k+ 1 is obtained from Hk by adding a correction of rank :s: 2 to the
matrix fkHk:
rank(Hk+l - H k ) :s: 2.
Hence, (5.11.4) is said to define a rank-two method.
The following special cases are contained in (5.11.4):
(a) fk == L (h == 0: the method of Davidon (1959) and Fletcher and Powell
(1963) ("DFP method");
e
(b) fk == 1, k == 1: the rank-two method of Broyden. Fletcher, Goldfarb,
and Shanno ("BFGS method") [see. for example, Broyden (1979)]:
(c) rk == 1. ek = P[qk/(p[qk - p[Hkqk): the symmetric rank-one method
of Broyden.
The last method is only defined for pI qk
1= qI Hkqk. It is possible that
(h < 0, in which case Hk+l can be indefinite even when Hk is positive definite
[see Theorem (5.11.9)]. If we substitute the value of (h into (5.11.4) and (5.11.5),
we obtain:
which explains why this is referred to as a rank-one method.
A minimization method of the Oren-Luenberger class has the following
form:
(5.11.6).
(0) Start: Choose a starting point Xo E IRn and an n x n positive definite
matrix Ho (e.g. Ho = 1), and set go := g(xo). For k ,= 0, 1, ... obtain
Xk+l, H k+ 1 from Xk. Hk as follows:
(1) If Yk = 0, stop: Xk is at least a stationary point for h. Otherwise
(2) compute Sk := HkYk.
(3) Determine Xk+1 = Xk - AkSk by means of an (approximate) minimization
and set
5 Finding Zeros and Minimum Points by Iterative Methods
352
(4) Choose suitable parameter values rk > 0, 8k 2: 0, and compute H k+1 =
P(8 k , rkHk,Pk. qk) according to (5.11.4).
The method is uniquely defined through the choice of the sequences
{rk}, {8 k } and through the line-search procedure in step (3). The characteristics of the line search can be described with the aid of the parameter
Ji,k defined by
(5.11.7)
If Sk is a descent direction, g[ Sk > 0, then Ji,k is uniquely determined by
Xk+l. If the line search is exact, then ILk = 0, since g[+lSk = -'P~(>'k) = 0,
'Pk(A) := h(xk - ASk). In what follows we assume that
(5.11.8)
Ji,k < 1.
°
If gk f
and Hk is positive definite, it follows from (5.11.8) that Ak > 0;
therefore
q[ Pk = -Ak(gk+l - gk)Sk = Ak(1 - Ji,k)g[ Sk = Ak(1 - Ji,k)g[ Hkgk > 0,
and also qk f 0, q[ Hkqk > 0: the matrix H k+1 is well defined via (5.11.4)
and (5.11.5). The condition (5.11.8) on the line search cannot be satisfied
only if
'P~(A) = -g(Xk - ASk)Tsk:::; 'P~(o) = -g[Sk <
But then
h(xk - ASk) - h(Xk) =
°
for all A 2: 0.
1A 'P~(T)dT :::; -Ag[ Sk < ° for all
A 2: 0,
so that h(xk - ASk) is not bounded below as A -+ +00. The condition
(5.11.8) does not, therefore, pose any actual restriction.
At this point we have proved the first part of the following theorem,
which states that the method (5.11.6) meets our requirements above:
°
(5.11.9) Theorem. If there is a k 2: in (5.11.6) for which Hk is positive
definite, gk f 0, and for which the line search satisfies Ji,k < 1, then for all
rk > 0, 8k 2: the matrix Hk+l = P(fh,rkHk,Pk,qk) is well defined and
positive definite.
°
PROOF. It suffices to consider only the case rk = 1 and to show the following property of the function P given by (5.11.5): Assuming that
H is positive definite,
pT q > 0,
pT H q > 0,
() 2: 0,
then fI:= P(8,H,p,q) is also positive definite.
Let y E lR n , y f 0, be an arbitrary vector, and let H = LLT be the
Choleski decomposition of H [Theorem (4.3.3)]. Using the vectors
5.11 Minimization Problems without Constraints
353
The Cauchy-Schwarz inequality implies that u T u - (u T z;)2 /v Tv 2: 0, with
equality if and only if 'U = av for some a "I 0 (since y "I 0). If u "I av, then
yT Hy > O. If u = av, it follows from the non singularity of Hand L that
o "I y = aq, so that
Because 0 "I y E IR n was arbitrary, H must be positive definite.
0
The following theorem establishees that the method (5.11.6) yields the
minimum point of any quadratic function h: IR n -+ IR after at most n steps,
provided that the line searches are performed exactly. Since each sufficiently
differentiable function h can be approximated arbitrarily closely in a small
enough neighborhood of a local minimum point by a quadratic function,
this property suggests that the method will be rapidly convergent even
when applied to nonquadratic functions.
(5.11.10) Theorem. Let h(x) = ~xT Ax+bT x+c be a quadratic function,
where A is an n x n positive definite matrix. Further, let Xo E IR n , and
let Ho an n x n positive definite matrix. If the method (5.11.6) is used to
minimize h, starting with xo, Ho and carrying out exact line searches (/Ji =
o for all i 2: 0), then sequences Xi, Hi, gi, Pi := XHI - Xi, qi := gHl - gi
are produced with the properties:
(a) There is a smallest index m :s; n with Xm = X = -A-1b such that
Xm = X is the minimum of h, gm = O.
(b) p[ qk = p[ APk = 0 for 0 :s; i "I k :s; m - 1, p[ qi > 0 for 0 :s; i :s; m - 1.
That is. the vectors Pi are A-conjugate.
(c) p[ gk = 0 for all 0 :s; i < k :s; m.
(d) Hkqi = "Yi.kPi for 0 :s; i < k :s; m, with
"YHI "YH2 ... "Yk-l
{
"Yik
. :=
1
for i < k -- 1,
for i = k -- 1.
354
5 Finding Zeros and Minimum Points by Iterative Methods
(e) If m = n, then additionally
lIm = Hn = PDP-lA-I,
where D := diagbo,n, l1.n,·· ·In-1.n), P := (PU,PI,'" 'Pn-d· And if
Ii == L it follows that Hn = A - I .
PROOF. Consider the following conditions for an arbitrary index 1 2: 0:
pT qk = pT APk = 0 for. 0 S i =J k S 1 - 1,
{ pT qi > 0 for 0 SiS I - 1,
(AI)
HI
is positive definite:
(Bd
pT gk = 0 for all 0 S i < k S 1;
(Cd
Hkqi = A/i.kJii
for 0 S i < k S l.
If these conditions hold, and if in addition gl = g(xd =J 0, then we will
show that (A l+1 , B l+ 1, C l+1 ) hold.
Since HI is positive definite by (Az),
Hlgl > 0 and SI := Hlg l =J 0
follow immediately from gl =J O. Because the line search is exact, Al IS a
zero point of
gf
hence PI = -AlSI =J 0 and
(5.11.11)
pf gl+1 = -Alsf gl+1 = 0,
pf ql = -AISf (gl+l - gz) = Alsf gl = Algf Hig i > O.
According to Theorem (5.11.9), H l + 1 is positive definite. Further,
for i < l, because APk = qk, (Bz) and (Cz) hold. This establishes (A I + 1 ).
To prove (Bl+d, we have to show that
gl+1 = 0 for all i < 1 + 1.
(5.11.11) takes care of the case i = 1. For i < 1, we can write
pT
since qj = gj+l - gj by (5.11.6). The above expression vanishes according
to (Bz) and (Al+d. Thus (B1+d holds.
Using (5.11.5), it is immediate that Hl+1ql = Pl. Further (Al+d, (Cz)
imply
qi = 0,
Hlqi = li,lqf Pi = 0 for i < I, so that
pf
qf
5.11 Minimization Problems without Constraints
355
follows for i < I from (5.11.5). Thus (Cl+d holds, too.
We note that (Ao, Bo, Co) hold trivially. So long as Xl satisfies (b)-(d)
and g(xz) -=/:- 0, we can use the above results to generate, implicitly, a point
XI+1 which also satisfies (b)-(d). The sequence xo, Xl, ... must terminate:
(Az) could only hold for I <::::: n, since the I vectors Po, ... , Pl-l are linearly
independent [if 2:i<I-1 OOiPi = 0, then multiplying by prA for k = 0, ... ,
1- 1, gives D:kPr Apk = 0 ===? D:k = 0 from (AI)]. No more than n vectors
in .IRn can be linearly independent. When the sequence terminates, say at
I = m, 0 <::::: m <::::: n, it must do so because
i.e. (a) holds. In case m = n, (d) implies
for the matrices P:= (po, ... ,Pn-l), Q = (qO, ... ,qn-l). Since AP = Q,
the nonsingularity of P implies
Hn = PDP-1A-1,
which proves (e) and, hence, the theorem.
D
We will now discuss briefly how to choose the parameters "/k, ()k in
order to obtain as good a method as possible. Theorem (5.11.10e) seems to
imply that the choice "/k == 1 would be good, because it appears to ensure
that limi Hi = A- l (in general this is true for nonquadratic functions
only under certain additional assumptions), which suggests that the quasiNewton method will converge "in the same way as the Newton method".
Practical experience indicates that the choice
"/k == 1,
()k == 1
(BFGS method)
is good [see Dixon (1971)]. Oren and Spedicato (1976) have been able to
give an upper bound pbk' ()k) on the quotient cond(Hk+1)/ cond(Hk) of
the condition numbers (with respect to the Euclidean norm) of Hk+l and
Hk. Minimizing P with respect to "/k and ()k leads to the prescription
if
c
<::::: 1
a
-
a
if -::::: 1
T
if
a
c
- <::::: 1 <:::::T
a
Here we have used
c
a
()k := 0;
then choose
"/k := -,
then choose
"/k := -,
()k := 1;
then choose
"/k:= 1,
()k :=
a
T
cr(c - a)
lOT - a 2 .
356
5 Finding Zeros and Minimum Points by Iterative Methods
~._ pTH-lp
C.-
kkk,
Another promising suggestion has been made by Davidon (1975). He was
able to show that the choice
(J(c-(J;
fA := {
lOT -
(J
(J
2cT
if (J < - - c+T'
otherwise,
(J-T
used with rk == 1 in the method (5.1l.5), minimizes the quotient Arnax/ Arnin
of the largest to the smallest eigenvalues satisfying the generalized eigenproblem
determine A E <C, Y ¥ 0 so that Hk+1y = AHky,
that is det(Hk 1 H k + 1 - A1) = O.
Theorem (5.1l.10) suggests that methods of the type (5.11.4) will converge quickly even on nonquadratic functions. This has been proved formally for some individual methods of the Oren-Luenbergcr class. These
results rest, for the most part, on the local behavior in a sufficiently small
neighborhood U(x) of a local minimum x of h under the following assumptions:
(5.11.12a) H(x) is positive definite,
(5.1l.12b) H(x) is Lipschitz continuous at x = x, i.e., there is a A with
IIH(x) - H(x)11 'S Jllix - xii for all x E U(x).
Further, certain mild restrictions have to be placed on the line search, for
example:
(5.11.13) For given constants 0 < Cl < C2 < 1. Cl 'S ~, Xk+l = Xk - AkSk is
chosen so that
h(Xk+l) s; h(Xk) - C1 Akgl Sk,
T
T
gk+l Sk s; c2gk Sk,
or
(5.1l.14)
Under the conditions (5.11.12), (5.11.13) Powell (1975) was able to show
for the BFGS method (rk == 1, 8k == 1): There is a neighborhood
V(x) ~ U(x)
such that the method is superlinearly convergent for all positive definite
initial matrices Ho and all Xo E V (x). That is,
5.11 Minimization Problems without Constraints
357
·
Ilxi+1 - xii = 0
11m
i-+oo Ilxi - xii
'
as long as Xi =f. x for all i ::::0: O. In his result the stepsize Ak == 1 satisfies
(5.11.13) for all k ::::0: 0 sufficiently large.
Another convergence result has been established for the subclass of the
Oren-Luenberger class (5.11.4) which additionally satisfies
TH-· 1
T
Pkqk <
< ~ Pk
qfHkqk - "Yk pfqk .
(5.11.15)
For this subclass, using (5.11.12), (5.11.14), and the additional demand
that the line search be asymptotically exact, i.e.
IILkl ~ cl19kll
for large enough k,
it can be shown [Stoer (1977), Baptist and Stoer (1977)] that
limxk = x,
k
IIXk+n - xii ~ "Yllxk - xW
for all k ::::0: O.
for all positive definite initial matrices Ho and for Ilxo - xii small enough.
The proofs of all of the above convergence results are long and difficult.
The following simple example illustrates the typical beh.aviors of the DFP
method, the BFGS method, and the steepest-descent method (Sk := gk in each
iteration step).
We let
2(
2
2
(2 + a:)2
()
hx,y
:=100(y 3-x)-x (3+x)) +--(--)2'
1 + 2+x
which has the minimum point x := -2, j) := 0.8942719099 ... , hex, y) = O. For
each method we take
Xo := 0.1,
yo := 4.2
as the starting point and let
for the BFGS and DFP methods.
Using the same line-search procedure for each method, we obtained the following results on a machine with eps = 10- 11 :
BFGS
DFP
N
F
54
374
47
568
E
S 10- 11
S 10- 11
Steepest descent
201
1248
0.7
358
5 Finding Zeros and l\linimum Points by Iterative Methods
N denotes the number of iteration steps (Xk, yA:) --t (Xk+l, xk+d: F denotes the
number of evaluations of h: and c:= Ilg(xJV,YN)11 was the accuracy attained at
termination.
The steepest descent method is clearly inferior to both the DFP and BFGS
methods. The DFP method is slightly inferior to the BFGS method. (The line
searches were not done in a particularly efficient manner. More than 6 function
evaluations were needed, on the average, for each iteration step. It is possible to
do better, but the same relative performance would have been evident among the
methods even with a better line-search procedure.)
EXERCISES FOR CHAPTER 5
1.
Let the continuously differentiable iteration function cf>: IR n --t IR n be given.
If
lub(Dcf>(x)) ~ K < 1 for all x E IR n ,
then the conditions for Theorem (5.2.2) are fulfilled for all x, Y E IRn.
2.
Show that the iteration
Xk-l
=
COS(Xk)
converges to the fixed point ~ = cos ~ for all Xo E IR n .
3.
Give a locally convergent method for determining the fixed point ~ = ~ of
cf>(x) := x 3 + x - 2. (Do not use the Aitken transformation.)
4.
The polynomial p(x) = x 3 - x 2 - x-I has its only positive root near ~ =
1.839 .... \Vithout using pi (x), construct an iteration function cf>(x) having
the fixed point ~ = cf>(~) and having the property that the iteration converges
for any starting point Xo > O.
5.
Show that
lim Xi = 2.
i-+oc
where
Xo := O.
6.
.'l:i+1:= /2
+ x.,.
Let the function f: IR2 --t IR2
f(
z)
=
[
exp(x 2 + y2) - 3 ]
x+y-sin(3(x+y))'
be given. Compute the first derivative Df(z). For which z is Df(z) singular?
7.
8.
The polynomial po(x) = X4 -8x 3 +24x2 -32:r+a4 has a quadruple root x = 2
for a4 = 16. To first approximation, where are its roots if a4 = 16 ± 10- 4 ?
Consider the sequence {Zi } with Zi+1 = cf>(z;}, cf>: IR --t IR. Any fixed point
~ of cf> is a zero of
F(z) := z - cf>(z).
Show that if one step of regula falsi is applied to F (z) with
EXERCISES FOR CHAPTER 5
359
then one obtains Steffensen's (or Aitken's) method for transforming the sequence {Zi } into {JLi }.
9.
Let f: IR -+ IR have a single, simple zero. Show that, if <p(~c) := x - f(x) and
the recursion (5.10.7) are used, the result is the quasi-Newton method
n = 0.1, ....
Show that this iteration converges at least quadratically to simple <leros and
linearly to multiple zeros.
Hint: (5.10.13).
10. Calculate x = 1/ a for any given a Ie 0 without using division. For which
starting values Xo will the method converge?
11. Give an iterative method for computing y'a, a > 0, which converges locally
in second order. (The method may only use the four fundamental arithmetic
operations. )
12. Let A be a nonsingular matrix and {Xk }k=O.l, .. be a sequene of matrices
satisfying
(Schulz's method).
(a) Show that lub(I - AXo ) < 1 is sufficient to ensure the convergence of
{Xk} to A-I Further. Ek := I - AXk satisfies
(b) Show that Schulz's method is locally quadratically convergent.
(c) If, in addition, AXo = XoA then
13. Let the function f: IR -+ IR be twice continuously differentiable for all x E
U(~) := {x I Ix - ~I :::; r} in a neighborhood of a simple zero ( f(~) = O.
Show that the sequence {x n } generated by
y:= Xn - f'(Xn)-l f(x n ),
Xn+l := Y - !'(X 71 )-l f(y),
converges locally at least cubically to ~.
14. Let the function f: IR -+ IR have the zero ~. Let f be twice continuously
differentiable and satisfy f' (x) Ie 0 for all x E I := {x IJ: - ~ I <::: r}. Define
the iteration
where q(x) := (f(x + f(x)) - f(x))/ f(x) for x Ie~, x E I. and show:
(a) For all x
360
5 Finding Zeros and Minimum Points by Iterative Methods
f(x+f(x)) -f(x) = f(x)
11
f'(x+tf(x))dt.
(b) The function q (x) can be continuously differentiably extended to x = ~
using
q(x) =
.i
l
f'(x+tf(x))dt.
(c) There is a constant c so that
Iq(x) - f'(x)1 S clf(x)l·
Give an interpretation of this estimate.
(d) The iteration has the form Xk+l = CP(Xk) with a differentiable function
CP. Give cP and show cp(!;) = ~, cp'(O = O. Hence the iteration converges
with second order to ~.
15. Let the function f: IR n --+ IR n satisfy the assumptions
(1) f(x) is continuously differentiable for all x E IRn;
(2) for all x E IR n , Df(x)-l exists:
(3) x T f(x) 2> ,(llxll) Ilxll for all x E 1R", where ,(p) is a continuous function
for p 2> 0 and satisfies ,(p) --+ +00 as p --+ 00;
(4) for all x, hE IR n
hT Df(x)h 2> J.i(llxll)llhI1 2 ,
with /1(p) monotone increasing in P > 0, J.i(0) = 0, and
1=
J.i(p )dp = +00.
Then (1), (2), (3) or (1), (2), (4) are enough to ensure that conditions (a)-(c)
of Theorem (5.4.2.5) are satisfied. For (4), use the Taylor expansion
f(x + h) - f(x) =
11
Df(x + th)h dt.
16. Give the recursion formula for computing the values AI, Bl which appear in
the Bairstow method.
17. (Tornheim 1964). Consider the scalar. multistep iteration function of r + 1
variables 'P( Xo, Xl, ... x r ) defining the iteration
i = 0, 1, ... ,
where Yo, Y-l, ... , Y-r are specified. Let 'P have partial derivatives of at least
order r + 1. y* is called a fixed point of 'P if for all k = 1, ... , l' and arbitrary
Xi, i 1= k, it follows that
(* )
Y* = 'P(xo, ... , Xk-l, y*, Xk+l,···, x,.).
Show that
(a) The partial derivatives
References for Chapter 5
S
(
)
361
alsl<p(Xo, ... ,Xr)
a Sr'
Xo Xs ... Xr
D <p Xo, ... ,Xr := a 80a 81
where s = (so, ... ,Sr), Isl:= L:~=OSj, satisfy DS<p(y*, ... ,yO) = 0 if for
some j, 0 :S j :S r, Sj = O. [Note that (*) holds identically in Xo,
Xk-l, Xk+l, ... , Xr for all k.]
(b) In a suitably small neighborhood of y* the recursion
(**)
holds with Ci := IYi - y* I and with an approporiate constant c.
Further, give:
(c) the general solution of the recursion (**) and the local convergence order
of the sequence Yi.
18. Prove (5.9.14).
References for Chapter 5
Baptist, P., Stoer, J. (1977): On the relation between quadratic termination and
convergence properties of minimization algorithms. Part II. Applications. Numer. Math. 28, 367-391.
Bauer, F.L. (1956): Beitrage zur Entwicklung numerischer Verfahren fUr programmgesteuerte Rechenalagen. II. Direkte Faktorisierung eines Polynoms. Bayer.
Akad. Wiss. Math. Natur. Kl. S.B. 163-203.
Brent, R.P. (1973): Algorithms for Minimization without Derivatives. Englewood
Cliffs, N.J.: Prentice-Hall.
Brezinski, C. (1978): Algorithmes d'Acceleration de la Convergence. Etude Numerique. Paris: Edition Technip.
- - - , Zaglia, R. M. (1991): Extrapolation Methods, Theory and Practice. Amsterdam: North Holland.
Broyden, C.G. (1965): A class of methods for solving nonlinear simultaneous
equations. Math. Comput. 19, 577-593.
- - - (1967): Quasi-Newton-methods and their application to function minimization. Math. Comput. 21, 368-381.
- - - (1970): The convergence of a class of double rank minimization algorithms. 1. General considerations, 2. The new algorithm. J. Inst. Math. Appl.
6, 76-90, 222-231.
- - - , C.G., Dennis, J.E., More, J.J. (1970): On the local and superlinear convergence of quasi-Newton methods. J. Inst. Math. Appl. 12, 223-245.
Collatz, L. (1968): Funktionalanalysis und numerische Mathematik. Die Grundlehren der mathematischen Wissenschaften in Einzeldarstellungen. Bd. 120.
Berlin, Heidelberg, New York: Springer-Verlag. English edition: Functional
Analysis and Numerical Analysis. New York: Academic Press (1966).
Collatz, L., Wetterling, W. (1971): Optimierungsaufgaben. Berlin, Heidelberg,
New York: Springer-Verlag.
Davidon, W.C. (1959): Variable metric methods for minimization. Argonne National Laboratory Report ANL-5990.
362
5 Finding Zeros and Minimum Points by Iterative Methods
- - - (1975): Optimally conditioned optimization algorithms without line
searches. Math. Programming 9, 1-30.
Deuflhard, P. (1974): A modified Newton method for the solution of ill-conditioned
systems of nonlinear equations with applications to multiple shooting. Numer.
Math. 22, 289-315.
Dixon, L.C.W. (1971): The choice of step length, a crucial factor in the performance of variable metric algorithms. In: Numerical Methods for Nonlinear
Optimization. F.A. Lootsma, ed., 149-170. New York: Academic Press.
Fletcher, R., Powell, M.J.D. (1963): A rapidly convergent descent method for
minimization. Comput. J. 6, 163-168.
- - - (1980): Unconstrained Optimization. New York: Wiley.
- - - (1981): Constrained Optimization. New York: Wiley.
Gill, P.E., Golub, G.H., Murray, W., Saunders, M.A. (1974): Methods for modifying matrix factorizations. Math. Comput. 28, 505-535.
Griewank, A. (2000): Evaluating Derivatives, Principles and Techniques of Algorithmic Differentiation. Frontiers in Appl. Math., Vol 19. Philadelphia: SIAM.
Henrici, P. (1974): Applied and Computional Complex Analysis. Vol. 1. New York:
Wiley.
Himmelblau, D.M. (1972): Applied Nonlinear Programming. New York:
McGraw-Hill.
Householder, A.S. (1970): The Numerical Treatment of a Single Non-linear Equation. New York: McGraw-Hill.
Jenkins, M.A., Traub, J.F. (1970): A three-stage variable-shift iteration for polynomial zeros and its relation to generalized Rayleigh iteration. Numer. Math.
14, 252-263.
Luenberger, D.G. (1973): Introduction to Linear and Nonlinear Programming.
Reading, Mass.: Addison-Wesley.
Maehly, H. (1954): Zur iterativen Auflosung algebraischer Gleichungen. Z. Angew.
Math. Physik 5, 260-263.
Marden, M. (1966): Geometry of Polynomials. Providence, R.I.: Amer. Math.
Soc.
Nickel, K. (1966): Die numerische Berechnung der Wurzeln eines Polynoms. Numer. Math. 9, 80-98.
Oren, S.S., Luenberger, D.G. (1974): Self-scaling variable metric (SSVM) algorithms. 1. Criteria and sufficient conditions for scaling a class of algorithms.
Manage. Sci. 20, 845-862.
- - - , Spedicato, E. (1976): Optimal conditioning of self-scaling variable metric
algortihms. Math. Programming 10, 70-90.
Ortega, J.M., Rheinboldt, W.C. (1970): Iterative Solution of Non-linear Equations in Several Variables. New York: Academic Press.
Ostrowski, A.M. (1973): Solution of Equations in Euclidean and Banach Spaces.
New York: Academic Press.
Peters, G., Wilkinson, J.H. (1969): Eigenvalues of Ax = >..Bx with band symmetric A and B. Comput. J. 12,398-404.
- - - , - - - (1971): Practical problems arising in the solution of polynomial
equations. J. Inst. Math. Appl. 8, 16-35.
Powell, M.J.D. (1975): Some global convergence properties of a variable metric
algorithm for minimization without exact line searches. In: Proc. AMS Symposium on Nonlinear Programming 1975. Amer. Math. Soc. 1976.
Spellucci, P. (1993): Numerische Verfahren der nichtlinearen Optimierung. Basel:
Birkhiiuser.
References for Chapter 5
363
Stoer, J. (1975): On the convergence rate of imperfect minimization algorithms
in Broyden's ,B-class. Math. Programming 9, 313-335.
- - - (1977): On the relation between quadratic termination and convergence
properties of minimization algorithms. Part I. Theory. Numer. Math. 28,343366.
Tornheim, L. (1964): Convergence of multipoint methods. J. Assoc. Comput.
Mach. 11, 210-220.
Traub, J.F. (1964): Iterative Methods for the Solution of Equations. Englewood
Cliffs, NJ: Prentice-Hall.
Wilkinson, J .H. (1959): The evaluation ofthe zeros of ill-conditioned polynomials.
Part I. Numer. Math. 1, 150-180.
- - - (1963): Rounding Errors in Algebraic Processes. Englewood Cliffs, N.J.:
Prentice Hall.
- - - (1965): The Algebraic Eigenvalue Problem. Oxford: Clarendon Press.
6 Eigenvalue Problems
6.0 Introduction
l\1any practical problems in engineering and physics lead to eigenvalue
problems. Typically, in all these problems, an overdetermined system of
equations is given, say n + 1 equations for n unknowns ~l' ... , ~n of the
form
(6.0.1)
F(x; A) :=== [
h(6, .... (n; A)
:
1=
0
In+l(6'''''~n: A)
in which the functions Ii also depend on an additional parameter A. Usually,
(6.0.1) has a solution x = [6'''''~n]T only for specific values A = Ai,
i = 1, 2, ... , of this parameter. These values Ai are called eigenvalues of
the eigenvalue problem (6.0.1) and a corresponding solution x = X(A;) of
(6.0.1), eigensolution belonging to the eigenvalue Ai.
Eigenvalue problems of this general form occur, e.g., in the coutext of
boundary value problems for differential equations [see Section 7.3]. In this
chapter we consider only the special class of algebraic eigenval'ue problems,
where all but one of the Ii in (6.0.1) depend linearly on x and A, and which
have the following form: Given real or complex n x n matrices A und B,
find a number A E <e such that the system of n + 1 equations
(A - AB)x = O.
(6.0.2)
has a solution x E <en. Clearly, this problem is equivalent to finding numbers
A E <e such that there is a nontrivial vector x E <en, x i- 0, with
(6.0.3)
Ax = ARl:.
For arbitrary A and B, this problem is still very general, and we treat it
only briefly in Section 6.8. The main portion of Chapter 6 is devoted to the
special case of (6.0.3) where B := I is the identity matrix: For an n x n
6.0 Introduction
365
matrix A, find numbers A E ~ (the eigenvalues of A) and nontrivial vectors
x E ~n (the eigenvectors of A belonging to A) such that
(6.0.4)
Ax = AX,
X
# O.
Sections 6.1-6.4 provide the main theoretical results on the eigenvalue problem (6.0.4) for general matrices A. In particular, we describe various normal
forms of a matrix A connected with its eigenvalues, additional results on
the eigenvalue problem for important special classes of matrices A (such as
Hermitian and normal matrices) and the basic facts on the singular values
(Ji of a matrix A, i.e., the eigenvalues (J; of AHA and AA H , respectively.
The methods for actually computing the eigenvalues and eigenvectors
of a matrix A usually are preceded by a reduction step, in which the matrix
A is transformed to a "similar" matrix B having the same eigenvalues as
A. The matrix (B = (b ik ) has a simpler structure than A. (B is either a
tridiagonal matrix, bik = 0 for Ii - kl > 1, or a Hessenberg matrix, bik = 0
for i :::: k + 2), so that the standard methods for computing eigenvalues and
eigenvectors are computationally less expensive when applied to B than
when applied to A.. Various reduction algorithms are described in Section
6.5 and its subsections.
The main algorithms for actually computing eigenvalues and eigenvectors are presented in Section 6.6, among others the LR algorithm of
Rutishauser (Section 6.6.4) and the powerful QR algorithm of Francis (Sections 6.6.4 and 6.6.5). Related to the QR algorithm is the method of Golub
and Reinsch for computing the singular values of matrices, which is described in Section 6.7. After touching briefly on the more gmeral eigenvalue
problem (6.0.3) in Section 6.8, the chapter closes (Section 6.9) with a description of several useful estimates for eigenvalues. These may serve, e.g,
to locate the eigenvalues of a matrix and to study their sensitivity with respect to small perturbations. A detailed treatment of all numerical aspects
to the eigenvalue problem for matrices is given in the excellent monograph
of Wilkinson (1965), and in Golub and van Loan (1983); the eigenvalue
problem for symmetric matrices is treated in Parlett (1980). ALGOL programs for all algorithms described in this chapter are found in Wilkinson
and Reinsch (1971), and FORTRAN programs in the "EISPACK Guide" of
Smith et a1. (1976) and its extension by Garbow et a1. (1977). All these
methods have been incorporated in the widely used program systems like
MATLAB, MATHEMATICA and ]'v[APLE.
366
6 Eigenvalue Problems
6.1 Basic Facts on Eigenvalues
In the following we study the problem (6.0.4), i.e., given a real or complex
n x n matrix A, find a number A E <e such that the linear homogeneous
system of equations
(6.1.1)
(A - AI)X = D
has a nontrivial solution x f D .
(6.1.2) Definition. A number A E <e is called an eigenvalue of the matrix
A if there is a vector x f 0 such that Ax = AX. Every such vector is called
a (right) eigenvector of A associated with the eigenvalue A. The set of all
eigenvalues is called the spectrum of A.
The set
L(A) := {x I (A - AI)X = D}
forms a linear subspace of <en of dimension
p(A) = n - rank (A - AI),
and a number A E <e is an eigenvalue of A precisely when L(A) f D, i.e.,
when p(A) > 0 and thus A - AI is singular:
det(A - AI) = o.
It is easily seen that ip(fl) := det(A - flI) is an nth-degree polynomial of
the form
It is called the
(6.1.3)
characteristic polynomial
of the matrix A. Its zeros are the eigenvalues of A. If A1, ... , Ak are the
distinct zeros of ip(fl), then ip can be presented in the form
The integer O"i, which we also denote by dAi) = O"i, is called the multiplicity
of the eigenvalue Ai, more precisely, its algebraic multiplicity.
The eigenvectors associated with the eigenvalue A are not uniquely
determined: together with the zero vector, they fill precisely the linear
subspace L(A) of <en. Thus,
(6.1.4). If x and yare eigenvectors belonging to the eigenvalue A of the
matrix A, then so is every linear combination ax + f3y f D.
6.1 Basic Facts on Eigenvalues
367
The integer p(>..) = dim L(>..) specifies the maximum number of linearly
independent eigenvectors associated with the eigenvalue >... It is therefore
also called the
geometric mnltiplicity of the eigenvalne >...
One should not confuse it with the algebraic multiplicity 0'(>").
EXAMPLES. The diagonal matrix of order n.
D := ),1,
has the characteristic polynomial cp(p,) = det(D - p,I) = (). - ~i)n. ). is the only
eigenvalue, and every vector x E <c n , x =1= 0, is an eigenvector: L()') = <c n ;
furthermore. iT().) = n = p().). The nth-order matrix
.\
1
o
).
(6.1.5)
o
1
).
also has the characteristic polynomial cp(~) = ()._~)n and), as its only eigenvalue,
with iT().) = n. The rank of C n ()') - AI, however, is now equal to n - 1; thus
p().) = n - (n - 1) = 1, and
L()') = {aella E <C},
el = 1st coordinate vector.
Among further simple properties of eigenvalues we note:
(6.1.6). Let p(tL) = 10 + 11~ + ... + Im~m be an arbitrary polynomial, and
A a matrix of order n. Defining the matrix p(A) by
p(A) := 101 + 11A + ... + ImAm,
the matrix p(A) has the eigenvector x corresponding to the eigenvalne p(>..)
if>.. is an eigenvalne of A and x a corresponding eigenvector. In particnlar,
etA has the eigenvalne et>.., and A + TIthe eigenvalne >.. + T.
From Ax = >..x one obtains immediately A 2 x = A(Ax) = >..Ax =
>..2x, and in general Ai.1: = >..iX. Thus
PROOF.
p(A)x = (roI + 11A + ... + ,mArn)x
=
(ro + ,1>" + ... + Im>..rn)x = p(>..):r.
Furthermore, from
det(A - >..1) = det«A - >..1)T) = det(A T - >..1),
det(A H - 5.1) = det«A - >..1)H) = det «A - >..1)T) = det(A - >"1),
o
368
6 Eigenvalue Problems
there follows:
(6.1.7). If A is an eigenvalue of A, then A is also an eigenvalue of AT, and
), an eigenvalue of AH.
Between the corresponding eigenvector x, y, z,
Ax = AX,
AT Y = AY,
AH z = ),z,
merely the trivial relationship y = z holds, in view of A H = ;F. In particular, there is no simple relationship, in general, between X and y or x
and z, Because of yT = zH and zH A = AZ H , one calls zH or yT, also
a left eigenvector left eigenvector DF associated with eigenvalue A of A.
Furthermore, if x -lOis an eigenvector corresponding to the eigenvalue A,
Ax = AX,
T an arbitrary nonsingular n x n matrix, and if one defines y .- T-Ix,
then
T- I ATy = T- I Ax = AT-Ix = AY, Y -I 0,
i.e., y is an eigenvector of the transformed matrix
associated with the same eigenvalue A. Such transformations are called
similarity transformations,
and B is said to be similar to A, A rv B. One easily shows that similarity
of matrices is an equivalence relation, i.e.,
ArvA,
A rv
A
B,
rv
B
B
rv
=?
B
C
=?
rv
A,
A
rv
C.
Similar matrices have not only the same eigenvalues, but also the same
characteristic polynomial. Indeed,
det(T- I AT - f..LI) = det(T- I (A - p1)T)
=
det(T- I ) det(A - f..LI) det(T)
= det(A - f..LI).
Moreover, the integers peA), o-(A) remain the same: For o"(A), this follows
from the invariance of the characteristic polynomial; for p( A), from the fact
that, T being nonsingular, the vectors Xl, ... , xp are linearly independent
6.2 The Jordan Normal Form of a Matrix
369
if and only if the corresponding vectors Yi := T-IXi, i = 1, ... , p, are
linearly independent.
In the most important methods for calculating eigenvalues and eigenvectors of a matrix A, one first performs a sequence of similarity transformations.
A(O) :=A,
A(i) := T i- l A(i-l)Ti ,
i = 1, 2, ... ,
in order to gradually transform the matrix A into a matrix of simpler form,
whose eigenvalues and eigenvectors can then be determined more easily.
6.2 The Jordan Normal Form of a Matrix
We remarked already in the previous section that for an eigenvalue A of an
n x n matrix A, the multiplicity a(A) of A as a zero of the characteristic
polynomial need not coincide with p(A), the maximum number of linearly
independent eigenvectors belonging to A. It is possible, however, to prove
the following inequality:
1 :s; p(A) :s; a(A) :s; n.
(6.2.1)
PROOF. We prove only the nontrivial part p(A) :s; a(A). Let P := p(A), and
let Xl, ... , xp be linearly independent eigenvectors associated with A:
AXi = AXi,
i = 1, ... , p.
We select n - P additional linearly independent vectors Xi E (Cn, i = P + 1,
... ) n, such that the Xi, i = 1, ... , n, form a basis in (Cn. Then the square
matrix T := [Xl) ... ,xn ] with columns Xi is nonsingular. For i = 1, ... , p,
in view of Tei = Xi, ei = T-IXi, we now have
T- I ATei = T- I AXi = AT-IXi = Aei.
T- I AT, therefore, has the form
T-IAT =
A
0
*
*
0
A
*
*
*
*
*
*
0
'"-,,--'
p
[ml,
and for the characteristic polynomial of A, or of T- I AT, we obtain
370
6 Eigenvalue Problems
i.p(p.) = det(A - pJ) = det(T- l AT - p.!) = (A - p.)P . det(C - p.!).
i.p is divisible by (A-fl)P; hence A is a zero of 'P of multiplicity at least p.
D
In the example of the previous section we already introduced the 1/ x 1/
matrices [see (6.l.5)]
""(Al ~
l
Ao 1
:1]
A
and showed that 1 = p(A) < cr(A) = 1/ (if v> 1) for the (only) eigenvalue
A of these matrices. The unique eigenvector (up to scalar multiples) is el,
and for the coordinate vectors ei we have generally
(Cv(A) - AI)ei = ei-l,
(6.2.2)
i = v, v - L ... ,2,
(Cv(A) - A!)el = O.
Setting formally ek := 0 for k :::; 0, then for all i. j ~ L
(Cv(A) - AJ)'ej = ej-i,
and thus
(6.2.3)
The significance of the matrices Cv(A) lies in the fact that they are used to
build the so-called Jordan normal form J of a matrix. Indeed, the following
fundamental theorem holds, which we state without proof:
(6.2.4) Theorem. Let A be an arbitrary n x n matri:r and A!.. ... ! Ak its
distinct eigenvalues, with geometric and algebraic multiplicities p(Ai) and
cr(Ai), respectively. i = 1. ... , k. Then for each of the eigenvalues Ai. i = 1,
... , k, there exists p(A,) natural numbers
j = 1, 2, .... p(Ai). with
vj').
\ )
cr ( /\i
= VI(i) + 1/2( i) + ... + v p(i)(>-,)
and there exists a nonsingular n x n matrix T, such that J := T- l AT has
the following form:
C V ,(1) (Ad
(6.2.5)
J =
o
o
6.2 The Jordan Normal Form of a Matrix
371
The numbers vJi), j = 1, ... , P(Ai), (and with them, the matrix J) are
uniquely determined up to order. J is called the Jordan normal form of A.
The matrix T, in general, is not uniquely determined.
If one partitions the matrix T columnwise, in accordance with the Jordan normal form J in (6.2.5),
(1)
T = [Tl
(1)
(k)
, ... , T p(>.,)' ... , Tl
(k)
1
, ... , Tp(>'k) ,
then from T- l AT = J and hence AT = T J, there follow immediately the
relations
(6.2.6)
Denoting the columns of the n x v?) matrix T?) without further indices
briefly by t m , m = 1, 2, ... , VJi) ,
Tj
(i)
= [tl, t2, ... , t",Ci)],
J
it immediately follows from (6.2.6), and the definition of C C,) (Ai) that
"'j
1
or
(6.2.7)
(A - AJ)tm = tm-l,
(i)
m = Vj
(i)
,Vj
-
1
1, ... , 2,
(A - AJ)tl = O.
In particular, tl, the first column of T?), is an eigenvector for the eigen-
value Ai. The remaining t m , m = 2, 3, ... , vJi), are called principal vectors corresponding to Ai, and one sees that with each Jordan block C Ci) (Ai)
"'j
there is associated an eigenvector and a set of principal vectors. Altogether,
for an n x n matrix A, one can thus find a basis of <en (namely, the columns
of T) which consists entirely of eigenvectors and principal vectors of A.
The characteristic polynomials
of the individual Jordan blocks C C,) (Ai) are called the
"'j
(6.2.8)
elementary divisors
372
6 Eigenvalue Problems
of A. Therefore, A has only linear elementary linear elementary divisors
precisely if vji) = 1 for all i and j, i.e., if the Jordan normal form is a
diagonal matrix. One then calls A diagonalizable or also normalizable.
This case is distinguished by the existence of a basis of IJjn consisting
solely of eigenvectors of A; principal vectors do not occur. Otherwise, one
says that A has "higher", i.e., nonlinear elementary divisors.
From Theorem (6.2.4) there follows immediately:
(6.2.9) Theorem. Every n x n matrix A with n distinct eigenvalues is
diagonalizable.
We will get to know further classes of diagonalizable matrices in Section
6.4.
Another extreme case occurs if with each of the distinct eigenvalues Ai,
i = 1, ... , k, of A there is associated only one Jordan block in the Jordan
normal form J of (6.2.5). This is the case precisely if
p(Ai) = 1 for i = 1, 2, ... , k.
The matrix A is then called
(6.2.10)
non derogatory,
otherwise, derogatory (an n x n matrix with n distinct eigenvalues is thus
both diagonalizable and nonderogatory). The class of nonderogatory matrices will be studied more fully in the next section.
A further important concept is that of the minimal polynomial of a
matrix A. By this we mean the polynomial
1j;(f.1) = 1'0 + 1'1f.1 + ... + I'm_lf.1 m- l + f.1m
of smallest degree having the property
1j;(A) = O.
The minimal polynomial can be read off at once from the Jordan normal
form:
(6.2.11) Theorem. Let A be an n x n matrix with the (distinct) eigenvalues AI, ... , Ak and with the Jordan normal form J of (6.2.5), and let
Ti := maxl::Oj::Op().;)
Then
vY).
(6.2.12)
is the minimal polynomial of A. 1j;(f.1) divides every polynomial X(f.1) with
X(A) = O.
PROOF. We first show that all zeros of the minimal polynomial 1j; of A, if
it exists, are eigenvalues of A. Let, say, A be a zero of 1j;. Then
6.2 The Jordan Normal Form of a Matrix
373
where the polynomial g(/1) has smaller degree than 'l/J, and hence by the
definition of the minimal polynomial g(A) i= O. There exists, therefore, a
vector z i= 0 with x := g(A)z i= O. Because of'l/J(A) = 0 it then follows that
0= 'l/J(A)z = (A - AI)g(A)z = (A - AI)x,
i.e., A is eigenvalue of A. If a minimal polynomial exists, it will thus have
the form 'l/J(/1) = (/1- Ad T1 (/1- A2t2 ... (/1- Aktk for certain Ti' We wish
to show nOw that Ti := maxj vjiJ will define a polynomial with 'l/J(A) =
O. With the notation of Theorem (6.2.4), indeed, A = T JT- 1 and thus
'l/J(A) = T'l/J(J)T-l. In view of the diagonal structure of J,
J=diag(C (1)(Al),""C (k)
VI
Vp()'k)
(Ak))i
however, we now have
(6.2.13)
and thus, by virtue of Ti ~
v?) and (6.2.3),
'l/J (Cv (i) (Ai)) = O.
J
Thus, 'l/J(J) = 0, and therefore also 'l/J(A) = O.
At the same time One sees that none of the integers Ti can be chosen
(i).
(i)
smaller than maxj Vj . If there were, say, Ti < Vj ,then, by (6.2.3),
(CV(i) (Ai) - Ai1f' i= O.
J
From g(Ai) i= 0 it would follow at once that
B:= g(CV(i) (Ai)).
J
is nonsingular. Hence, by (6.2.13), also 'l/J(C (i))) i= 0, and neither 'l/J(J)
Vj
nor 'l/J(A) would vanish. This shows that the specified polynomial is the
minimal polynomial of A.
If, finally, X(/1) is a polynomial with X(A) = 0, then X, with the aid of
the minimal polynomial, can be written in the form
374
6 Eigenvalue Problems
where deg r < deg
From X(A) = ¢(A) = 0 we thus get also r(A) = O.
Since Vi is the minimal polynomial of A, we must have identically rep) == 0:
1/J is a divisor of X.
0
By (6.2.4), one has
ptA,)
'"
(i)
O'(Ai)=~l/j
::::Ti=maXl/j(i) ,
J
j=l
i.e., the characteristic polynomial cp(p) = det(A - pI) of A is a multiple
of the minimal polynomial. Equality O'(Ai) = Ti, i = 1, ... , k, prevails
precisely when A nonderogatory. Thus,
(6.2.14) Corollary (Cayley-Hamilton). The character'istic polynomial cp(p,)
of a matrix A satisfies cp(A) = o.
(6.2.15) Corollary. A matrix A is nonderogatory if and only if its minimal polynomial and characteric polynomial coincide (up to a multiplicative
constant).
EXA~IPLE.
The Jordan matrix
1
1
1
0
1
1
1
J
1
1
-1
1
-1
-1
LJ
0
has the eigenvalues >11 = 1, .\2 = 1 with multiplicities
p(.\d = 2,
O'(.\d = 5,
Elementary divisors:
Characteristic polynomial:
Minimal polynomial:
P(.\2) = 3,
0'(.\2) = 4.
6.3 The Frobenius Normal Form of a Matrix
375
To Al = 1 there correspond the linearly independent (right) eigenvectors el, e4;
to A2 = -1 the eigenvectors e6, es, eg.
6.3 The Frobenius Normal Form of a Matrix
In the previous section we studied the matrices Cv(.A), which turned out to
be the building blocks of the Jordan normal form of a matrix. Analogously,
one builds up the Frobenius normal form (also called the rational normal
form) of a matrix from Frobenius matrices F of the form
(6.3.1)
0
0
-1'0
1
0
-1'1
0
1
-1'm-2
0
F=
-1'm-l
whose properties we now wish to discuss. One encounters matrices of this
type in the study of Krylov sequences of vectors. By a Krylov sequence of
vectors for the n x n matrix A and the initial vector to E ~n one means
a sequence of vectors ti E ~n, i = 0, 1, ... , m - 1, with the following
properties:
(6.3.2).
(a) ti = Ati-l, i::::: l.
(b) to, t l , ... , t m - l are linearly independent,
(c) tm:= Atm -1 depends linearly on to, h, ... , t m - l : there are constants
1'i with tm + 1'm-l t m-1 + ... + 1'oto = O.
The length m of Krylov sequence of course depends on to. Clearly,
m :::; n, since more than n vectors in ~n are always linearly dependent. If
one forms the n x m matrix T := [to, ... , tm-d and the matrix F of (6.3.1),
then (6.3.2) is equivalent to
(6.3.3)
Rang T= m,
AT = A[to, ... , tm-l] = [h, ... , tm] = [to, ... , tm-1]F = TF.
Every eigenvalue of F is also eigenvalue of A: From Fz = AZ, Z #- 0, we
indeed obtain for x := Tz, in view of (6.3.3),
x#-O
Moreover, we have:
and
Ax = ATz = T F z = ATz = AX.
376
6 Eigenvalue Problems
(6.3.4) Theorem. The matrix F (6.3.1) is nonderogatory: The minimal
polynomial of F is
'lj;(p,) = /'0 + /'IP, + ... + /'m_Ip,m-l + p,m
= (_l)m det (F - p,I).
PROOF. Expanding cp(p,) := det (F - p,I) by the last column, the characteristic polynomial F is found to be
0
-p,
1
-p,
-/'o
-/'1
cp(p,) = det
1
-p,
1
0
-/'m-2
-/'m-l - P,
By the results (6.2.12), (6.2.14) of the preceding section, the minimum
polynomial 'lj;(p,) of T divides cp(p,). If we had deg 'lj; < m = deg cp, say
'lj;(p,) = 000 + alP, + ... + ar_lp,r-l + p,r,
r < m,
then from 'lj;(F) = 0 and Fei = eHl for 1 ::; i ::; m - 1, the following
contradiction would result at once:
0= 'lj;(F)el = aOel + aIe2 + ... + ar-ler + er+1
= [000,001, ... , ar-l, 1,0, ... , O]T "" O.
Thus, deg 'lj; = m, and hence 'lj;(p,) = (-l)mcp(p,). Because of (6.2.15), the
theorem is proved.
0
Assuming the characteristic polynomial of F to have the zeros Ai with
multiplicities (J'i, i = 1, ... , k,
the Jordan normal of Fin (6.3.1), in view of (6.3.4), is given by
C<7 1 (Ad
C<72 (A2)
[
o
The significance of the Frobenius matrices lies in the fact that they furnish
the building blocks of the so-called Frobenius or rational normal form of a
matrix. Namely:
6.3 The F'robenius Normal Form of a Matrix
377
l
(6.3.5) Theorem. For every n x n matrix A there is a nonsingular n x n
matrix T with
FI
(6.3.6)
T-IAT =
0
where the Pi are Frobenius matrices having the following properties:
(a) If 'Pi(f.l) = det (Fi - f.lI) is the characteristic polynomial of Pi, i = 1,
... , r, then 'Pi(f.l) is a divisor of 'Pi-I (f.l), i = 2,3, ... , r.
(b) 'PI (f.l), up to the multiplicative constant ± 1, is the minimal polynomial
of A.
(c) The matrices Fi are uniquely determined by A.
One calls (6.3.6) the Frobenius normal form of A.
This easily done with the help of the Jordan normal form [see
(6.2.4)]: We assume that J in (6.2.5) is the Jordan normal form of A.
Without loss of generality, let the integers v)J) be ordered,
PROOF.
(6.3.7)
> v(i)
> ... >
v(i)
2 p(Ai)'
V I(i) -
.
Z=
12k
, , ... , .
Define the polynomials 'Pj(f.l), j = 1, ... , r, r := maXi pP'i), by
(1)
'Pj(f.l) = (AI - f.l)"j
(2)
(A2 - f.l)"j
(k)
... (Ak - f.l)"j
[using the convention v?) := 0 if j > p(Ai)]. In view of (6.3.7), 'Pj(f.l)
divides 'Pj-I (f.l) and ±'PI (f.l) is the minimal polynomial of A. Now take
as the Frobenius matrix Pj just the Frobenius matrix whose characteristic
polynomial is 'Pj(f.l). Let Si be the matrix that transforms Fi into its Jordan
normal form J i ,
Si-iPiSi = k
A Jordan normal form of A (the Jordan normal form is unique only up to
permutations of the Jordan blocks) then is
378
6 Eigenvalue Problems
According to Theorem (6.2.4) there is a matrix U with U- 1 AU = J'. The
matrix T:= US- 1 with
transforms A into the desired form (6.3.6).
It is easy to convince oneself of the uniqueness F i .
o
EXAMPLE. For the matrix J in the example of Section 6.2 one has
<Pl(J-t) = (1 - J-t)3( -1 - J-t)2 = _(J-t5 - J-t4 - 2J-t3 + 2J-t 2 + J-t - 1),
<P2(J-t) = (1 - J-t)2( -1 - J-t) = _(J-t3 - J-t2 - J-t + 1),
<P3(J-t) = -(J-t + 1),
and there follows
F,
~ [j
000
000
1 o 0 -2
2
0 1 0
o 0 1 1
-:1
,
F2
[~
0
1 ,
0 -1]
1
1
F3 = [-1].
The significance of the Frobenius normal form lies in its theoretical
properties (Theorem (6.3.5)). Its practical importance for computing eigenvalues is very limited. For example, if the n x n matrix A is nonderogatory,
then computing the Frobenius normal form F (6.3.1) is equivalent to the
computation of the coefficients Ik of the characteristic polynomial
which has the desired eigenvalues of A as zeros. But it is not advisable to
first determine the Ii in order to subsequently compute the eigenvalues as
zeros of <P: In general, the zeros Aj of <P react much more sensitively to small
changes in the coefficients Ii of 'P than to small changes in the elements of
the original matrix A [see Sections 5.8 and 6.9].
EXAMPLE. In Section 6.9, Theorem (6.9.7), it will be shown that the eigenvalue
problem for Hermitian matrices A = A H is well conditioned in the following
sense: For each eigenvalue Ai (A + .6. A ) of A + .6.A there is an eigenvalue Aj (A)
of A such that
If the 20 x 20 matrix A has, say, the eigenvalues Aj = j, j = 1, 2, ... , 20, then
lub 2 (A) = 20 because A = AH (see Exercise 8). Subjecting all elements of A to
a relative error of at most eps, i.e., replacing A by A +.6.A with I.6.AI : : : eps IAI,
it follows that [see Exercise 11]
6.4 The Schur Normal form of a Matrix
379
lub 2 (L.A) ::::: lub 2 (IL.AI) ::::: lub2(eps IAI)
::::: epsv20lub2(A) < 90 eps.
On the other hand, a mere relative error IL. li I = eps of the coefficients Ii of the
characteristic polynomial 'P(/-t) = (/-t -1)(/-t - 2) ... (/-t - 20) produces tremendous
changes in the zeros Aj = j of'P [see Example (1) of Section 5.8].
The situation is particularly bad for Hermitian matrices with clustered
eigenvalues: the eigenvalue problem for these matrices is still well conditioned even for multiple eigenvalues, whereas multiple zeros of a polynomial
'P are always ill-conditioned functions of the coefficients ri.
In addition, many methods for computing the coefficients ri of the characteristic polynomial are numerically unstable. As an example, we mention
the method of Frazer, Duncan, and Collar, which is based on the observation [see (6.3.2) (c)] that the vector c = [rO,rl, ... ,rn-lf is the solution of a linear system of equations Tc = -tn with the nonsingular
matrix T := [to, t l , ... , tn-I], provided that to, ... , tn-l is a Krylov sequence oflength n. Unfortunately, the matrix T is in general ill-conditioned,
cond(T) » 1, so that the computed solution c can be highly incorrect [see
Section 4.4]. Indeed, as k --t 00, the vectors tk = Akto will, when scaled by
suitable factors (J"k, converge toward a nonzero vector t = limk (J"ktk that, in
general, does not depend on the choice of the initial vector to. The columns
of T tend, therefore, to become "more and more linearly dependent" [see
Section 6.6.3 and Exercise 12].
6.4 The Schur Normal Form of a Matrix; Hermitian
and Normal Matrices; Singular Values of Matrices
If one does not admit arbitrary nonsingular matrices T in the similarity
transformation T- l AT, it is in general no longer possible to transform A
to Jordan normal form. However, for unitary matrices T, i.e., matrices T
with THT = I, one has the following result of Schur:
(6.4.1) Theorem. For every n x n matrix A there is a unitary n x n matrix
U with
*
Here Ai, i = 1, ... , n, are the (not necessarily distinct) eigenvalues of A.
= 1 the theorem is trivial.
Suppose the theorem is true for matrices of order n - 1, and let A be an
n x n matrix. Let Al be an eigenvalue of A, and Xl =f. 0 a corresponding
PROOF. We use complete induction on n. For n
380
6 Eigenvalue Problems
eigenvector with Ilxd~ = x{f Xl = 1, AX1 = >'1X1. Then one can find n - 1
additional vectors X2, ... , Xn such that Xl, X2, ... , Xn forms an orthonormal
basis of <en, and the n x n matrix X := [Xl, ... ,xnl with columns Xi is thus
unitary, XH X = I. Since
XH AXel = XH AX1 = A1XHx1 = A1e1,
the matrix X H AX has the form
where A1 is a matrix of order n - 1 and aH E <e n- 1. By the induction
hypothesis, there exists a unitary (n - 1) x (n - 1) matrix U1 such that
The matrix
then is a unitary n x n matrix satisfying
UHAU =
[~ ufo ] XHAX [10
[1o uf0] [A1 a] [1 0]
0
A1
0
U1
The fact that Ai, i = 1, ... , n, are zeros of det(U H AU - p1), and hence
eigenvalues of A, is trivial.
0
Now, if A = AH is a Hermitian matrix, then
is again a Hermitian matrix. Thus, from (6.4.1), there follows immediately
6.4 The Schur Normal form of a Matrix
381
(6.4.2) Theorem. For every Hermitian n x n matrix A = AH there is a
unitary matrix U = [Xl, ... ,xn ] with
U- 1 AU = U H AU = [ AOI
The eigenvalues Ai, i = 1, ... , n, of A are real. A is diagonalizable. The ith
column Xi of U is an eigenvector belonging to the eigenvalue Ai: AXi = Ai xi .
A thus has n linearly independent pairwise orthogonal eigenvectors.
If the eigenvalues Ai of a Hermitian n x n matrix A = AH are arranged
in decreasing order,
Al ;::: A2 ;::: ... ;::: An,
then Al and An can be characterized also in the following manner [see
(6.9.14) for a generalization]:
(6.4.3)
An =
PROOF. If U H AU
x H Ax
xHx
xHAx
min - H - ·
OjixE<cn X X
= A = diag[Al, ... , An], U unitary, then for all x -=f. 0,
(xHU)U H AU(UHx)
(xHU)(UHX)
2:iAi l1]i 12
2:i l1]i 12
< 2:i A1 11]iI 2 = A
- 2:il1]iI 2
1,
where y := U H x = [1]1, ... , 1]n]T -=f. O. Taking for x -=f. 0 in particular an
eigenvector belonging to AI, Ax = Al x, one gets x H Ax/x H x = AI, so that
Al = maxOjixE<cn x H Ax/x Hx. The other assertion in (6.4.3) follows from
what was just proved by replacing A with -A.
From (6.4.3) and the definition (4.3.1) of a positive definite (positive
semidefinite) matrix A, one obtains immediately
(6.4.4). A Hermitian matrix A is positive definite (positive semidefinite)
if and only if all eigenvalues of A are positive (nonnegative).
A generalization of the notion of a Hermitian matrix is that of a normal
matrix: An n x n matrix A is called normal if
i.e., A commutes with AH. For example, all Hermitian, diagonal, skew
Hermitian and unitary matrices are normal.
(6.4.5) Theorem. An n x n matrix A is normal if and only if there exists
a unitary matrix U such that
382
6 Eigenvalue Problems
U- 1 AU = U H AU =
>'1
[0
Normal matrices are diagonalizable and have n linearly independent pairwise orthogonal eigenvectors Xi, i = 1. ... , n, AXi = AiXi, namely the
columns of the matrix U = [Xl, ... ,XnJ.
PROOF. By Schur's theorem (6.4.1), there exists a unitary matrix U with
UHAU = [
A1
* . . **..]
o
An
From AHA = AA H there now follows
RHR= U H AHUU H AU = U H AHAU
= U H AAHU = U H AUU H AHU
=RRH.
For the (1,1) element of RH R = RRH we thus obtain
hence r1k = 0 for k = 2, ... , n. In the same manner one shows that all
nondiagonal elements of R vanish.
Conversely, if A is unitarily diagonalizable, U H AU = D, where D :=
diag(A1, ... , An), UHU = I, there follows at once
AH A = UDHUHUDU H = UIDI 2U H = UDUHUDHU H = AAH.
0
Given an arbitrary m x n matrix A, the n x n matrix AHA is positive
semidefinite, since xH (AH A)x = IIAxll~ 2': 0 for any X E <en. Its eigenvalues
A1 2': A2'" 2': An 2': 0 are nonnegative by (6.4.4) and can therefore be
written in the form Ak = (J~ with (Jk 2': O. The numbers (J1 2': ... 2': (In 2': 0
are called
(6.4.6)
singular values of A.
Replacing the matrix A in (6.4.3) by AH A, one obtains immediately
(6.4.7)
6.4 The Schur Normal form of a Matrix
383
In particular, if m = n and A is nonsingular, one has
(6.4.8)
The smallest singular value an of a square matrix A gives the distance of
A to the "nearest" singular matrix:
(6.4.9) Theorem. Let A and E be arbitrary n x n matrices and let A have
the singular values al 2:: a2 2:: ... 2:: an 2:: O. Then
(a) lub 2 (E) 2:: an if A + E is singular.
(b) There is a matrix E with lub 2 (E) = an such that A + E singular.
PROOF. (a): Let A + E be singular, thus (A + E)x = 0 for some x =I- O.
Then (6.4.7) gives
hence an :::; lub 2 (E).
(b): If an = 0, there is nothing to prove. For, by (6.4.7), one has 0 =
IIAxl12 for some x =I- 0, so that A already is singular. Let, therefore, an > O.
Because of (6.4.7), there exist vectors u, v such that
IIAul12 = an,
IIul12 = 1,
1
IIvl12 = l.
v := -Au,
an
For the special n x n matrix E:= -anvu H , one then has (A+E)u = 0, so
that A + E is singular, and moreover
D
An arbitrary m x n matrix A can be transformed unitarily to a certain
normal form in which the singular values of A appear:
(6.4.10) Theorem. Let A be an arbitrary (complex) m x n matrix. Then:
(a) There exist a unitary m x m matrix U and a unitary n x n matrix V
such that U H AV = E is an m x n "diagonal matrix" of the following
form:
E=[~ ~],
D:=diag(al, ... ,ar ),
al2::a22::···2::ar>0.
Here aI, ... , a r are the nonvanishing singular values of A, and r is
the rank of A.
384
6 Eigenvalue Problems
(b) The nonvanishing singular values of AH are also precisely the number
aI, ... , aT'
The decomposition A = U EV H is called
the singular-value decomposition of A.
(6.4.11)
°
°
PROOF. We show (a) by mathematical induction on m und n. For m =
or n = there is nothing to prove. We assume that the theorem is true for
(m - 1) x (n - 1) matrices and that A is an m x n matrix with m ~ 1,
n ~ 1. Let al be the largest singular value of A. If al = 0, then by (6.4.7)
also A = 0, and there is nothing to show. Let, therefore, al > 0, and let
Xl -=I- be an eigenvector of AHA for the eigenvalue ar, with IIxll12 = 1:
°
(6.4.12)
Then one can find n - 1 additional vectors X2, ... , Xn E (Cn such that the
n x n matrix X := [Xl, X2, ... , xnl with the columns Xi becomes unitary,
X H X = In. By virtue of IIAxlll~ = xf AH AXI = arxf Xl = ar > 0,
the vector Yl:= l/alAxl E (Cm with IIYll12 = 1 is well defined, and one
can find m - 1 additional vectors Y2, ... , Ym E (Cm such that the m x m
matrix Y := [Yl, Y2, ... , Yml is likewise unitary, yHy = 1m. Now from
(6.4.12) and the definition of Yl, X and y, there follows at once, with
el:= [1,O, ... ,OV E (Cn,el:= [1,O, ... ,OV E <em, that
yH AXel = yH AXI = alyH Yl = aIel E (Cm
and
(yH AX)Hel = XH AHYel = XH AHYI = ~XH AH AXI
al
so that the matrix Y H AX has the following form:
Here A is an (m - 1) x (n - 1) matrix.
By the induction hypothesis, there exist a unitary (m - 1) x (m - 1)
matrix {j and a unitary (n - 1) x (n - 1) matrix V such that
with t; a (m - 1) x (n - 1) "diagonal matrix" of the form indicated. The
m x m matrix
6.4 The Schur Normal form of a Matrix
u:= y.
385
[~ &]
is unitary, as is the n x n matrix
and one has
UHAV=[~ [?H]YHAX[~ ~]=[~ UOH][~l ~][~1~]
[~1 ~]=[~~]=E,
D:=diag(a1, ... ,ar ),
E being an m x n diagonal matrix with a2 2: ... 2: a r > 0, ar
Amax(AH A). Evidently, rank A = r, since rank A = rank U H AV = rank
E.
We must still prove that a1 2: a2 and that the <Ti are the singular values
of A. Now from U H AV = E there follows, for the n x n diagonal matrix
EHE,
EH E = diag(ai, ... , a;, 0, ... ,0) = VH AHUU H AV = VH (AH A)V,
a;
so that [see Theorem (6.4.2)] ar, ... ,
are the nonvanishing eigenvalues
of AHA, and hence aI, ... , ar are the nonvanishing singular values of A.
Because of ar = Amax(AH A), one also has a1 2: a2.
0
The unitary matrices U, V in the decomposition U H AV = E have the
following meaning: The columns of U represent m orthonormal eigenvectors of the Hermitian m x m matrix AA H , while those of V represent n
orthonormal eigenvectors of the Hermitian n x n matrix AHA. This follows at once from U H AAHU = EEH,V H AH AV = EH E, and Theorem
(6.4.2). Finally we remark that the pseudoinverse A+ [see Section 4.8.5] of
the m x n matrix A can be immediately obtained from the decomposition
UHAV=E:If
E =
[~ ~],
D = diag(al, ... ,<Tr ), a1 2: ... 2: <Tr > 0,
then the n x m diagonal matrix
0] '
[D-o 0
1
is the pseudoinverse of E, and one verifies at once that the n x m matrix
(6.4.13)
386
6 Eigenvalue Problems
satisfies the conditions of (4.8.5.1) for a pseudoinverse of A, so that A+,
in view of the uniqueness statement of Theorem (4.8.5.2), must be the
pseudoinverse of A.
6.5 Reduction of Matrices to simpler Form
The most common methods for determining the eigenvalues and eigenvectors of a dense matrix A proceed as follows. By means of a finite number
of similarity transformations
i = 1, 2, ... , m,
one first transforms the matrix A into a matrix B of simpler form,
B:= Am = T-1AT,
and then determines the eigenvalues A and eigenvectors y of B, By = Ay.
For x := Ty = T 1 ··· Tmy, since B = T- 1AT, we then have
Ax = Ax,
i.e., to the eigenvalue A of A there belongs the eigenvector x. The matrix
B is chosen in such a way that
(1) the determination of the eigenvalues and eigenvectors of B is as simple
as possible (i.e., requires as few operations as possible) and
(2) the eigenvalue problem for B is not (substantially) worse conditioned
than that for A (i.e., small changes in the matrix B do not impair
the eigenvalues of B, nor therefore those of A, substantially more than
equally small changes in A).
In view of
B = T-1AT,
B + lc,B = T-1(A + lc,A)T,
lc,A:= T lc,BT- 1,
given any vector norm 11·11 and corresponding matrix norm lub (.), one gets
the following estimates:
lub(B) ::::: cond(T) lub(A),
lub(lc,A) ::::: cond(T) lub(lc,B),
and hence
lub(lc,A) < (
d(T))2Iub(lc,B)
lub(A) - con
lub(B) .
6.5 Reduction of Matrices to simpler Form
387
For large cond(T) » 1 even a small error 6B, committed when computing
B, say lub(6B)/lub(B) ::::; eps, has the same effect on the eigenvalues of
A as the equivalent error 6A, which can be as large as
lub(l:>A)
2lub(l:>B)
2
lub(A) ::::; cond(T) lub(B) = cond(T) . eps.
Hence for large cond(T) , no method to compute the eigenvalues of A by
means of the eigenvalues of B will be numerically stable [see Section 1.3].
Since
cond(T) = cond(TI ... Tm) :"::: cond(Td··· cond(Tm) ,
numerical stability can be expected only if one chooses the matrices Ti such
that cond(Ti) does not become too large. This is the case, in particular,
for the maximum norm Ilxll oo = maxi IXil and elimination matrices of the
form [see Section 4.1]
1
0
1
Ti = G j =
with Ilkjl :"::: 1,
1)+I,j
0
lnj
0
1
(6.5.0.1)
1
0
1
1 =
G-:
J
-1)+I,j
-lnj
0
0
1
condoo(Ti) :"::: 4,
and also for the Euclidean norm IIxI12 = v'xHx and unitary matrices Ti = U
(e.g., Householder matrices) for which cond(Ti) = 1. Reduction algorithms
using either unitary matrices Ti or elimination matrices Ii as in (6.5.0.1) are
described in the following sections. The "simple" terminal matrix B = Am
that can be achieved in this manner, for arbitrary matrices, is an upper
Hessenberg matrix, which has the following form:
*
B=
*
*
o
o
bik = 0 for k s: i - 2.
0
* *
388
6 Eigenvalue Problems
For Hermitian matrices A = AH only unitary matrices T i , T i- 1 = TiH , are
used for the reduction. If A i - 1 is Hermitian, then so is Ai = T i- 1A i - 1T i :
A{i = (TiH Ai_ITi)H = TiH AE1 Ti = TiH A i- 1T i = Ai.
For the terminal matrix B one thus obtains a Hermitian Hessenberg matrix,
i.e., a (Hermitian) tridiagonal matrix, or Jacobi matrix:
B=
81
12
'Y2
82
0
8i = 8i .
0
'Yn
1n
8n
6.5.1 Reduction of a Hermitian Matrix to Tridiagonal Form.
The Method of Householder
In the method of Householder for the tridiagonalization of a Hermitian
n x n matrix AH = A =: Ao, one uses suitable Householder matrices [see
Section 4.7]
TiH = T i- 1 = Ti = I - f3iUiU{i
for the transformation
(6.5.1.1)
We assume that the matrix A-I = (ajk) has already the following form:
A i- 1 =
J i- 1
C
0
cH
8i
a ,H
0
ai
Ai- 1
(ajk),
with
[~l
81
12
'Y2
82
0
0
0
1i
8i
0
'Yi-1
1i-1
8i- 1
0
0
'Yi
_[ai~"il
a, -
:
ani
.
6.5 Reduction of Matrices to simpler Form
389
According to Section 4.7, there is a Householder matrix Ti of order n - i
such that
T;ai = k· el E <en-i.
(6.5.1.2)
T; has the form Ti = I - (3uu H , U E <e n - i , and is given by
n
L lajil
8)
(3'= {1/(0'(0' + lai+l,il))
.
0
(6.5.1.3)
k := -0"
U:= [
if 0' =I- 0,
otherwise,
i
if a'+1
1 , , 1 ,' = e <pla'+1
' l . , 2,I ,
ei<p
ei<p
2,
j=i+l
(0' + lai+l,i I)
ai+2,i
.
1
.
ani
Then, for the unitary n x n matrix Ii, partitioned like (6.5.1.1),
Ti :=
I
0
0
0
1
0
0
0
Ti
} i-I
one clearly has TiH = Ti- 1 = Ti and, by (6.5.1.2),
c
o
o
390
6 Eigenvalue Problems
81
"(2
"12
82
0
o
o
"(i-l
o
0
"Ii-1
8i - 1
"(i
0
0
"Ii
8i
"(HI
0
...
0
=: Ai
"IHI
o
o
o
with "IHI := k.
Since Ti = I - (3uu H , one can compute TUii-lTi as follows:
Introducing for brevity the vectors p, q E <D n - i ,
T;.t4 i - 1 T i = A i - 1 - pu H - upH + (3upH uu H
(6.5.1.4)
=Ai_l-U[P_~(pHU)u]H - [p_~(pHU)U]uH
= A i - 1 - uqH - qu H .
The formulas (6.5.1.1)-(6.5.1.4) completely describe the ith transformation
Evidently,
o
B = An - 2 =
"12
o
"In
6.5 Reduction of Matrices to simpler Form
391
is a Hermitian tridiagonal matrix.
A formal ALGOL-like description of the Householder transformation for
a real symmetric matrix A = AT = [ajkJ with n ;:::: 2 is
for i := 1 step 1until n - 2 do
begin lSi := aii;
i----n
L lajil if ai+l,i < 0 then :=
2 ;
S :=
Efl
8
-8;
j=i+l
e := 8 + ai+l,i;
if 8 = 0 then begin aii := 0; goto M Mend;
(3:= aii := 1/(8 x e);
Ui+l := ai+1,i := e;
for j := i + 2 step 1 until n do Uj := aji;
for j := i + 1 step 1 until n do
'Yi+l := -8;
Pj := (
t
t
j=i+1
ajk x Uk
+
t
akj x Uk)
x (3;
k=j+1
8k := ( .
Pj x Uj) x (3/2;
)=,+1
for j := i + 1 step 1 until n do
qj := Pj - 8k x Uj;
for j := i + 1 step 1 until n do
for k := i + 1 step 1 until j do
ajk := ajk - qj x Uk - Uj x qk;
MM:
end;
This program takes advantage of the symmetry of A: Only the elements
need to be given. Moreover, the matrix A is overwritten with
the essential elements (3i, Ui of the transformation matrix Ti = 1- (3iuiuf!,
i = 1, 2, ... , n - 2: Upon exiting from the program, the ith column of A
will be occupied by the vector
ajk with k :::; j
i = 1,2, ... ,n - 2.
(The matrices T i , i = 1,2, ... ,n - 2, are needed for the "back transformation" of the eigenvectors: if y is an eigenvector of A n - 2 for the eigenvalue
A,
A n - 2 y = AY,
then x := T 1 T2 ··· Tn - 2 y is an eigenvector of A, Ax = AX.)
392
6 Eigenvalue Problems
Tested ALGOL programs for the Householder reduction and back transformation of eigenvectors can be found in Martin, Reinsch, and Wilkinson
(1971); FORTRAN programs, in Smith et al. (1976).
Applying the transformations described above to an arbitrary nonHermitian matrix A of order n, the formulas (6.5.1.1), (6.5.1.3) give rise to
a chain of matrices Ai, i = 0, 1, ... , n - 2, of the form
Ao =A,
*
*
o
o
* *
*
o
* * .
· *
* * .
· *
* * .
* *.
· *
. *
* * .
· *
v
i-I
The first i-I columns of A i - I are already those of a Hessenberg matrix .
. A n - 2 is a Hessenberg matrix. During the transition from A i - I to Ai the
elements ajk of A i - I with j, k :::; i remain unchanged.
ALGOL programs for this algorithm can be found in Martin and Wilkinson (1971); FORTRAN programs, in Smith et al. (1976).
In Section 6.5.4 a further algorithm for the reduction of a general matrix
A to Hessenberg form will be described which does not operate with unitary
similarity transformations.
The numerical stability of the Householder reduction can be shown
in the following way: Let Ai and Ti denote the matrices obtained if the
algorithm is carried out in floating-point arithmetic with relative precision
eps; let Ui denote the Householder matrix which, according to the rules
of the algorithm, would have to be taken as transformation matrix for
the transition A i - I --+ Ai in exact arithmetic. Thus, Ui is an exact unitary
matrix, while Ti is an approximate unitary matrix, namely the one obtained
in place of Ui by computing Ui in floating-point arithmetic. The following
relations thus hold:
Ti = fl(Ui ),
Ai = fl(TiAi_ITi).
By means of the methods described in Section 1.3 one can now show [see,
e.g., Wilkinson (1965)] that
6.5 Reduction of Matrices to simpler Form
393
lub 2 (Ti - Ui ) ~ f(n) eps,
Ai = fl(Ti Ai - 1T;) = Ti Ai - 1T i + G i ,
(6.5.1.5)
lub 2 (G i ) ~ f(n) eps lub 2 (Ai-I),
where f(n) is a certain function [for which, generally, f(n) = 0(71"'), a ~ 1].
From (6.5.1.5), since lub 2 (Ui ) = 1, UiH = Ui- 1 = Ui (U i is a Householder
matrix!), there follows at once
Ri := Ti - Ui ,
lub 2 (Ri ) ~ f(n) eps,
-
-1 -
-
-
-
Ai = Ui Ai-lUi + R iA i- 1Ui + Ui Ai- 1R i + RiAi-1Ri + Gi
-1 -
=: Ui
Ai-lUi + F i ,
where
or, since f(n) eps « 3, in first approximation:
lub 2 (F;) ::; 3 eps f( n) lub 2 (Ai-r),
(6.5.1.6)
lub 2 (A;) ::; (1 + 3 eps f(n)) lub 2 (A i -r),
::; (1 + 3 eps f(n))i lub 2 (A).
For the Hessenberg matrix An - 2 , finally, since A = Ao, one obtains
(6.5.1.7)
where
n-2
F :=
L U1U
2 ··· Ui F i Ui- 1 ... U
1 1.
i=l
It thus follows from (6.5.1.6) that
n-2
lub 2 (F) ~
L lub (F
2
i)
i=l
n-2
::; 3 eps f(n) lub 2 (A)
L (1 + 3 eps f(n))i-1,
i=l
or, in first approximation
(6.5.1.8)
lub 2 (F) ::; 3(71 - 2)f(n) eps lub 2 (A).
Provided nf(n) is not too large, the relations (6.5.1.7) and (6.5.1.8) show
that the matrix An - 2 is exactly similar to the matrix A + F, which is only
a slight perturbation of A, and therefore the method is numerically stable.
394
6 Eigenvalue Problems
6.5.2 Reduction of a Hermitian Matrix to Tridiagonal or
Diagonal Form: The Methods of Givens and Jacobi
In Givens' method (1954), a precursor of the Householder method, the
chain
for the transformation of a Hermitian matrix A to tridiagonal form B = Am
is constructed by means of unitary matrices Ti = njk of the form ('P, 'ljJ
real)
1
0
1
cos'P
(6.5.2.1)
-e- i 1j; sin 'P
+-j
cos 'P
+-k
1
njk =
1
ei 1j; sin 'P
1
1
0
In order to describe the method of Givens we assume for simplicity that
A = AH is real; we can choose in this case 'ljJ = 0, and njk is orthogonal.
Note that in the left multiplication A ~ njkl A = nJ{,A only rows j and
k of A undergo changes, while in the right multiplication A ~ Anjk only
columns j and k change. We describe only the first transformation step
A = Ao ~ T I- I Ao =: A~ ~ A~TI = T I- I AoTI =: AI. In the half step
Ao ~ A~ the matrix TI = n 23 , T1- I = n~ is chosen such [see Section
4.9] that the element of A~ = n~Ao in position (3,1) is annihilated; in
the subsequent right multiplication by n 23 , A~ ~ Al = A~n23, the zero
position (3,1) is preserved. Below is a sketch for a 4 x 4 matrix, where
changing elements are denoted by *:
~~ [~
x
x
x
x
x
x
x
x
~l
~ n~Ao =
~ A~n23 =
[~
[~
x
x
* *
* *
x x
0
*
* *
* *
* *
~l
~l
=A~
=: AI.
6.5 Reduction of Matrices to simpler Form
395
Since with Ao, also Al is Hermitian, the transformation Ao -+ Al also
annihilates the element in position (1, 3). After this, the element in position
(4, 1) is transformed to zero by a Givens rotation T2 = [l24, etc. In general,
one takes for Ti successively the matrices
.. , ,
... ,
[In-I,n,
and chooses [ljk, j = 2, 3, ... , n -1, k = j + 1, j + 2, ... , n, so as to annihilate the element in position (k,j - 1). A comparison with Householder's
method shows that this variant of Givens method requires about twice as
many operations. For this reason, Householder's method is usually preferred. There are, however, modern variants ("rational Givens transformations") which are comparable to the Householder method.
The method of Jacobi, too, employs similarity transformations with
the special unitary matrices [ljk (6.5.2.1); however, it no longer produces
a finite sequence ending in a tridiagonal matrix, but an infinite sequence
of matrices A (i), i = 0, 1, ... , converging to a diagonal matrix
Here, the Ai are just the eigenvalues of A. To explain the method, we again
assume for simplicity that A is a real symmetric matrix. In the transformation step
A(i) -+ A(i+I)
= [lJr,A(i)[ljk
the quantities c:= cosrp, s := sinrp of the matrix [ljk, j < k, in (6.5.2.1)
are now determined so that ajk = 0 (we denote the elements of A (i) byars,
those of A(i+ I ) by a~s):
A(i+l)
=
I
a~1
a~.,J
a1,k
a~,n
a j1
,
a( .
J,J
0
a j .n
a~l
0
a~k
a~,n
a~1
an,j
I
,
a~,k
a~.n
Only the entries in rows and columns j and k (in bold face) are changed,
according to the formulas
396
6 Eigenvalue Problems
(6.5.2.2)
, - a 'kj - -es (ajj - akk ) + (2
e - s 2) ajk -~ 0 ,
a jk -
a~k = S2ajj + e2akk - 2eSajk.
From this, one obtains for the angle cp the defining equation
2es
2ajk
tan 2cp = - - - = ----"-e2 - S2
ajj - akk
By means of trigonometric identities, one can compute from this the quantities e and s and, by (6.5.2.2), the a~s'
It is recommended, however, that the following numerically more stable
formulas be used [see Rutishauser (1971), where an ALGOL program can also
be found]. First compute the quantity '13 := cot 2cp from
'13 '= ajj - akk
.
2ajk'
and then t := tan cp as the root of smallest modulus of the quadratic equation
t 2 + 2M - 1 = 0,
that is,
t=
s(iJ)
1'131 +
VI +
'13 2 '
s(iJ) := {~1
if '13 :::: 0,
otherwise,
or t := 1/2'13 if 1'131 is so large that '13 2 would overflow. Obtain the quantities
e and s from
1
c'.- --===
JI+t2' s:= te.
Finally compute the number T := tan(cp/2) from
S
T
:= (1
+ e)'
and with the aid of s, t and T rewrite the formulas (6.5.2.2) in a numerically
more stable way as
+ t ajk,
aj j
ajj
a'jk
a"k j ' -0'
a~k
akk -
t ajk·
6.5 Reduction of Matrices to simpler Form
397
For the proof of convergence, one considers
S(A(i)) := ~]ajkI2,
S(A(i+l)) =
j#
I:lajkl 2
j#
the sums of the squares of the off-diagonal elements of A(i) and A(i+1),
respectively. For these one finds, by (6.5.2.2),
The sequence of nonnegative numbers S (A (i)) therefore decreases monotonically, and thus converges. One can show that limi-toc S(A(i)) = 0 (i.e.,
the A(i) converge to a diagonal matrix), provided the transformations Jl jk
are executed in a suitable order, namely row-wise,
... ,
... ,
Jln-1,n,
and, in this order, cyclically repeated. Under these conditions one can even
prove quadratic convergence of the Jacobi method, if A has only simple
eigenvalues:
'hN
WI t
IS :=
n(n-1)
:= ----'--::-2--'-'
n:ln
IAi(A) - Aj(A)1 > 0
'-r-J
[for the proof, see Wilkinson (1962); further literature: Rutishauser (1971),
Schwarz, Rutishauser and Stiefel (1972), Parlett(1980)].
In spite of this rapid convergence and the additional advantage that an
orthogonal system of eigenvectors of A can easily be obtained from the Jl jk
employed, it is more advantageous in practical situations, particularly for
large n, to reduce the matrix A to a tridiagonal matrix J by means of the
Householder method [see Section 6.5.1] and to compute the eigenvalues and
eigenvectors of J by the QR method, since this method converges cubically.
This all the more so if A has already the form of a band matrix: in the QR
method this form is preserved; the Jacobi method destroys it.
We remark that Eberlein developed a method for non-Hermitian matrices similar to Jacobi's. An ALGOL program for this method, and further
details, can be found in Eberlein (1971).
398
6 Eigenvalue Problems
6.5.3 Reduction of a Hermitian Matrix to Tridiagonal Form:
The Method of Lanczos
Krylov sequences of vectors q, Aq, A2q, ... belonging to an n x n matrix
and a starting vector q E en were already used for the derivation of the
Frobenius normal form of a general matrix in Section 6.3. They also play an
important role in the method of Lanczos (1950) for reducing a Hermitian
matrix to tridiagonal form. Closely related to such a sequence of vectors is
a sequence of subspaces of en
i::::: 1,
Ki(q, A) := span[ q, Aq, ... , Ai-1q J,
Ko(q, A) := {o}.
called Krylov spaces: Ki(q, A) is the subspace spanned by the first i vectors
of the sequence {Ajq}j~o. As in Section 6.3, we denote by m the largest
index i for which q, Aq, ... , Ai-l q are still linearly independent, that is,
dim Ki(q, A) = i. Then m :::; n, Amq E Km(q, A), the vectors q, Aq, ... ,
Am-l q form a basis of Km(q, A), and therefore AKm(q, A) C Km(q, A):
the Krylov space Km(q, A) is A-invariant and the map x H p(x) := Ax
describes a linear map of Km(q, A) into itself.
In Section 6.3 we arrived at the Frobenius matrix (6.3.1) when the map
P was described with respect to the basis q, Aq, ... , Am-l q of Km(q, A).
The idea of the Lanczos method is closely related: Here, the map P is
described with respect to a special orthonormal basis ql, q2, ... , qm of
Km(q, A), where the qj are chosen such that for all i = 1, 2, ... , m, the
vectors ql, q2, ... , qi form an orthonormal basis of Ki(q, A). If A = AH is a
Hermitian n x n matrix, then such a basis is easily constructed for a given
starting vector q. We assume q i- 0 in order to exclude the trivial case and
suppose in addition that Ilqll = 1, where II . II is the Euclidean norm. Then
there is a three-term recursion formula for the vectors qi [similar recursions
are known for orthogonal polynomials, cf. Theorem (3.6.3)]
(6.5.3.1a)
ql := q,
IIqO:= 0,
Aqi = liqi-I + ~iqi + li+1qHl
for i ::::: 1,
where
~i := qf Aqi,
(6.5.3.1b)
IHI := Ilrill
with ri := Aqi - ~iqi - liqi-},
qi+1 := rd'HI'
if IHI i- O.
Here, all coefficients Ii, ~i are real. The recursion breaks off with the first
index i =: io with IHI = 0, and then the following holds
io = m = max dim Ki(q, A).
2
6.5 Reduction of Matrices to simpler Form
399
PROOF. We show (6.5.3.1) by induction over i. Clearly, since Ilqll = 1, the
vector ql := q provides an orthonormal basis for Kl (q, A). Assume now
that for some j 2': 1 vectors ql, ... , qj are given, so that (6.5.3.1) and
hold for all i :::; j, and that ri i- 0 in (6.5.3.1b) for all i < j. We show
first that these statements are also true for j + 1, if rj i- O. In fact, then
'YJ+l i- 0, 6j and qj+1 are well defined by (6.5.3.1b), and IIqJ+11I = 1. The
vector qj+1 is orthogonal to all qi with i ::; j: This holds for i = j, because
'YJ+l i- 0, because
from the definition of 6j, and using the induction hypothesis
'YJ+lqfqJ+l = qf A% - 6jqfqj = O.
For i = j - 1, the same reasoning and A = A H first give
'YJ+lqf-l%+1 = qf-1Aqj - 'Yjqf-lqj-l = (Aqj_dHqj - 'Yj.
The orthogonality ofthe qi for i :::; j and A%-l = 'Yj-lqj-2+ 6j-lqj-l +'Yjqj
then imply (Aqj_l)H qj = "(j = 'Yj, and therefore qJ-l qj+1 = O. For i < j-1
we get the same result with the aid of Aqi = 'Yiqi-l + 6iqi + 'Yi+1qi+l:
'Yj+1qf qj+1 = qf Aqj = (Aqi)H qj = o.
Finally, since span[ ql, ... , qi 1 = Ki(q, A) c Kj(q, A) for i :::; j, we also
have
which implies by (6.5.3.1b)
qJ+l E span[ qj-l, qj, Aqj] C Kj+1 (q, A),
and therefore span[ ql, ... , qJ+l] c K j +1(q, A). Since the orthonormal vectors ql, ... , qJ+l are linearly independent and dim Kj+1 (q, A) ::; j + 1 we
obtain
Kj+1 (q, A) = span[ ql,···, qj+1].
This also shows j + 1 :::; m = maxi dim Ki (q, A), and io :::; m for the breakoff index io of (6.5.3.1). On the other hand, by the definition io
Aqio E span[ qio-l, qio] C span[ ql, ... ,qio 1= Kio (q, A),
so that, because
400
6 Eigenvalue Problems
we get the A-invariance of Kio(q,A), AKio(q,A) c Kio(q,A). Therefore io 2: m, since Km(q, A) is the first A-invariant subspace among the
Ki(q, A). This finally shows io = m, and the proof is complete.
0
The recursion (6.5.3.1) can be written in terms of the matrices
1 ::; i ::; m,
Ii
as a matrix equation
(6.5.3.2)
AQi = Qdi + [0, ... ,0, Ii+Iqi+l]
= QiJi + Ii+Iqi+Ief,
i = 1, 2, ... , m,
where ei := [0, ... ,0, I]T E IRi is the ith axis vector of IRi. This equation
is easily verified by comparing the jth columns, j = 1, ... , i, on both sides.
Note that the n x i matrices Qi have orthonormal columns, Q,fIQi = li(:=
i x i identity matrix) and the J i are real symmetric tridiagonal matrices.
Since i = m is the first index with Im+1 = 0, the matrix J m is irreducible,
and the preceding matrix equation reduces to {cf. (6.3.3)]
AQm = QmJm
where Q;;,Qm = 1m. Any eigenvalue of J m is also an eigenvalue of A, since
Jmz = Az, z t 0 implies x := Qmz t 0 and
If m = n, i.e., if the method does not terminate prematurely with an m < n,
then Qn is a unitary matrix, and the tridiagonal matrix I n = Q;[ AQn is
unitarily similar to A.
Given any vector q =: ql with iiqii = 1, the method of Lanczos consists
of computing the numbers Ii, iSi, i = 1, 2, ... , m, bl := 0), and the
tridiagonal matrix J m by means of (6.5.3.1). Subsequently, one may apply
the methods of Section 6.6 to compute the eigenvalues and eigenvectors
of J m (and thereby those of A). Concerning the implementation of the
method, the following remarks are in order:
1. The number of operations can be reduced by introducing an auxiliary
vector defined by
Ui := Aqi - liqi-I·
Then ri = Ui - iSiqi, and the number
6.5 Reduction of Matrices to simpler Form
401
can also be computed from Ui, since q{f qi-1 = o.
2. It is not necessary to store the vectors qi if one is not interested in
the eigenvectors of A: In order to carry out (6.5.3.1) only two auxiliary
vectors v, W E ~n are needed, where initially v := q is the given starting
vector with iiqii = 1. Within the following program, which implements the
Lanczos algorithm for a given Hermitian n x n matrix A = A H , Vk and Wk,
k = 1, ... , n, denote the components of v and w, respectively:
W
:= 0; /'1 := 1; i := 1;
1: if /'i #- 0 then
begin if i #- 1 then
for k := 1 step 1 until n do
begin t := Vk; Vk := Wk//'i; Wk := -/'it end;
W := Av + w; {)i := v H w; W := W - {)iV;
m := i; i := i + 1; /'i := Vw H w;
goto 1;
end;
Each step i -+ i + 1 requires about 5n scalar multiplications and one multiplication of the matrix A with a vector. Therefore, the method is inexpensive if A is sparse, so that it is particularly valuable for solving the
eigenvalue problem for large sparse matrices A = AH.
3. In theory, the method is finite: it stops with the first index i = m ::::: n
with /'H1 = O. However, because of the influence ofroundoff, one will rarely
find a computed /'i+l = 0 in practice. Yet, it is usually not necessary to
perform many steps of the method until one finds a zero or a very small
/'i+ 1: The reason is that, under weak assumptions, the largest and smallest
eigenvalues of J i converge very rapidly with increasing i toward the largest
and smallest eigenvalues of A [Kaniel-Paige theory: see Kaniel (1966), Paige
(1971), and Saad (1980)]. Therefore, if one is only interested in the extreme
eigenvalues of A (which is quite frequently the case in applications), only
relatively few steps of Lanczos' method are necessary to find a J i , i « n,
with extreme eigenvalues that already approximate the extreme eigenvalues
of A to machine precision.
4. The method of Lanczos will generate orthogonal vectors qi only in
theory: In practice, due to roundoff, the vectors iii actually computed become less and less orthogonal as i increases. This defect could be corrected
by reorthogonalizing a newly computed vector qi+l with respect to all previous vectors iii, j ::::: i, that is, by replacing qi+l by
402
6 Eigenvalue Problems
However, reorthogonalization is quite expensive: The vectors iii have to
be stored, and step i of the Lanczos method now requires O(i . n) operations instead of O( n) operations as before. But it is possible to avoid a full
reorthogonalization to some extent and still obtain very good approximations for the eigenvalues of A in spite of the difficulties mentioned. Details
can be found in the following literature, which also contains a systematic
investigation of the interesting numerical properties of the Lanczos method:
Paige (1971), Parlett and Scott (1979), and Cullum and Willoughby (1985),
where one can also find programs.
6.5.4 Reduction to Hessenberg Form
It was already observed in Section 6.5.1 that one can transform a given
n x n matrix A by means of n - 2 Householder matrices T; similarly to
Hessenberg form B,
A:= Ao ---+ Al ---+ ... ---+ An-2 = B,
We now wish to describe a second algorithm of this kind, in which one uses
as transformation matrices T; permutation matrices
(here Ci is the ith axis vector of IR n ), and elimination matrices of the form
1
Inj
1
These matrices have the property
p7~1 = P~ = Prs )
1
(6.5.4.1)
1
A left multiplication P7-:;I A of A by pr--;I = Prs has the effect of interchanging rows rand 8 of A, whereas a right multiplication APrs interchanges
6.5 Reduction of Matrices to simpler Form
403
columns rand s of A. A left multiplication Gjl A of A by Gjl has the effect of subtracting lrj times row j from row r of the matrix A for r = j + 1,
j + 2, ... , n, while a right multiplication AGj means that lrj times column
r is added to column j of A for r = j + 1, j + 2, ... , n.
In order to successively transform A to Hessenberg form by means of
similarity transformations of the type considered, we proceed similarly as in
Gaussian elimination [see Section 4.1] and as in Householder's method [see
Section 6.5.1]: Starting with Ao := A one generates further matrices Ai,
i = 1,2, ... , n-2, by similarity transformations. Here Ai is a matrix whose
first i columns have already Hessenberg form. In step i, A i - l -t Ai :=
T i- l Ai-ITi , one uses as Ti a product Ti = Pi+I,sGi+I of a permutation
Pi+I,s (where s :::: i + 1) with an elimination matrix Gi+I' Thus step
consists of two half steps
A i - I ---+ A' := Pi+l,sAi-IPi+I,s ---+ Ai := G;-;I A'Gi+I.
The first corresponds to finding a pivot a~+l,i by partial pivoting as in
Gaussian elimination; this pivot is then used in the second half step to
annihilate the nonzero elements below the subdiagonal of column i of A'
by means of the elimination matrix Gi+1.
After n - 2 steps of this type one ends up with a Hessenberg matrix
B = An - 2.
Since the method proceeds as in 6.5.1, we illustrate only the first step A =
Ao ---+ Al = T1- I ATI , TI = P2,sG 2 , for n = 4 (here, * denotes changing elements).
First one determines by column pivoting the element as,1 of maximal modulus
in the first column of A below the diagonal, s 2: 2, which is then shifted to position
(2,1) by a row permutation (in the scetch s = 4, the pivot is marked by 0):
A
~ A" ~ [~
0
x
x
x
x
x
x
~l-~ P,,~ ~ [~
x
x
--> A'
~ (P"A")P" ~
[!
x
x
* *
x x
* *
* xx
* x
* x
*
~l
~l
Then suitable multiples of row 2 of A' are subtracted from rows j > 2 in order
to annihilate the elements below the pivot, A' ---+ A" := G 21 A', and finally one
computes Al := A"G2 :
A' ----+ A" =
XX
[0 x
x
x
Xl
[X
x ----+ A = 0
o * * *
o * * *
1
0
0
ALGOL programs for this algorithm and for the back transformation of
the eigenvectors can be found in Martin and Wilkinson (1971); FORTRAN
programs, in Smith et al. (1976).
404
6 Eigenvalue Problems
The numerical stability of this method can be examined as follows: Let
Ai and Ti be the matrices actually obtained in place of the Ai, Ti during the
course of the algorithm in a floating-point computation. In view (6.5.4.1),
~o further rounding errors are committed in the computation of Ti - 1 from
Ti ,
1
T, = fl (T.-l)
"
and by definition of Ai one has
(6.5.4.3)
With the methods of Section 1.3 one can derive for the error matrix Ri
estimates of the following form:
(6.5.4.4)
where f(n) is a certain function of n with f(n) = O(n<», a ;:::; 1 [ef. Exercise
13]. It finally follows from (6.5.4.3), since A = Ao, that
(6.5.4.5)
A n - 2 = T n - 2 ... Tl (A + F)Tl ... T n - 2
-
--1
--1
--
where
n-2
(6.5.4.6)
F = LTIT2 ... TiRiTi-1 ... T 2- 1T 1- 1.
i=1
This shows that An - 2 is exactly similar to a perturbed initial matrix A + F.
The smaller F is compared to A, the more stable is the method.
For the matrices Ti we now have, in view of (6.5.4.1) and Iliil :s: 1,
so that, rigorously,
(6.5.4.7)
-
-1
-1
-
-
As a consequence, taking note of Ai ;:::; Ti- ... T 1- AT1 ··· T i , (6.5.4.4), and
(6.5.4.6), on can get the estimates
IUboo(Ai) ~Ciluboo(A),
(6.5.4.8)
IUboo(F) :s: Kf(n) eps IUboo(A),
Ci := 22i ,
n-2
K:=
L2
4i - 2 ,
i=l
with a factor K = K (n) which grows rapidly with n. Fortunately, the bound
(6.5.4.7) in most practical cases is much too pessimistic. It is already rather
unlikely that
lub oo (Tl ... Ti) ~ 2i for i ~ 2.
6.6 Methods for Determining the Eigenvalues and Eigenvectors
405
In all these cases the constants C i and K in (6.5.4.8) can be replaced by
substantially smaller constants, which means that in most cases F is small
compared to A and the method is numerically stable.
For non-Hermitian matrices Housholder's reduction method [see Section 6.5.1] requires about twice as many operations as the method described
in this section. Since in most practical cases this method is not essentially
less stable than the Householder method, it is preferred in practice (tests
even show that the Householder method, due to the larger number of arithmetic operations, often produces somewhat less accurate results).
6.6 Methods for Determining the Eigenvalues and
Eigenvectors
In this section we first describe how classical methods for the determination
of zeros of polynomials (see Sections 5.5, 5.6] can be used to determine the
eigenvalues of Hermitian tridiagonal and Hessenberg matrices.
We then note some iterative methods for the solution of the eigenvalue
problem. A prototype of these methods is the simple vector iteration, in
which one iteratively computes a specified eigenvalue and corresponding
eigenvector of a matrix. A refinement of this method is Wielandt's inverse
iteration method, by means of which all eigenvalues and eigenvectors can
be determined, provided one knows sufficiently accurate approximations for
the eigenvalues. To these iterative methods, finally, belong also the LR and
the QR method for the calculation of all eigenvalues. The last two methods,
especially the QR method, are the best methods currently known for the
solution of the eigenvalue problem.
6.6.1 Computation of the Eigenvalues of a Hermitian
Tridiagonal Matrix
For determining the eigenvalues of a Hermitian tridiagonal matrix
(6.6.1.1)
there are (in addition to the most important method, the QR method,
which will be described in Section 6.6.6) two obvious methods which we
now wish to discuss. Without restricting generality, let J be an irreducible
tridiagonal matrix, i.e., 'Yi "10 for all i. Otherwise, J would decompose into
irreducible tridiagonal matrices J(i), i = 1, ... , k,
406
6 Eigenvalue Problems
but the eigenvalues of J are just the eigenvalues of the J(i), i = 1, ... ,
k, so that it suffices to consider irreducible matrices J. The characteristic
polynomial 4J(JL) of J can easily be computed by recursion: indeed, letting
Pi (JL) := det( J i - JLI),
Ii
~l
"
c5 i
'
and expanding det( J i - JLI) by the last column, one finds the recurrence
relations
(6.6.1.2)
Po(JL) := 1,
pdJL) = 15 1 - JL,
Pi(JL) = (15; -IL)Pi-1(JL) -l!iI 2pi-2tJL),
i = 2,3, ... , n,
4J(IL) == Pn (JL).
Since c5 i and I!i 12 are real, the polynomials Pi (11,) form a Sturm sequence
[see Theorem (5.6.5)]' provided that Ii # 0 for i = 2, ... , n.
It is possible, therefore, to determine the eigenvalues of J with the
bisection method descrihed in Section 5.6. This method is recommended
especially if one wants to compute not all, but only certain prespecified
eigenvalues of .1. It can further be recommended on account of its numerical
stability when some eigenvalues of J are lying very close to each other [see
Barth, Martin and Wilkinson (1971) for an ALGOL program].
Since the eigenvalues of J are all real and simple, they can also be determined with Newton's method, say, in Maehly's version [see (5.5.13)]' at least
if the eigenvalues are not excessively clustered. The values Pn (A (j)), p~ (A (j))
of the characteristic polynomial and its derivative required for Newton's
method can be computed recursively: Pn(A(J)) by means of (6.6.1.2), and
p~(A(j)) by means of the following formulas which are obtained by differentiating (6.6.1.2):
p~(JL) = 0,
p~(p) = -1,
p;(/L) = -Pi-1(P) + (Oi - JL)P;-l(/L) -1,iI 2p;_2(JL),
i = 2, ... , n.
A starting value A(0) 2: maxi Ai for Newton's method can be obtained from
6.6 Methods for Determining the Eigenvalues and Eigenvectors
407
(6.6.1.3) Theorem. The eigenvalues Aj of the matrix J in (6.6.1.1) satisfy
the inequality
11 := In+l := 0.
PROOF. For the maximum norm Ilxll oo =
max IXil one has
and from Jx = AjX, x i- 0, there follows at once
IAjlllxll oo = IIJxll oo :S luboo(J) ·llxll oo ,
Ilxll oo i- 0,
and hence IAjl :S luboo(J).
D
6.6.2 Computation of the Eigenvalues of a Hessenberg Matrix.
The Method of Hyman
Besides the QR method [see Section 6.6.6]' which is used most frequently
in practice, one can in principle use all me.thods of Chapter 5 to determine
the zeros of the polynomial p(p,) = det(B - p,I) of a Hessenberg matrix
B = (b ik ). For this, as for example in the application of Newton's method,
one must evaluate the values p(p,) and p'(p,) for given p,. The following
method for computing these quantities is due to Hyman:
We assume that B irreducible, i.e., that bi,i-l i- for i = 2, ... , n.
For fixed p, on can then determine numbers a, Xl, ... , Xn-l such that
x = (Xl, ... ,Xn-l, x n ), Xn := 1, is a solution of the system of equations
°
or, written out in full,
(b ll - P,)Xl + b12X2 + ... + blnxn = a,
(6.6.2.1)
b2lXl + (b 22 - p,)X2 + ... + b2n x n = 0,
bn,n-lXn-l + (bnn - p')xn = 0.
Starting with Xn = 1, one can indeed determine Xn-l from the last equation
Xn-2 from the second to last equation, ... , Xl from the second equation,
and finally a from the first equation. The numbers Xi and a of course
depend on p,. By interpreting (6.6.2.1) as an equation for x, given a, it
follows by Cramer's rule that
1 = Xn =
a( _1)n-l b2l b32 ... bn,n-l
det(B - p,I)
,
408
6 Eigenvalue Problems
or
( 6.6.2.2)
Apart from a constant factor, a = a(p) is thus identical with the characteristic polynomial of B. By differentiation with respect to /1, letting
x; := x;(p) and observing Xn == 1, x~ == 0 one further obtains from (6.6.2.1)
the formulas
(b l l ~ /h)x~ ~ Xl + b 12 X; + ... + b1.n-lX~_1 = a',
b 21 X;
+ (b 22 ~ p)x; ~ X2 + ... + b2n-lX~_1 = 0,
which, together with (6.6.2.1) and starting with the last equation, can be
solved recursively for the Xi, X;, a, and finally a'. In this manner one can
compute a = a(p) and a'(p) for every /h, and thus apply Newton's method
to compute the zero of a(p), i.e., by (6.6.2.2), the eigenvalues of B.
6.6.3 Simple Vector Iteration and Inverse Iteration of Wielandt
A precursor of all iterative methods for determining eigenvalues and eigenvectors of a matrix A is the simple vector iteration: Starting with an arbitrary initial vector to E (Cn, one forms the sequence of vectors {td with
i = 1. 2, ....
Then
t, = Aito.
In order to examine the convergence of this sequence, we first assume A to
be a diagonalizable n x n matrix having eigenvalues Ai,
We further assume that there is no eigenvalue Aj different from Al with
IAjl = IA11, i.e., there is an integer r > 0 such that
(6.6.3.1)
Al = A2 = ... = Ar
The matrix A, being diagonalizable, has n linearly independent eigenvectors
Xi, AXi = AiXi, which forms a basis of (Cn We can therefore write to in the
form
6.6 Methods for Determining the Eigenvalues and Eigenvectors
409
(6.6.3.2)
For ti there follows the representation
(6.6.3.3)
Assuming now further that to satisfies
-this makes precise the requiremant that to be "sufficiently general" follows from (6.6.3.3) and (6.6.3.1) that
(6.6.3.4)
it
1) i
(An) i
A11 ti = P1 X1 + ... + PrXr + Pr+1 (Ar+
~ Xr+1 + ... + Pn ~ Xn
and thus, because IAj/A11 < 1 for j .::,. l' + 1,
(6.6.3.5)
1
lirn - t = P1X1 + ... + P X
i-4CX)
A1 '
r
r·
Normali:.ling the ti =: (Ti i ), ... , T~i») T in any way, for example, by letting
Hii) I = max Hi)l,
s
it follows from (6.6.3.5) that
(HI)
(6.6.3.6)
T·
Ji
\
1· -(-.).1m
= A1,
~-HX)
~
T
lim Zi = a(P1X1 + ... + PrXr),
,-400
],
where a i= 0 is a normalization constant. Under the stated assumptions
the method thus furnishes both the dominant eigenvalue A1 of A and an
eigenvector belonging to A1, namely the vector Z = a(P1x1 + .. ·+Prxr), and
we say that "the vector iteration converges toward A1 and a corresponding
eigenvector" .
Note that for l' = 1 (AJ a simple eigenvalue) the limit vector z is
independent of the choice to to (provided only that P1 i= 0). If Al is a
multiple dominant eigenvalue (1' > 1), then the eigenvector z obtained
depends on the ratios P1 : P2 : ... : Pr, and thus on the initial vector to. In
addition, it can be seen from (6.6.3.4) that we have linear convergence with
convergence factor IAr+ Ii A11. The proof of convergence, at the same time,
shows that the method does not always converge to Al and an associated
eigenvector, but may converge toward an eigenvalue Ak and an eigenvector
belonging to Ak, if in the decomposition (6.6.3.2) of to one has
P1 = ... = Pk-1 = 0,
410
6 Eigenvalue Problems
(and there is no eigenvalue which differs from Ak and has the same modulus
as Ak). This statement, however, has only theoretical significance, for even
if initially one has exactly PI = 0 for to, due to the effect of rounding errors
we will get for the computed [1 = fl(Ato), in general,
[1 = cA1:1'1 + lh A 2 x 2 + ... + PnAnxn
with a small c =f. 0 and Pi :::::; Pi, i = 2, ... , n, so that the method eventually
still converges to Al and a corresponding eigenvector.
Suppose now that A is a nondiagonalizable matrix with a uniquely determined dominant eigenvalue Al (i.e., IA11 = IAil implies)'1 = Ai). Replacing (6.6.3.2) by a representation of to as linear combination of eigenvectors
and principal vectors of A. one can show in the same way that for "sufficiently general" to the vector iteration converges to Al and an associated
eigenvector .
For practical computation the simple vector iteration is only conditionally useful, since it converges slowly if the moduli of the eigenvalues are
not sufficiently separated, and moreover furnishes only one eigenvalue and
associated eigenvector. These disadvantages are avoided in the inverse iteration (also called fractional iteration) of \Vielandt. Here one assumes that
a good approximation A is already known for one of the eigenvalues AI, ... ,
An of A, say Aj:
(6.6.3.7)
Starting with a "sufficiently general" initial vector to E <en. one then forms
the vectors t i , i = 1. 2, ... , according to
(6.6.3.8)
(A - A1)ti = f i - 1.
If A =f. Ai, i = 1, ... , n, then (A - A1)-1 exists and (6.6.3.8) is equivalent
to
ti = (A - A1)-l ti _ 1 ,
i.e., to the simple vector iteration with the matrix (A - A1)-l and the
eigenvalues l/(Ak - A), k = 1, 2, ... , n. Because of (6.6.3.7) one has
IAj~AI»IAk~AI for Ak=f.Aj.
Assuming A again diagonalizable, with eigenvectors Xi, it follows from to =
PIXI + ... + Pn:rn (if Aj is a simple eigenvalue) that
(6.6.3.9)
6.6 Methods for Determining the Eigenvalues and Eigenvectors
411
The smaller IAj - AI/IAk - AI for Ak =I- Aj (i.e., the better the approximation
A), the better will be the convergence.
The relations (6.6.3.9) may give the impression that an initial vector to
is the more suitable for inverse iteration the more closely it agrees with the
eigenvector Xj, to whose eigenvalue Aj one knows a good approximation A.
They further seem to suggest that with the choice to ~ Xj the "accuracy"
of the t; increases uniformly with i. This impression is misleading, and true
in general only for the well-conditioned eigenvalues Aj of a matrix A [see
Section 6.9], i.e., for the eigenvalues Aj which under a small perturbation
of A change only a little:
if
lub(6A)
lub(A) = O(eps).
According to Section 6.9 all eigenvalues of symmetric matrices A are
well conditioned. A poor condition for Aj must be expected in those cases
in which Aj is a multiple zero of the characteristic polynomial of A and A
has nonlinear elementary divisors, or, what often occurs in practice, when
Aj belongs to a cluster of eigenvalues whose eigenvectors are almost linearly
dependent.
Before we study in an example the possible complications for the inverse
iteration caused by an ill-conditioned Aj, let us clarify what we mean by a
numerically acceptable eigenvalue A (in the context of the machine precision
eps used) and corresponding eigenvector X;.. =I- 0 of a matrix A.
The number A is called a numerically acceptable eigenvalue A if there
exists a small matrix 6A with
lub(6A)/lub(A) = O(eps)
such that A is an exact eigenvalue of A + 6A. Such numerically acceptable
eigenvalues A are produced by every numerically stable method for the
determination of eigenvalues.
The vector X;.. =I- 0 is called a numerically acceptable eigenvector
corresponding to a given approximation A of an eigenvalue A if there exists
a small matrix 6 1 A with
lub(61 A)
lub(A) = O(eps).
(6.6.3.10)
EXAMPLE 1. For the matrix
the number>' := 1 + yIePs is a numerical acceptable eigenvalue, for>. is an exact
eigenvalue of
A + 6A
with
6 [0 0].
A:=
eps
0
412
6 Eigenvalue Problems
Although
x)"
:=
[~]
is an exact eigenvector of A, it is not, in the sense of the above definition, a
numerically acceptable eigenvector corresponding to the approximation A = 1 +
because every matrix
vers,
with
has the form
AlA =
'-"
[~oeps ~],
u
(3, is
arbitrary.
For all these matrices, lub=(D. 1 A) 2': ~» O(eps ·lub(A)).
It is thus possible to have the following paradoxical situation: Let .\0
be an exact eigenvalue of a matrix, Xo a corresponding exact eigenvector,
and .\ a numerically acceptable approximation of '\0, Then Xo need not be
a numerically acceptable eigenvector to the given approximation .\ of .\0.
If one has such a numerically acceptable eigenvector .\ of A, then with
inverse iteration one merely tries to find a corresponding numerically acceptable eigenvector x)". A vector t j, j ~ 1, found by means of inverse
iteration from an initial vector to can be accepted as such an ;J:)", provided
Iltj-III = O(eps) ·lub(A ).
It;f
(6.6.3.11)
For then, in view of (A ~ .\I)tj = t j - I , one has
(A + LiA ~ .\I)t] = 0
with
lub 2 (LiA ) =
Iltj-d2
IItjl12 = 0 (cps lub 2( A )) .
It is now surprising that for ill-conditioned eigenvalues we can have an
iterate tj which is numerically acceptable, while its successor t j + 1 is not:
EXAI\IPLE 2. The matrix
A = [~
~],
T}
= O(eps),
has the eigenvalues Al = T} + vIr!, A2 = T} ~ vir! with corresponding eigenvectors
6.6 Methods for Determining the Eigenvalues and Eigenvectors
413
Since 7J is small, A is very ill-conditioned and )\1, A2 form a cluster of eigenvalues whose eigenvectors Xl, X2 are almost linearly dependent [e.g., the slightly
perturbed matrix
- [00 6]
A·=
.
= A+LiA,
has the eigenvalue A(A) = 0 with min IA(A) - Ai I 2': ~ - 1771 » O(eps)J. The
number A = 0 is a numerically acceptable eigenvalue of A to the eigenvector
Indeed
(A + LiA - o· I)x), = 0
Li A := [=~
for
~],
lub=(LiA)/lub=(A) = _1771_ 1;: :; O(eps).
1+ 1
77
Taking this numerically acceptable eigenvector
as the initial vector to for the inverse iteration, and A
eigenvalue of A, one obtains after the first step
o as an approximate
The vector h, is no longer numerically acceptable as an eigenvector of A, every
matrix lilA with
having the form
with
a
=1 +
{3 - '},
,= 6.
These matrices lilA all satisfy lub=(LilA) 2': 1-1771; there no small among them
with lub ou (LiA) ;:::; O( eps).
On the other hand, every initial vector to of the form
to =
[n
with ITI not too large produces for h a numerically acceptable eigenvector. For
example, if
to =
the vector
[i] ,
414
6 Eigenvalue Problems
h~ ~ [1]0
~ I)
is numerically acceptable, as we just showed.
Similar circumstances to those in this example prevail in general whenever the eigenvalue considered has nonlinear f'lementary divisors. Suppose
Aj is an eigenvalue of A for which there arc elementary divisors of degrees
at most k [k = Tj: sec (6.2.11)]. Through a numerically stable method one
can in general [see (6.9.11)] obtain only an approximately value A within
the error A ~ Aj = O(eps1/k) [we assume for simplicity lub(A) = 1].
Let now Xo, Xl, ... , Xk-1 be a chain of principal vectors [see Sections
6.2, 6.3] for the eigenvalue Aj,
(A ~ AjI);ri = Xi-1
for i = k ~ L ... ,0,
(:1:-1 := 0).
It then follows immediately, for A i- Ai, i = 1, ... , n, that
(A ~ AI)-l [A ~ AI + (A ~ Aj)I]xi = (A ~ AI)-1,ri_1,
(A ~ AI)
-1
1
1
.ri = A' ~ A Xi + A ~ A' (A ~ AI)
J
i=k~l,
J
-1
Xi-1,
... ,O,
and from this, by induction,
(A ~ AI)
-1
1
1
1
Xk-1 = AJ ~ A Xk-1 ~ (Aj ~ A)2 Xk-2 + ... ± (Aj ~ A)k Xo·
For the initial vector to := .Tk-1, we thus obtain for the corresponding t1
and therefore Iltoll/lltIil = O(eps), so that t1 is acceptable as an eigenvector
of A [see (6.6.3.11)]. If, however, we had taken as initial vector to := XU,
the exact eigenvector of A, we would get
and thus
Since eps1/k » eps, the vector t1 is not a numerically acceptable eigenvector
to the approximate value A at hand [cf. Example 1].
For this reason, one applies inverse iteration in practice only in a rather
rudimentary form: Having computed (numerically acceptable) approximations A for the exact eigenvalues of A by means of a numerically stable
6.6 lVlethods for Determining the Eigenvalues and Eigenvectors
415
method, one tentatively determines, for a few different initial vectors to,
the corresponding tl and accepts as eigenvector that vector tl for which
Iitoli/lltill best represents a quantity of O(eps lub(A)).
Since the step to ---+ tl requires the solution of a system of linear equations (6.6.3.8), one applies inverse iteration in practice only for tridiagonal and Hessenberg matrices A. To solve the system of equations (6.6.3.8)
(A - AI is almost singular), one decomposes the matrix A - AI into the
product of a lower and an upper triangular matrix, Land R, respectively.
To ensure numerical stability one must use partial pivoting, i.e., one determines a permutation matrix P and matrices L, R with [see Section 4.1]
P(A - M) = LR,
The solution tl of (6.6.3.8) is then obtained from the two triangular
systems of equations
Lz = Pt o,
(6.6.3.12)
Rtl = z.
For tridiagonal and Hessenberg matrices A, the matrix L is very sparse: In
each column of L at most two elements are different from zero. R for Hessenberg matrices A is an upper triangular matrix; for tridiagonal matrices
A, a matrix of the form
* * *
R=
0
*
o
*
*
so that in cach case the solution of (6.6.3.12) requires only a few operation.
Besides, Ilzll :::::0 Iitoll, by the first relation in (6.6.3.12). The work can therefore be simplified further hy determining tl only from Rtl = z, where one
tries to choose the vector z so that Ilzll/lltlll becomes as small as possible.
An ALGOL program for the computation of the eigenvectors of a symmetric tridiagonal matrix using inverse iteration can he found in Peters
and Wilkinson (1971a); FORTRAN programs, in Smith et al. (1976).
6.6.4 The LR and QR Methods
The LR-method of Rutishauser (1958) and the QR method of Francis
(1961/62) and Kuhlallovskaja (1961) are also iterative methods for the
416
6 Eigenvalue Problems
computation of the eigenvalues of an n x n matrix A. We first describe the
historically earlier LR method. Here, one generates a sequence of matrices A;. beginning with Al := A, according to the following rules: Use the
Gauss elimination method [see Section 4.1] to present Ai as the product of
a lower triangular matrix Li = (Ijk) with I jj = 1 and an upper triangular
matrix Ri
(6.6.4.1)
Thereafter, let
i = 1, 2, ....
From Section 4.1 we know that an arbitrary matrix Ai does not always
have a factorization Ai = LiRi. We assume, however, for the following
discussion that Ai can be so factored. We begin by showing the following.
(6.6.4.2) Theorem. If all decompositions Ai = LiRi exist, then
(a) Ai+l is similar to Ai:
i = 1, 2, ....
(b) A i+ 1 = (L I L 2 ·· ·L i )-lA 1 (L I L 2 ·· ·L i ), i = 1, 2, ....
(c) For the lower triangular matrix Ti := Ll ... Li and the upper triangular
matrix Ui := R i · .. Rl one has
i = 1, 2, ....
PROOF. (a) From A; = LiRi we get
(b) follows immediately from (a).
To prove (c), note that (b) implies
i = 1, 2, ....
It follows for i = 1, 2, ... that
TiUi = Ll ... L i - 1 (LiRi)R i - 1 ... Rl
= LI ... Li-IA;Ri - 1 ... Rl
= AILI ... Li-IRi - 1 ••. Rl
= AITi-lUi-I'
With this, the theorem is proved.
o
6.6 Methods for Determining the Eigenvalues and Eigenvectors
417
It is possible to show, under certain conditions, that the matrices Ai
converge to an upper triangular matrix AX) with the eigenvalues Aj of
A as diagonal elements Aj = (Ax;)jj. However, we wish to analyze the
convergence only for the Q R method which is closely related to the LR
method, because the LR method has a serious drawback: it breaks down if
one of the matrices Ai does not have a triangular decomposition, and even
if the decomposition Ai = LiRi exists, the problem of computing Li and
R may be ill conditioned.
In order to avoid these difficulties (and make the method numerically stable),
one could think of forming the LR decomposition with partial pivoting:
PiAi =: LiRi,
Pi permutation matrix, p;-l = pt,
A i + 1 : = RiP;T L; = L;l(p;AiPt)Li'
There are examples, however, in which the modified process fails to converge:
A1 =
[~ ~] ,
>q = 3,
'>"2 = -2,
P1 =
[~ 6] ,
P1A 1 =
[i ~]
A2 = R 1P'[ L1 =
[~ ~] ,
6] ,
P2A 2 =
P2 =
[~
A3 = R 2PiL 2 =
[i ~]
=
[1~2 ~][~ ~] = L1R1,
[
1~3 ~] [~ ~] = L R
2
2,
[~ ~] == A 1.
The LR method without pivoting, on the other hand, would converge for this
example.
These difficulties of the LR method are avoided in the QR method of
Francis (1961/62), which can be regarded as a natural modification of the
LR algorithm. Formally the QR method is obtained if one replaces the LR
decompositions in (6.6.4.1) by QR decompositions [see Section 4.7]. Thus,
again beginning with Al := A, the QR method generates matrices Qi, R i ,
and Ai according to the following rules:
(6.6.4.3)
Ai =: QiRi,
Here, the matrices are factored into a product of a unitary matrix Qi and
an upper triangular matrix R i . Note that a QR decomposition Ai always
exists and that it can be computed with the methods of Section 4.7 in a
418
6 Eigenvalue Problems
numerically stable way: There are n - 1 Householder matrices H?), j = 1,
... , n - 1, satisfying
where Ri is upper triangular. Then, since (H?»)H = (H?»)-l = H?) the
matrix
H(i)
Q ,...-- H(i)
1 ...
n-l
is unitary and satisfies Ai = QiRi' so that Ai+! can be computed as
_
_
(i)
(i)
Ai+l - RiQi - RiHl ... H n - l ·
Note further that the QR decomposition of a matrix is not unique. If S is
an arbitrary unitary diagonal matrix, i.e., a phase matrix of the form
then QiS and SH Ri are unitary and upper triangular, respectively, and
(QiS)(SH R i ) = QiRi' In analogy to Theorem (6.6.4.2), the following holds.
(6.6.4.4) Theorem. The matrices Ai, Qi, Ri in (6.6.4.3) and
Pi := QlQ2'" Qi,
Ui := RiRi - l ... R l ,
satisfy:
(a) Ai+! is unitarly similar to Ai, Ai+l = QfI AiQi'
(b) Ai+l = (Ql'" Qi)H Al(Ql··· Qi) = p iH AlPi ·
(c) Ai = PiUi .
The proof is carried out in much the same way as for Theorem (6.6.4.2)
and is left to the reader.
In order to make the convergence properties [see (6.6.4.12)] of the QR
method plausible, we show that this method can be regarded as a natural
generalization of both the simple and inverse vector iteration [see Section
6.6.3]. As in the proof of (6.6.4.2), starting with Po := I, we obtain from
(6.6.4.4) the relation
(6.6.4.5)
i ~ 1.
If we partition the matrices Pi and Ri as follows
r '(rl]
P i = [Pi ,Pi '
so that pt) is an n x r matrix and R~r) an r x r matrix, then (6.6.4.5)
implies
(6.6.4.6)
AP(r)l = p(r) R(r)
1,-
1,
1,
for i ~ 1, 1:::; r :::; n.
6.6 Methods for Determining the Eigenvalues and Eigenvectors
419
Denoting the linear subspace, which is spanned by the orthonormal columns
of pt) by Pi(r) := R(Pi(r)) = {pi(r) z I Z E <e r }, we obtain
p(r) ~ AP(r)I = R(p(r) R(r))
~
z-
z
for i 2: 1, 1::::: r ::::: n,
z
with equality holding, when A is nonsingular, implying that all R i , R~r)
will also be nonsingular, that is, the QR method can be regarded as a
subspace iteration. The special case r = 1 in (6.6.4.6) is equivalent to the
I ): If we
usual vector iteration [see 6.6.3] with the starting vector el =
write Pi := pP) then
pci
Ilpill = 1,
(6.6.4.7)
i 2: 1,
as follows from (6.6.4.6); also, R?) = ri~, and IIFF) II = 1. The convergence
of the vector Pi can be investigated as in 6.6.3: We assume for simplicity
that A is diagonalizable and has the eigenvalues Ai with
IAII > IA21 2: .. ·IAnl·
Further let X = [Xl, ... ,X n ] = (Xik), Y := X-I = [YI, ... ,YnV = (Yik) be
matrices with
A=XDY,
(6.6.4.8a)
D -_ [AOI
so that Xi and y[ are a right and left eigenvector belonging to Ai:
YiT Xk =
(6.6.4.8b)
i = k,
{I0 for
otherwise.
If PI = I'll =I 0 holds in the decomposition of el
then by the results of 6.6.3 the ordinary vector iteration, tk := Akel satisfies:
(6.6.4.9)
On the other hand (6.6.4.7) implies
Aiel = rWri;) ... ri?Pi.
Since Ilpill = 1, there are phase factors (J"k = ei(Pk, l(J"kl = 1, so that
lim (J"iPi = Xl,
i
.
(i) (J"i-I _ \
1Imr
ll - - -/II,
(J"i
420
6 Eigenvalue Problems
ri?
Thus, the
and Pi "essentially converge" (i.e., up to a phase factor)
toward Al and Xl as i tends to oo. By the results of 6.6.3 the speed of
convergence is determined by the factor IA2/AII < 1,
(6.6.4.10)
Using (6.6.4.10) and A i + l = PiH APi [Theorem (6.6.4.4)(b)] it is easy
to see that the first column Aiel of Ai converges to the vector AIel =
[AI, 0, ... , O]T as i -too, and the smaller IA2/ All < 1 is, the faster the error
IIAiel - Alelll = O((IA21/IAd) tends to zero. We note that the condition
PI = Un i- 0, which ensures the convergence, is satisfied if the matrix Y
has a decomposition Y = L y Ry into a lower triangular matrix L y with
(Ly )j j = 1 and an upper triangular matrix Ry.
It will turn out to be particularly important that the QR method is
also related to the inverse iteration [see 6.6.3] if A is nonsingular. Since
p;HPi = I, we obtain from (6.6.4.5) P/!..lA- l = RilPiH or
Denote the 71 x (71 - r + 1) matrix consisting of the last 71 - r + 1 columns
of Pi by pt), denote the subspace generated by the columns of Pt) by
P?·) = R(Pi(r)), and denote the upper triangular matrix consisting of the
last 71 - r + 1 rows and columns of Ri by R~r). Then the last equation can
be written as a vector space iteration with the inverse A-H = (AH)-l of
the matrix A H :
A -H p(r)l = p(r) (R(r))-H
'[-
z
z
for i 2: 1, 1:S r :S 71,
In the special case r = 71 this iteration reduces to an ordinary inverse vector
iteration for the last column Pi := Pi(n) of the matrices Pi:
A-H'.
_'.-(i)
P,-l - p, Pnn'
(i) ._ (R-1)
inn,
p.nn·-
arising from the starting vector Po = en. Convergence of the iteration can
be investigated as in Section 6.6.3 and as previously. Again, the discussion
is easy if A is diagonalizable and A -H has a unique eigenvalue of maximal
modulus, i.e., if the eigenvalues Ai of A now satisfy IAll 2: ... 2: IAn-II>
IAnl > 0 (note that A- H has the eigenvalues 5. j l, j = 1, ... ,71). Let Xi and
UJ be the right and left eigenvector of A belonging to the eigenvalue Ai,
and consider (6.6.4.8). If the starting vector Po = en is sufficiently general,
the vector Pi will "essentially" converge toward a normalized eigenvector
u, lIull = 1, of A-H for its largest eigenvalue 5.;;-1, A-Hu = 5.;;-lu, or
6.6 Methods for Determining the Eigenvalues and Eigenvectors
421
equivalently u H A = An u H . Therefore the last rows e~ Ai of the matrices
Ai will converge to Ane~ = [0, ... ,0, Anl as i ---+ oc.
Now, the quotient IAn/An-II < 1 determines the convergence speed,
(I D,
Ile~Ai - [0, ... ,0, Anlll = 0 A~:l
(6.6.4.11)
since A- H has the eigenvalues 5. j l satisfying IA~ll > IA;;-~ll ;:: ... ;:: IAIll
by hypothesis. For the formal convergence analysis we have to write the
starting vector Po = en as a linear combination
en = Plfh + ... + PnYn,
of the right eigenvectors Yj of the matrix A-HYj = 5.j l Yj. Convergence
is ensured if the coefficient Pn associated with the eigenvalue 5.~ 1 of A-H
largest in modulus is =I O. Now 1= XHyH = [xl, ... ,xnlH[yl, ... ,Ynl
implies Pn = xI,[ en, so that Pn =I 0 if the element Xnn = e~xn = Pn
of the matrix X is nonzero. We note that this is the case if the matrix
y = Ly R y has an LR decomposition: the reason is that X = Ryl Lyl,
since X = y-l and Ryl and Lfl are upper and lower triangular matrices,
which implies Xnn = e~ Ryl Ly en =I O. This analysis motivates part of the
following theorem, which assures the convergence of the QR method if all
eigenvalues of A have different moduli.
(6.6.4.12) Theorem. Let A =: Al be an n x n matrix satisfying the
following hypotheses
(1) The eigenvalues Ai of A have distinct mod1l1i:
tAll > IA21 > ... > IAnl·
(2) The matrix Y with A = XDY, X = y-l, D = diag(Al, ... , An) = the
Jordan normal form of A, has a triangular decomposition
Then the matrices Ai, Qi, Ri of the QR method (6.6.4.3) have the following
convergence properties: There are phase matrices
Q
0'( (J'l(i) , . . . , (J' n(i)) '
,:>,
-_ d'lab
I= 1
I(J'(i)
k
'
such that lim; Sf-I QiSi = I and
lim sf RiS;-l = lim SE-lAiSi- l = [
'l-+()C)
Al:2
'x~ 1
o
An
t-+<x
422
6 Eigenvalue Problems
} n part'zcu l ar, rImi->00 a jJ
(i) = Aj ' J. = 1 " . "
h
A i = ((il)
n, were
a jk '
Remark. Hypothesis (2) is used only to ensure that the diagonal
elements of Ai converge to the eigenvalues Aj in their natural order,
IAll> IA21 > '" > IAnl,
PROOF [following Wilkinson (1965)], We carry out the proof under the
additional hypothesis An i- 0 so that D- 1 exists, Then from X - I = Y
Ai=XDiy
= QxRxDiLyRy
(6,6.4.13)
= QxRx(DiLyD-i)DiRy,
where Q x Rx = X is a Q R decomposition of the nonsingular matrix X into
a unitary matrix Qx and a (nonsingular) upper triangular matrix R x , Now
DiLyD- i =: (l~~) is lower triangular, with
() (AA: )
lj~ =
i
ljk,
L y =: (I jk ),
ljk =
{I
0
k,
for j =
for j < k,
Since IAj I < IAk I for j > k it follows that lim;l;~ = 0 for j > k, and thus
lim Ei = 0,
i---+oc'
Here the speed of convergence depends on the separation of the moduli of
the eigenvalues, Next, we obtain from (6.6.4.13)
Ai = QxRx(I + Ei)DiRy
(6.6.4.14)
= Qx(I + RxEiR)/)RxDiRy
= Qx(I + Fi)RxDiRy
with Fi := Rx EiR)/, limi Fi = O. Now the positive definite matrix
(I + F;JH (I + F;J = I + Hi,
with limi Hi = 0, has a uniquely determined Choleski decomposition [Theorem (4,3,3)]
where Hi is an upper triangular matrix with positive diagonal elements,
Clearly, the Choleski factor Hz depends continuously on the matrix I + Hi,
as is shown by the formulas of the Choleski method, Therefore lim; Hi = 0
implies limi Hi = I, Also the matrix
is unitary:
6.6 Methods for Determining the Eigenvalues and Eigenvectors
423
Qf Qi = ft;H (I + Fi)H (I + Fi)it;l = R;H (I + Hi)R;l
=
R;H (Rf k)R;l = I.
Therefore, the matrix I + Fi has the QR decomposition I + Fi = Qik,
with limi Qi = limi(I + Fi)R;l = I, limi k = I. Thus, by (6.6.4.14)
where QXQi is unitary, and RiRXD'Ry is an upper triangular matrix.
On the other hand, by Theorem (6.6.4.4)(c), the matrix Ai has the QR
decomposition
Since the QR decomposition for nonsingular A is unique up to a rescaling
of the columns (rows) of Q (resp. R) by phase factors CJ = e i ¢, there are
phase matrices
d',g.( CJ 1(i) , " ' , CJ n(il) ,
Si -- l a
with
i ? 1,
and it follows that
lim FiSi = limQxQi = Qx,
i
i
1
1D- i+ 1R1R- i-1
-1 SH
R i -- Ui Ui-1 -- S i R- i R X DiRY' RY
X
i-1
=
-
-1 - -1
H
SiRiRxDRx R i _ 1S i _ 1,
lim Sf R iS i - 1 = Rx DR)/,
I
and finally, by Ai = QiRi,
limSt~'lAiSi-1
= limS~lQiSisfRiSi-1
= RxDRxl,
,
,
The proof is now complete, since the matrix Rx D R)/ is upper triangular
and has diagonal D
1
Rx DR;/ =
[
>'0
*
D
424
6 Eigenvalue Problems
It can be seen from the proof that the convergence of the Qi, Ri and
Ai is linear, and improves with decreasing "converging factors" IAJ / Ak I,
j > k, i.e., with improved separation of the eigenvalues in absolute value.
The hypotheses of the theorem can be weakend, e.g., our analysis that led
to estimates (6.6.4.10) and (6.6.4.11) already showed that the first column
and the last row of Ai converge under weaker conditions on the separation
of the eigenvalues. In particular, the strong hypotheses of the theorem are
violated in the important case of a real matrix A with a pair AT) Ar+l = '\1'
of conjugate complex eigenvalues. Assuming, for example, that
and that the remaining hypotheses of Theorem (6.6.4.12) are satisfied, the
following can still be shown for the matrices Ai =
(a;2).
(6.6.4.15) (a) limi
a;2 = 0 for all (j, k) =I (r + 1,r) with j > k.
(b) limi a;'i = A] for j =I r, r + 1.
(c) Although the 2 x 2 matrices
(i)
[
a r1'
( i)
a r+1.1'
diverge in general as i --t :)0, their eigenvalues converge toward Ar and
Ar +l.
In other words, the convergence of the matrices Ai takes place in the
positions denoted by Aj and 0 of the following figure, and the eigenvalues
of the 2 x 2 matrix denoted by * converge
Al
X
0
A2
X
X
X
x
X
X
X
X
0
Ai --+
A1'-l
0
i----'tx
* *
* *
0
0
X
Ar+~
0
x
An
For a detailed investigation of the convergence of the QR method, the
reader is referred to the following literature: Parlett (1967), Parlett and
Poole (1973), and Golub and Van Loan (1983).
6.6 Methods for Determining the Eigenvalues and Eigenvectors
425
6.6.5 The Practical Implementation of the QR Method
In its original form (6.6.4.:3), the QR method has some serious drawbacks
that make it hardly competitive with the methods considered so far:
(a) The method is expensive: a complete step Ai -+ A i + 1 for a dense
n x n matrix A requires 0(n 3 ) operations.
(b) Convergence is very slow if some quotients IAj / Ak I, j # k, of the
eigenvalues of A are close to 1.
However, there are remedies for these disadvantages.
(a) Because of the amount of work, one applies the QR method only
to reduced matrices A, namely matrices of Hessenberg form, or in the case
of Hermitian matrices, to Hermitian tridiagonal matrices (i.e., Hermitian
Hessenberg matrices). A general matrix A, therefore must first be reduced
to one of these forms by means of the methods described in Section 6.5. For
this procedure to make sense, one must show that these special matrices are
invariant under the QR transformation: if Ai is a (possibly Hermitian) lIessenberg matrix then so is Ai+ 1. This invariance is easily established. From
Theorem (6.6.4.4)(a), the matrix AH1 = Qf AiQi is unitarily similar to Ai
so that AH1 is Hermitian if Ai is. If Ai is an n x n Hessenberg matrix, then
AH 1 can be computed as follows: First, reduce the subdiagonal elements
of Ai to 0 by means of suitable Givens matrices of type D I2 , ... , D n - I.n
[see Section 4.9]
Ai = QiRi'
nHnH
nH
Q i ._
. - UI2J£23' .. un-l,n,
and then compute A i + 1 by means of
A i + l = RiQi = RiDi!zD~ ... D;;-l.n·
Because of the special structure of the Dj,j+l, the upper triangular matrix
Ri is transformed by the post multiplications with the Dfj+1 into a matrix
Ai+1 of Hessenberg form. Note that A i+ l can be computed from Ai in one
sweep if the matrix multiplications are carried out in the following order
One verifies easily that it takes only 0(n 2 ) operations to transform an n x n
Hessenberg matrix Ai into A i +1 in this way, and only O(n) operations in
the Hermitian case, where Ai is a Hermitian tridiagonal matrix.
We therefore assume for the following discussion that A and thus all Ai
are (possibly Hermitian) Hessenberg matrices. We may also assume that
426
6 Eigenvalue Problems
the Hessenberg matrices Ai are irreducible, i.e., their subdiagonal elements
(i)
.
0 therwise Ai has the form
aj,j_l' J = 2, ... , n, are nonzero.
where A;, A;' are Hessenberg matrices of order lower than n. Since the
eigenvalues of Ai are just the eigenvalues of A; and A;', it is sufficient
to determine the eigenvalues of the matrices A;, A;' separately. So, the
eigenvalue problem for Ai can be reduced to the eigenvalue problem for
smaller irreducible matrices.
Basically, the Q R method for Hessenberg matrices runs as follows: The
Ai are computed according to (6.6.4.3) until one of the last two subdiagonal
elements a~?n-l and a~~1,n-2 of the Hessenberg matrix Ai (which converge
to 0 in general, see (6.6.4.12), (6.6.4.15)) become negligible, that is,
. {I a n(i). n - I I, Ia n(i)- 1 ,n-2 I}
min
(;)1 + Ian-l.
(i)
I)
s: eps (I ann
n- 1 '
eps being, say, the relative machine precision. In case a~:n-l is negligible,
a~~ is a numerically acceptable eigenvalue [see Section 6.6.3] of A, because
it is the exact eigenvalue of a Hessenberg matrix Ai close to Ai IIAi - Ai I
eps IIAil1 : Ai is obtained from Ai by replacing 1L~)n_l
by O. In case 1L~~1 , n-2
,
is negligible, the eigenvalues of the 2 x 2 matrix
s:
(i)
[ an-I.n-l
(i)
an,n-l
(i)
an-,I.n
(i)
ann
1
are two numerically acceptable eigenvalues of Ai, since they are exact eigenvalues of a Hessenberg matrix Ai close to Ai, which is obtained by replacing
the small element a~~l n-2 by O. If a~~ is taken as an eigenvalue, one crosses
out the last row and c~lumn, and if the eigenvalues of the 2 x 2 matrix are
taken, one removes the last two rows and columns of Ai, and continues the
algorithm with the remaining matrix.
By (6.6.4.4) the QR method generates matrices that are unitarily similar to each other. We note an interesting property of "almost" irreducible
Hessenberg matrices that are unitarily similar to each other, a property
that will be important for the implicit shift techniques to be discussed
later.
(6.6.5.1) Theorem. Let Q = [ql"'" qn] and U = [U1 .. ' .. Un] be unitary
matrices with columns qi and 11i, and suppose that H = (h jk ) := QH AQ
and K := U H AU are both Hessenberg matrices that are similar to the same
matrix A using Q und U, respectively, for unitary transformations. Assume
further that h i . i - 1 Ie 0 for i
n - 1, and that 111 = IJ'lql, 11J'11 = l. Then
s:
6.6 Methods for Determining the Eigenvalues and Eigenvectors
427
there is a phase matrix S = diag(ab ... , an), lakl = 1, such that U = QS
andK=SHHS.
That is, if H is "almost" irreducible and Q and U have "essentially"
the same first column, then all columns of these matrices are "essentially"
the same, and the Hessenberg matrices K and H agree up to phase factors,
K = SHHS.
PROOF. In terms of the unitary matrix V = [VI, ... ,Vn ] := U H Q, we have
H = V H KV and therefore
KV=VH.
A comparison of the (i - 1)-th column on both sides shows
i-I
h i ,i-l V i = KVi-l -
Lh
j ,i-l V j,
2 ~ i ~ n.
j=1
With h i ,i-l i= 0 for 2 ~ i ~ n - 1, this recursion permits one to compute
i ~ n - 1 from VI. Since VI = U H ql = alUHul = aIel and K is
a Hessenberg matrix we find that the matrix V = [VI, ... , v n ] is upper
triangular. As V is also unitary, VVH = I, V is a phase matrix, which we
denote by SH := V. The theorem now follows from K = V HV H.
0
Vi,
(b) The slow convergence of the QR method can be improved substantially by means of so-called shift techniques. We know from the relationship
between the QR method and inverse vector iteration [see (6.6.4.11)] that,
for nonsingular A, the last rowe; Ai of the matrices Ai converges in general
to Ane; as i ---+ 00, if the eigenvalues Ai of A satisfy
In particular, the convergence speed of the error
is determined by the quotient IAn/An-II. This suggests that we may accelerate the convergence by the same techniques used with inverse vector
iteration [see Section 6.6.3], namely, by applying the QR method not to A
but to a suitable shifted matrix A = A - kI. Here, the shift parameter k is
chosen as a close approximation to some eigenvalue of A, so that, perhaps
after a reordering of the eigenvalues,
Then the elements a~)n-l of the matrices Ai associated with A will
converge much faster to 0 'as i ---+ 00, namely, as
428
6 Eigenvalue Problems
Note that if A is a Hessenberg matrix, so also are A = A - kI and all Ai,
hence the last row of Ai has the form
TA-·l -- [0 , • • . , 0 ,an,n-l'
-til
-(ill
en
ann'
More generally, if we choose a new shift parameter in each step, the QR
method with shifts is obtained:
AI: = A,
(6.6.5.2)
Ai - k;l =: QiRi
(QR decomposition)
The matrices Ai are still unitarily similar to each other, since
AHI = Qf (Ai - k;l)Qi + kiI = Qf AiQi.
In addition, it can be shown as in the proof of Theorem (6.6.4.4) [see
Exercise 20l
(6.6.5.3)
A i + 1 = P iH APi,
(A - k 1 l)'" (A - k;l) = PiUi ,
where again Pi := QIQ2'" Qi and Ui := RiR i - 1 ... R I · Moreover,
(6.6.5.4)
A i + 1 = RiAiR;1
= U i AUi- 1
holds if all R j are nonsingular. Also, AHI will be a Hessenberg matrix, but
this assertion can be sharpened
(6.6.5.5) Theorem. Let Ai be an irreducible Hessenberg matrix of ordcr
n. If k i is not an eigenvalue of Ai then A i+ 1 will also be an irreducible
Hessenberg matrix. Otherwise, AHI will have the form
wherc Ai+! is an irreducible H essenberg matrix of order (n - 1).
PROOF. It follows from (6.6.5.2) that
(6.6.5.6)
Ri = Qf (A; - kil)o
If ki is not an eigenvalue of Ai. then Ri is nonsingular by (6.6.5.2), and
therefore (6.6.5.4) yields
(HI) _
T
A
_ T R A R- 1 _
aj+l. j - e J +l i+l C j - Cj + 1 , i i ej -
so that AHI is irreducible, too.
(il
(i)
((i») -1 --I- 0
rj+l.j+laj+l.j rjj
(.
6.6 Methods for Determining the Eigenvalues and Eigenvectors
429
On the other hand, if k i is an eigenvalue of A;, then R; is singular.
But since Ai is irreducible, the first n - 1 columns of Ai - kif are linearly
independent, and therefore, by (6.6.5.6) so also are the first n - 1 columns
of R i . Hence, by the singularity of R i , the last row of Ri must vanish
*]
R.=[Hi
"
0
0 '
with Hi being a nonsingular upper triangular matrix of order (n - 1).
Because of the irreducibility of Ai - k;l and the fact that
Qi is also an irreducible Hessenberg matrix. Therefore, since
~ ] Qi + kif = [ Act
1
the submatrix AHI is an irreducible Hessenberg matrix of order n -1, and
AH 1 has the structure asserted in the theorem.
D
We now turn to the problem of choosing the shift parameters k i appropriately. To simplify the discussion, we consider only the case of real
matrices A, which is the most important case in practice, and we assume
that A is a Hessenberg matrix. The earlier analysis suggests choosing close
approximations to the eigenvalues of A as shifts k i , and the problem is finding such approximations. But Theorem (6.6.4.12) and (6.6.4.15) are helpful
here: According to them we have limi
= An, if A has a unique eigenvalue An of smallest absolute value. Therefore, for large i, the last diagonal
element a~~ of Ai will, in general, be a good approximation for An, which
suggests the choice
a)::,
(6.6.5.7a)
if the convergence of the a~i:n has stabilized, say, if
I1 -
(i-I)
ann
I
---t-,) ::: 7] < 1
an,n,
for a small T7 (even the choice T7 = 1/3 leads to acceptable results). It is
possible to show for Hessenberg matrices under weak conditions that the
choice (6.6.5.7a) ensures
la~i.~~11 ::: CE 2
if la~:n-ll ::: E and E is sufficiently small, so that the QR method then
converges locally at a quadratic rate. By direct calculation, this is easily
verified for real 2 x 2 matrices
430
6 Eigenvalue Problems
with a fc 0 and small lei: Using the shift ki := a n .n
matrix
.
~
Iblc 2
~ [Ii b]
--2'
A = ed' wIth Icl = -a2 +e
o one finds the
as the QR successor of A, so that indeed lei = 0(e 2 ) for small 14 For
symmetric matrices, b = e, the same example shows
so that one can expect local convergence at a cubic rate in this case. In fact,
for general Hermitian tridiagonal matrices and shift strategy (6.6.5.7a),
local cubic convergence was shown by Wilkinson [for details sec Wilkinson
(1965), p. 548, and Exercise 23].
A more general shift strategy, which takes account of both (6.6.4.12)
and (6.6.4.15), is the following.
(6.6.5. 7b) Choose k i as the eigenvalue A of the 2 x 2 matrix
[ a~~l,n-I an-.1,n 1.
(i)
(i)
an,n-I
(i)
a n .n
.
for which la~)n - AI is smallest.
This strategy also allows for the possibility that A has several eigenvalues of equal absolute value. But it may happen that k" is a complex number
even for real Ai, if Ai nonsymmetric, a case that will be considered later.
For real symmetric matrices Wilkinson (1968) even showed the following
global convergence theorem.
(6.6.5.8) Theorem. If the QR method with shift strategy (6.6.5.7b) is
applied to a real, irreducible, symmetric tridiagonal n x n matrix in all
iterations, then the elements a~:n-l of the ith iteration matrix Ai converge
to zero at least quadratically toward an eigenvalue of A. Disregarding rare
exceptions, the convergence rate is even cubic.
D
Nu~mRICAL
(6.6.5.9)
EXAMPLE. The spectrum of the matrix
6
1
1
1
~ ~
1
6.6 Methods for Determining the Eigenvalues and Eigenvectors
431
lies symmetrically around 6; in particular, 6 is an eigenvalue of A. For the QR
method with shift strategy (b) we display in the following the elements a~?n_l'
a~:n as well as the shift parameters k i :
(i)
( i)
a 54
1
ki
a 55
2
3
1
- .454544295 102 10 - 2
0
-.316869782391100
-.302775637732100
-.316875874226100
+.106774452090 10 -9
-.316875952616100
-.316875952619100
4
+.918983519419 10 -22
1-.316875952617100 = ..\.51
Processing the 4 x 4 matrix further, we find
ki
4
5
+.143723850633 10 0
-.171156231712 10 -5
+.29906913587510 1
+.298386369683101
6
-.11127768766310-17
1+.298386369682101 = ..\.41
+.29838996772210 1
+.298386369682101
Processing the 3 x 3 matrix further, we get
(i)
(i)
a 32
a 33
ki
6
7
+.78008805287910- 1
-.83885498096110-7
+.600201597254 10 1
+.59999999999610 1
8
+.12718113562310- 19
1+.599999999995101 = ..\.31
+.600000324468101
+.59999999999510 1
The remaining 2 x 2 matrix has the eigenvalues
+.90161363031410 1 = ..\.2
+.123168759526 10 2 = ..\.1
This ought to be compared with the result of eleven QR steps without shifts:
(12)
(12)
k
a k ,k-l
a kk
1
2
3
4
5
+.33745758663710-1
+.11407995142110-1
+.46308675985310-3
-.20218824473310-10
+.123165309125 10 2
+.90164381961h o 1
+ .600004307566 10 1
+.298386376789 10 1
-.316875952617100
432
6 Eigenvalue Problems
Such convergence behavior is to be expected, since
Further processing the 4 x 4 matrix requires as many as 23 iterations in order to
achieve
ai~5) = +.48763746442510-10.
The fact that the eigenvalues in this example come out ordered in size
is accidental: With the use of shifts this is not to be expected in general.
When the QR step (6.6.5.2) with shifts for symmetric tridiagonal matrices is implemented in practice, there is a danger of losing accuracy when
subtracting and adding kiJ in (6.6.5.2), in particular, if Ikil is large. Using
Theorem (6.6.5.1), however, there is way to compute Ai without subtracting and adding kif explicitly. Here, we use the fact the first column ql of
the matrix Qi := [ql,"" qn] can be computed from the first column of
Ai - kif: at the beginning of this section we have seen that
Qi = flifzfl~ ... fl;;-1.n,
where the flj,j+l are Givens matrices. Note that, for j > 1, the first column
of flj,j+l equals the first unit column el, so that ql = Qiel = fl~el is also
the first column of fl~. l\Iatrix fl12 is by definition the Givens matrix that
annihilates the first subdiagonal element of Ai -kif, and it is therefore fully
determined by the first column of that matrix. The matrix B = fl~Aifl12'
then has the form
B=
x
x
x
X
x
x
o
.r
x
x
x
x
x
x
0
x
x
that is, it is symmetric and, except for the potentially nonzero elements
b13 = b3l , tridiagonal. In analogy to the procedure described in Section
6.5.1, one can find a sequence of Givens transformations
-H
-
-H
B --+ fl23Bfl23 --+ ... --+ fl n - 1 ,n'"
-H
fl 23 Bfl23 ··· fln-l,n =
C,
which transform B into a unitarily similar matrix C that is symmetric and
tridiagonal. First, D23 , is chosen so as to annihilate the elements b31 = b13 of
B. The corresponding Givens transformation will produce nonzero elements
in the positions (4, 2) and (2, 4) of D/ABD 23 . These in turn, are annihilated
by a proper choice of i?34, and so on.
So, for n = 5 the matrices B --+ ... --+ C have the following structure, where
again 0 and * denote new zero and potentially nonzero elements, respectively.
6.6 Methods for Determining the Eigenvalues and Eigenvectors
B=
ti34 )
[~
x
x
x
[~
x
x
x
x
x
x
x
x
X
X
0
X
X
X
0
x
x
*
x
J [~
ti 23 )
J
ti 45 )
[~
x
0
X
X
x
x
x
*
x
x
*
x
X
X
X
X
X
X
x
0
x
x
433
;]
~] ~C
x
n
In terms of the unitary matrix U := D 12 23 ··· nn-l,n, we have
C = U H AiU since B = D~AiD12' The matrix U has the same first column as the unitary matrix Qi, which transforms Ai into A i+ 1 = Qf' AiQi
according to (6.6.5.2), namely, ql = D12el. If A; is irreducible then by
Theorem (6.6.5.5) the elements a;:;~i of A i + 1 = (a;~+l)) are nonzero
for j = 2,3, ... , n - 1. Thus, by Theorem (6.6.5.1), the matrices C and
Ai+l = SCSH are essentially the same, S = diag(±l, ... , ±1), U = QS.
Since in the explicit QR step (6.6.5.2), the matrix Qi is unique only up to a
rescaling Qi ---+ QiS by a phase matrix S, we have found an alternate algorithm to determine Ai+l' namely, by computing C. This technique, which
avoids the explicit subtraction and addition of kJ, is called the implicit
shift technique for the computation of Ai+l.
Now, let A be a real nonsymmetric Hessenberg matrix. The shift strategy (6.6.5. 7b) then leads to a complex shift k i , even if the matrix Ai is
real, whenever the 2 x 2 matrix in (6.6.5.7b) has two conjugate complex
eigenvalues k i and k,. In this case, it is possible to avoid complex calculations, if one combines the two QR steps Ai ---+ Ai+l ---+ Ai+2 by choosing
the shift ki in the first, and the shift ki+l := ki in the second. It is easy
to verify that for real Ai the matrix Ai+2 is real again, even though Ai+l
may be complex. Following an elegant proposal by Francis (1961/62) one
can compute Ai+2 directly from Ai using only real arithmetic, provided the
real Hessenberg matrix Ai is irreducible: For such matrices Ai this method
permits the direct calculation of the subsequence
of the sequence of matrices generated by the QR method with shifts. It is
in this context that implicit shift techniques were used first.
In this text, we describe only the basic ideas of Francis' technique. Explicit formulas can be found, for instance, in Wilkinson (1965). For simplicity, we omit the iteration index and write A = (aJk) for the real irreducible
Hessenberg matrix Ai, k for ki' and A for Ai+2. By (6.6.5.2), (6.6.5.3)
there are a unitary matrix Q and an upper triangle matrix R such that
A = QH AQ is a real Hessenberg matrix and (A - kI)(A - kI) = QR. We
try to find a unitary matrix U having essentially the same first column
434
6 Eigenvalue Problems
as Q, Qel = ±Uel, so that U H AU =: K is also a Hessenberg matrix. If
k is not an eigenvalue of A, then Theorems (6.6.5.1) and (6.6.5.5) show
again that the matrices A and K differ only by a phase transformation:
H.
K = S AS, S = dlag(±L ... , ±1).
This can be achieved as follows: The matrix.
n := (A - kJ)(A - kI) = A2 - (k + k)A + kkI
is real, and its first column b = Bel, which is readily computed from A
and k, has the form the form b = [x, x, x, 0, ... ,O]T, since A is a Hessenberg matrix. Choose P [see Section 4.7] as the Householder matrix that
transforms b into a multiple of el, Pb = {11 e1. Then the first column of
P has the same structure as b, that is, Pel = [x, x, x, 0, ... , oV, because
P 2 b = b = {11Pel. Since B = QR, B = p 2 B = P[{1lel, *, ... , *], the matrix P has essentially the same first column Pel = ±Qe1 as the matrix Q.
Next, one computes the matrix pH AP = PAP, which, by the structure of
P (note that Pek = ek for k .2: 4) is a matrix of the form
x
x
x
:1:
x
x
* *
PAP =
x
x
x
x
x
x
x
in other words, except for few additional subdiagonal elements in positions
(3,1), (4,1), (4,2), it is again a Hessenberg matrix. Applying Householder's
method [see the end of Section 6.5.4], PAP is then transformed into a
genuine Hessenherg matrix K using unitary similarity transformations
The Householder matrices T k , k = 1, ... , n - 2, according to Section
6.5.1, differ from the unit matrix only in three columns, Tke, = ei for
i rf. {k + 1, k + 2, k + 3}.
For n = 6 one obtains a sequence of matrices of the following form (0 and *
again denote new zero and nonzero elements, respectively):
x
PAP+
X
x
x
X
X
X
T,
----=-+
X
X
X
X
.r
r~
0
0
x
x
T2
X
X
X
X
X
* *
x
----t
x
x
x
6.6 Methods for Determining the Eigenvalues and Eigenvectors
x
x
x
x
x
x
x
x
o x x
o x x
* *
x
x
x
x
x
x
x
o
o
x
~ [~ ~
435
x
x
x
~
x
x
x
x
x
x
x
x
=K.
x
o x
x
As Tkel = el for all k, the unitary matrix U := PT1 ... T n - 2 with
U H AU = K has the same first column as P, Uel = PT1 ··· Tn-2el = Pel'
Thus, by Theorem (6.6.5.1), the matrix K is (essentially) identical to A.
Note that it is not necessary for the practical implementation of QR
method to compute and store the product matrix Pi := Ql ... Qi, if one
wishes to determine only the eigenvalues of A. This is no longer true in
the Hermitian case if one also wants to find a set of eigenvectors that are
orthogonal. The reason is that for Hermitian A
limpiH APi = D = diag(Al, ... , An),
"
which follows from (6.6.5.3), (6.6.4.12), so that for large i the j-th column
p;i) = Piej, of Pi will be a very good approximation to an eigenvector of A
for the eigenvalue Aj and these approximations p;i), j = 1, ... , n, are orthogonal to each other, since Pi is unitary. If A is not Hermitian, it is better
to determine the eigenvectors by the inverse iteration of Wielandt [see Section 6.6.3], where the usually excellent approximations to the eigenvalues
that have been found by the Q R method can be used.
The QR method converges very quickly, as is also shown by the example
given earlier. For symmetric matrices, experience suggests that the QR
method is about four times as fast as the widely used Jacobi method if
eigenvalues and eigenvectors are to be computed, and about ten times as
fast if only the eigenvalues are desired. There are many programs for the QR
method. ALGOL programs can be found in the handbook of Wilkinson and
Reinsch (1971) [see the contributions by Bowdler, Martin, Peters, Reinsch,
and Wilkinson], FORTRAN programs are in Smith et al. (1976).
436
6 Eigenvalue Problems
6.7 Computation of the Singular Values of a Matrix
The singular values (6.4.6) of an m x n matrix A and its singular-value decomposition (6.4.11) can be computed rapidly, and in a numerically stable
manner, by a method due to Golub and Reinsch (1971), which is closely
related to the QR method. We assume, without loss of generality, that
m::';' n (otherwise. replace A by A H ). The decomposition (6.4.11) can then
be written in the form
(6.7.1)
A = U [~] V H ,
D:= diag(O"l, ... ,O"n),
0"1::';' 0"2::';""::';' O"n ::.;, 0,
where U is a unitary m x m matrix, V is a unitary n x n matrix, and 0"1,
..• , O"n are the singular values of A, i.e., 0"; are the eigenvalues of AH A. In
principle, therefore, one could determine the singular values by solving the
eigenvalue problem for the Hermitian matrix AHA, but this approach can
be subject to loss of accuracy: for the matrix
lei < y'ePs,
eps = machine precision,
for example, the matrix AH A is given by
A H A = [1 +1 e
2
1]
1 + e2
'
and A has the singular values O"l(A) = yi2 + e2 , 0"2(A) = lei. In floatingpoint arithmetic, with precision eps, instead of AHA one obtains the matrix
fl(A H A) =: B =
[1 1]
1
1 '
with eigenvalues '\1 = 2 and '\2 = 0; 0"2(A) = lei does not agree to machine
precision with ~ = O.
In the method of Golub and Reinsch one first reduces A, in a preliminary step, unitarily to bidiagonal form. By the Householder method
[Section 4.7], one begins with determining an m x m Householder matrix
Pi which annihilates the sub diagonal elements in the first column of A.
One thus obtains a matrix A' = PiA of the form shown in the following
sketch for m = 5, n = 4, where changing elements are denoted by *:
A=
x
x
x
x
x
X
.J;
X
x
x
x
x
x
x
x
x
x
x
x
x
-+ HA =A' =
* *
*
*
*
*
0
0
0
0
*
*
*
*
*
J
6.7 Computation of the Singular Values of a Matrix
437
Then, as in Section 4.7, one determines an n x n Househoulder matrix Q1
of the form
such that the elements in the positions (1,3), ... , (1, n) in the first row of
A" := A/Q1 vanish:
A'=
x
0
0
0
0
x
x
x
x
:r;
x
X
X
X
x
x
x
x
x
x
-+ A/Q1 = A" =
x
0
0
0
0
*
*
*
*
*
0
0
*
*
*
*
*
*
*
*
In general, one obtains in the first step a matrix A" of the form
A" =
[*l,
0,1 := [e2, 0, ... ,0]
E <e n - 1 ,
with an (m - 1) x (n - 1) matrix A. One now treats the matrix A in the
same manner as A, etc., and in this way, after n reduction steps, obtains
an m x n bidiagonal matrix J(O) of the form
(0)
q1
J(O) =
[~O 1'
(0)
e2
0
(0)
q2
Jo =
(0)
en
0
(0)
qn
where fO) = Pn ··· P 2 P 1AQ1Q2'" Qn-2 and Pi, Qi are certain Householder matrices.
Since Q := Q1 Q2 ... Qn-2 is unitary and J(O)H J(O) = JI! J o =
QH AH AQ, the matrices J o and A have the same singular values; likewise,
with P := P 1P 2 ... Pn , the decomposition
is the singular-value decomposition (6.7.1) of A if J o = GDH H is the
singular-value decomposition of J o. It suffices therefore to consider the
problem for n x n bidiagonal matrices J.
In the first step, J is multiplied on the right by a certain Givens reflection of the type T 12 , the choice of which we want to leave open for the
moment. Thereby, J changes into matrix J(1) = JT12 of the following form
(sketch for n = 4):
438
6 Eigenvalue Problems
The sub diagonal element in position (2,1) of J(1), generated in this way, is
again annihilated through left multiplication by a suitable Givens reflection
of the type S12. One thus obtains the matrix J(2) = S12J(1):
= .f2).
The element in position (1,3) of J(2) is annihilated by multiplication from
the right with a Givens reflection of the type T2 :l ,
The subdiagonal element in position (3,2) is now annihilated by left multiplication with a Givens reflection S2:" which generates an element in position (2,4), etc. In the end, for each Givens reflection T12 one can thereby
determine a sequence of further Givens reflections Si,i+1, T i ,i+l, i = 1, 2,
... , n - 1, in such a manner that the matrix
is again a bidiagonal matrix. The matrices S := S12S23 ... Sn-l,n, T :=
T 12 T 23 ... T n -1.n are unitary, so that, by virtue of j = SH JT, the matrix
AI := jH j = TH JH SSH JT = TH MT is a tridiagonal matrix which is
unitarily similar to the tridiagonal matrix M := JH J.
Up to now, the choice of T12 was open. We show that T12 can be so
chosen that AI becomes precisely the matrix which is produced from the
tridiagonal matrix AI by the QR method with shift k. To show this, we
may assume without loss of generality that AI = JH J is irreducible, for
otherwise the eigenvalue problem for Ai could be reduced to the eigenvalue
problem for irreducible tridiagonal matrices of smaller order than that of
M. We already know [see (6.6.5.2)] that the QR method, with shift k,
applied to the tridiagonal matrix Ivi yields another tridiagonal matrix if
according to the following rules:
(6.7.2)
AI - kI =: T R.
R upper triangular,
Al:= RT+ kI,
T unitary,
6.7 Computation of the Singular Values of a Matrix
439
Both tridiagonal matrices ]\;1 = TH AIT and J'V1 = TH MT are unitarily
similar to lvI, where the unitary matrix T = T(k) depends, of course, on
the shift k. Now, Theorem (6.6.5.5) shows that for all shifts k the matrix
if = [mij] is almost irreducible, that is, mj,j-1 -I- 0 for j = 2,3, ... , n - 1
since AI is irreducible. Hence we have the situation of Theorem (6.6.5.1): If
the matrices T and T = T(k) have the same first column, then according to
this theorem there is a phase matrix L1 = diag( ei<bl, ei<b2, ... ,ei<bn ), ¢1 = 0,
so that NI = L1H !VI L1, that is, ]\;1 is essentially equal to M. We show next
that it is in fact possible to choose T12 in such a way that T and T(k) have
the same first column.
The first column t1 of T = T 12 T23 ... Tn - Ln is precisely the first column
of T 12 , since (with e1 := [1,0, ... , O]T E (Cn)
and has the form t1 = [c, 8, 0, ... ,O]T E IRn , c2 + 8 2 = 1. On the other
hand, the tridiagonal matrix AI - kI, by virtue of AI = JH J, has the form
61
(6.7.3)
12
62
1'2
M-kI=
0
o
'Yn
if
1
6:
1
61 = Iql1 2 - k, 1'2 = e2ql,
e2
J=
r~ ,~ 1
q2
qn
A unitary matrix T with (6.7.2), TH(M - kI) = R upper triangular, can
be determined as a product of n - 1 Givens reflections of the type Ti ,i+1
[see Section 4.9]' T = T12 T23 ··· Tn-1,n which successively annihilate the
sub diagonal elements of 1\1 - kI. The matrix T12 , in particular, has the
form
(";
8
8
-c
1
T12 =
(6.7.4)
o
o
[i]
1
:= ex [
where 8 -I- 0 if M is irreducible.
~~] = ex [lq11:q~ k],
(";2
+ 82 = 1,
440
6 Eigenvalue Problems
The first column t1 of T agrees with the first column of T 12 , which is
given by c, sin (6.7.4). Since /1.1 = TH MT is a tridiagonal matrix, it follows
from Francis' observation that if essentially agrees with IVI (up to scaling
factors), provided one chooses as first column t1 of T12 (through which T12
is determined) precisely the first column of T12 (if the tridiagonal matrix
!vI is not reducible).
Accordingly, a typical step J --+ J of the method of Golub and Reinsch
consists in first determining a real shift parameter k, for example by means
of one of the strategies in Section 6.6.5; then choosing T12 := T12 with the
aid of the formulas (6.7.4); and finally determining, as described above,
further Givens matrices T i ,i+1, Si,it-l such that
again becomes a bidiagonal matrix. Through this implicit handling of the
shifts, one avoids in particular the loss of accuracy which in (6.7.2) would
occur, say for large k, if the subtraction !'vI --+ !'vI - kI and subsequent
addition RT --+ RT + kI were executed explicitly. The convergence properties of the method are of course the same as those of the QR method: in
particular, using appropriate shift strategies [see Theorem (6.6.5.8)]' one
has cubic convergence as a rule.
An ALGOL program for this method can be found in Golub and Reinsch
(1971).
6.8 Generalized Eigenvalue Problems
In applications one frequently encounters eigenvalue problems of the following form: For given n x n matrices A, B, numbers A are to be found
such that there exists a vector x i= 0 with
(6.8.1)
Ax = ABx.
For nonsingular matrices B, this is equivalent to the classical eigenvalue
problem
(6.8.2)
B- 1 Ax = AX
for the matrix B- 1 A (a similar statement holds if A-I exists). Now usually,
in applications, the matrices A and B are real symmetric, and in addition
B is positive definite. Although in general B- 1 A is not symmetric, it is still
possible to reduce (6.8.1) to a classical eigenvalue problem for symmetric
matrices: If
6.9 Estimation of Eigenvalues
441
is the Choleski decomposition of the positive definite matrix B, then L is
nonsingular and B- 1A similar to the matrix G := L -1 A(L -1 )T,
But now, the matrix G, like A, is symmetric. The eigenvalues A of (6.8.1)
are thus precisely the eigenvalues of the symmetric matrix G.
The computation of G can be simplified as follows: One first computes
F:= A(L-1)T
by solving F LT = A, and then obtains
from the equation LG = F. In view of the symmetry of G it suffices to
determine the elements below the diagonal of G. For this, knowledge of the
lower triangle of F (jik with k :::; i) is sufficient. It suffices, therefore, to
compute only these elements of F from the equation F LT = A.
Together with the Choleski decomposition B = LLT. which requires
about ~n3 multiplications, the computation of G = L -1 A(L -l)T from A,
B thus costs about ~n3 multiplications.
For additional methods, sec Martin and Wilkinson (1971) and Peters
and Wilkinson (1970). A more recent method for the solution of the generalized eigenvalue problem (6.8.1) is the QZ method of J'vIoler and Stewart
(1973).
6.9 Estimation of Eigenvalues
Using the concepts of vector and matrix norm developed in Section 4.4, we
now wish to give some simple estimates for the eigenvalues of a matrix. We
assume that for x E (Cn
Ilxll
is a given vector norm and
lub(A) = ~;t
IIAxl1
W
the associated matrix norm. In particular, we employ the maximum norm
Ilxll exo = max
, lXii,
IUboo(A) = max
,
L lai,kl·
k
We distinguish between two types of eigenvalue estimations:
(1) exclusion theorems,
442
6 Eigenvalue Problems
(2) inclusion theorems.
Exclusion theorems give domains in the complex plane which contain no
eigenvalue (or whose complement contains all eigenvalues); inclusion theorems give domains in which there lies at least one eigenvalue.
An exclusion theorem of the simplest type is
(6.9.1) Theorem (Hirsch). For- all eigenvalues..\. of A one has
1..\.1 :s: lub(A).
PROOF. If x is an eigenvector to the eigenvalue ..\., then from
A.T = ..\.x,
x ~ 0,
there follows
II..\.xll = 1..\.1· 11·1:11 :s: lub(A) . Ilxll,
1..\.1 :s: lub(A).
D
If ..\.i are the eigenvalues of A, then
p(A):= max I..\.il
l::;i::;n
is called the spectml mdius of A. By (6.9.1) we have p(A) < lub(A) for
every vector norm.
(6.9.2) Theorem.
(a) For- ever-y matr-ix A and ever-y E > 0 thcr-e exists a vector norm sitch
that
lub(A) :s: p(A) + E.
(b) If every eigenvalue..\. of A with 1..\.1 = p(A) has only linear elementary
divisors, then there exists a vector norm such that
lub(A) = p(A).
PROOF.
(a): Given A, there exists a nonsingular matrix T such that
TAT- 1 =.J
is the Jordan normal form of A [see (6.2.5)], i.e., .J is made up of diagonal
blocks of the form
..\.i
1
0
1
o
6.9 Estimation of Eigenvalues
443
By means of the transformation J --+ D; I J D '" with diagonal matrix
DE := diag(l, E, E2, ... , En-I),
E
> 0,
one reduces the C,,(.\) to the form
o
From this it follows immediately that
luboo(D;IJDE) = luboo(D;ITAT-ID E):=:; p(A) +E.
Now the following is true in general: If S is a nonsingular matrix and II . II
a vector norm, then p(x) := IISxl1 is also a vector norm, and lUbp(A) =
lub(SAS- I ). For the norm p(x) := IID;ITxlloo it then follows that
lubp(A) = luboo(D;IT AT- I DE) :=:; p(A) + E.
(b): Let the eigenvalues Ai of A be ordered as follows:
Then, by assumption, for 1 :=:; i :=:; s each Jordan box Cv(Ai) in J has
dimension 1, i.e., Cv(Ai) = [AJ Choosing
we therefore have
luboo(D;ITAT-ID E) = p(A).
For the norm p(x) := IID;ITxlloo it follows, as in (a), that
lUbp(A) = p(A).
o
A better estimate than (6.9.1) is given by the following theorem [ef.
Bauer and Fike (1960)]:
(6.9.3) Theorem. If B is an arbitrary n x n matrix, then for all eigenvalues A of A one has
I:=:; lub((AI - B)-I(A - B)) :=:; lub((AI - B)-I) lub(A - B)
unless A is also an eigenvalue of B.
PROOF. If x
identity
is an eigenvector of A for the eigenvalue A, then from the
444
6 Eigenvalue Problems
(A - B)x = ()"I - B)x
it follows immediately, if ).. is not an eigenvalue of B, that
()"I - B)-l(A - B)x = x,
and hence
o
Choosing in particular
B = A D := [
0
..
all
o
1
ann
the diagonal of A, and taking the maximurIl norm, it follows that
n
lubCXJ [()"I - A D )-l(A - AD)] = max
l::;i::;n
1
I).. - aiil
L laikl.
k=l
k'i'"
From Theorem (6.9.3) we now get
t
(6.9.4) Theorem (Gershgorin). The union of all discs
J(i:=
{p, E <C lip, - aiil ::;
laikl},
i = 1, 2, ... , n,
k:;E:-i
contains all eigenvalues of the n x n matrix A = (aik).
Since the disc J(i has center at aii and radius I:~=l,k#i laik I, this estimate will be sharper as A deviates less from a diagonal matrix.
EXAMPLE 1.
A=
[~ ~.1 -~:~l'
-0.2
0
3
Kl = {Ill 111- 11::; 0.2},
K2 = {II 1111 - 21 ::; 0.4},
K3 = {II I 111 - 31 ::; 0.2},
[see Figure 12J.
The preceding theorem can be sharpened as follows:
(6.9.5) Corollary. If the union }.h = U~=l J(i J of k discs J(ij' j = 1, ... ,
k. and the union .fIv12 of the remaining discs are disjoint, then.flvh contains
exactly k eigenvalues of A and !Y!2 exactly n - k eigenvalues.
PROOF. If A =
AD + R, for t E [O,l]lct
6.9 Estimation of Eigenvalues
o
445
4
Fig. 12. Gershgorin circles
A t := AD + tR.
Then
Ao = AD,
The eigenvalues of At are continuous functions of t. Applying the theorem
of Gershgorin to At, one finds that for t = 0 there are exactly k eigenvalues
of Ao in lvh and n - k in lvh (counting multiple eigenvalues according
to their multiplicities as zeros of the characteristic polynomial). Since for
o S; t S; 1 all eigenvalues of At likewise must lie in these discs, it follows
for reasons of continuity that also k eigenvalues of A lie in lvh and the
remaining n - k in ~M2.
0
Since A and AT have the same eigenvalues, one can apply (6.9.4), (6.9.5)
to A as well as to AT and thus, in general, obtain more information about
the location of the eigenvalues.
It is often possible to improve the estimates of Gershgorin's theorem
by first applying a similarity transformation A -t D- 1 AD with a diagonal
matrix D = diag(d 1 , ... , dn ). For the eigenvalues of D- 1 AD, and thus of
A, one thus obtains the discs
Ki =
{/lii/l-aiil S;
t
lai~~kl =:Pi}.
k#-i
By a suitable choice of D one can often substantially reduce the radius
Pi of a disc (the remaining discs, as a rule, become larger) in such a way,
indeed, that the disc Ki in question remains disjoint with the other discs
K j , j i i. Ki then contains exactly one eigenvalue of A.
EXA:VIPLE 2.
0< r:: « 1,
K 1 ={/-lII/-l-1Is2r::},
K2 = K3 = {/-lil/-l- 21 S 2r::},
446
6 Eigenvalue Problems
Transformation with D = diag(l, kE, kE), k > 0, yields
A' = D-1AD =
For A' we have
[1~k
11k
1
P2 = P3 = k + E.
The discs Kl and K2 = K3 for A' are disjoint if
Pl + P2 = 2kE
2
1
+ k + E < 1.
For this to be true we must clearly have k > 1. The optimal value k, for which
Kl and K2 touch one another, is obtained from Pl + P2 = 1. One finds
k=
and thus
2
2
1 - E + J(1 - E)2 - 8E2
= 1+E+0(E),
Pl = 2kE2 = 2E2 + 0(E 3 ).
Through the transformation A -+ A' the radius Pl of Kl can thus be reduced
from the initial 2E to about 2E2.
The estimate of Theorem (6.9.3) can also be interpreted as a perturbation theorem, indicating how much the eigenvalues of A can deviate from
the eigenvalues of B. In order to show this, let us assume that B is normalizable:
It then follows, if .>. is no eigenvalue of B, that
lub( (.AI - B)-I) = lub( (P(.AI - AB)-l p-l)
S lub((.AI - AB)-l) lub(P) lub(P-l)
=
max
l::;i::;n
1
1'>'-'>'i(B)1
cond(P)
1
mm I.>. - '>'i(B) 1cond(P).
l::;i::;n
This estimate is valid for all norms 11·11 satisfying, like the maximum norm
and the Euclidean norm,
lub(D) = max Idil
l::;i::;n
6.9 Estimation of Eigenvalues
447
for all diagonal matrices D = diag(d1, ... ,d n ). Such norms are called absolute; they are also characterized by the condition Illxlll = Ilxll for all
x E (Cn [see Bauer, Stoer, and Witzgall (1961)].
From (6.9.3) we thus obtain [cf. Bauer, Fike (1960), Householder
(1964)] :
(6.9.6) Theorem. If B is a diagonalizable n x n matrix, B = P ABP- 1 ,
AB = diag(Al(B), ... , An(R)), and A an arbitrary n x n matrix, then for
each eigenvalue )'(A) there is an eigenvalue Ai(B) such that
IA(A) - Ai(B)1 ::; cond(P) lub(A - B).
Here cond and lub are to be formed with reference to an absolute norm 11·11.
This estimate shows that for the sensitivity of the eigenvalues of a
matrix B under perturbation, the controlling factor is the condition
cond(P)
of the matrix P. not the condition of B. But the columns of P are just
the (right) eigenvectors of B. For Hermitian, and more generally, normal
matrices normal matrix B one can choose for P a unitary matrix [Theorem (6.4.5)]. With respect to the Euclidean norm IIxl12 one therefore has
cond 2 (P) = 1, and thus:
(6.9.7) Theorem. If B is a nurmal n x n matrix and A an arbitrary n x n
TrwtTix, then for' each eigenvalue A(A) of A there is an eigenvalue )'(B) of
B such that
I)'(A) - A(E)I ::; lub 2 (A - E).
The eigenvalue problem for Hermitian matrices is thus always well conditioned.
The estimates in (6.9.6), (6.9.7) are of global nature. We now wish to
examine in a first approximation the sensitivity of a fixed eigenvalue A of A
under small perturbation A -+ A+EC, E -+ 0. Vie limit ourselves to the case
in which the eigenvalue A under study is a simple zero of the characteristic
polynomial of A. To ), then belong uniquely determined (up to a constant
factor) right and left eigenvectors x and y, respectively:
xi 0,
y i 0.
For these, one has yH x i 0, as one easily shows with the aid of the Jordan
normal form of A, which contains only one Jordan block, of order one, for
the eigenvalue A.
(6.9.8) Theorem. Let), be a simple zero of the characteristic polynomial
of the n x n matrix A, and x and yH corresponding right and left eigenvectors of A, respectively,
x, y i 0,
448
6 Eigenvalue Problems
and let C be an arbitrary n x n matrix. Then there exists a function A(e)
which is analytic for lei sufficiently small, lei S eo., eo > 0, such that
A(O) = A,
A'(O) = yH Cx,
yHx
and A(e) is a simple zero of the characteristic polynomial of A + eC. One
thus has, in first approximation,
PROOF. The characteristic polynomial of the matrix A
+ eC,
is an analytic function of e and 11. Let K be a disc about A,
K = {11 1111 - AI S r},
r > 0,
which does not contain any eigenvalues of A other than A, and S := {11 I
111 - AI = r} its boundary. Then
inf ICPo(l1)I =: m > O.
!J.ES
Since CPo(l1) depends continuously on e, there is an eo > 0 such that also
(6.9.9)
inf IcpE(I1)1 > 0
,LES
for all
lei < eo.
According to a well-known theorem in complex variables, the number of
zeros of CPE (11) within K is given by
In view of (6.9.9), I/(e) is continuous for lei S eo; hence 1 = 1/(0) = I/(e)
for lei SeD, 1/ being integer valued. The simple zero A(e) of CPE(IJ,) inside of K, according to another theorem in complex variables, admits the
representation
(6.9.10)
For lei S eo the integrand of (6.9.10) is an analytic function of c and therefore also A(e), according to a well-known theorem on the interchangeability
of differentiation and integration. For the simple eigenvalue A(e) of A + eC
one can choose right and left eigenvectors X(e) and y(e),
6.9 Estimation of Eigenvalues
(A + eC)x(e) = A(e)X(e),
449
y(e)H (A + eC) = A(e)y(e)H,
in such a way that x(e) and y(e) are analytic function::; of e for lei:::; co.
We may put, e.g., x(e) = [6(e), ... ,';n(e)]T with
';i(e) = (_l)i det(B 1 ;),
where Bli is the (n - 1) x (n - 1) matrix obtained by deleting row 1 and
column i in the matrix A + eC - ).,(e)l. From
(A + eC - ).,(e) 1)x(e) = 0,
differentiating with re::;pect to e at e = 0, one obtains
(C - ).,' (0) l)x(O) + (A - ),,(0) I) x' (0) = 0,
from which, in view of y(O)H (A - ).,(0) 1) = 0,
y(O)H(C - ).,1(0)1)X(O) = 0,
and hence, since y(O)H x(O) i- 0,
A'(O) = yHCx,
yHx
y = y(O),
x = x(O),
as was to be shown.
Denoting, for the Euclidean norm I . 112, by
o
the cosine of the angle between x and y, the preceding result implie::; the
estimate
1>.'(0)1 =
lyHCxl
IIYl12 Ilx 112 Icos(x, y) I
<
IICxl12
< lub 2(C) .
IIxl121 cos(x, y) I - Icos(x, y) I
The sensitivity of )., will thus increase with decreasing I cos(x, y)l. For Hermitian matrices we always have x = y (up to constant multiples), and
hence I cos(x, y)1 = 1. This is in harmony with Theorem (6.9.7), according
to which the eigenvalues of Hermitian matrices are relatively insensitive to
perturbations.
Theorem (6.9.8) asserts that an eigenvalue A of A which is only a simple
zero of the characteristic polynomial is relatively insensitive to perturbations A --+ A + eC, in the sense that for the corresponding eigenvalue A(e)
of A + eC there exists a constant K and an co > 0 such that
I).,(e) - )"1 :::; K lei
for
lei:::; co·
450
6 Eigenvalue Problems
However, for ill-conditioned eigenvalues A, i.e., if the corresponding left and
right eigenvectors are almost orthogonal, the constant K is very large.
This statement remains valid if the eigenvalue A is a multiple zero of
the characteristic polynomial and has only linear elementary divisors. If
to A there belong nonlinear elementary divisors, however, the statement
becomes false. The following can be shown in this case (cf. Exercise 29):
Let (t1-A)Vl, (p.-At 2 , .•• , (Il-At p , VI 2" V2 2" ... 2" vP' be the elementary
divisors belonging to the eigenvalue A of A. Then the matrix A + Ee, for
sufficiently small E, has eigenvalues Ai(E), i = 1, ... , IJ", IJ" := VI + ... + vP'
satisfying, with some constant K,
(6.9.11)
This has the following numerical consequence: If in the practical computation of the eigenvalues of a matrix A [with lub(A) = 1] one applies a stable method, then the rounding errors committed during the course of the
method can be interpreted as having the effect of producing exact results
not for A. but for a perturbed initial matrix A + DA, lub(DA) = O(eps).
If to the eigenvalue Ai of A there belong only linear elementary divisors,
then the computed approximation ),i of Ai agrees with Ai up to an error
of the order eps. If however Ai has elementary divisors of order at most v,
one must expect an error of order of magnitude epsl/v for ),i.
FOr the derivation of a typical inclusion theorem, we limit ourselves to
the case of the Euclidean norm
and begin by proving the formula
. jjUrjj2
.
j
mln-j-j -jj- = mmjdi
x,to
x 2
l
for a diagonal matrix D = diag(d l , ... , d n ). Indeed, for all x i= 0,
jjD.Tjj§ _
jjxjj§ -
Li jX ij2jdi j2 > . jd.j2
Li jX ij2 - mim , .
If jdjj = mini jdij, then the lower bounds is attained for x = ej (the j-th
unit vector). Furthermore, if A is a normal matrix, i.e., a matrix which can
be transformed to diagonal form by a unitary matrix U,
A=UHDU,
D = diag(d], ... ,dn ),
and if f(A) is an arbitrary polynomial, then
f(A) = U H f(D)U,
di = Ai(A),
6.9 Estimation of Eigenvalues
451
and from the unitary invarianee of IIxl12 = IIUxl12 it follows at once, for all
x =I 0, that
Ilf(A)xI12
IIxl12
Ilf(D)UxI12
IIU H f(D)UxI12
IIUxl12
IIxl12
~ min 1f (d i ) 1 = min 1f (Ai (A) ) I·
l~i~n
l~i~n
We therefore have the following [sec, e.g., Householder (1964)]:
(6.9.12) Theorem. If A is a normal matrix, f(A) an arbitrary polynomial,
and x =I 0 an arbitrary vector, then there is an eigenvalue A(A) of A such
that
If(A(A))
1
~ Ilf(A)x I12 .
IIxl12
In particular, choosing for f the linear polynomial
f(A) == A _ x H Ax == A _ /L0l ,
xH X
where
/Lik :=X
H( A H)i Ax.
k
/Loa
i, k = 0, 1, 2, ... ,
we immediately get, in view of fLik = !lh'
Ilf(A)xll~ = x H(AH _ !l01 I) (A _ /1011).7:
/Loa
/LOO
/L1O/L0l
/L10/L01
/L01/L1O
= /111 - - - - - - + -2-/100
/100
/Loo
/100
flOl/L10
= /Lll - - - .
/100
We thus have
(6.9.13) Theorem (Weinstein): If A is normal and x -=I 0 an arbitrary
vector, then the disc
/Lll - /L1O/L01 / /Loa }
/Loa
contains at least one eigenvalue of A.
The quotient /101 / fLoo = x H Ax / x H x, incidentally, is called the
Rayleigh quotient
452
6 Eigenvalue Problems
of A belonging to x. The last theorem is used especially in connection
with the vector iteration: If x is approximately equal to Xl, the eigenvector
corresponding to the eigenvalue AI,
x ~ Xl,
AXI
= AlXl,
then the Rayleigh quotient x H Ax / x H X belonging to x is in general a very
good approximation to AI. Theorem (6.9.13) then shows how much at most
this approximate value x H Ax / x H x deviates from an eigenvalue of A.
The set
G[A] =
{x:H~x I x 1= °}
of all Rayleigh quotients is called the field of values of the matrix A. Choosing for x an eigenvector of A, it follows at once that G[A] contains the
eigenvalues of A. Hausdorff, furthermore, has shown that G[A] is always
convex. For normal matrices
A = UHAU.
one even has
xH U H AU x I
G[A] = { xHUHUx
x 1=
=
{11 I =
J.L
t Ai,
Ti
°} = { - - I 1= }
Ti
yH Ay
yHy
2:: 0,
t
Ti =
y
U
1 }.
That is, for normal matrices, G[A] is the convex hull of the eigenvalues of
A [see Figure 13].
Fig. 13. Field of values
6.9 Estimation of Eigenvalues
453
For a Hermitian matrix H = H H = U H AU with eigenvalues >'1 : : :
A2 ::::: ... ::::: An, we thus recover the result (6.4.3), which characterizes Al
and An by extremal properties. The remaining eigenvalues of H, too, can
be characterized similarly, as is shown by the following theorem:
(6.9.14) Theorem (Courant, Weyl). For the eigenvalues Al ::::: A2 ::::: ... :::::
An of an n x n Hermitian matrix H, one has for i = 0, 1, ... , n - 1
xHHx
PROOF. For arbitrary Yl,' .. , Yi E (Cn define /J(Yl, ... , Yi) by
/J(Yl, ... ,y;) :=
max
xE(f;n: X H Yl="'=X H Yi=O
xHx=l
Further let Xl, ... , Xn be a set of n orthonormal eigenvectors of H for the
eigenvalues Aj [see Theorem (6.4.2)]: HXj = AjXj, XfXk = 6jk for j, k = 1,
2, ... , n. For Yj := Xj, j = L ... , i, all x E (Cn with xHYj = 0, j = 1, ... ,
i, xH X = 1, can then be represented in the form
so that, since Ak :s: Ai+l for k ::::: i + 1, one has for such x
XH Hx = (Pi+lXi+l + ... + Pnxn)H H(Pi+lXi+l + ... + Pnxn)
= IPi+112Ai+l + ... + IPnl 2A n
:s: Ai+l(lpi+11 2 + ... + IPnI 2 ) = Ai+l,
where equality holds for x := Xi+l. Hence,
On the other hand, for arbitrary Yl, ... , Yi E (Cn, the subspaces
E := {x E (Cn I x H Yj = 0 for j :s: i},
F := {~jS;i+l PjXj I Pj E (C},
have dimensions dim E ::::: n - i, dim F = i + 1, so that dim F n E 2: 1,
and there exists a vector Xo E F n E with x{f Xo = 1. Therefore, because
Xo = PlXl + ... + Pi+lXi+l E F,
/J(Yl, ... , Yi) ::::: x{f H Xo = IPll 2 Al + ... + IPi+11 2Ai+1
::::: (IPlI 2 + ... + IPi+11 2)Ai+l = Ai+l.
o
454
6 Eigenvalue Problems
Defining, for an arbitrary matrix A,
then HI, H2 are Hermitian and
( HI, H 2 are also denoted by Re A and 1m A, respectively; but it should be
noted that the elements of Re A are not real in general.)
For every eigenvalue A of A, one has, on account of A E G[A] and
(6.4.3),
xHAx
1 xHAx+XHAHx
ReA S; maxRe-H= max~
2
X",O
x x
X",O X X
= max
X",O
xHHIX
H
x x
= Amax(Hd,
xHAx
ImA S; maxlm-H= Amax(H2)'
X",O
x.r
By estimating Re A, 1m A analogously from below, one obtains
(6.9.15) Theorem (Bendixson). Decomposing an arbitrary matrix A into
A = HI + iH2, where HI and H2 are Hermitian, then for every eigenvalue
A of A one has
Amin(H1 ) S; Re AS; Amax(Hd.
Amin(H2 ) S; ImA S; Arnax (H 2 ).
The estimates of this section lead immediately to estimates for the zeros
of a polynomial
We need only observe that to p there corresponds the Frobenius matrix
o
1
which has the characteristic polynomial (l/a n )( -l)np(A). In particular,
from the estimate (6.9.1) of Hirsch, with luboo(A) = maXi Lk laikl. applied
to F and F T , one obtains the following estimates for all zeros Ak of p(A),
respectively:
EXERCISES FOR CHAPTER 6
455
(a) IAkl ~ max{1 :: I, l~~~a:_l (1 + 1::1) },
(b)
IAkl ~ max{ 1, ~ 1:: 1}.
EXAMPLE 3. For p('\) = ,\3 - 2,\2
+ ,\ - lone obtains
(a) I'\il S max{l, 2, 3} = 3,
(b) I'\il S max{l, 1 + 1 + 2} = 4.
Tn this case (a) gives a better estimate.
EXERCISES FOR CHAPTER 6
1.
For the matrix
2
1
2
o
1
2
2
A=
1
2
1
o
o
find the characteristic polynomial, the minimal polynomial, a system of eigenvectors and principal vectors, and the Frobenius normal form.
2.
How many distinct (i.e., not mutually similar) 6 x 6 matrices are there whose
characteristic polynomial is
3.
What are the properties of the eigenvalues of
positive definite/semidefinite,
orthogonal/ unitary,
real skew-symmetric (AT = -A)
matrices? Determine the minimal polynomial of a projection matrix A = A 2 .
4.
For the matrices
(a) A=UVT,u,VElR",
(b) H = 1- 2ww H • w H W = 1, w E (Cn,
o o
o o
1 o
o 1
il
456
6 Eigenvalue Problems
determine the eigenvalues Ai, the multiplicities O"i and pi, the characteristic
polynomial i.p(JL) , the minimal polynomial 1jJ(JL) , a system of eigenvectors and
principal vectors, and a Jordan normal form J,
5.
For 'U,V,'W,Z E IR n , n > 2, let
(a) With AI, A2 the eigenvalues of
_
A:=
[vT'U
T
Z 'U
show that A has the eigenvalues AI, A2, O.
(b) How many eigenvectors and principal vectors can A have 7 'What types
of Jordan normal form J of A are possible 7 Determine J in particular
for A. = O.
6.
(a) Show: A is an eigenvalue of A if and only if -A is an eigenvalue of B,
where
(b) Suppose the real symmetric tridiagonal matrix
A=
12
o
satisfies
In
,n
On
i = 1, ... , n,
i = 2, ... , n.
Show: If A is an eigenvalue of A, then so is -A. 'What does this imply
for the eigenvalues of the matrix (6.6.5.9)7
(c) Show that eigenvalues of the matrix
are symmetric with respect to the origin, and that
EXERCISES FOR CHAPTER 6
457
if n is even, n = 2k,
otherwise.
7.
Let C be a real n x n matrix and
t
Show: / is stationary at x 0 precisely if x is an eigenvector of ~(C + C T )
with /(x) as corresponding eigenvalue.
S.
Show:
(a) If A is normal, and has the eigenvalue Ai, IAII >
singular values ai, then
i = 1, ... , n,
lub 2 (A) = IA11 = p(A),
cond 2 (A) = :~~: = p(A)p(A-I)
(if A-I exists).
(b) For every n x n matrix,
9.
Show: If A is a normal n x n matrix with the eigenvalues A"
and if U, V are unitary, then for the eigenvalues p" of U AV one has
10. Let B be an n x m matrix. Prove:
M =
[~: 11: 1 positive definite
{o}
p(BH R) < 1
[In' 1m unit matrices, p(BH B) = spectral radius of BH Bl.
11. For n x n matrices A, B show:
s
(a) IAI S IBI =? lub 2 (IAI)
lub 2 (IBI),
(b) lub 2 (A) S lub 2 (IAI) S fo lub 2 (A).
12. The content of this exercise is the assertion in Section 6.3 that a matrix is ill
conditioned if two column vectors are "almost linearly dependent". For the
matrix A = [aJ, ... , an], ai E IR n , i = 1, ... , n, let
O<c:<l
458
6 Eigenvalue Problems
(i.e., al and a2 form an angle ex with 1 - E :::; I cos exl :::; 1). Show: Either A is
singular or cond2(A) 2: liVE.
13. Estimation of the function f(n) in formula (6.5.4.4) for floating-point arithmetic with relative precision eps: Let A be an n x n matrix in floating-point
repretientation and G j an elimination matrix of the type (6.5.4.1) with the
essential elements lij. The Ii] are formed by division of elements of A and
are thus subject to a relative error of at most eps. With the methods of Section 1.3, derive the following estimates (higher powers eps', i 2: 2, are to be
neglected) :
(a) lub=[fl(GjIA) - GjlA] S; 4epslub=(.4),
(b) lub=[fl(AG J ) - .4G j ] S; (n - j + 2) epslub=(A),
(c) lub=[fl(Gjl AG j ) - Gjl AG j ] S; 2(n - j + 6) epslub=(A).
14. Convergence behavior of the vector iteration: Let A be a rcal symmetric n x n
matrix having the eigenvalues Ai with
xT
and the corresponding eigenvectors Xl, ... , Xn with
Xk = l5 ik . Starting
with an initial vector Yo for which xi Yo # O. suppose one computes
for k = 0, 1, ...
with an arbitrary vector norm I . II, and concurrently the quantities
(AYk)i
qki:= (Yk)i '
and the Rayleigh quotient
Prove:
(a) qki = Ad 1 + O((A2/Ad)] for all i with (XI)i # 0,
(b) Tk=>"d1+0((A2/At)2k)].
15. In the real matrix
A=AT =
[1
9
* * *
0
* *
*
*
* * 4
* * *
~ j,
21
stars represent elements of modulus:::; 1/4. Suppose the vector iteration is
carried out with A and the initial vector yo = e5.
(a) Show that e5 is an "appropriate" initial vector, i.e., the sequence Yk in
Exercise 14 does indeed converge toward the eigenvector belonging to
the dominant eigenvalue of A.
(b) Estimate how many correct decimal digits Tk+5 gains compared to Tk·
EXERCISES FOR CHAPTER 6
459
16. Prove: luboo(F) < 1 ==? A := 1+ F admits a triangular factorization A =
L·R.
17. Let the matrix A be nonsingular and admit a triangular factorization A =
L . R (lii = 1). Show:
(a) Land R are uniquely determined.
(b) If A is an upper Hessenberg matrix, then
1
°1] ,
RL upper Hessenberg.
*
(c) If A is tridiagonal, then L is as in (b) and
RL tridiagonal.
18. (a) Which upper triangular matrices are at the same time unitary; which
ones real orthogonal ?
(b) In what do different QR factorizations of a nonsingular matrix differ
from one another? Is the answer valid also for singular matrices?
19. (a) Consider an upper triangular matrix
and determine a Givens rotation fl so that
Hint: flel is an eigenvector of R corresponding to the eigenvalue A2.
(b) How can one transform an upper triangular matrix R with rkk = Ak,
k = 1, ... , n, unitarily into an upper triangular matrix R = U H RU,
UHU = I, with the diagonal diag(Ai, AI, ... ,Ai-I, Ai+1, ... ,An)?
20. QR method with shifts: Prove formula (6.6.5.3).
21. Let A be a normal n x n matrix with the eigenvalues AI, ... , An, A = QR,
QH Q = I, R = (rik) upper triangular. Prove:
minlAil -s; Irjjl -s; max IAil,
i
22. Compute a QR step with the matrix A =
(a) without shift
j = 1, ... , n.
i
[ :c
E
1]
460
6 Eigenvalue Problems
(b) with shift k = 1, i.e., following strategy (a) of Section 6.6.6.
23. Effect of a QR step with shift for tridiagonal matrices:
B
/; 1= 0, i = 2, ... , n.
/n
Suppose we carry out a QR step with A, using the shift parameter k = 5n ,
,
/2
::,1'
5;,
Prove: If d:= min; IAi(B) - 5n l > 0, then
15'n _ 5 1< ITnd 12 .
3
1/n'1<lT
d 21 '
n
11,
-
Hint: Q is a product of suitable Givens rotations; apply Exercise 21.
Example: What does one obtain for
1
5
1
1
5
1
1
5
0.1
24. Is shift strategy (a) in the QR method meaningful for real tridiagonal matrices of the type
5 /2
0
/2
5
?
/n
0
/n
5
Answer this question with the aid of Exercise 6 (c) and the following fact:
If A is real symmetric and tridiagonal, and if all diagonal elements are zero,
then the same is true after one QR step.
25. For the matrix
5.2
A = [ 0.6
2.2
0.6
6.4
0.5
2.2]
0.5
4.7
compute an upper bound for cond 2 (A), using estimates of the eigenvalues by
the method of Gershgorin.
26. Estimate the eigenvalues of the following matrices as accurately as possible:
EXERCISES FOR CHAPTER 6
461
(a) The matrix A in Exercise 15.
10- 3
2
10- 3
(b)
10- 34 ]
103
.
Hint: Use the method of Gershgorin in conjunction with a transformation
A --+ D- 1 AD, D a suitable diagonal matrix.
27. (a) Let A, B be Hermitian square matrices and
Show: For every eigenvalue )"(B) of B there is an eigenvalue )"(H) of H
such that
(b) Apply (a) to the practically important case in which H is a Hermitian,
"almost reducible" tridiagonal matrix of the form
0
* *
* *
*
H=
*
*
E
E
*
*
E small.
*
*
o
*
*
*
How can the eigenvalues of H be estimated in terms of those of A and
B?
28. Show: If A = (aid is Hermitian, then for every diagonal element aii there
exists an eigenvalue )"(A) of A such that
I),.(A)-aiil ~
~)a,jI2.
ji"i
29. (a) For the v x v matrix
462
6 Eigenvalue Problems
prove with the aid of Gershgorin's theorem and appropriate scaling with
diagonal matrices: The eigenvalues Ai(c) ofthe perturbed matrix Cv(A)+
cF satisfy, for c sufficiently small,
with some constant K. Through a special choice of the matrix F show
that the case Ai(c) - A = O(c l / v ) does indeed occur.
(b) Prove the result (6.9.11).
Hint: Transform A to Jordan normal form.
30. Obtain the field of values G[A] = {x H Axlx H X = I} for
1
[1
A:= 0
1
1
0
o 0
o
o
-1
-1
~l·
-1
31. Decompose these matrices into HI + iH2 with HI, H2 Hermitian:
5 +i
1 + 4i
1
32. For the eigenvalues of
3.
A = [ 0.9
0.1
1.1
15.0
-1.9
-0.1]
-2.1 .
19.5
use the theorems of Gershgorin and Bendixson to determine inclusion regions
which are as small as possible.
References for Chapter 6
Barth, W., Martin, R. S., Wilkinson, J. H. (1971): Calculation of the eigenvalues
of a symmetric tridiagonal matrix by the method of bisection. Contribution
Il/5 in Wilkinson and Reinsch (1971).
Bauer, F. L., Fike, C. T. (1960): Norms and exclusion theorems. Numer. Math.
2, 137-141.
- - - , Stoer, J., Witzgall, C. (1961): Absolute and monotonic norms. Numer.
Math. 3, 257-264.
Bowdler, H., Martin, R. S., Wilkinson, J. H. (1971): The QR and QL algorithms
for symmetric matrices. Contribution Il/3 in: Wilkinson and Reinsch (1971).
Bunse, W., Bunse-Gerstner, A. (1985): Numerische Lineare Algebra. Stuttgart:
Teubner.
Cullum, J., Willoughby, R. A. (1985): Lanczos Algorithms for Large Symmetric Eigenvalue Computations. Vol. I: Theory, Vol. II: Programs. Progress in
Scientific Computing, Vol. 3, 4. Basel: Birkhiiuser.
References for Chapter 6
463
Eberlein, P. J. (1971): Solution to the complex eigenproblem by a norm reducing
Jacobi type method. Contribution II/17 in: Wilkinson and Reinsch (1971).
Francis, J. F. G. (1961/62): The QR transformation. A unitary analogue to the
LR transformation. I. Computer J. 4, 265-271. The QR transformation. II.
ibid., 332-345.
Garbow, B. S., et al. (1977): Matrix Eigensystem Computer Routines - EISPACK Guide Extension. Lecture Notes in Computer Science 51. Berlin, Heidelberg, New York: Springer-Verlag.
Givens, J. W. (1954): Numerical computation of the characteristic values of a
real symmetric matrix. Oak Ridge National Laboratory Report ORNL-1574.
Golub, G. H., Reinsch, C. (1971): Singular value decomposition and least squares
solution. Contribution I/1O in: Wilkinson and Reinsch (1971).
- _ .. , Van Loan, C. F. (1983): Matrix Computations. Baltimore: The JohnsHopkins University Press.
- - - , Wilkinson, J. H. (1976): Ill-conditioned eigensystems and the computation of the Jordan canonical form. SIAM Review 18, 578-619.
Householder, A. S. (1964): The Theory of Matrices in Numencal Analysis. New
York: Blaisdell.
Kaniel, S. (1966): Estimates for some computational techniques in linear algebra.
Math. Compo 20, 369-378.
Kielbasinski, A., Schwetlick, H. (1988): Numerische Lineare Algebra. Thun,
Frankfurt /M.: Deutsch.
Kublanovskaya, V. N. (1961): On some algorithms for the solution of the complete
eigenvalue problem. Z. Vycisl. Mat. i Mat. Fiz. 1, 555-570.
Lanczos, C. (1950): An iteration method for the solution of the eigenvalue problem of linear differential and integral operators. J. Res. Nat. Bur. Stand. 45,
255-282.
Martin, R. S., Peters, G., Wilkinson, J. H. (1971): The QR algorithm for real
Hessenberg matrices. Contribution II/14 in: Wilkinson and Reinsch (1971).
- - - , Reinsch, C., Wilkinson. J. H. (1971): Householder's tridiagonalization
of a symmetric matrix. Contribution II/2 in: Wilkinson and Reinsch (1971).
- , Wilkinson, J. H. (1971): Reduction of the symmetric eigenproblem Ax =
)'Bx and related problems to standard form. Contribution II/10 in: Wilkinson
and Reinsch (1971).
- - - , - - - (1971): Similarity reduction of a general matrix to Hessenberg
form. Contribution II/1:1 in: Wilkinson and Reinsch (1971).
Moler, C. B .. Stewart, G. W. (1973): An algorithm for generalized matrix eigenvalue problems. SIAM J. Numer. Anal. 10, 241-256.
Paige, C. C. (1971): The computation of eigenvalues and eigenvectors of very
large sparse matrices. Ph.D. thesis, London University.
Parlett, B. N. (1965): Convergence of the QR algorithm. Numer. Math. 7, 187-193
(corr. in 10, 16:~-164 (lY67)).
- - - (1980): The Symmetric Eigenvalue Problem. Englewood Cliffs, N.J.:
Prcntice-Hall.
- - - , Poole, W. G. (1973): A geometric theory for the QR, LU and power
iterations. SIAM J. Numer. Anal. 10. 389-412.
- - - , Scott, D. S. (1979): The Lanczos algorithm with selective orthogonalization. Math Compo 33, 217-238.
Peters, G., Wilkinson J. H. (1970): Ax = )'Bx and the generalized eigenproblem.
SIAM 1. Numer. Anal. 7, 479-492.
- - - , - - - (1971): Eigenvectors of real and complex matrices by LR and
QR triangularizations. Contribution II/15 in: Wilkinson and Reinsch (1971).
464
6 Eigenvalue Problems
- - - , - - - (1971): The calculation of specified eigenvectors by inverse iteration. Contribution II/18 in: Wilkinson and Reinsch (1971).
Rutishauser, H. (1958): Solution of eigenvalue problems with the LR-transformation. Nat. Bur. Standards Appl. Math. Ser. 49, 47-81.
- - - (1971): The Jacobi method for real symmetric matrices. Contribution
II/I in: Wilkinson and Reinsch (1971).
Saad, Y. (1980): On the rates of convergence of the Lanczos and the block Lanczos
methods. SIAM 1. Num. Anal. 17, 687-706.
Schwarz, H. R., Rutishauscr, H., Stiefel, E. (1972): Numerik symmetrischer Matrizen. 2d ed. Stuttgart: Teuhner. (English translation: Englewood Cliffs, N.J.:
Prentice-Hall (1973).)
Smith, B. T., et al. (1976): Matrix eigensystems routines - EISPACK Guide.
Lecture Notes in Computer Science 6, 2d ed. Berlin, Heidelberg, New York:
Springer-Verlag.
Stewart, G. W. (1973): Introduction to Matrix Computations. New York, London:
Academic Press.
Wilkinson, J. II. (1962): Note on the quadratic convergence of the cyclic Jacobi
process. Numer. Math. 4, 296-300.
- - - (1965): The Algebraic Eigenvalue Problem. Oxford: Clarendon Press.
- - - (1968): Global convergence of tridiagonal QR algorithm with origin
shifts. Linear Algebra and Appl. 1, 409-420.
- - - . Reinsch, C. (1971): Linear Algebra, Handbook for Automatic Computation, Vol. II. Berlin, Heidelberg, New York: Springer-Verlag.
7 Ordinary Differential Equations
7.0 Introduction
Many problems in applied mathematics lead to ordinary differential equations. In the simplest case one seeks a differentiable function Y = y(x) of
one real variable x, whose derivative y'(x) is to satisfy an equation of the
form y'(x) = f(x, y(x)), or more briefly,
(7.0.1)
y' = f(x,y);
one then speaks of an ordinary differential equation. In general there are
infinitely many different functions y which are solutions of (7.0.1). Through
additional requirements one can single out certain solutions from the set
of all solutions. Thus, in an initial-value problem, one seeks a solution y of
(7.0.1) which for given xo, Yo satisfies an initial condition of the form
(7.0.2)
y(xo) = Yo·
More generally, one also considers systems of n ordinary differential
equations
y~ (x) = II (x, Yl(X), ... ,Yn(x)),
y~(x) = h(x, Yl (x), ... ,Yn(x)),
for n unknown real functions Yi(X), i = 1, ... , n, of a real variable. Such
systems can be written analogously to (7.0.1) in vector form:
(7.0.3)
To the initial condition (7.0.2) there now corresponds a condition of the
form
466
(7.0.4)
7 Ordinary Differential Equations
y(xo) = Yo =
[Y~Ol
YnO
In addition to ordinary differential equations of first order' (7.0.1),
(7.0.3), in which there occur only first derivatives of the unknown function
y( x), there are ordinary differential equations of mth order of the form
(7.0.5)
))
y (m)( x ) -_ f( x, y ()
x ,Y (1)()
x, .... y (m-l)( x.
By introducing auxiliary functions
ZI(X) : = y(x),
Z2(X) : = y(1)(x),
Zm(x) : = y(m-l)(x),
(7.0.5) can always be transformed into an equivalent system of first-order
differential equations,
(7.0.6)
By an initial-value problem for the ordinary differential equation of mth
order (7.0.5) one means the problem of finding an m times differentiable
function y(x) which satisfies (7.0.5) and initial conditions of the form
i = 0, 1, ... , m - 1.
Initial-value problems will be treated in Section 7.2.
Besides initial-value problems for systems of ordinary differential equations, boundary-value problems also frequently occur in practice. Here, the
desired solution y(x) of the differential equation (7.0.3) has to satisfy a
boundary condition of the form
r(y(a), y(b)) = 0,
(7.0.7)
where a =f. b are two different numbers and
rl (Ul, ... , Un, VI, ... , V n )
r(u, V) :=
[
:
r n (Ul, ... , Un, VI,""
Vn )
1
7.1 Some Theorems from the Theory of Diffferential Equations
467
is a vector of n given functions ri of 2n variables Ul, ... , Un, Vl, ... , V n ·
Problems of this kind will be considered in Sections 7.3, 7.4, and 7.5.
In the methods which are to be discussed in the following, one does not
construct a closed-form expression for the desired solution y(x) - this is
not even possible, in general - but in correspondence to certain discrete
abscissae Xi, i = 0, 1, ... ,one determines approximate values 'r/i := 'r/(Xi) for
the exact values Yi := Y(Xi). The discrete abscissae are often equidistant,
Xi = Xo + ih. For the appproximate values 'r/i we then also write more
precisely 'r/(Xi; h), since the 'r/i, like the Xi, depend on the stepsize h used.
An important problem, for a given method, will be to examine whether,
and how fast, 'r/(x; (x - xo)jn) converges to y(x) as n -+ 00, i.e., h -+ o.
For a detailed treatment of numerical methods for solving initial- and
boundary-value problems we refer to the literature: In addition to the classical book by Henrici (1963), and the more recent exposition by Hairer et
al. (1987, 1991), we mention the books by Gear (1971), Grigorieff (1972,
1977), Keller (1968), Shampine and Gordon (1975), and Stetter (1973).
7.1 Some Theorems from the Theory of Ordinary
Differential Equations
For later use we list a few results - some without proof - from the theory
of ordinary differential equations. We assume throughout that [see (7.0.3)]
y' = f(x, y)
is a system of n ordinary differential equations, 11.11 a norm on m,n, and
II All an associated consistent multiplicative matrix norm with 11111 = 1
[see (4.4.8)]. It can then be shown [see, e.g., Henrici (1962), Theorem 3.1]
that the initial-value problem (7.0.3), (7.0.4) - and thus in particular
(7.0.1), (7.0.2) - has exactly one solution, provided f satisfies a few simple
regularity conditions:
(7.1.1) Theorem. Let f be defined and continuous on the strip S :=
{(x, y)la :s; x :s; b, y E m,n}, a, b finite. Further, let there be a constant L
such that
(7.1.2)
for all x E [a, b] and all Yl, Y2 E m,n ("Lipschitz condition"). Then for every
Xo E [a, b] and every Yo E IRn there exists exactly one function y(x) such
that
(a) y(x) is continuous and continuously differentiable for x E [a, b];
(b) y'(x) = f(x, y(x)) for x E [a, b];
(c) y(xo) = Yo.
468
7 Ordinary Differential Equations
From the mean-value theorem it easily follows that the Lipschitz condition is satisfied ifthe partial derivatives 8fd8Yj, i, j = 1, ... , n, exist on
the strip S and are continuous and bounded there. For later use, we denote
by
(7.1.3)
the set of functions f for which all partial derivatives up to and including
order N exist on the strip S = {(x,y)la ~ x ~ b, y E IR n }, a,b finite, and
arc continuous and bounded there. The functions f E Fl (a, b) thus fulfill
thc assumptions of (7.1.1).
In applications, f is usually continuous on S and also continuously
differentiable there, but the derivatives 8fd8Yj are often unbounded on S.
Then, while the initial-value problem (7.0.3), (7.0.4) is still solvable, the
solution may be defined only in a certain neighborhood U(xo) (which may
even depend on Yo) of the initial point and not on all of [a, b] [see, e.g.,
Henrici (1962)].
EXAMPLE. The initial-value problem
y' = y2,
y(O) = Yo > 0
has the solution y(x) = l/(yo - x), which is defined only for x < Yo.
(7.1.1) is the fundamental existence and uniqueness theorem for the
initial-value problem given in (7.0.3), (7.0.4).
We now show that the solution of an initial-value problem depends
continuously on the initial value:
(7.1.4) Theorem. Let the function f: S -t IRn be continuous on the strip
S = {(x,y)la ~ x ~ b, y E IRn} and satisfy the Lipschitz condition
for all (x, Yi) E S, i = 1, 2. Let a ~ :[;0 ~ b. Then for the solution y(x; s)
of the initial-value problem
y' = f(x,y),
y(xo; s) = s
there holds the estimate
for a:::: x ~ b.
PROOF. By definition of y(x; s) one has
y(x; s) = s +
l
x
Xo
f(t, y(t; s)) dt
7.1 Some Theorems from the Theory of Diffferential Equations
469
for a ::; x ::; b. It follows that
y(x; sd - y(x; S2) = Sl - S2 + l
x
[f(t, yet; Sl)) - f(t, y(t; S2))] dt,
Xo
and thus
(7.1.5)
x
Ily(x;sd -y(x;S2)11 ::; IIs1 - s211 +Lll Ily(t;sl) - y(t;s2)11 dtl·
Xo
For the function
x
1>(x) := l
Ily(t; sd - yet; s2)11 dt
Xo
one has 1>'(x) = Ily(x; sd - y(x; s2)11 and thus, by (7.1.5), for x :::: XD
Q(x) ::; IIs1 - s211
Q(x):= 1>'(x) - L1>(x).
with
The initial-value problem
(7.1.6)
1>'(X) = Q(x) + L1>(x),
1>(Xo) = 0
has for x :::: Xo the solution
(7.1.7)
1>(x) = eL(x-x o ) l x Q(t)e-L(t-x o ) dt.
Xo
On account of Q(x) ::; 1181 - s211, one thus obtains the estimate
0::; 1>(x) ::; eL(x-xo)lls1 - s2111x e-L(t-x o ) dt
xo
=
~lls1 L
s211[eL (x-x o )
-1]
for
x:::: Xo·
The desired result finally follows for x :::: Xo:
o
If x < XD one proceeds similarly.
The preceding theorem can be sharpened: Under additional assumptions the solution of the initial-value problem actually depends on the initial
value in a continuously differentiable manner. We have the following:
(7.1.8) Theorem. If in addition to the assumptions of Theorem (7.1.4)
the Jacobian matrix Dyf(x, y) = [Bli/BYj] exists on S and is continuous
and bounded there,
IIDyf(x, y) II ::; L
for
(x, y) E S,
470
7 Ordinary Differential Equations
then the solution y(x;s) ofy' = f(x,y), y(xo;s) = s, is continuously differentiable for all x E [a, b] and all s E IRn. The derivative
Z(x: s) := DsY(:J;; s) = [
OY(x; s)
O(J'l"'"
oy(x; S)]
o(J'n
'
is the solution of the initial-value problem (Z' = DxZ)
(7.1.9)
Z' = Dyf(x, y(x; s))Z,
Z(xo; s) = I.
Note that Z', Z, and Dyf(x, y(x; s)) are n x n matrices. (7.1.9) thus
describes an initial-value problem for a system of n 2 differential equations.
Formally, (7.1.9) can be obtained by differentiating with respect to s the
identities
y'(x; s) == f(x, y(x; s)),
y(xo; s) == s.
A proof of Theorem (7.1.8) can be found, e.g., in Coddington and
Levinson (1955).
For many purposes it is important to estimate the growth of the solution
Z of (7.1.9). Suppose, then, that T(x) is an n x n matrix, and the n x n
matrix Y(x) solution of the linear initial-value problem
y' = T(x)Y,
(7.1.10)
Y(a) = I.
One can then show:
(7.1.11) Theorem. If T(x) is continuous on [a, b], and k(x) := IIT(x)lI,
then the solution Y(x) of (7.1.10) satisfies
IIY(x) - III ::; exp
(l k(t) dt) X
1,
x 2:: a.
PROOF. By definition of Y(x) one has
Y(x) = 1+
Letting
1
x
T(t)Y(t) dt.
'P(x) := IIY(x) - III
by virtue of IIY(x)11 ::; 'P(x) + 11111 = 'P(x) + 1 there follows for x 2:: a the
estimate
(7.1.12)
'P(x) ::;
N ow let c( x) be defined by
1
x
k(t)('P(t) + 1) dt.
(7.1.13)
l
7.2 Initial-Value Problems
x
k(t)(ip(t) + 1) dt = c(x) exp
(l
X
471
c(a) = 1.
k(t) dt) - 1,
Through differentiation of (7.1.13) [c(x) is clearly differentiable] one obtains
k(x)(ip(x) + 1) = c'(x) exp
= c'(x) exp
(l
X
(l k(t) dt) + k(x)c(x) (l k(t) dt)
X
k(t) dt) + k(x) [1 +
l
exp
x
X
k(t)(ip(t) + 1) dt] ,
from which, because of k(x) :2: 0 and (7.1.12), it follows that
c'(x) exp
(l
X
k(t) dt) + k(x)
::; k(x)
l
l
x
k(t)(ip(t) + 1) dt = k(x)ip(x)
x
k(t)(ip(t) + 1) dt.
One thus obtains finally
c'(x) ::; 0
and therefore
(7.1.14)
c(x) ::; c(a) = 1
for
x:2: a.
The assertion of the theorem now follows immediately from (7.1.12)-
(7.1.14).
0
7.2 Initial-Value Problems
7.2.1 One-Step Methods: Basic Concepts
As one can already surmise from Section 7.1, the methods and results for
initial-value problems for systems of ordinary differential equations of first
order are essentially independent of the number n of unknown functions. In
the following we therefore limit ourselves to the case of only one ordinary
differential equation of first order for only one unknown function (i.e., n =
1). The results, however, are valid, as a rule, also for systems (i.e., n> 1),
provided quantities such as y and f (x, y) are interpreted as vectors, and I· I
as norm I . II. For the following, we assume that the initial-value problem
under consideration is always uniqucly solvable.
A first numerical method for the solution of the initial-value problem
(7.2.1.1)
y' = f(x, y),
y(xo) = Yo
is suggested by the following simple observation: Since f(x,y(x)) is just
the slope y'(x) of the desired exact solution y(x) of (7.2.1.1), one has for
h =1= 0 approximately
472
7 Ordinary Differential Equations
y(X + h~ - y(x) ~ f(.7:, y(x)),
or
y(X + h) ~ y(x) + hf(x,y(x)).
(7.2.1.2)
Once a steplength h i= 0 is chosen, starting with the given initial values xo,
Yo = y(xo), one thus obtains at equidistant points Xi = Xo + ih, i = 1, 2,
... , approximations 'Ii to the values Yi = y(Xi) of the exact solution y(x)
as follows:
170 := Yo;
for i = 0, 1, 2, ...
(7.2.1.3)
17i+l := 17i + hf(Xi, 17i)'
Xi+l := Xi + h.
One arrives at the polygon method of Euler shown in Figure 14.
y
(xo,1)O)
Xo
X
Fig. 14. Euler's method
Evidently, the approximate values 17i of y(Xi) depend on the stepsize
h. To indicate this, we also write more precisely 1]( :Z:i; h) instead of 1]i. The
"approximate solution" 17(x; h) is thus defined only for
X
E Rh := {xo + ih I i = 0, 1, 2, ... }
or, alternatively, for
h E Hx := { X ~ Xo I n = 1, 2, ... } ;
7.2 Initial-Value Problems
473
in fact, it is defined recursively by [ef. (7.2.1.3)]
7)(Xo; h) := Yo,
7)(x + h; h) := 7](x; h) + h f(x, 7](x; h)).
Euler's method is a typical one-step method. In general, such methods are
given by a function
p(x, y; h; 1).
Starting with the initial values xo, Yo of the initial-value problem (7.2.l.1),
one now obtains approximate values 7]i for the quantities Yi := y(xd of the
exact solution y(x) by means of
7)0 := Yo,
(7.2.1.4)
for i = 0,1, 2, ...
7]i+l := 7]i
Xi+l := Xi
+ h P(Xi' 7]i; h; 1),
+ h.
In the method of Euler, for example, one has p(x, y; h; 1) := f(x, y); here,
P is independent of h.
For simplicity, the argument f in the function P will from now on he
omitted. As in Euler's method (see above) we also write more precisely
7]( Xi; h) instead of 7]i, in order to indicate the dependence of the approximate values of the stepsize h used.
Let now X and y be arbitrary, but fixed, and let z(t) he the exact
solution of the initial-value problem
(7.2.1.5)
z' (t) = f (t, z (t) ),
z(x) = y,
with initial values x, y. Then the function
(7.2.l.6)
,1(x, y; h; 1) := {
Z(X+h)-y
h
f(x,y)
if h f 0,
if h =
°
represents the difference quotient of the exact solution z(t) of (7.2.1.5) for
stepsize h, while p(x, y; h) is the difference quotient for stepsize h of the
approximate solution of (7.2.l.5) produced by P. As in P, we shall also
drop the argument f in ,1.
The magnitude of the difference
T(X, y; h) := ,1(x, y; h) - p(x, y; h)
indicates how well the value z(x + h) of the exact solution at X + h of the
differential quat ion with the inital data (x, y) obeys the equation of the onestep method: it is a measure of the quality of the approximation method.
474
7 Ordinary Differential Equations
One calls T(X, y; h) the local discretization error at the point (x, y) of the
method in question. For a reasonable one-step method one will require that
lim T(X, y: h) = o.
h-+D
In as much as limh-+o L1(x, y; h) = f(x, y), this is equivalent to
lim p(x, y; h) = f(x, y).
(7.2.1.7)
h-+O
One calls P, and the associated one-step method, consistent if (7.2.1.7)
holds for all x E [a,b], y E IR and f E Fl(a,b) [see (7.1.3)].
EXAMPLE. Euler's method, <i>(x,Yjh) := f(x,y), is obviously consistent. The
result can be sharpened: if f has sufficiently many continuous partial derivatives,
it is possible to say with what order T(X, Yj h) goes to zero as h --+ O. For this,
expand the solution z(t) of (7.2.1.5) into a Taylor series about the point t = x:
h2
2
hP
p.
z(x + h) = z(x) + hz'(x) + -z"(x) + ... + ,z(p)(x + ()h),
0 < () < 1.
We now have, because z(x) = y, z'(t) == f(t, z(t)),
z"(x) = ~f(t, z(t)) It=x= fx(t, z(t)) It=x +fy(t, z(t))z'(t) It=x
= fx(x, y) + fy(x, y)f(x, y),
z'''(x) = fxx(x, y) + 2fxy(x, y)f(x, y) + fyy(x,y)f(x,y)2 + fy(x,y)z"(x),
etc., and thus
(7.2.1.8)
h
Ip-1
d(x, Yj h) = z'(x) + ,z"(x) + ... + _l_,_Z(p) (x + ()h)
2.
p.
= f(x, y) + ~ [jx(x, y) + fy(x, y)f(x, y)] + ...
For Euler's method, <P(x, Y; h) := f(x, y), it follows that
h
T(X, y; h) = d(x, y; h) ~ <P(x, y; h) = "2 [jx(x, y) + fy(x, y)f(x, y)] + ...
= O(h).
Generally, one speaks of a method of order p if
(7.2.1.9)
T(X, y; h) = O(hP)
for all x E [a, b], y E IR and f E Fp(a, b).
Euler's method thus is a method of order 1.
7.2 Initial-Value Problems
475
The preceding example shows how to obtain methods of order greater
than 1. Simply take for tf>(x, y; h) sections of the Taylor series (7.2.1.8) of
L1(x, y; h). For example,
7.2.1.10
h
tf>(x,y;h):= f(x,y) + "2[fx(x,y) + fy(x,y)f(x,y)]
produces a method of order 2. The higher-order methods so obtained, however, are hardly useful, since in every step (Xi, 'TJi) ~ (Xi+l' 'TJi+l) one must
compute not only f, but also the partial derivatives fx, f y, etc.
Simpler methods of higher order can be obtained, e.g., by means of the
construction
(7.2.1.11)
tf>(x, y; h) := ad(x, y) + a2f(x + Plh, y + P2hf(x, y)),
in which the constants al, a2, Pl, P2 are so chosen that the Taylor expansion
of L1(x, y; h)-tf>(x, y; h) in powers of h starts with the largest possible power.
For tf>(x, y; h) in (7.2.1.11), the Taylor expansion is
Comparison with (7.2.1.8) yields for a second-order method the conditions
One solution of these equations is
a2 -- 12'
Pl = 1 ,
P2 = 1,
and one obtains the method of Heun (1900):
(7.2.1.12)
tf>(x, y; h) = ~ [f(x, y) + f(x + h, y + hf(x, y))],
which requires only two evaluations of f per step. Another solution is
1
Pl = 2'
1
P2 = 2'
leading to the modified Euler method [Collatz (1960)]
(7.2.1.13)
tf>(x, y; h) := f (x + ~h, y + ~hf(x, y)) ,
which again is of second order and requires two evaluations of f per step.
The Runge-Kutta method [Runge (1895), Kutta (1901)] is obtained
from a construction which is somewhat more general than (7.2.1.11). It
has the form
(7.2.1.14)
where
476
7 Ordinary Differential Equations
kl := f(x, y),
k2 := f(x + ~h, y + ~hkl)'
k3 := f(x + ~h, y + ~hk2)'
k4 := f(x + h, y + hk3).
Through a simple but rather tedious Taylor expansion in h one finds, for
f E F4(a, b),
<J>(x, y; h) - .::l(x, y; h) = O(h4).
The Runge-K utta method, therefore, is a method of fourth order. It requires
four evaluations of f per step.
If f(x,y) does not depend on y, then the solution of the initial-value
problem
y' = f(x), y(xo) = Yo
J:a
is just the integral y(x) = Yo +
f(t)dt. The method of Heun then corresponds to the approximation of y(x) by means of trapezoidal sums, the
modified Euler method to the midpoint rule, and the Runge-Kutta method
to Simpson's rule [see Section 3.1].
All methods of this section are examples of multistage Runge-Kutta
methods:
(7.2.1.15) Definition. As-stage Runge-Kutta method is a one-step method
given by a function 1[>( x, y; h; 1) which is defined by finitely real numbers Cl,
C2, ... , C., a2, a3, ... , as, and f3i,j with 2 ::::; i ::::; sand 1 ::::; j ::::; i - I as
follows:
where
kl := f(x, y),
k2 := f(x + a2 h , Y + hf321kd,
k3:= f(x+a3 h ,y+h(f331 kl + f332 k2)) ,
Following Butcher (1964), a method of this type is designated by the
scheme
0
a2
f321
a3
f331
f332
as
f3s1
f32s
f3s,s-l
Cl
C2
Cs-l
(7.2.1.16)
Cs
7.2 Initial-Value Problems
477
Butcher (1964) has analysed these methods systematically. Concrete
examples of such methods of order higher than four were given by him,
by Fehlberg (1964, 1966, 1969), Shanks (1966) and many others. We will
describe some methods of this type in Section 7.2.5.
A general exposition of one-step methods is found in Grigorieff (1972)
and Stetter (1973), and in particular in Hairer, Norsett and Wanner (1993).
7.2.2 Convergence of One-Step Methods
°
In this section we wish to examine the convergence behavior as h -+ of an
approximate solution 'fl(x; h) furnished by a one-step method. We assume
that f E FI (a, b) and denote by y( x) the exact solution of the initial-value
problem (7.2.1.1)
y' = f(x, y),
y(xo) = Yo·
Let cfJ(x, y; h) define a one-step method,
'flo:= Yo,
for i = 0, 1, ...
'flHI := 'fli
+ hcfJ(Xi' 'fli; h),
Xi+I := Xi
+ h,
which for x E Rh := {xo + ih I i = 0, 1, 2, ... } produces the approximate
solution 'fl(x; h):
if x = Xo + ih.
'fl(x; h) := 'fli'
We are interested in the behavior of the global discretization error
e(x; h) := 'fl(x; h) - y(x)
for fixed x and h -+ 0, h E Hx := {(x - xo)/n I n = 1, 2, ... }. Since
e(x; h), like 'fl(x; h), is only defined for h E H x , this means a study of the
convergence of
x -Xo
h n := - - - ,
as n -+ 00.
n
We say that the one-step method is convergent if
(7.2.2.1)
lim e(x; h n ) =
n->oo
°
for all x E [a,b] and all f E FI(a, b).
We will see that for f E Fp(a, b) methods of order p > 0 [ef. (7.2.1.9)]
are convergent, and even satisfy
478
7 Ordinary Differential Equations
The order of the global discretization error is thus equal to the order of the
local discretization error.
We begin by showing the following:
(7.2.2.2) Lemma. If the numbers ~i satisfy estimates of the form
then
PROOF. From the assumptions we get immediately
161 ~ (1 + 5)I~ol + E,
161 ~ (1 + 5)21~ol + E(l + 5) + E,
I~nl ~ (1 + 5)nl~ol + E[l + (1 + 5) + (1 + 5)2 + ... + (1 + 5)n-l]
(1 + 5)nl~01 + E (1 + 5)n - 1
5
enS - 1
<
ensl"<,0 1+ E -'
5
=
since 0 < 1 + 5 ~ eO for 5 > -1.
With this, we can prove the following main theorem:
o
(7.2.2.3) Theorem. Cons'ider, for Xo E [a, b], Yo E JR, the initial-value
problem
y' = f(x, y), y(xo) = Yo,
having the exact solution y(x). Let the function P be continuous on
G:={(x,y,h)la~x~b,
ly-y(x)I:s:1', O~lhl~ho}, ho>O, 1'>0,
and let there exist positive constants Ai and N such that
(7.2.2.4)
for all (x, Yi, h) E G, i = 1,2, and
(7.2.2.5) IT(X, y(x); h)1 = 1L1(x, y(x); h) - p(x, y(x); h)1 ~ Nlhl P ,
p > 0,
for all x E [a,b], Ihl ~ h o . Then there exists an h, 0 < h :s: h o , such that
for the global discretization error e(x; h) = TJ(x; h) - y(x),
7.2 Initial-Value Problems
479
for all x E [a,bj and all h n = (x - xo)/n, n = 1, 2, ... , with Ihnl ::; h. If
"( = 00, then h = h o.
PROOF. The function
p(x, Y; h)
J>(x,y;h):= { p(x,y(x) + "(; h)
p(x, y(x) - "(; h)
if (x, y, h) E G,
if x E [a,b], Ihl::; ho, y 2': y(x) +,,(,
if x E [a, b], Ihl ::; ho, y ::; y(x) - ,,(,
is evidently continuous on G:= {(x,y,h)lx E [a,b], y E JR, Ihl 2': ho} and
satisfies the condition
(7.2.2.6)
for all (x, Yi, h) E G, i = 1, 2, and, because J>(x, y(x); h) = p(x, y(x); h),
also the condition
(7.2.2.7)
1L1(x, y(x); h) - J>(x, y(x); h)1 ::; Nlhl P for x E [a, bj, Ihl ::; ho.
Let the one-step method generated by J> furnish the approximate values
iii := ii(Xi; h) for Yi := y(Xi), Xi := Xo + ih:
iii+l = iii +hJ>(Xi,iii;h).
In view of
Yi+l = Yi + h L1(Xi' Yi; h),
one obtains for the error ei := iii -Yi, by subtraction, the recurrence formula
(7.2.2.8)
ei+1 = ei + h[J>(Xi' iii; h) - J>(Xi' Yi; h)] + h[J>(Xi' Yi; h) - L1(Xi' Yi; h)].
Now from (7.2.2.6), (7.2.2.7) it follows that
1J>(Xi' iii; h) - J>(Xi' Yi; h) I ::; Mliii - Yil = Mlei I,
1L1(Xi,Yi;h) -J>(Xi,Yi;h)l::; Nlhl P ,
and hence from (7.2.2.8) we have the recursive estimate
Lemma (7.2.2.2), since eo = iio - Yo = 0, gives
(7.2.2.9)
lekl ::; Nlhl P
eklhlM
M -
1
.
Now let x E [a, bj, x i= xo, be fixed and h := h n = (x - xo)/n, n > 0
an integer. Then Xn = Xo + nh = x and from (7.2.2.9) with k = n, since
e(x; h n ) = en it follows at once that
480
7 Ordinary Differential Equations
(7.2.2.10)
le(x; hn)1 :S Nlh"I P
eMlx~xol
Ai
- 1
for all x E [a, bj and hn with Ih,,1 :S ho. Since Ix - xol :S Ib - al and r > 0,
there exists an h, 0 < h :S ho. such that le(x; hn)1 :S r for all x E [a, bj,
Ihnl :S h, i.e., for the one-step method generated by P,
7]0 =
Yo,
TIi+l = 7]i + P(Xi' 7]i; h),
we have for Ihl :S h, according to the definition of $,
7]i
= 7]i,
ei = ei,
and
$(Xi' 7/;; h) = P(Xi' 7]i; h).
The assertion of the theorem,
thus follows for all x E [a, bj and all h n = (x - xo)/n, n = 1, 2, ... , with
Ihnl :S II,.
0
From the preceding theorem it follows in particular that methods of order p > 0 which in the neighborhood of the exact solution satisfy a Lipschitz
condition of the form (7.2.2.4) are convergent in the sense (7.2.2.1). Observe that the condition (7.2.2.4) is fulfilled, e.g.) if (8/8y)p(x, y; h) exists
and is continuous in a domain G of the form stated in the theorem.
Theorem (7.2.2.3) also provioes an upper bound for the discretization
error, which in principle can be evaluated if one knows it! and N. One could
use it, e.g., to determine the steplength h which is required to compute
y(x) within an error c, given x and c > O. Unfortunately, in practice this is
doomed by the fact that the constants !vI and N are not easily accessible,
since an estimation of !vI and N is only possible via estimates of higher
derivatives of f. Already in the simple Euler's method, p(x, y; h) := f(x, y),
e.g., one has [see (7.2.1.8) f.j
N;::::; ~Ifx(x, y(x)) + fy(x, y(x))f(x, y(x))I,
Ai;::::; 18p/Byl = Ify(x, Y)I·
For the Runge-Kutta method, one would already have to estimate derivatives of f of the fourth order.
7.2.3 Asymptotic Expansions for the Global Discretization
Error of One-Step Methods
It may be conjectured from Theorem (7.2.2.3) that the approximate solution 7](x; h), furnished by a method of order p, possesses an asymptotic
expansion in powers of h of the form
7.2 Initial-Value Problems
481
(7.2.3.1)
for all h = h n = (x - xo)/n, n = 1, 2, ... , with certain coefficient functions
ek(x), k = p, p + 1, ... , that are independent of h. This is indeed true for
general one-step methods of order p, provided only that iJ>(x, y; h) and f
satisfy certain additional regularity conditions. One has [see Gragg (1963)]
(7.2.3.2) Theorem. Let f(x, y) E FN+1(a, b) [ef. (7.1.3)] and let 7](x; h) be
the approximate solution obtained by a one-step method of order p, p :::; N,
to the solution Vex) of the initial value problem
y' = f(x, v),
y(xo) = Yo,
Xo E [a, b].
Then 7](x; h) has an asymptotic expansion of the form
(7.2.3.3)
7](x; h) =y(x) + ep(x)hP + ep+dx)hP-rl + ... + eN(x)h N
+ EN +1 (x; h)hN+1
with edxo) = 0 for k 2:: p,
which is valid for all x E [a, b] and all h = h n = (x - xo)/n, n = 1,2, ....
The functions ei(x) therein are differentiable and independent of h, and
the remainder term EN+1(X;h) is bounded for fixed x and all h = h n =
(x - xo)/n, n = 1,2, ... , sUPn IEN+1(X; hn)1 < 00.
PROOF. The following elegant proof is due to Hairer and Lubich (1984).
Suppose that the one-step method is given by iJ>(x, y; h). Since the method
has order p, and f E F N + 1 , there follows [see Section 7.2.1]
y(x + h) - Vex) - hiJ>(x, y; h)
= dp+1(x)h p+1 + ... + dN+1(X)h N+1 + O(h N+2 ).
First, we use only
y(x + h) - vex) - hiJ>(x, y; h) = dp+1(x)h P+1 + O(hP+2 )
and show that there is a differentiable function ep(x) such that
7](x; h) - Vex) = ep(x)h P + O(h P+1),
ep(xo) = O.
Tb this end, we consider the function
where the choice of e p is still left open. It is easy to see that i) can be
considered as the result of another one-step method,
i)(x + h; h) = i)(x; h) + htP(x, i)(x; h); h),
if 1> is defined by
tP(x, y; h) := iJ>(x, y + ep(x)h P; h) - (cp(x + h) - cp(x)hP- 1.
482
7 Ordinary Differential Equations
By Taylor expansion with respect to h, we find
y(:r + h) - y(x) - h<i>(x, Yi h)
= [rlp+dx) - fy(x, y(x))ep(x) - e~(x)]hP+l + O(h P+ 2 ).
Hence, the one-step method belonging to <i> has order p + 1 if ep is taken
as the solution of the initial value problem
With this choice of ep , Theorem (7.2.2.3) applied to <i> then shows that
7)(Xi h) - y(x) = 7/(X; h) - y(x) - ep(x)hP = O(h P+ 1).
A repetition of these arguments with <i> in place of 1> completes the proof
of the theorem.
D
Asymptotic laws of the type (7.2.3.1) or (7.2.3.3) are significant in
practice for two reasons. In the first place, one can use them to estimate
the global discretization error e(x; h). Suppose the method of order p has
an asymptotic expansion of the form (7.2.3.1), so that
e(x; h) = Tf(x; h) - y(x) = ep(x)hP + O(h P+ 1 ).
Having found the approximate value Tf(Xi h) with stepsize h. one computes
for the same x, but with another stepsize (say h/2), the approximation
Tf(x: h/2). For sufficiently small h [and ep(x) i= 0] one then has in first
approximation
(7.2.3.4)
(7.2.3.5)
Subtracting the second equation from the first gives
and one obtains, by sustitution in (7.2.3.5),
(7.2.3.6)
X:
lI,) _y( x) ~ Tf(x; h)2P- -ry(x;1 h/2).
ry ( . 2
For the Rungc-Kutta method one has p = 4 and obtains the frequently
used formula
7.2 Initial-Value Problems
483
.~) _ (x) ~ T}(x; h) - T}(x; h/2).
T} ( x, 2
Y
15
The other, more important, significance of asymptotic expansions lies in
the fact that they justify the application of extrapolation methods [see
Sections 3.4 and 3.5]. Since a little later [see (7.2.12.7) f.] we will get to
know a discretization method for which the asymptotic expansion of T}(x; h)
contains only even powers of h and which, therefore, is more suitable for
extrapolation algorithms [see Section 3.5] than Euler's method, we defer
the description of extrapolation algorithms to Section 7.2.14.
7.2.4 The Influence of Rounding Errors in One-Step Methods
If a one-step method
T}o := Yo;
(7.2.4.1)
for i = 0, 1, ...
T}i+1 := T}i + hp(Xi, T}i; h),
xHl := Xi + h
is executed in floating-point arithmetic (t decimal digits) with relative precision eps = 5 x lO-t, then instead of the T}i one obtains other numbers i'ii,
which satisfy a recurrence formula of the form
i'io:= Yo;
for i = 0, 1, ...
(7.2.4.2)
Ci := fl(P(Xi, i'ii; h)),
d i := fl(h Ci),
i'iHl := fl(i'ii + di ) = i'ii + hp(Xi, i'ii; h) + l':i+1,
where the total absolute rounding error Ei+1, in first approximation, is made
up of threee components:
Here
ni+l =
fl(P(Xi, i'ii; h)) - P(Xi, i'ii; h)
P(Xi, T}i; h)
is the relative rounding error committed in the floating-point computation of P, I-tHl the relative rounding error committed in the computation
of the product hei, and O"i+1 the relative rounding error which occurs in
the addition i'ii + di . Normally, in practice, the stepsize h is so small that
IhP(Xi,i'ii;h) « li'iil, and if Ini+11 :S eps and ll-ti+11 :S eps, one thus has
Ci+1 ~ i'ii+1O"i+1, i.e., the influence of rounding errors is determined primarily by the addition error O"i+1.
484
7 Ordinary Differential Equations
Remark. It is natural, therefore, to reduce the influence of rounding errors by
carrying out the addition in double precision (2t decimal places). Denoting by
fl( a + b) a double-precision addition, byi)i a double-precision number (2t decimal
places), and by iii := rd1(i)d the number iji rounded to single precision, then the
algorithm, instead of (7.2.4.2), now runs as follows,
ijo := yo;
for i = 0, 1, ...
iii := rd 1 (ijd,
(7.2.4.3)
Ci := fl(4)(x;, iii; h)),
di := fl(hc;),
fji+l := fb (iji + d i ).
Let us now briefly estimate the total influence of all rounding errors C i.
For this, let Yi = Y(Xi) be the values of the exact solution of the initialvalue problem, 7]i = 7](Xi; h) the discrete solutions produced by the one-step
method (7.2.4.1) in exact arithmetic, and finally 'iii the approximate values
of 7]i actually obtained in t-digit floating-point arithmetic. The latter satisfy
relations of the form
7]0 := Yo;
(7.2.4.4)
for i = 0, 1, ...
iiHl :=
iii + h<I>(Xi' iii; h) + CHI·
For simplicity, we also assume
We assume further that <I> satisfies a Lipschitz condition of the form
(7.2.2.4),
1<I>(x, YI; h) - <I>(x, Y2; h)1 ::; MIYI - Y21·
Then, for the error r(xi; h) := ri := iii - 7]i, there follows by subtraction of
(7.2.4.1) from (7.2.4.4)
rHl = ri
+ h(<I>(Xi,iii; h) - <I>(Xi, 7]i; h)) + CH1,
and thus
(7.2.4.5)
Since ro = 0, Lemma (7.2.2.2) gives
C
Ir(x; h)1 ::; ThT
eMlx-xo I -
M
1
for all x E [a, b] and h = hn = (x - xo)/n, n = 1, 2, '"
therefore, that for a method of order p the total error
. It follows,
7.2 Initial-Value Problems
485
under the assumptions of Theorem (7.2.2.3), obeys the estimate
(7.2.4.6)
Iv(x; h)1 ~ [Nlhl P +
c ] eMlx-xol - 1
M
ihT
for all x E [a, b] and for all sufficiently small h := h n = (x - xo)/n.
This formula reveals that, on account of the influence of rounding errors,
the total error vex; h) begins to increase again, once h is reduced beyond a certain critical value. The following table shows this behavior. For the initial-value
problem
1
y(-l) = 101'
with exact solution Vex) = 1/(1 + 100x 2 ), an approximate value 17(0; h) for yeO) =
1 has been computed by the Runge-Kutta method in 12-digit arithmetic:
h
10- 2
0.5.10- 2
10- 3
0.5.10- 3
v(O; h)
-0.276 . 10- 4
-0.178 . 10- 5
-0.229 . 10- 7
-0.192.10- 7
h
10- 4
0.5.10- 4
10- 5
v(O; h)
-0.478.10- 6
-0.711 . 10- 6
-0.227 . 10- 5
The appearance of the term c/lhl in (7.2.4.6) becomes plausible if one
considers that the number of steps to get from Xo to x, using steplength
h, is just (x - xo)/h and that all essentially independent rounding errors
were assumed equal to c. Nevertheless, the estimate is much too coarse to
be practically significant.
7.2.5 Practical Implementation of One-Step Methods
In practice, initial-value problems present themselves mostly in the following form: What is desired is the value which the exact solution y(x) assumes
for a certain x #- Xo. It is tempting to compute this solution approximately
by means of a one-step method in a single step, i.e., choosing stepsize
Ii = x - Xo. For large x - Xo this of course leads to a large discretization
error e(x; Ii); the choice made for Ii would be entirely inadequate. Normally, therefore, one will introduce suitable intermediate points Xi, i = 1,
... , k - 1, Xo < Xl < ... < Xk = x, and, beginning with Xo, Yo = y(xo),
compute successive approximate values of Y(Xi): Having determined an approximation Y(Xi) of Y(Xi), one computes Y(Xi+l) by applying a step of the
method with stepsize hi := Xi+! - Xi,
486
7 Ordinary Differential Equations
There again arises, however, the problem of how the stepsizes hi are to
be chosen. Since the amount of work involved in the method is proportional
to the number of individual steps, one will attempt to choose the stepsizes
hi as large as possible. On the other hand, they must not be chosen too
large if one wants to keep the discretization error small. In principle, one
has the following problem: For given Xo, Yo, determine a stepsize h as large
as possible, but such that the discretization error e(xo + h; h) after one step
with this stepsize still remains below a certain tolerance e. This tolerance
e should not be selected smaller than K eps, i.e., e :::. K eps, where eps is
the relative machine precision and K a bound for the solution y(x) in the
region under consideration,
K:::o max{ ly(x)11 x E [xu, Xu + h]}.
A choice of h corresponding to e = K eps then guarantees that the approximate solution TJ(xo+h; h) obtained agrees with the exact solution y(xo+h)
to machine precision. Such a step size h = h(e), with
le(xu+h;h)I:::Oe,
e:::'Keps,
can be found approximately with the methods of Section 7.2.3: For a
method of order p one has in first approximation
(7.2.5.1 )
Now, by (7.2.3.3), ep(xo) = 0 ; thus in first approximation,
(7.2.5.2)
Therefore le(xo + h; h)1 ~ e will hold if
(7.2.5.3)
If we know e~(xo), we can compute from this the appropriate value of h.
An approximate value for e~(xo), however, can be obtained from (7.2.3.6).
Using first the stepsize H to compute TJ(xo + H; H) and TJ(xo + Hi H /2),
one then has by (7.2.3.6)
(7.2.5.4)
.H)--,-,,(xo+H;H)-T/(xu+H;H/2)
.
2P - 1
e (Xo +H, 2
By (7.2.5.1), (7.2.5.2), on the other hand,
H ~ ep(xo + H) (H)P
e(xo + H; :2)
:2 ~ e~(xo)H (H)P
:2
From (7.2.5.4) thus follows the estimate
7.2 Initial-Value Problems
487
e~(xo)~ H;+l2p2: 1 [1J(Xo+H;H)-1J(xo+H;~)].
Equation (7.2.5.3) therefore yields for h the formula
H ~ (~I1J(xo + H; H) -1J(xo + H; H j 2)1) l/(p+l) ,
(7.2.5.5)
h
2P-l
E
which can be used in the following way: Choose a stepsize H; compute
1J(xo + H; H), 1J(xo + H; Hj2), and h from (7.2.5.5). If Hjh » 2, then
by (7.2.5.4) the error e(xo + H; Hj2) is much larger than the prescribed
E. It is expedient, therefore, to reduce H. Replace H by 2h; with the new
H compute again 1J(xo + H; H), 1J(xo + H; Hj2), and from (7.2.5.5) the
corresponding h, until finally IHjhl ::::: 2. Once this is the case, one accepts
1J(xo + H; Hj2) as an approximation for y(xo + H) and proceeds to the
next integration step, replacing xo, Yo, and H by the new starting values
Xo + H, 1J(xo + H; Hj2) and 2h [see Fig. 15 for an illustration].
EXAMPLE. We consider the initial-value problem
y' = - 200xy2 ,
1
y( -3) = 901'
with the exact solution y(x) = 1/(1+100x 2 ). An approximate value 17 for y(O) = 1
was computed using the Runge-Kutta method and the "step control" procedure
just discussed. The criterion H/h» 2 in Fig. 15 was replaced by the test H/h 2:
3. Computation in 12 digits yields:
17 - y(O)
Number of required
Runge-Kutta steps
Smallest stepsize H
observed
-0.13585 X 10- 6
1476
0.1226··· X 10- 2
For fixed stepsize h the Runge-Kutta method yields:
17(0; h) - y(O)
Number of
Runge-Kutta steps
= 0.2032· .. X 10- 2
-0.5594 X 10- 6
1476
0.1226· .. X 10- 2
-0.4052 X 10- 6
h
1;76
0.1226. 3. x 10
~ = 2446
A fixed choice of stepsize thus yields worse results with the same, or even greater,
computational effort. In the first case, the stepsize h = 0.2032· .. X 10- 2 is probably too large in the "critical" region near 0; the discretization error becomes too
large. The stepsize h = 0.1226··· X 10- 2 , on the other hand, will be too small in
the "harmless" region from -3 to close to 0; one takes unnecessarily many steps
and thereby commits more rounding errors.
Our discussion shows that in order to be able to make statements about
the size of an approximately optimal steplength h, there must be available
488
7 Ordinary Differential Equations
in
Xo =?
Yo =?
Compute
+ H;H)
'T](xo + H; H/2)
'T](Xo
H:= 2h
h
yes
no
Xo := Xo + H
Yo := 'T](xo
+ H; H/2)
H:= 2h
out
Fig. 15. A stepsize control method
two approximate values 'T](xo + H; H) and 'T](xo + H; H/2) ~ obtained
from the same discretization method ~ for the exact solution y(xo + H).
Efficient methods for controlling the stepsize, however, can be obtained
in still another manner. Instead of comparing two approximations (with
different H) from the same discretization method, one takes, following an
idea of Fehlberg's, two approximations (with the same H) which originate
from a pair q,r, q,rr of different discretization methods of Runge-K utta type
of orders p and p + 1. This gives rise to so-called Runge-Kutta-Fehlberg
methods, which have the form
(7.2.5.6)
Yi+l = fj; + hq,r(xi, Yi; h),
YHI = Yi + hq,rr(Xi, Yi; h),
7.2 Initial-Value Problems
489
where [ef. (7.2.1.15)]
p
<PI (x, y; h) =
L ckidx, y; h),
k=O
(7.2.5.7)
p+l
<pn(x, y; h) =
L ckh(x, y; h)
k=O
and
(7.2.5.8)
k-l
h := h(x, y; h) = i(x + Qk h, y + h L 'skdl) ,
k = 0,1, ... , p + 1.
1=0
Note that in the computation of <Pn the function values io, il, ... , i p
used in <PI are again utilized, so that only one additional evaluation, i p +1,
is required. The methods <PI, <Pn are therefore called embedded methods.
The constants Qk, 'skI, ck and Ck are determined so that the methods have
orders p and p + 1,
(7.2.5.9a)
Ll(x, y(x); h) - <PI (x, y(x); h) = O(h P ),
Ll(x, y(x); h) - <Pn (x, y(x); h) = O(h P+1),
and the additional relations
k-l
(7.2.5.9.b)
Qk =
L'skj,
k=O, 1, ... ,p+1,
j=O
are satisfied
In principle, suitable coefficients can be determined as in (7.2.1.11). This
leads, in general, to complicated system of nonlinear equations. For this reason
we discuss only the special case p = 2 that leads to embedded methods of orders
2 and 3. For p = 2, conditions (7.2.5.9) yield the equations
2
2
LCk -1 =0,
LCtkCk -
k=O
k=l
3
3
LCk -1 =0,
LCtkCk -
k=O
k=l
3
LPklCk -
~ = 0,
~ = 0,
~ = 0,
k=2
This system of equations admits infinitely many solutions. One can therefore
impose additional requirements to enhance economy: for example, make the value
h from the ith step reusable as fo in the (i + 1)st step. Since
490
7 Ordinary Differential Equations
and
fo = f(x + h, y + h<PI(X, y; h)),
"new"
this implies
= 1,
Ct3
P30
= Co,
(hI = Cl,
/132 = C2·
Further requirements can be made concerning the magnitudes of the coefficients
in the error terms .1 - <PI, .1- <Pu. However, we will not pursue this any further
and refer, instead, to the papers of Fehlberg (1964, 1966, 1969).
For the set of coefficients, one then finds
k
Ctk
0
0
1
1
PkO
13k 1
/3k2
1
"4
"4
27
189
800
214
891
2
40
3
1
-
Ck
Ck
214
891
1
533
2106
650
891
800
1053
1
0
33
729
800
1
33
650
891
-78
In applications, however, methods of higher order are more interesting.
Dormand and Prince (1980) found the following coefficients for a pair of embedded methods of orders p = 4 and 5 (DOPRI 5(4)):
k
Ctk
0
Pk2
13k3
Pk4
Pk5
0
1
"5
2
/'hl
rho
3
5
3
40
4
44
45
19372
6561
9017
3168
35
384
"5
4
9
8
5
1
6
1
Ck
35
384
5179
57600
1.
10
3
Ck
9
-
32
-15
"9
25360
2187
64448
6561
46732
5247
500
355
-33"
0
0
84
7571
16695
393
640
92097
339200
187
2100
0
.1.
lli3
40
56
0
500
lli3
212
-729
49
176
125
192
5103
- 18656
2187
- 6784
11
84
125
192
2187
6784
11
-
40
The coefficients are such that the error term of the higher order method <Pn
is minimal.
Step control is now accomplished as follows: Consider the difference
Yi+l - fj;+l. From (7.2.5.6) it follows that
(7.2.5.10),
and from (7.2.5.9a), up to higher order terms,
(7.2.5.11)
l]>r(Xi, Yi; h) - L1(Xi' Yi; h) ~ hPCr(x),
l]>rr(Xi,Yi;h) - L1(:J:i,Yi;h) ~ hP+1Crr (x).
Hence for small Ihl
(7.2.5.12)
1
'''':''
Yi+l
- Yi+l
- hP + C r ( Xi ) .
7.2 Initial-Value Problems
491
Suppose the integration from Xi to Xi+l was successful, i.e., for given error
tolerance E > 0, it achieved
~eglecting the terms of higher order, one then has also
If we wish the "new" step size hnew = Xi+2 - XHl to be successful, we must
have
ICr(Xi+1)h~t~l.::; E.
Now, up to errors of the first order, we have
For Cr(Xi), however, one has from (7.2.5.12) the approximation
This yields the approximate relation
h
I + .::; E,
!Yi+l - Yi+l! I-h_
new P
A
1
which can be used to estimate the new stepsize,
(7.2.5.13)
hnew ='= h
I_
C
11/(P+l)
A
YHl - Yi+l
Step-size control according to this formula is relatively cheap: The computation of YHl and YHl in (7.2.5.13) requires only p+ 1 evaluations of the
right-hand side of f(x, y) ofthe differential equation. Compare this with the
analogous relation (7.2.5.5): Since the computation of 1)(xo + H; H /2) necessitates further evaluation of f (three times as many, altogether), step-size
control according to (7.2.5.13) is more efficient than according to (7.2.5.5).
We remark that many authors, based on extensive numerical experimentation, recommend a modification of (7.2.5.13), viz.,
(7.2.5.14)
hnew ='= ah I
Eh _
1
1 P
/ ,
YHl - YHl
where a is a suitable adjustment factor: a :=:::: 0.9.
In the context of the graphical representation and the assessment of
the solution of an initial value problem the following difficulty arises: A
Runge-Kutta method generates approximations Yi for the solution only
at discrete points Xi, i = 0, 1, ... , that are determined by the stepsize
492
7 Ordinary Differential Equations
control mechanism. These data are usually too coarse for a good graphical
representation or for detecting irregularities in the solution. On the other
hand, an artificial restriction of the stepsize Ihl :::; hmax within the stepsize
control decreases the efficiency. A solution to this dilemma is to construct
from the data Xi, Yi, i ~ 0, a continuous approximation for the solution
at all intermediate points x('O) := Xi + iJh, 0 :::; iJ :::; 1, between Xi and
Xi+l. Here, of course, h := Xi+l - Xi. This leads to continuous Runge-Kutta
methods.
For the DOPRI 5(4)-method introduced above this is achieved by the representation (here x and y stand for Xi and Yi, respectively):
y(x + iJh) := y + h
5
L Ck(iJ)Jk,
k=O
where the Jk are determined as in DOPRI 5(4). The following weights Ck(iJ)
depending on iJ guarantee a method of order four:
CO(iJ) := iJ(l + iJ( -1337/480 + iJ(1039/360 + iJ( -1163/1152)))),
Cl (iJ) := 0,
C2(iJ) := 100iJ2(1054/9275 + iJ(-4682/27825 + iJ(379/5565)))/3,
C3(iJ) := -5 iJ2(27/40 + iJ( -9/5 + iJ(83/96)))/2,
C4 (iJ) := 18225 iJ2 (-3/250
+ iJ(22/375 + iJ( -37/600))) /848,
C5(iJ) := -22 iJ2( -3/10 + iJ(29/30 + iJ( -17/24)))/7.
Clearly the constants Ck (iJ) agree for iJ = 1 with the constants Ck of DOPRI 5(4).
7.2.6 Multistep Methods: Examples
In a multistep method for the solution of the initial-value problem
Y' = I(x, Y),
Y(Xo) = Yo,
one computes an approximate value 'TJj+r of Y(Xj+r) from r ~ 2 given
approximate values 'TJk of Y(Xk), k = j, j + 1, ... , j + r - 1, at equidistant
points Xk = Xo + k h:
(7.2.6.1)
for j = 0, 1, 2, ...
'TJj, 'TJj+r, ... , 'TJj+r-1 ~ 'TJj+r'
To initiate such methods, it is of course necessary that r starting values 'TJo,
'TJ1, ... , 'TJr-l be at our disposal; these must be obtained by other means,
e.g., with the aid of a one-step method.
We begin by introducing some examples of multistep methods. A class
of such methods can be derived from the formula
(7.2.6.2)
l
Y(Xp+k) - y(xp_j) =
7.2 Initial-Value Problems
XP
+k
493
f(t, y(t)) dt,
Xp_j
which is obtained by integrating the identity y'(x) = f(x,y(x)). As in the
derivation of the Newton-Cotes formulas [see Section 3.1], one now replaces
the integrand in (7.2.6.2) by an interpolating polynomial Pq(x) with
(1)
deg Pq(x) :::; q,
(2)
Pq(Xk) = f(Xk, Y(Xk)),
k = p, p - 1, ... , p - q.
We assume here that the Xi are equidistant and h := Xi+! - Xi. With
the abbreviation Yi := Y(Xi)' and using the Lagrange interpolation formula
(2.1.1.4),
Pq(x) =
X - Xp-l ,
L f(Xp-i, Yp-i)Li(x), Li(X) := II Xp-i
- Xp-l
q
q
i=O
1=0
l#-i
one obtains the approximate formula
(7.2.6.3)
q
= h
L (3qd(X p-i, Yp-i)
i=O
with
1
(7.2.6.4) (3qi:= h
l
xP
+k
Li(x)dx =
Xp_j
Jk. II ++ ds,
q
-)
S
-i
l
l
i = 0, 1, ... , q.
1=0
l#-i
Replacing in (7.2.6.3) the Yi by approximate values 'f/i and
equality sign, one obtains the formula
::::;0
by the
q
'f/p+k = 'f/p-j + h
L {3qdp-i,
i=O
For different choices of k, j, and q, one obtains different multistep methods.
For k = 1, j = 0, and q = 0, 1, 2, ... , one obtains the Adams-Bashforth
methods:
(7.2.6.5) .
.
wIth {3qi :=
11 II -.+-+
q
S
o 1=0 -z
l#-i
l
l
ds,
i = 0, 1, ... , q.
494
7 Ordinary Differential Equations
Comparison with (7.2.6.1) shows that here r
values:
q + 1. A few numerical
;3q ;
i=O
1
2
3
4
;10i
2,81i
12;12;
2483 ;
1
3
23
55
1901
-1
-16
-59
-2774
5
37
2616
-9
-1274
251
720B4;
For k = 0, j = 1 and q = 0, 1, 2, ... , one obtains the Adarns-Moulton
formulas:
Replacing p hy p + 1 gives
T]p+l = T/p + h[;1qof(xp+l,T/p+d + Bqdp + ... + Bqq f p+1-q]
.
(7.2.6.6)
wIth ;1qi :=
fa II -=;--z
s+ I ds,
q
-1 1=0
liei
+
i = 0, 1, ... , q.
At first sight it appears that (7.2.6.6) no longer has the form (7.2.6.1), since
T]p+l occurs on both the left-hand and the right-hand side of the equation in
(7.2.6.6). Thus for given T]p, T/p-l, ... , T/p+l-q, equation (7.2.6.6) represents,
in general, a nonlinear equation for T/p+ 1, and the Adams-Moulton method
is an implicit method. The following iterative method for determining T/p+l
suggests itself naturally:
(HI)
(7.2.6.7)
T/p+l
(i)
= T]p + h[;1qof(Xp+1' T]P+1) + ;1qdp + ... + Bqq !p+l-q],
i=0,1,2, ....
This iteration has the form T]~~II) = iJI (T]~~ 1)' and with the methods of
Section 5.2 it can easily be shown that for sufficiently smalllhi the mapping
z -+ iJI(z) is contractive [see Exercise 10] and thus has a fixed point T]p+1 =
iJI(T]P+1), which solves (7.2.6.6). This solution of course depends on x P' T]p,
T]p-l, ... T]p+l-q and h, and the Adams-Moulton method therefore is indeed
a multistep method of the type (7.2.6.1). Here r = q.
For given T]p, T/p-l, ... , T]p+l-q a good initial value T]~~1 for the iteration (7.2.6.7) can be found, e.g., with the aid of the Adams-Bashforth
method (7.2.6.5). For this reason, one also calls explicit methods like the
Adams-Bashforth method predictor methods, and implicit methods like the
Adams-Moulton method corrector methods [through the iteration (7.2.6.7)
one "corrects" T]~~ 1]'
A few numerical values for the coefficients occurring in (7.2.6.6):
7.2 Initial-Value Problems
/3qi
i=O
1
2
3
4
f30i
1
1
5
9
251
1
8
19
646
-1
-5
-264
1
106
-19
2f31i
12f32i
24f33i
720f34i
495
In the method of Nystrom one chooses k = 1, j = 1 in (7.2.6.2) and
thus obtains
(7.2.6.8)
i = 0,1, ... , q.
This is a predictor method, which has obviously the form (7.2.6.1) with
r=q+1.
Worthy of note is the special case q = O. Here 1300 = J~l 1 ds = 2, and
(7.2.6.8) becomes
T)p+l = T)p-l + 2hfp·
(7.2.6.9)
This is the so-called midpoint rule, which corresponds in the approximation
of an integral to "rectangular sums" .
The methods of Milne are corrector methods. They are obtained from
(7.2.6.2) by taking k = 0, j = 2, and by replacing p with p + 1:
T)p+l = T)p-l + h[f3qof(xp+l, T)P+l) + f3qd p + ... + f3qqfp+l-q]
(7.2.6.10)
.
wIth /3qi :=
1° II +
q
-2 1=0
s+l
=-1 ds,
i = 0, 1, ... , q.
•
l::;i:i
Analogously to (7.2.6.7), one solves (7.2.6.10) also by iteration.
7.2.7 General Multistep Methods
All multistep methods discussed in Section 7.2.6, as well as the one-step
methods of Section 7.2.1, can be written in the following form:
(7.2.7.1)
T)j+r + ar-lT)j+r-l + ... + aoT/j = hF(xj; T)j+r, T)j+r-l, .. ·, T/j; h; f).
Generally, a multistep method given by (7.2.7.1) is called an r-step method.
In the examples considered in Section 7.2.6 the function F, in addition,
depends linearly on f as follows:
F(xj; T)j+r, T)j+r-l, ... ,T)j; h; f) == brf(xj+r, T)j+r) + ... + bof(xj, T)j).
496
7 Ordinary Differential Equations
The b;, i = 0, ... , r, are certain constants. One then speaks of a linear
r-step method; such methods will be further discussed in Section 7.2.1l.
For the Adams-Bashforth method (7.2.6.5), for example, r = q + 1,
and
bq - i = (3qi =
1II -.1
o
q
1=0
s+l
-7
+I
d.s,
i = 0, 1, ... , q.
Ii"
Any r initial values T/o, ... , T/r-l define, through (7.2.7.1), a unique
sequenceT/j, j ~ 0. For the initial values T/j one chooses, as well as possible,
certain approximations for the exact solution Yi = Y(Xi) of (7.2.l.1) at
Xi = Xo + ih, i = 0, L ... , r - l. These can be obtained, e.g., by means of
appropriate one-step methods. We denote the errors in the starting values
by
Ci := T/i - y(x;),
i = 0, 1, ... , r - l.
Further errors, e.g. rounding errors in the evaluation of F, occur during the
execution of (7.2.7.1). We want to include the influence of these errors in
our study also, and therefore consider, more generally than (7.2.7.1), the
following recurrence formulas:
T)O := Yo
+ co,
T/r-l := Yr-l + Cr-l;
(7.2.7.2)
for j = 0, 1, 2, ... :
T/j+r + ar-IT/j+r-l + ... + aoT/j :=
hF(xj; T/j+n T/j+r-l,···, T/J; h; f) + hCj+r.
The solution T/i of (7.2.7.2) depends on h and the Cj, and represents a
function
T/(x;c;h),
which, like the error function C = c(x; h), is defined only for x E Rh =
{xo +ih I i = 0, 1, ... }, or h E Hx = {(x -xo)/n I n = 1, 2, ... }, through
T/(Xi; C; h) := T/i,
C(Xi; h) := Ci,
Xi:= Xo + ih.
As in one-step methods, one can define the local discretization error
T(X, Y; h) of a multistep method (7.2.7.1) at the point x, y. Indeed, let
f E F1(a,b), X E [a,b], Y E IR, and let z(t) be the solution of the initialvalue problem
z'(t) = f(t, z(t)),
z(x) = y.
7.2 Initial-Value Problems
497
Then the local discretization error r(x, y; h) is defined to be the quantity
(7.2.7.3)
r(x, y; h) := ~ [z(x + rh) +
r-l
L aiz(x + ih)
i=O
- hF(x; z(x + rh), z(x + (r - l)h), ... , z(x); h;
n].
The local discretization error r(x, y; h) thus indicates how well the exact
solution of a differential equation satisfies the recurrence formula (7.2.7.1).
From a reasonable method one expects that this error will become small
for small Ihl. Generalizing (7.2.1.7), one thus defines the consistency of
multistep methods by:
(7.2.7.4) Definition. The multistep metod is called consistent if for each
f E F1(a, b) there exists a function a(h) with limh-+o a(h) = 0 such that
(7.2.7.5)
Ir(x,y;h)1 ~ a(h)
for all x E [a,b]' y E IR.
One speaks of a method of order p if, for f E Fp(a,b),
EXAMPLE. For the midpoint rule (7.2.6.9) one has, in view of z'(t)
z(x) = y,
= J(t,z(t)),
1
T(X, y; h) : = h [z(x + 2h) - z(x) - 2hJ(x + h, z(x + h))]
= k[z(x + 2h) - z(x) - 2hz'(x + h)].
Through Taylor expansion in h one finds
T(X,y; h) =
k[z(x) + 2hz'(x) + 2h z (X) + 8~3 ZIl'(X)
2
ll
_ z(x) - 2h(z'(x) + hz"(x) + ~2 z'"(X))] + O(h3)
=
~2 ZIll(X) + O(h3).
The method is thus consistent and of second order.
The order of the methods in Section 7.2.6 can be determined more
conveniently by means of the error estimates for interpolating polynomials
[e.g., (2.1.4.1)] and Newton-Cotes formulas (3.1.1).
For f(x, y) := 0, and z(x) := y, a consistent method gives
1
r(x, y; h) = h [y(1 + ar-l + ... + ao) - hF(x; y, y, ... ,y; h; 0)],
498
7 Ordinary Differential Equations
IT(X, y; h)1 .::; u(h),
lim u(h) = O.
h--+O
For continuous F(x; y, y, ... ,y; .; 0), since y is arbitrary, it follows that
1 + ar-l + ... + ao = O.
(7.2.7.6)
We will often, in the following, impose on F the condition
F(x; Un Ur-l, ... , Uo; h; 0) == 0
(7.2.7.7)
for all x E [a, b], all h, and all Ui. For linear multistep methods, (7.2.7.7)
is certainly always fulfilled. (7.2.7.7), together with (7.2.7.6), guarantees
that the exact solution y(x) == Yo of the trivial differential equation y' = 0,
y(xo) = Yo, is also exact solution of (7.2.7.2) if Ci = 0 for all i.
Since the approximate solution l](x; C; h) furnished by a multistep
method (7.2.7.2) also depends on the errors Ci, the definition of convergence is more complicated than for one-step methods. One cannot expect
of course, that the global discretization error
e(x; C; h) := l](x; C, h) - y(x)
for fixed x and for h = hn = (x - xo)/n, n = 1, 2, ... , will converge toward
0, unless the errors c(x; h) also become arbitrarily small as h --+ O. One
therefore defines:
(7.2.7.8) Definition. The multistep method given by (7.2.7.2) is called
convergent if
lim l](x; C; hn ) = y(x),
n--+oo
x -Xo
hn:= - - ,
n
n= 1, 2, ... ,
for all x E [a, b], all f E FI(a, b), and all functions c(z; h) for which there
is a p( h) such that
(7.2.7.9)
Ic(z; h)1 .::; p(h)
for all Z E Rh,
limh--+o p( h) = O.
7.2.8 An Example of Divergence
The results of 7.2.2, in particular Theorem (7.2.2.3), may suggest that
multistep methods, too, converge faster with increasing order p of the local
discretization error [see (7.2.7.4)J. The point of the following example is to
show that this conjecture is false. At the same time the example furnishes
a method for constructing multistep methods of maximum order.
Suppose we construct a linear multistep method of the type (7.2.7.1)
with r = 2, having the form
7.2 Initial-Value Problems
499
7)H2 + al7)Hl + a07)j = h[bd(xj+l' 7)Hl) + bof(xj, 7)j)].
The constants ao, aI, bo, bl , are to be determined so as to yield a method of
maximum order. If Z'(t) = f(t, z(t)), then for the local discretization error
T(X, y; h) of (7.2.7.3) we have
hT(X, y; h) = z(x + 2h) + alz(x + h) + aoz(x) - h[bd(x + h) + boz'(x)].
Now expand the right-hand side in a Taylor series in h,
hT(X, y; h) = z(x)[l + al + ao] + hZ'(x)[2 + al - bl - bo]
+ h 2 z (X)[2 + ~al - bl ] + h3 z (X) [1 + tal - ~bl] + O(h4).
ll
lll
The coefficients ao, aI, bo, bl are to be determined in such a way as to
annihilate as many h-powers as possible, thus producing a method which
has the largest possible order.
This leads to the equations
-b l
=0,
--bo = 0,
-b l
= 0,
1+
2+
2+
~al
1+ tal
-~bl
= 0,
with the solution al = 4, ao = -5, bl = 4, bo = 2, and hence to the method
7)H2 + 47)Hl - 57)j = h[4f(xHl,7)j+l) + 2f(xj, 7)j)]
of order 3 [since hT(X, y; h) = O(h4), i.e., T(X, y; h) = O(h 3 )].
If one tries to use this method to solve the initial-value problem
y' = -y,
y(O) = 1,
with the exact solution y(x) = e- x , computation in 1O-digit arithmetic
for h = 10- 2 , even taking as starting values the exact (within machine
1, 7)1 := e- h , produces the results given in the
precision) values 7)0
following table.
- ..:1(-5)j
216 ---yr-e 3xj/5 [
cf.( 7.2.8.3) ]
2
3
4
5
-0.164 X 10- 8
+0.501 X 10- 8
-0.300 X 10- 7
+0.144 X 10- 6
-0.753 X 10- 9
+0.378 X 10- 8
-0.190 X 10- 7
+0.958 X 10- 7
96
97
98
99
100
-0.101 x 1058
+0.512 x 1058
-0.257 x 1059
+0.129 x 1060
-0.652 X 1060
-0.668 X 1057
+0.336 X 1058
-0.169 X 1059
+0.850 X 1059
-0.427 X 1060
500
7 Ordinary Differential Equations
How does one account for this wildly oscillating behavior of the 'r/j? If
we assume that the exact values 'r/o := 1, 'r/l := e- h are used as starting
values and no rounding errors are committed during the execution of the
method (Ej = for all j), we obtain a sequence of numbers 'r/j with
°
'r/o = 1,
'r/l = e- h ,
'r/j+2 + 4'r/j+l - 5'r/j = h[-4'r/j+l - 2'r/j]
for j = 0, 1, ... ,
or
(7.2.8.1)
'r/j+2 + 4(1 + h)'r/j+l + (-5 + 2h)'r/j =
°
for j = 0, 1, ....
Such difference equations have special solutions of the form 'r/j = )../. Upon
substituting this expression in (7.2.8.1), one finds for A the equation
which, aside from the trivial solution A = 0, has the solutions
3V1 + ~h + th2,
A2
-2 - 2h - 3V1 + ~h + th2.
Ai = -2 - 2h +
=
For small h one has
hence
(7.2.8.2)
2 - lh 3 + 2..h4 + O(h5)
Ai = 1 - h + lh
2
6
72
'
A2 = -5 - 3h + O(h2).
One can now show that every solution 'r/j of (7.2.8.1) can be written as a
linear combination
'r/j = aA{ + {3A~
AL
of the two particular solutions
A~ just found [see (7.2.9.9)]. The constants a and {3, in our case, are determined by the initial conditions 1]0 = 1,
1]1 = e- h , which lead to the following system of equations for a, {3:
= a + {3 = 1,
1]1 = aAl + {3A2 = e- h .
1]0
The solution is given by
7.2 Initial-Value Problems
501
From (7.2.8.2) one easily verifies
a: = 1 + O(h2),
(3 = -2~6h4 + O(h 5 ).
For fixed x i- 0, h = h n = x/n, n = 1, 2, ... , one therefore obtains for the
approximate solution TJn = TJ(x; h n )
= [1 + 0 (;
-
r] [
1 - ; + 0 (;
rr
2~6 ~: [1 + 0 (;)] [-5 - 3; + 0
(;rr
The first term tends to y(x) = e- x as n --+ 00; the second term for n --+ 00
behaves like
(7.2.8.3)
Since lim n --+ oo 5n /n 4
00, this term oscillates more and more violently
as n --+ 00. This explains the oscillatory behavior and divergence of the
method. As is easily seen, the reason for this behavior lies in the fact
that -5 is a root of the quadratic equation p,2 + 4p, - 5 = 0. It is to be
expected that also in the general case (7.2.7.2) the zeros of the polynomial
tJt(p,) = p,r + ar_lp,r-l + ... + ao playa significant role for the convergence
of the method.
7.2.9 Linear Difference Equations
In the following section we need a few simple results on linear difference
equations. By a linear homogeneous difference equation of order r one
means an equation of the form
For every set of starting values uo, Ul, ... , Ur-l one can evidently determine
exactly one sequence of numbers Un, n = 0, 1, ... , which solves (7.2.9.1).
In applications to multistep methods a point of interest is the growth
behavior of the Un as n --+ 00, in dependence of the starting values uo, Ul,
... , Ur-l. In particular, one would like to know conditions guaranteeing
that
(7.2.9.2a)
for all real starting values uo, Ul, ... , Ur-l.
Since the solutions Un = un(UO), Uo .- [Uo,Ul, ... ,Ur-1V, obviously
depend linearly on the starting vector,
502
7 Ordinary Differential Equations
the restriction to real starting vectors Uo E lR r is unnecessary, and
(7.2.9.2a) is equivalent to
(7.2.9.2b)
U
lim -...!.':.. = 0
n
n--+oo
for all (complex) starting values uo, Ul, ... , U,-I.
With the difference equation (7.2.9.1) one associates the polynomial
(7.2.9.3)
One now says that (7.2.9.1) satisfies the
(7.2.9.4)
stability condition
if for every zero A of tJr(jL) one has IAI ::; 1, and further 1/;(A) = 0 and IAI = 1
together imply that A is a simple zero of tJr.
(7.2.9.5) Theorem. The stability condition (7.2.9.4) is necessary and sufficient for (7.2.9.2).
PROOF. (1) Assume (7.2.9.2), and let A be a zero of tJr in (7.2.9.3). Then
the sequence Uj := Aj, j = 0, 1, ... , is a solution of (7.2.9.1). For IAI > 1
the sequence unln = An In, diverges, so that from (7.2.9.2) it follows at
once that IAI ::; 1. Let now A be a multiple zero of tJr with IAI = 1. Then
1//(A) = r,\r-l + (r - 1)ar_lAr-2 + ... + 1· al = O.
The sequence Uj := j V, j 2': 0, is thus a solution of (7.2.9.1):
Uj+r+ar-lUj+r-l + ... + aOuj
=jAj(A r + ar_lAr-1 + ... + ao)
+ AJ+l(rA r - 1 + (r - 1)ar_lAr-2 + ... + all
=0.
Since unln = An does not converge to zero as 71 ---+ 00, A must be a simple
zero.
(2) Conversely, let the stability condition (7.2.9.4) be satisficd. We first
use the fact that, with the abbreviations
1
o
the difference equation (7.2.9.1) is equivalent to the recurrence formula
7.2 Initial-Value Problems
503
(7.2.9.6)
A is a Frobenius matrix with the characteristic polynomial 'lj;(j.t) in (7.2.9.3)
[Theorem (6.3.4)]. Therefore, if the stability condition (7.2.9.4) is satisfied,
one can choose, by Theorem (6.9.2), a norm II . lion C r such that for the
corresponding matrix norm lub(A) ::; 1. It thus follows, for all Uo E Cr ,
that
(7.2.9.7)
IlUnll = IIAnUoll ::; IIUol1
for n = 0,1, ....
°
Since on C r all norms are equivalent [Theorem (4.4.6)]' there exists a k >
with (l/k)IIUII ::; 1IUlioo ::; klIUII, and one obtains from (7.2.9.7) in particular
IlUnlloo ::; k 2 1lUoll oo , n = 0,1, ... ,
i.e., one has limn-+oo (l/n)IIUn ll oo = 0, and hence(7.2.9.2).
D
In the proof of the preceding theorem we exploited the fact hat the
zeros Ai of tJ/ furnish particular solutions of (7.2.9.1) of the form Uj := A{,
j = 0, 1, .... The following theorem shows that one can similarly represent
all solutions of (7.2.9.1) in terms of the zeros of tJ/:
(7.2.9.8) Theorem. Let the polynomial
'lj;(j.t) := j.tr + ar_lj.tr-l + ... + ao
have the k distinct zeros Ai, i = 1, 2, ... , k, with multiplicities ai, i = 1,
2, ... , k, and let aD f= 0. Then for arbitrary polynomials Pi(t) with deg
Pi < ai, i = 1, 2, ... , k, the sequence
(7.2.9.9)
is a solution of the difference equation (7.2.9.1). Conversely, every solution
of (7.2.9.1) can be uniquely represented in the form (7.2.9.9).
PROOF. We only show the first part of the theorem. Since with {Uj}, {Vj},
also {Uj +Vj} is a solution of (7.2.9.1), it suffices to show that for a a-fold
zero A of tJ/ the sequence
j = 0,1, ... ,
is a solution of (7.2.9.1) if p(t) is an arbitrary polynomial with deg p <
a. For fixed j :::: 0, let us represent p(j + t) with the aid of Newton's
interpolation formula (2.1.3.1) in the following form:
where 0:0" = 0:0"+1 = ... = O:r = 0, because deg p < a. With the notation
a r := 1, we thus have
504
7 Ordinary Differential Equations
L apAPp(j + p)
r
Uj+r + ar -1Uj+r-1 + ... + aOUj = Aj
p=O
= Aj
r
a-1
L apAP [0: + L O:TP(P - 1) ... (p - + 1)]
0
p=o
'T
T=l
= Aj[O:O~(A) + O:lA~/(A) + ... + O:a_1Aa-1~(a-1)(A)]
=0,
since A is a a-fold zero of tJi and thus ~(T)(A) = 0 for 0 ::; 'T :S a - 1. This
proves the first part of the theorem.
Now a polynomial p(t) = Co + C1t + ... + Ca_1ta-1 of degree < a is
precisely determined by its a coefficients cm , m = 0, I, ... , a - I, so that
by a1 + a2 + ... + ak = r the representation (7.2.9.9) contains a total of r
free parameters, namely the coefficients of the Pi (t). The second part of the
theorem therefore asserts that by suitably choosing these r parameters one
can obtain every solution of (7.2.9.1), i.e., that for every choice of initial
values Uo, U1, ... , Ur-1 there is a unique solution to the following system
of r linear equations for the r unknown coefficients of the Pi(t), i = I, ... ,
k:
P1(j)A{ + P2(j)A~ + ... + Pk(j)At = Uj
j = 0, I, ... , r-1.
The proof of this, although elementary, is tedious. We therefore omit it.
0
7.2.10 Convergence of Multistep Methods
We now want to use the results of the previous section in order to study
the convergence behavior of the multistep method (7.2.7.2). It turns out
that for a consistent method [see (7.2.7.4)] the stability condition (7.2.9.4)
is necessary and sufficient for the convergence of the method, provided F
satisfies certain additional regularity conditions [see (7.2.10.3)].
In connection with the stability condition (7.2.9.4), note from (7.2.7.6)
that for a consistent method ~(/-l) = /-lr + ar _1/-l r - 1 + ... + ao has A = 1 as
a zero.
We begin by showing that the stability condition is necessary for convergence:
(7.2.10.1) Theorem. If the multistep method (7.2.7.2) is convergent [Definition (7.2.7.8)] and F obeys the condition (7.2.7.7),
F(x; U r , U r -1, ... , Uo; h; 0) == 0,
then the stability condition (7.2.9.4) holds.
PROOF. Since the method (7.2.7.2) is supposed to be convergent, in the
sense of (7.2.7.8), for the integration of the differential equation y' == 0,
7.2 Initial-Value Problems
505
y(xo) = 0, with the exact solution Vex) == 0, it must produce an approximate solution 1'/( x; c; h) satisfying
lim 1'/(x; C; h) =
h-+O
°
°
for all x E [a, b] and all C with Ic(z; h)1 ::::; p(h), p(h) -+ for h -+ 0.
Let x =J. Xo, x E [a,b], be given. For h = h n = (x - xo)/n, n = 1,
2, ... , it follows that Xn = x and 1'/(x; C; h n ) = 1'/n, where 1'/n in view of
F(x; Ur , Ur-l, ... , Uo; h; 0) == 0, is determined by the recurrence formula
1'/i = Ci,
1'/j+r + ar -l1'/j+r-l + ... + ao1'/j = hncj+r,
i = 0, 1, ... , r - 1,
j = 0, 1, ... , n - r,
with Cj := c(xo + jh n ; h n ). We choose cj+r := 0, j = 0, 1, ... , n - r,
Ci := hnUi, i = 0, 1, ... , r - 1, with arbitrary constants Uo, Ul, ... , Ur-l.
With
p(h) := Ihl max IUil
O~i~r-l
one then has
i = 0,1, ... , n,
and
lim p(h) = 0.
h-+O
Now, 1'/n = hnu n , where Un is obtained recursively from Uo, Ul, ... , Ur-l
via the difference equation
Uj+r + ar-lUj+r-l + ... + aOuj = 0,
j = 0, 1, ... , n - r.
Since, by assumption, the method is convergent, we have
lim 1'/n = (x - xo) lim Un = 0,
n--+oo n
n-+oo
i.e., (7.2.9.2) follows, the Uo, Ul, ... , Ur-l having been chosen arbitrarily.
The assertion now follows from Theorem (7.2.9.5).
0
We now impose on F the additional condition of being "Lipschitz continuous" in the following sense: For every function f E Fl (a, b) there exist
constants ho > and M (which may depend on 1) such that
(7.2.10.2)
°
r
IF(x; Ur , Ur-l,···, Uo; h; 1) - F(x; Vr , Vr-l,···, Vo; h; 1)1 ::::; M
L lUi - vii
i=O
for all x E [a, b], Ihl : : ; ho, Uj, Vj E lR [ef. the analogous condition (7.2.2.4)].
We then show that for consistent methods the stability condition is also
sufficient for convergence:
506
7 Ordinary Differential Equations
(7.2.10.3) Theorem. Let the multistep method (7.2.7.2) be consistent in
the sense of (7.2.7.4), and F satisfy the conditions (7.2.7.7) and (7.2.10.2).
Then the method is convergent in the sense of (7.2.7.8) for all f E FI (a, b)
if and only if the stability condition (7.2.9.4) holds.
PROOF. The fact that the stability condition is necessary for convergence
follows from (7.2.10.1). In order to show that it is also sufficient under
the stated assumptions, one proceeds, analogously to the proof of Theorem
(7.2.2.3), in the following way: Let y(x) be the exact solution of y' = f(x, Y),
y(xo) = Yo, Yj := Y(Xj), Xj = Xo + jh, and 'r/j the solution of (7.2.7.2):
'r/i = Yi + Ei,
i = 0, 1, ... , r - 1,
'r/j+r + ar-l'r/j+r-l + ... + aO'r/j = hF(xj; 'r/j+T?"" 'r/j; h; f) + hEj+r,
for j = 0,1, ... , with IEjl :::; p(h), limh---+op(h) = 0.
For the error ej := 'r/j - Yj one has
i = 0, ... , r -1,
(7.2.10.4)
j = 0,1, ... ,
where
Cj+r := h[F(xj; 'r/j+T?"" 'r/j; h; f) - F(xj; Yj+r,···, Yj; h; f)]
+ h(Ej+r - rj+r) ,
rj+r := r(xj, Yj; h).
By virtue of the consistency (7.2.7.4), one has for a suitable function a(h),
Irj+rl = Ir(xj, Yj; h)1 :::; a(h),
lim a(h) = 0,
h---+O
and by(7.2.10.2),
r
(7.2.10.5)
ICj+rl :::; IhlM 2: lej+il + Ihl [p(h) + a(h)].
i=O
With the help of the vectors
and the matrix
7.2 Initial-Value Problems
507
(7.2.10.4) can be written equivalently in the form
Ej+l = AEj + cj+rB,
(7.2.10.6)
17].
Eo:= [
Cr-1
Since by assumption the stability condition (7.2.9.4) is valid, one can
choose, by Theorem (6.9.2), a norm II . lion (Cr with lub(A) S; 1. Now
all norms on (Cr are equivalent [see (4.4.6)], i.e., there is a constant k > 0
such that
1
kllEjl1 S;
T-l
L lej+il S; kllEjll·
i=O
Since
r-l
r
i=O
i=1
L lej+il S; kllEjll, L lej+il S; kllEj+lll,
one thus obtains from (7.2.10.5)
ICj+rl S; IhlMk(IIEj II + IIEj+llD + Ihl [p(h) + O"(h)].
Using(7.2.1O.6) and IIBII S; k, it follows for j = 0, 1, ...
(7.2.10.7) (1-lhIMk2)IIEj+l11 S; (1 + Ih1Mk 2)IIEjll + klhl [p(h) + O"(h)],
and IIEol1 S; krp(h). For Ihl S; 1/(2Mk 2), we now have (1 - Ih1Mk2) ~ ~
and
1 + IhlMk2 <
41hlMk2
1 _ IhlMk2 - 1 +
.
(7.2.10.7), therefore, yields for Ihl S; 1/(2Mk 2)
IIEj+ll1 S; (1 + 4lhlMk2)IIEj II + 2klhl [p(h) + O"(h)],
j = 0,1, ....
Because of IIEol1 S; krp(h), Lemma (7.2.2.2) shows that
2
e4nlhlMk2 1
liEn II S; e4nlhlMk krp(h) + [p(h) + O"(h)]
2Mk - ,
i.e., for x -=I- Xo, h = hn = (x - xo)/n, Ihnl S; 1/(2Mk 2), one has
There exist, therefore, constants C 1 and C 2 independent of h such that
(7.2.10.8)
for all sufficiently large n. Convergence of the method now follows in view
of limh-+o p(h) = limh-+o O"(h) = o.
D
508
7 Ordinary Differential Equations
From (7.2.10.8) one immediately obtains the following:
(7.2.10.9) Corollary. If in addition to the assumptions of Theorem
(7.2.10.3) the multistep method is a method of order p [see (7.2.7.4)]'
CJ"(h) = O(hP ), and f E Fp(a, b), then the global discretization error also
satisfies
17)(x;s; h n ) - y(x)1 = O(h~)
for all h n = (x - xo)/n, n sufficiently large, provided the errors
Ci = c(xo + ih n ; h n ),
i = 0,1, ... ,n,
obey an estimate
ICil ~ p(h n ),
i = 0,1, ... , n,
with
p(hn ) = O(h~)
as n -+ 00.
7.2.11 Linear Multistep Methods
In the following sections we assume that in (7.2.7.2), aside from the starting
errors Ci, 0 ~ i ~ r - 1, there are no further errors: Cj = 0 for j 2 r. Since
it will always be clear from the context what starting values are being used
and hence what the starting errors are, we will simplify further by writing
7)(x; h) in place of 7)(x; C; h) for the approximate solution produced by the
multistep method (7.2.7.2).
The most commonly used multistep methods are linear ones. For these,
the function F(x; u; h; /), u := fur, Ur-l, . .. ,uol in (7.2.7.1) has the following form:
(7.2.11.1)
F(x;Ur,Ur-l, ... ,uo;h;f)
== brf(x + rh, u r ) + br-d(x + (r - l)h, Ur-l) + ... + bof(x, uo).
A linear multistep method therefore is determined by listing the coefficients
ao, ... , ar-l, bo , ... , br . By means of the recurrence formula
7)J+r + a r -l7)J+r-l + ... + a07)j = h[brf(xJ+r, 7)J+r) + ... + bof(xj, 7)j)],
Xi := Xo
+ ih,
for each set of starting values 7)0, 7)1, ... , 7)r-l and for each (sufficiently
small) stepsize h =1= 0, the method produces approximate values 7)j for the
values Y(Xj) of the exact solution y(x) of an initial-value problem y' =
f(x, y), y(xo) = Yo·
7.2 Initial-Value Problems
509
If br -=I=- 0, one deals with a corrector method; if br = 0, with a predictor
method.
For I E Fl (a, b) every linear multistep method evidently satisfies the
conditions (7.2.10.2) and (7.2.7.7). According to Theorem (7.2.10.1), therefore, the stability condition (7.2.9.4) for the polynomial
'I/;(p,) := p,r + ar _lp,r-l + ... + ao
is necessary for convergence [see (7.2.7.8)] of these methods. By Theorem
(7.2.10.3), the stability condition for tff together with consistency, (7.2.7.4),
is also sufficient for convergence.
To check consistency, by Definition (7.2.7.4) one has to examine the
behavior of the expression (here a r := 1)
L[z(x); h] : =
(7.2.11.2)
==
r
r
i=O
i=O
r
r
i=O
i=O
L aiz(x + ih) - h L bd(x + ih, z(x + ih))
L aiz(x + ih) - h L biZ'(x + ih)
== h· T(x,y;h)
for the solution z(t) with z'(t) = I(t, z(t)), z(x) = y, X E [a, b], y E JR.
Assuming that z(t) is sufficiently often differentiable (this is the case if I
has sufficiently many continuous partial derivatives), One finds by Taylor
expansion of L[z(x); h] in powers of h for I E Fq_ 1 (a, b),
L[z(x); h] = Coz(x) + C 1 h z'(x) + ... + Cqh qz(q) (x) + o(lhl q)
= hT(X, y; h).
Here, the C i are independent of z(.), x and h, and one has in particular
Co = aD + a1 + ... + a r -1 + 1,
C 1 = a1 + 2a2 + ... + (r - l)a r - l + r·l - (b o + b1 + ... + br ).
In terms of the polynomialtff(p,) and the additional polynomial
(7.2.11.3)
P, E <C,
one can write Co and C 1 in the form
Co = '1/;(1),
C 1 = '1/;'(1) - X(I).
Now, for I E F1 (a, b),
T(X, y; h) = ~L[Z(X); h] =
~O z(x) + Cd(x) + O(h),
510
7 Ordinary Differential Equations
and, according to Definition (7.2.7.4), a consistent multistep method requires
Co = C 1 = 0,
i.e., a consistent linear multistep method has at least order l.
In generaL it has order p [see Definition (7.2.7.4)] if, for f E Fp(a, b),
Co = C 1 = ... = Cp = O.
In addition to Theorems (7.2.10.1) and (7.2.10.3), we now have for linear
multistep methods the following:
(7.2.11.4) Theorem. A linear multistep method which is convergent is
also consistent.
PROOF. Consider the initial-value problem
y' = 0,
y(O) = 1,
with the exact solution y(x) == l. For starting values TJi := 1, i = 0, 1, ... ,
r - 1, the method produces values TJj+r, j = 0, 1, ... , with
(7.2.1l.5)
TJj+r + ar-lTJj+r-l + ... + aoTJj = O.
Letting h n := x/n, one gets TJ(x; h n ) = TJn, and in view of the convergence
of the method,
lim TJ(x;h n ) = lim Tin = y(x) = l.
TL
--+ eX)
n --+ (Xl
For j -+ eXJ it thus follows at once from (7.2.11.5) that
Co = 1 + ar-l + ... + aD = O.
In order to show C 1 = 0, we utilize the fact that the method must
converge also for the initial-value problem
y' = 1,
y(O) = 0
with the exact solution y(x) == x. We already know that Co = '1'(1) = O.
By Theorem (7.2.10.1) the stability condition (7.2.9.4) holds, hence A = 1
is only a simple zero of 1/;, i.e.,1/;' (1) "10; the constant
K'= X(l)
. 1/;'(1)
is thus well defined. With the starting values
TJj := jhK,
j=O,l, ... ,r-l,
we have for the initial-value problem y' = 1, y(O) = 0, in view of Y(Xj) =
Xj = jh,
7.2 Initial-Value Problems
r/j = Y(Xj) + Cj
whereby
511
with Cj := jh(K - 1), j = 0, 1, ... , r - 1,
lim Cj = 0
h-+O
for j = 0, 1, ... , r - 1.
The method, with these starting values, yields a sequence TJj for which
Through substitution in (7.2.11.6), and observing Co = 0, one easily sees
that
TJj = j h K for all j.
Now, TJn = TJ(x; h n ), h n := x/no By virtue ofthe convergence of the method,
therefore,
x = y(x) =
lim TJ(x; h n ) = lim TJn = lim nhnK = xK.
n~oo
n~oo
n~oo
Consequently K = 1, and thus,
C 1 = '¢'(1) - X(l) = O.
D
Together with Theorems (7.2.10.1), (7.2.10.3) this yields the following:
(7.2.11.7) Theorem. A linear multistep method is convergent for all f E
F1 (a, b) if and only if it satisfies the stability condition (7.2.9.4) for,¢ and
is consistent [i.e., '¢(1) = 0, '¢'(1) - X(l) = 0].
The following theorem gives a convenient means of determining the
order of a linear multistep method:
(7.2.11.8) Theorem. A linear multistep method is a method of order p if
and only if the function rp(f.1) := 1jJ(f.1)/ln(f.1) - X(f.1) has f.1 = 1 as a p-fold
zero.
PROOF. Put z(x) := eX in L[z(x); h] of (7.2.11.2). Then for a method of
order p,
On the other hand,
We thus have a method of order p precisely if
i.e., if h = 0 is a p-fold zero of rp(e h ), or, in other words, if rp(f.1) has the
D
p- fold zero f.1 = 1.
512
7 Ordinary Differential Equations
This theorem suggests the following procedure: For given constants
ao, aI, ... , ar-l, suppose additional constants bo, bl , ... , br, are to be
determined so that the resulting multistep method has the largest possible
order. To this end, the function 'IjJ(p,)/ln(p,), which for consistent methods
is holomorphic in a neighorhood of p, = 1, is expanded in a Taylor series
about p, = 1:
(7.2.11.9)
~~~~ = cO+cI(p,-l)+ ... +Cr_I(p,-lr-l+cr(p,-lr+···.
Choosing
x(p,) : = Co + CI(p, - 1) + ... + cr(p, _1)r
=: bo + blP, + ... + brp,r
(7.2.11.10)
then gives rise to a corrector method of order at least r + 1. Taking
x(p,) : = Co + CI(p, -1) + ... + Cr-I(P, -lr- 1
= bo + blP, + ... + br - l + 0 . p,r
one obtains a predictor method of order at least r.
In order to achieve methods of still higher order, one could think of further determining the constants ao, ... , ar-l in such a way that in (7.2.11.9)
(7.2.11.11)
'IjJ(1) = 1 + ar-l + ... + ao = 0,
Cr+l = Cr +2 = ... = C2 r -1 = o.
The choice (7.2.11.10) for X(p,) would then lead to a corrector method of
order 2r. Unfortunately, the methods so obtained are no longer convergent,
since the polynomials 'IjJ for which (7.2.11.11) holds no longer satisfy the
stability condition (7.2.9.4). Dahlquist (1956, 1959) was able to show that
an r-step method which satisfies the stability condition (7.2.9.4) has order
< { r + 1, if r is odd,
r + 2, if r is even
P-
[ef. Section 7.2.8].
EXAMPLE. The consistent method of maximum order, for r = 2, is obtained by
setting
Taylor expansion of 'IjJ(p,)j In(p,) about p, = 1 yields
'IjJ(p,) = 1 _ a + 3 - a (p, _ 1) + a + 5 (p, _ 1)2 _ 1 + a (p, _ 1)3 + ....
In(p,)
2
12
24
Putting
7.2 Initial-Value Problems
513
the resulting linear multistep method has order 3 for a # -1, and order 4 for
a = -1. Since 'l/J(/-l) = (/-l- 1)(/-l- a), the stability condition (7.2.9.4) is satisfied
only for -1 :::; a < 1. In particular, for a = 0, one obtains
This is just the Adams-Moulton method (7.2.6.6) for q = 2, which therefore has
order 3. For a = -1 one obtains
'l/J(/-l) = /-l2 - 1,
X(/-l) = ~/-l2 + ~/-l + ~,
which corresponds to Milne's method (7.2.6.10) for q = 2 and has order 4 [see
also Exercise 11].
One should not overlook that for multistep methods of order p the
integration error is of order O(hP ) only ifthe solution y(x) of the differential
equation is at least p + 1 times differentiable [j E Fp (a, b)].
7.2.12 Asymptotic Expansions of the Global Discretization
Errror for Linear Multistep Methods
In analogy to Section 7.2.3, one can also try to find asymptotic expansions
in the stepsize h for approximate solutions generated by multistep methods.
There arise, however, a number of difficulties.
To begin with, the approximate solution T](x; h), and thus certainly also
its asymptotic expansion (if it exists), will depend on the starting values
used. Furthermore, it is not necessarily true that an asymptotic expansion
of the form [ef. (7.2.3.3)]
(7.2.12.1)
T](x; h) =y(x) + hPep(x) + hP+lep+l(x) + ...
+ hN eN(x) + hN+ 1 EN+l(x; h)
exists for all h = h n := (x - xo)/n, with functions ei(x) independent of h
and a remainder term EN+l(X; h) which is bounded in h for each x.
We show this for a simple linear multistep method, the midpoint rule
[see (7.2.6.9)], i.e.,
(7.2.12.2)
T]j+l = T]j-l + 2hf(xj, T]j),
Xj = Xo + jh,
j = 1, 2, ....
We will use this method to treat the initial-value problem
y' = -y,
Xo = 0,
Yo = y(O) = 1,
whose exact solution is y(x) = e- x . We take the starting values
T]o := 1,
T]l:= 1 -
h
[T]l is the approximate value for Y(Xl) = e- h obtained by Euler's polygon
method (7.2.1.3)]. Beginning with these starting values, the sequence {T]j}
514
7 Ordinary Differential Equations
- and hence the function 'TI(x; h) for all x E Rh = {Xj = jh I j = 0, 1, ... }
- is then defined by (7.2.12.2) through
T/(x; h) := T/j = T/xlh
if x = Xj = jh.
According to (7.2.12.2), since f(xj, T/j) = -T/j, the T/j satisfy the following
difference equation:
T/j+! + 2hT/j -
T/j-l
= 0,
j = 1,2, ....
With the help of Theorem (7.2.9.8), the 'TIj can be obtained explicitly: The
polynomial
has the zeros
Therefore, by (7.2.9.8),
j = 0,1, ... ,
(7.2.12.3)
where the constants CI, C2 can be determined by means of the starting
values 'TID = 1, T/I = 1 - h. One finds
T/o = 1 = CI + C2,
T/I =
1 - h = CIAI + C2A2,
and thus
Consequently, for x E R h , hoi 0,
(7.2.12.4)
One easily verifies that
is an analytic function of h in Ihl < 1. Furthermore,
7.2 Initial-Value Problems
515
since, evidently, cl(-h) = cl(h) and Al (-h) = l/Al(h). The second term
in (7.2.12.4) exhibits more complicated behavior. We have
with
'-P2(h) = c2(h)[Al( _h)]x/h = c2(h)[Al(h)]-x/h
an analytic function in Ihl < 1. As above, one sees that '-P2( -h) = '-P2(h). It
follows that '-PI and '-P2, for Ihl < 1, have convergent power-series expansions
of the form
'-Pl(h) = uo(x) + ul(x)h 2 + u2(x)h 4 +"',
'-P2(h) = vo(x) + vl(x)h 2 + v2(x)h4 +"',
with certain analytic functions Uj(x), Vj(x). The first terms in these series
can easily be found from the explicit formulas for ci(h), Ai(h):
Vo(X) = 0,
e- x
Ul(X) = 4[-1 + 2x],
eX
Vl(X) = 4'
Therefore, 1](x; h) has an expansion of the form
(7.2.12.5)
00
1](X; h) = y(x) + I>2k [Uk(X) + (-l)x/h vk (x)]
k=l
x
for h = -, n = 1, 2, ....
n
Because ofthe oscillating term ( -1 y/ h, which depends on h, this is not an
asymptotic expansion of the form (7.2.12.1).
Restricting the choice of h in such a way that x / h is always even, or
always odd, one obtains true asymptotic expansions
(7.2.12.6)
00
x
1](X; h) = y(x) + 2::>2k[Uk(X) + Vk(X)] for h= - , n= 1,2, ... ,
2n
k=l
1](x; h) = y(x) +
00
x
- , n = 1,2, ... ,
2: h2k [Uk(X) - Vk(X)] for h = 2n -1
k=l
Computing 1]1 with the Runge-Kutta method (7.2.1.14) instead of Euler's method, one obtains the starting values
1]0 := 1,
1]1 := 1- h
h2
h3
h4
2
6
24
+ - - - +-.
For Cl and C2 one finds in the same way as before
516
7 Ordinary Differential Equations
C
2
~
h2
h3
h4
= c (h) = V 1 + h- - 1 - ""2 + ""6 - 24
2';1 + h 2
2
Since Cl (h) and C2 (h), and thus 1]( x; h), are no longer even functions of h,
there will be no expansion ofthe form (7.2.12.5) for 1](x; h), but merely an
expansion of the type
The form of the asymptotic expansion thus depends critically on the starting values used.
In general, we have the following theorem due to Gragg (1965) [see
Hairer and Lubich (1984) for a short proof]:
(7.2.12.7) Theorem. Let f E F2N +2(a, b), and y(x) be the exact solution
of the initial-value problem
y' = f(x,y),
y(xo) = Yo,
Xo E [a,b].
For x E Rh = {xo + ih I i = 0, 1, ... } let 1](x; h) be defined by
1](xo; h) := Yo,
1](xo + h; h) := Yo + hf(xo, Yo),
1](x + h; h) := 1](x - h; h) + 2hf(x, 1](x; h)).
(7.2.12.8)
Then 1](x; h) has an expansion of the form
(7.2.12.9)
1](x; h) = y(x) +
L h2k [Uk(X) + (_I)(x-x o vk (x)] + h 2N+2E 2N+2(X; h),
N
)/h
k=l
valid for x E [a, b] and all h = (x - xo)/n, n = 1, 2, .... The functions
Uk(X), Vk(X) are independent of h. The remainder term E 2N +2 (X; h) for
fixed x remains bounded for all h = (x - xo)/n, n = 1, 2, ....
We note here explicitly that Theorem (7.2.12.7) holds also for systems
of differential equation (7.0.3): f, Yo, 1], y, Uk, Vk, etc. are then to be understood as vectors.
Under the assumptions of Theorem (7.2.12.7) the error e(x; h) :=
1](x; h) - y(x) in first approximation is equal to
h 2[Ul(X) + (-I)Cx-x o)/h v1 (x)].
In view of the term (_I)(x-x o)/h, it shows an oscillating behavior:
7.2 Initial-Value Problems
517
One says, for this reason, that the midpoint rule is "weakly unstable" . The
principal oscillating term (-1)(x-x ol /h v1 (x) can be removed by a trick. Set
[Gragg (1965)]
S(X; h) := ~ [1J(x; h) + 1J(x - h; h) + hf(x, 1J(x; h))],
(7.2.12.10)
where 1J(x; h) is defined by (7.2.12.9). On account of (7.2.12.8), one now
has
1J(X + h; h) = 1J(x - h; h) + 2hf(x, 1J(x; h)),
and hence also
(7.2.12.11)
S(X; h) = ~ [1J(x; h) + 1J(x + h; h) - hf(x, 1J(x; h))].
Addition of (7.2.12.10) and (7.2.12.11) yields
(7.2.12.12)
S(x; h) = ~ [1J(x; h) + ~1J(x - h; h) + ~1J(x + h; h)].
By (7.2.12.9), one thus obtains for S an expansion of the form
S(x; h)
=H
y(x) + ~ [y(x + h) + y(x - h)]
N
+ L h2k [Uk(X) + ~(Uk(X + h) + Uk(X - h))
k=l
+ (_l)(x-x ol /h(Vk(X) - ~(Vk(X + h) + Vk(X - h)))]}
+ O(h 2N+2 ).
Expanding y(x ± h) and the coefficient functions Uk(X ± h), Vk(X ± h) in
Taylor series in h, one finally obtains for S(x; h) an expansion of the form
(7.2.12.13)
S(x; h) =y(x) + h2 [Ul(X) + iy"(x)]
N
+ L h2k [Uk(X) + (_l)(x-x ol /h vk (x)] + O(h 2N +2 ),
k=2
in which the leading error term no longer contains an oscillating factor.
7.2.13 Practical Implementation of Multistep Methods
The unpredictable behavior of solutions of differential equations forces the
numerical integration to proceed with stepsizes which, in general, must vary
from point to point if a prescribed error bound is to be maintained. Multistep methods which use equidistant nodes Xi and a fixed order, therefore,
are not very suitable in practice. Every change of the steplength requires
518
7 Ordinary Differential Equations
a recomputation of starting data [cf. Section 7.2.6] or entails a complicated interpolation process, which severely reduces the efficiency of these
integration methods.
In order to construct efficient multistep methods, the equal spacing of
the nodes Xi must be given up. We briefly sketch the construction of such
methods. Our point of departure is again Equation (7.2.6.2), which was
obtained by formal integration of y' = f(x, y):
As in (7.2.6.3) we replace the integrand by an interpolating polynomial Qq
of degree q, but use here Newton's interpolation formula (2.1.3.4), which, for
the purpose of step control, offers some advantages. With the interpolating
polynomial one first obtains an approximation formula
and from this the recurrence formula
with
Qi(X) = (x - xp)··· (x - Xp-i+l),
Qo(x) == 1.
In the case k = 1, j = 0, q = 0, 1, 2, ... , one obtains an "explicit" formula
(predictor)
T/p+l = T/p +
(7.2.13.1)
q
2: g;J[xp, ... ,xp_;],
i=O
where
For k = 0, j = 1, q = 0,1,2, ... , and replacing p by p+ 1, there results an
"implicit" formula (corrector)
q
T/p+l = T/p + Lg;J[xp+l,'" ,Xp-i+l],
(7.2.13.2)
i=O
with
go :=
l
i > 0,
XP + 1
Xp
dx.
7.2 Initial-Value Problems
519
The approximation rJp+! =: rJ~~I from the predictor formula can be used,
as in Section 7.2.6, as "starting value" for the iteration on the corrector
formula. However, we carry out only one iteration step, and denote the
resulting approximation by rJ~~I. From the difference of the two approximate values we can again derive statements about the error and stepsize.
Subtracting the predictor formula from the corrector formula gives
Q~I)(X) is the interpolation polynomial of the corrector formula (7.2.13.2),
involving the points
with
f~~1 = f(Xp+I, rJ~~I)·
Q~O) (x) is the interpolation polynomial of the predictor formula, with points
(xp, f p), ... , (x p_q, fp-q). Defining f~~1 := Q~O)(Xp+I)' then Q~O)(x) is also
determined uniquely by the points
(xp+!, f~~I)' (xp, fp), ... , (x p_q+!, f p- q+!)
The difference Q~l)(X) - Q~O\x) therefore vanishes at the nodes x p_q+!,
... , xP' and at Xp+1 has the value
(I)
(0)
j p+1
- f p+!.
Therefore,
(7.2.13.3)
(1)
(0)
_
((1)
(0) )
rJp+! - rJp+! - C q · f p+1 - f p+! .
Now let y(x) denote the exact solution of the differential equation y'
f(x,y) with initial value y(xp) = rJp. We assume that the "past" approximate values rJp-i = y(Xp-i), i = 0, 1, ... , q are "exact", and we make
the additional assumption that f(x,y(x)) is exactly represented by the
polynomial Q~~l(X) with interpolation points
Then
y(xp+!) = rJp +
l
XP + 1
Xp
f(y, y(x)) dx
520
7 Ordinary Differential Equations
For the error
one obtains
In the first term, Q~O) (x) may be replaced by Q~~l (x) if one takes as
interpolation points
(Xp+l' !~~l)' (xp, !p), ... , (x p_ q, !p-q)'
In analogy to (7.2.13.3) one thcn obtains
If now thc nodes Xp+l, x P' ... , x p_ q are equidistant and h := Xj+l - Xj,
one has for the error
Eq = y(xp+d - f)~~l = O(h q+2 ) ~ C h q+2 .
The further development now follows the previous pattern [ef. Section
7.2.5]. Let c he a prescribed error bound. The "old" step hold is accepted
as "successful" if
The new step h ncw is considered "successful" if we can make
Elimination of C again yields
h new
~
hold
(I;ql) q~2
This strategy can still be combined with a change of q (variable order
method). One computes the three quantities
(
~
IE:-11
)
q~,
'
and determines their maximum. If the first quantity is the maximum, one
reduces q by l. If the second is the maximum, one holds on to q. If the
7.2 Initial-Value Problems
521
third is the maximum, one increases q by 1. The important point to note is
that the quantities E q- 1, E q, E q+1 can be computed recursively from the
table of divided differences. One has
E q- 1 = gq_l,2f(1)[Xp+l, x P"'" Xp-q+l],
Eq = gq,2f(1)[Xp+1,Xp, ... ,xp_ q],
E q+1 = gq+1,2f(1) [Xp+l, x P' ... ,Xp- q-1],
where f(l) [xP+1' x P" .. ,Xp-i] are the divided differences relative to the interpolation points (xP+1' f~~l)' (xp, fp), ... , (Xp-i, fp-i) and gij is defined
by
i,j ~ 1.
The quantities gij satisfy the recursion
gij = (xP+1 - Xp+l-i)gi-l,j + gi-l,j+1,
j = 1, 2, ... , q + 2 - i, i = 2, 3, ... , q,
with the initial values
j = 1, ... , q + 1.
The method discussed is "self-starting"; one begins with q = 0, then raises
q in the next step to q = 1, etc. The "initialization" [ef. Section 7.2.6]
required for multistep methods with equidistant nodes and fixed q becomes
unnecessary.
For a more detailed study we refer to the relevant literature, for example, Hairer, N!ZSrsett, and Wanner (1993); Shampine and Gordon (1975);
Gear (1971); and Krogh (1974).
7.2.14 Extrapolation Methods for the Solution of the InitialValue Problem
As discussed in Section 3.4, asymptotic expansions suggest the application of extrapolation methods. Particularly effective extrapolation methods
result from discretization methods which have asymptotic expansions containing only even powers of h. Observe that this is the case for the midpoint
rule or the modified midpoint rule [see (7.2.12.8), (7.2.12.9) or (7.2.12.10),
(7.2.12.13)].
In practice, one uses especially Gragg's function Sex; h) of (7.2.12.10),
whose definition we repeat here because of its importance: Given the triple
(j, xo, Yo), a real number H, and a natural number n > 0, define x :=
Xo + H, h := H/n. For the initial-value problem
522
7 Ordinary Differential Equations
y' = f(x, y),
Y(Xo) = Yo,
with exact solution y(x), one then defines the function value S(x; h), in the
following way:
(7.2.14.1)
Tlo : = Yo,
TIl : = Tlo + hf(xo, Tlo),
for j = 1, 2, ... , n - 1 :
XI:= Xo
Tlj+l : = Tlj-l + 2hf(xj, Tlj),
+ h,
Xj+!:= Xj + h,
S(x; h) : = ~ [TIn + Tln-l + hf(xn, TIn)].
In an extrapolation method for the computation of y(x) [see Sections 3.4,
3.5], one then has to select a sequence of natural numbers
F = {no, nl, nz,·· .},
o < no < nl < nz < ... ,
and for hi := H/ni to compute the values S(x; hi). Because of the oscillation term (_l)(x-x o)/h in (7.2.12.13), however, F must contain only even
or only odd numbers. Usually, one takes the sequnce
(7.2.14.2)
ni := 2ni-z for i:::: 3.
F = {2, 4, 6, 8,12,16, ... },
As described in Section 3.4, beginning with the Oth column, one then
computes by means of interpolation formulas a tableau of numbers ~k' one
upward diagonal after another:
S(x; h o) =: Too
Tn
(7.2.14.3)
S(x; hz) =: T zo
\.
Here
is just the value of the interpolating polynomial (rational functions would
be better)
7.2 Initial-Value Problems
523
of degree k in h 2 with Tik(h j ) = Sex; h j ) for j = i, i - I , ... , i - k. As
shown in Section 3.5, each column of (7.2.14.3) converges to y(x):
lim Tik = y(x)
2-+00
for k = 0, 1, ....
In particular, for k fixed, the Tik for i -+ 00 converge to y(x) like a method
of order (2k + 2). We have in first approximation, by virtue of (7.2.12.13)
[see (3.5.10)]'
Tik - y(x) ~ (_1)kh;hLl ... hLk [Uk +1 (x) + Vk+1(X)].
One can further, as described in Section 3.5, exploit the monotonic behavior
of the Tik in order to compute exlicit estimates of the error Tik - y(x).
Once a sufficiently accurate Tik =: 1} is found, 1} will be accepted as an
approximate value for y(x). Thereafter, one can compute in the same way
an approximation to y(x) at an additional point x = x + fI by replacing
xo, Yo, H by x, 1}, fI and by solving the new initial-value problem again,
just as described.
We wish to emphasize that the extrapolation method can be applied
also for the solution of an initial-value problem (7.0.3), (7.0.4) for systems of
n ordinary differential equations. In this case, f(x, y) and y(x) are vectors
of functions, and Yo, 'TJi, and Sex; h) in (7.2.14.1) are vectors in IRn. The
asymptotic expansions (7.2.12.9) and (7.2.12.13) remain valid, and they
mean that each component of Sex; h) E IRn has an asymmptotic exspansion
of the form indicated. The elements Tik of (7.2.14.3), likewise, are vectors
in IR n and are computed as before, component by component, from the
corresponding components of Sex; hi).
In the practical realization of the method, the problem arises how large
the basic steps H are to be chosen. If one chooses H too large, one must
compute a very large tableau (7.2.14.3) before a sufficiently accurate Tik is
found; i is a large integer, and to find Tik one has to compute Sex; hj) for
j = 0, 1, ... , i. The computation of Sex; h j ) requires nj + 1 evaluations
of the right-hand side f(x, y) of the differential equation. For the sequence
F in (7.2.14.2) chosen above, the numbers Si := L;=o(nj + 1), and thus
the expenditure of work for a tableau with i + 1 upward diagonals grows
rapidly with i: one has Si+l ~ 1.4s i .
If, on the other hand, the step H is too small, one takes unnecessarily
small, and hence too many, integration steps (xo, y(xo)) -+ (xo + H, y(xo +
H)). It is therefore very important for the efficiency ofthe method that one
incorporate into the method, as in Section 7.2.5, a mechanism for estimating
a reasonable stepsize H. This mechanism must accomplish two things:
(1) It must guarantee that any stepsize H which is too large is reduced,
so that no unnecessarily large tableau is constructed.
(2) It should propose to the user of the method (of the program) a reasonable stepsize fI for the next integration step.
524
7 Ordinary Differential Equations
However, we don't want to enter into further discussion of such mechanisms and only remark that one can proceed, in principle, very much as
in Section 7.2.5.
An ALGOL program for the solution of initial-value problems by means
of extrapolation methods can be found in Bulirsch and Stoer (1966).
7.2.15 Comparison of Methods for Solving Initial-Value
Problems
The methods we have described fall into three classes:
(1) one-step methods,
(2) multistep methods,
(3) extrapolation methods.
All methods allow a change of the steplength in each integration step; an
adjustment ofthe respective stepsizes in each ofthem does not run into any
inherent difficulties. The modern multistep methods are not employed with
fixed orders; nor are the extrapolation methods. In extrapolation methods,
for example, one can easily increase the order by appending another column
to the tableau of extrapolated values. One-step methods of the RungeKutta-Fehlberg type, by construction, are tied to a fixed order, although
methods of variable orders can also be constructed with correspondingly
more complicated procedures.
In an attempt to find out the advantages and disadvantages of the various integration schemes, computer programs for the methods mentioned
above have been prepared with the utmost care, and extensive numerical experiments have been conducted with a large number of differential
equations. The result may be described roughly as follows:
The least amount of computation, measured in evaluations of the righthand side of the differential equation, is required by the multistep methods.
In a predictor method the right-hand side of the differential equation must
be evaluated only once per step, while in a corrector method this number
is equal to the (generally small) number of iterations. The expense caused
by the step control in multistep methods, however, destroys this advantage
to a large extent. Multistep methods have the largest amount of overhead
time. Advantages accrue particularly in cases where the right-hand side
of the differential equation is built in a very complicated manner (large
amount of computational work to evaluate the right-hand side). In contrast,
extrapolation methods have the least amount of overhead time, but on the
other hand sometimes do not react as "sensitively" as one-step methods
or multistep methods to changes in the prescribed accuracy tolerance c:
often results are furnished with more corrrect digits than necessary. The
reliability of extrapolation methods is quite high, but for modest accuracy
requirements they no longer work economically (are too expensive).
7.2 Initial-Value Problems
525
For modest accuracy requirements, Runge-Kutta-Fehlberg methods
with low orders of approximation p are to be preferred. Runge-KuttaFehlberg methods of certain orders sometimes react less sensitively to discontinuities in the right-hand side of the differential equation than multistep or extrapolation methods. It is true that in the absence of special precautions, the accuracy at a discontinuity in Runge-Kutta-Fehlberg methods is drastically reduced at first, but afterward these methods continue to
function without disturbances. In certain practical problems, this can be
an advantage.
None of the methods is endowed with such advantages that it can be
preferred over all others (assuming that for all methods computer programs
are used that were developed with the greatest care). The question of which
method ought to be employed for a particular problem depends on so many
factors that it cannot be discussed here in full detail; we must refer for this
to the original papers, e.g., Clark (1968), Crane and Fox (1969), Hull et al.
(1972), Shampine et al. (1976), Diekhoff et al. (1977).
7.2.16 Stiff Differential Equations
In the upper atmosphere, ozone decays under the influence of the radiation
of the sun. This process is described by the formulas
of chemical reaction kinetics. The kinetic parameters k j , j = 1, 2, 3, are
either known from direct measurements or are determined indirectly from
the observed concentrations as functions of time by solving an "inverse
problem". If we denote by Yl = [0 3 ], Y2 = [0], and Y3 = [0 2 ] the concentration of the reacting gases, then according to a simple model of physical
chemistry, the reaction can be described by the following set of ordinary
differential equation [ef. Willoughby (1974)]:
Yl(t) = -k 1 YIY3 + k2Y2Y~ - k3YIY2,
Y2(t) =
k 1YIY3 - k2Y2Y~ - k3YIY2,
Y3(t) = -k 1YIY3 + k2Y2Y~ + k3YIY2.
We simplify this system by assuming that the concentration of molecular
oxygen [0 2 ] is constant, i.e., Y3 = 0, and consider the situation where
initially the concentration [0] of the radical 0 is zero. Then, after inserting
the appropriate constants and rescaling of the k j , one obtains the initial
value problem
Yl = -Yl - YIY~ + 294Y2,
Y2 = (Yl - YIY2)/98 - 3Y2,
Yl(O) = 1,
Y2(0) = 0,
526
7 Ordinary Differential Equations
Typical for chemical kinetics are the widely different velocities (time scales)
of the various elementary reactions. This results in reaction constants kj
and coefficients of the differential equations of very different orders of magnitudes. A linearization therefore gives linear differential equations with a
matrix having a large spread of eigenvalues [see the Gershgorin theorem
(6.9.4)]. One therefore has to expect a solution structure characterized by
"boundary layers" and "asymptotic phases" . The integration of systems of
this kind gives rise to peculiar difficulties. The following example will serve
as an illustration [ef. Grigorieff (1972, 1977)].
Suppose we are given the system
(7.2.16.1)
,
Al + A2
Al - A2
Y1(X) =
2
Y1+
2
Y2,
'()
Al - A2
>'1 + A2
Y2 X =
2
Y1 +
2 Y2,
with negative constants AI, A2: Ai < O. The general solution is
(7.2.16.2)
Y1(X) = C 1e A1X + C2eA2X }
Y2(X) = C 1e A1X - C 2e A2X
x :2: 0,
where C 1 , C 2 are constants of integration. If (7.2.16.1) is integrated by
Euler's method [ef. (7.2.1.3)], the numerical solution can be represented in
"closed form" as follows,
(7.2.16.3)
rJ1i = C 1(1 + hAd + C 2(1 + hA2)i,
rJ2i = C 1(1 + hAd - C2(1 + hA2)i.
Evidently, the approximations converge to zero as i -+
steplength h is chosen small enough to have
00
only if the
(7.2.16.4)
Now let IA21 be large compared to IA11. Since A2 < 0, the influence
of the component e A2X in (7.2.16.2) is negligibly small in comparison with
e A1X • Unfortunately, this is not true for the numerical integration. In view
of (7.2.16.4), indeed, the steplength h> 0 must be chosen so small that
2
h < p:;J.
For example, if Al = -1, A2 = -1000, we must have h ~ 0.002. Thus, even
though e-1000x contributes practically nothing to the solution, the factor
1000 in the exponent severely limits the step size. This behavior in the
numerical solution is referred to as "stiffness": It is to be expected if for
a system of differential equations y' = f (x, Y), Y E IRn, the n x n-matrix
fy(x,y) has eigenvalues A with ReA« o.
7.2 Initial-Value Problems
527
Euler's method (7.2.1.3) is hardly suitable for the numerical integration of such systems; the same can be said for the Runge-Kutta-Fehlberg
methods, multistep methods, and extrapolation methods, discussed earlier.
Appropriate methods for integrating stiff differential equations can be
derived from so-called implicit methods. As an example, we may take the
implicit Euler method,
(7.2.16.5)
The "new" approximation TJi+l will have to be determined by iteration.
The computational effort thus increases considerably.
One finds that many methods, applied with constant steplength h > 0
to the linear system of differential equations
(7.2.16.6)
y' = Ay,
y(O) = Yo,
A a constant n x n matrix, will produce a sequence TJi of approximation
vectors for the solution y(Xi) which satisfy a recurrence formula
(7.2.16.7)
TJi+l = g(hA)TJi.
The function g(z) depends only on the method employed and is usually a
rational function in which it is permissible to substitute a matrix for the
argument.
EXAMPLE 1. For the explicit Euler method (7.2.1.3) one finds
1)i+l
= 1)i + hA1)i = (I + hA)1)i,
whence g(z) = 1 + z;
for the implicit Euler method (7.2.16.5),
1
whence g(z) = - - .
1-z
If one assumes that the matrix A in (7.2.16.6) has only eigenvalues Aj
with Re Aj < 0 [ef. (7.2.16.1)]' then the solution y(x) of (7.2.16.6) converges
to zero as x -+ 00, while the discrete solution {TJi}, by virtue of (7.2.16.7),
converges to zero as i -+ 00 only for those stepsizes h > 0 for which
Ig( hAj) I < 1 for all eigenvalues Aj of A.
Inasmuch as the presence of eigenvalues Aj with Re Aj « 0 will not
force the employment of small stepsizes h > 0, a method will be suitable
for the integration of stiff differential equations if it is absolutely stable in
the following sense:
(7.2.16.8) Definition. A method (7.2.16.7) is called absolutely stable if
Ig(z)1 < 1 for all z with Rez < o.
A more accurate description of the behavior of a method (7.2.16.7) is
provided by its region of absolute stability DF, by which we mean the set
528
7 Ordinary Differential Equations
(7.2.16.9)
M = {z E ~ Ilg(z)1 < I}.
The larger the intersection M n ~_ of M with the left half-plane ~_
{z I Re z < O}, the more suitable the method is for the integration of stiff
differential equations; it is absolutely stable if M contains ~_.
EXAMPLE 2. The region of absolute stability for the explicit Euler method is
{zlll+zl < I};
for the implicit Euler method,
{z 111- zl > I}.
The implicit Euler method is thus absolutely stable; the explicit Euler method is
not.
Taking A-stability into account, one can then develop one-step, multistep, and extrapolation methods as in the previous Sections 7.2.1, 7.2.9,
and 7.2.14. All these methods are implicit or semi-implicit, since only these
methods have a proper rational stability function. All explicit methods considered earlier lead to polynomial stability functions, and hence cannot be
A-stable. The implicit character of all stable methods for solving stiff differential equations implies that one has to solve a linear system of equations
in each step at least once (semi-implicit methods), sometimes even repeatedly, resulting in Newton-type iterative methods. In general, the matrix E
of these linear equations contains the Jacobian matrix fy = fy(x, y) and
usually has the form E = J - h"ffy for some number "f.
Extrapolation methods
In order to derive an extrapolation method for solving a stiff system of
the form* y' = f(y), we note that the stiff part of the solution y(t) can be
factored off near t = x by putting c(t) := e-A(t-x)y(t) where A := fy(y(x)).
Then
c'(x) = /(y(x)), with /(y) := f(y) - Ay,
and Euler's method (7.2.1.3), resp., the midpoint rule (7.2.6.9), lead to the
approximations
c(x + h) ~ y(x) + h/(y(x)),
c(x + h) ~ c(x - h) + 2h/(y(x)).
Using that c(x ± h) = eOFAhy(x ± h) ~ (J =F Ah)y(x ± h) we obtain a
semi-implicit midpoint rule [ef. (7.2.12.8)]
* Every differential equation can be reduced to this autonomous form:
fj' = ](i, fj) is equivalent to
[EJ' = [f(~,y)].
7.2 Initial-Value Problems
529
1J(xo; h) := Yo, A:= Iy(Yo),
1J(xo + h; h) := (I - hA)-l[yo + hJ(yo)]'
1J(x + h; h) := (I - hA)-l [(1 + hA)1J(x - h; h) + 2hJ(1J(x; h))]
for the computation of an approximate solution 1J(x; h) ~ y(x) of the initial
value problem y' = I(y), y(xo) = Yo. This rule was used by Bader and
Deuflhard (1983) as a basis for extrapolation methods, and they applied it
with great success to problems of chemical reaction kinetics.
One-step methods
In analogy to the Runge-Kutta-Fehlberg methods [see Equation (7.2.5.6)
ff.], Kaps and Rentrop (1979) construct for stiff autonomous systems y' =
I(y) methods that are distinguished by a simple structure, efficiency and
robust step control. They have been tested numerically for stiffness ratios
as large as
I = 10
I .\m~x
.\mm
7
(with 12-digit computation) .
As in (7.2.5.6) one takes
(7.2.16.10)
Yi+1 = Yi + hiPr(Yi; h),
Yi+ 1 = Yi + h iPn (Vi; h),
with
(7.2.16.11)
ILl(x,y(x));h) -iPr(y(x);h)l:::; N rh3 ,
ILl(x, y(x)); h) - iPn(y(x); h)1 :::; N n h4
and
3
iPr(y; h) =
L ckIk(y; h),
iPn(y; h) =
L ckfk(y; h),
k=l
4
k=l
where the Ii: := Ik(y; h), k = 1, 2, 3, 4, have the form
(7.2.16.12)
Ii: = I (y + h
k-1
k
1=1
1=1
L (3kdt) + hf' (y) L "ikdt·
For specified constants, the Ii: must be determined iteratively from these
systems. The constants obey equations similar to those in (7.2.5.9) ff. Kaps
and Rentrop provide the following values:
530
7 Ordinary Differential Equations
'Ykk=
"121 =
"131 =
"141 =
0.220428410 for k = 1, 2, 3, 4,
0.822867461,
0.695700194,
"132= 0,
3.90481342,
"142= 0,
"143 = 1,
(321 = -0.554591416,
(331 =
0.252787696,
(332=
1,
(341 =
(331,
(342=
(332,
C2
1.182 15360,
C1
C3
= -0.162871 035,
= -0.192825995 X 10- 1 ,
i\
0.545211 088,
(;2
(;3
= 0.177064668,
(;4
(343 = 0,
0.301486480,
= -0.237622363 X 10-1,
Step control is accomplished as in (7.2.5.17):
hnew = 0.9h 3
A
elhl
IYi+1 - Yi+11
Multistep methods
It was shown by Dahlquist (1963) that there are no A-stable multistep
methods of order r > 2 and that the implicit trapezoidal rule
h
1]n+1 = Tfn + '2(f(Xn,1]n) + !(Xn+1,Tfn+1)),
n> 0,
where 1]1 is given by the implicit Euler method (7.2.16.5), is the A-stable
method of order 2 with an error coefficient c = - 112 of smallest modulus.
Im(z)
-----L-t---"t'~+7'_.h,L-.,..Re(z)
stable set
Fig. 16. The stability of BDF-methods
7.2 Initial-Value Problems
531
Gear (1971) also showed that the BDF-methods up to order 'T' = 6 have
good stability properties: Their stability regions (7.2.16.9) contain subsets
of <C_ = {z I Rez < O} having the shape sketched in Figure 16. These
multistep methods all belong to the choice [see (7.2.11.3)]
X(J1) = br J1r
and can be represented by backward difference formulas, which explains
their name. Coefficients of the standard representation
'r/j+r + ar-l'r/j+r-l + ... + ao'r/j = hbrf(xj+r, 'r/j+r)
are listed in the following table [taken from Gear (1971) and Lambert
(1973)]. In this table, a = a r and c = Cr are the parameters indicated
in Figure 16 that correspond to the particular BDF-method. Byrne and
Hindmarsh (1987) report very favorable numerical results for these methods.
More details on stiff differential equations can be found in e.g., Grigorieff
(1972, 1977), Enright, Hull, and Lindberg (1975), and in Willoughby
(1974). For a recent comprehensive treatment, see Hairer and Wanner
(1991).
r
ar
Cr
br
ao
1
90 0
0
1
-1
2
90 0
0
3
88 0 0.1
4
73 0 0.7
5
51 0 2.4
6
18 0 6.1
2
3
6
1
3
2
-
11
11
12
25
60
137
60
147
3
25
12
137
10
147
al
a2
a3
a4
a5
48
25
300
137
400
147
300
137
450
147
360
147
4
3
9
11
18
16
25
75
137
72
147
36
25
200
137
225
147
11
7.2.17 Implicit Differential Equations. Differential-Algebraic
Equations
So far we have only considered explicit ordinary differential equations y' =
f(x, y) (7.0.1). For many modern applications this is too restrictive, and it
is important, for reasons of efficiency, to deal with implicit systems
532
7 Ordinary Differential Equations
F(x, y, y') = 0
directly without transforming them to explicit form, which may not even
be possible.
Large-scale examples of this type abound in many applications, e.g., in the
design of efficient microchips for modern computers. Thousands of transistors
are densely packed on these chips, so that their economical design is not possible
without using efficient numerical techniques. Such a chip can be modeled as a
large complicated electrical network. Kirchhoff's laws for the electrical potentials
and currents in the nodes of this network lead to a huge system of ordinary
differential equations for these potentials and currents as functions of time t. By
solving this system numerically, the behavior of these chips, and their electrical
properties, can be checked before their actual production, that is, such a chip
can be simulated by a computer [see, e.g., Bank, Bulirsch, and Merten (1990);
Horneber (1985)].
The resulting differential equations are implicit, and the following types,
depending on the complexity of the model for a single transistor, can be distinguished:
(7.2.17.1)
Linear implicit systems
(7.2.17.2)
Cif(t) = BU(t) + J(t),
Linear-implicit nonlinear systems
(7.2.17.3)
Cif(t) = J(t, U(t)),
Quasilinear-implicit systems
(7.2.17.4)
C(U)if(t) = J(t, U(t)),
General implicit systems
F(t, U(t), Q(t)) = 0,
Q(t) = C(U)U(t).
The vector U, which may have a very large number of components, describes
the electrical potentials of the nodes of the network. The matrix C contains the
capacities of the network which may depend on the voltages, C = C(U). Usually,
this matrix is singular and very sparse.
There are already algorithms for solving systems of type (7.2.17.2) [cf.
Deufihard, Hairer, and Zugck (1987); Petzold (1982); Rentrop (1985); Rentrop,
Roche, and Steinebach (1989)]. But the efficient and reliable solution of systems
of the types (7.2.17.3) and (7.2.17.4) is still the subject of research [see, e.g.,
Hairer, Lubich, and Roche (1989); Petzold (1982)].
We now describe some basic differences between initial value problems
for implicit and explicit systems (we again denote the independent variable
by x)
(7.2.17.5)
y'(x) = f(x,y(x)),
treated so far.
Concerning systems of type (7.2.17.1),
y(xo) := Yo
7.2 Initial-Value Problems
(7.2.17.6)
533
Ay'(x) = By(x) + f(x),
y(xo) = Yo,
where y(x) E IRn and A, Bare n x n matrices, the following cases can be
distinguished:
1st Case. A regular, B arbitrary.
The formal solution of the associated homogeneous system is given by
y(x) = e(x-xo)A-1B yo .
Even though the system can be reduced to explicit form y' = A-I By +
A-I f, this is not feasible for large sparse A in practice, since then the
inverse A -1 usually is a large full matrix and A-I B would be hard to
compute and store.
2nd Case. A singular, B regular.
Then (7.2.17.6) is equivalent to
B- 1 Ay' = y + B- 1 f(x).
The Jordan normal form J of the singular matrix B- 1 A = T JT- l has the
structure
J=[~ ~],
where W contains all Jordan blocks belonging to the nonzero eigenvalues
of B- 1 A, and N the Jordan blocks corresponding to the eigenvalue zero.
We say that N has index k if N is nilpotent of order k, i.e., N k = 0,
Nj i= 0 for j < k. The transformation to Jordan normal form decouples
the system: In terms of
:=T-IB-lf(x)
[ p(x)]
q(x)
we obtain the equivalent system
Wu'(x) = u(x) + p(x),
Nv'(x) = vex) + q(x).
The partial system for u' has the same structure as in Case 1, so it has
a unique solution for any initial value Yo. This is not true for the second system for v': First, this system can only be solved under additional
smoothness assumption for f:
f E C k - l [xo, Xend],
where k is the index of N. Then
534
7 Ordinary Differential Equations
vex) = -q(x) + Nv'(x)
= -(q(x) + Nq'(x)) + N 2 v"(x)
= -(q(x) + Nq(x) + ... + N k- 1q(k-1)(X)).
Since Nk = 0, this resolution chain for v finally terminates, proving that
vex) is entirely determined by q(x) and its derivatives. Therefore, the following principal differences to (7.2.17.5) can be noted:
(1) The index of the Jordan normal form of B- 1 A determines the
smoothness requirements for J(x).
(2) Not all components of the initial value Yo can be chosen arbitrarily: v(xo) is already determined by q(x), and therefore by J(x), and its
derivatives at x = Xo. For the solvability of the system, the initial values Yo = y(xo) have to satisfy a consistency condition, and one speaks
of consistent initial values in this context. However, the computation of
consistent initial values may be difficult in practice.
3rd Case. A and B singular.
Here it is necessary to restrict the investigation to "meaningful" systems. Since there are matrices (the zero matrix A := B := 0 provides a
trivial example) for which (7.2.17.6) has no unique solution, one considers
only pairs of matrices (A, B) that lead to uniquely solvable systems. It is
possible to characterize such pairs by means of the concept of a regular matrix pencil: These are pairs (A, B) of matrices such that AA + B is regular
for some A E <C. Since then det(AA + B) is a nonvanishing polynomial in A
of degree :::; n, there are at most n numbers A, namely, the eigenvalues of
the generalized eigenvalue problem [see Section 6.8] Bx = AAx, for which
AA + B is singular.
It is possible to show for regular matrix pencils that (7.2.17.6) has
at most one solution. Also, there is a closed formula for this solution in
terms of the Drazin-inverse of matrices [see, e.g., Wilkinson (1982) or
Gantmacher(1969)]. We will not pursue this analysis further, as it does
not lead to essentially new phenomena compared with Case 2.
If the capacity matrices C of the systems (7.2.17.1) and (7.2.17.2) are
singular, then, after a suitable transformation, these systems decompose
into a differential equation and a nonlinear system of equations (we denote
the independent variable again by x):
(7.2.17.7)
y'(x) = J(x,y(x),z(x)),
0= g(x, y(x), z(x)) E lRn2,
y(xo) E lRnl, z(xo) E lR n2 : consistent initial values.
7.2 Initial-Value Problems
535
Such a decomposed system is called a differential-algebraic system [see, e.g.,
Griepentrog and Marz·(1986), Hairer and Wanner (1991)]. By the implicit
function theorem, it has a unique local solution if the Jacobian
~~ is regular (Index-l assumption).
Differential-algebraic systems typically occur in models of the dynamics of
multibody systems; however, the Index-1 assumption may not always be
satisfied [see Gear (1988)].
Numerical methods
The differential-algebraic system (7.2.17.7) can be viewed as a differential
equation on a manifold. Therefore, such systems can be successfully solved
by methods combining solvers for differential equations [see Section 7.2]'
for nonlinear equations [see Section 5.4], and continuation methods.
Extrapolation and Runge-Kutta type methods can be derived by
imbedding the nonlinear equation of (7.2.17.7) formally into a differential
equation
cz'(x) = g(x, y(x), z(x))
and then considering the limiting case c = o. Efficient and reliable one-step
and extrapolation methods for solving differential-algebraic systems are described in Deuflhard, Hairer, and Zugck (1987). The method (7.2.16.10)(7.2.16.12) for solving stiff systems can be modified into a method for solving implicit systems of the form (7.2.17.2). The following method of order 4
with stepsize control is described in Rentrop, Roche, and Steinebach (1989):
In the notation of Section 7.2.16, and for an autonomous implicit system
Cy' = fey), the method has the form [ef. (7.2.16.12)J
Cfk = f(Y + h
k-1
k
l=1
1=1
L (Adz') + hf'(y) L
"(kilt,
k = 1, 2, ... ,5,
where the coefficients are given in Table (7.2.17.8).
Multistep methods can also be used for the numerical solution of
differential-algebraic systems. Here, we only refer to the program package
DASSL [see Petzold (1982)J for solving implicit systems of the form
F(x, y(x), y'(x)) = o.
There, the derivative y'(x) is replaced by a BDF-formula, which leads to a
nonlinear system of equations. Concerning details of solution strategies and
their implementation, we refer to the literature, e.g., Byrne and Hindmarsh
(1987), Gear and Petzold (1984, and Petzold (1982).
536
7 Ordinary Differential Equations
0.5 for k = 1, 2, 3, 4, 5,
4.0,
{21 =
0.46296 ... ,
{32= -1.5,
{31=
{41 =
2.2083 ... ,
{42 = -l.89583 ... ,
{Sl = -12.00694 ... ,
{52=
7.30324074 ... ,
{S4 = -15.851 ... ,
{kk=
,1321 =
,1331 =
0,
0.25,
,1341 = -0.0052083 ... ,
,1351 =
l.200810185 ... ,
l.0,
,1354=
C1
C4
C1
C4
0.538 ... ,
0.5925 ... ,
0.4523148 ... ,
0.5037 ... ,
0.25,
0.1927083 ... ,
f3S2= -1.950810185 ... ,
{43= -2.25,
{53=
19.0,
,1343=
0.5625,
0.25,
,1332=
,1342=
C2
= -0.13148 ... ,
.f353 =
C3
0.2,
C2 = -0.1560185 ... ,
Cs
0.2.
= -0.2,
C5
0,
Table (8.2.17.8)
7.2.18 Handling Discontinuities in Differential Equations
So far, the methods for solving initial value problems
y(xo) = Yo,
y'(x) = f(x,y(x)),
assumed that the right hand side f of the differential equation and, therefore, its solution y(x) is sufficiently often differentiable. However, there are
many applied problems where f and yare discontinuous. These discontinuities are often due to the underlying tasks, e.g., in friction and contact problems of mechanics, in process engineering if one admits different charges
of input. But they may also be caused by unavoidahle simplifications of a
model, e.g. if one represents discrete tabular data found in experiments by
an interpolating function of low order of differentiability.
Here, we consider initial value problems of the form
(7.2.18.1)
y'(x) = fn(x,y(x)),
y(XO) = Yo·
Xn-l < X < Xn,
y;; = lim
We assume that the initial values
of discontinuity, the "switching points"
transition functions ¢n(x, y) as follows:
Xn,
y;; = ¢n(x n , y;;'),
where y,-:: = lim X----,Xn
_, - y(x).
n = 1, 2, ... ,
+ y(x) at the points
X----,Xn
are determined by known
_.c
7.2 Initial-Value Problems
537
It would be easy to solve (7.2.18.1), say by one-step methods, if the
switching points Xn were known a priori. One would have to choose the
integration points Xi of these methods, Xo =: XO < Xl < "', which are
usually determined by a stepsize control technique alone [see Section 7.2.5]
in such a way that also all switching points Xn occur among them, and to
compute at Xn the new starting ordinate y1; by means of the transition
function.
However, there are many problems where the switching points are not.
known a priori: They are usually characterized as zeros of known real valued
switching functions:
(7.2.18.2)
n = 1, 2, ... ,
that depend on the value y(x;;) of the solution y. In what follows we suppose
that these zeros are simple zeros. In these cases one has to compute y(.)
and the switching points concurrently. Essentially, one has to check each
prospective integration interval [xi, xi+ 1] whether it contains a switching
point X n , and, if this is the case, to compute X n , to interrupt the integration
at Xn [i.e. replace xi+1 by xn], to compute the starting ordinate y1;, and
resume the integration with the new starting point (xn' y1;) and the new
right hand side fn+1 of (7.2.18.1).
Efficient methods for finding the switching points are as follows [ef. e.g.,
Eich (1992)]:
Let 7)i-1 and 7)i be numerical approximations of the solution of (7.2.18.1)
at the points xi-1 and Xi obtained in the (i -1 )st and i-th integration step,
respectively, using the right hand side fn(x, y(x)). Also let y(x) be a continuous approximation of the solution y(x) for X E [x i - 1, xi], and assume that
the switching function qn has a zero .T n with x,-l < Xn < xi. Moreover,
we assume that 7)i exists and therefore also the approximation y(x) on the
whole interval [x i - 1 , Xi].
If y(x) is sufficiently well approximated on [x i - 1,X n + f] [here f > 0
accounts for an eventual shift of the zero Xn due to roundoff], then the
function qn(x, y(x)), too, has a zero xn in [x i - 1, Xn + f], which can be used
as an approximation of switching point X n . The zero xn can be determined
by an appropriate algorithm, e.g., by the secant method or by inverse interpolation [see Section 5.9]. Suitable generalizations [safeguard methods]
of these algorithms are described in Gill et al. (1995).
It would be too expensive to search for a possible zero of a switching
function between every two integrations steps. If the integration stepsize
is small enough so that each switching function has at most one zero in
the integration interval, it is sufficient to monitor only the sign changes of
the switching functions on these intervals. If more than one switching function changes its sign, the switching of the differential equation (7.2.18.1) is
effected by the zero of that switching function which changes its sign first.
538
7 Ordinary Differential Equations
7.2.19 Sensitivity Analysis of Initial-Value Problems
Initial value problems often depend on real parameters P = (Pl, ... ,Pn,'),
(7.2.19.1 )
y'(X;p) = j(x, y(x;p),p),
y(xo:p) = Yo(p).
Possibly, even small changes of these parameters may produce large changes
in the solution y( x; p). Thus, it is important to carefully investigate parameter dependence. This may add valuable insight into the process described
by the differential equation. Mainly, such an investigation requires computing, along with a solution trajectory y(x;p), also its sensitivity (condition
numbers) given by the matrix
oy(x;p) I .
op
p
If the parameters are genuine model parameters, then this information
allows to judge the quality of the model. Sensitivities play also a dominant
role in the solution of optimal control problems, such as parametrization of
control functions or parameter identification [Heim and von Stryk (1996),
Engl et al. (1999)]. There are several numerical methods to compute these
sensitivities:
- approximation by difference quotients,
- solution of the sensitivity equations,
- solution of adjoint equations.
With the first method, the sensitivities are approximated by
oy(x;p) I ~ y(XiP+ Llpiei) - Y(XiP),
0Pi
p
LlPi
i = 1, ... , np.
Here ei is the ith axis vector of IRnp. The implementation is easy and
not too expensive. When using an integration method with order and/or
stepsize control, the computation of the "perturbed trajectory" y(x; P +
Llpiei) must use the same order/stepsize sequence used for the computation
of the reference trajectory y(x; p) [see Buchauer et al. (1994), Kiehl (1999)].
If higher accuracy is needed or if the number np of parameters is large the
other two methods are preferable.
By Theorem (7.1.8), the following sensitivity equations are associated
with (7.2.19.1):
(7.2.19.2)
oy'(x;p) = oj(x, YiP) . OY(XiP) + oj(x, YiP)
op
oy
op
op'
oy(xo;p)
oYo(p)
op
op
7.3 Boundary-Value Problems
539
[see Hairer et al. (1993) for details].
These equations establish the function
z(x) := 8y(x;p)/8p,
which is a n x np-matrix function if y E IR n , as the solution of an initial
value problem for a system of linear differential equations. That can be
used to implement a very efficient method for solving (7.2.19.2).
For the method based on adjoint differential equations, we refer the
reader to the literature [e.g., Morrison and Sargent (1986)].
High precision approximations of sensitivities can be computed very
efficiently both by integrating the sensitivity equations and the system of
adjoint equations. A basic tool is the internal numerical differentiation
(IND), that employs the differentiation of the recursions used for the numerical computation of the reference trajectory [Leis and Kramer (1985),
Caracotsios and Stewart (1985), Bock et al. (1995)] . By contrast, the external numerical differentiation (END), does not change the integration
method (up to the adaptation of the order/stepsize control).
If the initial value problem contains discontinuities [see Section 7.2.18],
the sensitivities have to be corrected at the switching points. Such corrections for systems of ordinary differential equations have been described by
Rozenvasser (1967), and for initial value problems for differential-algebraic
systems of index-1 [see Section 7.2.17] by Cabin, Feehery and Barton
(1998).
7.3 Boundary-Value Problems
7.3.0 Introduction
More general than initial-value problems are boundary-value problems. In
these one seeks a solution y( x) of a system of n ordinary differential equations,
(7.3.0.1a) y' = f(x, y),
f(x,y) = [
h(Xl Yl""'Yn)]
:
,
fn(x, Yl,···, Yn)
satisfying a boundary condition of the form
(7.3.0.1b)
Ay(a) + By(b) = c.
Here, a 01 b are given numbers, A, B square matrices of order n, and c a
vector in lRn. In practice, the boundary conditions are usually separated:
540
7 Ordinary Differential Equations
(7.3.0.1b')
i.e., in (7.3.0.1b) the rows of the matrix [A, B, c] can be permuted such that
for the rearranged matrix [A, B, c],
The boundary conditions (7.3.0.1b) are linear (more precisely, affine) in
yea), y(b).
Occasionally, in practice, one encounters also nonlinear boundary conditions of the type
r(y(a), y(b)) = 0,
(7.3.0.1b")
which are formed by means of a vector r of n functions ri, i = 1, ... , n, of
2n variables:
_ [rl(Ul' ... 'U~'Vl, ...
r(u,v) =
.
'vn)l
.
rn(ul, ... ,Un,Vl, ... ,Vn )
Even separated linear boundary-value problems are still very general. Thus,
initial-value problems, e.g., can be thought of as special boundary-value
problems of this type (with A = I, a = xo, c = Yo, B = 0).
Whereas initial-value problems are normally uniquely solvable [see Theorem (7.1.1)]' boundary-value problems can also have no solution or several
solutions.
EXAMPLE. The differential equation
w" +w = 0
(7.3.0.2a)
for the real function w : JR --+ JR, with the notation Yl(X) := w(x),
w'(x), can be written in the form (7.3.0.1a),
Y2(X)
[YlY2 ], -_ [-YlY2 ] .
It has the general solution
w(X) = Cl sinx + C2 cos X,
Cl, C2
arbitrary.
The special solution w(x) := sinx is the only solution satisfying the boundary
conditions
(7.3.0.2b)
w(O) = 0,
w(7r/2) = 1.
7.3 Boundary-Value Problems
541
All functions w(x) := Cl sinx, with Cl arbitrary, satisfy the boundary conditions
w(O) = 0,
(7.3.0.2c)
wen) = 0,
while there is no solution w(x) of (7.3.0.2a) obeying the boundary conditions
w(O) = 0,
(7.3.0.2d)
wen) = 1.
[Observe that all boundary conditions (7.3.0.2b-d) have the form (7.3.0.1b') with
Al = B2 = [1,0].]
The preceding example shows that there will be no theorem, such as
(7.1.1), of similar generality, for the existence and uniqueness of solutions
to boundary-value problems [see, in this connection, Section 7.3.3].
Many practically important problems can be reduced to boundaryvalue problems (7.3.0.1). Such is the case, e.g., for
eigenvalue problems for differential equations,
in which the right-hand side f of a system of n differential equations depends on a parameter A,
y' = f(x, y, A),
(7.3.0.3a)
and one has to satisfy n + 1 boundary conditions of the form
(7.3.0.3b)
r(y(a),y(b), A) = 0,
r(u, v, A) =
The problem (7.3.0.3) is overdetermined and therefore, in general, has no
solution for an arbitrary choice of A. The eigenvalue problem in (7.3.0.3)
consists in determining those numbers Ai, the "eigenvalues" of (7.3.0.3),
for which (7.3.0.3) does have a solution. Through the introduction of an
additional function
Yn+1(x) := A
and an additional differential equation
y~+l(X) = 0,
(7.3.0.3) is seen to be equivalent to the problem
y'=f(x,y),
f(y(a), y(b») = 0,
which now has the form (7.3.0.1), with
y.=[y
]
.
Yn+l'
542
7 Ordinary Differential Equations
Furthermore, so-called
boundary-value problems with free boundary
are also reducible to (7.3.0.1). In these problems. only one boundary abscissa, a, is prescribed, while b is to be determined so that the system of n
ordinary differential equations
(7.3.0.4a)
y' = f(x, y)
has a solution y satisfying n + 1 boundary conditions
(7.3.0.4b)
r(y(a),y(b)) = O.
rl(U,V)
rCu.v) = [
:
1
rn+I(U,V)
Here, in place of x, one introduces a new independent variable t and a
constant Zn+1 := b - a, yet to be determined, by means of
o : : .: t ::::.: 1,
x - a = t Zn+l,
.
dZ n + 1
Zn+1 = - - =0.
dt
[Instead of this choice of parameters, any substitution of the form x - a =
([>(t, zn+r) with ([>(1, zn+r) = Zn+1 is also suitable.] For z(t) := y(a+t zn+r),
where y(.7:) is a solution of (7.3.0.4), one then obtains
and (7.3.0.4) is thus equivalent to a boundary-value problem of the type
(7.3.0.1) for the functions Zi(t). i = 1, ... , n + L
(7.3.0.5)
7.3.1 The Simple Shooting Method
We want to explain the simple shooting method first by means of an example. Suppose we are given the boundary-value problem
(7.3.1.1)
wI! = f(x. W, 10'),
w(a) = ct,
w(b) = /3,
7.3 Boundary-Value Problems
543
with separated boundary conditions. The initial-value problem
w"=f(x,w,w'),
(7.3.1.2)
w(a)=a,
w'(a)=s
in general has a uniquely determined solution w(x) ::::= w(x; s) which of
course depends on the choice of the initial value s for w'(a). To solve the
boundary-value problem (7.3.1.1), we must determine s =: s so as to satisfy
the second boundary condition, web) = web; s) = (3. In other words: one has
to find a zero s of the function F( s) :::::= web; s) - (3. For every argument s the
function F(s) can be computed. For this, one has to determine (e.g., with
the methods of Section 7.2) the value w(b) = web; s) of the solution w(x; s)
of the initial-value problem (7.3.1.2) at the point x = b. The computation
of F( s) thus amounts to the solution of an initial-value problem.
For determining a zero s of F( s) one can use, in principle, any method
of Chapter 5. If one knows, e.g., values sCO), SCI) with
F(sCO») < 0,
F(s(l») > 0
[see Figure 17], one can compute s by means of a simple bisection method
[see Section 5.6].
w
w(x; 8(1»)
b
x
Fig. 17. Simple shooting
Since web; s), and hence F(s), are in general [see Theorem (7.1.8)] continuously differentiable functions of s, one can also use Newton's method
to determine s. Starting with an initial approximation sCO), one then has
to iteratively compute values SCi) according to the prescription
(7.3.1.3)
(»)
S(i+l) = sCi) _ F (s •
F' (s(i») .
web; sCi») and thus F(s(i)), can be determined by solving the initial-value
problem
(7.3.1.4)
w" = f(x, w, w'),
w(a) = a,
544
7 Ordinary Differential Equations
The value of the derivative of F,
F'(s) =
:s
web; s),
for s = sri) can be obtained, e.g., by adjoining an additional initial-value
problem. With the aid of Theorem (7.1.8) one easily verifies that the function vex) := vex; s) = (8/8s)w(x; s) satisfies
(7.3.1.5) v" = fw(x, w, w')v + fw' (x, w, w')v',
v(a) = 0,
v'(a) = 1.
Because of the partial derivatives fw, fw" the initial-value problem (7.3.1.5)
is in general substantially more complex than (7.3.1.4). For this reason,
one often replaces the derivative F'(s(i») in Newtons formula (7.3.1.3) by
a difference quotient LlF( s(i»),
(i) ._ F(s(i) + Lls(i)) - F(s(i))
LlF(s ).-
Lls(i)
,
where Lls(i) is chosen "sufficiently" small. F(s(i) + Lls(i») is computed, like
F(s(i»), by solving an initial-value problem. The following difficulties then
arise:
If LlS(i) is chosen too large, LlF(s(i») is a poor approximation to F'(s(i»)
and the iteration
(7.3.1.3a)
s(i+ 1) =
sri) _ F ( s C))
2
LlF(s(i»)
converges toward s considerably more slowly (if at all) than (7.3.1.3). If
Lls(i) is chosen too small, then F(s(i) +Lls(i») ;:::; F(s(i»), and the subtraction
F(s(i) + LlS(i») - F(s(i») is subject to cancellation, so that even small errors
in the calculation of F(s(i)) and F(s(i) + Lls(i») strongly impair the result
LlF(s(i»).
The solution of the initial-value problems (7.3.1.4), i.e., the calculation
of F, therefore has to be carried out as accurately as possible. The relative
error of F(s(i») and F(s(i) + Lls(i») is only allowed to have the order of
magnitude of the machine precision eps. Such accuracy can be attained with
extrapolation methods [see Section 7.2.14]. If Lls(i) is then further chosen so
that in t-digit computation F(s(i») and F(sCi) + LlS(i») have approximately
the first t/2 digits in common, LlS(i) LlF(s(i)) ;:::; y""ePsF(s(i»), then the
effects of cancellation are still tolerable. As a rule, this is the case with the
choice Lls(i) = y""ePs sCi).
EXAMPLE. Consider the boundary-value problem
W
1/
3 2
= 2W,
w(O) = 4,
weI) = 1.
7.3 Boundary-Value Problems
545
Following (7.3.1.2), one finds the solution of the initial-value problem
W
/I
3
2
= 'iW
,
W(O;s) =4,
w'(O;s)=s.
The graph of F(s) := we!; s) - 1 is shown in Figure 18.
70
60
50
30
20
10
o
s -+
Fig. 18. Graph of F(s) = w(l; s) - 1.
It is seen that F(s) has two zeros 81, 82. The iteration according to (7.3.1.3a)
yields
81 = -8.0000000000,
82 = -35.8585487278.
6
4
2
O~~~---L--~--~--~
-2
-4
-6
-8
-10
-12
Fig. 19. The solutions WI and W2.
546
7 Ordinary Differential Equations
Figure 19 shows the graphs of the two solutions Wi(X) = w(x; Si), i = 1, 2, of
the boundary-value problem. The solutions were computed to about ten decimal
digits. Both solutions, incidentally, can be expressed in closed form by
where cn(~lk2) denotes the Jacobian elliptic function with modulus
k --
v'2+73
2
.
From the theory of elliptic functions, and by using an iterative method of Section
5.2, one obtains for the constants of integration C 1 , C2 the values
C 1 = 4.30310990 ... ,
C2 = 2.33464196 ... .
For the solution of a general boundary-value problem (7.3.0.1) involving
n unknown functions Yi(X), i = 1, ... , n,
(7.3.1.6)
y' = f(x,y),
r(y(a),y(b)) = 0,
where f(x, y) and r(u, v) are vectors of n functions, one proceeds as in the
example above. One tries again to determine a starting vector s E lRn for
the initial-value problem
(7.3.1.7)
y' = f(x, y),
y(a) = s
in such a way that the solution y(x) = y(x; s) obeys the boundary conditions of (7.3.1.6),
r(y(a;s),y(b;s») =r(s,y(b;s») =0.
One thus has to find a solution s = [0"1,0"2, •.• ,00nV of the equation
(7.3.1.8)
F(s) = 0,
F(s) := r(s, y(b; s»).
This can be done, e.g., by means of the general Newton's method (5.1.6)
(7.3.1.9)
In each iteration step, therefore, one has to compute F(s(i), the Jacobian
matrix
7.3 Boundary-Value Problems
547
and the solution d(i) := sri) - s(i+l) of the linear system of equations
DF(s(i))d(i) = F(S(i)). For the computation of F(s) = r(s,y(b;s)) at
s = sri) one must determine y(b; sci)), i.e., solve the initial value problem
(7.3.1.7) for s = sri). For the computation of DF(s(i)) one observes
DF(s) = Dur(s,y(b;s)) + Dvr(s,y(b;s)) . Z(b;s),
(7.3.1.10)
with the matrices
V)] ,vru,V
D (
) = [ori(u,
V)] ,
D uru,V
(
) = [ori(U,
£l
~
vUj
(7.3.1.11)
VVj
Z(b;s) = Dsy(b;s) = [Oy~~;s)].
In the case of nonlinear functions r(u, v), however, one will not compute
DF(s) by means of these complicated formulas, but instead will approximate by means of difference quotients. Thus, DF(s) will be approximated
by the matrix
In view of F(s) = r(s, y(b; s)), the calculation of LljF(s) of course will
require that y(b; s) = y(b; 0"1, ... , O"n) and y(b; 0"1, ... , O"j + LlO"j,"" O"n) be
determined through the solution of the corresponding initial-value problems.
For linear boundary conditions (7.3.0.1b),
r(u, v) == Au + Bv - c,
the formulas simplify somewhat. One has
F(s) == As + By(b; s) - c,
DF(s) == A + BZ(b; s).
In this case, in order to form DF(s), one needs to determine the matrix
Z(b' )=[OY(b;S) ... OY(b;S)]
,s
~"
~
.
VO"I
VO"n
As just described, the jth column 8y(b; s)/oO"j of Z(b; s) is replaced by a
difference quotient
A. (b' ).=
UJY ,S .
y(b; 0"1,···, O"j + LlO"j,"" O"n) - y(b; 0"1,···, IJj,"" O"n)
A
•
uO"j
548
7 Ordinary Differential Equations
One obtains the approximation
(7.3.1.13)
LlF(s) = A + B Lly(b; S),
Lly(b; S) := [Llly(b; S), ... , Llny(b; S)].
Therefore, to carry out the approximate Newton's method
(7.3.1.14)
the following has to be done:
(0) Choose a starting vector s(O).
For i = 0, 1, 2, ... :
(1) Determine y(b; sCi)) by solving the initial-value problem (7.3.1.7) for
s = s(i), and compute F(s(i») = r( s(i), y(b; s(i»)).
(2) Choose (sufficiently small) numbers Llaj =1= 0, j = 1, ... , n, and determine y(b; SCi) + Llajej) by solving the n initial-value problems (7.3.1. 7)
for s = SCi) + Llajej = [a~i), ... , a?) + Llaj,"" a~i)f, j = 1, 2, ... , n.
(3) Compute LlF(s(i)) by means of (7.3.1.12) [or (7.3.1.13)] and also the
solution d(i) of the system of linear equations LlF(s(i))d(i) = F(s(i)),
and put s(i+ 1) := SCi) - d(i).
In each step ofthe method one thus has to solve n+ 1 initial-value problems
and an nth-order system of linear equations.
In view of the mere local convergence of the (approximate) Newton's
method (7.3.1.14), the method will in general diverge unless the starting
vector s(O) is already sufficiently close to a solution s of F(s) =
[see
Theorem (5.3.2)]. Since, as a rule, such initial values are not known, the
method in the form (7.3.1.9), or (7.3.1.14), is not very useful in practice. For
this reason, one replaces (7.3.1.9), or (7.3.1.14), by the modified Newton
method [see (5.4.2.4)]' which usually converges even for starting vectors
s(O) that are not particularly good (if the boundary-value problem is at all
solvable).
°
7.3.2 The Simple Shooting Method for Linear Boundary-Value
Problems
By substituting LlF(s(i») for DF(s(i») in (7.3.1.9), one generally loses the
(local) quadratic convergence of Newton's method. The substitute method
(7.3.1.14) as a rule converges only linearly (locally), the rate of convergence
being larger for better approximations LlF(s(i)) to DF(s(i)).
In the special case of linear boundary-value problems, one now has
DF(s) = LlF(s) for all s (and for arbitrary choice of the Llaj), so that
(7.3.1.9) and (7.3.1.14) become identical. By a linear boundary-value problem one means a problem in which I(x, y) is an affine function in y and the
boundary conditions (7.3.0.1b) are linear, i.e.,
7.3 Boundary-Value Problems
(7.3.2.1)
549
y' = T(x)y + g(x),
Ay(a) + By(b) = c,
with an n x n matrix T(x), a function g:IR -+ IR n , c E IR", and constant
n x n matrices A and B. We assume in the following that T(x) and g(x)
are continuous functions on [a,b]. By y(x;s) we again denote the solution
of the initial-value problem
(7.3.2.2)
y' =T(x)y+g(x),
y(a;s) =s.
For y( x; s) one can give an explicit formula,
(7.3.2.3)
y(x; s) = Y(x) s + y(x; 0),
where the n x n matrix Y(x) is the solution of the initial-value problem
Y' = T(x) Y,
Y(a) = I.
Denoting the right-hand side of (7.3.2.3) by u(x;s), one indeed has
u(a; s) = Y(a) s + y(a; 0) = Is + 0 = s
Dxu(x; s) = u'(x; s) = Y'(x) s + y'(x; 0)
= T(x) Y(x);; + T(x) y(x; 0) + g(;T)
= T(x) u(x; s) + g(x),
i.e., u(x; s) is a solution of (7.3.2.2). Since, under the assumptions on T(x)
and g(x) made above, the initial-value problem has a unique solution, it
follows that u(x; s) == y(x: s). Using (7.3.2.3), one obtains for the function
F( s) in (7.3.1.8)
(7.3.2.4)
F(s) = As + By(b; s) - c = [A + BY(b)] s + By(b; 0) - c.
Thus, F(s) is also an affine function of s. Consequently [ef. (7.3.1.13)],
DF(s) = LlF(s) = A + BY(b) = LlF(O).
The solution s of F(s) = 0 [assuming the existence of LlF(O)-l] is given by
s = -[A + BY(b)]-l [By(b; 0) - c]
= 0 - LlF(O)-l F(O),
or, slightly more generally, by
(7.3.2.5)
where s(O) E IRn is arbitrary. In other words, the solution s of F(s) = 0,
and hencc the solution of the linear boundary-value problem (7.3.2.1), will
550
7 Ordinary Differential Equations
be produced by the method (7.3.1.14) in one iteration step, initiated with
an arbitrary starting vetor s(O).
7.3.3 An Existence and Uniqueness Theorem for the Solution
of Boundary-Value Problems
Under very restrictive conditions one can show the unique solvability of
certain boundary-value problems. To this end, we consider in the following
boundary-value problems with nonlinear boundary conditions:
(7.3.3.1)
y' = f(x, y),
r(y(a), y(b)) = o.
The problem given in (7.3.3.1) is solvable precisely if the function F(s) in
(7.3.1.8) has a zero s:
(7.3.3.2)
F(s) = r(s, y(b; s)) = o.
The latter is certainly true if one can find a nonsingular n x n matrix Q
such that
(7.3.3.3)
<P(s) := s - QF(s)
is a contractive mapping in JRn; the zero s of F(s) is then a fixed point of
<P, <P(s) = s.
With the help of Theorem (7.1.8), we can now prove the following
result, which for linear boundary conditions is due to Keller (1968).
(7.3.3.4) Theorem. For the boundary-value problem (7.3.3.1), let the following assumptions be satisfied:
(1) f and Dyf are continuous on S:= {(x,y) I a:::; x:::; b, y E JR n }.
(2) There is a k E C[a,b] with IIDyf(x,y)11 :::; k(x) for all (x,y) E s.
(3) The matrix
P(u, v) := Dur(u, v) + Dvr(u, v)
admits for all u, v E JRn a representation of the form
P(u,v) = Po (I +M(u,v))
with a constant nonsingular matrix Po and a matrix]v! = M(u, v), and
there are constants Il and m with
for all u,v E JRTh.
(4) There is a number). > 0 with), + Il < 1 such that
7.3 Boundary-Value Problems
551
Then the boundary-value problem (7.3.3.1) has exactly one solution
y(x).
PROOF. We show that with a suitable choice of Q, namely Q := PO-I, the
function <p(s) in (7.3.3.3) satisfies
(7.3.3.5)
IIDs<p(s)1I ~ K < 1 for all s E IRn , where K:= A+ M< 1.
From this it then follows at once that
i.e., <P is a contractive mapping which, according to Theorem (5.2.3), has
exactly one fixed point s = <p(s) , which is a zero of F(s), on account of the
nonsingularity of Q.
For <p(s) := s - Po-Ir(s, y(b; s)) one has
(7.3.3.6)
Ds<P(s) = 1- PO-I [Dur(s, y(b; s)) + Dvr(s, y(b; s))Z(b; s)]
= 1- PO-I [p((s, y(b; s)) + Dvr(s, y(b; s)) (Z(b; s) - 1)]
= 1- PO- I [Po (I + M) + Dvr (Z - I)]
= -M(s, y(b; s)) - Po- I Dvr(s, y(b; s)) (Z(b; s) - I),
where the matrix
Z(x; s) := DsY(x; s)
is the solution of the initial-value problem
Z' = T(x)Z,
Z(a; s) = I,
T(x) := Dyf(x, y(x; s))
[see (7.1.8), (7.1.9)]. From Theorem (7.1.11), by virtue of assumption (2),
there thus follows for Z the estimate
IIZ(b; s) - III
~ exp (l k(t) dt) -1,
b
and further, from (7.3.3.6) and assumptions (3) and (4),
IIDs<p(s)II
~ M+m [exp (lb k(t)dt) -1]
~M+m [1+ ~ -1]
=M+A=K<1.
552
7 Ordinary Differential Equations
o
The theorem is now proved.
Remark. The conditions of the theorem are only sufficient and are also
very restrictive. Assumption (3), for example, already in the case n = 2, is
not satisfied for such simple boundary conditions as
[see Exercise 20]. Even though some of the assumptions, for example (3),
can be weakened, one nevertheless obtains only theorems whose conditions
are vere rarely satisfied in practice.
7.3.4 Difficulties in the Execution of the Simple Shooting
Method
To be useful in practical applications, methods for solving the boundaryvalue problem
y' = f(x, y),
r(y(a), y(b)) = 0,
should yield an approximation to y(xo) for every choice of Xo in the region
of definition of a solution y. In the shooting method discussed so far, an
approximate value s for the solution y(a) is computed only at one point,
the point Xo = a. This seems to suggest that the boundary value problem
has thus been solved, since the value y(xo) of the solution at every other
point Xo can be (approximately) determined by solving the initial-value
problem
(7.3.4.1 )
y' = f(x, y),
y(a) = s,
say, with the methods of Section 7.2. This, however, is true only in principle. In practice, there often accrue considerable inaccuracies if the solution
y(x) = y(x; s) of (7.3.4.1) depends very sensitively on S, as shown in the
following example.
EXAMPLE 1. The linear system of differential equations
(7.3.4.2)
[0100 01] [Y1]
[Yl]'
Y2
Y2
has the general solution
C1, C2
Let y(x: s) be the solution of (7.3.4.2) satisfying the initial condition
]
y( -5) = s = [ 81
82
.
One verifies at once that
arbitrary.
7.3 Boundary-Value Problems
(7.3.4.4)
553
e 50 (1081 + 82) lOx [ 1 ]
. ) _ e-5°(lOsl - 82) -lOx [ 1]
20
e
-10 +
20
e
10·
y ( X,8 -
We now wish to determine the solution y(x) of (7.3.4.2) which satisfies the linear
and separated (7.3.0.1b') boundary conditions
(7.3.4.5)
Y1( -5) = 1,
Using (7.3.4.3), the exact solution is
(7.3.4.6)
e50 _ e- 50
y(x) = e100 _ e- lOO e
-lOx [
1] + e50 - e-e- 50100
-10
elOO _
e
lOx [
1]
10·
The initial vale s = y( -5) of the exact solution is
s= [
-10 +
20(11 - e- lOO )
100
-100
e
-e
1
Trying to compute s, e.g., in lO-digit floating-point arithmetic, we obtain at best
an approximate value s of the form
s = fl(s) = [
1(1 +
Ed] ,
-10(1 + E2)
with IEil S eps = 10- 10 . Let, e.g., E1 = 0, E2 = _10-10. To the initial value
however, correspondss by (7.3.4.4) an exact solution y(x; s) with
(7.3.4.7)
10- 9 100
34
_
Y1(5;8) ~ 2Qe
~ 1.3 x 10 .
On the other hand, the boundary-value problem (7.3.4.2), (7.3.4.5) is wellconditioned with respect to the influence of the boundary data on the solution
y(x). For example, to a perturbation of the first boundary condition
Y1(-5)=1+E,
Y1(5)=1,
corresponds the solution Y(X,E) (jj(x, 0) = y(x)):
1] + e50 e-100e-_50e-- e- 50 Ee
_
e50 _ e- 50 + e 50 E -lOx [
Y(X,E) =
e 100 _ e- lOO e
-10
lOO
lOx [
1]
10·
For -5 S x S 5 the factors of c: are small (=0(1)), one can even show for small
lEI
IY1 (x, E) -111 (x, 0)I~EY1 (x, 0) S E,
IY2(X, E) - Y2(X, 0) l~lOEth (x, 0) S lOE
with factors Y1 (x, 0) = Y1 (x) that are extremely small in the open interval (-5,5).
Therefore the roundoff error committed when computing the starting value s =
554
7 Ordinary Differential Equations
y( -5) of the exact solution by means of the shooting method has a much larger
influence on the solution y(.) than comparably small errors in the input data
(here the boundary data). In this example, the shooting method is thus not wellconditioned [see Section 1.3].
The above example shows that even the computation of the initial value
s = yea) to full machine accuracy does not guarantee that additional values
y(x) can be determined accurately. By (7.3.4.4), we find that, for x > 0
sufficiently large,
i.e., the inaccuracy in the initial data propagates exponentially with x.
Theorem (7.1.4) asserts that this may be true in general: For the solution y(x; 8) of the initial-value problem y' = f(x, y), yea) = 8, it follows
that
Ily(x; 81) - Y(X; 82)11 : : : 1181 - 8211 eLlx-al,
provided the hypotheses of Theorem (7.1.4) are satisfied.
This estimate, however, also shows that for sufficiently small intervals
[a, b], the influence of inaccuracies in the initial data 8 = y( a) remains small.
For large intervals, there is a further problem which severely restricts the
practical significance of the simple shooting method: Frequently, the function f in the differential equation y' = f(x, y) has a derivative Dyf(x, y)
which is continuous for all x E [a,b] and all y E rn,n, but IIDyf(x,y)11 is
unbounded on S = {(x, y) I a ::::: x ::::: b, y E rn,n}, so that the Lipschitz condition of Theorem (7.1.1) is violated. In this case, the solution
y(x) = y(x; 8) of the initial-value problem y' = f(x, y), yea) = 8, still exists, but only in a neighborhood Us(a) of a, whose size may depend on 8,
and perhaps not for all x E [a, b]. Thus, y(b; 8) possibly exists only for the
values 8 in a small set M. Besides, A1 is usually not known. The simple
shooting method, therefore, will always break down if one chooses for the
starting value in Newton's method a vector 8(0) tf- M.
EXAMPLE 2. Consider the boundary-value problem [ef. Troesch (1960), (1976)]
(7.3.4.8)
(7.3.4.9)
y" = A sinh AY
yeO) = 0,
y(l) = 1
(A a fixed parameter).
In order to treat the problem with the simple shooting method, the initial slope y' (0) = s needs to "estimated" first. When numerically integrating the
initial-value problem (7.3.4.8) with A = 5, yeO) = 0, y'(O) = s, the solution y(x; s)
turns out to depend extremely sensitively on s. For s = 0.1, 0.2, ... the computation breaks down even before the right-hand boundary (x = 1) is reached, due
to exponent overflow, i.e., y(x; s) has a singular point (which depends on s) at
some Xs :S 1. The effect of the initial slope y' (0) = s upon the location of the
singularity can be estimated in this example:
7.3 Boundary-Value Problems
555
y" = .\ sinh .\y
possesses the first integral
(y')2 = cosh.\y + C.
(7.3.4.10)
2
The conditions y(O) = 0, y' (0) = s define the constant of integration,
S2
C=2"-1.
Integration of (7.3.4.10) leads to
x =
~ lAY V + 2dC:Shry _ 2
S2
The singular point is then given by
Xs
11
= >: 0
00
dry
vs 2 + 2 cosh ry - 2
For the approximate evaluation of the integral we decompose the interval of
integration,
c: > 0 arbitrary,
and estimate the partial integrals separately. We find
and
J
e
OO
dry
Joo
vs2+2coshry-2 -
e
dry
<
2
Vs +4sinh(ry/2) -
Joo
e
dry
2sinh(ry/2)
= -In (tanh(c:/4)).
One thus gets the estimate
-+
c:
< .!.l ( lsi
Xs
>. n
-
g2)
tanh
1+S2
(
~)
=: H(c:, s).
For each c: > 0 the quantity H(c:, s) is an upper bound for Xs; therefore, in
particular,
556
7 Ordinary Differential Equations
vTsI, s) for all s =I- o.
The asymptotic behavior of H ( vTsI, s) for s ~ 0 can easily be found: For small
Xs ::; H (
lsi, in fact, we have in first approximation
tanh
(vTsI)
== vTsT
44'
so that, asymptotically for s ~ 0,
(7.3.4.11)
[One can even show (see below) that, asymptotically for s ~ 0,
(7.3.4.12)
holds].
If, in the shooting process, one is to arrive at the right-hand boundary x = 1,
the value of lsi can thus be chosen at most so large that
for>. = 5, one obtains the small domain
lsi ~ 0.05.
(7.3.4.13)
For the "connoisseur" we add the following: The initial-value problem
(7.3.4.14)
y" = >. sinh >.y,
has the exact solution
2.
y(x; s) = ); smh
-1
(s2"
y(O) = 0,
y'(O) = s
2
e = 1-~·
4'
sn(>,x lk 2 ) )
cn(>,xlk 2 ) ,
here, sn and cn are the Jacobian elliptic functions with modulus k, which here
depends on the initial slope s. If K(k 2 ) denotes the quarter period of cn, then cn
has a zero at
(7.3.4.15)
and therefore y(x; s) has a logarithmic singularity there. But K(k 2 ) has the expansion
K(k 2 ) = In
4 + -1( v!f=k2
4 )
v!f=k2
or, rewritten in terms of s,
4
In
- 1
(1 - k 2 ) + ...
'
7.3 Boundary-Value Problems
K(k 2 ) = In -8 + -S2 ( In -8 - 1)
lsi 16
lsi
557
+ ... ,
from which (7.3.4.12) follows.
For the solution of the actual boundary-value problem, i.e.,
y(l; s) = 1,
one finds for oX = 5 the value
s = 4.57504614 x 10- 2
[ef. (7.3.4.13)]. This value is obtained from a relation between Jacobi functions
combined with an iteration method of Section 5.1. For completeness, we add
still another relation, which can be derived from the theory of Jacobi functions.
The solution of the boundary-value problem (7.3.4.8), (7.3.4.9) has a logarithmic
singularity at
. 1
(7.3.4.16)
Xs
1
= + oXcosh(oX/2).
For oX = 5 it follows that
Xs
== 1.0326 ... ;
the singularity of the exact solution lies in the immediate neighborhood of the
right-hand boundary point. The example sufficiently illuminates the difficulties
that can arise in the numerical solution of boundary value problems.
For a direct numerical solution of the problem without knowledge of the
theory of elliptic functions, see Section 7.3.6.
A further example for the occurrence of singularities can be found in Exercise 19.
1.3.5 The Multiple Shooting Method
The multiple shooting method has been described repeatedly in the literature, for example in Keller (1968), Osborne (1969), and Bulirsch (1971).
A FORTRAN program can be found in Oberle and Grimm (1989).
In a multiple shooting method, the values
k = 1, 2, ... , m,
of the exact solution y(x) of a boundary-value problem
(7.3.5.1)
at several points
y' = f(x,y),
r(y(a), y(b)) = 0,
a = Xl < X2 < ... < Xm = b
are computed simultaneously by iteration. To this end, let y(x; Xk, Sk) be
the solution of the initial-value problem
558
7 Ordinary Differential Equations
The problem now consists in determining the vectors 8k, k = 1, 2, ... , m,
in such a way that the function
y(X) := Y(X:Xk,8k)
for x E [Xk,Xk+d,
k = 1, 2 .... , m - 1,
y(b) := 8 m .
pieced together by the y(x: Xk, 8k), is continuous, and thus a solution of the
differential equation y' = f(x, y), and in addition satisfies the boundary
conditions r(y(a), y(b)) = 0 (see Figure 20).
y
Xm-l
xm=b
x
Fig. 20. Multiple shooting.
This yields the following nm conditions:
y(Xk+1: Xk, 8k) = 8k+1,
(7.3.5.2)
k = 1,2, ... , m -1,
r(81' 8 m ) = 0
for the nm unknown components O"kj, j = 1, 2, ... , n, k = 1, 2, ... , m, of
the
8k = [O"kl, O"k2,"" O"kn]T.
Altogether, (7.3.5.2) respresents a system of equations of the form
(7.3.5.3)
:~~:~: :~~
F(8) :=
F m - 1 (8 m -I, 8 m )
F,n (81, Srn)
in the unknowns
].-
3d -
[
82
Y(X2: Xl,
y(X3: X2, 82) - 83
=0
Y(X m : Xm -1' 3 m -d - 8 m
r(81,8 m )
7.3 Boundary-Value Problems
559
It can be solved iteratively with the help of Newton's method,
(7.3.5.4)
i = 0,1, '" .
In order to still induce convergence if at all possible, even for a poor choice
of the starting vector s(O), instead of (7.3.5.4) one takes in practice the
modified Newton method (5.4.2.4) [see Section 7.3.6 for further indications
concerning the implementation of the method]. In each step of the method
one must compute F(s) and DF(s) for s = sCi). For the computation of
F(s) one has to determine the values Y(Xk+li Xk, Sk) for k = 1, 2, ... , m-1
by solving the initial-value problems
Y' = f(x, y),
and to compute F(s) according to (7.3.5.3). The Jacobian matrix
DF(s) = [DskFi(s)L,k=l, ... ,m'
in view of the special structure of the Fi in (7.3.5.3), has the form
(7.3.5.5)
Gl
-1
0
G2 -1
0
0
DF(s) =
0
0
A
0
Gm - l
0
-1
B
where the n x n matrices A, B, G k , k = 1, ... , m -1, in turn are Jacobian
matrices,
(7.3.5.6)
G k :::= DSkFk(s)::= DskY(Xk+liXk,Sk),
k = 1, 2, ... , m -1,
B :::= DsmFm(s) ::= Dsm r(sl' sm),
A:::= DSIFm(s)::= DS1r(Sl,Sm)'
As already described in the simple shooting method, it is expedient in
practice to again replace the differential quotients in the matrices A, B,
G k by difference quotients, which can be computed by solving additional
(m - l)n initial-value problems (n initial-value problems for each matrix
G l , ... , G m - l ). The computation of s(Hl) from SCi) according to (7.3.5.4)
can be carried out as follows: With the abbreviations
(7.3.5.7)
560
7 Ordinary Differential Equations
(7.3.5.4) (or the slightly modified substitute problem) is equivalent to the
following system of linear equations:
G1Lls l - Lls 2 = -Fl'
G 2 Lls 2 - LlS3 = -F2,
(7.3.5.8)
Gm-1Llsm- 1 - Llsm = -Fm- 1 ,
ALls l + BLls m = -Fm.
Beginning with the first equation, one can express all Lls k successively in
terms of Lls 1. One thus finds
Lls 2 = GILls l + Fl,
(7.3.5.9)
and from this finally, by means of the last equation,
(7.3.5.10)
where w:= -(Fm+BFm- 1 +BGm- 1 Fm- 2 +·· ·+BGm- 1 Gm- 2 .· .G2 Fl)'
This is a system of linear equations for the unknown vector Lls 1, which
can be solved by means of Gauss elimination. Once Lls i is determined,
one obtains Lls 2 , Lls3 , ... , Lls m successively from (7.3.5.8) and s(i+1) from
(7.3.5.7).
We further remark that under the assumptions of Theorem (7.3.3.4) one
can show that the matrix A + BG m - 1 ... G 1 in (7.3.5.10) is nonsingular.
Furthermore, there again will be a nonsingular nm x nm matrix Q such
that the function
4>(S) := S - QF(s)
is contracting, and thus has exactly one fixed point s with F(s) = O.
It is seen, in addition, that F(s), and essentially also DF(s), is defined
for all vectors
,~
[J ,~
E
M M'" x M'" x ... x M'm-" x Ff',
so that the iteration (7.3.5.4) of the multiple shooting method can be executed for s E M. Here M(k), k = 1, 2, ... , m - 1, is the set of all vectors
Sk for which the solution y(x; Xk, Sk) exists (at least) on the small intervall
[Xk' Xk+l]' This set M(k) includes the set Mk of all Sk for which y(x; Xk, Sk)
7.3 Boundary-Value Problems
561
exists on all of [a, b]. Now Newton's method for computing Sk by means
of the simple shooting method can only be executed for Sk E Mk c M(k).
This shows that the demands of the multiple shooting method upon the
quality of the starting vectors in Newton's method are considerably more
modest than those of the simple shooting method.
7.3.6 Hints for the Practical Implementation of the Multiple
Shooting Method
The multiple shooting method in the form described in the previous section
is still rather expensive. The iteration in the modified Newton's method
(5.4.2.4), for example, has the form
(7.3.6.1 )
and in each step one has to compute at least the approximation LlF(s(i})
of the Jacobian matrix DF(s(i}) by forming appropriate difference quotients. The computation of LlF(s(i}) alone amounts to the solution of n
initial-value problems. This enormous amount of work can be substantially
reduced by means of the techniques described in Section 5.4.3, recomputing the matrix LlF(s(i}) only occasionally, and employing the rank-one
procedure of Broyden in all remaining iteration steps in order to obtain
approximations for LlF(s(i}).
We next wish to consider the problem of how to choose the intermediate
points Xk, k = 1, ... , m, in [a,b], provided an approximate solution (starting trajectory) 17(X) is known for the boundary-value problem. Put Xl := a.
Having already chosen Xi « b), integrate the initial-value problem
17~ = f(x, 17i),
17i(Xi) = 17(Xi)
by means of the methods of Section 7.2, and terminate the integration at
the first point X = ~ for which the solution II17i(~) II becomes "too large"
compared to 1117(~)II, say l117i(~)11 2 21117(~)11; then put Xi+1 :=~.
EXAMPLE 1. Consider the boundary-value problem
(7.3.6.2)
y" = 5 sinh 5y,
[ef. Example 2 of Section 7.3.4].
y(O) = 0,
y(l) = 1
For the starting trajectory ry(x) we take the straight line connecting the two
boundary points, ry(x) == x.
The subdivision 0 = Xl < X2 .•. < Xm = 1 found by the computer is shown in
Figure 21. Starting with this subdivision and the values Sk := ry(Xk), s~ := ry' (Xk),
the multiple shooting method yields the solution to about 9 digits in 7 iterations.
The problem suggests a "more advantageous" starting trajectory: the linearized problem
ry" = 5· 5ry,
ry(O) = 0, ry(l) = 1
has the solution ry(x) := sinh5x/sinh5. This function would have been a somewhat "better" starting trajectory.
562
7 Ordinary Differential Equations
1.5
1.0
0.5
0.5
x ---+
1.0
Fig. 21. Subdivision in multiple shooting.
In some practical problems the right-hand side f (x, y) of the differential
equation is only piecewise continuous on [a, b] as a function of x, or else
only piecewise continuously differentiable. In these cases one ought to be
careful to include the points of discontinuity among the subdivision points
Xi; otherwise there are convergence difficulties [see Example 2].
There remains the problem of how to find a first approximation 'TJ(x)
for a boundary-value problem. In many cases one knows the qualitative
behavior of the solution, e.g. on pysical grounds, so that one can easily find
at least a rough approximation. Usually this also suffices for convergence,
as the modified Newton's method does not impose high demands on the
qualitiy of the starting vector.
In complicated cases the so-called continuation method (homotopy
method) is usually effective. Here one gradually feels one's way, via the
solution of "neighboring" problems, toward the solution of the actually
posed problem: Almost all problems contain certain parameters a,
(7.3.6.3)
y' = f(x, y; a),
r(y(a), y(b); a) = 0,
and the problem consists in determining the solution y(x) == y(x; a), which
of course depends also on a, for a certain value a = ii. It is usually true
that for a certain other value a = aD the boundary-value problem Paa
is simpler, so that at least a good approximation 1]o(x) is known for the
solution y(x; aD) of P aa . One then chooses a finite sequence of sufficiently
close numbers Ci with 0< 101 < 102 < ... < lOt = 1, puts ai := ao+ci(ii-ao),
and for i = 0, 1, ... , l - 1 takes the solution y(x; ai) of Pai as the starting
approximation for the determination of the solution y(x; aiH) of Pai + 1 • In
7.3 Boundary-Value Problems
563
y(x; al) = y(x; a) one then finally has the desired solution of Pa. For the
method to work, it is important that one not take too large steps Ei -+ Ei+l,
and that "natural" parameters a, which are intrinsic to the problem, be
chosen: The introduction of parameters, say constructs of the type
f(x, y; a) = af(x, y) + (1 - a)g(x, y),
a = 1,
aD
= 0,
where f(x, y) is the given right-hand side of the differential equation and
g(x, y) an arbitrarily chosen "simple" function which has nothing to do
with the problem, does not succeed in critical cases.
A more refined variant of the continuation method, which has proved
to be useful in the context of multiple shooting methods, is described in
Deufihard (1979).
EXAMPLE 2. (A singular boundary-value problem). The steady-state temperature
distribution in the interior of a cylinder of radius 1 is described by the solution
y(x; a) of the nonlinear boundary-value problem [ef. e.g., Na and Tang (1969)]
,
y " = - y- -ae y ,
x
y'(O) = y(l) = O.
(7.3.6.4)
Here, a is a "natural" parameter with
. heat generation
conductivity ,
0< a ::; 0.8.
a=------
For the numerical computation of the solution for a = 0.8 one could use the
subdivision
a = 0,0.1, ... ,0.8
and, beginning with the explicitly known solution y(x; 0) == 0, construct the homotopy chain for the computation of y(x; 0.8). The problem, however, exhibits
one further difficulty: The right-hand side of the differential equation has a singularityat x = o. While it is true that y' (0) = 0 and the solution y(x; a) is defined,
indeed analytic, on the whole interval 0 ::; x ::; 1, there nevertheless arise considerable convergence difficulties. The reason is to be found in the following: A
backward analysis [ef. Section 1.3] shows that the result y of the numerical computation can be interpreted as the exact solution of the boundary-value problem
(7.3.6.4) with slightly perturbed boundary data
jj' (0) = e1,
y(l) = e2
[see, e.g., Babuska et al. (1966)]. This solution y(x; a) is "near" the solution
y(x; a), but in contrast to y(x; a) has only a continuous first derivative at x = 0,
because
lim y" = - lim e1 = ±oo.
x--tO
x-+O X
Since the order of convergence of every numerical method depends not only on the
method itself, but also (as is often overlooked) on the existence and boundedness
of the higher derivatives of the solution, the order of convergence in the present
example will be considerably reduced (to essentially 1). These difficulties and
564
7 Ordinary Differential Equations
others as well, are encountered in many practical boundary-value problems. The
cause of failure is often not recognized, and the "blame" put on the method or
the computer.
Through a simple artifice these difficulties can be avoided. The "neighboring"
solutions with y' (0) =I 0 are filtered out by means of a power series expansion
about x = 0:
(7.3.6.5)
The coefficients y(i)(O), i = 2, 3, 4, ... , can all be expressed in terms of A:= y(O),
an unknown constant; through substituion in the differential equation (7.3.6.4)
one finds
(7.3.6.6)
from which, for x -t 0,
Differentiation gives
so that
y(3) (0) = O.
Further,
and
3 2 2.\
Y (4)(0) -_ sa
e .
One can show that y(5) (0) = 0, and in general, y(2i+l) (0) = O.
The singular boundary-value problem can now be treated as follows. For
y(x; a) one utilizes in the neighborhood of x = 0 the power-series representation
(7.3.6.5), and at a sufficient distance from the singularity x = 0 the differential
equation (7.3.6.4) itself-for example,
(7.3.6.7)
yl/(x) =
{
~ae.\
[1 - ~x2ae.\ ]
y'(x)
- - - -ae
x
y(x)
if 10- 2 <::: x <::: 1.
The error is of the order of magnitude 10- 8 Now the right-hand side still contains
the unknown parameter A = y(O). As shown in Section 7.3.0, however, this can
be interpreted as an extended boundary-value problem. One puts
Yl(X) := y(x),
Y2(X) := y'(x),
Y3(X) := y(O) = A,
and obtains the system of differential equations on 0 <::: x <::: 1,
7.3 Boundary-Value Problems
565
I
YI = Y2,
I
(7.3.6.8)
~ { ~oeY3 [1 - ~x20eY31
Y2 ~
_ Y2 _ oeY!
x
if 0 ::; x ::; 1O~2,
if 1O~2 ::; x ::; 1,
y~ = 0,
with the boundary conditions
(7.3.6.9)
In the multiple shooting method one chooses the "patch point" 1O~2 as one of
the subdivision points, say,
Xl
= 0,
X2
= 10~2,
X3
= 0.1, ... ,
Xu
= 0.9,
Xl2
= 1.
There is no need to worry about the smoothness of the solution at the patch point;
it is ensured by construction. With the use of the homotopy [see (7.3.6.3)], beginning with y(x; 00) == 0, the solution y(x; Ok) is easily obtained from y(x; ak~l) in
only 3 iterations to an accuracy of about 6 digits (total computing time on the
CDC 3600 computer for all 8 trajectories: 40 seconds). The results are shown in
Figure 22.
<>=0.8
y(x;a)
0.1
o
1.0
X ---+
Fig. 22. The solution y(x; a) for 0 = 0(0.1)0.8.
7.3.7 An Example: Optimal Control Program for a lifting
Reentry Space Vehicle
The following extensive problem originates in space travel. This concluding
example is to illustrate how nonlinear boundary-value problems are handled
and solved in practice, because experience shows that the actual solution
of such real-life problems causes the greatest difficulties to the beginner.
566
7 Ordinary Differential Equations
The numerical solution was carried out with the multiple shooting method;
a program can be found in Oberle and Grimm (1989).
The coordinates of the trajectory of an Apollo type vehicle satisfy, during the flight of the vehicle through the earth's atmosphere, the following
differential equations:
Spv 2
gsiwy
V=V(V'I,~,11)=- 2m CD(u)- (1+0 2 '
.
(7.3.7.1 )
Spv
vcos 1
r = r(v'I'~'u) = 2m Cdu) + R(l +~)
gcos r
v(1 +~)2'
.
_
v sin I
~ = '::(V'I'~'U) =~,
.
v
( = z (v, I, ~ , u) = 1 + ~ cos f.
The meanings are: v: velocity: r: flight-path angle; h: altitude above the
earth's surface; R: earth radius; ~ = hj R: normalized altitude; (: distance
on the earth's surface; p = Po exp( -f3R~): atmospheric density; CD(u) =
1.174 - 0.9cosu: aerodynamical drag coefficient; Cdu) = 0.6sinu: aerodynamical lift coefficient; u: control parameter, which can be chosen arbitrarily as a function of time; g: gravitational acceleration; Sjm: (frontal
area)j(mass of vehicle). Numerical values arc: R = 209 (~ 209 10 5 ft);
8 = 4.26; Po = 2.70410-3; g = 3.217210-4; Sjm = 53,200; (1 ft = 0.3048 m).
(See Figure 23.)
4
v
Fig. 23. The coordinates of the trajectory.
The differential equations have been simplified somewhat by assuming
(1) a spherical earth at rest, (2) a flight trajectory in a great circle plane,
7.3 Boundary-Value Problems
567
(3) that vehicle and astronauts can be subjected to unlimited deceleration. The right-hand sides of the differential equations nevertheless contain
all terms which are essential physically. The largest effect is produced by
the terms multiplied by CD (11) and Cd U), respectively: these are the atmospheric forces, which in spite of the low air density p, are particularly
significant by virtue of the high speed of the vehicle; they can be influenced
via the parameter 11 (angle of attack). The terms multiplied by 9 are the
gravitational forces of the earth acting on the vehicle; the remaining terms
result from the choice of the coordinate system.
During the flight through the earth's atmosphere the vehicle is heated
up considerably. The total stagnation point convective heating per unit
area is given by the integral
(7.3.7.2)
J =
r qdt,
.10
T
The range of integration is from t = 0, the time when the vehicle hits the
400,000 ft atmospheric level, until a time instant T. The vehicle is to be
maneuvered into an initial position favorable for the final splashdown in the
Pacific. Through the freely disposable parameter 11 the maneuver is to be
executed in such a way that the heating J becomes minimal and that the
following boundary conditions are satisfied: Data at the moment of entry:
v(O) = 0.36 (~ 36,000 ft/sec),
7r
"( ( 0 ) = -8.1 180°'
0
((7.3.7.3)
~(O) = ~
[h(O) = 4 (~ 400,000 ft)].
Data at the end of maneuver:
v(T) = 0.27 (~ 27,000 ft/sec),
(7.3.7.4)
"((T) = 0,
~(T) = 2.5
R
[h(T) = 2.5 (~ 250,000 ft)].
The terminal time T is free. ( is ignored in the optimization process.
The calculus of variations [cf., e.g., Hestenes (1966)] now teaches the
following: Form, with parameters (Lagrange multipliers), the expression
(7.3.7.5)
where Av , A'Y' A~ satisfy the three differential equations
568
7 Ordinary Differential Equations
·
Au =
(7.3.7.6)
·
aH
-av'
aH
A) = - a"Y '
·
aH
ae
A.; = - The optimal control u is then given by
(7.3.7.7)
.
-0.6A)
-0.9VAv
SIn 'U = - - - ,
COS U = ---"o
.
0
Note that (7.3.7.6), by virtue of (7.3.7.7), is nonlinear in Av , A).
Since the terminal time T is not subject to any condition, the further
boundary condition must be satisfied:
(7.3.7.8)
HI
t=T
=0.
The problem is thus reduced to a boundary-value problem for the six differential equations (7.3.7.1), (7.3.7.6) with the seven boundary conditions
(7.3.7.3), (7.3.7.4), (7.3.7.8). We are thus dealing with a free boundaryvalue problem [cf. (7.3.0.4a,b)]. A closed-form solution is impossible; one
must use numerical methods.
It would be wrong to construct a starting trajectory 7)( x) without reference to reality. The unexperienced should not be misled by the innocentlooking form of the right-hand side of the differential equation (7.3.7.1):
During the numerical integration one quickly observes that v, "Y, ~, Av , A"
A.; depend in an extremely sensitive way on the initial data. The solution
has moving singularities which lie in the immediate neighborhood of the
initial point of integration [see for this the comparatively trivial example
(7.3.6.2)]. This sensitivity is a consequence of the effect of atmospheric
forces, and the physical interpretation of the singularity is a "crash" of the
vehicle or a "hurling back" into space. As can be shown by an a posteriori
calculation, there exist differentiable solutions of the boundary-value problem only for an extremely narrow domain of boundary data. This is the
mathematical formulation of the danger involved in the reentry maneuver.
Construction of a Starting Trajectory. For aerodynamic reasons the
graph of the control parameter 'U will have the shape depicted in Figure 24
(information from space engineers).
This function can be approximated, e.g., by
(7.3.7.9)
where
7.3 Boundary-Value Problems
569
'U
1f
2"
0.5
1
T
Fig. 24. Control parameter (empirical).
T
=
t
T'
()
erf x =
0 < T < 1,
2
(X _".2
ft 10 e
del,
and PI, P2, P3 are, for the moment, unknown constants.
For the determination of the Pi one solves the following auxiliary
boundary-value problem: the differential equations (7.3.7.1) with u from
(7.3.7.9) and in addition
PI = 0,
P2 = 0,
P3 = 0,
(7.3.7.10)
with boundary conditions (7.3.7.3), (7.3.7.4) and T = 230.
The boundary value T = 230 sec is an estimate for the duration of the
reentry maneuver. With the relatively poor approximations
PI = 1.6,
P2 = 4.0,
P3 = 0.5
thc auxiliary boundary-value problem can be solved even with the simple
shooting method, provided one integrates in the backward direction (initial
point T = 1, terminal point T = 0). The result after 11 iterations is
(7.3.7.11)
PI = 1.09835,
P2 = 6.48578,
P3 = 0.347717.
Solution of the Actual Boundary- Value Problem. \Vith the "nonoptimal" control function u from (7.3.7.9), (7.3.7.11) one obtains approximations to v(t), ,(t), ~(t) by integrating (7.3.7.1). This "incomplete" starting
trajectory can be made to a "complete" one as follows: since cos u > 0, it
follows from (7.3.7.7) that Av < 0; we choose Av == -1.
Further, from
570
7 Ordinary Differential Equations
h
40
r
3.5
(J>
-2"
30
_40
2.5
_60
2D
-8"
so
lOO
150
200 2249
Fig. 25. The trajectories h, /, v.
Fig. 26. The trajectories of the adjoint variables AE, A" Av.
7.3 Boundary-Value Problems
571
6A'"'(
tan u = 9VAv
there follows an approximation for A'"'(. An approximation for Ae can be
obtained from the relation H == 0, where H is given by (7.3.7.5), because
H = const is a first integral of the equations of motion.
With this approximate trajectory for v, ,,(, ... , Ae, T (T = 230) and the
techniques of Section 7.3.6, one can now determine the subdivision points
for the multiple shooting method (m = 6 suffices).
Figures 25 and 26 show the result of the computation (after 14 iterations); the total time of the optimal reentry maneuver comes out to be
T = 224.9 sec; the accuracy of the results is about 7 digits.
100"
:=;C'\~
{400
_'",.. zoo
~
1
I
\\ \\
,
\
,,
co
§ -200··
!.
\\"\\
-60"
\
\,
\
\\
\
\\
\
\
\
\
\
\
~ -4()0
::>
f\
\\\
--.........,
I
\
\
.,
\
\
--- mitlal values
\
-soo ........
after 6 Iterations \
-100 0
a
\
_after 14 Iterations (a1ccuracy . 7 dlg'ltS)
so
100
150
200 224.9
Fig. 27. Successive approximations to the control u.
Figure 27 shows the behavior of the control u = arctan(6A'"'(/9vA v ) during the course of the iteration. It can be seen how the initially large jumps
at the subdivision points of the multiple shooting method are "flattened
out".
The following table shows the convergence behavior of the modified
Newton's method [see Sections 5.4.2 and 5.4.3].
572
7 Ordinary Differential Equations
Error
IIF(s(i l )112
Steplength
Error
IIF(s(il)1I2
Steplength
(in (5.4.3.5))
A
[so (7.3.5.3)]
(in (5.4.3.5))
A
3 X 10 2
0.250
1 x 10- 1
1.000
6 X 10 4
0.500 (trial)
0.250 (trial)
0.125
6 x 10- 2
1.000
1.000
1.000
0.125
3 x 10- 2
1 x 10- 2
1 x 10- 3
0.250
0.500
4 x 10- 5
3 x 10- 7
1.000
1.000
1.000
1.000
1 X 10- 9
1.000
[so (7.3.5.3)]
5 x 10 2
7 X 10 2
2 X 10 2
1 X 10 2
8 X 10 1
1 X 10 1
1 x 10°
7.3.8 Advanced Techniques in Multiple Shooting
In Section 7.3.5 we examined a prototype version of the multiple shooting
method. In this section, we deal with some advanced techniques for reducing the large computational expenses when applying the multiple shooting
algorithm to complex problems in science and engineering.
Calculation of the Jacobian Matrix. The calculation of approximations
L1F(s(il) of the Jacobian matrices DF(s(il) - especially of the partial
derivatives G k in (7.3.5.6) -- is the most expensive part of the algorithm.
Those partial derivatives
k = 1, 2, ... , m - 1,
can be computed by solving at least n (m - 1) additional initial-value
problems and forming appropriate difference quotients. This approach is
straightforward, but no information concerning the accuracy of the G k is
available and all initial-value problems have to be solved to high accuracy.
Low accuracy of the Gk often leads to poor convergence of the iteration to
solve the nonlinear system (7.3.5.4).
In view of these problems, another approach has been developed based
on the integration of the so-called "variational equation" [see Hiltmann
(1990), Callies (2000)]
(7.3.8.1)
oGk(x; Xk, sd
ox
= fy(x, y(x; .Tk, Sk)) Gk(x; Xk, Sk).
GdXk; Xk, Sk) = I.
To prove (7.3.8.1), we only have to differentiate
7.3 Boundary-Value Problems
(7.3.8.2)
y' = f(x) y),
573
Y(Xk) = Sk
with respect to Sk observing that y = y(x; Xk, Sk) is also a function of Sk.
(7.3.8.1) and (7.3.8.2) are integrated in parallel with completely different
integrators specially tailored to the respective differential equation. In case
of (7.3.8.1), full advantage is taken of the linearity of the system to save
about half of the function evaluations [see Callies (2001)]. Step size control
is done independently, integration accuracy is different for each system but
controlled for both systems, and switching points can be located much
easier than in the original approach.
We also note the formula
(7.3.8.3)
for the partial derivative of Y(Xk+1; Xk) Sk) with respect to Xk, which follows
from the differentiation of (7.3.8.2) and the identity y(Xk; Xk, Sk) == Sk with
respect to Xk.
Piecewise Continuous Right Hand Sides. In many applications) the right
hand side f(x, y) of the differential equation is discontinuous or not differentiable [ef. Section 7.2.18]. Even jumps in the solution y(x) are possible.
These problems could, in principle, be treated in the framework of the basic
multiple shooting algorithm by incorporating the discontinuity points into
the set of the multiple shooting abscissae Xi, but the price to be paid is
a dramatic increase in the number of switching points accompanied by a
similar increase in the size of the nonlinear system (7.3.5.3).
These problems can be overcome by working with two types of discretisations [Callies (2000a)], the macm-discretisation Xl < X2 < .. , < Xm and
the micm-discretisation Xv =: Xv.1 < Xv,2 < ... < xv,''" := Xv+1, v = 1,
... ,m-I.
Consider the following piecewise defined multi-point boundary value
problem
(7.3.8.4)
y' = fV.I"(x,y), X E [xv'l",xv'l"+d}
v = 1, ... ,m - 1, JL = 1, ... ,"'V - 1,
y(xv.l") = SV,I"
v = 1, ... , m - 1, JL = 2, ... , "'v,
Using the abbreviations Sl := Sl,l and
v = 1, ... , m -1,
the solution of (7.3.8.4) is formally equivalent to the solution of a special
system of nonlinear equations [cf. (7.3.5.3)]
574
7 Ordinary Differential Equations
Xl
2 ))
"2(X3, 53, y(X 3 ))
"I (X2. 52, y(X
(7.3.8.5)
51
X2
= 0,
F(z) :=
Z .-
1'm -1(X m , 5m , y(x;))
1'(X1' 51, Xm , 5m )
82
Xm
8m
Here. Y(X;';+l) := Y(X;;+l; XV' 8 v ), v = 1, ... , m - 1, and, for X E
[xv,!" XV.I'+d, y(.) is the solution of the initial value problem
y' = iv.1' (X, y),
X E [XV,I" XV.I'+l), fl = 1, ... , Kv - 1,
y(xv'l') = 8 v.!"
where 8 v .1 .- 8 v , and, for fl = 2, ... , Kv - 1, xV.1' and 8 V .?l satisfy the
equation
"v. I' (xv.I" 5 u .1' , y(x;;'I')) = O.
Note that the switching and jump functions considered in Sections 7.2.18
generate equations of this type.
These problems are rather general, since the XV,I' may not be known
a priori, jumps in the solution y(.) are allowed at x = xV.I" and the right
hand side of the differential equation iv.1' may depend on v and fl. Note that
(7.3.8.4) encompasses the basic problem (7.3.5.1) and (7.3.G.2) treated in
Section 7.3.5 with a fixed i, fixed abscissae Xi in the macro-discretization,
no micro-discretization points, no jumps in the solution, as special case: we
only have to set Kv := 2, iv.1' := i, Xv.1 := Xv, Xv.2 := Xv+1, 8v.1 := 8v ,
"v.2(X. 8, y) := y - 8, and to drop the variables Xl, X2, ... , Xm from z in
(7.3.8.5) .
The nonlinear system (7.3.8.5) is iteratively solved e.g. by a modified
Newton method (iteration of the linearized system). The ~-th iteration step
reads
(7.3.8.6)
z(~+l) := z(EJ -
A(LlF(z(O))-l F(z(O),
LlF(z(O) :=:::; D F(z(E,))
An iteration step is accepted if both of the following tests are valid:
(7.3.8.7)
IIF(z(U 1)) I ~ IIF(z(~)) I
(Test 1)
II (LlF(z(O)) -1 F(Z(~+l)) II ~ I (LlF(z(O)) -1 F(z(O) II
(Test 2)
In particular, calculating the Jacobian matrix D F(z) requires computing the sensitivity matrices
v = 1, 2, ... , m - 1.
7.3 Boundary-Value Problems
575
This presents a problem when the open interval (xv, xv+d contains points
of the micro-discretization, "'v > 2. We will treat this problem, but only
for the most elementary case "'v = 3 and only one such interval, when this
interval contains exactly one point of the micro-discretization. This leads
to the following core problem and a basic assumption.
Assumption (A): Let Xl < X2 < X3 and f > O. Further let h E
CN([XI - f,X2 + f] X IRn,IR n ), h E C N ([X2 - f,X3 + f] X IRn,IRn ), and
q E C 2 ([Xl, .T3] X IRn , IR). Consider the piecewise defined system of ordinary
differential equation8
(7.3.8.8)
y' _ {h(X,. y(x)),.
h(x, y(x)),
if q(x, y(x)) < 0,
if q(x, y(x)) > 0,
and the initial condition Y(Xl) = 81; let X2 exist with q(X2,y(X2";Xl,Sl)) =
o and start the integration of h with the initial value (X2' S2), where S2 is
a solution (a8sumed to e.Tist) of a further equation p(y( x2"), S2) = 0 with a
function p: IR2n -+ IRn. Assume that q(x, y(x)) < 0 for Xl :S X < X2) and
q(x, y(x)) > 0 for X2 < X :S X3. Further let Q(x, y) := [qx + qyh](x, V)·
Finally, assume that Q(X2, y(X2)) > 0, p is twice continuously differentiable
on .'lome open convex set D C IR2n with (y(X2), S2) E D, and that the
Jacobian D S2 P(Y(X2), S2) i8 nonsingular.
If Assumption (A) is satisfied, the function q(x, y) is called a 8witching
function, p(y, 8) a tran8ition function, and X2 a design point.
Interpretation of the core problem: Xl and X3 represent two adjacent points of
the macro-discretization, and X2 the abscissa of the micro-discretization between
Xl and X3. The role of the equation Tv.I'("') = 0 in (7.3.8.4) is played by the
simple system
[ q(X2' Y(X 2 ))] = o.
p(y(x2 ), S2)
In many applications, p has the form p(y, s) = ¢(y) - s [ef. Section 7.2.18].
Assumption (A) allows the application of the implicit function theorem
and guarantees that the design point X2 and the new starting ordinate S2 are
locally unique solutions of smooth equations and depend smoothly on Sl,
X2 = X2(Sl), S2 = S2(sI) . This will be used in the proof of the following
Theorem:
(7.3.8.9) Theorem. Let AS8umption (A) be 8atisfied and let X2 be a de8ign point. Then the sensitivity matrix OY(X3; x2(sd, s2(sd)/OSl can be
computed by
576
7 Ordinary Differential Equations
OY(X3; x2(sd, S2(SI))
OSI
with the abbreviations
Op
Op(y, S2) I
oy =
oy
Y=Y(X 2 )'
Op
Op(y, 82) I
OS2 =
OS2
y=y(x;-)
PROOF. By Assumption (A), X2 and 82 satisfy q(X2,y(X2;Xl,SI)) = 0,
p(y(X 2 ;Xl,Sl),S2) = 0 and thus can be treated as functions of Sl (and
Xl, but the dependence on Xl is not used): X2 = X2(81), S2 = 82(8d. By
differentiating the identity q(x2(sd, y(X2(81); Xl, sd) == 0 we obtain by the
implicit function theorem
Analogously we get from the equation p(y,82) = 0, which defines 82 as
function of y,
082 = _ (op(y, 8 2 ) ) -1 op(y, 82) .
oy
OS2
oy
Using the chain rule and (7.3.8.3) we obtain
OY(X3; X2, S2)
OSl
OY(X3; X2, 82) 0.7:2
082 OY(X2)
- + OY(X3; X2, 82) -OX2
OSl
082
oy OSl
OX2
OS2 OY(X2))
= G 2(X3; X2, 82) ( - h(X2, 82) -0 + -a - a ,
Sl
Y
Sl
--'------::,-----------'--=
where aY(X2)/081 stands for
oy(x2(sd-;x1,sl) = 0y(x 2 ) OX2 + Oy(x;XI,Sdl
OSl
OX2 OSl
OSl
X=X 2
_ OX2
= h(X2,y(X 2 )) -0 +G 1(X2;X1,SI)'
Sl
7.3 Boundary-Value Problems
577
Combination of the equations yields the desired result.
0
The theorem stated above allows to treat piecewise continuous right
hand sides without an increase in the size of the nonlinear system (7.3.5.3),
but with equal precision. The determination of the zeros of the switching
functions q(x,y(x)) = 0 is coupled to the convergence tests (7.3.8.7) of
the shooting algorithm making the algorithm even more efficient (Callies
(200la)) .
7.3.9 The Limiting Case m -+ 00 of the Multiple Shooting
Method (General Newton's Method, Quasilinearization)
If the subdivision [a, bj is made finer and finer (m -+ (0), the multiple shooting method converges toward the general Newton's method for boundaryvalue problems [ef., e.g., Collatz (1966)], also known as quasilinearization.
In this method an approximation 17(X) for the exact solution y(x) of
the nonlinar boundary-value problem (7.3.5.1) is improved by solving linear bondary-value problems. Using Taylor expansion of f(x,y(x)) and
r(y(a), y(b)) about T)(x) one finds in first approximation [ef. the derivation
of Newton's method in Section 5.1]
y'(x) = f(x, y(x)) == f(x, T)(x)) + Dyf(x, T)(x)) (y(:r) - T)(x)),
0= r(y(a),y(b)) == T(17(a),T)(b)) + A(y(a) - 7/(a)) + B(y(b) -17(b)),
where A := Dur(T)(a), T)(b)), B := Dvr(T)(a), T)(b)). One may expect, therefore, that the solution 7)(x) of the linear boundary-value problem
(7.3.9.1)
7)' =
f(x, T)(x)) + Dyf(x, T)(x))(7) - T)(x))
A(7)(a) -T)(a)) +B(7)(b) -T)(b)) = -r(T)(a),T)(b)) ,
will be a better solution of (7.3.5.1) than T)(x). For later use we introduce
the correction function L1T)(x) := 7)(x) -T)(x), which by definition is solution
of the boundary-value problem
(7.3.9.2)
(L1T))' = f(x, T)(x)) - T)'(x) + Dyf(x, 17(x)) L117,
AL1T)(a) + BL1T)(b) = -T(17(a), 17(b)),
so that
(7.3.9.3) L1T)(x) = L117(a) +
l
x
[Dyf(t, T/(t))L1r/(t) + f(t, T)(t)) - T)'(t)] dt.
Replacing T) in (7.3.9.1) by 7), one obtains a further approximation r, for
y, etc. In spite of the simple derivation, this iteration method has serious
disadvantages which cast some doubt on its practical value:
(1) The vector functions T)( x), 7)( x), ... must be stored over their entire
range [a, bj.
578
7 Ordinary Differential Equations
(2) The matrix function Dyf(x, y) must be computed in explicit analytic
form and must likewise be stored over its entire range [a, b]. The realization of both these requirements is next to impossible in the problems
that currently arise in practice (e.g., f with 25 components, 500-1000
arithmetic operations per evaluation of f).
We now wish to show that the multiple shooting method as m -+ 00
converges to the method (7.3.9.1) in the following sense: Let 71(X) be a
sufficiently smooth function on [a, b] and let further (7.3.9.1) be uniquely
solvable with the solution fJ(x), L171(:r) := r)(.1;) -71(X). Then the following is
true: If in the multiple shooting method (7.3.5.3)-(7.3.5.10) one chooses any
subdivision a = Xl < X2 < ... < Xm = b of fineness h := maxk IXk+1 - xkl
and the starting vector
then the solution of (7.3.5.8) will yield a vector of corrections
(7.3.9.4)
L1s =
r
L1S12
L1s
:
1
L1,sm
For the proof we assume for simplicity IXk+l - xkl = h, k = 1, 2, ... ,
m - l. We want to show that to each h there exists a differentiable function
L1s: [a,b] -+ JRn such that maxk IIL1s k - L1S(Xk) I = O(h), L1s(a) = L1s 1
and maxxE[a.bJ IIL1s(x) - L171(X) II = O(h).
To this end, we first show that the products
G k- I G k - 2 ··· GJ+I = Gk- I G k - 2 ··· G 1 (GjGj - 1 ... G I )-1
appearing in (7.3.5.9), (7.3.5.10) have simple limits as h -+ 0 (hk = const,
hj = const). The representation (7.3.5.9) for L1s k can be written in the
form
(7.3.9.5)
k-2
L1s k = Fk-l + Gk-l ... G 1 [L1S 1 + ~ (Gj ... G 1 ) -1 Fj ] ,
J=l
According to (7.3.5.3), Fk is given by
thus
2 <::: k <::: m.
Fk = 8k +
1:
7.3 Boundary-Value Problems
k 1
+
579
f(t, y(t; Xk,Sk)) dt - Sk+l,
and with the help of the mean-value theorem,
Let now Z(x) be the solution of the initial-value problem
Z' = Dyf(x, T/(X) )Z,
(7.3.9.7)
Z(a) = I,
and the matrix Zk solution of
1 S; k S; m - 1,
so that
(7.3.9.8)
For the matrices Zk := Zk(xk+d one now shows easily by induction on k
that
(7.3.9.9)
Indeed, since Zl = Z, this is true for k = 1. If the assertion is true for k - 1,
then the function
Z(x) := Zk(X)Zk-1··· Zl
satisfies the differential equation Z' = Dyf(x,7)(x))Z and the initial conditions
Z(Xk) = Zk(Xk)Zk-1 ... Zl = Zk-1 ... Zl = Z(Xk). By uniqueness of the solution
of initial-value problems there follows Z(x) = Z(x), hence (7.3.9.9).
By
Theorem
(7.1.8)
we
further
have
for
the
matrices
Gk(x) := DskY(x; Xk, Sk),
Gk(x) = I + 1~ Dyf(t, y(t; Xk, Sk) )Gdt ) dt,
(7.3.9.10)
G k = Gk (.Tk+ 1 ).
With the abbreviation
1:
upon subtraction of (7.3.9.8) and (7.3.9.10), one obtains the estimate
'P(X) ::;
(7.3.9.11)
IIDyf(t, 7)(t)) - Dyf(t, y(t; Xk, Sk)) 1IIIZk(t)11 dt
+ l~IIDYf(t,y(t;xk,sk))II'P(t)dt.
580
7 Ordinary Differential Equations
If Dyf(t,y), for all t E [a,b], is uniformly Lipschitz continuous in y, one
easily shows for Xk ::;; t ::;; Xk+l that
IIDyf(t,1](t)) - Dyf(t, y(t; Xk, Sk)) II ::;; LII11(t) - y(t; Xk, sk)11 = O(h),
as well as the uniform boundedness of IIZk(t)ll, IIDyf(t, y(t; Xk,Sk))11 for
t E [a, b], k = 1, 2, ... , m - 1. From (7.3.9.11) and the proof techniques of
Theorems (7.1.4) (7.1.11), one then obtains an estimate of the form
~in particular, for x =
Xk+l,
1::;; k::;; m -1,
(7.3.9.12)
with a constant Cl independent of k and h.
From the identity
ZkZk-l'" Zl - GkGk-l'" G 1 = (Zk - Gk)Zk-l ... Zl
+Gk (Zk-l - Gk-l)Zk-2'" Zl + ... + GkGk-l ... G 2 (Zl - G 1 )
and the further estimates
1::;; k ::;; m - 1,
C2 independent of k and h, which follow from Theorem (7.1.11), we thus
obtain for 1 ::;; k ::;; m - 1, in view of kh ::;; b - a and (7.3.9.9),
(7.3.9.13)
with a constant D independent of k and h.
From (7.3.9.5), (7.3.9.6), (7.3.9.9), (7.3.9.12), (7.3.9.13) one obtains
(7.3.9.14)
Lls k = (Z(Xk) + O(h2))
Here, Lls( x) is the function
(7.3.9.15)
Lls(x) := Z(x)Lls l + Z(x)
Lls(a) = Lls 1 .
IX Z(t)-l [f(t, 1](t)) -1]'(t)] dt,
7.3 Boundary-Value Problems
581
Note that .1s depends also on h, since .1s 1 depends on h. Evidently, .1s is
differentiable with respect to x, and by (7.3.9.7) we have
.1s'(X) =Z'(x) [.1S 1 +
IX Z(t)-l [f(t,1](t)) -1]'(t)] dt] + f(x, 1](X)) -1]'(x)
=Dyf(x, 1](X)).1s(x) + f(x, 1](X)) -1]'(X),
so that
.1s(x) = .1s(a) +
IX [Dyf(t,1](t)).1s(t) + f(t,1](t)) -1]'(t)] dt.
By subtracting this equation from (7.3.9.3), we obtain for the difference
8(x) := .11](X) - .1s(x)
the equation
(7.3.9.16)
8(x) = 8(a) +
IX Dyf(t, 1](t))8(t) dt,
and further, with the aid of (7.3.9.7),
(7.3.9.17)
8(x) = Z(x)8(a),
[[8(x)[[::; K[[8(a)[[
for a ::; x ::; b,
with a suitable constant K.
In view of (7.3.9.14) and (7.3.9.17) it suffices to show [[8(a)[[ = O(h)
in order to complete the proof of (7.3.9.4). With the help of (7.3.5.10) and
(7.3.9.15) one now shows in the same way as (7.3.9.14) that .1s 1 = .1s(a)
satisfies an equation of the form
[A + B(Z(b) + 0(h2))].1s1
= -Pm - BZ(b)
Ib
Z(t)-l (J(t, 1](t)) -1]'(t)) dt + O(h)
= -Pm - B[.1s(b) - Z(b).1S1] + O(h).
From Pm = r(sl' sm) = r(1](a), 1](b)) it follows that
A.1s(a) + B.1s(b) = -r(1](a), 1](b)) + O(h).
Subtraction of (7.3.9.2), in view of (7.3.9.17), yields
A8(a) + B8(b) = [A + BZ(b)]8(a) = O(h),
and thus 8(a) = O(h), since (7.3.9.1) by assumption is uniquely solvable
and hence A + BZ(b) is nonsingular. Because of (7.3.9.14), (7.3.9.17), this
proves (7.3.9.4).
D
582
7 Ordinary Differential Equations
7.4 Difference Methods
The basic idea underlying all difference metods is to replace the differential
quotients in a differential equation by suitable difference quotients and to
solve the discrete equations that are so obtained.
We illustrate this with the following simple boundary-value problem of
second order for a function Y : [a. b] -+ IR:
_y" + q(x)y = g(x).
(7.4.1)
y(a) = n.
y(b) = 8.
Under the assumptions that q, 9 E C[a, b] (i.e., q and 9 are continuous
functions on [a.b]) and q(x) 2': 0 for x E [a.b], it can be shown that (7.4.1)
has a unique solution y(.T).
In order to discretize (7.4.1), we subdivide [a, b] into 71+ 1 equal subintervals.
a
Xj = a + j h,
= Xo < Xl < ... < Xn < .Tn+l = b,
b-a
71+1
h:=--,
and, with the abbreviation Yj := Y(Xj), replace the differential quotient
y;' = y"(Xi) for i = 1, 2, ... ,71 by the second difference quotient
,12
L.l
. ' _ Yi+l -
y, .-
2Yi + Yi-l
h2
We now estimate the error Ti(Y) := Y"(Xi) - Ll 2 Yi. We assume that Y is
four times continuously differentiable on [a, b]' Y E C 4 [a, b]. Then, by Taylor
expansion of Y(Xi ± h) about Xi, one finds
h3
h4
h 2 II
_.
I
_
_
III
_
(4) ( .
±)
Yi±l - y, ± hYi + 2! Yi ± 3! y, + 4! Y
X, ± 0i h ,
o < O'j= < 1.
and thus
Since y(4) is still continuous, it follows that
(7.4.2)
1/)
2
h 2 (4)
Ti(Y ) := Y (Xi - Ll Yi = -12 Y (Xi + Oih)
for some IOil < 1.
Because of (7.4.1), the values Yi = Y(Xi) satisfy the equations
Yo = n
(7.4.3)
2Yi -
Yi-l h2
Yi+l
+ q(X;)Yi = g(Xi) + Ti(Y),
Yn+l = ;3.
i = 1, 2, .... 71,
7.4 Difference Methods
583
With the abbreviations qi ;= q(Xi), gi;= g(Xi), the vectors
k ;=
gn-1
{3
gn
+ h2
and the symmetric n x n tridiagonal matrix
(7.4.4)
-1
the equations (7.4.3) are equivalent to
(7.4.5)
AY=k+T(Y)·
The difference method now consists in dropping the error term T(Y) in
(7.4.5) and taking the solution u = [U1,"" un]T of the resulting system of
linear equations,
(7.4.6)
Au= k,
as an approximation to y.
We first want to show some properties of the matrix A in (7.4.4). \Ve
will write A ::; B for two n x n matrices if Gij ::; bij for i, j = 1, 2, ... , n.
We now have the following;
(7.4.7) Theorem. If qi 2' 0 for i = 1, ... , n, then A in (7.4.4) is positive
definite, and 0 ::; A -1 ::; A-1 with the positive definite n x n matrix
-1
(7.4.8)
-1
PROOF. We begin by showing that A is positive definite. For this, we consider the n x n matrix An ;= h 2 A. According to Gershgorin's theorem
(6.9.4), we have for the eigenvalues the estimate IAi - 21 ::; 2, and hence
o ::; Ai ::; 4. If Ai = 0 were an eigenvalue of An, it would follow that
det(An) = 0; however, det(An) = n + 1, as one easily verifies by means
of the recurrence formula det(An+d = 2det(An) - det(A n - 2 ) (which is
584
7 Ordinary Differential Equations
obtained at once by expanding det(A n+ 1 ) along the first row). Neither is
Ai = 4 an eigenvalue of An, since otherwise An - 41 would be singular. But
-2 -1
An - 41
r-1
=
-1
D := diag(1, -1,1, ... , ±1),
so that 1det(A n - 41)1 = 1det(An)l, and with An, also An - 41 is nons ingular. We thus obtain the estimate 0 < Ai < 4 for the eigenvalues of An.
This shows in particular that An is positive definite. By virtue of
n
zH Az = zH Az + 2:qilziI2,
zH Az > 0
for z f 0 and qi:;:' 0,
i=l
it follows immediately that zH Az > 0 for z f 0; hence the positive definiteness of A. Theorem (4.3.2) shows the existence of A-I and A-1, and
it only remains to prove the inequality 0 ::; A-I ::; A-I. To this end, we
consider the matrices D, D, J, J with
h 2 A = D(I - J),
D = diag(2 + qlh2, ... , 2 + qnh2),
h 2 A = D(I - J),
D = 21.
Since qi :;:. 0, we obviously have the inequalities
0::; D::; D,
1
2 + qlh2
0
(7.4.9)
O::;J=
o
1
2 + q2h2
1
1
0
1
o
0
:2
0
::;J=
[:
1
1
:2
"2
0
In view of J = ~ ( - An + 21) and the estimate 0 < Ai < 4 for the eigenvalues
of An, we have -1 < J1i < 1 for the eigenvalues J1i of J, i.e., p( J) < 1 for
7.4 Difference Methods
585
the spectral radius of 1. From Theorem (6.9.2) there easily follows the
convergence of the series
Since 0 ::; J ::; J, we then also have convergence in
and since, by (7.4.9), 0 ::; D- 1 ::; iJ-l we get
as was to be shown.
D
From the above theorem it follows in particular that the system of
equations (7.4.6) has a solution (if q(x) 2: 0 for x E [a, b]), which can easily
be found, e.g., by means of the Choleski method [see Section 4.3]. Since A
is a tridiagonal matrix, the number of operations for the solution of (7.4.6)
is proportional to n.
We now wish to derive an estimate for the errors
Yi -Ui
of the approximations Ui obtained from (7.4.6) for the exact solutions Yi =
Y(Xi), i = 1, ... , n.
(7.4.10) Theorem. Let the boundary-value problem (7.4.1) have a solution
y(x) E C 4 [a, b], and let ly(4)(x)1 ::; M for x E [a, b]. Also, let q(x) 2: 0 for
x E [a, b] and u = [Ul, ... ,Un]T be the solution of (7.4.6). Then, for i = 1,
2, ... , n,
PROOF. Because of (7.4.5) and (7.4.6) we have for
f) - u the equation
A(fj - u) = T(Y).
U sing the notation
we obtain from Theorem (7.4.7) and the representation (7.4.2) of T(Y),
(7.4.11)
1
-
Mh2 -
If) - ul = IA- T(Y)I ::; A- 1 IT(y)1 ::; 12 A - 1e ,
586
7 Ordinary Differential Equations
where e := [1,1, ... , I]T. The vector A-Ie can be obtained at once by the
following observation: The special boundary-value problem
-y"(x) = 1,
y(a) = y(b) = 0,
°
of the type (7.4.1) has the exact solution y(x) = ~(x - a)(b - x). For this
boundary-value problem, however, we have T(Y) =
by (7.4.2), and the
discrete solution u of (7.4.6) coincides with the exact solution y of (7.4.5).
In addition, for this special boundary-value problem the matrix A in (7.4.4)
is just the matrix A in (7.4.8), and moreover k = e. We thus have A-Ie = u,
Ui = ~(Xi - a)(b - x;). Together with (7.4.11) this yields the assertion of
the theorem.
D
Under the assumptions of Theorem (7.4.lO), the errors go to zero
like h 2 : the difference method has order 2. The method of Stormer and
Numerov, which discretizes the differential equation y"(X) = !(x, y(x)) by
h2
Yi+I - 2Yi + Yi-I = 12 (fi+I + 10!i + !i-d
and leads to tridiagonal matrices, has order 4. All these methods can also
be applied to nonlinear boundary-value problems
y" = !(x,y),
y(a) = ct, y(b) =;3.
We then obtain a system of nonlinear equations for the approximations
y(x;) , which in general can be solved only iteratively. At any rate, one
obtains only methods of low order. To achieve high accuracy, one has to
use a very fine subdivision of the interval [a, b]; in contrast, e.g., with the
multiple shooting method [see Section 7.6 for comparative examples].
For an example of the application of difference methods to partial differential equations, see Section 8.4.
Ui ~
7.5 Variational Methods
Variational methods (Rayleigh-Ritz-Galerkin methods) are based on the
fact that the solutions of some important types of boundary-value problems
possess certain minimality properties. We want to explain these methods for
the following simple boundary-value problem for a function U : [a, b] --+ IR,
(7.5.1)
-(p(x)u'(x))' + q(x)u(x) = g(x, u(x)),
u(a) = ct, u(b) = ;3.
Note that the problem (7.5.1) is somewhat more general than (7.4.1).
Under the assumptions
7.5 Variational Methods
(7.5.2)
P E C 1 [a,b],
q E C[a, b],
587
p(x);:::po>o.
q(x) ;::: 0,
gu(x, u) :::; .Ao,
g E C 1 ([a, b] x lR),
with .Ao the smallest eigenvalue of the eigenvalue problem
-(pz')' - (.A - q)z = 0,
z(a) = z(b) = 0,
it is known that (7.5.1) always has exactly one solution. For the following we
therefore assume (7.5.2) and make the simplifying assumption g(x, u(x)) =
g(x) (no u-dependence of the right-hand side).
If u(x) is the solution of (7.5.1), then y(x) := u(x) -l(x) with
b-x
a-x
l(x) := OOb - + f3--b ,
-a
l(a) = 11',
a-
l(b) = f3,
is the solution of a boundary-value problem of the form
-(py')' + qy = j,
y(a) = 0, y(b) = 0,
(7.5.3)
with vanishing boundary values. Without loss of generality, we can thus
consider, instead of (7.5.1), problems of the form (7.5.3). With the help of
the differential operator
L(v) :=::= -(pv')' + qv
(7.5.4)
associated with (7.5.3), we want to formulate the problem (7.5.3) somewhat
differently. The operator L maps the set
D L := {v E C 2 [a,b] I v(a) = O,v(b) = O}
°
of all real functions that are twice continuously differentiable on [a, b] and
satisfy the boundary conditions v(a) = v(b) =
into the set C[a, b] of
continuous functions on [a, b].The boundary-value problem (7.5.3) is thus
equivalent to finding a solution of
L(y) = j,
(7.5.5)
Y E DL ,
Evidently, D L is a real vector space and L a linear operator on D L: for u, v E
DL also oou+f3v belongs to D L , and one has L(oou+f3v) = ooL(u)+f3L(v) for
all real numbers a, f3. On the set L 2 (a, b) of all square-integrable functions
on [a, b] we now introduce a bilinear form and a norm by means of the
definition
(7.5.6)
(u,v):=
lb
u(x)v(x)dx,
IluI12:= (u,u)1/2.
588
7 Ordinary Differential Equations
The differential operator L in (7.5.4) has a few properties which are important for the understanding of the variational methods. One has thc
following:
(7.5.7) Theorem. L is a symmetric operator on D L , i.e., we have
(u, L(v»
=
(L(u), v)
for all u, v E D L .
PROOF. Through integration by parts one finds
(u, L(v» =
=
=
lb
u(x)[-(p(x)v'(x»' + q(x)v(x)] dx
-u(x)p(x)v'(x)l~ +
lb
lb
[P(x)u'(x)v'(x) + q(x)u(x)v(x)] dx
[P(x)u'(x)v'(x) + q(x)u(x)v(x)] dx,
since u (a)
u (b) = 0 for u E D L. For reasons of symmetry it follows
likewise that
1
b
(7.5.8)
(L(u), v) =
[P(x)u'(x)v'(x) + q(x)u(x)v(x)] dx;
hence the assertion.
D
The right-hand side of (7.5.8) is not only defined for u, v E D L . Indeed,
let D := {u E Kl(a, b)lu(a) = u(b) = O} be the set of all functions u
that are absolutely continuous on [a, b] with u(a) = u(b) = 0, for which u'
on [a, bj (exists almost everwhere and) is square integrable [see Definition
(2.4.1.3)]. In particular, all piecewise continuously differentiable functions
satisfying the boundary conditions belong to D. D is again a real vector space with D :2 D L . The right-hand side of (7.5.8) defines on D the
symmetric bilinear form
(7.5.9)
[u, v] :=
lb
[P(x)u'(x)v'(x) + q(x)u(x)v(x)] dx,
which for u, v E DL coincides with (u, L(v». As above, one shows for
y E D L , U E D, through integration by parts, that
(7.5.10)
(u, L(y» = [u, yj.
Relative to the scalar product introduced on DL by (7.5.6), L is a positive
definite operator in the following sense:
(7.5.11) Theorem. Under the assumptions (7.5.2) one has
[u,u] = (u,L(u)) >
°
7.5 Variational Methods
589
for all u i- 0, u E D L .
One even has the estimate
'Yllull~ ::; [u, u] ::; nu'll;'
(7.5.12)
for all u E D
with the norm IlullCXl := SUPa:C;x:C;b lu(x)1 and the constants
Po
'Y:= b _ a'
PROOF. In view of'Y >
because u( a) = 0,
r:= Ilplloo(b - a) + Ilqlloo(b - a) 3 .
°
it suffices to show (7.5.12). For u E D we have,
u(x) =
i
x
u'(O dE;,
for x E [a, b].
The Schwarz inequality yields the estimate
U(X)2::;
i
x
x
12 dE;,·l u'(02dE;,= (x-a)
::; (b - a)
·i U'(~)2 d~,
i u'(E;,)2d~
x
b
and thus
(7.5.13)
Now, by virtue of the assumption (7.5.2), we have p(x) ~ Po > 0, q(x) ~
for x E [a, b]; from (7.5.9) and (7.5.13) it thus follows that
[u, u] =
ib
[P(X)U'(X)2 + q(x)u(x)2] dx
~ Po
ib
u'(x)2 dx
°
~ b ~ a Ilull;'·
By (7.5.13), finally, we also have
[u,u] =
i
b
[p(X)U'(X)2+ q(X)U(X)2]dx
::; Ilplloo(b - a)llu'll~ + Ilqll=(b - a)llull~
::; Fllu' II~,
o
as was to be shown.
In particular, from (7.5.11) we may immediately deduce the uniqueness
of the solution Y of (7.5.3) or (7.5.5). If L(Yl) = L(Y2) = f, Yl, Y2 E D L ,
then L(YI-Y2) = and hence = (YI-Y2,L(YI-Y2)) ~ 'YIIYI-Y211;, ~ 0,
which yields at once Yl = Y2.
°
°
590
7 Ordinary Differential Equations
We now define for u E D a quadratic functional F: D -+ JR, by
F(u) := [u, u]- 2(u, f);
(7.5.14)
here f is the right-hand side of (7.5.3) or (7.5.5): F associates with each
function 'U E D a real number F(u). Fundamental for variational methods
is the observation that the function F attains its smallest value exactly for
the solution y of (7.5.5):
(7.5.15) Theorem. Let y E DL be the solution of (7.5.5). Then
F(u) > F(y)
for all u ED,
71
#- y.
PROOF We have L(y) =
F, for u #- y,
11,
f and therefore, by (7.5.10) and the definition of
ED,
F(u) = [u,u]- 2(u,f) = [u,11,]- 2(u,L(y))
= [u, u] - 2[u, y] + [y, y] - [y, y]
= [71. - y, 71. - y] - [y, y]
> -[y, y] = F(y),
since, by Theorem (7.5.11), [71. - y, 11, - y] > 0 for u #- y.
As a side result, we note the identity
(7.5.16)
[71. -
y, 71. - y] = F(u) + [y, y]
for all
o
71. E D.
Theorem (7.5.15) suggests approximating the desired solution y by minimizing F(u) approximately. Such an approximate minimum of F may be
obtained systematically as follows: One chooses a finite-dimensional subspace S of D, SeD. If dimS = m, then relative to a basis 11,1, ... , Urn of
S, every 71. E S admits a representation of the form
(7.5.17)
One then determines the minimum Us E S of Fin S,
(7.5.18)
F(us) = minF(u),
uES
and takes Us to be an approximation for the exact solution y of (7.5.5),
which according to (7.5.15) minimizes F on the whole space D. For the
computation of the approximation Us consider the following representation
of F(u), u E S, obtained via (7.5.17),
7.5 Variational Methods
591
P(61, 62,"" 6m ) : == F(6171.1 + ... + 6m u m )
[t
6i Ui ,
~ 6kUk]- 2 (~ 6k U f)
k,
m
m
i,k=l
k=l
With the help of the vectors 6, t.p and the m x m matrix A,
(7.5.19)
61
6:= [ :
6m
1,t.p [ (U1, j) 1 A
:=
:
,
[[U1' uI]
:=:
[U m ,U1] '"
(um,j)
1
[U1, um ]
:,
[um,um]
one obtains for the quadratic function P: IR m -+ IR
(7.5.20)
The matrix A is positive definite, since A is symmetric by (7.5.9), and for
all vectors 6 i= 0 one also has 'U := 61 u1 + ... + 6m u m i= 0 and thus, by
Theorem (7.5.11),
6T A6 = I)i6k[Ui, Uk] = [u,u] > O.
i,k
The system of linear equations
(7.5.21 )
A6=t.p,
therefore, has a unique solution 6 = 8, which can be computed by means
of the Choleski method [see Section 4.3]. In view of the identity
p( 6) = 6T A6 - 2t.pT 6 = 6T A6 - 28T A6 + 8T A8 - ;sr A8
= (6 - 8f A(6 - 8) - 8T A8
=
(6 - 8)T A(6 - 8) +p(8)
and (6 - 8)T A(6 - 8) > 0, for 6 i= 8, it follows at once that P(6) > p(8) for
6 i= 8, and consequently that the function
belonging to 8 furnishes the minimum (7.5.18) of F(u) on S. With Y
the solution of (7.5.5), it follows immediately from (7.5.16), by virtue of
F(us) = minuEsF(u), that
(7.5.22)
[us - y,Us - y] = min
s' [u - y,u - y].
uE
592
7 Ordinary Differential Equations
\Ve want to use this relation to estimate the error Ilus - ylloo. One has the
following:
(7.5.23) Theorem. Let y be the exact solution of (7.5.3), (7.5.5). Let S c
D be a finite-dimensional subspace of D, and let F( us) = minuEs F( u).
Then the estimate
Ilus - ylloc::; CIIu' - Y'lloo
holds for allu E S. Here C =
Theorem (7.5.11).
PROOF.
jr/r, where r, , are the constants of
(7.5.12) and (7.5.22) yield immediately, for arbitrary u E S ~ D,
,llus - YII~ ::; [us - y, Us - y] ::; [u - y, u - y] ::; rllu' - y'll;'·
o
From this, the assertion follows.
Every upper bound for infuEs[u - y, u - V], or more weakly for
inf Ilu' - Y'lloo,
uES
immediately gives rise to an estimate for Ilus - yllCX). We wish to indicate
such a bound for an important special case. We choose for the subspace S
of D the set
S = SPLl := {SLl I SLl(a) = SLl(b) = O}
of all cubic spline functions SLl [see Definition (2.4.1.1)] which belong to a
fixed subdivision of the interval [a, b],
Ll: a = Xo < Xl < X2 < ... < Xn = b,
and vanish at the end points a, b. Evidently, SPLl ~ DL ~ D. We denote
by IILlII the width of the largest subinterval of the subdivision Ll,
IILlII := max (Xi - Xi-I).
I::;~::;n
The spline function u := SLl with
U(Xi) = y(Xi), i = 0, 1, ... ,n,
u l (() = yl(() for (= a, b,
where y is the exact solution of (7.5.3), (7.5.5), clearly belongs to S = SPLl'
From Theorem (2.4.3.3) and its proof one gets the estimate
provided y E C 4 (a, b). Together with Theorem (7.5.23) this gives the following result:
7.5 Variational Methods
593
(7.5.24) Theorem. Let the exact solution Y of (7.5.3) belong to C 4 (a, b),
and let the assumptions (7.5.2) be satisfied. Let S := Sp d and Us be the
spline function for which
F(us) = minF(u).
uES
Then with the constant C:=
vir;.,. which is independent ofy,
By the footnote following Theorem (2.4.3.3), the estimate can be improved:
The error bound thus goes to zero like the third power of the fineness
of L1; in this respect, therefore, the method is superior to the difference
method of the previous section [see Theorem (7.4.10)].
For the practical implementation of the variational method, say in the
case S = Sp d, one first has to select a basis for Sp d' One easily sees that
m:= dimSPd = n+1 [according to (2.4.1.2) the spline function Sd E SPd
is uniquely determined by n + 1 conditions
Sd(Xi)=Yi,
i=1,2, ... ,n-1,
S~(a) = y~,
S~(b) = y~,
for arbitrary Yi, yb, y~]. As in Exercise 29, Chapter 2, one can find a basis
of spline functions So, S 1 ... , Sn in Sp d such that
(7.5.25)
This basis has the advantage that the corresponding matrix A m
(7.5.19) is a band matrix of the form
x
x
x
x
(7.5.26)
x
o
x
o
x
x
x
x
x
X
.1:
X
since, by virtue of (7.5.9), (7.5.25),
lSi, Sk] =
lb [P(x)S~(x)S~(x) +
q(x)Si(x)Sdx)] dx = 0
594
7 Ordinary Differential Equations
whenever Ii - kl .: : 4. Once the components of the matrix A and of the
vector cp in (7.5.19) for this basis So, ... , SrI have been found by integration,
one solves the system of linear equations (7.5.21) for t5 and so obtains the
approximation Us for the exact solution y.
Of course, instead of SPL\ one can also choose other spaces S <:;; D. For
example, one could take for S the set
71,-2
S = { PI P(x) = (x - a)(x - b)
L aix\ ai arbitrary}
i=O
of all polynomials of degree at most n which vanish at a and b. The matrix
A in this case would be in general no longer be a band matrix (and besides
would be very ill conditioned if the special polynomials
Pi(x):=(x-a)(x-b)x',
i=0,1, ... ,n-2,
were chosen as basis in S).
Similarly to the spline functions, one can also choose for S the set [see
(2.l.5.10)]
S =
H1
m ),
H1
m)
where L1 is again a subdivision a = Xo < Xl < ... < Xn = band
consists of all functions U E C m - l [a, b] which in each subinterval [Xi, Xi+1],
i = 0, ... , n - 1, coincide with a polynomial of degree :S 2m - 1 ("Hermite
function space"). Here again, through a suitable choice of the basis {ud
m ), one can assure that the matrix A = ([Ui' Uk]) is a band matrix.
in
By appropriate choices of the parameter m one can even obtain methods
m ),
of order higher than 3 in this way [ef. Theorem (7.5.24)]. For S =
analogously to Theorem (7.5.24), using the estimate (2.1.5.14) instead of
(2.4.3.3), one obtains
H1
H1
<_ 22m-2(2m
c _ 2)! I Y (2m)11 x 11L1112m-1 ,
I us - YII~,
~
l
provided Y E C 2m [a, b]; we have a method of order at least 2m-I. [Proofs of
these and similar estimates, as well as the generalization of the variational
methods to certain nonlinear boundary-value problems, can be found in
Ciarlet, Schultz, and Varga (1967).] Observe, however, that with increasing
m the matrix A becomes more and more difficult to compute.
We further point out that the variational method can be applied to
considerably more general boundary-value problems than (7.5.3), e.g., to
partial differential equations. The important prerequisites for the crucial
theorems (7.5.15), (7.5.23) and the solvability of (7.5.21) are essentially
only the symmetry of L and the possibility of estimates of the form (7.5.12)
[see Section 7.7 for applications to the Dirichlet problem].
7.5 Variational Methods
595
The bulk of the work with methods of this type consists in computing
the coefficients of the system of linear equations (7.5.21) by means of integration, having made a decision concerning an appropriate basis Ul, ... ,
U m in S. For boundary-value problems in ordinary differential equations
these methods as a rule are too expensive and too inaccurate, and are not
competitive with the multiple shooting method [see the following section
for comparative results]. Its value becomes evident only in boundary-value
problems for partial differential equations.
Finite-dimensional spaces SeD of functions satisfying the boundary
conditions of the boundary-value problem (7.5.5) also playa role in collocation methods. The idea of these methods is quite natural: One tries to
approximate the solution y(x) of (7.5.5) by a function u(x) in S represented
by
U=61Ul+···+6 m u m
in terms of a basis {Uj} of S. To this end, one selects m different collocation
points Xj E (a, b), j = 1, 2, ... , m and determines U E S such that
(7.5.27a)
(Lu)(Xj) = f(xj),
j = 1, 2, ... , m,
that is, the differential equation L( u) = f is to be satisfied exactly at the
collocation points. Of course, this is equivalent to solving the following
linear equations for the coefficients 6k, k = 1, ... , m:
L L(Uk)(Xj)6k = f(xj), j = 1, 2, ... , m.
m
(7.5.27b)
k=l
There are many possibilities of implementing such methods, namely,
by different choices of S, of bases of S, and of collocation points x j. With
a suitable choice, one may obtain very efficient methods.
For instance, it is advantageous for the basis functions Uj, j = 1, 2, ... ,
m, to have compact support, like the B-splines [see 2.4.4 and (7.5.25)]. Then
the matrix A = [L(Uk)(Xj)] of the linear system (7.5.27) is a band matrix.
In the so-called spectral methods, one chooses spaces S of trigonometric
polynomials (2.3.1.1) and (2.3.1.2), which are spanned by finitely many
simple trigonometric functions, say, e ikx , k = 0, ±1, ±2, .... Then, with
a suitable choice of the collocation points Xj, one can often solve (7.5.27)
with the aid of fast Fourier transforms [see Section 2.3.2].
For illustration, consider the simple boundary-value problem
-y"(x) = f(x),
y(O) = y(7r) = 0,
of the form (7.5.3). Here, all functions u(x) = E;:'=l 8k sin kx satisfy the boundary
conditions. If we choose the collocation points Xj := j7r /(m+ 1), j = 1, 2, ... , m,
then by (7.5.27) the numbers "/k := 8kk 2 will solve the trigonometric interpolation
problem [see Section 2.3.1]
596
7 Ordinary Differential Equations
~
'Yk sin kj7r = fj := f(xj),
~
m+1
j = 1, 2, ... , m.
k=l
Therefore, the 'Yk can be computed by using fast Fourier transforms.
These methods play an important role for the solution of initialboundary value problems for partial differential equations. For a detailed
exposition, the reader is referred to the literature, e.g., Gottlieb and Orszag
(1977), Canuto et al. (1987), and Boyd (1989).
7.6 Comparison of the Methods for Solving
Boundary-Value Problems for Ordinary
Differential Equations
The methods described in the previous sections, namely
(1) the simple shooting method,
(2) the multiple shooting method,
(3) the difference method,
(4) the variational method,
will now be compared by means of the following example. We consider the
linear boundary-value problem with linear separated boundary conditions
(7.6.1)
- y" + 400y = -400 cos 2 7rX - 27f2 cos(27fx) ,
y(O) = y(l) = 0,
with the exact solution [see Figure 28]
(7.6.2)
y(x) =
e
-20
1+
e- 20
e 20x
+
1
1 + e- 20
e- 20x _
cos 2 (7fx).
Although this problem must be considered very simple compared with
most problems in practice, the following peculiarities are present which lead
us to expect difficulties in its solution:
(1) The homogeneous differential equation -y" +400y = 0 associated with
(7.6.1) has solutions of the form y(x) = ce±20x which grow or decay at
a rapid exponential rate. This leads to difficulties in the simple shooting
methods.
(2) The derivatives y(i)(x), i = 1, 2, ... , of the exact solution (7.6.2)
are very large for x ~ 0 and x ~ 1. The error estimates (7.4.10) for
the difference method, and (7.5.24) for the variational method, thus
suggest large errors.
7.6 Boundary-Value Problems, Comparison of Methods
597
y
o
1
0.5
x
-0.5
Fig. 28. Exact solution of (7.6.1).
On a 12-digit computer the following results were found [in evaluating
the error, especially the relative error, one ought to observe the behavior
of the solution [see Figure 28]: maxO<x<l ly(x)1 R:! 0.77, y(O) = y(l) = 0,
y(0.5) = 0.907 998 593 370 x 10- 4 ]: - (a) Simple shooting method: The initial values
y(O) = 0,
y' (0) = -20
1 - e- 20
1 + e-
20 =
-19.999 999 9176 ...
of the exact solution were found, after one iteration, with a relative
error::; 3.2.10- 11 . If the initial-value problem corresponding to (7.6.1)
is then solved by means of an extrapolation method [see Section 7.2.14;
ALGOL procedure Diffsys in Bulirsch and Stoer (1966)], taking as initial
values the exact initial values y(O), y'(O) rounded to machine precision,
one obtains instead of the exact solution y(x) in (7.6.2) an approximate
solution y(x) having the absolute error .1y(x) = y(x) -y(x) and relative
error Cy(x) = (y(x) - y(x))jy(x) as shown in the table below. Here
the effect of the exponentially growing solution y(x) = ce 20x of the
homogeneous problem is clearly noticeable [ef. Section 7.3.4].
598
7 Ordinary Differential Equations
x
lL1y(x)l*
IEy(x)1
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
1.9 X 10- 11
1.5 X 10- 10
1.1 X 10- 9
8.1 X 10- 9
6.0 X 10- 8
4.4 X 10- 7
3.3 X 10- 6
2.4 X 10- 5
1.8 X 10- 4
1.3 X 10- 3
2.5 X 1O- 11
2.4 X 10- 10
3.2 X 10- 9
8.6 X 10- 8
6.6 X 10- 4
4.7 X 10- 5
9.6 X 10- 6
3.8 X 10- 5
2.3 X 10- 4
00
• maxx E[O,l] lL1y(X) I ::::; 1.3 X 10- 3 .
(b) Multiple shooting method: To reduce the effect of the exponentially
growing solution y( x) = c e20x of the homogeneous problem, we chose
m large, m = 21:
k-1
0= Xl < X2 < ... < X21 = 1,
Xk=--·
20
X
lL1y(x)l*
IEy(x)1
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
9.2 X 10- 13
2.7 X 10- 12
4.4 X 10- 13
3.4 X 10- 13
3.5 X 10- 13
1.3 X 10- 12
1.8 X 10- 12
8.9 X 10- 13
9.2 x 10- 13
5.0 X 10- 12
1.2 X 10- 12
4.3 X 10- 12
1.3 X 10- 12
3.6 X 10- 12
3.9 X 10- 9
1.4 X 1O- 11
5.3 X 10- 12
1.4 X 10- 12
1.2 X 10- 12
00
* maX x E[O,l]lL1y(x)1 ::::; 5 x 10- 12 •
Three iterations of the multiple shooting method [program in Oberle
and Grimm (1989)] gave the absolute and relative errors in the above
table.
(c) Difference method: The method of Section 7.4 for step sizes h =
l/(n + 1) produces the following absolute errors L1y = maxi IL1Y(Xi)l,
Xi = i h [powers of 2 were chosen for h in order that the matrix A in
(7.4.4) could be calculated exactly on the computer]:
7.6 Boundary-Value Problems, Comparison of Methods
h
f1y
2~4
2.0 X 1O~2
l.4 X 1O~3
9.0 X 1O~5
5.6 X 1O~6
2~6
2~8
2~1O
599
These errors are in harmony with the estimate of Theorem (7.4.10).
Halving the step h reduces the error to ~ of the old value. In order to
achieve errors comparable to those of the multiple shooting method,
f1y ::::::: 5 X 1O~ 12, one would have to choose h ::::::: 1O~6!
(d) Variational method: In the method of Section 7.5 we choose for the
spaces S spaces Sp Ll of cubic spline functions corresponding to equidistant subdivisions
f1:
0 = Xo < Xl < ... < Xn = 1,
Xi
= ih,
1
h =-.
n
The following maximum absolute errors f1y = Ilus - yll= [ef. Theorem
(7.5.24)] were found:
h
f1y
1/10
1/20
1/30
1/40
1/50
1/100
6.0 x 1O~3
5.4 x 1O~4
l.3 x 1O~4
4.7 x 1O~5
2.2 x 1O~5
l.8 x 1O~6
In order to achieve errors of the order of magnitude f1y ::::::: 5 x 1O~12 as
in the multiple shooting method, one would have to choose h ::::::: 1O~4.
Merely to compute the band matrix A in (7.5.26), this would require
about 4 x 10 4 integrations for this stepsize.
These results clearly demonstrate the superiority of the multiple shooting method, even for simple linear separated boundary-value problems.
Difference methods and variational methods are feasible, even here, only if
the solution need not be computed very accurately. For the same accuracy,
difference methods require the solution of larger systems of equations than
variational methods. This advantage of the variational methods, however,
is of importance only for smaller stepsizes h, and its significance is considerably reduced by the fact that the computation of the coefficients in the
600
7 Ordinary Differential Equations
systems of equations is much more involved than in the simple difference
methods. For the treatment of nonlinear boundary-value problems for ordinary differential equations the only feasible methods, effectively, are the
multiple shooting method and its modifications.
7.7 Variational Methods for Partial Differential
Equations. The Finite-Element Method
The methods described in Section 7.5 can also be used to solve boundaryvalue problems for partial differential equations. \¥e want to explain this for
a D'iTichlet buundaTy-value pmblem in lR? We are given a (open bounded)
region D c lR? with boundary 8D. We seek a function y: f? ---+ IR, f? :=
D U 8D such that
- L1y(x) + c(x)y(x) = j(x) for x = (XI,X2) E D,
y(x)=O forxE8D,
(7.7.1)
where L1 is the Laplace-Operator
!\
L.l
(
Y X
)
._
.-
8 2y(x)
8 2y(x)
8 2 + 8 2 .
Xl
X2
Here, c, f: f? ---+ IR are given continuous functions with c(x) 2: 0 for x E f?
To simplify the discussion, we assume that (7.7.1) has a solution y E C 2 (f?),
i.e., y has continuous derivatives D'ty := 8ay/8x't on D for Q ::; 2 which are
continuously extendable to f? As domain DL of the differential operator
L(v) := -L1v + cv associated to (7.7.1) we accordingly take DL := {v E
C 2(f? I vex) = 0 for x E 8D}. Thus, the problem is to find a solution of
L(v) = j,
(7.7.2)
v E D L.
We further assume that the region D is such that the integral theorems of
Gauss and Green apply, and moreover, that each line Xi = const, i = 1, 2,
intersects D in at most finitely many segments. With the abbreviations
(u, v) :=
in
u(x)v(x) dx,
IluI12:= (u, U)I/2
we then have [ef. Theorem (7.5.7)]:
(7.7.3) Theorem. L is a symmetTic opemtoT on D L :
(7.7.4)
(u, L(v)) = (L(u), v) = L[DIU(X) Dlv(x) + D 2 u(x) D 2 v(x)
+ c(x)u(x)v(x)] dx
7.7 Variational Methods for Partial Differential Equations
601
for all u, v E D L .
PROOF. One of Green's formulas is
- lS?r uLlvdx = lS?r (tDiUD;V) dx - laS?
r U~~ dw.
i=l
Here OV / ov is the differential quotient in the direction of the outward
normal, and dw the line element of af? Since u E DL, we have u(x) = 0 for
x E af?; thus IaS?u(av/av) dw = O. From this, the assertion of the theorem
follows at once.
D
The right-hand side of (7.7.4) again defines a bilinear form [cf. (7.5.9)]
2
(7.7.5)
[u,v]:= l(~DiUDiV+CU1J)dX'
and there holds an analog of Theorem (7.5.11):
(7.7.6) Theorem. There are constants 'Y > 0, r> 0 such that
(7.7.7)
'Yllull~(1) :::; [u, U] :::; r
2
L IID;ull~ for all E D
U
L.
;=1
Here
is the so-called "Sobolev norm" .
PROOF. The region f? is bounded. There exists, therefore, a square f?1 with
side lenght a which contains f? in its interior. Without loss of generality,
let the origin be a corner of f?1 (Figure 29).
a 1---------,
o
a
Fig. 29. The regions [2 and [21
602
7 Ordinary Differential Equations
Now let u E D L . Since u(x) := a for x E af2, we can extend u continuously onto f2I by letting u(.r) := a for x E f2I \f2. By our assumption on
il, DI U(il' .) is piecewise continuous, so that by u(a, X2) = a
U(Xl' X2) = faXl DIu(il, X2) dt 1 for all x E rh.
The Schwarz inequality therefore gives
[U(XI,X2W:s; Xlfaxl[DIU(tl,X2Wdtl
:s; a faa [D I u(t l ,X2W dt 1 for x E f2 1 .
Integrating this inequality over fh, one obtains
(7.7.8)
r [U(XI,x2)f dx I dx 2:S; a 21 [D Iu(tl,t2)]2dt1 dt 2
J.!?l
1
.!?l
:s; a 2
[(D1U)2 + (D2U)2] dx .
.!?l
Since u( x) = a for x E f21 \ f2, one can restrict the integration to f2, and
because c( x) :::" a for x E f?, it follows immediately that
2
a [u, u] :::" a
2
2
L IID;ull§ :::" Ilull~,
;=1
and from this finally
')'llull~(I) :s; [u, u]
.
1
wIth')' := - 2 - - '
a +1
Again from (7.7.8) it follows that
C := max Ic(x)l,
xE.!?
and thus
2
[11,11] :s;
r L IID;ull~
with r := 1 + Ca 2 .
;=1
This proves the theorem.
D
As in Section 7.5, one can find a larger veetor space D ~ D L of functions
such that definition (7.7.5) of [u, v] remains meaningful for u, v E D, the
statement (7.7.7) of the previous theorem remains valid for u ED, and
(7.7.9)
[u,v] = (u,L(y))
for y E DL and u E D.
7.7 Variational Methods for Partial Differential Equations
603
In what follows, we are not interested in what the largest possible set D with
this property looks like 8 ; we are only interested in special finite-dimensional
vector spaces S of functions for which the validity of (7.7.7) and (7.7.9) for
U E D S , D S the span of Sand D L , can be established individually.
EXAMPLE. Let the set fl be supplied with a triangulation 7 = {TI, . .. , Td, i.e.,
fl is the union of finitely many triangles Ti E 7, fl = U:=I Ti , such that any
two triangles in 7 either are disjoint or have exactly one vertex or one side in
common (Figure 30).
=
Fig. 30. Triangulation of fl
By 5i(i 2 1) we denote the set of all functions u: fl-+ lR with the following
properties:
(1) u is continuous on fl.
(2) u(x) = 0 for x E afl.
(3) On each triangle T of the triangulation 7 of fl the function u coincides with
a polynomial of degree i,
Evidently, each 5i is a real vector space. Even though u E 5 i need not even
be continuously differentiable in fl, it is nevertheless continuously differentiable
in the interior of each triangle T E T Definition (7.7.5) is meaningful for all u,
v E DSi; likewise, one sees immediately that (7.7.7) is valid for all u E DSi. The
function spaces 5 = 5 i , i = 1, 2, 3 - in particular 52 and 53 - are used in the
finite-element method in conjunction with the Rayleigh-Ritz method.
Exactly as in Section 7.5 [cf. (7.5.14), (7.5.15), (7.5.22)] one shows the
following:
8
One obtains such a D by "completion" of the vector space CJ(fl), equipped
with the norm I . Ilw(1) of all once continuously differentiable functions r.p on
fl, whose "support" supp(r.p) := {x E fl I r.p(x) =I=- O} is compact and contained
in fl.
604
7 Ordinary Differential Equations
(7.7.10) Theorem. Let y be the solution of (7.7.2), and S a finitedimensional space of functions for which (7.7.7), (7.7.9) hold for all u E D S .
Let F ( u) be defined by
(7.7.11)
F(u) := [u, uj- 2(u, J)
for U E D S .
Then:
(a) F(u) > fey) fOT allu -=1= y.
(b) There exists a Us E S with F(us) = minuEs F(u).
(c) For this Us one has
[Us - y, Us - yj = min [u - y, u - yj.
uES
The approximate solution Us can be represented as in (7.5.17)-(7.5.21)
and determined by solving a linear system of equations of the form (7.5.21).
For this, one must choose a basis in S and calculate the coefficients (7.5.19)
of the system of linear equations (7.5.21).
A practically useful basis for S2, e.g., is obtained in the following way: Let
P be the set of all vertices and midpoints of sides of the triangles T i , i = 1, 2,
... , k, of the triangulation T which are not located on the boundary af]. One
can show, then, that for each PEP there exists exactly one function Up E S2
~atisfying
up(P) = 1,
up(Q) =
° for Q =f. P, Q P.
E
In addition, these functions have the nice property that
up(x) = 0,
for all x E T with P f/. T;
it follows from this that in the matrix A of (7.5.21), all those elements vanish,
[up,uQ] = 0,
for which there is no triangle T E T to which both P and Q belong: the matrix
A is sparse. Bases with similar properties can be found also for the spaces Sl and
S3 [see Zl<imal (1968)].
The error estimate of Theorem (7.5.23), too, carries over.
(7.7.12) Theorem. Let y be the exact solution of (7.7.1), (7.7.2), and S a
finite-dimensional space of functions such that (7.7.7), (7.7.9) holds for all
u E D S . For the approximate solution Us with F( us) = minuEs F( u) one
then has the estimate
r
Ilus - yIIW(l) :::; inf ( - L IIDju - Djyll~
uES
2
I j=l
We now wish to indicate upper bounds for
)1/2
.
7.7 Variational Methods for Partial Differential Equations
605
in the case of the spaces 8 = 8 i , i = 1, 2, 3, defined in the above example
of triangulated regions fl. The following theorem holds:
(7.7.13) Theorem. Let fl be a triangulated region with triangulation T.
Let h be the maximum side length, and the smallest angle, of all triangles
occurring in T. Then the following is true:
e
then there exists a function U E 8 1 with
(a) Iu(x) - y(x)1 ::; M2h2 for all x E 0,
(b) IDju(x) - Djy(x)1 ::; 6M2h/ sine for j = 1, 2 and for all x ETa,
TET.
(2) If y E C 3 (s7) and
then there exists a U E 8 2 with
(a) lu(x) - y(x)1 ::; M3h3 for x E il,
(b) IDJu(x) - Djy(x)1 ::; 2M3h2/ sine for j = 1, 2, x ETa,
T E T.
(3) If y E C 4 (s7) and
then there exists a u E 8 3 with
(a) lu(x) -y(x)l::; 3M4h 4 /sine for x EO,
(b) IDju(x) - Djy(x)1 ::; 5M4h3/ sine for j = 1, 2, x ETa, T E T.
[Here, TO means the interior of the triangle T. The limitation x E TO,
T E T in the estimates (b) is necessary because the first derivatives Dju
may have jumps along the sides of the triangles T.]
Since for u E 8 i , i = 1, 2, 3, the function Dju is only piecewise continuous, and since
606
7 Ordinary Differential Equations
s C max IDju(x) - Djy(x) 2 ,
1
xETO
C:=
!n1dX,
TET
from each of the estimates given under (b) there immediately follows, via
Theorem (7.7.12), an upper bound for the Sobolev norm Ilus - vllw(l) of
the error Us -y for the special function spaces 5 = 5 i , i = 1. 2, 3, provided
only that the exact solution y satisfies the differentiability assumptions of
(7.7.13):
h
Ius! - yll",(!) s C1Ah -.-,
l
I
n
US 2
-
II u s 3 -
vll
wl !)
vii
Will
SIn ()
h2
S C2 A13 sin()'
17. 3
s C3 M4 ---:--()'
sm
The constants Ci are independent of the triangulation T of n and of v, and
can be specified explicitly; the constants 1'.1;. h, () have the same meaning
as in (7.7.13). These estimates show, e.g. for 5 = 53, that the error (i.e.,
its Sobolev norm) goes to zero like the third power of the fineness 17. of the
triangulation.
We will not prove Theorem (7.7.13) here. Proofs can be found, e.g., in
Zl<imal (1968) and in Ciarlet and Wagschal (1971). We only indicate that
the special functions U E 5 i , i = 1, 2, 3, of the theorem are obtained by
interpolation of y: For 5 = 52, e.g., one obtains the function U E 52 by
interpolation of y at the points PEP of the triangulation (see above):
11(x) =
L y(P)up(X).
PEP
An important advantage of the finite-element method lies in the fact
that boundary-value problems can be treated also for relatively complicated
regions n, provided n can still he triangulated. We remark, in addition,
that not only can problems in JR2 of the special form (7.7.1) be attacked
by these methods, hut also boundary-value problems in higher-dimensional
spaces and for more general differential operators than the Laplace operator.
A detailed exposition of the finite element method and its practical implementation can be found in Strang and Fix (1973), Oden and
Reddy(1976), Schwarz (1988), and Ciarlet and Lions (1991), and Quarteroni
and Valli (1994).
EXERCISES FOR CHAPTER 7
607
EXERCISES FOR CHAPTER 7
1.
Let A be a real diagonalizable n x n matrix with the real eigenvalues Ai, ... ,
An and the eigenvectors Cl, ... , cn . How can the solution set of the system
y' =Ay
be described in terms of the Ak and Ck? How is the special solution y(x)
obtained with y(O) = Yo, yo E lRn ?
2.
Determine the solution of the initial-value problem
y' = Jy,
y(O) = yo
with yo E lR n and the n x n matrix
[1 :1
A E lR.
Hint: Seek the kth component of the solution vector y(x) in the form y(x) =
Pk(x)e AX , Pk a polynomial of degree ::; n - k.
3.
Consider the initial-value problem
,
3
y(O) = O.
Y =x-x,
Suppose we use Euler's method with stepsize h to compute approximate
values T/(Xj; h) for y(Xj), Xj = jh. Find an explicit formula for T/(Xj; h) and
e(xj; h) = T/(Xj; h) - Y(Xj), and show that e(x; h), for x fixed, goes to zero as
h = x/n -+ O.
4.
The initial-value problem
y'
=..;y,
y(O) = 0
has the nontrivial solution y(x) = x 2 /4. Application of Euler's method however yields T/(Xj; h) = 0 for all x and h = x/n, n = 1, 2, .... Explain this
paradox.
5.
Let T/(x; h) be the approximate solution furnished by Euler's method for the
initial-value problem
,
y(O) = 1.
y =y,
(a) One has T/(x; h) = (1 + h)x/h.
(b) Show that T/(x; h) has the expansion
T/(X; h) = LTi(X)hi
with
TO(X) = eX,
i=O
which converges for Ihl < 1; the Ti(X) here are analytic functions independent of h.
608
7 Ordinary Differential Equations
(c) Determine Ti(X) for i = 1, 2, 3.
(d) The Ti(X), i ?': 1, are the solutions of the initial-value problems
i
k+l ( )
'"""' Ti_k X
,
Ti(X)=Ti(X)- L...J (k+l)!'
Ti(O)
= O.
k=l
6.
Show that the modified Euler method furnishes the exact solution of the
differential equation y' = -2ax.
7.
Show that the one-step method given by
¢P(X, y; h) := Hkl + 4k2 + k3],
kl := f(x, y),
k2 := f
(x + ~, y + ~kl) ,
k3 := f(x + h, y + h( -k 1 + 2k2))
("simple Kutta formula") is of third order.
8.
Consider the one-step method given by
h
¢P(x, y; h) := f(x, y) + "2 g(x + ~h, y + ~hf(x, y)),
where
g(x, y) := :xf(x, y) + (:/(x, y)) . f(x, y).
Show that it is a method of order 3.
9.
What does the general solution of the difference equation
Uj+2 = Uj+l + Uj,
j ?': 0,
look like? (For Uo = 0, Ul = lone obtains the "Fibonacci sequence".)
10. In the methods of Adams-Moulton type, given approximate values "T]p-(q-l),
... , "T]p for y(Xp-(q-l)), ... , y(xp), one computes an approximate value
"T]p+l for y(xp+1), Xp+l E [a, b], through the following iterative process [see
(7.2.6.7)]:
(0)
"T]p+l
ar b·Itrary;
for i = 0, 1, ... :
"T]p+1 := 'IF("T]p+l) := "T]p + h[,BqO!(Xp+l, "T]p+l) + ,Bql!p + ... +
(HI)
(i)
(i)
+,Bqqfp+l-q].
Show: For f E H(a,b) there exists an ho > 0 such that for all Ihl ::; ho the
sequence {"T]~~d converges toward an "T]p+l with "T]p+l = 'IF("T]p+I).
11. Use the error estimates for interpolation by polynomials, or for Newton-Cotes
formulas, to show that the Adams-Moulton method for q = 2 is of third order
and the Milne method for q = 2 of fourth order.
EXERCISES FOR CHAPTER 7
609
12. For q = 1 and q = 2 determine the coefficients {3qi in Nystrom's formulas
1]p+l = 1]p-l + h[{3lO!p + {3n!p-l],
1]p+l = 1]p-l + h[{320!p + {321!p-l + (322!p-2].
13. Check whether or not the linear multistep method
h
1]p - 1]p-4 = "3 [8!p-l - 4!p-2 + 8!p-3]
is convergent.
14. Let 1P'(1) = 0, and assume that for the coefficients in
IP'(J-t)
r-l
r-2
--1 = 'Yr-lJ-t
+ 'Yr-2J-t
+ ... + 1'1J-t + 1'0
J-tone has br-ll > br-21 2 ... 2 11'01· Does IP'(J-t) satisfy the stability condition?
15. Determine a, (3 and I' such that the linear multistep method
has order 3. Is the method thus found stable?
16. Consider the predictor method given by
(a) Determine ao, bo and b1 as a function of al such that the method has
order at least 2.
(b) For which aI-values is the method thus found stable?
(c) What special methods are obtained for al = 0 and al = -I?
(d) Can al be so chosen that there results a stable method of order 3?
17. Suppose the method
is applied to the initial-value problem
y' =0,
y(O) = c.
Let the starting values be 1]0 = c and 1]1 = c+eps (eps = machine precision).
What values 1]j are to be expected for arbitrary stepsize h?
18. For the solution of the boundary-value problem
y" = 100y,
y(O) = 1,
y(3) = e- 30 ,
consider the initial-value problem
y" = 100y,
y(O) = 1,
y' (0) = s,
610
7 Ordinary Differential Equations
with solution y(x; s), and determine s = 8 iteratively such that y(3; 8) = e- 3o .
Assume further that 8 is computed only within a relative error E, i.e., instead
of 8 one obtains 8( 1 + E). How large is y(3; 8(1 + E))? In this case, is the simple
shooting method (as described above) a suitable method for the solution of
the boundary-value problem?
19. Consider the initial-value problem
y' = K(y + y3),
y(O) = s.
(a) Determine the solution y(x; s) of this problem.
(b) In which x-neighborhood Us(O) of 0 is y(x; s) defined?
(c) For a given b 0 find a k > 0 such that y(b; s) exists for all lsi < k.
t-
20. Show that assumption (3) of Theorem (7.3.3.4) in the case n = 2 is not
satisfied for the boundary conditions
21. Show that the matrix E := A + BG m - 1 ... G 1 in (7.3.5.10) is nonsingular
under the assumptions of Theorem (7.3.3.4).
Hint: E = Po (I + H), H = !vI + P o- 1 Dvr' (Z - I).
22. Show by an error analysis (backward analysis) that the solution vector of
(7.3.5.10), (7.3.5.9), subject to rounding errors, can be interpreted as the
exact result of (7.3.5.8) with slightly perturbed right-hand sides Pj and perturbed last equation
(A + E1)Lls1 + E2Lls2 + ... + Em-1Llsm- 1 + (B + Em)Llsm = -Pm,
lub( E j ) small.
23. Let DF(s) be the matrix in (7.3.5.5). Prove:
det(DF(s)) = det(A + BGm- 1 ... GIl.
For the inverse (DF(s))-l find explicitly a decomposition in block matrices
R, S,T:
(DF(s))-l = RST
with S block-diagonal, R unit lower block triangular, and S, T chosen so
that ST· DF(s) is unit lower block triangular.
24. Let Ll E IR n be arbitrary. Prove: If Lls1 := Ll and Lls j , j = 2, ... , m, is
obtained from (7.3.5.9), then
satisfies
(DF(s) . F)T Lls = _FT F + F;;((A + BGm- 1 ··· G1)Ll- w).
Thus, if Ll is a solution of (7.3.5.10), then always
EXERCISES FOR CHAPTER 7
611
(DF(s) . F{ .ds < O.
25. Consider the boundary-value problem (y(x) E IRn)
yl = f(x, y)
with the separated boundary conditions
Yk(a) ~ ak
= o.
Yk+l(b) ~ f3 k+l
There are thus k initial values and n~k terminal values which are known. The
information can be used to reduce the dimension of the system of equations
(7.3.5.8) (k components of .ds l and n ~ k components of .ds m are zero). For
m = 3 construct the system of equations for the corrections .ds;. Integrate
(using the given information) from Xl = a to X2 as well as from X3 = b to X2
("counter shooting"; see Figure 31).
~
I
Fig. 31. Counter shooting.
26. The steady concentration of a substrate in a sperical cell of radius 1 in an
enzyme catalyzed reaction is described by the solution y( X; a) of the following
singular boundary-value problem [ef. Keller (1968)]:
2 II
Y
"=~l+
X
yl(O) = 0,
Y
a(y + k)'
y(l) = 1,
a, k : parameters, k = 0.1, 10- 3 S a S 10- 1 . Altough f is singular at
x = 0, there exists a solution y(x) which is analytic for small Ixl. In spite
of this, every numerical integration method fails near x = O. Help yourself
as follows: By using the symmetries of y(x), expand y(x) as in Example 2,
Section 7.3.6, into a Taylor series about 0 up to and including the term x 6 ;
612
7 Ordinary Differential Equations
thereby express all coefficients in terms of A := y(O). By means of p(x; A) =
y(O) + y' (O)x + ... + y(6) (0)X6 /6! one obtains a modified problem
d2p(X; A)
dx 2
f(x, y, y')
F ( x,y,y ') = {
y"=
y(O) = A,
y'(O) = 0,
2
for 0 ::; x ::; 10- ,
for 10- 2 < x ::; 1,
y(l) = 1,
which is better suited for numerical solution. Consider (*) an eigenvalue
problem for the eigenvalue A, and formulate (*) as in Section 7.3.0 as a
boundary-value problem (without eigenvalue). For checking purposes we give
the derivatives:
y(O) = A,
Y
y
(2)
A
_
(0) - 3a(A + k)'
(6)(0) = (3k - WA)kA
21a 3 (A + k)5 '
Y
(4)
y(i)(O) = 0
kA
_
(0) - 5a2(A + k)3'
c
.
lor ~ =
1 3 5
,
, .
27. Consider the boundary-value problem
y" - p(x)y' - q(x)y = r(x),
y(a) = a,
y(b) = (3
with q(x) 2: qo > 0 for a ::; x ::; b. We are seeking approximate values Ui for
the exact values y(Xi), i = 1, 2, ... , n, at Xi = a + ih, h = (b - a)/(n + 1).
Replacing y'(Xi) by (Ui+l - ui-d/2h and y"(x;) by (Ui-l - 2Ui + ui+d/h 2
for i = 1, 2, ... , n, and putting Uo = a, Un +l = (3, one obtains from the
differential equation a system of equations for the vector U = [Ul, ... , unf
Au = k,
A an n x n-matrix, k E JRn .
(a) Determine A and k.
(b) Show that the system of equations is uniquely solvable for h sufficiently
small.
28. Consider the boundary-value problem
y" = g(x),
with 9 E e[O, 1].
(a) Show:
y(x) =
with
y(O) = y(l) = 0,
11
G(x,
G(x 0 .- {~(x - 1)
, .- x(~ - 1)
~)g(~)d~
for 0 ::; ~ ::; x ::; 1,
for 0 ::; x ::; ~ ::; 1.
(b) Replacing g(x) by g(x) + Llg(x) with ILlg(x) I ::; E: for all x, the solution
y(x) changes to y(x) + Lly(x). Prove:
E:
ILly(x) I ::; "2x(l - x)
for 0 ::; x ::; 1.
References for Chapter 7
613
(c) The difference method described in Secion 7.4 yields the system of equations [see (7.4.6)]
Au= k,
the solution vector u = [U1, ... , Un]T of which provides approximate
values Ui for Y(Xi), Xi = i/(n + 1) and i = 1, 2, ... , n. Replacing k by
k + LJ.k with ILJ.k, I ::; E, i = 1, 2, ... , n, the vector u changes to u + LJ.u.
Show:
i = 1, 2, ... , n.
29. Let
D:= {u I u(o) = 0, u E C 2 [0,1]}
and
F(u):=
11
g(U'(X))2 + j(x,u(x))}dx + p(u(1))
with u E D. juu(x, u) 20, p" (u) 20. Prove: If y(x) is solution of
y"_jU(X,y) =0,
then
y(O) =0,
F(y) < F(u),
y'(1)+p'(y(1)) =0,
u ED, u # y,
and vice versa.
30.
(a) Let T be a triangle in JR2 with the vertices H, P 3 and P 5 ; let further
P2 E P1P3, P4 E P3P5, and P6 E P5P1 be different form H, P3, P5. Then
for arbitrary real numbers Y1, ... ,Y6 there exists exactly one polynomial
of degree at most 2,
which assumes the values y, at the points Pi, i = 1, ... ,6.
Hint: It evidently suffices to show this for a single triangle with suitably
chosen coordinates.
(b) Is the statement analogous to (a) valid for arbitrary position of the Pi?
(c) Let T 1 , T2 be two triangles of a triangulation with a common side g and
Ul, U2 polynomials of degree at most 2 on T1 and T2, respectively. Show:
If 'U1 and U2 agree in three distinct points of g, then U2 is a continuous
extension of U1 onto T2 (i.e.,u1 and U2 agree on all of g).
References for Chapter 7
Babuska, r., Prager, M., Vitisek, E. (1966): Numerical Processes in Differential
Equations. New York: Interscience.
Bader, G., Deufthard, P. (1983): A semi-implicit midpoint rule for stiff systems
of ordinary systems of differential equations. Numer. Math. 41, 373-398.
614
7 Ordinary Differential Equations
Bank, R E., Bulirsch, R, Merten, K. (1990): Mathematical modelling and simulation of electrical circuits and semiconductor devices. ISNM. 93, Basel:
Birkhiiuser.
Bock, H.G., Schlader, J.P., Schulz, V.H. (1995): Numerik groBer DifferentiellAlgebraischer Gleichungen: Simulation und Optimierung, p. 35-80 in: Schuler, II. (ed.) ProzejJsimulation, Weinheim: VCH.
Boyd, J. P. (1989): Chebyshev and Fourier Spectral Methods. Berlin, Heidelberg,
New York: Springer-Verlag.
Broyden, C.G. (1967): Quasi-Newton methods and their application to function
minimization. Math. Compo 21, 368-381.
Buchauer, 0., Hiltmann, P .. Kiehl, M. (1994): Sensitivity analysis of initial-value
problems with application to shooting techniques. Numerische Mathematik 67,
151-159.
Bulirsch, R. (1971): Die Mehrzielmethode zur numerischen Lasung von nichtlinearen Randwertproblemen und Aufgaben der optimalen Steuerung. Report der
Carl-Cranz-Gesellschaft.
- - - , Stoer, J. (1966): Numerical treatment of ordinary differential equations
by extrapolation methods. Numer. Math. 8, 1-13.
Butcher, J. C. (1964): On Runge-Kutta processes of high order. 1. Austral. Math.
Soc. 4, 179-194.
Byrne, D. G., Hindmarsh, A. C. (1987): Stiff o.d.e.-solvers: A review of current
and coming attractions. J. Compo Phys. 70, 1-62.
Callies, R (2000): Design Optimization and Optimal Control. Differential-Algebraic Systems, Multigrid Multiple-Shooting Methods and Numerical Realization. Habilitation Thesis, Technische Universitiit Miinchen.
- - - (2000a): Efficient Treatment of Experimental Data in Integrate Design
Optimization. In: Advances in Computational Engineering fj Sciences ( S. N.
Atluri, F. W. Brust, Eds.), pp. 1188-1193. Palmdale/CA: Tech Science Press,.
- - - (2001): Optimized Runge-Kutta Methods forLinear Differential Equations. Submitted to: 1. Appl. Math. Compo
- - - (200la): Multidimensional Stepsize Control. ZAMM 81 S3,743-744.
Canuto, C., Hussaini, M. Y., Quarteroni, A., Zang, T. A. (1987): Spectral Methods
in Fluid Dynamics. Berlin, Heidelberg, New York: Springer-Verlag.
Caracotsios, M., Stewart, W.E. (1985): Sensitivity analysis of initial value problems with mixed ODEs and algebraic equations. Computers and Chemical Engineering 9, 359-365.
Ciarlet, P. G., Lions, J. L., Eds. (1991): IIandbook of Numerical Analysis. Vol.
II. Finite Element Methods (Part 1). Amsterdam: North Holland.
- - - , Schultz, M. H., Varga, R. S. (1967): Numerical methods of high order
accuracy for nonlinear boundary value problems. Numer. Math. 9, 394-430.
- - - , Wagschal. C. (1971): Multipoint Taylor formulas and applications to the
finite element method. Numer. Math. 17, 84-100.
Clark, N. W. (1968): A study of some numerical methods for the integration
of systems of first order ordinary differential equations. Report ANL-7428.
Argonne National Laboratories.
Coddington, E. A., Levinson, N. (1955): Theory of Ordinary D~fferential Equations. New York: McGraw-Hill.
Collatz, L. (1960): The Numerical Treatment of Differential Equations. Berlin,
Gottingen, Heidelberg: Springer-Verlag.
- - - (1966): Functional Analysis and Numerical Mathematics. New York: Academic Press.
References for Chapter 7
615
Crane, P.J., Fox, P.A. (1969): A comparative study of computer programs for
integrating differential equations. Num. Math. Computer Program library one
- Basic routines for general use. Vol. 2, issue 2. New Jersey: Bell Telephone
Laboratories Inc.
Dahlquist, G. (1956): Convergence and stability in the numerical integration of
ordinary differential equations. Math. Scand. 4, 33-53.
- - - (1959): Stability and error bounds in the numerical integration of ordinary differential equations. Trans. Roy. Inst. Tech. (Stockholm), No. 130.
- - - (1963): A special stability problem for linear multistep methods. BIT 3,
27-43.
Deuflhard, P. (1979): A stepsize control for continuation methods and its special
application to multiple shooting techniques. Numer. Math. 33, 115-146.
----, Hairer, E., Zugck, J. (1987): One-step and extrapolation methods for
differential-algebraic systems. Numer. Math. 51, 501-516.
Diekhoff, H.-J., Lory, P., Oberle, H. J., Pesch, H.-J., Rentrop, P., Seydel, R.
(1977): Comparing routines for the numerical solution of initial value problems
of ordinary differential equations in multiple shooting. Numer. Math. 27, 449469.
Dormand, J. R., Prince, P. J. (1980): A family of embedded Runge-Kutta formulae. J. Compo Appl. Math. 6, 19-26.
Eich, E. (1992): Projizierende l'vIehrschrittverfahren zur numerischen Losung von
Bcwegungsglcichungen technischer l'vIehrkorpersysteme mit Zwangsbedingungen und Unstetigkeiten. Fortschritt-Berichte VDI, Reihe 18, Nr. 109, VDIVerlag, Dusseldorf.
Engl, C., Kroner, A., Kronseder, T., von Stryk, O. (1999): Numerical Simulation
and Optimal Control of Air Separation Plants, pp. 221-231 in: Bungartz, H.J., Durst, F., Zenger, Chr.(eds.): High Perfonnance Scientific and Engineering
Computing, Lecture Notes in Computational Science and Engineering Vol. 8,
New York: Springer-Verlag.
Enright, W. H., Hull, T. E., Lindberg, B. (1975): Comparing numerical methods
for stiff systems of ordinary differential equations. BIT 15, 10-48.
Fehlberg, E. (1964): New high-order Runge-Kutta formulas with stepsize control
for systems of first- and second-order differential equations. Z. Angew. Math.
Mech. 44, T17-T29.
- - - (1966): New high-order Runge-Kutta formulas with an arbitrary small
truncation error. Z. Angew. Math. Mech. 46, 1-16.
- - - (1969): Klassische Runge Kutta Formeln funfter und siebenter Ordnung
mit Schrittweiten-Kontrolle. Computing 4, 93-106.
- - - (1970): Klassische Runge-Kutta Formeln vierter und niedrigerer Ordnung
mit Schrittweiten-Kontrolle und ihre Anwendung auf Warmc!citungsproblcme.
Computing 6,61-71.
Gabin, S., Feehery, W.F., Barton, P.I. (1998): Parametric sensitivity functions for
hybrid discrete/continuous systems. Preprint submitted to Applied Numerical
Mathematics.
Gantmacher, F. R. (1969): Matrizenrechnung II. Berlin: VEB Deutscher Verlag
def Wissenschaften.
Gear, C. W. (1971): Numerica.l initial value problems in ordinary differential
equations. Englewood Cliffs, N.J.: Prentice-Hall.
-------- (1988): Differential algebraic equation index transformations. SIAM J.
Sci. Statist. Comput. 9, 39-47.
- - - , Petzold, L. R. (1984): ODE methods for the solution of differential/algebraic systems. SIAM J. Numer. Anal. 21, 716-728.
616
7 Ordinary Differential Equations
Gill, P.E., Murray, W., Wright, H.M. (1995): Practical Optimization, 10th printing. London: Academic Press.
Gottlieb, D., Orszag, S. A. (1977): Numerical Analysis of Spectral Methods: Theory and Applications. Philadelphia: SIAM.
Gragg, W. (1963): Repeated extrapolation to the limit in the numerical solution
of ordinary differential equations. Thesis, UCLA.
- - - (1965): On extrapolation algorithms for ordinary initial value problems.
J. SIAM Numer. Anal. Ser. B 2, 384-403.
Griepentrog, E., Miirz, R (1986): Differential-Algebraic Equations and Their Numerical Treatment. Leipzig: Teubner.
Grigorieff, RD. (1972, 1977): Numerik gewohnlicher Differentialgleichungen 1, 2.
Stuttgart: Teubner.
Hairer, E., Lubich, Ch. (1984): Asymptotic expansions of the global error of fixed
stepsize methods. Numer. Math. 45, 345-360.
- - - , - - - , Roche, M. (1989): The numerical solution of differentialalgebraic systems by Runge-Kutta methods. In: Lecture Notes in Mathematics
1409. Berlin, Heidelberg, New York: Springer-Verlag.
- - - , Norsett, S. P., Wanner, G. (1993): Solving Ordinary Differential Equations 1. Nonstiff Problems. 2nd ed., Berlin, Heidelberg, New York: SpringerVerlag.
- - - , Wanner, G. (1991): Solving Ordinary Differential Equations II. Stiff
and Differential-Algebraic Problems. Berlin, Heidelberg, New York: SpringerVerlag.
Heim, A., von Stryk, O. (1996): Documentation of PAREST - A multiple shooting
code for optimization problems in differential-algebraic equations. Technical
Report M9616, Fakultiit fiir Mathematik, Technische Universitiit Miinchen.
Henrici, P. (1962): Discrete Variable Methods in Ordinary Differential Equations.
New York: John Wiley.
Hestenes, M. R (1966): Calculus of Variations and Optimal Control Theory. New
York: John Wiley.
Heun, K. (1900): Neue Methode zur approximativen Integration der Differentialgleichungen einer unabhiingigen Variablen. Z. Math. Phys. 45, 23-38.
Hiltmann, P. (1990): Numerische Lasung von Mehrpunkt-Randwertproblemen
und Aufgaben der optimalen Steuerung mit Steuerfunktionen liber endlichdimensionalen Riiumen. PhD Thesis, Mathematisches Institut, Technische Universitiit Miinchen.
Horneber, E. H. (1985): Simulation elektrischer Schaltungen auf dem Rechner.
Fachberichte Simulation, Bd. 5. Berlin, Heidelberg, New York: Springer-Verlag.
Hull, T. E., Enright, W. H., Fellen, B. M., Sedgwick, A. E. (1972): Comparing
numerical methods for ordinary differential equations. SIAM J. Numer. Anal.
9, 603-637. [Errata, ibid. 11, 681 (1974).]
Isaacson, E., Keller, H. B. (1966): Analysis of Numerical Methods. New York:
John Wiley.
Kaps, P., Rentrop, P. (1979): Generalized Runge-Kutta methods of order four
with stepsize control for stiff ordinary differential equations. Numer. Math. 33,
55-68.
Keller, H. B. (1968): Numerical Methods for Two-Point Boundary- Value Problems. London: Blaisdell.
Kiehl, M. (1999): Sensitivity analysis of ODEs and DAEs - theory and implementation guide. Optimization Methods and Software 10,803-821.
References for Chapter 7
617
Krogh, F. T. (1974): Changing step size in the integration of differential equations
using modified divided differences. In: Proceedings of the Conference on the
Numerical Solution of Ordinary Differential Equations, 22-71. Lecture Notes
in Mathematics 362. Berlin, Heidelberg, New York: Springer-Verlag.
Kutta, W. (1901): Beitrag zur naherungsweisen Integration totaler Differentialgleichungen. Z. Math. Phys. 46, 435-453.
Lambert, J. D. (1973): Computational Methods in Ordinary Differential Equations. London-New York-Sidney-Toronto: John Wiley.
Leis, J.R., Kramer, M.A. (19S5): Sensitivity analysis of systems of differential
and algebraic equations. Computers and Chemical Engineering 9, 93-96.
:vIorrison, K.R., Sargent, R.W.H. (1986): Optimization of multistage processes
described by differential-algebraic equations, pp. S6-102 in: Lecture Notes in
Mathematics 1230. New York: Springer-Verlag.
Na, T. Y., Tang, S. C. (1969): A method for the solution of conduction heat
transfer with non-linear heat generation. Z. Angew. Math. Mech. 49, 45-52.
Oberle, H. J., Grimm, W. (19S9): BNDSCO - A program for the numerical solution of optimal control problems. Internal Report No. 515-S9/22, Institute for
Flight Systems Dynamics, DLR, Oberpfaffenhofen, Germany.
Oden, J. T., Rcddy, J. K. (1976): An Introduction to the Mathematical Theory of
Finite Rlements. New York: John \Viley.
Osborne, M. R. (1969): On shooting methods for boundary value problems. J.
Math. Anal. Appl. 27, 417-433.
Petzold, L. R. (19S2): A description of DASSL - A differential algebraic system
solver. IMA CS Trans. Sci. Compo 1, 65ff., Ed. H. Stepleman. Amsterdam:
North Holland.
Quarteroni, A., Valli, A. (1994): Numerical Approximation of Partial Differential
Equations. Berlin, Heidelberg, New York: Springer-Verlag.
Rentrop, P. (19S5): Partitioned Runge-Kutta methods with stiffness detection
and stepsize control. Numer. Math. 47, 545-564.
- - - , Roche, M., Steinebach, G. (19S9): The application of RosenbrockWanner type methods with stepsize control in differential-algcbraic equations.
Numer. Math. 55, 545-563.
Rozenvasser, E. (1967): General sensitivity equations of discontinuous systems.
Automation and Remote Control 28, 400-404
Runge, C. (lS95): Uber die numerische Aufiiisung von Differentialgleichungen.
Math. Ann. 46, 167-17S.
Schwarz, H. R. (19S1): FORTRAN-Programme zur Methode der finiten Elemente.
Stuttgart: Teubner.
- - - , (1988): Finite Element Methods. London-New York: Academic Press.
Shampine, L. F., Gordon, M. K. (1975): Computer Solution of Or'dinary Differential Equations. The Initial Value Problem. San Francisco: Freeman and
Company.
- - - , Watts, H. A., Davenport, S. 1\1. (1976): Solving nonstifI ordinary differential equations - The state of the art. SIAM Review 18, 376- /111.
Shanks, E. B. (1966): Solution of differential equations by evaluation of functions.
Math. Compo 20, 21-3S.
Stetter, H. J. (1973): Analysis of Discretization Methods for Ordinary Differential
Equations. Berlin, Heidelberg, New York: Springer-Verlag.
Strang, G., Fix, G. J. (1973): An Analysis of the Finite Element Method. Englewood Cliffs, N.J.: Prentice-Hall.
Troesch, B. A. (1960): Intrinsic difficulties in the numerical solution of a boundary
value problem. Report NN-142, TRW, Inc. Redondo Beach, CA.
618
7 Ordinary Differential Equations
- - - (1976): A simple approach to a sensitive two-point boundary value problem. J. Computational Phys. 21, 279-290.
Wilkinson, J. H. (1982): Note on the practical significance of the Drazin Inverse.
In: L.S. Campbell, Ed. Recent Applications of Generalized Inverses. Pitman
Pub!. 66, 82-89.
Willoughby, R. A. (1974): Stiff Differential Systems. New York: Plenum Press.
Zlamal, M. (1968): On the finite element method. Numer. Math. 12, 394-409.
8 Iterative Methods for the Solution of
Large Systems of Linear Equations.
Additional Methods
8.0 Introduction
Many applications require the solution of very large systems of linear equations Ax = b in which the matrix A is fortunately sparse, i.e., has only relatively few nonzero elements. Such systems arise, for instance, if difference
methods or finite element methods arc being used for solving boundary
value problems in partial differential equations. The classical elimination
methods [see Chapter 4] are not suitable in this context since they tend to
lead to the formation of dense intermediate matrices, making the number of
arithmetic operations necessary for the solution too large even for presentday computers, not to mention the fact that the memory requirements for
such intermediate matrices exceed available space.
For these reasons, researchers have long since moved to iterative methods for solving such systems of equations. These methods start with an
initial vector x(O) and subsequently generate a sequence of vectors
which converge toward the desired solution x. A common feature of all these
methods is the fact that the effort required for an individual iteration step
xli) -+ x(i+l) is essentially equal to that of multiplying a vector by the
matrix A - a very modest effort provided A is sparse. Therefore, one can
still carry out a relatively large number of iterations with a reasonable
amount of effort. This is necessary for the reason alone that these methods
converge only linearly, and sometimes very slowly at that. Iterative methods
are therefore usually inferior to elimination methods if A is small (a 100 x
100 matrix is small in this sense) or not sparse.
Somewhat outside of this framework lie the so-called method of iterative
refinement [see (8.1.9)-(8.1.11)] and the Krylov space methods [see Section
8.7]. Iterative refinement is used frequently to iteratively improve the accuracy of an approximate solution i; of a system of equations which has been
obtained by an elimination method using machine arithmetic.
The general characteristics of iterative methods, stated above, also apply to Krylov space methods, with the one exception that these methods, in
620
8 Iterative Methods for the Solution of Systems of Linear Equations
exact arithmetic, terminate with the exact solution x after a finite number
of steps. They generate iterates Xk that approximate the solution of Ax = b
best among all vectors belonging to a Krylov space of dimension k. We describe four methods of this type: the classical conjugate-gradient method
(cg-method) of Hestenes and Stiefel (1952) for systems of equations with
a positive definite matrix A [see Section 8.7.1]' and, for systems with an
arbitrary nonsingular A, the GMRES-method of Saad and Schultz (1986)
[see Section 8.7.2], a simplified version of the QMR-method of Freund and
Nachtigal (1991) [Section 8.7.3], and the Bi-CGSTAB algorithm of van der
Vorst (1992) [Section 8.7.4]. These methods have the advantage that they
do not depend on additional parameters (as other methods do), the proper
choice of which is often difficult. With regard to the applicability of Krylov
space methods, however, the same remarks apply as for the true iterative
methods. For this reason, we include these methods in the present chapter.
A detailed treatment of iterative methods can be found in the new
edition of the classical book by Varga (2000), and also in Young (1971)
and Axelsson (1994); Krylov space methods are described in depth in Saad
(1996).
For the solution of certain special systems of equations arising with the
solution of the so-called model problem (the Poisson problem on a rectangle) there are some direct methods which give the solution after finitely
many steps and are superior to (most) iterative methods. Two such methods are the algorithm of Buneman (1969), and Fourier methods that use the
FFT-algorithm of trigonometric interpolation. Both methods are described
in Section 8.8.
Nowadays, the very large systems of equations connected with the solution of boundary-value problems of partial differential equations by finite
element techniques are mainly solved by multigrid methods. We describe in
Section 8.9 only concepts of these important iterative methods, using the
context of a boundary value problem for ordinary differential equation. For
thorough treatments of multigrid methods, which are closely connected to
the numerics of partial differential equations, we refer the reader to the vast
special literature on this subject, e.g. to Hackbusch (1985), Braess (1997),
Bramble (1993), Quarteroni and Valli (1997).
It should be pointed out that elimination techniques for solving linear
systems Ax = b have also been modified in various ways to handle large
sparse matrices A. These techniques concern themselves with appropriate ways of storing such matrices, and with the determination of suitable
permutation matrices PI, P2 such that in the application of elimination
methods to the equivalent system P 1 AP2 y = Ptb, Y = P 2- 1 X, the intermediate matrices generated remain as sparse as possible. Basic methods of
this type were described in Section 4.12. For a relevant exposition, we refer
the reader to the literature cited there, and in addition to George (1973),
who proposed a particularly useful technique for handling the linear equa-
8.1 General Procedures for the Construction of Iterative Methods
621
tions arising from the application of finite element techniques to partial
differential equations.
8.1 General Procedures for the Construction of
Iterative Methods
Let a nonsingular n x n matrix A be given, and a system of linear equations
(8.1.1)
Ax=b
with the exact solution x := A-lb. We consider iterative methods of the
form [ef. Chapter 5]
i = 0,1, ....
(8.1.2)
With the help of an arbitrary nonsingular n x n matrix B such iteration
algorithms can be obtained from the equation
Bx + (A - B)x = b,
by putting
(8.1.3)
BX(Hl)
+ (A - B)X(i) = b,
or solved for x(Hl),
(8.1.4)
Such iteration methods, in this generality, were first considered by Wittmeyer (1936).
Note that (8.1.4) is identical with the following special vector iteration
[see Section 6.6.3]:
where the matrix W of order n + 1, in correspondence to the eigenvalue
>'0 := 1, has the left eigenvector [1,0] and the right eigenvector [!], x :=
A-lb. According to the results of Section 6.6.3, the sequence [xti)] will
converge to [!] only if >'0 = 1 is a simple dominant eigenvalue of W, i.e., if
the remaining eigenvalues >'1, ... , >'n of W (these are the eigenvalues of
I - B- 1 A) being smaller in absolute value than 1.
622
8 Iterative Methods for the Solution of Systems of Linear Equations
Each choice of a nonsingular matrix B leads to a potential iterative
method (8.1.4). The better B satisfies the following conditions, the more
useful the method will be:
(1) the system of equations (8.1.3) is easily solved for x(i+l) ,
(2) the eigenvalues of 1- B- 1 A have moduli which are as small as possible.
The better B agrees with A, the more likely the latter will be true.
These questions of optimality and convergence will be examined in the next
sections. Here, we only wish to indicate a few important special iterative
methods (8.1.3) obtained by simple choices of B. We introduce, for this
purpose, the following standard decomposition of A:
A=D-E-F,
(8.1.5)
with
o
as well as the abbreviations
(8.1.6)
L:= D- l E,
U:= D- l F,
J:= L + U,
H:= (I - L)-IU,
assuming aii i- 0 for i = 1, 2, ... , n.
(1) In the Jacobi method or total-step method one chooses
B:= D,
(8.1.7)
1- B- I A = J.
One thus obtains for (8.1.3)
+ LajkXk(i) -_ bj ,
(HI)"
ajjX j
j = 1,2, ... ,n,
i = 0, 1, ... ,
kij
where xCi) := [xii), ... ,x~)lT.
(2) In the Gauss-Seidel method or single-step method one chooses
(8.1.8)
B:=D-E,
One thus obtains for (8.1.3)
I-B- I A=(I-L)-lU=H.
8.2 Convergence Theorems
""'
(i+1)
L...JajkXk
(i+1)
+ajjXj
k<j
j = 1, 2, ...
,n,
623
(i) - b
+ ""'
L...JajkXk j,
k>j
i = 0, 1, ....
(3) The method of iterative refinement is a special case in itself. Here, the
following situation is assumed. As the result of an elimination method for
the solution of Ax = b one obtains, owing to rounding errors, a (generally)
good approximate solution xeD) for the exact solution x and a lower and upper triangular matrix Land R, respectively, such that L R ~ A [see Section
4.5]. The approximate solution xeD) can the~ subsequently be improved by
means of an iterative method of the form (8.1.3), choosing
B:=LR.
(8.1.3) is equivalent to
(8.1.9)
with the residual
rei) := b - Ax(i).
From (8.1.9) it then follows that
(8.1.10)
Note that uti) can be simply computed by solving the triangular systems
of equations
(8.1.11)
Lz = r(i),
Ru(i) = z.
In general (if A is not too ill conditioned), the method converges extremely
fast. Already Xli) or x(2) agrees with the exact solution x to machine accuracy. Since, precisely for this reason, there occurs severe cancellation in
the computation of the residuals r(i) = b - Ax(i), it is extremely important
for the proper functioning of the method that the computation of rei) be
done in double precision. For the subsequent computation of z, U(i), and
x(i+1) = Xli) + uti) from (8.1.11) and (8.1.10), double precision is not required. [Programs and numerical examples for iterative refinement can be
found in Wilkinson and Reinsch (1971), and in Forsythe and Moler (1967)].
8.2 Convergence Theorems
The iterative methods considered in (8.1.3), (8.1.4) produce from each initial vector xeD) a sequence of vectors {XCi) h=D,l, .... We now call the method
convergent if for all initial vectors xCD) this sequence {xCi) h=D,l, ... converges
624
8 Iterative Methods for the Solution of Systems of Linear Equations
toward the exact solution x = A-lb. By peG) we again denote in the following the spectral radius [see Section 6.9] of a matrix G. We can then
state the following convergence criterion:
(8.2.1) Theorem.
(a) The method (8.1.3) converges if and only if p(l - B- 1 A) < 1.
(b) It is sufficient for convergence of (8.1.3) that lub(I - B- 1 A) < l.
Here lub(·) can be taken relative to any norm.
PROOF. (a): For the error fi := xCi) -
x, from
x CH1 ) = (I - B- 1 A)xCi) + B- 1 b,
x = (I - B- 1 A)x + B- l b,
we immediately obtain by subtraction the recurrence formula
or
i = 0,1, ....
(8.2.2)
(1) Assume now (8.1.3) is convergent. Then we have limi .... oo fi = 0 for
all fo. Choosing in particular fo to be an eigenvector of I - B- 1 A
corresponding to the eigenvalue A, it follows from (8.2.2) that
(8.2.3)
and hence IAI < 1, since limHoo fi = O. Therefore, p(l - B-IA) < 1.
(2) If, conversely, p(l - B- 1 A) < 1, it follows immediately from Theorem
(6.9.2) that limHoo(I - B- 1 A)i = 0 and thus limi .... oo fi = 0 for all fo.
(b): For arbitrary norms one has p(I - B- 1 A) :S lub(l - B- 1 A) [see
Theorem (6.9.1)]. This proves the theorem.
0
Theorem (8.2.1) suggests the conjecture that the rate of convergence is
larger the smaller p(I - B- 1 A). The statement can be made more precise.
(8.2.4) Theorem. For the method (8.1.3) the errors Ii = XCi) - x satisfy
(8.2.5)
sup lim sup i Ilfill = (I _ B- I A)
p
.
II JIfoil
i .... oo
10#0
Here II . II is an arbitrary norm.
PROOF. Let 11·11 be an arbitrary norm, and lub(·) the corresponding matrix
norm. By k we denote, for short, the left-hand side of (8.2.5). One sees immediately that k 2: p(I - B- 1 A), by choosing for fo, as in (8.2.2), (8.2.3),
8.2 Convergence Theorems
625
the eigenvectors of (I - B- 1 A). Let now E > 0 be arbitrary. Then by Theorem (6.9.2) there exists a vector norm N(·) such that for the corresponding
matrix norm lub N (·),
According to Theorem (4.4.6) all norms on (Cn are equivalent, and there
exist constants m, AJ > 0 with
m Ilxll :::; N(x) :::; M Ilxll.
If now fa =J 0 is arbitrary, from these inequalities and (8.2.2) it follows that
1
1 ( (1 - B- 1
lifill :::; -N(fi)
= -N
A)'·fa)
m
m
or
Since limi-+CXJ ;jA1jm = 1, one obtains k :::; p(I - B- 1 A) + E, and since
B- 1 A). The theorem is now proved.
0
E > 0 was arbitrary, k :::; p(I -
Let us appply these results first to the Jacobi method (8.1.7). We continue llsing the notation (8.1.5)-(8.1.6) introduced in the previous section.
Relative to the maximum norm lubCXJ (C) = maXi 2::k ICik I, we then have
If laiil > 2::k#i laikl for all i, we get immediately
lubCXJ (J) < 1.
From Theorem (8.2.1b) we thus obtain at once the first part of the following
theorem:
(8.2.6) Theorem.
(a) Strong Row Sum Criterion: The Jacobi method is convergent for all
matrices A with
(8.2.7)
laiil >
L laikl fori = 1, 2, ... , n.
k#i
626
8 Iterative Methods for the Solution of Systems of Linear Equations
(b) Strong Column Sum Criterion: The Jacobi method converges for all
matrices A with
lakkl >
(8.2.8)
L laikl for k = 1,2, ... , n.
itk
A matrix A satisfying (8.2.7) [(8.2.8)] is called strictly row-wise (column-wise) diagonally dominant.
PROOF OF (b). If (8.2.8) is satisfied by A, then (8.2.7) holds for the matrix
AT. The Jacobi method therefore converges for AT, and thus, by Theorem
(8.2.1a), p(X) < 1 for X := I - D- 1AT. Now X has the same eigenvalues
as X T and D-1XTD = 1- D-IA. Hence also p(I - D-1A) < 1, i.e., the
Jacobi method is convergent also for the matrix A.
D
For irreducible matrices A the strong row (column) sum criterion can be
improved. Here, a matrix A is called irreducible if there is no permutation
matrix P such that p T AP has the form
where All is a p x p matrix and A22 a q x q matrix with p + q = n, p > 0,
q>
o.
The irreducibility of a matrix A can often be readily tested by means
of the (directed) graph G(A) associated with the matrix A. If A is an
n x n matrix, then G (A) consists of n vertices PI, ... , Pn and there is an
(oriented) arc Pi --+ Pj in G(A) precisely if aij "10.
EXA1VIPLE.
2
A=
r:
0
:1
~
P
P
G(A) : Pi
2
3
U
U
U
It is easily shown that A is irreducible if and only if the graph G(A) is
connected in the sense that for each pair of vertices (Pi, Pj ) in G(A) there
is an oriented path from Pi to Pj .
For irreducible matrices one has the following:
(8.2.9) Theorem (Weak Row Sum Criterion). If A is irreducible and
laiil 2: L laikl
for all i = 1, 2, ... , n,
kti
but lainiol > Lktin laiokl for at least one io! then the Jacobi method converges.
8.2 Convergence Theorems
627
A matrix A satisfying the hypotheses of the theorem is called irreducibly
diagonally dominant.
Analogously there is, of course, also a weak column sum criterion for
irreducible A.
PROOF. From the assumptions of the theorem it follows for the Jacobi
method, as in the proof of (8.2.6a), that
and from this,
Pie:::; e,
(8.2.10)
IJle 01 e,
e:=(l,l. ... ,lf·
[Absolute value signs 1·1 and inequalities for vectors or matrices are always
to be understood cornponentwise.]
Now, with A also J is irreducible. In order to prove the theorem, it
suffices to show the inequality
because from it we have immediately
p(J)n = p(r) :::; luboo(r) :::; lub oo (IJln) < 1.
Now, in view of (8.2.10) and IJI 20,
IJI 2 c:::; IJI c ~ c,
and generally,
IJIi+le :::; IJlie :::; ... ~ e,
i.e., the vectors t(i) := e -IJlie satisfy
(8.2.11)
We show that the number Ti of nonvanishing components of t(i) increases
monotonically with i as long as Ti < n: 0 < Tl < T2 < .... If this were not
the case, there would exist, because of (8.2.11), a first i 2 1 with Ti = Ti+l.
Without loss of generality, let t(i) have the form
t( i) =
[~]
with a vector a > 0, a E IRP, p > o.
Then, in view of (8.2.11) and Ti = Ti+l, also t(i+1) has the form
t(i+l) = [~]
with a vector b > 0, bE IRP.
628
8 Iterative Methods for the Solution of Systems of Linear Equations
Partitioning IJI analogously,
IJI = [11111
IhII
IJ11 1a p x p matrix,
it follows that
Because a > 0, this is only possible if hI = 0, i.e., if J is reducible. Hence,
o < 71 < 72 < ... , and thus t(n) = e - IJln e > O. The theorem is now
proved.
D
The conditions of Theorems (8.2.6) and (8.2.9) are also sufficient for
the convergence of the Gauss-Seidel method. We show this only for the
strong row sum criterion. We have even more precisely:
(8.2.12) Theorem. If
laiil > L laikl
for all i = 1, 2, ... , n,
k#i
the Gauss-Seidel method is convergent, and fuTtermore [see (8.1.6)]
PROOF. Let "'H := luboo(H), "'J:= lub(X) (J). As already exploited repeatedly, the assumption of the theorem implies
e=(l, ... ,lf,
for the matrix J = L +U. From this, because IJI = ILl + lUI, one concludes
(8.2.13)
Now L and ILl are both lower triangular matrices with vanishing diagonal.
For such matrices, as is easily verified,
so that (1 - L)-l and (1 - ILI)-l exist and
0::; 1(1 - L)-ll = 11 + L + ... + Ln-11
::; 1 + ILl + ... + ILln-l = (1 -ILI)-l.
8.3 Relaxation Methods
629
Multiplying (8.2.13) by the nonnegative matrix (I - ILI)-l, one obtains,
because H = (I - L)-lU,
IHle::; (I -ILI)-lIUle::; (1 -ILI)-l(1 -ILl + (K,.f -l)I)e
= (I
+ (K,.f -1)(1 -ILI)-l)e.
Now, (1 - ILI)-l :::: I and K,.f < 1, so that the chain of inequalities can be
continued:
IHle::; (1 + (K,.f -l)I)e = K,.fe.
But this means
as was to be shown.
D
Since lubcxo(H) :::: p(H), luboc(J) :::: p(J), this theorem may suggest the
conjecture that under the assumptions of the theorem also p(H) ::; p(J) <
1, i.e., in view of Theorem (8.2.4), that the Gauss-Seidel method converges
at least as fast as the Jacobi method. This, however, as examples show, is
not true in general, but only under further assumptions on A. Thus, e.g.,
the following theorem is valid, which we state without proof [for a proof,
see Varga (2000)].
(8.2.14) Theorem (Stein, Rosenberg). If the matrix J = L + U :::: 0 is
nonnegative, then for J and H = (1 - L )-lU precisely one of the following
relations holds:
(1) p(H) = p(J) = 0,
(2) 0 < p(H) < p(J) < 1,
(3) p(H) = p(J) = 1,
(4) p(H) > p(J) > 1.
The assumption J :::: 0 is satisfied in particular [see (8.1.5), (8.1.6)] if
the matrix A has positive diagonal elements and nonpositive off-diagonal
elements: aii > 0, aik ::; 0 for i =1= k. Since this condition happens to be
satisfied in almost all systems of linear equations which are obtained by difference approximations to linear differential operators [ef., e.g., Section 8.4]'
this theorem provides in many practical cases the significant information
that the Gauss-Seidel method converges faster than the Jacobi method, if
one of the two converges at all.
8.3 Relaxation Methods
The results of the previous section suggest looking for simple matrices B
for which the corresponding iterative method (8.1.3) converges perhaps
630
8 Iterative Methods for the Solution of Systems of Linear Equations
still faster than the Gauss-Seidel method, p(I - B- 1A) < p(H). More
generally, one can consider classes of suitable matrices B(w) depending on
a parameter wand try to choose the parameter w in an "optimal" way, i.e.,
so that p(I - B(W)-1 A) as a function of w becomes as small as possible.
In the relaxation methods (SOR methods) one studies the following class
of matrices B(w):
1
w
B(w) = - D(I - wL).
(8.3.1)
Here again we use the notation (8.1.5)-(8.1.6). This choice is obtained
through the following considerations.
Suppose for the (i + l)st approximation x(i+ l ) we already know the
components X~i+l), k = 1, 2, ... , j - 1. As in the Gauss-Seidel method
(8.1.8), we then define an auxiliary quantity xji+l) by
(8.3.2)
_(i+l) -_ - '~
" ajkXk
,(i+1) - '~
" ajk·Tk(i) + bj,
1 :S j :S n,
ajjX j
k<j
whereupon
-(i+l)
Xj
i :;:, 0,
k>]
x/(+1) is determined through a certain averaging of x/() and
.
, VIZ.
(8.3.3)
(i+ l ) ._ (1
Xj
.-
-
-(i+l) _
(i) + (-(i+l)
w ) Xj(i) + w Xj
- Xj
W Xj
-
(i))
Xj
.
Eliminating the auxiliary quantitiy xji+l) from (8.3.3) by means of (8.3.2),
one obtains
+
.. (i+1) _
.. ( i ) , [
aJJx j
LU
- aJJT j
' " . (i+l) _
.. (i) _ '~
" . (i)
~ a]kxk
aJJx j
a]kak
k<j
+ b.]
] ,
k>j
1 :S j :S n,
i:;:, O.
In matrix notation this is equivalent to
B(w)x Ci +1) = (B(w) - A)x(i) + b,
where B(w) is defined by (8.3.1) and
1
B(w) - A = -D((l- w)1 +wU).
w
For this method the rate of convergence, therefore, is determined by the
spectral radius p( H (w )) of the matrix
(8.3.4)
H(w) := 1- B(W)-1 A = (I - wL)-1 [(1 - w)1 + wUj.
One calls w the relaxation parameter and speaks of overrelaxation if
w > 1 and underrelaxation if w < 1. For w = lone exactly recovers the
Gauss-Seidel method.
8.3 Relaxation Methods
631
We begin by listing, in part without proof, a few qualitative results
about p(H(w)). The following theorem shows that in relaxation methods
only parameters w with 0 < w < 2, at best, lead to convergent methods:
(8.3.5) Theorem (Kahan).For arbitrary matrices A one has
p(H(w)) ~ Iw - 11
for all w.
PROOF. 1- wL is a lower triangular matrix with 1 as diagonal elements,
so that det(I - wL) = 1 for all w. For the characteristic polynomial rp(A)
of H(w) it follows that
rp(A) = det(AI - H(w)) = det((I - wL)(H - H(",')))
= det(A + w - 1)1 - wAL - wU).
Up to a sign, the constant term rp(O) of rp(.) is equal to the product of the
eigenvalues Ai(H(w)) of H(w):
n
(_1)n
II Ai(H(w)) = rp(O) = det(w - 1)1 - wU) == (w _l)n.
i=l
It follows immediately that p(H(w)) = maxi IAi(H(w))1 ~ Iw -11-
D
For matrices A with L ~ 0, U ~ 0, only overrelaxation can give faster
convergence than the Gauss-Seidel method:
(8.3.6) Theorem. If the matrix A is irreducible and J = L + U ~ 0,
and if the Jacobi method converges, p(J) < 1, then the function p(H(w))
is strictly decreasing on the interval 0 < w s: w for some w ~ 1.
For a proof, see Varga (1962) and Householder (1964).
One further shows:
(8.3.7) Theorem (Ostrowski, Reich). For positive definite matrices A one
has
p(H(w)) < 1 for all 0 < w < 2.
In particular, the Gauss-Seidel method (w = 1) converges for positive definite matrices.
PROOF. Let 0 < w < 2, and A be positive definite. Then F = EH in
the decomposition (8.1.5), A = D - E - F of A = AH. For the matrix
B = B(w) in (8.3.1) one has B = (l/w)D - E, and the matrix
B+B
H
1
1
-A = -D -E+ -D - F - (D - E - F)
w
w
632
8 Iterative Methods for the Solution of Systems of Linear Equations
is positive definite, since the diagonal elements of the positive definite matrix A are positive [Theorem (4.3.2)] and (2/w) - 1 > o.
We first show that the eigenvalues A of A -1 (2B - A) all lie in the
interior of the right half plane, Re A > O. Indeed, if x is an eigenvector for
A, then
A- 1 (2B - A)x = AX,
H
x (2B - A)x = AxH Ax.
Taking the conjugate complex of the last relation gives, because A = A H ,
By addition, it follows that
xH(B + BH - A)x = Re AxH Ax.
But now, A and (B + BH - A) are positive definite and thus Re A> O. For
the matrix Q:= A- 1(2B - A) = 2A- 1B - lone has
(Q - I)(Q + I)-1 = I - B- 1A = H(w).
[Observe that B is a nonsingular triangular matrix; therefore B- 1 and
thus (Q + I)-1 exist.] If f.-l is an eigenvalue of H(w) and z a corresponding
eigenvector, then from
(Q - I)(Q + I)-l z = H(w)z = f.-lZ
it follows, for the vector y := (Q + I) -1 Z i 0, that
(Q - I)y = f.-l(Q + I)y,
(1 - f.-l)Qy = (1 + f.-l)Y·
Since y i 0, we must have f.-l i 1, and one finally obtains
QY -_ 1+/1
1
y,
-f.-l
i.e., A = (1 + f.-l)/(1 - f.-l) is an eigenvalue of Q = A-1(2B - A). Hence,
f.-l = (A - l)/(A + 1). For 1f.-l12 = f.-lp one obtains
2
If.-ll
=
IAI2 + 1 - 2 ReA
IAI2 + 1 +2ReA'
and since Re A > 0 for 0 < w < 2,
IMI < 1,
i.e.,
p(H(w)) < 1.
o
For an important class of matrices the more qualitative assertions of
Theorems (8.3.5)-(8.3.7) can be considerably sharpened. This is the class
8.3 Relaxation Methods
633
of matrices with property A introduced by Young [see, e.g., Young (1971)]'
or its generalization due to Varga (1962), the class of consistently ordered
matrices [see (8.3.10)].
(8.3.8) Definition. The matrix A has property A if there exists a permutation matrix P such that P ApT has the form
D 1, D2 diagonal matrices.
The most important fact about matrices with property A is given in
the following theorem:
(8.3.9) Theorem. For every n x n matrix with property A and aii =J 0, i =
1, ... J n, there exists a permutation matrix P such that the decomposition
(8.l.5), (8.l.6), A = D(I - L - U), of the permuted matrix A := P ApT
has the following property: The eigenvalues of the matrices
ex E <C,
ex =J 0,
are independent of ex.
PROOF. By Definition (8.3.8) there exists a permutation P such that
PApT = [D1
M2
D:= [~1
~~] = D(I - L - U),
o
.
~2] , L=-[Di?Ah ~] , U = - [~ Dl1Nh]
Here D1 and D2 are nonsingular diagonal matrices. For ex =J 0, one now
has
()
J ex
0
= - [ exDi1M2
ex-1D011AI1]
[0
= -Sa Di1Ah
= -SaJ(l)S~l
with the nonsingular diagonal matrix
11 , 12 unit matrices.
The matrices J(a) and J(l) are similar, and hence have the same eigenw~.
0
Following Varga (1962), a matrix A which, relative to the decomposition
(8.l.5), (8.l.6), A = D(I - L - U), has the property that the eigenvalues
of the matrices
634
8 Iterative Methods for the Solution of Systems of Linear Equations
for 0' -I 0 are independent of 0', is called
(8.3.10)
consistently ordered.
Theorem (8.3.9) asserts that matrices with property A can be ordered
consistently, i.e., the rows and columns of A can be rearranged by a permutation P such that there results a consistently ordered matrix P ApT.
Consistently ordered matrices A, however, need not at all have the form
D 1 , D2 diagonal matrices.
This is shown by the important example of block tridiagonal matrices A,
which have the form
A=
Di diagonal matrices.
AN~l,N
AN.N~l
DN
If all D; are nonsingular, then the matrices
J(O') = 0'
~lD~l A
N~l
N~LN
obey the relation
that is, A is consistently ordered. In addition, block tridiagonal matrices
have property A. We show this only for the special 3 x 3 matrix
One has in fact,
8.3 Relaxation Methods
635
In the general case, one proceeds analogously.
We point out, however, that there are consistently ordered matrices
which do not have property A. This is shown by the example
For irreducible n x n matrices A with nonvanishing diagonal elements
aii =f. 0 and decomposition A = D(I - L - U) it is often easy to find out
whether or not A has property A by considering the graph G(J) associated
with the matrix J = L + U. For this one examines the lenghts sii), s~i), ...
of all closed oriented paths (oriented cycles)
Pi -+ Pk 1 -+ Pk2 -+ ... -+ Pks(i) = Pi
in G(J) which lead from Pi to Pi. Denoting by Ii the greatest common
··
f the s1(i) ,s2(i) , ... ,
dIVlsor
0
(i)
(i)
li=gcd(S1 ,s2 , ... ),
the graph G(J) is called 2-cyclic if h = 12 = ... = In = 2, and weakly
2-cyclic if all Ii are even.
The following theorem then holds, which we state without proof.
(8.3.11) Theorem. An irreducible matrix A has property A if and only if
G( J) is weakly 2-cyclic.
EXAMPLE. To the matrix
A-
[-1
0
-1
-1
4
-1
0
0
-1
4
-1
-~l
-1
4
there belongs the matrix
1
J:= 4"
with the graph
[1
1
0
0
1
1
0
0
1
~l
636
8 Iterative Methods for the Solution of Systems of Linear Equations
G(J) is connected, so that J, and thus also A, is irreducible (see Section 8.2).
Since G(J) is evidently 2-cyclic, A has property A.
The significance of consistently ordered matrices [and therefore by
(8.3.9), indirectly also of matrices with property A] lies in the fact that
one can explicitly show how the eigenvalues p of J = L + U are related to
the eigenvalues A = A(W) of H(w) = (1 - WL)-l((1 - w)l + wU):
(8.3.12) Theorem (Young, Varga). Let A be a consistently ordered matrix
(8.3.10) and wi- O. Then:
(a) With p, also -p is an eigenvalue of J = L + u.
(b) If p is an eigenvalue of J and
(8.3.13)
then A is an eigenvalue of H (w).
(e) If A i- 0 is an eigenvalue of H(w) and (8.3.13) holds, then p is an
eigenvalue of J.
PROOF. (a): Since A is consistently ordered, the matrix J ( -1) = - L - U =
-J has the same eigenvalues as J(I) = J = L + U.
(b): Because det(I - wL) = 1 for all w, one has
det(Al - H(w)) = det[(I - wL) (AI - H(w))]
(8.3.14)
=
det [AI - AwL - (1 - w)I - wU]
= det((A + w - 1)1 - AwL - wU).
Now let p be an eigenvalue of J = L + U and A a solution of (8.3.13). Then
A + w - 1 = V5.wp or A + w - 1 = -V5.wp. Because of (a) we can assume
without loss of generality that
If A = 0, then w = 1, so that by (8.3.14),
det(O· I - H(I)) = det(-wU) = 0,
i.e., A is an eigenvalue of H(w). If A i- 0 it follows from (8.3.14) that
8.3 Relaxation Methods
637
Jxu)]
= (v'xwr det [ILl - (V,XL + Jxu)]
det(AI - H(w)) = det [(A + w -1)1 - V,Xw( V,XL +
(8.3.15)
=
(v'xwr det(ILl - (L + U)) = 0,
since the matrix J(.J>...) = .J>...L + (l/.J>...)U has the same eigenvalues as
J = L + U and J-L is eigenvalue of J. Therefore, det(AI - H(w)) = 0, and A
is an eigenvalue of H(w).
(c): Now, conversely, let A #
be an eigenvalue of H(w), and J-L a
number satisfying (8.3.13), i.e., with A + w - 1 = ±w.J>...J-L. In view of (a) it
suffices to show that the number J-L with A + w - 1 = w.J>...J-L is an eigenvalue
of J. This, however, follows immediately from (8.3.15).
0
°
As a side result the following is obtained for w = 1:
(8.3.16) Corollary. Let A be a consistently ordered matrix (8.3.10). Then
for the matrix H = H(l) = (I - L)-lU of the Gauss-Seidel method one
has
p(H) = [p(JW.
In view of Theorem (8.2.4) this means that with the Jacobi method one
has to carry out about twice as many iteration steps as with the GaussSeidel method, to achieve the same accuracy.
We now wish to exhibit in an important special case the optimal relaxation parameter Wb, characterized by [see Theorem (8.3.5)]
p(H(wb») = min p(H(w») = min p(H(w»).
wEIR
O<w<2
(8.3.17) Theorem (Young, Varga). Let A be a consistently ordered matrix.
Let the eigenvalues of J be real and p( J) < 1. Then
2
Wb = -1-+-Jr1=_=p=C(===J;::::;:C)2'
for Wb :s: w :s: 2,
638
8 Iterative Methods for the Solution of Systems of Linear Equations
Q(H(w))
2
w
Fig. 32. Spectral radius of p(H(w))
where the abbreviation JL := p(J) is used (see Figure 32).
Note that the left-hand differential quotient of p(H(w)) at Wb is "-oc:".
One should preferably take, therefore, as relaxation parameter w a number
that is slightly too large, rather than one that is too smalL if Wb is not
known exactly.
PROOF. The eigenvalues JLi of the matrix J, by assumption, are real, and
- p( J) :s: JLi :s: p( J) < 1.
For fixed W E (0,2) [by Theorem (8.3.5) it suffices to consider this domain]
to each JLi there belong two eigenvalues ..\~l)(w), ..\~2)(w) of H(w), which
are obtained by solving the quadratic equation (8.3.13) in ..\. Geometrically,
..\~l) (w), ..\~2) (w) are obtained as abscissae of the points of intersection of
the straight line
9w(..\) = (..\ + w - 1)
w
with the parabola rni(..\) := ±v0.JLi (see Figure 33). The line 9w has the
slope l/w and passes through the point (1,1). If it does not intersect the
parabola mi(..\)' then ..\i1)(w), ..\~2)(W) are conjugate complex numbers with
modulus Iw - 11, as one finds immediately from (8.3.13). Evidently,
p(H(w)) = max(I..\~l)(w)I, 1..\~2)(w)l) = max(I..\(1)(w)l, 1..\(2)(w)I),
i
the ..\(1) (w), ..\(2)(W) being obtained by intersecting gw(..\) with m(..\) .±v0.JL, where JL = p( J) = maxi IJL, I· By solving the quadratic equation
(8.3.13), with J~ = p(J), one verifies (8.3.18) immediately, and thus also
the remaining assertions of the theorem.
0
8.4 Applications to Difference Methods-An Example
639
Fig. 33. Determination of Wb.
8.4 Applications to Difference Methods-An Example
In order to illustrate how the iterative methods described can be applied,
we consider the Dirichlet boundary-value problem
(8.4.1)
-u xx -
U yy
= f(x, y),
u(x, y) = 0
fur
0< x, y < 1,
(x, y) E aD,
for the unit square D := {(x, y) I 0 < x, y < 1} c IR? with boundary aD
[ef. Section 7.7]. We assume f(x, y) continuous on DUaD. Since the various
methods for the solution of boundary-value problems are usually compared
on this problem, (8.4.1) is also called the model problem. To solve (8.4.1)
by means of a difference method, one replaces the differential operator by a
difference operator, as described in Section 7.4 for boundary-value problems
in ordinary differential equations. One covers D U aD with a grid Dh U aDh:
Dh := {(Xi, Yj) I i,j = 1,2, ... , N},
aDh := {(Xi, 0), (Xi, 1), (0, Yj), (1, Yj) I i,j = 0, 1, ... , N + 1},
where, for abbreviation, we put [see Figure 34]
Xi := ih,
Yj:= jh,
1
h:= N + l'
i,j = 0, 1, ... , N + 1,
N 2:: 1 an integer.
640
8 Iterative Methods for the Solution of Systems of Linear Equations
Y
1~~---r--~--r--'
1
Xi
X
Fig. 34. The grid fh
With the further abbreviation
i,j = 0, 1, ... , N + 1,
the differential operator
can be replaced for all (Xi, Yj) E sh by the difference operator
4Uij -
(8.4.2)
Ui-l,j - Ui+l,j - Ui,j-1 - Ui,j+1
h2
up to an error Tij. The unknowns UiJ' 1 :::; i, j :::; N [because of the boundary
conditions, the Uij = 0 are known for (Xi, Yj) E ash] therefore obey a
system of linear equations of the form
(8.4.3)
4Uij - Ui-l,j -
Ui+1,j -
Ui.j-l -
(Xi, Yj) E
Ui,)+l
= h 2 1ij + h2Tij,
[h,
with lij := I (Xi, Yj). Here the errors Tij of course depend on the mesh size
h. Under appropiate assumptions for the exact solution u, one shows as in
Section 7.4 that Tij = O(h2). For sufficiently small h one can thus expect
that the solution Zij, i, j = 1, .... N, of the linear system of equations
(8.4.4)
4zij ZOj
Z;-l,) -
Zi+l,j -
Zi,j-1 -
Zi,j+1 = h2lij,
= ZN+l,j = ZiO = Zi,N+1 = 0
i,j = 1, ... , N,
for Z,] = 0,1, ... , N + 1,
obtained from (8.4.3) by omitting the error T 2 ) , agrees approximately with
the Uij' To every grid point (Xi, Yj) of flh there belongs exactly one component Zij of the solution of (8.4.4). Collecting the N2 unknowns Zij and
the right-hand sides h 2 lij row-wise [see Figure 34] into vectors
8.4 Applications to Difference Methods~An Example
Z := (Zll' Z21, ... , ZNI, ZI2,""
b := h 2 (Jll , ... , iNI,""
641
ZN2, ... , ZIN, ... , ZNN )T,
ilN, ... , iNN )T,
then (8.4.4) is equivalent to a system of linear equations of the form
Az = b,
with the N 2 x N 2 matrix
4 -1
-1 4
-1
A=
-1
-1
-1
-1 4
-1
4 -1
-1 4
-1
-1
-1
-1
4
-1
-1
-1
-1
4 -1
-1 4
-1
-1
(8.4.5)
All
Al2
A21
A22
-1
-1 4
0
AN-I,N
0
AN,N-I
ANN
A is partitioned in a natural way into blocks Aij of order N, which are
induced by the partitioning of the points (Xi, Yj) of [h into horizontal rows
(Xl, Yj), (X2' Yj), ... , (XN' Yj)·
The matrix A is quite sparse. In each row, at most five elements are
different from zero. For this reason, in the execution of an iteration step
z(i) -+ z(i+l) of, say, the Jacobi method (8.1.7) or the Gauss-Seidel method
(8.1.8), one requires only about 5N 2 operations (1 operation = 1 multiplication or division + 1 addition). (If the matrix A where dense, N 4 operations
per iteration step would be required.) Compare this expenditure with that
of a direct method for solving Az = b. If we were to compute a triangular
decomposition A = LLT of A, say with the Choleski method (below we
642
8 Iterative Methods for the Solution of Systems of Linear Equations
will see that A is positive definite), then L would be an N 2 x N 2 lower
triangular matrix of the form
*
L=
*
*" - v - '*
N+l
The computation of L alone requires approximately ~ N 4 operations.
Since the Jacobi method, e.g., requires approximately (N + 1)2 iterations
(overrelaxation method: N + 1 iterations) [see (8.4.9)J in order to obtain
a result accurate to 2 decimal places, the Choleski method would be less
expensive than the Jacobi method. The main difficulty with the Choleski
method however, lies in its storage requirement: The storage ofthe approximately N3 nonvanishing elements of L requires too much space for larger
values of N, say N ::::: 100. Here lie the advantages of iterative methods:
Their storage requirement is only of the order of magnitude N 2 .
To the Jacobi method belongs the matrix
J = L + U = H41 - A).
The associated graph G(J) (for N = 3)
is connected and 2-cyclic. A is therefore irreducible [see Section 8.2J and
has property A [Theorem 8.3.11]. One easily sees, in addition, that A is
already consistently ordered.
The eigenvalues and eigenvectors of J = L + U can be determined
explicitly. One verifies at once, by substitution, that the N 2 vectors
Z(k,l)
with components
,
k, l = 1, 2, ... , N,
8.4 Applications to Difference Methods-An Example
hi .
17rj
+ 1 sm N + 1 '
(k,l)._.
Zij
643
1::::: i,j::::: N,
. - sm N
satisfy
with
) := _1 (cos
k7
j1 (
k,l
-r- + cos -17r- )
N+1
2
1::::: k,l::::: N.
N+1'
J thus has the eigenvalues j1(k,I), 1 ::::: k, 1 ::::: N.
The spectral radius of J, therefore, is
(8.4.6)
p(J) = max !j1(k,I)! = cos _7r_.
N +1
k,l
To the Gauss-Seidel method belongs the matrix H = (1 - L)-lU with
the spectral radius [see (8.3.16)]
p(H) = p(J)2 = cos 2
N: l'
According to Theorem (8.3.17) the optimal relaxation parameter Wb and
p(H(Wb)) are given by
2
2
Wb=----r==========
1+
(8.4.7)
11 - cos2 _ 7 r _
y
N+1
.
7r
l+sm
N -
+1
cos 2 _ 7 r _
P(H (Wb)) = ---=:N-'--'-+-=:1--n2 ·
( 1 + sin
N: 1)
The number K, = K,(N) with p(J)" = p(H(Wb)) indicates how many steps
of the Jacobi method produce the same error reduction as one step of the
optimal relaxation method. Clearly,
K, =
Inp(H(wb))
Inp(J) .
Now, for small z one has In(l + z) = z - z2/2 + O(z3), and for large N
7r
7r
2
cos N + 1 = 1 - 2(N + 1)2 + 0
so that
Likewise,
(
1 )
N4 '
644
8 Iterative Methods for the Solution of Systems of Linear Equations
lnp(H(wb)) = 2 [lnp(J) -In( 1 + sin N: 1)]
= 2
[
7f2
7f
7f2
2(N + 1)2 - N + 1 + 2(N + 1)2 + 0
(
1 )]
N3
so that finally, asymptotically for large N,
(8.4.8)
K
= K(N) ~ 4(N + 1) ,
7f
i.e., the optimal relaxation method is more than N times as fast as the
Jacobi method. The quantities
In(lO)
2
RJ := -lnp(J) ~ 0.467(N + 1) ,
1
2
RGS := 2RJ ~ 0.234(N + 1) ,
(8.4.9)
RWb := -
In(lO)
( ( ) ) ~ 0.367(N
lnp H Wb
+ 1)
indicate the number of iterations required in the Jacobi method, the GaussSeidel method, and the optimal relaxation method, respectively, in order
to reduce the error by a factor of 110 •
Finally, we derive an explicit formula for the condition number cond 2 (A)
of A with respect to the Euclidean norm, which will be important for the
speed of convergence of the conjugate-gradient method in Section 8.7.1 [see
(8.7.1.9)J: Since A = 4(1 - J) and J has the eigenvalues
p(k,l) =
~ (cos ~ + cos~)
2
N+l
N+l'
1:::; k,l:::; N,
it follows
Amax(A)=4(I+cos N:l)'
Amin(A)=4(I-cos N:l)·
This implies for cond 2 (A) := Amax(A)/Amin(A)
(8.4.10)
cond 2 (A) =
cos 2 _ 7 f _
N: 1 ~ 42 (N + 1)2
sin 2 _ _
7f
N+l
4
h2 ·
7f 2
8.5 Block Iterative Methods
645
8.5 Block Iterative Methods
As the example of the previous section shows, the matrices A arising in
the application of difference methods to partial differential equations often
exhibit a natural block structure,
where Aii are square matrices. If now, in addition, all Aii are nonsingular,
it seems natural to introduce block iterative methods relative to the given
partition 7r of A, in the following way: In analogy to the decomposition
(8.l.5) of A one defines, relative to 7r, the decomposition
A = DTC - ETC - F",
L TC := D;;l ETC,
UTC := D;;l FTC
with
(8.5.1)
1
One obtains the block Jacobi method (block total-step method) for the solution of Ax = b by choosing in (8.l.3), analogously to (8.l.7), B := D TC .
One thus obtains the iteration algorithm
D 7r x(i+1) -- b + (E 7r + F 1r )x(i) ,
or, written out in full,
(8.5.2)
j = 1, 2, ... , N, i = 0, 1, ....
Here, the vectors x(i), b are of course partitioned similarly to A. In each
step x(i) --+ x(Hl) of the method we now must solve N systems of linear
equations of the form Ajjz = y, j = 1, 2, ... , N. This is accomplished by
first obtaining, by the methods of Section 4, a triangular decomposition (or
a Choleski decomposition, if appropriate) Ajj = LjRj of the A j j , and then
reducing Ajjz = y to the solution of two triangular systems of equations
646
8 Iterative Methods for the Solution of Systems of Linear Equations
For the efficiency of the method it is essential that the Ajj be simply
structured matrices for which the triangular decomposition is easily carried
out. This is the case, e.g., for the matrix A in (8.4.5) ofthe model problem.
Here, Ajj are positive definite tridiagonal N x N matrices
Ajj
=
r
4
~1
~1
*
~1
J
whose Choleski decomposition requires a very modest amount of work
(number of operations proportional to N).
The rate of convergence of (8.5.2) of course is now determined by the
spectral radius p( J1f ) of the matrix
l" := 1 ~ B- 1 A = L1f + U1f .
In the same way one can define, analogously to (8.1.8), a block GaussSeidel method (block single-step method), through the choice
or, explicitly,
(8.5.3)
j = L 2, ... ,N,
i = 0, 1, ....
Again, systems of equations with the matrices Ajj need to be solved in
each iteration step.
As in Section 8.3, one can also introduce block relaxation methods
through the choice B:= B(w) = (l/w)D 1f (i ~wL1f). Explicitly [cf. (8.3.3)]'
let x;i+1l be the solution of (8.5.3); then
j = 1, 2, .... N.
Now, of course,
Also the theory of consistently ordered matrices of Section 8.3 carries over,
if one defines A as consistently ordered whenever the eigenvalues of the
matrices
8.6 The ADI Method of Peaceman and Rachford
647
are independent of ct. Optimal relaxation factors are determined as in Theorem (8.3.17) with the help of p(J1[).
One expects intuitively that the block methods will converge faster
with increasing coarseness of the block partition 7f of A. This indeed can
be shown under fairly general assumptions on A [see Varga (1962)]. For
the coarsest partition 7f of A into a single block, e.g., the iterative method
"converges" after just one step. It is then equivalent to a direct method.
This example shows that arbitrarily coarse partitions are only of theoretical interest. The reduction in the numher of iterations is compensated, to a
certain extent, by the larger computational work for each individual iteration step. For the most common partitions, in which A is hlock tridiagonal
and the diagonal blocks usually tridiagonal (this, e.g., is the case for the
model problem), one can show, however, that the computational work per
iteration involved in block methods is equal to that in ordinary methods.
In these cases, block methods bring real advantages. For the model problem [see Section 8.4]' relative to the partition given in (8.4.5), the spectral
radius p( J1[) can again be determined explicitly. One finds
7f
cos-N+1
p(J1[) = - - - - - ' - - - < p(J).
7f
2~cos-­
N+1
For the corresponding optimal block relaxation method one has asymptotically for N -+ ::xJ
The number of iterations is reduced by a factor V2 compared to the ordinary optimal relaxation method [proof: see Exercise 17].
8.6 The ADI Method of Peaceman and Rachford
Still faster convergence than in relaxation methods is obtained with the
AD! methods (alternating-direction implicit methods) for the iterative computation of the solution of a system of linear equations which arises in
difference methods. From amongst the many variants of this method we
describe here only the historically first method of this type, which is due to
Peaceman and Rachford [see Varga (1962) and Young (1971) for an exposition of further variants]. We illustrate the method for the boundary-value
problem
(8.6.1)
~'uxx(x, y) ~ Uyy(x, y)
u(x,y) = 0
+ (]" u(x, y) = f(x, y)
for (x, y) E fl,
for (x,y) E 8fl, fl:= {(x,y) 10 < x,y < I}
648
8 Iterative Methods for the Solution of Systems of Linear Equations
on the unit square, which on account of the term au is only slightly more
general than the model problem (8.4.1). We assume in the following that a
is constant and non-negative. Using the same discretization and the same
notation as in Section 8.4, in place of (8.4.4) one now obtains the system
of linear equations
(8.6.2)
4z ij - Zi~l,j - Zi+1,j - Zi,j-1 - Zi,j+1 + O'h 2 zij = h 2 fij, 1::; i,j::; H,
ZOj
= ZN+l,j = ZiQ = Zi,N+1 = 0 for 0::; i,j ::; H + 1
for the approximate values Zij of the values Uij = U(Xi' Yj) of the exact
solution. To the decomposition
4zij - Zi~l,j -
+ ah 2 zij ==
== [2z·2J - Z-l ' - Z'+l ,] + [2z''lJ - Z - l - Z 1"J'+1] + [O'h 2 z' ,]
ZH1,j -
Zi.j-1 2
Zi,j+1
,)
2
1,)
't,)
,]
into variables which are located on the same horizontal and vertical line
[see Figure 34] of fh, respectively, there corresponds a decomposition of
the matrix A of the system of equations (8.6,2), Az = b, of the form
A=H+V+E.
Here, H, V, E are defined through their actions on a vector z:
ZH1,j,
ifw = Hz,
2Zi~ - Zi,j-1 - Zi,j+1,
if w = Vz,
O'h Zij.
if w = Ez.
2Zij -
(8.6,3)
Wij
=
{
Zi-l,j -
E is a diagonal matrix with nonnegative elements; H and V are both
symmetric and positive definite, It suffices to show this for H: If the
z;.j are ordered in correspondence to the rows of fh [see Figure 34]'
Z = [Zl1' Z21,"" ZN1,·· .. ZlN, Z2N, ... , ZNN]T, then
2 -1
-1
-1
-1
2
H=
2 -1
-1
-1
-1
2
8.6 The ADI Method of Peaceman and Rachford
649
But according to Theorem (7.4.7) the matrices
2
-1
-1
-1
-1
2
and hence also H, are positive definite. For V one proceeds similarly.
Analogous decompositions of A are also obtained for boundary-value
problems which are considerably more general than (8.6.1).
In the ADI method of Peaceman and Rachford the system of equations
Az = b,
in accordance with the decomposition A = H + V + E, is now transformed
equivalently into
and also
(V+~E+r1)z= (r1-H-~E)z+b.
Here r is an arbitrary real number. With the abbreviations HI := H + ~E,
VI := V + ~E, one obtains the iteration algorithm of the ADI method,
(8.6.4a)
(8.6.4b)
(HI + ri+lI) z eH1 / 2) = (ri+l1 - V1)Ze i ) + b,
(vI + rHII)ZeHl) = (r;+l1 - Hl)Ze H1 / 2) + b.
Given z(i), one first computes ze H1 / 2) from the first of these equations, then
substitutes this value into the second and computes zeHl). The quantity
ri+l is a real parameter which may be chosen differently from step to step.
With suitable ordering of the variables Zij the matrices (HI + rHI1) and
(VI + rHI1) are positive definite tridiagonal matrices (assuming rHl ;::: 0),
so that the systems of equations (8.6.4a,b) can easily be solved for ze H1 / 2 )
and zeHl) via a Choleski decomposition of HI + rH11 and VI + rHI1.
Eliminating zeHl/2), one obtains from (8.6.4)
(8.6.5)
with
(8.6.6)
Tr := (VI + r1) -1 (r1 - Hl)(HI + rI) -1 (r1 - VI),
9r(b):= (VI + r1)-l [I + (r1 - Hl)(H1 + rI)-l]b.
For the error Ii := ze i ) - z it follows from (8.6.5) and the relation z =
T r .+ 1 z + 9r'+1 (b), by subtraction, that
650
8 Iterative Methods for the Solution of Systems of Linear Equations
and therefore
(8.6.8)
As in relaxation methods, one tries to choose the parameters ri in such
a way that the method converges as fast as possible. In view of (8.6.7),
(8.6.8), this means that the ri are to be determined so that the spectral
radius p(Trrn ... T r1 ) becomes as small as possible.
We first want to consider the case in which the same parameter ri = r
is chosen for all i = 1, 2, .... Here one has the following:
(8.6.9) Theorem. Under the assumption that HI and VI are positive definite one has p(Tr) < 1 for all r > O.
Note that the assumption is satisfied for our special problem (8.6.1).
From Theorem (8.6.9) it follows in particular that each constant choice
ri = r > 0 leads to a convergent ite