11.915 Comp Maths I, Unit 1 1 Advanced Mathematics and Statistics for Engineers Computational Mathematics I Unit 1 — Errors, Computer Arithmetic and Norms Contents 1 Disasters due to numerical errors 1.1 The explosion of Ariane 5 . . . . . . . . . . . . . . 1.2 Roundoff error and the Patriot missile . . . . . . . 1.3 The sinking of the Sleipner A offshore oil platform 1.4 Errors in computer modelling of real systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 1 2 2 2 Computer arithmetic 2.1 Absolute, relative, backward and forward errors . . . . . . 2.2 Computer representation of floating point numbers . . . . 2.3 Scientific computers . . . . . . . . . . . . . . . . . . . . . 2.4 Example — adding 0.001 repeatedly . . . . . . . . . . . . 2.5 Example — cancellation errors . . . . . . . . . . . . . . . 2.6 Example — cancellation errors in approximate derivatives 2.7 Example — quadratic roots . . . . . . . . . . . . . . . . . 2.8 Example — evaluating polynomials . . . . . . . . . . . . . 2.9 Example — more general functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 3 3 5 7 8 9 10 11 12 3 Vector and matrix norms 3.1 Vector norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Matrix norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Absolute and relative errors in norms . . . . . . . . . . . . . . . . . . . . . . 13 13 15 16 4 Problems 16 1 1.1 . . . . . . . . . . . . Disasters due to numerical errors The explosion of Ariane 5 Adapted from the WWW pages by D N Arnold http://www.ima.umn.edu/∼arnold/disasters On June 4, 1996 an unmanned Ariane 5 rocket launched by the European Space Agency exploded just forty seconds after its lift-off from Kourou, French Guyana. The rocket was on its first voyage, after a decade of development costing $7 billion. The destroyed rocket and its cargo were valued at $500 million. A board of inquiry investigated the causes of the explosion and in two weeks issued a report. It turned out that the cause of the failure was a software error in the inertial reference system. Specifically a 64 bit floating point number relating to the horizontal velocity of the rocket with respect to the platform was converted to a 16 bit signed integer. The number was larger than 32,768, the largest integer storeable in a 16 bit signed integer, and thus the conversion failed. 11.915 Comp Maths I, Unit 1 1.2 2 Roundoff error and the Patriot missile Article [5] by Robert Skeel, professor of computer science at the University of Illinois at Urbana-Champaign, in SIAM News, 1992. “The March 13 issue of Science carried an article claiming, on the basis of a report from the General Accounting Office (GAO), that a ‘minute mathematical error ... allowed an Iraqi Scud missile to slip through Patriot missile defenses a year ago and hit U.S. Army barracks in Dhahran, Saudi Arabia, killing 28 servicemen.’ The article continues with a readable account of what happened. “The article says that the computer doing the tracking calculations had an internal clock whose values were slightly truncated when converted to floating-point arithmetic. The errors were proportional to the time on the clock: 0.0275 seconds after eight hours and 0.3433 seconds after 100 hours. A calculation shows each of these relative errors to be both very nearly 2−20 , which is approximately 0.0001%. “The GAO report contains some additional information. The internal clock kept time as an integer value in units of tenths of a second, and the computer’s registers were only 24 bits long. This and the consistency in the time lags suggested that the error was caused by a fixed-point 24-bit representation of 0.1 in base 2. The base 2 representation of 0.1 is nonterminating; for the first 23 binary digits after the binary point, the value is 0.1 × (1 − 2−20 ). The use of 0.1 × (1 − 2−20 ) in obtaining a floating-point value of time in seconds would cause all times to be reduced by 0.0001%. “This does not really explain the tracking errors, however, because the tracking of a missile should depend not on the absolute clock-time but rather on the time that elapsed between two different radar pulses. And because of the consistency of the errors, this time difference should be in error by only 0.0001%, a truly insignificant amount. “Further inquiries cleared up the mystery. It turns out that the hypothesis concerning the truncated binary representation of 0.1 was essentially correct. A 24-bit representation of 0.1 was used to multiply the clock-time, yielding a result in a pair of 24-bit registers. This was transformed into a 48-bit floating-point number. The software used had been written in assembly language 20 years ago. When Patriot systems were brought into the Gulf conflict, the software was modified (several times) to cope with the high speed of ballistic missiles, for which the system was not originally designed. “At least one of these software modifications was the introduction of a subroutine for converting clock-time more accurately into floating-point. This calculation was needed in about half a dozen places in the program, but the call to the subroutine was not inserted at every point where it was needed. Hence, with a less accurate truncated time of one radar pulse being subtracted from a more accurate time of another radar pulse, the error no longer cancelled. “In the case of the Dhahran Scud, the clock had run up a time of 100 hours, so the calculated elapsed time was too long by 2−20 × 100 hours = 0.3433 seconds, during which time a Scud would be expected to travel more than half a kilometer. “The roundoff error, of course, is not the only problem that has been identified: serious doubts have been expressed about the ability of Patriot missiles to hit Scuds.” End of article by Robert Skeel. 1.3 The sinking of the Sleipner A offshore oil platform Adapted from the WWW pages by D N Arnold 11.915 Comp Maths I, Unit 1 3 http://www.ima.umn.edu/∼arnold/disasters The Sleipner A offshore oil platform sank during construction in Gandsfjorden outside Stavanger, Norway on 23 August 1991. The top deck weighed 57,000 tons, and provided accommodation for about 200 people and support for drilling equipment weighing about 40,000 tons. The crash caused a seismic event registering 3.0 on the Richter scale, and left nothing but a pile of debris on the sea bed 220m down. The failure involved a total loss of about $700 million. The conclusion of the investigation was that the loss was caused by a failure in a cell wall (one of the building blocks of the structure), resulting in a serious crack and a leakage that the pumps were not able to cope with. The wall failed as a result of a combination of a serious error in the finite element analysis and insufficient anchorage of the reinforcement in a critical zone. 1.4 Errors in computer modelling of real systems Computer predictions based on mathematical models of a real-world system can fail for all sorts of reasons. The three examples above show some of the pitfalls. Ariane was destroyed by an overflow problem (the number was too big for the place it was to be stored) and Patriot didn’t work when it was needed because of rounding error accumulation and mixing up precision. Sleipner was very different. The structure was destroyed because a finite element package was trusted to make predictions within a few percent and it didn’t get things right, perhaps because of modelling problems and perhaps because of errors in approximately solving the equations defined by the model. A brief summary of sources of error is: • The coding/hardware is wrong. Not our concern here, but always worth keeping in mind! • The mathematical model is wrong, or doesn’t take account of enough of the factors in the problem. • The equations that make up the mathematical model are not solved accurately enough. • The equations that make up the mathematical model cannot be solved accurately enough. For example, a system which is very unstable in the sense that small changes in input give rise to large changes in output, is going to be tricky to deal with when, say, measured data fed into it is noisy. This is a major concern in the study of numerical algorithms and linear algebra, and we will examine this in detail. 2 Computer arithmetic Most standard computer languages and processors are designed to do real arithmetic to about 15 significant digits. This sounds quite impressive until you start doing things like adding very small numbers to big ones, or relying on the result of subtracting two numbers that are very close together. Many texts on numerical analysis and scientific computing cover these topics, but [4] is probably the most comprehensive. 11.915 Comp Maths I, Unit 1 4 INPUT OUTPUT x backward error y=f(x) forward error x+dx y+dy = f(x+dx) Figure 1: Backward and forward errors in computing y = f (x). The thick solid line indicates the outcome of the computation. 2.1 Absolute, relative, backward and forward errors We need to be clear about these terms when considering errors. If the number x̂ approximates x, then the absolute error in x̂ = |x̂ − x| and, if x 6= 0 relative error in x̂ = |x̂ − x| . |x| where |x| is the absolute value (modulus) of x. (Matlab command abs.) We will see this generalised below when x is a vector or matrix. Figure 1 shows a schematic diagram of the computation of y = f (x) given x. The actual computed result is y + dy when it should be y, and dy is the forward error in this calculation. On the other hand y + dy corresponds to the correct calculation of f (x + dx), and dx is the backward error in the calculation. 2.2 Computer representation of floating point numbers There are standards for representing and processing floating point numbers on digital computers (e.g. the IEEE single, double and extended precisions) and, of course, the chips use binary (base 2) arithmetic. To keep life simple let us work with base 10 arithmetic here, but bear in mind the Patriot missile and the example in § 2.4 below when you think in base 10, but compute in base 2. A normalized decimal floating point number y 6= 0 has the form y = ±0.d1 d2 . . . dk × 10n , 1 ≤ d1 ≤ 9 , 0 ≤ ds ≤ 9 for s = 2 : k (1) and the integer n also has a limited range of values n = ±f1 . . . fj , 0 ≤ fs ≤ 9 for s = 1 : j . In the above form, 0.d1 d2 . . . dk is the mantissa and n is the exponent. Setting all the d’s to zero gives the number y = 0. The number requires k + j decimal digits and two sign bits to store it. If for example k = 7 and j = 2 is easy to show that max |y| = 0.9999999 × 1099 ≈ 1099 11.915 Comp Maths I, Unit 1 and 5 min |y| = 0.1000000 × 10−99 = 10−100 . y6=0 If you try to go outside these bounds you get an overflow or underflow, similar to Ariane. To convert a real number like y = π, 1/3 or 76 × 10−3 to the form f l(y) in (1), two procedures are used. First (notionally) write the real number in its infinite decimal expansion, say y = 0.d1 d2 . . . × 10n , 1 ≤ d1 ≤ 9 , 0 ≤ ds ≤ 9 for s ≥ 2 . (2) Chopping defines f l(y) as f lchop (y) = 0.d1 d2 . . . dk × 10n , while rounding defines f lround (y) = f lchop (y + 0.5 × 10n−k+1 ) = 0.δ1 δ2 . . . δk × 10n . Rounding looks complicated, but it is just the well-known rule: when dk+1 < 5 we round down to get f l(y) = 0.d1 d2 . . . dk × 10n , and when dk+1 ≥ 5 we round up by adding 1 to dk . For example, chopping and rounding π = 0.314159265358 . . . × 101 to 5 digit floating point form gives f lchop (π) = 0.31415 × 101 , f lround (π) = 0.31416 × 101 following the conventions above. The error that results by replacing a real number by its floating point representation is the roundoff error. If the real number y = 0.d1 d2 . . . × 10n (as in (2) above), then the relative error in chopping to k digits is |f lchop (y) − y| 0.dk+1 dk+2 . . . 0.dk+1 dk+2 . . . × 10n−k = ≤ × 10−k |y| 0.d1 d2 . . . × 10n 0.d1 d2 . . . and, given the restrictions on the decimal digits in (2), |f lchop (y) − y| 0.9999999 . . . ≤ × 10−k = 101−k . |y| 0.1 A similar result holds for rounding: |f lround (y) − y| ≤ 0.5 × 101−k . |y| So the relative errors do not depend on the magnitude of the numbers, so long as they fit into the range that can be stored. Computer arithmetic suffers from these floating point representation problems. Consider the real numbers x = 1/3, y = 5/7 and work in 5-digit chopping arithmetic. First we have f l(x) = 0.33333 × 100 and f l(y) = 0.71428 × 100 where we have shortened f lchop to f l. Computer evaluation of x + y (written x ⊕ y) can be thought of as x ⊕ y = f l(f l(x) + f l(y)) = f l(0.33333 × 100 + 0.71428 × 100 ) = f l(1.04761 × 100 ) = 0.10476 × 101 . 11.915 Comp Maths I, Unit 1 6 In other words, chop x and y to 5 digits, add these results together and chop. The relative error overall is approximately 0.18 × 10−4 . This isn’t bad since each chop could involve a relative error as big as 1 × 10−4 . This general way of thinking about computer arithmetic operations works for the other common operations: x ⊗ y = f l(f l(x) × f l(y)) , x ª y = f l(f l(x) − f l(y)) , x ª ÷y = f l(f l(x) ÷ f l(y)) . Of course the exact details vary from processor to processor. Finally, many implementations of floating point arithmetic satisfy the condition f l(x op y) = (x op y)(1 + δ), |δ| ≤ u, op = +, −, ×, ÷ (3) where the unit roundoff u = β 1−t /2 and β is the number base used (e.g. binary or decimal) and t is the precision (number of digits in the mantissa). For the commonly used IEEE double precision arithmetic these numbers are β = 2, t = 53, u ≈ 10−16 . See [Ch.2][4]. 2.3 Scientific computers Although the mass market for computers has settled onto one common type of processor, the more specialised market is much more diverse. There are RISC chips, parallel computers, vector processors etc. and one has to consider issues like main memory access times, onchip cache (fast memory attached to the processor) size and access time, the number of arithmetic units on individual processors etc.. We will only have a brief look here. Parallel processing Parallel computers come in many varieties. They are collections of processors that can be used together to execute a programme. Typical issues here are: • Can the programme be broken into independent pieces that can be executed together? If 90% of the code can be parallelised and 10% cannot, then the time taken by code can be reduced to 10% at best. • Do these independent pieces all take the same time to execute? If not, then many processors will be doing nothing waiting for the slowest to finish. Making this work is called load balancing. • Do the processors share memory or have independent chunks only accessible to to themselves? • If they don’t share memory, how fast do the processors pass information between one another? Perfect performance is roughly that P processors make your code P times faster. In real life, parallel performance is rarely near perfect, particularly as P gets big. It is quite possible for 256 processors to take longer than 64 on some problems. On the other hand it is also possible to see speed-ups of more than P with P processors, when the data for the problem are split into smaller pieces that can fit onto the faster cache memory and doesn’t use the slower main memory. 11.915 Comp Maths I, Unit 1 7 Vector processing Vector processors like the Cray and some Digital Signal Processing chips have also made a big impact. The basic idea is simple, but effective. Say we want to find the vector sum a + b. One type of hardware goes through the process for j=1:N read a(j) and b(j) from memory add them write result to memory end It doesn’t start on the next calculation until it completely finishes the current one. If that process takes say 4 clock cycles, then the whole thing takes 4N clock cycles. A vector pipeline processor is designed to split the calculation (read from memory, add, write result) up into say 7 sections, each taking one clock cycle, but to start a new calculation every clock cycle. See Table 1. It takes 7 clock cycles to get the first result out, but each one after that appears every cycle. The total time to add both vectors in this way is 6 + N cycles. When N is big, then this vector pipeline is 4 times faster then the simple unit. However it isn’t faster when adding pairs of scalars. The simplified example above shows how vector pipelines can speed up calculations of operations involving vectors. One reason to introduce this concept is that for other reasons, Matlab behaves as if it is a vector pipeline computer. For example, N = 100000; x = rand(N,1); y=rand(N,1); tic z = x+y; toc elapsed_time = 0.2166 tic for j=1:N z(j) = x(j)+y(j); end toc elapsed_time = 4.8452 shows a speed-up by a factor of 22 for the vector calculation (the first part) over the component-wise calculation. These were run on a Sun Ultra 5/270. We will give more Matlab speed tips throughout the course. 2.4 Example — adding 0.001 repeatedly The following code section should step through adding 0.001 to x at each step. It should take exactly 100 steps for x to be equal to 0.2, and it should stop at that point because of the logical check and break. 11.915 Comp Maths I, Unit 1 Cycle 1 2 3 4 5 6 7 8 ... 1 a1 , b1 a2 , b2 a3 , b3 a4 , b4 a5 , b5 a6 , b6 a7 , b7 a8 , b8 ... 8 2 3 Steps 4 5 6 7 Result a1 , b1 a2 , b2 a3 , b3 a4 , b4 a5 , b5 a6 , b6 a7 , b7 ... a1 , b1 a2 , b2 a3 , b3 a4 , b4 a5 , b5 a6 , b6 ... a1 , b1 a2 , b2 a3 , b3 a4 , b4 a5 , b5 ... a1 , b1 a2 , b2 a3 , b3 a4 , b4 ... a1 , b1 a2 , b2 a3 , b3 ... a1 , b1 a2 , b2 ... a1 + b1 a2 + b2 ... Table 1: Pipeline processing to find a + b x = 0.1; for k=1:200 x = x + 0.001; if x==0.2, disp(’x = 0.2’), break, end end disp([’Number of steps = ’,int2str(k),’ Final x = ’,num2str(x)]) OUTPUT Number of steps = 200 Final x = 0.3 The code never finds that x is exactly equal to 0.2! To see what happens near where x=0.2 modify the code: format long e % show full numerical precision x = 0.1; for k=1:200 x = x + 0.001; if abs(x-0.2) < 1.e-6, break, end end disp([’Number of steps = ’,int2str(k),’ Final x = ’,num2str(x),... ’ x-0.2 = ’,num2str(x-0.2)]) OUTPUT Number of steps = 100 Final x = 0.2 x-0.2 = 8.3267e-17 So adding 0.001 repeatedly to x doesn’t do what we expect in floating point arithmetic. That is because the number 0.001 can’t be represented exactly in binary (computer numbers). It is better to replace the check x==0.2 in the first version above by abs(x-0.2) < 1.e-6. That is, x is within distance 10−6 (or another small number) of 0.2. This is very similar to the Patriot missile problem. 11.915 Comp Maths I, Unit 1 2.5 9 Example — cancellation errors Start with an integer example. Subtract two 14 digit integers which agree in the first 6 places: >> format long >> p = 14567123456629 p = 1.456712345662900e+013 >> q = 14567109642894 q = 1.456710964289400e+013 >> r = p-q r = 13813735 Not surprisingly, the result r=p-q has 14-6 = 8 digits in it (and it is correct). Next try it with non-integers: >> format long >> a = 100000*sin(0.300000001) a = 2.955202076166761e+004 >> b = 100000*sin(0.299999999) b = 2.955202057060031e+004 The numbers a and b above are computed to about 16 significant figures and agree in their first 8 significant places. Taking the difference between them should leave a number accurate to about 16-8 = 8 significant decimals. >> c = a-b c = 1.910672981466632e-004 Unfortunately the computer lies and pads out the result by adding more digits than it really knows anything about. It gives the impression that the number c is known to 16 significant digits when it cannot possibly be. Now subtract the number d from c. They agree in the first 8 decimal places. >> d = 1.9106729123456e-4 d = 1.910672912345600e-004 >> c-d ans = 6.912103241120057e-012 11.915 Comp Maths I, Unit 1 10 Error in Difference Approxs of derivative of sin(x) at x=2 0 10 forward (f(x+h)−f(x))/h central (f(x+h)−f(x−h))/(2h) −2 10 −4 10 −6 10 −8 10 −10 10 −12 10 −10 10 −8 10 −6 −4 10 10 −2 10 0 10 h values Figure 2: Comparison of the absolute errors in the forward and central difference approximations of the derivative of sin x at x = 2. See § 2.6 Again we get a result quoted to 15 or 16 decimal places, but none of the digits in this number mean anything because c was accurate to 8 decimals and we subtracted them off using d. 2.6 Example — cancellation errors in approximate derivatives Derivatives of functions can be approximated by estimating the slope of the function. For example, the derivative of sin x at x = 2 is cos 2. Use a forward difference to approximate the derivative. >>x=2; h=0.1; aprx = (sin(x+h)-sin(x))/h aprx = -0.4609 >> exact = cos(2) exact = -0.4161 Not too far wrong. Making the parameter h smaller usually makes thing better. >>x=2; h=0.05; aprx = (sin(x+h)-sin(x))/h 11.915 Comp Maths I, Unit 1 11 aprx = -0.4387 >>exact = cos(2) exact = -0.4161 However, when h gets very small the difference between sin(x+h) and sin(x) is relatively very small and we can have rounding error problems as described earlier. We see in Figure 2 that as we decrease h we get closer to the exact derivative until about h = 1.e − 8. For smaller h we slowly drift away from this good result until at h = 1.e−16 we get a completely wrong result aprx=0! This is all due to rounding errors in subtracting two numbers that are very close together. 2.7 Example — quadratic roots Find the roots of the quadratic equation ax2 + bx + c = 0 . The two roots (when a 6= 0) are x± = −b ± √ b2 − 4ac . 2a This seems like a simple computation, but if b2 is much bigger than 4ac and b > 0, then x+ may be difficult to calculate accurately since it involves p −b + b2 − 4ac ≈ −b + b . In the case b > 0 with b2 << 4ac it is better to use an alternative formula for x+ given by √ √ √ −b + b2 − 4ac −b + b2 − 4ac −b − b2 − 4ac −2c √ √ x+ = = × = 2a 2a −b − b2 − 4ac b + b2 − 4ac or to compute x− first and then use the identity x+ = c/(ax− ). That way there are no cancellation problems to deal with. 2.8 Example — evaluating polynomials Consider the evaluation of the polynomial f (x) = x3 − 6x2 + 3x − 0.149 for x = 4.71 using 3-digit chopping and rounding arithmetic. arithmetic exact 3-chop 3-round x 4.71 4.71 4.71 3×x 14.13 14.1 14.1 x×x 22.1841 22.1 22.2 6×x×x 133.1046 132 133 x×x×x 104.487111 104 105 11.915 Comp Maths I, Unit 1 12 Note that 4.71 × 4.71 × 4.71 in 3-digit rounding arithmetic is (4.71 ⊗ 4.71) ⊗ 4.71 = f lround (22.1841) ⊗ 4.71 = 22.2 ⊗ 4.71 = f lround (104.562) = 105 and that 105 6= f lround (exact result) = 104. Now add everything up (working from left to right): exact 3-chop 3-round f (4.71) 104.487111 − 133.1046 + 14.13 − 0.149 ((104 ª 132) ⊕ 14.1) ª 0.149 ((105 ª 133) ⊕ 14.1) ª 0.149 = −14.636489 = −14.0 = −14.0 Relative error in both cases is 0.636/14.636 = 0.04348 . . . i.e. 4 per cent. Horner’s algorithm is an alternative method for polynomial evaluation that is computationally efficient and usually cuts down on rounding errors. For the general n-th degree polynomial f (x) = an+1 xn + an xn−1 + a2 x + a1 , Horner’s algorithm evaluates f (x) for given x and a1 , . . . , an+1 as w = a(n+1); for j=n:-1:1 w = w*x + a(j); end so in the cubic example above we get (with x = 4.71, a4 = 1, a3 = −6, a2 = 3, a1 = −0.149): algorithm w = a(4) w = w*x + a(3) w = w*x + a(2) w = w*x + a(1) w exact 1 -1.29 -3.0759 -1.4636489 w 3-chop 1 -1.29 -3.07 -14.5 w 3-round 1 -1.29 -3.08 -14.6 The relative errors are now approximately 0.9% and 0.2% respectively, compared to 4% by the direct evaluation above. Also note that the Horner algorithm takes 6 flops (floating point operations ± ÷ ∗) compared to 9 or 10 for the direct method. The potential improvement in accuracy and computational cost is greater for higher degree polynomials. 2.9 Example — more general functions Consider the function 1 − cos x x2 for small values of x. There are difficulties with g(x) because cos x ≈ 1 for small x. On a typical processor working to between 15 and 16 significant figures, the result g(x) will be returned as exactly 0 for all |x| 6= 0 less than around 10−8 . This is not very good since for small x, f (x) ≈ 0.5 which is nothing like 0. One cure for this is to use Taylor’s theorem (see almost any calculus book or [1, 2]) to expand the numerator and/or the denominator of such functions around x = 0. When a function f (x) is smooth enough, Taylor’s theorem gives g(x) = f (x) = f (x0 ) + (x − x0 )f 0 (x0 ) + (x − x0 )2 00 (x − x0 )3 (3) f (x0 ) + f (x0 ) + · · · . 2! 3! 11.915 Comp Maths I, Unit 1 13 In the example above with x0 = 0, 1 − cos x = 1 − cos 0 + (x − 0) sin 0 + = (x − 0)2 (x − x0 )3 cos 0 + (− sin 0) + · · · 2! 3! x2 x4 − + ··· 2 4! which only has even powers of x. Finally, a good approximation of g(x) for small x is obtained by 1 x2 1 − cos x ≈ − . g(x) = x2 2 4! The error in this approximation is actually ≤ x4 /6! in exact arithmetic, but we shall leave that to the Computational Mathematics II module. 3 Vector and matrix norms We are going to be studying linear algebra, which works with vectors and matrices. We need a way to measure the size of vectors (and sometimes matrices) and particularly the errors in vector results. Norms are used to give a measure the “size” of vectors and matrices. They should not be confused with the “dimension”, which is the number of elements in a vector and the number of rows and columns in a matrix. See many books on numerical linear algebra for an overview of this topic, for example [1, 2, 3, 4]. Note: We will usually think of vectors as column vectors. 3.1 Vector norms Perhaps the most familiar vector norm is the distance between two points on a map or a plan. The distance from the origin to the point (2, 6) on a map is then p 22 + 62 ≈ 6.3246 (Pythagoras’ theorem for right-angled triangles) and in 3D, the distance from the origin to (2, 6, −1) is p 22 + 62 + (−1)2 ≈ 6.4031 . This definition of distance (called the Euclidean distance) is what you would measure with a ruler. For multicomponent vectors, the Euclidean distance from the origin to the point denoted by x = (x1 , x2 , . . . , xN )T is 1/2 N q X def x2j , kxk2 = x21 + x22 + · · · + x2N = j=1 assuming that x has real (not complex) components. The notation kxk2 is read as “the 2-norm” or the “Euclidean norm” or the “`2 norm” of vector x. Throwing away the ruler, there are other many ways to measure distance when dealing with real vectors. The most 11.915 Comp Maths I, Unit 1 14 x+y ||x+y|| ||y|| 0 x ||x|| Figure 3: Triangle inequality kx + yk ≤ kxk + kyk for vectors x, y. common examples for x ∈ RN (i.e. x is a vector of N real numbers) are the `p norms defined by 1/p N X def |xj |p , 1 ≤ p < ∞ , x ∈ RN kxkp = j=1 and the special case of the infinity or `∞ or maximum norm def kxk∞ = max |xj | . j=1:N The most common cases are p = 1, 2 and ∞. The Matlab command norm works for both vector and matrix norms. To qualify as a vector norm, k · k (however it is defined) must satisfy some rules: kxk ≥ 0 ∀x ∈ RN (4) kxk = 0 ⇔ x = 0 kx + yk ≤ kxk + kyk kα xk = |α| kxk (5) N ∀x, y ∈ R (6) N ∀α ∈ R, x ∈ R (7) The symbol ∀ means for all. The p-norms defined above do satisfy these rules. The rules can be interpreted as: 1. The norm is not negative (it is like a distance after all). 2. The norm of x is zero, if and only if x is 0 (again think of distances). 3. The distance travelled to go from the origin to point x and then to point y is greater than or equal to the distance going directly from the origin to the point x + y. See Figure 3 This is a generalisation of the more familiar |a + b| ≤ |a| + |b| with a = 4, b = ±7 for example. 4. If all components of the vector x are multiplied by a constant then the norm/distance is multiplied by the modulus of the same constant. As a consequence of these properties, vector norms also satisfy the Holder inequality ¯ ¯ ¯N ¯ ¯X ¯ 1 1 ¯ + =1 (8) xj yj ¯¯ = |xT y| ≤ kxkp kykq , ¯ p q ¯ j=1 ¯ 11.915 Comp Maths I, Unit 1 15 where x, y ∈ RN , 1 ≤ p, q ≤ ∞. The special case |xT y| ≤ kxk2 kyk2 (9) (p = q = 2) is called the Cauchy-Schwarz inequality. Furthermore, all norms on RN are equivalent. That is there exist constants c1 , c2 > 0 such that c1 kxkr ≤ kxks ≤ c2 kxkr (10) for any vector norms k · kr , k · ks . The constants c1 , c2 do not depend on the contents of x, only on the details of the norms and the length of the vector. For example kxk∞ ≤ kxk1 ≤ N kxk∞ √ kxk∞ ≤ kxk2 ≤ N kxk∞ √ kxk2 ≤ kxk1 ≤ N kxk2 for all x ∈ RN . This means that if we can show that the error in a vector tends to zero in one norm, then it does so in all other norms. 3.2 Matrix norms Matrices present further complications, but norms on matrices behave in a similar way to norms on vectors. The Matlab command norm works for both vector and matrix norms. We will concentrate on real, N × N square matrices, denoted by A ∈ RN ×N . A matrix norm k · k must satisfy kAk ≥ 0 ∀A ∈ RN ×N (11) kAk = 0 ⇔ A = 0 kA + Bk ≤ kAk + kBk kα Ak = |α|kAk (12) N ×N ∀A, B ∈ R (13) N ×N ∀α ∈ R, A ∈ R . (14) (Beware! The requirements for a matrix norm vary from book to book, but the main points are in agreement.) The most common matrix norms are the p−norms kAkp = max kAxkp = sup kxkp =1 x6=0 kAxkp kxkp for 1 ≤ p ≤ ∞ and the Frobenius norm v uN N uX X |aij |2 kAkF = t i=1 j=1 where aij is the element in the ith row and jth column of matrix A. The matrix p-norm is sometimes called the natural or induced matrix norm associated with the vector p-norm. The 1- and ∞-norms are the easiest of the p-norms to evaluate. They are equivalent to kAk1 = max j=1:N N X i=1 |aij | and kAk∞ = max i=1:N N X j=1 |aij | , 11.915 Comp Maths I, Unit 1 16 the maximum column sum and the maximum row sum respectively. (Incidentally, this is messy to prove. See e.g. [2, Ch.7].) The others are not at all straightforward or computationally cheap to evaluate. The p-norms also satisfy the conditions kAxkp ≤ kAkp kxkp (15) kABkp ≤ kAkp kBkp (16) and for A, B ∈ RN ×N and x ∈ RN . One way to think of the p-norm of matrix A is that it measures the maximum “magnification” or “size increase” the matrix can give when multiplying a vector or another matrix. The p-norms and the Frobenius norm are linked together in the same way as vector norms. For example, √ kAk2 ≤ kAkF ≤ N kAk2 √ √1 kAk∞ ≤ kAk2 ≤ N kAk∞ N √ √1 kAk1 ≤ kAk2 ≤ N kAk1 N p kAk2 ≤ kAk1 kAk∞ maxi,j=1:N |aij | ≤ kAk2 ≤ N maxi,j=1:N |aij | for all A ∈ RN ×N . In fact all norms on matrices in RN ×N are equivalent, just as we have seen in (10) for vectors. 3.3 Absolute and relative errors in norms Suppose that vector x̂ approximates x and matrix  approximates A. Then, for the norm chosen, absolute error in x̂ = kx̂ − xk , absolute error in  = k − Ak . If x 6= 0 and A 6= 0, relative error in x̂ = kx̂ − xk , kxk relative error in  = k − Ak . kAk Note that a different choice of norm can give different numerical results. 4 Problems 1. The number p̂ approximates p. What is the (i) absolute error and (ii) relative error in the approximation of p by p̂? (a) p = π, p̂ = 3.1 (b) p = 100/3, p̂ = 33.3 . 2. The number x̂ approximates x with a relative error of at most 10−N . What range of values can x̂ take given x and N below? (a) x = 22/7, N = 4 (b) x = 150, N = 3 . 11.915 Comp Maths I, Unit 1 17 3. How many normalised floating point numbers x with a 5-digit mantissa are there with 700 ≤ x < 900 using base 10 arithmetic? (The number of digits in the exponent doesn’t matter here.) 4. What is the smallest (non-zero) normalised floating point number with a 3-digit mantissa and 2-digit exponent in base 10 arithmetic? What is the biggest? (Smallest means closest to zero and biggest means furthest from zero.) 5. Suppose that f l(y) is a k-digit rounding approximation to real number y 6= 0. Show that ¯ ¯ ¯ y − f l(y) ¯ ¯ ≤ 0.5 × 10(1−k) . ¯ ¯ ¯ y 6. Perform the following computations (i) exactly, (ii) using 3-digit chopping, (iii) using 3-digit rounding and determine any loss in accuracy, assuming the numbers given below are exact. (a) 14.1 + 0.0981 (b) 0.0218 × 179 . 7. Use 3 digit rounding arithmetic to evaluate f (1.53) by (i) direct evaluation and (ii) Horner’s algorithm (see § 2.8 and set y = ex below). f (x) = 1.01e4x − 4.62e3x − 3.11e2x + 12.2ex − 1.99 . Computing the result to high precision and then rounding to 3 digits gives f (1.53) = −7.61. 8. Why is computer or calculator evaluation of f (x) = (ex − 1 − x)/(1 − cos x) likely to be inaccurate when x > 0 is small? Use Taylor expansions of the numerator and denominator to derive an alternative formula for f (x) which is accurate for small x. 9. Find kxk2 and kxk∞ for x = (3, −4, 0, 1.5)T , (2, 1, −3, 4)T . 10. Verify that the function k · k1 defined on RN by kxk1 = N X |xi |, i=1 is a vector norm and evaluate kxk1 for the vectors in the previous question. 11. Find k · k∞ for the following matrices: µ ¶ µ ¶ 1 −1 1 1 , , 2 1 1 1 µ 12. Show that k · k defined below is a matrix norm. def kAk = n n X X i=1 j=1 |aij | . 10 15 0 1 ¶ . 11.915 Comp Maths I, Unit 1 18 13. Show that k · k defined below is not a matrix norm. def kAk = max |aij | . ij 14. Use Matlab to find the 1-, 2-, ∞- and Frobenius norms of the 100 × 100 Hilbert matrix. Hint: look at the commands hilb, norm. References [1] K. E. Atkinson. An Introduction to Numerical Analysis. Wiley, 2 edition, 1989. [2] R. L. Burden and J. D. Faires. Numerical Analysis. Brooks Cole, 7 edition, 2001. [3] G. Golub and C. F. van Loan. Matrix Computations. John Hopkins University Press, 3 edition, 1995. [4] N. J. Higham. Accuracy and Stability of Numerical Algorithms. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 1996. [5] R. Skeel. Roundoff error and the Patriot missile. Society for Industrial and Applied Mathematics (SIAM) News, 25(4):11, 1992. c B Duncan & G J Lord, Department of Mathematics, September 2001 °D