CHAPTER 4 Round-Off and Truncation Errors Numerical Accuracy Truncation error : Method dependent Errors which result from using an approximation rather than an exact procedure h2 h3 f ( x i h ) f ( x i ) hf ( x i ) f ( x i ) f ( x i ) .... 2! 3! Round-off error : Machine dependent Errors which result from not being able to adequately represent the true value Result from using an approximate number to represent exact number 3.1416 , e 2.71828 Taylor Series Expansion Construction of finite-difference formula Numerical accuracy: discretization error x a Base point x = a f ( x ) co c1 ( x a) c 2 ( x a) 2 c 3 ( x a) 3 co f ( a) 2 f ( x ) c 2 c ( x a ) 3 c ( x a ) c 1 f ( a ) 1 2 3 f ( x ) 2 c 6 c ( x a ) c 2 f ( a ) / 2! 2 3 c 3 f ( a ) / 3! f ( x ) 6 c 3 (m) f ( x ) ( m! )c m ( m 1)m( m 1) 2 c m 1 ( x xo ) c m f ( m ) ( a ) / m! f ( x) c m 0 ( x a) m m m 0 f ( m ) (a) ( x a) m m! Taylor series expansions h2 h3 f ( xi 1 ) f ( xi h) f ( xi ) hf ( xi ) f ( xi ) f ( xi ) .... 2! 3! Taylor Series and Remainder Taylor series (base point x = a) f ( x) m 0 f ( m ) ( a) ( x a) m m! f ( a ) f ( a ) f ( n) ( a) 2 3 f ( a ) f ( a )( x a ) ( x a) ( x a ) ... ( x a ) n Rn 2! 3! n! Remainder f ( ) n 1 Rn ( x a) ( n 1)! ( n1 ) Truncation Error Taylor series expansion h2 h3 f ( xi 1 ) f ( xi h) f ( xi ) hf ( xi ) f ( xi ) f ( xi ) .... 2! 3! Example (higher-order terms truncated) e x x2 x3 x4 x5 1 x .... 2! 3! 4! 5! 3 5 7 9 x x x x sin x x .... 3! 5! 7 ! 9! (xi = 0, h = x xi+1 = x) Power series Polynomials The function becomes more nonlinear as m increases A MATLAB Script Filename: fun_exp.m function sum = exp(x) % Evaluate exponential function exp(x) % by Taylor series expansion % f(x)=1 + x + x^2/2! + x^3/3! + … + x^n/n! clear all x = input(‘enter the value of x = ’); n = input(‘enter the order n = ’); term =1 ; sum= term; for i = 1 : n term = term*x/i; sum = sum + term; end MATLAB For Loops Filename: fun_exp2.m function sum = exp(x) % Evaluate exponential function exp(x) % by Taylor series expansion % f(x)=1 + x + x^2/2! + x^3/3! + … + x^n/n! x = input(‘enter the value of x =’); n = input(‘enter the order n = ’); term(1) =1 ; sum(1)= term(1); for i = 1 : n term(i+1) = term(i)*x/i; sum(i+1) = sum(i) + term(i+1); end % Display the results disp(‘i term(i) sum(i)’) a = 1:n+1; [a’ term’ sum’] Truncation Error n 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 term 1.0000 10.0000 50.0000 166.6667 416.6667 833.3334 1388.8890 1984.1272 2480.1589 2755.7322 2755.7322 2505.2112 2087.6760 1605.9045 1147.0746 764.7164 477.9478 281.1458 156.1921 82.2064 41.1032 sum 1.0000 11.0000 61.0000 227.6667 644.3334 1477.6667 2866.5557 4850.6826 7330.8418 10086.5742 12842.3066 15347.5176 17435.1934 19041.0977 20188.1719 20952.8887 21430.8359 21711.9824 21868.1738 21950.3809 21991.4844 n term sum 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 19.5729 8.8968 3.8682 1.6117 0.6447 0.2480 0.0918 0.0328 0.0113 0.0038 0.0012 0.0004 0.0001 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 22011.0566 22019.9531 22023.8223 22025.4336 22026.0781 22026.3262 22026.4180 22026.4512 22026.4629 22026.4668 22026.4688 22026.4688 22026.4688 22026.4688 22026.4688 22026.4688 22026.4688 22026.4688 22026.4688 22026.4688 x 10 , e 22026.4658 x Truncation Error n 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 term 1.0000000 -10.0000000 50.0000000 -166.6666718 416.6666870 -833.3333740 1388.8890381 -1984.1271973 2480.1589355 -2755.7321777 2755.7321777 -2505.2111816 2087.6760254 -1605.9045410 1147.0745850 -764.7164307 477.9477539 -281.1457520 156.1920776 -82.2063599 41.1031799 sum 1.0000000 -9.0000000 41.0000000 -125.6666718 291.0000000 -542.3333740 846.5556641 -1137.5715332 1342.5874023 -1413.1447754 1342.5874023 -1162.6237793 925.0522461 -680.8522949 466.2222900 -298.4941406 179.4536133 -101.6921387 54.4999390 -27.7064209 13.3967590 x 10 , n 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 term -19.5729427 8.8967924 -3.8681707 1.6117378 -0.6446951 0.2479596 -0.0918369 0.0327989 -0.0113100 0.0037700 -0.0012161 0.0003800 -0.0001152 0.0000339 -0.0000097 0.0000027 -0.0000007 0.0000002 0.0000000 0.0000000 e 0.45399 10 x sum -6.1761837 2.7206087 -1.1475620 0.4641758 -0.1805193 0.0674404 -0.0243965 0.0084024 -0.0029076 0.0008624 -0.0003537 0.0000263 -0.0000889 -0.0000550 -0.0000647 -0.0000620 -0.0000627 -0.0000625 -0.0000626 -0.0000626 4 How to reduce error? e 10 1 / 22026.4658 Round-off Errors Computers can represent numbers to a finite precision Most important for real numbers integer math can be exact, but limited How do computers represent numbers? Binary representation of the integers and real numbers in computer memory 32 bits (23, 8, 1) 28 = 256 smallest largest .100 00(2) 128 0.14693 10 38 .11111(2)127 0.18905 1038 64 bits (52, 11, 1) 211 = 2048 smallest largest .100 00(2) 1024 .11111(2)1023 MATLAB uses double precision Order of operation Addition problem: 0.99 0.0044 0.0042 0.9986 exact result with 3-digit arithmetic: (0.99 0.0044) 0.0042 0.994 0.0042 0.998 0.99 (0.0044 0.0042) 0.99 0.0086 0.999 Round-off error Cancellation error x 2 bx 1 0 br x1 2 br x2 2 If b is large, r is close to b Difference of two numbers very close to each other potential for greater error! r b2 4 Rationalize: b r b r b2 r 2 4 2 x2 2 b r 2b r 2b r b r Try b = 97 x 2 97 x 1 0 (r = 96.9794) x2 (3 sig. figs.) exact: 0.01031 standard: 0.01050 rationalized: 0.01031 Corresponding to “cancellation, critical arithmetic” Significant Figures 48.9 mph? 48.95 mph? Significant Digits The places which can be used with confidence 32-bit machine: 7 significant digits 64-bit machine: 17 significant digits Double precision: reduce round-off error, but increase CPU time 3.141592653589793238462643 2 1.41421356237310 e 2.71828182845904 False Significant Figures 3.25/1.96 = 1.65816326530162... (from MATLAB) But in practice only report 1.65 (chopping) or 1.66 (rounding)! Why?? Because we don’t know what is beyond the second decimal place 3.259 / 1.960 1.66275510204082... Chopping 3.250 / 1.969 1.65058405281869... 3.254 / 1.955 1.66445012787724... Rounding 3.245 / 1.964 1.65224032586558... Accuracy and precision Accuracy - How closely a measured or computed value agrees with the true value Precision - How closely individual measured or computed values agree with each other More Accurate More Precise Accuracy is getting all your shots near the target. Precision is getting them close together. Numerical Errors The difference between the true value and the approximation Approximation = true value + true error Et = true value approximation = x* x True Error x * x Relative Error True Value x* or in percent x * x t 100% x* Approximate Error But the true value is not known If we knew it, we wouldn’t have a problem Use approximate error approximat e error a 100% approximat ion present approx. previous approx. 100% present approximat ion xnew xold Relative error 100% xnew Number Systems Base-10 (Decimal): 0,1,2,3,4,5,6,7,8,9 Base-8 (Octal): 0,1,2,3,4,5,6,7 Base-2 (Binary): 0,1 – off/on, close/open, negative/positive charge Other non-decimal systems 1 lb = 16 oz, 1 ft = 12 in, ½”, ¼”, ….. 5 ,129 5 10 3 1 10 2 2 10 1 9 10 0 base 10 : 1 2 3 4 0 . 3125 3 10 1 10 2 10 5 10 101101 1 2 5 0 2 4 1 2 3 1 2 2 0 2 1 1 2 0 45 base 2 : 11 1 2 3 4 0.1011 1 2 0 2 1 2 1 2 16 Decimal System (base 10) Binary System (base 2) Integer Representation Signed magnitude method Use the first bit of a word to indicate the sign – 0: negative (off), 1: positive (on) Remaining bits are used to store a number + 1 0 1 0 0 1 0 1 1 0 Sign Number off / on, close / open, negative / positive Integer Representation 8-bit word 2 6 2 5 2 4 2 3 2 2 2 1 2 0 Sign Number smallest number 0000000base2 0base10 largest number 1111111base2 127 base10 +/- 0000000 are the same, therefore we may use “-0” to represent “-128” Total numbers = 28 = 256 (-128 127) Integer Representation 16-bit word 1 2 1 2 .... 1 2 1 2 32 ,767 14 13 1 0 Range: -32,768 to 32,767 Overflow: > 32,767 (cannot represent 43,000 A&M students) Underflow: < -32,768 (magnitude too large) 32-bit word Range: -2,147,483,648 to 2,147,483,647 9 significant digits Overflow: world population 6 billion Underflow: budget deficit -$100 billion Integer Operations Integer arithmetic can be exact as long as you don't get remainders in division 7/2 = 3 in integer math or overflow the maximum integer For a 8-bit computer max = 128 (or -127) So 123 + 45 = overflow and -74 * 2 = underflow Floating-Point Representation Real numbers (also called floating-point numbers) are represented differently For fraction or very large numbers Store as sign signed exponent mantissa sign is 1 or 0 for negative or positive exponent is maximum value (positive or negative) of base mantissa contains significant digits Floating-Point Representation e m e1 e 2 em d 1 d 2 d 3 d p sign of number signed exponent mantissa N .d 1 d 2 d 3 d p B mB e e m: mantissa B: Base of the number system e: “signed” exponent Note: the mantissa is usually “normalized” if the leading digit is zero Integer representation Floating-point number representation Decimal Representation 8-bit word 1 0 1 2 3 4 10 10 10 10 10 10 sign signed exponent number 1|095|1467 (base: B = 10) mantissa: m = -(1*10-1 + 4*10-2 + 6*10-3 + 7*10-4 ) = -0.1467 signed exponent: e = + (9*101 + 5*100) = 95 10951467 base10 mB 0.1467 10 e 95 Floating-Point Representation 8-bit word (without normalization) 2 1 2 0 2 1 2 2 2 3 2 4 sign signed exponent number 0|111|0101 (base: B = 2) mantissa: m = +(0*2-1 + 1*2-2 + 0*2-3 + 1*2-4 ) = 5/16 signed exponent: e = - (1*21 + 1*20) = -3 10111001base2 mB (5/16) 2 e 3 5/128 Normalization 1 in 2 (1/144) ft 2 0.006944 ft 2 2 2 2 1 in 0 . 694444 10 ft (Less accurate) (Normalization) Remove the leading zero by lowering the exponent (d1 = 1 for all numbers) 1 m1 B 1 base 10 : 10 m 1 0.1 m 1 1 base 2 : m1 2 if m < 1/2, multiply by 2 to remove the leading 0 floating-point allow fractions and very large numbers to be represented, but take up more memory and CPU time Binary Representation 8-bit word (with normalization) 2 1 2 0 2 1 2 2 2 3 2 4 sign signed exponent number 1|011|1001 (base: B = 2) mantissa: m = -(1*2-1 + 0*2-2 + 0*2-3 + 1*2-4 ) = -9/16 signed exponent: e = + (1*21 + 1*20) = 3 10111001base2 mB (9/16) 2 9/2 e 3 Single Precision A real variable (number) is stored in four words, or 32 bits (64 bits for Supercomputers) bit (binary digit): 0 or 1 byte: 4 bits, 24 = 16 possible values word: 2 bytes = 8 bits, 28 = 256 possible values 32 bits 23 for the digits 8 for the signed exponent 1 for the sign smallest .100 00(2)127 0.29387 10 38 128 39 largest .111 11(2) 0 . 34028 10 Double Precision A real variable is stored in eight words, or 64 bits 16 words, 128 bits for supercomputers 64 bits 52 for the digits 11 for the signed exponent 1 for the sign signed exponent 210 = 1024 smallest .100 00(2) 1024 .111 11(2) largest 1023 Round-off Errors Floating point characteristics contribute to round-off error (limited bits for storage) Limited range of quantities can be represented A finite number of quantities can be represented The interval between numbers increases as the numbers grow Example - three significant digits 0.0100 0.0101 0.0102 …… 0.0999 (0.0001 increment) 0.100 0.101 0.102 ……. 0.999 (0.001 increment) 1.00 1.01 1.02 (0.01 increment) ……. 9.99 MATLAB Finite number of real quantities (integers, real numbers or text) can be represented For 8-bit, 28 = 256 quantities For 16-bit, 216 = 65536 quantities MATLAB uses double precision 4 bytes = 64 bits more than 1019 (264) quantities