1.6 Round-off errors in floating point computations. 1.6.1 Round-off errors. When people or computers do computations with floating point numbers, they usually round the result of each arithmetic operation to a certain fixed number of digits of precision. This introduces additional errors into the final result called round-off errors. Usually round-off errors are insignificant compared to errors in measurement or truncation errors, but sometimes they will actually be larger. This is the case when the result of an addition or subtraction is significantly smaller in magnitude than the numbers which one is adding or subtracting. In some cases the round-off errors can be serious enough to cause the final result to be meaningless. Example 1. An object moves along a straight line so that its position x at time t is given by x = t3. Let to = 10 and t1 = 10 + h be two times and xo = to3 = 103 = 1000 and x1 = t13 = (10+h)3 be the corresponding positions. The displacement x is the change in position, i.e. x = x1 – xo = (10+h)3 - 1000. Suppose h = 0.014. a. Compute x exactly. b. Compute x doing the calculations using four digit decimal floating point arithmetic. What is the error in the result? c. An alternative formula for x is x = 3to2h + 3toh2 + h3 = 300h + 30h2 + h3. Compute x using this alternative formula again doing the calculations using four digit decimal floating point arithmetic. What is the error in the result? How does this compare with part b? Solution. a. Compute x exactly. 10 + h = 10 + 0.014 = 10.014 (10 + h)2 = (10.014)2 = 100.280196 (10 + h)3 = (100.280196)(10.014) = 1004.205882744 x = 1004.205882744 – 1000 = 4.205882744 b. Compute x rounding results to four digits after each operation. In the following indicates rounding and a subscript a indicates an approximate value. 10 + h = 10.014 (10 + h)a = 10.01 [(10 + h)a]2 = (10.01)2 = 100.2001 [(10 + h)2]a = 100.2 [(10 + h)2]a (10 + h)a = (100.2)(10.01) = 1003.002 [(10 + h)3]a = 1003 [(10 + h)3]a - 1000 = 1003 – 1000 = 3 [x]a = 3 Absolute error = 4.205882744 - 3 = 1.205882744 Relative error = 1.205882744/3 0.4 = 40% c. Compute x using the alternative formula. 10 + h = 10.014 (10 + h)a = 10.01 300h = (300)(0.014) = 4.2 h2 = (0.014)2 = 0.000196 30h2 = (30)(0.000196) = 0.00588 h3 = 0.000002744 300h + 30h2 + h3 = 4.205882744 [x]a = 4.206 Absolute error = 4.205882744 - 4.206 = 0.000117256 Relative error = 0.000117256/4.206 0.00003 1.6.1 - 1 = 0.003% This is much better than b. This example illustrates that sometimes one formula for computing a quantity is better than another equivalent formula from the standpoint of round-off error. It also raises several questions. Why did the result in part b have such a large relative error and the result in part c didn't? Would the same be true if we did the calculations with more digits of precision? Is there a way to describe/predict how large the round-off error might be ahead of time before we do the computation? Estimating the round-off error in a certain computation is often difficult. One way is to repeat the same computation doing the second calculation with more digits of precision. By comparing the two values one can estimate the round-off error in the value obtained with fewer digits of precision. Another way is to try to estimate the round-off error at each step of the computation. If we look carefully we can see that in the computation in part b we lost about three digits of precision when we did the final subtraction 1003 – 1000 = 3. Until then the intermediate results had almost four digits of precision. The two numbers we subtracted, 1003 and 1000, were close in the relative sense so the result was a number, 3, that was much smaller than either. The computation in part c did not involve the subtraction of two nearly equal numbers, so the only round-off errors were small in the relative sense. Example 2. Redo part b of Example 1 with a general h which is small with respect to 10 if the computations are done on a computer with machine equal to . (In parts b and c of Example 1 one has = 5 10-4 = 0.0005.) For simplicity you may make approximations in the calculation of the error. Solution. Recall from section 1.5 that the relative error between a number x and its rounded value is no more than . The first thing we do in the calculation of (10+h)3 - 1000 is to round h. The rounded value of h may have a relative error as large as and an absolute error as large as h. The next thing to do is to compute 10+h. Before rounding the absolute error might be as much as the absolute error in h which is h and the relative error as much as h/(10 + h) which is about h/10. Rounding introduces another relative error of approximately which is added to h/10 giving + h/10 . Next we multiply 10+h by itself to get (10+h)2 and then multiply (10+h)2 by 10+h to get (10+h)3. Each step requires a multiplication and a rounding. In the multiplication the relative errors of the things we are multiplying add (approximately) and the rounding adds an additional relative error of . It follows that the computed value of (10+h)3 may have a relative error of about 5 and an absolute error of about 5(10+h)3 which is about 5000. Finally, we compute (10+h)3 - 1000. In subtraction the absolute errors add. So before rounding the value of (10+h) - 1000 has an absolute error of about 5000 and a relative error of about 5000/[(10+h)3 - 1000]. Since (10+h)3 - 1000 300h, the relative error of (10+h)3 - 1000 is about 5000/(300h) 17/h. In part b one had h = 0.014, so the worst case relative error is about 1200 . If = 5 10-4, then the relative error might be as large as 0.6. In fact it was only about 0.4, which is about 2/3 the worst case. Example 2. The equation y = 4 - x describes the top half of the circle of radius 2 centered at the origin. If one starts at x on the x axis and goes left to x = 0, then the change in the y values is y = 2 - 4 - x. Suppose x = 0.0016. a. Compute y exactly. b. Compute y doing the calculations using four digit decimal floating point arithmetic. What is the error in the result? 1.6.1 - 2 c. x . Compute y using this alternative formula again doing the 2+ 4-x calculations using four digit decimal floating point arithmetic. What is the error in the result? How does this compare with part b? An alternative formula for y is y = Solution. a. Compute y exactly. 4 - x = 4 - 0.0016 = 3.9984 4-x = y = 2 b. 3.9984 = 1.999599960… 4 - x = 2 - 1.999599960… = 0.000400040… Compute y rounding results to four digits after each operation. 4 - x = 3.9984 (4 - x)a = 3.998 (4 - x)a = 3.998 = 1.999499937 [ 4 - x]a = 1.999 2 - [ 4 - x]a = 2 - 1.999 = 0.001 [y]a = 0.001 So Absolute error = | 0.000400040… - 0.001 | = 0.000599960 0.0006 Relative error 0.0006/0.001 0.6 = 60% c. Compute x using the alternative formula. 4 - x = 3.9984 (4 - x)a = 3.998 (4 - x)a = 3.998 = 1.999499937 [ 4 - x]a = 1.999 2 + [ 4 - x]a = 2 + 1.999 = 3.999 [2 + 4 - x]a = 3.999 x 0.0016 = = 0.0004001000250 [y]a = 0.0004001 3.999 [2 + 4 - x]a Absolute error = | 0.000400040… - 0.0004001 | 0.00000006 Relative error = 0.00000006/0.0004 0.00015 = 0.015% This is much better than b. Problems. 1. An empty bowl is placed on a scale and the scale reads 4.62 lb. Some cherries are placed in the bowl and the bowl is again placed on the scale and the scale now reads 4.96 lb. Suppose the value that the scale reads may be in error by as much as 0.08 lb. Approximately how much do the cherries weigh and what is the relative error in this value? Answer. The cherries weigh approximately 4.96 – 4.62 = 0.34 lb. However, both the numbers 4.96 and 4.62 could be off by as much as 0.08, so the difference could be off by as much as 0.16. So the relative error in the weight 0.34 is 0.16/0.34 = 0.47 = 47%. 2. An object moves along a straight line. Its position at time t0 = 200 sec is x0 = 700 ft and its position at time 601 2102 t1 = sec us x1 = ft. 3 3 3. a. What is the exact velocity v = (x1 – x0)/(t1 – t0)? b. Suppose the velocity is computed using four digit decimal arithmetic. What value is computed for the velocity and what is the relative error in this value? An object moves along a straight line in such a fashion that its position x is given in terms of time t by the formula x = 2t2. If t0 and t1 are two times and x0 and x1 are the corresponding positions, then its average 1 velocity over this time interval is v = (x1 – x0)/(t1 – t0). Suppose t0 = 1 sec and t1 = 1 + sec. 300 1.6.1 - 3 4. a. What is the exact velocity v = (x1 – x0)/(t1 – t0) over this time interval? b. Suppose the velocity is computed using four digit decimal arithmetic. What value is computed for the velocity and what is the relative error in this value? c. Find another way of computing the velocity in this situation using four digit decimal arithmetic that has a smaller relative error. Radioactive carbon-14 decays into nitrogen with a half life of t1/2 = 5,730 30 years. Suppose at time t = 0 one has a sample containing N = 3.65 grams of carbon-14. At time t (years) the amount A of nitrogen that ln 2 has been produced from the decay of the carbon-14 is given by A = N( 1 - e-kt ) where k = = 1.21 10-4 t1/2 1 yr-1. Find the value of A when t = which corresponds to a time 1 hour. Suppose one does the (24)(365.25) calculations with a calculator with eight digits of precision. What problem occurs in the calculation? Example 4. Consider the calculation of y = 1 + x + x2/2! + x3/3! + + xn/n! discussed in section 1.1. Let's estimate the round-off error in the computation when x = -5.5 and n = 25, and the calculations are done with decimal floating point numbers with six digits of precision. In this case = 5 10-6. Solution. The answer depends somewhat on the algorithm used to compute the sum. Note that y = y25 where yj = 1 + x + x2/2! + x3/3! + + x j/j! = 1 + t1 + t2 + t3 + + tj = yj-1 + tj = x j/j! = where tj qj = x fj = j! j qj/fj = xqj-1 = j fj-1 Suppose one uses the following algorithm. x = - 5.5; n = 25; yo = 1; qo = 1; fo = 1; for j = 1 to n do begin qj = xqj-1; fj = j fj-1; tj = qj/fj; yj = yj-1 + tj end 1.6.1 - 4 One way to estimate the round-off error is to first do the computation using six digits of precision and then with more digits of precision. This is done in Example 9.5.2a in Section 1.9.5 below. With six digits of precision one obtains 0.00405471, with 10 digits one obtains 0.00408674 and with 14 digits one obtains 0.00408673. It appears that the true value is about 0.0040867, so the six digit calculation is off by about 3 10-5 which is about a 1% error. Another way to estimate the round-off error is to estimate the error at each stage of the computation. This can be somewhat complicated as we saw in Example 2. First consider the error in qj = xj. The values of qo and q1 can be represented exactly. In this particular example the values of q2 and q3 can also be represented exactly, but if x had some other value this might not be true. So we shall give an estimate that holds for any value of x. To get qj we multiply qj-1 by x and round. Before rounding the relative error in xqj-1 is no more than the relative error in qj-1. Rounding introduces an additional relative error of no more than . So the relative error in qj is no more than about (qj-1) + . It follows that the relative error in qj is no more than approximately (j-1). Similarly the relative error in fj is no more than approximately (j-1). Now consider the error in tj = qj / fj. Before rounding the relative error in tj is no more than approximately the sum of the relative errors in qj and fj which is 2(j-1). Rounding introduces an additional relative error of no more than approximately , so the relative error in tj is no more than about (2j-1). The absolute error in tj is no more than about (2j-1) | tj |. Finally consider the error in the yj = yj-1 + tj. Before rounding the absolute error in yj is no more than the sum of the absolute errors in yj-1 and tj. Rounding introduces and an additional absolute error of nor more than | yj |. So the absolute error in yj is no more than about bj where bj = j j k=2 k=2 (2k-1) | tk | + | yk |. This value is computed for j = 25 in Example 2 in section 1.6.2. One obtains b25 2559, so the error in y25 is bounded by about b25 0.013. The terms | tj | start at 5.5 for j = 1 and go 15..., 27..., 38…, 41 for j = 2, 3, 4, 5 and then start to decrease. The terms (2j-1) | tj | start at 45.. for j = 2 and go 138..., 266..., 377…, 422 for j = 3, 4, 5, 6 and then start to decrease. The terms | yj | start at 4.5 for j = 1 and go 10..., 17..., 21…,, for j = 2, 3, 4 and then start to decrease. It turns out that the main contributions to b25 are the terms (2k-1) | tk | for k between 3 and 12. This estimate of the error is quite a bit larger than the one obtained above by repeating the computations using more digits. This is because it assumes the worst possible case at each step. 1.6.1 - 5