1.6 Round-off errors in floating point computations.

advertisement
1.6 Round-off errors in floating point computations.
1.6.1
Round-off errors.
When people or computers do computations with floating point numbers, they usually round the result of each
arithmetic operation to a certain fixed number of digits of precision. This introduces additional errors into the
final result called round-off errors. Usually round-off errors are insignificant compared to errors in
measurement or truncation errors, but sometimes they will actually be larger. This is the case when the result of
an addition or subtraction is significantly smaller in magnitude than the numbers which one is adding or
subtracting. In some cases the round-off errors can be serious enough to cause the final result to be meaningless.
Example 1. An object moves along a straight line so that its position x at time t is given by x = t3. Let to = 10
and t1 = 10 + h be two times and xo = to3 = 103 = 1000 and x1 = t13 = (10+h)3 be the corresponding positions.
The displacement x is the change in position, i.e. x = x1 – xo = (10+h)3 - 1000. Suppose h = 0.014.
a.
Compute x exactly.
b.
Compute x doing the calculations using four digit decimal floating point arithmetic. What is the error in
the result?
c.
An alternative formula for x is x = 3to2h + 3toh2 + h3 = 300h + 30h2 + h3. Compute x using this
alternative formula again doing the calculations using four digit decimal floating point arithmetic. What is
the error in the result? How does this compare with part b?
Solution. a. Compute x exactly.
10 + h = 10 + 0.014 = 10.014
(10 + h)2 = (10.014)2 = 100.280196
(10 + h)3 = (100.280196)(10.014) = 1004.205882744
x = 1004.205882744 – 1000 = 4.205882744
b.
Compute x rounding results to four digits after each operation. In the following  indicates
rounding and a subscript a indicates an approximate value.
10 + h = 10.014  (10 + h)a = 10.01
[(10 + h)a]2 = (10.01)2 = 100.2001  [(10 + h)2]a = 100.2
[(10 + h)2]a (10 + h)a = (100.2)(10.01) = 1003.002  [(10 + h)3]a = 1003
[(10 + h)3]a - 1000 = 1003 – 1000 = 3  [x]a = 3
Absolute error = 4.205882744 - 3 = 1.205882744
Relative error = 1.205882744/3  0.4 = 40%
c. Compute x using the alternative formula.
10 + h = 10.014  (10 + h)a = 10.01
300h = (300)(0.014) = 4.2
h2 = (0.014)2 = 0.000196
30h2 = (30)(0.000196) = 0.00588
h3 = 0.000002744
300h + 30h2 + h3 = 4.205882744  [x]a = 4.206
Absolute error = 4.205882744 - 4.206 = 0.000117256
Relative error = 0.000117256/4.206  0.00003
1.6.1 - 1
= 0.003%
This is much better than b.
This example illustrates that sometimes one formula for computing a quantity is better than another equivalent
formula from the standpoint of round-off error. It also raises several questions. Why did the result in part b
have such a large relative error and the result in part c didn't? Would the same be true if we did the calculations
with more digits of precision? Is there a way to describe/predict how large the round-off error might be ahead of
time before we do the computation?
Estimating the round-off error in a certain computation is often difficult. One way is to repeat the same
computation doing the second calculation with more digits of precision. By comparing the two values one can
estimate the round-off error in the value obtained with fewer digits of precision. Another way is to try to
estimate the round-off error at each step of the computation.
If we look carefully we can see that in the computation in part b we lost about three digits of precision when we
did the final subtraction 1003 – 1000 = 3. Until then the intermediate results had almost four digits of precision.
The two numbers we subtracted, 1003 and 1000, were close in the relative sense so the result was a number, 3,
that was much smaller than either. The computation in part c did not involve the subtraction of two nearly equal
numbers, so the only round-off errors were small in the relative sense.
Example 2. Redo part b of Example 1 with a general h which is small with respect to 10 if the computations are
done on a computer with machine  equal to . (In parts b and c of Example 1 one has  = 5  10-4 = 0.0005.)
For simplicity you may make approximations in the calculation of the error.
Solution. Recall from section 1.5 that the relative error between a number x and its rounded value is no more
than . The first thing we do in the calculation of (10+h)3 - 1000 is to round h. The rounded value of h may
have a relative error as large as  and an absolute error as large as h. The next thing to do is to compute 10+h.
Before rounding the absolute error might be as much as the absolute error in h which is h and the relative error
as much as h/(10 + h) which is about h/10. Rounding introduces another relative error of approximately 
which is added to h/10 giving  + h/10  . Next we multiply 10+h by itself to get (10+h)2 and then multiply
(10+h)2 by 10+h to get (10+h)3. Each step requires a multiplication and a rounding. In the multiplication the
relative errors of the things we are multiplying add (approximately) and the rounding adds an additional relative
error of . It follows that the computed value of (10+h)3 may have a relative error of about 5 and an absolute
error of about 5(10+h)3 which is about 5000. Finally, we compute (10+h)3 - 1000. In subtraction the absolute
errors add. So before rounding the value of (10+h) - 1000 has an absolute error of about 5000 and a relative
error of about 5000/[(10+h)3 - 1000]. Since (10+h)3 - 1000  300h, the relative error of (10+h)3 - 1000 is
about 5000/(300h)  17/h. In part b one had h = 0.014, so the worst case relative error is about 1200 . If
 = 5  10-4, then the relative error might be as large as 0.6. In fact it was only about 0.4, which is about 2/3 the
worst case.
Example 2. The equation y = 4 - x describes the top half of the circle of radius 2 centered at the origin. If one
starts at x on the x axis and goes left to x = 0, then the change in the y values is y = 2 - 4 - x. Suppose
x = 0.0016.
a.
Compute y exactly.
b.
Compute y doing the calculations using four digit decimal floating point arithmetic. What is the error in
the result?
1.6.1 - 2
c.
x
. Compute y using this alternative formula again doing the
2+ 4-x
calculations using four digit decimal floating point arithmetic. What is the error in the result? How does
this compare with part b?
An alternative formula for y is y =
Solution. a. Compute y exactly.
4 - x = 4 - 0.0016 = 3.9984
4-x =
y = 2 b.
3.9984 = 1.999599960…
4 - x = 2 - 1.999599960… = 0.000400040…
Compute y rounding results to four digits after each operation.
4 - x = 3.9984  (4 - x)a = 3.998
(4 - x)a =
3.998 = 1.999499937  [ 4 - x]a = 1.999
2 - [ 4 - x]a = 2 - 1.999 = 0.001  [y]a = 0.001
So
Absolute error = | 0.000400040… - 0.001 | = 0.000599960  0.0006
Relative error  0.0006/0.001  0.6 = 60%
c. Compute x using the alternative formula.
4 - x = 3.9984  (4 - x)a = 3.998
(4 - x)a =
3.998 = 1.999499937  [ 4 - x]a = 1.999
2 + [ 4 - x]a = 2 + 1.999 = 3.999  [2 +
4 - x]a = 3.999
x
0.0016
=
= 0.0004001000250 [y]a = 0.0004001
3.999
[2 + 4 - x]a
Absolute error = | 0.000400040… - 0.0004001 |  0.00000006
Relative error = 0.00000006/0.0004  0.00015
= 0.015%
This is much better than b.
Problems.
1.
An empty bowl is placed on a scale and the scale reads 4.62 lb. Some cherries are placed in the bowl and
the bowl is again placed on the scale and the scale now reads 4.96 lb. Suppose the value that the scale reads
may be in error by as much as 0.08 lb. Approximately how much do the cherries weigh and what is the
relative error in this value? Answer. The cherries weigh approximately 4.96 – 4.62 = 0.34 lb. However,
both the numbers 4.96 and 4.62 could be off by as much as 0.08, so the difference could be off by as much
as 0.16. So the relative error in the weight 0.34 is 0.16/0.34 = 0.47 = 47%.
2.
An object moves along a straight line. Its position at time t0 = 200 sec is x0 = 700 ft and its position at time
601
2102
t1 =
sec us x1 =
ft.
3
3
3.
a.
What is the exact velocity v = (x1 – x0)/(t1 – t0)?
b.
Suppose the velocity is computed using four digit decimal arithmetic. What value is computed for the
velocity and what is the relative error in this value?
An object moves along a straight line in such a fashion that its position x is given in terms of time t by the
formula x = 2t2. If t0 and t1 are two times and x0 and x1 are the corresponding positions, then its average
1
velocity over this time interval is v = (x1 – x0)/(t1 – t0). Suppose t0 = 1 sec and t1 = 1 +
sec.
300
1.6.1 - 3
4.
a.
What is the exact velocity v = (x1 – x0)/(t1 – t0) over this time interval?
b.
Suppose the velocity is computed using four digit decimal arithmetic. What value is computed for the
velocity and what is the relative error in this value?
c.
Find another way of computing the velocity in this situation using four digit decimal arithmetic that has
a smaller relative error.
Radioactive carbon-14 decays into nitrogen with a half life of t1/2 = 5,730  30 years. Suppose at time t = 0
one has a sample containing N = 3.65 grams of carbon-14. At time t (years) the amount A of nitrogen that
ln 2
has been produced from the decay of the carbon-14 is given by A = N( 1 - e-kt ) where k =
= 1.21  10-4
t1/2
1
yr-1. Find the value of A when t =
which corresponds to a time 1 hour. Suppose one does the
(24)(365.25)
calculations with a calculator with eight digits of precision. What problem occurs in the calculation?
Example 4. Consider the calculation of y = 1 + x + x2/2! + x3/3! +  + xn/n! discussed in section 1.1. Let's
estimate the round-off error in the computation when x = -5.5 and n = 25, and the calculations are done with
decimal floating point numbers with six digits of precision. In this case  = 5  10-6.
Solution. The answer depends somewhat on the algorithm used to compute the sum. Note that y = y25 where
yj
=
1 + x + x2/2! + x3/3! +  + x j/j!
=
1 + t1 + t2 + t3 +  + tj
=
yj-1 + tj
=
x j/j! =
where
tj
qj
=
x
fj
=
j!
j
qj/fj
=
xqj-1
=
j fj-1
Suppose one uses the following algorithm.
x = - 5.5;
n = 25;
yo = 1;
qo = 1;
fo = 1;
for j = 1 to n do
begin
qj = xqj-1;
fj = j fj-1;
tj = qj/fj;
yj = yj-1 + tj
end
1.6.1 - 4
One way to estimate the round-off error is to first do the computation using six digits of precision and then with
more digits of precision. This is done in Example 9.5.2a in Section 1.9.5 below. With six digits of precision
one obtains 0.00405471, with 10 digits one obtains 0.00408674 and with 14 digits one obtains 0.00408673. It
appears that the true value is about 0.0040867, so the six digit calculation is off by about 3  10-5 which is about
a 1% error.
Another way to estimate the round-off error is to estimate the error at each stage of the computation. This can
be somewhat complicated as we saw in Example 2. First consider the error in qj = xj. The values of qo and q1
can be represented exactly. In this particular example the values of q2 and q3 can also be represented exactly,
but if x had some other value this might not be true. So we shall give an estimate that holds for any value of x.
To get qj we multiply qj-1 by x and round. Before rounding the relative error in xqj-1 is no more than the relative
error in qj-1. Rounding introduces an additional relative error of no more than . So the relative error in qj is no
more than about (qj-1) + . It follows that the relative error in qj is no more than approximately (j-1).
Similarly the relative error in fj is no more than approximately (j-1).
Now consider the error in tj = qj / fj. Before rounding the relative error in tj is no more than approximately the
sum of the relative errors in qj and fj which is 2(j-1). Rounding introduces an additional relative error of no
more than approximately , so the relative error in tj is no more than about (2j-1). The absolute error in tj is no
more than about (2j-1) | tj |.
Finally consider the error in the yj = yj-1 + tj. Before rounding the absolute error in yj is no more than the sum of
the absolute errors in yj-1 and tj. Rounding introduces and an additional absolute error of nor more than | yj |.
So the absolute error in yj is no more than about bj where bj =
j
j
k=2
k=2
 (2k-1) | tk | +  | yk |.
This value is computed for j = 25 in Example 2 in section 1.6.2. One obtains b25  2559, so the error in y25 is
bounded by about b25  0.013. The terms | tj | start at 5.5 for j = 1 and go 15..., 27..., 38…, 41 for j = 2, 3, 4, 5
and then start to decrease. The terms (2j-1) | tj | start at 45.. for j = 2 and go 138..., 266..., 377…, 422 for j = 3,
4, 5, 6 and then start to decrease. The terms | yj | start at 4.5 for j = 1 and go 10..., 17..., 21…,, for j = 2, 3, 4 and
then start to decrease. It turns out that the main contributions to b25 are the terms (2k-1) | tk | for k between 3 and
12. This estimate of the error is quite a bit larger than the one obtained above by repeating the computations
using more digits. This is because it assumes the worst possible case at each step.
1.6.1 - 5
Download