1.5 Floating point numbers and round-off errors. 1.5.1 Floating point numbers. Round-off errors are due to the fact that people, calculators, and computers usually do not keep track of or store numbers exactly in the course of a series of calculations. Scientific and engineering computations are often done with numbers expressed in floating point form. A number x is expressed in decimal floating point form if it is written as a signed number with magnitude between 1 and 10 multiplied by an integral power of 10. In other words we write x = d1.d2dj 10q where the dj are decimal digits with d1 0. Example 1. 168,500 has floating point representation 1.685 105 0.0378462 has floating point representation 3.78462 10-2 - 0.00746 has floating point representation - 7.46 10-3 1/3 has floating point representation 3.333…. 10-1 The number d1.d2dj is called the mantissa while the power q is called the exponent. Actually, computers often use base 2 for their representation of floating point numbers, an issue that will be discussed in section 1.8. For the moment we restrict our attention to base 10. Note that q is the largest integer such that 10q | x | which implies (1) q = log10 | x | Then the mantissa d1.d2dj is equal to 10-qx. Here y denotes the floor of y which is the largest integer not exceeding y. Note that in the case of 1/3 = 3.333…. 10-1 one needs an infinite number of digits in the mantissa to represent 1/3 exactly. It is usually impossible to keep track of an infinite number of digits in the course of a series of computations, so people and computers usually do calculations keeping only a certain fixed number of digits of the mantissa at each step. This is called the number of digits of precision in the computations. We will assume a number x is rounded to xa, although some computers chop x to get xa. Example 2. If a computation is done is using seven decimal digits of precision, then the number x = 1/3 would be approximated by xa = 3.333333 10-1 = 0.3333333. The absolute error between the number x = 1/3 and xa = 0.3333333 is 1/3 10-7 and the relative error is 10-7. In general, the term round-off error refers to the error that one makes by replacing a number with its floating point approximation rounded off to a certain number of digits. The size of a round-off error will vary. However, there is a close connection between the relative error of the round-off error and the number of digits of precision; see Proposition 1 below. 1.5.1 - 1 To make this more precise suppose x = d1.d2dj 10q in floating point form. Then if x > 0 1 (2) round(x, p) = dd .d.d dd d(d +1)10 10 d .d d (d +1)00 10 1.00 10 1 1 2 2 2 q p-1 p p-1 s-1 if dp+1 < 5 if dp+1 5 and dp < 9 q p s q q+1 if dp+1 5 and ds < 9 and ds+1 = = dp = 9 if dp+1 5 and d1 = = dp = 9 denotes x rounded to p decimal places. If x < 0 then round(x, p) = - round( -x, p). Example 3. If x = 1.685 105 and we round x to 3 decimal places we get xa = 1.68 105. The absolute error between x and xa is 0.005 10-5 and the relative error is 0.005 10-5 / 1.68 10-5 0.003 = 3 10-3. On most calculators and computers the numbers are rounded-off to the same number of digits after each operation and to say that a computation is done with p decimal digits of precision means that the inputs and the result of each arithmetic operation are rounded to p digits. There is the following connection between the number of significant decimal digits and a bound on the relative error. Proposition 1 Suppose x = d1.d2dj 10q in floating point form and let xa = round(x, p) be the approximation to x obtained by rounding x to p decimal places. Then the absolute error is no more than 5 10q-p and both relative errors t and a are no more than 5 10-p. Example 4. It is easiest to see the proof for a specific value of p. Suppose p = 3 and x is positive. If x = d1.d2dj 10q, then xa = d1.d2d3 10q and = | x - xa | 0.005 10q = 5 10q-3. So 5 10q-3/| xa |. Since | xa | 10q one has 5 10q-3/10q = 5 10-3. Proof for general p. We shall suppose x is positive; the case where x is negative follows from the case where x is positive and the fact that round(x, p) = - round( -x, p). One has = | x - xa | 0.005 10q where there are p-1 zeros between the decimal point and the 5. This is because we decrease x by no more than this amount to get xa if we round x down to get xa and we increase x by no more than this amount if we round up. Note that 0.005 10q = 5 10q-p. So 5 10q-p. Also 10q | x | and 10q | xa |. So t = / | x | 5 10q-p / 10q = 5 10-p, and similarly for a. // round(x, p) can be expressed in terms of the chop operation. chop(x, p) = d1.d2dp 10q = 10q-p+1 10p-q-1 x denotes x chopped off to p digits. If x < 0 then chop(x, p) = - chop(-x, p). If x > 0 then round(x, p) = chop( x + ( 5 10q-p ), p) In analysis of round-off errors it is often convenient to work with the machine . (3) Machine = the smallest number which when added to 1 using the given computational method gives a result larger than 1. 1.5.1 - 2 (Some authors define the machine to be the largest number which when added to 1 still gives 1 as a result; this is almost equivalent.) If the computations are done with p decimal digits of precision then = 5 10-p. Note that Proposition 1 says that the relative error between a number and the approximation obtained by rounding it off to p decimal places is no more than . Computers often use base 2 for their representation of floating point numbers which will be discussed in more detail in section 1.8. We shall see there that if the computations are done with p binary digits of precision then = 2-p and the relative error between a number x and its rounded value is no more than . By the time a measured value is stored in the computer there are already two sources of error, the error in measurement and the round-off error. The total error is the sum of the two errors. The following proposition states this more precisely. Proposition 2. Suppose xa is an approximation to x with absolute error (x, xa) = | x - xa | no more than ha and xaa is an approximation to xa with absolute error (xa, xaa) = | xa - xaa | no more than haa. Then xaa is an approximation to x with absolute error (x, xaa) = | xa - xa | no more than ha + haa. Suppose xa is an x - xa approximation to x with relative error (x, xa) = no more than ra and xaa is an approximation to xa with xa xa - xaa relative error (xa, xaa) = no more than raa and xa 0 and xaa 0. Then xaa is an approximation to x xaa x - xaa with relative error (x, xaa) = no more than ra + raa + raraa. If in addition raa < 1 then xaa is an xaa | | | | | | approximation to x with absolute error no more than ha + | xa | raa / (1 - raa). In symbols (4) (x, xaa) (x, xa) + (xa, xaa) (5) (x, xaa) (x, xa) + (xa, xaa) + (x, xa) (xa, xaa) (6) (x, xaa) (x, xa) + | xa | (xa, xaa) 1 - (xa, xaa) Remark. If (xa, xaa) is small then (5) says (7) ~ (x, xa) + (xa, xaa) (x, xaa) < ~ b means a c where c b. Thus, the relative error in successive approximations is approximately where a < bounded by the sum of the relative errors of the individual approximations. Proof. Using the triangle inequality | a + b | | a | + | b | one has (x, xaa) = | (x - xa) + (xa - xaa) | | x - xa | + | xa - xaa | = (x, xa) + (xa, xaa) which proves (4). To prove (5), we divide (3) by | xaa | to get (x, xaa) | x - xa | / | xaa | + (xa, xaa). In the first term on the right, we multiply top and bottom by | xa | and use the fact that | xa | / | xaa | is no more than 1 + (xa, xaa). This gives (x, xaa) (x, xa) ( 1 + (xa, xaa) ) + (xa, xaa) which proves (5) To prove (6), we first note that (xa, xaa) = | xaa | (xa, xaa). Then we note that | xa | / | xaa | is no less than 1 - (xa, xaa), so that 1.5.1 - 3 | xaa | | xa | / ( 1 - (xa, xaa) ). So (xa, xaa) | xa | (xa, xaa) / ( 1 - (xa, xaa) ) which when combined with (4) gives (6). // Example 4. Suppose a piece of wood is cut to a length x that is measured to be xa = 1/3 ft with a relative error of at most 1%. Suppose the approximate value xa is represented by the three digit decimal floating point value xaa = 0.333. Approximately what is the of xaa as an approximation to x? Solution. We are given that the relative error of xa as an approximation to x is no more than 1%. By Proposition 1 one has (xa, xaa) 0.005, so the relative error of xaa as an approximation to xa is no more than half a percent. By (7) the relative error of xaa as an approximation to x is no more than approximately 1.5%. The round-off error in storing a measured value in the computer is usually much smaller than the error in measurement. However, it is possible for the round-off error in arithmetic computations to be larger than the error due to the error in measurement. We shall see some examples in the next section. 1.5.1 - 4