1.5 Floating point numbers and round

1.5 Floating point numbers and round-off errors.
Floating point numbers.
Round-off errors are due to the fact that people, calculators, and computers usually do not keep track of or store
numbers exactly in the course of a series of calculations. Scientific and engineering computations are often done
with numbers expressed in floating point form. A number x is expressed in decimal floating point form if it is
written as a signed number with magnitude between 1 and 10 multiplied by an integral power of 10. In other
words we write x =  d1.d2dj  10q where the dj are decimal digits with d1  0.
Example 1.
has floating point representation
1.685  105
has floating point representation
3.78462  10-2
- 0.00746
has floating point representation
- 7.46  10-3
has floating point representation
3.333….  10-1
The number  d1.d2dj is called the mantissa while the power q is called the exponent. Actually, computers
often use base 2 for their representation of floating point numbers, an issue that will be discussed in section 1.8.
For the moment we restrict our attention to base 10.
Note that q is the largest integer such that 10q  | x | which implies
q =  log10 | x | 
Then the mantissa  d1.d2dj is equal to 10-qx. Here  y  denotes the floor of y which is the largest
integer not exceeding y.
Note that in the case of 1/3 = 3.333….  10-1 one needs an infinite number of digits in the mantissa to represent
1/3 exactly. It is usually impossible to keep track of an infinite number of digits in the course of a series of
computations, so people and computers usually do calculations keeping only a certain fixed number of digits of
the mantissa at each step. This is called the number of digits of precision in the computations. We will assume
a number x is rounded to xa, although some computers chop x to get xa.
Example 2. If a computation is done is using seven decimal digits of precision, then the number x = 1/3 would
be approximated by xa = 3.333333  10-1 = 0.3333333. The absolute error between the number x = 1/3 and
xa = 0.3333333 is 1/3  10-7 and the relative error is 10-7.
In general, the term round-off error refers to the error that one makes by replacing a number with its floating
point approximation rounded off to a certain number of digits. The size of a round-off error will vary.
However, there is a close connection between the relative error of the round-off error and the number of digits of
precision; see Proposition 1 below.
1.5.1 - 1
To make this more precise suppose x =  d1.d2dj  10q in floating point form. Then if x > 0
round(x, p) =
dd .d.d dd d(d +1)10 10
d .d d (d +1)00  10
1.00  10
p-1 p
if dp+1 < 5
if dp+1  5 and dp < 9
if dp+1  5 and ds < 9 and ds+1 =  = dp = 9
if dp+1  5 and d1 =  = dp = 9
denotes x rounded to p decimal places. If x < 0 then round(x, p) = - round( -x, p).
Example 3. If x = 1.685  105 and we round x to 3 decimal places we get xa = 1.68  105. The absolute error
between x and xa is 0.005  10-5 and the relative error is 0.005  10-5 / 1.68  10-5  0.003 = 3  10-3.
On most calculators and computers the numbers are rounded-off to the same number of digits after each
operation and to say that a computation is done with p decimal digits of precision means that the inputs and the
result of each arithmetic operation are rounded to p digits. There is the following connection between the
number of significant decimal digits and a bound on the relative error.
Proposition 1 Suppose x =  d1.d2dj  10q in floating point form and let xa = round(x, p) be the
approximation to x obtained by rounding x to p decimal places. Then the absolute error is no more than
5  10q-p and both relative errors  t and  a are no more than 5  10-p.
Example 4. It is easiest to see the proof for a specific value of p. Suppose p = 3 and x is positive. If
x = d1.d2dj  10q, then xa = d1.d2d3  10q and  = | x - xa |  0.005  10q = 5  10q-3. So   5  10q-3/| xa |.
Since | xa |  10q one has   5  10q-3/10q = 5  10-3.
Proof for general p. We shall suppose x is positive; the case where x is negative follows from the case where x
is positive and the fact that round(x, p) = - round( -x, p). One has  = | x - xa |  0.005  10q where there are
p-1 zeros between the decimal point and the 5. This is because we decrease x by no more than this amount to
get xa if we round x down to get xa and we increase x by no more than this amount if we round up. Note that
0.005  10q = 5  10q-p. So   5  10q-p. Also 10q  | x | and 10q  | xa |. So
 t =  / | x |  5  10q-p / 10q = 5  10-p, and similarly for  a. //
round(x, p) can be expressed in terms of the chop operation.
chop(x, p)
d1.d2dp  10q
10q-p+1  10p-q-1 x 
denotes x chopped off to p digits. If x < 0 then chop(x, p) = - chop(-x, p). If x > 0 then
round(x, p) = chop( x + ( 5  10q-p ), p)
In analysis of round-off errors it is often convenient to work with the machine .
Machine  = the smallest number which when added to 1 using the given computational
method gives a result larger than 1.
1.5.1 - 2
(Some authors define the machine  to be the largest number which when added to 1 still gives 1 as a result; this
is almost equivalent.) If the computations are done with p decimal digits of precision then  = 5  10-p. Note
that Proposition 1 says that the relative error between a number and the approximation obtained by rounding it
off to p decimal places is no more than .
Computers often use base 2 for their representation of floating point numbers which will be discussed in more
detail in section 1.8. We shall see there that if the computations are done with p binary digits of precision then
 = 2-p and the relative error between a number x and its rounded value is no more than .
By the time a measured value is stored in the computer there are already two sources of error, the error in
measurement and the round-off error. The total error is the sum of the two errors. The following proposition
states this more precisely.
Proposition 2. Suppose xa is an approximation to x with absolute error  (x, xa) = | x - xa | no more than ha and
xaa is an approximation to xa with absolute error  (xa, xaa) = | xa - xaa | no more than haa. Then xaa is an
approximation to x with absolute error  (x, xaa) = | xa - xa | no more than ha + haa. Suppose xa is an
x - xa
approximation to x with relative error  (x, xa) =
no more than ra and xaa is an approximation to xa with
xa - xaa
relative error  (xa, xaa) =
no more than raa and xa  0 and xaa  0. Then xaa is an approximation to x
x - xaa
with relative error  (x, xaa) =
no more than ra + raa + raraa. If in addition raa < 1 then xaa is an
approximation to x with absolute error no more than ha + | xa | raa / (1 - raa). In symbols
 (x, xaa)   (x, xa) +  (xa, xaa)
 (x, xaa)   (x, xa) +  (xa, xaa) +  (x, xa)  (xa, xaa)
 (x, xaa)   (x, xa) +
| xa |  (xa, xaa)
1 -  (xa, xaa)
Remark. If  (xa, xaa) is small then (5) says
~  (x, xa) +  (xa, xaa)
 (x, xaa) <
~ b means a  c where c  b. Thus, the relative error in successive approximations is approximately
where a <
bounded by the sum of the relative errors of the individual approximations.
Proof. Using the triangle inequality | a + b |  | a | + | b | one has
 (x, xaa) = | (x - xa) + (xa - xaa) |  | x - xa | + | xa - xaa | =  (x, xa) +  (xa, xaa)
which proves (4). To prove (5), we divide (3) by | xaa | to get  (x, xaa)  | x - xa | / | xaa | +  (xa, xaa). In the first
term on the right, we multiply top and bottom by | xa | and use the fact that | xa | / | xaa | is no more than
1 +  (xa, xaa). This gives  (x, xaa)   (x, xa) ( 1 +  (xa, xaa) ) +  (xa, xaa) which proves (5) To prove (6), we
first note that  (xa, xaa) = | xaa |  (xa, xaa). Then we note that | xa | / | xaa | is no less than 1 -  (xa, xaa), so that
1.5.1 - 3
| xaa |  | xa | / ( 1 -  (xa, xaa) ). So  (xa, xaa)  | xa |  (xa, xaa) / ( 1 -  (xa, xaa) ) which when combined with (4)
gives (6). //
Example 4. Suppose a piece of wood is cut to a length x that is measured to be xa = 1/3 ft with a relative
error of at most 1%. Suppose the approximate value xa is represented by the three digit decimal floating
point value xaa = 0.333. Approximately what is the of xaa as an approximation to x?
Solution. We are given that the relative error of xa as an approximation to x is no more than 1%. By
Proposition 1 one has  (xa, xaa)  0.005, so the relative error of xaa as an approximation to xa is no more
than half a percent. By (7) the relative error of xaa as an approximation to x is no more than approximately
The round-off error in storing a measured value in the computer is usually much smaller than the error in
measurement. However, it is possible for the round-off error in arithmetic computations to be larger than the
error due to the error in measurement. We shall see some examples in the next section.
1.5.1 - 4