Suppose that x and y are NMN`s (both > 0 or both < 0)

advertisement
Floating-Point Systems For &szlig; = base, e = exponent, and n = significand (or mantissa) length,
(as in Atkinson &amp; Han) we consider only ‘0’ and numbers of the form
x = &plusmn; &szlig;e (d1.d2d3d4... dn)&szlig;, where d1  1.
These are called the normalized machine numbers or (as here) NMNs.
For the floating-point system defined above with L  e  U, L &amp; U integers,
the smallest positive NMN is &szlig;L while the largest is &szlig;U+1(1– &szlig;-n) and the
cardinality of all (+, –, and 0) these NMNs is
2(U – L + 1) (&szlig;n - &szlig;n-1) + 1.
Note: The kth positive NMN, xk for k = 1 to (U – L + 1) #m, is
&szlig;L+p-1 (1 + (k – 1– (p – 1) #m) &szlig;1-n),
where #m = &szlig;n – &szlig;n-1 is the number of normalized mantissas and p = the ‘block’
number for xk, i.e. p = Ceiling[k/#m] (so p increases from 1 to U – L + 1)
The IEEE base-2 double-precision standard calls for -1022  e  1023 and n = 53.
In this arithmetic, the least positive NMN is 2-1022  2.225 10-308 and the largest is
21024 (1 - 2-53)  1.798 10308. It also follows that there are
2046 252 = 9,214,364,837,600,034,816 (9.214 1018)
positive NMNs in the 53-bit arithmetic. The machine epsilon is 2-52 2.22 10-16
and the largest integer in the arithmetic, M, with the property that all of M’s
integer predecessors also belong to the arithmetic is 253 9.0 1015.
It should be noted that the MN’s in a given finite arithmetic do not include
every integer between that arithmetic’s least and greatest positive member:
Assume a base-2 arithmetic with a  e  b, a  0  n  b, and q an integer.
Then if 1  q  2n - 1, q must be a MN. Why? Since (a proof is in Gilbert)...
q = c1 2n-1 + c2 2n-2 + c3 2n-3 + ... + cn 20,
= 2n-j (cj.cj+1cj+2...cn [if the first j-1 c’s = 0...])2,
where the ci’s are 0 or 1 and cj is the first nonzero ci (so it’s 1). Now, 2n
is clearly a MN (it’s 2n(1.000...0)2), but 2n + 1 is not one while 2n + 2 is Proof: It’s 2n(1.000...01)2, where the trailing 1 is in the 2-n place;
Proof: The next MN is 2n + 2 = 2n(1.000...1(this 1 is in the 2-(n-1) place))2,
i.e. 2n +1 is passed over (the integer gaps widen as we move to the right).
Atkinson’s definition of ‘significant digits’ is as follows xa has, at least, m significant digits of accuracy (or digits
of accuracy) wrt xt provided the magnitude of the error,
|xt - xa|, is less than or equal to five units in the (m+1)st
digit counting rightward from xt’s first nonzero digit.
This is equivalent to requiring |xt - xa|  5 10e-m, for e = xt’s normalization
exponent, i.e. xt =  10e (d1.d2d3d4...) with d1  1. Atkinson claims (pg. 45):
If |relative error|  5 10-(m+1), then xa has
(at least) m significant digits of accuracy wrt xt.
Proof: Let xt =  10e (d1.d2d3d4...dmdm+1...), where d1  1. Then the bound
on the relative error implies that
|xt - xa|  5 10e-(m+1) (d1.d2d3d4...dmdm+1...) &lt; 5 10e-m,
and xa has at least m significant digits of accuracy relative to xt.
From Skeel &amp; Keiper, we also have the following:
decimal places of accuracy, or d-acc, given by -log10 |xt - xa|
digits of accuracy, or -acc, given by -log10 (|xt - xa|/|xt|)
From Conte &amp; de Boor (pg. 10) the significant digits definition:
xa approximates xt to at least d significant digits if |xt - xa|/|xt|  5 10-d
Finally, from an old trigonometry textbook (Greenleaf), the definition:
a digit is significant provided it’s known to within 4 possibilities
Examples using Atkinson’s definition –
(1) xt =  = 3.1415926… vs. xa = 22/7 = 3.142857142857…
The absolute error = -.00126…, so |error| &lt; 5 10-3
and since e = 0, e – m = -3 implies that m is 3.
(2) xt = 2/9 = .222… vs. xa = .222
The absolute error = .000222…, so |error| &lt; 5 10-4
and since e = -1, e – m = -4 implies that m is 3.
(3) xt = 23.496 vs. xa = 23.494
The absolute error = -.002…, so |error| &lt; 5 10-3
and since e = 1, e – m = -3 implies that m is 4.
(4) xt = .02144 vs. xa = .02138
The absolute error = .00006 = .6 10-4 &lt; 5 10-4
and since e = -2, e – m = -4 implies that m is 2.
(5) xt = 5 vs. xa = 4.995
The absolute error = .005, so |error| = 5 10-3
and since e = 0, e – m = -3 implies that m is 3.
Claim: Atkinson's m satisfies e + log10 5 - log10|error| - 1 &lt; m  e + log10 5 - log10|error|.
Proof: Suppose |error|  5 10e-m. Then m  log105 + e - log10|error| which implies that
log10 5 + e - log10|error| - 1 &lt; m  log10 5 + e - log10|error|, with m the unique integer
satisfying 5 10e-(m+1) &lt; |error|  5 10e-m. Note that m = Floor(- log10|error/(5 10e)|.
Suppose a &gt; 0 and let m = Floor[-log(a)]. Then -log a = m + f, where 0 f &lt; 1,
and –log a – 1 &lt; m -log a. This implies 10-m-1 &lt; a 10-m &amp; if a = |error|/(5 10e),
then 5 10e-(m+1) &lt; |error| 5 10e-m and thus no larger m satisfies |error| 5 10e-m.
Applying this to each of the examples appearing above yields –
(1) 2.6 &lt; m  (2) 2.4 &lt; m  (3) 3.4 &lt; m  (4) 1.9 &lt; m  (5) 2 &lt; m 
Regarding Miscellaneous Calculators -
The TI-85 uses a 12+2-digit (2 internal) base 10
arithmetic with simple rounding, -999 e999.
The TI-83 uses a 12+2-digit (2 internal) base 10
arithmetic with simple rounding, -99 e99.
The TI-89 (&amp; Voyage 200) uses a 12+2-digit (2 internal)
base 10 arithmetic with simple rounding, -999 e999.
The HP-48G uses a 12 (no internal) base 10
arithmetic with simple rounding, -499 e499.
Example regarding Matlab’s “format hex”
&gt;&gt; format hex
&gt;&gt; 0.1
Yields… 3fb999999999999a which in binary is
0011 1111 1011 1001 1001 1001 1001 1001 1001 1001 1001 1001 1001 1001 1001 1010
The first 0 is the algebraic sign (here a +), the next 11 bits code E = 1019 which is related
to the exponent e via e = E – 1023 (here e = -4), and the final 52 bits code the mantissa
(save for the 1 from normalization) which here is the exact mantissa 1.100110011001…
rounded up to the 2-52 place. Typing sym(0.1, ’e’) gives 1/10 + eps/40 and Matlab is
telling us the floated form of 0.1 in the 53-bit arithmetic is 2-52 /40 more than 1/10.
Finally, typing sym(0.1, ‘f’) will give (as a fraction) the floated form of 0.1.
Assume x &amp; y are NMN’s with xy &gt; 0 &amp; 0 &lt; |y| &lt; |x| b-1 u, for b the base of the
arithmetic and u the unit round-off (= b1-s if chopping, and .5 b1-s if rounding).
Claim: y has no effect if added to x, i.e. x  y = x.
Proof: Wlog, take x &amp; y &gt; 0 and let x = m be &amp; y = n bf with b &gt; m, n 1
If chopping is used, then y &lt; x b-1 u means that n bf &lt; m be-1 b1-s = m be-s or
bf &lt; (m/n) be-s &lt; be-s+1 implying f - e &lt; 1 - s or f - e -s. Therefore, x (+) y = x,
since n bf – e n b-s. If rounding is used, then y &lt; x b-1 u means n bf &lt; (m/2) be-s
or that bf &lt; (m/2n) be-s &lt; (1/2)be-s+1 implying f - e &lt; -logb 2 - s &lt; -s. Therefore,
f - e  -1 – s so that x (+) y = x, since n bf – e n b-1-s. 
Examples: Let b = 10 and s = 3. Now take...
x = 1.24 &amp; y = .00123. Then x (+) y = 1.24 (so y is ‘ignored’)
Now suppose that: 2  y/x  1/2. Then
Claim: x (–) y = x – y, i.e. the subtraction is exact (assuming the
exponent of the difference lies within the range of the arithmetic).
Note: For any 2 mantissas m &amp; n, we have n/m &gt; b &amp; m/n &gt; b, i.e. 1/b &lt; n/m &lt; b.
Proof: Wlog, take x &amp; y &gt; 0 and assume that x &gt; y. Let x = m be and y = n bf, with
b &gt; m, n 1We claim that f = e - 1, e, or e + 1: Suppose f = e + k, for some k  2.
Then 2  (n/m) bk &gt; bk-1  b  2, a contradiction. If instead, f = e - k, for some k  2,
then .5  b-1  b1-k &gt; (n/m) b-k  .5, a contradiction. Since, x &gt; y, e + 1 is impossible.
If f = e, then x – y = (m – n) be = b1-s (M – N) be for M &amp; N the integer mantissas
m b1-s and n b1-s . Since M – N falls between 1 and (bs - 1) - bs-1, the difference is
expressible as an at most s-digit base-b integer (any integer between 1 &amp; bs - 1
is so expressible) implying m – n is expressible as an at most s-digit fractional
mantissa, i.e. x - y is exact.
Finally, if f = e – 1, then it follows that x – y = b1-s (b M – N) be-1. But since
b-1(n/m)  .5 or 2 n  b m, it holds that N b M – N. Therefore, b M – N is
an at most s-digit base-b integer and x – y is exactly representable in the
arithmetic (if the exponent falls in the range allowed by the arithmetic). 
Examples: Let b = 10 and s = 3. Now take...
x = 4.24 &amp; y = 2.60. Then y/x = .6132... and x – y = 1.63 (is exact)
x = 2.46 &amp; y = 4.28. Then y/x = 1.7398... and x – y = -1.82 (is exact)
x = 5.24 &amp; y = 12.78. Then y/x = 2.4389... and x – y = -7.54 (is exact)
x = 5.24 &amp; y = 62.78. Then y/x = 11.9809... and x – y = -57.5 (not exact)
Regarding the LOS example in Atkinson on page 49 –
xt = cos(.01) = 0.99995000041666… vs. xa = 0.9999500004 (computed)
The absolute error &lt; 1.667 10-11 &lt; 5 10-11 and since e = -1, e – m = -11 &amp;
xa has the full 10 significant digits of accuracy allowed by the arithmetic.
xt = 1 - cos(.01) = 0.00004999958333… vs. xa = 0.0000499996 (computed)
Then |absolute error| &lt; 1.667 10-11 &lt; 5 10-11 and since e = -5, e – m = -11 &amp;
xa has only 6 significant digits of accuracy, a loss of 4 in this arithmetic.
xt = (1 - cos(.01))/(.01)2 = 0.4999958333… vs. xa = 0.499996 (computed)
Then |absolute error| &lt; 1.667 10-7 &lt; 5 10-7 and since e = -1, e – m = -7 &amp;
xa has just 6 significant digits of accuracy, a loss of 4 in this arithmetic.
Remedy #1: Substitute the Taylor polynomial 1 – x2/2! + x4/4! – x6/6! for cos(x).
Then the problematic expression becomes &frac12; - x2/24 + x4/720.
Remedy #2: Rewrite it as sin2(.01)/[(1 + cos(.01))(.01)2]
Download