Floating-Point Systems For ß = base, e = exponent, and n = significand (or mantissa) length, (as in Atkinson & Han) we consider only ‘0’ and numbers of the form x = ± ße (d1.d2d3d4... dn)ß, where d1 1. These are called the normalized machine numbers or (as here) NMNs. For the floating-point system defined above with L e U, L & U integers, the smallest positive NMN is ßL while the largest is ßU+1(1– ß-n) and the cardinality of all (+, –, and 0) these NMNs is 2(U – L + 1) (ßn - ßn-1) + 1. Note: The kth positive NMN, xk for k = 1 to (U – L + 1) #m, is ßL+p-1 (1 + (k – 1– (p – 1) #m) ß1-n), where #m = ßn – ßn-1 is the number of normalized mantissas and p = the ‘block’ number for xk, i.e. p = Ceiling[k/#m] (so p increases from 1 to U – L + 1) The IEEE base-2 double-precision standard calls for -1022 e 1023 and n = 53. In this arithmetic, the least positive NMN is 2-1022 2.225 10-308 and the largest is 21024 (1 - 2-53) 1.798 10308. It also follows that there are 2046 252 = 9,214,364,837,600,034,816 (9.214 1018) positive NMNs in the 53-bit arithmetic. The machine epsilon is 2-52 2.22 10-16 and the largest integer in the arithmetic, M, with the property that all of M’s integer predecessors also belong to the arithmetic is 253 9.0 1015. It should be noted that the MN’s in a given finite arithmetic do not include every integer between that arithmetic’s least and greatest positive member: Assume a base-2 arithmetic with a e b, a 0 n b, and q an integer. Then if 1 q 2n - 1, q must be a MN. Why? Since (a proof is in Gilbert)... q = c1 2n-1 + c2 2n-2 + c3 2n-3 + ... + cn 20, = 2n-j (cj.cj+1cj+2...cn [if the first j-1 c’s = 0...])2, where the ci’s are 0 or 1 and cj is the first nonzero ci (so it’s 1). Now, 2n is clearly a MN (it’s 2n(1.000...0)2), but 2n + 1 is not one while 2n + 2 is Proof: It’s 2n(1.000...01)2, where the trailing 1 is in the 2-n place; Proof: The next MN is 2n + 2 = 2n(1.000...1(this 1 is in the 2-(n-1) place))2, i.e. 2n +1 is passed over (the integer gaps widen as we move to the right). Atkinson’s definition of ‘significant digits’ is as follows xa has, at least, m significant digits of accuracy (or digits of accuracy) wrt xt provided the magnitude of the error, |xt - xa|, is less than or equal to five units in the (m+1)st digit counting rightward from xt’s first nonzero digit. This is equivalent to requiring |xt - xa| 5 10e-m, for e = xt’s normalization exponent, i.e. xt = 10e (d1.d2d3d4...) with d1 1. Atkinson claims (pg. 45): If |relative error| 5 10-(m+1), then xa has (at least) m significant digits of accuracy wrt xt. Proof: Let xt = 10e (d1.d2d3d4...dmdm+1...), where d1 1. Then the bound on the relative error implies that |xt - xa| 5 10e-(m+1) (d1.d2d3d4...dmdm+1...) < 5 10e-m, and xa has at least m significant digits of accuracy relative to xt. From Skeel & Keiper, we also have the following: decimal places of accuracy, or d-acc, given by -log10 |xt - xa| digits of accuracy, or -acc, given by -log10 (|xt - xa|/|xt|) From Conte & de Boor (pg. 10) the significant digits definition: xa approximates xt to at least d significant digits if |xt - xa|/|xt| 5 10-d Finally, from an old trigonometry textbook (Greenleaf), the definition: a digit is significant provided it’s known to within 4 possibilities Examples using Atkinson’s definition – (1) xt = = 3.1415926… vs. xa = 22/7 = 3.142857142857… The absolute error = -.00126…, so |error| < 5 10-3 and since e = 0, e – m = -3 implies that m is 3. (2) xt = 2/9 = .222… vs. xa = .222 The absolute error = .000222…, so |error| < 5 10-4 and since e = -1, e – m = -4 implies that m is 3. (3) xt = 23.496 vs. xa = 23.494 The absolute error = -.002…, so |error| < 5 10-3 and since e = 1, e – m = -3 implies that m is 4. (4) xt = .02144 vs. xa = .02138 The absolute error = .00006 = .6 10-4 < 5 10-4 and since e = -2, e – m = -4 implies that m is 2. (5) xt = 5 vs. xa = 4.995 The absolute error = .005, so |error| = 5 10-3 and since e = 0, e – m = -3 implies that m is 3. Claim: Atkinson's m satisfies e + log10 5 - log10|error| - 1 < m e + log10 5 - log10|error|. Proof: Suppose |error| 5 10e-m. Then m log105 + e - log10|error| which implies that log10 5 + e - log10|error| - 1 < m log10 5 + e - log10|error|, with m the unique integer satisfying 5 10e-(m+1) < |error| 5 10e-m. Note that m = Floor(- log10|error/(5 10e)|. Suppose a > 0 and let m = Floor[-log(a)]. Then -log a = m + f, where 0 f < 1, and –log a – 1 < m -log a. This implies 10-m-1 < a 10-m & if a = |error|/(5 10e), then 5 10e-(m+1) < |error| 5 10e-m and thus no larger m satisfies |error| 5 10e-m. Applying this to each of the examples appearing above yields – (1) 2.6 < m (2) 2.4 < m (3) 3.4 < m (4) 1.9 < m (5) 2 < m Regarding Miscellaneous Calculators - The TI-85 uses a 12+2-digit (2 internal) base 10 arithmetic with simple rounding, -999 e999. The TI-83 uses a 12+2-digit (2 internal) base 10 arithmetic with simple rounding, -99 e99. The TI-89 (& Voyage 200) uses a 12+2-digit (2 internal) base 10 arithmetic with simple rounding, -999 e999. The HP-48G uses a 12 (no internal) base 10 arithmetic with simple rounding, -499 e499. Example regarding Matlab’s “format hex” >> format hex >> 0.1 Yields… 3fb999999999999a which in binary is 0011 1111 1011 1001 1001 1001 1001 1001 1001 1001 1001 1001 1001 1001 1001 1010 The first 0 is the algebraic sign (here a +), the next 11 bits code E = 1019 which is related to the exponent e via e = E – 1023 (here e = -4), and the final 52 bits code the mantissa (save for the 1 from normalization) which here is the exact mantissa 1.100110011001… rounded up to the 2-52 place. Typing sym(0.1, ’e’) gives 1/10 + eps/40 and Matlab is telling us the floated form of 0.1 in the 53-bit arithmetic is 2-52 /40 more than 1/10. Finally, typing sym(0.1, ‘f’) will give (as a fraction) the floated form of 0.1. Assume x & y are NMN’s with xy > 0 & 0 < |y| < |x| b-1 u, for b the base of the arithmetic and u the unit round-off (= b1-s if chopping, and .5 b1-s if rounding). Claim: y has no effect if added to x, i.e. x y = x. Proof: Wlog, take x & y > 0 and let x = m be & y = n bf with b > m, n 1 If chopping is used, then y < x b-1 u means that n bf < m be-1 b1-s = m be-s or bf < (m/n) be-s < be-s+1 implying f - e < 1 - s or f - e -s. Therefore, x (+) y = x, since n bf – e n b-s. If rounding is used, then y < x b-1 u means n bf < (m/2) be-s or that bf < (m/2n) be-s < (1/2)be-s+1 implying f - e < -logb 2 - s < -s. Therefore, f - e -1 – s so that x (+) y = x, since n bf – e n b-1-s. Examples: Let b = 10 and s = 3. Now take... x = 1.24 & y = .00123. Then x (+) y = 1.24 (so y is ‘ignored’) Now suppose that: 2 y/x 1/2. Then Claim: x (–) y = x – y, i.e. the subtraction is exact (assuming the exponent of the difference lies within the range of the arithmetic). Note: For any 2 mantissas m & n, we have n/m > b & m/n > b, i.e. 1/b < n/m < b. Proof: Wlog, take x & y > 0 and assume that x > y. Let x = m be and y = n bf, with b > m, n 1We claim that f = e - 1, e, or e + 1: Suppose f = e + k, for some k 2. Then 2 (n/m) bk > bk-1 b 2, a contradiction. If instead, f = e - k, for some k 2, then .5 b-1 b1-k > (n/m) b-k .5, a contradiction. Since, x > y, e + 1 is impossible. If f = e, then x – y = (m – n) be = b1-s (M – N) be for M & N the integer mantissas m b1-s and n b1-s . Since M – N falls between 1 and (bs - 1) - bs-1, the difference is expressible as an at most s-digit base-b integer (any integer between 1 & bs - 1 is so expressible) implying m – n is expressible as an at most s-digit fractional mantissa, i.e. x - y is exact. Finally, if f = e – 1, then it follows that x – y = b1-s (b M – N) be-1. But since b-1(n/m) .5 or 2 n b m, it holds that N b M – N. Therefore, b M – N is an at most s-digit base-b integer and x – y is exactly representable in the arithmetic (if the exponent falls in the range allowed by the arithmetic). Examples: Let b = 10 and s = 3. Now take... x = 4.24 & y = 2.60. Then y/x = .6132... and x – y = 1.63 (is exact) x = 2.46 & y = 4.28. Then y/x = 1.7398... and x – y = -1.82 (is exact) x = 5.24 & y = 12.78. Then y/x = 2.4389... and x – y = -7.54 (is exact) x = 5.24 & y = 62.78. Then y/x = 11.9809... and x – y = -57.5 (not exact) Regarding the LOS example in Atkinson on page 49 – xt = cos(.01) = 0.99995000041666… vs. xa = 0.9999500004 (computed) The absolute error < 1.667 10-11 < 5 10-11 and since e = -1, e – m = -11 & xa has the full 10 significant digits of accuracy allowed by the arithmetic. xt = 1 - cos(.01) = 0.00004999958333… vs. xa = 0.0000499996 (computed) Then |absolute error| < 1.667 10-11 < 5 10-11 and since e = -5, e – m = -11 & xa has only 6 significant digits of accuracy, a loss of 4 in this arithmetic. xt = (1 - cos(.01))/(.01)2 = 0.4999958333… vs. xa = 0.499996 (computed) Then |absolute error| < 1.667 10-7 < 5 10-7 and since e = -1, e – m = -7 & xa has just 6 significant digits of accuracy, a loss of 4 in this arithmetic. Remedy #1: Substitute the Taylor polynomial 1 – x2/2! + x4/4! – x6/6! for cos(x). Then the problematic expression becomes ½ - x2/24 + x4/720. Remedy #2: Rewrite it as sin2(.01)/[(1 + cos(.01))(.01)2]