Suppose that x and y are NMN`s (both > 0 or both < 0)

Floating-Point Systems For ß = base, e = exponent, and n = significand (or mantissa) length, (as in Atkinson & Han) we consider only ‘0’ and numbers of the form x = ± ße (d1.d2d3d4... dn)ß, where d1  1. These are called the normalized machine numbers or (as here) NMNs. For the floating-point system defined above with L  e  U, L & U integers, the smallest positive NMN is ßL while the largest is ßU+1(1– ß-n) and the cardinality of all (+, –, and 0) these NMNs is 2(U – L + 1) (ßn - ßn-1) + 1. Note: The kth positive NMN, xk for k = 1 to (U – L + 1) #m, is ßL+p-1 (1 + (k – 1– (p – 1) #m) ß1-n), where #m = ßn – ßn-1 is the number of normalized mantissas and p = the ‘block’ number for xk, i.e. p = Ceiling[k/#m] (so p increases from 1 to U – L + 1) The IEEE base-2 double-precision standard calls for -1022  e  1023 and n = 53. In this arithmetic, the least positive NMN is 2-1022  2.225 10-308 and the largest is 21024 (1 - 2-53)  1.798 10308. It also follows that there are 2046 252 = 9,214,364,837,600,034,816 (9.214 1018) positive NMNs in the 53-bit arithmetic. The machine epsilon is 2-52 2.22 10-16 and the largest integer in the arithmetic, M, with the property that all of M’s integer predecessors also belong to the arithmetic is 253 9.0 1015. It should be noted that the MN’s in a given finite arithmetic do not include every integer between that arithmetic’s least and greatest positive member: Assume a base-2 arithmetic with a  e  b, a  0  n  b, and q an integer. Then if 1  q  2n - 1, q must be a MN. Why? Since (a proof is in Gilbert)... q = c1 2n-1 + c2 2n-2 + c3 2n-3 + ... + cn 20, = 2n-j (cj.cj+1cj+2...cn [if the first j-1 c’s = 0...])2, where the ci’s are 0 or 1 and cj is the first nonzero ci (so it’s 1). Now, 2n is clearly a MN (it’s 2n(1.000...0)2), but 2n + 1 is not one while 2n + 2 is Proof: It’s 2n(1.000...01)2, where the trailing 1 is in the 2-n place; Proof: The next MN is 2n + 2 = 2n(1.000...1(this 1 is in the 2-(n-1) place))2, i.e. 2n +1 is passed over (the integer gaps widen as we move to the right). Atkinson’s definition of ‘significant digits’ is as follows xa has, at least, m significant digits of accuracy (or digits of accuracy) wrt xt provided the magnitude of the error, |xt - xa|, is less than or equal to five units in the (m+1)st digit counting rightward from xt’s first nonzero digit. This is equivalent to requiring |xt - xa|  5 10e-m, for e = xt’s normalization exponent, i.e. xt =  10e (d1.d2d3d4...) with d1  1. Atkinson claims (pg. 45): If |relative error|  5 10-(m+1), then xa has (at least) m significant digits of accuracy wrt xt. Proof: Let xt =  10e (d1.d2d3d4...dmdm+1...), where d1  1. Then the bound on the relative error implies that |xt - xa|  5 10e-(m+1) (d1.d2d3d4...dmdm+1...) < 5 10e-m, and xa has at least m significant digits of accuracy relative to xt. From Skeel & Keiper, we also have the following: decimal places of accuracy, or d-acc, given by -log10 |xt - xa| digits of accuracy, or -acc, given by -log10 (|xt - xa|/|xt|) From Conte & de Boor (pg. 10) the significant digits definition: xa approximates xt to at least d significant digits if |xt - xa|/|xt|  5 10-d Finally, from an old trigonometry textbook (Greenleaf), the definition: a digit is significant provided it’s known to within 4 possibilities Examples using Atkinson’s definition – (1) xt =  = 3.1415926… vs. xa = 22/7 = 3.142857142857… The absolute error = -.00126…, so |error| < 5 10-3 and since e = 0, e – m = -3 implies that m is 3. (2) xt = 2/9 = .222… vs. xa = .222 The absolute error = .000222…, so |error| < 5 10-4 and since e = -1, e – m = -4 implies that m is 3. (3) xt = 23.496 vs. xa = 23.494 The absolute error = -.002…, so |error| < 5 10-3 and since e = 1, e – m = -3 implies that m is 4. (4) xt = .02144 vs. xa = .02138 The absolute error = .00006 = .6 10-4 < 5 10-4 and since e = -2, e – m = -4 implies that m is 2. (5) xt = 5 vs. xa = 4.995 The absolute error = .005, so |error| = 5 10-3 and since e = 0, e – m = -3 implies that m is 3. Claim: Atkinson's m satisfies e + log10 5 - log10|error| - 1 < m  e + log10 5 - log10|error|. Proof: Suppose |error|  5 10e-m. Then m  log105 + e - log10|error| which implies that log10 5 + e - log10|error| - 1 < m  log10 5 + e - log10|error|, with m the unique integer satisfying 5 10e-(m+1) < |error|  5 10e-m. Note that m = Floor(- log10|error/(5 10e)|. Suppose a > 0 and let m = Floor[-log(a)]. Then -log a = m + f, where 0 f < 1, and –log a – 1 < m -log a. This implies 10-m-1 < a 10-m & if a = |error|/(5 10e), then 5 10e-(m+1) < |error| 5 10e-m and thus no larger m satisfies |error| 5 10e-m. Applying this to each of the examples appearing above yields – (1) 2.6 < m  (2) 2.4 < m  (3) 3.4 < m  (4) 1.9 < m  (5) 2 < m  Regarding Miscellaneous Calculators - The TI-85 uses a 12+2-digit (2 internal) base 10 arithmetic with simple rounding, -999 e999. The TI-83 uses a 12+2-digit (2 internal) base 10 arithmetic with simple rounding, -99 e99. The TI-89 (& Voyage 200) uses a 12+2-digit (2 internal) base 10 arithmetic with simple rounding, -999 e999. The HP-48G uses a 12 (no internal) base 10 arithmetic with simple rounding, -499 e499. Example regarding Matlab’s “format hex” >> format hex >> 0.1 Yields… 3fb999999999999a which in binary is 0011 1111 1011 1001 1001 1001 1001 1001 1001 1001 1001 1001 1001 1001 1001 1010 The first 0 is the algebraic sign (here a +), the next 11 bits code E = 1019 which is related to the exponent e via e = E – 1023 (here e = -4), and the final 52 bits code the mantissa (save for the 1 from normalization) which here is the exact mantissa 1.100110011001… rounded up to the 2-52 place. Typing sym(0.1, ’e’) gives 1/10 + eps/40 and Matlab is telling us the floated form of 0.1 in the 53-bit arithmetic is 2-52 /40 more than 1/10. Finally, typing sym(0.1, ‘f’) will give (as a fraction) the floated form of 0.1. Assume x & y are NMN’s with xy > 0 & 0 < |y| < |x| b-1 u, for b the base of the arithmetic and u the unit round-off (= b1-s if chopping, and .5 b1-s if rounding). Claim: y has no effect if added to x, i.e. x  y = x. Proof: Wlog, take x & y > 0 and let x = m be & y = n bf with b > m, n 1 If chopping is used, then y < x b-1 u means that n bf < m be-1 b1-s = m be-s or bf < (m/n) be-s < be-s+1 implying f - e < 1 - s or f - e -s. Therefore, x (+) y = x, since n bf – e n b-s. If rounding is used, then y < x b-1 u means n bf < (m/2) be-s or that bf < (m/2n) be-s < (1/2)be-s+1 implying f - e < -logb 2 - s < -s. Therefore, f - e  -1 – s so that x (+) y = x, since n bf – e n b-1-s.  Examples: Let b = 10 and s = 3. Now take... x = 1.24 & y = .00123. Then x (+) y = 1.24 (so y is ‘ignored’) Now suppose that: 2  y/x  1/2. Then Claim: x (–) y = x – y, i.e. the subtraction is exact (assuming the exponent of the difference lies within the range of the arithmetic). Note: For any 2 mantissas m & n, we have n/m > b & m/n > b, i.e. 1/b < n/m < b. Proof: Wlog, take x & y > 0 and assume that x > y. Let x = m be and y = n bf, with b > m, n 1We claim that f = e - 1, e, or e + 1: Suppose f = e + k, for some k  2. Then 2  (n/m) bk > bk-1  b  2, a contradiction. If instead, f = e - k, for some k  2, then .5  b-1  b1-k > (n/m) b-k  .5, a contradiction. Since, x > y, e + 1 is impossible. If f = e, then x – y = (m – n) be = b1-s (M – N) be for M & N the integer mantissas m b1-s and n b1-s . Since M – N falls between 1 and (bs - 1) - bs-1, the difference is expressible as an at most s-digit base-b integer (any integer between 1 & bs - 1 is so expressible) implying m – n is expressible as an at most s-digit fractional mantissa, i.e. x - y is exact. Finally, if f = e – 1, then it follows that x – y = b1-s (b M – N) be-1. But since b-1(n/m)  .5 or 2 n  b m, it holds that N b M – N. Therefore, b M – N is an at most s-digit base-b integer and x – y is exactly representable in the arithmetic (if the exponent falls in the range allowed by the arithmetic).  Examples: Let b = 10 and s = 3. Now take... x = 4.24 & y = 2.60. Then y/x = .6132... and x – y = 1.63 (is exact) x = 2.46 & y = 4.28. Then y/x = 1.7398... and x – y = -1.82 (is exact) x = 5.24 & y = 12.78. Then y/x = 2.4389... and x – y = -7.54 (is exact) x = 5.24 & y = 62.78. Then y/x = 11.9809... and x – y = -57.5 (not exact) Regarding the LOS example in Atkinson on page 49 – xt = cos(.01) = 0.99995000041666… vs. xa = 0.9999500004 (computed) The absolute error < 1.667 10-11 < 5 10-11 and since e = -1, e – m = -11 & xa has the full 10 significant digits of accuracy allowed by the arithmetic. xt = 1 - cos(.01) = 0.00004999958333… vs. xa = 0.0000499996 (computed) Then |absolute error| < 1.667 10-11 < 5 10-11 and since e = -5, e – m = -11 & xa has only 6 significant digits of accuracy, a loss of 4 in this arithmetic. xt = (1 - cos(.01))/(.01)2 = 0.4999958333… vs. xa = 0.499996 (computed) Then |absolute error| < 1.667 10-7 < 5 10-7 and since e = -1, e – m = -7 & xa has just 6 significant digits of accuracy, a loss of 4 in this arithmetic. Remedy #1: Substitute the Taylor polynomial 1 – x2/2! + x4/4! – x6/6! for cos(x). Then the problematic expression becomes ½ - x2/24 + x4/720. Remedy #2: Rewrite it as sin2(.01)/[(1 + cos(.01))(.01)2]

Suppose that x and y are NMN`s (both > 0 or both < 0)

Related documents

Products

Support

Suppose that x and y are NMN`s (both > 0 or both < 0)

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib