Numerical Error: Atkinson defines the absolute error to be AE = error = xt - xa, where xa approximates some exact value xt, and the relative error as RE = relative error = (xt - xa)/xt. We will also have occasion to use these logarithmic measures of error: d-acc = “decimal places of accuracy” = - log10|error| -acc = “digits of accuracy” = - log10|RE| Let’s try and justify the d-acc & -acc labeling... Suppose xa is accurate to at least the dth decimal place (i.e. |AE| 10-d), where d is an integer. Then d-acc d, and conversely. In particular, since d-acc [d-acc], with [ ] denoting the ‘greatest integer function’, it holds that xa is accurate to at least [d-acc] decimal places. If |AE| 10-d, then since |xt| 10e, |RE| 10-(e+d) -acc e+d. Here, we define xa to have at least k digits of accuracy* with respect to xt if the magnitude of the error, |AE|, is less than or equal to 1 unit in the (k+1)st digit counting rightward from xt’s first nonzero digit. More precisely, |AE| 10e-k, for e satisfying |xt| = 10e (d1.d2d3...dk...) with d1 1. Note: This also implies |RE| 10-k so that -acc k. *This is a bit more stringent than Atkinson’s definition (1 unit vs. 5 units). Claim: If -acc (a positive integer), then xa is accurate wrt xt to at least the (- 1)st place counting rightward from xt’s first nonzero digit. Proof: Suppose that -acc . Then -log10|RE| |RE| 10- |AE| |xt| 10- < 10e-(-1) and xa has the claimed accuracy. Note: We can’t do better than the (- 1)st place. Try xt = 5.000 & xa = 4.995. Then |AE| = 5 10-3 & -acc = 3, but |AE| isn’t less than or equal to10-3. Claim: Suppose xa is accurate wrt xt to at least the th place counting rightward from xt’s first nonzero digit, i.e. assume |AE| 10e-, where |xt| = (d1.d2d3...) 10e with d1 1. Then -acc . Proof: Since |RE| = |AE|/|xt| 10-, this follows immediately. Atkinson’s #7 on pg. 43, but using the round-to-even rule & an arbitrary : Claim: For x in the range of the arithmetic, that is if ßL |x| ßU( – ß1-n), |(x - fl(x))/x| .5ß1-n/(1 + .5ß1-n) < .5ß1-n, where “fl” is the round-to-even floating point operation. Proof: Let x = sgn(x) ße (d1.d2d3...dndn+1...)ß, z = (d1.d2d3...dn)ß and y = (.dn+1dn+2dn+3…)ß. Then we know that fl(x) (and note that fl(-x) = -fl(x)) is either... (1) sgn(x) ße z + sgn(x) ße ß1-n if y > .5 or if y = .5 & dn is odd, or (2) sgn(x) ße z if y < .5 or if y = .5 and dn is even. Therefore, (1) |x - fl(x)| = ße ß1-n (1 - y) and (2) |x - fl(x)| = ße ß1-n y. Accordingly, in case (1), |(x - fl(x))/x| ße ß1-n (1 - y)/(ße(1 + ß1-n y)) .5 ß1-n /(1 + .5 ß1-n) < .5 ß1-n, while in case (2), |(x - fl(x))/x| ß1-n y/(1 + ß1-n y) .5ß1-n/(1 + .5ß1-n) < .5 ß1-n. * Since g(y) = ß1-n y/(1 + ß1-n y) is strictly increasing for y > 0. Rounding-up-fives (aka simple rounding) yields the same inequality while chopping yields |x - fl(x)| = ße ß1-n y < ße+1-n & therefore |(x - fl(x))/x| < ß1-n. Regarding Addition: Suppose we want to compute S = 1in ai, where we can ignore the input errors occurring when the ai's are floated, but not the errors associated with the intermediate sums, i.e. the values a1 + a2, a1 + a2 + a3, a1 + a2 + a3 + ... + an. Set S2 = fl(a1 + a2) = (a1 + a2)(1 + 2), S3 = fl(S2 + a3) = (S2 + a3)(1 + 3), ..., Sn = fl(Sn-1 + an) = (Sn-1 + an)(1 + n). Then since Sn = (a1 + a2) 2in (1 + i) + a3 3in (1 + i) + a4 4in (1 + i) + … + an-1 (1 + n-1) (1 + n) + an (1 + n) = (a1 + a2) (1 + 2in i + r2) + a3 (1 + 3in i + r3) + … + an-1 (1 + n-1+ n + rn) + an (1 + n), where r2, r3, r4,…, are higher power expressions in 2, 3, …, it follows that if these expressions are ignored, Sn – S is approximately equal to (a1 + a2) 2in i + a3 3in i + … + an-1(n-1+ n) + an n. This suggests that we might minimize the error if we first sort the ai's from least to greatest in magnitude, i.e. |a1| |a2| |a3| . . . |an|, before adding. Example: Add 1.21, 10.4, 30.4, 50.6, & 70.1. The exact value is 162.71, but using 3-digit arithmetic with a round-to-even, yields the following least to greatest: 163. rel. error = .00178... vs. greatest to least: 162. rel. error = .00436...