WEEK 4: REAL ARITHMETIC AND COMPUTERS 1. What is a real number? As you will learn in Analysis, the answer is long-winded. In essence, the set R includes all rational numbers, and in addition, satisfies the famous Completeness Axiom: Every non-empty set of real numbers that is bounded above has a least upper bound. Example 1.1. Consider the set S = x ∈ R : x2 ≤ 2 . This set is not empty (since 1 ∈ S) and it is bounded above; indeed ∀ x ∈ S, x ≤ 2. So 2 is an upper bound for S. By the Completeness Axiom, S must have a least √ upper bound. We denote this number by 2. It can be proved that √ 2 2 = 2. √ √ It is well-known fact that 2 cannot be expressed as a ratio of two integers, i.e. 2 is irrational. At present, there is no satisfactory way of storing— or performing exact calculations with— irrational numbers. In practice, we must be content with using (rational) approximations of real numbers. Example 1.2. Define a sequence of rational numbers {xn }n∈N via 1 2 x0 = 1, xn+1 = xn + . 2 xn √ It can be shown that this sequence converges very rapidly to 2. We saw last week how the binary system uses lists of symbols (the symbols 0 or 1) to represent whole numbers. This representation can be used to store a whole number on a computer. The basic unit of storage on modern computers is called a bit. Each bit has a location in the computer “memory” and holds either 0 or 1. So a whole number can be stored by using enough bits. Therefore we can in principle perform calculations in exact arithmetic. Unfortunately, in most applications, the length of the numerators and denominators would grow very quickly as the calculation proceeds, taking much time and computer memory. Floating-point arithmetic is a practical and efficient alternative to exact rational arithmetic. The details of this mode of calculation vary from one computer system to another. Our purpose here is to convey the main ideas— at the risk of some oversimplification. Essentially, floating-point arithmetic consists of some operations performed on a certain set of rational numbers. 1 2 WEEK 4: REAL ARITHMETIC AND COMPUTERS 2. The set of floating-point numbers First, let us describe the set of numbers. A floating-point number x takes the form (2.1) x = s × m × 2e−σ . In this expression, s denotes the sign, i.e. ±, m denotes the so-called mantissa, e is the shifted exponent, and σ is the shift— which is the same for every number. The main point in this respect is that every floating-point number occupies the same amount of computer storage (number of bits), regardless of its actual value. In the case of “double-precision” floating-point arithmetic, x occupies precisely 64 bits of memory as follows: s m e 1 bit 52 bits 11 bits The mantissa is a fraction which takes the binary form m = 0.m0 m1 · · · m51 (base 2) ∈ [0, 1) where each mi is either 0 or 1. The shifted exponent is a whole number that takes the binary form 0 ≤ e = e10 e9 · · · e0 (base 2) ≤ 211 − 1 . In order to allow for a nearly equal range of positive and negative exponents, the shift is σ = 210 = 1024 . Hence −1024 ≤ e − σ ≤ 1023 and so every non-zero floating-point number satisfies the inequalities 2−1024 ≤ |x| ≤ 21023 . Attempts to create larger (respectively smaller) non-zero floating-point numbers will therefore result in a processing error called overflow (respectively underflow). The digits that make up the mantissa are called the significant digits. Hence, in double-precision arithmetic, one uses 52 significant binary digits; this is equivalent to about 16 decimal digits (since 210 ≈ 103 ). 3. The floating-point operations It should be clear by now that the set of floating-point numbers, i.e. of numbers that can be written in the form (2.1) is quite small. We shall denote the set of floating point numbers by F. Example 3.1. The number 1/10 has the expansion 1 = 0.0001100110011001100 . . . (base 2) . 10 This expansion does not terminate. It follows that 1/10 ∈ / F. For the purpose of approximating such numbers, the floating-point arithmetic system includes a rounding function ̺ : Q → F with the following properties. • ∀ x ∈ F, ̺(x) = x. WEEK 4: REAL ARITHMETIC AND COMPUTERS 3 • Let x ∈ Q. Let [a, b] be the smallest interval containing x with endpoints in F. Then ̺(x) equals a or b— whichever is nearest to x. If a and b are equidistant from x, then ̺(x) is determined by a process that varies with the particular computer system. Example 3.2. ̺(1/10) = 0.10000000000000001 (base 10) . See Van Rossum Appendix B. The basic operations defined on F are then • Floating-point addition: x ⊕ y = ̺(x + y) . • Floating-point subtraction: x ⊖ y = ̺(x − y) . • Floating-point multiplication: • Floating-point division: x ⊗ y = ̺(x × y) . x ⊘ y = ̺(x/y) .