Floating Point Arithmetic In the 70's, floating point arithmetic capability was found only in mainframes and supercomputers. Although many microprocessors designed in the 80's started to have floating point coprocessor, their floating-point arithmetic speed was about three orders of magnitude slower than mainframes and supercomputers. With the advance in microprocessor technology, many microprocessors designed in the 90's, such as Intel Pentium III, AMD Athlon, and Compaq Alpha 21264, have high performance floating point capabilities that rival supercomputers. High speed floating point arithmetic has become a standard requirement for microprocessors designed today. The IEEE Floating Point Standard is an effort for the computer manufacturers to conform to a common representation and arithmetic convention for floating point data. Virtually all computer manufacturers in the world have accepted this standard. That is, virtually all microprocessors designed in the future will conform to the IEEE Floating Point Standard. Therefore, it is very important for a computer designer to understand the concept and implementation requirements of this standard. In the IEEE Floating Point Standard, a floating point number consists of three parts: sign (S), exponent (E), and mantissa (M). Each (S, E, M) pattern uniquely identifies a floating point number. The value of such a floating number can be derived as follows: value = (-1)S * M * {2E}, where 1.0 ≤ M < 2.0 . The interpretation of S is simple: S=0 results in a positive number and S=1 a negative number. The issue is how to encode M and E. We will use the following example to explain the encoding convention for M and E. Example: assume that each floating point number consists of a 1-bit sign, 3-bit exponent, and 2-bit mantissa. We will use this 6-bit format to illustrate the issues involved in encoding E and M. Numbers in the decimal system will have subscript D and those in the binary system will have subscript B. For example, 0.5D = 0.1B. Normalized representation of M Specifying that 1.0B ≤ M < 10.0B makes the mantissa value for each floating point number unique. For example, the only one mantissa value allowed for 0.5D is M =1.0: 0.5D = 1.0B * 2-1 Another potential candidate would be 0.1B * 20, but the value of mantissa would be too small according to the rule. Similarly, 10.0B * 2-2 is not legal because the value of the mantissa is too large. In general, with the restriction that 1.0≤ M < 10.0, all floating point numbers can have only one legal mantissa value. The numbers that satisfy this restriction will be referred to as normalized numbers. Because all mantissa values are of the form 1.XX, just omit the 1. part in the representation. Therefore, the mantissa value of 0.5 in a 2-bit mantissa representation is 00, which is derived by omitting 1. from 1.00. Excess encoding of E If n bits are used to represent exponent, 2n-1-1 is added to the two's complement representation for the exponent to form its excess representation. In our 3-bit exponent representation, 2’s complement 000 001 010 011 100 101 110 111 Actual decimal 0 1 2 3 (reserved pattern) -3 -2 -1 Excess-3 011 100 101 110 111 000 001 010 Figure 1 Excess-3 encoding, sorted by 2’s complement ordering The advantage of excess representation is that an unsigned comparator can be used to compare signed numbers. As shown in Figure 2, the excess-3 code increases monotonically from –3 to 3. The code from –3 is 000 and that for 3 is 110. Thus, if one use an unsigned number comparator to compare excess-3 code for any number from –1 to 3, the comparator gives the correct comparison result in terms of which number is larger, smaller, etc. For example, if one compares excess-3 codes 001 and 100 with an unsigned comparator, 001 is smaller than 100. This is the right conclusion since the values that they represent, -2 and 1, have exactly the same relation. This is a very desirable quality for hardware implementation since unsigned comparators are smaller and faster than signed comparators. 2’s complement 100 101 110 111 000 001 010 011 Actual decimal (reserved pattern) -3 -2 -1 0 1 2 3 Excess-3 111 000 001 010 011 100 101 110 Figure 2 Excess-3 encoding, sorted by excess-3 ordering Now we are ready to represent 0.5D with our 6-bit format: 0.5D = 0 010 00, where S = 0, E = 010, and M = (1.)00 With normalized mantissa and excess-coded exponent, the value of a number with an nbit exponent is (-1) S * 1.M * 2 (E- (2 ^( n-1))+ 1) Representable Numbers The representable numbers of a given format is the set of all numbers that can be exactly represented in the format. For example, if one uses a 3-bit unsigned integer format, the representable numbers would be: 000 001 010 011 100 101 110 111 0 1 2 3 4 5 6 7 Figure 3 Representable numbers of a 3-bit unsigned format Neither -1 nor 9 can be represented in the format given above. One can draw a number line to identify all the representable numbers, as shown in Figure 4 where all representatble numbers of the 3-bit unsigned integer format are marked with stars. -1 0 1 2 3 4 5 6 7 8 9 \ Figure 4 representable numbers of a 3-bit unsigned integer format 1 The representable numbers of a floating point format can be visualize in a similar manner. In Table 1, we define all the representable numbers of what we have so far and two variations. We use a 5-bit format to keep the size of the table manageable. The format consists of 1-bit S, 2-bit E (excess coded), and 2-bit M (with 1. part omitted). The no-zero column gives the representable numbers of the format we discussed thus far. Note that with this format, 0 is not one of the representatble numbers. E 00 01 10 11 M 00 01 10 11 00 01 10 11 00 01 10 11 No-zero S=1 -1 -1 2 -(2 ) -1 -3 2 +1*2 -(2-1+1*2-3) -1 -3 2 +2*2 -(2-1+2*2-3) 2-1+3*2-3 -(2-1+3*2-3) 20 -(20) 0 -2 2 +1*2 -(20+1*2-2) 20+2*2-2 -(20+2*2-2) 20+3*2-2 -(20+3*2-2) 21 -(21) 21+1*2-1 -(21+1*2-1) 21+2*2-1 -(21+2*2-1) 21+3*2-1 -(21+3*2-1) S=0 Abrupt underflow S=0 S=1 0 0 0 0 0 0 0 0 0 2 -(20) 0 -2 2 +1*2 -(20+1*2-2) 20+2*2-2 -(20+2*2-2) 20+3*2-2 -(20+3*2-2) 21 -(21) 21+1*2-1 -(21+1*2-1) 21+2*2-1 -(21+2*2-1) 21+3*2-1 -(21+3*2-1) Reserved pattern Gradual underflow S=0 S=1 0 0 -2 1*2 -1*2-2 -2 2*2 -2*2-2 3*2-2 -3*2-2 20 -(20) 0 -2 2 +1*2 -(20+1*2-2) 20+2*2-2 -(20+2*2-2) 20+3*2-2 -(20+3*2-2) 21 -(21) 21+1*2-1 -(21+1*2-1) 21+2*2-1 -(21+2*2-1) 21+3*2-1 -(21+3*2-1) Table 1 Representable numbers of no-zero, abrupt underflow, and gradual underflow formats Examining how they populate the number line, as shown in Figure 5, provides further insights about the representable numbers. In Figure 5, we show only the positive representable numbers. The negative numbers are symmetric to their positive counterparts. 0 1 2 3 \ Figure 5 Number line display of representable numbers of the no-zero representation 1 There are three observations that can be made here. First, the number 0 is not representable in this format. Because 0 is one of the most important numbers, not being able to represent 0 in a number system is a serious shortcoming. Second, the representable numbers become closer to each other towards the neighborhood of 0. This is a desirable behavior because as the absolute value of these numbers become smaller, it is more important to represent them accurately. Having the representable numbers close to each other makes it possible to represent numbers in the region more accurately. Unfortunately this trend does not hold for the very vicinity of 0, which leads to the third point: there is a gap of representable numbers in the very vicinity of 0. This is because the range of normalized mantissa precludes 0. One method that has been used to accommodate 0 into a normalized number system is the abrupt underflow convention, which is illustrated by the second column of Table 1. Whenever E is 0, the number is interpreted as 0. This method takes away eight 4 representable numbers (four positive and four negative) in the vicinity of 0 and makes them all 0. Although this method makes 0 representable, it does create an even larger gap between representable numbers in 0's vicinity, as shown in Figure 6. It is obvious, when compared with Figure 5, the gap of representable numbers has been enlarged significantly with the vicinity of 0. This is very problematic since many numeric algorithms rely on the fact that the accuracy of number representation is higher for the very small numbers near zero. These algorithms generate small numbers and eventually use them as denominators. Any error in the representation can be greatly magnified in the division process. 3 1 2 \ Figure 6 Representable numbers of the abrupt underflow format 1 0 4 The actual method adopted by the IEEE standard is called gradual underflow. The method relaxes the normalization requirement for numbers very close to 0. That is, whenever E=0, the mantissa is no longer assumed to be of the form 1.XX. Rather, it is assumed to be 0.XX. In general, if the n-bit exponent is 0, the value is 0.M * 2 - 2 ^(n-1) + 2 For example, in Table 1, the gradual underflow for mantissa value 01 is 0.01*2 0 = 2–2. Figure 7 shows the representable numbers for the gradual underflow format. The effect is to have a uniform spectrum of representable numbers in the very close vicinity of 0. This eliminates the undesirable gap in the previous two methods. This is illustrated in Figure 7. 0 2 1 \ 1 IEEE Floating Point Format The actual IEEE format has one more bit pattern than we have so far: exponent 11…1 11…1 00…0 00…0 mantissa ≠0 =0 ≠0 =0 meaning NaN (-1)S * ∞ denormalized 0 3 All other numbers are normalized floating-point numbers. Single precision numbers have 1-bit S, 8-bit E, and 23-bit M. Double precision numbers have 1-bit S, 11-bit E, and 52bit M. All representable numbers fall between -∞ (negative infinity) and +∞ (positive infinite). An ∞ is created by overflow, e.g., divided by zero. Any representable number divided by +∞ or -∞ results in 0. NaN (Not a Number) is generated by operations whose input values do not make sense, for example, 0/0, 0*∞, ∞/∞, ∞ - ∞. They are also used to for data that have not be properly initialized in a program. There are two types on NaN’s in the IEEE standard: signaling and quiet. Signaling NaN causes an exception when used as input to arithmetic operations. For example the operation 1.0 + signaling NaN raises an exception. Signaling NaN’s are used in situations where the programmer would like to make sure that the program execution gets interrupts whenever any irregular values are used in floating point computations. These situations usually mean that there is something wrong with the execution of the program. In mission critical applications, the execution cannot continue until the validity of the execution can be verified with a separate means. For example, software engineers often mark all the uninitialized data as signaling NaN. This practice ensures the detection of using uninitialized data during program execution. Quiet NaN generates a quiet NaN when used as input to arithmetic operations. For example, the operation quiet NaN <= 1.0 + quiet NaN generates a quiet NaN. Quiet NaN’s are typically used in applications where the user can review the output and decide if the application should be re-run with a different input for more valid results. When the results are printed, Quite NaN’s are printed as “NaN” so that the user can spot them in the output file. Rounding of Mantissa We have discussed the issue of floating-point format. Now we concentrate on the issues regarding the accuracy of arithmetic. The most important source of inaccuracy in floating point arithmetic is rounding. Rounding occurs if the mantissa of the result value needs too many bits to be represented exactly. The cause of rounding is typically pre-shifting in floating point arithmetic. When two input operands to a floating-point addition or subtraction have different exponents, the mantissa of the one with the smaller exponent is typically right-shifted until the exponents are equal. As a result, the final result can have many more bits than the format can accommodate. When rounding occurs, there are four desirable properties: 1) If an operation produces a result value that can be represented exactly, it must be so represented. → must carry a guard bit in the arithmetic data-path. 2) The rounded value of x should be independent of the operation that generates it. → must carry a rounding bit in the arithmetic data-path. 3) The "unbias rounding to the nearest" must be supported. → must carry a sticky bit in the data-path. 4) The user should be able to select the desired rounding mode for each specific application. → must provide several modes of rounding. The IEEE standard provides four major rounding modes: Chopped (towards 0), e.g., 1.011 rounds to 1.0, -1.011 rounds -1.0 Rounded towards +∞, e.g., 1.011 rounds to 1.1, -1.011 rounds -1.0. This is desirable for a bank when it calculates the credit card interest to be charged to its customers. Rounded towards -∞, e.g., 1.011 rounds to 1.0, -1.011 rounds -1.1. This is desirable for a bank when it calculates the savings account interest to be paid to its customers. Unbiased rounding to the nearest, 1.011 rounds to 1.1, -1.011 rounds to -1.1. This is desirable for a scientist who is interested in generating unbiased statistical data. For an arithmetic unit that supports a format with k bits of mantissa (including the implied 1.), the data-path is actually k+3 bits wide. The extra three bits are guard bit, rounding bit, and sticky bit. Usage of guard bit The purpose of the guard bit is to make sure that if the ideal result of an arithmetic operation can be exactly represented, the data path with limited number of bits in the arithmetic unit will generate that ideal result. Example: 1.00 * 2 1 - 1.11 * 2 0 = 1.0 * 2 –3 (ideally) Without guard bit, a 3-bit data-path would be used. 1.00 20 0.11 20 LSB lost after pre-shift ____________________________________________ 0.01 20 = 1.00 * 2 -2 The result is two times the correct result. The result is wrong even though the correct result is exactly representable. That is, in our 6-bit representation: 0 100 00 - 0 010 11 = 0 001 00, where all the three values are exactly representable. With guard bit, 4-bit (N+1 bit) data path is provided for 3-bit (N bit) mantissa. 1.000 20 0.111 20 LSB not lost after pre-shift ____________________________________________ 0.001 20 = 1.00 * 2 -3 The result is the same as that produced by an infinitely precise arithmetic logic. In general, if the result is exactly representable, one guard bit is sufficient to ensure that the arithmetic always produces the correct result. The intuitive reason why this works all the time is that the situation where an ideal result is exactly representable but as risk due to pre-shif is when two numbers that are very close in value are involved in subtraction. Their value must be close, thus, the pre-shif will not be more than 1 bit in this situation. If the –pre-shift is 2 bits or more, then the ideal result would not be exactly represetable, as shown in the following example. In this example, the guard bit does not always give the same rounding result if the result is not exactly representable. Example: 1.00 * 2 0 - 1.11 * 2 -2 = 1.001 * 2 -1 (ideally) 1.000 20 0.011 20 LSB lost after pre-shift ____________________________________________ 0.101 20 = 1.01 * 2 -1 If we round the ideal result by "chopping to 0" the rounded result will be different from that generated by a data-path with a guard bit. But since the ideal result is not exactly representable anyway, this is ACCEPTABLE. Usage of rounding bit To make sure that the same ideal result gets rounded to the same representable value regardless of the type of arithmetic operation that generated the result. Rounding bit is useful only if a guard bit is already provided. Example: The same ideal result 1.001 * 2 -1 can be generated by two operations: (a) 1.00 * 2 0 - 1.11 * 2 -2, or (b) 1.00 * 2 –1 + 1.00 * 2 -4. With a guard bit but no rounding bit, one would have a 4-bit data path for the arithmetic. The subtraction would operate as follows: 1.000 20 0.011 2 0, LSB lost after pre-shift ____________________________________________ 0.101 20 = 1.01 * 2 -1 Using a "chopped to 0" rounding, one gets the value 1.01 * 2 -1 The addition would operate as follows: 1.000 2 -1 0.001 2 -1 ____________________________________________ 1.001 2 -1 Using a "chopped to 0" rounding, one gets the value 1.00 * 2 -1 Although the two operations would generate the same ideal result if there was unlimited number of bits in the data path, the actually generate different rounded results with the same rounding mode. In this case, the same ideal result becomes different rounded value depending on the type of operation that generated the result. With both guard bit and rounding bit, one would have a 5-bit data path in floating arithmetic units such as adders that performs addition and subtraction. The subtraction would operate as follows: 1.0000 20 0.0111 20 ____________________________________________ 0.1001 20 Using a "chopped to 0" rounding, one gets the value 1.00 * 2 -1 The addition would operate as follows: 1.0000 2 -1 0.0010 2 -1 ____________________________________________ 1.0010 2 -1 Using a "chopped to 0" rounding, one gets the value 1.00 * 2 -1 In general, it can be shown that one rounding bit is sufficient to guarantee that the same ideal result gets rounded to the same value regardless of the type of operation that generated the result. The reason is that in order for an addition and a subtraction to generate the same result, the pre-shift cannot be more than two bits. Thus, having two additional bits in the arithmetic unit data path allows use to preserve enough bits to achieve the same results. Sticky bit The sticky bit is 1 if and only if the infinitely precise result would have bits to the right of the last guard and rounding bits. It is set for input mantissa bits if "1" bits of the lesser operand are pre-shifted past the rounding bit for addition and subtraction. The sticky bit of the input operand that was not pre-shifted starts as 0. The sticky bit of the result is generated from the sticky bits of the input operands for addition and subtraction. Sticky bit of the result is set if the remainder is not zero for the division. Sticky bit supports unbias rounding to the nearest. A half-way (unrepresentable) number is an unpresentable number that is exactly of equal distance to the two closest representable numbers. All numbers except for the half-way numbers have a natural nearest representable number to round to. As for the half-way number, one obvious solution is to always round it to the next larger representable number. This, however, would create a bias of the final result towards +∞. The solution adopted by the IEEE standard is to round half-way number: to whichever of 0.xxx or 0.xxx + 0.001 has an even last bit. For example, the following rounding are performed according to the specification: 1.0011 1.0101 1.0110 rounds to 1.01 1.01 1.10 Sticky bit supports proper unbiased rounding for those numbers very close to the halfway numbers. It records the fact that these numbers are not exactly half-way. When the numbers are slightly larger than the half-way number, the arithmetic unit can round it to the next larger representable number rather than applying the rule for the half-way number. This makes it possible to apply the half-way rounding rule only to the numbers which are truly half-way numbers. Exception in IEEE Standard What is an exception? An exception is a situation that is not expected by the normal program logic and may have to be handled before execution can resume. There are two kinds of exceptions: faults and traps. Faults: There is usually no way to finish executing the problem-causing instruction without handling of a fault. After the handling a fault, the problem causing instruction must be re-executed. An example of fault is page fault. Traps: It is in general possible to finish executing the problem causing instruction without handling the problem trap. The execution of the subsequent instructions, however, does not make sense before the trap is handled. After the handling of a trap, the problem causing instruction should not be re-executed. An example of trap is system call trap. IEEE Floating Point Traps include divided by zero, overflow (when the result is larger than the largest representable number), underflow (result is too small to represent as a normalized number), invalid (operation that generate NaN). If an exception occurs, the programmer can decide whether a trap will be taken to handle the situation. For any of the exceptions, the programmer should be able to specify a trap handler, request that a handler be disabled or restored, and during execution, turn on or off the trapping signals through program control. The trap handler should be able to determine: (1) the type of exception that occurred, (2) the kind of operation performed (3) the result format (double, single, extended, ..., etc.), (3) the correctly rounded result (if overflow, underflow, and inexactly occur.) (4) the input operand values.