doc

advertisement
Floating Point Arithmetic
In the 70's, floating point arithmetic capability was found only in mainframes and
supercomputers. Although many microprocessors designed in the 80's started to have
floating point coprocessor, their floating-point arithmetic speed was about three orders of
magnitude slower than mainframes and supercomputers. With the advance in
microprocessor technology, many microprocessors designed in the 90's, such as Intel
Pentium III, AMD Athlon, and Compaq Alpha 21264, have high performance floating
point capabilities that rival supercomputers. High speed floating point arithmetic has
become a standard requirement for microprocessors designed today.
The IEEE Floating Point Standard is an effort for the computer manufacturers to conform
to a common representation and arithmetic convention for floating point data. Virtually
all computer manufacturers in the world have accepted this standard. That is, virtually all
microprocessors designed in the future will conform to the IEEE Floating Point Standard.
Therefore, it is very important for a computer designer to understand the concept and
implementation requirements of this standard.
In the IEEE Floating Point Standard, a floating point number consists of three parts: sign
(S), exponent (E), and mantissa (M). Each (S, E, M) pattern uniquely identifies a floating
point number. The value of such a floating number can be derived as follows:
value = (-1)S * M * {2E}, where 1.0 ≤ M < 2.0
.
The interpretation of S is simple: S=0 results in a positive number and S=1 a negative
number. The issue is how to encode M and E. We will use the following example to
explain the encoding convention for M and E.
Example: assume that each floating point number consists of a 1-bit sign, 3-bit exponent,
and 2-bit mantissa. We will use this 6-bit format to illustrate the issues involved in
encoding E and M. Numbers in the decimal system will have subscript D and those in the
binary system will have subscript B. For example, 0.5D = 0.1B.
Normalized representation of M
Specifying that 1.0B ≤ M < 10.0B makes the mantissa value for each floating point
number unique. For example, the only one mantissa value allowed for 0.5D is M =1.0:
0.5D = 1.0B * 2-1
Another potential candidate would be 0.1B * 20, but the value of mantissa would be too
small according to the rule. Similarly, 10.0B * 2-2 is not legal because the value of the
mantissa is too large. In general, with the restriction that 1.0≤ M < 10.0, all floating point
numbers can have only one legal mantissa value. The numbers that satisfy this restriction
will be referred to as normalized numbers.
Because all mantissa values are of the form 1.XX, just omit the 1. part in the
representation. Therefore, the mantissa value of 0.5 in a 2-bit mantissa representation is
00, which is derived by omitting 1. from 1.00.
Excess encoding of E
If n bits are used to represent exponent, 2n-1-1 is added to the two's complement
representation for the exponent to form its excess representation. In our 3-bit exponent
representation,
2’s complement
000
001
010
011
100
101
110
111
Actual decimal
0
1
2
3
(reserved pattern)
-3
-2
-1
Excess-3
011
100
101
110
111
000
001
010
Figure 1 Excess-3 encoding, sorted by 2’s complement ordering
The advantage of excess representation is that an unsigned comparator can be used to
compare signed numbers. As shown in Figure 2, the excess-3 code increases
monotonically from –3 to 3. The code from –3 is 000 and that for 3 is 110. Thus, if one
use an unsigned number comparator to compare excess-3 code for any number from –1 to
3, the comparator gives the correct comparison result in terms of which number is larger,
smaller, etc. For example, if one compares excess-3 codes 001 and 100 with an unsigned
comparator, 001 is smaller than 100. This is the right conclusion since the values that
they represent, -2 and 1, have exactly the same relation. This is a very desirable quality
for hardware implementation since unsigned comparators are smaller and faster than
signed comparators.
2’s complement
100
101
110
111
000
001
010
011
Actual decimal
(reserved pattern)
-3
-2
-1
0
1
2
3
Excess-3
111
000
001
010
011
100
101
110
Figure 2 Excess-3 encoding, sorted by excess-3 ordering
Now we are ready to represent 0.5D with our 6-bit format:
0.5D = 0 010 00, where S = 0, E = 010, and M = (1.)00
With normalized mantissa and excess-coded exponent, the value of a number with an nbit exponent is
(-1) S * 1.M * 2 (E- (2 ^( n-1))+ 1)
Representable Numbers
The representable numbers of a given format is the set of all numbers that can be exactly
represented in the format. For example, if one uses a 3-bit unsigned integer format, the
representable numbers would be:
000
001
010
011
100
101
110
111
0
1
2
3
4
5
6
7
Figure 3 Representable numbers of a 3-bit unsigned format
Neither -1 nor 9 can be represented in the format given above. One can draw a number
line to identify all the representable numbers, as shown in Figure 4 where all
representatble numbers of the 3-bit unsigned integer format are marked with stars.
-1 0
1 2 3 4 5 6 7 8 9
\
Figure 4 representable
numbers of a 3-bit unsigned integer format
1
The representable numbers of a floating point format can be visualize in a similar
manner. In Table 1, we define all the representable numbers of what we have so far and
two variations. We use a 5-bit format to keep the size of the table manageable. The
format consists of 1-bit S, 2-bit E (excess coded), and 2-bit M (with 1. part omitted). The
no-zero column gives the representable numbers of the format we discussed thus far.
Note that with this format, 0 is not one of the representatble numbers.
E
00
01
10
11
M
00
01
10
11
00
01
10
11
00
01
10
11
No-zero
S=1
-1
-1
2
-(2 )
-1
-3
2 +1*2
-(2-1+1*2-3)
-1
-3
2 +2*2
-(2-1+2*2-3)
2-1+3*2-3 -(2-1+3*2-3)
20
-(20)
0
-2
2 +1*2
-(20+1*2-2)
20+2*2-2 -(20+2*2-2)
20+3*2-2 -(20+3*2-2)
21
-(21)
21+1*2-1 -(21+1*2-1)
21+2*2-1 -(21+2*2-1)
21+3*2-1 -(21+3*2-1)
S=0
Abrupt underflow
S=0
S=1
0
0
0
0
0
0
0
0
0
2
-(20)
0
-2
2 +1*2
-(20+1*2-2)
20+2*2-2 -(20+2*2-2)
20+3*2-2 -(20+3*2-2)
21
-(21)
21+1*2-1 -(21+1*2-1)
21+2*2-1 -(21+2*2-1)
21+3*2-1 -(21+3*2-1)
Reserved pattern
Gradual underflow
S=0
S=1
0
0
-2
1*2
-1*2-2
-2
2*2
-2*2-2
3*2-2
-3*2-2
20
-(20)
0
-2
2 +1*2
-(20+1*2-2)
20+2*2-2 -(20+2*2-2)
20+3*2-2 -(20+3*2-2)
21
-(21)
21+1*2-1 -(21+1*2-1)
21+2*2-1 -(21+2*2-1)
21+3*2-1 -(21+3*2-1)
Table 1 Representable numbers of no-zero, abrupt underflow, and gradual underflow formats
Examining how they populate the number line, as shown in Figure 5, provides further
insights about the representable numbers. In Figure 5, we show only the positive
representable numbers. The negative numbers are symmetric to their positive
counterparts.
0
1
2
3
\
Figure 5 Number line display of representable numbers of the no-zero representation
1
There are three observations that can be made here. First, the number 0 is not
representable in this format. Because 0 is one of the most important numbers, not being
able to represent 0 in a number system is a serious shortcoming. Second, the
representable numbers become closer to each other towards the neighborhood of 0. This
is a desirable behavior because as the absolute value of these numbers become smaller, it
is more important to represent them accurately. Having the representable numbers close
to each other makes it possible to represent numbers in the region more accurately.
Unfortunately this trend does not hold for the very vicinity of 0, which leads to the third
point: there is a gap of representable numbers in the very vicinity of 0. This is because the
range of normalized mantissa precludes 0.
One method that has been used to accommodate 0 into a normalized number system is
the abrupt underflow convention, which is illustrated by the second column of Table 1.
Whenever E is 0, the number is interpreted as 0. This method takes away eight
4
representable numbers (four positive and four negative) in the vicinity of 0 and makes
them all 0. Although this method makes 0 representable, it does create an even larger gap
between representable numbers in 0's vicinity, as shown in Figure 6. It is obvious, when
compared with Figure 5, the gap of representable numbers has been enlarged
significantly with the vicinity of 0. This is very problematic since many numeric
algorithms rely on the fact that the accuracy of number representation is higher for the
very small numbers near zero. These algorithms generate small numbers and eventually
use them as denominators. Any error in the representation can be greatly magnified in the
division process.
3
1
2
\
Figure 6 Representable numbers of the abrupt underflow format
1
0
4
The actual method adopted by the IEEE standard is called gradual underflow. The
method relaxes the normalization requirement for numbers very close to 0. That is,
whenever E=0, the mantissa is no longer assumed to be of the form 1.XX. Rather, it is
assumed to be 0.XX. In general, if the n-bit exponent is 0, the value is
0.M * 2 - 2 ^(n-1) + 2
For example, in Table 1, the gradual underflow for mantissa value 01 is 0.01*2 0 = 2–2.
Figure 7 shows the representable numbers for the gradual underflow format.
The effect is to have a uniform spectrum of representable numbers in the very close
vicinity of 0. This eliminates the undesirable gap in the previous two methods. This is
illustrated in Figure 7.
0
2
1
\
1
IEEE Floating Point Format
The actual IEEE format has one more bit pattern than we have so far:
exponent
11…1
11…1
00…0
00…0
mantissa
≠0
=0
≠0
=0
meaning
NaN
(-1)S * ∞
denormalized
0
3
All other numbers are normalized floating-point numbers. Single precision numbers have
1-bit S, 8-bit E, and 23-bit M. Double precision numbers have 1-bit S, 11-bit E, and 52bit M.
All representable numbers fall between -∞ (negative infinity) and +∞ (positive infinite).
An ∞ is created by overflow, e.g., divided by zero. Any representable number divided by
+∞ or -∞ results in 0.
NaN (Not a Number) is generated by operations whose input values do not make sense,
for example, 0/0, 0*∞, ∞/∞, ∞ - ∞. They are also used to for data that have not be
properly initialized in a program. There are two types on NaN’s in the IEEE standard:
signaling and quiet.
Signaling NaN causes an exception when used as input to arithmetic operations.
For example the operation 1.0 + signaling NaN raises an exception. Signaling NaN’s are
used in situations where the programmer would like to make sure that the program
execution gets interrupts whenever any irregular values are used in floating point
computations. These situations usually mean that there is something wrong with the
execution of the program. In mission critical applications, the execution cannot continue
until the validity of the execution can be verified with a separate means. For example,
software engineers often mark all the uninitialized data as signaling NaN. This practice
ensures the detection of using uninitialized data during program execution.
Quiet NaN generates a quiet NaN when used as input to arithmetic operations. For
example, the operation quiet NaN <= 1.0 + quiet NaN generates a quiet NaN. Quiet
NaN’s are typically used in applications where the user can review the output and decide
if the application should be re-run with a different input for more valid results. When the
results are printed, Quite NaN’s are printed as “NaN” so that the user can spot them in the
output file.
Rounding of Mantissa
We have discussed the issue of floating-point format. Now we concentrate on the issues
regarding the accuracy of arithmetic. The most important source of inaccuracy in floating
point arithmetic is rounding. Rounding occurs if the mantissa of the result value needs
too many bits to be represented exactly. The cause of rounding is typically pre-shifting in
floating point arithmetic. When two input operands to a floating-point addition or
subtraction have different exponents, the mantissa of the one with the smaller exponent is
typically right-shifted until the exponents are equal. As a result, the final result can have
many more bits than the format can accommodate.
When rounding occurs, there are four desirable properties:
1) If an operation produces a result value that can be represented exactly, it must be
so represented. → must carry a guard bit in the arithmetic data-path.
2) The rounded value of x should be independent of the operation that generates it.
→ must carry a rounding bit in the arithmetic data-path.
3) The "unbias rounding to the nearest" must be supported. → must carry a sticky bit
in the data-path.
4) The user should be able to select the desired rounding mode for each specific
application. → must provide several modes of rounding.
The IEEE standard provides four major rounding modes:

Chopped (towards 0), e.g., 1.011 rounds to 1.0, -1.011 rounds -1.0

Rounded towards +∞, e.g., 1.011 rounds to 1.1, -1.011 rounds -1.0. This is
desirable for a bank when it calculates the credit card interest to be charged to its
customers.

Rounded towards -∞, e.g., 1.011 rounds to 1.0, -1.011 rounds -1.1. This is
desirable for a bank when it calculates the savings account interest to be paid to
its customers.

Unbiased rounding to the nearest, 1.011 rounds to 1.1, -1.011 rounds to -1.1. This
is desirable for a scientist who is interested in generating unbiased statistical data.
For an arithmetic unit that supports a format with k bits of mantissa (including the
implied 1.), the data-path is actually k+3 bits wide. The extra three bits are guard bit,
rounding bit, and sticky bit.
Usage of guard bit
The purpose of the guard bit is to make sure that if the ideal result of an arithmetic
operation can be exactly represented, the data path with limited number of bits in the
arithmetic unit will generate that ideal result.
Example: 1.00 * 2 1 - 1.11 * 2
0
= 1.0 * 2 –3 (ideally)
Without guard bit, a 3-bit data-path would be used.
1.00
20
0.11
20
LSB lost after pre-shift
____________________________________________
0.01
20
= 1.00 * 2 -2
The result is two times the correct result. The result is wrong even though the correct
result is exactly representable. That is, in our 6-bit representation:
0 100 00 - 0 010 11 = 0 001 00, where all the three values are exactly representable.
With guard bit, 4-bit (N+1 bit) data path is provided for 3-bit (N bit) mantissa.
1.000
20
0.111
20
LSB not lost after pre-shift
____________________________________________
0.001
20
= 1.00 * 2 -3
The result is the same as that produced by an infinitely precise arithmetic logic. In
general, if the result is exactly representable, one guard bit is sufficient to ensure that the
arithmetic always produces the correct result. The intuitive reason why this works all the
time is that the situation where an ideal result is exactly representable but as risk due to
pre-shif is when two numbers that are very close in value are involved in subtraction.
Their value must be close, thus, the pre-shif will not be more than 1 bit in this situation. If
the –pre-shift is 2 bits or more, then the ideal result would not be exactly represetable, as
shown in the following example. In this example, the guard bit does not always give the
same rounding result if the result is not exactly representable.
Example: 1.00 * 2 0 - 1.11 * 2 -2 = 1.001 * 2 -1 (ideally)
1.000
20
0.011
20
LSB lost after pre-shift
____________________________________________
0.101
20
= 1.01 * 2 -1
If we round the ideal result by "chopping to 0" the rounded result will be different from
that generated by a data-path with a guard bit. But since the ideal result is not exactly
representable anyway, this is ACCEPTABLE.
Usage of rounding bit
To make sure that the same ideal result gets rounded to the same representable value
regardless of the type of arithmetic operation that generated the result. Rounding bit is
useful only if a guard bit is already provided.
Example: The same ideal result 1.001 * 2 -1 can be generated by two operations:
(a) 1.00 * 2 0 - 1.11 * 2 -2, or (b) 1.00 * 2 –1 + 1.00 * 2 -4.
With a guard bit but no rounding bit, one would have a 4-bit data path for the arithmetic.
The subtraction would operate as follows:
1.000
20
0.011
2 0,
LSB lost after pre-shift
____________________________________________
0.101
20
= 1.01 * 2 -1
Using a "chopped to 0" rounding, one gets the value 1.01 * 2 -1
The addition would operate as follows:
1.000
2 -1
0.001
2 -1
____________________________________________
1.001
2 -1
Using a "chopped to 0" rounding, one gets the value 1.00 * 2 -1
Although the two operations would generate the same ideal result if there was unlimited
number of bits in the data path, the actually generate different rounded results with the
same rounding mode. In this case, the same ideal result becomes different rounded value
depending on the type of operation that generated the result.
With both guard bit and rounding bit, one would have a 5-bit data path in floating
arithmetic units such as adders that performs addition and subtraction. The subtraction
would operate as follows:
1.0000
20
0.0111
20
____________________________________________
0.1001
20
Using a "chopped to 0" rounding, one gets the value 1.00 * 2 -1
The addition would operate as follows:
1.0000
2 -1
0.0010
2 -1
____________________________________________
1.0010
2 -1
Using a "chopped to 0" rounding, one gets the value 1.00 * 2 -1
In general, it can be shown that one rounding bit is sufficient to guarantee that the same
ideal result gets rounded to the same value regardless of the type of operation that
generated the result. The reason is that in order for an addition and a subtraction to
generate the same result, the pre-shift cannot be more than two bits. Thus, having two
additional bits in the arithmetic unit data path allows use to preserve enough bits to
achieve the same results.
Sticky bit
The sticky bit is 1 if and only if the infinitely precise result would have bits to the right of
the last guard and rounding bits. It is set for input mantissa bits if "1" bits of the lesser
operand are pre-shifted past the rounding bit for addition and subtraction. The sticky bit
of the input operand that was not pre-shifted starts as 0. The sticky bit of the result is
generated from the sticky bits of the input operands for addition and subtraction. Sticky
bit of the result is set if the remainder is not zero for the division. Sticky bit supports
unbias rounding to the nearest.
A half-way (unrepresentable) number is an unpresentable number that is exactly of equal
distance to the two closest representable numbers. All numbers except for the half-way
numbers have a natural nearest representable number to round to. As for the half-way
number, one obvious solution is to always round it to the next larger representable
number. This, however, would create a bias of the final result towards +∞. The solution
adopted by the IEEE standard is to round half-way number: to whichever of 0.xxx or
0.xxx + 0.001 has an even last bit. For example, the following rounding are performed
according to the specification:
1.0011
1.0101
1.0110
rounds to
1.01
1.01
1.10
Sticky bit supports proper unbiased rounding for those numbers very close to the halfway numbers. It records the fact that these numbers are not exactly half-way. When the
numbers are slightly larger than the half-way number, the arithmetic unit can round it to
the next larger representable number rather than applying the rule for the half-way
number. This makes it possible to apply the half-way rounding rule only to the numbers
which are truly half-way numbers.
Exception in IEEE Standard
What is an exception? An exception is a situation that is not expected by the normal
program logic and may have to be handled before execution can resume. There are two
kinds of exceptions: faults and traps.
Faults: There is usually no way to finish executing the problem-causing instruction
without handling of a fault. After the handling a fault, the problem causing instruction
must be re-executed. An example of fault is page fault.
Traps: It is in general possible to finish executing the problem causing instruction without
handling the problem trap. The execution of the subsequent instructions, however, does
not make sense before the trap is handled. After the handling of a trap, the problem
causing instruction should not be re-executed. An example of trap is system call trap.
IEEE Floating Point Traps include divided by zero, overflow (when the result is larger
than the largest representable number), underflow (result is too small to represent as a
normalized number), invalid (operation that generate NaN).
If an exception occurs, the programmer can decide whether a trap will be taken to handle
the situation.
For any of the exceptions, the programmer should be able to specify a trap handler,
request that a handler be disabled or restored, and during execution, turn on or off the
trapping signals through program control.
The trap handler should be able to determine: (1) the type of exception that occurred, (2)
the kind of operation performed (3) the result format (double, single, extended, ..., etc.),
(3) the correctly rounded result (if overflow, underflow, and inexactly occur.) (4) the
input operand values.
Download