Uploaded by Usiju Gadzama

17-Floating-PointRepresentations (1)

advertisement
See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/262836460
Floating-Point Representations
Chapter · January 2014
CITATIONS
READS
0
2,504
1 author:
Shadrokh Samavi
McMaster University
361 PUBLICATIONS 2,589 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
Image and Video compression View project
Skin Lesion Detection View project
All content following this page was uploaded by Shadrokh Samavi on 04 June 2014.
The user has requested enhancement of the downloaded file.
Floating-Point
Representations
Dr. Shadrokh Samavi
1
1
Contents
1.
2.
3.
4.
5.
6.
Floating-Point Numbers
The ANSI / IEEE Floating-Point Standard
Basic Floating-Point Algorithms
Conversions and Exceptions
Rounding Schemes
Logarithmic Number Systems
Textbook: Computer Arithmetic: Algorithms and Hardware Designs,
Oxford University Press, New York, 2000 , by Behrooz Parhami.
Many of the slides are either from the textbook or from Parhami’s slides.
Dr. Shadrokh Samavi
2
2
1. Floating-Point Numbers
Dr. Shadrokh Samavi
3
3
Floating-Point Numbers
No finite number system can represent all real numbers.
Various systems can be used for a subset of real numbers.
Fixed-point
 w . F: Low precision and/or range and
computation must be scaled
Rational
 p / q: Difficult arithmetic
Floating-point
 s be: Most common scheme
Logarithmic
 logbx: Low precision and wide dynamic range
Dr. Shadrokh Samavi
4
4
Fixed-Point Numbers
•Maximum absolute error is same for all numbers
±ulp with truncation
±ulp/2 with rounding
•Maximum relative error is much worse for small number
than for large numbers
x = (0000 0000. 0000 1001)two
y = (1001 0000. 0000 0000)two
•Small dynamic range: x 2 and y 2 cannot be represented
underflow (number too small)
Dr. Shadrokh Samavi
overflow (too large)
5
5
Fixed-Point Numbers
x =  s  be or ;  significand  baseexponent
Two signs are involved in a floating-point number:
1. The significand or number sign ,usually represented
by a separate sign bit.
2. The exponent sign ,usually embedded in the biased
exponent(when the bias is a power of 2,the
exponent sign is the complement of its MSB)
Floating-point trade-off: precision , dynamic range
Dr. Shadrokh Samavi
6
6
Tradeoff: Allocation of more bits to the exponent part
widens the number representation range but reduces the
precision.
±
Sign
0:+
1:–
e
s
Expon ent:
Signed integer,
often represented
as unsigned value
by adding a bias
S i g n i fi c a n d :
Represented as a fixed-point number
Range with h bits:
[–bias, 2h –1–bias]
Usually normalized by shifting,
so that the MSB becomes nonzero.
In radix 2, the fixed leading 1
can be removed to save one bit;
this bit is known as "hidden 1".
Fig. 17.1 Typical floating-point number format
Dr. Shadrokh Samavi
7
7
Biased Exponent Format
–
Only significand (mantissa) requires sign.
–
Zero is represented with the smallest biased
exponent of 0 and an all-zero significand.
–
Compare normalized floating-point numbers
as if they were integers
Dr. Shadrokh Samavi
8
8
Max = largest significand * b largest exponent
Min = smallest significand *b smallest exponent
–
Negative numbers
max –
FLP –
min –
Sparser
0
Denser
Overflow
region
Positive numbers
min +
FLP +
Denser
Underflow
example
+
Sparser
Underflow
regions
Midway
example
max +
Overflow
region
Typical
example
Overflow
example
Fig. 17.2 Subranges and special values in floating-point
number representations
Dr. Shadrokh Samavi
9
9
Dr. Shadrokh Samavi
10
10
2. IEEE Floating-Point Standard
Dr. Shadrokh Samavi
11
11
The ANSI / IEEE Floating-Point Standard
Short (32-bit) format
8 bits,
bias = 127,
–126 to 127
23 bits for fractional part
(plus hidden 1 in integer part)
Sign Exponent
Significand
11 bits,
bias = 1023,
–1022 to 1023
52 bits for fractional part
(plus hidden 1 in integer part)
Long (64-bit) format
Fig. 17.3
The ANSI/IEEE standard floating-point
number representation formats
Dr. Shadrokh Samavi
12
12
The ANSI / IEEE Floating-Point Standard
––––––––––––––––––––––––––––––––––––––––––––––––––––––––––
Feature
Single / Short
Double / Long
––––––––––––––––––––––––––––––––––––––––––––––––––––––––––
Word width (bits)
32
64
Significand bits
23 + 1 hidden
52 + 1 hidden
Significand range
[1, 2 – 2–23]
[1, 2 – 2–52]
Exponent bits
8
11
Exponent bias
127
1023
Zero (0)
e + bias = 0, f = 0
e + bias = 0, f = 0
Denormal
e + bias = 0, f  0
e + bias = 0, f  0
represents  0.f  2–126 represents 0.f 2–1022
Infinity ()
e + bias = 255, f = 0
e + bias = 2047, f = 0
Not-a-number (NaN)
e + bias = 255, f  0
e + bias = 2047, f  0
Ordinary number
e + bias  [1, 254]
e + bias  [1, 2046]
e  [–126, 127]
e  [–1022, 1023]
represents 1.f  2e
represents 1.f  2e
min
2–126  1.2  10–38
2–1022  2.2  10–308
max
 2128  3.4  1038
 21024  1.8  10308
Dr. Shadrokh Samavi
13
13
The ANSI / IEEE Floating-Point Standard
1) e= 255 and f ≠ 0 then v= NaN regardless of s
2) e= 255 and f = 0 then v=(-1)s
(NaN)
(Infinity)
3) 1 e  254, then v= (-1)s 2(e-127) (1.f )
(Normalized)
4) e=0 and f ≠ 0, then v=(-1)s 2-126 (0.f )
(Denormalized)
5) e=0 and f=0, then v=(-1)s0
(Zero)
(Single Precision)
Dr. Shadrokh Samavi
14
14
The ANSI / IEEE Floating-Point Standard
1) e= 2047 and f ≠ 0 then v= NaN regardless of s
2) e= 2047 and f = 0 then v=(-1)s
(NaN)
(Infinity)
3) 1 e  2046, then v= (-1)s 2(e-1023) (1.f )
(Normalized)
4) e=0 and f ≠ 0, then v=(-1)s 2-1022 (0.f )
(Denormalized)
5) e=0 and f=0, then v=(-1)s0
(Zero)
(Double Precision)
Dr. Shadrokh Samavi
15
15
Special Operands and Denormals
Operations on special operands:
Ordinary number  (+) = 0
(+)  Ordinary number = 
NaN + Ordinary number = NaN
“Graceful underflow”
0
Denormals
–126
–125
2
. . .
2
. . .
...
min
Fig. 17.4 Denormals in the IEEE single-precision format
Dr. Shadrokh Samavi
16
16
Extended Formats
Single extended
 11 bits
Bias is
unspecified, but
exponent range
must include:
Single extended
[-1022, 1023]
 32 bits
Short (32-bit) format
8 bits,
bias = 127,
–126 to 127
23 bits for fractional part
(plus hidden 1 in integer part)
Sign Exponent
Double extended
[-16 382, 16 383]
Significand
11 bits,
bias = 1023,
–1022 to 1023
52 bits for fractional part
(plus hidden 1 in integer part)
Long (64-bit) format
 15 bits
Double extended
Dr. Shadrokh Samavi
 64 bits
17
17
IEEE 754 Format Parameters
Dr. Shadrokh Samavi
18
18
3. Basic Floating-Point
Algorithms
Dr. Shadrokh Samavi
19
19
Basic Floating-Point Algorithms
Addition
Assume e1  e2; alignment shift (preshift) is needed if e1 > e2
( s1  b e1) + ( s2  b e2) = ( s1  b e1) + ( s2 / b e1–e2)  b e1
= ( s1  s2 / b e1–e2)  b e1 =  s  b e
Example:
Numbers to be added:
x = 25  1.00101101
y = 21  1.11101101
Operand with
smaller exponent
to be preshifted
Operands after alignment shift:
x = 25  1.00101101
y = 25  0.000111101101
Result of addition:
s = 25  1.010010111101
s = 25  1.01001100
Dr. Shadrokh Samavi
Extra bits to be
rounded off
Rounded sum
20
20
Floating-Point Multiplication and Division
Multiplication:
( s1  b e1)  ( s2  b e2) = ( s1  s2 )  b e1+e2
Because s1  s2  [1, 4), postshifting may be needed for normalization
Overflow or underflow can occur during multiplication or normalization
Division:
( s1  b e1) / ( s2  b e2) = ( s1 / s2 )  b e1-e2
Because s1 / s2  (0.5, 2), postshifting may be needed for normalization
Overflow or underflow can occur during division or normalization
Dr. Shadrokh Samavi
21
21
Floating-Point Square-Rooting
For e even:
s  be =
For e odd:
bs b e-1 =
s  b e/2
bs  b (e–1) / 2
After the adjustment of s to bs and e to e – 1, if needed, we have:
s*  b e* =
s*  b e*/2
Even
In [1, 4)
for IEEE 754
In [1, 2)
for IEEE 754
Overflow or underflow is impossible; no post-normalization needed
Dr. Shadrokh Samavi
22
22
4. Conversions and Exceptions
Dr. Shadrokh Samavi
23
23
Conversions and Exceptions
Conversions from fixed- to floating-point
Conversions between floating-point formats
Conversion from high to lower precision: Rounding
Conversion between decimal and floating-point
ANSI/IEEE standard includes four rounding modes:
Round to nearest even [default rounding mode]
Round toward zero (inward)
Round toward + (upward)
Round toward – (downward)
Dr. Shadrokh Samavi
24
24
Exceptions in Floating-Point Arithmetic
Divide by zero
Overflow
Underflow
Inexact exception: Rounded value not the same as original
Invalid operation: examples include
Addition
(+) + (–)
Multiplication
0
Division
0 / 0 or  / 
Square-rooting
operand < 0
Dr. Shadrokh Samavi
25
25
5. Rounding Schemes
Dr. Shadrokh Samavi
26
26
Rounding Schemes
Whole part
Fractional part
xk–1xk–2 . . . x1x0 . x–1x–2 . . . x–l
Round
yk–1yk–2 . . . y1y0
The simplest possible rounding scheme: chopping or truncation
xk–1xk–2 . . . x1x0 . x–1x–2 . . . x–l
Dr. Shadrokh Samavi
Chop
xk–1xk–2 . . . x1x0
27
27
Truncation or Chopping
chop(x)
chop(x) = down(x)
4
4
3
3
2
2
1
1
x
–4
–3
–2
–1
1
2
3
4
–1
–2
–3
–4
Fig. 17.5 Truncation or chopping of a
signed-magnitude number (same as round
toward 0).
Dr. Shadrokh Samavi
x
–4
–3
–2
–1
1
2
3
4
–1
–2
–3
–4
Fig. 17.6 Truncation or chopping of a 2’scomplement number (same as downwarddirected rounding).
28
28
Truncation or Chopping
S
Chopped
error
00.00
00
0
00.01
00
-1/4
00.10
00
-1/2
00.11
00
-3/4
01.00
01
0
01.01
01
-1/4
01.10
01
-1/2
01.11
01
-3/4
10.00
10
0
10.01
10
-1/4
10.10
10
-1/2
10.11
10
-3/4
11.00
11
0
11.01
11
-1/4
11.10
11
-1/2
11.11
11
-3/4
Total error
Dr. Shadrokh Samavi
-6
29
29
Round to Nearest Number
Rounding has a slight upward bias.
Consider rounding
(xk–1xk–2 ... x1x0 . x–1x–2)two
to an integer (yk–1yk–2 ... y1y0 . )two
rtn(x)
4
3
The four possible cases, and
their representation errors are:
2
1
x
–4
–3
–2
–1
1
2
3
–1
–2
–3
4
x–1x–2
00
01
10
11
Round
down
down
up
up
Error
0
–0.25
0.5
0.25
With equal prob., mean = 0.125
(can cause error accumulation).
–4
Fig. 17.7
Rounding of a signedmagnitude value to the nearest number.
Dr. Shadrokh Samavi
30
30
Round to Nearest Number
F1 ≤ x ≤ F2 
Round(x)=nearest to x out of F1,F2
S
Rounded
error
00.00
00
0
00.01
00
-1/4
00.10
01
1/2
00.11
01
1/4
01.00
01
0
01.01
01
-1/4
01.10
10
1/2
01.11
10
1/4
10.00
10
0
10.01
10
-1/4
10.10
11
1/2
10.11
11
1/4
11.00
11
0
11.01
11
-1/4
11.10
00
1/2
11.11
00
Total error
Dr. Shadrokh Samavi
1/4
2
31
31
Round to Nearest Odd/Even Number
rtne(x)
R*(x)
4
4
3
3
2
2
1
1
x
–4
–3
–2
–1
1
2
3
4
x
–4
–3
–2
–1
1
–1
–1
–2
–2
–3
–3
–4
–4
Fig. 17.8 Rounding to the
nearest even number.
2
3
4
Fig. 17.9 R* rounding or rounding
to the nearest odd number.
the "midpoint" values ( X -1 X-2 = 1 0) rounded up or down with equal probabilities.
Dr. Shadrokh Samavi
32
32
R* Rounding to Nearest
Odd
F1 ≤ x ≤ F2 
Round(x)=nearest to x out of F1,F2
In case of a tie (X.10), choose
out of F1 and F2 the odd one
(with least-significant bit 1)
S
Rounded
error
00.00
00
0
00.01
00
-1/4
00.10
01
1/2
00.11
01
1/4
01.00
01
0
01.01
01
-1/4
01.10
01
-1/2
01.11
10
1/4
10.00
10
0
10.01
10
-1/4
10.10
11
1/2
10.11
11
1/4
11.00
11
0
11.01
11
-1/4
11.10
11
-1/2
11.11
100
1/4
Total error
Dr. Shadrokh Samavi
0
33
33
Rounding to Nearest Even
F1 ≤ x ≤ F2 
Round(x)=nearest to x out of F1,F2
In case of a tie (X.10), choose
out of F1 and F2 the even one
(with least-significant bit 0)
S
Rounded
error
00.00
00
0
00.01
00
-1/4
00.10
00
-1/2
00.11
01
1/4
01.00
01
0
01.01
01
-1/4
01.10
10
1/2
01.11
10
1/4
10.00
10
0
10.01
10
-1/4
10.10
10
-1/2
10.11
11
1/4
11.00
11
0
11.01
11
-1/4
11.10
00
1/2
11.11
00
1/4
Total error
Dr. Shadrokh Samavi
0
34
34
A Simple Symmetric Rounding Scheme
jam(x)
Chop and force the LSB
of the result to 1
4
3
2
1
x
–4
–3
–2
–1
1
2
3
–1
–2
–3
4
Simplicity of chopping,
with the near-symmetry
of ordinary rounding
Max error is
comparable to chopping
(double that of
rounding)
–4
Fig. 17.10
Jamming or von
Neumann rounding.
Dr. Shadrokh Samavi
35
35
Jamming or von Neumann
rounding
S
Rounded
00.00
01
1
00.01
01
3/4
00.10
01
1/2
00.11
01
1/4
01.00
01
0
01.01
01
-1/4
01.10
01
-1/2
01.11
01
-3/4
10.00
11
1
10.01
11
3/4
10.10
11
1/2
10.11
11
1/4
11.00
11
0
11.01
11
-1/4
11.10
11
-1/2
11.11
11
-3/4
Total error
Dr. Shadrokh Samavi
error
2
36
36
ROM Rounding
ROM(x)
Fig. 17.11 ROM rounding
with an 8  2 table.
4
3
(y3y2y1y0)2=(x3x2x1x0)2 when x-1=0
2
or x3=x2=x1=x0=1
1
x
–4
–3
–2
–1
1
2
3
(y3y2y1y0)2=(x3x2x1x0)2+1 otherwise
4
–1
–2
–3
–4
xk–1 . . . x4x3x2x1x0 . x–1x–2 . . . x–l
ROM address
Dr. Shadrokh Samavi
ROM
xk–1 . . . x4y3y2y1y0
ROM data
37
37
Xn…X4X3X2X1.X-1
ROM Rounding
0 0 0 . 0
0 0 0
0
0 0 0 . 1
0 0 1
1/2
0 0 1 . 0
0 0 1
0
0 0 1 . 1
0 1 0
1/2
0 1 0 . 0
0 1 0
0
0 1 0 . 1
0 1 1
1/2
0 1 1 . 0
0 1 1
0
0 1 1 . 1
1 0 0
1/2
1 0 0 . 0
1 0 0
0
1 0 0 . 1
1 0 1
1/2
1 0 1 . 0
1 0 1
0
1 0 1 . 1
1 1 0
1/2
1 1 0 . 0
1 1 0
0
1 1 0 . 1
1 1 1
1/2
1 1 1 . 0
1 1 1
0
1 1 1 . 1
1 1 1
1/2
Total error
Dr. Shadrokh Samavi
Xn…X4X3X2X1 error
4
38
38
Directed Rounding: Motivation
We may need result errors to be in a known direction.
Example: in computing upper bounds, larger results are
acceptable, but results that are smaller than correct
values could invalidate the upper bound
This leads to the definition of directed rounding modes
upward-directed rounding (round toward +) and
downward-directed rounding (round toward –)
(required features of IEEE floating-point standard)
Dr. Shadrokh Samavi
39
39
Directed Rounding: Visualization
up(x)
chop(x) = down(x)
4
4
3
3
2
2
1
1
x
–4
–3
–2
–1
1
2
3
4
x
–4
–3
–2
–1
1
–1
–1
–2
–2
–3
–3
–4
–4
Fig. 17.12 Upward-directed
rounding or rounding toward +.
Dr. Shadrokh Samavi
2
3
4
Fig. 17.6 Truncation or chopping
of a 2’s-complement number (same
as downward-directed rounding).
40
40
6. Logarithmic Number Systems
Dr. Shadrokh Samavi
41
41
Logarithmic Number Systems
Sign-and-logarithm number system: Limiting case of FLP representation
x = ±be1
e = logb |x|
We usually call b the logarithm base, not exponent base
Using an integer-valued e wouldn’t be very useful, so we consider e
to be a fixed-point number
Sign
±
Fixed-point exponent
e
Implied radix point
Fig. 17.13
Logarithmic number representation with sign and fixed-point
exponent.
Dr. Shadrokh Samavi
42
42
Properties of Logarithmic Representation
The logarithm is often represented as a 2’s-complement number
(Sx, Lx) = (sign(x), log2 |x|)
Simple multiplication and division; harder add and subtract
L(xy) = Lx + Ly
L(x/y) = Lx – Ly
Example: 12-bit, base-2, logarithmic number system
1
1
Sign
0
1
1
0

0
0
1
0
1
1
Radix point
The bit string above represents –2–9.828125  –(0.0011)ten
Number range  (–216, 216); min = 2–16
Dr. Shadrokh Samavi
View publication stats
43
43
Download