17-Floating-PointRepresentations (1)

See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/262836460 Floating-Point Representations Chapter · January 2014 CITATIONS READS 0 2,504 1 author: Shadrokh Samavi McMaster University 361 PUBLICATIONS 2,589 CITATIONS SEE PROFILE Some of the authors of this publication are also working on these related projects: Image and Video compression View project Skin Lesion Detection View project All content following this page was uploaded by Shadrokh Samavi on 04 June 2014. The user has requested enhancement of the downloaded file. Floating-Point Representations Dr. Shadrokh Samavi 1 1 Contents 1. 2. 3. 4. 5. 6. Floating-Point Numbers The ANSI / IEEE Floating-Point Standard Basic Floating-Point Algorithms Conversions and Exceptions Rounding Schemes Logarithmic Number Systems Textbook: Computer Arithmetic: Algorithms and Hardware Designs, Oxford University Press, New York, 2000 , by Behrooz Parhami. Many of the slides are either from the textbook or from Parhami’s slides. Dr. Shadrokh Samavi 2 2 1. Floating-Point Numbers Dr. Shadrokh Samavi 3 3 Floating-Point Numbers No finite number system can represent all real numbers. Various systems can be used for a subset of real numbers. Fixed-point  w . F: Low precision and/or range and computation must be scaled Rational  p / q: Difficult arithmetic Floating-point  s be: Most common scheme Logarithmic  logbx: Low precision and wide dynamic range Dr. Shadrokh Samavi 4 4 Fixed-Point Numbers •Maximum absolute error is same for all numbers ±ulp with truncation ±ulp/2 with rounding •Maximum relative error is much worse for small number than for large numbers x = (0000 0000. 0000 1001)two y = (1001 0000. 0000 0000)two •Small dynamic range: x 2 and y 2 cannot be represented underflow (number too small) Dr. Shadrokh Samavi overflow (too large) 5 5 Fixed-Point Numbers x =  s  be or ;  significand  baseexponent Two signs are involved in a floating-point number: 1. The significand or number sign ,usually represented by a separate sign bit. 2. The exponent sign ,usually embedded in the biased exponent(when the bias is a power of 2,the exponent sign is the complement of its MSB) Floating-point trade-off: precision , dynamic range Dr. Shadrokh Samavi 6 6 Tradeoff: Allocation of more bits to the exponent part widens the number representation range but reduces the precision. ± Sign 0:+ 1:– e s Expon ent: Signed integer, often represented as unsigned value by adding a bias S i g n i fi c a n d : Represented as a fixed-point number Range with h bits: [–bias, 2h –1–bias] Usually normalized by shifting, so that the MSB becomes nonzero. In radix 2, the fixed leading 1 can be removed to save one bit; this bit is known as "hidden 1". Fig. 17.1 Typical floating-point number format Dr. Shadrokh Samavi 7 7 Biased Exponent Format – Only significand (mantissa) requires sign. – Zero is represented with the smallest biased exponent of 0 and an all-zero significand. – Compare normalized floating-point numbers as if they were integers Dr. Shadrokh Samavi 8 8 Max = largest significand * b largest exponent Min = smallest significand *b smallest exponent – Negative numbers max – FLP – min – Sparser 0 Denser Overflow region Positive numbers min + FLP + Denser Underflow example + Sparser Underflow regions Midway example max + Overflow region Typical example Overflow example Fig. 17.2 Subranges and special values in floating-point number representations Dr. Shadrokh Samavi 9 9 Dr. Shadrokh Samavi 10 10 2. IEEE Floating-Point Standard Dr. Shadrokh Samavi 11 11 The ANSI / IEEE Floating-Point Standard Short (32-bit) format 8 bits, bias = 127, –126 to 127 23 bits for fractional part (plus hidden 1 in integer part) Sign Exponent Significand 11 bits, bias = 1023, –1022 to 1023 52 bits for fractional part (plus hidden 1 in integer part) Long (64-bit) format Fig. 17.3 The ANSI/IEEE standard floating-point number representation formats Dr. Shadrokh Samavi 12 12 The ANSI / IEEE Floating-Point Standard –––––––––––––––––––––––––––––––––––––––––––––––––––––––––– Feature Single / Short Double / Long –––––––––––––––––––––––––––––––––––––––––––––––––––––––––– Word width (bits) 32 64 Significand bits 23 + 1 hidden 52 + 1 hidden Significand range [1, 2 – 2–23] [1, 2 – 2–52] Exponent bits 8 11 Exponent bias 127 1023 Zero (0) e + bias = 0, f = 0 e + bias = 0, f = 0 Denormal e + bias = 0, f  0 e + bias = 0, f  0 represents  0.f  2–126 represents 0.f 2–1022 Infinity () e + bias = 255, f = 0 e + bias = 2047, f = 0 Not-a-number (NaN) e + bias = 255, f  0 e + bias = 2047, f  0 Ordinary number e + bias  [1, 254] e + bias  [1, 2046] e  [–126, 127] e  [–1022, 1023] represents 1.f  2e represents 1.f  2e min 2–126  1.2  10–38 2–1022  2.2  10–308 max  2128  3.4  1038  21024  1.8  10308 Dr. Shadrokh Samavi 13 13 The ANSI / IEEE Floating-Point Standard 1) e= 255 and f ≠ 0 then v= NaN regardless of s 2) e= 255 and f = 0 then v=(-1)s (NaN) (Infinity) 3) 1 e  254, then v= (-1)s 2(e-127) (1.f ) (Normalized) 4) e=0 and f ≠ 0, then v=(-1)s 2-126 (0.f ) (Denormalized) 5) e=0 and f=0, then v=(-1)s0 (Zero) (Single Precision) Dr. Shadrokh Samavi 14 14 The ANSI / IEEE Floating-Point Standard 1) e= 2047 and f ≠ 0 then v= NaN regardless of s 2) e= 2047 and f = 0 then v=(-1)s (NaN) (Infinity) 3) 1 e  2046, then v= (-1)s 2(e-1023) (1.f ) (Normalized) 4) e=0 and f ≠ 0, then v=(-1)s 2-1022 (0.f ) (Denormalized) 5) e=0 and f=0, then v=(-1)s0 (Zero) (Double Precision) Dr. Shadrokh Samavi 15 15 Special Operands and Denormals Operations on special operands: Ordinary number  (+) = 0 (+)  Ordinary number =  NaN + Ordinary number = NaN “Graceful underflow” 0 Denormals –126 –125 2 . . . 2 . . . ... min Fig. 17.4 Denormals in the IEEE single-precision format Dr. Shadrokh Samavi 16 16 Extended Formats Single extended  11 bits Bias is unspecified, but exponent range must include: Single extended [-1022, 1023]  32 bits Short (32-bit) format 8 bits, bias = 127, –126 to 127 23 bits for fractional part (plus hidden 1 in integer part) Sign Exponent Double extended [-16 382, 16 383] Significand 11 bits, bias = 1023, –1022 to 1023 52 bits for fractional part (plus hidden 1 in integer part) Long (64-bit) format  15 bits Double extended Dr. Shadrokh Samavi  64 bits 17 17 IEEE 754 Format Parameters Dr. Shadrokh Samavi 18 18 3. Basic Floating-Point Algorithms Dr. Shadrokh Samavi 19 19 Basic Floating-Point Algorithms Addition Assume e1  e2; alignment shift (preshift) is needed if e1 > e2 ( s1  b e1) + ( s2  b e2) = ( s1  b e1) + ( s2 / b e1–e2)  b e1 = ( s1  s2 / b e1–e2)  b e1 =  s  b e Example: Numbers to be added: x = 25  1.00101101 y = 21  1.11101101 Operand with smaller exponent to be preshifted Operands after alignment shift: x = 25  1.00101101 y = 25  0.000111101101 Result of addition: s = 25  1.010010111101 s = 25  1.01001100 Dr. Shadrokh Samavi Extra bits to be rounded off Rounded sum 20 20 Floating-Point Multiplication and Division Multiplication: ( s1  b e1)  ( s2  b e2) = ( s1  s2 )  b e1+e2 Because s1  s2  [1, 4), postshifting may be needed for normalization Overflow or underflow can occur during multiplication or normalization Division: ( s1  b e1) / ( s2  b e2) = ( s1 / s2 )  b e1-e2 Because s1 / s2  (0.5, 2), postshifting may be needed for normalization Overflow or underflow can occur during division or normalization Dr. Shadrokh Samavi 21 21 Floating-Point Square-Rooting For e even: s  be = For e odd: bs b e-1 = s  b e/2 bs  b (e–1) / 2 After the adjustment of s to bs and e to e – 1, if needed, we have: s*  b e* = s*  b e*/2 Even In [1, 4) for IEEE 754 In [1, 2) for IEEE 754 Overflow or underflow is impossible; no post-normalization needed Dr. Shadrokh Samavi 22 22 4. Conversions and Exceptions Dr. Shadrokh Samavi 23 23 Conversions and Exceptions Conversions from fixed- to floating-point Conversions between floating-point formats Conversion from high to lower precision: Rounding Conversion between decimal and floating-point ANSI/IEEE standard includes four rounding modes: Round to nearest even [default rounding mode] Round toward zero (inward) Round toward + (upward) Round toward – (downward) Dr. Shadrokh Samavi 24 24 Exceptions in Floating-Point Arithmetic Divide by zero Overflow Underflow Inexact exception: Rounded value not the same as original Invalid operation: examples include Addition (+) + (–) Multiplication 0 Division 0 / 0 or  /  Square-rooting operand < 0 Dr. Shadrokh Samavi 25 25 5. Rounding Schemes Dr. Shadrokh Samavi 26 26 Rounding Schemes Whole part Fractional part xk–1xk–2 . . . x1x0 . x–1x–2 . . . x–l Round yk–1yk–2 . . . y1y0 The simplest possible rounding scheme: chopping or truncation xk–1xk–2 . . . x1x0 . x–1x–2 . . . x–l Dr. Shadrokh Samavi Chop xk–1xk–2 . . . x1x0 27 27 Truncation or Chopping chop(x) chop(x) = down(x) 4 4 3 3 2 2 1 1 x –4 –3 –2 –1 1 2 3 4 –1 –2 –3 –4 Fig. 17.5 Truncation or chopping of a signed-magnitude number (same as round toward 0). Dr. Shadrokh Samavi x –4 –3 –2 –1 1 2 3 4 –1 –2 –3 –4 Fig. 17.6 Truncation or chopping of a 2’scomplement number (same as downwarddirected rounding). 28 28 Truncation or Chopping S Chopped error 00.00 00 0 00.01 00 -1/4 00.10 00 -1/2 00.11 00 -3/4 01.00 01 0 01.01 01 -1/4 01.10 01 -1/2 01.11 01 -3/4 10.00 10 0 10.01 10 -1/4 10.10 10 -1/2 10.11 10 -3/4 11.00 11 0 11.01 11 -1/4 11.10 11 -1/2 11.11 11 -3/4 Total error Dr. Shadrokh Samavi -6 29 29 Round to Nearest Number Rounding has a slight upward bias. Consider rounding (xk–1xk–2 ... x1x0 . x–1x–2)two to an integer (yk–1yk–2 ... y1y0 . )two rtn(x) 4 3 The four possible cases, and their representation errors are: 2 1 x –4 –3 –2 –1 1 2 3 –1 –2 –3 4 x–1x–2 00 01 10 11 Round down down up up Error 0 –0.25 0.5 0.25 With equal prob., mean = 0.125 (can cause error accumulation). –4 Fig. 17.7 Rounding of a signedmagnitude value to the nearest number. Dr. Shadrokh Samavi 30 30 Round to Nearest Number F1 ≤ x ≤ F2  Round(x)=nearest to x out of F1,F2 S Rounded error 00.00 00 0 00.01 00 -1/4 00.10 01 1/2 00.11 01 1/4 01.00 01 0 01.01 01 -1/4 01.10 10 1/2 01.11 10 1/4 10.00 10 0 10.01 10 -1/4 10.10 11 1/2 10.11 11 1/4 11.00 11 0 11.01 11 -1/4 11.10 00 1/2 11.11 00 Total error Dr. Shadrokh Samavi 1/4 2 31 31 Round to Nearest Odd/Even Number rtne(x) R*(x) 4 4 3 3 2 2 1 1 x –4 –3 –2 –1 1 2 3 4 x –4 –3 –2 –1 1 –1 –1 –2 –2 –3 –3 –4 –4 Fig. 17.8 Rounding to the nearest even number. 2 3 4 Fig. 17.9 R* rounding or rounding to the nearest odd number. the "midpoint" values ( X -1 X-2 = 1 0) rounded up or down with equal probabilities. Dr. Shadrokh Samavi 32 32 R* Rounding to Nearest Odd F1 ≤ x ≤ F2  Round(x)=nearest to x out of F1,F2 In case of a tie (X.10), choose out of F1 and F2 the odd one (with least-significant bit 1) S Rounded error 00.00 00 0 00.01 00 -1/4 00.10 01 1/2 00.11 01 1/4 01.00 01 0 01.01 01 -1/4 01.10 01 -1/2 01.11 10 1/4 10.00 10 0 10.01 10 -1/4 10.10 11 1/2 10.11 11 1/4 11.00 11 0 11.01 11 -1/4 11.10 11 -1/2 11.11 100 1/4 Total error Dr. Shadrokh Samavi 0 33 33 Rounding to Nearest Even F1 ≤ x ≤ F2  Round(x)=nearest to x out of F1,F2 In case of a tie (X.10), choose out of F1 and F2 the even one (with least-significant bit 0) S Rounded error 00.00 00 0 00.01 00 -1/4 00.10 00 -1/2 00.11 01 1/4 01.00 01 0 01.01 01 -1/4 01.10 10 1/2 01.11 10 1/4 10.00 10 0 10.01 10 -1/4 10.10 10 -1/2 10.11 11 1/4 11.00 11 0 11.01 11 -1/4 11.10 00 1/2 11.11 00 1/4 Total error Dr. Shadrokh Samavi 0 34 34 A Simple Symmetric Rounding Scheme jam(x) Chop and force the LSB of the result to 1 4 3 2 1 x –4 –3 –2 –1 1 2 3 –1 –2 –3 4 Simplicity of chopping, with the near-symmetry of ordinary rounding Max error is comparable to chopping (double that of rounding) –4 Fig. 17.10 Jamming or von Neumann rounding. Dr. Shadrokh Samavi 35 35 Jamming or von Neumann rounding S Rounded 00.00 01 1 00.01 01 3/4 00.10 01 1/2 00.11 01 1/4 01.00 01 0 01.01 01 -1/4 01.10 01 -1/2 01.11 01 -3/4 10.00 11 1 10.01 11 3/4 10.10 11 1/2 10.11 11 1/4 11.00 11 0 11.01 11 -1/4 11.10 11 -1/2 11.11 11 -3/4 Total error Dr. Shadrokh Samavi error 2 36 36 ROM Rounding ROM(x) Fig. 17.11 ROM rounding with an 8  2 table. 4 3 (y3y2y1y0)2=(x3x2x1x0)2 when x-1=0 2 or x3=x2=x1=x0=1 1 x –4 –3 –2 –1 1 2 3 (y3y2y1y0)2=(x3x2x1x0)2+1 otherwise 4 –1 –2 –3 –4 xk–1 . . . x4x3x2x1x0 . x–1x–2 . . . x–l ROM address Dr. Shadrokh Samavi ROM xk–1 . . . x4y3y2y1y0 ROM data 37 37 Xn…X4X3X2X1.X-1 ROM Rounding 0 0 0 . 0 0 0 0 0 0 0 0 . 1 0 0 1 1/2 0 0 1 . 0 0 0 1 0 0 0 1 . 1 0 1 0 1/2 0 1 0 . 0 0 1 0 0 0 1 0 . 1 0 1 1 1/2 0 1 1 . 0 0 1 1 0 0 1 1 . 1 1 0 0 1/2 1 0 0 . 0 1 0 0 0 1 0 0 . 1 1 0 1 1/2 1 0 1 . 0 1 0 1 0 1 0 1 . 1 1 1 0 1/2 1 1 0 . 0 1 1 0 0 1 1 0 . 1 1 1 1 1/2 1 1 1 . 0 1 1 1 0 1 1 1 . 1 1 1 1 1/2 Total error Dr. Shadrokh Samavi Xn…X4X3X2X1 error 4 38 38 Directed Rounding: Motivation We may need result errors to be in a known direction. Example: in computing upper bounds, larger results are acceptable, but results that are smaller than correct values could invalidate the upper bound This leads to the definition of directed rounding modes upward-directed rounding (round toward +) and downward-directed rounding (round toward –) (required features of IEEE floating-point standard) Dr. Shadrokh Samavi 39 39 Directed Rounding: Visualization up(x) chop(x) = down(x) 4 4 3 3 2 2 1 1 x –4 –3 –2 –1 1 2 3 4 x –4 –3 –2 –1 1 –1 –1 –2 –2 –3 –3 –4 –4 Fig. 17.12 Upward-directed rounding or rounding toward +. Dr. Shadrokh Samavi 2 3 4 Fig. 17.6 Truncation or chopping of a 2’s-complement number (same as downward-directed rounding). 40 40 6. Logarithmic Number Systems Dr. Shadrokh Samavi 41 41 Logarithmic Number Systems Sign-and-logarithm number system: Limiting case of FLP representation x = ±be1 e = logb |x| We usually call b the logarithm base, not exponent base Using an integer-valued e wouldn’t be very useful, so we consider e to be a fixed-point number Sign ± Fixed-point exponent e Implied radix point Fig. 17.13 Logarithmic number representation with sign and fixed-point exponent. Dr. Shadrokh Samavi 42 42 Properties of Logarithmic Representation The logarithm is often represented as a 2’s-complement number (Sx, Lx) = (sign(x), log2 |x|) Simple multiplication and division; harder add and subtract L(xy) = Lx + Ly L(x/y) = Lx – Ly Example: 12-bit, base-2, logarithmic number system 1 1 Sign 0 1 1 0  0 0 1 0 1 1 Radix point The bit string above represents –2–9.828125  –(0.0011)ten Number range  (–216, 216); min = 2–16 Dr. Shadrokh Samavi View publication stats 43 43

17-Floating-PointRepresentations (1)

Related documents

Products

Support

17-Floating-PointRepresentations (1)

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib