2 Fractional Numbers Fractional Number Notations Ver. 1.4 Fractional numbers have the form: xxxxxxxxx.yyyyyyyyy where the x’es constitute the integer part of the value and the y’s the fractional part There are two main methods to encode fractional numbers: © 2010 - Claudio Fornaro fixed-point notation floating-point notation 3 Fixed-point Notation Fixed-point notation splits the available n bits in 2 portions: Fixed-point Notation If needed: one for the integer part one for the fractional part integer 4 fractional The radix point is not stored (does not uses up bits): its position is just known The number of bits for the integer and the fractional part are chosen before making any calculation the integer part must be padded with 0es on the left the fractional part must be padded with 0es on the right Examples 5.25 in FX on 4+4 bits: 01010100 5.25 in FX on 6+2 bits: 00010101 Radix points are supposed here 5 Fixed-point Notation Fixed-point Notation For relative fractional values, both SM and 2C notations can be used The n bits are then divided into 3 parts: 6 Examples Convert value +12.25 in FX 2C 1+4+3 sign (1 bit) integer part (m bits) fractional part (n-m-1 bits) E.g. 1+7+8 means 1 bit for sign, 7 for the integer part and 8 for the fractional part Convert value –12.25 in FX 2C 1+4+3 Operations are the same seen as for integer values, provided that the values have the same format 01100010 01100010 10011110 Note: when using the 1st 2C-operation method, 1 must be added to the LSB, not to unity place: 01100010 2C-Operation 10011101+ 1= 10011110 7 Exercises Convert the values as requested –151.0 FX 2C on 16 bits (1+8+7) –151.25 FX 2C on 16 bits (1+8+7) 111100101010 from FX 2C (1+7+4) ()10 100110011000 from FX 2C (1+6+5) ()10 Calculate on FX 2C 16 bits (1+7+8) and identify any overflow (111.6 – 44.57) / 2 (68.22 – 71.25) * 64 8 Exercises Solutions 0 10010111 0000000 1 01101001 0000000 –151.25 0 10010111 0100000 1 01101000 1100000 Note that the integer part is not the same 111100101010 0 0001101 0110 –13.37510 100110011000 0 110011 01000 –51.2510 –151.0 9 Exercises 10 Exercises Solutions (111.6 – 44.57) / 2 1101111.10011001201101111100110012C 101100.100100012 00101100100100012C 11010011011011112C 1 0110111110011001+ 1101001101101111= 0100001100001000 0010000110000100 +33.515625 (68.22 – 71.25) * 64 1000100.00111000201000100001110002C 1000111.012 01000111010000002C 10111000110000002C 1 0100010000111000+ 1011100011000000= 1111110011111000 0011111000000000 OVERFLOW Radix points are supposed here Solutions Radix points are supposed here 11 Fixed-point Uses Fixed-point notation is sometimes used by simulating it with the integer notation that microprocessors use (i.e. 2C) This allows faster computations than operations using floating-point notation (intrinsically slower) 12 Fixed-point Problems Suppose the following (unsigned) values have to be coded using Fixed-point on a total of 8 bits: 37.25 12.625 5.4375 1.2890625 100101.01 1100.1010 101.01110 1.0100101 All of them can be coded in 8 bits, but there is not a unique position for the radix point suitable for all 13 Fixed-point Problems Fixed-point Problems Suppose you have to represent some fractional values 0 x < 8 using a Fixed-point coding 4+4 bits: 7.2732 2.3748 5.4375 14 1.2890 The first bit is always 0, and the fractional part is rounded to 4 bits If we could move the fractional point 1 positions to the left, we could have 1 more bit for precision Suppose you have to represent some values with fractional part x.0, x.5 or x.25 only, using a Fixed-point coding 4+4 bits The last two bits are always 00, and the integer part is limited to 15 If we could move the fractional point 2 positions to the right, we could have 2 more bits for the integer part (values up to 63) 15 Fixed-point Problems The problem with Fixed-point notation is the fixed position of the radix point To solve this problem, the radix point must be made movable (floating), this requires that its position be stored along with each number 16 Exponential Notation Exponential notation represents a number as a value (mantissa or significand ) that multiplies a whole power of the base (exponent ) Mantissa Examples (in decimal): Exponent 123.45678 = 0.12345678·103 0.0087654321 = 0.87654321·10-2 87655678 = 0.87655678·108 17 Exponential Notation Exponential Notation Very big and very small values are obtained by just varying the exponent The same value can be expressed in many forms: 18 123.45 = 0.12345·103 = 12345·10-2 Among these forms, form 0.x (x≠0) is chosen to have a unique representation for values, this is called the When the number of digits is not enough to store the whole number only the most important (leftmost) digits are stored The most significant digits are thus preserved, but approximation errors are introduced because of truncation Example (only 4 decimal digits): normalized form 0.001234567 0.1234·10-2=0.001234000 876543 0.8765·106 = 876500 19 Exponential Notation The maximum representation error with n digits is 10-n relative to the power of the whole part If the whole part power is m : = 10-n · 10m = 10m-n which is the power of the rightmost digit (LSD) 20 Exponential Notation Example Suppose the value has only 4 decimal digits 876543 0.8765·106 (normalized) The whole part is 0·106 m =6 = 10-4 · 106 = 102 (maximum error) This can also be seen by writing the value as a sum of powers: 0·106+8·105+7·104+6·103+5·102 for this value, the error is: | 876543 – 876500 | = 43 (< 102 ) 21 Exponential Notation IEEE-P754 Floating-Point Example 22 Suppose the value has only 4 decimal digits 0.001234567 0.1234·10-2=0.001234000 The whole part is 0·10-2 m = –2 = 10-4 · 10-2 = 10-6 (maximum error) Writing the value as a sum of powers: 0·10-2+1·10-3+2·10-4+3·10-5+4·10-6 for this value, the error is: | 0.001234567 – 0.001234000 | = = 0.000000567 (< 10-6 ) The IEEE-P754 standard describes the most common notations used by computer FPUs (Floating-Point Units) to compute floating-point values The two exponential binary floating point notations described have the form mantissa · 2exponent and are: Single precision (SP) Double precision (DP) 23 IEEE-P754 Single Precision Single precision values uses 32 bits divided in 3 parts: sign: 1 bit exponent field: 8 bits mantissa (or significand) field: 23 bit s exponent mantissa The sign bit is defined as follows: 0 is used for values 0 1 is used for values 0 (negative zero!) 24 IEEE-P754 Single Precision The mantissa (or significand ) is in the normalized form 1.xxxxx, where the 1 before the radix point is the leftmost 1 (MSB) in the binary representation Only the fractional part of the binary mantissa is stored in the mantissa field: the leftmost 1 is already known to be present (called hidden bit ), this allows for one more bit of precision (23 bits stored + 1 hidden = 24 bits effective) 25 IEEE-P754 Single Precision 26 IEEE-P754 Single Precision The exponent is a relative integer value on 8 bits, the IEEE-P754 SP standard does not use SM or 2C notations, but a biased notation called “excess 127”: the FP exponent field is computed by adding constant value 127 (bias constant ) to the exponent of the normalized value Excess notation is efficient, especially for number comparison The offset value is 2n–1 – 1 (n is the number of bits) in order to consider the first half of the range as negative numbers 27 IEEE-P754 Single Precision Example: +13.2510 IEEE-P754(SP) sign is positive: sign bit = 0 convert the value to binary 13.25 = 1101.01 normalize the value Note: base 2 1101.01 = 1.10101·23 compute the exponent by adding 127 to the real base 2 exponent 3+127=130=10000010 Compose the pieces adding padding 0es 0 10000010 10101000000000000000000 28 IEEE-P754 Single Precision Example: convert from IEEE-P754(SP) 1 01100000 01000000000000000000000 sign bit = 1 – extract the mantissa, add the hidden bit, and convert to decimal 1.012= 1.2510 compute the real exponent by subtracting 127 from the extracted exponent 1100000 = 96 96–127=–31 compose the parts: -1.2510·2–31 =-5.82·10–10 29 IEEE-P754 Single Precision 30 IEEE-P754 Single Precision The SP decimal range is: (1.4·10–45 3.4·10+38) The decimal exponent varies from –45 to +38, corresponding to a binary exponent from –126 to 127 Values are approximated to 7 decimal digits (corresponding to the 24 bits used by the mantissa) The representation error is the absolute weight of the LSB This is computed by multiplying the weight of the integer part (hidden bit) times the relative weight of the mantissa LSB (i.e. the weight of the LSB with respect to the integer part) This results in adding the exponents 1.10010..1 · 20 1.10010..1 · 25 1.10010..1 · 294 = 20-23 = 2-23 = 25-23 = 2-18 = 294-23 = 271 31 IEEE-P754 Single Precision The binary exponent varies from –126 to 127, corresponding to excess 127 values from 1 to 254 Exponent values 00000000 (0) and 11111111 (255) are used for special numbers: Zeroes Infinities NaNs Denormalized values 32 IEEE-P754 Single Precision Zero Exponent=00000000, Mantissa=0 0/1 00…00 00…00 by definition, not by computation, because there is not any 1 for normalization Positive and negative are considered equivalent Infinity Exponent=11111111, Mantissa=0 0/1 11…11 00…00 Operations with infinitives are well defined 33 IEEE-P754 Single Precision IEEE-P754 Single Precision Not a Number (NaN) Exponent=11111111, Mantissa0 0/1 11…11 <not 00…00> NaNs are used to indicate values that does not represent real numbers There are 2 types of NaNs: 34 Special Operations Quiet NaNs: denote indeterminate operations (mantissa MSB set), the result of an operation is not mathematically defined Signalling NaNs: denote an invalid operation (mantissa MSB clear) N / INF INF · INF N/0 INF + INF 0/0 INF – INF INF / INF INF · 0 =0 = INF = INF = INF = NaN = NaN = NaN =NaN Any operation with NaN yields a NaN result 35 IEEE-P754 Single Precision IEEE-P754 standard allows values in non-normalized form too (denormalized ) Exponent=00000000, Mantissa0 Hidden bit is now 0 and not 1 The exponent value is considered –126 Value is: 0.mantissa · 2–126 36 IEEE-P754 Double Precision Double precision notation just extends the SP notation to use 64 bits The differences are: exponent bits: 11 mantissa bits: 52 bias constant: 1023 exponent range: –1022, +1023 equivalent decimal range: (4.9·10-324 1.7·10+308) with 15 decimal digits denormalized exponent: –1022 37 IEEE-P754 Compact Notation IEEE-P754 Exercises For ease of writing and copying, floating-point numbers (as any other bit sequence) can be translated to base 16 as they were (they are not!) a pure binary number 38 Convert the following values to/from IEEE-P754: 0 10000000 00100…00 40100000 1 01111111 11000…00 BFE00000 C3C41000 110000111100010000010…0 –1324.25 to SP and DP 0.02324 to SP and DP with an absolute precision of 1/1000 0 10000000 00100…00 to decimal 1 01111111 11000…00 to decimal EB141000 to decimal 39 IEEE-P754 Exercises Solutions –1324.25 10100101100.01 = 1.010010110001·210 10+127 = 137 = 10001001 10+1023 = 1033 = 10000001001 then: SP: 1 10001001 0100101100010…0 in compact form: C4A58800 DP: 1 10000001001 0100101100010…0 in compact form: C094B10000000000 40 IEEE-P754 Exercises Solutions 0.02324 = 1/1000 n =10 (fractional bits) 0.0000010111 = 1.0111·2–6 –6+127 = 121 = 01111001 –6+1023 = 1017 = 01111111001 then: SP: 0 01111001 01110…0 in compact form: 3CB80000 DP: 0 01111111001 01110…0 in compact form: 3F97000000000000 41 IEEE-P754 Exercises Floating-point Addition Solutions 42 0 10000000 00100…00 +1.0012·2128-127= 10.012 =+2.25 1 01111111 11000…00 –1.112·2127-127= –1.75 EB141000 = 1 11010110 001010000010…0 –1.001012·2214-127= –1.1562510·287= = –1.1562510·287 = –1.15625 ·280 ·27= –1 · 1024 · 102 = –1026 (approx.) the non-approximated value is: –1.78921021302965117856514048 ·1026 To add two FP values, these must have the same exponent before adding their mantissas: the smaller value is converted to have the same exponent as the greater (it is de-normalized) As the exponent is increased (e.g. by 3), the mantissa must decrease (right shift 3 bits) to not change the overall value 1.01000·216 + 1.101000·213 1.01000·216 + 0.001101·216 43 Underflow If the conversion of the smaller value shifts away all of the mantissa bits (including the hidden bit), the value is approximated to 0, thus the operation result is equal to the greater while the smaller is just ignored There is an underflow condition when, adding 2 values, the result is equal to the greater of them 44 Underflow Example in SP 1.101·243+ 1.01·218 1.01·218 must be converted to the form xxx·243, this causes a right shift of 25 bits on the mantissa, thus shifting away all the 24 mantissa bits and resulting in 0 Adding up many small values, it is possible that a partial sum becomes so big to cause underflow for each of the subsequent values (only the first part of the values is added up) 45 IEEE-P754 Exercises IEEE-P754 Exercises Calculate the following operations (IEEE-P754) and express the result in the same compact form, identify any Overflow/Underflow: 46 Solution N.1: 2B1A5F20 + 4F1A3BB0 C4A58000 + C2B80000 63AB102F – 709B1BC2 7F600000 + 7F100000 2B1A5F20 0 01111110 00110100101111100100000 E=01010110=86 4F1A3BB0 0 10011110 00110100011101110110000 E=10011110=158 Difference of exponents= 72 72 > 24 UNDERFLOW Result: 4F1A3BB0 47 IEEE-P754 Exercises Solution N.2: C4A58000 1 10001001 01001011000000000000000 E=10001001 =137 (non biased: 10) M=1. 01001011 C2B80000 1 10000101 01110000000000000000000 E=10000101 =133 (non biased: 6) M=1.0111 Difference of exponents: 137 – 133= 4 48 IEEE-P754 Exercises Solution N.2 (continuation): De-normalized mantissa of the 2nd value to have exponent=10 (4 right shifts): 0.00010111 Addition: – 1.01001011 ·210 + 10 – 0.00010111 ·2 = 10 – 1.01100010 ·2 Result: 1 10001001 01100010000000000000000 C4B18000 49 IEEE-P754 Exercises IEEE-P754 Exercises Solution N.3 50 63AB102F – 709B1BC2 0 11000111 01010110001000000111111 E=199 0 11100001 00110110001101111000010 E=225 Difference of exponents: 225 – 199 = 26 26 > 24 UNDERFLOW Result: F09B1BC2 (SIGN CHANGED!) Solution N.4 7F600000 0 11111110 11000000000000000000000 E=254 (non biased=127) 7F100000 0 11111110 00100000000000000000000 E=254 (non biased=127) Difference of exponents: 0 51 IEEE-P754 Exercises Solution N.4 (continuation) 1.110 ·2127 + 1.001 ·2127 = 10.111 ·2127 Renormalization: 1.0111·2128 Max exponent is 127 OVERFLOW Result: (+Infinity) 0 11111111 00000000000000000000000 7F800000 52 IEEE-P754 Exercises Calculate in the IEEE-P754 SP format the following operations with DECIMAL numbers, identify any Overflow/Underflow: 92000000010 – 92000000110 53 IEEE-P754 Puzzles IEEE-P754 SP Ranges Solution: 54 Values differ on the LSB The two numbers have 9 decimal digits corresponding to about 9·3=27 bits After normalization, the relative weight of the LSB is 2-27 Having only 24 bits, power 2-27 is discarded The two values are considered equal Result is 0 Maximum normalized positive number is 1.111…111·2127 with 23 fractional bits If there were all the bits, the value would be: 1.111…111·2127 with 127 fractional bits, 1.111…111·2127 = 2128 –1 Having just 23 fractional bits, the value is approximated to 1.11…11 00…00 · 2127 with 23 fractional bits set to 1 and the rightmost 127–23=104 bits set to 0 104 bits set to 1 are value 2104 – 1 55 IEEE-P754 SP Ranges Maximum normalized positive value: 1.11…11 00…00 · 2127 = (2128 – 1) – (2104 – 1) =2128 – 2104 56 IEEE-P754 SP Ranges 3.4028234663852885981170418348452e+38 Minimum normalized positive number: 1.000…000·2-126 1.1754943508222875079687365372222e–38 Maximum denormalized positive number is 0.111…111·2-126 with 23 fractional bits the rightmost bit power is: –126–23= –149 (2-126 – 1) – (2-149 – 1) =2-126 – 2-149 Minimum denormalized positive number is 0.000…001·2-126 with 23 fractional bits the rightmost bit power is: –126–23= –149 2-149 57 IEEE-P754 Puzzles IEEE-P754 Puzzles Determine the difference between value 44A58800 and the next one (44A58801) 58 Value in binary is: 0 10001001 0100101100010…0 = 1.0100101100010…0·210 Next one differs for just the LSB: 1.0100101100010…1·210 Difference is 1·LSB weight = 210-23 = 2-13 Determine the range of the consecutive integer values in SP. 59 IEEE-P754 Puzzles Determine the (absolute) representation error for value N=6·1018 in IEEE-P754 SP. N = 6 ·1018 6 ·260 requires 63 bits N = 1.xxx ·262 In SP there are only 23 bits for the mantissa The relative weight of the LSB is 262-23=39 The representation error is 239 Values are in the form 1.xx…xx with 23 fractional bits (denormals are not integers) 24 bits (hidden bit included) result in 224 combinations of bits (0 to 224–1), each corresponds to a value and an appropriate exponent makes it an integer value 224 is represented too Range: –224 +224