Fractional Number Notations Fractional Numbers Fixed

advertisement
2
Fractional Numbers

Fractional Number Notations
Ver. 1.4

Fractional numbers have the form:
xxxxxxxxx.yyyyyyyyy
where the x’es constitute the integer
part of the value and the y’s the
fractional part
There are two main methods to encode
fractional numbers:


© 2010 - Claudio Fornaro
fixed-point notation
floating-point notation
3
Fixed-point Notation

Fixed-point notation splits the available
n bits in 2 portions:



Fixed-point Notation

If needed:

one for the integer part
one for the fractional part
integer

4

fractional
The radix point is not stored (does not
uses up bits): its position is just known
The number of bits for the integer and
the fractional part are chosen before
making any calculation

the integer part must be padded with 0es
on the left
the fractional part must be padded with
0es on the right
Examples


5.25 in FX on 4+4 bits: 01010100
5.25 in FX on 6+2 bits: 00010101
Radix points are supposed here
5
Fixed-point Notation





Fixed-point Notation
For relative fractional values, both SM
and 2C notations can be used
The n bits are then divided into 3 parts:

6

Examples

Convert value +12.25 in FX 2C 1+4+3

sign (1 bit)
integer part (m bits)
fractional part (n-m-1 bits)
E.g. 1+7+8 means 1 bit for sign, 7 for the
integer part and 8 for the fractional part

Convert value –12.25 in FX 2C 1+4+3


Operations are the same seen as for
integer values, provided that the values
have the same format
01100010
01100010  10011110
Note: when using the 1st 2C-operation method,
1 must be added to the LSB, not to unity place:
01100010
2C-Operation
10011101+
1=
10011110
7
Exercises

Convert the values as requested





–151.0
 FX 2C on 16 bits (1+8+7)
–151.25
 FX 2C on 16 bits (1+8+7)
111100101010 from FX 2C (1+7+4)  ()10
100110011000 from FX 2C (1+6+5)  ()10
Calculate on FX 2C 16 bits (1+7+8) and
identify any overflow


(111.6 – 44.57) / 2
(68.22 – 71.25) * 64
8
Exercises

Solutions
0 10010111 0000000
1 01101001 0000000
 –151.25

0 10010111 0100000
1 01101000 1100000
Note that the integer part is not the same
 111100101010
 0 0001101 0110
 –13.37510
 100110011000
 0 110011 01000
 –51.2510

–151.0

9
Exercises

10
Exercises
Solutions

(111.6 – 44.57) / 2
1101111.10011001201101111100110012C
101100.100100012 00101100100100012C
11010011011011112C
1 0110111110011001+
1101001101101111=
0100001100001000 0010000110000100
+33.515625
(68.22 – 71.25) * 64
1000100.00111000201000100001110002C
1000111.012
01000111010000002C
10111000110000002C
1 0100010000111000+
1011100011000000=
1111110011111000  0011111000000000
OVERFLOW

Radix points are supposed here
Solutions

Radix points are supposed here
11
Fixed-point Uses


Fixed-point notation is sometimes used
by simulating it with the integer notation
that microprocessors use (i.e. 2C)
This allows faster computations than
operations using floating-point notation
(intrinsically slower)
12
Fixed-point Problems

Suppose the following (unsigned) values
have to be coded using Fixed-point on a
total of 8 bits:





37.25
12.625
5.4375
1.2890625




100101.01
1100.1010
101.01110
1.0100101
All of them can be coded in 8 bits, but
there is not a unique position for the
radix point suitable for all
13
Fixed-point Problems


Fixed-point Problems
Suppose you have to represent some
fractional values 0  x < 8 using a
Fixed-point coding 4+4 bits:

7.2732
2.3748
5.4375
14

1.2890
The first bit is always 0, and the
fractional part is rounded to 4 bits
If we could move the fractional point 1
positions to the left, we could have 1
more bit for precision

Suppose you have to represent some
values with fractional part x.0, x.5 or
x.25 only, using a Fixed-point coding
4+4 bits
The last two bits are always 00, and the
integer part is limited to 15
If we could move the fractional point 2
positions to the right, we could have 2
more bits for the integer part (values
up to 63)
15
Fixed-point Problems


The problem with Fixed-point notation
is the fixed position of the radix point
To solve this problem, the radix point
must be made movable (floating), this
requires that its position be stored
along with each number
16
Exponential Notation


Exponential notation represents a
number as a value (mantissa or
significand ) that multiplies a whole
power of the base (exponent )
Mantissa
Examples (in decimal):



Exponent
123.45678 = 0.12345678·103
0.0087654321 = 0.87654321·10-2
87655678 = 0.87655678·108
17
Exponential Notation



Exponential Notation
Very big and very small values are
obtained by just varying the exponent
The same value can be expressed in
many forms:

18


123.45 = 0.12345·103 = 12345·10-2
Among these forms, form 0.x (x≠0) is
chosen to have a unique representation
for values, this is called the

When the number of digits is not enough
to store the whole number only the most
important (leftmost) digits are stored
The most significant digits are thus
preserved, but approximation errors are
introduced because of truncation
Example (only 4 decimal digits):

normalized form

0.001234567  0.1234·10-2=0.001234000
876543  0.8765·106 = 876500
19
Exponential Notation

The maximum representation error with
n digits is 10-n relative to the power of
the whole part
If the whole part power is m :
 = 10-n · 10m = 10m-n
which is the power of the rightmost
digit (LSD)
20
Exponential Notation

Example





Suppose the value has only 4 decimal digits
876543  0.8765·106 (normalized)
The whole part is 0·106  m =6
 = 10-4 · 106 = 102 (maximum error)
This can also be seen by writing the value
as a sum of powers:
0·106+8·105+7·104+6·103+5·102
for this value, the error is:
| 876543 – 876500 | = 43 (< 102 )
21
Exponential Notation

IEEE-P754 Floating-Point
Example





22

Suppose the value has only 4 decimal digits
0.001234567  0.1234·10-2=0.001234000
The whole part is 0·10-2  m = –2
 = 10-4 · 10-2 = 10-6 (maximum error)
Writing the value as a sum of powers:
0·10-2+1·10-3+2·10-4+3·10-5+4·10-6
for this value, the error is:
| 0.001234567 – 0.001234000 | =
= 0.000000567 (< 10-6 )

The IEEE-P754 standard describes the
most common notations used by
computer FPUs (Floating-Point Units) to
compute floating-point values
The two exponential binary floating
point notations described have the form
mantissa · 2exponent and are:


Single precision (SP)
Double precision (DP)
23
IEEE-P754 Single Precision

Single precision values uses 32 bits
divided in 3 parts:



sign: 1 bit
exponent field: 8 bits
mantissa (or significand) field: 23 bit
s exponent

mantissa
The sign bit is defined as follows:


0 is used for values  0
1 is used for values  0 (negative zero!)
24
IEEE-P754 Single Precision


The mantissa (or significand ) is in the
normalized form 1.xxxxx, where the 1
before the radix point is the leftmost 1
(MSB) in the binary representation
Only the fractional part of the binary
mantissa is stored in the mantissa field:
the leftmost 1 is already known to be
present (called hidden bit ), this allows
for one more bit of precision (23 bits
stored + 1 hidden = 24 bits effective)
25
IEEE-P754 Single Precision

26
IEEE-P754 Single Precision
The exponent is a relative integer
value on 8 bits, the IEEE-P754 SP
standard does not use SM or 2C
notations, but a biased notation called
“excess 127”: the FP exponent field is
computed by adding constant value 127
(bias constant ) to the exponent of the
normalized value


Excess notation is efficient, especially
for number comparison
The offset value is 2n–1 – 1
(n is the number of bits) in order to
consider the first half of the range as
negative numbers
27
IEEE-P754 Single Precision

Example: +13.2510  IEEE-P754(SP)





sign is positive: sign bit = 0
convert the value to binary
13.25 = 1101.01
normalize the value
Note: base 2
1101.01 = 1.10101·23
compute the exponent by adding 127 to
the real base 2 exponent
3+127=130=10000010
Compose the pieces adding padding 0es
0 10000010 10101000000000000000000
28
IEEE-P754 Single Precision

Example: convert from IEEE-P754(SP)
1 01100000 01000000000000000000000




sign bit = 1  –
extract the mantissa, add the hidden bit,
and convert to decimal
1.012= 1.2510
compute the real exponent by subtracting
127 from the extracted exponent
1100000 = 96  96–127=–31
compose the parts: -1.2510·2–31 =-5.82·10–10
29
IEEE-P754 Single Precision



30
IEEE-P754 Single Precision
The SP decimal range is:
 (1.4·10–45  3.4·10+38)
The decimal exponent varies from –45
to +38, corresponding to a binary
exponent from –126 to 127
Values are approximated to 7 decimal
digits (corresponding to the 24 bits used
by the mantissa)



The representation error is the
absolute weight of the LSB
This is computed by multiplying the
weight of the integer part (hidden bit)
times the relative weight of the mantissa
LSB (i.e. the weight of the LSB with
respect to the integer part)
This results in adding the exponents



1.10010..1 · 20
1.10010..1 · 25
1.10010..1 · 294
  = 20-23 = 2-23
  = 25-23 = 2-18
  = 294-23 = 271
31
IEEE-P754 Single Precision


The binary exponent varies from –126
to 127, corresponding to excess 127
values from 1 to 254
Exponent values 00000000 (0) and
11111111 (255) are used for special
numbers:




Zeroes
Infinities
NaNs
Denormalized values
32
IEEE-P754 Single Precision


Zero
Exponent=00000000, Mantissa=0
0/1 00…00 00…00
by definition, not by computation, because
there is not any 1 for normalization
Positive and negative are considered
equivalent
Infinity
Exponent=11111111, Mantissa=0
0/1 11…11 00…00
Operations with infinitives are well defined
33
IEEE-P754 Single Precision

IEEE-P754 Single Precision
Not a Number (NaN)
Exponent=11111111, Mantissa0
0/1 11…11 <not 00…00>
NaNs are used to indicate values that does
not represent real numbers
There are 2 types of NaNs:


34

Special Operations




Quiet NaNs: denote indeterminate operations

(mantissa MSB set), the result of an operation
is not mathematically defined
Signalling NaNs: denote an invalid operation
(mantissa MSB clear)



N / INF
INF · INF
N/0
INF + INF
0/0
INF – INF
INF / INF
INF · 0
=0
= INF
= INF
= INF
= NaN
= NaN
= NaN
=NaN
Any operation with NaN yields a NaN result
35
IEEE-P754 Single Precision





IEEE-P754 standard allows values in
non-normalized form too (denormalized )
Exponent=00000000, Mantissa0
Hidden bit is now 0 and not 1
The exponent value is considered –126
Value is:
0.mantissa · 2–126
36
IEEE-P754 Double Precision


Double precision notation just extends
the SP notation to use 64 bits
The differences are:






exponent bits: 11
mantissa bits: 52
bias constant: 1023
exponent range: –1022, +1023
equivalent decimal range:
 (4.9·10-324  1.7·10+308)
with 15 decimal digits
denormalized exponent: –1022
37
IEEE-P754 Compact Notation

IEEE-P754 Exercises
For ease of writing and copying,
floating-point numbers (as any other bit
sequence) can be translated to base 16
as they were (they are not!) a pure
binary number



38

Convert the following values to/from
IEEE-P754:


0 10000000 00100…00  40100000
1 01111111 11000…00  BFE00000
C3C41000  110000111100010000010…0



–1324.25 to SP and DP
0.02324 to SP and DP with an absolute
precision of 1/1000
0 10000000 00100…00 to decimal
1 01111111 11000…00 to decimal
EB141000 to decimal
39
IEEE-P754 Exercises

Solutions

–1324.25
10100101100.01 = 1.010010110001·210


10+127 = 137 = 10001001
10+1023 = 1033 = 10000001001
then:


SP: 1 10001001 0100101100010…0
in compact form: C4A58800
DP: 1 10000001001 0100101100010…0
in compact form: C094B10000000000
40
IEEE-P754 Exercises

Solutions

0.02324
 = 1/1000  n =10 (fractional bits)
0.0000010111 = 1.0111·2–6


–6+127 = 121 = 01111001
–6+1023 = 1017 = 01111111001
then:


SP: 0 01111001 01110…0
in compact form: 3CB80000
DP: 0 01111111001 01110…0
in compact form: 3F97000000000000
41
IEEE-P754 Exercises

Floating-point Addition
Solutions



42

0 10000000 00100…00
 +1.0012·2128-127= 10.012 =+2.25
1 01111111 11000…00
 –1.112·2127-127= –1.75
EB141000 = 1 11010110 001010000010…0
 –1.001012·2214-127= –1.1562510·287=
= –1.1562510·287 = –1.15625 ·280 ·27=
 –1 · 1024 · 102 = –1026 (approx.)
the non-approximated value is:
–1.78921021302965117856514048 ·1026

To add two FP values, these must have
the same exponent before adding their
mantissas: the smaller value is converted
to have the same exponent as the
greater (it is de-normalized)
As the exponent is increased (e.g. by 3),
the mantissa must decrease (right shift 3
bits) to not change the overall value
1.01000·216 + 1.101000·213
1.01000·216 + 0.001101·216
43
Underflow


If the conversion of the smaller value
shifts away all of the mantissa bits
(including the hidden bit), the value is
approximated to 0, thus the operation
result is equal to the greater while the
smaller is just ignored
There is an underflow condition when,
adding 2 values, the result is equal to
the greater of them

44
Underflow


Example in SP
1.101·243+ 1.01·218
1.01·218 must be converted to the form
xxx·243, this causes a right shift of 25
bits on the mantissa, thus shifting away
all the 24 mantissa bits and resulting in 0
Adding up many small values, it is
possible that a partial sum becomes so
big to cause underflow for each of the
subsequent values (only the first part of
the values is added up)
45
IEEE-P754 Exercises

IEEE-P754 Exercises
Calculate the following operations
(IEEE-P754) and express the result in
the same compact form, identify any
Overflow/Underflow:




46

Solution N.1:

2B1A5F20 + 4F1A3BB0
C4A58000 + C2B80000
63AB102F – 709B1BC2
7F600000 + 7F100000



2B1A5F20 
0 01111110 00110100101111100100000
E=01010110=86
4F1A3BB0 
0 10011110 00110100011101110110000
E=10011110=158
Difference of exponents= 72
72 > 24  UNDERFLOW
Result: 4F1A3BB0
47
IEEE-P754 Exercises

Solution N.2:



C4A58000 
1 10001001 01001011000000000000000
E=10001001 =137 (non biased: 10)
M=1. 01001011
C2B80000 
1 10000101 01110000000000000000000
E=10000101 =133 (non biased: 6)
M=1.0111
Difference of exponents: 137 – 133= 4
48
IEEE-P754 Exercises

Solution N.2 (continuation):



De-normalized mantissa of the 2nd value to
have exponent=10 (4 right shifts):
0.00010111
Addition: – 1.01001011 ·210
+
10
– 0.00010111 ·2
=
10
– 1.01100010 ·2
Result:
1 10001001 01100010000000000000000
C4B18000
49
IEEE-P754 Exercises

IEEE-P754 Exercises
Solution N.3





50

63AB102F – 709B1BC2
0 11000111 01010110001000000111111
E=199
0 11100001 00110110001101111000010
E=225
Difference of exponents: 225 – 199 = 26
26 > 24  UNDERFLOW
Result: F09B1BC2 (SIGN CHANGED!)
Solution N.4



7F600000 
0 11111110 11000000000000000000000
E=254 (non biased=127)
7F100000 
0 11111110 00100000000000000000000
E=254 (non biased=127)
Difference of exponents: 0
51
IEEE-P754 Exercises

Solution N.4 (continuation)



1.110 ·2127 +
1.001 ·2127 =
10.111 ·2127
Renormalization: 1.0111·2128
Max exponent is 127 OVERFLOW
Result: (+Infinity)
0 11111111 00000000000000000000000
7F800000
52
IEEE-P754 Exercises

Calculate in the IEEE-P754 SP format
the following operations with DECIMAL
numbers, identify any
Overflow/Underflow:

92000000010 – 92000000110
53
IEEE-P754 Puzzles

IEEE-P754 SP Ranges
Solution:






54
Values differ on the LSB
The two numbers have 9 decimal digits
corresponding to about 9·3=27 bits
After normalization, the relative weight of
the LSB is 2-27
Having only 24 bits, power 2-27 is discarded
The two values are considered equal
Result is 0

Maximum normalized positive number is
1.111…111·2127 with 23 fractional bits



If there were all the bits, the value would
be: 1.111…111·2127 with 127 fractional bits,
1.111…111·2127 = 2128 –1
Having just 23 fractional bits, the value is
approximated to 1.11…11 00…00 · 2127
with 23 fractional bits set to 1 and the
rightmost 127–23=104 bits set to 0
104 bits set to 1 are value 2104 – 1
55
IEEE-P754 SP Ranges

Maximum normalized positive value:
1.11…11 00…00 · 2127 =
(2128 – 1) – (2104 – 1) =2128 – 2104
56
IEEE-P754 SP Ranges


3.4028234663852885981170418348452e+38

Minimum normalized positive number:
1.000…000·2-126
1.1754943508222875079687365372222e–38
Maximum denormalized positive number
is 0.111…111·2-126 with 23 fractional bits


the rightmost bit power is: –126–23= –149
(2-126 – 1) – (2-149 – 1) =2-126 – 2-149
Minimum denormalized positive number
is 0.000…001·2-126 with 23 fractional bits


the rightmost bit power is: –126–23= –149
2-149
57
IEEE-P754 Puzzles

IEEE-P754 Puzzles
Determine the difference between value
44A58800 and the next one (44A58801)



58
Value in binary is:
0 10001001 0100101100010…0 =
1.0100101100010…0·210
Next one differs for just the LSB:
1.0100101100010…1·210
Difference is 1·LSB weight = 210-23 = 2-13

Determine the range of the consecutive
integer values in SP.




59
IEEE-P754 Puzzles

Determine the (absolute) representation
error for value N=6·1018 in IEEE-P754 SP.





N = 6 ·1018  6 ·260  requires 63 bits
N = 1.xxx ·262
In SP there are only 23 bits for the mantissa
The relative weight of the LSB is 262-23=39
The representation error is 239
Values are in the form 1.xx…xx with 23
fractional bits (denormals are not integers)
24 bits (hidden bit included) result in 224
combinations of bits (0 to 224–1), each
corresponds to a value and an appropriate
exponent makes it an integer value
224 is represented too
Range: –224  +224
Download