# 11_FP - pantherFILE ```Data Representation: Floating Point for
Real Numbers
Computer Organization and Assembly Language: Module 11
Floating Point Representation


The IEEE-754 Floating Point Standard is a widely used floating
point representation from among the many alternative formats
The representation of floating point numbers contains:



a mantissa (variant of a scaled, sign magnitude integer)
an exponent (8-bit, biased-127 integer)
In this way floating point representation resembles scientific notation

Any number N can be represented as M*10^e, where



e = floor(log10N)
M = N/(10^e)
1 &lt; M &lt; 10
Floating Point Representation
 A number
N represented in floating point is determined by
the mantissa m, an exponent e, and its sign, s
N = (-1)s * m * 2e
If the sign is negative, s = 1. If the sign is positive, s = 0.
 The mantissa is normalized, i.e., 1  m &lt; 2

 In
the IEEE-754 single precision format, the mantissa is
represented with 23 bits (only the fractional part is stored

m = (+/-) 1.f22f21…f1f0
 Double
precision floating point works the same way, but the
bit fields are larger: 1-bit sign, 11-bit exponent, 52 bits for
the fractional part of the mantissa
Conversion to base-2
1.Break the decimal number into two parts: an
integer and a fraction
2.Convert the integer into binary and place it to the
left of the binary point
3.Convert the fraction into binary and place it to the
right of the binary point
4.Write it in base-2 scientific notation and normalize
Example
Convert 22.625 to floating point representation
1. Convert 22 to binary. 2210 = 101102
2. Convert .625 to binary
2*.625 = 1 + .25
.62510=.1012
2*.25
= 0 + .5
2*.5
= 1 + 0
3. Thus 22.62510 = 10110.1012
4. In base –2 scientific notation: 10110.101*20
Normalized form: 1.0110101*24
IEEE-754 SPFP Representation
 Given
the floating point representation
N = (-1)s * m * 2e
where m = 1.f22f21…f1f0
 we
can convert it to the IEEE-754 SPFP format
using the relations:
F = (m-1)*223 (hence F is an integer)
E = e + 127
S
E
S = s
F
Single-Precision Floating Point
The IEEE-754 single precision format has 32 bits
distributed as
S
E
F
1
8
23
 E  255, thus the actual exponent e (interpreted
as biased-127) is restricted so that -127 e 128
0

But e = -127 and e = 128 have special meaning
Special values and the hidden bit
IEEE-754 , zero is represented by setting E = F
= 0 regardless of the sign bit, thus there are two
representations for zero: +0 and -0.
 + by S=0, E=255, F=0
 - by S=1, E=255, F=0
 NaN or Not-a-Number by E=255, F0
(may result from 0 divided by 0)
 The leading 1 in the fraction is not represented. It is
the hidden bit.
 In
Converting to IEEE-754 SPFP
1.Convert into a normalized base-2 representation
2.Bias the exponent. The result will be E.
3.Put the values into the correct field. Note that only
the fractional part of the mantissa is stored in F.
Example
Convert 22.625 to IEEE-754 SPFP format
1. In scientific notation: 10110.101*20
Normalized form: 1.0110101*24
2. Bias the exponent: 4 + 127 = 131
13110 = 100000112
3. Place into the correct fields.
S = 0
E = 10000011
F = 011 0101 0000 0000 0000 0000
0
S
10000011
E
01101010000000000000000
F
Example
Convert 17.15 to IEEE FPS format 17.1510 =
10001.0010 0110 0110 0110 011*20
1. Normalized form:
1. 0001 0010 0110 0110 0110 011 * 24
2. Bias the exponent: 4 + 127 = 131
13110 = 100000112
3. Place into the correct fields.
S = 0
E = 10000011
F = 000 1001 0011 0011 0011 0011
0
S
10000011
00010010011001100110011
E
F
Example
Convert -83.7 to IEEE FPS format (single
precision)
2*.7 = 1 + .4
2*.4 = 0 + .8
2*.8 = 1 + .6
2*.6 = 1 + .2
2*.2 = 0 + .4
-83.710=-1010011.101100110
2*.4 = 0 + .8
2*.8 = 1 + .6
2*.6 = 1 + .2
2*.2 = 0 + .4
. . .
1. In binary scientific notation:
-1010011.10110011001100110 * 20
Normalized: -1.01001110110011001100110 * 26
2. Bias the exponent: 6 + 127 = 133
13310 = 100001012
3. Place into the correct fields
S = 1
E = 10000101
F = 01001110110011001100110
1
S
10000101
01001110110011001100110
E
F
 It
is difficult for people to read binary
 one
bit pattern looks much like another
 Raw
data, which is not being interpreted as
representing a particular data type, is often
 The final step in many IEEE-754 SPFP problems
will be to convert the result to hexadecimal
 11000010101001110110011001100110
 C2A76666
Graceful underflow
 Given
a single precision floating point number with
bit fields S, E, and F (interpreted as unsigned
integers), the value of the number is normally
calculated as
N = (-1)S(1 + F/223)2E-127
 This
interpretation is not used when
= 255 (+, -  , or NaN)
 E = 0, F = 0 (+0 or –0)
 What about E  0, F  0?
E
Graceful underflow
 Given
a single precision floating point number with
bit fields S, E = 0, and F (interpreted as unsigned
integers), the value of the number is calculated as
N = (-1)S(0 + F/223)2-126
 This
allows representation of numbers as small as
2-149, though each order of magnitude below 2-126
results in loss of one bit of precision.
Graceful underflow
0 00000001 00000000000000000000000
 Normal

interpretation: N = 2(1 – 127) = 2-126
24 bits of precision (counting the hidden bit)
0 00000000 10000000000000000000000
E
= 0 interpretation: N = 2-126 (.12) = 2-126 (.5) = 2-127

Only 23 bits of precision
0 00000000 00010000000000000000000
E
=0 interpretation: N = 2-126 (.00012) = 2-126(.0625) = 2-130

Only 20 bits of precision
```