# Floating-Point Numbers

```CENG536
Computer Engineering department
Çankaya University
The problem with fixed-point representation is illustrated by
the following examples:
The relative representation error due to truncation is quite
significant for x while it is much less severe for y. On the
other hand, both x2 and y2 are unrepresentable, because their
computations lead to underflow (number too small) and
overflow (too large), respectively.
CENG 536 - Spring 2012-2013 Dr. Yuriy ALYEKSYEYENKOV
2
This numbers can be represented as
The exponent -5 or +7 essentially indicates the direction and
amount by which the radix-point must be moved to produce
the corresponding fixed-point representation shown above.
Hence, the designation is “floating-point numbers”.
3
CENG 536 - Spring 2012-2013 Dr. Yuriy ALYEKSYEYENKOV
A floating-point number has four components: the sign, the
significand (mantissa) s, the exponent base b, and the
exponent e. The exponent base is usually a power of two
except for digital arithmetic, where it is 10.
mantissa
4
CENG 536 - Spring 2012-2013 Dr. Yuriy ALYEKSYEYENKOV
A typical floating-point format. A key point to observe is
that two signs are involved in a floating-point number.
5
CENG 536 - Spring 2012-2013 Dr. Yuriy ALYEKSYEYENKOV
The use of biased exponent format has virtually no effect on
the speed or cost of exponent arithmetic (addition /
subtraction), given small number of bits involved. It does,
however, facilitate zero detection (zero can be represented
with the smallest biased exponent of 0 and an all-zero
significand) and magnitude comparison (we can compare
normalized floating-point numbers as if they were integers).
6
CENG 536 - Spring 2012-2013 Dr. Yuriy ALYEKSYEYENKOV
The range of values in a floating-point number
representation is composed of the intervals [- max, - min]
and [max, min] :
7
CENG 536 - Spring 2012-2013 Dr. Yuriy ALYEKSYEYENKOV
Number distribution pattern and subranges in presentations:
There are three special or singular values -, 0 +. Zero
is special because it can not be presented with a normalized
mantissa (significand).
CENG 536 - Spring 2012-2013 Dr. Yuriy ALYEKSYEYENKOV
8
Overflow occurs when a result is less then – max or greater
then + max.
Underflow, on the other hand, occurs for results in a range
(– min, 0) or (0, min)
9
CENG 536 - Spring 2012-2013 Dr. Yuriy ALYEKSYEYENKOV
The equation
for the value of a floating-point number suggests that the
range [- max, max] increases if we choose a larger exponent
base b. A larger b also simplifies arithmetic operations on
the exponents, since for the given range, smaller exponents
must be dealt with. However, if the significand is to be kept
in normalized form, effective precision decreases for larger
b. In the past, machines with b = 2, 8, 16, or 256 were built.
10
CENG 536 - Spring 2012-2013 Dr. Yuriy ALYEKSYEYENKOV
The exponent sign is almost always encoded in a biased
format. As for a sign of a floating-point number, alternatives
to the currently dominant signed-magnitude format include
the use the 1’s or 2’s complement representation. Several
variations have been tried in the past, including the
complementation of the significand part only and the
complementation of the entire number (including the
exponent part) when the number to be represented is
negative.
11
CENG 536 - Spring 2012-2013 Dr. Yuriy ALYEKSYEYENKOV
The two representation formats in IEEE standard for binary
floating-point numbers (ANSI/IEEE Std 754-1985) are
depicted:
12
CENG 536 - Spring 2012-2013 Dr. Yuriy ALYEKSYEYENKOV
13
CENG 536 - Spring 2012-2013 Dr. Yuriy ALYEKSYEYENKOV
Standard defines extended formats that allow
implementation to carry higher precisions internally
to reduce the effect of accumulated errors. Two
extender formats are defined:
14
CENG 536 - Spring 2012-2013 Dr. Yuriy ALYEKSYEYENKOV
Value = N = (-1)s  2 E-127  (1.M)
The decimal number 0.7510 is to be represented in the IEEE 754
single precision format:
0.7510 = 0.112
(converted to a binary number)
= 1.1  2-1
(normalized a binary number)
hidden
The mantissa is positive so the sign S is given by
The biased exponent E is given by E = e + 127
S=0
E = - 1 + 127 = 12610 = 0111 11102
15
Fractional part of mantissa M = .1000…..000 (in 23 bits)
CENG 536 - Spring 2012-2013 Dr. Yuriy ALYEKSYEYENKOV
The IEEE 754 single precision representation is given by:
0
0
1
1
1
31 30
1
1
1
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
23 22
Sign
1 bit
Exponent
8 bits
0
0
Bits
Mantissa
23 bits
16
CENG 536 - Spring 2012-2013 Dr. Yuriy ALYEKSYEYENKOV
The decimal number – 2345.12510 is to be represented in the
IEEE 754 single precision format:
– 2345.12510 = – 1001 0010 1001.0012 (converted to binary)
= – 1.0010 0101 0010 012  211 (normalized binary)
hidden
The mantissa is negative so the sign S is given by S = 1
The biased exponent E is given by E = e + 127
E = 11 + 127 = 13810 = 1000 10102
Fractional part of mantissa
M = .0010 0101 0010 0100 ... 000 (in 23 bits)
17
CENG 536 - Spring 2012-2013 Dr. Yuriy ALYEKSYEYENKOV
The IEEE 754 single precision representation is given by:
1
1
0 0 0 1 0 1
31
0
0
0 1 0 0 1 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0
23 22
Sign
1 bit
Exponent
8 bits
0
Bits
Mantissa
23 bits
18
CENG 536 - Spring 2012-2013 Dr. Yuriy ALYEKSYEYENKOV
Basic arithmetic on floating-point numbers is
conceptually simple. However, care must be taken in hardware
implementation for ensuring corrections and avoiding undue
loss of precision; in addition, it must be possible to handle any
exceptions.
Addition and subtraction are most difficult of the
elementary operations for floating-point operands. Here, we
deal only with addition, since subtraction can be converted to
addition by flipping the sign of subtrahend.
19
CENG 536 - Spring 2012-2013 Dr. Yuriy ALYEKSYEYENKOV
Assuming
, we begin by aligning the two operand
through right-shifting of the significand (mantissa)
of the
number with the smaller exponent.
20
CENG 536 - Spring 2012-2013 Dr. Yuriy ALYEKSYEYENKOV
If the exponent base b and the number representation radix
(base) are the same, we simply shift s2 to the right by e1 – e2
digits. When b = ra the shift amount, which is computed
through direct subtraction of the biased exponent, is multiplied
by a. In either case, this step is referred to as alignment shift,
or preshift, (in contrast to normalization shift or postshift
which is needed when the resulting significand s is
unnormalized).
21
CENG 536 - Spring 2012-2013 Dr. Yuriy ALYEKSYEYENKOV
We then perform addition as follows
22
CENG 536 - Spring 2012-2013 Dr. Yuriy ALYEKSYEYENKOV
Floating-point multiplication is simpler then floating-point
addition; it is performed by multiplying the significands and
Postshifting may be needed, since the product s1  s2 of the
two significands can be unnormalized. For example, we have
, leading to the possible need for a singlebit right shift. Also, the computed exponent needs adjustment
if the exponents are biased or if a normalization shift is
performed. Overflow/underflow is possible during
multiplication if e1 and e2 have like signs. Overflow is also
possible due to normalization.
CENG 536 - Spring 2012-2013 Dr. Yuriy ALYEKSYEYENKOV
23
Similarly, floating-point division is performed by dividing the
significands and subtracting the exponents
Here, problems to be dealt with are similar to those of
multiplication. The ratio
of the significands may have
to be normalized. For example we have
and a single bit left-shift is always adequate. The computed
exponent needs adjustment is the exponents are biased or if a
normalizing shift is performed. Overflow / underflow is
possible during division if e1 and e2 have unlike signs.
Underflow due to normalization is also possible.
CENG 536 - Spring 2012-2013 Dr. Yuriy ALYEKSYEYENKOV
24
To extract the square root of a positive floating-point number,
we first make its exponent even. This may require subtracting
1 from the exponent and multiplying the significand by b. We
then use the following
In the case of IEEE floating-point numbers, the adjusted
significand will be in the range 1  s  4, which leads directly
to a normalized significand for the result. Square-rooting
never produced overflow or underflow.
25
CENG 536 - Spring 2012-2013 Dr. Yuriy ALYEKSYEYENKOV
26
CENG 536 - Spring 2012-2013 Dr. Yuriy ALYEKSYEYENKOV
27
CENG 536 - Spring 2012-2013 Dr. Yuriy ALYEKSYEYENKOV
28
CENG 536 - Spring 2012-2013 Dr. Yuriy ALYEKSYEYENKOV
29
CENG 536 - Spring 2012-2013 Dr. Yuriy ALYEKSYEYENKOV
30
CENG 536 - Spring 2012-2013 Dr. Yuriy ALYEKSYEYENKOV
31
CENG 536 - Spring 2012-2013 Dr. Yuriy ALYEKSYEYENKOV
32
CENG 536 - Spring 2012-2013 Dr. Yuriy ALYEKSYEYENKOV
33
CENG 536 - Spring 2012-2013 Dr. Yuriy ALYEKSYEYENKOV
34
CENG 536 - Spring 2012-2013 Dr. Yuriy ALYEKSYEYENKOV
35
CENG 536 - Spring 2012-2013 Dr. Yuriy ALYEKSYEYENKOV
36
CENG 536 - Spring 2012-2013 Dr. Yuriy ALYEKSYEYENKOV
37
CENG 536 - Spring 2012-2013 Dr. Yuriy ALYEKSYEYENKOV
38
CENG 536 - Spring 2012-2013 Dr. Yuriy ALYEKSYEYENKOV
```