Lecture 4: Data Representation Floating Point Representation

advertisement
Lecture 4: Data Representation
Floating Point Representation
Representing Floating Point Numbers
●
●
●
A fixed point notation (e.g. two's complement) allows a range of
positive and negative integers centered around 0 to be represented
By assuming a fixed binary or radix point, this format would allow
numbers with a fractional component to be represented
But – this approach has limitations
–
●
●
very large numbers or very small fractions cannot be represented
Floating point representation allows us to represent very large
numbers and very small fractions
Computers are equipped with specialised hardware that performs
floating point arithmetic (FPU – Floating Point Unit)
Scientific Notation
●
Very large or very small numbers can be represented using
scientific notation
n = ± s x be
●
defined by
–
–
–
–
–
+/- the sign of the number
s the significand or mantissa
e the exponent
b is the base
examples
●
0.00000000000000034 = 3.4 x 10-16
●
976 000 000 000 000 = 9.76 x 1014
Fixed Width
●
●
Assume that
–
s has three digits (base 10) and 0.1 <= |s| <= 0.999
–
e has two digits (base 10)
–
and both have a sign
We can represent numbers between
-0.999 x 1099 and 0.999 x 1099
–
with a magnitude that ranges from
0.100 x 10-99 to 0.999 x 1099
–
with just 5 digits and two signs!
Floating Point Representation
1
S
●
●
Exponent
23
Significand / mantissa
The left-most bit stores the sign of the number (0 = positive, 1 =
negative)
The exponent is stored using excess or biased notation
–
●
8
a typical value (usually 2k-1 – 1, or 2k-1), known as the bias, is
subtracted from the field to obtain the true exponent value (where k
is the number of bits in the exponent)
The significand is stored in the remaining 23 bits
Excess or Biased Representation for n bits
●
●
To convert a decimal number to n-bit excess or biased
representation
–
add the bias (2n-1 or 2n-1 – 1) to the decimal number
–
convert the result to (n-bit) binary
To convert n-bit excess or biased values to decimal
–
convert the n-bit excess value to its decimal equivalent
–
subtract the bias (2n-1 or 2n-1 – 1) from the result
Excess or Biased Representation for 4 bits
●
The sign of the exponent is encapsulated in the left-most bit of the
number (positive numbers 1, negative 0)
Decimal
Bias 2k-1 Bias 2k-1- 1
Decimal
Bias 2k-1 Bias 2k-1- 1
0000
0
-8
-7
1000
8
0
1
0001
1
-7
-6
1001
9
1
2
0010
2
-6
-5
1010
10
2
3
0011
3
-5
-4
1011
11
3
4
0100
4
-4
-3
1100
12
4
5
0101
5
-3
-2
1101
13
5
6
0110
6
-2
-1
1110
14
6
7
0111
7
-1
0
1111
15
7
8
Normalisation
●
Floating point numbers are normalised in order to simplify
operations
–
a normalised number is one in which the most significant digit of the
significand is nonzero (i.e. 1 for base two)
–
the typical convention is that there is one bit to the left of the radix
point
± 1.bbb ... b x 2±e
–
–
where b is either binary digit (0 or 1)
because the most significant bit is always one, it is unnecessary to
store this bit (this bit is implicit)
●
a 23-bit field can therefore store a 24-bit significand with a value in the
half open interval [1, 2)
Examples
Example – Converting Floating Point to Decimal
●
What is the decimal value of the following floating point number?
Assume the bias is 2n-1 – 1 (for 8 bits, 27 – 1 = 127)
0 10010011 10100000000000000000000
–
Determine the sign: sign value is 0 so the number is positive
–
Calculate the exponent
●
Convert the exponent to decimal
–
●
10010011excess = 27 + 24 + 21 + 20 = 147
Subtract the bias: 147 – 127 = 20, the exponent is 20
–
Add the implicit 1. bit in front of the significand:1.101
–
Convert the result 1.1012 x 220 to decimal
–
1.1012 x 220 = 1.625 x 220
Changing Binary Fractions to Base 10 Fractions
●
The integer is dealt with in the normal way
–
●
1012 = 510 so 101.11012 = 5.?????10
To sort out the fraction
–
read the figures of the fraction as an integer and convert to base 10
●
–
divide that number by 2 to the power of the number of the fraction
columns
●
●
11012 = 1310
13 / 24 = 13 / 16 = 0.8125
Reassemble the result – 101.11012 = 5 + 13/16 or 5.812510
Changing Binary Fractions to Base 10 Fractions –
an Alternative Method
●
The integer is dealt with in the normal way
–
●
●
1012 = 510 so 101.11012 = 5.?????10
To sort out the fraction use the base 2 column headings
–
2-1
2-2
2-3
2-4
–
½
¼
⅛
1/16
–
0.5
0.25
0.125
0.0625
–
so 0.11012 = 0.5 + 0.25 + 0.0625 = 0.812510
Reassemble the result – 101.11012 = 5.812510
Example – Converting Decimal to Floating Point
●
What is the floating point representation of -1.25 x 2-10 ? Assume
the bias is 2n-1 – 1 (for 8 bits, 27 – 1 = 127)
–
The sign is negative so the sign bit is 1
–
Convert the number to binary
●
–
Extract the significand (remove the implicit 1. bit and pad with 0s)
●
–
–
1.2510 = 110 + 0.2510 = 1.012
significand = 0100 0000 0000 0000 0000 000
Convert the exponent -10 to biased or excess notation
●
add the bias to the exponent: -10 + 127 = 11710
●
convert the result to binary: 11710 = 011101018-bit excess
-1.25 x 2-10 = 1 01110101 010000000000000000000000
Changing Base 10 Fractions to Binary Fractions
●
The integer is dealt with in the normal way
–
●
●
610 = 1102 so 6.37510 = 110.????2
To sort out the fraction, e.g. 0.375
–
double the fraction and underline the integer part of the result
–
repeat the process by doubling the fraction part of the result until
you have a whole number (or until you run out of space)
–
read the integer parts from top to bottom and place after the binary
point
●
0.375
x
2
=
0.75
●
0.75
x
2
=
1.5
●
0.5
x
2
=
1.0
fraction part is 0.0112
Reassemble the number – 6.37510 = 110.0112
Expressible Numbers using a 32-bit word
Expressible Numbers using a 32-bit word
●
Using two's complement integer representation
-231 to 231- 1
●
For the previous example floating-point format (1 bit for the sign,
8 bits for the exponent, 23 bits for the significand) the following
ranges of numbers are possible
–
negative numbers between -(2 – 2-23) x 2128 and -2-127
–
positive numbers between 2-127 and (2 – 2-23) x 2128
–
only some numbers in these regions can be represented
Numbers which cannot be represented
●
Negative overflow
–
●
negative numbers less than -(2 – 2-23) x 2128
Negative underflow
–
negative numbers greater than -2-127
●
Zero
●
Positive underflow
–
●
positive numbers less than 2-127
Positive overflow
–
positive numbers greater than (2 – 2-23) x 2128
Density
●
●
It is important to note that we are not representing more individual
values with floating-point notation – the maximum number of different
values which can be represented is 2n where n is the number of bits
Numbers that are represented using floating-point notation are not
spaced evenly along the number line
–
the possible values get closer together near the origin (i.e. 0) and further
apart as they move away from the origin
Example (1)
●
A representation with b = 2, a 1-bit sign bit, a 2-bit exponent e,
and a 2-bit significand s, has 32 normalised numbers (16 positive
and 16 negative values)
1.00 x 2-1
1.01 x 2-1
1.10 x 2-1
1.11 x 2-1
1.00 x 20
1.01 x 20
1.10 x 20
1.11 x 20
1.00 x 21
1.01 x 21
1.10 x 21
1.11 x 21
1.00 x 22
1.01 x 22
1.10 x 22
1.11 x 22
Example (2)
significand
2-1
20
21
22
1.00
0.100
1.00
10.0
100
1.01
0.101
1.01
10.1
101
1.10
0.110
1.10
11.0
110
1.11
0.111
1.11
11.1
111
Range and Precision
●
The size of the exponent determines the range of numbers that can
be represented
–
●
The size of the significand determines the precision of the
numbers that can be represented
–
●
the range of expressible numbers can be expanded by increasing the
number of bits that are used to represent the exponent – this will
decrease precision
precision can be increased by increasing the number of bits that are
used to represent the significand – this will decrease the range
The only way to increase both range and precision is to use more
bits
–
single-precision numbers, double-precision numbers
IEEE 754 Single Precision Floating Point Format
1
S
●
8
Exponent
23
Significand / mantissa
IEEE 754 single precision floating point format contains
–
1 sign bit (s), 8 bit exponent, 23 bit significand
●
the exponent is stored using biased representation
–
the bias is (2n-1 – 1) where n is the number of bits in the exponent
Download