The IEEE standard for floating point arithmetic

advertisement
ECE473 Homework 4 (30 points)
Due: Oct. 17 (Friday)
Download an open-source 32-bit floating point unit (FPU) from the following website and use
Quartus II to simulate the following operations. The following is an example FPU. But you can
try other open-source FPUs.
http://www.opencores.org/projects.cgi/web/fpu100/overview
(The website has Version 19. You can also use Version 18 which is stored on the
department server. cp ~zhu/shared/fpu_v18.zip ~/)
A = 2.71896743
B = - 3.14159265

Translate A and B into IEEE 754 single-precision format in binary. You will NOT
receive any credits if you just give the final binary results. You are required to show
your translation procedure in details (10 points).

Simulate the operations of A*B, A/B, A+B B-A, sqrt(A) and A/0 by using the FPU.
Print out the results and justify your simulation results (Using rounding down modes)
(10 points).

Summarize the performance and limitations of this open-source FPU, such as resource
utilization and maximum working clock frequency (10 points).
1
IEEE standard for floating point arithmetic
The IEEE Standard for Binary Floating-Point Arithmetic (IEEE 754) is the most widely-used
standard for floating-point computation, and is followed by many CPU and FPU implementations.
The standard defines formats for representing floating-point numbers (including ±zero and
denormals) and special values (infinities and NaNs) together with a set of floating-point operations
that operate on these values. It also specifies four rounding modes and five exceptions (including
when the exceptions occur, and what happens when they do occur).
IEEE 754 specifies four formats for representing floating-point values: single-precision (32-bit),
double-precision (64-bit), single-extended precision (≥ 43-bit, not commonly used) and doubleextended precision (≥ 79-bit, usually implemented with 80 bits). Only 32-bit values are required
by the standard; the others are optional. Many languages specify that IEEE formats and arithmetic
be implemented, although sometimes it is optional. For example, the C programming language,
which pre-dated IEEE 754, now allows but does not require IEEE arithmetic (the C float typically
is used for IEEE single-precision and double uses IEEE double-precision).
Single Precision
The IEEE single precision floating point standard representation requires a 32 bit word, which
may be represented as numbered from 0 to 31, left to right. The first bit is the sign bit, S, the next
eight bits are the exponent bits, 'E', and the final 23 bits are the fraction 'F':
S EEEEEEEE FFFFFFFFFFFFFFFFFFFFFFF
01
89
31
The value V represented by the word may be determined as follows:







If E=255 and F is nonzero, then V=NaN ("Not a number")
If E=255 and F is zero and S is 1, then V=-Infinity
If E=255 and F is zero and S is 0, then V=Infinity
If 0<E<255 then V=(-1)**S * 2 ** (E-127) * (1.F) where "1.F" is intended to represent the
binary number created by prefixing F with an implicit leading 1 and a binary point.
If E=0 and F is nonzero, then V=(-1)**S * 2 ** (-126) * (0.F) These are "unnormalized"
values.
If E=0 and F is zero and S is 1, then V=-0
If E=0 and F is zero and S is 0, then V=0
In particular,
0 00000000 00000000000000000000000 = 0
1 00000000 00000000000000000000000 = -0
0 11111111 00000000000000000000000 = Infinity
1 11111111 00000000000000000000000 = -Infinity
0 11111111 00000100000000000000000 = NaN
2
1 11111111 00100010001001010101010 = NaN
0 10000000 00000000000000000000000 = +1 * 2**(128-127) * 1.0 = 2
0 10000001 10100000000000000000000 = +1 * 2**(129-127) * 1.101 = 6.5
1 10000001 10100000000000000000000 = -1 * 2**(129-127) * 1.101 = -6.5
0 00000001 00000000000000000000000 = +1 * 2**(1-127) * 1.0 = 2**(-126)
0 00000000 10000000000000000000000 = +1 * 2**(-126) * 0.1 = 2**(-127)
0 00000000 00000000000000000000001 = +1 * 2**(-126) *
0.00000000000000000000001 =
2**(-149) (Smallest positive value)
Double Precision
The IEEE double precision floating point standard representation requires a 64 bit word, which
may be represented as numbered from 0 to 63, left to right. The first bit is the sign bit, S, the next
eleven bits are the exponent bits, 'E', and the final 52 bits are the fraction 'F':
S EEEEEEEEEEE FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
01
11 12
63
The value V represented by the word may be determined as follows:







If E=2047 and F is nonzero, then V=NaN ("Not a number")
If E=2047 and F is zero and S is 1, then V=-Infinity
If E=2047 and F is zero and S is 0, then V=Infinity
If 0<E<2047 then V=(-1)**S * 2 ** (E-1023) * (1.F) where "1.F" is intended to represent
the binary number created by prefixing F with an implicit leading 1 and a binary point.
If E=0 and F is nonzero, then V=(-1)**S * 2 ** (-1022) * (0.F) These are "unnormalized"
values.
If E=0 and F is zero and S is 1, then V=-0
If E=0 and F is zero and S is 0, then V=0
Note: IEEE 754r is an ongoing revision to the IEEE 754 floating point standard. The intent of the
revision is to extend the standard where it has become necessary, to tighten up certain areas of the
original standard which were left undefined, and to merge in IEEE 854 (the radix-independent
floating-point standard).
Reference:ANSI/IEEE Standard 754-1985,
Standard for Binary Floating Point Arithmetic
This takes a character array as read from a file of 4 bytes (32 bits) and converts it to a float using
the IEEE Standard; DBL_MAX is imported from <float.h>
3
float arrayToFloat(unsigned char data[4])
{
int s, e;
unsigned long src;
long f;
float value;
src = ((unsigned long)data[0] << 24) +
((unsigned long)data[1] << 16) +
((unsigned long)data[2] << 8) +
((unsigned long)data[3]);
s = (src & 0x80000000UL) >> 31;
e = (src & 0x7F800000UL) >> 23;
f = (src & 0x007FFFFFUL);
if (e == 255 && f != 0) {
/* NaN - Not a number */
value = DBL_MAX;
} else if (e == 255 && f == 0 && s == 1) {
/* Negative infinity */
value = -DBL_MAX;
} else if (e == 255 && f == 0 && s == 0) {
/* Positive infinity */
value = DBL_MAX;
} else if (e > 0 && e < 255) {
/* Normal number */
f += 0x00800000UL;
if (s) f = -f;
value = ldexp(f, e - 127 - 23);
} else if (e == 0 && f != 0) {
/* Denormal number */
if (s) f = -f;
value = ldexp(f, -126 - 23);
} else if (e == 0 && f == 0 && s == 1) {
/* Negative zero */
value = 0;
} else if (e == 0 && f == 0 && s == 0) {
/* Positive zero */
value = 0;
} else {
/* Never happens */
}
return value;
}
4
Download