The IEEE standard for floating point arithmetic

ECE473 Homework 4 (30 points) Due: Oct. 17 (Friday) Download an open-source 32-bit floating point unit (FPU) from the following website and use Quartus II to simulate the following operations. The following is an example FPU. But you can try other open-source FPUs. http://www.opencores.org/projects.cgi/web/fpu100/overview (The website has Version 19. You can also use Version 18 which is stored on the department server. cp ~zhu/shared/fpu_v18.zip ~/) A = 2.71896743 B = - 3.14159265  Translate A and B into IEEE 754 single-precision format in binary. You will NOT receive any credits if you just give the final binary results. You are required to show your translation procedure in details (10 points).  Simulate the operations of A*B, A/B, A+B B-A, sqrt(A) and A/0 by using the FPU. Print out the results and justify your simulation results (Using rounding down modes) (10 points).  Summarize the performance and limitations of this open-source FPU, such as resource utilization and maximum working clock frequency (10 points). 1 IEEE standard for floating point arithmetic The IEEE Standard for Binary Floating-Point Arithmetic (IEEE 754) is the most widely-used standard for floating-point computation, and is followed by many CPU and FPU implementations. The standard defines formats for representing floating-point numbers (including ±zero and denormals) and special values (infinities and NaNs) together with a set of floating-point operations that operate on these values. It also specifies four rounding modes and five exceptions (including when the exceptions occur, and what happens when they do occur). IEEE 754 specifies four formats for representing floating-point values: single-precision (32-bit), double-precision (64-bit), single-extended precision (≥ 43-bit, not commonly used) and doubleextended precision (≥ 79-bit, usually implemented with 80 bits). Only 32-bit values are required by the standard; the others are optional. Many languages specify that IEEE formats and arithmetic be implemented, although sometimes it is optional. For example, the C programming language, which pre-dated IEEE 754, now allows but does not require IEEE arithmetic (the C float typically is used for IEEE single-precision and double uses IEEE double-precision). Single Precision The IEEE single precision floating point standard representation requires a 32 bit word, which may be represented as numbered from 0 to 31, left to right. The first bit is the sign bit, S, the next eight bits are the exponent bits, 'E', and the final 23 bits are the fraction 'F': S EEEEEEEE FFFFFFFFFFFFFFFFFFFFFFF 01 89 31 The value V represented by the word may be determined as follows:        If E=255 and F is nonzero, then V=NaN ("Not a number") If E=255 and F is zero and S is 1, then V=-Infinity If E=255 and F is zero and S is 0, then V=Infinity If 0<E<255 then V=(-1)**S * 2 ** (E-127) * (1.F) where "1.F" is intended to represent the binary number created by prefixing F with an implicit leading 1 and a binary point. If E=0 and F is nonzero, then V=(-1)**S * 2 ** (-126) * (0.F) These are "unnormalized" values. If E=0 and F is zero and S is 1, then V=-0 If E=0 and F is zero and S is 0, then V=0 In particular, 0 00000000 00000000000000000000000 = 0 1 00000000 00000000000000000000000 = -0 0 11111111 00000000000000000000000 = Infinity 1 11111111 00000000000000000000000 = -Infinity 0 11111111 00000100000000000000000 = NaN 2 1 11111111 00100010001001010101010 = NaN 0 10000000 00000000000000000000000 = +1 * 2**(128-127) * 1.0 = 2 0 10000001 10100000000000000000000 = +1 * 2**(129-127) * 1.101 = 6.5 1 10000001 10100000000000000000000 = -1 * 2**(129-127) * 1.101 = -6.5 0 00000001 00000000000000000000000 = +1 * 2**(1-127) * 1.0 = 2**(-126) 0 00000000 10000000000000000000000 = +1 * 2**(-126) * 0.1 = 2**(-127) 0 00000000 00000000000000000000001 = +1 * 2**(-126) * 0.00000000000000000000001 = 2**(-149) (Smallest positive value) Double Precision The IEEE double precision floating point standard representation requires a 64 bit word, which may be represented as numbered from 0 to 63, left to right. The first bit is the sign bit, S, the next eleven bits are the exponent bits, 'E', and the final 52 bits are the fraction 'F': S EEEEEEEEEEE FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF 01 11 12 63 The value V represented by the word may be determined as follows:        If E=2047 and F is nonzero, then V=NaN ("Not a number") If E=2047 and F is zero and S is 1, then V=-Infinity If E=2047 and F is zero and S is 0, then V=Infinity If 0<E<2047 then V=(-1)**S * 2 ** (E-1023) * (1.F) where "1.F" is intended to represent the binary number created by prefixing F with an implicit leading 1 and a binary point. If E=0 and F is nonzero, then V=(-1)**S * 2 ** (-1022) * (0.F) These are "unnormalized" values. If E=0 and F is zero and S is 1, then V=-0 If E=0 and F is zero and S is 0, then V=0 Note: IEEE 754r is an ongoing revision to the IEEE 754 floating point standard. The intent of the revision is to extend the standard where it has become necessary, to tighten up certain areas of the original standard which were left undefined, and to merge in IEEE 854 (the radix-independent floating-point standard). Reference:ANSI/IEEE Standard 754-1985, Standard for Binary Floating Point Arithmetic This takes a character array as read from a file of 4 bytes (32 bits) and converts it to a float using the IEEE Standard; DBL_MAX is imported from <float.h> 3 float arrayToFloat(unsigned char data[4]) { int s, e; unsigned long src; long f; float value; src = ((unsigned long)data[0] << 24) + ((unsigned long)data[1] << 16) + ((unsigned long)data[2] << 8) + ((unsigned long)data[3]); s = (src & 0x80000000UL) >> 31; e = (src & 0x7F800000UL) >> 23; f = (src & 0x007FFFFFUL); if (e == 255 && f != 0) { /* NaN - Not a number */ value = DBL_MAX; } else if (e == 255 && f == 0 && s == 1) { /* Negative infinity */ value = -DBL_MAX; } else if (e == 255 && f == 0 && s == 0) { /* Positive infinity */ value = DBL_MAX; } else if (e > 0 && e < 255) { /* Normal number */ f += 0x00800000UL; if (s) f = -f; value = ldexp(f, e - 127 - 23); } else if (e == 0 && f != 0) { /* Denormal number */ if (s) f = -f; value = ldexp(f, -126 - 23); } else if (e == 0 && f == 0 && s == 1) { /* Negative zero */ value = 0; } else if (e == 0 && f == 0 && s == 0) { /* Positive zero */ value = 0; } else { /* Never happens */ } return value; } 4

The IEEE standard for floating point arithmetic

Related documents

Products

Support

The IEEE standard for floating point arithmetic

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib