International Journal of Engineering Trends and Technology (IJETT) – Volume 8 Number 4- Feb 2014 Floating Point Engine using VHDL Najib Ghatte #1, Shilpa Patil#2, Deepak Bhoir#3 Fr. Conceicao Rodrigues College of Engineering Fr. Agnel Ashram, Bandstand, Bandra (W), Mumbai: 400 050, India Abstract— Floating Point arithmetic is by far the most used way of approximating real number arithmetic for performing numerical calculations on modern computers. Each computer had a different arithmetic for long time: bases, significant and exponents’ sizes, formats, etc. Each company implemented its own model and it hindered the portability between different equipments until IEEE 754 standard appeared defining a single and universal standard. This paper deals with the implementation of single precision as well as double precision floating point arithmetic adder as well as multiplier according with the IEEE 754 standard and using the hardware programming language VHDL. VHDL Codes so designed are simulated for various set of inputs and desired results are obtained. Codes are synthesized on device XC3S5000 having package as FG900 of Spartan®-3 FPGA family. Keywords— Single precision, double precision, FPGA, Spartan, IEEE-754, Floating-point arithmetic, VHDL, Xilinx I. INTRODUCTION This chapter briefly introduces all the topics that will be encountered and described in the later parts of the thesis. It starts with representation of floating point numbers in the fixed-point format and then will manifest the IEEE-754 standard to express numbers in single precision (32-bit) and double precision (64-bit) format. It also argue with the reason for using floating-point numbers in computation. It discusses the floating-point arithmetic, cost involved in implementing higher precision than the existing floating-point hardware. It also explains the differences between the fixed-pint and floating-point representation of numbers. A. Computer Representation of Real Numbers Real numbers may be described as numbers that can represent a number with infinite precision and are used to measure continuous quantities. Almost all computations in Physics, Chemistry, Mathematics or scientific computations, all involve operations using real numbers. Computers can only approximate real numbers, most commonly represented as fixed-point and floating-point numbers. In a Fixed-point representation, a real number is represented by a fixed number of digits before and after the radix point. Since the radix point is fixed, the range of fixed-point also is limited. Due to this fixed window of representation, it can represent very small ISSN: 2231-5381 numbers or very large numbers accurately within the available range. A better way of representing real numbers is floatingpoint representation. Floating-point numbers represent real numbers in scientific notation. They employ a sort of a sliding window of precision or number of digits suitable to the scale of a particular number and hence can represent of a much wider range of values accurately. Most processors designed for consumer applications, such as Graphical Processing Units (GPUs) and CELL processors promise and deliver outstanding floating point performance for scientific applications while using the single precision floating point arithmetic hardware [1]. Video games rarely require higher accuracy in floating-point operations, the high cost of extra hardware needed in their implementation is not justified. The hardware cost of a higher precision arithmetic is lot greater than single- precision arithmetic. For example, one double precision or 64-bit floating-point pipeline has approximately same cost as two to four 32-bit floating-point pipelines. Most applications use 64bit floating point to avoid losing precision in a long sequence of operations used in the computation, even though the finalresult may not be accurate to more than 32-bit precision. The extra precision is used so the application developer does not have to worry about having enough precision. B. Fixed-Point Representation of Real Numbers In computing, a fixed-point number representation is a real data type for a number that has a fixed number of digits after (and sometimes also before) the radix point (after the decimal point '.' in English decimal notation) as shown in Fig. 1. Fixed-point number representation can be compared to the more complicated (and more computationally demanding) floating-point number representation. Fixed-point numbers are useful for representing fractional values, usually in base 2 or base 10, when the executing processor has no floating point unit (FPU) or if fixed-point provides improved performance or accuracy for the application at hand. Most low-cost embedded microprocessors and microcontrollers do not have an FPU. [2] A value of a fixed-point data type is essentially an integer that is scaled by a specific factor determined by the type. For example, the value 1.25 can be represented as 1250 in a fixed- http://www.ijettjournal.org Page 198 International Journal of Engineering Trends and Technology (IJETT) – Volume 8 Number 4- Feb 2014 Fig. 1 8-bit sign-magnitude fixed-point representation point data type with scaling factor of 1/1000, and the value 1,250,000 can be represented as 1250 with a scaling factor of 1000. Unlike floating-point data types, the scaling factor is the same for all values of the same type, and does not change during the entire computation If the set of all the numbers in that range is represented by this method, placed on a number line, there lies an equal interval between the two points as shown in Fig. 2. Fig. 2 Fixed-point representation on number line The scaling factor is usually a power of 10 (for conventional convenience) or a power of 2 (for computational efficiency). However, other scaling factors may be used occasionally, e.g. a time value in hours may be represented as a fixed-point type with a scale factor of 1/3600 to obtain values with one-second accuracy. Fig. 3 shows the fixed-point representation of 32-bit numbers (1st bit reserved for sign magnitude, 15 bits for exponent and remaining 16 bits for fractional part) and 64-bit numbers (1st bit reserved for sign magnitude, 31 bits for exponent and remaining 32 bits for fractional part). The highest value that can be represented by 32-bit representation is 32767.9999 and that with 64-bit representation is 2147483647.9999 as shown. Fig. 3 Fixed-point representation: 32-bit and 64-bit Limitations: Since fixed-point operations can produce results that have more bits than the operands, there is possibility for information loss. For instance, the result of fixed-point multiplication could potentially have as many bits as the sum of the number of bits in the two operands. In order to fit the ISSN: 2231-5381 result into the same number of bits as the operands, the answer must be rounded or truncated. If this is the case, the choice of which bits to keep is very important. When multiplying two fixed-point numbers with the same format, for instance with I integer bits, and Q fractional bits, the answer could have up to 2I integer bits, and 2Q fractional bits. C. Floating-Point Representation of Real Numbers The term floating point implicates that there is no fixed number of digits before and after the decimal point; i.e. the decimal point can float. Floating-point representations are slower and less accurate than fixed-point representations, but can handle a larger range of numbers. [3] Because mathematics with floating-point numbers requires a great deal of computing power, many microprocessors come with a chip, called a floating point unit (FPU ), specialized for performing floating-point arithmetic. FPUs are also called math coprocessors and numeric co-processors. II. IEEE-754 FLOATING-POINT STANDARD Hardware supporting different floating-point precisions and various formats have been adopted over the years. The MASPAR MP1 supercomputer performed floating-point operations using 4-bit slice operations on the mantissa with special normalization hardware and supported 32-bit and 64-bit IEEE 754 formats. [4] Floating-point representation has a complex encoding scheme with three basic components: mantissa, exponent and sign. Usage of binary numeration and powers of 2 resulted in floating point numbers being represented as single precision (32-bit) and double precision (64-bit) floating-point numbers. Both single and double precision numbers are defined by the IEEE 754 standard. According to the standard, a single precision number has one sign bit, 8 exponent bits and 23 mantissa bits where as a double precision number comprises of one sign bit, 11 exponent bits and 52 mantissa bits. A. Special Values IEEE 754 reserves exponent field values of all 0s and all 1s to denote special values in the floating-point scheme as explicated in [5] such as +0, -0, +∞, -∞, positive denormalised numbers, negative denormalised numbers, Not a Number (NaN). B. Exceptions in Floating-point Number Representation An arithmetic exception arises when an attempted atomic arithmetic operation has no result that would be acceptable universally. IEEE 754 defines five basic types of floating point exceptions: invalid operation, division by zero, overflow, underflow and inexact as described in [6] and tabulated in Table I. C. Rounding Modes Precision is not infinite and sometimes rounding a result is necessary. IEEE-754 supports four rounding modes: Round to nearest, Round towards zero, Round towards +∞, Round towards -∞. [7] http://www.ijettjournal.org Page 199 International Journal of Engineering Trends and Technology (IJETT) – Volume 8 Number 4- Feb 2014 TABLE I IEEE-754 FLOATING POINT E XCEPTIONS Exception Invalid Division by zero Overflow Underflow Example 0/0, 0×∞, x/0 > max < min Inexact nearest(x op y) ≠ x op y Default Result NaN ±∞ ±∞ Subnormals or zero Rounded Result III. FLOATING POINT ARITHMETIC Floating Point arithmetic is by far the most used way of approximating real number arithmetic for performing numerical calculations on modern computers. A. Floating Point Adder Following the established plan, [8] the way to do the operations (addition) will be set. This point will be also used to try to explain why these steps are necessary in order to make clearer and easier the explanation of the code in the next section. ( F1 2 E1 ) ( F2 2E2 ) F 2E The different steps are as follows: 1. Compare exponents. If not equal, shift smaller fraction to right and add 1 to exponent (repeat). 2. Add fractions. 3. If result is 0, adjust for proper 0. 4. If fraction overflow, shift right and add 1. 5. If un-normalized, shift left and subtract 1 (repeat). 6. Check for exponent overflow. 7. Round fraction. If not normalized, go to step 4. 1) VHDL Code for Addition of IEEE-754 Single Precision Numbers VHDL code for addition of single precision (32-bit) numbers is being developed and then is simulated using ModelSim SE Plus 6.5. VHDL code was break down into seven possible states where each state represents steps observed in algorithm to carry out the addition process. St signal is asserted to start the process. Overflow and underflow are indicated as an exception signal. As soon as addition is completed, it asserts the done signal to indicate the termination of simulation. Various sets of inputs are fed to the block to get the results. The further part of the document deals with simulation and synthesis results. A. ModelSim Simulation Consider addition of two decimal numbers, 12.3 and 1250.36 fed as FPinput1 = 12.3 (0 10000010 10001001100110011001101) and FPinput2 = 1250.36 (0 10001001 00111000100101110000101) to the algorithm to get the desired output as FPsum = 1262.66 (0 10001001 00111011101010100011110) as shown in Fig. 4. ISSN: 2231-5381 B. Xilinx ISE Synthesis VHDL Code for addition of IEEE-754 Single Precision (32-bit) numbers are then synthesized for device XC3S5000 having package as FG900 of Spartan®-3 FPGA family. From the datasheet cited in [9], this device has following attributes manifests in Table II. TABLE II SPARTAN®-3 FPGA FG900 XC3S5000 ATTRIBUTES Device System Gates Equivalent Logic Cells CLB Array (One CLB=Four Slices Row Column Total Total Slices Max. User I/O Dedicated Multiplier xc35000 5M 74,880 104 80 8,320 33,280 633 104 Table III shows the Device Utilisation Summary of the VHDL code, so written, it is been observed that number of device parameters used are very less. Hence, an optimum Device Utilisation is obtained. TABLE III FLOATING POINT ADDITION (SINGLE PRECISION): DEVICE UTILISATION SUMMARY From the timing report obtained, it is found that the maximum frequency at which it can operate is 67.925 MHz. (maximum time - period of 14.722 ns). Minimum input arrival time before clock (input setup time) is 9.323 ns and maximum output required time after clock (output setup time) comes out as 7.1 ns. 2) VHDL Code for Addition of IEEE-754 Double Precision Numbers VHDL code for addition of double precision (64-bit) numbers is being developed and then is simulated using ModelSim SE Plus 6.5. VHDL code was break down into seven possible states where each state represents steps observed in algorithm to carry out the addition process. St signal is asserted to start the process. Overflow and underflow are indicated as an exception signal. As soon as addition is completed, it asserts the done signal to indicate the termination of simulation. http://www.ijettjournal.org Page 200 International Journal of Engineering Trends and Technology (IJETT) – Volume 8 Number 4- Feb 2014 Various sets of inputs are fed to the block to get the results. The further part of the document deals with simulation and synthesis results. A. ModelSim Simulation Consider addition of two decimal numbers, 1.25 and 6500.12 fed as FPinput1 = 1.25 (0 01111111111 01000000000000000000000000000000000000000000000000 00) and FPinput2 = 6500.12 (0 10000001011 10010110010000011110101110000101000111101011100001 01) were fed to the algorithm to get the desired output as FPsum = 6501.37 (0 10000001011 10010110010101011110101110000101000111101011100001 01) as shown in Fig. 5. B. Xilinx ISE Synthesis VHDL Code for addition of IEEE-754 Double Precision (64-bit) numbers are then synthesized for device XC3S5000 having package as FG900 of Spartan®-3 FPGA family. From the datasheet cited in [9], this device has following attributes manifests in Table II. Table IV shows the Device Utilisation Summary of the VHDL code, so written, it is been observed that number of device parameters used are very less. Hence, an optimum Device Utilisation is obtained. TABLE IV FLOATING POINT ADDITION (DOUBLE PRECISION): DEVICE UTILISATION SUMMARY 2. 3. 4. 5. Place the decimal point in the result. Add the exponents; i.e. (E1 + E2 – Bias) Obtain the sign; i.e. S1 xor S2 Normalize the result; i. e. obtaining 1 at the MSB of the result’s significand. 6. Rounding the result to fit in the available bits 7. Checking for underflow/overflow occurrence 1) VHDL Code for Multiplication of IEEE-754 Single Precision Numbers VHDL code for multiplication of single precision (32-bit) numbers is being developed and then is simulated using ModelSim SE Plus 6.5. VHDL code is break down into small components to handle normalisation, rounding, packing and unpacking respectively. Operands are tested for special numbers such as zero, infinity and NaN. Exponents are added and significands are further multiplied. Sign bit is computed with XOR operation. Various sets of inputs are fed to the block to get the results. The further part of the document deals with simulation and synthesis results. A. ModelSim Simulation Consider multiplication of two decimal numbers, 102.3 and 1.253 fed as FP_A = 102.3 (0 100001011 0011001001100110011010) and FP_B = 1.253 (0 011111110 1000000110001001001110) were fed to the algorithm to get the desired output as FP_Z = 128.1819 (0 10000110 00000000010111010010001) as shown in Fig. 6. B. Xilinx ISE Synthesis VHDL Code for multiplication of IEEE-754 Single Precision (32-bit) numbers are then synthesized for device XC3S5000 having package as FG900 of Spartan®-3 FPGA family. From the datasheet cited in [9], this device has following attributes manifests in Table II. Table V shows the Device Utilisation Summary of the VHDL code, so written, it is been observed that number of device parameters used are very less. Hence, an optimum Device Utilisation is obtained. From the timing report obtained, it is found that the maximum frequency at which it can operate is 59.912 MHz. (maximum time - period of 16.691 ns). Minimum input arrival time before clock (input setup time) is 9.601 ns and maximum output required time after clock (output setup time) comes as 7.242 ns. B. Floating Point Multiplier Following the established plan, [10] the way to do the operations (multiplication) will be set. This point will be also used to try to explain why these steps are necessary in order to make clearer and easier the explanation of the code in the next section. The different steps are as follows: 1. Multiply the significands: i.e. (F1 × F2). ISSN: 2231-5381 TABLE V FLOATING POINT M ULTIPLICATION (SINGLE PRECISION): DEVICE UTILISATION SUMMARY From the timing report obtained, it is found that the maximum combinational path delay is 39.507 ns. Maximum http://www.ijettjournal.org Page 201 International Journal of Engineering Trends and Technology (IJETT) – Volume 8 Number 4- Feb 2014 combinational path delay is only for paths that start at an input to the design and go to an output of the design without being clocked along the way. 2) VHDL Code for Multiplication of IEEE-754 Double Precision Numbers VHDL code for multiplication of double precision (64-bit) numbers is being developed and then is simulated using ModelSim SE Plus 6.5. VHDL code is break down into small components to handle normalisation, rounding, packing and unpacking respectively. Operands are tested for special numbers such as zero, infinity and NaN. Exponents are added and significands are further multiplied. Sign bit is computed with XOR operation. Various sets of inputs are fed to the block to get the results. The further part of the document deals with simulation and synthesis results. A. ModelSim Simulation Consider multiplication of two decimal numbers, 1.25 and 6500.12 fed as FP_A = 1.25 (0 01111111111 01000000000000000000000000000000000000000000000000 00) and FP_B = 6500.12 (0 10000001011 10010110010000011110101110000101000111101011100001 01) were fed to the algorithm to get the desired output as FP_Z = 8125.15 (0 10000001011 11111011110100100110011001100110011001100110011001 10) as shown in Fig. 7. B. Xilinx ISE Synthesis VHDL Code for multiplication of IEEE-754 Double Precision (64-bit) numbers are then synthesized for device XC3S5000 having package as FG900 of Spartan®-3 FPGA family. From the datasheet cited in [9], this device has following attributes manifests in Table II. Table VI shows the Device Utilisation Summary of the VHDL code, so written, it is been observed that number of device parameters used are very less. Hence, an optimum Device Utilisation is obtained. TABLE VI FLOATING POINT M ULTIPLICATION (DOUBLE PRECISION): DEVICE UTILISATION SUMMARY From the timing report obtained, it is found that the maximum combinational path delay is 40.608 ns. Maximum combinational path delay is only for paths that start at an input to the design and go to an output of the design without being clocked along the way. IV. CONCLUSION The importance and usefulness of floating point format nowadays does not allow any discussion. Any computer or electronic device, which operates with real numbers, implements this type of representation and operation. Throughout the report, IEEE-754 compliant floating-point representation is explained. Also, it dealt with floating-point arithmetic namely, addition and multiplication. The VHDL code has been implemented so that all the operations are carried out with combinational logic, which reaches a faster response because there are not any sequential devices as flip-flops which delays the execution time VHDL code for addition and multiplication of single precision (32-bit) and double precision (64-bit) has been developed and simulated over ModelSim Plus 6.5. Codes were tested with special numbers such as zero, infinity (positive as well as negative) and NaN. Desired results were obtained and thereby an efficient algorithm was developed. Codes were synthesized on device XC3S5000 having package as FG900 of Spartan®-3 FPGA family. By means of Xilinx ISE 14.5, synthesis results were obtained. Device Utilization Summary specifies the amount of IOBs and LUTs utilized. Also, maximum frequency at which algorithm can operate and static power dissipation is estimated. REFERENCES [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] ISSN: 2231-5381 William R. Dieter, Akil Kaveti, Henry G. Dietz, “Low-Cost Microarchitectural Support for Improved Floating-Point Accuracy”, March 2007. Pat Kusbel, “Control and Computing System for the Compact Microwave Radiometer for Humidity Profiling” B.Tech thesis, Department of Electrical and Computer Engineering, Colorado State University, March 2006. Alex N. D. Zamfirescu, “Floating Point Type for Synthesis”, CA USA, 2000. John J. G. Savard (2012) Floating-Point Formats [Online]. Available: http://www.quadibloc.com/comp/cp0201.htm Steve Hollasch, (2005) IEEE Standard 754 Floating Point Numbers [Online]. Available: http://steve.hollasch.net/cgindex/coding/ieeefloat.html Sun Microsystems, “Numerical Computation Guide”, Ch. 4 Exception and Exception Handling. Nathan Whitehead, Alex Fit-Florea, “Precision & Performance: Floating Point and IEEE 754 Compliance for NVIDIA GPUs”, 2011 Arturo Barrabés Castillo, “Design of single precision Float Adder (32Bit Numbers) according to IEEE 754 Standard using VHDL” M.Tech, thesis, Slovenská Technical University, Apr. 2012 Spartan-3 FPGA Family, Xilinx, 2013. R. Sai Siva Teja, A. Madhusudhan, “FPGA Implementation of LowArea Floating Point Multiplier Using Vedic Mathematics” in International Journal of Emerging Technology and Advanced Engineering, Dec 2013. http://www.ijettjournal.org Page 202 International Journal of Engineering Trends and Technology (IJETT) – Volume 8 Number 4- Feb 2014 MODELSIM SIMULATION RESULTS Fig. 4 Floating Point Addition (Single Precision): Timing Diagram 12.3 + 1250.36 = 1262.66 Fig. 5 Floating Point Addition (Double Precision): Timing Diagram 1.25 + 6500.12 = 6501.37 Fig. 6 Floating Point Multiplication (Single Precision): Timing Diagram 102.3 × 1.253 = 128.1819 Fig. 7 Floating Point Multiplication (Double Precision): Timing Diagram 1.25 × 6500.12 = 8125.15 ISSN: 2231-5381 http://www.ijettjournal.org Page 203