Field Programmable Gate Array Prototyping of End-Around Carry Parallel Prefix Tree Architectures. ABSTRACT: As an important part of many processors’ floating point unit, fused multiply-add unit performs a multiplication followed immediately by an addition. In IBM POWER6 microprocessor’s fused multiply-add unit, a fast 128-bit floating-point end-around-carry (EAC) adder is proposed. Very few algorithmic details exist in today’s literature about this adder. In this study, a complete designed EAC adder that can work independently as a regular adder is proposed. Details about the proposed EAC adder’s arithmetic algorithms are described. In IBM’s original EAC adder, the Kogge–Stone tree has been chosen for its high performance on ASIC technology. In this study, the authors present a comparative study on different parallel prefix trees which are used in the design of our new EAC adder targeting field programmable gate array (FPGA) technology. Our study highlights the main performance differences among 14 different architecture configurations focusing on the area requirements and the critical path delay. The experimental results show that there is one architecture configuration with the lower area requirement and the higher performance. Key-Words: EAC, Parallel prefix, Kogge-stone INTRODUCTION: Fused multiply-add unit plays an important role in modern microprocessor. It performs floatingpoint multiplication followed immediately by an addition of the product with a third floatingpoint operand. In 2007, a seven-cycle fused multiply-add pipeline unit was proposed as a part of the floating-point unit in IBM’s POWER6 microprocessor. In this fused multiply-add dataflow, the product should be aligned before it is added with the addend. Because the magnitude of the product is unknown in the early stages prior to the combination with the addend, it is difficult to determine a priori which operand is bigger. Even if it was determined early that the product was bigger, there would be a problem on conditionally complementing two intermediate operands, the carry and sum outputs of the counter tree. Thus, an adder needs to be designed to always output a positive magnitude result and preferably only needs to conditionally complement one operand. Therefore a new 128-bit end-around carry (EAC) adder was designed and fabricated in IBM’s fused multiply-add unit. The intention is not to produce an adder with the best stand-alone performance but to provide the one with the best overall floating-point performance. VEDLABS, #112, Oxford Towers, Old airport Road, Kodihalli, Bangalore-08 Page 1 BLOCK DIAGRAM: Fig: Architecture of modified EAC adder Fig. shows the architecture of the proposed EAC adder. In this adder, the inputs are two 129-bit binary addends sum magnitudes of and the output is the They are all in sign magnitude format. are the are the corresponding sign bits. The magnitudes of VEDLABS, #112, Oxford Towers, Old airport Road, Kodihalli, Bangalore-08 Page 2 operands are used to produce the positive magnitude of the sum and the sign bits of operands are used to produce the sign of the sum. HARDWARE AND SOFTWARE REQUIREMENTS: Software Requirement Specification: Operating System: Windows XP with SP2 Synthesis Tool: Xilinx 12.2. Simulation Tool: Modelsim6.3c. Hardware Requirement specification: Minimum Intel Pentium IV Processor Primary memory: 2 GB RAM, Spartan III FPGA Xilinx Spartan III FPGA development board JTAG cable, Power supply REFERENCES: 1] CURRAN B., MCCREDIE B., SIQAL L., ET AL.: ‘4GHz+ low-latency fixed-point and binary floating-point execution units for the POWER 6 processor’. Digest of 2006 IEEE Int. Solid-State Circuits Conf., 2006, pp. 1728–1734 [2] SCHWARZ E.M.: ‘Binary floating-point unit design’, in U.S.S. (ED.): ‘High performance energy efficient microprocessor design’ (Springer, 2006), pp. 189–208 [3] YU X.Y., FLEISCHER B., CHAN Y.H., ET AL.: ‘A 5 GHz+ 128-bit binary floating-point adder for the POWER 6 processor’. Proc. Int. Conf. 32nd European Solid-State Circuits, 2006, pp. 166–169 [4] LEOBANDUNG D.M.E., NAYAKAMA H., ET AL.: ‘High performance 65 nm SOI technology with dual stress liner and low capacitance sram cell’. Digest of 2005 Symp. on VLSI Technology, 2005 [5] KOGGE P.M., STONE H.S.: ‘A parallel algorithm for the efficient solution of a general class of recurrence equations’, IEEE Trans. Comput., 1973, 22, (8), pp. 786–793 [6] LADNER R., FISCHER M.: ‘Parallel prefix computation’, J. ACM, 1980, 27, (4), pp. 831–838 VEDLABS, #112, Oxford Towers, Old airport Road, Kodihalli, Bangalore-08 Page 3