International Journal of Engineering Trends and Technology (IJETT) – Volume 21 Number 6 – March 2015 Performance Analysis of Parallel Prefix Adder Based on FPGA 1 Avinash shrivastava, 2Chandrahas sahu 1 M.E. Student, Department of Electronics & communication, SSCET (csvtu), Bhilai (C.G.) 2 Faculty, Department of Electronics & Communication, SSCET (csvtu), Bhilai (C.G.) Abstract — Parallel-prefix structures (also known as carry tree) are found to be common in high performance adders in very large scale integration (VLSI) designs because of the delay is logarithmically proportional to the adder width. Such structures can usually be classified into three basic stages which are precomputation, prefix tree and post-computation. However, this performance advantage does not translate directly into FPGA implementations due to constraints on logic block configurations and routing overhead. Three types of carry-tree adders (the KoggeStone, Brent Kung, Han Carlson and Harris adder) in this paper investigates and compares them to the simple Ripple Carry Adder (RCA.).These implementations have been successfully done in verilog hardware descriptive language using Xilinx Integrated Software Environment (ISE) 13.2 design suit. These designs are implemented in Xilinx Spartan 6 ,Spartan 6 low power, virtex 6, virtex 6 low power Field Programmable Gate Arrays (FPGA) and delays are measured using xpower analyser 13.2 and all these adder’s Comparison of Slice utilization, No. of logic levels required & Delay are investigated and compared finally. Keywords— parallel prefix adders; carry tree adders; FPGA; logic analyzer; delay; power. I. INTRODUCTION Addition is a fundamental operation for any digital system, digital signal processing or control system. A fast and accurate operation of a digital system is greatly influenced by the performance of the resident adders. Adders are also very important component in digital systems because of their extensive use in other basic digital operations such as subtraction, multiplication and division. Hence, improving performance of the digital adder would greatly advance the execution of binary operations inside a circuit compromised of such blocks. The performance of a digital circuit block is gauged by analyzing its power dissipation, layout area and its operating speed. Parallel Prefix Adder (PPA) is very useful in today’s world of technology because of its implementation in Very Large Scale Integration (VLSI) chips. The VLSI chips rely heavily on fast and reliable arithmetic computation. These contributions can be provided by PPA. There are many types of PPA such as Kogge Stone [1], Brent Kung [2], Ladner Fisher [3], Hans Carlson [4] and Knowles [5], Harris. For the purpose of this research, only Brent Kung and Kogge Stone adders will be investigated. Fig. 1 shows the structured diagram of a PPA. PPA can be divided into three main parts, namely the pre-processing, carry graph and post-processing. The preprocessing part will generate the propagate (p) and generate (g) bits. The acquirement of the PPA carry bit is differentiates PPA from other type of adders. It is a parallel form of obtaining the carry bit that makes it performs addition arithmetic faster. In this paper, the practical issues involved in designing and implementing tree-based adders on FPGAs are described. An efficient testing strategy for evaluating the performance of these adders is discussed. Several tree-based adder structures are ISSN: 2231-5381 implemented and characterized on a FPGA and compared with the Ripple Carry Adder (RCA) II. TYPES OF ADDERS Ripple carry adder, or carry propagate adder, Carry look-ahead adder Carry skip adder, Manchester chain adder Carry select adders Pre-Fix Adders Multi-operand adder Carry save Adder Pipelined parallel adder Basic Adder Unit: The most basic arithmetic operation is the addition of two binary digits, i.e. bits. A combinational circuit that adds two bits, according the scheme outlined below, is called a half adder. A full adder is one that adds three bits, the third produced from a previous addition operation. One way of implementing a full adder is to utilizes two half adders in its implementation. The full adder is the basic unit of addition employed in all the adders studied here Half Adder: A half adder is used to add two binary digits together, A and B. It produces S, the sum of A and B, and the corresponding carry out Co. Although by itself, a half adder is not extremely useful, it can be used as a building block for larger adding circuits (FA). One possible implementation is using two AND gates, two inverters, and an OR gate instead of a XOR gate as shown in Fig. Boolean Equations: http://www.ijettjournal.org Page 281 International Journal of Engineering Trends and Technology (IJETT) – Volume 21 Number 6 – March 2015 Figure.1: Half-Adder logic and block diagrams Fig. 2: Full adder Table 1: Half-Adder truth table Full Adder: A full adder is a combinational circuit that performs the arithmetic sum of three bits: A, B and a carry in, C, from a previous addition, Fig. 2a. Also, as in the case of the half adder, the full adder produces the corresponding sum, S, and a carry out Co. As mentioned previously a full adder maybe designed by two half adders in series as shown below in Figure 2b. The sum of A and B are fed to a second half adder, which then adds it to the carry in C (from a previous addition operation) to generate the final sum S. The carry out, Co, is the result of an OR operation taken from the carry outs of both half adders. There are a variety of adders in the literature both at the gate level and transistor level each giving different performances Boolean Equations Fig. 3: Full adder constructed from 2b Half Adders Parallel Adders Parallel adders are digital circuits that compute the addition of variable binary strings of equivalent or different size in parallel. The schematic diagram of a parallel adder is shown below in Fig. 3. Fig. 4 Parallel Adder III. DRAWBACKS OF RIPPLE CARRY AND CARRY LOOKAHEAD ADDER Table 1: full -Adder truth table ISSN: 2231-5381 http://www.ijettjournal.org Page 282 International Journal of Engineering Trends and Technology (IJETT) – Volume 21 Number 6 – March 2015 (10) Fig.5: 4 bit ripple carry adder (11) In fig.5, the first sum bit should wait until input carry is given, the second sum bit should wait until previous carry is propagated and so on. Finally the output sum should wait until all previous carries are generated. So it results in delay. In order to reduce the delay in RCA (or) to propagate the carry in advance, we go for carry look ahead adder .Basically this adder works on two operations called propagate and generate The propagate and generate equations are given by. (1) (2) For 4 bit CLA, the propagated carry equations are given as (3) (4) (5) (6) Equations (3), (4), (5) and (6) are observed that, the carry complexity increases by increasing the adder bit width. So designing higher bit CLA becomes complexity. In this way, for the higher bit of CLA’s, the carry complexity increases by increasing the width of the adder. So results in bounded fan-in rather than unbounded fan-in, when designing wide width adders. In order to compute the carries in advance without delay and complexity, there is a concept called Parallel prefix approach. Fig. 6. Parallel-Prefix Structure with carry save notation More practically, the equations (10) and (11) can be expressed using a symbol ―o ―denoted by Brent and Kung. Its function is exactly the same as that of a black cell i.e. (12) IV. DIFFERENCE BETWEEN PARALLEL-PREFIX ADDERS AND OTHERS The PPA’s pre-computes generate and propagate signals are presented in [2]. Using the fundamental carry operator (fco), these computed signals are combined in [3].The fundamental carry operator is denoted by the symbol ―ο‖, V. PARALLEL-PREFIX ADDER STRUCTURE Parallel-prefix structures are found to be common in high performance adders because of the delay is logarithmically proportional to the adder width [2]. PPA’s basically consists of 3 stages • Pre computation • Prefix stage • Final computation The Parallel-Prefix Structure is shown in figure 2. Pre computation In pre computation stage, propagates and generates are computed for the given inputs using the given equations (1) and (2). Prefix stage In the prefix stage, group generate/propagate signals are computed at each bit using the given equations. The black cell (BC) generates the ordered pair in equation (7), the gray cell (GC) generates only left signal, following [2]. Fig. 7. Black and Gray Cell logic Definitions ISSN: 2231-5381 http://www.ijettjournal.org Page 283 International Journal of Engineering Trends and Technology (IJETT) – Volume 21 Number 6 – March 2015 wiring congestion. The operation of the 16 bit Brent Kung adder is given below [3]. This adder uses less BC’s and GC’s than kogge The "o" operation will help make the rules of building prefix structures. C. Final computation In the final computation, the sum and carryout are the final output. (12) (13) Where ―-1‖ is the position of carry-input. The generate/propagate signals can be grouped in different fashion to get the same correct carries. Based on different ways of grouping the generate/propagate signals, different prefix architectures can be created. Figure 3 shows the definitions of cells that are used in prefix structures, including BC and GC. For analysis of various parallel prefix structures, see [2], [3] & [4]. The 16 bit SKA uses black cells and gray cells as well as full adder blocks too. This adder computes the carries using the BC’s and GC’s and terminates with 4 bit RCA’s. Totally it uses 16 full adders. The 16 bit SKA is shown in figure 4. In this adder, first the input bits (a, b) are converted as propagate and generate (p, g). Then propagate and generate terms are given to BC’s and GC’s. The carries are propagated in advance using these cells. Later these are given to full adder blocks. Fig. 8. 16 bit Kogge Stone adder Kogge Stone Adder The Kogge Stone Adder (KSA) has regular layout which makes them favored adder in the electronic technology. Another reason the KSA is the favored adder is because of its minimum fan-out or minimum logic depth. As a result of that, the KSA becomes a fast adder but has a large area [9]. The delay of KSA is equal to log2n which is the number of stages for the ―o‖ operator. The KSA has the area (number of ―o‖ operators) of (n*log2n)-n+1 where n is the number of input bits [2]. KSA is another of prefix trees that use the fewest logic levels. A 16bit KSA is shown in Figure 6. The 16 bit kogge stone adder uses BC’s and GC’s and it won’t use full adders. The 16 bit KSA uses 36 BC’s and 15 GC’s. And this adder totally operates on generate and propagate blocks. So the delay is less when compared to the previous SKA and STA. The 16 bit KSA is shown in figure 6.In this KSA; there are no full adder blocks like SKA and STA [5] & [6]. Stone adder and has the better delay performance which is observed in Xilinx 13.2 xpower analyzer.known for its high logic depth with minimum area characteristics [8]. High logic depth here means high fan-out characteristics. These adders are implemented in verilog HDL in Xilinx 13.2 ISE design suite and then verified using Xilinx virtex 5 FPGA through chip scope analyzer [7], [8] and [9]. And these were tested using Xilinx 13.2 xpower analyzer. This allows measuring the adder delays directly. The Xilinx xpower analyser 13.2 is integrated to PC (Personal Computer) through Xilinx virtex 5 FPGA [10]. The test setup is depicted in the figure 10. Brent Kung Adder The large number of levels in Brent Kung Adder (BKA) however reduces its operational speed. BKA is also power efficient because of its lowest area delay with large number of input bits [7]. The delay of BKA is equal to (2*log2n)-2 which is also the number of stages for the “o” operator. The BKA has the area (number of “o” operators) of (2*n)-2-log2n where n is the number of input bits [1]. The BKA is BKA which also uses BC’s and GC’s but less than the KSA. So it takes less area to implement than KSA. The 16 bit BKA uses 14 BC’s and 11 GC’s but kogge stone uses 36 BC’s and 15 GC’s. So BKA has less architecture and occupies less area than KSA. The 16 bit BKA is shown in the below figure 7. BKA occupies less area than the other 3 adders called SKA, KSA, and STA. This adder uses limited number of propagate and generate cells than the other 3 adders. It takes less area to implement than the KSA and has less ISSN: 2231-5381 http://www.ijettjournal.org Page 284 International Journal of Engineering Trends and Technology (IJETT) – Volume 21 Number 6 – March 2015 VII. DISCUSSION OF RESULT TABLE: COMPARISON OF SLICE UTILIZATION, NO. OF LOGIC LEVELS REQUIRED & DELAY Adder name Fig. 9. 16 bit Brent Kung adder VI. METHEDOLOGY Design Specification HDL Entry • 16 bit Full adder • Algorithm to be adopted • Verilog coding for each parallel prefix tree based full adder No. of slices LUT No. of logic levels Spartan 6 24 10 12.244 43 10 26 39 Ripple carry Kogge Stone Brent Kung Harris Delay Spartan 6 low power Virtex 6 Virtex 6 low power 18.772 4.599 4.984 11.935 18.316 4.629 4.995 9 11.447 17.506 4.240 4.564 8 10.306 15.248 3.512 3.821 The delay observed for adder design from synthesis reports in Xilinx ISE 13.2 synthesis reports are shown in figure. Synthesis Functional Simulation • Converting a high-level description of design into an optimized gate-level representation. • To check the adder functionality for each parallel prefix tree based full adder 20 15 10 Comparative analysis of synthesis results for various FPGA chips • Parameters: No. of Slice/LUT & maximum path delay SPARTAN 6 5 SPARTAN 6 LOW POWER 0 VIRTEX 6 Power Analysis for Spartan 6 FPGA chip VIRTEX 6 LOW POWER • Static power comparison for various operating temperature Fig. 11. Simulation results for the adder designs ISSN: 2231-5381 http://www.ijettjournal.org Page 285 International Journal of Engineering Trends and Technology (IJETT) – Volume 21 Number 6 – March 2015 The no. of slices LUT and no. of logic level observed for adder designs from synthesis reports in Xilinx ISE 13.2 are compared and shown in figure12. HARRIS BRENT… KOGGE… SLICES LUT RIPPLE… 50 40 30 20 10 0 NO OF LOGIC LEVEL Fig: Comparative chart for Slice utilization The area of the adder designs is measured in terms of look up tables (LUT) and input output blocks (IOB) taken for Xilinx ISE 13.2 in Spartan 6 FPGA chip is plotted in the figure. As per reference [1], ISE software doesn’t give exact delay of the adders because it is not able to analyze the critical path over the adder [1]. From the comparison table it is clear that Out of all adders, Harris adder has less delay. KSA adder and BKA have about the same delay. According to the synthesis reports, out of four parallel prefix adders, Harris adder has better delay because of taking least logic level for compaction where as on the basis of delay and area (slice utilization or the no. of LUT required) Brent Kung is the best on an average. VIII. CONCLUSIONS From the study of analysis done on area and power, we have concluded that the efficiency is improved by 6.50 % in ours delay for RCA, when compared to Brent Kung [2], and for KSA it is improved by 2.53 % when compared with [1], and for Harris adder it is improved by 18.80 % in Spartan 6 fpga chip. So we can say that Harris is the best because of taking least logic level for compaction where as on the basis of delay and area (slice utilization or the no. of LUT required) Brent Kung is the best on an average. IX. REFERENCES [1] P. M. Kogge and H. Stone, .A parallel algorithm for the efficient solution of a general class of recurrence relations, IEEE Trans. Computers, vol. C-22, no. 8, pp. 786.793,Aug. 1973. [2] R. P. Brent and H. T. Kung, .A regular layout for parallel adders, IEEE Trans. Computers, vol. C-31, no. 3, pp. 260.264, Mar. 1982. ISSN: 2231-5381 [3] D. Harris, .A taxonomy of parallel prefix networks, In Record of the Thirty-Seventh Asilomar Conference on Signals, Systems and Computers, Nov. 2003, pp. 2213.2217. [4] R. Ladner and M. Fischer, .Parallel prefix Computation, J. ACM, vol. 27, no. 4, pp.831.838, Oct. 1980. [5] David H.K.Hoe, Chris Martinez and Sri Jyothsna Vundavalli‖, Design and Characterization of Parallel Prefix Adders using FPGAs‖, 2011 IEEE 43rd Southeastern Symposium in pp. 168-172, 2011. [6] N. H. E. Weste and D. Harris, CMOS VLSI Design, 4th edition, Pearson–Addison-Wesley, 2011. [7] R. P. Brent and H. T. Kung, ―A regular layout for parallel adders,‖ IEEE Trans. Comput., vol. C-31, pp. 260-264, 1982. [8] D. Gizopoulos, M. Psarakis, A. Paschalis, and Y. Zorian, ―Easily Testable Cellular Carry Look ahead Adders,‖ Journal of Electronic Testing: Theory and Applications 19, 285-298, 2003. [9] T. Lynch and E. E. Swartzlander, ―A Spanning Tree Carry Look ahead Adder,‖ IEEE Trans. on Computers, vol. 41, no. 8, pp. 931939, Aug. 1992. [10] Beaumont-Smith, A, Cheng-Chew Lim,‖Parallel prefix adder design‖, Computer Arithmetic, 2001. Proceedings. 15th IEEE Symposium, pp. 218 – 225, 2001.M. Young, the Technical Writer's Handbook. Mill Valley, CA: University Science, 1989 [11] K. Vitoroulis and A. J. Al-Khalili, ―Performance of Parallel Prefix Adders Implemented with FPGA technology,‖ IEEE Northeast Workshop on Circuits and Systems, pp. 498-501, Aug. 2007. 172. [12] S. Xing and W. W. H. Yu, ―FPGA Adders: Performance Evaluation and Optimal Design,‖ IEEE Design & Test of Computers, vol. 15, no. 1, pp.24-29, Jan.1998 [13] Ladner R, Fischer M,‖Parallel prefix computation ―, J.ACM, vol.27, no. 4, pp 831-838, Oct.1980. [14] Han T, Carlson D, ―Fast area-efficient VLSI adders‖, Proc.8th.symp.Comp.Arit.pp.49-56, Sep.1987. [15] Jianhua LiuZhu, Haikun, Chung-Kuan Cheng, John Lillis, ―Optimum prefix Adders in a Comprehensive Area, Timing and power Design Space‖., Proceeding of the 2007 Asia and South pacific Design Automation conference. Washington, pp.609-615, Jan 2007. [16] Taeko Matsunaga and Yusuka Matsunaga., ―TimingConstrained Area minimization Algorithm for parallel prefix adders‖, IEICE TRANS, Fundamentals, vol.E90-A, No.12 Dec, 2007. [17] Taeko Matsunaga and Shinji Kimura, Yusuka Matsunaga, ―Synthesis of parallel prefix adders considering switching activities‖, IEEE International Conference on computer design, pp.404-409, 2008 [18] Giorgos Dimitrakopoulos and Dimitric Nikolos, ―High Speed Parallel –Prefix VLSI Ling Adders”, IEEE Trans on computers, Vol.54, No.2, Feb 2005 [19] V.Choi and E.E.Swartz lander, Ir, ―Parallel Prefix adder design with matrix representation‖,, in Proc.17th IEEE symp, comput.Arithmatic (ARITH), PP 90-98,2005 [20] John F.Wakerly, Digital Design Principles and Practices, 4th Edition, Pearson Education, 2009. http://www.ijettjournal.org Page 286