International Journal of Engineering Trends and Technology- Volume3Issue5- 2012 Hardware Efficient VLSI Adder For Low Power Applications Nune.Radika1, K.Rajasekhar2 1. Department of E.C.E, Akula Sree Ramulu Engineering College, Tetali 2. H.O.D, Department of E.C.E, Akula Sree Ramulu Engineering College, Tetali Abstract— Carry Select Adder (CSLA) is one of the fastest adders used in many data-processing processors to perform fast arithmetic functions. From the structure of the CSLA, it is clear that there is scope for reducing the area and power consumption in the CSLA. This work uses a simple and efficient gate-level modification to significantly reduce the area and power of the CSLA. Based on this modification 8-, 16-, 32-, and 64-b squareroot CSLA (SQRT CSLA) architecture have been developed and compared with the regular SQRT CSLA architecture. The proposed design has reduced area and power as compared with the regular SQRT CSLA with only a slight increase in the delay. This work evaluates the performance of the proposed designs in terms of delay, area, power, and their products by hand with logical effort and through custom design and layout in 0.18- m CMOS process technology. The results analysis shows that the proposed CSLA structure is better than the regular SQRT CSLA. A simple approach is proposed in this paper to reduce the area and power of SQRT CSLA architecture. The reduced number of gates of this work offers the great advantage in the reduction of area and also the total power. The compared results show that the modified SQRT CSLA has a slightly larger delay (only 3.76%), but the area and power of the 64-b modified SQRT CSLA are significantly reduced by 17.4% and 15.4% respectively. The power-delay product and also the area-delay product of the proposed design show a decrease for 16-, 32-, and 64-b sizes which indicates the success of the method and not a mere tradeoff of delay for power and area. Keywords— Modelsim 6.4c, Xilinx ISE 10.1i,Binary to excess3 converter, carry look ahead adder The CSLA is used in many computational systems to alleviate the problem of carry propagation delay by independently generating multiple carries and then select a carry to generate the sum [1]. However, the CSLA is not area efficient because it uses multiple pairs of Ripple Carry Adders (RCA) to generate partial sum and carry by considering carry .Then the final sum and carry are selected by the multiplexers (mux). The basic idea of this work is to use Binary to Excess-1 Converter (BEC) instead of RCA in the regular CSLA to achieve lower area and power consumption [2]–[4]. The main advantage of this BEC logic comes from the lesser number of logic gates than the 3-bit Full Adder (FA) structure. The details of the BEC logic are discussed . This brief is structured as follows. Section II deals with the delay and area evaluation methodology of the basic adder blocks. Here we present the detailed structure and the function of the BEC logic. The SQRT CSLA has been chosen for carry bits before the sum, which reduces the wait time to calculate the result of the larger value bits. We will start by explaining the operation of one-bit full adder which will be the basis for Comparison with the proposed design as it has a more balanced delay, and requires lower power and area [5], [6]. A .BEC As stated above the main idea of this work is to use BEC instead of the RCA in order to reduce the area and power consumption of the regular CSLA. To replace the 4-bit RCA, I. INTRODUCTION Design of area- and power-efficient high-speed data path logic an 4-bit BEC is required. A structure and the function table of systems are one of the most substantial areas of research in illustrates how the basic function of the CSLA is obtained by VLSI system design. In digital adders, the speed of addition is using the 4-bit BEC together with the mux. One input of the limited by the time required to propagate a carry through the 8:4 mux gets as it input (B3, B2, B1, and B0) and another adder. The sum for each bit position in an elementary adder is input of the mux is the BEC output. This produces the two generated sequentially only after the previous bit position has possible partial results in parallel and the mux is used to select been summed and a carry propagated into the next position. either the BEC output or the direct inputs according to the a 4-b BEC are shown in Fig. and Table II, respectively. Fig. 3 control signal Cin. The importance of the BEC logic stems ISSN: 2231-5381 http://www.internationaljournalssrg.org Page 650 International Journal of Engineering Trends and Technology- Volume3Issue5- 2012 from the large silicon area reduction when the CSLA with The structure of the 16-b regular SQRT CSLA is shown in Fig. large number of bits are designed. The Boolean expressions of 4. It has five groups of different size RCA. The delay and area the 4-bit BEC is listed as (note the functional symbols NOT, evaluation of each group are shown in Fig. 5, in & XOR) which the numerals within [] specify the delay values, e.g., sum2 requires 10 gate delays. The steps leading to the evaluation are as follows Except for group2, the arrival time of mux selection input is always greater than the arrival time of data outputs from the RCA’s. Thus, the delay of group3 to group5 is determined. Fig.4-bit BEC. B. METHODOLOGY OF THE BASIC ADDER BLOCKS The AND, OR, and Inverter (AOI) implementation of an XOR gate is shown in Fig.The gates between the dotted lines are performing the operations in parallel and the numeric representation of each gate indicates the delay contributed by that gate. The delay and area evaluation methodology considers all gates to be made up of AND, OR, and Inverter, each having delay equal to 1 unit and area equal to 1 unit. We then add up the number of gates in the longest path of a logic block that contributes to the maximum delay. The area evaluation is done by counting the total number of AOI gates required for each logic block. Based on this approach, the CSLA adder blocks of 2:1 mux, Half Adder (HA), and FA are evaluated Fig. Regular 16-b SQRT CSLA. C. METHODOLOGY OF MODIFIED 16-B SQRT CSLA The structure of the proposed 16-b SQRT CSLA using BEC for RCA optimize the area and power is shown in Fig. 6. We again split the structure into five groups. The delay and area estimation of each group are shown in Fig. The steps leading to the evaluation are given here. Instead of another 2-b RCA with a 3-b BEC is used which adds one to the output from 2-b RCA.,Thus, the sum3 and final depending (output from mux) are mux and partial (input to mux) and mux, respectively. The sum2 depends mux. 2) For the remaining group’s the arrival time of mux selection input is always greater than the arrival time of data inputs from the BEC’s. Thus, the delay of the remaining groups depends on the arrival time of mux selection input and the mux delay. Fig BEC with 8:4 mux. C. METHODOLOGY OF REGULAR 16-B SQRT CSLA ISSN: 2231-5381 http://www.internationaljournalssrg.org Page 651 International Journal of Engineering Trends and Technology- Volume3Issue5- 2012 After this step, a 40MHz clock signal is applied to the clock port of the root module, and the synthesis tool is programmed not to modify the clock tree during the optimization phase. In addition, an arbitrary input delay of 5ns with respect to the clock port is applied to all input and output ports (except the clock port itself) to set a safe margin by considering any unintended source of delay such as the delay associated with driving module/modules. Then, the design is constrained with hypothetical maximum area equal to zero to force the tool to make the gate level netlist as compact as possible. Fig. The parallel RCA with BEC. II IMPLEMENTATION OF ALGORITHM A primary objective of this project was to develop a synthesizable model for the AES128 encryption algorithm. Synthesis is the process of converting the register transfer level (RTL) representation of a design into an optimized gatelevel netlist. This is a major step in ASIC design flow that takes an RTL model closer to a low-level hardware In the next steps, the tool is programmed to consider a unique design for each cell instance by removing the multiply-instantiated hierarchy in the current design. Then, the synthesis script removes the boundaries from all the components in the design hierarchy and removes all levels of hierarchy. Finally, the tool compiles the design with high effort and reports any warning related the mapping and final optimization step. At the end, the tool generates reports for the optimized gate level netlist area, the worst combinational path timing, and any violated design constraint. implementation. Synthesis consists of three main steps. The first step is the “Translation”, which involves converting the RTL description of a design into a non-optimized intermediate representation that is used by the synthesis tool. The second step is the “logic optimization”, which optimizes the internal representation by removing redundant logic and performing Boolean logic optimizations. The third step is called “technology mapping & optimization” which maps the internal representation to an optimized gate level representation using the technology library cells based on design constraints.[3] A. SYNTHESIS METHODOLOGY: The first step in the synthesis process is to read all the components in the design hierarchy. There are three components in the 3-level design hierarchy that needs to be synthesized. Since the RTL model utilizes a Verilog “Package”, then the synthesis tool needs to enable the semantics of a package. In addition, the synthesis tool needs to know if there are multiple instances of calling an automatic function in the design, to preserve separate values for each instance. After reading the design files, they are “Analyzed” and “Elaborated” through which the RTL code is converted into the Synopsys Design Compiler(SDC) internal format. [6] The intermediate results are stored in the defined “working library”. ISSN: 2231-5381 Figure RTL Schematic report B.SYNTHESIS TIMING RESULT: The synthesis tool optimizes the combinational paths in a design. In General, four types of combinational paths can exist in any design: [3] Input port of the design under test to input of one internal flip-flip http://www.internationaljournalssrg.org Page 652 International Journal of Engineering Trends and Technology- Volume3Issue5- 2012 Output of an internal flip-flip to input of another flip-flip Output of an internal flip-flip to output port of the design under test A combinational path connecting the input and output ports of the design under test The last DC command in the script developed in previous section, instructs the tool to report the path with the worst timing. In this case, the path with the worst timing is a combinational path of type two. The delay associated with this path is the summation of delays of all combinational gates in the path plus the Clock-To-Q delay of the originating flipflop, which was calculated as 24.09ns. By considering the setup time of the destination flip-flop in this path, which is 0.85ns, the 40MHz clock signal satisfies the worst combinational path delay. The delays of combinational gates, setup time of flip-flops and Clock-To-Q values are derived from the LSI_10k library file that was used for the mapping step during synthesis. The synthesis timing report is shown below: combinational and sequential areas. The synthesis area report is shown below: III CONCLUSION In this paper we proposed new adder in order to reduce the area and power of SQRT CSLA architecture. The reduced number of gates of this work offers the great advantage in the reduction of area and also the total power. The compared results show that the modified CSLA low area and power as compared to SQRT CSLA. The power-delay product and also the area-delay product of the proposed design show a decrease for 16-, 32-, and 64-b sizes which indicates the success of the method and not a mere tradeoff of delay for power and area. The modified CSLA architecture is therefore, low area, low power, simple and efficient for VLSI hardware implementation. REFERENCES [1] FIGURE. SIMULATED OUTPUT. C SYNTHESIS AREA RESULT: The synthesis area report shows the total number of cells and nets in the netlist. It also uses the area parameter associated with each cell in the LSI_10K library file, to calculate the total combinational and sequential area of the netlist. The total area of the gate level netlist is unknown since it depends on total area of the interconnects, which itself is a function of the wiring load model used in physical design. The total cell area in the netlist is reported as 22978 units, which is the sum of ISSN: 2231-5381 [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [1] A. Avizienis, “Signed Digit Number Representation for Fast Parallel Arithmetic,” IRE Trans. Electron. Computers, vol. EC-10, pp.389400, 1961. [2] R. I. Hartley, “Subexpression sharing in filters using canonic signed digit multipliers,” IEEE Trans. Circuits Syst. II, vol. 43, no. 10, pp. 677–688, Oct. 1996. [3] T. Solla and O. Vainio, “Comparison of programmable FIR filter architectures for low power,” in Proc. 28th Eur. Solid-State Circuits Conf., Firenze, Italy, pp. 759–762, Sep. 2002. [4] K. H. Chen and T. D. Chiueh, “A low-power digit-based reconfigurable FIR filter,” IEEE Trans. Circuits Syst. II, vol. 53, no. 8, pp. 617–621, Aug. 2006. [5] T. Zhangwen, J. Zhang, and H. Min, “A high-speed, programmable, CSD coefficient FIR filter,” IEEE Trans. Consumer Electron., vol. 48, no. 4, pp. 834–837, Nov. 2002. [6] J. Park, et al., "Computation Sharing Programmable FIR Filter for Low-Power and High-Performance Applications", IEEE J. Solid state Cir. Sys., vol.39, no.2, pp.348-357, Feb. 2004. [7] R. Mahesh and A. P. Vinod, "New Reconfigurable Architectures for Implementing FIR Filters with Low Complexity", IEEE Trans. CAD of Int. Cir., vol. 29, no.2, pp.275-288, Feb. 2010. [8] K. Dabbagh-Sadeghipour, A. Aghagolzadeh, "A New Hardware Efficient, Low Power Fir Digital Filter Implementation Approach", Proc. IEEE ICECS, http://www.internationaljournalssrg.org Page 653