International Journal of Engineering Trends and Technology (IJETT) – Volume 13 Number 8 – Jul 2014 Multi Purpose and Efficient Data Transferring FPGA Based Applications 1 1 2 M Sudhakar Reddy, K. S. N. Vittal. M.tech, , (VLSI&ES), QIS institute of technology, A.P; India Assistant Professor, Dept .Of ECE, QIS institute of technology, A.P; India ABSTRACT: In this project, design of an asynchronous FPGA blocks is implemented with power optimization techniques. Concentrated on STANDBY and DYNAMIC power consumptions are presented and studied on various gating techniques. The existing techniques are Standby power is reduced by using autonomous fine grain power gating and reducing the dynamic power by using the level encoding dual rail (LEDR) architecture. The proposed present circuit design of a low-power delay buffer. The proposed delay buffer uses several new techniques to reduce its power consumption in look up table in FPGA. Since delay buffers are accessed sequentially, it adopts a ring-counter addressing scheme. In the ring counter, double-edge-triggered (DET) flip-flops are utilized to reduce the operating frequency by half and the C-element gated-clock strategy is proposed. The gated-driver-tree idea is also employed in the input and output ports of the memory block to decrease their loading, thus saving even more power. KEYWORDS: Level encoding dual rail (LEDR), fine grain power gating, gated driver tree, C- element, delay buffer, gated clock, ring counter. I. 2 INTRODUCTION Due to the dramatic increase in portable and battery-operated applications, lower power consumption has become a necessity in order to prolong the battery life. Power consumption is an important part of the equation determining the end product's size, weight, and efficiency. Selecting an appropriate FPGA architecture is critical in achieving the best static and dynamic power consumption. Flash-based FPGAs by Micro semi are the low-power leaders in the industry. In addition to utilizing the low-power attributes of flash-based FPGAs, you can deploy several design techniques to further reduce overall power. The important FPGA power components to consider in the following sections: • Power-up (inrush power): Inrush power is the amount of power drawn by the device during powerup • Configuration power: Configuration power is the amount of power required during the loading of the ISSN: 2231-5381 FPGA upon power-up (specific to SRAM-based programmable logic devices). • Static (standby) power: Static power is the amount of power the device consumes when it is poweredup but not actively performing any operation. • Dynamic (active) power: Dynamic power is the amount of power the device consumes when it is actively operating. • Sleep power (low-power mode): Some FPGA devices offer low-power or sleep modes. In some cases, this may be different from static power. This application note focuses on reducing the dynamic power. In FPGA design, the clock gating and power gating is important work. To implement clock gating, circulation is employed. The idea of circulation is to retain the contents of the flip-flop in the sleep state. Circulation can reduce the dynamic power consumption of registers and the gates in the fan-out of the registers. However, the standby power consumption of the clock network cannot be reduced. The standby power is a serious problem because it has an enormously large number of transistors to achieve its programmability. Low-cost FPGAs consume up to hundreds of mille watts power. Power gating has emerged as the most effective design technique to achieve low standby power. Power gating techniques are based on selectively setting the functional units into a low leakage mode when they are inactive. Currently, most circuits adopt static random access memory plus some control/addressing logic to implement delay buffers. For smaller length delay buffers, shift register can be used instead. The former approach is convenient since SRAM compilers are readily available and they are optimized to generate memory modules with low power consumption and high operation speed with a compact cell size. Previously, a simplified and thus lower-power sequential addressing scheme for SRAM application in delay buffers is proposed. To use double-edge-triggered (DET) flip-flops instead of traditional DFFs in the ring counter to halve the http://www.ijettjournal.org Page 356 International Journal of Engineering Trends and Technology (IJETT) – Volume 13 Number 8 – Jul 2014 operating clock frequency. A novel approach using the C-elements instead of the R–S flip-flops in the control logic for generating the clock-gating signals is adopted to avoid increasing the loading of the global clock signal. Also proposed gate the drivers in the clock tree. The technique will greatly decrease the loading on distribution network of the clock signal for the ring counter and thus the overall power consumption. II. The ability to reconfigure functionality to be implemented on a chip gives a unique advantage to designer who designs his system on an FPGA It reduces the time to market and significantly reduces the cost of production. FIELD PROGRAMMABLE GATE ARRAYS Field Programmable Gate Arrays are two dimensional arrays of logic blocks and flip-flops with electrically programmable interconnections between logic blocks. The interconnections consist of electrically programmable switches which is why FPGA differs from Custom ICs, as Custom IC is programmed using integrated circuit fabrication technology to form metal interconnections between logic blocks. In an FPGA logic blocks are implemented using multiple level low fan-in gates, which gives it a more compact design compared to an implementation with two-level AND-OR logic. Logic block of an FPGA can be configured in such a way that it can provide functionality as simple as that of transistor or as complex as that of a microprocessor. It can used to implement different combinations of combinational and sequential logic functions. Logic blocks of an FPGA can be implemented by any of the following: Transistor pairs Combinational gates like basic NAND gates or XOR gates N-input Lookup tables Multiplexers Wide fan-in And - OR structure Routing in FPGAs consists of wire segments of varying lengths which can be interconnected via electrically programmable switches. Density of logic block used in an FPGA depends on length and number of wire segments used for routing. Number of segments used for interconnection typically is a tradeoff between density of logic blocks used and amount of area used up for routing. ISSN: 2231-5381 Fig1: FPGA Architecture III. ASYNCHRONOUS ARCHITECTURE DESIGN Most digital circuits designed and fabricated today are “synchronous”. In essence, they are based on two fundamental assumptions that greatly simplify their design: (1) All signals are binary, and (2) All components share a common and discrete notion of time, as defined by a clock signal distributed throughout the circuit. Asynchronous circuits are fundamentally different; they also assume binary signals, but there is no common and discrete time. Instead the circuits use handshaking between their components in order to perform the necessary synchronization, communication, and sequencing of operations. Expressed in ‘synchronous terms’ this results in a behavior that is similar to systematic fine-grain clock gating and local clocks that are not in phase and whose period is determined by actual circuit delays – registers are only clocked where and when needed. This difference gives asynchronous circuits inherent properties that can be (and have been) exploited to advantage in the areas listed and motivated below. The interested reader may find further introduction to the mechanisms behind the advantages. Low power consumption, due to fine-grain clock gating and zero standby power consumption. High operating speed, operating speed is determined by actual local latencies rather than global worst-case latency. http://www.ijettjournal.org Page 357 International Journal of Engineering Trends and Technology (IJETT) – Volume 13 Number 8 – Jul 2014 Less emission of electro-magnetic noise, the local clocks tend to tick at random points in time. Robustness towards variations in supply voltage, temperature, and fabrication process parameters, timing is based on matched delays (and can even be insensitive to circuit and wire delays) Better composability and modularity, because of the simple handshake interfaces and the local timing. No clock distribution and clock skew problems; there is no global signal that needs to be distributed with minimal phase skew across the circuit. The asynchronous architecture it detects the activity of a power gated domain. The activities are: To determine when logic block is standby state, when sleep state & when active state It compares the phase of the input data and output data It determines the function of lookup table Dynamic power reducing purposed introduce dual rail encoding [2] and level encoding dual rail architecture [3]. Standby power reducing introduced autonomous fine grain power gating technique [3]. The registers store the data value and produce the output to switch block. Sleep controller monitor wake up the successive block when it gets data. The switch block consists of pass transistor switches. In a switch block, a wire-set consists of four blocks. In the switch block there are four signals IV. AUTONOMOUS FINE GRAIN POWER GATING Fig2: Control Strategy of the power Gating Method An efficient control strategy of the autonomous fine-grain power gating. The standby state is used to do the following: Wake up the LB before the data arrives Power OFF the LB only when the data does not come for quite a while The use of the standby state has two major advantages, First, the wake-up time can be hidden since the LB has already been woken up when the data arrivals. Second, dynamic power can be saved since the number of the unnecessary switching of the sleep transistor is reduced [3]. V. LEDR ENCODING Fig3: LEDR encoding Data Transmission Data signal (first bit) Data signal (second bit) Acknowledgement signal Logic Data arrival signal Asynchronous FPGAs based on LEDR encoding. LEDR is one of several two-phase dualrail encodings. In LEDR encoding, no spacer is required shown fig3.This results in high throughput and low dynamic power consumption because of the number of signal transitions reduced by half [3]. The above four signal acknowledgement signal and data arrival signal connected to pervious logic block. The two pass switches are used for the four wires of the wire-set, one Va, Ra, ack and wakeup signal wires respectively. The pass switches are controlled by the same memory bits. ISSN: 2231-5381 Table 1 shows the code table of LEDR encoding. In LEDR encoding, each data value has two types of code words with different phases. The example where data values "0," "0," and "1" are transferred. The main feature is that the sender sends data values alternately in phase ° and phase 1. Because no spacer is required, the number of signal transitions is half of four-phase dual-rail encoding. As a result, the throughput is high and the power consumption is small. Based on this observation, in the FPGA, LEDR encoding is employed for http://www.ijettjournal.org Page 358 International Journal of Engineering Trends and Technology (IJETT) – Volume 13 Number 8 – Jul 2014 implementing the asynchronous architecture to reduce the dynamic power. Fig5: Block Diagram of the Lookup Table Table 1: Code table of LEDR encoding VI. LOGIC BLOCK DESIGN The detailed structures of the decoder and multiplexer based lookup table showed below Figure. In that diagram a AND gates are used for decoder and pass transistor used for multiplexer logic . Fig6: The Detail Structure of Look up Table Fig4: Logic block design In this logic block contains lookup table, sleep controller, registers and programmable delay elements, c-element with ring counter, gated driver tree presented. The description of the c-element with ring counter described below A. Look up table design VII. This architecture contains four sub modules. Each sub modules consist of a decoder, a multiplexer and memory bits. The decoder designed by two four input AND gates. The output of the decoder is given to the multiplexer. The multiplexer is designed by four pass transistor logic and one inverter logic. The decoders exclude invalid input patterns with different phases. The valid data are fed to the multiplexer. As a result, the numbers of multiplexers are reduced and the transistor count is reduced compared to the multiplexer type LOOKUP TABLE. If the combination of inputs are invalid (i.e., if the two inputs have the different phases) all pass transistors turn OFF according to the output of the decoder. The decoder and multiplexer based lookup table as shown ISSN: 2231-5381 The previous outputs stored in latch, if input patterns are valid (i.e., if the two inputs have the same phase), according to the corresponding passtransistors turn ON. The value of the memory bit is selected as outputs; the outputs are stored in the latches. PROPOSED DESIGN A. Delay buffers In the proposed delay buffer, several power reduction techniques are adopted. Mainly, these circuit techniques are designed with a view to decreasing the loading on high fan-out nets, e.g., clock and read/write ports. In, the R–S flip-flop is replaced by a C-element. Besides, the operating frequency is reduced to half by using the DET flipflop. The major advantage of the C-element is that its output is free of glitches, which is essential for a clock gating signal. Since the DFFs are replaced by DET flip-flops to run the ring counter at half speed, the gating on–off condition needs to be revised. When the input of the last DET flip-flop in the http://www.ijettjournal.org Page 359 International Journal of Engineering Trends and Technology (IJETT) – Volume 13 Number 8 – Jul 2014 previous block has a transition from ‘0’ to ‘1’, the clock signal in the current block is enabled. When the output of the first DET flip-flop in the next block rises from ‘0’ to ‘1’, both inputs of the C-element go to ‘0’ and the clock is turned off in the current block. The ‘gate’ signal for those drivers can utilize the same clock gating signals of their driving blocks. Thus, the driver tree ‘gate’ signal should be asserted when the active cell (whose output is ‘1’) in the ring counter is one of its descendants in the quaternary driver tree. Given M blocks, each having DET flipflops, instead of activating all drivers. C. C-Element for dynamic power reduction The Muller C-element, or Muller C-gate, is a commonly used asynchronous logic component originally designed by David E. Muller. It applies logical operations on the inputs and has hysteresis. The output of the C-element reflects the inputs when the states of all inputs match. The output then remains in this state until the inputs all transition to the other state. Fig7: Diagram of ring counter with clock gated by C-Elements B. Gated driver tree Proposed apply gating to the driver tree network that delivers the global clock signal to all blocks. Since, at any time, at most, two blocks need the global clock signal, so only those drivers along the path from the clock source to the blocks that need to be driven by the global clock are activated, as shown in Figure 8. Fig9: C- Element logic diagram This model can be extended to the Asymmetric C-element where some inputs only effect the operation in one of the transitions (positive or negative). A 0 0 1 1 B 0 1 0 1 Q 0 Q(t-1) Q(t-1) 1 Table2: Truth Table for C-Element If both inputs are 0, then the pull-up network changes the latch's state, and the C-element outputs a 0. If both inputs are 1, then the pull-down network changes the latch's state, making the C-element output a 1. Otherwise, the input of the latch is not connected to either Vdd or ground, and so the weak inverter (drawn smaller in the diagram) dominates and the latch outputs its previous state Fig8: Clock driver tree and gating signal ISSN: 2231-5381 http://www.ijettjournal.org Page 360 International Journal of Engineering Trends and Technology (IJETT) – Volume 13 Number 8 – Jul 2014 VIII. RESULTS A. Simulations Fig10: Input simulation for logic block in FPGA Fig11: Output simulation for logic block in FPGA ISSN: 2231-5381 http://www.ijettjournal.org Page 361 International Journal of Engineering Trends and Technology (IJETT) – Volume 13 Number 8 – Jul 2014 B. Synthesis Report Power existing method 702 mw 260 mw Power proposed method Total Equivalent Gate Count for Existing Design 12,052 CELLS Total Equivalent Gate Count for Proposed Design 10,215 CELLS Latency in existing method 31.80 ns Latency in proposed method 29.00 ns IX. CONCLUSION In this paper, we presented a low-power asynchronous FPGA architecture which adopts several novel techniques to reduce power consumption. The ring counter with clock gated by the C-elements can effectively eliminate the excessive data transition without increasing loading on the global clock signal. The gated-driver tree technique used for the clock distribution networks can eliminate the power wasted on drivers that need not be activated. Another gated-demultiplexer tree and a gated-multiplexer tree are used for the input and output driving circuitry to decrease the loading of the input and output data bus. All gating signals are easily generated by a C-element taking inputs from some DET flip-flop outputs of the ring counter. REFERENCES [5] W. Li and L. Wanhammar, “A pipeline FFT processor,” in Proc. Workshop Signal Process. Syst. Design Implement, 1999, pp. 654–662. [6] E. K. Tsern and T. H. Meng, “A low-power video-rate pyramid VQ decoder,” IEEE J. Solid-State Circuits, vol. 31, no. 11, pp. 1789–1794, Nov. 1996. [7] N. Shibata, M.Watanabe, and Y. Tanabe, “A current-sensed high-speed and low-power first-in-first-out memory using a wordline/bitline- swapped dual-port SRAM cell,” IEEE J. SolidState circuits, vol. 37, no. 6, pp. 735–750, Jun. 2002. [8] E. Sutherland, “Micropipelines,” Commun. ACM, vol. 32, no. 6, pp. 720–738, Jun. 1989. [9] R. Hosain, L. D. Wronshi, and A. albicki, “Low power design using double edge triggered flip-flop,” IEEE Trans. Very Large Scale Integr. (VLSI ) Syst., vol. 2, no. 2, pp. 261–265, Jun. 1994. [10] K. Zhang, U. Bhattacharya, Z. Chen, F. Hamzaoglu, D. Murray, N.Vallepalli, Y.Wang, B. Zheng, and M. Bohr, “SRAM design on 65-nm CMOS technology with dynamic sleep transistor for leakage reduction,” IEEE J. Solid-State Circuits, vol. 40, no. 4, pp. 895–901, Apr. 2005 [11] Masanori hariyama, shota ishihara, chang chia wei and michitaka kameyama, "a field-programmable vlsi based on an asynchronous bit-serial architecture," a-sscc, pp. 380--383, 200 M Sudhakar Reddy was born in A.P. India. He received the B.Tech degree in Electronics & communications Engineering from Jawaharlal Nehru Technological University in 2012.Presentlyhe is pursuing M.Tech VLSI & Embedded Systems, in QIS Institute of Technology. His research interests include Low power design. [1] W. Eberle et al., “80-Mb/s QPSK and 72-Mb/s 64-QAM flexible and scalable digital OFDM transceiver ASICs for wireless local area networks in the 5-GHz band,” IEEE J. SolidState Circuits, vol. 36, no. 11, pp. 1829–1838, Nov. 2001. [2] M. L. Liou, P. H. Lin, C. J. Jan, S. C. Lin, and T. D. Chiueh, “Design of an OFDM baseband receiver with space diversity,” IEE Proc. Commun., vol. 153, no. 6, pp. 894–900, Dec. 2006. [3] N.Rajagopala Krishnan And K. Sivasuparamanyan “A Reconfigurable Low Power FPGA Design with Autonomous Power Gating and LEDR Encoding” 978-1-4673-46030/12/$31.00 ©2012 IEEE. [4] G.Pastuszak, “A high-performance architecture for embedded block coding in JPEG 2000,” IEEE Trans. Circuits Syst. Video Technol., vol. 15, no. 9, pp. 1182–1191, Sep. 2005. ISSN: 2231-5381 K. S. N. Vittal was born in A.P. India. He received the B.Tech degree in Electronics & communications Engineering from Jawaharlal Nehru Technological University in 2008. He received M.TECH degree in K L University in 2012. His research interests include Analog design and Low power design. Presently he is working as Assistant Professor, Department of E.C.E, in QIS Institute of Technology. http://www.ijettjournal.org Page 362