A Dual-VDD Low Power FPGA Architecture A. Gayasen1, K. Lee1 , N. Vijaykrishnan1 , M. Kandemir1 , M.J. Irwin1 , and T. Tuan2 1 Dept. of Computer Science and Engineering Pennsylvania State University University Park, PA 16802 {gayasen,kiylee,vijay,kandemir,mji}@cse.psu.edu 2 Xilinx Research Labs 2100 Logic Dr. San Jose, CA 95124 Tim.Tuan@xilinx.com Abstract. The continuing increase in FPGA size and complexity and the emergence of sub-100nm technology have made FPGA power consumption, both dynamic and static, an important design consideration. In this work, we propose a programmable dual-VDD architecture in which the supply voltage of the logic blocks and routing blocks are programmed to reduce power consumption by assigning low-VDD to non-critical paths in the design, while assigning high-VDD to the timing critical paths in the design to meet timing constraints. We evaluate the effectiveness of different VDD assignment algorithms and architectural implementations. Our experimental results show that reducing the supply voltage selectively to the non-critical paths provides significant power savings with minimal impact on performance. One of our VDD -assignment techniques provides an average power saving of 61% across different MCNC benchmarks. 1 Introduction In modern FPGAs, power consumption has become an important design consideration. Increasing performance and complexity have raised the dynamic power consumed per chip, while the use of deep sub-micron processes has resulted in higher static power in the forms of sub-threshold leakage and gate leakage. High power consumption requires expensive packaging and cooling solutions. In battery-powered applications, high power consumption may prohibit the use of FPGA altogether. Consequently, solutions for reducing FPGA power are needed. Reducing the supply voltage (VDD ) is an effective technique for reducing both dynamic and static power. Dynamic power has a quadratic dependency on supply voltage, while both sub-threshold leakage (due to Drain Induced Barrier Lowering, DIBL) and gate leakage exhibit exponential dependencies on the supply voltage. However, reducing supply voltage also negatively affects circuit performance. A well-known technique to reap the benefits of voltage scaling without the performance penalty is the use of dual-VDD . The timing critical blocks in the design operate on the normal VDD (or VDDH ), while noncritical blocks operate on a second supply rail with a lower voltage (or VDDL ). While dual-VDD ICs have been successfully used in low-power ASICs and custom ICs [17], no commercial FPGA today uses multiple VDD ’s for power reduction.3 The difficulty of designing a dual-VDD FPGA is that the optimal VDD assignment changes from one design to another. Consequently, if logic blocks are statically determined to be operating at low or high VDD , the placement and routing algorithms need to be modified accordingly as in [11]. However, static assignment of VDD to the blocks may prevent the ability to reduce power consumption or to meet timing constraints for some 3 Xilinx Virtex-II FPGAs use different supply voltages for I/O and the core. Pass transistors used for interconnects are also supplied higher gate voltages to eliminate the VT drop. But this is not targeted to reduce power. designs. In contrast, the use of VDD -programmability for each block helps to tune the number of high and low VDD blocks as desired by the application. In this approach, the challenge is in determining the VDD assignments to each block. The need for level converters wherever a low-VDD logic block drives a high-VDD block and the associated delay and energy overheads are an important consideration when performing these VDD assignments. Furthermore, positioning of the level converters influences the ability to assign lower VDD ’s to the routing blocks. In this work, we propose a programmable dual-VDD architecture in which the supply voltage of the logic and routing blocks are programmed to reduce power consumption by assigning low-VDD to non-critical paths in the design, while assigning high-VDD to the timing critical paths in the design to meet timing constraints. In our programmable dualVDD architecture (see figure 1), the VDD of a circuit block is selected between VDDH and VDDL by using two high-VT transistors (supply transistors) connecting the block to the supplies. The state (ON/OFF) of each supply transistor is controlled by a configuration bit, which is set by the VDD assignment algorithm. The configuration bits are set either to connect the block to one of the power supplies or completely disconnect the block from both the power supply lines when the block is unused or idle. We evaluate the effectiveness of different VDD assignment algorithms and implementation choices for an island style FPGA architecture designed in 65nm technology. Our results indicate that one of our VDD -assignment techniques provides an average power saving of 61% across different MCNC benchmarks. (a) (b) (c) Fig. 1. Supply transistors used for programmable VDD The remainder of this paper is organized as follows. In Section 2, we revise the related work, focusing in particular on power optimizations for FPGAs. In Section 3, we discuss our dual-VDD FPGA architecture. Section 4 describes the experimental methodology we used, and discusses the VDD assignment algorithms and the power estimation technique we used. Section 5 presents experimental results and section 6 concludes the paper. 2 Related Work Most of the previous works on power modeling, estimation and reduction in FPGA have focused primarily on dynamic power. In [9], the dynamic power of a Xilinx XC4003A FPGA was analyzed by taking measurements of test designs. [15] analyzes dynamic power consumption in Virtex-II FPGA family. [12, 10] evaluate different FPGA architectures for power efficiency. [16] presents a routability-driven bottom-up clustering technique for area and power reduction in clustered FPGAs. Leakage in FPGAs has captured interest only very recently. [18] makes a detailed analysis of leakage power in Xilinx CLBs. It concludes that significant reduction of FPGA leakage is needed to enable the use of FPGAs in mobile applications. [3] presents a finegrained leakage control scheme using sleep transistors at gate level. [14] evaluates several low-leakage design techniques for FPGAs and shows that using multiple VT switch blocks reduces leakage significantly. [1] selects the polarities of logic signals to reduce active leakage power in FPGAs. [5] presents a cut enumeration algorithm targeting low power technology mapping for FPGA architectures with dual supply voltages. [6] presents a region-constrained placement approach to reduce leakage in FPGAs. Dual-VDD techniques have been proposed previously for ASICs [19, 17]. Recently, a lowpower FPGA using pre-defined dual-VDD /dual-VT fabrics has been proposed in [11]. But, they have focused on reducing only dynamic power, while keeping the leakage constant. Further, they have used a fixed dual-VDD /dual-VT fabric, keeping all the routing resources at high-VDD , which limits the power savings significantly. 3 Architecture The proposed dual-VDD architecture is built on cluster-based island-style FPGA architecture, with the configuration stored in SRAM cells. It facilitates configurable supply voltage for logic blocks and routing multiplexers. Figure 2 gives an overview of the architecture. The basic logic element (BLE) consists of a 4-input LUT and a flip-flop. Eight such BLEs cluster together to form a logic block (CLB). Figure 2(a) shows how the CLB is configured using high-VT supply transistors to operate at two different voltages. (a) Dual-VDD CLB (b) Dual-VDD routing mux Fig. 2. Dual-VDD architecture As mentioned in section 1, a dual-VDD design needs level conversion when a low-VDD block drives a block operating at high-VDD . In our dual-VDD architecture, level conversion takes place only at CLB pins. For this purpose, CLB pins have level converters (LCs) attached to them. A multiplexer allows to by-pass the level converter if level conversion is not needed at that pin. Placing the level converter only at CLB pins reduces the complexity of the routing fabric, and at the same time, limits the overheads due to level converters. We experimented with two architectures differing in the placement of the level converters. While the first architecture places LCs at the output pins of CLBs; the second architecture places them at CLB input pins. Figure 2a shows the first case, where only the output pins of a CLB have LCs attached to them. In this case, a net with multiple fanouts operates at high VDD if any one of the CLBs driven by this net is at high VDD (since, the signal’s voltage level does not change in the routing fabric). This limits the number of routing muxes that can be operated at low VDD , and therefore, is less effective in reducing routing power compared to the case when LCs are attached to CLB input pins. But, the drawback of keeping LCs at input pins of CLBs (apart from area penalty) is that a larger number of LCs are needed, which increases the leakage in logic blocks. Our results support this reasoning, but show that overall leakage is lower for the second case. Figure 2(b) shows a routing multiplexer (mux) in the dual-VDD architecture. The multiplexer’s output is connected to a level-restoring buffer to restore the VT -drop through the NMOS-based multiplexer. Note that the same set of supply transistors control the voltage of configuration SRAM cells and the level-restoring buffer. Since the configuration SRAM is not timing critical, the supply transistors need to be sized just enough to supply the maximum current needed by the level-restoring buffer connected to it. If a circuit block (CLB or routing multiplexer) is completely unused, then in order to save leakage, it is desirable to completely switch-off that block. This is achieved by keeping a separate configuration bit for every supply transistor4 . Although this incurs more area overhead, it results in significant leakage savings, since resource utilization in an FPGA is typically low [18]. Due to the area overhead of level converters and supply transistors, the dual-VDD FPGA takes approximately 21% more area than a single-VDD FPGA when LCs are at CLB outputs. For the case when LCs are at CLB inputs, this number is estimated to be around 23%. Majority of leakage in an FPGA occurs in the configuration SRAM cells. It has been previously shown in [6] that by increasing the threshold voltage of the configuration SRAM, its leakage can be reduced by 98%, while increasing configuration time by 20%. Since configuration time is not critical in most of our target designs, this tradeoff for power savings is reasonable. In order to see the effect of dual-VDD on power consumption, we have neglected the configuration SRAM leakage both for single supply design, and for the dual supply design (since the reduction of configuration SRAM leakage is achieved by increasing its threshold voltage, and is equally applicable to both single and dual supply designs). 3.1 Level Conversion Level converters have been studied widely ever since multi-VDD circuits were proposed [19, 13]. The area, delay and power overheads of level converters prohibit random VDD assignment to logic elements of a circuit. For the present work, we have used the level converter circuit shown in Figure 3, and a 65nm BSIM4 SPICE model to simulate it. For an FPGA architecture where level converters are placed at CLB input pins, four level converters are required per BLE. For a VDDH of 1.1V and VDDL of 0.9V, the LC delay is almost 17% of the delay of an LUT, and as much as 41% of the clock-to-Q delay of the flip-flop. This significant delay in the LC prohibits the use of many LCs within a logical path of the circuit. In contrast to delay, power consumption in an LC was observed to be negligible (< 1%) compared to a BLE. This allows us to place LCs at all input pins of a CLB and still get power savings. 4 Methodology We used VPR and its power model [2, 12] for this work. MCNC benchmarks were used for experimentation to evaluate the dual-VDD architecture and VDD assignment algorithms. The routing architecture that we supplied to VPR closely resembles a modern FPGA, with a routing channel width of 200, and buffered segments of lengths 1, 2, 6 and “long”. The LUT-size of 4, and cluster-size of 8 LUTs are chosen to be same as a Xilinx Virtex-II device. 4 In case of a routing mux, we may need to pull down the control signals when the mux is unused. The pull-down transistors can be sized very small. Fig. 3. Level converter circuit Fig. 4. Experimental Flow Circuit simulations were performed in SPICE using 65nm BSIM4 device models. Delays of BLE and LC were obtained from these simulations. Power consumption, both static and dynamic, of the LC was also obtained by simulating in SPICE using BSIM4 models. Figure 4 shows the experimental flow. The flow deviates from a normal VPR flow after the place and route stage. We first assign voltage to all CLBs using algorithms that are discussed below, and then estimate power of the design placed and routed on the target dual-VDD architecture. Assigning voltages after routing makes the timing analysis more accurate, since all the routing delays get incorporated in the timing graph. 4.1 VDD assignment In order to be effective, a dual VDD scheme requires that paths in the circuit vary in their delays. If all paths are of same delay then all circuit elements will require high VDD to maintain the performance of the design. Figure 5 shows the distribution of path delays averaged over MCNC benchmarks which we used for all our experimentation. It is evident from the figure that path delays in a circuit vary considerably. Therefore, a dual-VDD scheme can be expected to reduce the power consumption significantly. Figure 5 also shows the path delays after using our dualVDD assignment algorithms. Optimal assignment of VDD to gates in a circuit is known to be an n-p complete problem. We use the heuristic shown in figure 6 for VDD assignment. Initially we assign low VDD to all CLBs in the FPGA, and find those paths whose delays become greater than the desired clock time period. e call such paths “critical”. Those CLBs which do not belong to any of the critical paths can be kept at low voltage without affecting performance of the design. Some of the remaining CLBs and routing muxes need to operate at highVDD so that the design’s performance target is met. The order in which these CLBs are analyzed is crucial for the performance of the heuristic. We define “criticality” of a CLB, as the number of critical paths that pass through this CLB5 . The CLBs within a path are analyzed in decreasing order of their criticalities. We started with CLBs on the most critical path, and proceeded to smaller paths in decreasing order of their delay. Figure 6 shows the algorithm for the case when LCs are at CLB inputs. In that case all routing muxes driven by a CLB have the same voltage as the CLB. For the other situation, when 5 This definition of criticality can potentially be improved by assigning priorities to paths depending on their delay or other parameters. Fig. 5. Distribution of path delays Table 1. Comparison of High-to-Low and Lowto-High algorithms (LC at CLB inputs, VDDH = 1.1V, VDDL = 0.9V Design # CLBs # VDDL CLBs Low-to-High High-to-Low alu4 191 51 65 apex2 235 54 74 apex4 158 46 26 bigkey 214 24 81 des 200 87 127 dsip 172 12 31 elliptic 451 339 327 ex1010 575 177 185 ex5p 133 30 24 misex3 175 18 16 pdc 572 400 405 s38584.1 806 724 739 seq 219 61 58 spla 462 215 226 tseng 131 102 114 Assign VDDL to all CLBs and routing muxes P = list of all paths in the design T = longest delay path when all circuit blocks operate at VDDH Td = x T, where x ≥ 1 is a user-defined performance metric critical path = {Pi ∈ P | delay(Pi ) > Td } for each CLB criticality(CLB) = number of paths passing through it while (critical path not empty) { Pk = path ∈ critical path with maximum delay N = all blocks through which Pk flows Sort N based on criticality (first entry has most paths) while (delay(Pk ) > Td ) { Ni = first(N) N = N - Ni Assign VDDH to Ni and all the routing muxes driven by Ni update delay of all paths passing through Ni } critical path = critical path - {Pk } } Fig. 6. Algorithm for VDD assignment: Low-to-High (assuming LCs at CLB input pins) LCs are at CLB outputs, the voltage of routing muxes driving a CLB is the same as that of the CLB. In order to enumerate all paths whose delays become larger than the required clock time period, we used the algorithm proposed in [8]. It maintains all paths in a heap data structure with their delays as the keys. Each path also maintains all the branch-points in the path in increasing order of their branch-slacks 6 . We experimented with a variant of the above algorithm (High-to-Low) too, in which all the CLBs are initially kept at high voltage and then some of them are changed to low VDD (see figure 7). Before changing a CLB to low-VDD , we need to make sure that this will not increase the delay of some other path in the circuit above the desired clock period. The number of low VDD blocks using both versions, for VDDH of 1.1V and VDDL of 0.9V 6 Branch slack is defined as the decrease in path delay if a particular branch point is used to generate a new path Assign VDDH to all CLBs and routing muxes P = list of all paths in the design T = longest delay path when all circuit blocks operate at VDDH Td = x T, where x ≥ 1 is a user-defined performance metric vddl delay(Pi ) = delay(Pi ) when all blocks in Pi are at VDDL critical path = {Pi ∈ P | vddl delay(Pi ) > Td } for each CLB criticality(CLB) = number of paths passing through it while (critical path not empty) { Pk = path ∈ critical path with maximum delay N = all blocks through which Pk flows Sort N based on criticality (last entry has most paths) while ((delay(Pk ) < Td ) & (N not empty)) { Ni = first(N) N = N - Ni Assign VDDL to Ni and all the routing muxes driven by Ni calculate delays of all paths flowing through Ni if any of the delays > Td reset Ni and all routing muxes driven by Ni to VDDH else update delays of all paths flowing through Ni } critical path = critical path - {Pk } } Fig. 7. Algorithm for VDD assignment: High-to-Low (assuming LCs at CLB input pins) (for 65nm technology) is shown in table 1. For 10 out of 15 designs, the High-to-Low (h2l) version performs better than Low-to-High (l2h). This happens because in case of h2l, when the CLBs on a particular path are being analyzed whether they can be run on low-VDD , the algorithm continues to look at all the other CLBs on the path even after it failed to change the VDD of some CLB. In contrast, in the l2h case, the algorithm keeps changing CLBs on a path to high VDD (in decreasing order of criticality), till the delay of the path is less than the required clock period. This sometimes causes the path’s delay to be reduced more than what was necessary. 4.2 Power estimation After all logic blocks have been assigned appropriate supply voltages, we estimate power consumption of the entire FPGA. We concentrate only on the power consumption in the core of the FPGA, and do not try to optimize or estimate IO power consumption. Furthermore, we did not estimate the power consumption in the global routing grid used for clock distribution. In order to estimate dynamic power, VPR’s power model calculates transition densities at all internal nodes of the FPGA, assuming that all inputs to the FPGA have the same static probability (default: 0.5). Capacitances are estimated from the capacitance values of a MOSFET, and that of wires and switches, all of which need to be provided in the architecture file taken by VPR as an input. We used Berkeley Predictive 65nm technology parameters for our experimentation. We modified VPR’s dynamic power model to include dual supply voltages, and the power consumption of level converters. Due to quadratic dependence of dynamic power on DDL 2 ) when its voltage is supply voltage, dynamic power of a circuit element reduces by ( VVDDH reduced from VDDH to VDDL . Dynamic power of a level converter (obtained from SPICE simulations) was added wherever a level converter was used (using the transition density at that node). VPR has got a basic leakage model, which calculates sub-threshold leakage due to weak inversion. But in a 65nm technology, two more effects, namely, DIBL and gate leakage become significant, and need to be included in the leakage estimation. We also modified the leakage model to take into account multiple supply voltages, and sleep modes. Specifically, the following modifications were made to VPR’s leakage estimation. 1. Gate leakage and sub-threshold leakage due to DIBL were included in the leakage estimation. In order to estimate leakage of a single MOSFET, we used results from SPICE simulations. 65nm BSIM4 device models were used. Simulations were performed for various supply voltages to get leakage numbers for different voltages. These numbers were incorporated into the power model of VPR to estimate gate leakage of the entire FPGA. 2. We estimated average leakage in a routing multiplexer by halving the worst case leakage, as discussed in [14]. To verify, we simulated multiplexers of various sizes and structures and found our leakage estimate to be very close to the SPICE results. 3. In the dual-VDD FPGA, unused logic blocks and routing muxes are kept in a sleep state, by switching off both the supply transistors. Circuit simulations in SPICE showed that in sleep mode, leakage of a circuit block reduces to 10% of the original (high VDD ) leakage. 4. To estimate level converter leakage, we obtained the leakage number for one level converter from SPICE simulations, and multiplied this by the number of level converters in the FPGA. 5 Results and Analysis Power reduction due to the dual-VDD architecture strongly depends on the voltage values of VDDH and VDDL . In order to understand this dependence, and to come up with a good voltage choice, we fixed the high-VDD at 1.1V and varied VDDL from 0.8V to 1.0V. Figure 8 shows the power consumption for different VDDL values (using High-to-Low Algorithm, LC at CLB’s inputs) . Note that for 11 (out of 15) designs, VDDL value of 0.9V results in maximum power savings. When VDDL is increased to 1.0V, although the number of CLBs on low VDD increases, the total power consumption increases. This happens because power consumption of the circuit elements at 1.0V is significantly higher than at 0.9V. On the other side, when we reduce VDDL to 0.8V, power consumption again increases because the number of CLBs and routing muxes on low VDD becomes too low. Therefore, for all other results in this section, we use a VDDL of 0.9V. For this case, on an average, we get close to 61% power saving. Figure 9 shows the power consumption of the designs for the two algorithms — High-toLow (h2l) and Low-to-High (l2h), and level converter placements — at CLB outputs (LCo) or inputs (LCi). (h2lLCi denotes High-to-Low algorithm with LC at CLB Inputs.) Note that for most designs, the High-to-Low algorithm outperforms the Low-to-High algorithm. This is expected because we showed in section 4 that the High-to-Low algorithm resulted in larger number of low-VDD CLBs. Further, the placement of LCs at CLB inputs saves more power (average: 61%) than their placement at outputs (average: 57%). This happens because LC leakage is not large enough to overshadow the gains we get in routing power by placing LCs at CLB inputs. But note that placing the LCs at CLB inputs increases the area of the FPGA. Figure 10 shows the static and dynamic power consumption in both logic and routing resources for the different algorithms and LC placements. An important observation is that not all components of power are reduced by the same factor. The reduction in dynamic power is much less than that in leakage. For example, using High-to-Low algorithm and placing LC at CLB inputs saves 24% dynamic power and 76% leakage power. This can be attributed to two factors. First, in an FPGA since there exist a large number of unused circuit elements, it is possible to reduce the leakage in them by switching them off. And second, leakage varies exponentially with supply voltage, but dynamic power varies only quadratically with supply voltage. Note that leakage in routing resources reduces to less than 17% of the original, because in most designs it is possible to put a large number of Fig. 8. Power consumption VDDL’s. VDDH = 1.1V . for different Fig. 10. Average power breakdown between logic and routing resources. Fig. 9. Power consumption for different architectures and algorithms. Fig. 11. Average power consumption for different critical path delay tolerances. routing muxes in sleep state, as they are sparsely used. Another trend to note is that the logic portion of leakage is larger when LCs are placed at CLB inputs (LCi) than when they are placed at CLB outputs (LCo). This implies that the larger overall power saving for the LCi case comes entirely from the routing resources. Finally, figure 11 shows what happens when we modify the VDD assignment algorithm to allow some degradation in the performance of the design. In the figure, a delay value of 110% denotes 10% performance penalty. Note that these delay values may increase after circuit implementation due to the use of supply transistors, and due to a possible increase of wire lengths (since total CLB area and consequently inter-CLB distances increase). Using h2lLCi, a 10% decrease in performance increases the average power saving by around 4%. But beyond 20%, power saving remains almost constant. 6 Conclusion and Future Work We have presented a dual-VDD FPGA architecture that provides significant power savings with minimal performance penalty. Variations of the VDD assignment algorithm and level converter placement were explored. It was observed that High-to-Low Algorithm coupled with placement of level converters at the input pins of CLBs resulted in maximum power savings. An average power saving of 61% was observed for this case. The dynamic power was reduced by 24%, while the reduction in static power was close to 76%. But placing the level converters at CLB output pins reduces the area penalty by about 2% and still saves about 57% of total power. In the present work, the router in VPR is essentially unaware of multiple supply voltages available for every logic block and routing switches. This could be improved by performing a dual-VDD aware routing. We plan to work on this in future. 7 Acknowledgements This work was supported in part by NSF CAREER Award 0093082, NSF Awards 0130143, 0202007, and a MARCO/DARPA GSRC Grant. References 1. J. H. Anderson, F. Najm, and T. Tuan. “Active Leakage Power Optimization for FPGAs”. In Proceedings of ACM/SIGDA International Symposium on Field-programmable gate arrays, 2004. 2. V. Betz and J. Rose. “VPR: A New Packing, Placement and Routing Tool for FPGA Research”. In International Workshop on Field-programmable Logic and Applications, 1997. 3. B. Calhoun, F. Honore, and A. Chandrakasan. “Design Methodology for Fine-grained Leakage Control in MTCMOS”. In Proceedings of International Symposium on Low Power Electronics and Design, 2003. 4. A. Chandrakasan, W. Bowhill, and F. Fox. “Design of High-Performance Microprocessor Circuits”. IEEE Press, 2001. 5. D. Chen, J. Cong, F. Li, and L. He. “Low-Power Technology Mapping for FPGA Architectures with Dual Supply Voltages”. In Proceedings of International Symposium on Fieldprogrammable gate arrays, 2004. 6. A. Gayasen, Y. Tsai, N. Vijaykrishnan, M. Kandemir, M. J. Irwin, and T. Tuan “Reducing leakage energy in FPGAs using region-constrained placement”. In Proceedings of International Symposium on Field-programmable gate arrays, 2004. 7. V. George, H. Zhang, and J. Rabaey. “The design of a low energy FPGA”. In Proceedings of International Symposium on Low Power Electronics and Design, 1999. 8. Y-C. Ju and R. A. Saleh. “Incremental Techniques for the Identification of Statically Sensitizable Critical Paths”. In Design Automation Conference, 1991. 9. E. Kusse and J. Rabaey. “Low-Energy Embedded FPGA Structures”. In Proceedings of International Symposium on Low Power Electronics and Design, 1998. 10. F. Li, D. Chen, L. He, and J. Cong. “Architecture Evaluation for Power-Efficient FPGAs”. In Proceedings of ACM/SIGDA International Symposium on Field-programmable gate arrays, 2003. 11. F. Li, Y. Lin, L. He, and J. Cong. “Low-power FPGA using Pre-Defined Dual-Vdd/Dual-Vt Fabrics”. In Proceedings of ACM/SIGDA International Symposium on Field-programmable gate arrays, 2004. 12. K. Poon, A. Yan, and S. Wilton. “A flexible Power Model for FPGAs”. In Proceedings of International Conference on Field Programmable Logic and Applications, 2002. 13. R. Puri, L. Stok, J. Cohn, D. Kung, D. Pan, D. Sylvester, A. Srivastava, and S. Kulkarni. Pushing ASIC performance in a power envelope. Design Automation Conference, 2003. 14. A. Rahman and V. Polavarapuv. “Evaluation of Low-Leakage Design Techniques for Field Programmable Gate Arrays”. In Proceedings of ACM/SIGDA International Symposium on Field-programmable gate arrays, 2004. 15. L. Shang, A. S. Kaviani, and K. Bathala. “Dynamic power consumption in Virtex[tm]II FPGA family”. In Proceedings of ACM/SIGDA International Symposium on Fieldprogrammable gate arrays, 2002. 16. A. Singh and M. Marek-Sadowska. “Efficient Circuit Clustering for Area and Power Reduction in FPGAs”. In Proceedings of ACM/SIGDA International Symposium on Field-programmable gate arrays, 2002. 17. M. Takahashi et.al. “A 60-mW MPEG4 Video Codec Using Clustered Voltage Scaling with Variable Supply-Voltage Scheme”. In IEEE Journal of Solid-State Circuits, Vol. 33, No. 11, Nov 1998. 18. T. Tuan and B. Lai. “Leakage Power Analysis of a 90nm FPGA”. In Custom Integrated Circuits Conference, 2003. 19. K. Usami and M. Horowitz. “Clustered voltage scaling technique for low-power design”. In Proceedings of International Symposium on Low Power Electronics and Design, 1995.