Overcoming the Power Wall: Connecting Voltage Domains in Series Pradeep S. Shenoy, Sai Zhang, Rami A. Abdallah, Philip T. Krein, and Naresh R. Shanbhag ECE Dept., University of Illinois at Urbana-Champaign Urbana, IL pshenoy@ieee.org, krein@illinois.edu Abstract—This paper analyzes an alternative for system-level power configuration of digital circuits. To overcome the “power wall” and enable low supply voltages for digital circuits, the voltage domains of microprocessors or other digital circuit elements are connected in series. This enhances overall system efficiency and performance by allowing both the digital circuits and the power delivery circuits to operate in their region of highest efficiency. Power consumption is dramatically reduced because multiple, independent voltage levels enable each processor to operate at its local minimum energy point. A comparative analysis of series and parallel voltage domains concludes that series voltage domains consume less power. Various voltage regulations schemes ranging from software, firmware, and hardware are presented. Some of the challenges and opportunities of series circuits are discussed. Keywords- voltage regulation, series circuits, microprocessor loads, multi-core, many-core, power delivery, switching converters I. INTRODUCTION The trend toward parallelism to increase computational throughput using many-core microprocessors (i.e. tens or hundreds of cores) motivates a renewed look at power delivery. Even advanced power delivery designs that allow inactive cores to independently reduce supply voltage or even shut down [1], [2] may not be effective for system-level energy reduction. Some reduction in power consumption may be obtained by placing fast, on-chip switching converters in the power path [3], [4]. However, chip-level converters tend to be less efficient than discrete designs. They also add another power conversion stage. While these approaches have some merits, it is not clear whether the overall system efficiency improves. On-chip conversion places a higher thermal load on the processor package. A “nonlinear” breakthrough in power delivery has been sought [5]. With increasing demand for high performance and low cost in modern computing systems, processes continue to scale down for faster devices and smaller area. At the same time, supply voltage scales down. According to International Technology Roadmap for Semiconductors (ITRS), supply voltage will fall below 600 mV by 2024 [6], as shown in Fig. 1. However, the power consumption of high performance computing systems tends to remain the same, limited by heat dissipation from a given package area. Heat removal is not the only challenge associated with the so-called “power wall” [7]. Low-voltage power is challenging and costly to supply. In traditional computing systems, digital loads in parallel share a common supply voltage. In multicore microprocessors (which may draw 100 W or more) with supplies below 1 V, the high current places a heavy burden on dc-dc converters in the Figure 1. Supply voltage scaling trends projected by ITRS [1]. system [8]. As supply voltages decrease, the effective load impedance decreases by a factor of 1/Vdd2, complicating power delivery. The bus capacitance required to keep supply voltage dynamic variation within bounds also increases as Vdd drops. Connecting the voltage domains of digital circuits in series, shown in Fig. 2, has the potential to be the needed breakthrough. The underlying concept was proposed in [9], [10], in which a few digital loads are connected in series and the total supply voltage is increased. In this paper, the basic concept is extended to large numbers of series-connected digital gates, cores, chips, or even boards. The load current provided by the power supply drops in proportion to the number of series blocks, which typically increases the efficiency of power delivery. Here, we also seek to support multiple voltage domains within a series structure, such that each digital device or core can operate at its highest efficiency. Figure 2. Voltage domains of digital circuits connected in series. Authorized licensed use limited to: ShanghaiTech University. Downloaded on September 19,2024 at 07:36:40 UTC from IEEE Xplore. Restrictions apply. The idea of using multiple voltage domains that enable functional blocks to operate at independent voltage levels to drop overall power consumption has been attempted in parallel configurations [11]. Noise performance is known to be superior with stacked load configurations [12]. Here, the technique is not physical stacking (i.e. 3D stacking) of integrated circuits, but rather electrical series connection of multiple voltage domains. Of course, this approach does not exclude 3D stacking [13]. It also applies to non-computational circuits such as memory [14]. This research considers a systems-oriented viewpoint, in which a series connection is extended to large numbers of modules and can be considered across the range from multiple function modules within a single IC up to interconnected boards in a data center. This paper first examines the minimum energy operating point of a CMOS circuit. The series connected voltage domain approach is then presented, analyzed, and compared to the conventional parallel load case. A hierarchy of voltage regulation schemes for the intermediate nodes is also discussed. II. CORE ENERGY OPTIMAL OPERATING POINT Reducing energy in ultra low-power (ULP) systems such as medical portable processors and implants, distributed sensor networks, and active RFIDs is vital for extended battery and design lifetime. Aggressive voltage scaling, where supply voltage (Vdd) is lowered to trade performance and clock frequency for energy, is a typical approach in ULP applications. The intent is to take advantage of modest performance requirements that can be met with clocks ranging from a few hundreds of Hz to a few kHz. Reducing Vdd results in quadratic reduction in dynamic energy at the expense of increased delay. However, as Vdd drops below the device threshold voltage (Vt) (subthreshold region), leakage energy starts to increase due to the exponential increase in delay, and becomes comparable to dynamic energy. This trade-off between leakage and dynamic energy results in a minimum total energy point at an optimal Vdd < Vt. Several studies have addressed this trade-off [15]-[17], and proposed circuit-level design techniques [18], [19] to reduce leakage and lower the optimal energy point. Recently, various subthreshold ASICs [20] and general-purpose processors [16] have emerged for ULP applications. A. Minimum Energy Operation The dominant forms of energy consumed per clock cycle in subthreshold designs are dynamic energy, Edyn, and leakage energy, Elkg. Dynamic energy is caused by parasitic capacitance charging and discharging with a switching current, ION, and leakage energy is due to leakage current, IOFF. Given a processing element with N processing nodes, each with capacitance C, switching activity factor α, operating frequency f, and supply voltage Vdd, the total energy per clock cycle, Eo is expressed as Eo Edyn Elkg The subthreshold current as a function of gate-to-source and drain-to-source voltage is given by VGS Vth VDS VDS S (2) I SUB VGS ,VDS I o 10 (1 e VT ) where Io is the reference current and is proportional to the transistor W/L ratio, S is the swing factor, γ is the DIBL coefficient, Vth is the threshold voltage and VT is the thermal voltage (26 mV at room temperature). Using (2), the switching and leakage current for an NMOS transistor are ION ,V I SUB Vdd ,Vdd and IOFF ,V I SUB 0,Vdd respectively. dd dd Assuming the processing element critical path consists of L processing nodes each with capacitance C, the operating frequency f is given by I ON ,Vdd (3) f LCVdd where β is a fitting parameter. Due to the exponential dependence of ION on Vdd in (2), the subthreshold frequency decreases exponentially with Vdd reduction leading to an exponential increase in leakage energy. This now can be seen by substituting (3) in (1) to get Vdd I OFF ,Vdd (4) Elkg NLCVdd2 NLCVdd2 10 S I ON ,Vdd and the total subthreshold energy is given by Vdd (5) Eo NCVdd2 L10 S Therefore, reducing Vdd into the subthreshold region decreases Edyn but increases Elkg exponentially such that an optimum subthreshold operating voltage exists. B. Subthreshold Multiply-Accumulate (MAC) Unit The subthreshold energy behavior can be illustrated through a 23-bit output multiple-accumulate (MAC) unit which is widely used in various applications. The MAC unit computes y[n] y[n 1] x1[n] x2 [n] , where x1 and x2 are 10-bit input operands and n is the clock-cycle/time index. We use a ripplecarry based architecture with 1-bit full adder as a building block/processing node to study the energy profile per MAC operation (shown in Fig. 3). Fig. 4 shows the MAC unit energy profile based on the analytic model in (5) and SPICE simulations in 130-nm commercial CMOS process at different switching activity factors α. The analytical model behaves very closely to SPICE simulations. As voltage is reduced, Elkg Edyn NCVdd2 Elkg NI OFF ,Vdd Vdd f (1) Figure 3. Ripple-carry based architecture of the 23-bit output MAC unit. Authorized licensed use limited to: ShanghaiTech University. Downloaded on September 19,2024 at 07:36:40 UTC from IEEE Xplore. Restrictions apply. Figure 4. Subthreshold energy of a 23-bit output MAC at α=0.3, 0.1. increases while Edyn decreases and the minimum energy point is reached at Vdd,opt = 0.33V for α = 0.3. Lowering α not only decreases total energy due to Edyn reduction but also pushes optimum Vdd to higher values since Edyn contribution to total system energy Eo when compared to Elkg is decreased. III. POWER CONSUMPTION ANALYSIS While linear regulators offer tight voltage control and can be designed for fast transient response, the energy loss penalty is too large to consider in most designs, particularly for subthreshold applications. There are two types of dc-dc converters generally used in digital circuit supplies. Switched capacitor (SC) converters use capacitors and a switch network to transfer charge and convert voltage levels. They can perform high conversion ratios and are relatively easy to integrate on chip. Non-isolated buck dc-dc converters are more common in low-voltage applications but require an inductor which is difficult to integrate. Conduction loss and switching loss reduce efficiency of any switching converter, especially when the current level is high. Figure 6. (a) Parallel connected load configuration (b) Series connected load configuration r2 represent the on-state resistance of the high and low side switches, respectively. Switching loss can be modeled as 1 1 Psw I loadVin (ton toff ) f sw COSSVin2 f sw (7) 2 2 where Vin, ton, toff, fsw, and COSS are the input voltage, the switch turn-on time, the switch turn-off time, the switching frequency, and the effective output capacitance of the MOSFET. Both power loss contributions increase as load current increases. A parallel connected load will have high load current at low output voltage and might not be the best configuration. B. Power Model for Parallel Connected Loads A parallel connected load is shown in Fig. 6 (a). Here N cores are running at different speeds, therefore the optimum voltages for each core differ and are denoted by Vdd,i. In a parallel configuration, all cores must have the same voltage, set to the maximum value of all optimum operating voltages: Vdd max (Vdd ,i ) (8) i=1...n A. Buck Converter Loss Model Modern high performance circuits usually use buck converters to deliver high power at low output voltage. A simple model is shown in Fig. 5. There are two main loss components associated with buck converters [21]. Conduction loss can be modeled as I 2 2 Pcond ( I load load )(rL Dr1 (1 D)r2 ) (6) 12 where Iload, ΔIload, D, represent the load current, the load current ripple, and the high-side switch duty ratio. Resistances r1 and Figure 5. Model of a buck converter showing on-state resistance of the switches and inductor series resistance. The load current flowing in core i can be expressed as (9) Iload ,i i Cload ,iVdd f sw,i The power consumption of N cores computing in parallel is N N i 1 i 1 Pout , parallel Vdd ( I load ,i ) i Cload ,iVdd2 f sw,i (10) C. Power Model for Series Connected Loads Parallel connections require every core to have the same Vdd, which is not ideal for cores running at slow frequencies. Additional cascaded converters can be used to enable per-core Vdd, but the area and power overhead can reduce the energy savings. As shown in Fig. 6 (b), one way to enable per-core Vdd and reduce the total output current is to connect cores in series so that each core can have its own Vdd, and computational load is balanced so that all cores consume the same current. In this case, the load current can be expressed as Iload 1Cload ,1Vdd ,1 f sw,1 ... N Cload , NVdd , N f sw, N (11) Each core can be configured to its optimum supply voltage Vdd,i, so the total power consumption of N cores in series is Authorized licensed use limited to: ShanghaiTech University. Downloaded on September 19,2024 at 07:36:40 UTC from IEEE Xplore. Restrictions apply. N N i 1 i 1 Pout , series I load ,iVdd ,i i Cload ,iVdd2 ,i f sw,i (12) D. Comparison It is quite easy to show that the series configuration consumes less power than the parallel connected loads due to independent voltage domains. Since the Vdd,i in the parallel configuration is the maximum value of all Vdd,i of each core, we can simplify and compare (10) and (12) to see that (13) Pout , series Pout , parallel Moreover, in the series configuration, the load current Iload supplied by the power delivery circuits is roughly N times less than that in parallel configuration. Since the power loss in the dc-dc converter is (14) Ploss Pcond Psw which is proportional to load current, notice that (15) Ploss , series Ploss , parallel Hence, the power consumed in the digital circuit load and the power loss in the power supply is lower in the series connected system. IV. REGULATION OF SERIES CONNECTED VOLTAGE DOMAINS A key challenge in series voltage domains is regulating the intermediate node voltages. Voltage domains should be regulated to avoid computational errors and limit stress on the transistors. The digital circuits in each voltage domain may not naturally draw the same current at the desired operating voltage. Series circuits, however, must conduct the same current through each element. This causes the voltage at each domain to deviate from the desired level. Just like a simple voltage divider circuit, the steady-state voltage of each domain is determined by the effective impedance of the load elements at that domain. The tradeoff is clear when a potential design is considered. For example, a 64-core processor designed to work at 30 W and 0.3 V requires power at 0.3 V and 100 A when connected in parallel (shown in Fig. 7 (a)). In series, it requires 19.2 V and only 1.6 A (shown in Fig. 7 (b)). This is an increase in the effective load impedance from 3 mΩ to 12.3 Ω (about 4000 times larger). The latter power supply will be much more (a) (b) Figure 7. Comparison of a 64-core, 30 W processor with cores connected in (a) parallel and in (b) series. The effective impedance of series cores is about 4000 times larger than parallel cores. Figure 8. Various methods of load balancing and voltage regulation using software, firmware, and hardware. efficient, with better transient response and lower node-bynode voltage ripple. However, the higher operating voltage introduces 63 intermediate voltage nodes that need regulation. There are several techniques that can be used to actively regulate the voltage domains. Broadly speaking, the goal is to manage the effective impedance of each load in software first and only utilize hardware mechanisms as a backup. This proactive approach reduces loss in the power delivery circuits. A hierarchical list of techniques (shown pictorially in Fig. 8) is as follows: 1. Compiler system/scheduler 2. Run-time system/scheduler 3. Distributed hardware schedule implementer 4. Clock rates 5. Communications (local and external) 6. Task swapping 7. External exchange (local power electronics) 8. Tolerance 9. Global supply 10. Protection The first line of defense in voltage regulation occurs at the software level. If the desired voltage of each voltage domain is equal, the operating system would manage software threads at compile time or run time to evenly balance the computational load of each element. The operating system could also dispatch activity unevenly if unequal voltage levels are desired. Another regulation layer is at the firmware level. A task manager could distribute operations among computational elements without the knowledge of the operating system. Task swapping and task rotation are possible techniques. Clock rates can be adjusted to throttle the activity at each voltage domain and control the effective load impedance. It is important to keep in mind that, even though Fig. 8 abstractly displays one circuit at each voltage level, a voltage domain could consist of several elements in parallel. The activity of all the elements could be managed to control the effective load impedance. If the preceding methods are unable to maintain voltages at the specified targets, hardware based regulation approaches can be used. This includes using local dc-dc converters to actively regulate the voltage [22]. These converters could be integrated on-chip or be placed off chip. Because these circuits process the power delivered to the digital load and incur power loss, it is preferable to avoid voltage regulation by power delivery Authorized licensed use limited to: ShanghaiTech University. Downloaded on September 19,2024 at 07:36:40 UTC from IEEE Xplore. Restrictions apply. This position paper aims to lay down a vision for series connected voltage domains and is not exhaustive. Certain aspects of this vision are elaborated upon, and others are enumerated and may require further study. Other challenges include efficient and effective regulation of intermediate node voltages, coordination between digital circuits in different voltage domains, latency, and packaging. The potential gains are substantial and worth the effort to work towards a complete solution. VI. Figure 9. Experimental platform with microcontrollers, communication I/Os, and voltage regulation circuits. circuits when possible. It may be unavoidable in some situations though. A unique advantage of series connected loads is that the dc-dc converters only need to process a differential fraction of the load power. Instead of processing the entire load current (as would be needed in standard approaches with parallel loads) the voltage regulators only need to provide the difference in current between voltage domains. The effect on voltage overshoot/undershoot caused by load transients will also be less severe since load current step magnitude will be substantially reduced (e.g. by a factor of 1/n where n is the number of voltage domains). In extreme situations, Zener diodes can protect the digital circuits from overvoltage conditions. The experimental prototype shown in Fig. 9 has been used to verify some of the regulation schemes described above. V. CONCLUSION A paradigm shift in system-level design of digital circuits is driven by power concerns. Currently available options will not be sufficient to overcome the “power wall”. Series connection of digital circuit voltage domains has great potential to be the nonlinear breakthrough required. This approach reduces power consumption in digital loads by allowing each element to operate at lower voltages or even its minimum energy operating point. Power delivery complexity and loss is reduced substantially as conducted current decreases. In turn, the number of pins dedicated to power is reduced. The overall system is able to operate efficiently over a variety of conditions. A key to enabling this approach is computational load management and regulation of the intermediate node voltages. A variety of techniques ranging from software based to power electronics are feasible. A general benefit of a series connection it that it facilitates substantial voltage reduction. This technique is especially suited for many-core processors that are able to run at very low voltage (e.g. 0.3 V). FUTURE WORK While series connected voltage domains have considerable promise, level shifting and inter-domain communications complicate implementation. If viable solutions can be found, this paradigm becomes attractive. That said, level conversion is standard in modern SOCs, and microprocessors exist which already employ multiple voltage domains. The inter-core communication bandwidth in a series connected system should be close to that in a typical multicore chip. Viable communication techniques include, but are not limited to, capacitively-driven wires [23], optical interconnects [24], and even wireless transmission [25]. What remains to be seen is if additional communication would be needed. This may motivate rethinking computer architecture to reduce communication or to limit it so that latency and power overhead have minimal impact. Similarly, this paradigm may lend itself to applications where there is little to no interdependence in the data and computations can be made in parallel without significant interaction. The goal may be to develop an initial mapping that is done in such a way as to balance the workloads across the processors and limit communication. Interestingly enough, it may turn out that communication power overhead decreases since a dominant factor is the signal swing (Vmax - Vmin) on the line. If lower voltages are enabled through series connections, the signal swing would be lower thereby reducing power overhead. REFERENCES D.E. Lackey et al., “Managing power and performance for system-onchip designs using Voltage Islands,” in Proc. ACM/IEEE Int. Conf. Computer Aided Design, 2002, pp. 195- 202. [2] J. Hu, Y. Shin, N. Dhanwada, and R. Marculescu, “Architecting voltage islands in core-based system-on-a-chip designs,” in Proc. IEEE Int. Symp. Low Power Electron. Design, 2004, pp. 180-185. [3] H.P. Le, M. Seeman, S.R. Sanders, V. Sathe, S. Naffziger, and E. Alon, “A 32nm fully integrated reconfigurable switched-capacitor DC-DC converter delivering 0.55W/mm2 at 81% efficiency,” in Proc. IEEE Int. Solid-State Circuits Conf., 2010, pp. 210-211. [4] W. Kim, D.M. Brooks, and G.Y. Wei, “A fully-integrated 3-level DC/DC converter for nanosecond-scale DVS with fast shunt regulation,” in Proc. IEEE Int. Solid-State Circuits Conf., 2011, pp. 268-270. [5] S.L. Smith, “Power Optimizatio in the Connected World,” presented at the IEEE Int. Conf. Energy Aware Computing, Cairo, Egypt, Dec. 2010. [6] 2009 ITRS Overall Technology Roadmap Characters. (Key Roadmap Drivers), available, http://www.itrs.net/Links/2009ITRS/Home2009.htm [7] T. Kuroda, “Low-power, high-speed CMOS VLSI design,” in Proc. IEEE Int. Conf. Computer Design: VLSI Computers Processors (ICCD’02), 2002, pp. 310-315. [8] Intel Corp., “Voltage regulator module (VRM) and enterprise voltage regulator-down (EVRD) 11.1 design guidelines,” available, www.intel.com/assets/PDF/designguide/321736.pdf, Sept. 2009. [9] S. Rajapandian, X. Zheng, and K.L. Shepard, “Implicit DC-DC downconversion through charge-recycling,” IEEE J. Solid-State Circuits, vol. 40, no. 4, pp. 846- 852, Apr. 2005. [10] S. Rajapandian, K.L. Shepard, P. Hazucha, and T. Karnik, “High-voltage power delivery through charge recycling,” IEEE J. Solid-State Circuits, vol. 41, no. 6, pp. 1400-1410, June 2006. [1] Authorized licensed use limited to: ShanghaiTech University. Downloaded on September 19,2024 at 07:36:40 UTC from IEEE Xplore. Restrictions apply. [11] M. Nakai et al., “Dynamic voltage and frequency management for a low-power embedded microprocessor,” IEEE J. Solid-State Circuits, vol. 40, no. 1, pp. 28-35, Jan. 2005. [12] G. Jie and C.H. Kim, “Multi-story power delivery for supply noise reduction and low voltage operation,” in Proc. IEEE Int. Symp. Low Power Electron. Design, 2005, pp. 192-197. [13] P. Jain, T.H. Kim, J. Keane, and C.H. Kim, “A multi-story power delivery technique for 3D integrated circuits,” in Proc. Int. Symp. Low Power Electron. Design, 2008, pp. 57-62. [14] A.C. Cabe, Z. Qi, and M.R. Stan, “Stacking SRAM banks for ultra low power standby mode operation,” in Proc. ACM/IEEE Design Automation Conf., 2010, pp. 699-704. [15] B. Zhai, S. Hanson, D. Blauw, and D. Sylvester, “Analysis and mitigation of variability in subthreshold design,” in Proc. Int. Symp. Low-Power Elec. Des., 2005, pp. 20-25. [16] B. Zhai et al., “Energy efficient subthreshold processor design,” IEEE Trans. VLSI Sys., vol. 17, no. 8, Aug. 2009, pp. 1127-1137. [17] L. Nazhandali et al., “Energy optimization of subthreshold-voltage sensor network processors,” in Proc. ACM ISCA, 2005. [18] J. Kwong and A. Chandrakasan, “Variation-driven device sizing for minimum energy sub-threshold circuits,” in Proc. Int. Symp. Low Power Electron. Des., 2006, pp. 8–13. [19] S. Mutoh et al., “1-V power supply high-speed digital circuit technology with multi-threshold voltage CMOS,” IEEE J. of Solid-State Circ., vol. 30, pp. 847 - 853, Aug. 1995. [20] A. Wang and A. Chandrakasan, “A 180-mV subthreshold FFT processor using a minimum energy design methodology,” IEEE J. Solid-State Circ., vol. 40, no. 1, pp. 310-319, Jan. 2005. [21] P.S. Shenoy, V.T. Buyukdegirmenci, A.M. Bazzi, and P.T. Krein, “System Level Trade-Offs of Microprocessor Supply Voltage Reduction,” in Proc. IEEE Int. Conf. Energy Aware Comp., 2010. [22] P.S. Shenoy, I. Fedorov, T. Neyens, and P.T. Krein, “Power Delivery for Series Connected Voltage Domains in Digital Circuits”, in Proc. IEEE Int. Conf. Energy Aware Comp., 2011. [23] R. Ho et al., “High-Speed and Low-Energy Capacitively-Driven OnChip Wires,” in Proc. IEEE Int. Solid-State Circuits Conf., 2007, pp.412-612. [24] M. Haurylau et al., “On-Chip Optical Interconnect Roadmap: Challenges and Critical Directions,” IEEE J. Sel. Topics Quantum Electron., vol. 12, no. 6, pp. 1699-1705, Nov.-Dec. 2006. [25] C. Hu, R. Khanna, J. Nejedlo, K. Hu, H. Liu, and P.Y. Chiang, “A 90 nm-CMOS, 500 Mbps, 3–5 GHz Fully-Integrated IR-UWB Transceiver With Multipath Equalization Using Pulse Injection-Locking for Receiver Phase Synchronization,” IEEE J. Solid-State Circuits, vol. 46, no. 5, pp. 1076-1088, May 2011. Authorized licensed use limited to: ShanghaiTech University. Downloaded on September 19,2024 at 07:36:40 UTC from IEEE Xplore. Restrictions apply.