Published in the Proceedings of the 20th International Conference on Computer Design (ICCD), September 16-18, 2002, Freiburg, Germany Impact of Scaling on The Effectiveness of Dynamic Power Reduction Schemes D. Duarte‡ Intel Corporation david.e.duarte@intel.com G. McFarland N. Vijaykrishnan, M.J. Irwin, H-S Kim Intel Corporation Department of CSE, Penn State University grant.mcfarland@intel.com {vijay, mji, hykim}@cse.psu.edu Abstract Power is considered to be the major limiter to the design of more faster and complex processors in the near future. In order to address this challenge, a combination of process, circuit design and micro-architectural changes are required. Consequently, to focus the optimization efforts in the right direction, the models proposed and studies performed in this work are a first step for understanding the relative importance of leakage and dynamic energy in future technologies. Further, we analyze the effectiveness of two energy reduction mechanisms that employ voltage scaling, namely, supply and threshold voltage selection. We consider the impact of imminent technology changes and packaging improvements while showing that neglecting the impact of temperature may lead to underestimate the power savings by up to 19.5%. 1. Introduction Energy dissipation has become an important design consideration, which can be attributed to the proliferation of battery-driven mobile systems and concerns about circuit reliability and packaging costs. In fact, power is widely considered to be the major impediment for more powerful high-performance processors. For CMOS circuits, the major sources of power consumption are dynamic and leakage power, with the latter becoming more significant as threshold voltages scale with technology. In order for devising new solutions to address the increasingly important power problem, it is essential for circuit designers and architects to have a mechanism to analyze future trends accurately and understand the relative importance of these components. There is a lot of literature that deals with the impact of technology scaling in the various aspects of VLSI circuit design [3, 4] and this paper does not intend to be one more with the same perspective. Here, we go a step further by providing a systematic approach to analyzing the dynamic and leakage energy trends. Further, we evaluate the anticipated effectiveness of supply voltage scaling that is widely used for energy optimization in ∗ current processors and compare it to a threshold voltage scaling approach. This is done considering the impact of technology and packaging improvements, as well as the key role of the operating temperature. 2. Effects of scaling on power consumption The dynamic power consumption of a given design has been usually estimated as: Pact = N t CavgVdd2 ( Act ) f clock Where Nt is the number of transistors in the design, Cavg is average capacitive load, Vdd is the power supply, fclock is the operating frequency and Act is the activity factor, which accounts for the number of devices that are actually switching. We calculate Cave = Cgate_ave+ Cdrain_ave + Cwire_local, with gate and diffusion capacitances estimated as normally [9] for an average-size device. The interconnect component is calculated as Cwire_local = Cwire/um. Llocal, where Cwire/um is the wire associated capacitance per unit length and Llocal is made equal to 10 times the minimum feature (λ) as connections are only to neighboring cells. Please refer to [1] for more details in the extraction of Cwire/um. The contribution of short-circuit currents will become of lesser importance for deep-submicron technologies, in particular since the threshold voltage (Vth) scales down at a slower rate than Vdd [9]. The cycle time is estimated as: Tcycle = LD C avgVdd 1 = f clock I on Pact = N tVdd I on ( Act ) LD Where ION is the drive current for an average-size device and LD is the logic depth (i.e., number of gate delays) of the slowest pipeline stage. The result obtained after replacing fclock in the power equation, is also given above. Following a starting reference number given in [1] for a 0.6um technology, we have scaled down LD by a constant factor up to the point were deeper pipelining is basically non-feasible as the latching time cost becomes comparable to the evaluation time of the logic between the registers. Similarly, we have scaled up the activity factor as a way to capture architectural improvements for enhanced Instruction Level Parallelism (ILP). In [8], a Acknowledgements: This work was supported in part by GSRC grant 98-DT-660, NSF Career Awards 0093085 and 0082064. ‡ D. Duarte was with the Department of Electrical Engineering, Pennsylvania State University while developing this work. Published in the Proceedings of the 20th International Conference on Computer Design (ICCD), September 16-18, 2002, Freiburg, Germany value of Act = 0.015 is used and we choose it as the base value for the 0.6um design. This number may seem very low but it captures the impact of aggressive clock gating, which is standard in current designs. The scaling factor of about 0.75 for LD was selected for consistency with industry data. Besides LD and Act, all other factors in the power equation given above scale with technology at a predictable rate depending on the scaling laws followed. We have used the scaling models presented in [1] and we have found a fairly good agreement of the main technology parameters with those presented by the ITRS roadmap [7]. We have assumed that short-channel effects (SCE) dominate and the effect of Drain Induced Barrier Lowering (DIBL) is captured. The number of transistors (Nt) is estimated by dividing the total die area by the area of an average-size device with individual contacts and some spare area around it. This approach attempts to balance the effect of very compact structures (such as memories) and other structures not so regular (such as datapaths). Two cases are considered: a constant (80mm2) and a variable die size, with the latter assuming an increase of 14% in die size from one technology generation to the next [3]. The first case can be seen as a low-end or embedded design, where simple clocking mechanisms are desirable, while the second one can be regarded as a high-performance design. Now, to estimate leakage power consumption due to subtreshold currents we use the following expression [2]: Pleak = N t Vdd I off K design Where Kdesign is a factor that accounts for the distribution/sizing of P and N devices, the stacking effect, the idleness of the devices and the design style used. This factor is defined empirically, and there is no analytical expression for it. In [2], experiments show that Kdesign for logic is around 10, while for memory structures it is about 1. Based on the area used by logic and memory structures, we estimate an average Kdesign of 2, as the area used for memory structures tends to be 85% in nanometer technologies. For details about how the logic and memory areas were determined, please refer to [6]. Subthreshold conduction is not the only leakage mechanism but it has by far the largest impact, which is worsened when DIBL effects are considered. The subthreshold current was estimated as [1]: I sub = V −V µCoxW (η − 1)φT2 exp gs th Leff ηφT whe re η = 1+ Cdep Cox In the above equations, µ is the carrier mobility, φT is the thermal voltage (=KT/q) and Cdep is the capacitance of the channel depletion region. The gate leakage estimates based on direct oxide tunneling effects were found to be almost completely negligible for the technologies studied. Figure 1. Active and leakage power (constant die size) Figure 2. Active and leakage power (increasing die size). Figures 1 and 2 illustrate how the estimates of dynamic and leakage power (obtained with the equations given) vary across the technologies considered. Note that we have only captured the influence of subthreshold currents, as they are the dominant leakage mechanism. Additionally, the effect of temperature has also been taken into account and from the plots, it is clear that it has a deep impact in the way that power (leakage power, in particular) behaves. For more details about the modeling of these effects please refer to [6] 3. Impact of technology and packaging There are two technology improvements that are expected to become standard in mainstream CMOS products within the next five years [10]. The first technique proposes replacing SiO2 with high permittivity materials. The thickness of the inversion layer beneath the oxide makes the apparent electrical thickness significantly larger than the actual physical thickness, with deviations in the range of 0.5nm to 1.0nm [11]. It now seems very likely that in the 0.1um generation and later, gate oxides will be fabricated with high-K materials such that the physical thickness will remain approximately constant while the electrical thickness is reduced. These materials are also expected to dramatically reduce gate leakage due to a higher oxide energy barrier (φB). Published in the Proceedings of the 20th International Conference on Computer Design (ICCD), September 16-18, 2002, Freiburg, Germany The second improvement is the replacing of Bulk CMOS by SOI (Silicon On Insulator). SOI has a significant impact on power by virtually eliminating diffusion capacitance and allowing for steeper subthreshold slopes (ST). In particular, in bulk CMOS, ST is approximately 100mV/dec, while in SOI ST becomes 75mV/dec, at 100OC. It should be noted that the former effect (elimination of diffusion capacitances) is beneficial but does not return much as interconnect capacitance takes place as the second contributor to the total parasitic capacitance for technologies where SOI is expected to become standard (0.1um and beyond). It was found that, after the two mentioned technology improvements are incorporated, while subthreshold currents decrease due to the use of SOI, the use of high-K dielectrics helps maintaining the impact of gate leakage to a minimum. Figure 3 captures the impact of the mentioned improvements in the total system power, estimated with the equations given earlier. We assume that dynamic power remains the same as the bulk CMOS case, following assumptions made earlier. In the optimum case (when DIBL effects are effectively minimized by SOI), leakage is always less than active power for the technologies considered. But as process variations continue to influence the device parameters, the actual effect is not ideal but translates into delaying the surge of leakage power by one technology generation (i.e., for this study, from 0.035um to 0.025um, as shown in Figure 3). Figure 3. Impact of SOI and high-K dielectrics in leakage system power (constant die size). In parallel with technology improvements, the impact of packaging and cooling mechanisms should be accounted for. In fact, the ITRS roadmap has stated that power consumption will be strongly determined by how effectively heat is removed from the die. The following equation shows how the total power and the die temperature are related to each other [5]: T j − Ta = θ ja ⋅ Power Where θja is the thermal resistance and Tj and Ta are the junction and ambient temperatures, respectively. The thermal resistance captures the thermal behavior of the CPU package, interfaces, heat sink and any forced air mechanisms, if present. Typical heat-sink thermal resistances vary with the geometry of the sink. For mobile devices, extruded heat sinks are in the order of 11.5OC/W while vapor-chamber folded-fin sinks are in the order of 0.2-0.4OC/W. For further details, please refer to [5]. We have used the above equation to determine what would be the required θja values to maintain the junction temperature down to safe levels. The ITRS roadmap has defined that for mobile designs (constant die size) Tj and Ta should be 100 OC and 55 OC, while for high performance designs (increasing die size) Tj and Ta should be 85 OC and 45 OC, respectively. The bars in Figure 4 show how θja must change to guarantee the Tj given above for the two design cases. This behavior can be analytically described by average reductions in θja of 33% and 48% per generation for lowend and high-performance cases, respectively. This estimation was, however an overkill. It was found that average reductions of 26% and 43% per generation, will work well until leakage power becomes significant, as shown by the lines in Figure 4. It must be highlighted that thermal resistance depends strongly on the cost of all associated components and also on the volume of the heat sink [5]. For the study that follows, our default case assumes DIBL effects and an operating temperature of 1000C, as technology improvements and limitations of efficient cooling mechanisms compensate each other. Figure 4. Thermal resistance and non-ideal temperature behavior. 4. Reducing power and temperature The chosen techniques for this study are based on dynamic adjustment, at runtime, of some basic operating parameters (such as Vdd and Vth). Since these run-time techniques adversely affect performance, smart policies must be devised in order to apply them wisely in real designs. Moreover, due to the strong relationship of leakage power with temperature, it is important to accurately model any temperature change associated with Published in the Proceedings of the 20th International Conference on Computer Design (ICCD), September 16-18, 2002, Freiburg, Germany the application of a given technique such that a better estimate is obtained. performance accentuates for the three cases considered when the decrease in Vdd is larger than about 20%. 4.1. Supply voltage dynamic scaling Reduction of the nominal supply voltage gives a significant reduction on power consumption at the expense of performance, as the drive current capability (Ion) reduces and the operating frequency must be reduced as well. Thus, Dynamic Voltage Scaling (DVS) schemes must be applied whenever the system operating requirements allow it. We now explore whether such schemes would be as useful in future technologies and whether DVS should be implemented in parallel with supply gating schemes as leakage power become dominant. We consider three base technologies, which were selected to provide three different power consumption scenarios. These are summarized in Table 1. We consider the case where the die size has not been scaled up, which can be viewed as an initial step towards lowering power consumption. The results are easily extendable to the case where die scaling takes place. Figure 5. Power variation as Vdd changes. Table 1. Technologies used for evaluation. Tech (µm) 0.07 0.05 0.035 Total Power (W) 41 64 126 Dynamic Power (%) 78 56 33 Leakage Power (%) 22 44 67 Figures 5 and 6 present the expected power and performance changes (as estimated with the equation for Tcycle on Section 2) as the nominal Vdd is scaled down up to about 40%. Figure 5 shows two cases; the dashed lines represent the instantaneous power savings after the change is applied (short-term policy). If the temperature is allowed to settle (long-term policy), the device leakage current reduces, causing a further reduction in the power consumed which ends up reaching a stable point given the linear relationship of power versus temperature and the logarithmic one of leakage versus temperature. The threshold that separates a long-term policy from a shortterm policy depends on how effectively the heat is removed from the die, such that its temperature follows closely any change in power consumption. It should be noted that, in the long-term case, all technologies basically converge to the same curve in terms of power reduction and temperature (the minimum temperature reached was 580C). The figures also show that, even though the attainable power reduction is almost linear with the change on Vdd, the negative effect on Figure 6. Delay variation as Vdd changes. There are some problems associated with Vdd scaling. In memory structures, as cell capacitances decrease, the amount of charge they can store reduces and makes them more susceptible to soft errors. Another problem is increased threshold variation in very short channel devices due to random dopant variation in the channel, which affects the cell stability during read processes. These two conditions worsen with Vdd scaling. The latter phenomenon might be fixed by increasing the beta ratio of the cell (the ratio of the NMOS pulldown to the NMOS pass device), which unfortunately prevents the memory cells from taking full advantage of process scaling. Thus, it is likely that memory arrays in processors implemented in 0.1um processes and beyond will need a separate power supply, higher than that used by the processor core or they will simply not be able to be scaled as the core, resulting in non-optimum area utilization. 4.2. Threshold voltage impact Threshold scaling by substrate biasing has been proposed and used as an effective way to reduce leakage power consumption. Although this technique has been applied for reducing leakage only when a unit or the whole system is idle, we explore now the feasibility of applying body bias control at run-time and system-wide. Published in the Proceedings of the 20th International Conference on Computer Design (ICCD), September 16-18, 2002, Freiburg, Germany The results of this section can also be used in assessing the impact of implementing a design in a Dual-Vth process. > 3VTH, so that enough current drive is available and performance is not dramatically harmed. 4.3. Supply and threshold voltage scaling Figure 7. Power variation as Vth changes. The following experiments assume variations on Vdd and Vth, according to the relative contributions of dynamic and leakage power to the total power number, respectively. Figures 9 and 10 present the results obtained when both Vth and Vdd are scaled for a total of 14 steps, with a maximum performance penalty of 16%. The starting Vdd values were the nominal ones and they were lowered by steps of 15mV, 10.5mV and 5.5mV such that final variations of 23%, 21% and 14% at step 14 were reached, for 0.07, 0.05 and 0,035um technologies, respectively. Similarly, the base Vth value was the nominal and steps of 1.9mV, 2.5mV and 2.4mV were used in order to reach final variations at step 14 of 14%, 21% and 23%. Figure 8. Delay variation as Vth changes. Figures 7 and 8 present the expected power and performance changes as the nominal Vth is scaled up to by 70%, which directly impacts the average value of ION. Larger increments on Vth are possible when the technique is applied to idle units. We observe that increments on Vth for overall power reduction become more effective as technology scales, at the expense of increased performance penalty. And as before, the impact of temperature is significant. For instance, to achieve a 20% reduction in power in a 0.035um design, short-term policies will require a 11% change in Vth while long-term policies only require a 5% change in Vth. The gap between the two cases decreases for larger changes in Vth and less aggressive technologies. This effect is enhanced by a lower operating temperature, which in the higher threshold voltage setting was reported to be 78, 67 and 560C for 0.07, 0.05 and 0.035um processes, respectively. The figures also show that, although the negative effect on performance is almost linear with the increase on Vth, the attainable power reduction presents a steeper rate of change for the three cases considered for increases on Vth up to about 20%. It was found that the required body bias voltage that will change Vth by 70% is lower than the operating voltage of each technology. In threshold voltage selection, it must be guaranteed that Vdd Figure 9. Power variation as Vdd and Vth change. Figure 10. Delay variation as Vdd and Vth change. If a short-term policy is implemented, we observe that the attainable power savings converge to a common trend, as shown in Figure 9. But the trend changes in the case of long-term policies where the savings are largest for the 0.035um technology and decrease for less aggressive processes. The effect is enhanced by a lower operating temperature, which in step 14 was found to be 68, 64 and 610C for 0.07, 0.05 and 0.035um processes, respectively. Published in the Proceedings of the 20th International Conference on Computer Design (ICCD), September 16-18, 2002, Freiburg, Germany 5. Concluding remarks We have presented a complete framework for the estimation of the impact of technology scaling in the power behavior of future designs. It also accounts for changes in architecture design and optimizations, aspect that we have called ‘architectural scaling’. We have used the mentioned framework to evaluate the effectiveness of various power reduction techniques. It was found that supply voltage scaling becomes less effective in providing power savings as leakage power becomes larger, which is reasonable given the quadratic dependence of the dynamic power with Vdd in contrast with the linear dependence of the leakage power. On the other hand, power savings obtained by increasing the threshold voltage are more significant as leakage power becomes dominant. Again, this is also reasonable given the logarithmic dependence of the leakage power on Vth, in contrast with the linear dependence of the dynamic power. An integrated scheme that uses both supply and threshold voltage scaling will provide the highest savings for the least amount of change in the controllable parameters. Table 2. Additional percentage power savings provided by temperature feedback. Tech (um) 0.07 0.05 0.035 Additional Savings % Average Maximum Average Maximum Average Maximum Vdd Scaling 5.4 7.4 9.8 13.5 14.4 19.5 Vth Scaling 1.2 2.0 3.9 6.2 8.9 13.4 Vdd / Vth Scaling 3.0 4.2 6.5 8.5 11.4 16.7 It was found, however, that the above observations change significantly if the application of certain scheme is held for some relatively long time (which we called longterm policy). In such a case, the decrease of power consumption causes a decrease in temperature, which in turn will reduce leakage power significantly (temperature feedback). Table 2 shows the additional percentage savings that can be obtained if the die is allowed to cool down after a power reduction scheme is applied, which can be as high as 19.5%. It is clear that additional savings increase as leakage becomes more important. This result emphasizes the importance of including runtime parameters, such as temperature, if accurate estimations are to be obtained. Also, design time optimizations such as technology and packaging improvements should be accounted for, as discussed in Section 3. We hope that framework proposed here can be used in a way that the goal is no longer to have simply the highest performance, but instead the highest performance within a particular market segment' s power budget and by considering the physical aspects of the real design. For instance, with the estimates given here, it will be possible to balance the benefits of using the high threshold devices in a low leakage process running at the higher possible frequency at a full Vdd versus using faster but leakier devices which require more voltage scaling in order to reach the desired power budget. Cases like these might lead the design team to select some optimum percentage of total power to be from leakage, which would be a function of the power budget being targeted. In the extreme case, if a process increases leakage greatly such that Vdd has to be reduced to the extent of making the design slower than the previous generation, then this is clearly a bad choice. It is possible that analysis like the one presented here will lead to the definition of Leff, Vdd, Tox, and Vth that will keep leakage power near its optimum percentage for a given processor. 6. References [1] Mc. Farland, G., “CMOS Technology Scaling and Its Impact on Cache Delay”, PhD. Thesis, Stanford University, June 1997, http://umunhum.stanford.edu/ ~farland/. [2] Butts, J. and Sohi, G., “A Static Power Model for Architects”, Proceedings of the 33rd Annual IEEE– MICRO 2000, pp. 223-234. [3] Borkar, S., “Design Challenges of Technology Scaling”, IEEE Micro, July-August 1999, pp. 23-29. [4] Sylvester, D., et al., “Future Performance Challenges in Nanometer Design”, Proc. of the 38th DAC, pp. 3-8. [5] Viswanath, R., et al., “Thermal Performance Challenges from Silicon to Systems”, Intel Technology Journal, 3rd quarter, 2000. [6] Duarte D., “Clock Network and Phase-Locked Loop Power Estimation and Experimentation”, PhD. Thesis, Penn State University, May 2002. [7] ITRS Roadmap, http://public.itrs.net. [8] Chen, Z., Diaz, C., et al., “0.18um Dual Vt MOSFET Process and Energy-delay Measurement”, International Electronic Devices Meeting, 1996, pp.851-854. [9] Rabaey, J., Chandrakasan, A. and Nikolic, B., “Digital Integrated Circuits: A Design Perspective”, 2nd Ed., Prentice-Hall International, NJ, 2002 (draft). [10] Intel Corporation, “Intel Announces Breakthrough In Chip Transistor Design”, http://www.intel.com/ pressroom/archive/releases/20011126tech.htm. [11] Hu, C., "Gate Oxide Scaling Limits and Projection", IEDM, 1996, pp. 319-322.