He X, Yan GH, Han YH et al. Wide operational range processor power delivery design for both super-threshold voltage and near-threshold voltage computing. JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY 31(2): 253–266 Mar. 2016. DOI 10.1007/s11390-016-1625-7 Wide Operational Range Processor Power Delivery Design for Both Super-Threshold Voltage and Near-Threshold Voltage Computing Xin He 1,2 , Student Member, CCF, IEEE, Gui-Hai Yan 1 , Member, CCF, ACM, IEEE Yin-He Han 1 , Senior Member, CCF, IEEE, Member, ACM, and Xiao-Wei Li 1 , Senior Member, CCF, IEEE 1 State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences Beijing 100190, China 2 University of Chinese Academy of Sciences, Beijing 100049, China E-mail: {hexin, yan, yinhes, lxw}@ict.ac.cn Received January 4, 2015; revised June 19, 2015. Abstract The load power range of modern processors is greatly enlarged because many advanced power management techniques are employed, such as dynamic voltage frequency scaling, Turbo Boosting, and near-threshold voltage (NTV) technologies. However, because the efficiency of power delivery varies greatly with different load conditions, conventional power delivery designs cannot maintain high efficiency over the entire voltage spectrum, and the gained power saving may be offset by power loss in power delivery. We propose SuperRange, a wide operational range power delivery unit. SuperRange complements the power delivery capability of on-chip voltage regulator and off-chip voltage regulator. On top of SuperRange, we analyze its power conversion characteristics and propose a voltage regulator (VR) aware power management algorithm. Moreover, as more and more cores have been integrated on a singe chip, multiple SuperRange units can serve as basic building blocks to build, in a highly scalable way, more powerful power delivery subsystem with larger power capacity. Experimental results show SuperRange unit offers 1x and 1.3x higher power conversion efficiency (PCE) than other two conventional power delivery schemes at NTV region and exhibits an average 70% PCE over entire operational range. It also exhibits superior resilience to power-constrained systems. Keywords 1 voltage regulator, power delivery, near-threshold computing, multicore processor Introduction Nowadays processor design has been gradually changed from performance-oriented paradigm to power efficiency oriented paradigm because of the approaching end of Dennard’s scaling[1] . To boost efficiency, the state-of-the-art processors provide multiple performance states (P-states) to enable flexible power management, which has even been marked as a selling point by processor vendors. Processor operational range has been greatly enlarged. Implementing performance states (P-states) requires dynamic voltage frequency scaling (DVFS) techniques which dictate a power delivery subsystem sup- porting a wide range of voltage levels. For example, to enable all P-states of an Intelr Pentiumr processor, the power delivery has to provide from 0.9 V at P5, the lowest performance state, to 1.4 V at P0, the highest performance state[2] . The operational voltage range will be further expanded when incorporating some advanced power management techniques such as Turbo Boost[3] and near-threshold voltage (NTV)[4] technology. Turbo Boost requires higher-than-nominal voltage to support instant over-clocking, while NTV uses lower-than-nominal voltage to achieve aggressive power saving. Wide operational range processor design brings new challenges to power delivery design. However, the Regular Paper This work is supported by the National Natural Science Foundation of China under Grant Nos. 61572470, 61532017, 61522406, 61432017, 61376043, and 61221062. ©2016 Springer Science + Business Media, LLC & Science Press, China 254 J. Comput. Sci. & Technol., Mar. 2016, Vol.31, No.2 power delivery subsystem is never ideal. It draws power when transferring power from power source to the loading processor. This decreases the power conversion efficiency (PCE), i.e., the ratio of processor power to the total power drawn from the board power source. Within optimal range, PCE can reach up to 90% with well-designed on-board voltage regulator (VR), referred to as off-chip inductor based VR (Off-VR in this paper). However, out of that optimal range, PCE decreases significantly. This can greatly dampen and even totally eat up the power benefit achieved by wide operational range power management. This argument can be confirmed by Fig.1(a). The result shows PCE of a typical Off-VR on different output voltage levels ranging from 1.1 V to 2.3 V[5] . The PCE reaches the optimal point (89%) at the highest voltage of 2.3 V, but plunges to merely 12% at the lowest 1.1 V. Clearly, in the lowvoltage states, the power efficiency cannot be high because the power loss in power delivery has dominated the total power. Power Efficiency (%) 100 80 60 40 20 Off-VR 0 1.1 1.4 1.7 2.0 Output Voltage (V) 2.3 Power Efficiency (%) 80 60 40 20 Off-VR On-VR LDO-VR 0 0.4 0.6 0.8 1.0 1.2 Output Voltage (V) Fig.1. Power conversion efficiency at different output voltage levels. The optimal PCE of conventional Off-VR resides at the high voltage end, but gradually decreases at low voltage levels. This is not surprising since the power delivery is designed for peak performance. This conventional wisdom becomes less effective when it comes to the state-of-the-art processors with wide operational voltage range. The future processors are likely required to work in both traditional super-threshold voltage (STV) mode and aggressive power-saving NTV mode to achieve ultra-high efficiency[6] . This requirement, however, cannot be served well with current power delivery designs. The looming challenge is how to design a power delivery subsystem that can provide wide operational range and, more importantly, high PCE across the whole range. One intuitive solution is to design multiple Off-VRs whose optimal points are evenly located in the operational range. However, we find that designing an OffVR based power delivery subsystem targeting low voltage range is impractical because of the degraded regulation quality and voltage scaling speed. Also, it is neither physically practical nor scalable because today’s processors have been very power-intensive. Even one Off-VR solution has greatly burdened the board design, not to mention multiple Off-VR solutions. Another traditional solution is to use an on-chip LDOVR to collaborate with Off-VR. Although LDO-VR can be easily integrated on chip, PCE still remains a big problem since LDO-VR’s PCE is limited to output/input voltage ratio, as confirmed in Fig.1(b). Note that Off-VR exhibits higher PCE in Fig.1(b) than the one in Fig.1(a). This is because Off-VR in Fig.1(a) employs larger inductor to aggressively remove output ripple, and thus incurs larger power loss, and Off-VR in Fig.1(b) uses much smaller inductor to maintain output ripple at an acceptable level based on the observation in Subsection 4.1. Although the inductors of two OffVRs vary in size and inductances, the behaviors of PCE share same trends. We resort to on-chip switched capacitor based VR (On-VR in this paper) to complement the Off-VR at the lower-end voltage. The rationale is twofold. 1) On-VR does not dictate on-board resources, but only on-chip silicon. At 20 nm technology node, due to thermal constraint it is estimated that roughly 20% chip area will become dark silicon, this number will grow to more than 50% at 8 nm technology node[7] . On the other side, on-chip switch capacitor voltage regulator has gone increasing mature in recent years[8], and integrated capacitors can be used to implement voltage regulator in current CMOS processes without additional fabrication steps. These imply we could trade increas- Xin He et al.: Wide Operational Range Processor Power Delivery Design ingly cheaper silicon to build On-VR for power delivery. 2) On-VR’s optimal point usually resides in low voltage range compared with Off-VR, and hence it is just right to complement Off-VR in terms of optimal PCE. As Fig.1(b) shows the efficiencies of Off-VR and On-VR change with varying load levels under fixed input voltages, note that the load current is scaled with output voltage. Off-VR can achieve high conversion efficiency at high load levels, but when the load shifts to low levels, Off-VR’s PCE degrades significantly. By contrast, On-VR maintains high PCE at the low load levels, but cannot feed high load levels. This observation provides a unique opportunity to achieve wide operational range delivery by judiciously building the synergy between Off-VR and On-VR. Based on this observation, we explore the design space of wide operational range and examine several design candidates. Finally we propose SuperRange, a wide operation range power delivery unit, to maximize PCE over the whole range. It is worth to note that the power delivery capacity of a SuperRange unit is limited, and under heavy load condition, the PCE of SuperRange would degrade significantly even if SuperRange could afford it. In this case, we propose to use SuperRange units as basic building blocks to form a distributed power delivery network. For example, the typical power of a 64-core processor is 120 W, and the capacity of a SuperRange unit is 30 W. In order to power up the manycore processor, the power delivery design should group the cores into four power domains powered by SuperRange units. In this way, high PCE could be still maintained. The reasons are twofold: 1) processor cores rarely operate at full speed and usually remain under-utilized, and thus power consumption is far less than the typical power; 2) large load can be shared by several SuperRange domains. Taking an 8-thread application as an example, each thread should be assigned to an active core. The thread scheduling algorithm could map each thread to one core in a different SuperRange domain, and thus the burden of large load can be relieved. We resort to hardware and software co-design to optimize PCE for a manycore processor. Specifically, we find an optimal core count of SuperRange domains and propose a load sharing scheduling algorithm. In particular, we make the following contributions. 1) We explore the design space of wide operational range power delivery design. We thoroughly evaluate the feasibility of possible design options, which motivates an optimal design. 255 2) We propose a wide operation range power delivery scheme, called SuperRange, to maximize PCE over the whole range. SuperRange builds the synergy between Off-VR and On-VR. 3) On top of SuperRange, we further propose a VR aware power management algorithm. This algorithm is used to search for optimal processor power states to maximize performance under given power budget. 4) We propose to use SuperRange unit as a basic building block to form a distributed power delivery network design for wide operational range processor. Incorporated with load sharing scheduling algorithm, the proposed design could keep high PCE and exhibit good salability. The rest of the paper is organized as follows. Section 2 gives the background. Section 3 analyzes the optional power delivery scheme and Section 4 details the proposed SuperRange scheme. Section 5 proposes the distributed power delivery network design. Section 6 describes the experiment. We introduce related work in Section 7. Finally Section 8 summarizes this paper. 2 Background Voltage regulator is the key component for delivering power to microprocessors. It has always been challenging to achieve high PCE. The widely used VRs are 1) linear VR and 2) switch-mode VR. The most commonly used linear VR is low-dropout regulator (LDO). Fig.2(a) shows a typical LDO regulator circuit. When the input voltage is higher than the dropout voltage, LDO maintains a specified voltage. In the dropout region, the PMOS pass element is simply a resistor. Voltage regulation is achieved through dropout voltage tuning in LDO part. LDO-VR only provides lower-than-input voltage levels, and its efficiency is limited to the ratio of output to input voltage. Thus LDO-VR can achieve high PCE only when the output voltage is close to the input voltage, and linearly degrades with the increasing gap between input and output voltage. The traditional switch-mode VRs refer to switching inductors VRs. Fig.2(b) shows a typical Off-VR model[9] . The voltage conversion is achieved by two parts: the bridge and the inductor. In the bridge part, the two power switches switch on and off asynchronously at a specified switching frequency. This periodical operation generates a square wave of voltage to charge and discharge the inductor. In the inductor part, a low-pass output filter LC is employed to filter the square wave. From Fourier analysis, finally the 256 J. Comput. Sci. & Technol., Mar. 2016, Vol.31, No.2 output voltage value is equal to the average value of the square wave, Vout = D × Vin , where D is duty cycle which is tunable to support different output voltage levels. D S PMOS Pass Element Vin G R Vp Vo Error Amplifier R Vref (a) Bridge Inductor Vin Re 11 kg Rb Ri Vout Switch Li Cdcap Cb GND GND (b) M Vin M C1 C1 Vout C1 M GND M 3 M C1 C1 M Vout C1 C2 M Vout GND C1 achieves high efficiency at high load level, but degrades fast at low load where the switching loss dominates the total VR power. Moreover, such type of VRs usually is a discrete component and can only reside off chip due to large form factor. In recent years, another type of switch-mode VR design, switch capacitor VR, goes increasingly mature. Switch capacitor VR consists of multiple switches and capacitors organized into specific topologies. The regulation is achieved by using capacitors as energy storage elements and periodically charging/discharging them. Fig.2(c) represents a typical switch capacitor VR model. Basically, it is implemented into a serialparallel topology, and achieves 3:1 voltage conversion ratio. In Fig.2(c), during charging phase, switches M 1, M 5, M 7 are on, others are off, and capacitors C1 and C2 are serially connected and charged, and in discharging phase, switches M 1, M 5, M 7 are off, others are on, and capacitors C1 and C2 are parallel connected and discharged. The circuit is periodically switched between two configurations at a specified switch frequency. Switch capacitor VR can gain high efficiency at a few voltage levels near its target conversion ratio. Compared with inductor mode VRs, switch capacitor VRs can be optimized to deliver power efficiently at low load levels, and they can be built on chip thanks to small physical size and compatible manufacturing process with the host chips. Even though switch capacitor VR lacks the ability to deliver power at all load levels, it is still a promising scheme to complement with other schemes. C1 (c) Fig.2. (a) LDO-VR model. (b) Off-VR model. (c) On-VR model. This type of VRs is still imperfect in conversion efficiency. It suffers from three kinds of power losses: switching loss, the resistive loss in the bridge, and the conductive loss in the inductor. Similar to LDO-VRs, it SuperRange Design Space Exploration In this section, we aim to find an efficient power delivery design towards wide operational range. This design would have the capability to convert the board voltage to a wide output voltage range. In this design, the board voltage is set to 3.7 V which follows conventional cases[10] . The output voltage levels range from NTV 0.4 V to STV 1.2 V[6] . To derive the optimal design is challenging. We therefore first explore the design space and analyze optional designs. These designs are evaluated from two aspects: 1) power conversion efficiency and 2) regulation quality, explained as follows. The power conversion efficiency (η) of a voltage regulator is defined as the ratio of the loading processor power (Pload ) to the power drawn from power source, 257 Xin He et al.: Wide Operational Range Processor Power Delivery Design that is η= Pload 1 = . loss Pload + Ploss 1 + PPload Clearly, we cannot achieve high efficiency if the power loss (Ploss ) is comparable to and even overwhelming the load power. The PCE is obtained based on existing Off-VR and On-VR models[8,11] . The regulation quality refers to output voltage ripple. To maintain high voltage integrity, the upper bound of voltage ripple is usually no more than 10%. 3.1 Off-VR Scheme One intuitive solution is to design multiple Off-VRs whose optimal output points are evenly located in the operational range, as shown in Fig.3(a). There are three sources of power loss: 1) switching loss Pcap , 2) resistive loss Pres in the bridge, and 3) conductive loss Pind in the inductors: ripple VR , because VR ∝ 1/f ; 2) longer response time, i.e., processor power state transition would take longer, as the switching frequency is reduced; 3) increased Pres and Pind , because the effective current Irms and the resistance Ri will increase with the reducing frequency. These trade-offs can be clearly observed in Fig.4. The spice simulation and power modeling results show how voltage ripple and efficiency change with reducing frequency. The initial switch frequency is 500 KHz given suggested f is several hundreds of KHz. To shift the range to low end, f has to be reduced to 133 KHz. Unfortunately, the output ripple goes over 10% guard line, but PCE merely increases 10%∼15%. The power saving is doomed to be offset by the increased load power due to higher voltage to tolerate the large ripple. Moreover, using one more Off-VR further burdens PCB board design and incurs high board area overhead. Hence, the Off-VR scheme is not a recommended design. 30 Ploss = Pcap + Pres + Pind , f=500 KHz where f=133 KHz 25 2 Pind = Ri Irms Board Power Supply 3.7 V Off-VR Off-VR 0.4 V~0.6 V 0.7 V~1.2 V I2 = Rc (IL2 + R ), 3 IR2 2 = Ri (IL + ). 3 Board Power Supply 3.7 V 0.4 V~1.2 V 0.4 0.5 0.6 Output Voltage (V) (a) On-VR CPU CPU Chip Chip Chip (b) 0 0.4 V~0.6 V 0.7 V~1.2 V CPU (a) 15 Max Allowed Ripple 10 Off-VR Off-VR 2 V LDO-VR 20 5 Board Power Supply 3.7 V (c) Fig.3. Top view of optional power delivery designs. (a) Off-VR. (b) LDO-VR. (c) On-VR + Off-VR. In this design, we find the main culprit of low efficiency at low load levels is high switching loss Pcap , while Pres and Pind are relative small. (To reduce the switching loss Pcap , we need to reduce switch capacitance C, input voltage Vin or switching frequency f .) The most effective approach to reducing Pcap is to reduce switching frequency (f ). However, decreasing switching frequency will incur new problems: 1) reduced quality of regulation, i.e., larger output voltage Power Conversion Efficiency (%) 2 Pres = Rc Irms Voltage Ripple (%) Pcap = CVin2 f, 80 f=500 KHz f=133 KHz 70 60 50 40 30 20 10 0 0.4 0.5 Output Voltage (V) 0.6 (b) Fig.4. Voltage ripple and efficiency comparison while reducing frequency from 500 KHz to 133 KHz. (a) Output voltage ripple comparison. (b) PCE comparison. 258 3.2 J. Comput. Sci. & Technol., Mar. 2016, Vol.31, No.2 LDO-VR Scheme Another solution is to use a large range LDO-VR. LDO has the ability to deliver voltage levels below its input voltage. However, the achievable PCE of LDO regulator is limited by the output/input voltages ratio (Vo /Vi ) and the quiescent current (Iq ) as follows: η= Io Vo . (Io + Iq )Vi In this analysis, we ignore the quiescent current’s impact on PCE, because it is two orders of magnitude smaller than the output current Io . Fig.3(b) shows this power delivery scheme. A high efficiency Off-VR serves as the frontend and feeds the step-down voltage to an LDO regulator. The LDO regulator can be built on chip or off chip, and we follow the most conventional design and build it on chip. In this baseline, LDO delivers power from 2 V to the voltage from 0.4 V to 1.2 V. The PCE is limited because of the low ratio of output voltage to input voltage. Especially at near-threshold region, the PCE is less than 30%. Therefore, using LDO regulator is not an efficient solution to support low load levels. 3.3 On-VR Scheme We propose to use On-VR to deliver low output voltage levels as Fig.3(c) shows. To serve NTV efficiently, one should reduce the gap between input and output voltage. An Off-VR first steps 3.7 V power source down to an intermediate value, then the On-VR further converts the voltage to 0.4 V∼0.6 V. Because the PCE is the product of PCEs of On-VR and Off-VR, both of them should be taken into careful consideration. The On-VR model follows a high efficiency design[8] . The power loss is shown as follows: Ploss = PCfly + PRsw + Pbott-cap + Pgate-cap , where PCfly is switch capacitor loss, PRsw is switch conductance loss, Pbott-cap is parasitic capacitor switching loss and Pgate-cap is gate parasitic capacitance switching loss. Fig.1(b) shows PCE of On-VR when OnVR converts voltage from the intermediate level 2 V to 0.4 V∼0.6 V. Clearly, it achieves high conversion efficiency at low output voltage levels because it uses lossless low power MOSFETs. However, On-VR alone lacks the ability to support high voltage range. We therefore need to use the frontend Off-VR as the complement in high voltage range. Specifically, the power delivery steers to Off-VR and shuts down On-VR when the loading processor needs STV, or steers to On-VR when NTV is engaged. The on-chip power switches have been well studied[12-13] . In this scheme, STV is readily supported, but NTV is worthy to be further clarified. The voltage transfer function of On-VR can be formulated as follows[8] : Vo = αVin − Io Ri , where Ri ∝ 1/f. In this equation, Vo is output voltage, Vin is input voltage, Ri is inner resistance, f is switching frequency and α is a topology specific parameter which defines the conversion ratio which is constant given an On-VR design. According to the equation, the output voltage can be modulated by 1) tuning the resistance of On-VR by changing f and keeping the Vin constant, or 2) tuning the input voltage produced by Off-VR and keeping the resistance constant. However, these two approaches are not equivalent to each other, and we find the second one is preferred. The reason is explained as follows. The first way is to adjust the operation state of On-VR fed with fixed input voltage from frontend OffVR. The voltage is firstly converted from 3.7 V to a lower value 2 V by Off-VR. Its PCE can be optimized to 80%∼87%. Then On-VR converts the voltage from fixed voltage 2 V to 0.6 V. Further voltage scaling, i.e., 0.4 V∼0.6 V, is realized by tuning the switching frequency of On-VR. This design can regulate the voltage to near-threshold region with high regulation quality, but the PCE of On-VR suffers from the decreasing switching frequency. The optimal operation point of On-VR is limited in a narrow range near 0.6 V. Simulation result shows this design has an average efficiency 58% to support NTV. Although it is 20% higher than conventional design, the efficiency is still very low at the lowest 0.4 V output voltage. Another alternative is to tune the Off-VR state to generate a variable voltage Vin and use fixed On-VR to step this intermediate voltage to NTV. For example, the Off-VR first changes the duty cycle and steps the source voltage to a value between 1.3 V and 2 V, and then the voltage value is converted to 0.4 V∼0.6 V range by On-VR. Because On-VR does not need to decrease its switching frequency, the PCE can stay around 90%. However, given Vo Io ≈ 90% × Iin Vin in On-VR, the output current of Off-VR, Iout , is very low, which renders the Off-VR less efficient. In our case, the efficiency drops to 44% on average. Although such solution is still not optimal, it sheds light on the way to further 259 Xin He et al.: Wide Operational Range Processor Power Delivery Design optimization: to improve the efficiency of frontend OffVR under a low current mode, which is the key design consideration in SuperRange scheme. Proposed SuperRange Scheme In this section, we use a multi-phase Off-VR (four phases in this paper) as the frontend to build SuperRange. This design solves the low efficiency problem of Off-VR at low current condition when feeding the On-VR. 4.1 Proposed Power Delivery Design Fig.5 shows the top view of the SuperRange scheme and loading processor. The processor working voltage covers NTV and STV. The STV levels are directly provided by 4-phase Off-VR, while the NTV levels are produced by On-VR which uses the Off-VR as its frontend. The On-VR is a serial-parallel single topology VR with 3 : 1 conversion ratio. The Off-VR is a conventional Off1 VR, similar to commercial regulator LTC3733○ . 35 Vo=0.4 V Output Voltage Ripple (%) 4 over 70%. Note that reducing the number of working phase inevitably incurs larger ripple, but the increased current ripple can be easily removed by using a relative large inductance inductor. The increased inductor size will not burden board design since we use only one Off-VR in this design. In our case, we find a 1.5 µH inductor is competent to ease the extra ripple which is confirmed in Fig.6. An inductor with inductances larger than 1.5 µH can reduce the ripple to a level lower than 7%. Thus we can dynamically reduce the number of working phase when more load current is required. 30 Vo=0.5 V Vo=0.6 V 25 20 Max Allowed Ripple 15 10 5 Vin Board Power Supply Vout 0 3.7 V 0.5 1.0 1.5 2.0 2.5 Inductor Inductance (mH) Cdcap Off-VR 0 VID Code Fig.6. Output ripple with varying inductor size. Power Switch 0.4 V~0.6 V Vout 0.7 V~1.2 V PMU Multicore Processor Chip Voltage (V) On-VR 1.3 1.1 STV 1.2 V 0.9 0.7 NTV 0.5 V 0.5 0.3 3.90 3.94 3.98 4.02 4.06 4.10 ms Fig.5. Implementation of proposed power delivery. The multi-phase Off-VR provides an opportunity to improve the efficiency of frontend Off-VR under a low current mode. Multi-phase technology is commonly used to increase the load current. Each phase delivers one equal portion of total load current. Modern Off-VR has the capability to dynamically change the number of working phases[14] . Therefore some phases can be disabled to adapt to the low current mode, without degrading PCE. In our design, when delivering to NTV, we find that a single working phase can afford the load, and other phases can be disabled. The PCE increases Basically, SuperRange has two working modes. 1) Supporting STV. Voltage conversion to STV is performed by Off-VR. Off-VR receives and decodes VID, and the power delivery is directly steered to OffVR. It changes output voltage based on VID. 2) Supporting NTV. The support to NTV region is achieved by a two-step conversion. Off-VR first decodes VID and the power delivery is steered to On-VR. Then Off-VR sets to a single working phase, and the output voltage is initially regulated to a variable intermediate level based on VID. In our case, the intermediate level is 1.3 V, 1.6 V or 2.0 V with the careful consideration of On-VR’s resistance. After this, On-VR steps the voltage to a corresponding near-threshold value. By doing so, the proposed scheme can provide high conversion efficiency over the entire load spectrum. 4.2 VR-Aware Power Management Algorithm Supply voltage scaling decision is made by on-chip power management unit (PMU). PMU is a micro- 1 ○ LTC3733: 3-phase, buck controllers for AMD CPUs. http://cds.linear.com/docs/en/datasheet/3733f.pdf, Nov. 2015. ferent load conditions; 3) processor core count; 4) power budget. 100 Power Conversion Efficiency (%) controller and runs power management routine which makes decisions on core power states (voltage and frequency) and active core count selections. PMU collects power data from on-chip current sensors and performance statistics from performance counters. Power management routine then uses data collected as input and selects supply voltage level to achieve certain power management goals. Once PMU makes a decision on the selection of supply voltage level, PMU sends the corresponding VID (Voltage Identification) code to Off-VR. The detailed PMU design is beyond the scope of this paper. To demonstrate the potential of SuperRange, we present a VR-aware power management algorithm employed by PMU. The goal of this algorithm is to maximize system performance under given power budget. It is worth to note that SuperRange can also be applied to other management algorithms, not limited to this one. We first exploit the efficiency of SuperRange at different load conditions as Fig.7 shows. Figs.7(a) and 7(b) show the PCEs when supporting NTV and STV levels, respectively. Clearly, the efficiency of the SuperRange strongly correlates with output voltage and current. Usually excessively low current cannot yield high efficiency; hence the low current condition should be avoided as much as possible. Then we consider system power efficiency. The power efficiency is defined as the ratio of performance (billion instruction per second) to total power, a.k.a. BIPS/Watt. Without considering the imperfect PCE, the power efficiency increases quadratically with lower voltage. This is because core power consumption reduces cubically while the performance degrades linearly with lower voltage. However, there is a trade-off when taking the imperfect PCE into account. Although high voltage benefits high PCE, it degrades the application power efficiency more. The following algorithm is used to tackle this trade-off and the goal is to maximize the performance under power budget. The load condition, the efficiency of power delivery system and the application performance serve as inputs to the algorithm. The load condition and performance can be measured through on-chip current sensor and performance counter. We assume a homogeneous multi-core processor, but the basic principle is also applicable to heterogenous processors, which is supposed to be future work. The problem can be formulated as follows: • Given: 1) application power and performance in different voltage levels; 2) SuperRange PCE under dif- J. Comput. Sci. & Technol., Mar. 2016, Vol.31, No.2 Vo=0.4 V Vo=0.5 V 80 Vo=0.6 V 60 40 20 0 0 1 2 3 Output Current (A) 4 5 (a) 80 Vo=0.7 V Power Conversion Efficiency (%) 260 Vo=0.9 V 60 Vo=1.1 V 40 20 0 0 5 10 15 Output Current (A) (b) 20 Fig.7. Per-phase efficiency at different load conditions. (a) OnVR & Off-VR model. (b) Off-VR model. • Determine: the number of active cores and the supply voltage level to maximize performance under power budget. The problem is solved with Algorithm 1. Basically, it is a two-step process. Step 1. The algorithm first computes the total power Pv when all cores are active at each voltage level. The power is the sum of measured core power and VR power loss. Then it compares the power values with given power budget, and then selects the lowest level VL under which Pv is higher than PB and the highest level VH under which Pv is lower than PB . The optimal voltage setting will be chosen between the two voltage levels. Step 2. The algorithm then calculates the maximum number of active core at VL which satisfies power budget, and gets the corresponding performance. Then it compares the performance with the performance BVH at 261 Xin He et al.: Wide Operational Range Processor Power Delivery Design VH , and picks the configuration which gives the higher performance. Algorithm 1. Searching for Optimal VF Setting and Active Core Count Input: core power Pvall , performance Bv at voltage level v, processor core count Nall , power budget PB ; SuperRange PCE ηv Output: output voltage level Vo ; active core number No 1: for each v do Pv 2: Pv = all //total power when all cores are active ηv 3: end for 4: Select the highest voltage VH at which Pv is smaller than PB , and get BVH ; 5: Select the lowest voltage VL at which Pv is higher than PB ; 6: for i = Nall ; i > 0; i − − do PLi 7: P = ; //PLi is obtained by current sensor ηv 8: Collect Bi through performance counter; 9: if P <= PB then 10: BVL = Bi ; 11: break; 12: end if 13: end for 14: (Vo , No ) = max(BVH , BVL ); //final configuration gives higher Bips Distributed Power Delivery Network Design Using SuperRange In Section 3 and Section 4, we propose SuperRange design, aiming at delivering power for wide operational range multicore processor at high PCE. However, the power delivering capacity of SuperRange unit is limited. When large load condition is incurred, PCE of SuperRange would degrade significantly although SuperRange could afford such load. In this section, we propose to use SuperRange unit as a basic building block to form a distributed power delivery network. In order to achieve high PCE, we analyze the correlation between PCE and underlying load condition, and observe that optimal PCE could only be attained at a modest load condition. We resort to hardware and software co-design to optimize PCE. PCE should be kept at an acceptable level under proposed power delivery organization, and it could be further increased through load balancing. Firstly we try to find the optimal core organization of SuperRange, specifically, the number of cores that SuperRange delivers to. When applying power delivery to load processor, the most important goal is to maintain PCE an acceptable value, and usually the differ- 75 Power Conversion Efficiency (%) 5 ence of the lowest value to the highest value should be kept less than 25%. We firstly exploit the relation between the power delivery PCE and the underlying core count and take a manycore processor as an example, and the result is shown in Fig.8. The underlying cores are all running in their highest voltage/frequency setting. From this figure, the PCE first rises to an optimal point (nearly 75%) with increasing core count. When core count goes across a certain extent (four cores in this case), PCE drops gradually, to 57% with 16 active cores. Inspired by the result, the most straightforward decision is that each SuperRange unit delivers to four cores to meet the highest PCE point. However, for a manycore system, equipping every a small number of cores with one SuperRange unit will incur large area overhead and lead to over-design. We observe that SuperRange can still keep high PCE even if SuperRange delivers to more cores. This is because conventionally a large number of cores are underutilized in a manycore system, and some cores in SuperRange can be slowed down to regain the PCE. Therefore, we find that deploying each SuperRange for 16 cores is a reasonable solution. It cannot only maintain PCE above 57%, but also incur negligible area overhead and exhibit superior scalability. ferret bodytrack dedup swaptions 70 65 Acceptable PCE 60 55 50 0 5 10 Core Count 15 16 20 Fig.8. SuperRange PCE with increasing active core count. Secondly we propose to use load sharing scheduling algorithm to further increase PCE. The distributed power delivery network for a manycore processor is shown in Fig.9. The processor cores are grouped into several power domains, and each power domain is supported by SuperRange power delivery. This scheme provides a unique opportunity to increase PCE. As we mentioned above, PCE would suffer from excess load 262 J. Comput. Sci. & Technol., Mar. 2016, Vol.31, No.2 SuperRange Domain 1 SuperRange SuperRange Domain 2 SuperRange Domain 2 Domain 1 ... Active Region ... Load Sharing Active Region ... Domain 3 SuperRange ... Domain 4 Domain 3 SuperRange SuperRange Domain 4 SuperRange Fig.9. Application of SuperRange to multicore processor. condition. In this scheme, excess load current can be avoided by load sharing. As shown in Fig.9, without loss of generality, let us assume a multi-programmed workload running on 16 cores. Running 16 cores in one domain would result in a reluctantly accepted PCE value. Fortunately, as shown in this figure, load current can be shared by four power domains. In each domain, four cores are active and the others are shut down, thus the PCE shifts to the high end. In this way, we can also achieve high PCE at large load condition. Note that the load can also be shared by other domains, and the decision making process should compare these results and find the optimal setting. 6 6.1 Experiment Experimental Setup Baseline Architecture. The baseline is a multi-core processor consisting of SuperRange domains. Each domain delivers power to 16 OoO (Out-of-Order) cores, though SuperRange essentially can support any type of architectures. Each domain baseline is deployed on a 32 MB last level cache (LLC). The LLC is organized in eight banks and each bank is 8-way associative. Distributed directory based MESI cache coherence is enabled to maintain the shared data consistency. The processor connects to the main memory through a DDR3-1333 memory controller with 10.6 Gb/s bandwidth. Detailed information is provided in Table 1. Voltage Regulators. The efficiency model of offchip inductor based VR is based on converter model[11] , and inductor and switches parameters are derived from LTC3733’s datasheet. The switch capacitor VR follows an existing high efficiency model[8] , and uses a 3 : 1 serial-parallel topology voltage regulator. We also build their LTspice models to simulate their behavior. Detailed configurations including VR parameters and area cost are listed in Table 2. The area overhead of On-VR is 0.084 mm2 , which is less than 0.5% area of the baseline domain. Table 1. Baseline Architecture Configurations Parameter Core number Power delivery LLC capacity LLC feature Cache coherence On chip interconnection Memory controller Memory bandwidth Area (mm2 ) Value 16 Hybird 32 MB 32 B block, 8-way Distribute directory-based MESI Mesh + router 1 10 Gb/s 253.03 Table 2. Voltage Regulator Parameters Parameter Topology Vin Vout Freq (Mhz) Number of phases L per ph (µH) Intrinsic resistance (mohm) Cfly (nF) Area (mm2 ) Off-VR Buck 3.7 0.7∼2.0 0.5 4 1.5 32 N/A 3.8 On-VR Switch Cap 1.2∼2.0 0.4∼0.7 300 20 N/A 56 20 0.084 Simulation Infrastructure. We use a full-system 2 simulator gem5○ as our simulation infrastructure. In addition, McPAT[15] is plugged in gem5 to evaluate power consumption and area overhead. The technology is configured to 32 nm technology node. 2 ○ The gem5 simulator system. http://www.m5sim.org, Nov. 2015. 263 Xin He et al.: Wide Operational Range Processor Power Delivery Design Table 3. Microarchitectural Parameters Parameter Processor Fetch/decode width Branch-predictor type Reorder buffer size Unified load/store FP/INT register file size Supply voltage (V) Clock rate (GHz) Inst/data cache Access time Miss penalty Area (mm2 ) Value Out-of-Order core 4 Hybrid 2-Level 128 64 64/64 0.4 : 0.1 : 1.2 0.3 : 0.2 : 1.9 32 KB, 2-way 2 ns 12 ns 13.65 Benchmarks. We use Parsec benchmark suite[16] to evaluate our design, because it targets the general purpose processors and aims to represent emerging workloads in the near future. Power Domains. The power domains are powered by SuperRange schemes. For other power consuming components, we follow Intel Nehalem family processor design[17] , i.e., the LLC and memory controllers are powered by their own off-chip VR. Because LLC and memory controller contribute to fixed and relative small portion of system power, our work focuses on core power delivery. 6.2 Experimental Results First, we show the overall power conversion efficiency of SuperRange domain in Fig.10. From the results, we can see that PCE at NTV increases by 40% compared with conventional Off-VR design and the average PCE over entire operational range can be nearly 70%. This is because SuperRange not only implements the synergy between Off-VR and On-VR to improve PCE, but also proposes to optimize the PCE of frontend Off-VR while maintaining On-VR high PCE (around 90%) at NTV levels. Since our scheme fully explores the efficiency benefit of both Off-VR and On-VR, we can conclude that the wide operational range now can be well supported. 100 Power Conversion Efficiency (%) We extend McPAT to calculate the power in different performance states. We build the Out-of-Order (OoO) core resembling Alpha-EV6. The core configuration is detailed in Table 3. The cores have nine performance states, and each state corresponds to a specified voltage/frequency setting: Pstate1(1.2 V, 1.9 GHz), Pstate2(1.1 V, 1.7 GHz), Pstate3(1.0 V, 1.5 GHz), Pstate4(0.9 V, 1.3 GHz), Pstate5(0.8 V, 1.1 GHz), Pstate6(0.7 V, 0.9 GHz), Pstate7(0.6 V, 0.7 GHz), Pstate8(0.5 V, 0.5 GHz), Pstate9(0.4 V, 0.3 GHz). NTV Region STV Region 80 60 40 20 0 SuperRange Off-VR LDO-VR 0.4 0.6 0.8 1.0 1.2 Output Voltage (V) Fig.10. Power conversion efficiency over the entire operational range. Second, we study the performance potential under a constant power budget. To simulate a powerconstrained system, the highest power budget is able to power up eight cores at Pstate1, the highest performance state, and half of the cores have to stay in dark. If more active cores turn out to yield higher performance, we need to reduce the voltage to enable lower performance states, at the risk of large power loss in the delivery. Fig.11 shows the maximum performance delivered by the target processor under three power delivery schemes (as shown in Fig.3) with 25% of the highest power budget. The results show that the proposed SuperRange obviously outperforms the other two schemes. This is because with the highest conversion efficiency over the large voltage range, the processor has more flexibility to tune between the single core performance and the multiple cores parallelism. This flexibility provides the applications more opportunities to successfully enable the optimal multi-threading configurations. By contrast, the LDO-VR and the Off-VR lead to less flexibility because the efficiency loss when delivering to low voltage range may exclude some multi-threading configurations which turn out to be performance optimal. The proposed SuperRange scheme not only outperforms the LDO-VR and the Off-VR at the constant power budget, but also shows superior resilience on even tighter power constraints. Fig.12 shows the normalized performance of the target processor under shrinking 264 J. Comput. Sci. & Technol., Mar. 2016, Vol.31, No.2 power budget. X axis sets the system power budget, Y axis shows the corresponding maximum achievable performance under three delivery schemes. Each box shows the distribution of the performance running Parsec Benchmarks. The results show that at loose power budget, SuperRange behaves as well as Off-VR scheme. When the budget gets tighter, the system prefers NTV levels. The result shows the achievable performance under SuperRange is far higher than the performance in LDO-VR and Off-VR by 171% and 52% on average, respectively. The result demonstrates that SuperRange is a more promising solution in the dark silicon era. 2.5 Τ104 LDO-VR Off-VR SuperRange Performance (MIPS) 2.0 1.5 The experiment is performed on four workloads. We assume each workload is made up of four Parsec applications, and each application has four threads. The workload configuration is detailed in Table 4. And active cores are running on their highest power/performance state. We compare the PCE under four different ways of load sharing. 1) The load is shared by four SuperRange domains, referred to as “4 domain”. 2) The load is balanced in two SuperRange domains, referred to as “2 domain + balance”. 3) The load is imbalanced in two SuperRange domains, referred to as “2 domain + imbalance”. 4) The load is delivered by one SuperRange domain, referred to as “1 domain”. The result is shown in Fig.13. We find that four-domain sharing method outperforms others and two-domain sharing methods behave better than single domain delivering. Under a heavy load condition, load sharing can always reduce resistive loss in power delivery. There is a slight difference of PCE about whether load is balanced or not because of the small absolute difference of load condition. 1.0 Table 4. Workload Formulation 0.5 ps Application Workload 1 blackscholes, bodytrack, canneal, dedup Workload 2 ferret, fluidanimate, freqmine, dedup Workload 3 streamcluster, swaptions, vips, freqmine Workload 4 blackscholes, canneal, swaptions, x264 vi flu fe rr id et an im st at re e am cl us te r fr eq m in sw e ap ti on s de du p bo bl ac ks ch ol es dy tr ac k 0 Workload Fig.11. Performance comparison in power-constrained system. 80 75 Power Conversion Efficiency (%) Normalized Performance 1.0 0.8 0.6 0.4 0.2 0 SuperRange Off-VR LDO-VR 100 90 80 70 60 50 40 Power Budget (%) 20 2 Domain+Balance 1 Domain 70 65 60 55 50 45 40 35 30 30 4 Domain 2 Domain+Imbalance Workload 1 Workload 2 Workload 3 Workload 4 10 Fig.12. Performance under increasingly tighter power constraint. Finally we evaluate the effectiveness of load sharing in the distributed SuperRange power delivery design. Fig.13. PCE comparison of different load sharing schemes. 7 Related Work The design for high efficiency power delivery has been a hot topic for years. Yan et al.[18] presented a Xin He et al.: Wide Operational Range Processor Power Delivery Design hybrid power delivery scheme to explore different power phases, but they did not take the varying PCE into consideration. Ng and Sanders[19] and Le et al.[8] proposed a high efficiency switched-capacitor DC-DC converter. Ng and Sanders’ design aims at large input voltage range, but the output voltage level is limited. Le et al.’s design implements a multiple topology converter with increased design complexity and on-chip area overhead. It is unsuitable for powering microprocessors. Kim et al.[20] designed an integrated 3-level DC-DC converter, and their design targets at fast DVFS and does not have high efficiency at a low load level. He et al.[21] proposed a wide range power delivery design, but like any other power delivery circuit design, the power delivering capability of their design is limited. All the designs stated above lack the ability to support a wide operational range for manycore processor efficiently. Another line of prior work considers system level strategies. Cho et al.[13] proposed a system level power management method based on dynamic voltage regulator scheduling to achieve high conversion efficiency over a large power range. They chose the most efficient voltage regulator with changing load levels. Sinkar et al.[22] proposed a workload aware voltage regulator configuration tuning method to optimize system power. Amelifard and Pedram[23] introduced a reconfigurable power delivery network design by dynamically changing the topology of voltage regulators to support different load levels. Ghasemi et al.[24] implemented per-core voltage domains by using on-chip low dropout converters. Their method imposes low area overhead since the converters share components with power gating circuit. However, the LDO converter has a low conversion efficiency at low load levels. Yan et al.[25] applied pathgrained voltage adaptation to relieve the violation incurred by low voltage, and Han et al.[26] proposed the redundancy-based data salvaging technique for fault recovery to enable near-threshold voltage. Both of Yan et al.’s and Wang et al.’s design are orthogonal to design in this paper. Differing from these designs, our work addresses the efficiency problem through a hybrid power delivery scheme and highlights wide operation range which is missed in prior work. 8 Conclusions In this paper, we proposed a large operational range power delivery scheme, SuperRange, by exploring the advantages of both on-chip and off-chip voltage regulator. We thoroughly analyzed the efficiency beha- 265 vior of existing voltage regulators, and ensure that SuperRange can provide high power conversion efficiency over the entire voltage range. Moreover, we proposed a voltage regulator aware power management algorithm. This algorithm heuristically finds optimal processor configurations (number of active core and voltage/frequency setting) to maximize performance under given power budget. Experimental results showed that SuperRange is well suitable to support wide power range for future power-constrained systems. References [1] Esmaeilzadeh H, Blem E, Amant R S, Sankaralingam K, Burger D. Dark silicon and the end of multicore scaling. In Proc. ISCA, June 2011, pp.365-376. [2] Intel. Enhanced Intelr SpeedStepr Technology for the Intelr Pentiumr M Processor. http://download.intel.com/design/network/papers/30117401.pdf, Aug. 2015. [3] Rotem E, Naveh A, Rajwan D, Ananthakrishnan A, Weissmann E. Power management architecture of the 2nd generation Intelr CoreTM microarchitecture, formerly codenamed Sandy Bridge. http://120.52.72.36/www.hotchips.org/c3pr90ntcsf0/wp-content/uploads/hc archives/hc23/HC23.19.9-Desktop-CPUs/HC23.19.921.SandyBridge Power 10-Rotem-Intel.pdf, August 2011. [4] Dreslinski R G, Wieckowski M, Blaauw D et al. Nearthreshold computing: Reclaiming Moore’s law through energy efficient integrated circuits. Proceedings of the IEEE, 2010, 98(2): 253-266. [5] Kim J, Horowitz M. An efficient digital sliding controller for adaptive power-supply regulation. IEEE Journal of SolidState Circuits (JSSC), 2002, 37(5): 639-647. [6] Jain S, Khare S, Yada S et al. A 280mV-to-1.2V wideoperating-range IA-32 processor in 32nm CMOS. In Proc. IEEE ISSCC, Feb. 2012, pp.66-68. [7] Esmaeilzadeh H, Blem E, Amant R S, Sankaralingam K, Burger D. Power limitations and dark silicon challenge the future of multicore. ACM Transactions on Computer Systems (TOCS), 2012, 30(3): 11:1-11:27. [8] Le H P, Sanders S R, Alon E. Design techniques for fully integrated switched-capacitor DC-DC converters. IEEE Journal of Solid State Circuits (JSSC), 2011, 46(9): 2120-2131. [9] Schrom G, Hazucha P, Paillet F, Gardner D S, Moon S T, Karnik T. Optimal design of monolithic integrated DC-DC converters. In Proc. IEEE ICICDT, May 2006. [10] Kim W, Gupta M S, Wei G Y, Brooks D. System level analysis of fast, per-core DVFS using on-chip switching regulators. In Proc. the 14th HPCA, Feb. 2008, pp.123-134. [11] Choi Y, Chang N, Kim T. DC-DC converter-aware power management for low-power embedded systems. IEEE Trans. Comput.-Aid. Des. Integr. Circ. Syst., 2007, 26(8): 1367-1381. [12] Amelifard B, Pedram M. Design of an efficient power delivery network in an SoC to enable dynamic power management. In Proc. ACM/IEEE ISLPED, August 2007, pp.328333. [13] Cho Y, Kim Y, Joo Y, Lee K, Chang N. Simultaneous optimization of battery-aware voltage regulator scheduling with dynamic voltage and frequency scaling. In Proc. ACM/IEEE ISLPED, August 2008, pp.309-314. 266 [14] Zumel P, Fernandez C, de Castro A, Garcia O. Efficiency improvement in multiphase converter by changing dynamically the number of phases. In Proc. the 37th PESC, June 2006. [15] Li S, Ahn J H, Strong R D, Brockman J B, Tullsen D M, Jouppi N P. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures. In Proc. the 42nd Annual IEEE/ACM MICRO, Dec. 2009, pp.469-480. [16] Bienia C, Kumar S, Singh J P, Li K. The PARSEC benchmark suite: Characterization and architectural implications. In Proc. the 17th PACT, October 2008, pp.72-81. [17] Gunther S, Deval A, Burton T, Kumar R. Energy-efficient computing: Power management system on the Nehalem family of processors. Intel Technology Journal, 2010, 14(3): 50-65. [18] Yan G, Li Y, Han Y, Li X, Guo M, Liang X. AgileRegulator: A hybrid voltage regulator scheme redeeming dark silicon for power efficiency in a multicore architecture. In Proc. the 18th IEEE HPCA, Feb. 2012, pp.287-298. [19] Ng V, Sanders S. A 92%-efficiency wide-input-voltagerange switched-capacitor DC-DC converter. In Proc. IEEE ISSCC, Feb. 2012, pp.282-284. [20] KimW, Brooks D, Wei G Y. A fully-integrated 3-level DCDC converter for nanosecond-scale DVFS. IEEE Journal of Solid State Circuits (JSSC), 2012, 47(1): 206-219. [21] He X, Yan G, Han Y, Li X. SuperRange: Wide operational range power delivery design for both STV and NTV computing. In Proc. DATE, March 2014. [22] Sinkar A A,Wang H, Kim N S. Workload-aware voltage regulator optimization for power efficient multi-core processors. In Proc. DATE, March 2012, pp.1134-1137. [23] Amelifard B, Pedram M. Optimal selection of voltage regulator modules in a power delivery network. In Proc. the 44th ACM/IEEE DAC, June 2007, pp.168-173. [24] Ghasemi H R, Sinkar A A, Schulte M J, Kim N S. Costeffective power delivery to support per-core voltage domains for power-constrained processors. In Proc. the 49th ACM/IEEE DAC, June 2012, pp.56-61. [25] Yan G, Han Y, Liu H, Liang X, Li X. MicroFix: Exploiting path-grained timing adaptability for improving powerperformance efficiency. In Proc. ACM/IEEE ISLPED, August 2009, pp.395-400. [26] Han Y, Wang Y, Li H, Li X. Enabling near-threshold voltage (NTV) operation in multi-VDD cache for power reduction. In Proc. IEEE ISCAS, May 2013, pp.337-340. Xin He received his B.E. degree in software engineering from Sichuan University, Chengdu, in 2011, and his M.E. degree in computer architecture from Institute of Computing Technology (ICT), Chinese Academy of Sciences (CAS), Beijing, in 2013. He is currently a Ph.D. candidate in ICT, CAS. His research interests include computer architecture especially on near-threshold computing, low power architecture and approximate computing. He is a student member of CCF/IEEE. J. Comput. Sci. & Technol., Mar. 2016, Vol.31, No.2 Gui-Hai Yan is an associate professor of ICT, CAS. He received his B.S. degree in electronic information science and technology from Peking University, Beijing, in 2005, and Ph.D. degree in computer architecture from ICT, CAS, in 2010. His research focuses on energyefficient and resilient architectures. He was awarded the CCF Outstanding Doctoral Dissertation and Chinese Academy of Sciences Outstanding Doctoral Dissertation for contributions to reliable computer architecture. He is a member of CCF, ACM, and IEEE. Yin-He Han received his B.E. degree from the College of Automatization Engineering, Nanjing University of Aeronautics and Astronautics, Nanjing, in 2001, and his M.E. and Ph.D. degrees in computer science from the Institute of Computing Technology (ICT), Chinese Academy of Sciences (CAS), Beijing, in 2003 and 2006, respectively. He is currently a professor at the State Key Laboratory of Computer Architecture, ICT, CAS. His research interests include computer architecture especially on fault-tolerant and low power architecture, VLSI design and test. Prof. Han was a recipient of Best Paper Award at ATS 2003. He is program chair of ATS 2014, finance chair of HPCA 2013, program co-chair of WRTLT 2009, and serves on the Technical Program Committees of several IEEE and ACM conferences, including PACT 2014, HPCA 2013, ASPDAC 2013, Cool Chip 2013, ATS 2008∼2010, GVLSI 2009∼2010, etc. He is a senior member of CCF and IEEE, and a member of ACM. Xiao-Wei Li received his B.E. and M.E. degrees in computer science from Hefei University of Technology, Hefei, in 1985 and 1988 respectively, and his Ph.D. degree in computer science from the Institute of Computing Technology (ICT), Chinese Academy of Sciences (CAS), Beijing, in 1991. Currently, he is a professor and deputy director of the State Key Laboratory of Computer Architecture, ICT, CAS. His research interests include VLSI testing and design verification, dependable computing, and wireless sensor networks. He is an associate editor-in-chief of the Journal of Computer Science and Technology, and a member of editorial board of Journal of Electronic Testing and Journal of Low Power Electronics. In addition, he serves on the Technical Program Committees of several IEEE and ACM conferences, including VTS, DATE, ASP-DAC, PRDC, etc. He was program co-chair of IEEE Asian Test Symposium in 2003, and general co-chair of ATS 2007.