Memory Power Management via Dynamic Voltage/Frequency Scaling Howard David (Intel) Eugene Gorbatov (Intel) Ulf R. Hanebutte (Intel) Chris Fallin (CMU) Onur Mutlu (CMU) Memory Power is Significant Power consumption is a primary concern in modern servers Many works: CPU, whole-system or cluster-level approach But memory power is largely unaddressed Our server system*: memory is 19% of system power (avg) n n n n q Some work notes up to 40% of total system power Power (W) 400 300 200 100 0 System Power Memory Power lbm GemsFDTD milc leslie3d libquantum soplex sphinx3 mcf cactusADM gcc dealII tonto bzip2 gobmk sjeng calculix perlbench h264ref namd gromacs gamess povray hmmer Goal: Can we reduce this figure? n *Dual 4-core Intel Xeon®, 48GB DDR3 (12 DIMMs), SPEC CPU2006, all cores active. Measured AC power, analytically modeled memory power. 2 Existing Solution: Memory Sleep States? n Most memory energy-efficiency work uses sleep states q n Shut down DRAM devices when no memory requests active But, even low-memory-bandwidth workloads keep memory awake Idle periods between requests diminish in multicore workloads q CPU-bound workloads/phases rarely completely cache-resident Sleep State Residency 8% 6% 4% 2% 0% 3 hmmer povray gamess gromacs namd h264ref perlbench calculix sjeng gobmk bzip2 tonto dealII gcc cactusADM mcf sphinx3 soplex libquantum leslie3d milc GemsFDTD lbm Time Spent in Sleep States q Memory Bandwidth Varies Widely n Workload memory bandwidth requirements vary widely Bandwidth/channel (GB/s) Memory Bandwidth for SPEC CPU2006 n 8 6 4 2 0 Memory system is provisioned for peak capacity à often underutilized 4 Memory Power can be Scaled Down n DDR can operate at multiple frequencies à reduce power q q q n Lower frequency directly reduces switching power Lower frequency allows for lower voltage Comparable to CPU DVFS CPU Voltage/ Freq. System Power Memory Freq. System Power ↓ 15% ↓ 9.9% ↓ 40% ↓ 7.6% Frequency scaling increases latency à reduce performance q q q Memory storage array is asynchronous But, bus transfer depends on frequency When bus bandwidth is bottleneck, performance suffers 5 Observations So Far n Memory power is a significant portion of total power q n 19% (avg) in our system, up to 40% noted in other works Sleep state residency is low in many workloads q q Multicore workloads reduce idle periods CPU-bound applications send requests frequently enough to keep memory devices awake n Memory bandwidth demand is very low in some workloads n Memory power is reduced by frequency scaling q And voltage scaling can give further reductions 6 DVFS for Memory n Key Idea: observe memory bandwidth utilization, then adjust memory frequency/voltage, to reduce power with minimal performance loss à Dynamic Voltage/Frequency Scaling (DVFS) for memory n Goal in this work: q q q Implement DVFS in the memory system, by: Developing a simple control algorithm to exploit opportunity for reduced memory frequency/voltage by observing behavior Evaluating the proposed algorithm on a real system 7 Outline n Motivation n Background and Characterization q q q DRAM Operation DRAM Power Frequency and Voltage Scaling n Performance Effects of Frequency Scaling n Frequency Control Algorithm n Evaluation and Conclusions 8 Outline n Motivation n Background and Characterization q q q DRAM Operation DRAM Power Frequency and Voltage Scaling n Performance Effects of Frequency Scaling n Frequency Control Algorithm n Evaluation and Conclusions 9 DRAM Operation n n Main memory consists of DIMMs of DRAM devices Each DIMM is attached to a memory bus (channel) /8 /8 /8 /8 /8 /8 /8 /8 Memory Bus (64 bits) 10 DRAM Operation n n n Main memory consists of DIMMs of DRAM devices Each DIMM is attached to a memory bus (channel) Multiple DIMMs can connect to one channel to Memory Controller 11 Inside a DRAM Device 12 Row Decoder Inside a DRAM Device Banks Bank 0 • Independent arrays • Asynchronous: independent of memory bus speed Sense Amps Column Decoder 13 Inside a DRAM Device Recievers Drivers Sense Amps Column Decoder Registers Runs at bus speed Clock sync/distribution Bank 0 Bus drivers and receivers Buffering/queueing Write FIFO • • • • Row Decoder I/O Circuitry 14 Inside a DRAM Device Recievers Registers Bank 0Termination On-Die Write FIFO Row Decoder ODT Column Decoder Drivers • Required by bus electrical characteristics for reliable operation • Resistive element that dissipates power whenSense bus is active Amps 15 Inside a DRAM Device Recievers Drivers Sense Amps Column Decoder Registers Bank 0 Write FIFO Row Decoder ODT 16 Effect of Frequency Scaling on Power n n n n n Reduced memory bus frequency: Does not affect bank power: q Constant energy per operation q Depends only on utilized memory bandwidth Decreases I/O power: q Dynamic power in bus interface and clock circuitry reduces due to less frequent switching Increases termination power: q Same data takes longer to transfer q Hence, bus utilization increases Tradeoff between I/O and termination results in a net power reduction at lower frequencies 17 Effects of Voltage Scaling on Power n Voltage scaling further reduces power because all parts of memory devices will draw less current (at less voltage) Voltage reduction is possible because stable operation requires lower voltage at lower frequency: Minimum Stable Voltage for 8 DIMMs in a Real System DIMM Voltage (V) n 1.6 1.5 1.4 1.3 1.2 1.1 1 Vdd for Power Model 1333MHz 1066MHz 800MHz 18 Outline n Motivation n Background and Characterization q q q DRAM Operation DRAM Power Frequency and Voltage Scaling n Performance Effects of Frequency Scaling n Frequency Control Algorithm n Evaluation and Conclusions 19 Bandwidth/channel (GB/s) 7 6 5 4 3 2 1 0 lbm GemsFDTD milc leslie3d libquantum soplex sphinx3 mcf cactusADM gcc dealII tonto bzip2 gobmk sjeng calculix perlbench h264ref namd gromacs gamess povray hmmer How Much Memory Bandwidth is Needed? Memory Bandwidth for SPEC CPU2006 20 Performance Impact of Static Frequency Scaling Performance Loss, StaNc Frequency Scaling 80 70 60 50 40 30 20 10 0 1333-­‐>800 1333-­‐>1066 lbm GemsFDTD milc leslie3d libquantum soplex sphinx3 mcf cactusADM gcc dealII tonto bzip2 gobmk sjeng calculix perlbench h264ref namd gromacs gamess povray hmmer n Performance impact is proportional to bandwidth demand Many workloads tolerate lower frequency with minimal performance drop Performance Drop (%) n 21 Performance Impact of Static Frequency Scaling 8 Performance Loss, StaNc Frequency Scaling :: :: :: :: :: :: :: :: :: :: :: :: :: 6 4 1333-­‐>800 1333-­‐>1066 2 0 lbm GemsFDTD milc leslie3d libquantum soplex sphinx3 mcf cactusADM gcc dealII tonto bzip2 gobmk sjeng calculix perlbench h264ref namd gromacs gamess povray hmmer n Performance impact is proportional to bandwidth demand Many workloads tolerate lower frequency with minimal performance drop Performance Drop (%) n 22 Outline n Motivation n Background and Characterization q q q DRAM Operation DRAM Power Frequency and Voltage Scaling n Performance Effects of Frequency Scaling n Frequency Control Algorithm n Evaluation and Conclusions 23 Memory Latency Under Load n At low load, most time is in array access and bus transfer à small constant offset between bus-frequency latency curves As load increases, queueing delay begins to dominate à bus frequency significantly affects latency Memory Latency as a FuncNon of Bandwidth and Mem Frequency Latency (ns) n 800MHz 180 150 120 90 60 0 2000 1067MHz 4000 1333MHz 6000 8000 UNlized Channel Bandwidth (MB/s) 24 Control Algorithm: Demand-Based Switching Latency (ns) Memory Latency as a FuncNon of Bandwidth and Mem Frequency 800MHz 180 150 120 90 60 0 1067MHz 2000 T800 4000 T1066 1333MHz 6000 8000 UNlized Channel Bandwidth (MB/s) After each epoch of length Tepoch: Measure per-channel bandwidth BW if BW < T800 : switch to 800MHz else if BW < T1066 : switch to 1066MHz else : switch to 1333MHz 25 Implementing V/F Switching n Halt Memory Operations q q q n Transition Voltage/Frequency q q q q n Pause requests Put DRAM in Self-Refresh Stop the DIMM clock Begin voltage ramp Relock memory controller PLL at new frequency Restart DIMM clock Wait for DIMM PLLs to relock Begin Memory Operations q q Take DRAM out of Self-Refresh Resume requests 26 Implementing V/F Switching n Halt Memory Operations q q q n Pause requests Put DRAM in Self-Refresh Stop the DIMM clock Transition Voltage/Frequency Begin voltage ramp q Relock memory controller PLL at new frequency frequency already adjustable statically q Memory Restart DIMM clock q Wait for DIMM PLLs to relock q C C n Voltage regulators for CPU DVFS can work for Begin Memory Operations memory DVFS Take DRAM out of Self-Refresh transition q Full Resume requests takes ~20µs q C 27 Outline n Motivation n Background and Characterization q q q DRAM Operation DRAM Power Frequency and Voltage Scaling n Performance Effects of Frequency Scaling n Frequency Control Algorithm n Evaluation and Conclusions 28 Evaluation Methodology n Real-system evaluation q Dual 4-core Intel Xeon®, 3 memory channels/socket q n Emulating memory frequency for performance q q n 48 GB of DDR3 (12 DIMMs, 4GB dual-rank, 1333MHz) Altered memory controller timing registers (tRC, tB2BCAS) Gives performance equivalent to slower memory frequencies Modeling power reduction q q Measure baseline system (AC power meter, 1s samples) Compute reductions with an analytical model (see paper) 29 Evaluation Methodology n Workloads q q n SPEC CPU2006: CPU-intensive workloads All cores run a copy of the benchmark Parameters q q q q Tepoch = 10ms Two variants of algorithm with different switching thresholds: BW(0.5, 1): T800 = 0.5GB/s, T1066 = 1GB/s BW(0.5, 2): T800 = 0.5GB/s, T1066 = 2GB/s à More aggressive frequency/voltage scaling 30 -­‐1 AVG hmmer povray gamess gromacs 2 namd 3 h264ref perlbench calculix sjeng gobmk bzip2 tonto dealII gcc cactusADM mcf sphinx3 soplex libquantum leslie3d milc GemsFDTD lbm n Performance DegradaNon (%) Performance Impact of Memory DVFS Minimal performance degradation: 0.2% (avg), 1.7% (max) 4 BW(0.5,1) BW(0.5,2) 1 0 31 -­‐1 AVG hmmer povray gamess gromacs 2 namd 3 h264ref perlbench calculix sjeng gobmk bzip2 tonto dealII gcc cactusADM mcf sphinx3 soplex libquantum leslie3d milc GemsFDTD n lbm n Performance DegradaNon (%) Performance Impact of Memory DVFS Minimal performance degradation: 0.2% (avg), 1.7% (max) Experimental error ~1% 4 BW(0.5,1) BW(0.5,2) 1 0 32 n 0% lbm GemsFDTD milc leslie3d libquantum soplex sphinx3 mcf cactusADM gcc dealII tonto bzip2 gobmk sjeng calculix perlbench h264ref namd gromacs gamess povray hmmer Memory Frequency Distribution Frequency distribution shifts toward higher memory frequencies with more memory-intensive benchmarks 100% 80% 60% 40% 1333 20% 1066 800 33 Memory Power ReducNon (%) n 20 15 0 lbm GemsFDTD milc leslie3d libquantum soplex sphinx3 mcf cactusADM gcc dealII tonto bzip2 gobmk sjeng calculix perlbench h264ref namd gromacs gamess povray hmmer AVG Memory Power Reduction Memory power reduces by 10.4% (avg), 20.5% (max) 25 BW(0.5,1) BW(0.5,2) 10 5 34 System Power ReducNon (%) n 4 3.5 3 2.5 2 1.5 1 0.5 0 lbm GemsFDTD milc leslie3d libquantum soplex sphinx3 mcf cactusADM gcc dealII tonto bzip2 gobmk sjeng calculix perlbench h264ref namd gromacs gamess povray hmmer AVG System Power Reduction As a result, system power reduces by 1.9% (avg), 3.5% (max) BW(0.5,1) BW(0.5,2) 35 System Energy ReducNon (%) n 4 3 -­‐1 lbm GemsFDTD milc leslie3d libquantum soplex sphinx3 mcf cactusADM gcc dealII tonto bzip2 gobmk sjeng calculix perlbench h264ref namd gromacs gamess povray hmmer AVG System Energy Reduction System energy reduces by 2.4% (avg), 5.1% (max) 6 5 BW(0.5,1) BW(0.5,2) 2 1 0 36 Related Work n MemScale [Deng11], concurrent work (ASPLOS 2011) q q q n n Also proposes Memory DVFS Application performance impact model to decide voltage and frequency: requires specific modeling for a given system; our bandwidth-based approach avoids this complexity Simulation-based evaluation; our work is a real-system proof of concept Memory Sleep States (Creating opportunity with data placement [Lebeck00,Pandey06], OS scheduling [Delaluz02], VM subsystem [Huang05]; Making better decisions with better models [Hur08,Fan01]) Power Limiting/Shifting (RAPL [David10] uses memory throttling for thermal limits; CPU throttling for memory traffic [Lin07,08]; Power shifting across system [Felter05]) 37 Conclusions n Memory power is a significant component of system power q n Workloads often keep memory active but underutilized q q n Channel bandwidth demands are highly variable Use of memory sleep states is often limited Scaling memory frequency/voltage can reduce memory power with minimal system performance impact q q n 19% average in our evaluation system, 40% in other work 10.4% average memory power reduction Yields 2.4% average system energy reduction Greater reductions are possible with wider frequency/ voltage range and better control algorithms 38 Memory Power Management via Dynamic Voltage/Frequency Scaling Howard David (Intel) Eugene Gorbatov (Intel) Ulf R. Hanebutte (Intel) Chris Fallin (CMU) Onur Mutlu (CMU) Why Real-System Evaluation? n Advantages: q Capture all effects of altered memory performance n q q n Able to run full-length benchmarks (SPEC CPU2006) rather than short instruction traces No concerns about architectural simulation fidelity Disadvantages: q q n System/kernel code, interactions with IO and peripherals, etc More limited room for novel algorithms and detailed measurements Inherent experimental error due to background-task noise, real power measurements, nondeterministic timing effects For a proof-of-concept, we chose to run on a real system in order to have results that capture all potential side-effects of altering memory frequency 40 CPU-Bound Applications in a DRAM-rich system n We evaluate CPU-bound workloads with 12 DIMMs: what about smaller memory, or IO-bound workloads? n 12 DIMMs (48GB): are we magnifying the problem? q q n Large servers can have this much memory, especially for database or enterprise applications Memory can be up to 40% of system power [1,2], and reducing its power in general is an academically interesting problem CPU-bound workloads: will it matter in real life? q q Many workloads have CPU-bound phases (e.g., database scan or business logic in server workloads) Focusing on CPU-bound workloads isolates the problem of varying memory bandwidth demand while memory cannot enter sleep states, and our solution applies for any compute phase of a workload [1] L. A. Barroso and U. Holzle. “The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines.” Synthesis Lectures on Computer Architecture. Morgan & Claypool, 2009. [2] C. Lefurgy et al. “Energy Management for Commercial Servers.” IEEE Computer, pp. 39—48, December 2003. 41 Combining Memory & CPU DVFS? n Our evaluation did not incorporate CPU DVFS: q q n Need to understand effect of single knob (memory DVFS) first Combining with CPU DVFS might produce second-order effects that would need to be accounted for Nevertheless, memory DVFS is effective by itself, and mostly orthogonal to CPU DVFS: q q q Each knob reduces power in a different component Our memory DVFS algorithm has neligible performance impact à negligible impact on CPU DVFS CPU DVFS will only further reduce bandwidth demands relative to our evaluations à no negative impact on memory DVFS 42 Why is this Autonomic Computing? n Power management in general is autonomic: a system observes its own needs and adjusts its behavior accordingly à Lots of previous work comes from architecture community, but crossover in ideas and approaches could be beneficial n n This work exposes a new knob for control algorithms to turn, has a simple model for the power/energy effects of that knob, and observes opportunity to apply it in a simple way Exposes future work for: n n n More advanced control algorithms Coordinated energy efficiency across rest of system Coordinated energy efficiency across a cluster/datacenter, integrated with memory DVFS, CPU DVFS, etc. 43