Memory Power Management via Dynamic Voltage/Frequency Scaling Howard David (Intel) Eugene Gorbatov (Intel)

advertisement
Memory Power Management via
Dynamic Voltage/Frequency Scaling
Howard David (Intel)
Eugene Gorbatov (Intel)
Ulf R. Hanebutte (Intel)
Chris Fallin (CMU)
Onur Mutlu (CMU)
Memory Power is Significant
Power consumption is a primary concern in modern servers
Many works: CPU, whole-system or cluster-level approach
But memory power is largely unaddressed
Our server system*: memory is 19% of system power (avg)
n 
n 
n 
n 
q 
Some work notes up to 40% of total system power
Power (W) 400 300 200 100 0 System Power Memory Power lbm GemsFDTD milc leslie3d libquantum soplex sphinx3 mcf cactusADM gcc dealII tonto bzip2 gobmk sjeng calculix perlbench h264ref namd gromacs gamess povray hmmer Goal: Can we reduce this figure?
n 
*Dual 4-core Intel Xeon®, 48GB DDR3 (12 DIMMs), SPEC CPU2006, all cores active.
Measured AC power, analytically modeled memory power.
2
Existing Solution: Memory Sleep States?
n 
Most memory energy-efficiency work uses sleep states
q 
n 
Shut down DRAM devices when no memory requests active
But, even low-memory-bandwidth workloads keep memory
awake
Idle periods between requests diminish in multicore workloads
q  CPU-bound workloads/phases rarely completely cache-resident
Sleep State Residency 8% 6% 4% 2% 0% 3
hmmer povray gamess gromacs namd h264ref perlbench calculix sjeng gobmk bzip2 tonto dealII gcc cactusADM mcf sphinx3 soplex libquantum leslie3d milc GemsFDTD lbm Time Spent in Sleep States q 
Memory Bandwidth Varies Widely
n 
Workload memory bandwidth requirements vary widely
Bandwidth/channel (GB/s) Memory Bandwidth for SPEC CPU2006 n 
8 6 4 2 0 Memory system is provisioned for peak capacity
à often underutilized
4
Memory Power can be Scaled Down
n 
DDR can operate at multiple frequencies à reduce power
q 
q 
q 
n 
Lower frequency directly reduces switching power
Lower frequency allows for lower voltage
Comparable to CPU DVFS
CPU Voltage/
Freq. System Power Memory Freq. System Power ↓ 15% ↓ 9.9% ↓ 40% ↓ 7.6% Frequency scaling increases latency à reduce performance
q 
q 
q 
Memory storage array is asynchronous
But, bus transfer depends on frequency
When bus bandwidth is bottleneck, performance suffers
5
Observations So Far
n 
Memory power is a significant portion of total power
q 
n 
19% (avg) in our system, up to 40% noted in other works
Sleep state residency is low in many workloads
q 
q 
Multicore workloads reduce idle periods
CPU-bound applications send requests frequently enough
to keep memory devices awake
n 
Memory bandwidth demand is very low in some workloads
n 
Memory power is reduced by frequency scaling
q 
And voltage scaling can give further reductions
6
DVFS for Memory
n 
Key Idea: observe memory bandwidth utilization, then
adjust memory frequency/voltage, to reduce power with
minimal performance loss
à Dynamic Voltage/Frequency Scaling (DVFS)
for memory
n 
Goal in this work:
q 
q 
q 
Implement DVFS in the memory system, by:
Developing a simple control algorithm to exploit opportunity
for reduced memory frequency/voltage by observing behavior
Evaluating the proposed algorithm on a real system
7
Outline
n 
Motivation
n 
Background and Characterization
q 
q 
q 
DRAM Operation
DRAM Power
Frequency and Voltage Scaling
n 
Performance Effects of Frequency Scaling
n 
Frequency Control Algorithm
n 
Evaluation and Conclusions
8
Outline
n 
Motivation
n 
Background and Characterization
q 
q 
q 
DRAM Operation
DRAM Power
Frequency and Voltage Scaling
n 
Performance Effects of Frequency Scaling
n 
Frequency Control Algorithm
n 
Evaluation and Conclusions
9
DRAM Operation
n 
n 
Main memory consists of DIMMs of DRAM devices
Each DIMM is attached to a memory bus (channel)
/8
/8
/8
/8
/8
/8
/8
/8
Memory Bus (64 bits)
10
DRAM Operation
n 
n 
n 
Main memory consists of DIMMs of DRAM devices
Each DIMM is attached to a memory bus (channel)
Multiple DIMMs can connect to one channel
to Memory Controller
11
Inside a DRAM Device
12
Row Decoder
Inside a DRAM Device
Banks
Bank 0
•  Independent arrays
•  Asynchronous:
independent of
memory bus speed
Sense Amps
Column Decoder
13
Inside a DRAM Device
Recievers
Drivers
Sense Amps
Column Decoder
Registers
Runs at bus speed
Clock sync/distribution
Bank 0
Bus drivers and receivers
Buffering/queueing
Write FIFO
• 
• 
• 
• 
Row Decoder
I/O Circuitry
14
Inside a DRAM Device
Recievers
Registers
Bank 0Termination
On-Die
Write FIFO
Row Decoder
ODT
Column Decoder
Drivers
•  Required by bus electrical characteristics
for reliable operation
•  Resistive element that dissipates power
whenSense
bus is
active
Amps
15
Inside a DRAM Device
Recievers
Drivers
Sense Amps
Column Decoder
Registers
Bank 0
Write FIFO
Row Decoder
ODT
16
Effect of Frequency Scaling on Power
n 
n 
n 
n 
n 
Reduced memory bus frequency:
Does not affect bank power:
q  Constant energy per operation
q  Depends only on utilized memory bandwidth
Decreases I/O power:
q  Dynamic power in bus interface and clock circuitry
reduces due to less frequent switching
Increases termination power:
q  Same data takes longer to transfer
q  Hence, bus utilization increases
Tradeoff between I/O and termination results in a net
power reduction at lower frequencies
17
Effects of Voltage Scaling on Power
n 
Voltage scaling further reduces power because all parts of
memory devices will draw less current (at less voltage)
Voltage reduction is possible because stable operation
requires lower voltage at lower frequency:
Minimum Stable Voltage for 8 DIMMs in a Real System DIMM Voltage (V) n 
1.6 1.5 1.4 1.3 1.2 1.1 1 Vdd for Power Model 1333MHz 1066MHz 800MHz 18
Outline
n 
Motivation
n 
Background and Characterization
q 
q 
q 
DRAM Operation
DRAM Power
Frequency and Voltage Scaling
n 
Performance Effects of Frequency Scaling
n 
Frequency Control Algorithm
n 
Evaluation and Conclusions
19
Bandwidth/channel (GB/s) 7 6 5 4 3 2 1 0 lbm GemsFDTD milc leslie3d libquantum soplex sphinx3 mcf cactusADM gcc dealII tonto bzip2 gobmk sjeng calculix perlbench h264ref namd gromacs gamess povray hmmer How Much Memory Bandwidth is Needed?
Memory Bandwidth for SPEC CPU2006 20
Performance Impact of Static Frequency Scaling
Performance Loss, StaNc Frequency Scaling 80 70 60 50 40 30 20 10 0 1333-­‐>800 1333-­‐>1066 lbm GemsFDTD milc leslie3d libquantum soplex sphinx3 mcf cactusADM gcc dealII tonto bzip2 gobmk sjeng calculix perlbench h264ref namd gromacs gamess povray hmmer n 
Performance impact is proportional to bandwidth demand
Many workloads tolerate lower frequency with minimal
performance drop
Performance Drop (%) n 
21
Performance Impact of Static Frequency Scaling
8 Performance Loss, StaNc Frequency Scaling ::
::
:: ::
:: :: ::
:: ::
:: :: :: ::
6 4 1333-­‐>800 1333-­‐>1066 2 0 lbm GemsFDTD milc leslie3d libquantum soplex sphinx3 mcf cactusADM gcc dealII tonto bzip2 gobmk sjeng calculix perlbench h264ref namd gromacs gamess povray hmmer n 
Performance impact is proportional to bandwidth demand
Many workloads tolerate lower frequency with minimal
performance drop
Performance Drop (%) n 
22
Outline
n 
Motivation
n 
Background and Characterization
q 
q 
q 
DRAM Operation
DRAM Power
Frequency and Voltage Scaling
n 
Performance Effects of Frequency Scaling
n 
Frequency Control Algorithm
n 
Evaluation and Conclusions
23
Memory Latency Under Load
n 
At low load, most time is in array access and bus transfer
à small constant offset between bus-frequency latency curves
As load increases, queueing delay begins to dominate
à bus frequency significantly affects latency
Memory Latency as a FuncNon of Bandwidth and Mem Frequency Latency (ns) n 
800MHz 180 150 120 90 60 0 2000 1067MHz 4000 1333MHz 6000 8000 UNlized Channel Bandwidth (MB/s) 24
Control Algorithm: Demand-Based Switching
Latency (ns) Memory Latency as a FuncNon of Bandwidth and Mem Frequency 800MHz 180 150 120 90 60 0 1067MHz 2000 T800
4000 T1066
1333MHz 6000 8000 UNlized Channel Bandwidth (MB/s) After each epoch of length Tepoch:
Measure per-channel bandwidth BW
if
BW < T800 : switch to 800MHz
else if BW < T1066 : switch to 1066MHz
else
: switch to 1333MHz
25
Implementing V/F Switching
n 
Halt Memory Operations
q 
q 
q 
n 
Transition Voltage/Frequency
q 
q 
q 
q 
n 
Pause requests
Put DRAM in Self-Refresh
Stop the DIMM clock
Begin voltage ramp
Relock memory controller PLL at new frequency
Restart DIMM clock
Wait for DIMM PLLs to relock
Begin Memory Operations
q 
q 
Take DRAM out of Self-Refresh
Resume requests
26
Implementing V/F Switching
n 
Halt Memory Operations
q 
q 
q 
n 
Pause requests
Put DRAM in Self-Refresh
Stop the DIMM clock
Transition Voltage/Frequency
Begin voltage ramp
q  Relock memory controller PLL at new frequency
frequency
already adjustable statically
q  Memory
Restart DIMM
clock
q  Wait for DIMM PLLs to relock
q 
C
C
n 
Voltage regulators for CPU DVFS can work for
Begin Memory Operations
memory DVFS
Take DRAM out of Self-Refresh
transition
q  Full
Resume
requests takes ~20µs
q 
C
27
Outline
n 
Motivation
n 
Background and Characterization
q 
q 
q 
DRAM Operation
DRAM Power
Frequency and Voltage Scaling
n 
Performance Effects of Frequency Scaling
n 
Frequency Control Algorithm
n 
Evaluation and Conclusions
28
Evaluation Methodology
n 
Real-system evaluation
q  Dual 4-core Intel Xeon®, 3 memory channels/socket
q 
n 
Emulating memory frequency for performance
q 
q 
n 
48 GB of DDR3 (12 DIMMs, 4GB dual-rank, 1333MHz)
Altered memory controller timing registers (tRC, tB2BCAS)
Gives performance equivalent to slower memory frequencies
Modeling power reduction
q 
q 
Measure baseline system (AC power meter, 1s samples)
Compute reductions with an analytical model (see paper)
29
Evaluation Methodology
n 
Workloads
q 
q 
n 
SPEC CPU2006: CPU-intensive workloads
All cores run a copy of the benchmark
Parameters
q 
q 
q 
q 
Tepoch = 10ms
Two variants of algorithm with different switching thresholds:
BW(0.5, 1): T800 = 0.5GB/s, T1066 = 1GB/s
BW(0.5, 2): T800 = 0.5GB/s, T1066 = 2GB/s
à More aggressive frequency/voltage scaling
30
-­‐1 AVG hmmer povray gamess gromacs 2 namd 3 h264ref perlbench calculix sjeng gobmk bzip2 tonto dealII gcc cactusADM mcf sphinx3 soplex libquantum leslie3d milc GemsFDTD lbm n 
Performance DegradaNon (%) Performance Impact of Memory DVFS
Minimal performance degradation: 0.2% (avg), 1.7% (max)
4 BW(0.5,1) BW(0.5,2) 1 0 31
-­‐1 AVG hmmer povray gamess gromacs 2 namd 3 h264ref perlbench calculix sjeng gobmk bzip2 tonto dealII gcc cactusADM mcf sphinx3 soplex libquantum leslie3d milc GemsFDTD n 
lbm n 
Performance DegradaNon (%) Performance Impact of Memory DVFS
Minimal performance degradation: 0.2% (avg), 1.7% (max)
Experimental error ~1%
4 BW(0.5,1) BW(0.5,2) 1 0 32
n 
0% lbm GemsFDTD milc leslie3d libquantum soplex sphinx3 mcf cactusADM gcc dealII tonto bzip2 gobmk sjeng calculix perlbench h264ref namd gromacs gamess povray hmmer Memory Frequency Distribution
Frequency distribution shifts toward higher memory
frequencies with more memory-intensive benchmarks
100% 80% 60% 40% 1333 20% 1066 800 33
Memory Power ReducNon (%) n 
20 15 0 lbm GemsFDTD milc leslie3d libquantum soplex sphinx3 mcf cactusADM gcc dealII tonto bzip2 gobmk sjeng calculix perlbench h264ref namd gromacs gamess povray hmmer AVG Memory Power Reduction
Memory power reduces by 10.4% (avg), 20.5% (max)
25 BW(0.5,1) BW(0.5,2) 10 5 34
System Power ReducNon (%) n 
4 3.5 3 2.5 2 1.5 1 0.5 0 lbm GemsFDTD milc leslie3d libquantum soplex sphinx3 mcf cactusADM gcc dealII tonto bzip2 gobmk sjeng calculix perlbench h264ref namd gromacs gamess povray hmmer AVG System Power Reduction
As a result, system power reduces by 1.9% (avg), 3.5% (max)
BW(0.5,1) BW(0.5,2) 35
System Energy ReducNon (%) n 
4 3 -­‐1 lbm GemsFDTD milc leslie3d libquantum soplex sphinx3 mcf cactusADM gcc dealII tonto bzip2 gobmk sjeng calculix perlbench h264ref namd gromacs gamess povray hmmer AVG System Energy Reduction
System energy reduces by 2.4% (avg), 5.1% (max)
6 5 BW(0.5,1) BW(0.5,2) 2 1 0 36
Related Work
n 
MemScale [Deng11], concurrent work (ASPLOS 2011)
q 
q 
q 
n 
n 
Also proposes Memory DVFS
Application performance impact model to decide voltage and
frequency: requires specific modeling for a given system; our
bandwidth-based approach avoids this complexity
Simulation-based evaluation; our work is a real-system proof
of concept
Memory Sleep States (Creating opportunity with data placement
[Lebeck00,Pandey06], OS scheduling [Delaluz02], VM subsystem [Huang05];
Making better decisions with better models [Hur08,Fan01])
Power Limiting/Shifting (RAPL [David10] uses memory throttling for
thermal limits; CPU throttling for memory traffic [Lin07,08]; Power shifting
across system [Felter05])
37
Conclusions
n 
Memory power is a significant component of system power
q 
n 
Workloads often keep memory active but underutilized
q 
q 
n 
Channel bandwidth demands are highly variable
Use of memory sleep states is often limited
Scaling memory frequency/voltage can reduce memory
power with minimal system performance impact
q 
q 
n 
19% average in our evaluation system, 40% in other work
10.4% average memory power reduction
Yields 2.4% average system energy reduction
Greater reductions are possible with wider frequency/
voltage range and better control algorithms
38
Memory Power Management via
Dynamic Voltage/Frequency Scaling
Howard David (Intel)
Eugene Gorbatov (Intel)
Ulf R. Hanebutte (Intel)
Chris Fallin (CMU)
Onur Mutlu (CMU)
Why Real-System Evaluation?
n 
Advantages:
q 
Capture all effects of altered memory performance
n 
q 
q 
n 
Able to run full-length benchmarks (SPEC CPU2006) rather
than short instruction traces
No concerns about architectural simulation fidelity
Disadvantages:
q 
q 
n 
System/kernel code, interactions with IO and peripherals, etc
More limited room for novel algorithms and detailed
measurements
Inherent experimental error due to background-task noise,
real power measurements, nondeterministic timing effects
For a proof-of-concept, we chose to run on a real system in
order to have results that capture all potential side-effects
of altering memory frequency
40
CPU-Bound Applications in a DRAM-rich system
n 
We evaluate CPU-bound workloads with 12 DIMMs:
what about smaller memory, or IO-bound workloads?
n 
12 DIMMs (48GB): are we magnifying the problem?
q 
q 
n 
Large servers can have this much memory, especially for database
or enterprise applications
Memory can be up to 40% of system power [1,2], and reducing its
power in general is an academically interesting problem
CPU-bound workloads: will it matter in real life?
q 
q 
Many workloads have CPU-bound phases (e.g., database scan or
business logic in server workloads)
Focusing on CPU-bound workloads isolates the problem of varying
memory bandwidth demand while memory cannot enter sleep
states, and our solution applies for any compute phase of a workload
[1] L. A. Barroso and U. Holzle. “The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale
Machines.” Synthesis Lectures on Computer Architecture. Morgan & Claypool, 2009.
[2] C. Lefurgy et al. “Energy Management for Commercial Servers.” IEEE Computer, pp. 39—48, December 2003.
41
Combining Memory & CPU DVFS?
n 
Our evaluation did not incorporate CPU DVFS:
q 
q 
n 
Need to understand effect of single knob (memory DVFS) first
Combining with CPU DVFS might produce second-order effects
that would need to be accounted for
Nevertheless, memory DVFS is effective by itself, and
mostly orthogonal to CPU DVFS:
q 
q 
q 
Each knob reduces power in a different component
Our memory DVFS algorithm has neligible performance impact
à negligible impact on CPU DVFS
CPU DVFS will only further reduce bandwidth demands relative
to our evaluations à no negative impact on memory DVFS
42
Why is this Autonomic Computing?
n 
Power management in general is autonomic: a system
observes its own needs and adjusts its behavior accordingly
à Lots of previous work comes from architecture community,
but crossover in ideas and approaches could be beneficial
n 
n 
This work exposes a new knob for control algorithms to
turn, has a simple model for the power/energy effects of
that knob, and observes opportunity to apply it in a simple
way
Exposes future work for:
n 
n 
n 
More advanced control algorithms
Coordinated energy efficiency across rest of system
Coordinated energy efficiency across a cluster/datacenter,
integrated with memory DVFS, CPU DVFS, etc.
43
Download