Guarded Power Gating in a Multi

advertisement
IBM Research
Guarded Power Gating in a Multi-core
Setting
Niti Madan, Alper
Buyuktosunoglu,
Pradip Bose,
Murali Annavaram
USC
IBM T.J.Watson
June 2010
© 2010 IBM Corporation
IBM Research
Outline
 Motivation
 Queuing Model based Methodology
 Results
 Conclusions and Future Work
2
IBM Research
Power Management through Power
Gating
– Use header or footer transistor to
power-gate the idle circuit
– Apply “sleep” to header or footer
=> turn off voltage
Sleep
Vdd
Virtual Vdd
.
.
Logic Block
.
.
– Can be applied at unit-level
(intra-core or small-knob)
– Can be applied at core-level (percore or big-knob)
3
IBM Research
Predictive Power Gating
Energy
Break-even point
Cumulative
Energy Savings
Energy Overhead
0
Decide to power gate
Wake-up
Ex. break-even point = 10 cycles
Decide to Power Gate
…10100 0000000000…
Decide to Power Gate
…10100 001………….
Correct prediction => save power
• Power-gating Algorithms
are predictive by nature
•Frequent mis-predictions
can burn more power than
save
• Break-even point
dependent upon block-size
and tech parameters
• Guard mechanism
proposed for unit-level
power gating algorithms by
Lungu et al. (ISLPED’09)
• Concern for per-core
power gating algorithms as
breakeven point is much
higher for cores
Anita Lungu, Pradip Bose, Alper Buyuktosunoglu, Daniel Sorin,”Dynamic power gating with quality guarantees”. ISLPED ‘09
4
IBM Research
Power Gating Scenarios
 Exploiting the two dimensions of utilization to power-gate idle units
or cores
– System Utilization (OS perspective) triggers the big-knob
– Resource Utilization (Core’s perspective) triggers the small-knob
• Do we PG cores or execution units or both?
Core 1
Core 2
time
Core 3
Core 4
time
(a) Baseline 4-core system
(b) Folded 2-core system
 How can we maximize power-savings opportunities provided by
both the small and big knobs ?
5
IBM Research
Goals of this study
 Explore the trade-offs between unit-level/small-knob
power gating algorithms and per-core/big-knob power
gating algorithms for a range of latencies/parameters
 Leverage analytical models for early-stage evaluation
 A case for guard mechanism for per-core powergating
6
Sriram Vajapeyam, Pradip Bose
IBM Research
Queuing Theory Based Analytical Model
 Representation of Multi-processor workloads as a
Queuing system
– Cores are servers
– Processing tasks are customer requests
– Tasks are processed in FCFS order
– Queuing system tracks average customer waiting time,
service time and server utilization
 Evaluate our power-management policies using C++
based Queuing model simulator: “QUTE”
Customers
Arrivals
Queue
Server(s)
Departures
7
IBM Research
Overview of QUTE Framework
 Simulation of Queuing Models (G/G/N/k/inf/FCFS)
– Faster than cycle-accurate simulations
– Easy to explore design-space early on
 Statistical Workload Generation Parameters:
– Task Arrival Times: Exponential Distribution
– Task Lengths: Normal/Exponential/Uniform Distributions
 Evaluation Metrics:
– Performance: Average response time
– Power: Average number of cores switched on
– Other Stats: Server utilization, variance in service
demand etc.
8
IBM Research
QUTE Framework
Task arrival
(arrival rate distribution using
random number generator)
.
.
FIFO
Task
Queue
(service time or task
Length statistical distribution)
C1
C2
……..
C3
C4
(all cores queue back the task at the end of a time slice)
9
IBM Research
Big Knob Modeling
Implemented a simple Idleness-triggered heuristic:
 Set Idleness Threshold (say to 0.5 msec)
 Every 0.5 msec (i.e. the idleness threshold),
– Scan all cores
– Identify cores idle for > idleness threshold
– Switch off all such cores
(except, make sure there is always at least one core ON,
either free or active)
 When a task arrives at the head of the task queue:
– If there is no free core,
• If there is a switched-off core, switch it ON
10
IBM Research
Small Knob Modeling
 Cannot directly simulate workload phases
 Each core can have N power states
– 2 states for this version : nominal power state and low
power state (75% power)
 Generate statistical distribution (Gaussian) of each
power state duration
 Each task always starts in the nominal power state
– Switch between power states in a given time-slice
 Parameters: Nominal (Hi) and Low (Lo) power state
means, Transition overhead
11
IBM Research
Simulation Parameters
ρ = λ / N*µ
System-level
Parameters
Big Knob Parameters
Small Knob
Parameters
Number of cores
Mean Task Length
Mean Task InterArrival Rate
Time Slice
Simulation Length
Core Switch-on Lat
(OnLat)
Idleness Threshold
(CT)
Hi state mean
Lo state mean
Transition overhead
Power Factor
32
5 ms
300 µs
1 ms
10000 Tasks
500 µs
500 µs
300 µs
100 µs
1 µs
0.75
12
IBM Research
Outline
 Problem Background
 Methodology: Queuing Model
 Results
 Conclusions and Future Work
13
IBM Research
Big Knob Results
Experiment
Response
time (µs)
Average
Power
(Num
Cores)
Base
5002.22
32
OnLat = 0.5ms
CT = 0.5ms
CT = 0.3ms
CT = 0.1ms
CT = 10µs
5038.46
5070.12
5158.51
5244.43
24.99
23.33
21.83
21.68
OnLat = 10µs
CT= 0.5ms
CT = 10µs
5002.93
5007.07
24.82
20.77
• CT controls the degree of power-savings (up to 34%)
• OnLat controls the performance loss (up to 5%)
14
IBM Research
Idle-Time Durations Histogram
CT
0
100
200
300
400
500
600
700
800
900
1000
1100
1200
1300
1400
1500
Number of
durations
8000
7000
6000
5000
4000
3000
2000
1000
0
Idle-time Duration (us)
15
IBM Research
Small Knob Results
Workload
Behavior
Hi
Mean
Lo
Mean
Hi
%
Lo
%
Response
Time
(µs)
Avg Power
(Effective NumCores)
Short phases
100
200
300
100
500
100
100
200
100
300
100
500
52
57
79
30
89
21
48
43
21
70
11
79
5050.51
5027.36
5026.46
5028.23
5013.67
5019.95
28.16
28.48
30.08
26.24
31.04
25.6
High ILP
Low ILP
Very High ILP
Very Low ILP
System_Power = Num_cores x (%time_in_Hi_state + F x %time_in_Lo_state) x P
where F = 0.75 for this analysis
4
Performance
Loss %
2
0
0.5
1 Overhead
5
10
Transition
(us)
• Power-savings dependent
upon workload behavior
• Short phases increases
number of transitions and
overhead
• Transition overhead
tolerable for our assumptions
16
IBM Research
Hybrid Model Results (Big + Small Knob)
Inter-arrival Rate
(µs)
Server Utilization
(measured)
50
100
300
500
1000
2000
1.0
1.0
0.52
0.31
0.16
0.08
High ILP Workload
• High ILP workloads – Big
knob is most helpful
• Low ILP workloads –
Small knob helpful for even
lower utilization
Low ILP Workload 17
IBM Research
A Case for Guard Mechanism for Multicore Power Gating
Experiment
Response Time (us)
Core Switching
ON/OFF Frequency
Fixed Arrival Rate
5043.88
91482
Toggling Arrival Rate
5111
226372
 Depending upon workload characteristics, Per-core power gating
heuristics are prone to mis-predictions and dissipating more power
 Aggressive power-gating heuristics are also increase the
performance overhead of mis-prediction (e.g. Lower CT )
18
IBM Research
Observations
 In a fully loaded system, the small knob is helpful
 In a lightly loaded system, the big knob is most useful
 In the intermediate loaded system, the big knob is
useful to have but the usefulness of the small knob
depends upon the workload characteristics
– Lower ILP or low resource utilization workloads are
benefited by the small knob
 Small knob is a useful feature to have regardless of
system load if we can implement power state with lower
power factor
– Current power factor is conservative (0.75)
19
IBM Research
Future Work
 Improve methodology by supporting real server
utilization traces
 Evaluate a system with multiple P-states and DVFS
 Architect guard mechanisms for the per-core power
gating algorithms
 Design implementation of a hybrid PG system
20
IBM Research
Thanks and Questions!
21
IBM Research
Backup Slides
22
IBM Research
Power Factor Sensitivity Analysis for
High ILP Workload
1.2
1
Inter
Intra_0.75
H_0.75
Intra_0.5
H_0.5
Intra_0.25
H_0.25
Intra_0.1
H_0.1
0.8
0.6
0.4
0.2
0
50
100
300
500
1000
2000
23
IBM Research
Power Factor Sensitivity Analysis for
Low ILP Workload
1.2
1
Inter
Intra_0.75
H_0.75
Intra_0.5
H_0.5
Intra_0.25
H_0.25
Intra_0.1
H_0.1
0.8
0.6
0.4
0.2
0
50
100
300
500
1000
2000
24
IBM Research
Two Level Power Gating Algorithms (Lungu et al.
ISLPED'09)
 Observations:
 Correctness requirement of power
saving schemes (efficiency-wise): save
power
 Single level idle prediction algorithms
can behave incorrectly and waste power
Level 2: Monitor & Control
Estimate
Power
Savings
No
Enable = 0
>0
Yes
Enable = 1
 Proposed Idea:
 Add second level monitor to control
enabling of power gating scheme

Improve efficiency of power wasting
cases without degrading power saving
of common case
Efficiency
Counters
On
Enable
Off_U
Off_C
 Per-core power-gating algorithms also rely
on such predictive schemes and will
require guard mechanisms
– Cost of misprediction is higher in per-core
power-gating
Cnt1++
Level 1: Actuate
Cnt2++
Off_U: Power gated, uncompensated
Off_C: Power gated, compensated
25
Download