ppt - UCSD VLSI CAD Laboratory

advertisement
Recovery-Driven Design:
A Power Minimization Methodology
for Error-Tolerant Processor Modules
Andrew B. Kahng†, Seokhyeong Kang†,
Rakesh Kumar‡ and John Sartori‡
†VLSI
CAD LABORATORY, UCSD
‡PASSAT GROUP, UIUC
DAC, June 17, 2010
UCSD VLSI CAD Laboratory and UIUC PASSAT Group
-1-
Outline

Background and Motivation
– Voltage scaling and error-tolerant design
– Error-tolerant design vs. recovery-driven design

Recovery-Driven Design
– Related work
– Heuristic: power minimization
– Error rate estimation

Experimental Framework and Results
– Design methodology
– Results and analysis

Conclusions and Ongoing Work
-2-
Reducing Power with Voltage Scaling

Power is a first-order design
constraint


Voltage scaling reduces
power but eventually causes
massive timing violations
Error-resilience allows
deeper voltage scaling
Power
– Moore’s law implies power
density of processors
continues to escalate
Voltage
Timing errors begin
to occur
(lower voltage)
-3-
Error-Tolerance Mechanisms

Hardware error-tolerance
– Errors are detected and corrected during runtime
– Razor (MICRO 2003)

~0.2% SPEED
Application-level error-tolerance*
REDUCTION
– Errors are allowed
to propagate to software resulting
REDUCTION IN
in reducedCOMPUTATION
performance
SPEEDor output quality
ERROR RATE
Traditional IC design
Error-Tolerant design
~0.04% ERROR RATE
• No errorsENERGY
allowed
PER INSTRUCTION
• Overclocking and
*Hedge
et al. “Energy-Efficient
voltage
overscaling Signal
not
Tolerance”,
ISLPED 1999
enabled
• Error correction architecture
allows timing errors
Processing
via Algorithmic
Noise• Overclocking
and
voltage
ENERGY
MINIMUM
overscaling enabled
Voltage scaling (lower voltage)
-4-
Our Work: From Error-Tolerance
to Recovery-Driven
Error-Tolerant design
Recovery-Driven design
• Design still optimized
for correct operation
• Design methodology
based on STA,
workload-agnostic
• Designed “from ground up”
for specific target error rate
• Design methodology
exploits functional
information
-5-
Recovery-Driven Design
Error rate
(traditional)
Error rate
(optimized)
1. OptimizePaths
Operating
New operating
point
point
Pmin
Pmin
Error rate
Power
1. Minimize error rate to extend range of voltage scaling
How to minimize power in recovery-driven design?
2. Reduce design power with cell downsizing or Vt swap
Target error rate
Power (traditional)
2. ReducePower
Vmin
Vmin
Power (optimized)
lower voltage
-6-
Outline

Background and motivation
– Voltage scaling and error-tolerant processor
– Error-tolerant design vs. recovery-driven design

Recovery-Driven Design
– Related work
– Heuristic: power minimization
– Error rate estimation

Experimental Framework and Results
– Design methodology
– Results and analysis

Conclusions and Ongoing Work
-7-
Related Works: Design-Level
Optimizations for Error-Tolerant Processors

BlueShift*
– Increase frequency up to a target error rate
– Speed up error paths with timing overrides and FBB
Slack Optimizer**
– Make gradual slope slack to
achieve gracefully increasing
error rate
– Estimate error rate using
switching activity from SAIF
‘wall’ of slack
Number of paths

‘gradual slope’ slack
Frequently
exercised
paths
Rarely
exercised
paths
Zero slack at
nominal voltage
Zero slack after voltage
scaling
Timing slack
*Grescamp et al. “Blueshift: Designing Processors for Timing Speculation
from the Ground up”, HPCA 2009
**Kahng et al. “Slack Redistribution for Graceful Degradation Under Voltage
Overscaling”, ASPDAC 2010
-8-
Recovery-Driven Design Methodology


•
•
Problem:
minimize processor power (leakage + dynamic) for a
target error rate
Approach:
we use slack redistribution and power reduction
enabled by accurate error rate estimation
Slack redistribution:
reshape path slack based on path activity (toggle rate)
to minimize error rate and extend voltage scaling
(OptimizePaths and ReducePower heuristics)
Error rate estimation using a simulation dump file
(VCD)
-9-
Slack Redistribution

Redistribute slack from paths that rarely toggle to
paths that frequently toggle
(a)
# paths
zero slack after
scaling voltage
OptimizePaths
(b)
upsize cells
timing
slack
voltage scaling
(c)
P+
P-
downsize cells
downsize cells
ReducePower
(d)
P+
P-
iterate voltage scaling
-10-
Slack Redistribution Flow

Netlist
VCD
Analyze activity

Timing Analysis
OptimizePaths
ReducePower
ERCompute Error
Rate
ER > ERtarget
YES
ECO P&R
Reduce
Voltage
NO


Toggle Information:
simulation dump file is
loaded
Path Optimization:
minimize error rate to
extend range of voltage
scaling
Power Reduction:
downsize cells to obtain
additional power savings
Error Rate Estimation:
estimate with toggle info
and STA results
-11-
Heuristic Details – OptimizePaths


Main idea: increase slack of frequently-exercised paths
in order of decreasing toggle rate
Procedure
1.
2.
3.
4.
5.

Pick a critical path p with maximum toggle rate
Resize cell instance ci in p
If the path slack is not improved, cell change is
restored
Repeat 2. ~ 3. for all cell instances in path p
Repeat 2.~ 4. for all critical paths
OptimizePaths → ReducePower → Voltage Scaling
-12-
Heuristic Details – ReducePower



Main idea: downsize cells on non-critical paths in order
of decreasing sensitivity
Sensitivity (c) = (powerc – powerc’) / (slackc – slackc’)
Procedure
1.
2.
3.
4.
5.

Pick a cell c with maximum sensitivity
Downsize cell c with logically equivalent cell
Incremental timing analysis and check error rate
If error rate is increased, cell change is restored
Repeat 1. ~ 4.
OptimizePaths → ReducePower→ Voltage Scaling
-13-
Path Extraction for Error Rate Estimation

Instead of simulation, we use toggle information from
value change dump (VCD) file
#0
0a
0b
1x
1y
#1
1a
0x
0y
#2
…
Wave form
VCD file
[value, net]
a
clock
y
b
a
[time]
Netlist
x
b
y
#0
#1
#2
#3
#4
Extracted paths
a-x-y (@ cycle 1, 3)
b-y (@ cycle 2, 4)
List of toggled nets
in each cycle time
-14-
Toggle and Error Rate Calculation
p: path
χtoggle: set of
cycles which p
has toggled
Xtot: total cycle #

Toggle rate:

Error rate:

20X faster than actual simulation and accurate
Runtime (min)
Simulation
Estimation
100
Error rate
30%
1000
20%
Actual
Estimated - PowerOpt
Estimated - SlackOpt*
10
10%
1
lsu_dctl
lsu_qctl1
lsu_stb_ctl
Voltage
0%
1
0.9
*Kahng et al. “Slack Redistribution...”, ASPDAC 2010.
0.8
0.7
0.6
0.5
-15-
Evaluation of Heuristic Design Choices

Path ordering
– toggle rate * slack
– toggle rate

Optimization radius
– path only
– fan-in/out network

Starting netlist
Optimization
radius
PathVoltage
ordering
Starting
netlist
optimization
stepduring
size
granularity
Power
Power
7.00E-05
6.00E-05
6.00E-05
6.00E-05
6.00E-05
5.00E-05
(A) runtime
(B) runtime
(A) path only
5.00E-05
(B) 0.05V
4.00E-05
4.00E-05
4.00E-05
4.00E-05
3.00E-05
3.00E-05
2.00E-05
2.00E-05
Voltage step size
(B)
runtime
1:55:123:50:24
(A)
toggle
rate
* slack 3:21:36
(A)
loose
(A)
0.01V
(B) fanin/fanout network
(B)
toggle
rate
(B)
tight
2.00E-05
– loosely constrained 2.00E-05
1.00E-05
– tightly constrained 1.00E-05

Runtime
(A) runtime
(A) runtime Runtime
(A) runtime
2:09:36
4:19:12
2:24:00
(B) runtime
(B) runtime 3:50:24
1:55:12
1:40:48
2:52:483:21:36
1:26:242:52:48
2:24:00
1:26:24
1:12:002:24:00
1:55:12
0:57:361:55:12
1:26:240:57:36
0:43:121:26:24
0:57:36
0:57:36
0:28:48
0:28:48
0:28:48
0:14:240:28:48
0.00E+00
0.00E+00
0.00E+00
0.00E+00
0:00:00
0:00:00
0:00:00
0.13%
0.13% 0.25%
0.25%
0.50%
1.00%
2.00%
4.00%0.04
8.00%
8.00% 0.08
0.00125
0.00250.50%
0.0051.00%
0.012.00%
0.024.00%
0.04
0.00125
0.0025
0.005
0.01
0.02
Error
ErrorRate
Rate
– 0.01V and 0.05V
-16-
Outline

Background and motivation
– Voltage scaling and error-tolerant processor
– Error-tolerant design vs. recovery-driven design

Recovery-Driven Design
– Related work
– Heuristic: power minimization
– Error rate estimation

Experimental Framework and Results
– Design methodology
– Results and analysis

Conclusions and Ongoing Work
-17-
Design Methodology
Benchmark generation
(Simics)
Input vector
Functional simulation
(NC Verilog)
Library
characterization
(SignalStorm)
Initial design
(OpenSPARC T1)
Design information (.v .spef)
Power Optimizer
Synopsys
Liberty
(.lib)
PrimeTime
Tcl Socket I/F
List of swaps
Simulation result (.vcd)
ECO P&R
(SOCEncounter)

Final design
Perform
System
Gate
Prepare
Implement
level
level
Synopsys
ECO
simulation
insimulation
P&R
C++with
Liberty
and
tocell
use
using
get
file
swap
Tcl
signal
using
Simics
socket
list
toggle
Cadence
with
to communicate
information
realSignal
benchmarks
with
(NC
Storm
verilog)
PrimeTime
-18-
Power Analysis for Real Workloads
system-level
simulation
input
pattern
Simics + Transplant
RTL design
OpenSPARC
benchmark binary
(bzip, twolf ...)

functional
simulation
VCD
VCS or NCVerilog
design
implementation
DC, SOCE
memory modeling
netlist
SPEF
power
analysis
PrimeTime-PX
Liberty (.lib)
MEMGEN, CACTI
Analyze level
System
Estimate
leakage
power
simulation
of
and
memory
dynamic
with–real
MEMGEN,
power
benchmark
using
CACTI
PT-PX
binary
and input patterns are captured
-19-
Testbed




Target design: sub-modules of
OpenSPARC T1
Benchmark:
ammp, bzip2, equake, twolf, sort.
Fast-forward, capture vectors
Implementation: TSMC 65GP
technology with standard SP&R
Alternative design techniques:
– SP&R with loose constraints and tight constraints
– Slack Optimizer (make a “gradual slope”) [ASPDAC2010]
-20-
Power Consumption of Each
Design Technique

Power savings compared to tradition SP&R design
Loose P&R
Tight P&R
Slack Optimizer
Power Optimizer
Power(w)
8.40E-05
6.40E-05
25% power
savings @
0.125% error
rate (average)
4.40E-05
LSU_STB_CTL
2.40E-05
0.00

0.13
0.25
0.50
1.00
Rate(%) 8.00
2.00 Error4.00
Error rate (%)
Area overhead and power savings (from loose SP&R)
Area overhead
Power savings
@ 0.125% error
Tight SP&R
Slack Optimizer
Power Optimizer
25.9%
3.7%
7.7%
12%
14%
25%
-21-
Power Consumption for HW-Based
Error Tolerance

Razor architecture was assumed for error detection
and correction – account for Razor overhead (area,
power) and power cost of error correction
Power (W)
2.40E-04
LSU_STB_CTL
2.00E-04
Loose P&R
Tight P&R
SlackOpt
PowerOpt 0.125
PowerOpt 1
1.60E-04
21% additional
power savings
0.84V
0.76V
Voltage (V)
1.20E-04
1.00
0.90
0.80
0.70
0.60
-22-
Conclusions and Ongoing Work

We propose recovery-driven design which minimizes
power for a target timing error rate
– Optimize designs with functional information and
iterative voltage scaling
– We also develop a fast and accurate technique for
post-layout activity and error rate estimation


We demonstrate significant power benefits – up to
25% power savings compared to traditional P&R at
an error rate of 0.125%
Ongoing work
– Recovery-driven design for different error resilience
mechanisms, different sources of variation
– Design / architecture co-exploration
-23-
Thank you
-24-
BACKUP
-25-
Related Work: BlueShift

BlueShift* : maximize frequency for a given error rate
Gate-level
simulation
Compute
error rate
ER < Target
NO
Speed up
paths
YES

BlueShift speedup
Finish
– Paths with the highest frequency of timing errors
– FBB (forward body-biasing) & Timing override

Limitation
– Repetitive gate level simulation – impractical
– Design overhead of FBB
*Grescamp et al. “Blueshift: Designing processors for timing speculation
from the ground up”, HPCA 2009
-26-
Exploiting Error Resilience for Multi-core
Design
Design of heterogeneously reliable multi-core processor
• Power-optimized for
different reliability
target
• Power-optimized for
different mixes of
workloads
Actual Workload: BZIP
Total Power Consumption (W)

1.60E-03
0% error rate
0.5% error rate
1.20E-03
8.00E-04
4.00E-04
0.00E+00
BZIP
AVERAGE
Target Workload for Optimization
Individual cores are customized for
a specific workload class
-27-
Lifetime Energy Minimization

Maximizing energy efficiency of DVFS-based designs
– Inefficiency is due to a design optimized for a single power /
performance point
– Minimize energy when the processor spends R of its lifetime at high
freq. (e.g., talk mode) and (1 – R) of its lifetime at low freq. (e.g.,
standby mode)
• Replication-based methodology:
area overhead vs. power tradeoffs
• Co-optimization methodology:
optimize design with two operating
constraints – (freq_hi, V_hi) and
(freq_lo, V_lo)
• Both methodologies can be
applied alternatively in each submodules
1.50
power at high frequency
Power at low frequency
energy at R=0.1
energy at R=0.01
area
1.00
0.50
PowerOpt
Replication
CoOpt
-28-
Sensitivity-Based Optimization
Platform

Post-layout stage cell swap
Lgate biasing
– Cell sizing + ECO
– Multi-Vt swap
– Multi-Lgate swap


Swap cell and check STA
with PrimeTime socket
interface
Cell swap according to the sensitivity S
– For leakage optimization, S = Δleakage x slack
– For timing closure, S = Δslack / (slack – WNS)

MMMC (Multi-Mode Multi-Corner) can be considered with
multiple PrimeTime sockets
-29-
Limitations of Traditional CAD Flow


In modern digital design, vast majority of paths have
near-critical slack – wall of slack distribution
Scaling beyond a critical operating point causes
massive errors and power benefits can be limited*
zero slack
error rate
number of paths
‘wall of slack’
timing slack
Error rate =
# cycles which
have timing error
# total cycles
20.0
0.0 % at 0.95V
1.0
1.00V
0.90V
operating
point
lower voltage
(higher frequency)
*Kahng et al. “Slack Redistribution...”, ASPDAC 2010.
-30-
Download