Recovery-Driven Design: A Power Minimization Methodology for Error-Tolerant Processor Modules Andrew B. Kahng†, Seokhyeong Kang†, Rakesh Kumar‡ and John Sartori‡ †VLSI CAD LABORATORY, UCSD ‡PASSAT GROUP, UIUC DAC, June 17, 2010 UCSD VLSI CAD Laboratory and UIUC PASSAT Group -1- Outline Background and Motivation – Voltage scaling and error-tolerant design – Error-tolerant design vs. recovery-driven design Recovery-Driven Design – Related work – Heuristic: power minimization – Error rate estimation Experimental Framework and Results – Design methodology – Results and analysis Conclusions and Ongoing Work -2- Reducing Power with Voltage Scaling Power is a first-order design constraint Voltage scaling reduces power but eventually causes massive timing violations Error-resilience allows deeper voltage scaling Power – Moore’s law implies power density of processors continues to escalate Voltage Timing errors begin to occur (lower voltage) -3- Error-Tolerance Mechanisms Hardware error-tolerance – Errors are detected and corrected during runtime – Razor (MICRO 2003) ~0.2% SPEED Application-level error-tolerance* REDUCTION – Errors are allowed to propagate to software resulting REDUCTION IN in reducedCOMPUTATION performance SPEEDor output quality ERROR RATE Traditional IC design Error-Tolerant design ~0.04% ERROR RATE • No errorsENERGY allowed PER INSTRUCTION • Overclocking and *Hedge et al. “Energy-Efficient voltage overscaling Signal not Tolerance”, ISLPED 1999 enabled • Error correction architecture allows timing errors Processing via Algorithmic Noise• Overclocking and voltage ENERGY MINIMUM overscaling enabled Voltage scaling (lower voltage) -4- Our Work: From Error-Tolerance to Recovery-Driven Error-Tolerant design Recovery-Driven design • Design still optimized for correct operation • Design methodology based on STA, workload-agnostic • Designed “from ground up” for specific target error rate • Design methodology exploits functional information -5- Recovery-Driven Design Error rate (traditional) Error rate (optimized) 1. OptimizePaths Operating New operating point point Pmin Pmin Error rate Power 1. Minimize error rate to extend range of voltage scaling How to minimize power in recovery-driven design? 2. Reduce design power with cell downsizing or Vt swap Target error rate Power (traditional) 2. ReducePower Vmin Vmin Power (optimized) lower voltage -6- Outline Background and motivation – Voltage scaling and error-tolerant processor – Error-tolerant design vs. recovery-driven design Recovery-Driven Design – Related work – Heuristic: power minimization – Error rate estimation Experimental Framework and Results – Design methodology – Results and analysis Conclusions and Ongoing Work -7- Related Works: Design-Level Optimizations for Error-Tolerant Processors BlueShift* – Increase frequency up to a target error rate – Speed up error paths with timing overrides and FBB Slack Optimizer** – Make gradual slope slack to achieve gracefully increasing error rate – Estimate error rate using switching activity from SAIF ‘wall’ of slack Number of paths ‘gradual slope’ slack Frequently exercised paths Rarely exercised paths Zero slack at nominal voltage Zero slack after voltage scaling Timing slack *Grescamp et al. “Blueshift: Designing Processors for Timing Speculation from the Ground up”, HPCA 2009 **Kahng et al. “Slack Redistribution for Graceful Degradation Under Voltage Overscaling”, ASPDAC 2010 -8- Recovery-Driven Design Methodology • • Problem: minimize processor power (leakage + dynamic) for a target error rate Approach: we use slack redistribution and power reduction enabled by accurate error rate estimation Slack redistribution: reshape path slack based on path activity (toggle rate) to minimize error rate and extend voltage scaling (OptimizePaths and ReducePower heuristics) Error rate estimation using a simulation dump file (VCD) -9- Slack Redistribution Redistribute slack from paths that rarely toggle to paths that frequently toggle (a) # paths zero slack after scaling voltage OptimizePaths (b) upsize cells timing slack voltage scaling (c) P+ P- downsize cells downsize cells ReducePower (d) P+ P- iterate voltage scaling -10- Slack Redistribution Flow Netlist VCD Analyze activity Timing Analysis OptimizePaths ReducePower ERCompute Error Rate ER > ERtarget YES ECO P&R Reduce Voltage NO Toggle Information: simulation dump file is loaded Path Optimization: minimize error rate to extend range of voltage scaling Power Reduction: downsize cells to obtain additional power savings Error Rate Estimation: estimate with toggle info and STA results -11- Heuristic Details – OptimizePaths Main idea: increase slack of frequently-exercised paths in order of decreasing toggle rate Procedure 1. 2. 3. 4. 5. Pick a critical path p with maximum toggle rate Resize cell instance ci in p If the path slack is not improved, cell change is restored Repeat 2. ~ 3. for all cell instances in path p Repeat 2.~ 4. for all critical paths OptimizePaths → ReducePower → Voltage Scaling -12- Heuristic Details – ReducePower Main idea: downsize cells on non-critical paths in order of decreasing sensitivity Sensitivity (c) = (powerc – powerc’) / (slackc – slackc’) Procedure 1. 2. 3. 4. 5. Pick a cell c with maximum sensitivity Downsize cell c with logically equivalent cell Incremental timing analysis and check error rate If error rate is increased, cell change is restored Repeat 1. ~ 4. OptimizePaths → ReducePower→ Voltage Scaling -13- Path Extraction for Error Rate Estimation Instead of simulation, we use toggle information from value change dump (VCD) file #0 0a 0b 1x 1y #1 1a 0x 0y #2 … Wave form VCD file [value, net] a clock y b a [time] Netlist x b y #0 #1 #2 #3 #4 Extracted paths a-x-y (@ cycle 1, 3) b-y (@ cycle 2, 4) List of toggled nets in each cycle time -14- Toggle and Error Rate Calculation p: path χtoggle: set of cycles which p has toggled Xtot: total cycle # Toggle rate: Error rate: 20X faster than actual simulation and accurate Runtime (min) Simulation Estimation 100 Error rate 30% 1000 20% Actual Estimated - PowerOpt Estimated - SlackOpt* 10 10% 1 lsu_dctl lsu_qctl1 lsu_stb_ctl Voltage 0% 1 0.9 *Kahng et al. “Slack Redistribution...”, ASPDAC 2010. 0.8 0.7 0.6 0.5 -15- Evaluation of Heuristic Design Choices Path ordering – toggle rate * slack – toggle rate Optimization radius – path only – fan-in/out network Starting netlist Optimization radius PathVoltage ordering Starting netlist optimization stepduring size granularity Power Power 7.00E-05 6.00E-05 6.00E-05 6.00E-05 6.00E-05 5.00E-05 (A) runtime (B) runtime (A) path only 5.00E-05 (B) 0.05V 4.00E-05 4.00E-05 4.00E-05 4.00E-05 3.00E-05 3.00E-05 2.00E-05 2.00E-05 Voltage step size (B) runtime 1:55:123:50:24 (A) toggle rate * slack 3:21:36 (A) loose (A) 0.01V (B) fanin/fanout network (B) toggle rate (B) tight 2.00E-05 – loosely constrained 2.00E-05 1.00E-05 – tightly constrained 1.00E-05 Runtime (A) runtime (A) runtime Runtime (A) runtime 2:09:36 4:19:12 2:24:00 (B) runtime (B) runtime 3:50:24 1:55:12 1:40:48 2:52:483:21:36 1:26:242:52:48 2:24:00 1:26:24 1:12:002:24:00 1:55:12 0:57:361:55:12 1:26:240:57:36 0:43:121:26:24 0:57:36 0:57:36 0:28:48 0:28:48 0:28:48 0:14:240:28:48 0.00E+00 0.00E+00 0.00E+00 0.00E+00 0:00:00 0:00:00 0:00:00 0.13% 0.13% 0.25% 0.25% 0.50% 1.00% 2.00% 4.00%0.04 8.00% 8.00% 0.08 0.00125 0.00250.50% 0.0051.00% 0.012.00% 0.024.00% 0.04 0.00125 0.0025 0.005 0.01 0.02 Error ErrorRate Rate – 0.01V and 0.05V -16- Outline Background and motivation – Voltage scaling and error-tolerant processor – Error-tolerant design vs. recovery-driven design Recovery-Driven Design – Related work – Heuristic: power minimization – Error rate estimation Experimental Framework and Results – Design methodology – Results and analysis Conclusions and Ongoing Work -17- Design Methodology Benchmark generation (Simics) Input vector Functional simulation (NC Verilog) Library characterization (SignalStorm) Initial design (OpenSPARC T1) Design information (.v .spef) Power Optimizer Synopsys Liberty (.lib) PrimeTime Tcl Socket I/F List of swaps Simulation result (.vcd) ECO P&R (SOCEncounter) Final design Perform System Gate Prepare Implement level level Synopsys ECO simulation insimulation P&R C++with Liberty and tocell use using get file swap Tcl signal using Simics socket list toggle Cadence with to communicate information realSignal benchmarks with (NC Storm verilog) PrimeTime -18- Power Analysis for Real Workloads system-level simulation input pattern Simics + Transplant RTL design OpenSPARC benchmark binary (bzip, twolf ...) functional simulation VCD VCS or NCVerilog design implementation DC, SOCE memory modeling netlist SPEF power analysis PrimeTime-PX Liberty (.lib) MEMGEN, CACTI Analyze level System Estimate leakage power simulation of and memory dynamic with–real MEMGEN, power benchmark using CACTI PT-PX binary and input patterns are captured -19- Testbed Target design: sub-modules of OpenSPARC T1 Benchmark: ammp, bzip2, equake, twolf, sort. Fast-forward, capture vectors Implementation: TSMC 65GP technology with standard SP&R Alternative design techniques: – SP&R with loose constraints and tight constraints – Slack Optimizer (make a “gradual slope”) [ASPDAC2010] -20- Power Consumption of Each Design Technique Power savings compared to tradition SP&R design Loose P&R Tight P&R Slack Optimizer Power Optimizer Power(w) 8.40E-05 6.40E-05 25% power savings @ 0.125% error rate (average) 4.40E-05 LSU_STB_CTL 2.40E-05 0.00 0.13 0.25 0.50 1.00 Rate(%) 8.00 2.00 Error4.00 Error rate (%) Area overhead and power savings (from loose SP&R) Area overhead Power savings @ 0.125% error Tight SP&R Slack Optimizer Power Optimizer 25.9% 3.7% 7.7% 12% 14% 25% -21- Power Consumption for HW-Based Error Tolerance Razor architecture was assumed for error detection and correction – account for Razor overhead (area, power) and power cost of error correction Power (W) 2.40E-04 LSU_STB_CTL 2.00E-04 Loose P&R Tight P&R SlackOpt PowerOpt 0.125 PowerOpt 1 1.60E-04 21% additional power savings 0.84V 0.76V Voltage (V) 1.20E-04 1.00 0.90 0.80 0.70 0.60 -22- Conclusions and Ongoing Work We propose recovery-driven design which minimizes power for a target timing error rate – Optimize designs with functional information and iterative voltage scaling – We also develop a fast and accurate technique for post-layout activity and error rate estimation We demonstrate significant power benefits – up to 25% power savings compared to traditional P&R at an error rate of 0.125% Ongoing work – Recovery-driven design for different error resilience mechanisms, different sources of variation – Design / architecture co-exploration -23- Thank you -24- BACKUP -25- Related Work: BlueShift BlueShift* : maximize frequency for a given error rate Gate-level simulation Compute error rate ER < Target NO Speed up paths YES BlueShift speedup Finish – Paths with the highest frequency of timing errors – FBB (forward body-biasing) & Timing override Limitation – Repetitive gate level simulation – impractical – Design overhead of FBB *Grescamp et al. “Blueshift: Designing processors for timing speculation from the ground up”, HPCA 2009 -26- Exploiting Error Resilience for Multi-core Design Design of heterogeneously reliable multi-core processor • Power-optimized for different reliability target • Power-optimized for different mixes of workloads Actual Workload: BZIP Total Power Consumption (W) 1.60E-03 0% error rate 0.5% error rate 1.20E-03 8.00E-04 4.00E-04 0.00E+00 BZIP AVERAGE Target Workload for Optimization Individual cores are customized for a specific workload class -27- Lifetime Energy Minimization Maximizing energy efficiency of DVFS-based designs – Inefficiency is due to a design optimized for a single power / performance point – Minimize energy when the processor spends R of its lifetime at high freq. (e.g., talk mode) and (1 – R) of its lifetime at low freq. (e.g., standby mode) • Replication-based methodology: area overhead vs. power tradeoffs • Co-optimization methodology: optimize design with two operating constraints – (freq_hi, V_hi) and (freq_lo, V_lo) • Both methodologies can be applied alternatively in each submodules 1.50 power at high frequency Power at low frequency energy at R=0.1 energy at R=0.01 area 1.00 0.50 PowerOpt Replication CoOpt -28- Sensitivity-Based Optimization Platform Post-layout stage cell swap Lgate biasing – Cell sizing + ECO – Multi-Vt swap – Multi-Lgate swap Swap cell and check STA with PrimeTime socket interface Cell swap according to the sensitivity S – For leakage optimization, S = Δleakage x slack – For timing closure, S = Δslack / (slack – WNS) MMMC (Multi-Mode Multi-Corner) can be considered with multiple PrimeTime sockets -29- Limitations of Traditional CAD Flow In modern digital design, vast majority of paths have near-critical slack – wall of slack distribution Scaling beyond a critical operating point causes massive errors and power benefits can be limited* zero slack error rate number of paths ‘wall of slack’ timing slack Error rate = # cycles which have timing error # total cycles 20.0 0.0 % at 0.95V 1.0 1.00V 0.90V operating point lower voltage (higher frequency) *Kahng et al. “Slack Redistribution...”, ASPDAC 2010. -30-