Online Low-Cost Defect Tolerance Solutions for Microprocessor Designs Thesis Proposal Tue. Dec. 18th 2007 Kypros Constantinides Advisor: Todd Austin Department of Electrical Engineering and Computer Science University of Michigan Reliability Challenges of Technology Scaling Age-related wearout - Electromigration - Gate-oxide breakdown (TDDB) Transient Faults (due to natural radiation) Source N+ Gate Drain - N+ -+ -+ +- +- + P Parametric Process Variation Manufacturing Defects (Uncertainty in device & environment) (that escape testing and burn-in) Increased Heating Thermal Runaway Higher Power Dissipation 2 Online Low-Cost Defect Tolerance Solutions for Microprocessor Designs Higher Transistor Leakage Thesis Proposal Dec. 18th, 2007 Reliability Challenges of Technology Scaling Cost product cost cost per transistor Further scaling is not profitable 1) Cost of built-in defect reliability tolerance mechanisms costreliability 2) Cost of R&D needed to cost develop reliable technologies Silicon Process Technology Suggested Approach 1) Build products out of unreliable components/technologies 2) Provide reliability through very low cost defect-tolerance techniques 3 Online Low-Cost Defect Tolerance Solutions for Microprocessor Designs Thesis Proposal Dec. 18th, 2007 Presentation Outline Previous Work – Traditional Techniques Preliminary Results BulletProof – A Hardware-Based Defect Tolerance Technique ACE Testing – A Software-Based Defect Tolerance Technique Future Work Timeline 4 Online Low-Cost Defect Tolerance Solutions for Microprocessor Designs Thesis Proposal Dec. 18th, 2007 Traditional Defect Tolerance Techniques Used at high-end life-critical systems (e.g., aviation) Triple Modular Redundancy (voting scheme) N-Version Hardware Triple Modular Redundancy Module Module Voting Logic Module 5 Online Low-Cost Defect Tolerance Solutions for Microprocessor Designs 2-Version Hardware Processor Type A Processor Type B Checker Thesis Proposal Dec. 18th, 2007 Examples of More Recent Research Approaches Processor Checking (DIVA – Austin, MICRO’99) Task Checking (Argus – Meixner, MICRO’07) Processor Checking Main Processor 6 Processor Checker Task Checking Control-Flow Checker Memory Checker Online Low-Cost Defect Tolerance Solutions for Microprocessor Designs Main Processor Data-Flow Checker Computation Checker Thesis Proposal Dec. 18th, 2007 Shortcomings of Existing Techniques Existing techniques continuously check for execution errors Redundant computation requires significant extra hardware – high area overhead Continuous checking consumes significant energy – pressure on power budget Suitable for high-end or life-critical systems BUT, too costly to employ for mainstream systems 7 Online Low-Cost Defect Tolerance Solutions for Microprocessor Designs Thesis Proposal Dec. 18th, 2007 Thesis Goal Thesis: Defect-tolerance techniques can provide the same level of reliability as traditional techniques, but at a much lower cost. Reliability ~99% Goals: Area Cost Ultra low-cost solution < 5% Area ~99% of defects are detectable and recoverable < 5% Provided Reliability Thesis Goal Performance < 10% Performance 8 Low runtime performance overhead (due to testing) < 10% After recovery the system still operates in degraded performance mode < 10% Online Low-Cost Defect Tolerance Solutions for Microprocessor Designs Thesis Proposal Dec. 18th, 2007 Presentation Outline Previous Work – Traditional Techniques Preliminary Results BulletProof – A Hardware-Based Defect Tolerance Technique ACE Testing – A Software-Based Defect Tolerance Technique Future Work Timeline 9 Online Low-Cost Defect Tolerance Solutions for Microprocessor Designs Thesis Proposal Dec. 18th, 2007 BulletProof Pipeline - Overview (ASPLOS06, DATE07) EXTERNAL INTERRUPTS EX stage EX checker with SEU detection MEM stage MEM checker MEM/WB latches ID checker with SEU detection with SEU detection with SEU detection ID stage EX/MEM latches IF checker trans epoch data intra epoch data ID/EX latches IF stage IF/ID latches I - CACHE non-speculative state speculative state I/O TRANSFERS D - CACHE MEMORY and L2 CACHE WB stage WB checker Distributed Testing Checkpoint Checkpoint scan chain COMPUTATION Speculative state during checkpoint interval On-line distributed testing using checkers COMPUTATION Checkpoint Interval 10 Online Low-Cost Defect Tolerance Solutions for Microprocessor Designs Thesis Proposal Dec. 18th, 2007 BulletProof: Distributed Testing and Recovery ID/ EX IF/ ID LOCALTESTER TESTER LOCAL CHECKER CHECKER LOCALTESTER TESTER LOCAL CHECKER CHECKER EX/ MEM X LOCALTESTER TESTER LOCAL CHECKER CHECKER MEM /WB LOCALTESTER TESTER LOCAL CHECKER CHECKER Checkpoint Recovery Reconfig X Computation Testing Computation Testing Computational Epoch Computation No Testing Time State Checkpoint 11 Testing Complete Fault Manifests Fault Detected Online Low-Cost Defect Tolerance Solutions for Microprocessor Designs Thesis Proposal Dec. 18th, 2007 Experimental Methodology – Baseline Architecture Baseline Architecture: Circuit-Level Evaluation: Prototype with a physical layout (TSMC 0.18um) Accurate area overhead estimations Accurate fault coverage area estimations Architecture-Level Evaluation: 5-stage 4-wide VLIW architecture, 32KB I-Cache, 32KB D-Cache Embedded designs: Need high reliability with high cost sensitivity Trimaran toolset & Dinero IV cache simulator Average computational epoch size Performance while in graceful degradation Benchmarks 12 PC IF/ID DECODER I-CACHE 32KB address data SPECINT2000, MediaBench, MiBench Online Low-Cost Defect Tolerance Solutions for Microprocessor Designs ID/EX ALU DECODER DECODER ALU DECODER Agen REGISTER FILE 4-write/8-read EX/ MEM MEM /WB D-CACHE MULT Agen MULT 32KB Thesis Proposal Dec. 18th, 2007 Design Defect Coverage Defect Coverage: total area of the design in which a defect can be detected and corrected IF 92.5% ID 93.6% EX 97.7% MEM 92.6% WB 92.7% Overall Design Defect Coverage 95.2% 13 Online Low-Cost Defect Tolerance Solutions for Microprocessor Designs Thesis Proposal Dec. 18th, 2007 Area Overhead Summary EX 11.09% (86%) RF 1.26% (9.8%) Overall design area cost 12.9% ID 0.22% (1.7%) IF 0.07% (0.6%) WB 0.06% (0.5%) L1 I-Cache 0.08% (0.66%) L1 D-Cache 0.08% (0.66%) 14 Online Low-Cost Defect Tolerance Solutions for Microprocessor Designs Thesis Proposal Dec. 18th, 2007 BulletProof Summary Provided Reliability 95.2% Silicon Area Cost 12.9% BulletProof Pipeline Runtime Performance Overhead <1% Trade-off runtime performance to get lower area overhead and higher reliability 15 Online Low-Cost Defect Tolerance Solutions for Microprocessor Designs Thesis Proposal Dec. 18th, 2007 Presentation Outline Previous Work – Traditional Techniques Preliminary Results BulletProof – A Hardware-Based Defect Tolerance Technique ACE Testing – A Software-Based Defect Tolerance Technique Future Work Timeline 16 Online Low-Cost Defect Tolerance Solutions for Microprocessor Designs Thesis Proposal Dec. 18th, 2007 Software-Based Defect Detection (MICRO’07) 1) Move the hardware checking overhead to software 2) Firmware periodically stalls the processor and perform hardware checking 3) Provide architectural support to the software checking routines FIRMWARE Periodically stalls the processor and run hardware checking routines Accessibility Architectural support to software-based checking Controllability ? 17 ? Advantages over hardware-based techniques - Lower area overhead - Higher runtime flexibility - it can support multiple fault models - dynamic tuning of testing process - Easier to upgrade (software patches) Online Low-Cost Defect Tolerance Solutions for Microprocessor Designs Thesis Proposal Dec. 18th, 2007 Access-Control Extensions (ACE) Framework Software Architectural support that enables software access to the processor state (ACE Hardware) Special Instructions can access and control any part of the processor state (ACE Instructions) ISA Firmware can periodically run directed hardware tests (ACE Firmware) Hardware 18 Online Low-Cost Defect Tolerance Solutions for Microprocessor Designs Applications Operating System ACE Firmware ACE Extension ACE Hardware Processor State Processor Thesis Proposal Dec. 18th, 2007 Accessing The Processor State (ACE Hardware) We leverage the existing full hold-scan chain infrastructure Full hold-scan chains are employed by most modern processors to improve/automate manufacturing testing Scan State (shadow processor state) Processor State 19 Online Low-Cost Defect Tolerance Solutions for Microprocessor Designs Thesis Proposal Dec. 18th, 2007 Accessing The Processor State (ACE Hardware) ACE Tree Register File ACE Node ACE Node ACE Node ACE Node ACE Node ACE Node Scan State Processor State ACE Instructions can move values from the architectural registers to the scan state and vice versa ACE Instructions can swap data between the scan state and the processor state 20 Online Low-Cost Defect Tolerance Solutions for Microprocessor Designs Thesis Proposal Dec. 18th, 2007 Software-based Testing & Diagnosis (ACE Firmware) Step 1: Load test pattern into scan state Step 2: 3 cycle atomic test operation ATPG Automatic test pattern & response generation Cycle 1: Swap scan state with processor state Cycle 2: Test cycle Cycle 3: Swap scan state with processor state Step 3: Validate test response MEMORY Test Patterns Test Responses Register File ACE Node ACE Node Processor Test Response State Test Pattern Scan state Validation ACE Node ACE Node ACE Node ACE Node X Test Processor State state TestResponse Pattern 21 Online Low-Cost Defect Tolerance Solutions for Microprocessor Designs Thesis Proposal Dec. 18th, 2007 Timeline of Software-Based Testing Software-based testing is coupled with a checkpointing and recovery mechanism Checkpoint Interval 22 Online Low-Cost Defect Tolerance Solutions for Microprocessor Designs Checkpoint ACE-based Test COMPUTATION Directed ACE-based testing - High-quality testing (ATPG patterns) - High fault coverage ~99% - Runtime < 1M instructions Functional Test Checkpoint Functional software test - Check if the core is capable to run ACE-based testing - Limited fault coverage 60-70% - Very fast < 1000 instructions COMPUTATION Thesis Proposal Dec. 18th, 2007 Experimental Methodology OpenSPARC T1 CMP – based on Sun’s Niagara Synopsys Design Compiler to synthesize the OpenSPARC CMP Synopsys TetraMAX ATPG tool for test pattern generation RTL implementation of ACE framework to get area overhead Microarchitectural Simulation to get performance overhead SESC cycle-accurate simulator Simulate a SPARC core enhanced with the ACE framework Benchmarks from the SPEC CPU2000 suite 23 Online Low-Cost Defect Tolerance Solutions for Microprocessor Designs Thesis Proposal Dec. 18th, 2007 Preliminary Functional Testing Fault injection campaign on a gate-level netlist of a SPARC core Software functional test – 3 phases (~700 instructions): Control flow check Register access Use all ISA instructions Memory Error (6.49%) Illegal Execution (1.40%) Early Execution Termination Timeout (0.49%) (1.57%) Control Flow Assertion Register (7.45%) Access Assertion (23.36%) Functional testing coverage Undetected Faults (37.86%) Incorrect is low ~ 62% Execution Assertion Undetected faults do not (21.38%) affect the execution of ACE firmware Full coverage provided with further ACE-based testing 24 Online Low-Cost Defect Tolerance Solutions for Microprocessor Designs Thesis Proposal Dec. 18th, 2007 Full-chip Distributed ACE-based Testing Chip testing is distributed to the eight SPARC cores Testing for stuck-at and path-delay fault models Cores [0,1] Test Instructions: 312K Coverage: 99.6% Cores [2,4] Test Instructions: 468K Coverage: 98.7% Cores [3,5] Test Instructions: 405K Coverage: 98.8% Cores [6,7] Test Instructions: 333K Coverage: 99.9% 25 Online Low-Cost Defect Tolerance Solutions for Microprocessor Designs Thesis Proposal Dec. 18th, 2007 Performance Overhead of ACE-Based Testing Performance overhead depends on the fault model used to generate patterns ACE framework is flexible to support test patterns from different fault models Average Performance Overhead (%) 30 25 20 100M Checkpoint Interval SPEC CPU2000 Average 15 10 5 0 Stuck-at Stuck-at+ Path Delay N-Detect(N=2) N-Detect(N=4) +Path Delay +Path Delay Higher quality testing 26 Online Low-Cost Defect Tolerance Solutions for Microprocessor Designs Thesis Proposal Dec. 18th, 2007 ACE Framework Area Overhead RTL implementation of ACE Framework in Verilog Explored several ACE tree configurations 8 ACE trees (1 per core) to cover OpenSPARC ~230K ACE accessible bits Area Overhead: 0.7% each tree 5.8% for ACE framework 27 Online Low-Cost Defect Tolerance Solutions for Microprocessor Designs Thesis Proposal Dec. 18th, 2007 ACE Testing Summary Provided Reliability ~99% Silicon Area Cost 5.8% 28 BulletProof Pipeline Online Low-Cost Defect Tolerance Solutions for Microprocessor Designs Runtime Performance Overhead 5-25% Thesis Proposal Dec. 18th, 2007 Contributions to Date - Acknowledgements BulletProof Pipeline (ASPLOS’06, DATE’07) Todd Austin and Valeria Bertacco (project supervision) Smitha Shyam and Sujay Phadke (ASPLOS’06) Mojtaba Mehrara and Mona Attariyan (DATE’07) Physical prototype implementation Distributed Checkers Added soft-error detection to BulletProof pipeline Increased the fault coverage of the technique (protection for control logic) ACE Testing Framework (MICRO’07) 29 Todd Austin, Onur Mutlu and Valeria Bertacco (project supervision) Online Low-Cost Defect Tolerance Solutions for Microprocessor Designs Thesis Proposal Dec. 18th, 2007 Presentation Outline Previous Work – Traditional Techniques Preliminary Results BulletProof – A Hardware-Based Defect Tolerance Technique ACE Testing – A Software-Based Defect Tolerance Technique Future Work Timeline 30 Online Low-Cost Defect Tolerance Solutions for Microprocessor Designs Thesis Proposal Dec. 18th, 2007 Overview of Future Research Directions Add value to already proposed techniques Online Defect Detection & Diagnosis Online Low-cost Defect Tolerance Solutions Online System Repair Online System Recovery Evaluation Infrastructure 31 - BulletProof Pipeline - ACE Testing - Low overhead periodic checkpoint and recovery - Existing mechanisms: • ReVive + ReViveI/O • SafetyNet Fault Injection Based Analysis Framework Online Low-Cost Defect Tolerance Solutions for Microprocessor Designs Thesis Proposal Dec. 18th, 2007 Extend the ACE Framework to Other Applications Overhead of ACE framework can be amortized by other applications: Online Defect Detection & Diagnosis ACE Framework PROCESSOR ACE Firmware Hardware accessibility & controllability Online Performance Monitoring Online Design Bug Detection Manufacturing Testing Post-silicon Debugging 32 Online Low-Cost Defect Tolerance Solutions for Microprocessor Designs Thesis Proposal Dec. 18th, 2007 Flexible Event Monitoring Architecture Event monitoring requires real-time signal monitoring/processing Register File ACE Node ACE Node ACE Node Event monitoring hardware: - Bug signature checkers - Performance counters Programmable Logic Core ACE Node ACE Node ACE Node Support of monitoring capabilities for all ~230K bits of OpenSPARC is very expensive ~25-30% area overhead 33 Online Low-Cost Defect Tolerance Solutions for Microprocessor Designs Thesis Proposal Dec. 18th, 2007 Design Bugs - Preliminary Analysis Most bugs are in complex control logic Memory subsystem (lsu) Exception/interrupt control (tlu) Load/Store Unit (lsu) & Trap Logic Unit (tlu) account for 96% of the design bugs in the OpenSPARC core They account only for the 49% of the core’s scan cells Design Bug Distribution (SPARC core) ifu, 4% tlu, 44% Scan Cells Distribution (SPARC core) tlu, 26% exu, 11% spu, 8% ffu, 5% mul, 6% lsu, 51% spu, 1% 34 ifu, 21% lsu, 23% Online Low-Cost Defect Tolerance Solutions for Microprocessor Designs Thesis Proposal Dec. 18th, 2007 Current Software-Based Fault Simulation Framework A Monte Carlo-based fault simulation & analysis framework Fault Models Design Stimuli • Logic masked • Timing masked Type, Time, Location, Duration • Architecture masked • Error (fault manifests) Fault-Exposed Model Gate-level Netlist Golden Model Monte Carlo Simulation loop – 1000x Fault is Fault Analyzer (no faults injected) Supported Models • Stuck-at • Stuck-open • Bridge • Path-delay • Transient (SEU) Fault simulation & analysis speed ~ 10KHz 35 Online Low-Cost Defect Tolerance Solutions for Microprocessor Designs Thesis Proposal Dec. 18th, 2007 Hardware Accelerated Fault Simulation Port the software-based fault simulation & analysis framework on the BEE2 hardware emulation platform BEE2 Emulation Board 36 Online Low-Cost Defect Tolerance Solutions for Microprocessor Designs In collaboration with: - Andrea Pellegrini - Dan Zhang Thesis Proposal Dec. 18th, 2007 Online Repair Techniques Qualitatively evaluate the effectiveness of graceful degradation that exploits existing resource redundancy But different architectures have different degrees of resource redundancy For what defect rates is a given degree of resource redundancy adequate? 2-cores 8-cores 80-tiles Is graceful degradation enough? Do we need to spare? If yes, what to spare? 37 Online Low-Cost Defect Tolerance Solutions for Microprocessor Designs Thesis Proposal Dec. 18th, 2007 Presentation Outline Previous Work – Traditional Techniques Preliminary Results BulletProof – A Hardware-Based Defect Tolerance Technique ACE Testing – A Software-Based Defect Tolerance Technique Future Work Timeline 38 Online Low-Cost Defect Tolerance Solutions for Microprocessor Designs Thesis Proposal Dec. 18th, 2007 Thesis Completion Timeline Internship? Jan’08 Mar’08 May’08 Jul’08 Sept’08 Nov’08 Jan’09 Mar’09 May’09 IEEE Transactions on Computers 39 MICRO’08 or ASPLOS’09 DAC’09 Online Low-Cost Defect Tolerance Solutions for Microprocessor Designs DSN’09 Thesis Proposal Dec. 18th, 2007 Thank You! Questions? 40 Online Low-Cost Defect Tolerance Solutions for Microprocessor Designs Thesis Proposal Dec. 18th, 2007