A R-S M T: A M i c r o a r c h i te ctu ra l A pproa c h t o Fa u l t To l e ra n ce in Microprocessors Eric Rotenberg Computer Sciences Department University of Wisconsin — Madison http://www.cs.wisc.edu/~ericro/ericro.html ericro@cs.wisc.edu High-Per fo r m a n c e M i c r o p r o ce sso r s • General-purpose computing success - Proliferation of high-performance single-chip microprocessors • Rapid performance growth fueled by two trends - Technology advances: circuit speed and density - Microarchitecture innovations: exploiting instruction-level parallelism (ILP) in sequential programs • Both technology and microarchitecture performance trends have implications to fault tolerance Eric Rotenberg AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors Slide 2 Te ch n ol ogy and Faul t Tole ra n c e • Technology-driven performance improvements - GHz clock rates - Billion-transistor chips • High clock rate, dense designs - low voltages for fast switching and power management - high-performance and “undisciplined” circuit techniques - managing clock skew with GHz clocks - pushing the technology envelope potentially reduces design tolerances in general => Entire chip prone to frequent, arbitrary transient faults Eric Rotenberg AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors Slide 3 M ic roarc h i t e c t u r e a n d Fa u lt To le ra n ce • Conventional fault-tolerant techniques - Specialized techniques (e.g. ECC for memory, RESO for ALUs) do not cover arbitrary logic faults - Pervasive self-checking logic is intrusive to design - System-level fault tolerance (e.g. redundant processors) too costly for commodity computers • A microarchitecture-based fault-tolerant approach - Microarchitecture performance trends can be easily leveraged for fault-tolerance goals - Broad coverage of transient faults - Low overhead: performance, area, and design changes Eric Rotenberg AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors Slide 4 Ta l k O u t l i n e ✓Technology and microarchitecture trends: implications to fault tolerance • Time redundancy spectrum • AR-SMT • Leveraging microarchitecture trends - Simultaneous Multithreading (SMT) - Speculation concepts - Hierarchical processors (e.g. Trace Processors) • Performance results • Summary and future work Eric Rotenberg AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors Slide 5 Time R e d u n d a n c y S p ectr u m Program-level time redundancy Processor Processor Program P Program P’ Processor dynamic parallel scheduling execution units Program P FU Instruction re-execution i instr i i’ FU FU FU Eric Rotenberg AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors Slide 6 Time R e d u n d a n c y S p ectr u m Processor Program P AR-SMT time redundancy Eric Rotenberg Program P’ AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors Slide 7 AR - S M T H i g h L eve l R-stream A-stream PROCESSOR fetch commit R-stream A-stream DELAY BUFFER • “A” => “Active stream” • “R” => “Redundant stream” • “SMT” => “Simultaneous MultiThreading” Eric Rotenberg AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors Slide 8 A R- S M T Fa u l t To l e ra n ce • Delay Buffer - Simple, fast, hardware-only state passing for comparing thread state - Ensures time redundancy: the A- and R-stream copies of an instruction execute at different times - Buffer length adjusted to cover transient fault lifetimes • Transient fault detection and recovery - Fault detected when thread state does not match - Error latency related to length of Delay Buffer - Committed R-stream state is checkpoint for recovery Eric Rotenberg AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors Slide 9 Leveragin g M i c r o a r c h i t e c t ur e Tr e n d s • Simultaneous Multithreading • Control Flow and Data Flow Prediction • Hierarchy and Replication Eric Rotenberg AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors Slide 10 SMT • Simultaneous multithreading [Tullsen,Eggers,Emer,Levy,Lo,Stamm: ISCA-23] - Multiple contexts space-share a wide-issue superscalar processor - Key ideas • Performance: better overall utilization of highly parallel uniprocessor • Complexity: leverage well-understood superscalar mechanisms to make multiple contexts transparent • SMT is (most likely) in next-generation product plans Eric Rotenberg AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors Slide 11 Cont ro l a n d D a t a “ P r e d ictio n ” cycle cycle r1 <= r6-11 r2 <= r1<<2 1 i1 i2 i3 i4 i5 1 i1 r3 <= r1+r2 2 2 i2 branch i5,r3==0 3 3 i3 1 0 4 4 i4 11 00 5 5 11 00 6 6 pipeline latency i5: r5 <= 10 1 0 7 7 8 8 11 00 9 9 i5 11 00 i1: i2: i3: i4: (a) program segment (b) base schedule (c) with prediction • Delay Buffer contains perfect “predictions” for R-stream! • Existing prediction-validation hardware provides fault detection! Eric Rotenberg AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors Slide 12 Tra c e P r o c e s s o r s • Trace: a long (16 to 32) dynamic sequence of instructions, and the fundamental unit of operation in Trace Processors • Replicated PEs => inherent, coarse level of hardware redundancy - Permanent fault coverage: A-/R-stream traces go to different PEs - Simple re-configuration: remove a PE from the resource pool Processing Element 0 Global Registers Predicted Local Registers Registers Issue Buffers Processing Element 1 Trace Cache Live-in Value Predict Processing Element 2 Speculative State Next Trace Predict Data Cache Func Units Global Rename Maps Processing Element 3 Eric Rotenberg AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors Slide 13 A R-S M T o n a Tra c e P r o ce sso r FETCH DISPATCH ISSUE EXECUTE RETIRE PE A-stream Trace Predictor PE A-stream A-stream Trace Cache R-stream Trace "Prediction" from Delay Buffer Decode Rename Allocate PE PE R-stream Eric Rotenberg Reclaim Resources PE A-stream Reg File time-shared Commit State D-cache Disambig. Unit space-shared AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors Slide 14 Pe r fo r m a n c e R e s ults normalized execution time (single thread = 1) 1.3 4 PE 8 PE 1.2 1.1 1 comp Eric Rotenberg gcc go jpeg li AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors Slide 15 Pe r fo r m a n c e R e s ults % of all cycles 30 A-stream R-stream 25 20 15 10 5 0 0 1 2 3 4 5 6 7 8 number of PEs used Eric Rotenberg AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors Slide 16 Summary • Technology-driven performance improvements - New fault environment: frequent, arbitrary transient faults • Leverage microarchitecture performance trends for broadcoverage, low-overhead fault tolerance - SMT-based time redundancy - Control and data “prediction” - Hierarchical processors • Introducing a second, redundant thread increases execution time by only 10% to 30% Eric Rotenberg AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors Slide 17 O t her I n t e r e s t i n g Pe r s pe ctive s • D. Siewiorek. “Niche Successes to Ubiquitous Invisibility: Fault-Tolerant Computing Past, Present, and Future”, FTCS-25 - (Quote) Fault-tolerant architectures have not kept pace with the rate of change in commercial systems. - Fault tolerance must make unconventional in-roads into commodity processors: leverage the commodity microarchitecture. • P. Rubinfeld. “Managing Problems at High Speeds”, Virtual Roundtable on the Challenges and Trends in Processor Design, Computer, Jan. 1998. - Implications of very high clock rate, dense designs Eric Rotenberg AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors Slide 18 F u t u r e Wo r k • Performance and design studies - Isolate individual performance impact of SMT-ness and prediction aspects - Debug Buffer size vs. performance: SMT scheduling flexibility - Support more threads - Design and validate a full AR-SMT fault-tolerant system • Derive fault models (collaboration?) • Simulate fault coverage - fault injection - test different parts of processor: coverage varies? - vary duration of transient faults - permanent faults and re-configuration Eric Rotenberg AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors Slide 19