Self-Repair for Robust System Design Yanjing Li Intel Labs Stanford University 1 Hardware Failures: Major Concern Permanent: our focus Temporary 2 Tolerating Permanent Hardware Failures Detection Diagnosis Recovery Self-repair 3 Tolerating Permanent Hardware Failures Detection Diagnosis Concurrent error detection:Self-repair expensive Recovery 4 Tolerating Permanent Hardware Failures Detection Diagnosis Online Self-Test and Diagnostics Non-continuous Low power Recovery Self-repair Concurrent No visible downtime 5 Tolerating Permanent Hardware Failures Diagnosis Detection Online Self-Test and Diagnostics Localize failures Recovery Diagnosis Self-repair 6 Tolerating Permanent Hardware Failures Replace / bypass faulty component(s) Diagnosis Detection Self-repair Recovery Self-repair 7 Tolerating Permanent Hardware Failures Correction of corrupted dataDiagnosis and states Detection Checkpoint Failure Failure occurs detected Rollback Recovery Self-repair 8 This Lecture Detection Diagnosis Recovery Self-repair 9 Outline Introduction Self-repair techniques Memories Processor cores Uncore components Conclusion 10 Memory Organization en rd/wr logic Row decoder addr Column decoder data_in Column multiplexers Data registers Sense amplifiers data_out 11 Memory Functional Fault Models Memory cell array faults stuck-at, transition, coupling, etc. Address decode faults (AFs) no cell accessed, multiple cells accessed, etc. Read/write logic faults Equivalent to memory cell array faults Single bit / word / column / row faults dominate [Aitken 04, Kurdahi 06, Sridharan 12] 12 Spare Rows / Columns en rd/wr logic Spare column Row decoder Spare rows addr Column decoder data_in Column multiplexers Data registers Sense amplifiers data_out 13 Memory Self-Repair Using Spare Rows en rd/wr logic Row decoder addr Column decoder data_in Column multiplexers Data registers Sense amplifiers data_out Row decoder modified 14 Memory Self-Repair Using Spare Columns en rd/wr logic Row decoder addr Multiplexers Column decoder data_in Column multiplexers Data registers Selfrepair control Sense amplifiers data_out Additional multiplexers and select logic 15 Spare Words en Same inputs as RAM rd/wr logic Spare words pool Row decoder addr Column decoder data_in Column multiplexers Data registers Sense amplifiers Self-repair control data_out [Kim 98] 16 Redundancy in Memories Essential to improve yield and reliability Widely used in commercial RAMs Related topics How much redundancy? Built-in repair analysis: redundancy allocation 17 How Much Redundancy? Yield analysis and yield learning Example: negative binomial memory yield model 𝑌𝑌𝑌𝑌𝑌 = (1 + 𝑌𝑌 𝑌) 𝑌: defect desntiy, 𝑌: memory area 𝑌: defect clustering coefficient (measured to be 2 or 3) For a memory array with N rows and 1 spare row 𝑌𝑌𝑌𝑌𝑌 = 𝑌𝑌𝑌𝑌𝑌+(𝑌 + 1)(𝑌𝑌𝑌𝑌𝑌)(1 − 𝑌𝑌𝑌𝑌𝑌) 18 How Much Redundancy? Yield analysis and yield learning Example: negative binomial memory yield model 19 Built-In Repair Analysis Spare rows and columns NP complete Various algorithms and EDA tools exist 20 Special Self-Repair Techniques for Caches Caches: affects performance, NOT functionality Block address Tag Index Block offset V Tag Data / ECC Decoder ... ... ... =? Hit/miss Data out 21 Cache Line Disable/Delete Exist in commercial systems Intel [Chang 07], IBM [Sanda 08], etc. Block address Tag Index Fault-tolerance bit Block offset V FT Tag Data / ECC Decoder ... ... ... =? Hit/miss Data out 22 Setting the FT Bit Cache line read ECC calculate and check ECC error? No Yes Correct and flag error; log faulty location ECC error at same location already occurred within a time threshold? Done No Yes Set FT bit 23 PADded Cache Reconfigurable cache with programmable decoder Conventional decoder 24 PADded Cache Reconfigurable cache with programmable decoder Additional tag bits needed Programmable decoder ~ 5% area cost [Shirvani 99] 16KB, direct-mapped, 1-level programmability Reduce cost at the price of granularity 25 Cache Line Delete vs. PADded Cache 26 Outline Introduction Self-repair techniques Memories Processor cores Uncore components Conclusion 27 Core Sparing Utilize multi-/many-core architectures Core disabling also possible Already in commercial products IBM BlueGene/Q Nvidia Geforce Cisco Metro Fine-grained approaches? 28 Core Cannibalization Many-core designs with small in-order cores 3.5% area cost (OpenRISC 1200) [Romanescu 08] 29 Core Cannibalization: Discussion What about diagnosis? Routing performance impact Additional wire delay pipeline stages Modified branch resolution, bypass logic, etc. Decreased clock frequency Small vs. large number of faulty cores 30 Microarchitectural Block Disabling Disable “half pipeline way” in superscalar designs 12% area cost (includes diagnosis) Issue queue F D R D R ... ... F Rd E M W C Rd E M W C [Schuchman 05] 31 Microarchitectural Block Disabling Disable “half pipeline way” in superscalar designs 12% area cost (includes diagnosis) Issue queue F S D* R* S D* R* *modified ... ... F S Rd * E* M* W* C* stages S Rd * E* M* W* C* [Schuchman 05] 2 shift stages (S) added and lots design modifications 32 Microarchitectural Block Disabling Disable “half pipeline way” in superscalar designs 12% area cost (includes diagnosis) Issue queue, old half F S D* R* S D* R* *modified ... ... F S Rd * E* M* W* C* stages S Rd * E* M* W* C* [Schuchman 05] Issue queue, new half Split issue queue (same for store buffer, not shown) 33 Microarchitectural Block Disabling: Discussion Expensive, complex, intrusive Diagnosis logic Reconfiguration logic Coverage issues E.g., fetch stage not covered 34 Architectural Core Salvaging Many-core CISC processor designs Application n Advanced ISA Advanced ISA Basic ISA Core 1 ... Application 1 Basic ISA Core n [Powell 09] 35 Architectural Core Salvaging Many-core CISC processor designs Application 1 Basic uops Core 1 Basic uops ... Advanced ISA Basic ISA Application n Advanced ISA Basic ISA Core n [Powell 09] 36 Architectural Core Salvaging Many-core CISC processor designs Application 1 Advanced uops Core 1 Basic uops ... Advanced ISA Basic ISA Application n Advanced ISA Basic ISA Core n [Powell 09] 37 Architectural Core Salvaging Many-core CISC processor designs Application 1 Advanced uops Core 1 Application n Basic uops ... Advanced ISA Basic ISA Thread swap Advanced ISA Basic ISA Core n [Powell 09] 38 Architectural Core Salvaging Many-core CISC processor designs Application n Basic uops Core 1 Application 1 Advanced uops ... Advanced ISA Basic ISA Thread swap Advanced ISA Basic ISA Core n [Powell 09] 39 Architectural Core Salvaging: Discussion What about diagnosis Applicability CISC-like architectures Performance Depends Coverage issues ~50% coverage of execution 40 Self-Repair for Processor Cores Which technique to choose? Software-assisted techniques? 41 Outline Introduction Self-repair techniques Memories Processor cores Uncore components Conclusion 42 Uncore Prevalent in SoCs IBM Power 7 Uncore examples Cache / DRAM controller Accelerators I/O interfaces NVIDIA Tegra Cisco Network Processor 43 Uncore Self-Repair Essential OpenSPARC T2 Uncore 12% Processor cores 12% Memories 76% Cores, memories, networks-on-chip Many existing techniques 44 Key Message [Li ITC13] Existing sparing techniques 20 Chip area impact (%) 0 98 Self-repair coverage (%) 45 Key Message [Li ITC13] Existing sparing techniques 20 Chip area impact (%) 7.5 Our techniques 3.2 0 75 98 Self-repair coverage (%) 46 Key Message [Li ITC13] Existing sparing techniques 20 Chip area impact (%) Performance: 0% (fault-free) 0.3% - 5% (faulty) Power: 3% 7.5 Our techniques 3.2 0 75 98 Self-repair coverage (%) 47 Existing Techniques Inadequate OpenSPARC T2 Uncore 48 Existing Techniques Inadequate OpenSPARC T2 Uncore … Original … … Spare … 49 Existing Techniques Inadequate OpenSPARC T2 Steering logic … … Spare … … … Original 50 Existing Techniques Inadequate OpenSPARC T2 Steering logic … … Original … … Spare … Power gating 20% chip area cost 51 New Uncore Self-Repair Techniques ERRS: Enhanced Resource Reallocation and Sharing SHE: Sparing through Hierarchical Exploration OpenSPARC T2 8 cores, 64 threads 500M transistors 7.5% chip area cost 52 Outline Introduction Self-repair for uncore ERRS SHE Conclusion 53 Basic Resource Reallocation & Sharing (RRS) Already-existing “similar” components No spares 4 cores 3. Reroute … Crossbar blocks Self-repair control L2 mem 0 1. Fault detected L2 bank control 0 8 cycle overhead (faulty case only) L2 bank control 1 2. Resource sharing 4 cores L2 mem 1 54 Basic RRS Performance Impact Fault-free scenario 4 cores … Crossbar blocks Self-repair control L2 mem 0 L2 bank MISS! control 0 Fill buffer L2 bank MISS! control 1 4 cores L2 mem 1 55 Basic RRS Performance Impact Single faulty component: 70% impact 4 cores … Crossbar blocks Self-repair control L2 mem 0 L2 bank control 0 L2 bank control 1 Fill buffer DATA DATA 4 cores L2 mem 1 56 Enhanced RRS (ERRS) Idea Identify Mitigate No Desirable tradeoff? Analyze Yes Done 57 ERRS Mitigates RRS Performance Impact Single faulty component: 3% impact 4 cores … Crossbar blocks Self-repair control L2 mem 0 Fill buffer L2 bank control 0 Fill buffer L2 bank control 1 4 cores L2 mem 1 58 ERRS on OpenSPARC T2 DRAM data return: duplicated L2 hit processing: duplicated DRAM bank control FSM: entries doubled L2 miss fill buffer: entries doubled 3.2% area, 2.7% power 59 Performance Evaluation Setup Simulators GEM5 64 single-issue in-order processor cores Simulated CMP 1KB private L1 data and instruction caches 4MByte shared L2 cache (8 banks) 4 DRAM controllers 60 ERRS vs. Basic RRS: Performance Impact ERRS: 3% performance impact 1 faulty L2 controller 3% 1 faulty DRAM controller 0.7% Average CPI overhead 0% 0.0% RRS Basic RRS ERRS RRS Basic RRS ERRS 64-core CMP PARSEC benchmark programs 61 ERRS vs. Basic RRS: Performance Impact ERRS: 5% performance impact 1 faulty L2 controller 18% 1 faulty DRAM controller 25% Average CPI overhead 0% 0% RRS Basic RRS ERRS RRS Basic RRS ERRS 64-core CMP Stressed programs 62 ERRS for Multiple Faulty Components 25 Graceful performance degradation 4 L2 bank controllers Average CPI impact (%) 4 L2 bank controllers + 2 DRAM controllers 2 L2 bank controllers 2 DRAM controllers 0 Faulty components 64-core CMP PARSEC benchmark programs 63 Which Faults Repairable? delay bridging stuckat All faults inside a component 64 Which Faults Not Repairable? Single points of failure Primary inputs Primary outputs Steering logic Metric: self-repair coverage % non-single points of failure 65 ERRS Component Self-Repair Coverage 97% 97% 97% 98% 97% 97% 97% 97% 96% 97% 97% 97% 98% 97% 97% 66 ERRS Chip Area/Power Impact & Coverage Existing sparing techniques 20 Post-layout chip area impact (%) Power: 3% 3.2 ERRS 0 75 98 Self-repair coverage (%) 67 Outline Introduction Self-repair for uncore ERRS SHE Conclusion 68 SHE: Sparing through Hierarchical Exploration Component RTL SHE Sparing for component Minimize area cost Identical blocks spare shared Balanced coverage vs. area 69 The Design Hierarchy Network controller pio Level 1 rdmc Level 2 chnl_16 chnl_15 chnl_1 Level 3 Lowest level 70 Sparing in Different Levels: Coarse-Grained pio Network controller pio rdmc pio rdmc rdmc Level 2 Original Spare / steering logic 71 Sparing in Different Levels: Mixed Levels Network controller pio pio chnl_16 pio chnl_15 chnl_1 rdmc chnl Level 2 / Level 3 Original Spare / steering logic 72 Sparing in Different Levels: Fine-Grained Network controller pio rdmc Lowest level Original Spare / steering logic 73 How to Obtain “Sweet-Spot”? Balanced tradeoff High Area overhead Self-repair coverage Low Deeper in hierarchy Algorithm details in paper 74 SHE Results Existing sparing techniques 20 Post-layout chip area impact (%) 3.2 ERRS 0 75 98 Self-repair coverage (%) 75 SHE Results Existing sparing techniques 20 Post-layout chip area impact (%) SHE power impact: 0.2% SHE performance impact: 0% 7.5 ERRS + SHE 0 98 Self-repair coverage (%) 76 What About Diagnosis? Traditional fault diagnosis difficult Effect-cause: infeasible RTL unavailable Cause-effect: impractical 7 petabyte fault dictionary [Beckler ITC12] 77 Fault Diagnosis Simplified CASP [Li DATE08, VTS10] Concurrent, Autonomous, Stored test Patterns 78 Fault Diagnosis Simplified CASP [Li DATE08, VTS10] Concurrent, Autonomous, Stored test Patterns Blocks = self-repair granularity … Block 1 Block n 79 Fault Diagnosis Simplified CASP [Li DATE08, VTS10] Concurrent, Autonomous, Stored test Patterns Pass/fail 1 … Pass/fail n Local test logic … … … Fail self-repair needed Block 1 Block n = self-repair granularity Diagnosis = detection 80 Summary: Self-Repair of Uncore Components Existing sparing techniques 20 Post-layout chip area impact (%) Performance: 0% (fault-free) 0.3% - 5% (faulty) Power: 3% 7.5 ERRS + SHE 3.2 ERRS 0 75 98 Self-repair coverage (%) 81 References [Aitken 04] Aitken, R., “A Modular Wrapper Enabling High Speed BIST and Repair for Small Wide Memories,” Proc. Intl. Test Conf., pp. 997-1005, 2004. [Chang 07] Chang, J., et al., “The 65-nm 16-MB Shared On-Die L3 Cache for the Dual-Core Intel Xeon Processor 7100 Series,” IEEE Journal of Solid-State Circuits, vol. 42, no. 4, pp. 846-852, 2007. [Kurdahi 06] Kurdahi, F.J., et al., “System-Level SRAM Yield Enhancement,” Proc. Intl. Symp. Quality Electronic Design, 2006 82 References [Li VTS10] Li, Y., et al., “Concurrent Autonomous Self-Test for Uncore Components in System-on-Chips,” Proc. VLSI Test Symposium, 2010. [Li DATE08] Li, Y., S. Makar, and S. Mitra, “CASP: Concurrent Autonomous Chip Self-Test using Stored Test Patterns,” Proc. Design, Automation, and Test in Europe, pp. 885-890, 2008. [Li ITC 13] Li, Y., et al., “Self-Repair of Uncore Components in Robust System-on-Chips: An OpenSPARC T2 Case Study,” Proc. IEEE Intl. Test Conf., pp. 1-10, 2013. 83 References [Powell 09] Powell, M.D., et al., “Architectural Core Salvaging in a Multi-Core Processor for Hard-Error Tolerance,” Proc. Intl. Symp. on Computer Architecture, pp. 93-104, 2009. [Romanescu 08] Romanescu, B.F., and D.J. Sorin, “Core Cannibalization Architecture: Improving Lifetime Chip Performance for Multicore Processors in the Presence of Hard Faults,” Proc. Intl. Conf. on Parallel Architectures and Compilation Techniques, pp. 43-51, 2008. 84 References [Sanda 08] Sanda, P.N, et al., “Fault-Tolerant Design of the IBM Power6 Microprocessor,” IEEE Micro, vol. 28, no. 2, pp. 30-38, 2008. [Schuchman 05] Schuchman, E., and T.N. Vijaykumar, “Rescue: A Microarchitecture for Testability and Defect Tolerance,” Proc. Intl. Symp. on Computer Architecture, pp. 160-171, 2005. [Shirvani 99] Shirvani, P.P., and E.J. McCluskey, “PADded Cache: A New Fault-Tolerance Technique for Cache Memories,” Proc. VLSI Test Symp., pp. 440-445, 1999. 85 References [Shivakumar 03] Shivakumar, P., et al., “Exploiting Microarchitectural Redundancy for Defect Tolerance,” Proc. Intl. Conf. on Computer Design, pp. 481-488, 2003. [Sridharan 12] Sridharan, V., et al., “A Study of DRAM Failures in the Field,” Proc. Intl. Conf. High Performance Computing, Networking, Storage and Analysis, pp. 1-11, 2012. [Schuchman 05] Schuchman, E., and T.N. Vijaykumar, “Rescue: A Microarchitecture for Testability and Defect Tolerance,” Proc. Intl. Symp. on Computer Architecture, pp. 160-171, 2005. 86