Self-Repair for Robust System Design Yanjing Li Intel Labs Stanford University

Self-Repair for Robust System Design Yanjing Li Intel Labs Stanford University 1 Hardware Failures: Major Concern  Permanent: our focus  Temporary 2 Tolerating Permanent Hardware Failures Detection Diagnosis Recovery Self-repair 3 Tolerating Permanent Hardware Failures Detection Diagnosis Concurrent error detection:Self-repair expensive Recovery 4 Tolerating Permanent Hardware Failures Detection Diagnosis Online Self-Test and Diagnostics  Non-continuous  Low power Recovery  Self-repair Concurrent  No visible downtime 5 Tolerating Permanent Hardware Failures Diagnosis Detection Online Self-Test and Diagnostics  Localize failures Recovery Diagnosis Self-repair 6 Tolerating Permanent Hardware Failures Replace / bypass faulty component(s) Diagnosis Detection Self-repair Recovery Self-repair 7 Tolerating Permanent Hardware Failures Correction of corrupted dataDiagnosis and states Detection Checkpoint Failure Failure occurs detected Rollback Recovery Self-repair 8 This Lecture Detection Diagnosis Recovery Self-repair 9 Outline  Introduction  Self-repair techniques  Memories  Processor cores  Uncore components  Conclusion 10 Memory Organization en rd/wr logic Row decoder addr Column decoder data_in Column multiplexers Data registers Sense amplifiers data_out 11 Memory Functional Fault Models  Memory cell array faults  stuck-at, transition, coupling, etc.  Address decode faults (AFs)  no cell accessed, multiple cells accessed, etc.  Read/write logic faults  Equivalent to memory cell array faults Single bit / word / column / row faults dominate [Aitken 04, Kurdahi 06, Sridharan 12] 12 Spare Rows / Columns en rd/wr logic Spare column Row decoder Spare rows addr Column decoder data_in Column multiplexers Data registers Sense amplifiers data_out 13 Memory Self-Repair Using Spare Rows en rd/wr logic Row decoder addr Column decoder data_in Column multiplexers Data registers Sense amplifiers data_out  Row decoder modified 14 Memory Self-Repair Using Spare Columns en rd/wr logic Row decoder addr Multiplexers Column decoder data_in Column multiplexers Data registers Selfrepair control Sense amplifiers data_out  Additional multiplexers and select logic 15 Spare Words en Same inputs as RAM rd/wr logic Spare words pool Row decoder addr Column decoder data_in Column multiplexers Data registers Sense amplifiers Self-repair control data_out [Kim 98] 16 Redundancy in Memories  Essential to improve yield and reliability  Widely used in commercial RAMs  Related topics  How much redundancy?  Built-in repair analysis: redundancy allocation 17 How Much Redundancy?  Yield analysis and yield learning  Example: negative binomial memory yield model 𝑌𝑌𝑌𝑌𝑌 = (1 + 𝑌𝑌 𝑌) 𝑌: defect desntiy, 𝑌: memory area 𝑌: defect clustering coefficient (measured to be 2 or 3)  For a memory array with N rows and 1 spare row 𝑌𝑌𝑌𝑌𝑌 = 𝑌𝑌𝑌𝑌𝑌+(𝑌 + 1)(𝑌𝑌𝑌𝑌𝑌)(1 − 𝑌𝑌𝑌𝑌𝑌) 18 How Much Redundancy?  Yield analysis and yield learning  Example: negative binomial memory yield model 19 Built-In Repair Analysis  Spare rows and columns  NP complete  Various algorithms and EDA tools exist 20 Special Self-Repair Techniques for Caches  Caches: affects performance, NOT functionality Block address Tag Index Block offset V Tag Data / ECC Decoder ... ... ... =? Hit/miss Data out 21 Cache Line Disable/Delete  Exist in commercial systems  Intel [Chang 07], IBM [Sanda 08], etc. Block address Tag Index Fault-tolerance bit Block offset V FT Tag Data / ECC Decoder ... ... ... =? Hit/miss Data out 22 Setting the FT Bit Cache line read ECC calculate and check ECC error? No Yes Correct and flag error; log faulty location ECC error at same location already occurred within a time threshold? Done No Yes Set FT bit 23 PADded Cache  Reconfigurable cache with programmable decoder Conventional decoder 24 PADded Cache  Reconfigurable cache with programmable decoder  Additional tag bits needed Programmable decoder  ~ 5% area cost [Shirvani 99]  16KB, direct-mapped, 1-level programmability  Reduce cost at the price of granularity 25 Cache Line Delete vs. PADded Cache 26 Outline  Introduction  Self-repair techniques  Memories  Processor cores  Uncore components  Conclusion 27 Core Sparing  Utilize multi-/many-core architectures  Core disabling also possible  Already in commercial products  IBM BlueGene/Q  Nvidia Geforce  Cisco Metro  Fine-grained approaches? 28 Core Cannibalization  Many-core designs with small in-order cores  3.5% area cost (OpenRISC 1200) [Romanescu 08] 29 Core Cannibalization: Discussion  What about diagnosis?  Routing  performance impact  Additional wire delay pipeline stages  Modified branch resolution, bypass logic, etc.  Decreased clock frequency  Small vs. large number of faulty cores 30 Microarchitectural Block Disabling  Disable “half pipeline way” in superscalar designs  12% area cost (includes diagnosis) Issue queue F D R D R ... ... F Rd E M W C Rd E M W C [Schuchman 05] 31 Microarchitectural Block Disabling  Disable “half pipeline way” in superscalar designs  12% area cost (includes diagnosis) Issue queue F S D* R* S D* R* *modified ... ... F S Rd * E* M* W* C* stages S Rd * E* M* W* C* [Schuchman 05] 2 shift stages (S) added and lots design modifications 32 Microarchitectural Block Disabling  Disable “half pipeline way” in superscalar designs  12% area cost (includes diagnosis) Issue queue, old half F S D* R* S D* R* *modified ... ... F S Rd * E* M* W* C* stages S Rd * E* M* W* C* [Schuchman 05] Issue queue, new half Split issue queue (same for store buffer, not shown) 33 Microarchitectural Block Disabling: Discussion  Expensive, complex, intrusive  Diagnosis logic  Reconfiguration logic  Coverage issues  E.g., fetch stage not covered 34 Architectural Core Salvaging  Many-core CISC processor designs Application n Advanced ISA Advanced ISA Basic ISA Core 1 ... Application 1 Basic ISA Core n [Powell 09] 35 Architectural Core Salvaging  Many-core CISC processor designs Application 1 Basic uops  Core 1 Basic uops ... Advanced ISA Basic ISA Application n  Advanced ISA Basic ISA Core n [Powell 09] 36 Architectural Core Salvaging  Many-core CISC processor designs Application 1 Advanced uops  Core 1 Basic uops ... Advanced ISA Basic ISA Application n  Advanced ISA Basic ISA Core n [Powell 09] 37 Architectural Core Salvaging  Many-core CISC processor designs Application 1 Advanced uops  Core 1 Application n Basic uops ... Advanced ISA Basic ISA Thread swap  Advanced ISA Basic ISA Core n [Powell 09] 38 Architectural Core Salvaging  Many-core CISC processor designs Application n Basic uops  Core 1 Application 1 Advanced uops ... Advanced ISA Basic ISA Thread swap  Advanced ISA Basic ISA Core n [Powell 09] 39 Architectural Core Salvaging: Discussion  What about diagnosis  Applicability  CISC-like architectures  Performance  Depends  Coverage issues  ~50% coverage of execution 40 Self-Repair for Processor Cores  Which technique to choose?  Software-assisted techniques? 41 Outline  Introduction  Self-repair techniques  Memories  Processor cores  Uncore components  Conclusion 42 Uncore Prevalent in SoCs  IBM Power 7 Uncore examples  Cache / DRAM controller  Accelerators  I/O interfaces NVIDIA Tegra Cisco Network Processor 43 Uncore Self-Repair Essential OpenSPARC T2 Uncore 12% Processor cores 12% Memories 76%  Cores, memories, networks-on-chip  Many existing techniques 44 Key Message [Li ITC13] Existing sparing techniques 20 Chip area impact (%) 0 98 Self-repair coverage (%) 45 Key Message [Li ITC13] Existing sparing techniques 20 Chip area impact (%) 7.5 Our techniques 3.2 0 75 98 Self-repair coverage (%) 46 Key Message [Li ITC13] Existing sparing techniques 20 Chip area impact (%) Performance: 0% (fault-free) 0.3% - 5% (faulty) Power: 3% 7.5 Our techniques 3.2 0 75 98 Self-repair coverage (%) 47 Existing Techniques Inadequate OpenSPARC T2 Uncore 48 Existing Techniques Inadequate OpenSPARC T2 Uncore … Original … … Spare … 49 Existing Techniques Inadequate OpenSPARC T2 Steering logic … … Spare … … … Original 50 Existing Techniques Inadequate OpenSPARC T2 Steering logic … … Original … … Spare … Power gating  20% chip area cost 51 New Uncore Self-Repair Techniques ERRS: Enhanced Resource Reallocation and Sharing SHE: Sparing through Hierarchical Exploration OpenSPARC T2 8 cores, 64 threads 500M transistors  7.5% chip area cost 52 Outline  Introduction  Self-repair for uncore  ERRS  SHE  Conclusion 53 Basic Resource Reallocation & Sharing (RRS)  Already-existing “similar” components  No spares 4 cores 3. Reroute … Crossbar blocks Self-repair control L2 mem 0 1. Fault detected L2 bank control 0 8 cycle overhead (faulty case only) L2 bank control 1 2. Resource sharing 4 cores L2 mem 1 54 Basic RRS Performance Impact Fault-free scenario 4 cores … Crossbar blocks Self-repair control L2 mem 0 L2 bank MISS! control 0 Fill buffer L2 bank MISS! control 1 4 cores L2 mem 1 55 Basic RRS Performance Impact Single faulty component: 70% impact 4 cores … Crossbar blocks Self-repair control L2 mem 0 L2 bank control 0 L2 bank control 1 Fill buffer DATA DATA 4 cores L2 mem 1 56 Enhanced RRS (ERRS) Idea Identify Mitigate No Desirable tradeoff? Analyze Yes Done 57 ERRS Mitigates RRS Performance Impact Single faulty component: 3% impact 4 cores … Crossbar blocks Self-repair control L2 mem 0 Fill buffer L2 bank control 0 Fill buffer L2 bank control 1 4 cores L2 mem 1 58 ERRS on OpenSPARC T2 DRAM data return: duplicated L2 hit processing: duplicated DRAM bank control FSM: entries doubled L2 miss fill buffer: entries doubled  3.2% area, 2.7% power 59 Performance Evaluation Setup Simulators GEM5 64 single-issue in-order processor cores Simulated CMP 1KB private L1 data and instruction caches 4MByte shared L2 cache (8 banks) 4 DRAM controllers 60 ERRS vs. Basic RRS: Performance Impact ERRS: 3% performance impact 1 faulty L2 controller 3% 1 faulty DRAM controller 0.7% Average CPI overhead 0% 0.0% RRS Basic RRS ERRS RRS Basic RRS ERRS 64-core CMP PARSEC benchmark programs 61 ERRS vs. Basic RRS: Performance Impact ERRS: 5% performance impact 1 faulty L2 controller 18% 1 faulty DRAM controller 25% Average CPI overhead 0% 0% RRS Basic RRS ERRS RRS Basic RRS ERRS 64-core CMP Stressed programs 62 ERRS for Multiple Faulty Components 25 Graceful performance degradation 4 L2 bank controllers Average CPI impact (%) 4 L2 bank controllers + 2 DRAM controllers 2 L2 bank controllers 2 DRAM controllers 0 Faulty components 64-core CMP PARSEC benchmark programs 63 Which Faults Repairable? delay bridging stuckat  All faults inside a component 64 Which Faults Not Repairable?  Single points of failure  Primary inputs  Primary outputs  Steering logic  Metric: self-repair coverage  % non-single points of failure 65 ERRS Component Self-Repair Coverage 97% 97% 97% 98% 97% 97% 97% 97% 96% 97% 97% 97% 98% 97% 97% 66 ERRS Chip Area/Power Impact & Coverage Existing sparing techniques 20 Post-layout chip area impact (%) Power: 3% 3.2 ERRS 0 75 98 Self-repair coverage (%) 67 Outline  Introduction  Self-repair for uncore  ERRS  SHE  Conclusion 68 SHE: Sparing through Hierarchical Exploration Component RTL  SHE Sparing for component Minimize area cost  Identical blocks  spare shared  Balanced coverage vs. area 69 The Design Hierarchy Network controller pio Level 1 rdmc Level 2 chnl_16 chnl_15 chnl_1 Level 3 Lowest level 70 Sparing in Different Levels: Coarse-Grained pio Network controller pio rdmc pio rdmc rdmc Level 2 Original Spare / steering logic 71 Sparing in Different Levels: Mixed Levels Network controller pio pio chnl_16 pio chnl_15 chnl_1 rdmc chnl Level 2 / Level 3 Original Spare / steering logic 72 Sparing in Different Levels: Fine-Grained Network controller pio rdmc Lowest level Original Spare / steering logic 73 How to Obtain “Sweet-Spot”? Balanced tradeoff High Area overhead Self-repair coverage Low Deeper in hierarchy Algorithm details in paper 74 SHE Results Existing sparing techniques 20 Post-layout chip area impact (%) 3.2 ERRS 0 75 98 Self-repair coverage (%) 75 SHE Results Existing sparing techniques 20 Post-layout chip area impact (%) SHE power impact: 0.2% SHE performance impact: 0% 7.5 ERRS + SHE 0 98 Self-repair coverage (%) 76 What About Diagnosis?  Traditional fault diagnosis difficult  Effect-cause: infeasible  RTL unavailable  Cause-effect: impractical  7 petabyte fault dictionary  [Beckler ITC12] 77 Fault Diagnosis Simplified  CASP [Li DATE08, VTS10]  Concurrent, Autonomous, Stored test Patterns 78 Fault Diagnosis Simplified  CASP [Li DATE08, VTS10]  Concurrent, Autonomous, Stored test Patterns Blocks = self-repair granularity … Block 1 Block n 79 Fault Diagnosis Simplified  CASP [Li DATE08, VTS10]  Concurrent, Autonomous, Stored test Patterns Pass/fail 1 … Pass/fail n Local test logic … … … Fail  self-repair needed Block 1 Block n = self-repair granularity Diagnosis = detection 80 Summary: Self-Repair of Uncore Components Existing sparing techniques 20 Post-layout chip area impact (%) Performance: 0% (fault-free) 0.3% - 5% (faulty) Power: 3% 7.5 ERRS + SHE 3.2 ERRS 0 75 98 Self-repair coverage (%) 81 References [Aitken 04] Aitken, R., “A Modular Wrapper Enabling High Speed BIST and Repair for Small Wide Memories,” Proc. Intl. Test Conf., pp. 997-1005, 2004. [Chang 07] Chang, J., et al., “The 65-nm 16-MB Shared On-Die L3 Cache for the Dual-Core Intel Xeon Processor 7100 Series,” IEEE Journal of Solid-State Circuits, vol. 42, no. 4, pp. 846-852, 2007. [Kurdahi 06] Kurdahi, F.J., et al., “System-Level SRAM Yield Enhancement,” Proc. Intl. Symp. Quality Electronic Design, 2006 82 References [Li VTS10] Li, Y., et al., “Concurrent Autonomous Self-Test for Uncore Components in System-on-Chips,” Proc. VLSI Test Symposium, 2010. [Li DATE08] Li, Y., S. Makar, and S. Mitra, “CASP: Concurrent Autonomous Chip Self-Test using Stored Test Patterns,” Proc. Design, Automation, and Test in Europe, pp. 885-890, 2008. [Li ITC 13] Li, Y., et al., “Self-Repair of Uncore Components in Robust System-on-Chips: An OpenSPARC T2 Case Study,” Proc. IEEE Intl. Test Conf., pp. 1-10, 2013. 83 References [Powell 09] Powell, M.D., et al., “Architectural Core Salvaging in a Multi-Core Processor for Hard-Error Tolerance,” Proc. Intl. Symp. on Computer Architecture, pp. 93-104, 2009. [Romanescu 08] Romanescu, B.F., and D.J. Sorin, “Core Cannibalization Architecture: Improving Lifetime Chip Performance for Multicore Processors in the Presence of Hard Faults,” Proc. Intl. Conf. on Parallel Architectures and Compilation Techniques, pp. 43-51, 2008. 84 References [Sanda 08] Sanda, P.N, et al., “Fault-Tolerant Design of the IBM Power6 Microprocessor,” IEEE Micro, vol. 28, no. 2, pp. 30-38, 2008. [Schuchman 05] Schuchman, E., and T.N. Vijaykumar, “Rescue: A Microarchitecture for Testability and Defect Tolerance,” Proc. Intl. Symp. on Computer Architecture, pp. 160-171, 2005. [Shirvani 99] Shirvani, P.P., and E.J. McCluskey, “PADded Cache: A New Fault-Tolerance Technique for Cache Memories,” Proc. VLSI Test Symp., pp. 440-445, 1999. 85 References [Shivakumar 03] Shivakumar, P., et al., “Exploiting Microarchitectural Redundancy for Defect Tolerance,” Proc. Intl. Conf. on Computer Design, pp. 481-488, 2003. [Sridharan 12] Sridharan, V., et al., “A Study of DRAM Failures in the Field,” Proc. Intl. Conf. High Performance Computing, Networking, Storage and Analysis, pp. 1-11, 2012. [Schuchman 05] Schuchman, E., and T.N. Vijaykumar, “Rescue: A Microarchitecture for Testability and Defect Tolerance,” Proc. Intl. Symp. on Computer Architecture, pp. 160-171, 2005. 86

Self-Repair for Robust System Design Yanjing Li Intel Labs Stanford University

Related documents

Products

Support

Self-Repair for Robust System Design Yanjing Li Intel Labs Stanford University

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib