Concurrent Autonomous Self-Test for Uncore Components in SoCs Yanjing Li, Stanford University Onur Mutlu, Carnegie Mellon University Donald S. Gardner, Intel Corporation Subhasish Mitra, Stanford University 1 Overcoming CMOS Reliability Challenges Failure rate On-line self-test and diagnostics Burn-in difficult Soft errors Built-In Soft Error Resilience (BISER) Guardbands expensive Time Early-life failures Lifetime Circuit aging 2 Uncore Components Significant in SoCs IBM Power 7 Uncore examples Uncore Components Controllers for cache & DRAM Crossbar I/O interfaces © news.cnet.com NVIDIA Tegra Cisco Network Processing Engine Uncore Uncore Components © techvishal.wordpress.com Components © ciscosistemas.org 3 Robust Uncore Essential 8-cores 64-threads OpenSPARC T2 SoC Uncore Uncore 12% Processor cores 12% Memories 76% © opensparc.net New on-line self-test for uncore CASP for processor cores [Li DATE 08, ICCAD 09] ECC, Memory BIST & repair for memories 4 Challenge 1: High Test Coverage CASP: Concurrent, Autonomous, Stored Patterns High-coverage patterns off-chip FLASH System-level on-line test access FLASH cheap, test compression pervasive CASP Logic BIST Roving Emulation Coverage High ? Depends Cost Low High High Design effort Moderate High High 5 Challenge 2: Power, Performance, Area Costs Stall-and-test inadequate 4-core Intel® Core™ i7 system results DRAM Controller Core Core Core Core Caches and Interconnects On-line self-test Requests from multiple cores Multiple cores stall © intel.com Unresponsiveness or system hang 6 Naïve Approaches Inadequate for Uncore Uncore CASP new techniques required Unresponsiveness or complete hang Small area cost Stall-and-test 12% area overhead* Small performance impact Spare unit for each uncore type * OpenSPARC T2 design 7 New Uncore On-line Self-Test Principles I. Resource reallocation and sharing (RRS) II. No-performance-impact testing III. Smart backup < 1% area impact, < 3% performance impact OpenSPARC T2 SoC ©opensparc.net 8 I. Resource Reallocation and Sharing (RRS) Components with “similar” functionality in SoCs Temporary reallocation and sharing Small performance hit without replication OpenSPARC T2 4 cores Crossbar blocks 4 cores CASP controller On-line self-test 1. Stall and 4. Reroute drain requests 3. Invalidate 2. Transfer dirty lines L2 banks ©opensparc.net 9 II. No-Performance-Impact Testing Implication-relations among SoC components Component(s) tested when idle During test of another component OpenSPARC T2 RRS 4 cores CASP controller IDLE On-line self-test Crossbar blocks 4 cores L2 banks ©opensparc.net 10 III. Smart Backup Operations with different requirements Backup unit for performance-critical operations Absolute minimal additional hardware I/O interface OpenSPARC T2 DMA for network Programmed I/O DMA for disks Support in smart backup Stall or handle slowly via Programmed I/O 11 Application Performance Impact Memory-centric 10% No visible unresponsiveness Execution 5% time impact 1.5% performance impact 0% PARSEC benchmarks I/O-centric on 4-core Intel system Disk access: 3% impact 4-core Intel® Core™ i7 Uncore CASP emulated © intel.com 12 Area and Power Impact CASP controller (< 0.01% area) On-chip buffer (8KB) OFF-CHIP FLASH 200 MB © opensparc.net Uncore on-line self-test principles applied Minimal area impact: < 1% Minimal power impact: < 1% 13 Test Results for Uncore Components 200 MB off-chip FLASH 10X test compression 7 ms – 300 ms test time per component Total pattern count Test coverage Stuck-at 5,577 99.2% - 99.9% Transition 11,049 92.8% - 97.8% Inexpensive FLASH Thorough on-line self-test 14 Uncore CASP vs. Existing Techniques Logic BIST Coverage High with high costs Concurrent BIST Uncore CASP [Saluja [This work] IEEE TCAD 88] Depends Area Cost Design complexity Performance impact High Low High Low with our uncore principles High costs possible Low Moderate Low 15 CASP Applicable for Other SoCs IBM Power 7 I. RRS II. No-performanceimpact testing III. Smart backup IV. Core CASP © news.cnet.com NVIDIA Tegra Cisco Network Processing Engine © ciscosistemas.org © techvishal.wordpress.com 16 Conclusions CASP adaptive on-line self-test & diagnostics 3 new principles for uncore CASP I. Resource reallocation and sharing (RRS) II. No-performance-impact testing III. Smart backup Effective and practical High test coverage 1% power, 3% performance, 1% area 17 Backup Slides 18 CASP on Actual Intel® Core™ i7 System Intel Research collaboration Quad-core Intel® Core™ i7 (3.2 GHz) Thermoelectric temperature controller Debug tool Unique real-life experiment Development of adaptive self-diagnostics Temperature Controller Debut Tool Adapter 19 CASP Flow SoC with CASP controller (mulit-core SoC proliferation) Inexpensive off-chip FLASH (non-volatile storage technology) 1. Select uncore or core component 3. Apply / analyze highquality test patterns (test compression, at-speed test…) 2. Isolate Scan chain 4. Resume operation 20 RRS Example: L2 Cache Banks 4. Route packets with 1. Stall cache controller destination {bank 0, 2. Drain outstanding requests bank 1} to bank 1 3a. Invalidate clean blocks; Invalidate directory; Crossbar Invalidate L1 3b. Transfer necessary Bank 0 Bank 1 states (dirty (under test) (helper) blocks) Tag Tag Data Data Write-back etc. etc. to main Controller Controller memory if necessary … DRAM Controller 0 21 No-Performance-Impact Testing Example: CCX (Crossbar) 8 cores , 64 threads Packets reallocated to helper CCX: multiplexers and arbitration logic 7 CCX: multiplexers and arbitration logic 0 Separate scan chains … Separate scan chains Test at the same time L2 Bank 0 L2 Bank 7 22 Smart Backup Example: Non-Cachable Unit 1. Stall 2. Drain outstanding requests Original (under test) PIO Boot ROM interface 4. Transfer states Config. Interrupt status register processing interface Backup PIO 3.Turn on Reset Interrupt status table Interrupt processing MUX 5. Select outputs from backup Minimize area costs at acceptable performance impact 23 Naïve Approaches Inadequate for Uncore Simple stall-and-test technique Demonstration on actual 4-core Intel® Core™ i7 system Infrequent Test Stall OS timer interrupt handler on core 1 … Request to DRAM OS timer interrupt handler on core i Stall DRAM controller Noticeable unresponsiveness Frequent Test Under test System hang Identical backup units: 12% area overhead 24 Performance Impact Tool: GEMS simulator (modified for RRS) Workload: PARSEC benchmark suite 4 threads on 4 cores, CASP runs 1 sec. every 10 sec. Simulated Latency Overhead (PARSEC Benchmark Suite) 1 thread 2 threads 4 threads 1.5% 1.0% 0.5% 0.0% 25 III. Smart Backup Operations with different requirements Backup unit for performance-critical operations Absolute minimal additional hardware Network I/O interface interface OpenSPARC T2 Ethernet DMA port for interface network Programmed Layer 2 packetI/O process Layers DMA 3 and 4 acceleration for disks Support in smart backup Stall or OS handle orchestration slowly via Programmed I/O 26