On exploiting redundancy Sandeep Gupta Ming Hsieh Department of Electrical Engineering University of Southern California sandeep@usc.edu Thanks • $s – NSF and SRC • Colleagues – – – – – – – – – – – – – Murali Annavaram Melvin A. Breuer Jaechul Cha Da Cheng Keith Chugg Tong-Yu Hsieh Hsunwei Hsiung Zhigang Jiang K.-J. Lee Mohammad Mirza-Aghatabar Antonio Ortega Shideh Shahidi Doochul Shin 2 Redundancy Historical definitions and uses World English Dictionary (via dictionary.com) • redundancy — n , pl -cies 1. a. the state or condition of being redundant or superfluous, esp superfluous in one's job b. (as modifier): a redundancy payment 2. excessive proliferation or profusion, esp of superfluity 3. duplication of components in electronic or mechanical equipment so that operations can continue following failure of a part 4. repetition of information or inclusion of additional information to reduce errors in telecommunication transmissions and computer processing 4 Computing Dictionary (via dictionary.com) • redundancy definition 1. The provision of multiple interchangeable components to perform a single function in order to provide resilience (to cope with failures and errors). Redundancy normally applies primarily to hardware. For example, a cluster may contain two or three computers doing the same job. They could all be active all the time thus giving extra performance through parallel processing and load balancing; one could be active and the others simply monitoring its activity so as to be ready to take over if it failed ("warm standby"); the "spares" could be kept turned off and only switched on when needed ("cold standby"). Another common form of hardware redundancy is disk mirroring. 2. data redundancy. (1995-05-09) 5 Classical forms of redundancy in digital systems • Spares – Circuit modules: Spare copies of modules • Cold or hot standby • Dedicated or shared – Computational tasks: Executing spare copies of tasks • In space or in time – Data to be communicated: Spare bits/bytes • In space or in time; coding is typically used – Data to be stored: Spare bits/bytes/… • In space; coding is typically used 6 Classical motivations for using redundancy • To improve resilience to failures or errors during the system’s operational life – Circuits and computation • To continue operation after a failure occurs • Typically limited to systems that demanded high-reliability, high-availability, … – Data • To ensure integrity despite errors during communication or storage • In memories and disks, redundancy used to compensate for fabrication-induced failures 7 Redundancy Our motivation Yield decreasing with scaling • Lower initial yield • Yield learning rate not improved • Lower mature yield R. C. Leachman and C. N. Berglund, “Systematic Mechanisms Limited Yield (SMLY) Assessment Study”, International SEMATECH, Doc. #03034383A-ENG, March 2003 9 Causes of decreasing yield • Continue to fabricate chips with finer feature sizes – widths, separations, and so on • Some key tools for fabrication not being scaled due to cost considerations, e.g., 193nm light source for photolithography • Continual increase in number of fabrication steps and masks • Continued increase in chip size 10 Yield of a chip with area A • When a chip has zero spares and only chips with zero defects counted in “yield” – Poisson model π = π −π·0⋅π΄⋅π – Negative binomial model π·0 ⋅ π΄ ⋅ π π = 1+ π½ π·0 : defect density π : kill ratio −π½ π½: clustering factor 11 A simple way to view the problem • Each passing year is likely to bring – Increasing defect densities – Increasing variations in values of circuit parameters – Increasing circuit sizes 12 Our motivation • Use redundancy in computational logic to improve yield, despite increasing fabrication challenges – Increasing defect densities – Increasing variations in values of circuit parameters • Redundancy has been used for decades to improve memory yields – New approaches to optimally exploit redundancy for increasing fabrication challenges 13 Our motivation • Yield enhancement via design in real world – Spare rows/columns used in SRAMs since 1970s – IBM CELL processor has one spare core – Nvidia GTX 480/470/465 have 1/2/5 disabled processors IBM CELL Nvidia GTX 470 14 Our concept of redundancy • Any functionality that is not necessary for acceptable system operation – Functionality = circuit module – Acceptable = • General purpose computing – Guaranteed correctness of output values, plus – Performance limits • Media applications – Output values within user-specified limits, in terms of PSNR, BER, MSE, … 15 Our primary figures of merit • Yield = % of fabricated chips deemed acceptable and shipped to customers • Die area: A measure of chip fabrication cost • These two are combined to indicate total functionality obtained from each wafer • • • • Yield/area GOPS/area or GOPS/wafer Revenue/wafer Revenue/total-cost 16 Yield-per-area (YPA) • Consider a wafer – Without our approach • For the original design, each die has area π΄ • The wafer contains, say, 100 dies • The yield is, say, 30%, hence – Chips sold: π1 , π2 , …, π30 ; and chips discarded: π31 , π32 , …, π100 – Our approach changes the design • Our new design is larger and hence each die has area, say, 1.25π΄ • Hence, the wafer contains 80 dies • As our new design can tolerate some defects, yield increases, say, to 50% ∗ ∗ ∗ ∗ – Chips sold: π1∗ , π2∗ , …, π40 ; chips discarded: π41 , π42 , …, π80 • Assuming the same cost per wafer 40 – Benefit of our approach = 30 = 1.33 – Yield per area gain = 50% 1.25π΄ 30% π΄ =1.33 17 GOPS-per-area or GOPS-per-wafer • What if our approach produces chips with different performance – Without our approach • For the original design, each die has area π΄ • The wafer contains, say, 100 dies • The yield is, say, 30%, hence – Chips sold: π1 , π2 , …, π30 , each with performance Pππππ – Chips discarded: π31 , π32 , …, π100 • Total performance per wafer: 30πππππ 18 GOPS-per-area or GOPS-per-wafer • What if our approach produces chips with different performance – Our approach changes the design • Our new design is larger and hence each die has area, say, 1.1π΄ • Hence, the wafer contains 91 dies • As our new design can tolerate some defects, yield increases, say, to 40% ∗ – Chips sold: π1∗ , π2∗ , …, π36 , with, say, » 26 chips with performance πππππ , and » 10 chips with performance 0.9πππππ ∗ ∗ ∗ – Chips discarded: π37 , π42 , …, π91 • Total performance per wafer: 35πππππ 19 General purpose processors Key idea • In general purpose computing systems – Large gap exists between • Correctness requirements at circuit layer • Functional correctness for users – Relax the unnecessarily stringent correctness requirements at circuit layer • Guarantee the correctness requirements at the – – – – μArch layer ISA layer System/software layer User layer 21 Module X is redundant in what sense? • We could do everything we can do now, if we did not have module X • Why did we add X? – To improve performance – To reduce power –… 22 Implicit redundancy: Examples • Storage – In absence of last-level cache (LLC), main memory can provide alternative storage – One part of LLC can provide alternative storage for another part • Computation – In absence of a multiplier, adder and shifter can provide the functionality – In absence of a floating point unit (FPU), integer units can provide the functionality • Controllers, predictors, … –… 23 Observation • Anything beyond Intel 4004 is redundant 24 How to exploit implicit redundancy? • If module X does not work in one particular chip – Can we use the chip as-is? – Or, should we simply ‘disable’ X and use the chip? – Or, can we find some other way to use the chip? 25 Objectives • To guarantee functional correctness of a microprocessor in presence of a defect in a module o At negligible cost in circuit area o At minimal impact on circuit performance ο Performance defined at various levels ο§ Circuit-level: clock period, in ps ο§ Architecture-level: instructions per cycle, IPC ο§ Combined: instructions per second, IPS Alternatively, operations per second, OPS ο Which chips affected? ο§ All chips (including defect-free ones), vs. ο§ Only chips with the defect 26 Key ideas • Enable sale of some defective chips that would have been thrown away without our method(s) • Explore all layers to identify method(s) that – Guarantee correctness of user results – Minimize performance penalty for all chips • Preferably zero performance penalty for non-defective chips – Minimize area overheads • Typically by exploiting features that exist in modern microprocessors 27 Case study-1: Branch predictor • A commonly used module in all types of microprocessors o Instead of waiting to find out which direction a branch instruction will take, predict branch and continue with pre-fetch, decode, … • Since prediction is not guaranteed to be correct 100% of time, any design that uses a branch predictor also includes o A module that detects every mis-prediction o A module that helps recover from a mis-prediction by flushing all incorrect values and continuing along the correct branch • Correctness of any program output guaranteed even if the branch predictor module is faulty o Above modules identify and recover from any mis-prediction cause by the fault • Hence, a fault in branch predictor only causes performance degradation and guarantees correctness 28 Estimating performance degradation • A gate-level design of gshare branch predictor implemented in SimpleScalar processor simulator • Single stuck-at faults on each line studied for benchmark programs A==Taken/Not-taken Select(j) Taken/Not-taken BHT (Branch history table) Branch Instruction (n-bit shift register) Address XOR-Array (n 2-input XOR gates) P_taken/P_Not-taken PHT (Pattern history table) (a register file of 2n 2-bit FSMs) 0 1 DC decoder) Q Ck FSM(j) 0 1 (n-to-2n D Q’ Cnt0(j) Z(j) D Q Ck Q’ Cnt1(j) 29 P==P_Taken/P_Not-taken Degradation in branch-prediction accuracy Module Decoder Line type # faults (% of total faults) A 72 (0.0002%) D 1,048,576 (3.63%) E MSB FSMs LSB 524,288 (1.82%) 13,634,188 (47.24%) 13,634,188 (47.24%) Branch Inst. Addr. N/A 36 (0.0001%) BHT N/A 36 (0.0001%) Input 36 (0.0001%) Output 36 (0.0001%) P N/A 2 (0.000006%) A N/A 2 (0.000006%) XOR Degradation Min. Max. Min. Max. Min. Max. Min. Max. Min. Max. Min. Max. Min. Max. Min. Max. Min. Max. Min. Max. Min. Max. twolf SA0 0.00 0.00 0.00 0.04 0.00 0.00 0.00 0.00 0.00 0.00 -0.11 0.20 0.98 13.23 -0.02 0.98 0.55 1.61 42.84 42.84 37.69 37.69 gzip SA1 0.40 1.16 0.00 0.00 18.27 18.27 0.00 0.00 0.00 0.00 -0.11 0.24 0.98 14.52 -0.02 0.98 0.55 1.62 48.49 48.49 39.17 39.17 SA0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 -0.03 0.20 -0.02 6.36 -0.04 0.02 -0.02 0.23 60.54 60.54 60.37 60.37 97.27% of the faults induce no measurable degradation 1.82% result in 0-1% degradation 0.91% induce degradation larger than 1% Faults inducing less than 1% mis-prediction accuracy result in less than 2% performance degradation SA1 -0.04 0.06 0.00 0.00 26.77 26.77 0.00 0.00 0.00 0.00 -0.03 0.20 -0.02 6.36 -0.04 0.02 -0.02 0.23 33.01 33.01 31.18 31.18 30 Summary for branch prediction • Branch predictors significantly improve performance of architectures • Overwhelming percentage of faults in branch predictors are acceptable • As corresponding versions of branch predictors provide most of the above performance improvement and guarantee correctness of results • Hence, chips with such faults can be used without any reconfiguration 31 A new type of redundancy • Combinationally redundant fault: CRF – Does not change IO behavior for a combinational block of logic • Sequentially redundant fault: SRF – Does not change IO behavior for a sequential logic • Faults in branch-prediction change the IO behavior of the entire processor chip – Yet, at the architecture-level program outputs guaranteed to be correct – A new category – architecturally redundant faults: ARF 32 A new type of redundancy ARF CRF SRF PA-ARF 33 Case study-2: LLC on chip • LLC (last level cache on-chip) is the largest individual module (~MB scale) • Capacity is expected to grow in the future multicore microprocessors • Implemented with densely packed and small devices • Susceptible to process variation and defects • Existing hardware-based approaches • Beyond certain levels, spare rows and columns cause high overheads • Require circuit modification and may be under- or overdesigned 34 OS-level LLC defect tolerance SRAM cache DRAM memory a page-frame a page- cover X * a cache region N S: number of sets 12 * Nw: number of ways : A block of SB bytes . . . • Page-cover: a set of continuous cache locations which are mapped by one pageframe in main memory • OS controls the allocation/de-allocation of page frames • Page-cover disabling: OS blocks all page-frames that map to faulty page covers 35 • In PCD, defective page-covers will not be used during system operation Defective LLC • LLC: 3MB, 12-way, 64-byte blocks 48subarrays, each with one spare row and two spare columns • Page-cover: consecutive 64-cache-sets – Total 64 page-covers in the 3MB cache case study 90% Cumulative percentage of chips after spare repair 80% 70% 60% 50% 40% 77% 30% • Most defective caches have limited numbers of defective page-covers • Capacity loss in cache and memory is small 20% 10% Bit-cell failure rate= π. π × ππ−π 0% 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Number of defective page-covers 36 Efficiency: EGOPS-per-area Optimum of “spares + PCD” × 10 8 −7 Optimum of “spares only” (12, 1, 1) 7 1.5X 6 (48, 1, 1) 5 (1 μm2 2.5X ) 4 (48, 1, 2) 3 Failure rate = π. π × ππ−π Failure rate = π. π × ππ−π 2 1 0 • • (96, 1, 3) Design: (π΅πΊπ¨ , π΅ππ , π΅ππ ) 0.6 0.7 0.8 0.9 • 1 1.1 1.2 1.3 Area : enhanced CACTI 32nm implementation – Modified Linux kernel 2.6.32 – Dual-core processor with shared 12-way 3MB LLC Benchmark: SPEC CPU2000 1.4 Area • High efficiency gain over traditional approach • High flexibility compared to hardware-based approaches 37 Case study-3: Computational units • Integer adders in ALUs • Floating point units (FPUs) 38 ALU-disabling and adder error-masking • Exploit existing features in modern processors – Multiple ALUs and sophisticated instruction scheduling • ALU-disabling – Prevent instruction scheduler from issuing instructions to a defective ALU • Adder error-masking – Re-issue to a defect-free ALU any add instruction that was issued to an ALU with a defective adder 39 ammp equake ππ’ππππ 0.7 0.5 apsi applu swim art bzip2 gcc gzip mcf vortex 15% 5% 15% 5% 15% 5% 15% 5% 15% 5% 15% 5% 15% 5% 15% 5% 15% 5% 15% 5% 15% 5% 5% 1.12 1.10 1.08 1.06 1.04 1.02 1.00 0.98 15% ALU approach normalized efficiency Area % of 2 ALUs mgrid 0.3 • Efficiency: total-performance-per-unit-fabrication-cost – Normalized to unenhanced processor efficiency • With negligible area overhead, our approach enables the use of defective chips that would otherwise be discarded 40 ALU approach breakdown 1.12 1.10 1.08 1.06 1.04 tier3 1.02 tier2 1.00 tier1 0.98 5% 10% Yunenc 0.7 15% 5% 10% Yunenc 0.5 15% 5% 10% 15% Yunenc 0.3 • Tier1: defect-free • Tier2: one defective ALU, defects not in the adder of the defective ALU • Tier3: one defective adder 41 Floating point multiplication emulation 1.25 1.20 1.15 1.10 1.05 1.00 0.95 10% 15% 20% 10% ammp ππππππ 70% 15% 20% equake 50% 30% 10% 15% apsi 20% 10% 15% swim 20% 10% 15% art 20% 10% 15% 20% FPU area % INT bipz2, gcc, gzip, mcf, vortex, parser • No area overheads, efficiency always larger than 1 • More than 10% improvement in some benchmarks when ππππππ is low • 2%~5% improvement in most cases when ππππππ is in the higher ranges 42 How widely applicable? • Can we develop such uniquely efficient approaches for many modules? 43 An example processor Pentium III, Katmai * Hsien-Hsin Lee, Georgia Tech Advanced Computer Architecture Lecture, users.ece.gatech.edu/~leehs/ECE6100/slides/Lec12-p6_P4.ppt, 2001 44 Module area Pentium III [3] Caching modules ~46% Datapath modules ~16% Queuing and speculation modules ~21% AMD Bulldozer [1][2] No on-chip LLC 0% 2MB on-chip LLC 41.20% Fetch+L1i cache 11.14% L1i cache 2.30% L1d cache 6.83% L1d cache 2.15% dTLB 1.80% FPU 4.72% FPU 10.47% ALU (int. cluster) 4.37% ALU (int. cluster) 5.55% L/SU 2.75% L/SU 9.09% BTB 2.17% Branch predictor 5.36% Decoder 7.88% Decoder 5.31% Inst. scheduler & Qs 6.49% Inst. scheduler & Qs 5.30% Fetch 2.15% Physical register 1.91% Rename 1.58% Rename 3.67% • With small or zero overheads, defects in these modules can be tolerated without producing errors [1] [2] [3] Bulldozer x86-64 Core,’’ IEEE Int’l Solid State Circuits Conf., IEEE Press, 2011, pp. 80-81 T Fischer et al., ‘‘Design Solutions for the Bulldozer 32-nm SOI 2-Core Processor Module in an 8-Core CPU,’’ IEEE Int’l Solid State Circuits Conf., IEEE Press, 2011 Hsien-Hsin Lee, Georgia Tech Advanced Computer Architecture Lecture, users.ece.gatech.edu/~leehs/ECE6100/slides/Lec12-p6_P4.ppt, 2001 45 Practical issues • Design changes – In some cases, only the software/firmware – In other cases, low overhead changes to hardware • Test and diagnosis – In many cases, need to identify failing modules/locations • Information flow and reconfiguration – In some cases, failure information must be propagated to system assembly, software configuration, … 46 Error tolerance 47 Introduction • Concept – The circuit produces some errors at its outputs – Yet, the application using the circuit produces acceptable results • Applications – Video, images, graphics, audio, communications, errorcorrecting codes for wireless communications • Discrete Cosine Transform (DCT) module in JPEG encoders – More than 50% of single SAF are acceptable for a typical architecture • Motion Estimation (ME) module in MPEG modules – More than 50% of single SAF are acceptable for a typical architecture 48 A case study • Graphics processors – Side-by-side comparison of • A fault-free chip • Faulty ones from the reject bin – Comparison for several • Graphics demos • Graphics sequences designed to exercise different capabilities • About 40 minutes of game play – About 20% of faulty chips had output errors that were imperceptible to users 49 Errors case 3 Max(MSE)>0.1 + Perception OK • • • Class 0, Chip No.54 Max(MSE) = 9.177 Very tiny dot (Type II) 50 51 52 Our other case studies • • • • JPEG encoder MPEG encoder Digital answering machine Decoder for turbo-codes for wireless communication • In all cases: A large percentage of hard faults in the largest modules had negligible impact on user experience 53 Metrics: Circuit level • Error significance (ES): Maximum numerical magnitude by which the values at outputs of an erroneous circuit deviates from the error free response • Error rate (ER): Percentage of vectors for which values at outputs deviate from the error free response Number of vectors that cause error – Error rate = Total number of input vectors applied 54 Metrics: Circuit to system level ER : FNA : FA Original Image DCT Quantization Huffman Encoding Encoded JPEG Retrieved Image IDCT Dequantization Huffman Decoding ES • ES and ER for a fault calculated separately for a module – In above example, we inject faults in DCT module in JPEG system • Quality of image is decided using application-level metrics – PSNR, BER, MSE, … – using a complete system that contains the faulty module – Each fault is marked either FNA or FA 55 How to exploit ET • During testing – Classical design – Classify faults as acceptable and non-acceptable – Develop new ET test techniques that • Detect every unacceptable fault • But not detect any acceptable fault – Theoretical results • Proved that perfect ET test sets always exist – Practical results • Developed test generators and demonstrated that ET test costs are comparable to normal test costs 56 One approach to exploit ET • Normal design; new analysis and test Fault-free Faulty Sell Throw Faulty Fault-free Sell Acceptable Sell Unacceptable Throw 57 How to exploit ET • Acceptable faults are indications of over-specification of the functionality • Exploit these during design – Arithmetic circuits • Re-optimization of classic architectures for acceptable approximations – General circuits • A synthesis approach to exploit allowable approximate versions of functions • An approach to re-design a circuit to exploit allowable approximations 58 Results: Approximate redesign C432 ER % area reduction 1% 17.6 5% 21.8 1% 6.5 5% 9.2 1% 17.0 5% 20.2 C880 ER % area reduction C1908 ER % area reduction 59 Results: Approximate synthesis % reduction in literals 30 rd73(7/3/903) clip(9/5/793) sao2(10/4/496) 5xp1(7/10/347) Z9sym(9/1/610) sym10(10/1/1470) t481(16/1/5233) 20 10 0 0.000 0.005 0.010 0.015 0.020 0.025 0.030 error rate (Legend: # of inputs, # of outputs, number of literals in the original function) • For each circuit, we performed synthesis for various ER values 60 Results: Approximate adder designs 0.9 0.85 0.8 Yield 0.75 0.7 0.65 0.6 RCA (MAD=1) CLA (MAD=0.24) CSA(v) (MAD=0.34) CSA(f) (MAD=0.43) KSA (MAD=0.14) 0.55 0.5 0 0.25 0.75 1.31 1.75 3.75 4.25 RS threshold 61 Explicit redundancy Using spare copies to improve yield-per-area Objective • To derive fundamental properties – strengths and limitations – of the use of spares for improving yield-per-area (YPA) – Focus on use of spares for random logic (RL) modules, where two modules cannot share spare copies – Focus on functional yield, i.e., correct operation at any speed – Consider all possible ways in which spares can be used 63 Random logic vs. regular memory/logic 64 Objective • To answer questions like – What is the maximum value of original chip yield for which spares can improve YPA? – What is the maximum YPA improvement that can be obtained? – What rules govern the optimal use of spares? • What amount of spares provides maximum YPA improvement? • At what granularity must spares be used to obtain maximum YPA improvement? • How should a given budget for spares be used to obtain maximum YPA improvement? • Particularly interested in the future, where, without the use of redundancy, yield → 0 65 Theory: Simplifying assumptions • Given system – Decomposed into modules m1, m2, …, mk of areas a1, a2, …, ak – At arbitrary granularity, i.e., we allow • The given circuit to be partitioned in any desired manner – In particular, we often consider equal-area modules • At arbitrary levels of granularity, including k→ ∞ • Ideal reconfiguration interconnects , i.e., interconnects required to use spare copies have – Negligible area – Negligible probability of being defective – Negligible delay and power impact • Spare modules not idealized, i.e., every spare module has area, defects, delays, … • Defect and yield model – Given defect density – No clustering 66 Yield models for module mi with area ai • Poisson model πππ = π −π·0⋅ππ⋅π • Negative binomial model πππ π·0 ⋅ ππ ⋅ π = 1+ π½ π·0 : defect density π : kill ratio −π½ π½: clustering factor 67 Notation m1 m2 mk … … m1 m2 … mk … m1 (a) 68 Some ways of using spares α α α 69 More ways of using spares 70 Baseline: The original system (no spares) • Consider the original system (no spares) – Aorig = 1 • Area for all cases normalized with respect to Aorig – Yorig = Y • Where, 0 < Y ≤ 1 – YPAorig = Yorig/Aorig = Y 71 n-way replication of the entire system – Anew = n – Ynew = 1- (1 – Y)n • Where, 0 < Y ≤ 1 – YPAnew = – In terms of yield-per-area, the gain of using spares is • G= >> 72 n-way replication of the entire system – YPA gain, G = • Where, 0 < Y ≤ 1 – Properties • • As Y → 0, G → 1 >> 73 Duplication at finer granularities • YPA gain, G = – As Y → 0, G → • Ynew = – As k → ∞, Ynew → 1 – Extremely useful, if we can find a way to create fine-grain duplication without overheads of reconfiguration interconnects >> 74 When any way of using spares cannot help • A completely general scenario for spares for RL – Spares used for fraction α (by area) of the original system, where, 0 ≤ α ≤ 1 α – Without spares, yield for this fraction of system: Yα – This fraction of system may be partitioned into • Any number of modules, and • Each module may have arbitrary size – At least one spare copy used for every module in this fraction of the system – Yield for the remainder of the system: Y(1 - α) >> 75 When any way of using spares cannot help • Since at least one spare used for fraction α of the original system – Anew ≥ (1 + α) α • Since, area of spares ≥ α – Ynew ≤ Y(1 - α) • Since, yield for the fraction with spares ≤ 1 – Hence, YPAnew ≤ >> 76 When any way of using spares cannot help • Hence, the gain of using spares is –G= – where • 0 < Y ≤ 1, and • 0≤α≤1 >> 77 When any way of using spares cannot help • For the use of spares to be effective, we need –G>1 – That is, we need – For spares to be effective, the original yield, Y, must be less-than-or-equal-to >> 78 When any way of using spares cannot help – Hence, for spares to be effective, the original system yield, Y, must satisfy Maximum k value of Y for which G ≥ 1 2 0.34 3 0.41 4 0.43 5 0.45 6 0.46 7 0.46 ∞ 0.50 – Duplication at practical granularities provides YPA gain for values of Y close to a universal bound! 79 >> When any way of using spares cannot help • For Y ≤ – For the bound argument, G improves monotonically with α, when – This suggests that optimal partial duplication may be achieved for full duplication 80 Summary of theoretical results • Necessary condition for spares to improve YPA – When original yield, Y, is less than 0.5 • In the future, i.e., as Y decreases (Y → 0) – YPA gain provided by spares increases – YPA gain → 2k-1, if the system is partitioned into k equalarea modules (ignoring interconnect overheads) • Partitioning and granularity – YPA gain increases with k (ignoring interconnect overheads) – As k → ∞, duplication is all that is needed to achieve perfect yield (ignoring interconnect overheads) – Equal-area partitions provide maximum YPA gain 81 To reality: Algorithms for design • Given – Scalable switches, forks, joins along with their associated yields and areas – An upper bound qi* on the number of spares for each module mi • Design: Optimize yield/area, by – Identifying an optimal set of modules mi, where i = 1, 2,… , k, and associated yi and ai – Identifying the best number of spares for each module – Identifying locations where steering logics should be inserted among the modules 82 Algorithm for inserting redundancy to maximize YPA Example: Six module pipeline m1 m2 m3 m4 m5 m6 0.97 0.75 0.84 0.45 0.96 0.98 Y ≈ 0.26 m4 0.45 m2 m1 0.97 m3 0.75 F 0.75 S 0.84 0.45 S 0.45 0.84 0.75 S m5 m6 0.96 0.98 0.96 0.98 J 0.45 0.45 Ynew ≈ 0.87 • Various degrees of replication and concatenation • Forks, joins and switches of appropriate bus width inserted 83 After testing and configuration m4 0.45 m2 m1 m3 0.75 F 0.97 0.75 S 0.84 0.45 S 0.45 S 0.84 0.75 m5 m6 0.96 0.98 0.96 0.98 m5 m6 ------ ------ 0.96 0.98 J 0.45 0.45 Good units shown in blue; OK transmission shown solid ------ m2 m1 0.97 m3 -----F ------ m4 S 0.84 0.45 S ------ -----0.75 Power-down unused modules and switch segments S J ----------84 Case study: OpenSPARC T2 • • • • 8 identical cores Area of each core is around 12mm2 Area of other modules is 246mm2 Without loss of generality we set YOM = 1 Gο½ Y / ANewοY / AOrig Y / AOrig 85 ο΄ 100 Results: Modules after CLB partitioning • Search space reduced from 188887 gates and FFs to 1808 CLBs, i.e., 104 times fewer • CLBs sizes vary over a wide range Size (# of gates) #of CLBs Percentage of area in the core 10000 < Size(C) 6 59.4% 1000 < Size(C) < 10000 11 30.5% 100 < Size(C) < 1000 25 4.78% Size(C) < 100 1766 5.32% Total: 1808 100% 86 Results: Modules after clustering small CLBs Cluster1 Cluster2 PI 1 2 PI 1 2 PI 1 2 PI 1 PI 1 PI 1 Cluster3 4 4 3 2 4 3 4 4 Optimization results after CLB clustering Before Optimization After Optimization 1808 58 Number of control signals 1808 58 Number of Spare FFs 1830 131 Area overhead (FFs) 2.09% 0.15% Number of CLBs 87 Key characteristics of algorithms • Consider regular as well as irregular logic circuit structures • Allow each module its own pool of spares • Consider all interconnect and reconfiguration overheads • Consider partial redundancy • Improve yield/area significantly, especially at early stages of a new process – Improve time to market for a new process 88 Partitioning dominant CLBs From our theoretical results C C1 1 C C (a) C C2 2 : Extra circuitry for testing (b) 89 Partitioning dominant CLBs Original circuit 1 2 netlist Synthesis 3 CLB Generation CLB Clustering CLBs Area of each gate & FF 4 Hypergraph Generation 6 yes Hyper Graph 7 Partitioning using HMETIS 5 Need more partitions? Redundant Configuration no Golden Y/A Exit New CLBs 90 Adding Redundancy Experimental results • Dorig is the original design without redundancy. • DRed-core is a redundant design which uses redundancy at core level (traditional technique) • DRed-func-modules is a redundant design that uses the original functional modules of design for redundancy • DRed-CLB+Reg is similar to ours but simply replicates registers • DRed-CLB is the redundant design generated by our design flow without extra partitioning • DRed-Golden is the golden redundant design generated by our design flow 91 Experimental results For emerging technologies with low yield, the original design without redundancy has a very low Y/A that leads to long time to market Finding 1 92 Experimental results DRed-Core has the worst yield/area when defect density increases. We need redundancy at finer granularities rather than traditional core level for multi core chips Finding 2 DRed-CLB has 1.1 to 13.3 times better Y/A as a function of defect density compared to DRed-Core 93 Experimental results DRed-func-modules with 13 unbalanced partitions underperforms designs with more partitions. This gap increases with defect density Finding 3 94 Experimental results FF sharing led to significant Y/A. DRed-CLB has up to 45% better Y/A compared to DRed-CLB+Reg Finding 4 95 Experimental results DRed-CLB outperforms all other designs but golden. Its gain is close to the upper bound gain Finding 5 We may not need extra partitioning for yield/area improvement 96 Explicit redundancy configurations for highly-regular architectures Objectives • Show that regularity significantly improves effectiveness of spares • To exploit the benefits and overheads of the use of spares in the context of highly-regular architectures – Focus on use of spares for identical copies of functional modules, where spare copies are shared – Consider all possible ways in which spares used – Level of granularity to add spares – Scope of sharing of spares – Number of spares 98 Proposed redundancy designs • K-way spare core sharing (KSCS) β« Example: Inter-processor spares Scope of sharing 16 8 4 2 1 Number 3 4 4 8 0 Case 1: inter-processor spares Case 2: intra-processor and inter-processor spares Intra-processor spares 99 Impact of scope of sharing • Comparison of different spare configurations 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Yield/area Area 7 6 0.88 5 0.64 4 3.00 3 1.36 2.39 2 1 1.25 0 Configuration IDs from Table 2 Area Yield and Yield/area Yield Config. Number of spare cores of different scope of sharing IDs #1 0 0 0 0 96 0 0 #2 0 0 0 32 64 0 0 #3 0 0 0 64 32 0 0 #4 0 0 0 96 0 0 0 #5 0 0 16 80 0 0 0 #6 0 0 32 64 0 0 0 #7 0 0 48 48 0 0 0 #8 0 0 64 32 0 0 0 #9 0 0 80 16 0 0 0 #10 0 0 96 0 0 0 0 #11 0 16 0 80 0 0 0 #12 0 32 0 64 0 0 0 #13 0 48 0 48 0 0 0 #14 0 64 0 32 0 0 0 #15 0 80 0 16 0 0 0 #16 0 96 0 0 0 0 0 • Config. IDs are ordered in such a ways that the scope of sharing increases100 Proposed redundancy designs • Definitions β« Scope of sharing: domain in which any defective module can be replaced by the specific spare copy β« Maximum number of spare cores (MNSC): • Greedy repair process β« Given spares with different scopes, spares with smaller scope is used first, with the maximum remaining MNSC for other cores β« After repair process, the chips with fewer than nominal number of working functional modules are counted as nonworking 101 Proposed redundancy designs • Area overhead 3.1 3.1 De-mux 1 Spare module 3.3 1 2 1 1 3.2 1 1 Area of spare modules 2 Wasted area 3.1 Impact on De -mux height 3.2 Impact on Crossbar 3.3 Impact on De -mux width 3.2 A. Original 8-core B. With one spare core processor and 1-to-2 De-mux C. With four spare cores and 1-to-5 De-mux 102 Experiments CS-P Inter-processor spares Intra-processor and inter-processor spares 1 0.9 0.8 0.84 0.85 0.78 Yield/Area 0.7 0.80 0.75 0.64 0.6 0.71 0.65 0.57 0.65 0.55 0.45 0.5 0.4 0.60 0.49 0.39 0.55 0.44 0.35 0.3 0.2 0.1 0 0.05 0.1 0.15 0.2 0.25 0.3 d CS-P: core-level spares within processors 103 Utilizing flexibilities in CMP binning and e-optimal designs 104 Binning CMPs • Number-of-Processors Binning (NPB) models – We define NPB function as follows, where π and π and the lower and upper bound of number of processors that can be used. π(π) = 0, π, π, ππ π < π, ππ π ≤ π ≤ π, ππ π > π, – NPB Origin: only π, i.e., (π = π, π = π) – NPB A: π to π, i.e., (π < π, π = π) – NPB B: π or greater, i.e., (π = π, π > π) – NPB C: π to max (π < π, π > π) π denotes the number of processors in the nominal design 105 Value functions • We define three value functions based on how the value π£(π) of a chip depends on π, the number of (enabled) working processors on a chip – Linear: The value of a chip is linear with the number of (enabled) working processors – IPC: We evaluate the value of a chip based on instructions per cycle (IPC) – Catalog price: We derive another practical value function using catalog prices for a chip with different numbers of enabled processors Normolized values Linear IPC Catalog price 2.5 2 1.5 1 0.5 11 13 15 17 19 21 23 25 27 29 31 Number of working processors 33 35 37 39 106 Utility functions and evaluation metrics • We define utility functions using one NPB function and one value function as follows – π’ π = π£(π(π)) • Evaluation metrics Value functions Linear Processors-sold-per-unit-area (PSPA) IPC Instructions-per-cycle-per-area (IPCPA) Catalog price Revenue-per-area (RPA) 107 Optimal redundancy design • We derive the optimal spare configurations for different value functions (a) linear, (b) IPC, and (c) catalog price. NPB function Spare processors Spare cores 4-way 8-way 16-way 32-way Origin A B C 36 24 40 28 Origin A B C 128 0 64 0 0 0 32 32 128 0 64 0 0 0 32 32 No spares design Normalized figure-of-merit NPB function Spare processors Spare cores 4-way 8-way 16-way 32-way Optimal processors design for NPB Origin NPB function Origin A B C 36 24 36 32 Origin A B C 128 0 64 0 0 0 32 32 128 0 64 0 0 0 32 32 Spare processors design Spare processors Spare cores 4-way 8-way 16-way 32-way Origin A B C 36 28 40 32 Origin A B C 128 0 64 0 Optimal core design for NPB Origin 0 128 0 0 0 32 64 32 32 0 32 Spare cores design 1 82.5% 0.8 0.6 47.6% 36.6% 0.4 36.6% 0.2 0 72.5% 75.0% 72.5% 38.5% 72.5% 47.8% 47.2% 7.9% 0.0% NPB Origin 83.5% 82.5% 72.5% 75.0% 50.6% 49.1% 72.5% 72.5% 74.9% 50.4% 36.6% 38.6% 36.6% 18.1% 72.5% 53.2% 53.2% 83.8% 83.8% 72.5% 74.8% 74.8% 72.5% 72.5% 50.1% 46.5% 49.1% 36.6% 38.4% 49.9% 48.1% 72.5% 74.9% 56.0% 55.2% 18.1% 36.6% 7.9% 0.0% NBP A 83.5% 72.5% NPB B 0.0% NPB C NPB Origin 0.0% NBP A NPB B 0.0% NPB C 16.0% 16.0% NPB Origin 0.0% NBP A NPB B NPB C e-optimal redundancy design • Due to diminishing return effects, we always observe scenarios where the benefit increases slowly with number of spares • In such cases, if we are given a margin e, within which the optimal redundancy design is acceptable, we are able to derive a close-to-optimal chip design with significantly reduced chip area Normalized figure-of-merit PSPA IPCPA RPA 0.6 56.0% 54.9% 0.55 50.6% 51.3% 0.5 48.3% 0.45 49.8% 48.6% 50.1% 49.8% 45.9% 0.4 8 9 10 11 12 13 14 Numer of spare processors 15 16 e-optimal redundancy design • We derive e-optimal redundancy designs for various utility functions: (a) linear, (b) IPC, and (c) catalog price, with e =8%. NPB function Spare processors Spare cores 4-way 8-way 16-way 32-way NPB function Origin A B C 31 23 35 24 Origin A B C 128 32 0 0 0 0 64 0 128 32 0 0 0 0 64 0 Normalized figure-of-merit Spare processors design Spare processors Spare cores 4-way 8-way 16-way 32-way 0.8 82.5% 82.5% 77.1% 72.5% 77.1% 68.9% 50.6% 47.8% 47.6% 48.3% 45.3% 43.8% 72.5% 68.9% 0.6 36.6% 0.4 34.2% Origin A B C 31 23 32 24 Origin A B C 128 32 0 0 0 0 64 0 128 32 0 0 0 0 64 0 Ζ-optimal spare processor design 1 NPB function Spare cores design 83.5% 83.5% 80.4% 72.5% 80.4% 68.9% 56.0% 53.2% 50.4% 51.3% 48.9% 47.8% 72.5% 68.9% 36.6% 34.2% Spare processors Spare cores 4-way 8-way 16-way 32-way Origin A B C 31 24 35 27 Origin A B C 128 32 0 0 0 128 0 64 64 64 0 0 0 16 0 16 Ζ-optimal spare core design 83.8% 83.8% 77.1% 72.5% 77.1% 68.9% 50.1% 49.1% 46.5% 47.5% 46.1% 45.1% 72.5% 68.9% 36.6% 34.2% 0.2 0 NPB Origin NBP A NPB B NPB C NPB Origin NBP A NPB B NPB C NPB Origin NBP A NPB B NPB C Conclusion • Three forms of redundancy – Implicitly present; no errors in program outputs – Implicitly present; some errors in application outputs acceptable – Explicitly added; no errors at module outputs • Implicit forms of redundancy require us to look beyond the circuit level – For processors: μArch, ISA, system/software, and user layers – For error tolerance: Application and user layers 111 Conclusion • Every form provides an avenue for combating increasing defect rates and variations of nanoscale technologies • Exploitation of redundancy increasingly beneficial with increasing defect density 112 Conclusion • Ordering of methods to exploit redundancy in terms of how soon they become effective • Implicit redundancy in – General purpose processors – Error tolerance in medial applications • Explicit redundancy – Memories (in use) – Multi-core/multi-processor chips – Random logic 113 Questions? Sandeep Gupta Ming Hsieh Department of Electrical Engineering University of Southern California sandeep@usc.edu 114