Design Validation and Debugging Tim Cheng Department of Electrical & Computer Engineering UC Santa Barbara VLSI Design and Education Center (VDEC) Univ. of Tokyo 1 • Harder to Design Robust and First-silicon success rate hasChips been dropping Reliable – ~30% for complex ASIC/SoC@.13m (according to an ASIC vendor) – Pre-silicon logic bugs have been increasing at 3X-4X per generation for Intel’s processors • Yield has been dropping for volume production and takes longer to ramp up the yield – IBM’s 8-core Cell-Processor chips: ~10-20% yield (July 2006) • “Better than worst-case” design resulting in failures w/o defects – Increase in variation of process parameters with scaling – Worst-case design getting way too conservative 2 In-Field Failures are Common and Costly • Xbox:16.4% failure rate • Additional warranty and refund will cost Microsoft $1.15B ($86 per $300-item) • More than financial cost: reputation and market loss • Non-trivial failure rate – 15% in average http://arstechnica.com/news.ars/post/20080214-xbox-360-failure-rates-worse-than-most-consumer- 3 Design for Robustness and Reliability • Systems must be designed to cope with failures • Efficient silicon debug is becoming a must – Need efficient design validation and debugging methodology – Design for debugging would become necessary • Must have embedded self-test for error detection – For both testing in manufacturing line and in-field testing – Both on-line and off-line testing • Re-configurability and adaptability for error recovery make better sense – Using spares to replace defective parts – Using redundancy to mask errors – Using tuning to compensate variations 4 Outline • Post-Silicon Validation and Debug • SMT-Based RTL Error Diagnosis [ITC 2008] • SAT-Based Diagnostic Test Generation [ATS 2007] 5 Bugs in Silicon • Manufacturing defects – Discovered during manufacturing test (<<1M DPM) • Functional bugs (AKA logic bugs) – Exist in all components – ~98% found before tape out, ~2% post-silicon* • Circuit bugs (AKA electrical bugs) – Not all components exhibit failures – Fails in some operating region (voltage, temperature, or frequency) – Usually cause by design margin errors, IR drop, crosstalk coupling, L di/dt noise, process variation … – ~50% found before tape out, ~50% post-silicon* * Source: Intel 6 Validation Domain Characteristics • Pre-silicon validation – Cycle accurate simulation – FSIM << FPROD: cycle poor – Any signal visible (i.e. white box): debugging is straightforward – Limited platform level interaction • Post-silicon validation – Tests run at FPROD: cycle rich – Component tested in platform configuration – Only package pins visible: difficult debug 7 Post-Si History And Trends • Functional bugs relatively constant – Correlate well to design complexity (amount of new and changed RTL) – Late specification changes are contributors • Circuit and analog bugs growing over time – I/O circuit complexity increasing sharply – Speedpaths (limiting FMAX of component) dominate CPU core circuit issues 8 Post-Si Debug Challenges • Trend is toward lower observability – Integration increasing towards SoC • Functional and circuit issues require different solutions • On average circuit bugs take 3x as much time to root cause vs. functional bugs – Bugs found on platforms, but are debugged on debug-enabled automatic test equipment (ATE) – Often need multiple iterations to reproduce on the tester – Often long latency between circuit issue and it’s syndrome 9 Pre-Si Verification vs. Post-Si Debugging Specification RTL Description Insert Corrections Pre-silicon Functional Debugging Logic Netlist Physical Design Insert Faults/ Silicon Debugging & Fault Diagnosis Errors 10 Automated Debugging/Diagnosis A failed verification/test step is followed by debugging/diagnosis: Testbench or Test Vectors Design or Silicon Verification or Testing Automated Debugging/Diagnosis PASS FAIL Counter examples/ Diagnostic Patterns 11 Leveraging Pre-Si Verification & Manufacturing Test Efforts for Post-Si Validation Specification Pre-silicon verification Lack of error propagation White Box analysis/metrics Post-silicon validation Black Box RTL Description Logic Netlist Physical Design at very Manufacturing Models Black Box low level of test abstraction 1212 Outline • Post-Silicon Validation and Debug • SMT-Based RTL Error Diagnosis [ITC 2008] • SAT-Based Diagnostic Test Generation [ATS 2007] 13 SAT-Based Diagnosis Erroneous Design Failing Tests Replicate circuit for each test Add additional circuitry into circuit model Add input/output constraints SAT assignment(s) → Fault location(s)! 14 SAT-Based Diagnosis - Example • Stuck-at-1 fault on line l1 • Input vector v=(0, 0, 1) detects 1/0 at y 0 0 x1 x2 l1 11 x3 0/1 y 1/0 Courtesy: A. Veneris 15 SAT-Based Diagnosis – Example (Cont’d) 1. Insert a MUX at each error candidate location x1 x2 s1 l1 w1 x3 0 y 1 2. Apply input/output vector constraints 0 0 s1 x1 l1 x2 w1 1 0 1 x3 y 0 Courtesy: A. Veneris 16 SAT-Based Diagnosis – Multiple Diagnostic Tests s1 x 0 1 1 l1 1 x 0 1 2 1 w 1 l1 1 x 1 2 2 w 1 l1 1 0 0 x 2 w 2 0 3 3 3 y 1 3 1 x3 0 1 x 1 0 2 2 2 y 1 2 1 x3 0 0 x 1 1 3 x3 y 3 0 1 1 Courtesy: A. Veneris 17 RTL Design Error Diagnosis • Using Boolean SAT-Solvers for RTL design error diagnosis is not efficient – The translation to Boolean is expensive – High level information is discarded Propose a SMT-based, automated method for RTL-level design error diagnosis 18 18 Satisfiability Modulo Theory (SMT) Solvers • Targets combined decision procedures (CDP) • Integrate Boolean-level approach with higherlevel decision procedures, such as ILP • SHIVA-UIF: an SMT solver developed for RTL circuit • Boolean Theory } • Bit-vector Theory • Equality Theory Makes a good candidate as the satisfiability engine for hardware designs 19 RTL Design Error Diagnosis Utilizing SHIVA-UIF • Extend the main idea of Boolean-SAT-based diagnosis approach to word-level – MUXs are added to word-level signals Failing Patterns, Error Candidates Add MUXs to design Impose test as constraints UNSAT SMT SAT Reduced candidate list Remove remaining candidates Add identified candidate to possible candidate list Add constraints to avoid same solution 20 20 Initialization Steps • Simple effect-cause analysis used to limit the potential candidates • A MUX is inserted at each potential erroneous signal X3 X1 X2 L + = Y W S 21 21 Could Directly Modifying HDL Code (at Potential Erroneous Statements) module full_adder_imp (a1, a2, c_in, s, c_out); input a1, a2, c_in; output s, c_out; wire temp; assign s = a1 ^ a2 ^ c_in; assign temp = (a1 & a2) | (a1 & c_in); assign c_out = temp | (a2 & c_in); endmodule module full_adder_muxed (a1, a2, free1, free2, free3, s1, s2, s3, c_in, s, c_out); input a1, a2, c_in; input free1, free2, free3; input s1, s2, s3; output s, c_out; wire temp_mux, s_mux, cout_mux; assign s_mux = a1 ^ a2 ^ c_in; assign s = s1 ? s_mux : free1; assign temp_mux = (a1 & a2) | (a1 & c_in); assign temp = s2 ? temp_mux : free2; assign c_out_mux = temp | (a2 & c_in); assign c_out = s3 ? c_out_mux : free3; endmodule 22 Inserting Constraints w.r.t. Failing Test and Expected Response • Add constraints corresponding to a failing test and its expected response to the MUXinserted circuit/code 3 3 SAT X3 5 X1 X2 L + = Y W 1 S=1 W =5 S ( ( S? (W):(3+3) ) = 5 ) 23 23 Experimental Results Design B03 B04 B05 C5 C10 C12 C15 C16 C17 C18 C30 No. of wordlevel elements 108 108 9700 115 230 420 345 540 720 1800 2910 No. of patterns 4 5 28 20 18 12 13 9 8 28 26 No. of initial candidates* 72 72 12949 211 561 579 911 595 815 2135 3499 • 11 example circuits (IWLS 2005 benchmarks) • An error is randomly injected in each circuit • * after applying simple effect-cause analysis No. of final candidates 6 9 5 13 9 100 25 7 28 10 87 24 Experimental Results 1200 Circuit B03 70 60 50 40 30 20 10 number of remaining candidates number of remaining candidates 80 800 600 400 200 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Number of failing patterns imposed Number of failing patterns imposed 700 600 250 Circuit C10 500 400 300 200 100 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Number of failing patterns imposed number of remaining candidates number of remaining candidates Circuit C15 1000 Circuit C5 200 150 100 50 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Number of failing patterns imposed • 4 sample circuits, each with 1000 random errors • Average/Max/Minimum number of remaining candidates 25 25 Experimental Results – Effect of Applying More Failing Tests • Average of 4 sample circuits, each with 1000 random errors Range of failing test indexes # of erroneous ckt instances in which # of candidates reduced (out of 1000) Average reduction in size of candidate list (in %) 5 to 200 588 1.74% 10 to 200 418 1.16% 20 to 200 318 0.97% 50 to 200 177 0.73% 100 to 200 102 0.62% 26 26 Disadvantage of Model-Free Diagnosis Design Golden Model s3 s1 W1 X3 X1 X2 + W3 L = X3 X1 Y X2 W 2 S2 L W4 S4 Y = W5 S5 • Some errors are indistinguishable from each other • Example: L is the real error location but the solver can find satisfying values for all initial error candidates 27 Advantages of SMT-Based RTL Design Error Diagnosis • The learned information can be reused • The order of candidate identification is easy to difficult, implicitly done by the solver – Solver tends to set MUXs of easy-to-diagnosis candidates first, and, – By the time of checking difficult candidates, the accumulated learned clauses help reduce complexity • Running All-SAT for this model results in: – Eliminating a group of candidates without explicitly targeting them one at a time 28 Outline • Post-Silicon Validation and Debug • SMT-Based RTL Error Diagnosis [ITC 2008] • SAT-Based Diagnostic Test Generation [ATS 2007] 29 Diagnostic Test Pattern Generation (DTPG) • Generates tests that distinguish fault types or locations • One of the most computationally intensive problems • Most existing methods are based on modified conventional ATPG or Sequential ATPG • Very complex and tedious implementation 30 30 Traditional SAT-based DTPG • Use a miter-like model to transform DTPG into a SAT problem PO × f1 Faulty M=1 PI Faulty × f2 PO SAT UNSAT Distinguishable Indistinguishable 31 31 SAT-based DTPG • Limitations: – Need to build a miter circuit for each fault pair – Cannot share learned information for different fault pairs Objectives: Reduce number of miter circuits and the computational cost for each DTPG run by using learned information from previous runs 32 32 DTPG Model for Injecting Multiple Fault Pairs sel2 • Inject the same set of Faulty sel1 selN N=2n to-be-differentiated faults into each of the two PI 0 circuits in the miter Vi PI 0 1 n-2n Decoder 1 • Add a n-to-2n decoder in PO each circuit to activate sel'2 Faulty sel'1 sel'N exactly one fault at a time • The extra sets of primary 1 inputs to the decoders, PI1 PI 12 n-2n Decoder and PI2, are extra primary 0 inputs Vi differentiates f1 and f6!! • Solve objective M=1 PO M=1? 33 33 DTPG Procedure Using Proposed Model List of fault candidates Build the DTPG model Simplify the circuit UNSAT M=1? SAT Diagnostic pattern found Add SAT constraint • For a SAT solution, values assigned at PI1 and PI2 represent indices of activated fault pair; values assigned at PI is a diagnostic test • After diagnostic test of fault pair fi and fj, is found, add a End blocking clause to avoid test for the same pair generated again • After UNSAT, all remaining fault pairs are indistinguishable 34 34 Main Advantages of the DTPG Model • The learned information can be reused • Order of target fault pair selection is automatically determined by SAT solving – Easy-to-distinguish fault pairs would be implicitly targeted first • Running All-SAT for this miter model could: – Find diagnostic patterns for all pairs of faults – Naturally perform diagnostic pattern compaction • Identify a group of indistinguishable fault pairs without explicitly targeting them one at a time 35 35 Finding More Compact Diagnostic Tests PO Faulty sel2 sel1 selN M=1? Vij PI 0 PI x0 1 0 n-2n Decoder Faulty sel'1 Vji differentiates f0 {f0, f2} and {f6, f7} and f3 1 0 PI 12 x1 sel'2 PO sel'N n-2n Decoder 36 36 DTPG with Compaction Heuristic • Solve objective M = 1 using SAT solver • Use existing patterns to guide the SAT solving • Find don’t cares at PI1 and PI2 in the newly generated pattern - so the corresponding pattern differentiate two groups of faults 37 37 DTPG for Multiple Faults • Need m n-to-2n decoder in each faulty circuit (m is the cardinality of multiple faults) Selj .... Seli • One output from each decoder is connected to an m-input OR gate • Can inject m or fewer faults • Combine existing methods before using the proposed DTPG model . n n-2 decoder n n-2 decoder ... ... 38 DTPG Results Circuit #Initial Fault Pairs #D/#E/ #Diagnostic #A Patterns CPU (sec) S5378 66 63/3/0 13 0.3 S13207 1225 1198/27/0 28 3.9 S15850 231 204/27/0 7 3.3 S35932 120 106/14/0 7 2.0 S38417 351 351/0/0 8 2.9 S38584 1225 1205/20/0 33 7.3 • Initial fault pairs: generated by a critical-path-tracing tool • All fault pairs injected into one miter circuit • #D—distinguishable, #E—equivalent, #A—aborted 39 Summary • SMT-based RTL Design Error Diagnosis – An enhanced model injecting single/multiple design errors – Enable sharing of the learned information – Identify false candidates without explicitly targeting them • SAT-based DTPG – Use an enhanced miter model injecting multiple faults – Enable sharing of the learned information – Identify undifferentiable faults efficiently – Support diagnosis between mixed, multiple fault types – Combine with diagnostic test pattern compaction 40