Online Design Bug Detection: RTL Analysis, Flexible Mechanisms, and Evaluation Kypros Constantinides University of Michigan Onur Mutlu Microsoft Research & Carnegie Mellon University Todd Austin University of Michigan Challenges of Correct Microprocessor Design Design Bugs: Deviations from the product specifications 1.2 bugs per month 3.5 bugs per month Chip-Multiprocessors New Features: • 64-bit extensions • Virtualization • Power Management • SSE3 *Data compiled from Intel product specification updates documents More bugs as more complex and diverse resources are integrated into a single chip 2 Online Design Bug Detection MICRO-41 November 11th, 2008 Why is Online Design Bug Detection Needed? Lower System Performance System Security: Attacks exploit HW design bugs Lower Customer Satisfaction Cost of design bugs Financial Loss Expensive Recalls Diminishing Brand/Company Reputation Microprocessor companies rely on ad-hoc techniques that change the software and hardware configuration to work around design bugs 3 Online Design Bug Detection MICRO-41 November 11th, 2008 Online Design Bug Detection and Avoidance Online Design Bug Detection Bug detection mechanism is updated by firmware with new design bugs Online System Recovery Bug Avoidance Techniques - Recover system from design bug effects - Low overhead periodic checkpoint and recovery - Existing mechanisms: • ReVive + ReViveI/O • SafetyNet - Avoid the reoccurrence of the design bug - Existing mechanisms: • Scale down to safe-mode • Disable buggy part • Hypervisor execution guidance In this work we focus on online design bug detection 4 Online Design Bug Detection MICRO-41 November 11th, 2008 Microprocessor Errata Documents From the Intel Pentium 4 Specification Update Document R31. Interactions between the Instruction Translation Lookaside Buffer (ITLB) and the Instruction Streaming Buffer May Cause Unpredictable Software Behavior Problem: Complex interactions within the instruction fetch/decode unit may make it possible for the processor to execute instructions from an internal streaming buffer containing stale or incorrect information. Implication: When this erratum occurs, an incorrect instruction stream may be executed resulting in unpredictable software behavior. Limitations: - Provide high-level description of the design bug - Hard to relate the design bug to the actual hardware implementation 5 Online Design Bug Detection MICRO-41 November 11th, 2008 Characterizing RTL Design Bugs OpenSPARC T1 (Niagara) OpenSPARC Core MUL IFU MMU Trap Logic Unit (TLU) EXU Load Store Unit (LSU) - RTL design bugs in Verilog code - Fixed and documented in the code Load Store Unit (LSU): 157 bugs Trap Logic Unit (TLU): 139 bugs Total of 296 bugs in SPARC core Example of RTL design bug in Verilog code – tlu_ctl.v 1089: ... 1105: 1106: 1107: 1108: 6 assign intrpt_taken = rstint_taken | hwint_taken | sirint_taken; Buggy Code // modified for bug 3919 // assign trap_to_redmode = trp_lvl_at_maxtlless1 & ~intrpt_taken; assign trap_to_redmode = trp_lvl_at_maxtlless1 & Corrected ~(rstint_taken | sirint_taken); Code Online Design Bug Detection MICRO-41 November 11th, 2008 Online Detection of Design Bugs 1089: ... 1105: 1106: 1107: 1108: assign intrpt_taken = rstint_taken | hwint_taken | sirint_taken; Buggy Code // modified for bug 3919 // assign trap_to_redmode = trp_lvl_at_maxtlless1 & ~intrpt_taken; assign trap_to_redmode = trp_lvl_at_maxtlless1 & Corrected ~(rstint_taken | sirint_taken); Code Monitoring the flip-flops can Correct Implementation detect the bug occurrence Q D trp_lvl_at_maxtlless1 = 1 Clk … rstint_taken = 0 Q= 0 sirint_taken D Clk 7 Buggy Implementation trap_to redmode 1 Combinational 1 Logic trap_to redmode trp_lvl_at_maxtlless1 = 1 0 rstint_taken = 0 hwint_taken = 1 0 sirint_taken = 0 Monitoring these signals can detect the bug occurrence Design bug is exposed Online Design Bug Detection MICRO-41 November 11th, 2008 Insights from RTL Design Bug Analysis RTL Analysis Observations: ~20 signals need to be monitored per bug >1000 unique signals need to be monitored for all the bugs studied Each bug has ~7 source signals not monitored for any other bug Set of monitored signals is expanding for every new bug All bug source signals are coming from control flip-flops Monitoring data buffers or data registers will not provide significant benefit Limitations of online bug detection techniques in the literature: 1. Monitor only a few hundreds of signals (~200-300) 2. Monitored signals are selected at design time 8 Online Design Bug Detection MICRO-41 November 11th, 2008 Flexible Bug Detection at the Flip-Flop Level Monitor ALL control flip-flops in the design Flexible Bug Signature 1 X X 0 X FF needs to be 1 FF needs to be 0 to expose bug to expose bug Detection Value Monitor Enable 1 0 0 1 1 0 Bug Signature Encoding 9 X … X X X 0 X X FF is not a bug Source signal Bug Detection Portion Load using field programmable scan chains Scan Portion Operating Flip-Flop Online Design Bug Detection s s s s s 0: Match 1: Mismatch Bug Detection Flip-Flop MICRO-41 November 11th, 2008 Distributed Global Bug Detection Checking Bug #12 is detected Checking Tree table entries loaded at system startup by firmware Bug ID Flag Match-bitvector 12 1 1 1 … 10 8-bit Bug Detection 0 0 8-bit Bug Detection Bug ID Flag Match-bitvector 9 1 1 1 X X 12 0 X 1 1 X … … s s s s 1 8-bit Bug Detection 12, 1 Bug #9 is detected 12, 1 Bug ID Flag Match-bitvector 7 1 X 1 X 1 12 0 1 X X X Flip-Flop Level s s s s s s 01 1 8-bit Bug Detection 8-bit Bug Detection 8-bit Bug Detection 1 10 8-bit Bug Detection 01 8-bit Bug Detection 64 Control Flip-Flops Online Design Bug Detection MICRO-41 November 11th, 2008 Detecting Multiple Design Bugs Design Bug Database Design Bugs & Triggering Conditions Bug Signature Conflict 0 Bug Sign.#1 1 Bug Sign.#2 … Bug Sign.#N Merge Bug Signatures System Bug Signature X Encode & Load Use “Don’t cares” to resolve signal conflicts between bug signatures No false negatives, but false positive bug detections are possible Bug Detection Flip-Flops 11 Online Design Bug Detection MICRO-41 November 11th, 2008 Online Tuning of Coverage/Performance Trade-Off Firmware loads initial system bug signature Execution recovery & design bug avoidance Design bug detected No Adjust the design bugs been covered by dunamically updating the system bug signature Remove bug with highest false positive rate Bug ID# False positive? Yes Update log False positive rate > threshold? Add bug with lowest false positive rate Bug ID# Yes Physical Memory Log of the false positive rate of each bug No 12 Online Design Bug Detection MICRO-41 November 11th, 2008 Area Overhead and Design Bug Coverage RTL prototype implementation: Critical Design Bugs in 10 commercial processor ~65% [Sarangi et al., MICRO’06] 80% Coverage 10% Overhead 25 20 15 10 5 0 13 Online Design Bug Detection Total Area Overhead (%) - Synthesized with IBM 130nm process technology - Covers the whole OpenSPARC T1 Chip - 39K control flip-flops monitored (15% of all Flip-flops in OpenSPARC T1) - Bug detection flip-flops have an area overhead of 3% MICRO-41 November 11th, 2008 Power Consumption Overhead Segment OpenSPARC T1 Checking Tree Field Power Budget: 58W Programmable (16 entries per node) 39K Bug Detection (0.74W) 1.3% IBM 130nm @ 1.2V Framework Augmented Flip-Flops (0.35W) 0.6% (0.9W) 1.5% Cores & L1 Wires & Repeaters Caches (10.7W) 18.4% (14.4W) 24.7% 3.5% Power Overhead I/O Pads (6.9W) 12% Misc. Units (I/O Bridge, DRAM Ctrl, CTU) (0.9W) 1.5% Crossbar (0.6W) 1.1% 14 L2 Cache (9W) 15.4% Leakage (13.7W) 23.5% Online Design Bug Detection MICRO-41 November 11th, 2008 Contributions RTL-level analysis of the design bugs of a commercial processor Bugs have unique source signals that are hard to predict at design time Monitored signals need to be selected in the field after bug discovery Current techniques not flexible enough - select signals at design time Proposed a flexible online bug detection mechanism 15 Monitor all control flip-flops in OpenSPARC T1 Set of monitored signals can be selected in the field using firmware RTL prototype: 80% bug coverage for 10% area overhead Online Design Bug Detection MICRO-41 November 11th, 2008 Future Work - Evaluation Challenges Current infrastructure insufficient to measure false positive rate Functional simulators: Lack of RTL level detail RTL simulators: Too slow to run applications Developing a hardware prototype of our framework on FPGA 16 Uncomment design bug fixes in RTL code of OpenSPARC T1 Evaluate the effectiveness of our framework on real applications Measure false positive rate Explore trade-off between bug coverage and performance Online Design Bug Detection MICRO-41 November 11th, 2008 Thank You! Questions? 17 Online Design Bug Detection MICRO-41 November 11th, 2008 Online Bug Detection & Avoidance: A Microprocessor Airbag Extra cost without any performance/utility benefits The microprocessor designers shouldn’t rely on it No guarantee of success - Doesn’t cover all possible design bugs Car airbags reduce fatalities by 8% when seat belts are worn Objective: Reduce the risk of serious implications when critical design bugs are discovered after product release 18 Online Design Bug Detection MICRO-41 November 11th, 2008 RTL Algorithmic Design Bugs Design bug in Verilog code – lsu_qctl1.v 2993: 2993: 2994: 2995: ... 3007: 3008: 3009: 3010: 3011: 3012: ... 3020: 3021: 3022: //bug4814 - change rrobin_picker1 to rrobin_picker2 // Choose one among 4 loads. //lsu_rrobin_picker1 ld4_rrobin ( //.events({ld3_pcx_rq_vld,ld2_pcx_rq_vld,ld1_pcx_rq_vld,ld0_pcx_rq_vld}), //.se(se), //.so() //); lsu_rrobin_picker2 ld4_rrobin ( .events({ld3_pcx_rq_vld,ld2_pcx_rq_vld,ld1_pcx_rq_vld,ld0_pcx_rq_vld}), .se(se), .so() ); - Algorithmic deviations from the design specifications - Require major modifications to be fixed 19 Online Design Bug Detection MICRO-41 November 11th, 2008 RTL Timing Design Bugs Design bug in Verilog code – lsu_qdp1.v 1228: ... 1239: 1240: 1241: 1242: 1243: 1244: 1245: 1246: 1247: 1248: 1249: 1250: // Begin - Bug3487. dff #(48) ifu_std_d1 ( .din (tlb_st_data[47:0]), .q (lsu_ifu_stxa_data[47:0]), .clk (asi_data_clk), .se (1'b0), .si (), ); .so () // select is now a stage earlier, which should be // fine as selects stay constant. //assign lsu_ifu_stxa_data[47:0] = tlb_st_data_d1[47:0] ; // End - Bug3487. - Signals need to be latched a cycle earlier or later to keep correctness - Addition or removal of flip-flops is the most common fix 20 Online Design Bug Detection MICRO-41 November 11th, 2008 RTL OpenSPARC T1 Design Bug Distribution Load/Store Unit (LSU) 157 Design Bugs Trap Logic Unit (TLU) 139 Design Bugs 21 Online Design Bug Detection MICRO-41 November 11th, 2008 Power Consumption Estimation Methodology Methodology/Tools Used Design Components Synopsys Power Compiler 1) SPARC Cores, 2) Crossbar, 3) FPU, 4) Misc. Units (I/O Bridge, DRAM Controllers, Control & Test Unit) 5) ACE Framework, 6) Online Design Bug Detection Mechanism CACTI 4.2 1) L1 Inst. & Data Caches, 2) L2 Cache Taken from * 1) I/O Pads, 2) Wires & Repeaters * A. S. Leon, K. W. Tam, J. L. Shin, D. Weisner, and F. Schumacher. A PowerEfficient High-Throughput 32-Thread SPARC Processor, In IEEE Journal of SolidState Circuits, 42(1), 2006 22 Online Design Bug Detection MICRO-41 November 11th, 2008 RTL Analysis Results Metrics LSU TLU Min./Average/Max. number of first-level monitor signals per logic design bug 2/8/43 2/12/44 Min./Average/Max. number of sourcelevel monitor signals per logic design bug 2/17/97 2/24/89 Source-level monitor signal sharing among different design bugs 23 68% 64% Average number of unique source-level monitor signals per logic design bug 6 9 Unique source-level monitor signals (for all logic design bugs) 516 602 Online Design Bug Detection MICRO-41 November 11th, 2008 Merging Bug Signatures 4-bit Bug Detection Segments Bug X X 1 0 Signature #1 Bug Signature #2 X X 1 1 X 0 1 X X X X X X 0 1 X X X X X Design Bug #1 Intermediate X X 1 X X 0 1 X X X X X Signature #1 CASE 2 CASE 1 CASE 2 Bug X X X X X 0 X 1 Signature #1 Bug Signature #2 X X X X X 0 X 1 Intermediate X X X X X 0 X 1 Signature #2 1 X 1 0 0 X 1 1 Design Bug #2 X X 1 X System Bug X X 1 X X 0 X X X X 1 X Signature 24 Online Design Bug Detection MICRO-41 November 11th, 2008 High-Level Overview Design Bugs & Triggering Conditions Generate the bug signatures based on bug triggering conditions BUG#1 BUG#2 XXX1X0…X1X0XX … 1 X Merge Bug Signatures 0 … X 1 System Bug Signature 25 Segment Match Detection Table 4 XXXX1X…X101XX 2 X Firmware loads the segment match detection entries X101XX…XX01XX 0 Global Bug Detection Signal Segment Match Detection Table XXX0X0…X1X1XX … BUG#N Aggregate bug detection segment match/mismatch signals to a global bug detection signal 6 1 Bug Signature Collection If the global bug detection signal flags a bug, system recovery is triggered 7 … Design Bug Recovery Handler Segment Checking Tree Segment Match Detection Table match/mismatch signals Bug Detection Segment Firmware encodes and loads the system bug signature to the bug detection segments X 3 Bug Detection Segment … Bug Detection Segment Bug Detection Segment … 5 Cycle-by-cycle online checking for design bugs Online Design Bug Detection System State (Flip-Flops) MICRO-41 November 11th, 2008 OpenSPARC T1 Data & Control Flip-Flops Chip Submodule 26 Data Signals Control Signals SPARC Core (x8) 15632 (79.06%) 4140 (20.94%) CPU-Cache Crossbar 27283 (98.69%) 362 (1.31%) Floating-Point Unit 4054 (87.75%) 566 (12.25%) Control & Test Unit 2325 (55.29%) 1880 (44.71%) Input/Output Bridge 10251 (95.14%) 524 (4.86%) DRAM Controller (x4) 13449 (94.70%) 752 (5.30%) Total 222765 (84.95%) 39460 (15.05%) Online Design Bug Detection MICRO-41 November 11th, 2008 Synergistic Online Bug & Defect Detection Firmware Test for Firmware Load Hardware Defects Design Bug Data: No Design Bug Checking - Bug Signature State Defect Hardware - Segment Match Entries Recovery Detected Defect Avoid Design Bug 1 System Startup Computation & 2 3 4 Online Design Bug Checking State State Design Bug State State Checkpoint Checkpoint Detected Checkpoint Recovery 27 Online Design Bug Detection Hardware Repair Computation & Online Design Bug Checking MICRO-41 November 11th, 2008