ppt - Carnegie Mellon University

advertisement
Online Design Bug Detection:
RTL Analysis, Flexible Mechanisms, and Evaluation
Kypros Constantinides
University of Michigan
Onur Mutlu
Microsoft Research & Carnegie Mellon University
Todd Austin
University of Michigan
Challenges of Correct Microprocessor Design
Design Bugs: Deviations from the product specifications
1.2 bugs
per month
3.5 bugs
per month
Chip-Multiprocessors
New Features:
• 64-bit extensions
• Virtualization
• Power Management
• SSE3
*Data compiled from Intel product
specification updates documents
More bugs as more complex and diverse
resources are integrated into a single chip
2
Online Design Bug Detection
MICRO-41
November 11th, 2008
Why is Online Design Bug Detection Needed?
Lower System
Performance
System Security:
Attacks exploit HW
design bugs
Lower Customer
Satisfaction
Cost of
design bugs
Financial Loss
Expensive Recalls
Diminishing
Brand/Company
Reputation
Microprocessor companies rely on ad-hoc techniques that change
the software and hardware configuration to work around design bugs
3
Online Design Bug Detection
MICRO-41
November 11th, 2008
Online Design Bug Detection and Avoidance
Online Design
Bug Detection
Bug detection mechanism
is updated by firmware with
new design bugs
Online
System Recovery
Bug Avoidance
Techniques
- Recover system from
design bug effects
- Low overhead periodic
checkpoint and recovery
- Existing mechanisms:
• ReVive + ReViveI/O
• SafetyNet
- Avoid the reoccurrence of
the design bug
- Existing mechanisms:
• Scale down to safe-mode
• Disable buggy part
• Hypervisor execution
guidance
In this work we focus on online design bug detection
4
Online Design Bug Detection
MICRO-41
November 11th, 2008
Microprocessor Errata Documents
From the Intel Pentium 4 Specification Update Document
R31. Interactions between the Instruction Translation Lookaside Buffer
(ITLB) and the Instruction Streaming Buffer May Cause Unpredictable
Software Behavior
Problem: Complex interactions within the instruction fetch/decode unit
may make it possible for the processor to execute instructions from an
internal streaming buffer containing stale or incorrect information.
Implication: When this erratum occurs, an incorrect instruction stream
may be executed resulting in unpredictable software behavior.
Limitations:
- Provide high-level description of the design bug
- Hard to relate the design bug to the actual hardware implementation
5
Online Design Bug Detection
MICRO-41
November 11th, 2008
Characterizing RTL Design Bugs
OpenSPARC T1
(Niagara)
OpenSPARC Core
MUL
IFU
MMU
Trap Logic
Unit
(TLU)
EXU
Load
Store
Unit
(LSU)
- RTL design bugs in Verilog code
- Fixed and documented in the code
Load Store Unit (LSU): 157 bugs
Trap Logic Unit (TLU): 139 bugs
Total of 296 bugs in SPARC core
Example of RTL design bug in Verilog code – tlu_ctl.v
1089:
...
1105:
1106:
1107:
1108:
6
assign intrpt_taken = rstint_taken | hwint_taken | sirint_taken;
Buggy Code
// modified for bug 3919
// assign trap_to_redmode = trp_lvl_at_maxtlless1 & ~intrpt_taken;
assign trap_to_redmode = trp_lvl_at_maxtlless1 &
Corrected
~(rstint_taken | sirint_taken);
Code
Online Design Bug Detection
MICRO-41
November 11th, 2008
Online Detection of Design Bugs
1089:
...
1105:
1106:
1107:
1108:
assign intrpt_taken = rstint_taken | hwint_taken | sirint_taken;
Buggy Code
// modified for bug 3919
// assign trap_to_redmode = trp_lvl_at_maxtlless1 & ~intrpt_taken;
assign trap_to_redmode = trp_lvl_at_maxtlless1 &
Corrected
~(rstint_taken | sirint_taken);
Code
Monitoring the flip-flops can
Correct
Implementation
detect
the bug occurrence
Q
D
trp_lvl_at_maxtlless1 = 1
Clk
…
rstint_taken = 0
Q= 0
sirint_taken
D
Clk
7
Buggy Implementation
trap_to
redmode
1
Combinational
1
Logic
trap_to
redmode
trp_lvl_at_maxtlless1 = 1
0
rstint_taken = 0
hwint_taken = 1
0
sirint_taken = 0
Monitoring
these
signals can detect the bug occurrence
Design
bug
is exposed
Online Design Bug Detection
MICRO-41
November 11th, 2008
Insights from RTL Design Bug Analysis
RTL Analysis Observations:





~20 signals need to be monitored per bug
>1000 unique signals need to be monitored for all the bugs studied
Each bug has ~7 source signals not monitored for any other bug
Set of monitored signals is expanding for every new bug
All bug source signals are coming from control flip-flops

Monitoring data buffers or data registers will not provide significant benefit
Limitations of online bug detection techniques in the literature:
1. Monitor only a few hundreds of signals (~200-300)
2. Monitored signals are selected at design time
8
Online Design Bug Detection
MICRO-41
November 11th, 2008
Flexible Bug Detection at the Flip-Flop Level
Monitor ALL control flip-flops in the design
Flexible Bug
Signature 1 X X
0
X
FF needs to be 1 FF needs to be 0
to expose bug
to expose bug
Detection
Value
Monitor
Enable
1
0
0
1
1
0
Bug Signature
Encoding
9
X
…
X
X X 0
X X
FF is not a bug
Source signal
Bug Detection
Portion
Load using field
programmable
scan chains
Scan
Portion
Operating
Flip-Flop
Online Design Bug Detection
s s
s
s
s
0: Match
1: Mismatch
Bug Detection
Flip-Flop
MICRO-41
November 11th, 2008
Distributed Global Bug Detection Checking
Bug #12 is detected
Checking Tree
table entries
loaded at
system startup
by firmware
Bug ID Flag Match-bitvector
12
1
1
1
…
10
8-bit Bug
Detection
0
0
8-bit Bug
Detection
Bug ID Flag Match-bitvector
9
1
1 1 X X
12
0
X 1 1 X
…
…
s s s s
1
8-bit Bug
Detection
12, 1 Bug #9 is detected
12, 1
Bug ID Flag Match-bitvector
7
1
X 1 X 1
12
0
1 X X X
Flip-Flop
Level
s
s
s s s s
01
1
8-bit Bug
Detection
8-bit Bug
Detection
8-bit Bug
Detection
1
10
8-bit Bug
Detection
01
8-bit Bug
Detection
64 Control Flip-Flops
Online Design Bug Detection
MICRO-41
November 11th, 2008
Detecting Multiple Design Bugs
Design Bug
Database
Design Bugs &
Triggering Conditions
Bug Signature
Conflict
0
Bug Sign.#1
1
Bug Sign.#2
…
Bug Sign.#N
Merge Bug Signatures
System Bug
Signature
X
Encode & Load
Use “Don’t cares” to
resolve signal conflicts
between bug signatures
No false negatives, but
false positive bug
detections are possible
Bug Detection
Flip-Flops
11
Online Design Bug Detection
MICRO-41
November 11th, 2008
Online Tuning of Coverage/Performance Trade-Off
Firmware loads initial
system bug signature
Execution
recovery &
design bug
avoidance
Design bug detected
No
Adjust the design bugs been covered by
dunamically updating the system bug signature
Remove bug with
highest false
positive rate
Bug ID#
False positive?
Yes
Update log
False positive
rate > threshold?
Add bug with
lowest false
positive rate
Bug ID#
Yes
Physical Memory
Log of the false
positive rate of
each bug
No
12
Online Design Bug Detection
MICRO-41
November 11th, 2008
Area Overhead and Design Bug Coverage
RTL prototype implementation:
Critical Design Bugs in 10
commercial processor ~65%
[Sarangi et al., MICRO’06]
80% Coverage
10% Overhead
25
20
15
10
5
0
13
Online Design Bug Detection
Total Area Overhead (%)
- Synthesized with IBM 130nm process technology
- Covers the whole OpenSPARC T1 Chip
- 39K control flip-flops monitored (15% of all Flip-flops in OpenSPARC T1)
- Bug detection flip-flops have an area overhead of 3%
MICRO-41
November 11th, 2008
Power Consumption Overhead
Segment
OpenSPARC T1
Checking Tree
Field
Power Budget: 58W Programmable (16 entries per node) 39K Bug Detection
(0.74W) 1.3%
IBM 130nm @ 1.2V
Framework
Augmented Flip-Flops
(0.35W) 0.6%
(0.9W) 1.5%
Cores & L1
Wires & Repeaters
Caches
(10.7W) 18.4%
(14.4W) 24.7%
3.5% Power
Overhead
I/O Pads
(6.9W) 12%
Misc. Units (I/O Bridge,
DRAM Ctrl, CTU)
(0.9W) 1.5%
Crossbar
(0.6W) 1.1%
14
L2 Cache
(9W) 15.4%
Leakage
(13.7W) 23.5%
Online Design Bug Detection
MICRO-41
November 11th, 2008
Contributions

RTL-level analysis of the design bugs of a commercial processor




Bugs have unique source signals that are hard to predict at design time
Monitored signals need to be selected in the field after bug discovery
Current techniques not flexible enough - select signals at design time
Proposed a flexible online bug detection mechanism



15
Monitor all control flip-flops in OpenSPARC T1
Set of monitored signals can be selected in the field using firmware
RTL prototype: 80% bug coverage for 10% area overhead
Online Design Bug Detection
MICRO-41
November 11th, 2008
Future Work - Evaluation Challenges

Current infrastructure insufficient to measure false positive rate



Functional simulators: Lack of RTL level detail
RTL simulators: Too slow to run applications
Developing a hardware prototype of our framework on FPGA




16
Uncomment design bug fixes in
RTL code of OpenSPARC T1
Evaluate the effectiveness of our
framework on real applications
Measure false positive rate
Explore trade-off between bug
coverage and performance
Online Design Bug Detection
MICRO-41
November 11th, 2008
Thank You!
Questions?
17
Online Design Bug Detection
MICRO-41
November 11th, 2008
Online Bug Detection & Avoidance:
A Microprocessor Airbag

Extra cost without any
performance/utility benefits

The microprocessor designers
shouldn’t rely on it

No guarantee of success - Doesn’t cover all possible design bugs
Car airbags reduce fatalities by 8% when seat belts are worn

Objective: Reduce the risk of serious implications when critical design
bugs are discovered after product release
18
Online Design Bug Detection
MICRO-41
November 11th, 2008
RTL Algorithmic Design Bugs
Design bug in Verilog code – lsu_qctl1.v
2993:
2993:
2994:
2995:
...
3007:
3008:
3009:
3010:
3011:
3012:
...
3020:
3021:
3022:
//bug4814 - change rrobin_picker1 to rrobin_picker2
// Choose one among 4 loads.
//lsu_rrobin_picker1 ld4_rrobin (
//.events({ld3_pcx_rq_vld,ld2_pcx_rq_vld,ld1_pcx_rq_vld,ld0_pcx_rq_vld}),
//.se(se),
//.so()
//);
lsu_rrobin_picker2 ld4_rrobin (
.events({ld3_pcx_rq_vld,ld2_pcx_rq_vld,ld1_pcx_rq_vld,ld0_pcx_rq_vld}),
.se(se),
.so()
);
- Algorithmic deviations from the design specifications
- Require major modifications to be fixed
19
Online Design Bug Detection
MICRO-41
November 11th, 2008
RTL Timing Design Bugs
Design bug in Verilog code – lsu_qdp1.v
1228:
...
1239:
1240:
1241:
1242:
1243:
1244:
1245:
1246:
1247:
1248:
1249:
1250:
// Begin - Bug3487.
dff
#(48) ifu_std_d1 (
.din
(tlb_st_data[47:0]),
.q
(lsu_ifu_stxa_data[47:0]),
.clk
(asi_data_clk),
.se
(1'b0),
.si (),
);
.so ()
// select is now a stage earlier, which should be
// fine as selects stay constant.
//assign lsu_ifu_stxa_data[47:0] = tlb_st_data_d1[47:0] ;
// End - Bug3487.
- Signals need to be latched a cycle earlier or later to keep correctness
- Addition or removal of flip-flops is the most common fix
20
Online Design Bug Detection
MICRO-41
November 11th, 2008
RTL OpenSPARC T1 Design Bug Distribution
Load/Store Unit (LSU)
157 Design Bugs
Trap Logic Unit (TLU)
139 Design Bugs
21
Online Design Bug Detection
MICRO-41
November 11th, 2008
Power Consumption Estimation Methodology
Methodology/Tools Used
Design Components
Synopsys Power Compiler
1) SPARC Cores, 2) Crossbar, 3) FPU,
4) Misc. Units (I/O Bridge, DRAM
Controllers, Control & Test Unit)
5) ACE Framework, 6) Online Design
Bug Detection Mechanism
CACTI 4.2
1) L1 Inst. & Data Caches, 2) L2 Cache
Taken from *
1) I/O Pads, 2) Wires & Repeaters
* A. S. Leon, K. W. Tam, J. L. Shin, D. Weisner, and F. Schumacher. A PowerEfficient High-Throughput 32-Thread SPARC Processor, In IEEE Journal of SolidState Circuits, 42(1), 2006
22
Online Design Bug Detection
MICRO-41
November 11th, 2008
RTL Analysis Results
Metrics
LSU
TLU
Min./Average/Max. number of first-level
monitor signals per logic design bug
2/8/43
2/12/44
Min./Average/Max. number of sourcelevel monitor signals per logic design bug
2/17/97 2/24/89
Source-level monitor signal sharing
among different design bugs
23
68%
64%
Average number of unique source-level
monitor signals per logic design bug
6
9
Unique source-level monitor signals (for
all logic design bugs)
516
602
Online Design Bug Detection
MICRO-41
November 11th, 2008
Merging Bug Signatures
4-bit Bug Detection Segments
Bug
X X 1 0
Signature #1
Bug
Signature #2 X X 1 1
X 0 1 X X X X X
X 0 1 X X X X X
Design
Bug #1
Intermediate
X X 1 X X 0 1 X X X X X
Signature #1
CASE 2
CASE 1 CASE 2
Bug
X X X X X 0 X 1
Signature #1
Bug
Signature #2 X X X X X 0 X 1
Intermediate
X X X X X 0 X 1
Signature #2
1 X 1 0
0 X 1 1
Design
Bug #2
X X 1 X
System Bug
X X 1 X X 0 X X X X 1 X
Signature
24
Online Design Bug Detection
MICRO-41
November 11th, 2008
High-Level Overview
Design Bugs &
Triggering Conditions
Generate the bug
signatures based on
bug triggering conditions
BUG#1
BUG#2
XXX1X0…X1X0XX
…
1
X
Merge
Bug Signatures
0
…
X
1
System Bug Signature
25
Segment Match
Detection
Table
4
XXXX1X…X101XX
2
X
Firmware loads
the segment match
detection entries
X101XX…XX01XX
0
Global Bug
Detection Signal
Segment Match
Detection
Table
XXX0X0…X1X1XX
…
BUG#N
Aggregate bug detection
segment match/mismatch
signals to a global bug
detection signal
6
1
Bug Signature Collection
If the global bug detection
signal flags a bug, system
recovery is triggered 7
…
Design Bug
Recovery
Handler
Segment
Checking Tree
Segment Match
Detection
Table
match/mismatch signals
Bug
Detection
Segment
Firmware encodes and
loads the system bug
signature to the bug
detection segments
X
3
Bug
Detection
Segment
…
Bug
Detection
Segment
Bug
Detection
Segment
…
5
Cycle-by-cycle online
checking for design bugs
Online Design Bug Detection
System State (Flip-Flops)
MICRO-41
November 11th, 2008
OpenSPARC T1 Data & Control Flip-Flops
Chip Submodule
26
Data Signals
Control Signals
SPARC Core (x8)
15632 (79.06%)
4140 (20.94%)
CPU-Cache Crossbar
27283 (98.69%)
362 (1.31%)
Floating-Point Unit
4054 (87.75%)
566 (12.25%)
Control & Test Unit
2325 (55.29%)
1880 (44.71%)
Input/Output Bridge
10251 (95.14%)
524 (4.86%)
DRAM Controller (x4)
13449 (94.70%)
752 (5.30%)
Total
222765 (84.95%)
39460 (15.05%)
Online Design Bug Detection
MICRO-41
November 11th, 2008
Synergistic Online Bug & Defect Detection
Firmware Test for
Firmware Load
Hardware Defects
Design Bug Data: No Design Bug
Checking
- Bug Signature
State
Defect
Hardware
- Segment Match Entries
Recovery
Detected
Defect
Avoid
Design Bug
1
System
Startup
Computation &
2
3
4
Online Design
Bug Checking
State
State Design Bug
State
State
Checkpoint
Checkpoint Detected Checkpoint
Recovery
27
Online Design Bug Detection
Hardware
Repair
Computation &
Online Design
Bug Checking
MICRO-41
November 11th, 2008
Download