Self-Repair for Robust System Design Yanjing Li Intel Labs Stanford University

advertisement
Self-Repair for Robust System Design
Yanjing Li
Intel Labs
Stanford University
1
Hardware Failures: Major Concern

Permanent: our focus

Temporary
2
Tolerating Permanent Hardware Failures
Detection
Diagnosis
Recovery
Self-repair
3
Tolerating Permanent Hardware Failures
Detection
Diagnosis
Concurrent error detection:Self-repair
expensive
Recovery
4
Tolerating Permanent Hardware Failures
Detection
Diagnosis
Online Self-Test and Diagnostics

Non-continuous
 Low power
Recovery

Self-repair
Concurrent
 No visible downtime
5
Tolerating Permanent Hardware Failures
Diagnosis
Detection
Online Self-Test and Diagnostics

Localize failures
Recovery
Diagnosis
Self-repair
6
Tolerating Permanent Hardware Failures
Replace / bypass faulty component(s)
Diagnosis
Detection
Self-repair
Recovery
Self-repair
7
Tolerating Permanent Hardware Failures
Correction of corrupted dataDiagnosis
and states
Detection
Checkpoint
Failure Failure
occurs detected
Rollback
Recovery
Self-repair
8
This Lecture
Detection
Diagnosis
Recovery
Self-repair
9
Outline

Introduction

Self-repair techniques
 Memories
 Processor cores
 Uncore components
 Conclusion
10
Memory Organization
en
rd/wr
logic
Row
decoder
addr
Column
decoder
data_in
Column multiplexers
Data registers
Sense amplifiers
data_out
11
Memory Functional Fault Models

Memory cell array faults
 stuck-at, transition, coupling, etc.

Address decode faults (AFs)
 no cell accessed, multiple cells accessed, etc.

Read/write logic faults
 Equivalent to memory cell array faults
Single bit / word / column / row faults dominate
[Aitken 04, Kurdahi 06, Sridharan 12]
12
Spare Rows / Columns
en
rd/wr
logic
Spare
column
Row
decoder
Spare
rows
addr
Column
decoder
data_in
Column multiplexers
Data registers
Sense amplifiers
data_out
13
Memory Self-Repair Using Spare Rows
en
rd/wr
logic
Row
decoder
addr
Column
decoder
data_in
Column multiplexers
Data registers
Sense amplifiers
data_out

Row decoder modified
14
Memory Self-Repair Using Spare Columns
en
rd/wr
logic
Row
decoder
addr
Multiplexers
Column
decoder
data_in
Column multiplexers
Data registers
Selfrepair
control
Sense amplifiers
data_out

Additional multiplexers and select logic
15
Spare Words
en
Same
inputs as
RAM
rd/wr
logic
Spare
words
pool
Row
decoder
addr
Column
decoder
data_in
Column multiplexers
Data registers
Sense amplifiers
Self-repair
control
data_out
[Kim 98]
16
Redundancy in Memories
 Essential to improve yield and reliability
 Widely used in commercial RAMs
 Related topics
 How much redundancy?
 Built-in repair analysis: redundancy allocation
17
How Much Redundancy?
 Yield analysis and yield learning
 Example: negative binomial memory yield model
𝑌𝑌𝑌𝑌𝑌 = (1 + 𝑌𝑌 𝑌)
𝑌: defect desntiy, 𝑌: memory area
𝑌: defect clustering coefficient (measured to be 2 or 3)
 For a memory array with N rows and 1 spare row
𝑌𝑌𝑌𝑌𝑌 = 𝑌𝑌𝑌𝑌𝑌+(𝑌 + 1)(𝑌𝑌𝑌𝑌𝑌)(1 − 𝑌𝑌𝑌𝑌𝑌)
18
How Much Redundancy?
 Yield analysis and yield learning
 Example: negative binomial memory yield model
19
Built-In Repair Analysis
 Spare rows and columns
 NP complete
 Various algorithms and EDA tools exist
20
Special Self-Repair Techniques for Caches
 Caches: affects performance, NOT functionality
Block address
Tag
Index
Block
offset
V
Tag
Data / ECC
Decoder
...
...
...
=?
Hit/miss
Data out
21
Cache Line Disable/Delete
 Exist in commercial systems
 Intel [Chang 07], IBM [Sanda 08], etc.
Block address
Tag
Index
Fault-tolerance bit
Block
offset
V
FT Tag
Data / ECC
Decoder
...
...
...
=?
Hit/miss
Data out
22
Setting the FT Bit
Cache line read
ECC calculate and check
ECC error?
No
Yes
Correct and flag error; log faulty location
ECC error at same location already
occurred within a time threshold?
Done
No
Yes
Set FT bit
23
PADded Cache
 Reconfigurable cache with programmable decoder
Conventional
decoder
24
PADded Cache
 Reconfigurable cache with programmable decoder
 Additional tag bits needed
Programmable
decoder
 ~ 5% area cost
[Shirvani 99]
 16KB, direct-mapped, 1-level programmability
 Reduce cost at the price of granularity
25
Cache Line Delete vs. PADded Cache
26
Outline

Introduction

Self-repair techniques
 Memories
 Processor cores
 Uncore components
 Conclusion
27
Core Sparing
 Utilize multi-/many-core architectures
 Core disabling also possible
 Already in commercial products
 IBM BlueGene/Q
 Nvidia Geforce
 Cisco Metro
 Fine-grained approaches?
28
Core Cannibalization

Many-core designs with small in-order cores
 3.5% area cost (OpenRISC 1200)
[Romanescu 08]
29
Core Cannibalization: Discussion

What about diagnosis?

Routing  performance impact
 Additional wire delay pipeline stages
 Modified branch resolution, bypass logic, etc.
 Decreased clock frequency

Small vs. large number of faulty cores
30
Microarchitectural Block Disabling

Disable “half pipeline way” in superscalar designs
 12% area cost (includes diagnosis)
Issue queue
F
D R
D R
...
...
F
Rd E M W C
Rd E M W C
[Schuchman 05]
31
Microarchitectural Block Disabling

Disable “half pipeline way” in superscalar designs
 12% area cost (includes diagnosis)
Issue queue
F
S
D* R*
S
D* R*
*modified
...
...
F
S Rd
* E* M* W* C*
stages
S Rd
* E* M* W* C*
[Schuchman 05]
2 shift stages (S) added and lots design modifications
32
Microarchitectural Block Disabling

Disable “half pipeline way” in superscalar designs
 12% area cost (includes diagnosis)
Issue queue, old half
F
S
D* R*
S
D* R*
*modified
...
...
F
S Rd
* E* M* W* C*
stages
S Rd
* E* M* W* C*
[Schuchman 05]
Issue queue, new half
Split issue queue (same for store buffer, not shown)
33
Microarchitectural Block Disabling: Discussion

Expensive, complex, intrusive
 Diagnosis logic
 Reconfiguration logic

Coverage issues
 E.g., fetch stage not covered
34
Architectural Core Salvaging

Many-core CISC processor designs
Application n
Advanced ISA
Advanced ISA
Basic ISA
Core 1
...
Application 1
Basic ISA
Core n
[Powell 09]
35
Architectural Core Salvaging

Many-core CISC processor designs
Application 1
Basic uops

Core 1
Basic uops
...
Advanced ISA
Basic ISA
Application n

Advanced ISA
Basic ISA
Core n
[Powell 09]
36
Architectural Core Salvaging

Many-core CISC processor designs
Application 1
Advanced uops

Core 1
Basic uops
...
Advanced ISA
Basic ISA
Application n

Advanced ISA
Basic ISA
Core n
[Powell 09]
37
Architectural Core Salvaging

Many-core CISC processor designs
Application 1
Advanced uops

Core 1
Application n
Basic uops
...
Advanced ISA
Basic ISA
Thread
swap

Advanced ISA
Basic ISA
Core n
[Powell 09]
38
Architectural Core Salvaging

Many-core CISC processor designs
Application n
Basic uops

Core 1
Application 1
Advanced uops
...
Advanced ISA
Basic ISA
Thread
swap

Advanced ISA
Basic ISA
Core n
[Powell 09]
39
Architectural Core Salvaging: Discussion

What about diagnosis

Applicability
 CISC-like architectures

Performance
 Depends

Coverage issues
 ~50% coverage of execution
40
Self-Repair for Processor Cores

Which technique to choose?

Software-assisted techniques?
41
Outline

Introduction

Self-repair techniques
 Memories
 Processor cores
 Uncore components
 Conclusion
42
Uncore Prevalent in SoCs

IBM Power 7
Uncore examples
 Cache / DRAM controller
 Accelerators
 I/O interfaces
NVIDIA Tegra
Cisco Network Processor
43
Uncore Self-Repair Essential
OpenSPARC T2
Uncore
12%
Processor cores
12%
Memories
76%

Cores, memories, networks-on-chip
 Many existing techniques
44
Key Message [Li ITC13]
Existing sparing techniques
20
Chip area
impact (%)
0
98
Self-repair coverage (%)
45
Key Message [Li ITC13]
Existing sparing techniques
20
Chip area
impact (%)
7.5
Our techniques
3.2
0
75
98
Self-repair coverage (%)
46
Key Message [Li ITC13]
Existing sparing techniques
20
Chip area
impact (%)
Performance: 0% (fault-free)
0.3% - 5% (faulty)
Power:
3%
7.5
Our techniques
3.2
0
75
98
Self-repair coverage (%)
47
Existing Techniques Inadequate
OpenSPARC T2
Uncore
48
Existing Techniques Inadequate
OpenSPARC T2
Uncore
…
Original
…
…
Spare
…
49
Existing Techniques Inadequate
OpenSPARC T2
Steering
logic
…
…
Spare
…
…
…
Original
50
Existing Techniques Inadequate
OpenSPARC T2
Steering
logic
…
…
Original
…
…
Spare
…
Power gating
 20% chip area cost
51
New Uncore Self-Repair Techniques
ERRS: Enhanced Resource Reallocation and Sharing
SHE: Sparing through Hierarchical Exploration
OpenSPARC T2
8 cores, 64 threads
500M transistors
 7.5% chip area cost
52
Outline

Introduction

Self-repair for uncore
 ERRS
 SHE
 Conclusion
53
Basic Resource Reallocation & Sharing (RRS)

Already-existing “similar” components
 No spares
4 cores
3. Reroute
…
Crossbar
blocks
Self-repair
control
L2 mem 0
1. Fault detected
L2 bank
control 0
8 cycle overhead
(faulty case only)
L2 bank
control 1
2. Resource sharing
4 cores
L2 mem 1
54
Basic RRS Performance Impact
Fault-free scenario
4 cores
…
Crossbar
blocks
Self-repair
control
L2 mem 0
L2 bank
MISS!
control 0
Fill buffer
L2 bank
MISS!
control 1
4 cores
L2 mem 1
55
Basic RRS Performance Impact
Single faulty component: 70% impact
4 cores
…
Crossbar
blocks
Self-repair
control
L2 mem 0
L2 bank
control 0
L2 bank
control 1
Fill buffer
DATA
DATA
4 cores
L2 mem 1
56
Enhanced RRS (ERRS) Idea
Identify
Mitigate
No
Desirable tradeoff?
Analyze
Yes
Done
57
ERRS Mitigates RRS Performance Impact
Single faulty component: 3% impact
4 cores
…
Crossbar
blocks
Self-repair
control
L2 mem 0
Fill buffer
L2 bank
control 0
Fill buffer
L2 bank
control 1
4 cores
L2 mem 1
58
ERRS on OpenSPARC T2
DRAM data
return:
duplicated
L2 hit
processing:
duplicated
DRAM bank
control FSM:
entries doubled
L2 miss fill buffer:
entries doubled
 3.2% area, 2.7% power
59
Performance Evaluation Setup
Simulators
GEM5
64 single-issue in-order processor cores
Simulated
CMP
1KB private L1 data and instruction caches
4MByte shared L2 cache (8 banks)
4 DRAM controllers
60
ERRS vs. Basic RRS: Performance Impact
ERRS: 3% performance impact
1 faulty L2 controller
3%
1 faulty DRAM controller
0.7%
Average
CPI
overhead
0%
0.0%
RRS
Basic
RRS ERRS
RRS
Basic
RRS ERRS
64-core CMP
PARSEC benchmark programs
61
ERRS vs. Basic RRS: Performance Impact
ERRS: 5% performance impact
1 faulty L2 controller
18%
1 faulty DRAM controller
25%
Average
CPI
overhead
0%
0%
RRS
Basic
RRS
ERRS
RRS
Basic
RRS ERRS
64-core CMP
Stressed programs
62
ERRS for Multiple Faulty Components
25
Graceful performance degradation
4 L2 bank
controllers
Average
CPI
impact
(%)
4 L2 bank controllers +
2 DRAM controllers
2 L2 bank
controllers
2 DRAM
controllers
0
Faulty components
64-core CMP
PARSEC benchmark programs
63
Which Faults Repairable?
delay
bridging
stuckat
 All faults inside a component
64
Which Faults Not Repairable?

Single points of failure
 Primary inputs
 Primary outputs
 Steering logic

Metric: self-repair coverage
 % non-single points of failure
65
ERRS Component Self-Repair Coverage
97%
97%
97%
98%
97%
97%
97%
97%
96%
97%
97%
97%
98%
97%
97%
66
ERRS Chip Area/Power Impact & Coverage
Existing sparing techniques
20
Post-layout
chip area
impact (%)
Power: 3%
3.2
ERRS
0
75
98
Self-repair coverage (%)
67
Outline

Introduction

Self-repair for uncore
 ERRS
 SHE
 Conclusion
68
SHE: Sparing through Hierarchical Exploration
Component
RTL

SHE
Sparing for
component
Minimize area cost
 Identical blocks  spare shared

Balanced coverage vs. area
69
The Design Hierarchy
Network controller
pio
Level 1
rdmc
Level 2
chnl_16
chnl_15
chnl_1
Level 3
Lowest level
70
Sparing in Different Levels: Coarse-Grained
pio
Network controller
pio
rdmc
pio
rdmc
rdmc
Level 2
Original
Spare / steering logic
71
Sparing in Different Levels: Mixed Levels
Network controller
pio
pio
chnl_16
pio
chnl_15
chnl_1
rdmc
chnl
Level 2 / Level 3
Original
Spare / steering logic
72
Sparing in Different Levels: Fine-Grained
Network controller
pio
rdmc
Lowest level
Original
Spare / steering logic
73
How to Obtain “Sweet-Spot”?
Balanced tradeoff
High
Area overhead
Self-repair coverage
Low
Deeper in hierarchy
Algorithm details in paper
74
SHE Results
Existing sparing techniques
20
Post-layout
chip area
impact (%)
3.2
ERRS
0
75
98
Self-repair coverage (%)
75
SHE Results
Existing sparing techniques
20
Post-layout
chip area
impact (%)
SHE power impact: 0.2%
SHE performance impact: 0%
7.5
ERRS + SHE
0
98
Self-repair coverage (%)
76
What About Diagnosis?

Traditional fault diagnosis difficult

Effect-cause: infeasible
 RTL unavailable

Cause-effect: impractical
 7 petabyte fault dictionary
 [Beckler ITC12]
77
Fault Diagnosis Simplified

CASP [Li DATE08, VTS10]
 Concurrent, Autonomous, Stored test Patterns
78
Fault Diagnosis Simplified

CASP [Li DATE08, VTS10]
 Concurrent, Autonomous, Stored test Patterns
Blocks = self-repair granularity
…
Block 1
Block n
79
Fault Diagnosis Simplified

CASP [Li DATE08, VTS10]
 Concurrent, Autonomous, Stored test Patterns
Pass/fail 1
…
Pass/fail n
Local test logic
…
…
…
Fail  self-repair needed
Block 1
Block n
= self-repair granularity
Diagnosis = detection
80
Summary: Self-Repair of Uncore Components
Existing sparing techniques
20
Post-layout
chip area
impact (%)
Performance: 0% (fault-free)
0.3% - 5% (faulty)
Power:
3%
7.5
ERRS + SHE
3.2
ERRS
0
75
98
Self-repair coverage (%)
81
References
[Aitken 04] Aitken, R., “A Modular Wrapper Enabling High Speed
BIST and Repair for Small Wide Memories,” Proc. Intl. Test
Conf., pp. 997-1005, 2004.
[Chang 07] Chang, J., et al., “The 65-nm 16-MB Shared On-Die
L3 Cache for the Dual-Core Intel Xeon Processor 7100 Series,”
IEEE Journal of Solid-State Circuits, vol. 42, no. 4, pp. 846-852,
2007.
[Kurdahi 06] Kurdahi, F.J., et al., “System-Level SRAM Yield
Enhancement,” Proc. Intl. Symp. Quality Electronic Design,
2006
82
References
[Li VTS10] Li, Y., et al., “Concurrent Autonomous Self-Test for
Uncore Components in System-on-Chips,” Proc. VLSI Test
Symposium, 2010.
[Li DATE08] Li, Y., S. Makar, and S. Mitra, “CASP: Concurrent
Autonomous Chip Self-Test using Stored Test Patterns,” Proc.
Design, Automation, and Test in Europe, pp. 885-890, 2008.
[Li ITC 13] Li, Y., et al., “Self-Repair of Uncore Components in
Robust System-on-Chips: An OpenSPARC T2 Case Study,”
Proc. IEEE Intl. Test Conf., pp. 1-10, 2013.
83
References
[Powell 09] Powell, M.D., et al., “Architectural Core Salvaging
in a Multi-Core Processor for Hard-Error Tolerance,” Proc. Intl.
Symp. on Computer Architecture, pp. 93-104, 2009.
[Romanescu 08] Romanescu, B.F., and D.J. Sorin, “Core
Cannibalization Architecture: Improving Lifetime Chip
Performance for Multicore Processors in the Presence of Hard
Faults,” Proc. Intl. Conf. on Parallel Architectures and
Compilation Techniques, pp. 43-51, 2008.
84
References
[Sanda 08] Sanda, P.N, et al., “Fault-Tolerant Design of the
IBM Power6 Microprocessor,” IEEE Micro, vol. 28, no. 2, pp.
30-38, 2008.
[Schuchman 05] Schuchman, E., and T.N. Vijaykumar,
“Rescue: A Microarchitecture for Testability and Defect
Tolerance,” Proc. Intl. Symp. on Computer Architecture, pp.
160-171, 2005.
[Shirvani 99] Shirvani, P.P., and E.J. McCluskey, “PADded
Cache: A New Fault-Tolerance Technique for Cache
Memories,” Proc. VLSI Test Symp., pp. 440-445, 1999.
85
References
[Shivakumar 03] Shivakumar, P., et al., “Exploiting
Microarchitectural Redundancy for Defect Tolerance,” Proc.
Intl. Conf. on Computer Design, pp. 481-488, 2003.
[Sridharan 12] Sridharan, V., et al., “A Study of DRAM Failures
in the Field,” Proc. Intl. Conf. High Performance Computing,
Networking, Storage and Analysis, pp. 1-11, 2012.
[Schuchman 05] Schuchman, E., and T.N. Vijaykumar,
“Rescue: A Microarchitecture for Testability and Defect
Tolerance,” Proc. Intl. Symp. on Computer Architecture, pp.
160-171, 2005.
86
Download