ELEC 7770 Advanced VLSI Design Spring 2010 Soft Errors and Fault-Tolerant Design

advertisement
ELEC 7770
Advanced VLSI Design
Spring 2010
Soft Errors and Fault-Tolerant Design
Vishwani D. Agrawal
James J. Danaher Professor
ECE Department, Auburn University
Auburn, AL 36849
vagrawal@eng.auburn.edu
http://www.eng.auburn.edu/~vagrawal/COURSE/E7770_Spr10
Spring 2010, Apr 14 . . .
ELEC 7770: Advanced VLSI Design (Agrawal)
1
Soft Errors
 Soft errors are the errors caused by the



operating environment.
They are not due to a permanent hardware fault.
Soft errors are intermittent or random, which
makes their testing unreliable.
One way to deal with soft errors is to make
hardware robust:
 Capable of detecting soft errors
 Capable of correcting soft errors
 Both measures are probabilistic
Spring 2010, Apr 14 . . .
ELEC 7770: Advanced VLSI Design (Agrawal)
2
Some Early References
 J. von Neumann, “Probabilistic Logics and the Synthesis


of Reliable Organisms from Unreliable Components,” pp.
329-378, 1959, in A. H. Taub, editor, John von
Neumann: Collected Works, Volume V: Design of
Computers, Theory of Automata and Numerical Analysis,
Oxford University Press, 1963.
M. A. Breuer, “Testing for Intermittent Faults in Digital
Circuits,” IEEE Trans. Computers, vol. C-22, no. 3, pp.
241-246, March 1973.
T. C. May and M. H. Woods, “Alpha-Particle-Induces
Soft Errors in Dynamic Memories,” IEEE Trans. Electron
Devices, vol. ED-26, no. 1, pp. 2-9, 1979.
Spring 2010, Apr 14 . . .
ELEC 7770: Advanced VLSI Design (Agrawal)
3
Causes of Soft Errors
 Interconnect coupling (crosstalk).
 Power supply noise: IR-drop, power droop,



ground bounce.
Ignition noise.
Electromagnetic pulse (EMP).
Effects generally attributed to alpha-particles:
 Charged particles: electrons, protons, ions.
 Radiation (photons): X-rays, gamma-rays, ultra-violet
light.
Spring 2010, Apr 14 . . .
ELEC 7770: Advanced VLSI Design (Agrawal)
4
Sources of Alpha-Particles
 Radioactive contamination in VLSI packaging


material.
Ionosphere, magnetosphere and solar radiation.
Other electromagnetic radiation.
Spring 2010, Apr 14 . . .
ELEC 7770: Advanced VLSI Design (Agrawal)
5
Alpha-Particle
 Helium nucleus: two protons and two

neutrons, mass = 6.65 ×10-27kg, charge =
+2e (e = 1.6 ×10-19C).
Energy = 3.73 GeV
Spring 2010, Apr 14 . . .
ELEC 7770: Advanced VLSI Design (Agrawal)
6
Soft Error Rate (SER)
 Failures in time (FIT): One FIT is 1 error per

billion hours of operation.
Alternative unit is mean time between failures
(MTBF) or mean time to failure (MTTF).
1 year MTBF
Spring 2010, Apr 14 . . .
=
109/(365×24) = 114,155 FIT
ELEC 7770: Advanced VLSI Design (Agrawal)
7
Particle Strike
Ion or
Charged particle
- +
n + +
+ - p - substrate
Spring 2010, Apr 14 . . .
ELEC 7770: Advanced VLSI Design (Agrawal)
8
current
Induced Current
time
I(t) = I0(e– t/a – e– t/b),
Spring 2010, Apr 14 . . .
ELEC 7770: Advanced VLSI Design (Agrawal)
a >> b
9
Voltage Induced at a Node
V = Q/C
Where Q = ∫ I(t) dt
C = node capacitance
Smaller node capacitance will result in
larger voltage swing.
Spring 2010, Apr 14 . . .
ELEC 7770: Advanced VLSI Design (Agrawal)
10
Effect on Digital Circuit
Charged
Particles
IN
Charged
Particles
Combinational
Logic
OUT
CK
Spring 2010, Apr 14 . . .
ELEC 7770: Advanced VLSI Design (Agrawal)
11
An SRAM Cell
WL
VDD
0
bit
1
bit
BL
BL
Spring 2010, Apr 14 . . .
ELEC 7770: Advanced VLSI Design (Agrawal)
12
SRAM Cell Struck by Alpha-Particle
Single-Event Upset (SEU)
WL
Charged
Particles
VDD
0→1
bit
1→0
bit
BL
BL
Spring 2010, Apr 14 . . .
ELEC 7770: Advanced VLSI Design (Agrawal)
13
A Resistor Hardened SRAM Cell
WL
VDD
0
bit
1
bit
BL
BL
Spring 2010, Apr 14 . . .
ELEC 7770: Advanced VLSI Design (Agrawal)
14
D-Latch
1
D
Q
Q
CK = 0
Spring 2010, Apr 14 . . .
0
ELEC 7770: Advanced VLSI Design (Agrawal)
15
SEU in D-Latch
Charged
Particles
1→0
D
Q
CK = 0
Spring 2010, Apr 14 . . .
Q
0→1
ELEC 7770: Advanced VLSI Design (Agrawal)
16
Single Event Transients in
Combinational Logic
1
0
1
1
1
CK
Charged
Particles
0
CK
Spring 2010, Apr 14 . . .
ELEC 7770: Advanced VLSI Design (Agrawal)
17
Effects of Transients
 Error correcting effects

 Transient pulse is filtered by gate inertia
 Transient is blocked by an unsensitized path
 Transient is blocked by an inactive clock
Error enhancing effects
 Large number of gates can produce multiple
pulses
 Fanouts can multiply error pulses
Spring 2010, Apr 14 . . .
ELEC 7770: Advanced VLSI Design (Agrawal)
18
Typical Soft Error Distribution
S. Mitra, N. Seifert, M. Zhang, Q. Shi, and K. S. Kim, “Robust
System Design with Built-In Soft-Error Resilience,” Computer,
vol. 38, no. 2, pp. 43-52, February 2005.
Spring 2010, Apr 14 . . .
ELEC 7770: Advanced VLSI Design (Agrawal)
19
Soft Error Simulation
 F. Wang and V. D. Agrawal, “Soft Error Rate

with Inertial and Logical Masking,” Proc. 22nd
International Conference on Quality VLSI
Design, January 2009, pp. 459-464.
F. Wang and V. D. Agrawal, “Soft Error Rate
Determination for Nanoscale Sequential Logic,”
Proc. 11th International Symposium on Quality
Electronic Design (ISQED), March 2010.
Spring 2010, Apr 14 . . .
ELEC 7770: Advanced VLSI Design (Agrawal)
20
SEUs in FPGA
 Parts that can be affected
 Look-up table (LUT)
 Configuration memory cell
 Flip-flop
 Block RAM
 F. L. Kastensmidt, L. Carro and R. Reis,
Fault-Tolerant Techniques for SRAM-Based
FPGAs, Springer, 2006.
Spring 2010, Apr 14 . . .
ELEC 7770: Advanced VLSI Design (Agrawal)
21
F1
F2
F3
F4
LUT
1
1
1
Memory cells
0
0
1
0
0
0
out
0
1
1
1
0
0
Spring 2010, Apr 14 . . .
1
ELEC 7770: Advanced VLSI Design (Agrawal)
22
F1
F2
F3
1
1
F4
SEU in
LUT
1
Memory cells
0
0
Charged
Particle
1 changed
to 0
Spring 2010, Apr 14 . . .
1
0
0
0
out
0
1
1
0
0
0
1
ELEC 7770: Advanced VLSI Design (Agrawal)
23
Four Types of SEU in FPGA
M
M M M M
FF
F1
F2
F3
F4
M
Type 3
Type 2
LUT
Type 1
M
Configuration
memory
cell
Spring 2010, Apr 14 . . .
Type 4
ELEC 7770: Advanced VLSI Design (Agrawal)
Block
RAM
24
SEU Detection Methods




Hardware redundancy
Time redundancy
Error detection codes (EDC)
Self-checker techniques
Spring 2010, Apr 14 . . .
ELEC 7770: Advanced VLSI Design (Agrawal)
25
SEU Mitigation Techniques





Triple modular redundancy (TMR)
Multiple redundancy with voting
Error detection and correction codes (EDAC)
Hardened memory cells
FPGA-specific methods
 Reconfiguration
 Partial configuration
 Rerouting design
Spring 2010, Apr 14 . . .
ELEC 7770: Advanced VLSI Design (Agrawal)
26
Hardware Redundancy for Detection
inputs
Combinational
Logic
Combinational
Logic
(duplicated)
output
Logic 1
indicates
error
Hardware overhead is high ~ 100%
Performance penalty is negligible.
Spring 2010, Apr 14 . . .
ELEC 7770: Advanced VLSI Design (Agrawal)
27
Time Redundancy for Detection
inputs
Combinational
Logic
DQ
output
CK+ d
DQ
Logic 1
indicates
error
CK
Hardware overhead is low.
Performance penalty ( ~ d) = maximum detectable pulse width.
Spring 2010, Apr 14 . . .
ELEC 7770: Advanced VLSI Design (Agrawal)
28
Repeat on Error Detection
inputs
Combinational
Logic
DQ
C
output
CK+ d
DQ
Logic 1
indicates
error
CK
Operation:
Spring 2010, Apr 14 . . .
If error is detected, then output retains its previous value.
Repeating the computation can produce correct result.
ELEC 7770: Advanced VLSI Design (Agrawal)
29
Muller C-Element
A
output
C
B
A
B
output
A
0
0
0
0
1
Old output
S Q
1
0
Old output
1
1
1
Spring 2010, Apr 14 . . .
B
ELEC 7770: Advanced VLSI Design (Agrawal)
output
R
30
Dynamic CMOS C-Element
A
C
output
A
B
A
B
output
0
0
1
0
1
Old output
1
0
Old output
1
1
0
Spring 2010, Apr 14 . . .
output
B
ELEC 7770: Advanced VLSI Design (Agrawal)
31
Pseodostatic CMOS C-Element
Weak
keeper
A
output
C
A
B
A
B
output
0
0
1
output
0
1
Old output
1
0
Old output
1
1
0
Spring 2010, Apr 14 . . .
B
ELEC 7770: Advanced VLSI Design (Agrawal)
32
Built-In Soft Error Resilience (BISER)
Weak
keeper
Data from
combinational
logic
Flip-flop
A
output
Duplicate
Flip-flop
Clock
Spring 2010, Apr 14 . . .
B
ELEC 7770: Advanced VLSI Design (Agrawal)
A
B
output
0
0
1
0
1
Old output
1
0
Old output
1
1
0
33
BISER
 Assumptions:
 Most soft errors in combinational logic are eliminated by



inertial and logic maskings.
 Soft error pulse generated in flip-flop is much shorter
than clock period.
 Probability of either a master or slave latch being struck
by soft error exactly at clock edge is small.
Flip-flop is duplicated and outputs fed to C-element.
Twenty times reduction of soft error observed.
Ref.: S. Mitra, N. Seifert, M. Zhang, Q. Shi, and K. S. Kim,
“Robust System Design with Built-In Soft-Error Resilience,”
Computer, vol. 38, no. 2, pp. 43-52, February 2005.
Spring 2010, Apr 14 . . .
ELEC 7770: Advanced VLSI Design (Agrawal)
34
Triple Modular Redundancy (TMR)
Combinational
Logic copy 1
inputs
Combinational
Logic copy 2
Majority
Voter
output
Combinational
Logic copy 3
Spring 2010, Apr 14 . . .
ELEC 7770: Advanced VLSI Design (Agrawal)
35
TMR Error Reduction
 Voter input error probability = E, assumed

independent for each input.
Output error probability,
e =
Prob(two errors or three errors)
3
E2 (1
3
– E) + ( 3 ) E3
=
( )
=
3 E2 – 3 E3 + E3 =
2
3 E2 – 2 E 3
 For very small E, E3 << E2 → e = 3E2
Spring 2010, Apr 14 . . .
ELEC 7770: Advanced VLSI Design (Agrawal)
36
TMR Error Probability
Input error probability, E
0.0
Output error probability, e
0.0
0.001
0.01
0.1
0.000002998
0.000298
0.027
0.2
0.3
0.4
0.104
0.216
0.352
0.5
0.6
0.5
0.648
Spring 2010, Apr 14 . . .
ELEC 7770: Advanced VLSI Design (Agrawal)
37
Majority Voter Circuit
A
B
C
output
A
B
0
0
0
0
0
0
1
0
0
1
0
0
0
1
1
1
1
0
0
0
C
0
1
1
1
1
0
1
1
1
1
1
Spring 2010, Apr 14 . . .
output
A
B
1
Majority
Voter
output
C
ELEC 7770: Advanced VLSI Design (Agrawal)
38
Alternative Implementations of Voter
VDD
A
LUT
0
0
0
1
0
1
1
1
output
B
output
C
ABC
Spring 2010, Apr 14 . . .
ELEC 7770: Advanced VLSI Design (Agrawal)
39
Triple Modular Redundancy (TMR)
inputs
Combinational
Logic
DQ
CK
DQ
Majority
Voter
CK+ d
DQ
output
CK+3d
DQ
CK+2d
Spring 2010, Apr 14 . . .
ELEC 7770: Advanced VLSI Design (Agrawal)
40
TMR for Memory Cells
inputs
Combinational
Logic
DQ
CK
DQ
Majority
Voter
output
CK
DQ
CK
Spring 2010, Apr 14 . . .
ELEC 7770: Advanced VLSI Design (Agrawal)
Problems:
1. Accumulation of
errors in flip-flops.
1. Voter is not protected.
41
FF Refresh and TMR for Memory Cells
r1
DQ
r2
Majority
Voter
CK
DQ
r3
Majority
Voter
CK
Majority
Voter
output
DQ
CK
Spring 2010, Apr 14 . . .
Majority
Voter
ELEC 7770: Advanced VLSI Design (Agrawal)
42
Reliability Analysis
 Determine how long a system will work without

failure.
Find:
 Mean time to failure (MTTF)
 Mean time to repair (MTTR)
 FIT rate
Spring 2010, Apr 14 . . .
ELEC 7770: Advanced VLSI Design (Agrawal)
43
Reliability Function
 Reliability function of a system,
R(t) = Probability of survival at time t
 Determined from failure rates of components,
λ(t) = Number of failures per unit time
Generally varies with time.
Spring 2010, Apr 14 . . .
ELEC 7770: Advanced VLSI Design (Agrawal)
44
Failure Rate, λ(t)
Failures per second, λ(t)
100
10-3
Infant
mortality
Constant failure Wearout
Rate (useful life) or aging
λ(t) = λ
10-6
10-9
10-12
Time, t
Spring 2010, Apr 14 . . .
ELEC 7770: Advanced VLSI Design (Agrawal)
45
Deriving R(t)
 R(t) is the probability of no error in interval [0, t].
 Divide interval in a large number (n) of subintervals of


duration t/n. Let x be the probability of error in one
subinterval.
Assume that duration t/n is so small that either no error
occurs or at most one error can occur. Then, average
errors in a subinterval = 0.(1 – x) + 1.x = x = λt/n.
Probability of no error in interval [0, t] is,
R(t) = (1 – x)n = (1 – λt/n)n
= exp(– λt), from Sterling’s formula as n → ∞
Spring 2010, Apr 14 . . .
ELEC 7770: Advanced VLSI Design (Agrawal)
46
R(t) and MTBF
R(t)
= e –λt
∞
∞
MTBF = ∫ R(t) dt = ∫ exp(– λt)dt
0
=
Spring 2010, Apr 14 . . .
0
1/λ
ELEC 7770: Advanced VLSI Design (Agrawal)
47
Reliability and MTBF
1.0
Reliability, R(t)
0.8
R(t) = 1/e = 0.368
0.6
0.4
0.2
0.0
1 MTBF
2 MTBF
3 MTBF
Time, t
Spring 2010, Apr 14 . . .
ELEC 7770: Advanced VLSI Design (Agrawal)
48
Example: First Generation Computer
 10,000 electron tubes.
 Average burn out rate: 5 tubes per 100,000 hours.
Spring 2010, Apr 14 . . .
ELEC 7770: Advanced VLSI Design (Agrawal)
49
Reliability of TMR
 R(TMR) = Prob(all three modules correct)
+ Prob(any two modules correct)
= R3 + 3R2 (1 – R)
= 3 R2 – 2 R3
= 3e-2λt – 2e-3λt
Spring 2010, Apr 14 . . .
ELEC 7770: Advanced VLSI Design (Agrawal)
50
MTBF of TMR
R(TMR)
=
3e-2λt – 2e-3λt
8
MTBF = ∫ R(TMR) dt
=
5/(6λ)
0
Spring 2010, Apr 14 . . .
ELEC 7770: Advanced VLSI Design (Agrawal)
51
MTBF of TMR
1.0
Reliability, R(t)
0.8
TMR
0.6
0.4
0.2
Single
module
0.0
Mission
duration
Spring 2010, Apr 14 . . .
Time, t
ELEC 7770: Advanced VLSI Design (Agrawal)
52
Error Detection Code
 Errors: Bits can flip due too noise in circuits and


in communication.
Extra bits used for error detection.
Example: a parity bit in ASCII code
7-bit ASCII code
Even parity code for A
(even number of 1s)
01000001
Parity bits
Odd parity code for A
11000001
(odd number of 1s)
Single-bit error in 7-bit code of “A”, e.g., 1000101, will change
symbol to “E” or 1000000 to “@”. But error will be detected in
the 8-bit code because the error changes the specified parity.
Spring 2010, Apr 14 . . .
ELEC 7770: Advanced VLSI Design (Agrawal)
43
Richard W. Hamming
 Error-correcting codes

(ECC).
Also known for
 Hamming distance
HD = Number of bits two
binary vectors
differ in
 Example:
HD(1101, 1010) = 3
 Hamming Medal, 1988
1915-1998
Spring 2010, Apr 14 . . .
ELEC 7770: Advanced VLSI Design (Agrawal)
54
The Idea of Hamming Code
Code space contains 2N possible N-bit code words
0010
”2”
1110
”E”
HD = 1
HD = 1
1010
”A”
1-bit error in “A”
HD = 1
HD = 1
1000
”8”
1011
”B”
Error not correctable. Reason: No redundancy.
Hamming’s idea: Increase HD between valid code words.
Spring 2010, Apr 14 . . .
ELEC 7770: Advanced VLSI Design (Agrawal)
55
Hamming’s Distance ≥ 3 Code
1110100
0010101
”2”
”E”
HD = 4
HD = 4
HD = 3
HD = 3
1-bit error in “A”
shortest distance
decoding eliminates
error
0010010
”?”
HD = 1
1010010
0011110
”A”
”3”
HD = 2
HD = 4
1000111
HD = 3
HD = 3
”8”
1011001
Spring 2010, Apr 14 . . .
ELEC 7770: Advanced VLSI Design (Agrawal)
”B”
56
Minimum Distance-3 Hamming Code
Symbol
Original
code
Odd-parity
code
ECC, HD ≥ 3
0
0000
10000
0000000
1
0001
00001
0001011
2
0010
00010
0010101
3
0011
10011
0011110
4
0100
00100
0100110
5
0101
10101
0101101
6
0110
10110
0110011
7
0111
00111
0111000
8
1000
01000
1000111
9
1001
11001
1001100
A
1010
11010
1010010
B
1011
01011
1011001
C
1100
11100
1100001
D
1101
01101
1101010
E
1110
01110
1110100
F
1111
11111
1111111
Spring 2010, Apr 14 . . .
Original code: Symbol “0” with a
single-bit error will be Interpreted as
“1”, “2”, “4” or “8”.
Reason: Hamming distance between
codes is 1. A code with any bit error will
map onto another valid code.
Remedy: Design codes with HD ≥ 2.
Example: Parity code. Single bit error
detected but not correctable.
Remedy: Design codes with HD ≥ 3.
For single bit error correction, decode
as the valid code at HD = 1.
For more error bit detection or
correction, design code with HD ≥ 4.
ELEC 7770: Advanced VLSI Design (Agrawal)
57
A Book on Coding Theory
R. W. Hamming, Coding and Information Theory,
Englewood Cliffs, New Jersey: Prentice-Hall,
1980.
Spring 2010, Apr 14 . . .
ELEC 7770: Advanced VLSI Design (Agrawal)
58
Byzantine Empire, 527-565
Emperor Justinian and General Belisarius
Spring 2010, Apr 14 . . .
ELEC 7770: Advanced VLSI Design (Agrawal)
59
Byzantine General’s Problem
 In a war a general needs to communicate an


attack (a) or retreat (r) order to subordinates in
the field.
For success a perfect agreement is necessary.
Byzantine Fault:
 Subordinates can be unreliable or malicious.
 Communication (messengers) can be unrelaible or
malicious.
Spring 2010, Apr 14 . . .
ELEC 7770: Advanced VLSI Design (Agrawal)
60
Example 1: Single Fault
 General: D; Subordinates: A, B and C
D
r→a
A
Spring 2010, Apr 14 . . .
r
r
B
ELEC 7770: Advanced VLSI Design (Agrawal)
C
61
Example 1: Majority Agreement
 General: D; Subordinates: A, B and C
D
r→a
r
r
r
Retreat
A
a
B
r
r
C
Retreat
Retreat
a
r
Spring 2010, Apr 14 . . .
ELEC 7770: Advanced VLSI Design (Agrawal)
62
Example 2: Two Faults
 General: D; Subordinates: A, B and C
D
a
A
Spring 2010, Apr 14 . . .
a
a
B
ELEC 7770: Advanced VLSI Design (Agrawal)
C
63
Example 2: Byzantine Failure
 General: D; Subordinates: A, B and C
D
a
a
a
r
Attack
A
r
r
r
a
B
Attack
C
Retreat
a
Spring 2010, Apr 14 . . .
ELEC 7770: Advanced VLSI Design (Agrawal)
64
Byzantine Resilient System
 A system that can correctly function in presence of

Byzantine faults.
Byzantine protocol for n node system:
 Any node can initiate a message broadcast.
 All nodes rebroadcast the received message to all nodes
it has not heard from.
After communications end, nodes take majority decision.

 Ref.: L. Lamport, R. Shostak and M. Pease, “The
Byzantine General’s Problem,” ACM Trans. Prog.
Lang. Syst., vol. 4, no. 3, pp. 382-401, July 1982.
Spring 2010, Apr 14 . . .
ELEC 7770: Advanced VLSI Design (Agrawal)
65
Byzantine Resilience Conditions
 In order to tolerate t failures, :
 The system must have at least 3t + 1 nodes.
 There must be at least 2t +1 disjoint
communication paths between nodes.
 A node must exchange messages at least t +1
times.
Spring 2010, Apr 14 . . .
ELEC 7770: Advanced VLSI Design (Agrawal)
66
Four-Core Processor System
A
B
D
C
Spring 2010, Apr 14 . . .
ELEC 7770: Advanced VLSI Design (Agrawal)
67
Example 1: C Initiates Message m,
Sends n to A and m to B and D
Processor
First round
Second
round
A
n
mm
m
B
m
mn
m
D
m
mn
m
Spring 2010, Apr 14 . . .
ELEC 7770: Advanced VLSI Design (Agrawal)
Decoded
message
68
Example 2: C Initiates Message m,
B Sends p to A and D
Processor
First round
Second
round
A
m
mp
m
B
m
mm
m
D
m
mp
m
Spring 2010, Apr 14 . . .
ELEC 7770: Advanced VLSI Design (Agrawal)
Decoded
message
69
Example 2: C Initiates Message m,
A and B generate faulty message q
Processor
First round
Second
round
A
m
mq
m
B
m
mq
m
D
m
qq
q
Spring 2010, Apr 14 . . .
ELEC 7770: Advanced VLSI Design (Agrawal)
Decoded
message
70
References
 L. Lamport, R. Shostak and M. Pease, “The


Byzantine General’s Problem,” ACM Trans.
Prog. Lang. Syst., vol. 4, no. 3, pp. 382-401, July
1982.
D. K. Pradhan, Fault-Tolerant Computer System
Design, Upper Saddle River, New Jersey:
Prentice Hall PTR, 1996.
P. K. Lala, Self-Checking and Fault-Tolerant
Digital Design, San Francisco: MorganKaufmann, 2001.
Spring 2010, Apr 14 . . .
ELEC 7770: Advanced VLSI Design (Agrawal)
71
Download