Fault Tolerance, Lecture 07 - Electrical and Computer Engineering

advertisement
Fault-Tolerant Computing
Dealing with
Low-Level
Impairments
Oct. 2007
Fault Masking
Slide 1
About This Presentation
This presentation has been prepared for the graduate
course ECE 257A (Fault-Tolerant Computing) by
Behrooz Parhami, Professor of Electrical and Computer
Engineering at University of California, Santa Barbara.
The material contained herein can be used freely in
classroom teaching or any other educational setting.
Unauthorized uses are prohibited. © Behrooz Parhami
Oct. 2007
Edition
Released
Revised
First
Oct. 2006
Oct. 2007
Fault Masking
Revised
Slide 2
Fault Masking
Oct. 2007
Fault Masking
Slide 3
Oct. 2007
Fault Masking
Slide 4
Multilevel Model
Component
Logic
Ideal
Defective
Legend:
Legned:
Today Low-Level
Impaired
Faulty
Initial
Entry
Entry
Last
lecture
Information
System
Service
Result
Oct. 2007
Erroneous
Mid-Level
Impaired
Deviation
Malfunctioning
Remedy
Degraded
Failed
Tolerance
Fault Masking
High-Level
Impaired
Slide 5
Handling Faults
Fault
Avoid
Tolerate
Quality Assurance
Testing
Prevent
Repair
Yes
Fixed
Full?
Test
Miss
Monitor
Mask
Detect
Reconfigure
Abort
No
No
Screened
Faulty
Component
Oct. 2007
Miss
Discard
Injured
Static Redundancy
Expose
Remove
Detect
Perfect
Dynamic Redundancy
or
Faulty-safe
System
Fault Masking
Full?
Degraded
Yes
Restored
Unaffected
State
Slide 6
Some Options for Fault Tolerance
1. Detect and replace
Dynamic redundancy (cold/hot standby)
Detection via
-- coding, watchdog timer, self-checking
-- duplication (pair-and-spares)
Detector
1
Spare
2. Mask
Static redundancy
May revert to simplex instead of duplex
Design challenges include
-- synchronization for voting
-- voting on imprecise results
3. Mask, diagnose, and reconfigure
Hybrid redundancy
Fault masked at output, but diagnosed
-- e.g., via comparison with voter output
Faulty circuit is replaced by spare
Becomes static upon spare exhaustion
Oct. 2007
Fault Masking
D
2
1
Voter
2
V
3
1
2
S
V
3
Spare
4
Switch-voter
Slide 7
Comparing Fault Tolerance Schemes
Advantages
Drawbacks
Less power
(cold standby)
Long life
(just add spares)
Coverage factor
Immediate masking
Power/area penalty
High safety
Detector
1
Tolerance latency
Spare
D
2
1
Voter
2
Voter critical
V
3
Immediate masking
Power/area penalty
1
Long life and
high safety
Switch-voter critical
2
S
3
Spare
Oct. 2007
V
Fault Masking
4
Switch-voter
Slide 8
Inherent Fault Masking in Logic Circuits
a
b
c
d
1
e
0
g
0
0
0
0
f
0
0  1 fault in b is critical
h
0
z
1
0  1 fault in c or d is
not critical (it is masked)
1  0 fault in a or h is
not critical (it is masked)
Even nonredundant circuits have some masking capability
Is there a way to exploit the inherent masking capabilities
of logic gates to achieve fault tolerance?
Oct. 2007
Fault Masking
Slide 9
Interwoven Redundant Logic
1
a
b
c
d
1
e
g
1
z
f
10
a1
a2
b1
b2
a1
a2
b1
b2
a3
a4
b3
b4
a3
a4
b3
b4
1  0 change is critical for AND,
subcritical for OR
e1
0  1 change is critical for OR,
subcritical for AND
10
e2
Alternating layers of ANDs and ORs
can mask each other’s critical faults
e3
1
e4
f1
f2
f3
f4
Oct. 2007
h
Let x1, x2, x3, and x4 be
4 copies of the signal x
e1
e4
f1
f4
To mask h critical faults:
Number of gates multiplied by (h + 1)2
Gate inputs multiplied by h + 1
For h = 1, the scheme is known as
Quadded logic
Fault Masking
Slide 10
Interwoven Logic for Nanoelectronics
Half-adder implemented in quadded logic
a
b
s
c
a
b
s
c
IEEE D&T
July-Aug. 2005
pp. 328-339
From: http://ieeexplore.ieee.org/iel5/54/32070/01492293.pdf
Oct. 2007
Fault Masking
Slide 11
Highly Reliable Logic with “Crummy” Relays
Moore & Shannon, 1956
a: prob [contact made | energized]
c: prob [contact made | not energized]
No matter how crummy the relays
(i.e., how close the values of a and c),
one can interconnect many of them in
a redundant series-parallel structure
to achieve arbitrarily high reliability
“Make” contact
(normally open)
a>c
“Break” contact
(normally closed)
a<c
1
xy
x
y
prob [connection made | energized] =
2a2 – a4
(> a if a > 0.62)
x
x
prob [connection made | not energized] =
2c2 – c4
(always < c)
x
x
Oct. 2007
Fault Masking
a > 0.5, c < 0.5
Slide 12
TMR with Perfect Voter
R =
3Rm2
–
2Rm3
?
> Rm
1
Condition on the module reliability:
R = Rm [1 + (1 – Rm)(2Rm – 1)]
(1 – Rm)(2Rm – 1) > 0  Rm > 1/2
2
R
1.0
TMR better
0.5
Simplex better
0.5
1.0
Rm
RIFTMR/Simplex = (1 – Rm)/(1 – R)
= 1/[1 – Rm(2Rm – 1)]
Oct. 2007
TMR
0.5
0.0
0.0
V
3
R
1.0
Voter
Fault Masking
Simplex
0.0
0
ln 2
lt
MTTF: TMR 5/(6l)
Simplex 1/l
Slide 13
TMR with Imperfect Voter
?
R = Rv(3Rm2 – 2Rm3) > Rm
1
Condition on the voter reliability
Rv > 1 / [3Rm – 2Rm2]
2
Voter
V
3
dRvmin/ dRm = (–3 + 4Rm) / (3Rm – 2Rm2)2
Rv
Condition on the module reliability
3 – 9 – 8/Rv
3 + 9 – 8/Rv
< Rm <
4
4
0.95
TMR better
0.885
Example: Rv = 0.95 requires that
0.56 < Rm < 0.94
Oct. 2007
Fault Masking
Simplex better
0.5
0.56
0.75
1.0
0.94
Slide 14
Rm
TMR with Compensating Faults
Rm = 1 – p0 – p1 (0- and 1-fault probabilities)
1
R = (3Rm2 – 2Rm3) + 6p0p1Rm
2
Example: Rm = 0.998, p0 = p1 = 0.001
3
Voter
V
R = 0.999,984 + 0.000,006 = 0.999,990
Basic TMR
Compensation
RIFTMR/Simplex = 0.002 / 0.000,016 = 125
RIFCompen/TMR = 0.000,016 / 0.000,010 = 1.6
Oct. 2007
Fault Masking
Slide 15
Implementing a Bit-Voter
TMR bit-voting: y = x1x2  x2x3  x3x1
(carry output of a single-bit full-adder)
What about 5MR, 7MR?
1
Gate-level design quickly explodes in size
3
2
x1
x2
Bit-voter
V
y
x3
Other designs are also possible
Arithmetic: add the bits, compare to threshold
Mux-based
Selection-based (majority of bit values is their median)
3-out-of-5 voter built of 2-input gates
Oct. 2007
Two mux-based designs for a 3-out-of-5 bit-voter
Fault Masking
Slide 16
Complexity of Different Bit-Voter Designs
Cost of majority bit-voters as a function of the number n of inputs
Oct. 2007
Fault Masking
Slide 17
Voting at the Word Level
Using bit-by-bit voting may be dangerous
One might think that in this example, any of the
module outputs could be correct, so that
producing 1 0 at the output isn’t all that wrong
However, with bit-by-bit voting, the output may
be different from all inputs
x1 = 0
x2 = 1
x3 = 1
y =1
x1 = 0
x2 = 1
x3 = 1
y =1
0
0
1
0
0
0
1
0
0
1
0
0
Design of bit- and word-voting networks discussed in:
Parhami, B., “Voting Networks,” IEEE TR, Aug. 1991
Oct. 2007
Fault Masking
Slide 18
Some Simple Voter Designs
If in the case of 3-way disagreement
any of the inputs can be chosen,
then a simple design is possible
1
This design can be readily generalized
to a larger number of inputs
3
2
x1
Compare
x2
x3
Disagree
0
1
One can perform pseudo voting that yields the median of 3 analog
signals (Dennis, N.G., Microelectronics and Reliability, Aug. 1974)
Median and mean voting are also possible with digital signals
Oct. 2007
Fault Masking
Slide 19
y
A TMR Application and Its Bit-Voter
Single-event upset (SEU) = Soft error
Change of state caused by a high-energy particle strike
n+ diffusion layer
Data
pn junction field
due to impact
D
Q
C
D
SEU effect on DRAMs
(from SANYO website)
Q
C
TMR flip-flop for SEU tolerance
0
D
Q
1
Clock
Oct. 2007
Fault Masking
C
0
1
2
3
Output
Mux
Slide 20
Example: SEU Hardened Flip-Flop
AFB
D
ANQ
B
A
A
Y A
A
Y
Y
A
S
A
A
B
B
C
C
A
A
B
B
C
C
A
A
B
B
C
C
Y
BFB
BNQ
B
A
Y B
A
Y
Y
A
B
S
Y
A
Y
CFB
B
A
CNQ
A
Y C
A
Y
Y
S
G
C
Y
For list of flip-flop hardening methods and their comparison, see:
http://klabs.org/richcontent/fpga_content/pages/notes/seu_hardening.htm
Oct. 2007
Fault Masking
Slide 21
Switch for Standby Redundancy
Standby redundancy requires an
n-to-1 switch to select the output
of the currently active module
The detectors use various info
to deduce fault conditions
-- Error coding
-- Reasonableness checks
-- Watchdog timer
Detector
1
Spare
Once a fault has been detected,
the switch reconfigures the system
by flagging the faulty unit and
activating next spare in sequence
D
2
1
D
2
D
n-to-1
switch
Spares
3
D
If we use an n-to-2 switch and compare the two selected outputs,
the configuration is known as “pair-and-spares”
Oct. 2007
Fault Masking
Slide 22
Switch for Hybrid Redundancy
Hybrid redundancy with n active
and s spare modules requires an
(n + s)-to-n switch to select the
outputs of the active modules
Self-purging redundancy is a variant
of hybrid redundancy in which all
modules are active at the outset,
but they are purged as they
disagree with the majority output
Voter in self-purging redundancy
is a threshold voter that considers
the inputs with weights of 1 (active)
or 0 (purged)
Oct. 2007
Fault Masking
1
2
S
V
3
Spare
4
Switch-voter
..
.
Slide 23
Applications of nMR and Hybrid Redundancy
The Space Shuttle:
Uses 5-way redundancy in hardware
Originally, 3 operational units + 2 spares
(one warm, one cold)
More recently, 4 operational + 1 spare
Also, uses 2 independently developed
software systems (Design diversity)
Japanese Shinkansen “Bullet” Train
Triple-duplex system (6-fold redundancy)
Oct. 2007
Fault Masking
Slide 24
Download