fault

advertisement
Fault Tolerance
Fault tolerance terminology
 “dependability” - extent to which reliance can
justifiably be placed on service.
 General
concept
 “reliability” - continuity of service
 metric:
mean time between failures (MBTF)
 “availability” - readiness for usage
 “safety” - avoidance of catastrophic effects on
environment
 “security” - resistance to unauthorized access.
2
Faults, errors, failures
 “fault” - component malfunction
 “error” - system state is wrong
 “failure” - system departs from specification
fault
error
failure
3
System
System
components
fault
failure
Environment
4
Coping with faults
 Reduce/eliminate faults in components.
 Fault tolerance
Prevent
faults from becoming failures
usually through redundancy.
5
Types of faults (fault models)
Fault tolerance algorithms dependent on fault
models.
 “Crash fault” or “stop fault” - faulty
component stops responding. No incorrect
state changes in component.
 “Timing fault” - response is too early or late.
 “Byzantine fault” - arbitrary behavior. Can be
considered adversarial (imagine worst case).
6
The agreement problem

Processors may fail
 … so, use multiple processors
 … but then, processors may disagree,
causing failures.
 Need a principled approach to distributed
agreement
7
Example: AFTI 16 (from J. Rushby)
 “Advanced Fighter Technology Integration
F16
 Triple-redundant digital flight-control system
(DFCS) with analog backup
 DFCS design was “asynchronous”
 processors


ran independently
sample sensor, evaluate control law, send command to
actuator
actuator averages or selects from commands
 General
Dynamics felt synchronization would
introduce a single point of failure.
8
AFTI 16 problems
 Processors can get widely varying sensor
readings because of timing differences
 Reconfiguration can cause sudden changes
in control (“thumps”).
Need
to allow wide range of “plausible values”
before declaring a processor “bad”
Bad sensor reading drags average down
Sensor finally crosses threshhold and is
called “bad”
average suddenly snaps back when sensor is
excluded.
9
AFTI 16 problems (cont)
 Processor states can diverge rapidly
especially
when different processors go into
different control modes.
 Design complexity
70%
of application code was for redundancy
management
Control laws had to be modified to ramp
changes in and out smoothly
10
AFTI 16 flight test, Flight 36
 “Departure” from control laws for 3 seconds
 acceleration exceeded -4g, then +7g
 Angle of attack went to -10 degrees, then +20
degrees
 Aircraft rolled 360 degreees
 Cause: side air probe cut out at high angle of
attack
 Analysis showed this would cause complete
failure of DFCS for several areas of flight
envelope
11
AFTI 16 flight 44
 Each channel declared the others failed
asynchronous
operation, timing skew, sensor
noise
 analog backup not selected
simultaneous
failure of two channels not
anticipated
 Aircraft flown home on a single digital
channel (not designed for this)
 There were no hardware failures.
12
AFTI 16 Analysis (NASA)
 Nearly all failure indications were design
oversights related to asynchronous
operation
 Failures due to lack of understanding of
interactions among
Air
data system
redundancy management software
flight control laws (decision points, thumps,
ramp-in/out)
 Moral of the story: Reliability through
redundancy is a lot harder than it looks.
13
Distributed consensus
 Goal: multiple processors agree on
something in the presence of various kinds
of faults and errors
 Intellectually difficult
Algorithms
are tricky
Proofs are subtle
Sensitive to assumptions
 Synchronous
vs. asynchronous
 Communication mechanism
 Fault models
 Many papers written
14
Synchronous vs. asynchronous
 Synchronous: Processors run in lock-step
Hard
to implement - model may be unrealistic
 Requires
Consensus
clock synchronization.
is easier
 Asynchronous: Processors run at arbitrary
speed
Easier
to implement - model is conservative
In most models, consensus problem is
provably unsolvable.
15
Synchronous vs. asynchronous
 Semi-synchronous
Bounds
on how far out-of-sync processors
can get
Model is fairly realistic
Consensus is almost as easy as synchronous
16
Fault models
 Goal: Make claims such as: “the system will
continue to function if any single processor
stops.”
 More conservative fault models:
Fault
tolerance is harder
But, if successful, stronger claims can be
made
Fewer assumptions = simpler FMEA, easier
“certification”
 A lot of models have been proposed.
17
Process fault models
 “Stopping fault” - process stops sending
messages
does
not restart
does not send wrong messages
liberal (easy) model
 “Byzantine fault” - process behaves
arbitrarily
Name
comes from cute “Byzantine generals”
metaphor
May send arbitrary messages, enter arbitrary
states
Equivalent to “evil” behavior, for our purposes
18
Synchronous agreement with
stopping faults
 Multiple processes want to “agree” on a
value
 Applications
sensor
readings among redundant processors
decide what time it is
decide which of a group of processors are
broken and should be removed from system.
19
Synchronous agreement - properties
 Each process starts with an initial value,
processes end with a decision value.
 Agreement: all good processes decide on
same values.
 Validity: if all processors start with same
value, that value is the final decision value.
 Termination: All good processes eventually
decide.
20
Flood set algorithm
 Assumption: There is a dedicated link
between each pair of processes
 No more than f processes can stop
 Each process has an initial value v
 Each process accumulates a set W of all the
values it has ever seen.
On
each round, every process sends its W set
to every other process
Every process sets W to the union of the old
value and all the new values coming in from
others.
21
Flood set
 After f rounds, every process looks at W.
If
W has only one value, choose that value.
Else, choose 0 (a predetermined default).
22
Flood set correctness
 In f+1 rounds, there must be at least one
round in which no processes stop
At
most f processes can stop, and processes
cannot stop more than once.
 If no process stops in round r, W will be the
same in all good processes in subsequent
rounds.
All
good processes successfully send all
values in W to all other good processes, so all
processes will have same W after the round.
After this, nothing can get added to any W
sets, so it doesn’t matter whether more
23
processes stop.
Flood set correctness
 So, after f+1 rounds, all non-stopped
processes have same W sets
If
W has only one value, all processes pick this
value.
Else all processes pick 1.
24
Flood set example
 3 processes, 1 fault, default value = 0
P1
P2
P3
something
V0
A
A
B
P3 Dies after
sending W to
something
P but not P
{A}
{A}
{B}
W in round 1
{A,B}
{A}
-
s
W in round 0
2
1
something
W sets for
W in round 2
{A,B}
{A,B}
-
0
0
-
P1, P2
are same
Www
blank here
Choose default
blank here
because |W|>1
Blank here
final
25
Flood set efficiency
O((f + 1) n2) messages
f+1 rounds
n processes send n messages per round
O((f+1)n3) values are sent (each message
may have a set of up to n values)
26
Optimized flood set
 Note: If W has more than one element, process
doesn’t need to know what is in it.
 Idea: Every process sends only first two distinct
values.
 Every
process sends its initial value on first round
 If process receives a different value, it sends it out on
next round
 Correctness proof: run Flood and OptFlood in
parallel
 same
initial values, stopping pattern
 W sets have more than one value iff OptFlood process
gets two values.
27
OptFlood efficiency
2 n2 messages
n processes send at most two messages to n
other processes.
O(n2) values are sent
28
Byzantine agreement
 Goal: non-faulty processes should agree on a value.
 E.g.,
message received
 e.g., sensor value
 Faults may cause arbitrary behavior
 arbitrary
values communicated
 different values communicated to different receivers
 Advantage: reduces fault analysis
 Disadvantage: hard or impossible to do.
29
Byzantine agreement properties
Agreement: All good processes agree on a value
Validity: If source of value was non-faulty, agreed upon
value is the same.
30
Asynchronous agreement
 Asynchronous model:
 Message
transmission takes arbitrary time.
 Processes run at arbitrary speeds.
 Theorem: There is no algorithm that reaches
agreement in an asynchronous model with even one
Byzantine failure
 Fine
print: Details of conditions, communication
 This is one of the most important results about
distributed systems.
31
Synchronous agreement
 Synchronous model: Processes can communicate in
a sequence of rounds. All processes complete a
round before next round begins.
 The agreement problem is solvable in this model.
 Theorem: Tolerating k Byzantine faults requires > 3k
processes.
 So “Triple modular redundancy” can’t handle
Byzantine faults.
 Practical case: 1 Byzantine fault, 4 processes.
 Assumes full connectivity (connections between
each pair of processors).
32
Synchronous agreement with one fault
 Single transmitter communicates value to all
processes.
 Round 0: Transmitter sends value to n-1 receivers.
 Values
are sent correctly if transmitter is not faulty.
 Round 1: Each receiver sends value to n-2 other
receivers.
 Receivers
record all values separately.
 Intuition: receivers compare notes on what transmitter
told them.
 Each receiver choose majority value of all values it
received.
 If
no majority, use pre-arranged default value.
33
Example 1- faulty transmitter
Round 0: faulty xmtr sends
P1
varying results to rcvrs.
1
P2
1
P3
2
Xmtr
R
c
v
r
P1
P2
P3
consensus
P1
1
1
2
1
P2
1
1
2
1
P3
1
1
2
1
take majority of all
Round 1: rcvrs
exchange
values (reliably)
Finally, receivers
answers
These are the
round 0 values
34
Example 2- faulty transmitter
Round 0: faulty xmtr sends
P1
varying results to rcvrs.
1
P2
2
P3
3
Xmtr
R
c
v
r
P1
P2
P3
consensus
P1
1
2
3
0
P2
1
2
3
0
P3
1
2
3
0
Round 1: rcvrs
exchange
values (reliably)
There is no majority,
so rcvrs use default
These are the
round 0 values
35
Example 3- faulty receiver
Round 0: faulty xmtr sends
P1
varying results to rcvrs.
1
P2
1
P3
1
Process 1 is
Xmtr
R
c
v
r
P1
P2
P3
consensus
P1
1
1
1
5
P2
2
1
1
1
P3
3
1
1
1
broken, so result
is not required to be
correct
Majority computes
correct values for
Process 1
sends bogus values
processes 2,3
These are the
round 0 values
36
General case
 Previous algorithm can be generalized to handle
more Byzantine faults.
 General results: k faults require k+1 (k?) rounds,
3k+1 processors
 Number of messages grows exponentially with
number of rounds
 Intuition: “Pn said that Pn-1 said that ... p1 said that
p0 said that the value was x”
 There
are exponentially many chains pn ... p0.
37
Hybrid Byzantine agreement
 Idea: Free bonus reliability with the purchase of
Byzantine agreement.
 Handles Byzantine faults, plus some more simpler
faults
 Symmetric fault: process sends same wrong value to
everyone.
 Nonmalicious fault: process sends a recognizable
error value.
 Advantages:
 If
processors have these faults, we can tolerate more
faulty processors
 These faults are more probable than true Byzantine
faults - so this increases reliability
38
Hybrid Byzantine agreement
 Modify previous algorithm by adding special error
value “E”.
 Nonmalicious
faults send E value (other faults may
send E, also).
 Majority algorithm first removes E values.
 Theorem: Algorithm reaches agreement if
 n > 2a + 2s + b + r
a
= Byzantine, s = symmetric, b = nonmalicious, r =
number of rounds (excluding first transmission).
 Previous case: a=1, s=0, b=0, r=1, so n > 3
 With 6 processors, can deal with 1 Byzantine + 2
nonmalicious faults.
 or 1 Byzantine and 1 symmetric
 ... but just 1 Byzantine in previous algorithm
39
Variations
 Synchronous communication is difficult
 Compromise
between synchronous and
asynchronous: real-time constraints.
 “Authentication” - agreement can be made less
costly by using digital signatures
 transmitter
digitally signs messages
 processes can’t lie about who said what.
 can handle any number of faults (in synchronous
model).
 May assume different network connectivity
 Some
links in network missing
40
Summary
 Fault tolerance is tricky. Redundancy does not
necessarily buy reliability.
 Byzantine models can account for unforeseen fault
types.
 Byzantine agreement is impossible in some models.
 There exist practical algorithms for Byzantine
agreement if synchronous communication is
available.
 There are deep theoretical results in this area.
41
Download