Fault Tolerance Fault tolerance terminology “dependability” - extent to which reliance can justifiably be placed on service. General concept “reliability” - continuity of service metric: mean time between failures (MBTF) “availability” - readiness for usage “safety” - avoidance of catastrophic effects on environment “security” - resistance to unauthorized access. 2 Faults, errors, failures “fault” - component malfunction “error” - system state is wrong “failure” - system departs from specification fault error failure 3 System System components fault failure Environment 4 Coping with faults Reduce/eliminate faults in components. Fault tolerance Prevent faults from becoming failures usually through redundancy. 5 Types of faults (fault models) Fault tolerance algorithms dependent on fault models. “Crash fault” or “stop fault” - faulty component stops responding. No incorrect state changes in component. “Timing fault” - response is too early or late. “Byzantine fault” - arbitrary behavior. Can be considered adversarial (imagine worst case). 6 The agreement problem Processors may fail … so, use multiple processors … but then, processors may disagree, causing failures. Need a principled approach to distributed agreement 7 Example: AFTI 16 (from J. Rushby) “Advanced Fighter Technology Integration F16 Triple-redundant digital flight-control system (DFCS) with analog backup DFCS design was “asynchronous” processors ran independently sample sensor, evaluate control law, send command to actuator actuator averages or selects from commands General Dynamics felt synchronization would introduce a single point of failure. 8 AFTI 16 problems Processors can get widely varying sensor readings because of timing differences Reconfiguration can cause sudden changes in control (“thumps”). Need to allow wide range of “plausible values” before declaring a processor “bad” Bad sensor reading drags average down Sensor finally crosses threshhold and is called “bad” average suddenly snaps back when sensor is excluded. 9 AFTI 16 problems (cont) Processor states can diverge rapidly especially when different processors go into different control modes. Design complexity 70% of application code was for redundancy management Control laws had to be modified to ramp changes in and out smoothly 10 AFTI 16 flight test, Flight 36 “Departure” from control laws for 3 seconds acceleration exceeded -4g, then +7g Angle of attack went to -10 degrees, then +20 degrees Aircraft rolled 360 degreees Cause: side air probe cut out at high angle of attack Analysis showed this would cause complete failure of DFCS for several areas of flight envelope 11 AFTI 16 flight 44 Each channel declared the others failed asynchronous operation, timing skew, sensor noise analog backup not selected simultaneous failure of two channels not anticipated Aircraft flown home on a single digital channel (not designed for this) There were no hardware failures. 12 AFTI 16 Analysis (NASA) Nearly all failure indications were design oversights related to asynchronous operation Failures due to lack of understanding of interactions among Air data system redundancy management software flight control laws (decision points, thumps, ramp-in/out) Moral of the story: Reliability through redundancy is a lot harder than it looks. 13 Distributed consensus Goal: multiple processors agree on something in the presence of various kinds of faults and errors Intellectually difficult Algorithms are tricky Proofs are subtle Sensitive to assumptions Synchronous vs. asynchronous Communication mechanism Fault models Many papers written 14 Synchronous vs. asynchronous Synchronous: Processors run in lock-step Hard to implement - model may be unrealistic Requires Consensus clock synchronization. is easier Asynchronous: Processors run at arbitrary speed Easier to implement - model is conservative In most models, consensus problem is provably unsolvable. 15 Synchronous vs. asynchronous Semi-synchronous Bounds on how far out-of-sync processors can get Model is fairly realistic Consensus is almost as easy as synchronous 16 Fault models Goal: Make claims such as: “the system will continue to function if any single processor stops.” More conservative fault models: Fault tolerance is harder But, if successful, stronger claims can be made Fewer assumptions = simpler FMEA, easier “certification” A lot of models have been proposed. 17 Process fault models “Stopping fault” - process stops sending messages does not restart does not send wrong messages liberal (easy) model “Byzantine fault” - process behaves arbitrarily Name comes from cute “Byzantine generals” metaphor May send arbitrary messages, enter arbitrary states Equivalent to “evil” behavior, for our purposes 18 Synchronous agreement with stopping faults Multiple processes want to “agree” on a value Applications sensor readings among redundant processors decide what time it is decide which of a group of processors are broken and should be removed from system. 19 Synchronous agreement - properties Each process starts with an initial value, processes end with a decision value. Agreement: all good processes decide on same values. Validity: if all processors start with same value, that value is the final decision value. Termination: All good processes eventually decide. 20 Flood set algorithm Assumption: There is a dedicated link between each pair of processes No more than f processes can stop Each process has an initial value v Each process accumulates a set W of all the values it has ever seen. On each round, every process sends its W set to every other process Every process sets W to the union of the old value and all the new values coming in from others. 21 Flood set After f rounds, every process looks at W. If W has only one value, choose that value. Else, choose 0 (a predetermined default). 22 Flood set correctness In f+1 rounds, there must be at least one round in which no processes stop At most f processes can stop, and processes cannot stop more than once. If no process stops in round r, W will be the same in all good processes in subsequent rounds. All good processes successfully send all values in W to all other good processes, so all processes will have same W after the round. After this, nothing can get added to any W sets, so it doesn’t matter whether more 23 processes stop. Flood set correctness So, after f+1 rounds, all non-stopped processes have same W sets If W has only one value, all processes pick this value. Else all processes pick 1. 24 Flood set example 3 processes, 1 fault, default value = 0 P1 P2 P3 something V0 A A B P3 Dies after sending W to something P but not P {A} {A} {B} W in round 1 {A,B} {A} - s W in round 0 2 1 something W sets for W in round 2 {A,B} {A,B} - 0 0 - P1, P2 are same Www blank here Choose default blank here because |W|>1 Blank here final 25 Flood set efficiency O((f + 1) n2) messages f+1 rounds n processes send n messages per round O((f+1)n3) values are sent (each message may have a set of up to n values) 26 Optimized flood set Note: If W has more than one element, process doesn’t need to know what is in it. Idea: Every process sends only first two distinct values. Every process sends its initial value on first round If process receives a different value, it sends it out on next round Correctness proof: run Flood and OptFlood in parallel same initial values, stopping pattern W sets have more than one value iff OptFlood process gets two values. 27 OptFlood efficiency 2 n2 messages n processes send at most two messages to n other processes. O(n2) values are sent 28 Byzantine agreement Goal: non-faulty processes should agree on a value. E.g., message received e.g., sensor value Faults may cause arbitrary behavior arbitrary values communicated different values communicated to different receivers Advantage: reduces fault analysis Disadvantage: hard or impossible to do. 29 Byzantine agreement properties Agreement: All good processes agree on a value Validity: If source of value was non-faulty, agreed upon value is the same. 30 Asynchronous agreement Asynchronous model: Message transmission takes arbitrary time. Processes run at arbitrary speeds. Theorem: There is no algorithm that reaches agreement in an asynchronous model with even one Byzantine failure Fine print: Details of conditions, communication This is one of the most important results about distributed systems. 31 Synchronous agreement Synchronous model: Processes can communicate in a sequence of rounds. All processes complete a round before next round begins. The agreement problem is solvable in this model. Theorem: Tolerating k Byzantine faults requires > 3k processes. So “Triple modular redundancy” can’t handle Byzantine faults. Practical case: 1 Byzantine fault, 4 processes. Assumes full connectivity (connections between each pair of processors). 32 Synchronous agreement with one fault Single transmitter communicates value to all processes. Round 0: Transmitter sends value to n-1 receivers. Values are sent correctly if transmitter is not faulty. Round 1: Each receiver sends value to n-2 other receivers. Receivers record all values separately. Intuition: receivers compare notes on what transmitter told them. Each receiver choose majority value of all values it received. If no majority, use pre-arranged default value. 33 Example 1- faulty transmitter Round 0: faulty xmtr sends P1 varying results to rcvrs. 1 P2 1 P3 2 Xmtr R c v r P1 P2 P3 consensus P1 1 1 2 1 P2 1 1 2 1 P3 1 1 2 1 take majority of all Round 1: rcvrs exchange values (reliably) Finally, receivers answers These are the round 0 values 34 Example 2- faulty transmitter Round 0: faulty xmtr sends P1 varying results to rcvrs. 1 P2 2 P3 3 Xmtr R c v r P1 P2 P3 consensus P1 1 2 3 0 P2 1 2 3 0 P3 1 2 3 0 Round 1: rcvrs exchange values (reliably) There is no majority, so rcvrs use default These are the round 0 values 35 Example 3- faulty receiver Round 0: faulty xmtr sends P1 varying results to rcvrs. 1 P2 1 P3 1 Process 1 is Xmtr R c v r P1 P2 P3 consensus P1 1 1 1 5 P2 2 1 1 1 P3 3 1 1 1 broken, so result is not required to be correct Majority computes correct values for Process 1 sends bogus values processes 2,3 These are the round 0 values 36 General case Previous algorithm can be generalized to handle more Byzantine faults. General results: k faults require k+1 (k?) rounds, 3k+1 processors Number of messages grows exponentially with number of rounds Intuition: “Pn said that Pn-1 said that ... p1 said that p0 said that the value was x” There are exponentially many chains pn ... p0. 37 Hybrid Byzantine agreement Idea: Free bonus reliability with the purchase of Byzantine agreement. Handles Byzantine faults, plus some more simpler faults Symmetric fault: process sends same wrong value to everyone. Nonmalicious fault: process sends a recognizable error value. Advantages: If processors have these faults, we can tolerate more faulty processors These faults are more probable than true Byzantine faults - so this increases reliability 38 Hybrid Byzantine agreement Modify previous algorithm by adding special error value “E”. Nonmalicious faults send E value (other faults may send E, also). Majority algorithm first removes E values. Theorem: Algorithm reaches agreement if n > 2a + 2s + b + r a = Byzantine, s = symmetric, b = nonmalicious, r = number of rounds (excluding first transmission). Previous case: a=1, s=0, b=0, r=1, so n > 3 With 6 processors, can deal with 1 Byzantine + 2 nonmalicious faults. or 1 Byzantine and 1 symmetric ... but just 1 Byzantine in previous algorithm 39 Variations Synchronous communication is difficult Compromise between synchronous and asynchronous: real-time constraints. “Authentication” - agreement can be made less costly by using digital signatures transmitter digitally signs messages processes can’t lie about who said what. can handle any number of faults (in synchronous model). May assume different network connectivity Some links in network missing 40 Summary Fault tolerance is tricky. Redundancy does not necessarily buy reliability. Byzantine models can account for unforeseen fault types. Byzantine agreement is impossible in some models. There exist practical algorithms for Byzantine agreement if synchronous communication is available. There are deep theoretical results in this area. 41