Chapter 6 Self-Stabilization Self-Stabilization Shlomi Dolev MIT Press , 2000 Shlomi Dolev, All Rights Reserved chapter 6 - Convergence in the Presence of Faults 1-1 Chapter 6: Convergence in the Presence of Faults - Motivation Processors crash and software (and hardware) may contain flaws Byzantine and crash failures are both well studied models Algorithms that tolerate failures are of practical interest The focus of this presentation is the integration of self-stabilization with other fault models chapter 6 - Convergence in the Presence of Faults 1-2 Byzantine Faults “Byzantine” – permanent faults The type of faults is not known in advance Processors can exhibit arbitrary “malicious”, “two faced” behavior Models the possibility of code corruption chapter 6 - Convergence in the Presence of Faults 1-3 Byzantine Fault Model A Byzantine processor “fights” against the rest of the processors in order to prevent them from reaching their goal A Byzantine processor can send any message at any time to each of its neighbors If 1/3 or more of the processors are Byzantine it is impossible to achieve basic tasks such as consensus in distributed systems chapter 6 - Convergence in the Presence of Faults 1-4 At least 1/3 of the processors are Byzantine No convergence i=1 P’3 P1 i=0 i=0 • choose the P1 same value. P’1 i=1 P’ and P’ have the P2 1and P3• 2have the when the non-faulty processors We will a six processorhave ring the same input, since P’examine i=0 same input, since P’1 3 that i=1 P’2input must be chosen. P2 same input, and P3 may bethat c=1 c=0 on Note AL is designed to be executed and P1 may be Contradiction !! must Byzantine they Assume there is only i=0 a system with P Pa2 distributed i=1 P’3 algorithm ALP’2 i=1 i=03 processors 3 Byzantine they must P’1 and Pchoose must1achieves decide onconsensus in the presence of a 3 that choose 0 i=1 P’1 P3 i=0 cin =? one input single BUT PByzantine c=1 c=0system processor the next 3 must choose 0 and P’1 must choose 1 i = input value c = consensus chapter 6 - Convergence in the Presence of Faults 1-5 At least 1/3 of the processors are Byzantine No convergence We have just seen the impossibility result for 3 processors, but is it a special case? Is it possible to reach consensus when the number of processors is 3f, where f>1 is the number of Byzantine processors? No! chapter 6 - Convergence in the Presence of Faults 1-6 At least 1/3 of the processors are Byzantine No convergence Proof: (by reduction) Divide the system into 3 clusters (group) of processors, one of which contains all the Byzantine processors. Replace each cluster by a super processor that simulates the execution of the cluster. The existence of an algorithm for the case 3f, f>1 , implies existence for f=1, which we have proved impossible. chapter 6 - Convergence in the Presence of Faults 1-7 The Use of Self-Stabilization What happens if… For a short period, 1/3 or more of the processors are faulty or perhaps temporarily crashed? Messages from a non-faulty processor are lost? Such temporary violations can be viewed as leaving a system in an arbitrary initial state Self–Stabilizing algorithms that cope with Byzantine and transient faults and stabilize in spite of these faults are presented, and demonstrate the generality of the selfstabilization concept! chapter 6 - Convergence in the Presence of Faults 1-8 Chapter 6: roadmap 6.1 Digital Clock Synchronization 6.2 Stabilization in Spite of Napping 6.3 Stabilization in Spite of Byzantine Faults 6.4 Stabilization in the Presence of Faults in Asynchronous Systems chapter 6 - Convergence in the Presence of Faults 1-9 Digital Clock Synchronization - Motivation Multi processor computers Synchronization is needed for coordination – clocks Global clock pulse & global clock value Global clock pulse & individual clock values Individual clock pulse & individual clock values Fault tolerant clock synchronization chapter 6 - Convergence in the Presence of Faults 1-10 Digital Clock Synchronization In every pulse each processor reads the value of it’s neighbors clocks and uses these values to calculate its new clock value . The Goal (1) identical clock values (2) the clock values are incremented by one in every pulse chapter 6 - Convergence in the Presence of Faults 1-11 Digital Clock Sync – Unbounded version 01 upon a pulse 02 forall Pj N(i) do send (j,clocki) 03 max := clocki 04 forall Pj N(i) do 05 receive(clockj) 06 if clockj max then max := clockj 07 od 08 clocki := max + 1 A simple induction can prove that this version of the algorithm is correct: If Pm holds the max clock value, by the i’th pulse every processor of distance i from Pm holds the maximal clock value chapter 6 - Convergence in the Presence of Faults 1-12 Digital Clock Synchronization – Bounded version Unbounded clocks is a drawback in self-stabilizing systems The use of 264 possible values does not help creating the illusion of “unbounded”: A single transient fault may cause the clock to reach the maximal clock value … chapter 6 - Convergence in the Presence of Faults 1-13 Digital Clock Sync – Bounded version (max) 01 upon a pulse 02 forall Pj N(i) do send (i,clocki) 03 max := clocki 04 forall Pj N(i) do 05 receive(clockj) 06 if clockj max then max := clockj 07 od 08 clocki := (max + 1) mod ((n +1)d +1) The Boundary M = ((n+1)d+1) Why is this algorithm correct? The number of different clock values can only decrease, and is reduced to a single clock value chapter 6 - Convergence in the Presence of Faults 1-14 For Example: 0 1 8 p3 M = ((n+1)d+1) = p1 4*2+1 = 9 7 2 Round Round Pulse 31 2 p1 p3 6 p2 3 p2 5 4 31 4 5 6 8 0 1 6 5 5 4 6 3 0 8 chapter 6 - Convergence in the Presence of Faults 1-15 Digital Clock Sync – Bounded version (max) Why is this algorithm correct? If all the clock values are less than M-d we achieve sync before the modulo operation is applied 0 1 m-1 . m-2 . there must be2convergence After d pulses and the m-i . max value is less3 than m m-d . . . . m-d-i . . chapter 6 - Convergence in the Presence of Faults 1-16 Digital Clock Sync – Bounded version (max) … Why is this algorithm correct? If not all the clock values are less than M-d By the pigeonhole principle, in any configuration there must be 2 clock values x and y such that y-x d+1, and there is no other clock value between After M-y+1 pulses the system reaches the configuration in which all clock values are less than M-d chapter 6 - Convergence in the Presence of Faults 1-17 Digital Clock Sync – Bounded version (min) The Boundary M = 2d+1 Why is this algorithm correct? If no processor assigns 0 during the first d pulses – sync is achieved (can be shown by simple induction) Else A processor assigns 0 during the first d pulses, d pulses after this point a configuration c is reached such that there is no clock value greater than d: the first case holds 01 upon a pulse 02 forall Pj N(i) do send (j,clocki) 03 min := clocki 04 forall Pj N(i) do 05 receive(clockj) 06 if clockj min then min := clockj 07 od 08 clocki := (min + 1) mod (2d +1) chapter 6 - Convergence in the Presence of Faults 1-18 Digital clocks with a constant number of states are impossible Consider only deterministic algorithm: There is no uniform digital clock-synchronization algorithm that uses only a constant number of states per processor. Thus, the number of clock values in a uniform system must be related to the number of processors or to the diameter. chapter 6 - Convergence in the Presence of Faults 1-19 Digital clocks with a constant number of states are impossible A special case will imply a lower bound for the general case A processor can read only the clock of a subset of its neighbors In a undirected ring every processor has a left and right neighbor, and can read the state of its left neighbor sit+1= f(si-1t, sit) sit - state of Pi in time t, f - the transition function |S| - the constant number of states of a processor The proof shows that in every step, the state of every processor is changed to the state of its right processor chapter 6 - Convergence in the Presence of Faults 1-20 Digital clocks with a constant number of states are impossible s1 s2 s3 = f(s1, s2) ... sl+2 = f(sl, sl+1) ... ... sj sj sj+1 sj s j+1 ,s that There must be a sequence of states s ,s ,…, s sk+2 = j j+1 k-1 k sj+1 s = s s k+1 s j k Usesjs1 andsj+1 s2 to construct an infinite sequence j+1 is, a subset of this infinite sequence such that f(sk-1,sk) ... . . . . . . ... of states such that s = f(s ,s ) sj+1 i+2 i i+1 = sj and f(s ,s ) = s k j j+1 sk sk sk-1 sk-1 Pulse sk+2 = sj+1 s sk+1 = sj one place k In each pulse, the states are rotated sk+2 = sj+1 s = s sk left. k+1 j sk-1 chapter 6 - Convergence in the Presence of Faults 1-21 Digital clocks with a constant number of states are impossible o Since the states of the processors encodes the clock values, and the set of states just rotates around the ring, We must assume that all the states encode the same clock. o On the other hand, the clock value must be increments in every pulse. Contradiction. chapter 6 - Convergence in the Presence of Faults 1-22 Chapter 6: roadmap 6.1 Digital Clock Synchronization 6.2 Stabilization in Spite of Napping 6.3 Stabilization in Spite of Byzantine Faults 6.4 Stabilization in the Presence of Faults in Asynchronous Systems chapter 6 - Convergence in the Presence of Faults 1-23 Stabilizing in Spite of Napping Wait-free self-stabilizing clock-synchronization algorithm is a clock-sync. Algorithm that copes with transient and napping faults Each non-faulty operating processor ignores the faulty processors and increments its clock value by one in every pulse Given a fixed integer k, once a processor Pi works correctly for at least k time units and continues working correctly, the following properties hold: Adjustment Pi does not adjust its clock Agreement Pis clock agrees with the clock of every other processor that has also been working correctly for at least k time units chapter 6 - Convergence in the Presence of Faults 1-24 Algorithms that fulfill the adjustmentagreement – unbounded clocks Simple example for k=1, using the unbounded clocks In every step – each processor reads the clock values of the other processors, and chooses the maximal value (denote by x) and assigns x+1 to its clock Note that this approach wont work using bounded clock values P1 10 11 7 7 5 0 The clock value never After an execution of P1, it’s clock holds the changes untilclock the value, napping maximal and wont adjust its processor with clock asmax long value as it doesn’t crash starts to work 9 max 8 0 5 3 chapter 6 - Convergence in the Presence of Faults 1-25 Algorithms that fulfill the adjustmentagreement – bounded clock values Using bounded clock values (M) The idea – identifying crashed processors and ignoring their values Each processor P has: P.clock {0… M-1} Q P.count[Q] {0,1,2} P is behind Q if P.count[Q]+1 (mod 3) = Q.count[P] P P.count[Q] Q Q.count[P] 0 1 1 2 2 0 chapter 6 - Convergence in the Presence of Faults 1-26 Algorithms that fulfill the adjustmentagreement – bounded solution The implementation is based on the concept of the “rock, paper, scissors” children’s game 2 00 1> 1 >VS0 > 221 2 chapter 6 - Convergence in the Presence of Faults 1-27 Algorithms that fulfill the adjustmentagreement – bounded solution The program for P: 1) Read every count and clock 2) Find the set R that are not behind any other processor 3) If R then P finds a processor K with the maximal clock value in R and assigns P.clock := K.clock + 1 (mod M) 4) For every processor Q, if Q is not behind P then P.count[Q] := P.count[Q] + 1 (mod 3) chapter 6 - Convergence in the Presence of Faults 1-28 Self-stabilizing Wait-free Bounded Solution – Run Sample P1 R P4 R 7 8 5 6 7 4 5 7 6 1 7 2 R R P2 P3 Active processor Simple connection “behind” connection K=2 chapter 6 - Convergence in the Presence of Faults 1-29 The algorithm presented is wait-free and self-stabilizing The algorithm presented is a wait-free self- stabilizing clock-synchronization algorithm with k=2 (Theorem 6.1) All processors that take a step at the same pulse, see the same view Each processor that executes a single step belongs to R, in which all the clock values are the same the agreement requirement holds Every processor chooses the maximal clock value of a processor in R, and increments it by 1 mod M the adjustment requirement holds The proof assumes an arbitrary start configuration the algorithm is both wait-free and self-stabilizing chapter 6 - Convergence in the Presence of Faults 1-30 Chapter 6: roadmap 6.1 Digital Clock Synchronization 6.2 Stabilization in Spite of Napping 6.3 Stabilization in Spite of Byzantine Faults 6.4 Stabilization in the Presence of Faults in Asynchronous Systems chapter 6 - Convergence in the Presence of Faults 1-31 Enhancing the fault tolerance Using self-stabilizing algorithm if temporary violation, of the assumptions on the system, occur the system synchronizes the clocks when the assumptions hold again Byzantine processor may exhibit a two-faced behavior,sending different messages to its neighbors If starting in an arbitrary configuration, during the future execution more than 2/3 of the processors are non-faulty, the system will reach a configuration within k rounds in which agreement and adjustment properties hold chapter 6 - Convergence in the Presence of Faults 1-32 Self Stabilizing clock synchronization algorithm Complete communication graph f = # of Byzantine faults Basic rules: Increment – Pi finds n-f-1 clock values identical to its own The action – (increment clock value by 1) mod M Reset – fewer than n-f-1 are found The action – set Pi’s clock value to 0 After the 2nd pulse, there are no more than 2 distinct clock values among the non-faulty processors No distinct supporting groups for 2 values may coexist chapter 6 - Convergence in the Presence of Faults 1-33 No distinct supporting groups for 2 values may coexist Suppose 2 such values exist: x and y. x y p1 p2 n-f processors gave (x-1) n-f processors gave (y-1) Since n>3f the number of non-faulty processors is at least: There are at least n-2f nonThere are at least n-2f nonfaulty processors with (x-1) faulty processors with (y-1) 2n-4f>2n-n-f=n-f There are at least 2n-4f nonfaulty processors chapter 6 - Convergence in the Presence of Faults 1-34 How can a Byzantine processor prevent reaching 0, simultaneously even after M-1 rounds P1 will reset 0 1 willreset reset P4 P3 will P2 0 1 n-f-1 = 2 f= 1 0 1 0 1 10 This strategy can yield an infinite execution in which the clock values of the non-faulty processors will never be synchronized chapter 6 - Convergence in the Presence of Faults 1-35 The randomized algorithm As a tool to ensure the the set of clock values of the non-faulty processors will eventually, with high probability, include only a single clock If a processor reaches 0 using “reset”, and has the possibility to increment it’s value, it tosses a coin randomized P randomized P11 randomized 22 1201 0 11 0 1 120 0 11 P33 randomized P44 12 00 10 0 Note that NO reset was done the values were incremented automatically chapter 6 - Convergence in the Presence of Faults 1-36 Digital clocks in the presence of Byzantine processors 01 upon a pulse 02 forall Pj N(i) do send (j,clockj) 03 forall Pj N(i) do used since Byzantine 04 receive (clockj) (*unless a timeout*) neighbor may not 05 if |{j|i j, clocki clockj}| < n – f – 1 then send a message 06 clocki := 0 Indicates a reset or 07 LastIncrementi := false an increment 08 else operation 09 if clocki 0 then 10 clocki := (clocki + 1) mod M 11 LastIncrementi := true 12 else 13 if LastIncrementi = true then clocki := 1 14 else clocki := random({0,1}) 15 if clocki = 1 then LastIncrementi := true 16 else LastIncrementi := false chapter 6 - Convergence in the Presence of Faults 1-37 The randomized algorithm If no sync is gained after a sequence of at most M successive pulses all non-faulty processors hold the value 0 At least 1 non faulty processor assigns 1 to its clock every M successive pulses In expected M·22(n-f) pulses, the system reaches a configuration in which the value of every nonfaulty processor’s clock is 1 (Theorem 6.2) Proving using the scheduler-luck game The expected convergence time depends on M What if M=264 ? chapter 6 - Convergence in the Presence of Faults 1-38 Parallel Composition for Fast Convergence The purpose : achieving an exponentially better convergence rate while keeping the max clock value of no smaller than 264 The technique can be used in a synchronous system In every step Pi will I. execute several independent versions of a selfstabilizing algorithm II. Compute it’s output using the output of all versions chapter 6 - Convergence in the Presence of Faults 1-39 Parallel Composition for Fast Convergence Using the Chinese remainder theorem : (DE Knuth. The Art of Computer Programming vd.2. Addison-Wesely, 1981) Let m1,m2, … ,mr be positive integers that are relatively prime in pairs, i.e., gcd(mj, mk)=1 when jk. Let m= m1m2•••mr, and let a,u1,u2, … ,ur be integers. Then there is exactly one integer u that satisfies the conditions a u a+m and u uj (mod mj) for 1 j r (Theorem 6.3) chapter 6 - Convergence in the Presence of Faults 1-40 Parallel Composition for Fast Convergence Choose : a=0 r primes 2,3,..,r such that 2·3·5···m(r-1)M 2·3·5···mr The lth version uses the lth prime ml , for the value Ml (as the clock bound) A message sent by Pi contains r clock values (one for each version) The expected convergence time for all the versions to be synchronized is less than (m1+m2+ … +mr)•22(n-f) chapter 6 - Convergence in the Presence of Faults 1-41 Parallel Composition for Fast Convergence The Chinese remainder theorem states that: Every combination of the parallel version clock corresponds to a unique clock value in the range 2357… 2 3 5 7 0 0 2 0 0 4 2 4 1 2 1 1 3 1 3 Start filling GAS chapter 6 - Convergence in the Presence of Faults 1-42 Chapter 6: roadmap 6.1 Digital Clock Synchronization 6.2 Stabilization in Spite of Napping 6.3 Stabilization in Spite of Byzantine Faults 6.4 Stabilization in the Presence of Faults in Asynchronous Systems chapter 6 - Convergence in the Presence of Faults 1-43 Stabilization in the Presence of Faults in Asynchronous Systems For some tasks a single faulty processor may cause the system not to stabilize Example – counting the number of processors in a ring communication graph, in presence of exactly 1 crashed processor. Eventually each processor should encode n-1 r13 = z 2 P3 P1 r12 = x r43=z 33 Assume the existence of a P It is NOT possible Pto 1 4 design a selfr12 = x self-stabilizing algorithm AL system will Conclusion: c’algorithm is stabilizing for the counting Lets consider a not The We can stop P4 reach c’ in which that does the job in the a safe configuration system with 4 taskall! until Pencode 2 and P3 3 P presence of exactly one 2-4 processors the system never encode 2 crashed processor P P2 3 P2 reaches a safe 2 configuration 23 3 2 chapter 6 - Convergence in the Presence of Faults 1-44