Convergence in the Presence of Faults

Chapter 6
Self-Stabilization
Self-Stabilization
Shlomi Dolev
MIT Press , 2000
Shlomi Dolev, All Rights Reserved
chapter 6 - Convergence in the Presence of Faults
1-1
Chapter 6: Convergence in the
Presence of Faults - Motivation
 Processors crash and software (and hardware) may
contain flaws
 Byzantine and crash failures are both well studied
models
 Algorithms that tolerate failures are of practical
interest
 The focus of this presentation is the integration
of self-stabilization with other fault models
chapter 6 - Convergence in the Presence of Faults
1-2
Byzantine Faults
“Byzantine” – permanent faults
 The type of faults is not known in advance
 Processors can exhibit arbitrary “malicious”, “two
faced” behavior
 Models the possibility of code corruption
chapter 6 - Convergence in the Presence of Faults
1-3
Byzantine Fault Model
 A Byzantine processor “fights” against the rest of
the processors in order to prevent them from
reaching their goal
 A Byzantine processor can send any message at any
time to each of its neighbors
 If 1/3 or more of the processors are Byzantine it
is impossible to achieve basic tasks such as
consensus in distributed systems
chapter 6 - Convergence in the Presence of Faults
1-4
At least 1/3 of the processors are Byzantine
 No convergence
i=1 P’3
P1 i=0
i=0
• choose the
P1 same value.

P’1
i=1

P’ and P’ have the
P2 1and P3• 2have
the
when
the
non-faulty
processors
We
will
a six
processorhave
ring the
same input, since
P’examine
i=0
same input,
since
P’1 3 that
i=1 P’2input must be chosen. P2
same
input,
and P3 may
bethat c=1
c=0 on
Note
AL is designed to be executed
and P1 may be
Contradiction
!! must
Byzantine
they
Assume
there
is only
i=0
a
system
with
P
Pa2 distributed
i=1 P’3 algorithm ALP’2 i=1
i=03 processors
3
Byzantine they must
P’1 and Pchoose
must1achieves
decide onconsensus in the presence of a
3 that
choose
0
i=1 P’1
P3 i=0




cin
=?
one input single
BUT PByzantine
c=1
c=0system
processor
the next
3 must
choose 0 and P’1 must
choose 1
i = input value
c = consensus
chapter 6 - Convergence in the Presence of Faults
1-5
At least 1/3 of the processors are
Byzantine  No convergence
We have just seen the impossibility result for 3
processors, but is it a special case?
Is it possible to reach consensus when the number
of processors is 3f, where f>1 is the number of
Byzantine processors? No!
chapter 6 - Convergence in the Presence of Faults
1-6
At least 1/3 of the processors are
Byzantine  No convergence
Proof: (by reduction)
 Divide the system into 3 clusters (group) of
processors, one of which contains all the Byzantine
processors.
 Replace each cluster by a super processor that
simulates the execution of the cluster.
 The existence of an algorithm for the case 3f, f>1
, implies existence for f=1, which we have proved
impossible.
chapter 6 - Convergence in the Presence of Faults
1-7
The Use of Self-Stabilization
 What happens if…
For a short period, 1/3 or more of the
processors are faulty or perhaps temporarily
crashed?
 Messages from a non-faulty processor are lost?
 Such temporary violations can be viewed as leaving
a system in an arbitrary initial state

Self–Stabilizing algorithms that cope with
Byzantine and transient faults and stabilize in
spite of these faults are presented, and
demonstrate the generality of the selfstabilization concept!
chapter 6 - Convergence in the Presence of Faults
1-8
Chapter 6: roadmap
6.1 Digital Clock Synchronization
6.2 Stabilization in Spite of Napping
6.3 Stabilization in Spite of Byzantine Faults
6.4 Stabilization in the Presence of Faults in
Asynchronous Systems
chapter 6 - Convergence in the Presence of Faults
1-9
Digital Clock Synchronization - Motivation
 Multi processor computers
 Synchronization is needed for coordination – clocks
Global clock pulse & global clock value
 Global clock pulse & individual clock values
 Individual clock pulse & individual clock values
 Fault tolerant clock synchronization

chapter 6 - Convergence in the Presence of Faults
1-10
Digital Clock Synchronization
 In every pulse each processor reads the value of
it’s neighbors clocks and uses these values to
calculate its new clock value .
 The Goal
(1) identical clock values
(2) the clock values are incremented by one in
every pulse
chapter 6 - Convergence in the Presence of Faults
1-11
Digital Clock Sync – Unbounded version
01 upon a pulse
02
forall Pj  N(i) do send (j,clocki)
03
max := clocki
04
forall Pj  N(i) do
05
receive(clockj)
06
if clockj  max then max := clockj
07
od
08
clocki := max + 1
 A simple induction can prove that this version of
the algorithm is correct:
 If Pm holds the max clock value, by the i’th
pulse every processor of distance i from Pm
holds the maximal clock value
chapter 6 - Convergence in the Presence of Faults
1-12
Digital Clock Synchronization –
Bounded version
 Unbounded clocks is a drawback in self-stabilizing
systems
 The use of 264 possible values does not help
creating the illusion of “unbounded”:
 A single transient fault may cause the clock to
reach the maximal clock value …
chapter 6 - Convergence in the Presence of Faults
1-13
Digital Clock Sync – Bounded version (max)
01 upon a pulse
02
forall Pj  N(i) do send (i,clocki)
03
max := clocki
04
forall Pj  N(i) do
05
receive(clockj)
06
if clockj  max then max := clockj
07
od
08
clocki := (max + 1) mod ((n +1)d +1)
 The Boundary M = ((n+1)d+1)
 Why is this algorithm correct?

The number of different clock values can only
decrease, and is reduced to a single clock value
chapter 6 - Convergence in the Presence of Faults
1-14
For Example:
0
1
8 p3
M = ((n+1)d+1)
= p1
4*2+1
= 9
7
2
Round
Round
Pulse 31
2
p1
p3
6
p2 3
p2
5
4
31
4
5
6
8
0
1
6
5
5
4
6
3
0
8
chapter 6 - Convergence in the Presence of Faults
1-15
Digital Clock Sync – Bounded version (max)
 Why is this algorithm correct?

If all the clock values are less than M-d we
achieve sync before the modulo operation is
applied
0
1
m-1
.
m-2
. there must be2convergence
After d pulses
and the
m-i
. max value is less3 than m
m-d
.
.
.
.
m-d-i
.
.
chapter 6 - Convergence in the Presence of Faults
1-16
Digital Clock Sync – Bounded version (max)
 … Why is this algorithm correct?
If not all the clock values are less than M-d
 By the pigeonhole principle, in any configuration
there must be 2 clock values x and y such that
y-x  d+1, and there is no other clock value
between
 After M-y+1 pulses the system reaches the
configuration in which all clock values are less
than M-d
chapter 6 - Convergence in the Presence of Faults
1-17
Digital Clock Sync – Bounded version (min)
 The Boundary M = 2d+1
 Why is this algorithm correct?
If no processor assigns 0 during the first d pulses – sync is
achieved (can be shown by simple induction)
Else
 A processor assigns 0 during the first d pulses, d pulses
after this point a configuration c is reached such that
there is no clock value greater than d: the first case holds

01 upon a pulse
02
forall Pj  N(i) do send (j,clocki)
03
min := clocki
04
forall Pj  N(i) do
05
receive(clockj)
06
if clockj  min then min := clockj
07
od
08
clocki := (min + 1) mod (2d +1)
chapter 6 - Convergence in the Presence of Faults
1-18
Digital clocks with a constant number
of states are impossible
Consider only deterministic algorithm:
There is no uniform digital clock-synchronization
algorithm that uses only a constant number of
states per processor.
Thus, the number of clock values in a uniform
system must be related to the number of
processors or to the diameter.
chapter 6 - Convergence in the Presence of Faults
1-19
Digital clocks with a constant number
of states are impossible
 A special case will imply a lower bound for the general





case
A processor can read only the clock of a subset of its
neighbors
In a undirected ring every processor has a left and
right neighbor, and can read the state of its left
neighbor
sit+1= f(si-1t, sit)
sit - state of Pi in time t, f - the transition function
|S| - the constant number of states of a processor
The proof shows that in every step, the state of
every processor is changed to the state of its right
processor
chapter 6 - Convergence in the Presence of Faults
1-20
Digital clocks with a constant number
of states are impossible
s1
s2
s3 = f(s1, s2)
...
sl+2 = f(sl, sl+1)
...
...
sj
sj
sj+1
sj
s
j+1 ,s that
There
must
be
a
sequence
of
states
s
,s
,…,
s
sk+2 =
j
j+1
k-1
k sj+1
s
=
s
s
k+1 s j
k
Usesjs1 andsj+1
s2 to construct an infinite
sequence
j+1
is,
a
subset
of
this
infinite
sequence
such
that
f(sk-1,sk)
...
.
.
.
.
.
.
...
of states
such
that
s
=
f(s
,s
)
sj+1
i+2
i i+1
= sj and f(s
,s
)
=
s
k j
j+1
sk
sk
sk-1
sk-1
Pulse
sk+2 = sj+1
s
sk+1 = sj one place
k
In each pulse, the states
are rotated
sk+2 = sj+1 s = s
sk
left.
k+1
j
sk-1
chapter 6 - Convergence in the Presence of Faults
1-21
Digital clocks with a constant number
of states are impossible
o Since the states of the processors encodes the
clock values, and the set of states just rotates
around the ring, We must assume that all the
states encode the same clock.
o On the other hand, the clock value must be
increments in every pulse.
Contradiction.
chapter 6 - Convergence in the Presence of Faults
1-22
Chapter 6: roadmap
6.1 Digital Clock Synchronization
6.2 Stabilization in Spite of Napping
6.3 Stabilization in Spite of Byzantine Faults
6.4 Stabilization in the Presence of Faults in
Asynchronous Systems
chapter 6 - Convergence in the Presence of Faults
1-23
Stabilizing in Spite of Napping
 Wait-free self-stabilizing clock-synchronization algorithm is
a clock-sync. Algorithm that copes with transient and
napping faults
 Each non-faulty operating processor ignores the faulty
processors and increments its clock value by one in every
pulse
 Given a fixed integer k, once a processor Pi works correctly
for at least k time units and continues working correctly, the
following properties hold:
 Adjustment Pi does not adjust its clock
 Agreement Pis clock agrees with the clock of every other
processor that has also been working correctly for at
least k time units
chapter 6 - Convergence in the Presence of Faults
1-24
Algorithms that fulfill the adjustmentagreement – unbounded clocks
 Simple example for k=1, using the unbounded
clocks
In every step – each processor reads the clock
values of the other processors, and chooses the
maximal value (denote by x) and assigns x+1 to
its clock
Note that this approach wont work using
bounded clock values
P1
10
11
7
7
5
0
The
clock
value
never
After an execution of P1, it’s clock holds the
changes
untilclock
the value,
napping
maximal
and wont adjust its
processor
with
clock
asmax
long value
as it doesn’t crash
starts to work
9
max
8
0
5
3
chapter 6 - Convergence in the Presence of Faults
1-25
Algorithms that fulfill the adjustmentagreement – bounded clock values
 Using bounded clock values (M)
The idea – identifying crashed processors and
ignoring their values
 Each processor P has:
 P.clock  {0… M-1}
 Q P.count[Q]  {0,1,2}
 P is behind Q if P.count[Q]+1 (mod 3) = Q.count[P]

P
P.count[Q]
Q
Q.count[P]
0
1
1
2
2
0
chapter 6 - Convergence in the Presence of Faults
1-26
Algorithms that fulfill the adjustmentagreement – bounded solution
 The implementation is based on the concept of the
“rock, paper, scissors” children’s game
2
00
1> 1
>VS0
> 221
2
chapter 6 - Convergence in the Presence of Faults
1-27
Algorithms that fulfill the adjustmentagreement – bounded solution
The program for P:
1) Read every count and clock
2) Find the set R that are not behind any other
processor
3) If R   then P finds a processor K with the
maximal clock value in R and assigns
P.clock := K.clock + 1 (mod M)
4) For every processor Q, if Q is not behind P then
P.count[Q] := P.count[Q] + 1 (mod 3)
chapter 6 - Convergence in the Presence of Faults
1-28
Self-stabilizing Wait-free Bounded
Solution – Run Sample
P1 R
P4
R
7
8
5
6
7
4
5
7
6
1
7
2
R
R
P2
P3
Active processor
Simple connection
“behind” connection
K=2
chapter 6 - Convergence in the Presence of Faults
1-29
The algorithm presented is wait-free
and self-stabilizing
 The algorithm presented is a wait-free self-
stabilizing clock-synchronization algorithm with k=2
(Theorem 6.1)
 All processors that take a step at the same pulse,
see the same view
 Each processor that executes a single step belongs
to R, in which all the clock values are the same 
the agreement requirement holds
 Every processor chooses the maximal clock value
of a processor in R, and increments it by 1 mod M
 the adjustment requirement holds
 The proof assumes an arbitrary start
configuration  the algorithm is both wait-free
and self-stabilizing
chapter 6 - Convergence in the Presence of Faults
1-30
Chapter 6: roadmap
6.1 Digital Clock Synchronization
6.2 Stabilization in Spite of Napping
6.3 Stabilization in Spite of Byzantine Faults
6.4 Stabilization in the Presence of Faults in
Asynchronous Systems
chapter 6 - Convergence in the Presence of Faults
1-31
Enhancing the fault tolerance
 Using self-stabilizing algorithm  if temporary
violation, of the assumptions on the system, occur
the system synchronizes the clocks when the
assumptions hold again
 Byzantine processor may exhibit a two-faced
behavior,sending different messages to its
neighbors
 If starting in an arbitrary configuration, during
the future execution more than 2/3 of the
processors are non-faulty, the system will reach a
configuration within k rounds in which agreement
and adjustment properties hold
chapter 6 - Convergence in the Presence of Faults
1-32
Self Stabilizing clock synchronization
algorithm
 Complete communication graph
 f = # of Byzantine faults
 Basic rules:
Increment – Pi finds n-f-1 clock values identical
to its own
The action – (increment clock value by 1) mod M
 Reset – fewer than n-f-1 are found
The action – set Pi’s clock value to 0
 After the 2nd pulse, there are no more than 2
distinct clock values among the non-faulty
processors
 No distinct supporting groups for 2 values may
coexist

chapter 6 - Convergence in the Presence of Faults
1-33
No distinct supporting groups for 2
values may coexist
Suppose 2 such values exist: x and y.
x
y
p1
p2
n-f processors gave (x-1)
n-f processors gave (y-1)
Since n>3f the number of non-faulty
processors is at least: There are at least n-2f nonThere are at least n-2f nonfaulty processors with (x-1)
faulty processors with (y-1)
2n-4f>2n-n-f=n-f
There are at least 2n-4f nonfaulty processors
chapter 6 - Convergence in the Presence of Faults
1-34
How can a Byzantine processor prevent reaching
0, simultaneously even after M-1 rounds
P1 will reset
0
1
willreset
reset P4
P3 will
P2
0
1
n-f-1 = 2  f= 1
0

1
0
1
10
This strategy can yield an infinite
execution in which the clock values of
the non-faulty processors will never
be synchronized
chapter 6 - Convergence in the Presence of Faults
1-35
The randomized algorithm
 As a tool to ensure the the set of clock values of
the non-faulty processors will eventually, with high
probability, include only a single clock
 If a processor reaches 0 using “reset”, and has
the possibility to increment it’s value, it tosses a
coin
randomized P randomized
P11 randomized
22
1201
0
11
0
1
120
0
11
P33 randomized P44
12
00


10
0
Note that NO reset was done
 the values were
incremented automatically chapter 6 - Convergence in the Presence of Faults
1-36
Digital clocks in the presence of
Byzantine processors
01 upon a pulse
02 forall Pj  N(i) do send (j,clockj)
03 forall Pj  N(i) do
used since Byzantine
04
receive (clockj) (*unless a timeout*)
neighbor may not
05 if |{j|i j, clocki  clockj}| < n – f – 1 then
send a message
06
clocki := 0
Indicates a reset or
07
LastIncrementi := false
an increment
08 else
operation
09
if clocki  0 then
10
clocki := (clocki + 1) mod M
11
LastIncrementi := true
12
else
13
if LastIncrementi = true then clocki := 1
14
else clocki := random({0,1})
15
if clocki = 1 then LastIncrementi := true
16
else LastIncrementi := false
chapter 6 - Convergence in the Presence of Faults
1-37
The randomized algorithm
 If no sync is gained after a sequence of at most M
successive pulses all non-faulty processors hold
the value 0
 At least 1 non faulty processor assigns 1 to its
clock every M successive pulses
 In expected M·22(n-f) pulses, the system reaches a
configuration in which the value of every nonfaulty processor’s clock is 1 (Theorem 6.2)
Proving using the scheduler-luck game

 The expected convergence time depends on M
What if
M=264 ?
chapter 6 - Convergence in the Presence of Faults
1-38
Parallel Composition for Fast Convergence
The purpose : achieving an exponentially better
convergence rate while keeping the max clock
value of no smaller than 264
 The technique can be used in a synchronous
system
In every step Pi will
I.
execute several independent versions of a selfstabilizing algorithm
II. Compute it’s output using the output of all
versions

chapter 6 - Convergence in the Presence of Faults
1-39
Parallel Composition for Fast Convergence
 Using the Chinese remainder theorem :
(DE Knuth. The Art of Computer Programming vd.2. Addison-Wesely, 1981)
 Let m1,m2, … ,mr be positive integers that are
relatively prime in pairs, i.e., gcd(mj, mk)=1 when
jk. Let m= m1m2•••mr, and let a,u1,u2, … ,ur be
integers. Then there is exactly one integer u that
satisfies the conditions a  u  a+m and u  uj (mod
mj) for 1  j  r (Theorem 6.3)
chapter 6 - Convergence in the Presence of Faults
1-40
Parallel Composition for Fast Convergence
 Choose :
a=0
 r primes 2,3,..,r such that 2·3·5···m(r-1)M 2·3·5···mr
 The lth version uses the lth prime ml , for the value
Ml (as the clock bound)
 A message sent by Pi contains r clock values (one for
each version)
 The expected convergence time for all the versions
to be synchronized is less than (m1+m2+ … +mr)•22(n-f)

chapter 6 - Convergence in the Presence of Faults
1-41
Parallel Composition for Fast Convergence
The Chinese remainder theorem
states that: Every combination of
the parallel version clock
corresponds to a unique clock
value in the range 2357…
2 3 5 7
0
0 2
0
0
4 2
4
1 2
1
1
3
1
3
Start filling
GAS
chapter 6 - Convergence in the Presence of Faults
1-42
Chapter 6: roadmap
6.1 Digital Clock Synchronization
6.2 Stabilization in Spite of Napping
6.3 Stabilization in Spite of Byzantine Faults
6.4 Stabilization in the Presence of Faults in
Asynchronous Systems
chapter 6 - Convergence in the Presence of Faults
1-43
Stabilization in the Presence of Faults in
Asynchronous Systems
 For some tasks a single faulty processor may cause
the system not to stabilize
 Example – counting the number of processors in a
ring communication graph, in presence of exactly 1
crashed processor. Eventually each processor should
encode n-1
r13 = z
2
P3
P1
r12 = x
r43=z 33
Assume
the
existence
of
a
P
It is NOT possible Pto
1
4 design a selfr12 = x
self-stabilizing
algorithm
AL
system
will
Conclusion:
c’algorithm
is
stabilizing
for
the
counting
Lets consider
a not The
We
can
stop
P4
reach
c’
in
which
that
does
the
job
in
the
a safe
configuration
system with 4
taskall! until
Pencode
2 and P3
3
P
presence
of
exactly
one
2-4
 processors
the system never
encode 2
crashed
processor
P
P2
3
P2 reaches a safe
2
configuration
23
3
2
chapter 6 - Convergence in the Presence of Faults
1-44