lecture notes

advertisement
DISTRIBUTED SYSTEMS
Department of Computing Science
Umea University
Fundamental Concepts
Distributed Systems - D N Ranasinghe
2
• About Distributed Computing
– devising algorithms for a set of processes that seek to achieve
some form of a ‘cooperative goal’
– quoting Leslie Lamport: ‘ a distributed system is one in which
the failure of a computer you did not even now existed can
render your own computer unusable’
Distributed Systems - D N Ranasinghe
3
Distributed Algorithm
• has no shared global information: only decides on local state
and the messages they receive
• has no shared global time frame: observes progress of
computation through at best a partial order of events
• non deterministic behaviour: cannot predict the exact sequence
of global states from the study of the algorithm
Distributed Systems - D N Ranasinghe
4
Design challenges from a systems perspective
• heterogeneity: in hardware, OS, mode of interaction (c-s, p2p etc),
middleware provisioning for developers
• security: involves eavesdropping, deliberate corruption, process
compromise, denial of service etc.,
• scalability: robustness, performance bottlenecks
• process failures: detecting/suspecting, masking, tolerating, recovery,
redundancy in the presence of partial processes failure
• concurrency
Distributed Systems - D N Ranasinghe
5
• transparency:
•
access (local and remote resources accessed through identical operations)
•
location (resources access independent of physical location)
•
concurrency (process concurrency on shared resources)
•
replication (maintaining replicas with consistency)
•
failure (concealment of failures)
•
mobility (movement of resources and clients)
Distributed Systems - D N Ranasinghe
6
Role of middleware
• software layer with services provided to the applications
designer
• consisting of processes and objects
Applications/Services
• mechanisms:
•
•
•
•
•
Remote Method Invocation
object brokering
Service Oriented Architecture
event notification
distributed shared memory…
Middleware
Operating System
Computer & Network H/W
Distributed Systems - D N Ranasinghe
7
Motivating application domains
• information dissemination (publish-subscribe paradigm): by event
registration and notification with time-space decoupling property,
based on reliable broadcast and agreement abstractions
• process control in automation, in industrial systems etc., where
consensus may have to be reached on multitude of sensorial inputs
• cooperative work: multi-user cooperation in editing etc., based on
shared persistent space paradigm employing ordered broadcast
abstractions
• distributed databases: need for atomic commitment abstraction on
acceptance or rejection of serialized transactions
Distributed Systems - D N Ranasinghe
8
Motivating application domains
• software based fault tolerance through replication: uses the so
called state machine replication paradigm
– when a centralized server is required to be made highly available by
executing several copies of it whose consistency is guaranteed by total
order broadcast abstraction
Distributed Systems - D N Ranasinghe
9
Modeling of distributed systems
• abstraction:
– to capture properties that are common to a large range of systems so that it
enables to distinguish the fundamental from the accessory
– to prevent reinvent the wheel for every minor variant of the problem
• a model abstracts away the key components and the way they
interact
• purpose:
– to make explicit all relevant assumptions about the system
– to express behaviour through algorithms
– make impossibility observations etc through logical analysis including proofs
Distributed Systems - D N Ranasinghe
10
Modeling of distributed systems
• abstracting the physical model: processes, links and failure detectors
(latter an indirect measurement of time)
2
1
3
5
4
Distributed Systems - D N Ranasinghe
11
Modeling of distributed systems
• component properties:
– channel (a communication resource) - message delays, message loss
– process (a computational resource, has only local state) – can incur
process failure, be infinitely slow or corrupt
• low level models of interaction: synchronous message passing,
asynchronous message passing
Internal Computation
Process
(modules of the process)
(receive)
(send)
Outgoing message
Incoming message
Distributed Systems - D N Ranasinghe
12
Modeling of distributed systems
• failure detector abstraction: a possible way to capture the notion of
process and link failures based on their timing behaviour
• incorporation of a failure detector, a specialized process in each
process which emits a heartbeat to others
• a failure detector can be considered as an indirect abstraction of
time; simply a timeout is an indication of a failure, mostly unreliable
with an outcome either suspected or unsuspected
•
a synchronous system => a ‘perfect failure detector’
Distributed Systems - D N Ranasinghe
13
Modeling of distributed systems
• clock: physical and logical
• abstracting a process: by the process failure model
Crashes
Omissions
Crashes & Recoveries
Arbitrary
Distributed Systems - D N Ranasinghe
14
Modeling of distributed systems
• crashes: a faulty process as opposed to a correct process (which
executes an infinite number of steps) does no further local
computation or message generation or respond to messages
– a crash does not preclude a recovery later but this is considered another
category
– also the correctness of any algorithm may depend on a maximally admissible
number of faulty processes
Distributed Systems - D N Ranasinghe
15
• arbitrary faults: a process that deviates arbitrarily from
the algorithm assigned to it
– also known as malicious or Byzantine faulty or in fact may be
due to a bug in the program
– under such conditions some algorithmic abstractions may be
‘impossible’
Distributed Systems - D N Ranasinghe
16
Modeling of distributed systems
• omission failure: due to network congestion or buffer overflow,
resulting in process unable to send messages
• crash-recovery: a process simply crashes fail-stop or, crashes and
recovers infinite times
– every process that recovers is assumed to have a stable storage (also called a
log) accessible through some primitives, which stores the most recent local
state with time stamps
– alternatively those which do never crash could also act as virtual stable
storage
Distributed Systems - D N Ranasinghe
17
Modeling of distributed systems
• abstracting communication: by loss or corruption of messages, also
known as communication omission
• usually resolved through end-to-end network protocol support
unless of course there is a network partition
• Desirable properties for ‘reliable’ delivery of messages
• liveness: any message in the outgoing buffer of sender is
‘eventually’ delivered to the incoming message buffer of receiver
• safety: the message received is identical to the one sent, and no
messages are delivered twice
Distributed Systems - D N Ranasinghe
18
• Abstracting other higher level interactions
• e.g., capturing recurring patterns of interaction in the form of
– distributed agreement (on an event, a sequence of events etc.,)
– atomic commitment (whether to take an irrevocable step or not)
– total order broadcast (i.e., agreeing on order of actions) leads to a
wide range of algorithms
Distributed Systems - D N Ranasinghe
19
Modeling of distributed systems
• Predicting impossibility results in higher level interactions
• due to in some cases indistinguishability of network failures from process
failures or, a slow process from a network delay
• e.g., agreement in the presence of message loss, agreement in the
presence of process failures in asynchronous situations
• Impossibility of agreement in the presence of message loss
• leads to a widely used assumption in almost all models
• typical two army problem
• formal model described below
Distributed Systems - D N Ranasinghe
20
Formal model of the two army problem
• processes A and B communicate by sending and receiving
messages on a bidirectional channel; A sends a message to B,
then B sends a message to A and so on
• A and B can execute two actions  and 
• neither process can fail but the channel can lose messages
• desired outcome is both processes take the same action and
neither take both actions
Distributed Systems - D N Ranasinghe
21
• proof- by contradiction: let there be a protocol P that solves
the problem using the fewest rounds, the last message sent
by A being m
• Observe that, action taken by A cannot depend on m since its
receipt could never be learned by A
• Action taken by B cannot depend on m because B must take
the same choice of action as A even m is lost
• Since actions of both A and B do not depend on m, m can be
discarded
• m is not the last message
• P is not using the fewest rounds
Distributed Systems - D N Ranasinghe
22
Formal models for message passing algorithms
• processes and channels: channels can be unidirectional or
bidirectional
• topology represented by an undirected graph G(V, E)
P1
P0
P2
P4
P3
Distributed Systems - D N Ranasinghe
23
Formal models for message passing algorithms
• System has n processes, p0 to pn-1 where i is the index of the
process
• The algorithm run by each pi is modeled as a process automaton a
formal description of a sequential algorithm and is associated with
a node in the topology.
Distributed Systems - D N Ranasinghe
24
Formal models for message passing algorithms
• A process automaton is a description of the process state machine
• consists of a 5-tuple: {message alphabet, process states, initial
states, message generation function, state transition function}
–
–
–
–
message_alphabet: content of messages exchanged
process_states: the finite set of states that a process can be in
initial_state: the start state of a process
message_gen_function: on the current process state how the next message
is to be generated
– state_trans_function: on the receipt of a messages, and based on current
state, the next state to which the process should transit
Distributed Systems - D N Ranasinghe
25
Description of system state
• A configuration is a vector C = (q0,…qn-1) where qi is a state of pi
• In message passing systems two events can take place: computation event
of process pi (application of the so called state transition function), and
delivery event, the delivery of message m from process pi to process pj
consisting of a message sending event and a corresponding receiving
event
• Each message is uniquely identified by its sender process, sequence
number and may be local clock value
• The behaviour of the system over time is modeled as an execution which is
a sequence of configurations alternating with events.
Distributed Systems - D N Ranasinghe
26
Formal models for message passing algorithms
Process
Internal Computation
(modules of the process)
(receive)
(send)
Outgoing message
Incoming message
• All possible executions of a distributed abstraction must satisfy two
conditions: safety and liveness.
Distributed Systems - D N Ranasinghe
27
Formal models for message passing algorithms
• Safety: ‘nothing bad has/can happen (yet)’
• e.g., ‘every step by a process pi immediately follows a step by
process p0’, or, ‘no process should receive a message unless the
message was indeed sent’
• Safety is a property that can be violated at some time t and never
be satisfied thereafter; doing nothing will also ensure safety!
Distributed Systems - D N Ranasinghe
28
Formal models for message passing algorithms
• Liveness: ‘eventually something good happens’
• a condition that must hold a number of times (possibly infinite),
e.g., ‘eventually p1 terminates’ => p1’s termination happens once,
or, liveness for a perfect link will require that if a correct process
(one which is alive and well behaved) sends a message to a correct
destination process, then the destination process should eventually
deliver the message
• Liveness is a property that for any time t, there is some hope that
the property can be satisfied at some time t’ t
Distributed Systems - D N Ranasinghe
29
Asynchronous systems
• there is no fixed upper bound for message delivery time or, the
time elapse between consecutive steps of a process
• notion of ordering of events, local computation, message send or
message receive are based on logical clocks
• an execution  of an asynchronous message passing system is a
finite or infinite sequence of the form C0, 1, C1, 2, C2,…., where Ck
is a configuration of process states, C0 is an initial configuration and
k is an event that captures all of messages send, computation and
message receive events.
• A schedule  is a sequence of events in the execution, e.g., 1, 2,
…., where if the local processes are deterministic then, the
execution is uniquely defined by (C0, ).
Distributed Systems - D N Ranasinghe
30
Synchronous systems
• There is a known upper bound on message transmission and
processing delays
• processes execute in lock step; execution is partitioned into
‘rounds’: C0, 1|,C1, 2 |,C2,….,
• very convenient for designing algorithms, but not very practical
• leads to some useful possibilities: e.g., timed failure detection –
every process crash can be detected by all correct processes, can
implement a lease abstraction
• in a synchronous system with no failures, only the C0 matters for a
given algorithm, but in an asynchronous system, there can be many
executions for a given algorithm
Distributed Systems - D N Ranasinghe
31
• synchronous message passing
state
transition
P
recv()
send()
Q
R
Time
round 1
current
State
round 2
round 3
new state
upper bound on time
Distributed Systems - D N Ranasinghe
32
Properties of algorithms
• validity and agreement: specific to the objective of the algorithm
• termination: an algorithm has terminated when all processes are
terminated and there are no messages in transit
• an execution can still be infinite, but once terminated, the process
stays there taking ‘dummy’ steps
• complexity: message (maximum number of messages sent over all possible
executions) and time (equal to maximum number of rounds if synchronous;
and in asynchronous, this is less straightforward
Distributed Systems - D N Ranasinghe
33
Properties of algorithms
• Interaction algorithms are possible for each process failure model
• fail-stop – processes can fail by crashing but the crashes can be
reliably detected by all other processes
• fail-silent – where process crashes can never be reliably detected
• fail-noisy – processes can fail by crashing, and the crashes can be
detected, but not always in a reliable manner
• fail-recovery – where processes can crash and later recover and still
participate in the algorithm
• Byzantine – processes deviate from the intended behaviour in an
unpredictable manner
• no solutions exist for all models in all interaction abstractions
Distributed Systems - D N Ranasinghe
34
Coordination and Agreement
Distributed Systems - D N Ranasinghe
35
• under this broad topic we will discuss
– Leader election
– Consensus
– Distributed mutual exclusion
• common or uniform decisions by participating processes to various
internal and external stimuli is often required, in the presence of
failures and synchrony considerations
Distributed Systems - D N Ranasinghe
36
Leader election (LE)
• a process that is correct and which acts as the coordinator in some
steps of a distributed algorithm, is a leader; e.g., commit manager
in a distributed database, central server in distributed mutual
exclusion
• LE abstraction can be straightforwardly implemented using a
perfect failure detector (that is in a synchronous situation)
• Hierarchical LE: assumes the existence of a ranking order agreed
among processes apriori, s.t. a function O associates, with every
process, those that precede in ranking, i.e., O(p1) = , p1 leader by
default; O(p2) = {p1}, if p1 dies p2 becomes leader; O(p3) = {p1, p2}
etc.,
Distributed Systems - D N Ranasinghe
37
Leader election (LE)
LCR algorithm (LeLann-Chang-Roberts): a simple ring based
algorithm
• assumptions: n processes each with a hard coded uid in a logical
ring topology, unidirectional message passing-process pi to p(i+1) mod n,
processes are not aware of ring size, asynchronous, no process
failures, no message loss
• leader is defined to be the process with the highest uid
Distributed Systems - D N Ranasinghe
38
Leader election (LE)
algorithm in prose:
• each process forwards its uid to neighbour
• if received uid < own uid, then discard, else if received uid > own
uid, forward received uid to neighbour, else if received uid =own uid
then declare self as leader
uid1
P1
uidn
Pn
P2
P3
uid2
uid3
P4
4
Distributed Systems - Duid
N Ranasinghe
39
Leader election (LE)
• process automaton:
message_alphabet: set U of uid’s
for each pi
statei: defined by three state variables
u  U, initially uidi
send  U + null, initially uidi
status  {leader, unknown}, initially unknown
msgi: place value of send on output channel;
transi: {send = null;
receive v  U on input channel;
if v = null or else if v < u then exit;
if v > u then send =v;
if v = u then status = leader;}
Distributed Systems - D N Ranasinghe
40
Leader election (LE)
• expected properties: validity – if a process decides, then the decided value
is the largest uid of a process
• termination – every correct process eventually decides
• agreement – no two correct processes decide differently
• message complexity: O (n2)
• time complexity: if synchronous, then n rounds until leader is discovered;
2n rounds until terminates
• other possible scenarios: synchronous and processes are aware of ring size
n (useful if processes fail), bidirectional ring (for a more efficient version of
the algorithm)
Distributed Systems - D N Ranasinghe
41
Leader election (LE)
• an O(n log n) message complexity algorithm (Hirschberg-Sinclair)
• assumptions: bidirectional ring, where for every i, 0 i  n, pi has a channel
to left to p i+1 mod n, and pi has a channel to right to p i-1, n processes each
with a hard coded uid in a logical ring topology, processes are not aware of
ring size, asynchronous, no process failures, no message loss
uid1
P1
uidk
Pk
P2
P3
uid2
uid3
P4
Distributed Systems - D N Ranasinghe
uid4
42
Leader election (LE)
algorithm in prose:
• as before, a process sends its identifier around the ring and the
message of the process with the highest identifier traverses the
whole ring and returns
• define a k-neighbourhood of a process pi to be the set of
processes at distance at most k from pi in either direction, left
and right
• algorithm operates in phases starting from 0
• in the kth phase a process tries to become a winner for that
phase, where it must have the largest uid in its 2k
neighbourhood
• only processes that are winners in the kth phase can go to
(k+1)th phase
Distributed Systems - D N Ranasinghe
43
• to start with, in phase 0 each process attempts to become a
phase 0 winner and sends probe messages to its left and right
neighbours
• if the identifier of the neighbour receiving the probe is higher,
then it swallows the probe, else its sends back a reply message if
it is at the edge of neighbourhood, else forwards probe to next in
line
• a process that receives replies from both its neighbours is a
winner in phase 0
• similarly in a 2k neighbourhood the kth phase winner will receive
replies from the farthest two processes in either direction
• a process which receives its own probe message declares itself
winner
Distributed Systems - D N Ranasinghe
44
Leader election (LE)
pseudo code for pi:
send <probe, uidi, phase, hop_count> to left and to right; initially
phase=0, and hop_count=1
upon receiving <probe, j, k, d> from left (or right) {
if j= uidi then terminate as leader;
if j > uidi and d< 2k then
send <probe, j, k, d+1> to right (or left); // forward msg
and increase hop count
if j > uidi and d  2k then // if reached edge, do not forward but
send <reply, j, k> to left (or right);} // if j < uid, msg is
swallowed
upon receiving <reply,j,k> from left (or right) {
if j  uidi then send <reply, j,k> to right (or left) // forward
else // reply is for own probe
if already received <reply, j,k> from right (or left) then
send <probe,
uidi,Systems
k+1,- 1>
;} // phase k winner
Distributed
D N Ranasinghe
45
Leader election (LE)
• other possible scenarios:
– synchronous with alternative ‘swallowing’ rules – any thing higher
than minimum uid seen so far etc., with tweaking of uid usage
– leads to a synchronous leader election algorithm whose message
complexity is at most 4n
Distributed Systems - D N Ranasinghe
46
DME
• shared memory mutual exclusion is a well known aspect in
operating systems when there is a need for concurrent threads to
access a shared variable or object for read/write purposes
• the shared resource is made a critical section with access to it
controlled by atomic lock or semaphore operations
• the lock or the semaphore variable is seen by all threads
consistently
• asynchronous shared memory is an alternative possibility: say, P1, P2
and P3 share M1 and, P2 and P3 share M2
Distributed Systems - D N Ranasinghe
47
DME
• in a distributed system there will be no shared lock variable to
look at
• processes will have to agree on the process eligible to access
the shared resource at any given time, by message passing
• assumptions: system of n processes, pi, i=1..n; a process
wishing to access an external shared resource must obtain
permission to enter the critical section (CS); asynchronous,
processes do not fail, messages are reliably delivered
Distributed Systems - D N Ranasinghe
48
• correctness properties
• ME1 safety: at most one process my execute in the CS at any
given time
• ME2 liveness: requests to enter and exit CS eventually succeed
• ME3 ordering: if one request to enter the CS ‘happened-before’
another, then entry to the CS is granted in that order
• ME2 ensures freedom from both starvation and deadlock
Distributed Systems - D N Ranasinghe
49
DME
• several algorithms exist: Central Server version, Ring, RicartAgrawala
Central Server version
Server
4
Queue of requests
1. Request
Token
P1
2
3. Grant
Token
2. Release
Token
P4
P3
P2
Distributed Systems - D N Ranasinghe
50
DME
• In this scenario, there is a central server S that grants permission to
the processes to enter CS based on a token request
• ME1, ME2 satisfied due to weak assumptions
• ME3 not - since arbitrary message delay may cause mis-order at S
Distributed Systems - D N Ranasinghe
51
DME
Ring algorithm:
• assumptions: processes are ordered in a logical ring with
unidirectional communication where each process pi communicates
only with p(i+1) mod n.; system of n processes, pi, i=1..n; asynchronous,
processes do not fail, messages are reliably delivered
• mutual exclusion is obtained by sole possession of a token
• ME1 and ME2 satisfied
• correctness may not be guaranteed under violations of assumptions
Distributed Systems - D N Ranasinghe
52
DME
Ricart-Agrawala algorithm:
• assumptions: each process pi has a unique identifier, uidi and
maintains a logical scalar clock LCi; system of n processes, pi, i=1..n;
asynchronous, processes do not fail, messages are reliably delivered
Distributed Systems - D N Ranasinghe
53
algorithm in prose:
• a process pi desirous of accessing the CS multicasts a request
message containing its (uid, timestamp) pair to whole group
• a process receiving such a request unless it is already in CS or, is
determined to enter CS and has a local clock less than LCi, responds
to pi.
• if pi receives responses from all then it can enter CS
Distributed Systems - D N Ranasinghe
54
DME
On initialization
state := RELEASED;
To enter the section
state := WANTED;
Multicast request to all processes;
Ti := request’s timestamp
wait until (number of replies received = (N-1))
state := HELD;
On receipt of a request <Ti , Pi> at pj (i != j)
if (state = HELD or (state = WANTED and (Tj, pj) < (Ti , pi)))
then
queue request from pi without replying;
else
reply immediately to pi ;
end if
To exit the critical section
state := RELEASED;
reply to any queued requests;
Distributed Systems - D N Ranasinghe
55
DME
41
41
P1
P3
Reply
34
Reply
41
Reply
34
P2
34
• ME1,ME2, ME3 satisfied
Distributed Systems - D N Ranasinghe
56
DME
• Message complexity is easily derivable
• In all three DME algorithms above, i.e., server based, ring based and
R-A, process failures might violate termination requirements
• message losses are not acceptable
• even a perfect failure detector is not applicable since two amongst
three algorithms are asynchronous
Distributed Systems - D N Ranasinghe
57
Fault tolerant consensus
• generally speaking, agreement or consensus by participating
processes may be on a common value, on a message delivery order,
on abort or commit, on a leader etc.,
• consensus is specified in terms of two primitives: propose and
decide
• properties to be satisfied:
• termination – every correct process eventually decides some value
• validity – if a process decides v, then v was proposed by some
process
• integrity - no process decides twice
• agreement – no two correct processes decide differently
Distributed Systems - D N Ranasinghe
58
Fault tolerant consensus
• integrity + agreement = safety
• validity + termination = liveness
• key features: best effort broadcast with no message loss as a
mechanism to convey to community of processes, synchronous,
process failures – fail stop and Byzantine with key parameter f, the
maximum number of processes that can fail, where the system is
known as f-resilient
• uncertainty in consensus in this failure model arises as a result of
the possibility of a partial set of a process’s messages being only
delivered at any round
Distributed Systems - D N Ranasinghe
59
Flooding consensus – version 1
• assumptions: n processes in a strongly connected undirected graph,
processes aware of group size, synchronous, maximally f fail stop
processes (hard coded), no message loss, the set of possible
decision values {V} is made of all proposed values, each process has
exactly one proposed value, objective is ‘uniform’ decision
Distributed Systems - D N Ranasinghe
60
Flooding consensus – version 1
algorithm in prose:
• processes execute in rounds
• each process maintains the set of proposals it has seen by the
merger, and this set is augmented when moving from one round to
next
• in each round every process disseminates its augmented set to all
others using best effort broadcast
• a process decides a specific value in its set when the number of
rounds equals (f+1)
Distributed Systems - D N Ranasinghe
61
p1
Consensus round (f+1)
p2
p3
p4
round 1
round 2
round 3
Distributed Systems - D N Ranasinghe
t
62
Flooding consensus – version 1
process automaton:
message_alphabet: subsets of {V}
for each pi
statei: defined by three state variables
rounds  N, initially 0
decision  {V}  unknown, initially unknown
W  V, initially the singleton set consisting of vi, pi’s proposal
msgi: if rounds  f then broadcast W to all other processes;
transi: {rounds = rounds +1;
receive value xj on input channel j;
W = W  j xj;
if rounds = f +1 then
if |W| = 1 then decision= v, where W = {v}
else decision = default;}
Distributed Systems - D N Ranasinghe
63
Flooding consensus – version 1
proof sketch:
• termination- all correct processes decide at the end of round f+1,
whatever that decision may be
• validity – suppose all initial proposals are identical to v, and hence
W has only one element v, and v is the only possible decision
• agreement – suppose if no process fails, then algorithm runs for 1
round only, and by the basic broadcast property, W seen by all are
identical
• in the worst case f failures can be distributed amongst each round
but there is one final round to uniformatise the decision
Distributed Systems - D N Ranasinghe
64
• performance: time complexity: (f+1) rounds
• a particular feature of all consensus algorithms
• message complexity: (f+1)n2
• other possible decision functions apart from uniform are
majority, minimum, maximum etc.,
Distributed Systems - D N Ranasinghe
65
Flooding consensus – version 2
•
assumptions: n processes in a fully connected undirected graph, processes
aware of group size, synchronous, fail stop crashes with perfect failure
detector, no message loss, the set of possible decision values {V} is made of all
proposed values, each process has exactly one proposed value, any
deterministic decision function can be applied
Distributed Systems - D N Ranasinghe
66
Flooding consensus – version 2
algorithm in prose:
• processes execute in rounds
• each process maintains the set of proposals it has seen by the
merger, and this set is augmented when moving from one round to
next
• in each round every process disseminates its set to all others using
best effort broadcast
• a process decides a specific value in its set when it knows it has
gathered all proposals that will ever be seen by any correct process
or, it has detected no new failures in two successive rounds
• a process so decides broadcasts its decision to the rest in next
round; all correct processes so far have not decided will decide on
the receipt of a decide message
Distributed Systems - D N Ranasinghe
67
Flooding consensus – version 2
• agreement is strictly not violated: but correct processes must
decide a value that must be consistent with values decided by
processes that might have decided before crashing
• suppose a process that receives messages from all others decide
but crashes immediately afterwards before broadcasting to others
• the rest move to next round detecting a failure and to the next
where there may be no further failures and then may decide on a
different outcome
• problem can be mitigated by employing a reliable broadcast
mechanism: a process must decide even if it is able to now, but only
after a reliable form of broadcast
Distributed Systems - D N Ranasinghe
68
Flooding consensus – version 2
• performance: worse case n rounds if (n-1) processes crash in
sequence
• impossibility of consensus under asynchronous fail-stop conditions
• important result by Fischer, Lynch, Peterson: ‘no algorithm can
guarantee to reach consensus in an asynchronous system even with
one process crash failure’
• outcome is mainly due to the indistinguishability of a crashed
process from a slow process in an asynchronous system
Distributed Systems - D N Ranasinghe
69
Flooding consensus – version 2
• proof is complicated, but follows the argument that among many
possible executions  there may be at least one that avoids
consensus being reached
• any alternative?
• with ‘unreliable failure detectors’ – consensus can be solved in an
asynchronous system with an unreliable failure detector if fewer
than n/2 processes crash (Chandra and Toueg)
Distributed Systems - D N Ranasinghe
70
Byzantine fault tolerance
• Consensus in a synchronous system in the presence of malicious
and/or adhoc process failures, known by the metaphor Byzantine
failure
• Generals commanding divisions of the Byzantine army
communicate using reliable messengers
• generals should decide on a common plan of action
• some generals many be traitors and may prevent loyal generals
from agreeing by sending conflicting messages to different generals
Distributed Systems - D N Ranasinghe
71
Byzantine fault tolerance
Four Generals scenario
Army
Army
General 2
Army
City
General 3
General 1
Army
General 4
Distributed Systems - D N Ranasinghe
72
Byzantine fault tolerance
• assumptions: n processes in a fully connected undirected graph,
processes aware of group size, synchronous, maximally f Byzantine
fail processes (hard coded): a faulty process may send any message
with any value at any time or keep silent, no message loss, a correct
process detecting the absence of a message associates it with a
‘null’ value, one designated process initiates messages to others
processes, messages are unsigned (oral), the set of possible
decision values {V} is made of proposed value by designated
process, objective is ‘majority’ decision
Distributed Systems - D N Ranasinghe
73
Byzantine fault tolerance
• properties to be satisfied:
• termination – every correct process eventually decides
• validity – if the sending process is correct then the message
received is identical to the message sent (or, if the commanding
general is loyal, then every loyal general obeys the order sent)
• agreement – correct processes receive the same message (or, all
loyal generals receive the same order)
• impossibility with three processes
Distributed Systems - D N Ranasinghe
74
Byzantine fault tolerance
• G2 is a traitor
CG
attack
attack
G1
G2
retreat
CG
• CG is a traitor
attack
G1
Distributed Systems - D N Ranasinghe
retreat
attack
retreat
G2
75
Byzantine fault tolerance
• algorithm in prose: processes execute in rounds; the designated
process initiates by best effort broadcast of message to others;
each correct process maintains the set of proposals it has seen by
the merger, and this set is augmented when moving from one
round to next; in each round every correct process disseminates its
set to all others except the designated process using best effort
broadcast; a correct process decides a majority value in its set (or
fall back to a default) when the number of rounds equals (f+1)
• case (a) – three processes with participating general p3 as traitor,
case (b) – three processes with commanding general p1 as traitor
Distributed Systems - D N Ranasinghe
76
Byzantine fault tolerance
P1
P1
1:V
1:V
1:W
2:1:V
P2
1:X
2:1:W
P3
P2
P3
3:1:X
3:1:u
• outcome: termination – satisfied by definition, whatever that decision is;
validity – not satisfied for case (a) (p2 does not follow p1) and not
applicable for case (b); agreement – satisfied for case (b) (p2 and p3 fall
back on default) and not applicable for case (a)
Distributed Systems - D N Ranasinghe
77
Byzantine fault tolerance
• consensus with four processes: case (a) – four processes with
participating general p3 as traitor, case (b) four processes with
commanding general p1 as traitor
Distributed Systems - D N Ranasinghe
78
Byzantine fault tolerance
• outcome: case (a) – validity and agreement satisfied; case (b) –
validity not applicable, agreement – satisfied (p2, p3 and p4 fall back
on default)
• scenario with signed messages: digitally signing a message uniquely
identifies a message and its originator
• revisit the three process consensus: case (a) – traitor cannot alter
commanding general’s message but can stay silent: validity satisfied
(p2 discards bogus message from p3); case (b) – agreement satisfied
(p2 and p3 fall back on default)
• Byzantine agreement is solvable with three processes with one
failure if processes digitally sign the messages
Distributed Systems - D N Ranasinghe
79
Byzantine fault tolerance
• complexity: time – (f+1)
• message – O(nf+1), an exponential message complexity
• generic result: Byzantine agreement is solvable with at least (3f+1)
processes in (f+1) rounds where f is the maximum number of
Byzantine failures
• a constant message size BFT consensus alternative exists: provided
n> 4f and runs for 2(f+1) rounds
Distributed Systems - D N Ranasinghe
80
Time and Global states
Distributed Systems - D N Ranasinghe
81
• a distributed system by nature has no single clock and it is
practically difficult to synchronise physical clocks across a system
• notion of a mechanism to globally order events in an asynchronous
system is an important requirement for replica management,
consensus etc.,
Distributed Systems - D N Ranasinghe
82
Logical clocks
• Leslie Lamport introduced the concept of causal relationship
observable in a message passing distributed system
Distributed Systems - D N Ranasinghe
83
Logical clocks
• A potential causal ordering can be established by looking at
‘happened-before’ relationships (indicated by an arrow  )
between local events within a process as well as sending and
receiving events across processes: e.g., p1: a b, p2: c d, p3: e
f, p1 and p2: bc, p2 and p3: d  f etc.,
• transitivity property: if x  y and y z then x z
• concurrency definition: if  (x y) and (y x) then we say (x ||
y)
• it can be easily established that for p1 and p3: a  f and, a || e.
Distributed Systems - D N Ranasinghe
84
Logical clocks
• it is possible to time stamp the events of a distributed system such
that
• rule 1 – if e1 and e2 are local events in pi and e1  e2 then Ci(e1) <
Ci(e2)
• rule 2 – if e1 is the sending event of a message by pi and e2 is the
corresponding receiving of the message by pj the Ci(e1) < Cj(e2)
• generalised notation: eij as event #j of process pi
• local history (possibly an infinite sequence of events) of pi as hi = ei1
ei2 ei3.., and the global history of the system as H = h1h2..hn
Distributed Systems - D N Ranasinghe
85
Logical clocks
• Lamport clock timestamp rules:
– given that LC(ei) = logical time stamp of event ei and LCi = value of
logical clock of pi then
LC(ei) =
LCi + 1 if ei is an internal event or a send event
= max (LCi, TS(m)) + 1 if ei is a receive event
- where TS(m) is the time stamp of the received message
- after occurrence of event ei on pi, the logical clock of pi is updated as
LCi  LC(ei)
Distributed Systems - D N Ranasinghe
86
Logical clocks
• properties: ee’  LC(e) < LC(e’) ; but note that
LC(e) < LC(e’)  e  e’
recv
Pi
ei2
ei1
e i3
T (local clock (i))
m
send
Pj
e j1
e j2
e j3
T (local clock (j))
Lamport’s logical clocks enforce only a partial ordering of
events
How can a causal order of events be enforced? Vector Clocks
by Mattern and Fidge
Distributed Systems - D N Ranasinghe
87
Logical clocks
• specification: VC(ei) = vector time stamp of event ei on pi is a vector of size
n: each element is VC(ei)[j]; j=1..n, where n is the group size
• for i=j, corresponds to the number of events on pi up to and including ei
• for ij, corresponds to the number of events on pj that happened before ei
p1
(1,0,0)
(2,0,0)
a
b
m1
(2,1,0)
(2,2,0)
Physic al
ti me
p2
c
d
m2
(2,2,2)
(0,0,1)
p3
e
f
Distributed Systems - D N Ranasinghe
88
Logical clocks
• Vector clock timestamp rules:
– VCi = vector clock of pi
– if ei is an internal event or send(m) on pi then,  ji, VC(ei)[j]  VCi[j]
and VC(ei)[i] = VC(ei)[i] + 1
– else {if ei is a receive event on pi of message m with vector timestamp
VT(m)} then, VC(ei)  max (VCi, VT(m)) and
VC(ei)[i]  VC(ei)[i] +1
– after occurrence of event ek on pi, its vector clock is updated as VCi 
VC(ei)
– comparing two vector clocks:
– VC(e) < VC(e’) iff ((VC(e)  VC(e’)) and (VC(e)  VC(e’))) where
• VC(e)  VC(e’) iff j s.t. VC(e)[j]  VC(e’)[j] and
• VC(e)  VC(e’) iff j s.t. VC(e)[j]  VC(e’)[j]
Distributed Systems - D N Ranasinghe
89
Logical clocks
• Vector clock properties: ee’ VC(e) < VC(e’)
• e || e’  (VC(e) < VC(e’)) and (VC(e’) < VC(e))
• Vector clocks impose a casual order of events
Distributed Systems - D N Ranasinghe
90
Global property of a distributed computation
properties to look for in a distributed system
• garbage collection – objects having no references to it within
a process can be discarded
• deadlock detection – cyclic waiting for resources
• termination detection – not only each process has halted but
also there are no messages in transit
• debugging – ensuring for example variables across processes
remain within defined limits etc.,
Distributed Systems - D N Ranasinghe
91
Global property of a distributed computation
• among these are the class of stable properties
• stable  if once true, then remains true forever
• to observe the state there is no omniscient observer who can
record an instantaneous snapshot of the system state
• useful concept if the system is asynchronous
Distributed Systems - D N Ranasinghe
92
Global property of a distributed computation
• first a few notations and definitions
• let qik be the state of a process pi after the occurrence of event eik, and qi0
the initial state of pi
P1
P2
P3
e11
e21
e12
e13
e14
e22
e31
e32 e33
Distributed Systems - D N Ranasinghe
e34
93
Global property of a distributed computation
• the global state of a distributed computation at any given instant is
defined by the tuple (q1k1, q2k2,…… qnkn): global state does not
include the state of the channels
• cut of a distributed computation is defined as a subset of the global
history H given by C = h1c1h2c2..hncn where, hici = ei1 ei2 …..eici the
local event history of pi up to event ci
• Cut C is defined by the tuple (c1, c2, ….cn)
• the global state (q1c1, q2c2 …..qncn) corresponds to cut C
Distributed Systems - D N Ranasinghe
94
Global property of a distributed computation
P1
P2
P3
e11
e21
e12
e13
e14
e22
e31
e32 e33
e34
• Usefulness of a Cut C:
Cut (C)
– it is possible to express a global property of a distributed computation
such as deadlock, computation terminated etc as a global state
predicate  which evaluates a observed state to true or false
Distributed Systems - D N Ranasinghe
95
Global property of a distributed computation
• suppose a process p0, outside of the system ask each process pi its
local state qi; process p0 builds the global state Q = (q1k1, q2k2,……
qnkn) and  evaluates on Q to give {true, false}
• consider some predicate , evaluated on a consistent cut C
expressed by state (q1c1, q2c2 …..qncn) such that (C) = value of  on
C = {T, F}
• let cut C precedes a cut C’ iff C  C’
• a predicate is said to be stable iff the following property holds: 
(C)  for all C  C’,  (C’)
Distributed Systems - D N Ranasinghe
96
Global property of a distributed computation
p1
300(CHF)
400
Transfer of 100
p2
p3
p4
650
750
150
500
400
200
50
350
600
Problem…!
100
Cut (C)
Distributed Systems - D N Ranasinghe
97
• invariant for the bank transfer example: there should not be more
money in the accounts than there was originally
• global state defined by cut C’: (400, 650, 400, 600); total amount =
2050 > 1550; cut C’ is not consistent
• definition: a cut C is consistent iff for all events e, e’ it is such that,
e’  C and (e  e’)  e  C
• definition: a consistent global state is a global state defined by a
consistent cut
• vector clocks can be used to determine if a cut is consistent or not
Distributed Systems - D N Ranasinghe
98
Consider
• VC(ei)[j] – number of events on pj that happened before ei (on
pi )
• VC(ej)[j] – number of events on pj before and including ej
• therefore if VC(ei)[j] > VC(ej)[j] then ei is aware of more events
on pj than ej it self
• that is there was a subsequent event after ej on pj which
caused ei
• exactly an inconsistent cut
Distributed Systems - D N Ranasinghe
99
• a cut C is consistent if and only if i,j: VC(ejcj)[j]  VC(eici)[j]
that is, cut event eici can not be aware of more events on pj
than ejcj it self
ei
pi
pj
ej
Distributed Systems - D N Ranasinghe
100
• consistency test for cut (c1, c2, c3):
• for a three process situation, proceed as
– VC(e1c1)[1]  VC(e2c2)[1] and VC(e1c1)[1]  VC(e3c3)[1] for p1
– VC(e2c2)[2]  VC(e1c1)[2] and VC(e2c2)[1]  VC(e3c3)[1] for p2
– VC(e3c3)[3]  VC(e1c1)[3] and VC(e3c3)[3]  VC(e2c2)[3] for p3
• a monitor process may establish whether an observed global
state is consistent using collected vector time stamps of the
processes
Distributed Systems - D N Ranasinghe
101
Computing a consistent global state (snapshot)
Chandy-Lamport algorithm
• a snapshot is a record of process states and channel states
which is consistent
• assumptions: FIFO channels (can be imposed using sequence
numbers); recorded states may be collected by a designated
process; no process fails; no message loss; graph is strongly
connected; any one of the processes can initiate a global
snapshot
Distributed Systems - D N Ranasinghe
102
algorithm in prose:
• process p1 (initiator of the snapshot) saves its state q1c1 and
broadcasts the message snapshot to P (set of all processes)
• let pi receive the snapshot message the first time from some
process pj (pj can be different from p1) at which time pi saves
its state pici and broadcasts the snapshot message to P (no
application event can take place between the reception of a
snapshot message and rebroadcast)
• when pi has received snapshot from all in P the computation
of the snapshot is terminated
Distributed Systems - D N Ranasinghe
103
Chandy-Lamport snapshot algorithm
p1
p2
p3
p4
c1
c2
е
е'
c3
c4
Distributed Systems - D N Ranasinghe
104
• as depicted p4 receives first snapshot message from p3 and
not from p1
proof sketch:
• global state Q = (q1c1,q2c2…….qncn) is consistent
• consider the cut C(c1, c2…cn)
• due to send (snapshot)  receive (snapshot) and FIFO
channels, for all i,j, if cj  C and ci  cj, then ci  C
Distributed Systems - D N Ranasinghe
105
Download