Slides

advertisement
Operating Systems & Concurrent Programming
Lecturer: Xu Qiwen
qwxu@umac.mo
Textbook: Distributed Operating Systems & Algorithms
Randy Chow
Theodore Johnson
Addison Wesley 1997
This course, we study
• OS, Network and Distributed, in particular, algorithms
used in these systems
• Concurrent Programming, mainly analysis of distributed
algorithms such as simulation and verification of the
algorithms
Spin system
• Modelling language Promela
concurrent Processes
communication via message channels, either
synchronous (hand-shaking) or
asynchronous (buffered)
• Simulation
• Verification by model checking
exhaustive search of the state space to check whether
properties are satisfied or not
- system invariants
- progress
- Linear temporal Logic
Spin is developed by G.J. Holzmarn at AT&T
http://netlib.bell-labs.com/netlib/spin/whatisspin.html
Formal methods library
www.afm.sbu.ac.ulc/fm/
A spectrum of operating systems
Decreasing Degree of Hardware and Software Coupling
1 st
3 rd
4 th
2nd
centralized
distributed
cooperative
network
operating
operating
autonomous
operating
system
system
system
system
A comparison of features in modern operating systems
Generation
System
Characteristics
Goals
first
centralized
operating
system
process management
memory management
I/0 management
File management
Resource management
Extended machine
(virtuality)
second
network
operating
system
remote access
information exchange
network browsing
resource sharing
(interoperability)
third
distributed
operating
system
global view of:
file system,
name space,
time, security,
computational power
single computer
view of multiple
computer system
(transparency)
fourth
cooperative
autonomous
system
open and cooperative
distributed
applications
cooperative work
(autonomicity)
Causality
A fundamental property of a distributed system:
lack of a global system state
This is due to
- Noninstantaneous communication
propagation delay
contention of network resource
lost messages
- Clock synchronization
clock drift
- Unpredicatable execution
CPU contention
interrupts
page faults
garbage collection
Therefore, in distributed systems, we can only talk
about causality
Causal:
the cause precedes the effect,
sending precedes receipt
E: the set of all events
Ep: the set of all events occur at processor p
e1 <p e2: e1 precedes e2 at processor p
for e1,e2 in Ep, either e1<p e2 or e2 <p e1
e1 <m e2 : e1 sending message m, e2 receipt message m
Happens – before
e 1 <H e2
1. if e1 <p e2 , then e1<H e2
2. if e1 <m e2, then e1 <H e2
3. if e1 <H e2 and e2< H e3, then e1 <H e3
A happens – before relation, H – DAG
p1
e1
e4
p2
p3
e2
e3
e5
e6
e8
e7
e1 <p1 e4 <p1 e7
e 1 <m e3
e 5 <m e8
e3 <p2 e5
e1 <H e 8
Lamport Timestamps Algorithm
global time does not exist
global `clock`: a total order to the events
must be consistent with the happens-before relation <H
algorithm on the fly
e.TS : time stamp attached to e
my_TS : local timestamp of the processor
Initially my_TS =0
On event e
if e is the receipt of message m
my_TS = max (m.TS, my _TS)
my_TS ++
e.TS = my_TS
if e is the sending of message m
m.TS =my_TS
if e1 <H e2, then e1.TS < e2.TS
to break ties of identical timestamps, Lamport suggests using
the processor address for the lower order bits of the timestamp
no guarantee: if e1.TS < e2.TS then e1 <H e2. Therefore, it
cannot be used to detect for example causality violation
Causality violation
s(m): the event of sending m
r(m): the event of receipt m
if s(m1) <H s(m2), but r(m2) < H r(m1)
Vector timestamps have the property
e1.VT < v e2.VT iff e1<H e2
Must be able to tell which events of every processor an event
causally follows
VT: an array of integers
VT[i]=k: causally follows the first k events at processor i
e1.VT <=v e2.VT : e1.VT[i]<= e2.VT[i] for every i= 1…M
e1.VT <v e2.VT : e1.VT <=v e2.VT and e1.VTe2.VT
Vector timestamp algorithm
Initially
my_VT = [0,…,0]
On event e
if e is the receipt of message m for i = 1 to M
my_VT[i]=max(m.VT[i], my_VT[i])
my_VT [self]++
e.VT=my_VT
if e is the sending of message m
m.VT=my_VT
We show if e1.VT < e2.VT, then e1 <H e2. Suppose e.1
is at processor i and e2 is at processor j. From e1.VT < e2.VT,
e1.VT[i] <= e2.VT[i]. The value of e2.VT[i] is obtained from
an event from processor i, therefore e1 <H e2.
Causal communication
ensure no causality violation
assume point-to-point
messages delivered in the order sent
main idea :
hold back message m until no messages m' <H m will be delivered
from any other processor.
earliest [1,…,M]
earliest[k]: the timestamp of the earliest message that can be
delivered from processor k
initially the smallest timestamp 1k
(1 in Lamport , (0…010…0) in vector timestamp)
blocked[1,…,M]
block[k]: queue of blocked messages from processor k
Causal Message delivery algorithm
Initially
each earliest[k] is set to 1k, k=1,…,M
each blocked[k] is set to {}, k=1,…,M
On the receipt of message m from processor p
delivery_list={}
if (blocked[p] is empty)
earliest[p]=m.timestamp
Add m to the tail of blocked [p]
while ( there is k such that blocked[k] is not empty,
and for every ik, self, not-earlier(earliest[i], earliest[k],i) )
remove message at head of blocked [k], put in delivery_list
if blocked[k] is not empty
set earliest[k] to m'.timestamp, where m' head of blocked [k]
else
increment earliest [k] by 1k
Deliver messages in delivery_list, in causal order
Deadlock in the algorithm
if one processor does not send messages, other processors will be
blocked to receive
Multicast communication
Every processor receives the same set of messages
p receives m1, m2 <H m1
p will eventually receive m2
Distributed Snapshots
no global state
distributed snapshot :
a global view of the system that is consistent
with causality
Si : state of processor Pi
S = (S1, S2,…,Sm)
channel Cij: communication channel Pi to Pj
C = {Cij | i,j 1,… M}
Lij = (m1, m2,….mk)
messages sent by Pi but yet to be received by Pj
L = {Lij | i,j 1,… M}
Global state G = (S,L)
Consistent Cut
observations of different processors should be concurrent
snapshot token : special message indicating a state to
be recorded
p
q
O1
t
O2
t
O3
O1 and O2 are concurrent
O1 and O3 are not concurrent
(in the original system, i.e.
without the snapshot tokens)
Distributed Snapshot Algorithm
Variables
integer
my_version
integer
current_snap [1…M]
integer
tokens_received [1…M]
processor_state
S [1…M]
channel _state
L [1…M] [1…M]
S[r]
contains processor self’s state,
L[r][q] contains Lq,self in the snapshot requested by processor r
Initially
my_version=0
for each processor p
current_snap [p] = 0
execute_snapshot()
Wait for a snapshot request or a token
Snapshot_Request:
my_ version ++
S[self]=current state
current_snap[self] = my_version
for each q in Oself
send(q, TOKEN, self, my_version)
token_received[self] =0
TOKEN(q; r, version) :
......
TOKEN(q;r,version):
if current_snap[r]<version
S[r]=current state
current_snap[r]=version
L[r][q]=()
for every p in Oself
send(p, TOKEN, r, version)
tokens_received[r]=1
else if (current_snap[r]=version)
tokens_received[r]++
put messages received from q since first receiving
token(r,version) into L[r][q]
if tokens_received[r]=|Iself|
the local snapshot for(r.version) is finished
Distributed Mutual Exclusion
Timestamp Algorithms
record timestamp
send requests to other processors, other processors
grant / deny the request using timestamp info
Variables
timestamp
timestamp
integer
boolean
current_time
my_timestamp
reply_count
reply_deferred[l…M]
Requesting the critical section
Request_CS()
my_timestamp=current_timestamp
is_requesting=True
reply_pending=M-1
for every other processor q
send(q,remote_request,my_timestamp)
wait until reply-pending=0
( CS )
Monitoring
CS_monitor()
Wait a remote_request or a reply message
remote_request(q,request_time):
if ( not is_requesting or my_timestamp>request_timestamp )
send(q,reply)
else reply_deferred[q]=True
reply(q):
reply_pending--
Releasing critical section
Release_CS()
(leave CS)
is_requesting=False
for q=1 to M
if reply_deferred [q]=True
send(q, reply)
reply_deferred[q]=false
Voting
Processors compete for votes to enter critical sections
Naive Voting Algorithm
Naïve_Voting_Enter_CS()
Send a vote request to all processors
Wait until (M+1)/2 votes
(CS)
Voting with district
Sp: Voting district of processor p
Si  Sj  {}
1<= i,j <= M
Variables used in voting based algorithm
Sself
current_timestamp
my_timestamp
yes_votes
have_voted
candidate
candidate_timestamp
have_inquired
waitingQ
voting district
candidate voted for
true if have tried to recall a vote
Requesting the critical section
Request_CS()
my_timestamp = current_timestamp
for every processor r in Sself
send ( r, REQUEST, my_timestamp )
while ( yes_votes< Sself|)
Wait until a YES, NO or INQUIRE message
YES (q) :
yes_votes ++
INQUIRE (q, inquire_timestamp)
if my_timestamp = inquire_timestamp
send (q, RELINQUISH)
yes_votes--
Monitor the critical section
Voter()
while true
wait until a REQUEST, RELEASE, or RELINQUISH
REQUEST(q;request_timestamp):
if have_voted is False
send(q,YES)
candidate_timestamp = request_TS
candidate = q
have_voted = True
else
add(q,request_timestamp) to waitingQ
if request_timestamp<candidate_timestamp and not have_inquired
have_inquired = True
send(candidate; INQUIRE, candidate_timestamp)
RELINQUISH(q):
RELEASE(q):
RELINQUISH(q):
add(candidate,candidate_timestamp) to waitingQ
remove the (s, timestamp) from waitingQ such that
timestamp is the minimum
send(s,YES)
candidate_timestamp=timestamp
candidate=s
have_inquired=False
RELEASE (q):
if waitingQ is not empty
remove the (s, timestamp) from waitingQ such that
timestamp minimum
send(s,YES)
candidate_timestamp=timestamp
candidate=s
else have_voted=False
have_inquired=False
Fixed Logical Structure
A processor can enter the critical section if it possesses a token
ring structure
Tree structure
Variables used by the fixed structure algorithm
Token_hldr
Incs
Current_dir
Request_Q
operations on request_Q
Nq(q)
Dg( )
ismt( )
Raymond’s algorithm
Requesting and releasing the critical section
Request_CS()
if not Token_hldr
if ismt ( )
send (current_dir,REQUEST)
Nq(self)
wait until Token_hldr is True
Incs=True
Release_CS()
Incs=False
if not ismt( )
current_dir=Dq( )
send(current_dir,TOKEN)
Token_hldr=False
if not ismt ( )
Send(current_dir, REQUEST)
Monitor_(SL)
whit True
wait for a REQUEST or a TOKEN
REQUEST (q)
if Token_hldr
if Incs Nq(q)
else
current_dir=q
send(current_dir, TOKEN)
Token_hldr= False
else
if ismt( )
send(current_dir,REQUEST)
Nq(q)
TOKEN:
current_dir=Dq( )
if current_dir=self
Token_hldr=True
else
send(current_dir,TOKEN)
if not ismt( )
send(current_dir,REQUEST)
Path compression
Token_hldr
Incs
IsRequesting
current_dir
next – The next processor to receive the token, nil if the
processor is at the end of the waiting list (if the processor
just requested)
Request_CS()
IsRequesting = True
if not Token_hldr
send (current_dir,REQUEST,self)
current_dir = self
next = NIL
wait until Token_hldr is True
Incs = true
Release_CS()
Incs = False
IsRequesting = False
if next  NIL
token_hldr = False
send (next, TOKEN)
next = NIL
Monitor_CS()
while True
wait for a REQUEST or a TOKEN
REQUEST(requester) :
if IsRequesting
if next = NIL
next = requester
else
send(current_dir, REQUEST, requester)
else if token_hldr
token_hldr = False
send(requester,TOKEN)
else
send(current_dir, REQUEST, requester)
current_dir = requester
TOKEN:
token_hldr = True
Leader Election
coordinator / participant(s)
The Bully Algorithm
Assumptions
1. message propagation time Tm
2. message handling time Tp
Failure detector
timeout T = 2Tm + Tp
Variables
state : {Down, Election, Reorganization, Normal}
coordinator :
definition
up
halted
Correctness Assertions
1. If statei {Normal, Reorganization}
and statei {Normal, Reorganization}
then coordinatori= coordinatorj
2. If statei = statej = normal,
then definitioni= definitionj
3. (liveness) eventually true
• statei = normal and coordinatori=i
• For every other nonfailed node j
statej = Normal and coordinatorj= i
Idea of the Bully Algorithm
• Each node has a priority
• In election, a node first checks if higher_priority nodes have failed,
if so, the node knows it should be the leader
• The leader “bullies” the other nodes into accepting its leader ship
• An election is initiated by the Coordinator_time out if a node does
not hear form the coordinator for a long time or by Recovery when
the node recovers from a failure
• The leader calls an election if it detects a processor fails or a failed
processor recovers
Algorithm to initiate an election by a node
Coordinator_Timeout( )
if state = Normal or state = Reorganization
send (coordinator, AreYouUp) timeout = T
wait until coordinator sends (AYU_answer) timeout = T
on timeout Election
Recovery ( )
state = Down
Election( )
Algorithm by the coordinator to check the state of other
processors
Check( )
if state = Normal and coordinator = self
for every other node j
send(j, AreYouNormal)
wait until j sends (AYN_answer, status) timeout = T
if (jup and status = False) or jup
Election
return( )
Bully election algorithm
Election( )
highest = True
for every higher priority processor p
send (p, AreYouUp)
wait up to T seconds for (AYU_answer)
AYU_answer(sender):
highest = False
if highest = False
return( )
state = Election
halted = self
up = { }
for every lower priority processor p
send (p, Enter_Election)
wait up to T for (EE_answer)
EE_answer(sender) : up = up  {sender}
Bully election algorithm continued
Election( )
……
num_answers = 0
coordinator = self
state = Reorganization
for each p in up
send (p, Set_Coordinator, self)
wait up to T for (SC_answer)
SC_answer (sender):
num_answers ++
if num_answer < |up|
Election ( )
return ( )
Bully Algorithm continued
Election ( )
……
num_answers = 0
for each p in up
send (p, New_State, Definition)
wait up to T for (NS_answer)
NS_answer(sender):
num_answers++
if num_answers < |up |
Election( )
return( )
state = Normal
Monitoring the election
Monitor_Election( )
while (true)
wait for a message
case AreYouUp (sender)
send (sender, AYU_answer)
case AreYouNormal(sender)
if state = Normal
send (sender, AYN_answer,True)
else
send (sender, AYN_answer, False)
case Enter_Election(sender)
state = Election
stop_processing( )
stop the election procedure if it is executing
halted= sender
send(sender,EE_answer)
Monitoring the election continued
Monitor_Election( )
……
case Set_Coordinator(sender, newleader)
if state = Election and halted = newleader
cooridinator = newleader
state = Reorganization
send (sender,sc_answer)
case New_state (sender, newdef)
if coordinator = sender and state = Reorganization
definition = newdef
state = Normal
The invitation Algorithm
Assumption:
delay can be arbitrary, no global coordinator
Processors into groups, different groups have different coordinators,
merge groups into large groups.
Correctness assertion
1. If statei  {Normal,Reorganization},
statej {Normal,Reorganization}, and Groupi= Groupj,
then Coordinatori = Coordinatorj
2. If statei = statej = Normal, Groupi = Groupj,
then Definitioni = Definitionj
Check( )
if state = Normal and Coordinator = self
others = { }
for every other node p
send (p, AreYouCoordinator )
wait up to T seconds for (AYC_answer) messages
AYC_answer,(sender, is_coordinator)
if is_coordinator
others = others  {sender}
if others = { }
return ( )
wait for a time inversely proportional to your priority
Merge (others)
Timeout ( )
if Coordinator = self
return ( )
send(Coordinator, AreYouThere, Group)
wait for AYT_answer, timeout is T
on timeout
is_coordinator = False
AYT_answer(sender, is_coordinator): skip
if is_coordinator=False
Recovery ( )
Merge (Coordinator_set)
if Coordinator = self and state = Normal
state = Election
stop_processing ( )
counter ++
Group = (self |counter)
Coordinator = self
{ not necessary or problem with}
UpSet = Up
{ interleaving with Invitation() ? *}
Up={}
For each p in Coordinator_set
send (p, Invitation, self, Group)
For each p in UpSet
send (p, Invitation,self,Group)
Wait for T seconds
/* Answers are collected by the Monitor_Election thread */
* Invitation() contains Coordinator=new_coordinator
Merge (Coordinator_Set)
……
state = Reorganization
num_answer = 0
For each p in Up
send(p, Ready, Group, Definition )
wait up to T seconds for
Ready_answer messages
Ready_answer ( sender, in group, new_group )
if in group and new_group = Group
num_answer + +
if num_answer < | Up |
Recovery ( )
else state = Normal
Invitation( )
while True
wait for Invitation (new_coordinator, new_group )
if state = Normal
stop_processing ( )
old_coordinator = Coordinator
UpSet = Up
state = Election
Coordinator = new_coordinator
Group = new_group
if old_coordinator = self
for each p in UpSet
send(p, Invitation, Coordinator,Group )
send(Coordinator, Accept, Group )
……
Question: is this better put in Monitor thread?
Invitation ( )
……
wait up to T seconds for an Accept_answer(sender, accepted)
on timeout
accepted = False
if accepted=False
Recovery( )
State = Reorganization
Election_Monitor( )
while True
wait for a message
Ready(sender, New_group,new_definition)
if Group=new_group and state = Reorganization
Definition = new_definition
state = Normal
send(Coordinator, Ready_answer, True, Group )
else
send (sender, Ready_answer, False )
……
AreYoucoordinator(sender):
if state = Normal and Coordinator = self
send(sender,AYC_answer,True)
else
send (sender,AYC_answer,False)
AreYouThere(sender, old_group):
if Group = old_group and Coordinator = self and sender in Up
send(sender,AYT_answer, True)
else
send(sender, AYT_answer, False)
Accept (sender, new_group):
if state = Election and Coordinator = self and Group =new_group
Up = Up {sender}
send (sender, accept_answer,True)
else
send (sender, accept_answer,False)
Recovery ( )
state = Election
stop_processing ( )
Counter + +
Group = (self |Counter)
coordinator = self
Up = { }
state = Reorganization
Definition = {a single node task description}
state = Normal
Data Management
The ACID properties
Atomicity: Either all of the operations or none in a transaction are
performed, in spite of failures
Consistency (serializability): The execution of interleaved
transactions is equivalent to a serial execution of the
transactions in some order
Isolation: Partial results of an incomplete transactions are
not visible to others before the transaction is
successfully committed
Durability: The system guarantees that the results of a committed
transaction will be made permanent even if a failure
occurs after the commitment
Data Replication
ACID properties more difficult to ensure
Atomicity
All processors involved in the transaction agree to either commit
or abort the transaction
Naïve protocol: coordinator completes its execution, commits, and
sends commit messages to other processors
Problem of naive protocol: if a participant processor fails, it will not
not sucessfully commit (therefore, not all processors commit)
Database Technique
Two-phase Commit
2PC_Coordinator()
pre commit the transaction
For every participant p
send(p,VOTE_REQ)
wait up to T for VOTE messages
VOTE(sender,vote_response)
if vote_response =YES
increment the number of yes votes
if each participant responded with a YES vote
commit the transaction
for every participant p
send(p,COMMIT)
else
abort the transaction
for every participant p
send(p,ABORT)
2PC_Participant()
while True
wait for a message from the coordinator
VOET_REQ(coordinator):
if I can commit the transaction
precommit the transaction
write a YES vote to the log
send(coordinator,YES)
else
abort the transaction
send(coordinator,NO)
COMMIT(coordinator):
commit the transaction
ABORT(coordinator):
abort the transaction
• Failure of any processor prior to the vote request, abort
• If the coordinator fails after pre committing but before committing,
abort after recovery (textbook also says “in practice, the coordinator
will attempt to commit’’. My understanding of this is that the
coordinator will perform another round of vote request).
• If a participant fails after precommitting but before committing,
Contact other processors to decide (the transaction may or may not
have committed) after recovery.
Disadvantage of 2 phase commit
if the coordinator fails after a participant has voted YES, the
participant must wait until the coordinator recovers. Protocol
cannot complete: blocked
Three Phase Commit
avoid blocking if a majority of processors agree on the action
Serializability (consistency)
if the result of execution is equivalent to a serial one
Example
t0: bt Write A=100, Write B=20 et
t1: bt Read A, Read B
1: Write sum in C
2: Write diff in D
et
t2: bt Read A, Read B
3: Write diff in C
4: Write sum D
et
Conflict: Write-Write, Write-Read, Read-Write
Interleaving schedules
t 0 < t1 < t2
1,2,3,4
3,4,1,2
1,3,2,4
3,1,4,2
1,3,4,2
3,1,2,4
log in C
W1=120
W2=80
W2=80
W1=120
W1=120
W2=80
W2=80
W1=120
W1=120
W2=80
W2=80
W1=120
log in D
W1=80
W2=120
W2=120
W1=80
W1=80
W2=120
W2=120
W1=80
W2=120
W1=80
W1=80
W2=120
Result (C,D)
(80,120)
consistent
(120,80)
consistent
(80,120)
consistent
(120, 80)
consistent
(80,80)
inconsistent
(120,120)
inconsistent
2PL
feasible
Timestamp
feasible
feasible
t1 aborts
and restarts
not feasible feasible
not feasible
t1 aborts
and restarts
not feasible cascade
aborts
not feasible t1 aborts
and restarts
Two Phase Locking (2PL)
A growing phase of locking, a shrinking phase of releasing
An extreme case: locks all objects at the beginning, releases all
at the end. Serialization is trivial, no concurrency,
simple applications
2PL:
1. A transaction must obtain a read or a write lock on data d before
reading d and must obtain a write lock on d before updating d
2. After a transaction relinquishes a lock, it may not acquire any
new locks
* many transaction can have read locks on a data, but if one transaction
has a write lock, no other transactions can have locks
2PL
concurrency limited
deadlock (e.g., t2 writes D then writes C)
strict 2PL: releasing lock, usually at commit or abort point
non-strict 2PL difficult to implement, difficult to know when the last
lock is requested
strict 2PL sacrifices some concurrency
Timestamp ordering
1. when an operation on a shared object is invoked, the object records
the timestamp of the invoking transaction
2. when a (different) transaction invokes a conflicting operation on the
object, if it has a larger timestamp than the one recorded by the
object, then let the transaction proceed (and record the new
timestamp), otherwise abort the transaction (restarts with a
larger timestamp).
Optimistic Concurrency Control
execution phase
validation phase
update phase
One-copy serializability
The result of execution is equivalent to a serial one on
nonreplicated objects
Read-one-primary
Read-one
Read-quorum
Write-one-primary
Write-all
Write-all-available
Write-quorum
Write-gossip
Read-one / Write-all-available
Example
t0: bt W(X) W(Y) et
t1: bt R(X) W(Y) et
t2: bt R(Y) W(X) et
t0 initialization, followed by t1 and t2. Only serial schedules ( t1 t2
or t2 t1 ) are consistent.
Now replicate X to Xa and Xb, Y to Yc and Yd
Xa and Yd fail
t1: bt R(Xa) (Yd fails) W(Yc) et
t2: bt R(Yd) (Xa fails) W(Xb) et
No conflict, not one copy
Quorum Voting
Read-quorum: each read operation to a replicated object d must obtain
a read quorum R(d)
Witre-quorum: W(d)
Quorum must overlap
V(d): total number of copies
Write-Write conflict: 2W(d) > V(d)
Read-Write conflict: R(d)+W(d) > V(d)
R(d)=1, W(d)=V(d), Read-one, Write-all
Gossip Update Propagation
Many applications do not need one-copy serializability
Basic Gossip Protocol
TSi: last update time of the data object
(maintained by Replica Manager, RM i)
TSf: timestamp of the last successful access operation
(maintained by File Service Agent, FSA)
Read:
TSf is compared with TSi
if TSf  TSi (data more recent)
return value
TSf is set to TSi
else
wait until data is updated by gossip
Update:
TSf ++
if TSf > Tsi
update is executed
TSi=TSf
propagate the new data by gossip
else
(the update is too late, possible action:
overwrite or become more upto date by a read)
Gossip:
A gossip message carrying a data value from replica j to replica i
is accepted if TSj > TSi
In the Basic Gossip Protocol, updates are simple overwrites
(do not depend on the current state).
To handle read-modify updates (depending on current state),
Casual Order Gossip Protocol
Example of casual order gossip: Figure 6.12
Distributed Agreement
A number of processors, some of them faulty, try to agree on a value.
Assumption: faulty processors may do anything, including the worst
(Byzantine).
Aim: a protocol which allows all the non faulty processors to reach the
agreement.
Byzantine agreement
In an ancient war in Byzantium, some Byzantium generals are loyal,
but some are disloyal. The loyal general need to decide whether to
attack together or retreat.
Question: Suppose there are 3 generals, 2 loyal and 1 disloyal, can the
loyal generals reach the agreement ?
loyal
disloyal
retreat
attack
attack
attack
retreat
loyal
loyal
1 attack, 1 retreat
attack
attack
loyal retreat
cannot decide
disloyal
1 attack, 1 retreat
Question : Can the loyal generals reach the agreement if there are 4
generals, 3 loyal, 1 disloyal.
disloyal
2A, 1R
R
A
A
loyal
A
A
loyal
A
loyal
R
A
R
A
2A, 1R
loyal
loyal
A
A
A
A
loyal A
A
R
R
disloyal
Theorem
Suppose there are M generals, t disloyal ones. If M≤3t, then the
generals cannot reach agreement.
Proof idea:
Suppose the theorem is not true. Let one general simulate t
generals, then the three general problem can also be solved.
Contradiction.
Byzantine general’s broadcast
BG_Send(k, v, I)
send v to every general in I.
BG_Receive(k)
Let v be the value received, or “Retreat” if no value is received
before time out
Let I be the set of generals who have never broadcast v
( the delivery list for this message )
BG_Send(k-1, v, I-self)
Use BG_Receive(k-1) to receive v(i) for every i in I-self
return majority (v, v (1)…….v (|I|-1))
Majority and default decision
Majority (v1, v2, …,vn)
Return the majority v among v1, v2, …,vn or
“Retreat” if no majority exists
Base case
BG_Send(0,v,I)
The commanding general broadcasts v to every other
generals in I
BG_Receive (0)
Return the value received, or “Retreat” if no message
is received
C
O1
1
O3
3
2
1
O4
O5
4
L1:O1
2
5
6
O1
L1:
L1:O1
2
O6
4
3
L1:O1
3
5
6
L1:O1
6
L6:L1:O1
……
3
4
5 6
2 3
4 5 6
4 5
2
2 decides the value from 1 by
majority(L1:O1, L3:L1:O1, L4:L1:O1, L5:L1:O1, L6:L1:O1)
In a similar way, 2 decides the value from generals 3, 4, 5, 6.
Finally, general 2 decides the value by taking the majority of these
values together with the one received from C.
Lemma
For any t and k, if the commanding general is loyal, the BG ( k )
protocol is correct if there are no more than t traitors and at
least 2t+k+1 generals (2t+k in textbook, mistake?).
Proof.
By induction on k. Base case k=0, BG (k) works because the
loyal generals just accept the orders from the commanding
general which is assumed to be loyal.
Assume BG(k-1) works for 2t+k generals and t traitors. Consider
The case of 2t+k+1 generals and t traitors
C
Ot+k
…
O2t+k
…
O1 = O2=…= Ot+k=…= O2t+k
After receiving the command from the commanding general, each
of the 2t+k loyal general will broadcast the correct command. There
are t traitors. By the assumption, a loyal general will decide on the
correct values of the other t+k-1 loyal generals. Together with the
order from the commanding general, there are t+k >t correct orders
and at most t incorrect orders, so the loyal general will decide on the
right order.
Theorem
For any k, the BG(k) protocol is correct if there are more than 3k
generals and no more than k traitors.
Proof:
Induction on k. Base case k = 0, the protocol is correct, because
there are no traitors.
Assume BG(k-1) works, if there are more than 3(k-1) generals and
no more than k-1 traitors. Consider there are 3k+1 generals, and k
traitors. If the commander is loyal, then the Lemma says the protocol
is correct, because there are 3k+1 = 2k+k+1 generals. If the
commander is disloyal, then when any other general rebroadcasts,
there are 3k>3(k-1) generals and k-1 traitors, so the loyal generals
agree on the rebroadcasted orders,and therefore will agree on the
final order.
Distributed Shared Memory (DSM)
Process Communication Paradigms
message passing
remote procedure call (RPC)
distributed shared memory
first introduced by K. Li, in his PhD thesis 1986
RPC and DSM provide abstraction, and they are implemented by
message passing in distributed systems.
DSM has a mapping and management software between DSM and
message passing mechanism
Shared Memory
tightly coupled systems
memory accessed via a common bus or network
direct information sharing
programming is similar to conventional shared memory programming
(a logical shared memory)
memory management problems: efficiency, coherence/consistency
A generic NUMA architecture
processor
memory
processor
memory coherence
controller
memory
memory coherence
controller
buses or network
NUMA: Nonuniform Memory Access
local/remote accesses, not uniform
NUMA Architectures
Memory Consistency Models
Process viewpoint (compared to data viewpoint, distributed
file system)
more concurrency
less concurrency
difficult to program
easy to program
weak consistency
strong consistency
General Access Consistency Models
R(X)v: read variable X, value v
W(X)v: write variable X with value v
Atomic (strict) consistency:
All read and write must appear to be executed atomically and
sequentially. All processors observe the same ordering of event
execution, which coincides with the real-time occurrence.
P1: W(X)1
P2:
R(X)1
P1: W(X)1
P2:
R(X)0 R(X)1
atomically consistent
not atomically consistent
This is the strictest consistency model. High complexity in
implementation. Usually used only as a baseline to evaluate the
performance of other consistency models.
Sequential consistency
Defined by Lamport: The result of any execution is the same as
if the operations of all processors were executed in some sequential
order, and the operations of each individual processor appear in
this sequence in the order specified by its program.
Interleaving, real-time order not required.
P1: W(X)1
P2:
R(X)1 R(X)1
P1: W(X)1
P2:
R(X)0 R(X)1
atomically consistent
not atomically consistent
both sequentially consistent
Programming friendly, but poor performance.
Causal consistency
Writes that are potentially causally related must be seen by all
processors.
Concurrent writes may be seen in a different order on different
processors (therefore may not lead to a global sequential order).
P1: W(X)1
P2:
P3:
P4:
W(X)3
R(X)1 W(X)2
R(X)1
R(X)1
R(X)3 R(X)2
R(X)2 R(X)3
causally consistent, not sequentially consistent
Causal consistency (continued)
P1: W(X)1
P2:
P3:
P4:
R(X)1 W(X)2
R(X)2 R(X)1
R(X)1 R(X)2
not causally consistent
If we remove R(X)1, W(X)1 and W(X)2 are concurrent
P1: W(X)1
P2:
P3:
P4:
W(X)2
R(X)2 R(X)1
R(X)1 R(X)2
causally consistent
Processor consistency
Writes from the same processor are performed and observed in the
order they were issued. Writes from different processors can be in
any order.
P1: W(X)1
P2:
P3:
P4:
R(X)1 W(X)2
R(X)1 R(X)2
R(X)2 R(X)1
processor consistent, not causally consistent
Slow memory consistency
Writes to the same location by the same processor must be in order.
P1: W(X)1 W(Y)2
W(X)3
P2:
R(Y)2
R(X)1 R(X)3
slow memory consistent
Consistency models with synchronization access
user information to relax consistency
synchronization access: read/write operations to synchronization
variables only by special instructions
Weak consistency
• Access to synchronization variables are sequentially consistent
• No access to a synchronization variable is issued by a processor
before all previous read/write operations have been performed
• No read/write data access is issued by a processor before a
previous access to a synchronization variable has been performed
P1: W(X)1 W(X)2 S
P1: W(X)1 W(X)2 S
P2:
R(X)1 R(X)2 S P2:
S R(X)1
P3:
R(X)2 R(X)1 S
weakly consistent
not weakly consistent
Release consistency
Use a pair of synchronization operations: acquire(S) and release(S)
• No future access can be performed until the acquire operation is
completed
• All previous operations must have been performed before the
completion of the release operation
• Order of synchronization access follows the processor consistency
model (acquire - read, release - write)
Entry consistency
Locking objects, instead of locking critical section
• For each shared variable X, associate acquire(X) and release(X)
• acquire(X) locks the shared variable X for the subsequent exclusive
operations on X until X is unblocked by a release(X)
Download