Operating Systems & Concurrent Programming Lecturer: Xu Qiwen qwxu@umac.mo Textbook: Distributed Operating Systems & Algorithms Randy Chow Theodore Johnson Addison Wesley 1997 This course, we study • OS, Network and Distributed, in particular, algorithms used in these systems • Concurrent Programming, mainly analysis of distributed algorithms such as simulation and verification of the algorithms Spin system • Modelling language Promela concurrent Processes communication via message channels, either synchronous (hand-shaking) or asynchronous (buffered) • Simulation • Verification by model checking exhaustive search of the state space to check whether properties are satisfied or not - system invariants - progress - Linear temporal Logic Spin is developed by G.J. Holzmarn at AT&T http://netlib.bell-labs.com/netlib/spin/whatisspin.html Formal methods library www.afm.sbu.ac.ulc/fm/ A spectrum of operating systems Decreasing Degree of Hardware and Software Coupling 1 st 3 rd 4 th 2nd centralized distributed cooperative network operating operating autonomous operating system system system system A comparison of features in modern operating systems Generation System Characteristics Goals first centralized operating system process management memory management I/0 management File management Resource management Extended machine (virtuality) second network operating system remote access information exchange network browsing resource sharing (interoperability) third distributed operating system global view of: file system, name space, time, security, computational power single computer view of multiple computer system (transparency) fourth cooperative autonomous system open and cooperative distributed applications cooperative work (autonomicity) Causality A fundamental property of a distributed system: lack of a global system state This is due to - Noninstantaneous communication propagation delay contention of network resource lost messages - Clock synchronization clock drift - Unpredicatable execution CPU contention interrupts page faults garbage collection Therefore, in distributed systems, we can only talk about causality Causal: the cause precedes the effect, sending precedes receipt E: the set of all events Ep: the set of all events occur at processor p e1 <p e2: e1 precedes e2 at processor p for e1,e2 in Ep, either e1<p e2 or e2 <p e1 e1 <m e2 : e1 sending message m, e2 receipt message m Happens – before e 1 <H e2 1. if e1 <p e2 , then e1<H e2 2. if e1 <m e2, then e1 <H e2 3. if e1 <H e2 and e2< H e3, then e1 <H e3 A happens – before relation, H – DAG p1 e1 e4 p2 p3 e2 e3 e5 e6 e8 e7 e1 <p1 e4 <p1 e7 e 1 <m e3 e 5 <m e8 e3 <p2 e5 e1 <H e 8 Lamport Timestamps Algorithm global time does not exist global `clock`: a total order to the events must be consistent with the happens-before relation <H algorithm on the fly e.TS : time stamp attached to e my_TS : local timestamp of the processor Initially my_TS =0 On event e if e is the receipt of message m my_TS = max (m.TS, my _TS) my_TS ++ e.TS = my_TS if e is the sending of message m m.TS =my_TS if e1 <H e2, then e1.TS < e2.TS to break ties of identical timestamps, Lamport suggests using the processor address for the lower order bits of the timestamp no guarantee: if e1.TS < e2.TS then e1 <H e2. Therefore, it cannot be used to detect for example causality violation Causality violation s(m): the event of sending m r(m): the event of receipt m if s(m1) <H s(m2), but r(m2) < H r(m1) Vector timestamps have the property e1.VT < v e2.VT iff e1<H e2 Must be able to tell which events of every processor an event causally follows VT: an array of integers VT[i]=k: causally follows the first k events at processor i e1.VT <=v e2.VT : e1.VT[i]<= e2.VT[i] for every i= 1…M e1.VT <v e2.VT : e1.VT <=v e2.VT and e1.VTe2.VT Vector timestamp algorithm Initially my_VT = [0,…,0] On event e if e is the receipt of message m for i = 1 to M my_VT[i]=max(m.VT[i], my_VT[i]) my_VT [self]++ e.VT=my_VT if e is the sending of message m m.VT=my_VT We show if e1.VT < e2.VT, then e1 <H e2. Suppose e.1 is at processor i and e2 is at processor j. From e1.VT < e2.VT, e1.VT[i] <= e2.VT[i]. The value of e2.VT[i] is obtained from an event from processor i, therefore e1 <H e2. Causal communication ensure no causality violation assume point-to-point messages delivered in the order sent main idea : hold back message m until no messages m' <H m will be delivered from any other processor. earliest [1,…,M] earliest[k]: the timestamp of the earliest message that can be delivered from processor k initially the smallest timestamp 1k (1 in Lamport , (0…010…0) in vector timestamp) blocked[1,…,M] block[k]: queue of blocked messages from processor k Causal Message delivery algorithm Initially each earliest[k] is set to 1k, k=1,…,M each blocked[k] is set to {}, k=1,…,M On the receipt of message m from processor p delivery_list={} if (blocked[p] is empty) earliest[p]=m.timestamp Add m to the tail of blocked [p] while ( there is k such that blocked[k] is not empty, and for every ik, self, not-earlier(earliest[i], earliest[k],i) ) remove message at head of blocked [k], put in delivery_list if blocked[k] is not empty set earliest[k] to m'.timestamp, where m' head of blocked [k] else increment earliest [k] by 1k Deliver messages in delivery_list, in causal order Deadlock in the algorithm if one processor does not send messages, other processors will be blocked to receive Multicast communication Every processor receives the same set of messages p receives m1, m2 <H m1 p will eventually receive m2 Distributed Snapshots no global state distributed snapshot : a global view of the system that is consistent with causality Si : state of processor Pi S = (S1, S2,…,Sm) channel Cij: communication channel Pi to Pj C = {Cij | i,j 1,… M} Lij = (m1, m2,….mk) messages sent by Pi but yet to be received by Pj L = {Lij | i,j 1,… M} Global state G = (S,L) Consistent Cut observations of different processors should be concurrent snapshot token : special message indicating a state to be recorded p q O1 t O2 t O3 O1 and O2 are concurrent O1 and O3 are not concurrent (in the original system, i.e. without the snapshot tokens) Distributed Snapshot Algorithm Variables integer my_version integer current_snap [1…M] integer tokens_received [1…M] processor_state S [1…M] channel _state L [1…M] [1…M] S[r] contains processor self’s state, L[r][q] contains Lq,self in the snapshot requested by processor r Initially my_version=0 for each processor p current_snap [p] = 0 execute_snapshot() Wait for a snapshot request or a token Snapshot_Request: my_ version ++ S[self]=current state current_snap[self] = my_version for each q in Oself send(q, TOKEN, self, my_version) token_received[self] =0 TOKEN(q; r, version) : ...... TOKEN(q;r,version): if current_snap[r]<version S[r]=current state current_snap[r]=version L[r][q]=() for every p in Oself send(p, TOKEN, r, version) tokens_received[r]=1 else if (current_snap[r]=version) tokens_received[r]++ put messages received from q since first receiving token(r,version) into L[r][q] if tokens_received[r]=|Iself| the local snapshot for(r.version) is finished Distributed Mutual Exclusion Timestamp Algorithms record timestamp send requests to other processors, other processors grant / deny the request using timestamp info Variables timestamp timestamp integer boolean current_time my_timestamp reply_count reply_deferred[l…M] Requesting the critical section Request_CS() my_timestamp=current_timestamp is_requesting=True reply_pending=M-1 for every other processor q send(q,remote_request,my_timestamp) wait until reply-pending=0 ( CS ) Monitoring CS_monitor() Wait a remote_request or a reply message remote_request(q,request_time): if ( not is_requesting or my_timestamp>request_timestamp ) send(q,reply) else reply_deferred[q]=True reply(q): reply_pending-- Releasing critical section Release_CS() (leave CS) is_requesting=False for q=1 to M if reply_deferred [q]=True send(q, reply) reply_deferred[q]=false Voting Processors compete for votes to enter critical sections Naive Voting Algorithm Naïve_Voting_Enter_CS() Send a vote request to all processors Wait until (M+1)/2 votes (CS) Voting with district Sp: Voting district of processor p Si Sj {} 1<= i,j <= M Variables used in voting based algorithm Sself current_timestamp my_timestamp yes_votes have_voted candidate candidate_timestamp have_inquired waitingQ voting district candidate voted for true if have tried to recall a vote Requesting the critical section Request_CS() my_timestamp = current_timestamp for every processor r in Sself send ( r, REQUEST, my_timestamp ) while ( yes_votes< Sself|) Wait until a YES, NO or INQUIRE message YES (q) : yes_votes ++ INQUIRE (q, inquire_timestamp) if my_timestamp = inquire_timestamp send (q, RELINQUISH) yes_votes-- Monitor the critical section Voter() while true wait until a REQUEST, RELEASE, or RELINQUISH REQUEST(q;request_timestamp): if have_voted is False send(q,YES) candidate_timestamp = request_TS candidate = q have_voted = True else add(q,request_timestamp) to waitingQ if request_timestamp<candidate_timestamp and not have_inquired have_inquired = True send(candidate; INQUIRE, candidate_timestamp) RELINQUISH(q): RELEASE(q): RELINQUISH(q): add(candidate,candidate_timestamp) to waitingQ remove the (s, timestamp) from waitingQ such that timestamp is the minimum send(s,YES) candidate_timestamp=timestamp candidate=s have_inquired=False RELEASE (q): if waitingQ is not empty remove the (s, timestamp) from waitingQ such that timestamp minimum send(s,YES) candidate_timestamp=timestamp candidate=s else have_voted=False have_inquired=False Fixed Logical Structure A processor can enter the critical section if it possesses a token ring structure Tree structure Variables used by the fixed structure algorithm Token_hldr Incs Current_dir Request_Q operations on request_Q Nq(q) Dg( ) ismt( ) Raymond’s algorithm Requesting and releasing the critical section Request_CS() if not Token_hldr if ismt ( ) send (current_dir,REQUEST) Nq(self) wait until Token_hldr is True Incs=True Release_CS() Incs=False if not ismt( ) current_dir=Dq( ) send(current_dir,TOKEN) Token_hldr=False if not ismt ( ) Send(current_dir, REQUEST) Monitor_(SL) whit True wait for a REQUEST or a TOKEN REQUEST (q) if Token_hldr if Incs Nq(q) else current_dir=q send(current_dir, TOKEN) Token_hldr= False else if ismt( ) send(current_dir,REQUEST) Nq(q) TOKEN: current_dir=Dq( ) if current_dir=self Token_hldr=True else send(current_dir,TOKEN) if not ismt( ) send(current_dir,REQUEST) Path compression Token_hldr Incs IsRequesting current_dir next – The next processor to receive the token, nil if the processor is at the end of the waiting list (if the processor just requested) Request_CS() IsRequesting = True if not Token_hldr send (current_dir,REQUEST,self) current_dir = self next = NIL wait until Token_hldr is True Incs = true Release_CS() Incs = False IsRequesting = False if next NIL token_hldr = False send (next, TOKEN) next = NIL Monitor_CS() while True wait for a REQUEST or a TOKEN REQUEST(requester) : if IsRequesting if next = NIL next = requester else send(current_dir, REQUEST, requester) else if token_hldr token_hldr = False send(requester,TOKEN) else send(current_dir, REQUEST, requester) current_dir = requester TOKEN: token_hldr = True Leader Election coordinator / participant(s) The Bully Algorithm Assumptions 1. message propagation time Tm 2. message handling time Tp Failure detector timeout T = 2Tm + Tp Variables state : {Down, Election, Reorganization, Normal} coordinator : definition up halted Correctness Assertions 1. If statei {Normal, Reorganization} and statei {Normal, Reorganization} then coordinatori= coordinatorj 2. If statei = statej = normal, then definitioni= definitionj 3. (liveness) eventually true • statei = normal and coordinatori=i • For every other nonfailed node j statej = Normal and coordinatorj= i Idea of the Bully Algorithm • Each node has a priority • In election, a node first checks if higher_priority nodes have failed, if so, the node knows it should be the leader • The leader “bullies” the other nodes into accepting its leader ship • An election is initiated by the Coordinator_time out if a node does not hear form the coordinator for a long time or by Recovery when the node recovers from a failure • The leader calls an election if it detects a processor fails or a failed processor recovers Algorithm to initiate an election by a node Coordinator_Timeout( ) if state = Normal or state = Reorganization send (coordinator, AreYouUp) timeout = T wait until coordinator sends (AYU_answer) timeout = T on timeout Election Recovery ( ) state = Down Election( ) Algorithm by the coordinator to check the state of other processors Check( ) if state = Normal and coordinator = self for every other node j send(j, AreYouNormal) wait until j sends (AYN_answer, status) timeout = T if (jup and status = False) or jup Election return( ) Bully election algorithm Election( ) highest = True for every higher priority processor p send (p, AreYouUp) wait up to T seconds for (AYU_answer) AYU_answer(sender): highest = False if highest = False return( ) state = Election halted = self up = { } for every lower priority processor p send (p, Enter_Election) wait up to T for (EE_answer) EE_answer(sender) : up = up {sender} Bully election algorithm continued Election( ) …… num_answers = 0 coordinator = self state = Reorganization for each p in up send (p, Set_Coordinator, self) wait up to T for (SC_answer) SC_answer (sender): num_answers ++ if num_answer < |up| Election ( ) return ( ) Bully Algorithm continued Election ( ) …… num_answers = 0 for each p in up send (p, New_State, Definition) wait up to T for (NS_answer) NS_answer(sender): num_answers++ if num_answers < |up | Election( ) return( ) state = Normal Monitoring the election Monitor_Election( ) while (true) wait for a message case AreYouUp (sender) send (sender, AYU_answer) case AreYouNormal(sender) if state = Normal send (sender, AYN_answer,True) else send (sender, AYN_answer, False) case Enter_Election(sender) state = Election stop_processing( ) stop the election procedure if it is executing halted= sender send(sender,EE_answer) Monitoring the election continued Monitor_Election( ) …… case Set_Coordinator(sender, newleader) if state = Election and halted = newleader cooridinator = newleader state = Reorganization send (sender,sc_answer) case New_state (sender, newdef) if coordinator = sender and state = Reorganization definition = newdef state = Normal The invitation Algorithm Assumption: delay can be arbitrary, no global coordinator Processors into groups, different groups have different coordinators, merge groups into large groups. Correctness assertion 1. If statei {Normal,Reorganization}, statej {Normal,Reorganization}, and Groupi= Groupj, then Coordinatori = Coordinatorj 2. If statei = statej = Normal, Groupi = Groupj, then Definitioni = Definitionj Check( ) if state = Normal and Coordinator = self others = { } for every other node p send (p, AreYouCoordinator ) wait up to T seconds for (AYC_answer) messages AYC_answer,(sender, is_coordinator) if is_coordinator others = others {sender} if others = { } return ( ) wait for a time inversely proportional to your priority Merge (others) Timeout ( ) if Coordinator = self return ( ) send(Coordinator, AreYouThere, Group) wait for AYT_answer, timeout is T on timeout is_coordinator = False AYT_answer(sender, is_coordinator): skip if is_coordinator=False Recovery ( ) Merge (Coordinator_set) if Coordinator = self and state = Normal state = Election stop_processing ( ) counter ++ Group = (self |counter) Coordinator = self { not necessary or problem with} UpSet = Up { interleaving with Invitation() ? *} Up={} For each p in Coordinator_set send (p, Invitation, self, Group) For each p in UpSet send (p, Invitation,self,Group) Wait for T seconds /* Answers are collected by the Monitor_Election thread */ * Invitation() contains Coordinator=new_coordinator Merge (Coordinator_Set) …… state = Reorganization num_answer = 0 For each p in Up send(p, Ready, Group, Definition ) wait up to T seconds for Ready_answer messages Ready_answer ( sender, in group, new_group ) if in group and new_group = Group num_answer + + if num_answer < | Up | Recovery ( ) else state = Normal Invitation( ) while True wait for Invitation (new_coordinator, new_group ) if state = Normal stop_processing ( ) old_coordinator = Coordinator UpSet = Up state = Election Coordinator = new_coordinator Group = new_group if old_coordinator = self for each p in UpSet send(p, Invitation, Coordinator,Group ) send(Coordinator, Accept, Group ) …… Question: is this better put in Monitor thread? Invitation ( ) …… wait up to T seconds for an Accept_answer(sender, accepted) on timeout accepted = False if accepted=False Recovery( ) State = Reorganization Election_Monitor( ) while True wait for a message Ready(sender, New_group,new_definition) if Group=new_group and state = Reorganization Definition = new_definition state = Normal send(Coordinator, Ready_answer, True, Group ) else send (sender, Ready_answer, False ) …… AreYoucoordinator(sender): if state = Normal and Coordinator = self send(sender,AYC_answer,True) else send (sender,AYC_answer,False) AreYouThere(sender, old_group): if Group = old_group and Coordinator = self and sender in Up send(sender,AYT_answer, True) else send(sender, AYT_answer, False) Accept (sender, new_group): if state = Election and Coordinator = self and Group =new_group Up = Up {sender} send (sender, accept_answer,True) else send (sender, accept_answer,False) Recovery ( ) state = Election stop_processing ( ) Counter + + Group = (self |Counter) coordinator = self Up = { } state = Reorganization Definition = {a single node task description} state = Normal Data Management The ACID properties Atomicity: Either all of the operations or none in a transaction are performed, in spite of failures Consistency (serializability): The execution of interleaved transactions is equivalent to a serial execution of the transactions in some order Isolation: Partial results of an incomplete transactions are not visible to others before the transaction is successfully committed Durability: The system guarantees that the results of a committed transaction will be made permanent even if a failure occurs after the commitment Data Replication ACID properties more difficult to ensure Atomicity All processors involved in the transaction agree to either commit or abort the transaction Naïve protocol: coordinator completes its execution, commits, and sends commit messages to other processors Problem of naive protocol: if a participant processor fails, it will not not sucessfully commit (therefore, not all processors commit) Database Technique Two-phase Commit 2PC_Coordinator() pre commit the transaction For every participant p send(p,VOTE_REQ) wait up to T for VOTE messages VOTE(sender,vote_response) if vote_response =YES increment the number of yes votes if each participant responded with a YES vote commit the transaction for every participant p send(p,COMMIT) else abort the transaction for every participant p send(p,ABORT) 2PC_Participant() while True wait for a message from the coordinator VOET_REQ(coordinator): if I can commit the transaction precommit the transaction write a YES vote to the log send(coordinator,YES) else abort the transaction send(coordinator,NO) COMMIT(coordinator): commit the transaction ABORT(coordinator): abort the transaction • Failure of any processor prior to the vote request, abort • If the coordinator fails after pre committing but before committing, abort after recovery (textbook also says “in practice, the coordinator will attempt to commit’’. My understanding of this is that the coordinator will perform another round of vote request). • If a participant fails after precommitting but before committing, Contact other processors to decide (the transaction may or may not have committed) after recovery. Disadvantage of 2 phase commit if the coordinator fails after a participant has voted YES, the participant must wait until the coordinator recovers. Protocol cannot complete: blocked Three Phase Commit avoid blocking if a majority of processors agree on the action Serializability (consistency) if the result of execution is equivalent to a serial one Example t0: bt Write A=100, Write B=20 et t1: bt Read A, Read B 1: Write sum in C 2: Write diff in D et t2: bt Read A, Read B 3: Write diff in C 4: Write sum D et Conflict: Write-Write, Write-Read, Read-Write Interleaving schedules t 0 < t1 < t2 1,2,3,4 3,4,1,2 1,3,2,4 3,1,4,2 1,3,4,2 3,1,2,4 log in C W1=120 W2=80 W2=80 W1=120 W1=120 W2=80 W2=80 W1=120 W1=120 W2=80 W2=80 W1=120 log in D W1=80 W2=120 W2=120 W1=80 W1=80 W2=120 W2=120 W1=80 W2=120 W1=80 W1=80 W2=120 Result (C,D) (80,120) consistent (120,80) consistent (80,120) consistent (120, 80) consistent (80,80) inconsistent (120,120) inconsistent 2PL feasible Timestamp feasible feasible t1 aborts and restarts not feasible feasible not feasible t1 aborts and restarts not feasible cascade aborts not feasible t1 aborts and restarts Two Phase Locking (2PL) A growing phase of locking, a shrinking phase of releasing An extreme case: locks all objects at the beginning, releases all at the end. Serialization is trivial, no concurrency, simple applications 2PL: 1. A transaction must obtain a read or a write lock on data d before reading d and must obtain a write lock on d before updating d 2. After a transaction relinquishes a lock, it may not acquire any new locks * many transaction can have read locks on a data, but if one transaction has a write lock, no other transactions can have locks 2PL concurrency limited deadlock (e.g., t2 writes D then writes C) strict 2PL: releasing lock, usually at commit or abort point non-strict 2PL difficult to implement, difficult to know when the last lock is requested strict 2PL sacrifices some concurrency Timestamp ordering 1. when an operation on a shared object is invoked, the object records the timestamp of the invoking transaction 2. when a (different) transaction invokes a conflicting operation on the object, if it has a larger timestamp than the one recorded by the object, then let the transaction proceed (and record the new timestamp), otherwise abort the transaction (restarts with a larger timestamp). Optimistic Concurrency Control execution phase validation phase update phase One-copy serializability The result of execution is equivalent to a serial one on nonreplicated objects Read-one-primary Read-one Read-quorum Write-one-primary Write-all Write-all-available Write-quorum Write-gossip Read-one / Write-all-available Example t0: bt W(X) W(Y) et t1: bt R(X) W(Y) et t2: bt R(Y) W(X) et t0 initialization, followed by t1 and t2. Only serial schedules ( t1 t2 or t2 t1 ) are consistent. Now replicate X to Xa and Xb, Y to Yc and Yd Xa and Yd fail t1: bt R(Xa) (Yd fails) W(Yc) et t2: bt R(Yd) (Xa fails) W(Xb) et No conflict, not one copy Quorum Voting Read-quorum: each read operation to a replicated object d must obtain a read quorum R(d) Witre-quorum: W(d) Quorum must overlap V(d): total number of copies Write-Write conflict: 2W(d) > V(d) Read-Write conflict: R(d)+W(d) > V(d) R(d)=1, W(d)=V(d), Read-one, Write-all Gossip Update Propagation Many applications do not need one-copy serializability Basic Gossip Protocol TSi: last update time of the data object (maintained by Replica Manager, RM i) TSf: timestamp of the last successful access operation (maintained by File Service Agent, FSA) Read: TSf is compared with TSi if TSf TSi (data more recent) return value TSf is set to TSi else wait until data is updated by gossip Update: TSf ++ if TSf > Tsi update is executed TSi=TSf propagate the new data by gossip else (the update is too late, possible action: overwrite or become more upto date by a read) Gossip: A gossip message carrying a data value from replica j to replica i is accepted if TSj > TSi In the Basic Gossip Protocol, updates are simple overwrites (do not depend on the current state). To handle read-modify updates (depending on current state), Casual Order Gossip Protocol Example of casual order gossip: Figure 6.12 Distributed Agreement A number of processors, some of them faulty, try to agree on a value. Assumption: faulty processors may do anything, including the worst (Byzantine). Aim: a protocol which allows all the non faulty processors to reach the agreement. Byzantine agreement In an ancient war in Byzantium, some Byzantium generals are loyal, but some are disloyal. The loyal general need to decide whether to attack together or retreat. Question: Suppose there are 3 generals, 2 loyal and 1 disloyal, can the loyal generals reach the agreement ? loyal disloyal retreat attack attack attack retreat loyal loyal 1 attack, 1 retreat attack attack loyal retreat cannot decide disloyal 1 attack, 1 retreat Question : Can the loyal generals reach the agreement if there are 4 generals, 3 loyal, 1 disloyal. disloyal 2A, 1R R A A loyal A A loyal A loyal R A R A 2A, 1R loyal loyal A A A A loyal A A R R disloyal Theorem Suppose there are M generals, t disloyal ones. If M≤3t, then the generals cannot reach agreement. Proof idea: Suppose the theorem is not true. Let one general simulate t generals, then the three general problem can also be solved. Contradiction. Byzantine general’s broadcast BG_Send(k, v, I) send v to every general in I. BG_Receive(k) Let v be the value received, or “Retreat” if no value is received before time out Let I be the set of generals who have never broadcast v ( the delivery list for this message ) BG_Send(k-1, v, I-self) Use BG_Receive(k-1) to receive v(i) for every i in I-self return majority (v, v (1)…….v (|I|-1)) Majority and default decision Majority (v1, v2, …,vn) Return the majority v among v1, v2, …,vn or “Retreat” if no majority exists Base case BG_Send(0,v,I) The commanding general broadcasts v to every other generals in I BG_Receive (0) Return the value received, or “Retreat” if no message is received C O1 1 O3 3 2 1 O4 O5 4 L1:O1 2 5 6 O1 L1: L1:O1 2 O6 4 3 L1:O1 3 5 6 L1:O1 6 L6:L1:O1 …… 3 4 5 6 2 3 4 5 6 4 5 2 2 decides the value from 1 by majority(L1:O1, L3:L1:O1, L4:L1:O1, L5:L1:O1, L6:L1:O1) In a similar way, 2 decides the value from generals 3, 4, 5, 6. Finally, general 2 decides the value by taking the majority of these values together with the one received from C. Lemma For any t and k, if the commanding general is loyal, the BG ( k ) protocol is correct if there are no more than t traitors and at least 2t+k+1 generals (2t+k in textbook, mistake?). Proof. By induction on k. Base case k=0, BG (k) works because the loyal generals just accept the orders from the commanding general which is assumed to be loyal. Assume BG(k-1) works for 2t+k generals and t traitors. Consider The case of 2t+k+1 generals and t traitors C Ot+k … O2t+k … O1 = O2=…= Ot+k=…= O2t+k After receiving the command from the commanding general, each of the 2t+k loyal general will broadcast the correct command. There are t traitors. By the assumption, a loyal general will decide on the correct values of the other t+k-1 loyal generals. Together with the order from the commanding general, there are t+k >t correct orders and at most t incorrect orders, so the loyal general will decide on the right order. Theorem For any k, the BG(k) protocol is correct if there are more than 3k generals and no more than k traitors. Proof: Induction on k. Base case k = 0, the protocol is correct, because there are no traitors. Assume BG(k-1) works, if there are more than 3(k-1) generals and no more than k-1 traitors. Consider there are 3k+1 generals, and k traitors. If the commander is loyal, then the Lemma says the protocol is correct, because there are 3k+1 = 2k+k+1 generals. If the commander is disloyal, then when any other general rebroadcasts, there are 3k>3(k-1) generals and k-1 traitors, so the loyal generals agree on the rebroadcasted orders,and therefore will agree on the final order. Distributed Shared Memory (DSM) Process Communication Paradigms message passing remote procedure call (RPC) distributed shared memory first introduced by K. Li, in his PhD thesis 1986 RPC and DSM provide abstraction, and they are implemented by message passing in distributed systems. DSM has a mapping and management software between DSM and message passing mechanism Shared Memory tightly coupled systems memory accessed via a common bus or network direct information sharing programming is similar to conventional shared memory programming (a logical shared memory) memory management problems: efficiency, coherence/consistency A generic NUMA architecture processor memory processor memory coherence controller memory memory coherence controller buses or network NUMA: Nonuniform Memory Access local/remote accesses, not uniform NUMA Architectures Memory Consistency Models Process viewpoint (compared to data viewpoint, distributed file system) more concurrency less concurrency difficult to program easy to program weak consistency strong consistency General Access Consistency Models R(X)v: read variable X, value v W(X)v: write variable X with value v Atomic (strict) consistency: All read and write must appear to be executed atomically and sequentially. All processors observe the same ordering of event execution, which coincides with the real-time occurrence. P1: W(X)1 P2: R(X)1 P1: W(X)1 P2: R(X)0 R(X)1 atomically consistent not atomically consistent This is the strictest consistency model. High complexity in implementation. Usually used only as a baseline to evaluate the performance of other consistency models. Sequential consistency Defined by Lamport: The result of any execution is the same as if the operations of all processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program. Interleaving, real-time order not required. P1: W(X)1 P2: R(X)1 R(X)1 P1: W(X)1 P2: R(X)0 R(X)1 atomically consistent not atomically consistent both sequentially consistent Programming friendly, but poor performance. Causal consistency Writes that are potentially causally related must be seen by all processors. Concurrent writes may be seen in a different order on different processors (therefore may not lead to a global sequential order). P1: W(X)1 P2: P3: P4: W(X)3 R(X)1 W(X)2 R(X)1 R(X)1 R(X)3 R(X)2 R(X)2 R(X)3 causally consistent, not sequentially consistent Causal consistency (continued) P1: W(X)1 P2: P3: P4: R(X)1 W(X)2 R(X)2 R(X)1 R(X)1 R(X)2 not causally consistent If we remove R(X)1, W(X)1 and W(X)2 are concurrent P1: W(X)1 P2: P3: P4: W(X)2 R(X)2 R(X)1 R(X)1 R(X)2 causally consistent Processor consistency Writes from the same processor are performed and observed in the order they were issued. Writes from different processors can be in any order. P1: W(X)1 P2: P3: P4: R(X)1 W(X)2 R(X)1 R(X)2 R(X)2 R(X)1 processor consistent, not causally consistent Slow memory consistency Writes to the same location by the same processor must be in order. P1: W(X)1 W(Y)2 W(X)3 P2: R(Y)2 R(X)1 R(X)3 slow memory consistent Consistency models with synchronization access user information to relax consistency synchronization access: read/write operations to synchronization variables only by special instructions Weak consistency • Access to synchronization variables are sequentially consistent • No access to a synchronization variable is issued by a processor before all previous read/write operations have been performed • No read/write data access is issued by a processor before a previous access to a synchronization variable has been performed P1: W(X)1 W(X)2 S P1: W(X)1 W(X)2 S P2: R(X)1 R(X)2 S P2: S R(X)1 P3: R(X)2 R(X)1 S weakly consistent not weakly consistent Release consistency Use a pair of synchronization operations: acquire(S) and release(S) • No future access can be performed until the acquire operation is completed • All previous operations must have been performed before the completion of the release operation • Order of synchronization access follows the processor consistency model (acquire - read, release - write) Entry consistency Locking objects, instead of locking critical section • For each shared variable X, associate acquire(X) and release(X) • acquire(X) locks the shared variable X for the subsequent exclusive operations on X until X is unblocked by a release(X)