CSS490 Replication & Fault Tolerance Textbook Ch9 (p440 – 484) Instructor: Munehiro Fukuda These slides were compiled from the course textbook and the reference books. Winter, 2004 CSS490 Fault Tolerance 1 File Replication Concepts Difference between replication and caching A replica is associated with a server, whereas a cache with client. A replicate focuses on availability, while a cache on locality A replicate is more persistent than a cache is A cache is contingent upon a replica Advantages Increased availability/reliability Performance enhancement (response time and network traffic) Scalability and autonomous operation Requirements Naming: no need to be aware of multiple replicas. Consistency: data consistency among replicated files. Replication control: explicit v.s. implicit/lazy replication ACID: Atomicity, Consistency, Isolation, and Durability Winter, 2004 CSS490 Fault Tolerance 2 File Replication Basic Architectural Model 1. Client Replica Manger Front End Replica Manger Client Front End Ex: DNS Winter, 2004 2. 3. Replica Manger Web server 4. 5. Request: send a client request to a server. Coordination: deliver the request to each replica manger in some order. Execution: process a client request but not permanently commit it. Agreement: agree if the execution will be committed Response: respond to the front end CSS490 Fault Tolerance 3 Group Communication Replica Manger Client Replica Manger Replica Manger group Winter, 2004 Replica Manger Group membership service Create and destroy a group. Add or withdraw a replica manager to/from a group. Detect a failure. Notify members of group membership changes. Provide clients with a group address. Message delivery Absolute ordering Consistent ordering CSS490 Fault Tolerance 4 Absolute Ordering Linearizability Ti < Tj Ti mi Tj mi mj mj Winter, 2004 Rule: Mi must be delivered before mj if Ti < Tj Implementation: A clock synchronized among machines A sliding time window used to commit message delivery whose timestamp is in this window. Example: Distributed simulation Drawback Too strict constraint No absolute synchronized clock No guarantee to catch all tardy messages CSS490 Fault Tolerance 5 Consistent (Total) Ordering Sequential Consistency Ti < Tj Ti Tj mj mj mi mi Rule: Messages received in the same order (regardless of their timestamp). Implementation: A message sent to a sequencer, assigned a sequence number, and finally multicast to receivers A message retrieved in incremental order at a receiver Example: Drawback: Winter, 2004 Replicated database update A centralized algorithm CSS490 Fault Tolerance 6 Two-Phase Commit Protocol Coordinator Worker 1 Worker 2 INIT INIT INIT Commit Vote-request WAIT Vote-abort Vote-commit Global-abortGlobal-commit ABORT COMMIT Vote-request Vote-commit Vote-request Vote-abort READY Global-abort Ack ABORT Another possible cases: The coordinator didn’t receive all vote-commits. A worker didn’t receive a vote-request. A worker didn’t receive a global-commit. Winter, 2004 Vote-request Vote-commit Vote-request Vote-abort READY Global-commit Ack COMMIT Global-abort Ack ABORT Global-commit Ack COMMIT → Time out and send a global-abort. → All workers eventually receive a global-abort. → Time out and check the other work’s status. CSS490 Fault Tolerance 7 Multi-copy Update Problem Read-only replication Primary backup replication Allow the replication of only immutable files. Designate one copy as the primary copy and all the others as secondary copies. Active backup replication Access any or all of replicas Read-any-write-all protocol Available-copies protocol Quorum-based consensus Winter, 2004 CSS490 Fault Tolerance 8 Primary-Copy Replication 1. 2. Client Front End Primary Replica Manger Backup 4. Replica Manger Client Front End 3. Replica Manger 5. Backup Winter, 2004 Request: The front end sends a request to the primary replica. Coordination:. The primary takes the request atomically. Execution: The primary executes and stores the results. Agreement: The primary sends the updates to all the backups and receives an ask from them. Response: reply to the front end. Advantage: an easy implementation, linearizable, coping with n-1 crashes. Disadvantage: large overhead especially if the failing primary must be replaced with a backup. CSS490 Fault Tolerance 9 Active Replication 1. 2. Client Replica Manger Front End Replica Manger Client Front End 3. 4. 5. Replica Manger Winter, 2004 Request: The front end multicasts to all replicas. Coordination:. All replica take the request in the sequential order. Execution: Every replica executes the request. Agreement: No agreement needed. Response: Each replies to the front. Advantage: achieve sequential consistency, cope with (n/2 – 1) byzantine failures Disadvantage: no more linearizable CSS490 Fault Tolerance 10 Read-Any-Write-All Protocol Read from any one of them Client Client Replica Manger Front End Write to all of them Front End Winter, 2004 Replica Manger Replica Manger Read Lock any one of replicas for a read Write Lock all of replicas for a write Sequential consistency Intolerable for even 1 failing replica upon a write. CSS490 Fault Tolerance 11 Available-Copies Protocol Read from any one of them Client Replica Manger Front End Write to all available replicats X Replica Manger Client Front End Replica Manger Winter, 2004 Read Lock any one of replicas for a read Write Lock all available replicas for a write Recovering replica Bring itself up to date by coping from other servers before accepting any user request. Better availability Cannot cope with network partition. (Inconsistency in two sub-divided network groups) CSS490 Fault Tolerance 12 Quorum-Based Protocols #replicas in read quorum + #replicas in write quorum > n Read quorum Client Client Front End Replica Manger Replica Manger Replica Manger Replica Manger Replica Manger Replica Manger Replica Manger Replica Manger Front End Write quorum Read-any-write-all: r = 1, w = n Winter, 2004 Read Retrieve the read quorum Select the one with the latest version. Perform a read on it Write Retrieve the write quorum. Find the latest version and increment it. Perform a write on the entire write quorum. If a sufficient number of replicas from read/write quorum, the operation must be aborted. CSS490 Fault Tolerance 13 ISIS System Process group: see page 4 of this ppt file Group view p1 Joins the group p2 p3 p4 multicast multicast rejoins crashed multicast Partially multicast messages must be discarded Multicast to available processes Reliable multicast Causal multicast: see pages 5 & 6 of MPI ppt file Atomic broadcast: see page 7 of this ppt file Winter, 2004 CSS490 Fault Tolerance 14 Gossip Architecture RMk Gossip RMj (Tj) RMi (Ti) Query, Tf Value, Ti If (Tf < Ti) FE return value (Tf) else { waits for RMi to be updated Query Value or Client query RMj/RMk} Winter, 2004 If (Tj > Tk) update RMk else discard the gossip message Update, Tf Update id If (Tf > Tj) update RMj FE else { update Client Update or ignore and update RMj} Client CSS490 Fault Tolerance 15 Bayou System Committed Primary RM Sent first Tentative C0 C1 C2 RM Sent later FE Tn T3 T1 Perform a dependency check T0 Client Client Secretary and other employees: book 3pm Winter, 2004 Check conflicts Check priority Merge Procedure Client Tn Tn+1 To make a tentative update committed: FE FE FE T0 T1 T2 T3 CN Cancel tentative updates Change tentative updates Client Executive: book 3pm CSS490 Fault Tolerance 16 Coda File System 1. Normal case: • Read-any, write-all protocol • Whenever a client writes back its file, it increments the file version at each server. 2. Network disconnection: • A client writes back its file to only available servers. • Version conflicts are detected and resolved automatically when network is reconnected Client disconnection: • A client caches as many files as possible (in hoard walking). • A client works in local if disconnected (in emulation mode). • A client writes back updated files to servers (in reintegration mode). 3. W W Version[2,2,3] Version[2,2,2] Version[1,1,1] Server 3 Winter, 2004 W Version[3,3,2] Version[2,2,2] Version[1,1,1] Server 2 CSS490 Fault Tolerance Version[3,3,2] Version[2,2,2] Version[1,1,1] Server 1 emulation hoard reintegration 17 Paper Review by Students ISIS System Gossip Architecture Bayou System Coda Winter, 2004 CSS490 Fault Tolerance 18