Virtual Synchrony Justin W. Hart CS 614 11/17/2005 Papers The Process Group Approach to Reliable Distributed Computing. Birman. CACM, Dec 1993, 36(12):37-53. Understanding the Limitations of Causally and Totally Ordered Communication. Cheriton and Skeen. 14th SOSP, 1993. Background Chandy-Lamport Logical Clocks Consistent Cuts Distributed Snapshots Publish/Subscribe Fail-Stop Fail Stop Group Membership Service Processes appear to fail by halting How does this affect the FLP result? Motivation Information Backplane Customization Hierarchical Structure Fault-Tolerance Reliability Process Groups Types of groups Anonymous groups Explicit groups Implementation Requirements Group communication Group membership as input Synchronization Anonymous Groups Group addressing Messages sent exactly once to all or no recipients Ordering Logging Explicit Groups Group members cooperate directly May execute algorithms based on membership knowledge Communication is sensitive to membership changes Building groups over conventional technology Conventional message passing technologies Group addressing Logical time & causal dependency Message delivery ordering State transfer Fault tolerance Close Synchrony Close Synchrony 100% lock-step execution model A synchronous execution p q r s t u With true synchrony executions run in genuine lock-step. So… what’s wrong with that? Under close synchrony, execution is limited by the slowest process in the group! Virtual Synchrony Relax synchronization requirements where possible Benefit by allowing for asynchronous interactions Do this where the result is identical to close synchrony A few protocols… fbcast cbcast abcast gbcast Four protocols!?!? …but Justin. The paper only discussed 2 protocols… you’re getting off-topic! A few protocols… fbcast Simple protocol upon which we’ll build the others. Delivery is FIFO ordered, with respect to the original sender Accomplished easily with a logical timestamp cbcast abcast gbcast Single updater If p is the only update source, the need is a bit like the TCP “fifo” ordering 1 p 2 3 4 r s t fbcast is a good choice for this case A few protocols… fbcast cbcast Receipt is causally ordered Protocol in paper uses token passing Another simple protocol uses vector timestamps abcast gbcast Causally ordered updates Simple protocol based on token passing Causally ordered updates p r s t Example: messages from p and s arrive out of order at t VT(b)=[1,0,0,1] c is early: VT(c) = [1,0,1,1] but VT(t)=[0,0,0,1]: clearly we are VT(c) = [1,0,1,1] When b one arrives, we can deliver missing message from s both it and message c, in order VT(a) = [0,0,0,1] Causally ordered updates Each thread corresponds to a different lock 2 p 5 1 r 3 s t 1 4 2 In effect: red “events” never conflict with green ones! Hey… that sped things up! Now I get it! Processes only have to wait for processes that they depend on. Not the slowest in the group! A few protocols… fbcast cbcast abcast Atomic delivery ordering With respect to other abcasts More costly than cbcast, but with a stronger ordering property ISIS builds abcast over cbcast gbcast A few protocols… fbcast cbcast abcast gbcast Atomic delivery ordering With respect to everything Three Round Multicast As a time-line picture Phase 1 2PC initiator Vote? Phase 2 Commit! p q r s t All vote “commit” Just one more… Flush protocol We say that a message is unstable if some receiver has it but (perhaps) others don’t For example, q’s message is unstable at process r If q fails we want to “flush” unstable messages out of the system Styles of groups Peer Groups Client-Server Groups Group acts as a server Client multicasts repeatedly to the group Diffusion Groups Processes cooperate closely Group serves information Clients connect to receive data from group Hierarchical Groups Offer scalability through a hierarchy of connected groups Historical Aside Two major classes of real systems Virtual synchrony Weaker properties – not quite “FLP consensus” Much higher performance (orders of magnitude) Requires that majority of system remain connected. Partitioning failures force protocols to wait for repair Quorum-based state machine protocols are Closer to FLP definition of consensus Slower (by orders of magnitude) Sometimes can make progress in partitioning situations where virtual synchrony can’t Names of some famous systems Isis was first practical virtual synchrony system Paxos was first major state machine system Later followed by Transis, Totem, Horus Today: Best options are Jgroups, Spread, Ensemble Technology is now used in IBM Websphere and Microsoft Windows Clusters products! BASE and other Byzantine Quorum systems now getting attention from the security community (End of Historical aside) Sounds good… what’s wrong with it? Tries to solve state problems at communication level This violates the end-to-end argument! Consistency requirements are typically stated with respect to application state Stable vs Durable Stable – messages are buffered until received by all group members Durable – message will be delivered, even if the sender dies Ordering semantics Incidental Ordering Semantic Ordering Prescriptive Ordering The problem with CATOCS It It It It can’t can’t can’t can’t say say say say “for sure” the “whole story” “together” it efficiently It can’t say “for sure” Processes communicating over a “hidden” channel Common database Shared memory Two threads reacting to external event It can’t say “together” Standard solution – locking Transaction models allow for abort and rollback Higher level conditions… what happens if a message arrives, but is not successfully processed Stock trading example Can’t say the “whole story” Not everything can be expressed through the “happens-before” relationship Semantic ordering constraints Causal memory, the weakest of these, cannot be expressed in causal multicast Total ordering helps some of these, but is far too expensive Inexpensive, state-level protocols with logical clocks can solve these It can’t say it efficiently False causality Potential causality != Actual causality Memory requirements for buffering “unstable” messages Ordering information during transmission and reception And… what of the end to end argument? All of this considers our communication channels… isn’t the application-level check far more important? Classes of distributed applications Data dissemination Netnews Trading application example Global predicate evaluation Transactional applications Replicated data Replication in the large Distributed real-time applications Implementing only part of the messaging? Can you cut down on overhead by implementing only part of the messaging using CATOCS? Semantics Are the semantics of state-based approaches superior to those of virtual synchrony? Scalability N Processes Time T to propagate a message across the system Grows roughly proportional with the square root of the number of processes Arcs in the active causal graph grow quadratically Quadratic causal graph Buffering grows Quadratic arcs Linear communication of causal dependencies Linear growth in required buffering Changing topologies doesn’t help CATOCS would require separate process groups for read and write to accomplish optimization of updates vs queries Group membership protocols Must enforce atomic delivery semantics Run our most expensive protocol… gbcast Failures increase with the size of the system, increasing load on the GMS Who uses ISIS? Brokerage Database replication and triggers ISIS-based utilities NEWS A pub/sub application with that will replay histories NMGR Manages batch-style jobs and performs load sharing Parallel make ISIS-based utilities DECEIT META/LOMITA NFS compatible file system Sensors & actuators Abstract sensors Specify control actions in high-level terms SPOOLER/LONG-HAUL FACILITY Now… somewhat supported ISIS/Horus/Ensemble/QuickSilver JGroups Spread Totem Transis WebSphere & Windows Cluster (internally) …and people actually use it. NYSE French ATC System AEGIS An ongoing debate The effort continues here at Cornell with the QuickSilver effort You’ve been presented the options… what are your conclusions? References Some slides borrowed from Ken Birman’s CS 614 slide sets on Virtual Synchrony http://www.cs.cornell.edu/courses/cs514/2005sp/Slide%20Sets.htm Images have been borrowed from The Process Group Approach to Reliable Distributed Computing. Birman. CACM, Dec 1993, 36(12):3753. Images have been borrowed from Understanding the Limitations of Causally and Totally Ordered Communication. Cheriton and Skeen. 14th SOSP, 1993. Statements and ideas have been borrowed verbatim from both papers, including section headings, and statements in notes. This has been mostly for coherence between the slides and papers Also sourced data from http://www.cs.cornell.edu/ken/