CS514: Intermediate Course in Operating Systems Professor Ken Birman Ben Atkin: TA Lecture 12: Oct 3 Reliable Group Comm. • We know how to build GMS and implement primary partition model – It can’t guarantee availability in all situations – But it is much less likely to block than a transactional replication service • Use it to implement group membership views for process groups • Use views as input to reliable multicast • Build ordered multicast on this: – fbcast, cbcast: easy and cheap – abcast: more costly, several approaches • Synchronize with new view delivery Process groups with joins, failures G0={p,q} G1={p,q,r,s} p G2={q,r,s} G3={q,r,s,t} crash q r s t r, s request to join r,s added; state xfer p fails t requests to join t added, state xfer Notes on Virtual Synchrony • In fact, must extend reliability to also exclude “gaps” in causal past – I.e. if multicast m n, then if after a failure n is delivered, m should be too – Ordering alone doesn’t guarantee this • Turns out that this isn’t hard to implement if the developer is careful • “gap freedom” has no significant cost implications Notes on Virtual Synchrony • Compared to quorum schemes with 2PC, one could argue that – View membership is “like” quorum update with 2PC at the end – Multicast is “like” reading a list of members transactionally • Insight: Virtual synchrony remembers membership from multicast to multicast. Transactional schemes must rediscover membership on each operation they do! Asynchrony • Notice that fbcast and cbcast can be used asynchronously, while abcast always “stutters” – Insight is that fbcast and cbcast can always be delivered to the sender at the time the multicast is sent – Abcast delivery ordering usually isn’t known until a round of message exchange has been completed • Results in a tremendous performance difference Asynchronous Multicast • Lets the sender avoid blocking X=7 Y=23 X = X-1…. • But should be used with care! – Sender isn’t blocking but multicasts are accumulating in the sender subsystem – Sender could get far ahead of other processes in the group Abcast is more synchronous • Only a lucky sender avoids blocking Y=33 X=7 Y=23 X = X-1…. • Even sender needs to wait to know the delivery ordering • Allows concurrent updates, but at the cost of much higher multicast latency! – Most senders will wait for equivalent of an RPC; only token holder avoids this – Whole group moves in lock-step Tradeoff • With asynchronous cbcast or fbcast, we gain concurrency at the sender side, but this helps mostly if remainder of group is idle or doing a non-conflicting task • With abcast we can concurrently do conflicting tasks but at cost of waiting to know the delivery order for our own multicasts Asynchrony: grain of salt • A good thing… in moderation • Too much asynchrony – Means things pile up in output buffers – If a failure occurs, much is lost – And we could consume a lot of sender-side buffering space Concatenation Application sends 3 asynchronous cbcasts Multicast Subsystem Message layer of multicast system combines them in a single packet Avoiding Trouble? • First, system itself should limit amount of asynchronous buffering • Also, we add a “flush” primitive – It delays until asynchronous messages get delivered remotely • Useful in two ways – When sender wants to do something that must “survive” even if sender fails – If sender is worried about buildup of asynchronously buffered messages (rare) • Former case is like “durability” (ACID) Effect? • Overall, system might surge ahead using asynchronous multicast • But in fact, no process ever gets far ahead of remainder of group • Analogous to TCP sliding window, but here window is a window of pending multicasts • With abcast the whole issue is much less evident, but on the other hand, the system runs much slower Presenting to user • We can just offer the user “raw” GCS but this is uncommon – pg_join(…), pg_leave(…) – cbcast(….), pg_flush(…) – Upcalls for msg_rcv, new_view • Instead, usually build “tools” to make user’s life simpler • Tools package the common functions in a standard way Sample tools • Replicated data with locking and state transfer • Load balancing • Fault-tolerance through primary backup or coordinator-cohort • Task subdivision schemes, based on work partitioning Building a tool • Tools simply map your request to the underlying group primitives • They exploit – View notification and current view – Multicast of various flavors – Synchronization properties • Tools are optimized to perform well Tools Challenge • Efficiency: generic interface tends to lose performance opportunities associated with knowing application semantics – For example, we might know about an update pattern – Or we could know that locking ensures that concurrent threads must be doing non-conflicting actions • How much can we assume about the application? Active Replication • Simplest use of process groups • Members replicate – Data (they maintain local copies) – Actions (all perform same operations in same order) • Basic idea is to use totally ordered multicast to send updates to all the group members. They can perform read operations using local copy of data. Synchronization • Many replication schemes will require some form of locking – Application interface: lock/unlock(“x”) – Think of lock “state” as replicated data! • Architecture is simplified if locking is not needed (if each “operation” is fully described in a single multicast) • Otherwise, implement lock/unlock using a token passing algorithm Locks as tokens • p initially holds the lock p crash q r s t • Later it moves to q, then back to p, then back to q when p crashes • Lock holder can update replicated data managed by the group Locks as tokens • Can have multiple locks per group • Easily implemented – For example, can use cbcast to request the lock – Unlock is also cbcast. It designates the process that will receive the lock, probably oldest pending lock request – Notice that since lock-req lock-grant, any process receiving a grant will find the corresponding request on its queue! State Transfer • This is the problem of providing a joining group member with initial values for the replicated data • Needs to be synchronized with incoming updates so that state will reflect each update, exactly once. State Transfer • Tool intercepts the new view • Just when it would have been delivered, instead we do an upcall to the application – Application writes down its state – We capture in messages and send them to the joining processes – After last state message, allow the new view to be delivered State Transfer Algorithm G0={p,q} p q r G1={p,q,r} State transfer looks instantaneous State Transfer Algorithm G0={p,q} G1={p,q,r} p State transfer looks instantaneous q r G0={p,q} Actually, it has a concealed structure p q r G1={p,q,r} State Merge after Partitioning • A merge is a form of state transfer done when two partitions combine after a link is fixed • Usually, state transfer is employed to take the state of the primary partition and copy it to the nonprimary side. • Sometimes the non-primary side will then reissue updates that occured while it was separated Active Replication • This adds up to active replication – We use asynchronous cbcast for locking, updates – State transfer to initialize joining processes – Performance limited by degree of asynchronous communication we tolerate Active Replication G0={p,q} p q r s r, s request to join t Active Replication G0={p,q} G1={p,q,r,s} p crash q r s t r, s request to join r,s added; state xfer Active Replication G0={p,q} G1={p,q,r,s} p G2={q,r,s} crash q r s t r, s request to join r,s added; state xfer p fails t requests to join Active Replication G0={p,q} G1={p,q,r,s} p G2={q,r,s} G3={q,r,s,t} crash q r s t r, s request to join r,s added; state xfer p fails t requests to join t added, state xfer How cheap is it? • As described, this is the case where Horus reaches 75,000 or more updates per second • In contrast, transactional replication rarely pushes beyond 200 per second and 50 is more common Cheap can be “too cheap” • Database applications may need stronger properties • Dynamic uniformity required if actions leave external traces and must be consistent with them ... but when this happens, expect to lose a factor of one hundred in performance. Limited to applications that are not very sensitive to latency Uses for replicated data • Replicated file system: local copies are always “safe”, never inconsistent • Replicated or “cloned” web pages: don’t overload the server. Load-balance queries • Replicated control or management policy: used to supervise a component consistent with the rest of the system • Replicated security keys: for authorization Groupware Uses of Replication • Might want to replicate display for a conferencing system: multicast to/between Java applets • Replicate the slides being shown by a speaker • Let individuals keep copies of bulky documents or other information that might be slow to transfer if we wait to the last minute. Only need to send out the updates. Financial Example of Replication • Many “trading” systems show bankers or brokers stock prices as they change • Each stock is like a small process group • Each new price is like an update • Benefit of our “model” is that traders see exactly the same input. This avoids risk of inconsistent decisions • Also replicate critical servers for availability Distributed Trading System Pricing DB’s 1. Historical Data Market Data Feeds Trader Clients 2. Analytics 3. Availability for historical data Load balancing and consistent message delivery for price distribution Parallel execution for analytics Tokyo, London, Zurich, ... Long-Haul WAN Spooler Current Pricing Publish-Subscribe Paradigm • A popular way to present replicated data • Processes publish and subscribe to “subjects”, which can be any ascii pathname – news_post(“subject”, message, length) – news_subscribe(“subject”, procedure) • State transfer is by “playback” of prior postings – news_subscribe_pb(“subject”, procedure) Conceptually, a message “bus” • Boxes are publishers (blue / green subjects) • Circles are subscribers (“ “ ) • Disks represent spoolers used for playback • Flexible and easily extended over time • Supports huge numbers of subjects Conceptually, a message “bus” • Boxes are publishers (blue / green subjects) • Circles are subscribers (“ “ ) • Disks represent spoolers used for playback • Flexible and easily extended over time • Supports huge numbers of subjects Conceptually, a message “bus” • Boxes are publishers (blue / green subjects) • Circles are subscribers (“ “ ) • Disks represent spoolers used for playback • Flexible and easily extended over time • Supports huge numbers of subjects Implementation of message bus • Map subjects to process groups (could do 1-1 mapping but many-1 is more efficient) • Spoolers join all groups for which spooling is desired, record messages on disk files • State transfer from spooler used for playback. New messages handled like “updates” Need for speed? • New York Stock Exchange generates about 25-50 trades per second, peak is 100 • Trades can be described in 512 byte records ... so we could potentially send every trade on the stock exchange to an individual workstation. With hardware multicast, we could send to a whole trading floor. Applications with more demanding requirements? • Page memory over a network (replicated, consistent, DSM) • Implement an in-memory file system using a set of workstations, XFS-style • Transmit MPEG encoded video frames • Interactively control a robot or some other remote device • Warn that an earthquake is about to happen But there are also scaling limits • Scaling: large numbers of destinations, high data rates. • With many receivers some may lag behind, overload, or become “lossy:” how to deal with this case? • As networks get larger, bridges and WAN links make performance very variable ... all of which makes flow control very hard Replicating a Server • Client/server computing is widely standard – – – – Database servers and OLTP applications File servers Web or Java servers Special purpose servers (example: compute the theoretical price of a stock under some assumptions about the economy) • Replication for load-balancing, faulttolerance Basic idea is simple • If we just replicate the inputs to the server, the copies will stay in “sync” • What makes it hard? – Servers to share the work (hence do different things) – Fault-tolerance (we don’t want the same input to crash both servers) – Server restart (don’t want to transfer a huge state) The usual approach • Replicate the updates but not the queries. • Load-balance the queries – Bind each client to a different server, or – Periodically “publish” load values and use these as the basis of a random algorithm, or – Multicast all requests; servers split the work using a deterministic scheme (e.g. even/odd) Randomized Load Balancing • Track loads on servers through periodic updates load0 = 2, load1=4, load2=4 • Think of 1/load as an interval on a line: 1/2 1/4 1/4 • Toss a random coin for each request and send to corresponding server Load-Balanced, Replicated Server • Clients distribute queries using random load-balancing and reissue them if the server fails or is too slow • Updates are multicast to all servers, all execute them in parallel Handling Server Recovery • If state is small, use state transfer • If state is medium sized, transfer it before the join request, then do the join and transfer only the updates that occured at the last moment • If state is huge, use periodic checkpoints and keep logs of incremental changes. State transfer only the log contents. Fault-Tolerance Options • On the “client side” can try a request and then reissue it if the server fails. • On the server side, can use primary/backup or coordinatorcohort methods • When there are real-time constraints on the system, use 2 primaries for each request to be sure that 1 reply will be received in time Primary-Backup Scheme • Primary server handles all the work. Backup is passive but takes over if primary fails. Primary-Backup issues • Non-determinism: must “control” it, may need to ship costly “traces” from primary to backup so that backup can reproduce actions of primary • Potential for a window of lost actions: if we wait for stability of trace data, response time is reduced, but if we don’t wait, we can lose actions when the primary fails Coordinator-Cohort Scheme • Each server is primary (coordinator) for some requests. It acts as backup (cohort) for others. • This approach is well matched to load-balancing • Servers do “different things” hence are less likely to fail simultaneously Coordinator-Cohort Scheme Coordinator for green clients, backup for red clients Coordinator for red clients, backup for green clients Beyond single groups • We’ve talked about groups one at a time • But what if groups were cheap? – We could use a separate group for each data item in a system – Vision: groups as a first-class programming tool Multigroup concepts • Circus: Berkeley around 1988 – Uses groups for replication – Idea is to replicate at a finegrained level – But costs were high • Isis: Cornell, 1987 – Groups as a distributed structuring and management tool Lightweight groups • Normal groups have overheads – Notably, membership change • With lots of groups, these costs become prohibitive • Leads to lightweight groups – One big group to track membership – Lightweight subgroups used by application, they map to big group Issues to think about • Won’t a replicated server just reproduce the failure of the primary? • Does it make more sense to do replication in hardware (e.g. RAID file system or paired hardware faulttolerance?) • Homework: how would you do reliability for NASA’s cheap space missions?