CS514: Intermediate Course in Operating Systems Professor Ken Birman Ben Atkin: TA Lecture 21 Nov. 7 Bimodal Multicast • Work is joint: – Ken Birman, Mark Hayden, Oznur Ozkasap, Zhen Xiao, Mihai Budiu, Yaron Minsky • Paper appeared in ACM Transactions on Computer Systems – May, 1999 Reliable Multicast • Family of protocols for 1-n communication • Increasingly important in modern networks • “Reliability” tends to be at one of two extremes on the spectrum: – Very strong guarantees, for example to replicate a controller position in an ATC system – Best effort guarantees, for applications like Internet conferencing or radio Reliable Multicast • Common to try and reduce problems such as this to theory • Issue is to specify the goal of the multicast protocol and then demonstrate that a particular implementation satisfies its spec. • Each form of reliability has its down specification and its own pros and cons Classical “multicast” problems • Related to “coordinated attack,” “consensus” • The two “schools” of reliable multicast – Virtual synchrony model (Isis, Horus, Ensemble) • All or nothing message delivery with ordering • Membership managed on behalf of group • State transfer to joining member – Scalable reliable multicast • Local repair of problems but no end-to-end guarantees Virtual Synchrony Model G0={p,q} G1={p,q,r,s} p G2={q,r,s} G3={q,r,s,t} crash q r s t r, s request to join r,s added; state xfer p fails t requests to join t added, state xfer ... to date, the only widely adopted model for consistency and fault-tolerance in highly available networked applications Virtual Synchrony Tools • Various forms of replication: – Replicated data, replicate an object, state transfer for starting new replicas... – 1-many event streams (“network news”) • Load-balanced and fault-tolerant request execution • Management of groups of nodes or machines in a network setting ATC system (simplified) Onboard Radar X.500 Directory Controllers Air Traffic Database (flight plans, etc) Roles for reliable multicast? • Replicate the state of controller consoles so that a group of consoles can support a team of controllers fault-tolerantly Roles for reliable multicast? • Distribute routine updates from the radar system, every 3-5 seconds (two kinds: images and digital tracking info) Roles for reliable multicast? • Distribute flight plan updates and decisions rarely (very critical information) Other settings that use reliable multicast • Basis of several stock exchanges • Could be used to implement enterprise database systems and wide-area network services • Could be the basis of web caching tools But vsync protocols don’t scale well • For small systems, performance is great. • For large systems can demonstrate very good performance… but only if applications “cooperate” • In large systems, under heavy load, with applications that may act a little flakey, performance is very hard to maintain Stock Exchange Problem: Vsync. multicast is too “fragile” Most members are healthy…. … but one is slow Virtually synchronous Ensemble multicast protocols average throughput on nonperturbed members 250 group size: 32 group size: 64 group size: 96 200 150 100 50 0 0 0.1 0.2 0.3 0.4 0.5 perturb rate 0.6 0.7 0.8 0.9 Throughput as one member of a multicast group is "perturbed" by forcing it to sleep for varying amounts of time. Why does this happen? • Superficially, because data for the slow process piles up in the sender’s buffer, causing flow control to kick in (prematurely) • But why does the problem grow worse as a function of group size, with just one “red” process? • Our theory is that small perturbations happen all the time (but more study is needed) Dilemma confronting developers • Application is extremely critical: stock market, air traffic control, medical system • Hence need a strong model, guarantees – SRM, RMTP, etc: model is too weak – Anyhow, we’ve seen that they don’t scale either! • But these applications often have a softrealtime subsystem – Steady data generation – May need to deliver over a large scale Today introduce a new design pt. • Bimodal multicast is reliable in a sense that can be formalized, at least for some networks – Generalization for larger class of networks should be possible but maybe not easy • Protocol is also very stable under steady load even if 25% of processes are perturbed • Scalable in much the same way as SRM Bimodal Attack Consensus is impossible! Attack! • General commands soldiers • If almost all attack victory is certain • If very few attack, they will perish but Empire survives • If many attack, Empire is lost Worse and worse... • General sends orders via couriers, who might perish or get lost in the forest (no spies or traitors) • Some of the armies are sick with flu. They won’t fight and may not even have healthy couriers to participate in protocol (Message omission, crash and timing faults.) Properties Desired • Bimodal distribution of multicast delivery • Ordering: FIFO on sender basis • Stability: lets us garbage collect without violating bimodal distribution • Detection of lost messages More desired properties • Formal model allows us to reason about how the primitive will behave in real systems • Scalable protocol: costs (background messages, throughput) do not degrade as a function of system size • Extremely stable throughput Naming our new protocol • Bimodal doesn’t lend itself to traditional contractions (bcast is taken, bicast and bimcast sound pretty awful) • Focus on probabilistic nature of guarantees • Call it pbcast for sure • Historical footnote: these are multicast protocols but everyone uses “bcast” in names Environment • Will assume that most links have known throughput and loss properties • Also assume that most processes are responsive to messages in bounded time • But can tolerate some flakey links and some crashed or slow processes. Start by using unreliable multicast to rapidly distribute the message. But some messages may not get through, and some processes may be faulty. So initial state involves partial distribution of multicast(s) Periodically (e.g. every 100ms) each process sends a digest describing its state to some randomly selected group member. The digest identifies messages. It doesn’t include them. Recipient checks the gossip digest against its own history and solicits a copy of any missing message from the process that sent the gossip Processes respond to solicitations received during a round of gossip by retransmitting the requested message. The round lasts much longer than a typical RPC time. Delivery? Garbage Collection? • Deliver a message when it is in FIFO order • Garbage collect a message when you believe that no “healthy” process could still need a copy (we used to wait 10 rounds, but now are using gossip to detect this condition) • Match parameters to intended environment Need to bound costs • Worries: – Someone could fall behind and never catch up, endlessly loading everyone else – What if some process has lots of stuff others want and they bombard him with requests? – What about scalability in buffering and in list of members of the system, or costs of updating that list? Optimizations • Request retransmissions most recent multicast first • Idea is to “catch up quickly” leaving at most one gap in the retrieved sequence Optimizations • Participants bound the amount of data they will retransmit during any given round of gossip. If too much is solicited they ignore the excess requests Optimizations • Label each gossip message with senders gossip round number • Ignore solicitations that have expired round number, reasoning that they arrived very late hence are probably no longer correct Optimizations • Don’t retransmit same message twice in a row to any given destination (the copy may still be in transit hence request may be redundant) Optimizations • Use IP multicast when retransmitting a message if several processes lack a copy – For example, if solicited twice – Also, if a retransmission is received from “far away” – Tradeoff: excess messages versus low latency • Use regional TTL to restrict multicast scope Scalability • Protocol is scalable except for its use of the membership of the full process group • Updates could be costly • Size of list could be costly • In large groups, would also prefer not to gossip over long high-latency links Can extend pbcast to solve both • Could use IP multicast to send initial message. (Right now, we have a tree-structured alternative, but to use it, need to know the membership) • Tell each process only about some subset k of the processes, k << N • Keeps costs constant. Router overload problem • Random gossip can overload a central router • Yet information flowing through this router is of diminishing quality as rate of gossip rises • Insight: constant rate of gossip is achievable and adequate Limited bandwidth (1500Kbits) Receiver Bandwidth of other links:10Mbits Sender Receiver High noise rate Sender 150 200 Uniform Hierarchical Uniform Hierarchical 180 160 100 120 nmsgs/sec nmsgs/sec 140 100 80 50 60 40 20 0 0 0 10 20 30 40 50 sec 60 70 80 90 100 0 5 10 15 group size 20 25 30 Hierarchical Gossip • Weight gossip so that probability of gossip to a remote cluster is smaller • Can adjust weight to have constant load on router • Now propagation delays rise… but just increase rate of gossip to compensate Hierarchical Gossip 1500 150 1000 100 nmsgs/sec propagation time(ms) Uniform Hierarchical Fast hierarchical 500 50 Uniform Hierarchical Fast hierarchical 0 0 5 10 15 group size 20 25 30 0 0 5 10 15 group size 20 25 30 Remainder of talk • Show results of formal analysis – We developed a model (won’t do the math here -- nothing very fancy) – Used model to solve for expected reliability • Then show more experimental data • Real question: what would pbcast “do” in the Internet? Our experience: it works! Idea behind analysis • Can use the mathematics of epidemic theory to predict reliability of the protocol • Assume an initial state • Now look at result of running B rounds of gossip: converges exponentially quickly towards atomic delivery p{#processes=k} Pbcast bimodal delivery distribution 1.E+00 1.E-05 1.E-10 1.E-15 1.E-20 1.E-25 1.E-30 0 5 10 15 20 25 30 35 40 45 50 number of processes to deliver pbcast Failure analysis • Suppose someone tells me what they hope to “avoid” • Model as a predicate on final system state • Can compute the probability that pbcast would terminate in that state, again from the model Two predicates • Predicate I: A faulty outcome is one where more than 10% but less than 90% of the processes get the multicast …call this the “General’s predicate” because he views it as a disaster if many but not most of his troops attack Two predicates • Predicate II: A faulty outcome is one where roughly half get the multicast and failures might “conceal” true outcome … this would make sense if using pbcast to distribute quorum-style updates to replicated data. The costly hence undesired outcome is the one where we need to rollback because outcome is “uncertain” Two predicates • Predicate I: More than 10% but less than 90% of the processes get the multicast • Predicate II: Roughly half get the multicast but crash failures might “conceal” outcome • Easy to add your own predicate. Our methodology supports any predicate over final system state Scalability of Pbcast reliability P{failure} 1.E-05 1.E-10 1.E-15 1.E-20 1.E-25 1.E-30 1.E-35 10 15 20 25 30 35 40 45 50 55 60 #processes in system Predicate I Predicate II P{failure} Effects of fanout on reliability 1.E+00 1.E-02 1.E-04 1.E-06 1.E-08 1.E-10 1.E-12 1.E-14 1.E-16 1 2 3 4 5 6 7 8 9 10 fanout Predicate I Predicate II fanout Fanout required for a specified reliability 9 8.5 8 7.5 7 6.5 6 5.5 5 4.5 4 20 25 30 35 40 45 50 #processes in system Predicate I for 1E-8 reliability Predicate II for 1E-12 reliability Scalability of Pbcast reliability 1.E+00 1.E-05 1.E-05 1.E-10 P{failure} p{#processes=k} Pbcast bimodal delivery distribution 1.E-10 1.E-15 1.E-20 1.E-15 1.E-20 1.E-25 1.E-30 1.E-35 1.E-25 10 15 1.E-30 0 5 10 15 20 25 30 35 40 45 P{failure} fanout 5 6 7 40 45 50 55 60 Predicate II 9 8.5 8 7.5 7 6.5 6 5.5 5 4.5 4 20 4 35 Fanout required for a specified reliability 1.E+00 1.E-02 1.E-04 1.E-06 1.E-08 1.E-10 1.E-12 1.E-14 1.E-16 3 30 Predicate I Effects of fanout on reliability 2 25 #processes in system 50 number of processes to deliver pbcast 1 20 8 9 10 25 30 35 40 45 50 #processes in system fanout Predicate I for 1E-8 reliability Predicate I Predicate II Figure 5: Graphs of analytical results Predicate II for 1E-12 reliability Discussion • We see that pbcast is indeed bimodal even in worst case, when initial multicast fails • Can easily tune parameters to obtain desired guarantees of reliability • Protocol is suitable for use in applications where bounded risk of undesired outcome is sufficient Model makes assumptions... • These are rather simplistic • Yet the model seems to predict behavior in real networks, anyhow • In effect, the protocol is not merely robust to process perturbation and message loss, but also to perturbation of the model itself • Speculate that this is due to the incredible power of exponential convergence... Experimental work • SP2 is a large network – Nodes are basically UNIX workstations – Interconnect is basically an ATM network – Software is standard Internet stack (TCP, UDP) • We obtained access to as many as 128 nodes on Cornell SP2 in Theory Center • Ran pbcast on this, and also ran a second implementation on a real network Example of a question • Create a group of 8 members • Perturb one member in style of Figure 1 • Now look at “stability” of throughput – Measure rate of received messages during periods of 100ms each – Plot histogram over life of experiment 0.6 0.4 Traditional Protocol w ith .45 sleep probability 0.2 0 Inter-arrival spacing (m s) 1 0.8 Pbcast w ith .05 sleep probability 0.6 Pbcast w ith .45 sleep probability 0.4 0.2 0 0. 00 5 0. 01 5 0. 02 5 0. 03 5 0. 04 5 0. 05 5 0. 06 5 Traditional Protocol w ith .05 sleep probability Histogram of throughput for pbcast Probability of occurence 0.8 0. 00 5 0. 01 5 0. 02 5 0. 03 5 0. 04 5 0. 05 5 0. 06 5 Probability of occurence Histogram of throughput for Ensemble's FIFO Virtual Synchrony Protocol Inter-arrival spacing (m s) Now revisit Figure 1 in detail • Take 8 machines • Perturb 1 • Pump data in at varying rates, look at rate of received messages Throughput with perturbed processes High bandwidth comparison of pbcast performance at faulty and correct hosts 200 traditional: at unperturbed host pbcast: at unperturbed host 180 traditional: at perturbed host pbcast: at perturbed host 160 Low bandwidth comparison of pbcast performance at faulty and correct hosts 200 traditional w/1 perturbed pbcast w/1 perturbed throughput for traditional, measured at perturbed host throughput for pbcast measured at perturbed host 180 160 140 average throughput average throughput 140 120 100 80 120 100 80 60 60 40 40 20 20 0 0.1 0.2 0.3 0.4 0.6 0.5 perturb rate 0.7 0.8 0.9 0 0.1 0.2 0.3 0.4 0.6 0.5 perturb rate 0.7 0.8 0.9 Throughput variation as a function of scale mean and standard deviation of pbcast throughput: 96-member group 220 215 215 210 210 throughput (msgs/sec) throughput (msgs/sec) mean and standard deviation of pbcast throughput: 16-member group 220 205 200 195 205 200 195 190 190 185 185 180 0 0.05 0.1 0.15 0.2 0.25 0.3 perturb rate 0.35 0.4 0.45 180 0.5 mean and standard deviation of pbcast throughput: 128-member group 0 0.05 0.1 0.15 0.2 0.25 0.3 perturb rate 0.35 0.4 0.45 0.5 standard deviation of pbcast throughput 220 150 215 standard deviation throughput (msgs/sec) 210 205 200 195 100 50 190 185 180 0 0.05 0.1 0.15 0.2 0.25 0.3 perturb rate 0.35 0.4 0.45 0.5 0 0 50 100 process group size 150 Impact of packet loss on reliability and retransmission rate Pbcast background overhead: perturbed process percentage (25%) Pbcast with system-wide message loss: high and low bandwidth 100 200 8 nodes 16 nodes 64 nodes 128 nodes 90 180 140 120 100 hbw:8 hbw:32 hbw:64 hbw:96 lbw:8 lbw:32 lbw:64 lbw:96 80 60 40 20 0 retransmitted messages (%) average throughput of receivers 80 160 70 60 50 40 30 20 10 0 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 system-wide drop rate 0.16 0.18 0.2 0 0.05 Notice that when network becomes overloaded, healthy processes experience packet loss! 0.1 0.15 0.2 0.25 0.3 perturb rate 0.35 0.4 0.45 0.5 What about growth of overhead? • Look at messages other than original data distribution multicast • Measure worst case scenario: costs at main generator of multicasts • Side remark: all of these graphs look identical with multiple senders or if overhead is measured elsewhere…. 16 nodes - 4 perturbed processes 100 80 60 40 20 0 average overhead average overhead (msgs/sec) 8 nodes - 2 perturbed processes 0.1 0.2 0.3 0.4 100 80 60 40 20 0 0.1 0.5 0.2 100 80 60 40 20 0 0.3 0.4 perturb rate 0.5 128 nodes - 32 perturbed processes average overhead average overhead 64 nodes - 16 perturbed processes 0.2 0.4 perturb rate perturb rate 0.1 0.3 0.5 100 80 60 40 20 0 0.1 0.2 0.3 0.4 perturb rate 0.5 Growth of Overhead? • Clearly, overhead does grow • We know it will be bounded except for probabilistic phenomena • At peak, load is still fairly low Pbcast versus SRM, 0.1% packet loss rate on all links PBCAST and SRM with system wide constant noise, tree topology PBCAST and SRM with system wide constant noise, tree topology 15 15 adaptive SRM repairs/sec received Tree networks requests/sec received adaptive SRM 10 SRM 5 10 SRM 5 Pbcast-IPMC 0 Pbcast Pbcast-IPMC 0 10 20 30 40 50 60 group size 70 80 90 100 Pbcast 0 0 10 PBCAST and SRM with system wide constant noise, star topology 20 30 40 50 60 group size 70 80 90 100 PBCAST and SRM with system wide constant noise, star topology 15 60 50 SRM 10 SRM repairs/sec received Star networks requests/sec received adaptive SRM 5 40 30 20 adaptive SRM 10 0 Pbcast Pbcast-IPMC 0 10 20 30 40 50 60 group size 70 80 90 100 0 0 10 20 30 40 50 60 group size 70 80 Pbcast Pbcast-IPMC 90 100 Pbcast versus SRM: link utilization PBCAST and SRM with system wide constant noise, tree topology PBCAST and SRM with system wide constant noise, tree topology 20 Pbcast Pbcast-IPMC SRM Adaptive SRM 45 40 link utilization on an incoming link to sender link utilization on an outgoing link from sender 50 35 30 25 20 15 10 5 0 Pbcast Pbcast-IPMC SRM Adaptive SRM 18 16 14 12 10 8 6 4 2 0 10 20 30 40 50 60 group size 70 80 90 100 0 0 10 20 30 40 50 60 group size 70 80 90 100 Pbcast and SRM with 0.1% system wide constant noise, 1000-node tree topology 30 Pbcast and SRM with 0.1% system wide constant noise, 1000-node tree topology 30 25 25 adaptive SRM 20 repairs/sec received requests/sec received Pbcast versus SRM: 300 members on a 1000-node tree, 0.1% packet loss rate SRM 15 10 5 0 adaptive SRM 20 SRM 15 10 5 Pbcast Pbcast-IPMC 0 20 40 60 80 group size 100 120 0 Pbcast-IPMC Pbcast 0 20 40 60 80 group size 100 120 Pbcast Versus SRM: Interarrival Spacing Pbcast versus SRM: Interarrival spacing (500 nodes, 300 members, 1.0% packet loss) 1% system wide noise, 500 nodes, 300 members, pbcast-grb 1% system wide noise, 500 nodes, 300 members, srm 100 90 90 90 80 80 80 70 60 50 40 30 percentage of occurrences 100 percentage of occurrences percentage of occurrences 1% system wide noise, 500 nodes, 300 members, pbcast-grb 100 70 60 50 40 30 70 60 50 40 30 20 20 20 10 10 10 0 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 latency at node level (second) 0.4 0.45 0.5 0 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 latency after FIFO ordering (second) 0.45 0.5 0 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 latency at node level (second) 0.4 0.45 0.5 Real Data: Spinglass on a 10Mbit ethernet (35 Ultrasparc’s) Injected noise, retransmission limit disabled Injected noise, retransmission limit re-enabled 150 180 140 160 130 140 120 120 110 #msgs 200 100 100 80 90 60 80 40 70 20 60 0 0 10 20 30 40 50 sec 60 70 80 90 100 50 0 10 20 30 40 50 sec 60 70 80 90 100 Networks structured as clusters Receiver Sender High noise rate Limited bandwidth (1500Kbits) Bandwidth of other links:10Mbits Receiver Sender Delivery latency in a 2-cluster LAN, 50% noise between clusters, 1% elsewhere partitioned nw, 50% noise between clusters, 1% system wide noise, n=80, pbcast-grb 45 40 40 35 35 percentage of occurrences percentage of occurrences partitioned nw, 50% noise between clusters, 1% system wide noise, n=80, pbcast-grb 45 30 25 20 15 30 25 20 15 10 10 5 5 0 0 0 1 2 3 4 latency at node level (second) 5 6 0 1 45 40 40 35 35 30 25 20 15 15 5 2 3 4 latency at node level (second) 6 20 5 1 5 25 10 0 6 30 10 0 5 srm adaptive 45 percentage of occurrences percentage of occurrences srm 2 3 4 latency after FIFO ordering (second) 5 6 0 0 1 2 3 4 latency at node level (second) Requests/repairs and latencies with bounded router bandwidth limited bw on router, 1% system wide constant noise limited bw on router, 1% system wide constant noise 200 120 pbcast-grb srm pbcast-grb srm 180 100 140 repairs/sec received requests/sec received 160 120 100 80 60 40 80 60 40 20 20 0 0 20 40 60 group size 80 100 120 limited bw on router, noise=1%, n=100, pbcast-grb 20 40 60 group size 5 2 4 6 8 10 12 14 latency at node level (second) 16 18 20 10 5 0 100 120 15 percentage of occurrences 10 0 80 limited bw on router, noise=1%, n=100, srm 15 percentage of occurrences percentage of occurrences 0 limited bw on router, noise=1%, n=100, pbcast-grb 15 0 0 0 2 4 6 8 10 12 14 latency after FIFO ordering (second) 16 18 20 10 5 0 0 2 4 6 8 10 12 14 latency at node level (second) 16 18 20 Discussion • Saw that stability of protocol is exceptional even under heavy perturbation • Overhead is low and stays low with system size, bounded even for heavy perturbation • Throughput is extremely steady • In contrast, virtual synchrony and SRM both are fragile under this sort of attack Programming with pbcast? • Most often would want to split application into multiple subsystems – Use pbcast for subsystems that generate regular flow of data and can tolerate infrequent loss if risk is bounded – Use stronger properties for subsystems with less load and that need high availability and consistency at all times Programming with pbcast? • In stock exchange, use pbcast for pricing but abcast for “control” operations • In hospital use pbcast for telemetry data but use abcast when changing medication • In air traffic system use pbcast for routine radar track updates but abcast when pilot registers a flight plan change Our vision: One protocol side-by-side with the other • Use virtual synchrony for replicated data and control actions, where strong guarantees are needed for safety • Use pbcast for high data rates, steady flows of information, where longer term properties are critical but individual multicast is of less critical importance Next steps • Currently exploring memory use patterns and scheduling issues for the protocol • Will study: – How much memory is really needed? Can we configure some processes with small buffers? – When to use IP multicast and what TTL to use – Exploiting network topology information – Application feedback options Summary • New data point in spectrum – Virtual synchrony protocols (atomic multicast as normally implemented) – Bimodal probabilistic multicast – Scalable reliable multicast • Demonstrated that pbcast is suitable for analytic work • Saw that it has exceptional stability Epilogue • Baron: “Although successful, your attack was reckless! It is impossible to achieve consensus on attack under such conditions!” • General: “I merely took a calculated risk….”