CS514: Intermediate Course in Operating Systems Professor Ken Birman Krzys Ostrowski: TA Tools for SOA robustness Starting with this lecture we’ll develop some basic “tools” we can use to enhance the robustness of web service applications Today we’ll focus on a very simple way to replicate data and track node status For use in RACS and for other replication purposes in data centers Gossip: A valuable tool… So-called gossip protocols can be robust even when everything else is malfunctioning Idea is to build a distributed protocol a bit like gossip among humans “Did you hear that Sally and John are going out?” Gossip spreads like lightening… Gossip: basic idea Node A encounters “randomly selected” node B (might not be totally random) Gossip push (“rumor mongering”): Gossip pull (“anti-entropy”) A tells B something B doesn’t know A asks B for something it is trying to “find” Push-pull gossip Combines both mechanisms Definition: A gossip protocol… Uses random pairwise state merge Runs at a steady rate (and this rate is much slower than the network RTT) Uses bounded-size messages Does not depend on messages getting through reliably Gossip benefits… and limitations Information flows around disruptions Scales very well Typically reacts to new events in log(N) Can be made selfrepairing Rather slow Very redundant Guarantees are at best probabilistic Depends heavily on the randomness of the peer selection For example We could use gossip to track the health of system components We can use gossip to report when something important happens In the remainder of today’s talk we’ll focus on event notifications. Next week we’ll look at some of these other uses Typical push-pull protocol Nodes have some form of database of participating machines Could have a hacked bootstrap, then use gossip to keep this up to date! Set a timer and when it goes off, select a peer within the database Send it some form of “state digest” It responds with data you need and its own state digest You respond with data it needs Where did the “state” come from? The data eligible for gossip is usually kept in some sort of table accessible to the gossip protocol This way a separate thread can run the gossip protocol It does upcalls to the application when incoming information is received Gossip often runs over UDP Recall that UDP is an “unreliable” datagram protocol supported in internet Unlike for TCP, data can be lost Also packets have a maximum size, usually 4k or 8k bytes (you can control this) Larger packets are more likely to get lost! What if a packet would get too large? Gossip layer needs to pick the most valuable stuff to include, and leave out the rest! Algorithms that use gossip Gossip is a hot topic! Can be used to… Notify applications about some event Track the status of applications in a system Organize the nodes in some way (like into a tree, or even sorted by some index) Find “things” (like files) Let’s look closely at an example Bimodal multicast This is Cornell work from about 10 years ago Goal is to combine gossip with UDP (also called IP) multicast to make a very robust multicast protocol Stock Exchange Problem: Sometimes, someone is slow… Most members are healthy…. … but one is slow i.e. something is contending with the red process, delaying its handling of incoming messages… With classical reliable multicast, throughput collapses as the system scales up! Even if we have just one slow receiver… as the group gets larger (hence more healthy receivers), impact of a performance perturbation is more and more evident! average throughput on nonperturbed members Virtually synchronous Ensemble multicast protocols 250 group size: 32 group size: 64 group size: 96 32 200 150 100 96 50 0 0 0.1 0.2 0.3 0.4 0.5 perturb rate 0.6 0.7 0.8 0.9 Why does this happen? Most reliable multicast protocols are based on an ACK/NAK scheme (like TCP but with multiple receivers). Sender retransmits lost packets. As number of receivers gets large ACKS/NAKS pile up (sender has more and more work to do) Hence it needs longer to discover problems And this causes it to buffer messages for longer and longer… hence flow control kicks in! So the whole group slow down Start by using unreliable UDP multicast to rapidly distribute the message. But some messages may not get through, and some processes may be faulty. So initial state involves partial distribution of multicast(s) Periodically (e.g. every 100ms) each process sends a digest describing its state to some randomly selected group member. The digest identifies messages. It doesn’t include them. Recipient checks the gossip digest against its own history and solicits a copy of any missing message from the process that sent the gossip Processes respond to solicitations received during a round of gossip by retransmitting the requested message. The round lasts much longer than a typical RPC time. Delivery? Garbage Collection? Deliver a message when it is in FIFO order Report an unrecoverable loss if a gap persists for so long that recovery is deemed “impractical” Garbage collect a message when you believe that no “healthy” process could still need a copy (we used to wait 10 rounds, but now are using gossip to detect this condition) Match parameters to intended environment Need to bound costs Worries: Someone could fall behind and never catch up, endlessly loading everyone else What if some process has lots of stuff others want and they bombard him with requests? What about scalability in buffering and in list of members of the system, or costs of updating that list? Optimizations Request retransmissions most recent multicast first Idea is to “catch up quickly” leaving at most one gap in the retrieved sequence Optimizations Participants bound the amount of data they will retransmit during any given round of gossip. If too much is solicited they ignore the excess requests Optimizations Label each gossip message with senders gossip round number Ignore solicitations that have expired round number, reasoning that they arrived very late hence are probably no longer correct Optimizations Don’t retransmit same message twice in a row to any given destination (the copy may still be in transit hence request may be redundant) Optimizations Use UDP multicast when retransmitting a message if several processes lack a copy For example, if solicited twice Also, if a retransmission is received from “far away” Tradeoff: excess messages versus low latency Use regional TTL to restrict multicast scope Scalability Protocol is scalable except for its use of the membership of the full process group Updates could be costly Size of list could be costly In large groups, would also prefer not to gossip over long high-latency links Router overload problem Random gossip can overload a central router Yet information flowing through this router is of diminishing quality as rate of gossip rises Insight: constant rate of gossip is achievable and adequate Limited bandwidth (1500Kbits) Receiver Bandwidth of other links:10Mbits Sender Receiver High noise rate 150 200 Uniform Hierarchical 180 160 140 Uniform Hierarchical 100 120 100 80 60 40 20 0 link (msgs/sec) Load on WANnmsgs/sec delivery (ms) Latency tonmsgs/sec Sender 0 10 20 30 40 50 sec 60 70 80 90 100 50 0 0 5 10 15 group size 20 25 30 Hierarchical Gossip Weight gossip so that probability of gossip to a remote cluster is smaller Can adjust weight to have constant load on router Now propagation delays rise… but just increase rate of gossip to compensate 1500 150 1000 100 nmsgs/sec propagation time(ms) Uniform Hierarchical Fast hierarchical 500 50 Uniform Hierarchical Fast hierarchical 0 0 5 10 15 group size 20 25 30 0 0 5 10 15 group size 20 25 30 Idea behind analysis Can use the mathematics of epidemic theory to predict reliability of the protocol Assume an initial state Now look at result of running B rounds of gossip: converges exponentially quickly towards atomic delivery p{#processes=k} Pbcast bimodal delivery distribution Either sender fails… 1.E+00 … or data gets through w.h.p. 1.E-05 1.E-10 1.E-15 1.E-20 1.E-25 1.E-30 0 5 10 15 20 25 30 35 40 45 50 number of processes to deliver pbcast Failure analysis Suppose someone tells me what they hope to “avoid” Model as a predicate on final system state Can compute the probability that pbcast would terminate in that state, again from the model Two predicates Predicate I: A faulty outcome is one where more than 10% but less than 90% of the processes get the multicast … Think of a probabilistic Byzantine General’s problem: a disaster if many but not most troops attack Two predicates Predicate II: A faulty outcome is one where roughly half get the multicast and failures might “conceal” true outcome … this would make sense if using pbcast to distribute quorum-style updates to replicated data. The costly hence undesired outcome is the one where we need to rollback because outcome is “uncertain” Two predicates Predicate I: More than 10% but less than 90% of the processes get the multicast Predicate II: Roughly half get the multicast but crash failures might “conceal” outcome Easy to add your own predicate. Our methodology supports any predicate over final system state Scalability of Pbcast reliability P{failure} 1.E-05 1.E-10 1.E-15 1.E-20 1.E-25 1.E-30 1.E-35 10 15 20 25 30 35 40 45 50 55 60 #processes in system Predicate I Predicate II P{failure} Effects of fanout on reliability 1.E+00 1.E-02 1.E-04 1.E-06 1.E-08 1.E-10 1.E-12 1.E-14 1.E-16 1 2 3 4 5 6 7 8 9 10 fanout Predicate I Predicate II fanout Fanout required for a specified reliability 9 8.5 8 7.5 7 6.5 6 5.5 5 4.5 4 20 25 30 35 40 45 50 #processes in system Predicate I for 1E-8 reliability Predicate II for 1E-12 reliability Scalability of Pbcast reliability 1.E+00 1.E-05 1.E-05 1.E-10 P{failure} p{#processes=k} Pbcast bimodal delivery distribution 1.E-10 1.E-15 1.E-20 1.E-15 1.E-20 1.E-25 1.E-30 1.E-35 1.E-25 10 15 1.E-30 0 5 10 15 20 25 30 35 40 45 P{failure} fanout 5 6 7 40 45 50 55 60 Predicate II 9 8.5 8 7.5 7 6.5 6 5.5 5 4.5 4 20 4 35 Fanout required for a specified reliability 1.E+00 1.E-02 1.E-04 1.E-06 1.E-08 1.E-10 1.E-12 1.E-14 1.E-16 3 30 Predicate I Effects of fanout on reliability 2 25 #processes in system 50 number of processes to deliver pbcast 1 20 8 9 10 25 30 35 40 45 50 #processes in system fanout Predicate I for 1E-8 reliability Predicate I Predicate II Figure 5: Graphs of analytical results Predicate II for 1E-12 reliability Experimental work SP2 is a large network Nodes are basically UNIX workstations Interconnect is basically an ATM network Software is standard Internet stack (TCP, UDP) We obtained access to as many as 128 nodes on Cornell SP2 in Theory Center Ran pbcast on this, and also ran a second implementation on a real network Example of a question Create a group of 8 members Perturb one member in style of Figure 1 Now look at “stability” of throughput Measure rate of received messages during periods of 100ms each Plot histogram over life of experiment Source to dest latency distributions 0.6 0.4 Traditional Protocol w ith .45 sleep probability 0.2 0 Inter-arrival spacing (m s) 1 0.8 0.6 Notice that in practice, bimodal Pbcast w ith .05 multicast is fast! sleep probability Pbcast w ith .45 sleep probability 0.4 0.2 0 0. 00 5 0. 01 5 0. 02 5 0. 03 5 0. 04 5 0. 05 5 0. 06 5 Traditional Protocol w ith .05 sleep probability Histogram of throughput for pbcast Probability of occurence 0.8 0. 00 5 0. 01 5 0. 02 5 0. 03 5 0. 04 5 0. 05 5 0. 06 5 Probability of occurence Histogram of throughput for Ensemble's FIFO Virtual Synchrony Protocol Inter-arrival spacing (m s) Now revisit Figure 1 in detail Take 8 machines Perturb 1 Pump data in at varying rates, look at rate of received messages Revisit our original scenario with perturbations (32 processes) High bandwidth comparison of pbcast performance at faulty and correct hosts 200 traditional: at unperturbed host pbcast: at unperturbed host 180 traditional: at perturbed host pbcast: at perturbed host 160 Low bandwidth comparison of pbcast performance at faulty and correct hosts 200 traditional w/1 perturbed pbcast w/1 perturbed throughput for traditional, measured at perturbed host throughput for pbcast measured at perturbed host 180 160 140 average throughput average throughput 140 120 100 80 120 100 80 60 60 40 40 20 20 0 0.1 0.2 0.3 0.4 0.6 0.5 perturb rate 0.7 0.8 0.9 0 0.1 0.2 0.3 0.4 0.6 0.5 perturb rate 0.7 0.8 0.9 Throughput variation as a function of scale mean and standard deviation of pbcast throughput: 96-member group 220 215 215 210 210 throughput (msgs/sec) throughput (msgs/sec) mean and standard deviation of pbcast throughput: 16-member group 220 205 200 195 205 200 195 190 190 185 185 180 0 0.05 0.1 0.15 0.2 0.25 0.3 perturb rate 0.35 0.4 0.45 180 0.5 mean and standard deviation of pbcast throughput: 128-member group 0 0.05 0.1 0.15 0.2 0.25 0.3 perturb rate 0.35 0.4 0.45 0.5 standard deviation of pbcast throughput 220 150 215 standard deviation throughput (msgs/sec) 210 205 200 195 100 50 190 185 180 0 0.05 0.1 0.15 0.2 0.25 0.3 perturb rate 0.35 0.4 0.45 0.5 0 0 50 100 process group size 150 Impact of packet loss on reliability and retransmission rate Pbcast background overhead: perturbed process percentage (25%) Pbcast with system-wide message loss: high and low bandwidth 100 200 8 nodes 16 nodes 64 nodes 128 nodes 90 180 140 120 100 hbw:8 hbw:32 hbw:64 hbw:96 lbw:8 lbw:32 lbw:64 lbw:96 80 60 40 20 0 retransmitted messages (%) average throughput of receivers 80 160 70 60 50 40 30 20 10 0 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 system-wide drop rate 0.16 0.18 0.2 0 0.05 Notice that when network becomes overloaded, healthy processes experience packet loss! 0.1 0.15 0.2 0.25 0.3 perturb rate 0.35 0.4 0.45 0.5 What about growth of overhead? Look at messages other than original data distribution multicast Measure worst case scenario: costs at main generator of multicasts Side remark: all of these graphs look identical with multiple senders or if overhead is measured elsewhere…. 16 nodes - 4 perturbed processes 100 80 60 40 20 0 average overhead average overhead (msgs/sec) 8 nodes - 2 perturbed processes 0.1 0.2 0.3 0.4 100 80 60 40 20 0 0.1 0.5 0.2 100 80 60 40 20 0 0.3 0.4 perturb rate 0.5 128 nodes - 32 perturbed processes average overhead average overhead 64 nodes - 16 perturbed processes 0.2 0.4 perturb rate perturb rate 0.1 0.3 0.5 100 80 60 40 20 0 0.1 0.2 0.3 0.4 perturb rate 0.5 Growth of Overhead? Clearly, overhead does grow We know it will be bounded except for probabilistic phenomena At peak, load is still fairly low Pbcast versus SRM, 0.1% packet loss rate on all links PBCAST and SRM with system wide constant noise, tree topology PBCAST and SRM with system wide constant noise, tree topology 15 15 adaptive SRM repairs/sec received Tree networks requests/sec received adaptive SRM 10 SRM 5 10 SRM 5 Pbcast-IPMC 0 Pbcast Pbcast-IPMC 0 10 20 30 40 50 60 group size 70 80 90 100 Pbcast 0 0 10 PBCAST and SRM with system wide constant noise, star topology 20 30 40 50 60 group size 70 80 90 100 PBCAST and SRM with system wide constant noise, star topology 15 60 50 SRM 10 SRM repairs/sec received Star networks requests/sec received adaptive SRM 5 40 30 20 adaptive SRM 10 0 Pbcast Pbcast-IPMC 0 10 20 30 40 50 60 group size 70 80 90 100 0 0 10 20 30 40 50 60 group size 70 80 Pbcast Pbcast-IPMC 90 100 Pbcast versus SRM: link utilization PBCAST and SRM with system wide constant noise, tree topology PBCAST and SRM with system wide constant noise, tree topology 20 Pbcast Pbcast-IPMC SRM Adaptive SRM 45 40 link utilization on an incoming link to sender link utilization on an outgoing link from sender 50 35 30 25 20 15 10 5 0 Pbcast Pbcast-IPMC SRM Adaptive SRM 18 16 14 12 10 8 6 4 2 0 10 20 30 40 50 60 group size 70 80 90 100 0 0 10 20 30 40 50 60 group size 70 80 90 100 Pbcast and SRM with 0.1% system wide constant noise, 1000-node tree topology 30 Pbcast and SRM with 0.1% system wide constant noise, 1000-node tree topology 30 25 25 adaptive SRM 20 repairs/sec received requests/sec received Pbcast versus SRM: 300 members on a 1000-node tree, 0.1% packet loss rate SRM 15 10 5 0 adaptive SRM 20 SRM 15 10 5 Pbcast Pbcast-IPMC 0 20 40 60 80 group size 100 120 0 Pbcast-IPMC Pbcast 0 20 40 60 80 group size 100 120 Pbcast Versus SRM: Interarrival Spacing Pbcast versus SRM: Interarrival spacing (500 nodes, 300 members, 1.0% packet loss) 1% system wide noise, 500 nodes, 300 members, pbcast-grb 1% system wide noise, 500 nodes, 300 members, srm 100 90 90 90 80 80 80 70 60 50 40 30 percentage of occurrences 100 percentage of occurrences percentage of occurrences 1% system wide noise, 500 nodes, 300 members, pbcast-grb 100 70 60 50 40 30 70 60 50 40 30 20 20 20 10 10 10 0 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 latency at node level (second) 0.4 0.45 0.5 0 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 latency after FIFO ordering (second) 0.45 0.5 0 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 latency at node level (second) 0.4 0.45 0.5 Real Data: Bimodal Multicast on a 10Mbit ethernet (35 Ultrasparc’s) Injected noise, retransmission limit disabled Injected noise, retransmission limit re-enabled 150 180 140 160 130 140 120 120 110 #msgs 200 100 100 80 90 60 80 40 70 20 60 0 0 10 20 30 40 50 sec 60 70 80 90 100 50 0 10 20 30 40 50 sec 60 70 80 90 100 Networks structured as clusters Receiver Sender High noise rate Limited bandwidth (1500Kbits) Bandwidth of other links:10Mbits Receiver Sender Delivery latency in a 2-cluster LAN, 50% noise between clusters, 1% elsewhere partitioned nw, 50% noise between clusters, 1% system wide noise, n=80, pbcast-grb 45 40 40 35 35 percentage of occurrences percentage of occurrences partitioned nw, 50% noise between clusters, 1% system wide noise, n=80, pbcast-grb 45 30 25 20 15 30 25 20 15 10 10 5 5 0 0 0 1 2 3 4 latency at node level (second) 5 6 0 1 45 40 40 35 35 30 25 20 15 15 5 2 3 4 latency at node level (second) 6 20 5 1 5 25 10 0 6 30 10 0 5 srm adaptive 45 percentage of occurrences percentage of occurrences srm 2 3 4 latency after FIFO ordering (second) 5 6 0 0 1 2 3 4 latency at node level (second) Requests/repairs and latencies with bounded router bandwidth limited bw on router, 1% system wide constant noise limited bw on router, 1% system wide constant noise 200 120 pbcast-grb srm pbcast-grb srm 180 100 140 repairs/sec received requests/sec received 160 120 100 80 60 40 80 60 40 20 20 0 0 20 40 60 group size 80 100 120 limited bw on router, noise=1%, n=100, pbcast-grb 20 40 60 group size 5 2 4 6 8 10 12 14 latency at node level (second) 16 18 20 10 5 0 100 120 15 percentage of occurrences 10 0 80 limited bw on router, noise=1%, n=100, srm 15 percentage of occurrences percentage of occurrences 0 limited bw on router, noise=1%, n=100, pbcast-grb 15 0 0 0 2 4 6 8 10 12 14 latency after FIFO ordering (second) 16 18 20 10 5 0 0 2 4 6 8 10 12 14 latency at node level (second) 16 18 20 Summary Gossip is a valuable tool for addressing some of the needs of modern autonomic computing Would use it in a RACS, and for distributing updates to parameters Often paired with other mechanisms, eg anti-entropy paired with UDP multicast Solutions scale well (if well designed!)