Reliable Distributed Systems Peer to Peer Peer-to-Peer (p2p) Systems The term refers to a kind of distributed computing system in which the “main” service is provided by having the client systems talk directly to one-another In contrast, traditional systems are structured with servers at the core and clients around the edges p2p systems Standard systems: Client/Server structured P2P systems: Clients help one-another out An “important” topic … or at least, it gets a lot of press Recording industry claims that p2p downloads are killing profits! Used to be mostly file sharing, but now online radio feeds (RSS feeds) are a big deal too U. Wash. study showed that 80% of their network bandwidth was spent on music/video downloads! DVDs are largest, and accounted for the lion’s share A great many objects were downloaded many times Strangely, many downloads took months to complete… Most went to a tiny handful of machines in dorm rooms Where has all the bandwidth gone? 800 600 500 non-HTTP TCP non-HTTP TCP 400 300 Akamai 0:00 Thu 12:00 0:00 Wed 0:00 Tue 12:00 Mon 0:00 12:00 0:00 Sun 12:00 0:00 Sat 12:00 0:00 Fri 12:00 0:00 Thu 12:00 0:00 12:00 WWW WWW Wed 0 P2P P2P 200 100 12:00 Mbps 700 Breakdown of UW TCP bandwidth into HTTP Components (May 2002) • WWW = 14% of TCP traffic; P2P = 43% of TCP traffic • P2P dominates WWW in bandwidth consumed!! Source: Hank Levy. See http://www.cs.washington.edu/research/networking/websys/pubs/osdi_2002/osdi.pdf Bandwidth consumed by UW servers (outbound traffic) Bandwidth Consumed by UW Servers 250 Kazaa Mbps 200 150 100 Gnutella WWW 50 Wed Thu Fri Sat Sun Mon Tue Wed Thu 12:00 0:00 12:00 0:00 12:00 0:00 12:00 0:00 12:00 0:00 12:00 0:00 12:00 0:00 12:00 0:00 12:00 0:00 12:00 0 Fri Source: Hank Levy. See http://www.cs.washington.edu/research/networking/websys/pubs/osdi_2002/osdi.pdf Object type for different systems Byte Breakdown per Content Delivery System 100% TEXT (T) IMAGES (I) AUDIO (A) VIDEO (V) OTHER (O) % Bytes 80% V V 60% I 40% T I A O V O T 20% A A V A T I O T I O 0% WWW Akamai Gnutella Kazaa Source: Hank Levy. See http://www.cs.washington.edu/research/networking/websys/pubs/osdi_2002/osdi.pdf Today: An Overview Today we’ll look at the area as a whole Origins: Illegal fire sharing Early academic work: “Distributed hash tables” Subsequent spread of field into many other areas: steganographic storage, erasure codes, gossip protocols and epidemic data dissemination, etc In upcoming lectures we’ll look at details of some research systems An old idea… If you think about it, most of the protocols we’ve discussed are “peer to peer” in a broad sense Pretty much everything Lamport was interested in uses direct client-to-client communication Group communication systems often do have servers, but not all need them… But the term really has a stronger meaning Denotes systems where the “data that matters” is passed among cooperating client systems And there may be huge numbers of clients Evokes image of resistance fighters working to overthrow an evil IP empire Attributes of p2p systems They can be enormous We often talk about hundreds of thousands or millions of client nodes, coming and going rapidly If there are servers, they are small in number and have limited roles These clients are everywhere Even in Kenya or Nepal… places with lousy network connectivity Often behind firewalls or NAT boxes Some are supercomputers. But many are slow The issue with NAT boxes When a system uses firewalls or NAT boxes Client systems inside the network can usually talk to servers outside it The NAT knows about the TCP 3-way handshake and “creates a tunnel” on the fly It remaps the (IP address, port) pair as packets pass by, so it looks as if the NAT (not the client) is making the connection and receiving the replies… But connectivity from outside to inside is blocked In fact, because client IP address is mapped, the client simply can’t be addressed other than through the NAT! The first peer-to-peer system The term, and the intuition, emerged from the Napster file sharing service In fact Napster has a set of servers But these just keep a directory on behalf of clients and orchestrate publicity inserts Servers build the web pages users see Actual music and DVD downloads are done from client to client Napster Having obtained a top-level page listing peers with copies of music or other content desired, a client can download the files directly from the peer Got “Sting”?…Can I have a copy? no problem, dude Where can I find a copy of “Sting:Fields of Barley”? … try 167.26.16.89 or 221.18.71.36 Data center builds the pages users see when they access Napster Quick aside Should “intellectual property” be free? Topic of much debate right now Lessig: “East Code vs West Code” East Code is a term for “laws on the books” West Code is a term for software His point? We need to evolve a balance between what we demand (law), what we can implement (code), and what will promote the general wellfare What regime gives the most benefit for the most people? Why did Napster go this route? When service launched, developers hoped to work around legal limits on sharing media They reasoned: let client systems advertise “stuff” If some of that stuff happens to be music, that’s the responsibility of the person who does it The directory system “helps clients advertise wares” but doesn’t “endorse” the sharing of protected intellectual property. Client who chooses to do so is violating the law They make their money on advertising they insert Judges saw it differently… “Napster’s clear purpose is to facilitate theft of IP…” Characteristics of big populations With huge numbers of users Surprisingly many “come and go” on short time scales One study: mean residence time in Freenet was just a few seconds… and many clients were never heard of again! British telcom reassigns IP addresses for all its networked users every few hours! List of (technical) issues with Napster Many clients just aren’t accessible Firewalls can limit incoming connections to clients Many client systems come and go (churn) Round trip times to Nepal are slow… Slow “upload” speeds are common connections Clients might withdraw a file unexpectedly E.g. if low on disk space, or if they download something on top of a song they aren’t listening to anymore More (technical) issues with Napster Industry has been attacking the service… and not just in court of law Denial of service assaults on core servers Some clients lie about content (e.g. serve Frank Sinatra in response to download for Eminem) Hacking Napster “clients” to run the protocol in various broken (disruptive) ways And trying to figure out who is serving which files, in order to sue those people What problems are “fundamental”? If we assume clients serve up the same stuff people download, the number of sources for a less popular item will be very small Under assumption that churn is a constant, these less popular items will generally not be accessible. But experiments show that clients fall into two categories: Well-connected clients that hang around Poorly-connected clients that also churn … this confuses the question What problems are fundamental? One can have, some claim, as many electronic personas as one has the time and energy to create. – Judith S. Donath. So-called “Sybil attack….” Attacker buys a high performance computer cluster It registers many times with Napster using a variety of IP addresses (maybe 10’s of thousands of times) Thinking these are real, Napster lists them in download pages. Real clients get poor service or even get snared Studies show that no p2p system can easily defend against Sybil attacks! Refined Napster structure Early Napster just listed anything. Later: Enhanced directory servers to probe clients, track their health. Uses an automated reporting of download problems to trim “bad sources” from list Ranks data sources to preferentially list clients who… Have been up for a long time, and Seem to have fast connections, and Appear to be “close” to the client doing the download (uses notion of “Internet distance”) Implement parallel downloads and even an experimental method for doing “striped” downloads (first block from source A, second from source B, third from C, etc) Leverages asymmetric download/uplink speeds Meanwhile, p2p took off By the time Napster was ruled illegal, it had 15 million users. 5 million of them joined in just a few months! With Napster out of business, a vacuum arose Some users teamed up to define an open standard called “Gnutella” and to develop many protocol implementations Gnutella eliminates the server Judge singled it out in deciding that Napster was illegal Also, a true peer-to-peer network seems harder to defeat than one that is only partly peer-to-peer Credo: “All information should be free” How Gnutella works Rough outline User joins the network using a broadcast with increasing TTL values “Is anyone out there?” Links itself to the first Gnutella node to respond To find content, protocol searches in a similar way Broadcasts “I’m looking for Eminem:WhackHer” Keeps increasing TTL value… eventually gives up if no system respond Hopefully, popular content will turn up nearby Self-organized “overlay” network I’m looking for Sting:Fields… Self-organized “overlay” network TTL determines how far the search will “flood” in the network. Here, TTL of 2 reached 10 nodes Self-organized “overlay” network Nodes with a copy send Download back a file from the first message offering it. node This that offers a copy. basically is a URL for the Hopefully file this is a nearby source with good connectivity… Gnutella has “issues” In experimental studies of the system Very high rates of join requests and queries are sometimes observed Departures (churn) found to disrupt the Gnutella communication graph Requests for rare or misspelled content turn into world-wide broadcasts Rare is… um… rare. Misspellings are common. Berkeley, MIT research in p2p Universities were first to view p2p as an interesting research area CAN: “Content addressable network” proposed by Berkeley Chord: MIT “distributed hash table” Both systems separate the “indexing” problem from actual storage Distributed hash tables (DHTs) Idea is to support a simple index with API: Insert(key, value) – saves (key,value) tuple Lookup(key) – looks up key and returns value Implement it in a p2p network, not a server… Exactly how we implement it varies Normally, each p2p client has just part of the tuples, hence must route query to the right place Distributed indexing Lookup(“Sting:Fields”) 128.64.72.13 Abstraction of an index makes it look like a big server. Implementation spreads the index over many peers. But we can implement this one abstraction in many ways. Insert(“Sting:Fields”, 128.64.72.13); Distributed indexing Lookup(“Sting:Fields”) 128.64.72.13 Insert(“Sting:Fields”, 128.64.72.13); Some details Keep in mind There are lots of protocols that can solve this problem: the protocol used is not part of the problem statement Some DHTs allow updates (e.g. if data moves, or nodes crash). Others are write once. Most DHTs allow many tuples with the same key and can return the whole list, or a random subset of size k, etc So what can we insert? Normally, we want to keep the values small… like an IP address So the (key,value) pairs might tell us where to look for something but probably not the actual thing Value could be (and often is) a URL Once we have the DHT running we can use it to build a p2p file system DHTs: Area quickly took off Can, Chord: DHTs, already mentioned Pastry: From Rice and MSR, uses “Plaxton trees” (a kind of lookup tree) Tapestry: Berkeley (similar to Pastry) Kelips, Beehive: Cornell (use replication to get much faster responses) … and too many more to list! Representative research topics Can we make a DHT… … … … … “resilient” to churn? hide content and guarantee anonymity? secure and robust against attack? support high quality parallel striped downloads? Can we use a DHT… To support scalable content distribution (IP multicast isn’t popular with ISPs)? To implement a new style of Internet addressing (i.e. replace IP routing or multicast)? Are there legitimate uses of p2p file systems? One thought: corporations might want to index “everything in their file store” or to archive stuff Digital libraries might use p2p to avoid keeping extra copies of special or extremely big objects Risk of “bit rot” is a big concern Suppose some huge set of PCs collaborates to preserve important documents Might also encrypt them – various options exist… How many replicas needed to avoid risk that “rare events” will destroy all copies simultaneously? A topic of study in Oceanstore and at UCSD Are there legitimate uses of p2p file systems? p2p could be a great way to legally share information within a team of collaborators at work, or some other “interest group” Think of these as little groups superimposed on a massive p2p network using the same technology Idea would be: “We help each other out” Some argue that p2p systems could be valuable in resisting repressive political regimes Like “coffee house” meetings in pre-revolutionary Russia Can repressive regimes survive if they can’t control the flow of information? Spyware: The real thing Imagine a popular p2p system that Encrypts content: need key to make sense of it Achieves a high degree of anonymity Pretty much everyone helps to serve each request, but nobody actually has a copy of the whole file on their drive – e.g. I have a few bits, you have a few bits Real sources and nodes accessing content concealed from intruders Robust against disruptive attack Needs to be popular: Spies hide in crowds Philosophical debate Is technology “political”? Here we have a technology invented to Rip off IP from owners Conceal crime from law enforcement Pretty much unstoppable without incredibly intrusive oversight mechanisms What’s the story here? Are we all anarchists? Some people believe technology is negative, some positive, some neutral What about p2p technology? Are we allowed to answer “all of the above”? p2p outside of file sharing Key idea was that p2p systems could “gossip” about replicated data Now and then, each node picks some “peer” (at random, more or less) Sends it a snapshot of its own data Or asks for a snapshot of the peer’s data Called “push gossip” “Pull” gossip Or both: a push-pull interaction Gossip “epidemics” [t=0] Suppose that I know something [t=1] I pick you… Now two of us know it. [t=2] We each pick … now 4 know it… Information spread: exponential rate. Due to re-infection (gossip to an infected node) spreads as 1.8k after k rounds But in O(log(N)) time, N nodes are infected Gossip epidemics An unlucky node may just “miss” the gossip for a long time Gossip scales very nicely Participants’ loads independent of size Network load linear in system size Data spreads in log(system size) time Time to infection:O(log n) 1.0 % infected 0.0 Time Facts about gossip epidemics Extremely robust Data travels on exponentially many paths! Hard to even slow it down… Suppose 50% of our packets are simply lost… … we’ll need 1 additional round: a trivial delay! Push-pull works best. For push-only/pull-only a few nodes can remain uninfected for a long time Later we’ll see that many optimizations are needed in practice… but the approach works! Uses of gossip epidemics To robustly multicast data Slow, but very sure of getting through To repair inconsistency in replicas To support “all to all” monitoring and distributed management For distributed data mining and discovery A contemporary perspective p2p computing has many pros and many cons, and for most purposes the cons outweigh the pros A “hard to control” technology Firewalls cause many annoyances Rather slow to propagate updates But at the same time Incredibly robust against disruption Contemporary response? So… use p2p techniques, but mostly In data centers or LANs where there are no firewalls In uses where slow update times aren’t an issue Often means that we need to marry p2p mechanism to a more “urgent” protocol like our multicast protocols Peek ahead We’ll look at several p2p technologies Chord, Pastry, Kelips: three DHTs Bimodal Multicast: Uses gossip in a multicast protocol to get superior scalability Astrolabe: Uses gossip to implement a scalable monitoring, management and control infrastructure (also great for data mining) Reliable Distributed Systems Distributed Hash Tables Reminder: Distributed Hash Table (DHT) A service, distributed over multiple machines, with hash table semantics Designed to work in a peer-to-peer (P2P) environment Insert(key, value), Value(s) = Lookup(key) No central control Nodes under different administrative control But of course can operate in an “infrastructure” sense P2P “environment” Nodes come and go at will (possibly quite frequently---a few minutes) Nodes have heterogeneous capacities Bandwidth, processing, and storage Nodes may behave badly Promise to do something (store a file) and not do it (free-loaders) Attack the system Several flavors, each with variants Tapestry (Berkeley) Based on Plaxton trees---similar to hypercube routing The first* DHT Complex and hard to maintain (hard to understand too!) CAN (ACIRI), Chord (MIT), and Pastry (Rice/MSR Cambridge) Second wave of DHTs (contemporary with and independent of each other) * Landmark Routing, 1988, used a form of DHT called Assured Destination Binding (ADB) Basics of all DHTs 127 111 13 97 33 81 58 Goal is to build some “structured” overlay network with the following characteristics: Node IDs can be mapped to the hash key space Given a hash key as a “destination address”, you can route through the network to a given node Always route to the same node no matter where you start from Simple example (doesn’t scale) 127 111 13 97 33 81 58 Circular number space 0 to 127 Routing rule is to move clockwise until current node ID key, and last hop node ID < key Example: key = 42 Obviously you will route to node 58 from no matter where you start Node 58 “owns” keys in [34,58] Building any DHT 127 13 111 97 33 81 58 24 Newcomer always starts with at least one known member Building any DHT 127 13 111 97 33 Newcomer always starts with at least one known member Newcomer searches for “self” in the network 81 58 24 hash key = newcomer’s node ID Search results in a node in the vicinity where newcomer needs to be Building any DHT 127 111 13 24 97 33 Newcomer always starts with at least one known member Newcomer searches for “self” in the network 81 58 hash key = newcomer’s node ID Search results in a node in the vicinity where newcomer needs to be Links are added/removed to satisfy properties of network Building any DHT 127 111 13 24 97 33 81 58 Newcomer always starts with at least one known member Newcomer searches for “self” in the network hash key = newcomer’s node ID Search results in a node in the vicinity where newcomer needs to be Links are added/removed to satisfy properties of network Objects that now hash to new node are transferred to new node Insertion/lookup for any DHT 127 111 13 24 97 Hash name of object to produce key 33 81 58 Use key as destination address to route through network foo.htm93 Well-known way to do this Routes to the target node Insert object, or retrieve object, at the target node Properties of all DHTs Memory requirements grow (something like) logarithmically with N (exception: Kelips) Routing path length grows (something like) logarithmically with N (several exceptions) Cost of adding or removing a node grows (something like) logarithmically with N Has caching, replication, etc… DHT Issues Resilience to failures Load Balance Heterogeneity Number of objects at each node Routing hot spots Lookup hot spots Locality (performance issue) Churn (performance and correctness issue) Security We’re going to look at four DHTs At varying levels of detail… Chord Kelips Cornell (Gupta, Linga, Birman) Pastry MIT (Stoica et al) Rice/Microsoft Cambridge (Druschel, Rowstron) Behive Cornell (Ramasubramanian, Sirer) Things we’re going to look at What is the structure? How does routing work in the structure? How does it deal with node departures? How does it scale? How does it deal with locality? What are the security issues? Chord uses a circular ID space Key ID Node ID K100 N100 N10 K5, K10 Circular ID Space N32 K11, K30 K65, K70 N80 N60 K33, K40, K52 • Successor: node with next highest ID Chord slides care of Robert Morris, MIT Basic Lookup N5 N10 N110 “Where is key 50?” N20 N99 “Key 50 is At N60” N32 N40 N80 N60 • Lookups find the ID’s successor • Correct if predecessor is correct Successor Lists Ensure Robust Lookup N5 5, 10, 20 N110 10, 20, 32 N10 20, 32, 40 N20 110, 5, 10 N99 N32 N40 99, 110, 5 N80 N60 32, 40, 60 40, 60, 80 60, 80, 99 80, 99, 110 • Each node remembers r successors • Lookup can skip over dead nodes to find blocks • Periodic check of successor and predecessor links Chord “Finger Table” Accelerates Lookups ¼ 1/8 1/16 1/32 1/64 1/128 N80 ½ To build finger tables, new node searches for the key values for each finger To do it efficiently, new nodes obtain successor’s finger table, and use as a hint to optimize the search Chord lookups take O(log N) hops N5 N10 K19 N20 N110 N99 N32 Lookup(K19) N80 N60 Drill down on Chord reliability Interested in maintaining a correct routing table (successors, predecessors, and fingers) Primary invariant: correctness of successor pointers Fingers, while important for performance, do not have to be exactly correct for routing to work Algorithm is to “get closer” to the target Successor nodes always do this Maintaining successor pointers Periodically run “stabilize” algorithm Finds successor’s predecessor Repair if this isn’t self This algorithm is also run at join Eventually routing will repair itself Fix_finger also periodically run For randomly selected finger Initial: 25 wants to join correct ring (between 20 and 30) 20 20 20 25 25 30 30 25 finds successor, and tells successor (30) of itself 25 30 20 runs “stabilize”: 20 asks 30 for 30’s predecessor 30 returns 25 20 tells 25 of itself This time, 28 joins before 20 runs “stabilize” 20 20 20 28 30 25 25 28 25 30 28 30 28 finds successor, and tells successor (30) of itself 20 runs “stabilize”: 20 asks 30 for 30’s predecessor 30 returns 28 20 tells 28 of itself 20 20 20 25 25 28 28 25 28 30 30 25 runs “stabilize” 30 20 runs “stabilize” Chord problems? With intense “churn” ring may be very disrupted Worse case: partition can provoke formation of two distinct rings (think “Cornell ring” and “MIT ring”) And they could have finger pointers to each other, but not link pointers This might never heal itself… But scenario would be hard to induce. Unable to resist Sybil attacks Pastry also uses a circular number space d46a1c d471f1 d467c4 d462ba d4213f Route(d46a1c) 65a1fc d13da3 Difference is in how the “fingers” are created Pastry uses prefix match overlap rather than binary splitting More flexibility in neighbor selection Pastry routing table (for node 65a1fc) Pastry nodes also have a “leaf set” of immediate neighbors up and down the ring Similar to Chord’s list of successors Pastry join X = new node, A = bootstrap, Z = nearest node A finds Z for X In process, A, Z, and all nodes in path send state tables to X X settles on own table Possibly after contacting other nodes X tells everyone who needs to know about itself Pastry paper doesn’t give enough information to understand how concurrent joins work 18th IFIP/ACM, Nov 2001 Pastry leave Noticed by leaf set neighbors when leaving node doesn’t respond Neighbors ask highest and lowest nodes in leaf set for new leaf set Noticed by routing neighbors when message forward fails Immediately can route to another neighbor Fix entry by asking another neighbor in the same “row” for its neighbor If this fails, ask somebody a level up For instance, this neighbor fails Ask other neighbors Try asking some neighbor in the same row for its 655x entry If it doesn’t have one, try asking some neighbor in the row below, etc. Kelips takes a different approach Network partitioned into N “affinity groups” Hash of node ID determines which affinity group a node is in Each node knows: One or more nodes in each group All objects and nodes in own group But this knowledge is soft-state, spread through peer-to-peer “gossip” (epidemic multicast)! Kelips Take a a collection of “nodes” 110 230 30 202 Kelips Map nodes to affinity groups Affinity Groups: peer membership thru consistent hash 0 1 2 N -1 110 N 230 30 202 members per affinity group Kelips 110 knows about other members – 230, 30… Affinity group view id hbeat rtt 30 234 90ms 230 322 30ms Affinity Groups: peer membership thru consistent hash 0 1 2 N -1 110 N 230 30 Affinity group pointers 202 members per affinity group Kelips Affinity group view id hbeat rtt 30 234 90ms 230 322 30ms Contacts group contactNode … … 2 202 202 is a In practice, keep k “contact” for Affinity Groups: 110 contacts, for some peer membershipin thrugroup consistent2hash small constant k 0 1 N -1 2 110 N 230 202 members per affinity group 30 Contact pointers Kelips “cnn.com” maps to group 2. So 110 tells group 2 to “route” inquiries about cnn.com to it. Affinity group view id hbeat rtt 30 234 90ms 230 322 30ms Affinity Groups: peer membership thru consistent hash 0 1 2 N -1 110 N Contacts 230 group contactNode … … 2 202 Resource Tuples resource info … … cnn.com 110 202 30 Gossip protocol replicates data cheaply members per affinity group Kelips To look up “cnn.com”, just ask some contact in group 2. It returns “110” (or forwards your request). Affinity Groups: peer membership thru consistent hash 0 1 N -1 2 110 N 230 202 members per affinity group 30 IP2P, ACM TOIS (submitted) Kelips gossip Operates at constant “background” rate Independent of frequency of changes in the system Average overhead may be higher than other DHTs, but not bursty If churn too high, system performs poorly (failed lookups), but does not collapse… Beehive A DHT intended for supporting righperformance infrastructure services with proactive caching Focus of “real system” is on DNS but has other applications Proactive caching: a form of replication. DNS already caches… Beehive pushes updates to improve hit rates Domain Name Service Translates textual names to internet addresses Relies on a static hierarchy of servers “www.cnn.com” -> 1.2.3.4 Prone to DoS attacks Fragile, expensive to maintain Slow and sluggish In 2002, a DDoS attack almost disabled the root name servers A Self-Organizing Solution… 0021 0112 0122 2012 oid = 0122 www.cnn.com IP=1.2.3.4 Great idea, but O(log N) is too slow on the Internet By replicating a (key,value) tuple we can With greater degree of shorten the worst-case replication the search searchgoes by even onefaster hop Beehive Intuition 0021 0112 0122 2012 Optimization problem: Minimize total number of replicas s.t., average lookup performance C Beehive Summary general replication framework suitable for structured DHTs decentralization, self-organization, resilience properties high performance: O(1) average lookup time scalable: minimize number of replicas and reduce storage, bandwidth, and network load adaptive: promptly respond to changes in popularity – flash crowds To finish up Various applications have been designed over DHTs File system, DNS-like service, pub/sub system DHTs are elegant and promising tools Concerns about churn and security Reliable Distributed Systems Scalability Scalability Today we’ll focus on how things scale Basically: look at a property that matters Make something “bigger” Like the network, the number of groups, the number of members, the data rate Then measure the property and see impact Often we can “hope” that no slowdown would occur. But what really happens? Stock Exchange Problem: Sometimes, someone is slow… Most members are healthy…. … but one is slow i.e. something is contending with the red process, delaying its handling of incoming messages… With a slow receiver, throughput collapses as the system scales up Virtually synchronous Ensemble multicast protocols average throughput on nonperturbed members 250 group size: 32 group size: 64 group size: 96 32 200 150 100 96 50 0 0 0.1 0.2 0.3 0.4 0.5 perturb rate 0.6 0.7 0.8 0.9 Why does this happen? Superficially, because data for the slow process piles up in the sender’s buffer, causing flow control to kick in (prematurely) But why does the problem grow worse as a function of group size, with just one “red” process? Small perturbations happen all the time Broad picture? Virtual synchrony works well under bursty loads And it scales to fairly large systems (SWX uses a hierarchy to reach ~500 users) From what we’ve seen so far, this is about as good as it gets for reliability Recall that stronger reliability models like Paxos are costly and scale far worse Desired: steady throughput under heavy load and stress Protocols famous for scalability Scalable reliable multicast (SRM) Reliable Multicast Transport Protocol (RMTP) On-Tree Efficient Recovery using Subcasting Several others: TMP, MFTP, MFTP/EC... (OTERS) But when stability is tested under stress, every one of these protocols collapses just like virtual synchrony! Example: Scalable Reliable Multicast (SRM) Originated in work on Wb and Mbone Idea is to do “local repair” if messages are lost, various optimizations keep load low and repair costs localized Wildly popular for internet “push,” seen as solution for Internet radio and TV But receiver-driven reliability model lacks “strong” reliability guarantees Local Repair Concept Local Repair Concept Local Repair Concept lost Local Repair Concept Receipt of subsequent packet triggers NACK for missing packet NACK X NACK NACK Local Repair Concept Retransmit NACK NACK X X NACK X Receive useless NAK, duplicate repair Local Repair Concept X XX NACK NACK X X X Receive useless NAK, duplicate repair X Local Repair Concept X NACK X NACK Receive useless NAK, duplicate repair X NACK X Local Repair Concept X X X Receive useless NAK, duplicate repair Limitations? SRM runs in application, not router, hence IP multicast of nack’s and retransmissions tend to reach many or all processes Lacking knowledge of who should receive each message, SRM has no simple way to know when a message can be garbage collected at the application layer Probabilistic rules to suppress duplicates In practice? As the system grows large the “probabilistic suppression” fails More and more NAKs are sent in duplicate And more and more duplicate data message are sent as multiple receivers respond to the same NAK Why does this happen? Visualizing how SRM collapses Think of sender as the hub of a wheel Messages depart in all directions Loss can occur at many places “out there” and they could be far apart… Hence NAK suppression won’t work Causing multiple NAKS And the same reasoning explains why any one NAK is likely to trigger multiple retransmissions! Experiments have confirmed that SRM overheads soar with deployment size Every message triggers many NAKs and many retransmissions until the network finally melts down Dilemma confronting developers Application is extremely critical: stock market, air traffic control, medical system Hence need a strong model, guarantees But these applications often have a soft-realtime subsystem Steady data generation May need to deliver over a large scale Today introduce a new design pt. Bimodal multicast (pbcast) is reliable in a sense that can be formalized, at least for some networks Generalization for larger class of networks should be possible but maybe not easy Protocol is also very stable under steady load even if 25% of processes are perturbed Scalable in much the same way as SRM Environment Will assume that most links have known throughput and loss properties Also assume that most processes are responsive to messages in bounded time But can tolerate some flakey links and some crashed or slow processes. Start by using unreliable multicast to rapidly distribute the message. But some messages may not get through, and some processes may be faulty. So initial state involves partial distribution of multicast(s) Periodically (e.g. every 100ms) each process sends a digest describing its state to some randomly selected group member. The digest identifies messages. It doesn’t include them. Recipient checks the gossip digest against its own history and solicits a copy of any missing message from the process that sent the gossip Processes respond to solicitations received during a round of gossip by retransmitting the requested message. The round lasts much longer than a typical RPC time. Delivery? Garbage Collection? Deliver a message when it is in FIFO order Garbage collect a message when you believe that no “healthy” process could still need a copy (we used to wait 10 rounds, but now are using gossip to detect this condition) Match parameters to intended environment Need to bound costs Worries: Someone could fall behind and never catch up, endlessly loading everyone else What if some process has lots of stuff others want and they bombard him with requests? What about scalability in buffering and in list of members of the system, or costs of updating that list? Optimizations Request retransmissions most recent multicast first Idea is to “catch up quickly” leaving at most one gap in the retrieved sequence Optimizations Participants bound the amount of data they will retransmit during any given round of gossip. If too much is solicited they ignore the excess requests Optimizations Label each gossip message with senders gossip round number Ignore solicitations that have expired round number, reasoning that they arrived very late hence are probably no longer correct Optimizations Don’t retransmit same message twice in a row to any given destination (the copy may still be in transit hence request may be redundant) Optimizations Use IP multicast when retransmitting a message if several processes lack a copy For example, if solicited twice Also, if a retransmission is received from “far away” Tradeoff: excess messages versus low latency Use regional TTL to restrict multicast scope Scalability Protocol is scalable except for its use of the membership of the full process group Updates could be costly Size of list could be costly In large groups, would also prefer not to gossip over long high-latency links Can extend pbcast to solve both Could use IP multicast to send initial message. (Right now, we have a treestructured alternative, but to use it, need to know the membership) Tell each process only about some subset k of the processes, k << N Keeps costs constant. Router overload problem Random gossip can overload a central router Yet information flowing through this router is of diminishing quality as rate of gossip rises Insight: constant rate of gossip is achievable and adequate Hierarchical Gossip Weight gossip so that probability of gossip to a remote cluster is smaller Can adjust weight to have constant load on router Now propagation delays rise… but just increase rate of gossip to compensate Remainder of talk Show results of formal analysis We developed a model (won’t do the math here -- nothing very fancy) Used model to solve for expected reliability Then show more experimental data Real question: what would pbcast “do” in the Internet? Our experience: it works! Idea behind analysis Can use the mathematics of epidemic theory to predict reliability of the protocol Assume an initial state Now look at result of running B rounds of gossip: converges exponentially quickly towards atomic delivery p{#processes=k} Pbcast bimodal delivery distribution Either sender fails… 1.E+00 … or data gets through w.h.p. 1.E-05 1.E-10 1.E-15 1.E-20 1.E-25 1.E-30 0 5 10 15 20 25 30 35 40 45 50 number of processes to deliver pbcast Failure analysis Suppose someone tells me what they hope to “avoid” Model as a predicate on final system state Can compute the probability that pbcast would terminate in that state, again from the model Two predicates Predicate I: A faulty outcome is one where more than 10% but less than 90% of the processes get the multicast … Think of a probabilistic Byzantine General’s problem: a disaster if many but not most troops attack Two predicates Predicate II: A faulty outcome is one where roughly half get the multicast and failures might “conceal” true outcome … this would make sense if using pbcast to distribute quorum-style updates to replicated data. The costly hence undesired outcome is the one where we need to rollback because outcome is “uncertain” Two predicates Predicate I: More than 10% but less than 90% of the processes get the multicast Predicate II: Roughly half get the multicast but crash failures might “conceal” outcome Easy to add your own predicate. Our methodology supports any predicate over final system state Scalability of Pbcast reliability P{failure} 1.E-05 1.E-10 1.E-15 1.E-20 1.E-25 1.E-30 1.E-35 10 15 20 25 30 35 40 45 50 55 60 #processes in system Predicate I Predicate II P{failure} Effects of fanout on reliability 1.E+00 1.E-02 1.E-04 1.E-06 1.E-08 1.E-10 1.E-12 1.E-14 1.E-16 1 2 3 4 5 6 7 8 9 10 fanout Predicate I Predicate II fanout Fanout required for a specified reliability 9 8.5 8 7.5 7 6.5 6 5.5 5 4.5 4 20 25 30 35 40 45 50 #processes in system Predicate I for 1E-8 reliability Predicate II for 1E-12 reliability Scalability of Pbcast reliability 1.E+00 1.E-05 1.E-05 1.E-10 P{failure} p{#processes=k} Pbcast bimodal delivery distribution 1.E-10 1.E-15 1.E-20 1.E-15 1.E-20 1.E-25 1.E-30 1.E-35 1.E-25 10 15 1.E-30 0 5 10 15 20 25 30 35 40 45 P{failure} fanout 5 6 7 40 45 50 55 60 Predicate II 9 8.5 8 7.5 7 6.5 6 5.5 5 4.5 4 20 4 35 Fanout required for a specified reliability 1.E+00 1.E-02 1.E-04 1.E-06 1.E-08 1.E-10 1.E-12 1.E-14 1.E-16 3 30 Predicate I Effects of fanout on reliability 2 25 #processes in system 50 number of processes to deliver pbcast 1 20 8 9 10 25 30 35 40 45 50 #processes in system fanout Predicate I for 1E-8 reliability Predicate I Predicate II Figure 5: Graphs of analytical results Predicate II for 1E-12 reliability Discussion We see that pbcast is indeed bimodal even in worst case, when initial multicast fails Can easily tune parameters to obtain desired guarantees of reliability Protocol is suitable for use in applications where bounded risk of undesired outcome is sufficient Model makes assumptions... These are rather simplistic Yet the model seems to predict behavior in real networks, anyhow In effect, the protocol is not merely robust to process perturbation and message loss, but also to perturbation of the model itself Speculate that this is due to the incredible power of exponential convergence... Experimental work SP2 is a large network Nodes are basically UNIX workstations Interconnect is basically an ATM network Software is standard Internet stack (TCP, UDP) We obtained access to as many as 128 nodes on Cornell SP2 in Theory Center Ran pbcast on this, and also ran a second implementation on a real network Example of a question Create a group of 8 members Perturb one member in style of Figure 1 Now look at “stability” of throughput Measure rate of received messages during periods of 100ms each Plot histogram over life of experiment Now revisit Figure 1 in detail Take 8 machines Perturb 1 Pump data in at varying rates, look at rate of received messages Revisit our original scenario with perturbations (32 processes) High bandwidth comparison of pbcast performance at faulty and correct hosts 200 traditional: at unperturbed host pbcast: at unperturbed host 180 traditional: at perturbed host pbcast: at perturbed host 160 Low bandwidth comparison of pbcast performance at faulty and correct hosts 200 traditional w/1 perturbed pbcast w/1 perturbed throughput for traditional, measured at perturbed host throughput for pbcast measured at perturbed host 180 160 140 average throughput average throughput 140 120 100 80 120 100 80 60 60 40 40 20 20 0 0.1 0.2 0.3 0.4 0.6 0.5 perturb rate 0.7 0.8 0.9 0 0.1 0.2 0.3 0.4 0.6 0.5 perturb rate 0.7 0.8 0.9 Discussion Saw that stability of protocol is exceptional even under heavy perturbation Overhead is low and stays low with system size, bounded even for heavy perturbation Throughput is extremely steady In contrast, virtual synchrony and SRM both are fragile under this sort of attack Programming with pbcast? Most often would want to split application into multiple subsystems Use pbcast for subsystems that generate regular flow of data and can tolerate infrequent loss if risk is bounded Use stronger properties for subsystems with less load and that need high availability and consistency at all times Programming with pbcast? In stock exchange, use pbcast for pricing but abcast for “control” operations In hospital use pbcast for telemetry data but use abcast when changing medication In air traffic system use pbcast for routine radar track updates but abcast when pilot registers a flight plan change Our vision: One protocol sideby-side with the other Use virtual synchrony for replicated data and control actions, where strong guarantees are needed for safety Use pbcast for high data rates, steady flows of information, where longer term properties are critical but individual multicast is of less critical importance Summary New data point in a familiar spectrum Virtual synchrony Bimodal probabilistic multicast Scalable reliable multicast Demonstrated that pbcast is suitable for analytic work Saw that it has exceptional stability