Navigating in the Dark: New Options for Building SelfConfiguring Embedded Systems Ken Birman Cornell University A sea change… We’re looking at a massive change in the way we use computers Today, we’re still very client-server oriented (Web Services will continue this) Tomorrow, many important applications will use vast numbers of small sensors And even standard wired systems should sometimes be treated as sensor networks Characteristics? Large numbers of components, substantial rate of churn … failure is common in any large deployment, small nodes are fragile (flawed assumption?) and connectivity may be poor Need to self configure For obvious, practical reasons Can we map this problem to the previous one? Not clear how we could do so Sensors can often capture lots of data (think: Foveon 6Mb optical chip, 20 fps) … and may even be able to process that data on chip But rarely have the capacity to ship the data to a server (power, signal limits) This spells trouble! The way we normally build distributed systems is mismatched to the need! The clients of a Web Services or similar system are second-tier citizens And they operate in the dark About one-another And about network/system “state” Building sensor networks this way won’t work Is there an alternative? We see hope in what are called peer-to-peer and “epidemic” communications protocols! Inspired by work on (illegal) file sharing But we’ll aim at other kinds of sharing Goals: scalability, stability despite churn, load loads/power consumption Most overcome tendency of many P2P technologies to be disabled by churn Astrolabe Intended as help for applications adrift in a sea of information Structure emerges from a randomized peer-to-peer protocol This approach is robust and scalable even under extreme stress that cripples more traditional approaches Developed at Cornell By Robbert van Renesse, with many others helping… Just an example of the kind of solutions we need Astrolabe Astrolabe builds a hierarchy using a P2P protocol that “assembles the puzzle” without any servers Dynamically changing query output is visible system-wide Name Avg Load WL contact SMTP contact SF 2.6 123.45.61.3 123.45.61.17 NJ 1.8 127.16.77.6 127.16.77.11 Paris 3.1 14.66.71.8 14.66.71.12 Name Load Weblogic? SMTP? Word Version swift 2.0 0 1 falcon 1.5 1 cardinal 4.5 1 Name Load Weblogic? SMTP? Word Version 6.2 gazelle 1.7 0 0 4.5 0 4.1 zebra 3.2 0 1 6.2 0 6.0 gnu .5 1 0 6.2 San Francisco … SQL query “summarizes” data New Jersey … Astrolabe in a single domain Each node owns a single tuple, like the management information base (MIB) Nodes discover one-another through a simple broadcast scheme (“anyone out there?”) and gossip about membership Nodes also keep replicas of one-another’s rows Periodically (uniformly at random) merge your state with some else… State Merge: Core of Astrolabe epidemic Name Time Load Weblogic? SMTP? Word Versi on swift 2011 2.0 0 1 6.2 falcon 1971 1.5 1 0 4.1 cardinal 2004 4.5 1 0 6.0 swift.cs.cornell.edu cardinal.cs.cornell.edu Name Time Load Weblogic ? SMTP? Word Version swift 2003 .67 0 1 6.2 falcon 1976 2.7 1 0 4.1 cardinal 2201 3.5 1 1 6.0 State Merge: Core of Astrolabe epidemic Name Time Load Weblogic? SMTP? Word Versi on swift 2011 2.0 0 1 6.2 falcon 1971 1.5 1 0 4.1 cardinal 2004 4.5 1 0 6.0 swift.cs.cornell.edu swift cardinal cardinal.cs.cornell.edu Name Time Load Weblogic ? SMTP? Word Version swift 2003 .67 0 1 6.2 falcon 1976 2.7 1 0 4.1 cardinal 2201 3.5 1 1 6.0 2011 2201 2.0 3.5 State Merge: Core of Astrolabe epidemic Name Time Load Weblogic? SMTP? Word Versi on swift 2011 2.0 0 1 6.2 falcon 1971 1.5 1 0 4.1 cardinal 2201 3.5 1 0 6.0 swift.cs.cornell.edu cardinal.cs.cornell.edu Name Time Load Weblogic ? SMTP? Word Version swift 2011 2.0 0 1 6.2 falcon 1976 2.7 1 0 4.1 cardinal 2201 3.5 1 1 6.0 Observations Merge protocol has constant cost One message sent, received (on avg) per unit time. The data changes slowly, so no need to run it quickly – we usually run it every five seconds or so Information spreads in O(log N) time But this assumes bounded region size In Astrolabe, we limit them to 50-100 rows Big system will have many regions Astrolabe usually configured by a manager who places each node in some region, but we are also playing with ways to discover structure automatically A big system could have many regions Looks like a pile of spreadsheets A node only replicates data from its neighbors within its own region Scaling up… and up… With a stack of domains, we don’t want every system to “see” every domain Cost would be huge So instead, we’ll see a summary Name Time Load Weblogic SMTP? Word ? Name Time Load Weblogic SMTP? Version Word ? Version Name Time Load Weblogic SMTP? swift 2011 2.0 0 1 6.2 Word ? Version Name Time Load Weblogic SMTP? swift 2011 2.0 0 1 6.2 Word falcon 1976 2.7 1 0 4.1 ? Version swift Name 2011 Time 2.0 Load 0 Weblogic 1 SMTP? 6.2 Word falcon 1976 2.7 1 0 4.1 ? Version Name 20113.5Time 2.0 1 Load SMTP? 6.2 Word cardinal 2201 1 swift 0 Weblogic 1 6.0 falcon 1976 2.7 1 0 4.1 ? Version Name Time Load Weblogic SMTP? cardinal 2201 3.5 1 1 6.0 swift 2011 2.0 0 1 6.2 Word falcon 1976 2.7 1 4.1 ? 0 Version cardinal 2201 3.5 1 1 6.0 swift 2011 2.0 0 1 6.2 falcon 1976 2.7 1 0 4.1 cardinal 2201 swift 20113.5 2.0 1 0 1 1 6.0 6.2 falcon 1976 2.7 1 0 4.1 cardinal 2201 3.5 1 1 6.0 falcon 1976 2.7 1 0 4.1 cardinal 2201 3.5 1 1 6.0 cardinal 2201 3.5 1 1 6.0 cardinal.cs.cornell.edu Astrolabe builds a hierarchy using a P2P protocol that “assembles the puzzle” without any servers Dynamically changing query output is visible system-wide Name Avg Load WL contact SMTP contact SF 2.6 123.45.61.3 123.45.61.17 NJ 1.8 127.16.77.6 127.16.77.11 Paris 3.1 14.66.71.8 14.66.71.12 Name Load Weblogic? SMTP? Word Version swift 2.0 0 1 falcon 1.5 1 cardinal 4.5 1 Name Load Weblogic? SMTP? Word Version 6.2 gazelle 1.7 0 0 4.5 0 4.1 zebra 3.2 0 1 6.2 0 6.0 gnu .5 1 0 6.2 San Francisco … SQL query “summarizes” data New Jersey … Large scale: “fake” regions These are Computed by queries that summarize a whole region as a single row Gossiped in a read-only manner within a leaf region But who runs the gossip? Each region elects “k” members to run gossip at the next level up. Can play with selection criteria and “k” Hierarchy is virtual… data is replicated Name Avg Load WL contact SMTP contact SF 2.6 123.45.61.3 123.45.61.17 NJ 1.8 127.16.77.6 127.16.77.11 Paris 3.1 14.66.71.8 14.66.71.12 Name Load Weblogic? SMTP? Word Version swift 2.0 0 1 falcon 1.5 1 cardinal 4.5 1 Name Load Weblogic? SMTP? Word Version 6.2 gazelle 1.7 0 0 4.5 0 4.1 zebra 3.2 0 1 6.2 0 6.0 gnu .5 1 0 6.2 San Francisco … New Jersey … Hierarchy is virtual… data is replicated Name Avg Load WL contact SMTP contact SF 2.6 123.45.61.3 123.45.61.17 NJ 1.8 127.16.77.6 127.16.77.11 Paris 3.1 14.66.71.8 14.66.71.12 Name Load Weblogic? SMTP? Word Version swift 2.0 0 1 falcon 1.5 1 cardinal 4.5 1 Name Load Weblogic? SMTP? Word Version 6.2 gazelle 1.7 0 0 4.5 0 4.1 zebra 3.2 0 1 6.2 0 6.0 gnu .5 1 0 6.2 San Francisco … New Jersey … Worst case load? A small number of nodes end up participating in O(logfanoutN) epidemics Here the fanout is something like 50 In each epidemic, a message is sent and received roughly every 5 seconds We limit message size so even during periods of turbulence, no message can become huge. Instead, data would just propagate slowly Haven’t really looked hard at this case Astrolabe is a good fit No central server Hierarchical abstraction “emerges” Moreover, this abstraction is very robust It scales well… disruptions won’t disrupt the system … consistent in eyes of varied beholders Individual participant runs trivial p2p protocol Supports distributed data aggregation, data mining. Adaptive and self-repairing… Data Mining We can data mine using Astrolabe The “configuration” and “aggregation” queries can be dynamically adjusted So we can use the former to query sensors in customizable ways (in parallel) … and then use the latter to extract an efficient summary that won’t cost an arm and a leg to transmit Costs are basically constant! Unlike many systems that experience load surges and other kinds of load variability, Astrolabe is stable under all conditions Stress doesn’t provoke surges of message traffic And Astrolabe remains active even while those disruptions are happening Other such abstractions Scalable probabilistically reliable multicast based on P2P (peer-to-peer) epidemics: Bimodal Multicast Some of the work on P2P indexing structures and file storage: Kelips Overlay networks for end-to-end IPstyle multicast: Willow Challenges We need to do more work on Real-time issues: right now Astrolabe is highly predictable but somewhat slow Security (including protection against malfunctioning components) Scheduled downtown (sensors do this quite often today; maybe less an issue in the future) Communication locality Important in sensor networks, where messages to distant machines need costly relaying Astrolabe does most of its communication with neighbors Close to Kleinberg’s small worlds structure for remote gossip Conclusions? We’re near a breakthrough: sensor networks that behave like sentient infrastructure They sense their own state and adapt Self-configure and self-repair Incredibly exciting opportunities ahead Cornell has focused on probabilistically scalable technologies and built real systems while also exploring theoretical analyses Extra slides Just to respond to questions Bimodal Multicast A technology we developed several years ago Our goal was to get better scalability without abandoning reliability Multicast historical timeline TIME 1980’s: IP multicast, anycast, other best-effort models Cheriton: V system IP multicast is a standard Internet protocol Anycast never really made it Multicast historical timeline TIME 1980’s: IP multicast, anycast, other best-effort models 1987-1993: Virtually synchronous multicast takes off (Isis, Horus, Ensemble but also many other systems, like Transis, Totem, etc). Used in many settings today but no single system “won” Isis Toolkit was used by New York Stock Exchange, Swiss Exchange French Air Traffic Control System AEGIS radar control and communications Multicast historical timeline TIME 1980’s: IP multicast, anycast, other best-effort models 1987-1993: Virtually synchronous multicast takes off (Isis, Horus, Ensemble but also many other systems, like Transis, Totem, etc). Used in many settings today but no single system “won” 1995-2000: Scalability issues prompt a new generation of scalable solutions (SRM, RMTP, etc). Cornell’s contribution was Bimodal Multicast, aka “pbcast” Virtual Synchrony Model G0={p,q} G1={p,q,r,s} p G2={q,r,s} G3={q,r,s,t} crash q r s t r, s request to join r,s added; state xfer p fails t requests to join t added, state xfer ... to date, the only widely adopted model for consistency and fault-tolerance in highly available networked applications Virtual Synchrony scaling issue Virtually synchronous Ensemble multicast protocols average throughput on nonperturbed members 250 group size: 32 group size: 64 group size: 96 200 150 100 50 0 0 0.1 0.2 0.3 0.4 0.5 perturb rate 0.6 0.7 0.8 0.9 Bimodal Multicast Uses some sort of best effort dissemination protocol to get the message “seeded” E.g. IP multicast, or our own tree-based scheme running on TCP but willing to drop packets if congestion occurs But some nodes log messages We use a DHT scheme Detect a missing message? Recover from a log server that should have it… Start by using unreliable multicast to rapidly distribute the message. But some messages may not get through, and some processes may be faulty. So initial state involves partial distribution of multicast(s) Periodically (e.g. every 100ms) each process sends a digest describing its state to some randomly selected group member. The digest identifies messages. It doesn’t include them. Recipient checks the gossip digest against its own history and solicits a copy of any missing message from the process that sent the gossip Processes respond to solicitations received during a round of gossip by retransmitting the requested message. The round lasts much longer than a typical RPC time. Scalability of Pbcast reliability 1.E+00 1.E-05 1.E-05 1.E-10 P{failure} p{#processes=k} Pbcast bimodal delivery distribution 1.E-10 1.E-15 1.E-20 1.E-15 1.E-20 1.E-25 1.E-30 1.E-35 1.E-25 10 15 1.E-30 0 5 10 15 20 25 30 35 40 45 P{failure} fanout 5 6 7 40 45 50 55 60 Predicate II 9 8.5 8 7.5 7 6.5 6 5.5 5 4.5 4 20 4 35 Fanout required for a specified reliability 1.E+00 1.E-02 1.E-04 1.E-06 1.E-08 1.E-10 1.E-12 1.E-14 1.E-16 3 30 Predicate I Effects of fanout on reliability 2 25 #processes in system 50 number of processes to deliver pbcast 1 20 8 9 10 25 30 35 40 45 50 #processes in system fanout Predicate I for 1E-8 reliability Predicate I Predicate II Figure 5: Graphs of analytical results Predicate II for 1E-12 reliability Our original degradation scenario Throughput measured at unperturbed process High Bandwidth measurements with varying numbers of sleepers 200 150 100 50 0 0.05 0.15 0.25 0.35 0.45 0.55 0.65 0.75 0.85 0.95 Probability of sleep event Traditional w/1 sleeper Pbcast w/1 sleeper Traditional w/3 sleepers Pbcast w 3/sleepers Traditional w/5 sleepers Pbcast w/5 sleepers Distributed Indexing Goal is to find a copy of Nora Jones’ “I don’t know” Index contains, for example (machine-name, object) Operations to search, update History of the problem Very old: Internet DNS does this But DNS lookup is by machine name We want the inverted map Napster was the first really big hit 5 million users at one time The index itself was centralized. Used peer-to-peer file copying once a copy was found (many issues arose…) Hot academic topic today System based on a virtual ring Many systems use Paxton radix search MIT: Chord system (Karger, Kaashoek) Rice: Pastry (Druschel, Rowston) Berkeley: Tapestry Cornell: a scheme that uses replication Kelips Kelips idea? Treat the system as sqrt(N) “affinity” groups of size sqrt(N) each Any given index entry is mapped to a group and replicated within it O(log N) time delay to do an update Could accelerate this with an unreliable multicast To do a lookup, find a group member (or a few of them) and ask for item O(1) lookup Why Kelips? Other schemes have O(log N) lookup delay This is quite a high cost in practical settings Others also have fragile data structures Background reorganization costs soar under stress, churn, flash loads Kelips has a completely constant load! Solutions that share properties Scalable Robust against localized disruption Have emergent behavior we can reason about, exploit in the application layer Think of the way a hive of insects organizes itself or reacts to stimuli. There are many similarities Revisit our goals Are these potential components for sentient systems? Middleware that perceives the state of the network It represent this knowledge in a form smart applications can exploit Although built from large numbers of rather dumb components the emergent behavior is intelligent. These applications are more robust, more secure, more responsive than any individual component When something unexpected occurs, they can diagnose the problem and trigger a coordinated distributed response They repair themselves after damage We seem to have the basis from which to work! Brings us full circle Our goal should be a new form of very stable “sentient middleware” Have we accomplished this goal? Probabilistically reliable, scalable primitives They solve many problems Gaining much attention now from industry, academic research community Fundamental issue is skepticism about peerto-peer as a computing model Conclusions? We’re at the verge of a breakthrough – networks that behave like sentient infrastructure on behalf of smart applications Incredibly exciting opportunities if we can build these Cornell’s angle has focused on probabilistically scalable technologies and tried to mix real systems and experimental work with stochastic analyses