Navigating in the Dark: New Options for Building Self- Configuring Embedded Systems

advertisement
Navigating in the Dark:
New Options for Building SelfConfiguring Embedded Systems
Ken Birman
Cornell University
A sea change…

We’re looking at a massive change in
the way we use computers



Today, we’re still very client-server oriented
(Web Services will continue this)
Tomorrow, many important applications will
use vast numbers of small sensors
And even standard wired systems should
sometimes be treated as sensor networks
Characteristics?

Large numbers of components,
substantial rate of churn


… failure is common in any large
deployment, small nodes are fragile
(flawed assumption?) and connectivity may
be poor
Need to self configure

For obvious, practical reasons
Can we map this problem to
the previous one?




Not clear how we could do so
Sensors can often capture lots of data
(think: Foveon 6Mb optical chip, 20 fps)
… and may even be able to process that
data on chip
But rarely have the capacity to ship the
data to a server (power, signal limits)
This spells trouble!

The way we normally build distributed
systems is mismatched to the need!


The clients of a Web Services or similar
system are second-tier citizens
And they operate in the dark



About one-another
And about network/system “state”
Building sensor networks this way won’t
work
Is there an alternative?

We see hope in what are called peer-to-peer
and “epidemic” communications protocols!




Inspired by work on (illegal) file sharing
But we’ll aim at other kinds of sharing
Goals: scalability, stability despite churn, load
loads/power consumption
Most overcome tendency of many P2P
technologies to be disabled by churn
Astrolabe

Intended as help for
applications adrift in a
sea of information

Structure emerges
from a randomized
peer-to-peer protocol

This approach is robust
and scalable even
under extreme stress
that cripples more
traditional approaches
Developed at Cornell

By Robbert van
Renesse, with many
others helping…

Just an example of the
kind of solutions we
need
Astrolabe
Astrolabe builds a hierarchy using a P2P
protocol that “assembles the puzzle” without
any servers
Dynamically changing
query output is visible
system-wide
Name
Avg
Load
WL contact
SMTP contact
SF
2.6
123.45.61.3
123.45.61.17
NJ
1.8
127.16.77.6
127.16.77.11
Paris
3.1
14.66.71.8
14.66.71.12
Name
Load
Weblogic?
SMTP?
Word
Version
swift
2.0
0
1
falcon
1.5
1
cardinal
4.5
1
Name
Load
Weblogic?
SMTP?
Word
Version
6.2
gazelle
1.7
0
0
4.5
0
4.1
zebra
3.2
0
1
6.2
0
6.0
gnu
.5
1
0
6.2
San Francisco
…
SQL query
“summarizes”
data
New Jersey
…
Astrolabe in a single domain


Each node owns a single tuple, like the
management information base (MIB)
Nodes discover one-another through a
simple broadcast scheme (“anyone out
there?”) and gossip about membership


Nodes also keep replicas of one-another’s
rows
Periodically (uniformly at random) merge
your state with some else…
State Merge: Core of Astrolabe epidemic
Name
Time
Load
Weblogic?
SMTP?
Word
Versi
on
swift
2011
2.0
0
1
6.2
falcon
1971
1.5
1
0
4.1
cardinal
2004
4.5
1
0
6.0
swift.cs.cornell.edu
cardinal.cs.cornell.edu
Name
Time
Load
Weblogic
?
SMTP?
Word
Version
swift
2003
.67
0
1
6.2
falcon
1976
2.7
1
0
4.1
cardinal
2201
3.5
1
1
6.0
State Merge: Core of Astrolabe epidemic
Name
Time
Load
Weblogic?
SMTP?
Word
Versi
on
swift
2011
2.0
0
1
6.2
falcon
1971
1.5
1
0
4.1
cardinal
2004
4.5
1
0
6.0
swift.cs.cornell.edu
swift
cardinal
cardinal.cs.cornell.edu
Name
Time
Load
Weblogic
?
SMTP?
Word
Version
swift
2003
.67
0
1
6.2
falcon
1976
2.7
1
0
4.1
cardinal
2201
3.5
1
1
6.0
2011
2201
2.0
3.5
State Merge: Core of Astrolabe epidemic
Name
Time
Load
Weblogic?
SMTP?
Word
Versi
on
swift
2011
2.0
0
1
6.2
falcon
1971
1.5
1
0
4.1
cardinal
2201
3.5
1
0
6.0
swift.cs.cornell.edu
cardinal.cs.cornell.edu
Name
Time
Load
Weblogic
?
SMTP?
Word
Version
swift
2011
2.0
0
1
6.2
falcon
1976
2.7
1
0
4.1
cardinal
2201
3.5
1
1
6.0
Observations

Merge protocol has constant cost




One message sent, received (on avg) per
unit time.
The data changes slowly, so no need to
run it quickly – we usually run it every five
seconds or so
Information spreads in O(log N) time
But this assumes bounded region size

In Astrolabe, we limit them to 50-100 rows
Big system will have many
regions


Astrolabe usually configured by a
manager who places each node in some
region, but we are also playing with
ways to discover structure automatically
A big system could have many regions


Looks like a pile of spreadsheets
A node only replicates data from its
neighbors within its own region
Scaling up… and up…

With a stack of domains, we don’t want
every system to “see” every domain


Cost would be huge
So instead, we’ll see a summary
Name
Time
Load
Weblogic
SMTP?
Word
?
Name
Time
Load
Weblogic
SMTP? Version
Word
?
Version
Name
Time
Load
Weblogic
SMTP?
swift
2011
2.0
0
1
6.2 Word
?
Version
Name
Time
Load
Weblogic
SMTP?
swift
2011
2.0
0
1
6.2 Word
falcon
1976
2.7
1
0
4.1
?
Version
swift Name 2011 Time 2.0 Load
0 Weblogic 1 SMTP? 6.2 Word
falcon
1976
2.7
1
0
4.1
?
Version
Name 20113.5Time 2.0 1 Load
SMTP? 6.2
Word
cardinal
2201
1
swift
0 Weblogic
1 6.0
falcon
1976
2.7
1
0
4.1
?
Version
Name
Time
Load
Weblogic
SMTP?
cardinal
2201
3.5
1
1
6.0
swift
2011
2.0
0
1
6.2 Word
falcon
1976
2.7
1
4.1
? 0
Version
cardinal
2201
3.5
1
1
6.0
swift
2011
2.0
0
1
6.2
falcon
1976
2.7
1
0
4.1
cardinal
2201
swift
20113.5
2.0 1
0 1
1 6.0
6.2
falcon
1976
2.7
1
0
4.1
cardinal
2201
3.5
1
1
6.0
falcon
1976
2.7
1
0
4.1
cardinal
2201
3.5
1
1
6.0
cardinal
2201
3.5
1
1
6.0
cardinal.cs.cornell.edu
Astrolabe builds a hierarchy using a P2P
protocol that “assembles the puzzle” without
any servers
Dynamically changing
query output is visible
system-wide
Name
Avg
Load
WL contact
SMTP contact
SF
2.6
123.45.61.3
123.45.61.17
NJ
1.8
127.16.77.6
127.16.77.11
Paris
3.1
14.66.71.8
14.66.71.12
Name
Load
Weblogic?
SMTP?
Word
Version
swift
2.0
0
1
falcon
1.5
1
cardinal
4.5
1
Name
Load
Weblogic?
SMTP?
Word
Version
6.2
gazelle
1.7
0
0
4.5
0
4.1
zebra
3.2
0
1
6.2
0
6.0
gnu
.5
1
0
6.2
San Francisco
…
SQL query
“summarizes”
data
New Jersey
…
Large scale: “fake” regions

These are



Computed by queries that summarize a
whole region as a single row
Gossiped in a read-only manner within a
leaf region
But who runs the gossip?


Each region elects “k” members to run
gossip at the next level up.
Can play with selection criteria and “k”
Hierarchy is virtual… data is replicated
Name
Avg
Load
WL contact
SMTP contact
SF
2.6
123.45.61.3
123.45.61.17
NJ
1.8
127.16.77.6
127.16.77.11
Paris
3.1
14.66.71.8
14.66.71.12
Name
Load
Weblogic?
SMTP?
Word
Version
swift
2.0
0
1
falcon
1.5
1
cardinal
4.5
1
Name
Load
Weblogic?
SMTP?
Word
Version
6.2
gazelle
1.7
0
0
4.5
0
4.1
zebra
3.2
0
1
6.2
0
6.0
gnu
.5
1
0
6.2
San Francisco
…
New Jersey
…
Hierarchy is virtual… data is replicated
Name
Avg
Load
WL contact
SMTP contact
SF
2.6
123.45.61.3
123.45.61.17
NJ
1.8
127.16.77.6
127.16.77.11
Paris
3.1
14.66.71.8
14.66.71.12
Name
Load
Weblogic?
SMTP?
Word
Version
swift
2.0
0
1
falcon
1.5
1
cardinal
4.5
1
Name
Load
Weblogic?
SMTP?
Word
Version
6.2
gazelle
1.7
0
0
4.5
0
4.1
zebra
3.2
0
1
6.2
0
6.0
gnu
.5
1
0
6.2
San Francisco
…
New Jersey
…
Worst case load?

A small number of nodes end up participating
in O(logfanoutN) epidemics



Here the fanout is something like 50
In each epidemic, a message is sent and received
roughly every 5 seconds
We limit message size so even during periods
of turbulence, no message can become huge.


Instead, data would just propagate slowly
Haven’t really looked hard at this case
Astrolabe is a good fit


No central server
Hierarchical abstraction “emerges”




Moreover, this abstraction is very robust
It scales well… disruptions won’t disrupt the
system … consistent in eyes of varied beholders
Individual participant runs trivial p2p protocol
Supports distributed data aggregation, data
mining. Adaptive and self-repairing…
Data Mining

We can data mine using Astrolabe



The “configuration” and “aggregation”
queries can be dynamically adjusted
So we can use the former to query sensors
in customizable ways (in parallel)
… and then use the latter to extract an
efficient summary that won’t cost an arm
and a leg to transmit
Costs are basically constant!



Unlike many systems that experience
load surges and other kinds of load
variability, Astrolabe is stable under all
conditions
Stress doesn’t provoke surges of
message traffic
And Astrolabe remains active even
while those disruptions are happening
Other such abstractions



Scalable probabilistically reliable
multicast based on P2P (peer-to-peer)
epidemics: Bimodal Multicast
Some of the work on P2P indexing
structures and file storage: Kelips
Overlay networks for end-to-end IPstyle multicast: Willow
Challenges

We need to do more work on



Real-time issues: right now Astrolabe is
highly predictable but somewhat slow
Security (including protection against
malfunctioning components)
Scheduled downtown (sensors do this quite
often today; maybe less an issue in the
future)
Communication locality



Important in sensor networks, where
messages to distant machines need
costly relaying
Astrolabe does most of its
communication with neighbors
Close to Kleinberg’s small worlds
structure for remote gossip
Conclusions?

We’re near a breakthrough: sensor networks
that behave like sentient infrastructure




They sense their own state and adapt
Self-configure and self-repair
Incredibly exciting opportunities ahead
Cornell has focused on probabilistically
scalable technologies and built real systems
while also exploring theoretical analyses
Extra slides

Just to respond to questions
Bimodal Multicast


A technology we developed several
years ago
Our goal was to get better scalability
without abandoning reliability
Multicast historical timeline
TIME
1980’s: IP multicast, anycast,
other best-effort models



Cheriton: V system
IP multicast is a
standard Internet
protocol
Anycast never really
made it
Multicast historical timeline
TIME
1980’s: IP multicast, anycast,
other best-effort models
1987-1993: Virtually synchronous multicast takes off
(Isis, Horus, Ensemble but also many other systems,
like Transis, Totem, etc). Used in many settings
today but no single system “won”

Isis Toolkit was used by



New York Stock Exchange, Swiss Exchange
French Air Traffic Control System
AEGIS radar control and communications
Multicast historical timeline
TIME
1980’s: IP multicast, anycast,
other best-effort models
1987-1993: Virtually synchronous multicast takes off
(Isis, Horus, Ensemble but also many other systems,
like Transis, Totem, etc). Used in many settings
today but no single system “won”
1995-2000: Scalability issues prompt a new generation
of scalable solutions (SRM, RMTP, etc). Cornell’s
contribution was Bimodal Multicast, aka “pbcast”
Virtual Synchrony Model
G0={p,q}
G1={p,q,r,s}
p
G2={q,r,s}
G3={q,r,s,t}
crash
q
r
s
t
r, s request to join
r,s added; state xfer
p fails
t requests to join
t added, state xfer
... to date, the only widely adopted model for consistency and
fault-tolerance in highly available networked applications
Virtual Synchrony scaling issue
Virtually synchronous Ensemble multicast protocols
average throughput on nonperturbed members
250
group size: 32
group size: 64
group size: 96
200
150
100
50
0
0
0.1
0.2
0.3
0.4
0.5
perturb rate
0.6
0.7
0.8
0.9
Bimodal Multicast

Uses some sort of best effort dissemination
protocol to get the message “seeded”


E.g. IP multicast, or our own tree-based scheme
running on TCP but willing to drop packets if
congestion occurs
But some nodes log messages


We use a DHT scheme
Detect a missing message? Recover from a log
server that should have it…
Start by using unreliable multicast to rapidly
distribute the message. But some messages
may not get through, and some processes may
be faulty. So initial state involves partial
distribution of multicast(s)
Periodically (e.g. every 100ms) each process
sends a digest describing its state to some
randomly selected group member. The digest
identifies messages. It doesn’t include them.
Recipient checks the gossip digest against its
own history and solicits a copy of any missing
message from the process that sent the gossip
Processes respond to solicitations received
during a round of gossip by retransmitting the
requested message. The round lasts much longer
than a typical RPC time.
Scalability of Pbcast reliability
1.E+00
1.E-05
1.E-05
1.E-10
P{failure}
p{#processes=k}
Pbcast bimodal delivery distribution
1.E-10
1.E-15
1.E-20
1.E-15
1.E-20
1.E-25
1.E-30
1.E-35
1.E-25
10
15
1.E-30
0
5
10
15
20
25
30
35
40
45
P{failure}
fanout
5
6
7
40
45
50
55
60
Predicate II
9
8.5
8
7.5
7
6.5
6
5.5
5
4.5
4
20
4
35
Fanout required for a specified reliability
1.E+00
1.E-02
1.E-04
1.E-06
1.E-08
1.E-10
1.E-12
1.E-14
1.E-16
3
30
Predicate I
Effects of fanout on reliability
2
25
#processes in system
50
number of processes to deliver pbcast
1
20
8
9
10
25
30
35
40
45
50
#processes in system
fanout
Predicate I for 1E-8 reliability
Predicate I
Predicate II
Figure 5: Graphs of analytical results
Predicate II for 1E-12 reliability
Our original degradation scenario
Throughput
measured at
unperturbed
process
High Bandwidth measurements with varying numbers of sleepers
200
150
100
50
0
0.05 0.15 0.25 0.35 0.45 0.55 0.65 0.75 0.85 0.95
Probability of sleep event
Traditional w/1 sleeper
Pbcast w/1 sleeper
Traditional w/3 sleepers
Pbcast w 3/sleepers
Traditional w/5 sleepers
Pbcast w/5 sleepers
Distributed Indexing


Goal is to find a copy of Nora Jones’ “I
don’t know”
Index contains, for example


(machine-name, object)
Operations to search, update
History of the problem


Very old: Internet DNS does this
But DNS lookup is by machine name


We want the inverted map
Napster was the first really big hit



5 million users at one time
The index itself was centralized.
Used peer-to-peer file copying once a copy
was found (many issues arose…)
Hot academic topic today

System based on a virtual ring


Many systems use Paxton radix search



MIT: Chord system (Karger, Kaashoek)
Rice: Pastry (Druschel, Rowston)
Berkeley: Tapestry
Cornell: a scheme that uses replication

Kelips
Kelips idea?


Treat the system as sqrt(N) “affinity” groups
of size sqrt(N) each
Any given index entry is mapped to a group
and replicated within it



O(log N) time delay to do an update
Could accelerate this with an unreliable multicast
To do a lookup, find a group member (or a
few of them) and ask for item

O(1) lookup
Why Kelips?

Other schemes have O(log N) lookup
delay


This is quite a high cost in practical
settings
Others also have fragile data structures


Background reorganization costs soar
under stress, churn, flash loads
Kelips has a completely constant load!
Solutions that share properties




Scalable
Robust against localized disruption
Have emergent behavior we can reason
about, exploit in the application layer
Think of the way a hive of insects
organizes itself or reacts to stimuli.
There are many similarities
Revisit our goals

Are these potential components for sentient systems?






Middleware that perceives the state of the network
It represent this knowledge in a form smart applications can
exploit
Although built from large numbers of rather dumb
components the emergent behavior is intelligent. These
applications are more robust, more secure, more responsive
than any individual component
When something unexpected occurs, they can diagnose the
problem and trigger a coordinated distributed response
They repair themselves after damage
We seem to have the basis from which to work!
Brings us full circle


Our goal should be a new form of very stable
“sentient middleware”
Have we accomplished this goal?




Probabilistically reliable, scalable primitives
They solve many problems
Gaining much attention now from industry,
academic research community
Fundamental issue is skepticism about peerto-peer as a computing model
Conclusions?



We’re at the verge of a breakthrough –
networks that behave like sentient
infrastructure on behalf of smart applications
Incredibly exciting opportunities if we can
build these
Cornell’s angle has focused on
probabilistically scalable technologies and
tried to mix real systems and experimental
work with stochastic analyses
Download