RON: Resilient Overlay Networks David Andersen, Hari Balakrishnan, Frans Kaashoek, Robert Morris

advertisement
RON: Resilient Overlay Networks
David Andersen, Hari Balakrishnan,
Frans Kaashoek, Robert Morris
MIT Laboratory for Computer Science
http://nms.lcs.mit.edu/ron/
Fault-tolerant Networking
B
A
Network
C
D
Any-to-any communication, routing around failures
The Internet
Mom-and-pop
ISP
AS
Transit
Big ISP
AS
AS
Really-big ISP
everyone’s afraid of
AS
AS
Autonomous System
(AS)
Peering
AS
AS
AS
AS
BGP4
AS
AS
AS
AS
AS
AS
AS
AS
AS
AS
AS
AS
AS
Scalability via aggressive aggregation and information hiding
Commercial reality via peering & transit relationships
How Robust is Internet Routing?
Paxson
95-97
• 3.3% of all routes had serious problems
Labovitz 97-00
• 10% of routes available < 95% of the time
• 65% of routes available < 99.9% of the time
• 3-min minimum detection+recovery time;
often 15 mins
• 40% of outages took 30+ mins to repair
Chandra 01
• 5% of faults last more than 2.75 hours
1.
2.
3.
4.
5.
Slow outage detection and recovery
Inability to detect badly performing paths
Inability to efficiently leverage redundant paths
Inability to perform application-specific routing
Inability to express sophisticated routing policy
Our Goal
To improve communication availability for small groups
by at least a factor or 10
• Many applications
– Collaboration and conferencing
– Virtual Private Networks (VPNs) across public Internet
– Overlay Internet Service
RON: Routing Using Overlays
• Cooperating end-systems in different routing domains can
conspire to do better than scalable wide-area protocols
Reliability via
path monitoring
and re-routing
Scalable BGP-based
IP routing substrate
Reliability via
path monitoring
and re-routing
• Types of failures
– Outages: Configuration/operational errors, backhoes, etc.
– Performance failures: Severe congestion, denial-of-service
attacks, etc.
RON Design
Nodes in different
routing domains
(ASes)
RON library
Conduit
Conduit
Forwarder
Prober Router
Application-specific
routing tables
Policy routing module
Performance
Database
Forwarder
Prober Router
Link-state routing protocol,
disseminates info using RON!
Many Research Questions
• Does the RON approach work at all?
• Each RON is small in size, no more than 50 or 100
nodes
– How fast can failure detection & recovery happen?
• Policy routing
– Doesn’t RON violate AUPs and other policies?
• Routing behavior
– Can stable routing be achieved?
– Implementing efficient multi-criteria routing
• Is it safe to deploy a large number of (small)
interacting RONs on the Internet?
To vu.nl
Lulea.se
OR-DSL
CMU
CCI
Aros
Utah
RON Deployment (19 sites)
MIT
MA-Cable
Cisco
Cornell
CA-T1
NYU
To vu.nl lulea.se ucl.uk
To kaist.kr, .ve
.com (ca), .com (ca), dsl (or), cci (ut), aros (ut), utah.edu, .com (tx)
cmu (pa), dsl (nc), nyu , cornell, cable (ma), cisco (ma), mit,
vu.nl, lulea.se, ucl.uk, kaist.kr, univ-in-venezuela
RON Experiments
• Measure loss, latency, and throughput with
and without RON
• 13 hosts in the US and Europe
• 3 days of measurements from data collected
in March 2001
• 30-minute average loss rates
– A 30 minute outage is very serious!
• Note: Experiments done with “No-Internet2for-commercial-use” policy
RON greatly improves loss-rate
1
"loss.jit"
0.8
0.6
0.4
0.2
0
30-min average loss rate on Internet
0
0.2
0.4
0.6
0.8
1
RON loss rate never
more than 30%
13,000 samples
30-min average loss rate with RON
An order-of-magnitude fewer failures
30-minute average loss rates
Loss
Rate
10%
RON
Better
479
No
Change
57
RON
Worse
47
20%
30%
127
32
4
0
15
0
50%
80%
20
14
0
0
0
0
100%
10
0
0
6,825 “path hours” represented here
12 “path hours” of essentially complete outage
76 “path hours” of TCP outage
RON routed around all of these!
One indirection hop provides almost all the benefit!
Resilience Against DoS Attacks
Conclusion
• Improved availability of Internet communication paths
using small overlays
– Layered above scalable IP substrate
– RON provides a set of libraries and programs to facilitate this
application-specific routing
• Experimental data suggest that this approach works
– Over 10X availability
– Outage detection and recovery in about 15 seconds
– Able to route around certain denial-of-service attacks
• Many interesting questions remain…
http://nms.lcs.mit.edu/ron/
Download