RON: Resilient Overlay Networks David Andersen, Hari Balakrishnan, Frans Kaashoek, Robert Morris

RON: Resilient Overlay Networks David Andersen, Hari Balakrishnan, Frans Kaashoek, Robert Morris MIT Laboratory for Computer Science http://nms.lcs.mit.edu/ron/ Fault-tolerant Networking B A Network C D Any-to-any communication, routing around failures The Internet Mom-and-pop ISP AS Transit Big ISP AS AS Really-big ISP everyone’s afraid of AS AS Autonomous System (AS) Peering AS AS AS AS BGP4 AS AS AS AS AS AS AS AS AS AS AS AS AS Scalability via aggressive aggregation and information hiding Commercial reality via peering & transit relationships How Robust is Internet Routing? Paxson 95-97 • 3.3% of all routes had serious problems Labovitz 97-00 • 10% of routes available < 95% of the time • 65% of routes available < 99.9% of the time • 3-min minimum detection+recovery time; often 15 mins • 40% of outages took 30+ mins to repair Chandra 01 • 5% of faults last more than 2.75 hours 1. 2. 3. 4. 5. Slow outage detection and recovery Inability to detect badly performing paths Inability to efficiently leverage redundant paths Inability to perform application-specific routing Inability to express sophisticated routing policy Our Goal To improve communication availability for small groups by at least a factor or 10 • Many applications – Collaboration and conferencing – Virtual Private Networks (VPNs) across public Internet – Overlay Internet Service RON: Routing Using Overlays • Cooperating end-systems in different routing domains can conspire to do better than scalable wide-area protocols Reliability via path monitoring and re-routing Scalable BGP-based IP routing substrate Reliability via path monitoring and re-routing • Types of failures – Outages: Configuration/operational errors, backhoes, etc. – Performance failures: Severe congestion, denial-of-service attacks, etc. RON Design Nodes in different routing domains (ASes) RON library Conduit Conduit Forwarder Prober Router Application-specific routing tables Policy routing module Performance Database Forwarder Prober Router Link-state routing protocol, disseminates info using RON! Many Research Questions • Does the RON approach work at all? • Each RON is small in size, no more than 50 or 100 nodes – How fast can failure detection & recovery happen? • Policy routing – Doesn’t RON violate AUPs and other policies? • Routing behavior – Can stable routing be achieved? – Implementing efficient multi-criteria routing • Is it safe to deploy a large number of (small) interacting RONs on the Internet? To vu.nl Lulea.se OR-DSL CMU CCI Aros Utah RON Deployment (19 sites) MIT MA-Cable Cisco Cornell CA-T1 NYU To vu.nl lulea.se ucl.uk To kaist.kr, .ve .com (ca), .com (ca), dsl (or), cci (ut), aros (ut), utah.edu, .com (tx) cmu (pa), dsl (nc), nyu , cornell, cable (ma), cisco (ma), mit, vu.nl, lulea.se, ucl.uk, kaist.kr, univ-in-venezuela RON Experiments • Measure loss, latency, and throughput with and without RON • 13 hosts in the US and Europe • 3 days of measurements from data collected in March 2001 • 30-minute average loss rates – A 30 minute outage is very serious! • Note: Experiments done with “No-Internet2for-commercial-use” policy RON greatly improves loss-rate 1 "loss.jit" 0.8 0.6 0.4 0.2 0 30-min average loss rate on Internet 0 0.2 0.4 0.6 0.8 1 RON loss rate never more than 30% 13,000 samples 30-min average loss rate with RON An order-of-magnitude fewer failures 30-minute average loss rates Loss Rate 10% RON Better 479 No Change 57 RON Worse 47 20% 30% 127 32 4 0 15 0 50% 80% 20 14 0 0 0 0 100% 10 0 0 6,825 “path hours” represented here 12 “path hours” of essentially complete outage 76 “path hours” of TCP outage RON routed around all of these! One indirection hop provides almost all the benefit! Resilience Against DoS Attacks Conclusion • Improved availability of Internet communication paths using small overlays – Layered above scalable IP substrate – RON provides a set of libraries and programs to facilitate this application-specific routing • Experimental data suggest that this approach works – Over 10X availability – Outage detection and recovery in about 15 seconds – Able to route around certain denial-of-service attacks • Many interesting questions remain… http://nms.lcs.mit.edu/ron/

RON: Resilient Overlay Networks David Andersen, Hari Balakrishnan, Frans Kaashoek, Robert Morris

Related documents

Products

Support

RON: Resilient Overlay Networks David Andersen, Hari Balakrishnan, Frans Kaashoek, Robert Morris

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib