Towards an Internet that “Never Fails” Hari Balakrishnan MIT Joint work with Nick Feamster, Scott Shenker, Mythili Vutukuru What We Should Aim Toward • Carrier airlines (2002 FAA Fact Book) 41 accidents, 6.7 million flights (five “nines” availability) • 911 phone service (1993 NRIC report) 29 minutes downtime per year per line (four “nines” availability) • Standard phone service (various sources) 53 minutes downtime per year per line (four “nines” availability) • The Internet? One to two “nines” Example Catastrophic Failures “…a glitch at a small ISP… triggered a major outage in Internet access across the country. The problem started when MAI Network Services...passed bad router information one of its customers onto Sprint.” “Microsoft's from websites were offline for up to 23 -- news.com, 25, 1997 hours...because of a [router]April misconfiguration…it took nearly a day to determine what was wrong and undo the “WorldCom a widespread changes.” Inc…suffered -- wired.com, January 25,outage 2001 on its Internet backbone that affected roughly 20 percent of its U.S. customer base. The network problems…affected millions of computer users worldwide. A spokeswoman attributed outagecustomers to "a route went tableout issue." "A numberthe of Covad from 5pm today -- cnn.com, October 3, 2002denial of service due to, supposedly, a DDOS (distributed attack) on a key Level3 data center, which later was described as a route leak (misconfiguration).” -- dslreports.com, February 23, 2004 # Threads over Stated Period NANOG List Failure “Analysis” 90 80 70 60 50 40 30 20 10 0 More than 70% of threads discussing failures related to router configuration or route announcement problems Filtering Route Leaks Route Hijacks 1994-1997 Route Instability 1998-2001 Routing Blackholes Loops 2001-2004 Note: Only includes problems openly discussed on this list. Faults and Failures • Fault = Underlying defect in a component that causes it to violate a specification Latent or Active (i.e., cause errors) • Unmasked faults (errors) cause failures Failure of subsystem (spec violation) causes fault in system • Internet faults occur for complex reasons Hardware, software, protocol, design, implementation, operational faults: could be triggered by malice • Internet failure: A cannot communicate with B Three Directions • Configuration as programming Defines BGP behavior Tools to cope with routing complexity • Coping with protocol faults: failure-atomic interdomain routing Prefix-based routing considered harmful • End-to-end routing Exposing multiple paths to end systems (and stubs) Today: Reactive Operation What happens if I tweak this policy…? • Problems cause downtime • Problems often not immediately apparent Coping with Complexity • View configuration as (distributed) programming Large-scale: over 1M lines of code in some networks • Programming tools to reduce fault frequency Static analysis can detect many faults [rcc] Sandboxing to overcome current “stimulusresponse” reasoning [FR03] • Centralize configuration platform More “intentional” config specs Push configs to routers Push routes to routers [RCP:F+04] Use static analysis and sandboxing tools Proactive Operation with rcc http://nms.csail.mit.edu/rcc rcc Configure Distributed router configurations (Single AS) Detect Faults Correctness Specification Deploy Constraints rcc Normalized Representation • Represent complex, distributed configuration • Define a correctness specification • Map specification to constraints Faults Correctness Specification Path Visibility EveryIfdestination with usable there exists aa path, paththen has athere routeexists advertisement a route Example violation: Signaling partition Route Validity EveryIfroute thereadvertisement exists a route, corresponds to aexists usablea path then there path Example violation: Routing loop Results: Faults across 17 ASes Every AS had faults, regardless of network size Most faults can be attributed to distributed configuration 10 Path Visibility 8 6 4 2 Incomplete Filter Undefined Filter Transit Between Peers Inconsistent Import Inconsistent Export Incomplete iBGP Session Duplicate Loopback 0 iBGP Signaling Partition Number of ASes Route Validity Three Directions • Configuration as programming Tools to cope with routing complexity • Coping with protocol faults: failure-atomic interdomain routing Prefix-based routing considered harmful • End-to-end routing Exposing multiple paths to end systems Prefixes are too coarse-grained Validity: If a failure occurs that makes a network unreachable via a given path, then the route corresponding to that path must be withdrawn 70% of intra-AS failures not visible in BGP [FABK03] …but they are also too fine-grained! • ~70% of discontiguous prefix pairs from the same AS are announced from the same location • Allocation explains about 60% of these cases: Registries often allocate discontiguous address blocks to a single AS on the same day • Routes for these prefixes will “flap” together. 135.36.0.0/16 (Agere) and 135.12.0.0/14 (Lucent) Route objects should correspond to an “atom” of hosts that share fate Proposal: Atomic Interdomain Protocol (AIP) • Exterminate prefixes • Name “atomic domains” (AD) directly Addressing, forwarding and routing on ADs Like current AS numbers, but finer-grained Example: MIT, Microsoft Redmond, one PoP of a large ISP, … • Flat AD IDs can carry cryptographic meaning Self-certifying (hash of public key) • End-system addresses have the form [AD : LocalID] Summary It’s worth shooting for a two or three order-ofmagnitude improvement in Internet availability It’s possible to get four or five nines of Internet availability, if we: Develop tools to cope with configuration complexity Develop a failure-atomic routing system Expose multiple IP-layer paths to higher layers