Interdomain Routing Correctness and Stability Nick Feamster Is correctness really that important? 2 Is correctness really that important? • The Internet is increasingly becoming part of the mission-critical Infrastructure (a public utility!). Big problem: Very poor understanding of how to manage it. 3 Why does routing go wrong? • Complex policies – Competing / cooperating networks – Each with only limited visibility • Large scale – Tens of thousands networks – …each with hundreds of routers – …each routing to hundreds of thousands of IP prefixes 4 What can go wrong? Some things are out of the hands of networking research But… Two-thirds of the problems are caused by configuration of the routing protocol 5 Review: Simple operation… Autonomous Systems (ASes) Route Advertisement MIT Destination 18.0.0.0/8 18.0.0.0/8 Next-hop AS Path 192.5.89.89 1 3356 3 66.250.252.44 174 3 Session 6 …but complex configuration! Flexibility for realizing goals in complex business landscape • Which neighboring networks can send traffic Traffic • Where traffic enters and leaves the network • How routers within the network learn routes to external destinations Flexibility Route No Route Complexity 7 Configuration Semantics Filtering: route advertisement Customer Competitor Ranking: route selection Primary Backup Dissemination: internal route advertisement 8 What types of problems does configuration cause? • • • • • • Persistent oscillation (today’s reading) Forwarding loops Partitions “Blackholes” Route instability … 9 These problems are real “…a glitch at a small ISP… triggered a major outage in Internet access across the country. The problem started when MAI Network Services...passed bad router information from one of its customers onto Sprint.” -- news.com, April 25, 1997 UUNet Sprint Florida Internet Barn 10 These problems are real “…a glitch at a small ISP… triggered a major outage in Internet access across the country. The problem started when MAI Network Services...passed bad router information one of its customers onto Sprint.” “Microsoft's from websites were offline for up to 23 -- news.com, 25, 1997 hours...because of a [router]April misconfiguration…it took nearly a day to determine what was wrong and undo the “WorldCom a widespread changes.” Inc…suffered -- wired.com, January 25,outage 2001 on its Internet backbone that affected roughly 20 percent of its U.S. customer base. The network problems…affected millions of computer users worldwide. A spokeswoman attributed outagecustomers to "a route went tableout issue." "A numberthe of Covad from 5pm today -- cnn.com, October 3, 2002denial of service due to, supposedly, a DDOS (distributed attack) on a key Level3 data center, which later was described as a route leak (misconfiguration).” -- dslreports.com, February 23, 2004 11 # Threads over Stated Period Several “Big” Problems a Week 90 80 70 60 50 40 30 20 10 0 Filtering Route Leaks Route Hijacks 1994-1997 Route Routing Blackholes Instability Loops 1998-2001 2001-2004 12 Why is routing hard to get right? • Defining correctness is hard • Interactions cause unintended consequences – Each network independently configured – Unintended policy interactions • Operators make mistakes – Configuration is difficult – Complex policies, distributed configuration 13 Correctness Specification Safety The protocol converges to a stable The assignment protocol does not oscillate path for every possible initial state and message ordering 14 Safety: No Persistent Oscillation 130 10 1 0 210 20 2 3 320 30 Varadhan, Govindan, & Estrin, “Persistent Route Oscillations in Interdomain Routing”, 1996 15 Strawman: Global Policy Check • Require each AS to publish its policies • Detect and resolve conflicts Problems: • ASes typically unwilling to reveal policies • Checking for convergence is NP-complete • Failures may still cause oscillations 16 Think Globally, Act Locally • Key features of a good solution – – – – Safety: guaranteed convergence Expressiveness: allow diverse policies for each AS Autonomy: do not require revelation/coordination Backwards-compatibility: no changes to BGP • Local restrictions on configuration semantics – Ranking – Filtering 17 Main Idea of Today’s Paper • Permit only two business arrangements – Customer-provider – Peering • Constrain both filtering and ranking based on these arrangements to guarantee safety • Surprising result: these arrangements correspond to today’s (common) behavior Gao & Rexford, “Stable Internet Routing without Global Coordination”, IEEE/ACM ToN, 2001 18 Relationship #1: Customer-Provider Filtering – Routes from customer: to everyone – Routes from provider: only to customers From other destinations To the customer providers From the customer To other destinations providers advertisements traffic customer customer 19 Relationship #2: Peering Filtering – Routes from peer: only to customers – No routes from other peers or providers advertisements peer peer traffic customer customer 20 Rankings • Routes from customers over routes from peers • Routes from peers over routes from providers provider peer customer 21 Additional Assumption: Hierarchy Disallowed! 22 Safety: Proof Sketch • System state: the current route at each AS • Activation sequence: revisit some router’s selection based on those of neighboring ASes 23 Activation Sequence: Intuition • Activation: emulates a message ordering – Activated router has received and processed all messages corresponding to the system state • “Fair” activation: all routers receive and process outstanding messages 24 Safety: Proof Sketch • State: the current route at each AS • Activation sequence: revisit some router’s selection based on those of neighboring ASes • Goal: find an activation sequence that leads to a stable state • Safety: satisfied if that activation sequence is contained within any “fair” activation sequence 25 Proof, Step 1: Customer Routes • Activate ASes from customer to provider – AS picks a customer route if one exists – Decision of one AS cannot cause an earlier AS to change its mind An AS picks a customer route when one exists 26 Proof, Step 2: Peer & Provider Routes • Activate remaining ASes from provider to customer – Decision of one Step-2 AS cannot cause an earlier Step2 AS to change its mind – Decision of Step-2 AS cannot affect a Step-1 AS AS picks a peer or provider route when no customer route is available 27 Ranking and Filtering Interactions • Allowing more flexibility in ranking – Allow same preference for peer and customer routes – Never choose a peer route over a shorter customer route • … at the expense of stricter AS graph assumptions – Hierarchical provider-customer relationship (as before) – No private peering with (direct or indirect) providers Peering 28 Some problems • Requires acyclic hierarchy (global condition) • Cannot express many business relationships Sprint Abovenet Verio Customer PSINet Question: Can we relax the constraints on filtering? What happens to rankings? 29 Other Possible Local Rankings Accept only next-hop rankings – Captures most routing policies – Generalizes customer/peer/provider – Problem: system not safe 1 2 1*, 3*, 0* 3*,2*,0* 3 2*,1*,0* Accept only shortest hop count rankings – Guarantees safety under filtering – Problem: not expressive Feamster, Johari, & Balakrishnan, “Implications of Autonomy for the Expressiveness of Policy 30 Routing”, SIGCOMM 2005 What Rankings Violate Safety? Theorem. Permitting paths of length n+2 over paths of length n will violate safety under filtering. Theorem. Permitting paths of length n+1 over paths of length n will result in a dispute wheel. Feamster, Johari, & Balakrishnan, “Implications of Autonomy for the Expressiveness of Policy 31 Routing”, SIGCOMM 2005 What about properties of resulting paths, after the protocol has converged? We need additional correctness properties. 32 Correctness Specification Safety The protocol converges to a stable path for every possible The assignment protocol does not oscillate initial state and message ordering Path Visibility EveryIfdestination with usable there exists aa path, paththen has athere routeexists advertisement a route Example violation: Network partition Route Validity EveryIfroute thereadvertisement exists a route, corresponds to aexists usablea path then there path Example violation: Routing loop 33 Path Visibility: Internal BGP (iBGP) Default: “Full mesh” iBGP. Doesn’t scale. “iBGP” Large ASes use “Route reflection” Route reflector: non-client routes over client sessions; client routes over all sessions Client: don’t re-advertise iBGP routes. 34 iBGP Signaling: Static Check Theorem. Suppose the iBGP reflector-client relationship graph contains no cycles. Then, path visibility is satisfied if, and only if, the set of routers that are not route reflector clients forms a clique. Condition is easy to check with static analysis. 35 How do we guarantee these additional properties in practice? 36 Today: Reactive Operation What happens if I tweak this policy…? Revert No Configure Observe Desired Effect? Yes Wait for Next Problem • Problems cause downtime • Problems often not immediately apparent 37 Goal: Proactive Operation • Idea: Analyze configuration before deployment rcc Configure Detect Faults Deploy Many faults can be detected with static analysis. Feamster & Balakrishnan, “Detecting BGP Configuration Faults with Static Analysis”, NSDI 200538 rcc Overview Distributed router configurations (Single AS) Correctness Specification Constraints “rcc”Normalized Faults Representation Challenges • Analyzing complex, distributed configuration • Defining a correctness specification • Mapping specification to constraints Feamster & Balakrishnan, “Detecting BGP Configuration Faults with Static Analysis”, NSDI 200539 rcc Implementation Preprocessor Distributed router configurations (Cisco, Avici, Juniper, Procket, etc.) Parser Constraints Relational Database (mySQL) Verifier Faults Feamster & Balakrishnan, “Detecting BGP Configuration Faults with Static Analysis”, NSDI 200540 Summary: Faults across 17 ASes Every AS had faults, regardless of network size Most faults can be attributed to distributed configuration 10 Path Visibility 8 6 4 Incomplete Filter Undefined Filter Transit Between Peers Inconsistent Import Inconsistent Export Incomplete iBGP Session 0 Duplicate Loopback 2 iBGP Signaling Partition Number of ASes Route Validity 41 rcc: Take-home lessons • Static configuration analysis uncovers many errors • Major causes of error: – Distributed configuration – Intra-AS dissemination is too complex – Mechanistic expression of policy http://nms.csail.mit.edu/rcc/ About 100 downloads (70 network operators) 42 Two Philosophies • This lecture: Accept the Internet as is. Devise “band-aids”. • Another direction: Redesign Internet routing to guarantee safety, route validity, and path visibility 43 Preventing Errors in the First Place Before: conventional iBGP eBGP iBGP After: RCP gets “best” iBGP routes (and IGP topology) RCP iBGP Feamster et al., “The Case for Separating Routing from Routers”, SIGCOMM FDNA, 2004 Caesar et al., “Design and Implementation of a Routing Control Platform”, NSDI, 2005 44 45 Configuration Syntax (Example) router bgp 7018 neighbor 192.0.2.10 remote-as 65000 neighbor 192.0.2.10 route-map IMPORT in neighbor 192.0.2.20 remote-as 7018 Dissemination neighbor 192.0.2.20 route-reflector-client ! route-map IMPORT permit 1 match ip address 199 Ranking set local-preference 80 ! route-map IMPORT permit 2 match as-path 99 set local-preference 110 ! route-map IMPORT permit 3 set community 7018:1000 ! ip as-path access-list 99 permit ^65000$ access-list 199 permit ip host 192.0.2.0 host 255.255.255.0 46 access-list 199 permit ip host 10.0.0.0 host 255.0.0.0 Why is Routing Hard to Get Right? • Defining correctness is hard • Operators make mistakes – Configuration is difficult – Complex policies, distributed configuration • Interactions cause unintended consequences – Each network independently configured – Unintended policy interactions 47 Which faults does rcc detect? Faults found by rcc Latent faults Potentially active faults End-to-end failures 48 Normalizing Router Configuration Challenge: • Hundreds of routers distributed across an AS • Thousands to tens of thousands of lines per router • Many ways to express identical policy Solution: • Express configuration with centralized tables • Check constraints by issuing queries on tables Filters Sessions Rankings 49 Route Validity: Consistent Export Possible NeighborCauses AS Export Export • Malice/deception 10.1.2.3 456 1 1 • iBGP signaling 10.4.5.6 456 2 partition 2 • Inconsistent export policy Clause Prepend 1 123 1 123 123 Policy normalization makes comparison easy. neighbor 10.1.2.3 route-map PEER permit 10 set prepend 123 neighbor 10.4.5.6 route-map PEER permit 10 set prepend 123 123 50 Percentage of destinations with inconsistent routes Inconsistent Export Observed at AT&T 15% of destinations inconsistent for >4 days Percentage of time Feamster et al., “BorderGuard: Detecting Cold Potatoes from Peers”. ACM IMC, October 2004. 51 Example: “Bogon” Routes 52 Feamster et al., “An Empirical Study of ‘Bogon’ Route Advertisements”. ACM CCR, January 2005. rcc Interface 53 Parsing Configuration 54 List of Faults 55