Internet Routing Instability Craig Labovitz G. Robert Malan Farnam Jahanian Appeared: SIGCOMM ‘97 Presenters: Supranamaya Ranjan Mohammed Ahamed Internet Structure • Many small ISP’s at lowest level • Small number of big ISP’s at core The Core of the Internet Sprint Verio rice.edu UUNet • Routing done using BGP at core • Inter-domain routing could be RIP/OSPF etc BGP Overview 92.92.x.x 128.42.x.x 196.29.x.x Sprint 92.92.x.x Verio 100.100.x.x 196.29.x.x UUNet 128.42.x.x 196.29.x.x 100.100.x.x BGP Overview (contd.) • Path Vector protocol • Similar to Distance Vector routing • Loop detection done using AS_PATH field R1 R2 Peering session (TCP) • Exchange full routing table at start • Updates sent incrementally Key Point The volume of BGP messages exchanged is abnormally high • Most messages are redundant / unnecessary and do not correspond to and topology or policy changes Consequence: Instability • Normal data packets handled by dedicated hardware • BGP packet processing consumes CPU time • Severe CPU processing overhead takes the router offline Route Flap Storm: B • Router A temporarily fails • When A becomes alive B & C send full routing tables • B & C fail…cascading effect A C How do we avoid /lessen the impact of these problems? Route Dampening • Router does not accept frequent route updates to a destination • Might signal that network has erratic connectivity • Increment counter for destination when route changes • Counter exceeds threshold stop accepting updates • Decrement counter with time Problem: • Future legitimate announcements are accepted only after a delay Prefix Aggregation/Super-netting • Core router advertises a less specific network prefix • Reduces size of routing tables exchanged Problems: Prefix aggregation is not effective because: - Internet addresses largely non-hierarchically assigned - Domain renumbering not done when changing ISP’s - 25% of prefixes multi-homed - Multi-homed prefixes should be exposed at the core Route Servers • O(N) peering sessions per Router • 1 peering session per router Route Server In-spite of all these measures the BGP message overhead is unexpectedly high Evaluation Methodology • Data from Route Server at M.A.E west (D.C) peering point • Peering point for more than 60 major ISP’s • Nine month log • Time series analysis of message exchange events Observation: Lot’s of redundant updates • Duplicate route with-drawls ISP Number of With-drawls Unique A 23276 4344 5 F 86417 12435 7 I 2479023 14112 175 One Reason: - Stateless BGP - No state of previous with-drawls maintained Ratio Observation: Instability Proportional to Activity After removing duplicate messages: ISP infrastructure up-grade Time of day 24:00 Lesser messages 18:00 12:00 10:00 AM 6:00 Lesser messages Instability density with time Power spectral density Number of instability events Evidence from Fine Grained Structure 7 days 24 hours Frequency (1/hour) Conjecture: BGP packets are competing with data packets during high bandwidth activity. Proportion of announcements AADiff WADiff Proportion of routing table Proportion of routing table • ISP’s serving more network prefixes may not contribute more to instability Proportion of announcements Proportion of announcements Observation: Instability & size uncorrelated WADup Proportion of routing table Cumulative proportion Observation: Instability distributed over routes 75% median 10 # of announcements per prefix+AS • 20% to 90% of routes change 10 times or less • No single route contributes significantly to instability Observation: Synchronized updates • 30 s and 1 minute patterns 30s Proportion • Inter-arrival times of updates shows periodicity AADiff 1min • Some routers collect and send Updates once every 30 s Possible reasons: Inter Arrival Time distribution for AADiff’s • Routers get synchronized • Border router- Internal router: interaction misconfigured?? End-to-end Perspective Chinoy: “Dynamics of Internet routing information” (SIGCOMM 93) Measurements on NSFNET showed: - Processing and forwarding latency of BDP update is 3 orders of magnitude more than the latency incurred in forwarding data packets - Will lead to packet drops during the intervening period Paxson: “End-to-End routing behavior in the internet” (SIGCOMM 96) - Routing loops introduce loops into other router’s routing tables - An end-to-end route changes every 1.5 hours on an average End-to-End perspective (Paxson) Pathology type Probability in 1995 Long-lived Routing loops ~ 0.14% same Short-lived Routing loops ~ 0.065% same 0.96% 2.2% 1.5% 3.4% Outage>30s Total Probability in 1996 Summary and Conclusions • Redundant routing information flows in core • Instability distributed across autonomous systems Possible reasons for instability: - Stateless BGP updates - Misconfigured routers - Synchronization - Clocks driving the links not synchronized (link “flaps”) Follow-up work & impact “Origins of Internet Routing Instability”-1999 • Migration from stateless to stateful BGP decreased duplicate withdrawals by an order of magnitude • But Duplicate Announcements (AADup) doubled • Reason: Non-transitive attribute filtering not implemented - BGP specification: “never propagate non-transitive attributes”.. - ASPATH is transitive attribute - MED (Multi Exit Discriminator) is NOT transitive Propagating MED’s Causes Oscillations