PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area Services Ming Zhang, Chi Zhang Vivek Pai, Larry Peterson, Randy Wang Princeton University Motivation • Routing anomalies are common on Internet Maintenance Power outage Fiber cut Misconfiguration … • Anomalies can affect end-to-end performance Packet losses Packet delays Disconnectivities 2 Background • Anomaly detection and diagnosis are nontrivial Asymmetric paths Failure information propagation Highly varied durations Limited coverage 3 Contributions • New techniques for Anomaly detection Anomaly isolation Anomaly classification • Large-scale study of anomalies Broad coverage High detection rate, low overhead Characterization of anomalies End-to-end effects Benefits to host service 4 Outline • State of the Art • PlanetSeer Components MonD – passive monitoring ProbeD – active probing • Anomaly Analysis Loop-based anomaly Non-loop anomaly • Bypassing Anomalies • Summary 5 State of the Art • Routing messages BGP: AS-level diagnosis IS-IS, OSPF: Within single ISP • Router/link traffic statistics SNMP, NetFlow: proprietary • End-to-end measurement Ping, traceroute 6 End-to-End Probing • All-pairs probes among n nodes O(n^2) measurement cost Not scalable as n grows 7 Key Observation • Combine passive monitoring with active probing • Peer-to-Peer (P2P), Content Distribution Network (CDN) Large client population Geographically distributed nodes Large traffic volume Highly diverse paths • The traffic generated by the services reveals information about the network. 8 Our Approach • Host service CDN • Components Client C Passive monitoring Active probing R1 • Advantages Low overhead R2 Wide coverage B A 9 MonD: Anomaly Detection • Anomaly indicators Time-to-live (TTL) change • Routing change n consecutive timeouts (n = 4 in current system) • Idling period of 3 to 16 seconds • most congestion periods < 220ms 10 ProbeD Operation • Baseline probes When a new IP appears From local node • Forward probes When a possible anomaly detected From multiple nodes (including local node) • Reprobes At 0.5, 1.5, 3.5 and 7.5 hours later From local node 11 Number of Groups ProbeD Groups 11 10 9 8 7 6 5 4 3 2 1 0 US (edu) US (nonedu) Canada Europe Asia & MidE Other • 353 nodes, 145 sites, 30 groups According to geographic location One traceroute per group 12 Estimating Scope Local ProbeD Client Remote ProbeD ra rb rc rd • Which routers might be affected? Routers which possibly change their next hops Traceroutes from multiple locations can narrow the scope 13 Path Diversity Tier Coverage 100% 80% Core 60% Edge 40% 20% 0% Tier 1 Tier 2 Tier 3 Tier 4 Tier 5 22 ASes 215 ASes 1392 ASes 1420 ASes 13872 ASes • Monitoring Period: 02/2004 – 05/2004 • Unique IPs: 887,521 • Traversed ASes: 10,090 14 Confirming Anomalies • Reported anomalies 2,259,588 • Conditions Loops Route change Partial unreachability ICMP unreachable • Very conservative confirmation Undecided 22% Anomaly 12% Nonanomaly 66% 15 Confirmed Anomaly Breakdown • Confirmed anomalies 271,898 2 per minute 100x more Temp loop 1% Persist Loop 7% • Temp anomalies Inconsistent probes Temp Anomalies 16% Path Change 44% Other Outage 23% Fwd Outage 9% 16 Scope of Loops 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% 1% persist loops cross ASes Persistent Temp 15% temp loops cross ASes 2 3 4 5 6+ • How many routers or ASes are involved? Temp loops involve more routers than persistent loops 97% persistent loops and 51% temp loops contain 2 hops 17 Distribution of Loops 50% 45% 40% 35% 30% 25% 20% 15% 10% 5% 0% Persistent Temp Traffic Tier 1 Tier 2 Tier 3 Tier 4 Tier 5 • Many persistent loops in tier-3, few in tier-1 • Worst 10% of tier-1 ASes – implications for largest ISPs 20% traffic 35% persistent loops 18 Duration of Persistent Loops 60% 50% 40% 30% 20% 10% 0% <0.5 hrs <1.5 hrs <3.5 hrs <7.5 hrs >= 7.5 hrs • How long do persistent loops last? Either resolve quickly or last for an extended period 19 fraction Scope of Forward Anomalies 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 change outage 78% outages within 2 ASes 57% changes within 2 ASes 0 2 4 6 8 hops 10 12 14 • How many routers or ASes are affected? 60% outages within 1 hops 75% outages and 68% changes within 4 hops 20 fraction Location of Forward Anomalies 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 change outage 0 1 2 3 4 5 6 hops 7 8 9 10 • How close are the anomalies to the edges of the network? 44% outages at the last hop 72% outages and 40% changes within 4 hops 21 Distribution of Forward Anomalies 50% 45% 40% 35% 30% 25% 20% 15% 10% 5% 0% Change Outage Traffic Tier 1 Tier 2 Tier 3 Tier 4 Tier 5 • Which ASes are affected? Tier-1 ASes most stable Tier-3 ASes most likely to be affected 22 Overlay Routing • Use alternate path when default path fails destination source intermediate 23 fraction Bypassing Anomalies 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0.1 1 bypass ratio 10 100 • How useful is overlay routing for bypassing failures? Effective in 43% of 62,815 failures, lower than previous studies 32% bypass paths inflate RTTs by more than a factor of 24 two Summary • Confirm 272,000 anomalies in 3 months • Persistent and temporary loops Persistent loops narrower scope, either resolve quickly or last for a long time • Path outages and changes Outages closer to edge, narrower scope • Anomaly distribution Skewed. Tier-1 most stable. Tier-3 most problematic. • Overlay routing Bypasses 43% failures, latency inflation 25 More Information • In the paper More details about anomaly characteristics End-to-end impacts Classification methodology Optimizations to reduce overheads & improve confirmation rate • mzhang@cs.princeton.edu • http://www.cs.princeton.edu/nsg/infoplane 26 Classifying Anomalies • Temporary vs. persistent loops Whether exit loops at maximum hop • Path changes vs. outages Changes: follow different paths to clients Outages: stop at intermediate hops ProbeD Client 27 Non-anomalies • Non-anomalies Ultrashort anomalies Path-based TTL Aggressive timeout 28 Identifying Forward Outages • Forward outages Route change ICMP dest unreachable Forward timeout Fwd timeout 35% Route Change 53% ICMP Unreach 12% 29 Loop Effect on RTT • How do loops affect RTTs? fraction Loops can incur high latency inflation 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Persist loop Persist loop normal Temp loop Temp loop normal 0 1 2 3 RTT (seconds) 4 30 Loop Effect on Loss Rate • How do loops affect loss rates? fraction 65% temporary and 55% persistent loops preceded by loss rates exceeding 30% 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Persistent Temp 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 loss rate % 1 31 Forward Anomaly Effect on RTT • How do forward anomalies affect RTTs? fraction Outages and changes can incur latency inflation Outages have more negative effect on RTTs 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 change change normal outage outage normal 0 1 2 RTT (seconds) 3 4 32 Forward Anomaly Effect on Loss Rate • How do forward anomalies affect loss rates? fraction 45% outages and 40% changes preceded by loss rates exceeding 30% 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 change outage 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 loss rate % 1 33 Reducing Measurement Overhead • Can we reduce the number of probes? fraction 15 probes can achieve the same accuracy in 80% cases Flow-based TTL 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 5 10 15 20 Number of Probes 25 30 34 Traffic Breakdown By Tiers Tier 5 26% Tier 4 7% Tier 3 24% Tier 1 20% Tier 2 23% 35