PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area Services Ming Zhang

advertisement
PlanetSeer: Internet Path Failure
Monitoring and Characterization
in Wide-Area Services
Ming Zhang, Chi Zhang
Vivek Pai, Larry Peterson, Randy Wang
Princeton University
Motivation
• Routing anomalies are common on Internet





Maintenance
Power outage
Fiber cut
Misconfiguration
…
• Anomalies can affect end-to-end performance
 Packet losses
 Packet delays
 Disconnectivities
2
Background
• Anomaly detection and diagnosis are nontrivial




Asymmetric paths
Failure information propagation
Highly varied durations
Limited coverage
3
Contributions
• New techniques for
 Anomaly detection
 Anomaly isolation
 Anomaly classification
• Large-scale study of anomalies





Broad coverage
High detection rate, low overhead
Characterization of anomalies
End-to-end effects
Benefits to host service
4
Outline
• State of the Art
• PlanetSeer Components
 MonD – passive monitoring
 ProbeD – active probing
• Anomaly Analysis
 Loop-based anomaly
 Non-loop anomaly
• Bypassing Anomalies
• Summary
5
State of the Art
• Routing messages
 BGP: AS-level diagnosis
 IS-IS, OSPF: Within single ISP
• Router/link traffic statistics
 SNMP, NetFlow: proprietary
• End-to-end measurement
 Ping, traceroute
6
End-to-End Probing
• All-pairs probes among n nodes
 O(n^2) measurement cost
 Not scalable as n grows
7
Key Observation
• Combine passive monitoring with active probing
• Peer-to-Peer (P2P), Content Distribution Network
(CDN)




Large client population
Geographically distributed nodes
Large traffic volume
Highly diverse paths
• The traffic generated by the services reveals
information about the network.
8
Our Approach
• Host service
 CDN
• Components
Client
C
 Passive monitoring
 Active probing
R1
• Advantages
 Low overhead
R2
 Wide coverage
B
A
9
MonD: Anomaly Detection
• Anomaly indicators
 Time-to-live (TTL) change
• Routing change
 n consecutive timeouts (n = 4 in current system)
• Idling period of 3 to 16 seconds
• most congestion periods < 220ms
10
ProbeD Operation
• Baseline probes
 When a new IP appears
 From local node
• Forward probes
 When a possible anomaly detected
 From multiple nodes (including local node)
• Reprobes
 At 0.5, 1.5, 3.5 and 7.5 hours later
 From local node
11
Number of Groups
ProbeD Groups
11
10
9
8
7
6
5
4
3
2
1
0
US (edu) US (nonedu)
Canada
Europe
Asia &
MidE
Other
• 353 nodes, 145 sites, 30 groups
 According to geographic location
 One traceroute per group
12
Estimating Scope
Local
ProbeD
Client
Remote
ProbeD
ra
rb
rc
rd
• Which routers might be affected?
 Routers which possibly change their next hops
 Traceroutes from multiple locations can narrow the
scope
13
Path Diversity
Tier Coverage
100%
80%
Core
60%
Edge
40%
20%
0%
Tier 1
Tier 2
Tier 3
Tier 4
Tier 5
22
ASes
215
ASes
1392
ASes
1420
ASes
13872
ASes
• Monitoring Period: 02/2004 – 05/2004
• Unique IPs: 887,521
• Traversed ASes: 10,090
14
Confirming Anomalies
• Reported anomalies
 2,259,588
• Conditions




Loops
Route change
Partial unreachability
ICMP unreachable
• Very conservative
confirmation
Undecided
22%
Anomaly
12%
Nonanomaly
66%
15
Confirmed Anomaly Breakdown
• Confirmed anomalies
 271,898
 2 per minute
 100x more
Temp
loop
1%
Persist
Loop
7%
• Temp anomalies
 Inconsistent probes
Temp
Anomalies
16%
Path
Change
44%
Other
Outage
23%
Fwd
Outage
9%
16
Scope of Loops
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
1% persist loops cross ASes
Persistent
Temp
15% temp loops cross ASes
2
3
4
5
6+
• How many routers or ASes are involved?
 Temp loops involve more routers than persistent loops
 97% persistent loops and 51% temp loops contain 2
hops
17
Distribution of Loops
50%
45%
40%
35%
30%
25%
20%
15%
10%
5%
0%
Persistent
Temp
Traffic
Tier 1
Tier 2
Tier 3
Tier 4
Tier 5
• Many persistent loops in tier-3, few in tier-1
• Worst 10% of tier-1 ASes – implications for
largest ISPs
 20% traffic
 35% persistent loops
18
Duration of Persistent Loops
60%
50%
40%
30%
20%
10%
0%
<0.5 hrs
<1.5 hrs
<3.5 hrs
<7.5 hrs
>= 7.5 hrs
• How long do persistent loops last?
 Either resolve quickly or last for an extended period
19
fraction
Scope of Forward Anomalies
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
change
outage
78% outages within 2 ASes
57% changes within 2 ASes
0
2
4
6
8
hops
10
12
14
• How many routers or ASes are affected?
 60% outages within 1 hops
 75% outages and 68% changes within 4 hops
20
fraction
Location of Forward Anomalies
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
change
outage
0
1
2
3
4
5 6
hops
7
8
9
10
• How close are the anomalies to the edges of the
network?
 44% outages at the last hop
 72% outages and 40% changes within 4 hops
21
Distribution of Forward
Anomalies
50%
45%
40%
35%
30%
25%
20%
15%
10%
5%
0%
Change
Outage
Traffic
Tier 1
Tier 2
Tier 3
Tier 4
Tier 5
• Which ASes are affected?
 Tier-1 ASes most stable
 Tier-3 ASes most likely to be affected
22
Overlay Routing
• Use alternate path when default path fails
destination
source
intermediate
23
fraction
Bypassing Anomalies
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0.1
1
bypass ratio
10
100
• How useful is overlay routing for bypassing
failures?
 Effective in 43% of 62,815 failures, lower than
previous studies
 32% bypass paths inflate RTTs by more than a factor of
24
two
Summary
• Confirm 272,000 anomalies in 3 months
• Persistent and temporary loops
 Persistent loops narrower scope, either resolve quickly
or last for a long time
• Path outages and changes
 Outages closer to edge, narrower scope
• Anomaly distribution
 Skewed. Tier-1 most stable. Tier-3 most problematic.
• Overlay routing
 Bypasses 43% failures, latency inflation
25
More Information
• In the paper




More details about anomaly characteristics
End-to-end impacts
Classification methodology
Optimizations to reduce overheads & improve
confirmation rate
• mzhang@cs.princeton.edu
• http://www.cs.princeton.edu/nsg/infoplane
26
Classifying Anomalies
• Temporary vs. persistent loops
 Whether exit loops at maximum hop
• Path changes vs. outages
 Changes: follow different paths to clients
 Outages: stop at intermediate hops
ProbeD
Client
27
Non-anomalies
• Non-anomalies
 Ultrashort anomalies
 Path-based TTL
 Aggressive timeout
28
Identifying Forward Outages
• Forward outages
 Route change
 ICMP dest unreachable
 Forward timeout
Fwd
timeout
35%
Route
Change
53%
ICMP
Unreach
12%
29
Loop Effect on RTT
• How do loops affect RTTs?
fraction
 Loops can incur high latency inflation
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
Persist loop
Persist loop
normal
Temp loop
Temp loop
normal
0
1
2
3
RTT (seconds)
4
30
Loop Effect on Loss Rate
• How do loops affect loss rates?
fraction
 65% temporary and 55% persistent loops preceded by
loss rates exceeding 30%
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
Persistent
Temp
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
loss rate %
1
31
Forward Anomaly Effect on RTT
• How do forward anomalies affect RTTs?
fraction
 Outages and changes can incur latency inflation
 Outages have more negative effect on RTTs
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
change
change
normal
outage
outage
normal
0
1
2
RTT (seconds)
3
4
32
Forward Anomaly Effect on Loss
Rate
• How do forward anomalies affect loss rates?
fraction
 45% outages and 40% changes preceded by loss rates
exceeding 30%
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
change
outage
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
loss rate %
1
33
Reducing Measurement Overhead
• Can we reduce the number of probes?
fraction
 15 probes can achieve the same accuracy in 80% cases
 Flow-based TTL
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
5
10
15
20
Number of Probes
25
30
34
Traffic Breakdown By Tiers
Tier 5
26%
Tier 4
7%
Tier 3
24%
Tier 1
20%
Tier 2
23%
35
Download