Network Tomography for Fault Diagnosis Renata Teixeira LIP6 Computer Laboratory

advertisement
Network Tomography for
Fault Diagnosis
Renata Teixeira
LIP6 Computer Laboratory
CNRS and UPMC Paris Universitas
The Internet is great, but
problems happen
Net2
Net1
LIP6
network
Is it google?
Net3
Is the problem in one of the
networks in path?
Is my connection ok?
How to automatically detect and identify problems?
1
Current alarms are not enough



Network equipments already have many alarms
–
SNMP traps
–
Anomaly detection systems
But, alarms may not reflect user’s experience
–
Hard to map users’ complaints to alarms
–
The user’s problem may not appear as an alarm
Network admins often resort to active measurements
–
Active monitoring servers inside their network
–
Subscribe to third-party monitoring services
•
Eg. Keynote or RIPE TTM
2
End-hosts can collaborate to
troubleshoot problems
Net2
Net1
LIP6
network
Net3
Detection: continuous path monitoring
Identification: tomography
3
End-host troubleshooting in two
different contexts
 Network
admins deploy monitoring services
–
Verify the performance of their networks
–
Assist in troubleshooting
 End-users
can collaborate
–
Identify and bypass problems
–
Rank providers
4
Detection techniques
 For
–
–
network admins
 For
Deploy dedicated
monitors
–
Need to inject
probes to measure
paths
–
end-users
Monitoring at endusers’ machine
Tapping users’
traffic is promising
Challenge
cannot continuously overload the network or
end-user’s machine to detect faults
5
Minimizing probing cost for detecting
interface failures: Algorithms and
scalability analysis
with
Hung X. Nguyen (Univ. of Adelaide)
Patrick Thiran (EPFL)
Christophe Diot (Thomson)
Active monitoring system to
detect faults
M1
T1
target
network
C
A
T2
D
T3
B
monitors
M2
target hosts
Goal
detect failures of any of the
interfaces in the subscriber’s network
with minimum probing overhead
7
Simple solution: Coverage problem
T1
M1
C
A
T2
D
T3
B

M2

Instead of probing all paths, select the
minimum set of paths that covers all
interfaces in the subscriber’s network
Coverage problem is NP-hard
–
Solution: greedy set-cover heuristic
8
Coverage solution doesn’t detect
all types of failures
 Detects
–
Failures that affect all packets that traverse the
faulty interface
•
 But
–
fail-stop failures
Eg., interface or router crashes, fiber cuts, bugs
not path-specific failures
Failures that affect only a subset of paths that
cross the faulty interface
•
Eg., router misconfigurations
9
New formulation of failure
detection problem
 Select
–
the frequency to probe each path
Lower frequency per-path probing can achieve a
high frequency probing of each interface
T1
1 every 9 mins
M1
1 every 3 mins
C
A
T2
D
T3
B
M2
10
Properties of solution


Failure detection problem is no longer NP-hard
–
Can find optimal solution using linear programming
–
Parameters: Duration of path-specific and fail-stop failures
Needs synchronization among monitors
–
–

Monitors need collaborate to probe an interface
Alternative probabilistic solution avoids synchronization
overhead
Probing cost scales almost linearly with the size of the
target network
–
In random power-law graphs like inferred internet graphs
11
Evaluation



Paths obtained using traceroutes
–
From 750 PlanetLab nodes to 3,000 DNS servers
–
From 12 RON nodes to 60,000 targets
Target networks are probed ASes
–
Map IPs to ASes using Mao et al.’s technique
–
1,366 ASes in PlanetLab
–
6,517 ASes in RON
Compute probing costs varying parameters
–
Set of paths, failure durations, target network
12
Probing costs varying size of
subscriber network in PlanetLab
Duration
Path-specific = 1000 sec
Fail-stop = 1 sec
13
Summary

Practical formulation of failure detection problem
–

Solution minimizes probing cost
–

Using linear programming
Inferred internet graphs are among the most expensive
to probe
–

Incorporates both fail-stop and path-specific failures
Probing scales almost linearly with network size
Next step
–
Deploy a system based on these probing techniques
14
ConnectionWatch: Passive monitoring
of round-trip times at end-hosts
with
Diana Zeaiter Joumblatt (LIP6)
Nina Taft (Intel)
Goal
 Automatic
detection of performance
degradations
–
Only care about problems that impact applications
–
Focus on detecting “large” round-trip times (RTT)
–
Detection should be fast and lightweight
16
ConnectionWatch
Upload to
central server
Packet
Trace
Ping
Daemon
Flow
statistics
Sniffer
Extract
flow ID
RTT
estimation
TCP packets
17
Alarms
High RTT
detector
Insights from preliminary
experiments

Datasets from five students during three days
–

44,715 TCP connections over 3,584 paths to 2,242 IPs
Some observations
–
More complete measurements than ping
•
–

16.5% of 1,072 addresses don’t reply to pings
Transfer of traces to server is main bottleneck
Hurdles
–
Portability of system to other OSes
–
Privacy concerns with capturing user’s traffic
–
Incentives for large-scale deployment
18
Which RTT variations correspond
to performance degradations?

Our datasets are still too small to answer
–

Simple technique based on outlier threshold
–
–

Performance degradations are rare events
What is a good threshold?
Should it the threshold be for all users, per user, per path,
per app?
Do outliers correspond to real performance degradations?
–
ConnectionWatch should get user’s feedback
•
“I’m annoyed button”
19
Practical issues with using network
tomography for fault diagnosis
with
Italo Scota Cunha (LIP6, Thomson)
Amogh Dhamdhere, Yiyi Huang, Nick Feamster,
Constantine Dovrolis (Georgia Tech)
Christophe Diot (Thomson)
The binary tomography solution by
Duffield
m

t2
Given
–
–

t1
Complete network topology
End-to-end reachability measurements
Find the smallest set of links that explain observations
– Assumes single-source tree, access to targets
21
Extending binary tomography
 Multi-network
–
Periodic traceroutes determine topologies
 Extension
–
to multiple-sources, multiple-targets
Minimum hitting set problem (NP-hard)
 Tomo:
–
setting: topology not known
Iterative poly-time greedy heuristic
Intuition: Iteratively choose link that explains
the max number of failures
22
Some problems
 Dynamics
–
Loss can be transient, topology can change
 Ambiguity
–
Losses are one-way but don’t always have access
to both ends of the path
 Lack
–
of synchronization
Different monitors see different conditions
23
Approach
 Transient
–
Triggered confirmation of failed paths
 Dynamic
–
losses
Algorithm based on IP spoofing
 Lack
–
routing
Periodic snapshots of the network topology
 One-way
–
packet loss
of synchronization
Correlation of probes from different monitors
24
Failure confirmation

Upon detection of a failure, trigger extra probes

Number of probes
–
–

Confirm failures with a target false positive rate
Assume independence and a given a loss rate
Time between probes
–
–
Reduce chance that probes fall on the same loss burst
Assume link losses follow a Gilbert process
loss burst
packets on
a path
false positive
25
time
Disambiguating one-way losses:
Spoofing

Monitor sends request to spoofer to send probe

Probe has IP address of the monitor

If reply reaches the monitor, reverse path is working
T
M
Spoofer: Send spoofed
packet with source address of M
26
Evaluation

Evaluation is challenging
–


Need ground truth and realistic environment
Controlled experiments on the VINI testbed
–
Allow us to inject failures
–
Problem: hard to argue about false positive
Experiments on Emulab
–
More control: dedicated nodes and links
–
Emulate the Abilene network
–
Selected LA and NY as monitors
27
Failure confirmation reduces false
positives

Emulab experiment setup
–
–

10% loss rates in each direction
No persistent failures
Both schemes use three probes to confirm a failure
Confirmation
interval
Back-to-back
0.2 secs
Burst factor
90%
96%
15%
25%
0.8%
0.8%
low false positives, because an interval of 0.2 secs
guarantees a small probability of probes being correlated
28
Correlation is important to get a
consistent view



Emulab and VINI experiments with short failures
–
More false positives
–
Lower detection rate
In real deployments, can we get a consistent view?
–
More noise because of losses and routing dynamics
–
Monitors are less synchronized
–
Monitors may not be able to reach the coordinator
Next steps
–
Online correlation
–
Minimize communication with coordinator
29
Summary

Continuous monitoring for detection
–
At management hosts: active measurements
•
–
At end-users: passive measurements
•

Reduce probing overhead, still detect failures
Lightweight detection of problems that affect apps
Network tomography for identification
–
Many challenges to get consistent inputs for tomography
•
Network dynamics and transient losses
•
Ambiguity of forward and reverse failures
•
Monitors may observe different conditions
30
Download