Network Tomography for Fault Diagnosis Renata Teixeira LIP6 Computer Laboratory CNRS and UPMC Paris Universitas The Internet is great, but problems happen Net2 Net1 LIP6 network Is it google? Net3 Is the problem in one of the networks in path? Is my connection ok? How to automatically detect and identify problems? 1 Current alarms are not enough Network equipments already have many alarms – SNMP traps – Anomaly detection systems But, alarms may not reflect user’s experience – Hard to map users’ complaints to alarms – The user’s problem may not appear as an alarm Network admins often resort to active measurements – Active monitoring servers inside their network – Subscribe to third-party monitoring services • Eg. Keynote or RIPE TTM 2 End-hosts can collaborate to troubleshoot problems Net2 Net1 LIP6 network Net3 Detection: continuous path monitoring Identification: tomography 3 End-host troubleshooting in two different contexts Network admins deploy monitoring services – Verify the performance of their networks – Assist in troubleshooting End-users can collaborate – Identify and bypass problems – Rank providers 4 Detection techniques For – – network admins For Deploy dedicated monitors – Need to inject probes to measure paths – end-users Monitoring at endusers’ machine Tapping users’ traffic is promising Challenge cannot continuously overload the network or end-user’s machine to detect faults 5 Minimizing probing cost for detecting interface failures: Algorithms and scalability analysis with Hung X. Nguyen (Univ. of Adelaide) Patrick Thiran (EPFL) Christophe Diot (Thomson) Active monitoring system to detect faults M1 T1 target network C A T2 D T3 B monitors M2 target hosts Goal detect failures of any of the interfaces in the subscriber’s network with minimum probing overhead 7 Simple solution: Coverage problem T1 M1 C A T2 D T3 B M2 Instead of probing all paths, select the minimum set of paths that covers all interfaces in the subscriber’s network Coverage problem is NP-hard – Solution: greedy set-cover heuristic 8 Coverage solution doesn’t detect all types of failures Detects – Failures that affect all packets that traverse the faulty interface • But – fail-stop failures Eg., interface or router crashes, fiber cuts, bugs not path-specific failures Failures that affect only a subset of paths that cross the faulty interface • Eg., router misconfigurations 9 New formulation of failure detection problem Select – the frequency to probe each path Lower frequency per-path probing can achieve a high frequency probing of each interface T1 1 every 9 mins M1 1 every 3 mins C A T2 D T3 B M2 10 Properties of solution Failure detection problem is no longer NP-hard – Can find optimal solution using linear programming – Parameters: Duration of path-specific and fail-stop failures Needs synchronization among monitors – – Monitors need collaborate to probe an interface Alternative probabilistic solution avoids synchronization overhead Probing cost scales almost linearly with the size of the target network – In random power-law graphs like inferred internet graphs 11 Evaluation Paths obtained using traceroutes – From 750 PlanetLab nodes to 3,000 DNS servers – From 12 RON nodes to 60,000 targets Target networks are probed ASes – Map IPs to ASes using Mao et al.’s technique – 1,366 ASes in PlanetLab – 6,517 ASes in RON Compute probing costs varying parameters – Set of paths, failure durations, target network 12 Probing costs varying size of subscriber network in PlanetLab Duration Path-specific = 1000 sec Fail-stop = 1 sec 13 Summary Practical formulation of failure detection problem – Solution minimizes probing cost – Using linear programming Inferred internet graphs are among the most expensive to probe – Incorporates both fail-stop and path-specific failures Probing scales almost linearly with network size Next step – Deploy a system based on these probing techniques 14 ConnectionWatch: Passive monitoring of round-trip times at end-hosts with Diana Zeaiter Joumblatt (LIP6) Nina Taft (Intel) Goal Automatic detection of performance degradations – Only care about problems that impact applications – Focus on detecting “large” round-trip times (RTT) – Detection should be fast and lightweight 16 ConnectionWatch Upload to central server Packet Trace Ping Daemon Flow statistics Sniffer Extract flow ID RTT estimation TCP packets 17 Alarms High RTT detector Insights from preliminary experiments Datasets from five students during three days – 44,715 TCP connections over 3,584 paths to 2,242 IPs Some observations – More complete measurements than ping • – 16.5% of 1,072 addresses don’t reply to pings Transfer of traces to server is main bottleneck Hurdles – Portability of system to other OSes – Privacy concerns with capturing user’s traffic – Incentives for large-scale deployment 18 Which RTT variations correspond to performance degradations? Our datasets are still too small to answer – Simple technique based on outlier threshold – – Performance degradations are rare events What is a good threshold? Should it the threshold be for all users, per user, per path, per app? Do outliers correspond to real performance degradations? – ConnectionWatch should get user’s feedback • “I’m annoyed button” 19 Practical issues with using network tomography for fault diagnosis with Italo Scota Cunha (LIP6, Thomson) Amogh Dhamdhere, Yiyi Huang, Nick Feamster, Constantine Dovrolis (Georgia Tech) Christophe Diot (Thomson) The binary tomography solution by Duffield m t2 Given – – t1 Complete network topology End-to-end reachability measurements Find the smallest set of links that explain observations – Assumes single-source tree, access to targets 21 Extending binary tomography Multi-network – Periodic traceroutes determine topologies Extension – to multiple-sources, multiple-targets Minimum hitting set problem (NP-hard) Tomo: – setting: topology not known Iterative poly-time greedy heuristic Intuition: Iteratively choose link that explains the max number of failures 22 Some problems Dynamics – Loss can be transient, topology can change Ambiguity – Losses are one-way but don’t always have access to both ends of the path Lack – of synchronization Different monitors see different conditions 23 Approach Transient – Triggered confirmation of failed paths Dynamic – losses Algorithm based on IP spoofing Lack – routing Periodic snapshots of the network topology One-way – packet loss of synchronization Correlation of probes from different monitors 24 Failure confirmation Upon detection of a failure, trigger extra probes Number of probes – – Confirm failures with a target false positive rate Assume independence and a given a loss rate Time between probes – – Reduce chance that probes fall on the same loss burst Assume link losses follow a Gilbert process loss burst packets on a path false positive 25 time Disambiguating one-way losses: Spoofing Monitor sends request to spoofer to send probe Probe has IP address of the monitor If reply reaches the monitor, reverse path is working T M Spoofer: Send spoofed packet with source address of M 26 Evaluation Evaluation is challenging – Need ground truth and realistic environment Controlled experiments on the VINI testbed – Allow us to inject failures – Problem: hard to argue about false positive Experiments on Emulab – More control: dedicated nodes and links – Emulate the Abilene network – Selected LA and NY as monitors 27 Failure confirmation reduces false positives Emulab experiment setup – – 10% loss rates in each direction No persistent failures Both schemes use three probes to confirm a failure Confirmation interval Back-to-back 0.2 secs Burst factor 90% 96% 15% 25% 0.8% 0.8% low false positives, because an interval of 0.2 secs guarantees a small probability of probes being correlated 28 Correlation is important to get a consistent view Emulab and VINI experiments with short failures – More false positives – Lower detection rate In real deployments, can we get a consistent view? – More noise because of losses and routing dynamics – Monitors are less synchronized – Monitors may not be able to reach the coordinator Next steps – Online correlation – Minimize communication with coordinator 29 Summary Continuous monitoring for detection – At management hosts: active measurements • – At end-users: passive measurements • Reduce probing overhead, still detect failures Lightweight detection of problems that affect apps Network tomography for identification – Many challenges to get consistent inputs for tomography • Network dynamics and transient losses • Ambiguity of forward and reverse failures • Monitors may observe different conditions 30