Internet Measurement Huaiyu Zhu, Rim Kaddah CS538 Fall 2011 OUTLINE • California Fault Lines: Understanding the Causes and Impact of Network Failures. Feng Wang , Zhuoqing Morley MaoJia Wang3, Lixin Gao and Randy Bush • A Measurement Study on the Impact of Routing Events on End to End Internet Path Performance Daniel Turner, Kirill Levchenko, Alex C. Snoeren, and Stefan Savage. OUTLINE • California Fault Lines: Understanding the Causes and Impact of Network Failures. Feng Wang , Zhuoqing Morley MaoJia Wang3, Lixin Gao and Randy Bush • A Measurement Study on the Impact of Routing Events on End to End Internet Path Performance Daniel Turner, Kirill Levchenko, Alex C. Snoeren, and Stefan Savage. Why Study Failure • Failure is a reality for large network • Achieving high availability requires engineering the network to be robust to failure • Designing mechanisms to effectively mitigate failures requires deep understanding of real failures CENIC Network • Serving California educational institutions • Over 200 routers • 5 years of data • Three Types of Components: ◦ The Digital California (DC) network ◦ The High-Performance Research (HPR) network ◦ Customer-premises equipment (CPE) Contribution • Methodology to reconstruct historical failure events of CENIC network • Using only commonly available data, No need for additional instrumentation • Analyze the network based on failure measurement Reconstruction What data are available to reconstruct a failure 4 years later? ◦ Syslog • Describes interface state changes ◦ Router Configuration Files • Maps interfaces to Links ◦ Operation announcements on mailing list Data are not intended for failure reconstruction! Validation • Internal consistency Using the administrator announcements to validate the event history reconstructed. • External consistency CAIDA Skitter project (now Ark) validating UP. Route Views project validating DOWN. Overview of Link Failures Overview of Link Failures Overview of Link Failures • Vertical banding V1: a network-wide IS-IS configuration change requiring a router restart V2: a network-wide software upgrade V3: a network-wide configuration change in preparation for IPv6 • Horizontal banding H1: a series of failures on a link between a core router and a County of Education office (hardware) H2: this link experienced over 33,000 short-duration failures (fiber cut) CDFs of Individual Failure Events Various Link Hardware Types Cause of Failure Failure Events Summary • Engineering for failure requires real data - Data has historically been difficult to obtain • Methodology to perform historical failure analysis with low-quality data sources • Shared our findings in the CENIC network - Reliability of individual components - Causes of failures - Impact of failure OUTLINE • California Fault Lines: Understanding the Causes and Impact of Network Failures. Feng Wang , Zhuoqing Morley MaoJia Wang3, Lixin Gao and Randy Bush • A Measurement Study on the Impact of Routing Events on End to End Internet Path Performance Daniel Turner, Kirill Levchenko, Alex C. Snoeren, and Stefan Savage. Key Questions • How could routing events cause degraded end-to-end path performance? • How topological properties and routing policies affect performance degradation? Approach • Study end-to-end performance under realistic topologies. • Investigate several metrics to characterize the end-to-end loss, delay, and out-of-order packets. • Characterize the kinds of routing changes that impact end-to-end path performance. • Analyze the impact of topology, routing policies, MRAI timer and iBGP configurations on end-to-end path performance. Experiment Methodology • A multi-homed prefix • BGP Beacon prefix: 192.83.230.0/24 • Controlled Routing Changes • Failover events: Beacon changes from the state of being connected to both providers to the state of being connected to a single provider. • Recovery events: Beacon changes from the state of being connected to a single provider to the state of being connected to both providers. ISP1 ISP 2 ISP 1 ISP 2 Failover event Beacon ISP 1 ISP 2 Recovery event Beacon Beacon Controlled Routing Changes • 12 routing events every day 8 for beacon events: o Failover events o Recovery events 4 for resetting the Beacon Connectivity. Time schedule (GMT) for BGP Beacon routing transitions host B host A Active Probing Internet capture the impact of routing changes on the end-toend performance. host C • Goal: • From 37 PlanetLab hosts to the Beacon host (a host within the Beacon prefix). ISP 1 ISP 2 Beacon host • Three probing methods: Data Plane - Back-to-back traceroutes Performance metrics - Back-to-back pings Pack loss - UDP probing (50msec Delay interval) Out-of-order Active probing traceroute ping UDP probing √ √ √ Packet Loss Loss burst: consecutive UDP probing packets lost during a routing change event. Failover Recovery Packet Delay Roundtrip delays from the probe host to the Beacon host (clock skews problem when using one-way delays). Failover Recovery Out-of-order Packets • Number of reordering (number of packets out of order) Recovery Failover • Reordering offset How Routing Failures Occur (Failover)? Prefer-customer routing policy: routes received from a provider’s customers are always preferred over those received from its peers. Provider 1 0 R2 Provider 2 Peer link R3 0 R1 R4 0 20 0 10 Customer link Beacon AS 0 R5 R6 0 0 How Routing Failures Occur (Failover)? (contd.) No-valley routing policy: peers do not transit traffic from one peer 10 to another. 10 R7 Provider 3 Peer link R2 R3 0 0 R1 R8 Peer link 0 20 20 20 10 R9 R4 0 10 R5 0 R6 Provider 2 Provider 1 Beacon AS 0 0 How Routing Failures Occur? (Recovery) iBGP constraint: a route received from an iBGP router cannot be transited to another iBGP router R1 2. R3 sends the path to R2 3. R2 sends a withdrawal Withdraw (2 0) Provider 1 1. Path 0 ⇒R3 recovery. Provider 2 path (0) R2 Path (0) R3 to R1 4. R3 sends the recovery path to R1 5. R1 regains its connection to the Beacon 0 Beacon AS 0 R4 Summary • During failover and recovery events • Routing events impact packet loss significantly. • Routing failures contribute to end-to-end packet loss significantly. • Routing events can lead to long packet round-trip delays and reordering • Routing policies and iBGP configuration play a major role in causing packet loss during routing events. Discussion • How could we prevent packet loss during path exploration? Would storing an alternative path in each router be a good idea? What are the downsides? • How could we exploit the previous results to improve end- to-end performance? • How realistic could we consider the topology in the second paper? References • Feng Wang , Zhuoqing Morley MaoJia Wang, Lixin Gao and Randy Bush. A Measurement Study on the Impact of Routing Events on End-to-End Internet Path Performance. SIGCOMM 2006. • Feng Wang , Zhuoqing Morley MaoJia Wang, Lixin Gao and Randy Bush. Presentation on SIGCOMM 2006. • Daniel Turner, Kirill Levchenko, Alex C. Snoeren, and Stefan Savage. California Fault Lines. SIGCOMM 2010. • Daniel Turner, Kirill Levchenko, Alex C. Snoeren, and Stefan Savage. Presentation on SIGCOMM 2010.