Rim Kaddah and Huaiyu Zhu

advertisement
Internet Measurement
Huaiyu Zhu, Rim Kaddah
CS538
Fall 2011
OUTLINE
•  California Fault Lines: Understanding the
Causes and Impact of Network Failures.
Feng Wang , Zhuoqing Morley MaoJia Wang3, Lixin Gao and Randy Bush
•  A Measurement Study on the Impact of
Routing Events on End to End Internet Path
Performance
Daniel Turner, Kirill Levchenko, Alex C. Snoeren, and Stefan Savage.
OUTLINE
•  California Fault Lines: Understanding the
Causes and Impact of Network Failures.
Feng Wang , Zhuoqing Morley MaoJia Wang3, Lixin Gao and Randy Bush
•  A Measurement Study on the Impact of
Routing Events on End to End Internet Path
Performance
Daniel Turner, Kirill Levchenko, Alex C. Snoeren, and Stefan Savage.
Why Study Failure
•  Failure is a reality for large network
•  Achieving high availability requires engineering
the network to be robust to failure
•  Designing mechanisms to effectively mitigate
failures requires deep understanding of real
failures
CENIC Network
•  Serving California educational institutions
•  Over 200 routers
•  5 years of data
•  Three Types of Components:
◦ The Digital California (DC) network
◦ The High-Performance Research
(HPR) network
◦ Customer-premises equipment
(CPE)
Contribution
•  Methodology to reconstruct historical failure events
of CENIC network
•  Using only commonly available data, No need for
additional instrumentation
•  Analyze the network based on failure measurement
Reconstruction
What data are available to reconstruct a failure 4 years later?
◦ Syslog
•  Describes interface state changes
◦ Router Configuration Files
•  Maps interfaces to Links
◦ Operation announcements on mailing list
Data are not intended for failure reconstruction!
Validation
•  Internal consistency
  Using the administrator announcements to validate the
event history reconstructed.
•  External consistency
  CAIDA Skitter project (now Ark) validating UP.
  Route Views project validating DOWN.
Overview of Link Failures
Overview of Link Failures
Overview of Link Failures
•  Vertical banding
  V1: a network-wide IS-IS configuration change requiring
a router restart
  V2: a network-wide software upgrade
  V3: a network-wide configuration change in preparation
for IPv6
•  Horizontal banding
  H1: a series of failures on a link between a core
router and a County of Education office (hardware)
  H2: this link experienced over 33,000 short-duration
failures (fiber cut)
CDFs of Individual Failure Events
Various Link Hardware Types
Cause of Failure
Failure Events
Summary
•  Engineering for failure requires real data
-  Data has historically been difficult to obtain
•  Methodology to perform historical failure analysis with
low-quality data sources
•  Shared our findings in the CENIC network
- Reliability of individual components
- Causes of failures
- Impact of failure
OUTLINE
•  California Fault Lines: Understanding the
Causes and Impact of Network Failures.
Feng Wang , Zhuoqing Morley MaoJia Wang3, Lixin Gao and Randy Bush
•  A Measurement Study on the Impact of
Routing Events on End to End Internet Path
Performance
Daniel Turner, Kirill Levchenko, Alex C. Snoeren, and Stefan Savage.
Key Questions
•  How could routing events cause degraded
end-to-end path performance?
•  How topological properties and routing
policies affect performance degradation?
Approach
•  Study end-to-end performance under realistic topologies.
•  Investigate several metrics to characterize the end-to-end
loss, delay, and out-of-order packets.
•  Characterize the kinds of routing changes that impact
end-to-end path performance.
•  Analyze the impact of topology, routing policies, MRAI
timer and iBGP configurations on end-to-end path
performance.
Experiment Methodology
•  A multi-homed prefix
•  BGP Beacon prefix: 192.83.230.0/24
•  Controlled Routing Changes
•  Failover events: Beacon changes from the state of being connected to
both providers to the state of being connected to a single provider.
•  Recovery events: Beacon changes from the state of being connected
to a single provider to the state of being connected to both providers.
ISP1
ISP 2
ISP 1
ISP 2
Failover event
Beacon
ISP 1
ISP 2
Recovery event
Beacon
Beacon
Controlled Routing Changes
•  12
routing events every
day
  8 for beacon events:
o  Failover events
o  Recovery events
 4 for resetting the
Beacon Connectivity.
Time schedule (GMT) for BGP
Beacon routing transitions
host B
host A
Active Probing
Internet
capture the impact of
routing changes on the end-toend performance.
host C
•  Goal:
• From 37 PlanetLab hosts to
the Beacon host (a host within
the Beacon prefix).
ISP 1
ISP 2
Beacon host
• Three probing methods:
Data Plane
- Back-to-back traceroutes Performance
metrics
- Back-to-back pings
Pack loss
- UDP probing (50msec
Delay
interval)
Out-of-order
Active probing
traceroute
ping
UDP probing
√
√
√
Packet Loss
Loss burst: consecutive UDP probing packets lost during a
routing change event.
Failover
Recovery
Packet Delay
Roundtrip delays from the probe host to the Beacon host (clock
skews problem when using one-way delays).
Failover
Recovery
Out-of-order Packets
•  Number of reordering (number of packets out of order)
Recovery
Failover
•  Reordering offset
How Routing Failures Occur (Failover)?
Prefer-customer routing policy: routes received from a
provider’s customers are always preferred over those received
from its peers.
Provider 1
0
R2
Provider 2
Peer link
R3
0
R1
R4
0
20
0
10
Customer link
Beacon
AS 0
R5
R6
0
0
How Routing Failures Occur (Failover)?
(contd.)
No-valley routing policy: peers do not transit traffic from one peer
10
to another.
10
R7
Provider 3
Peer link
R2
R3
0
0
R1
R8
Peer link
0
20
20
20
10
R9
R4
0
10
R5
0
R6
Provider 2
Provider 1
Beacon
AS 0
0
How Routing Failures Occur? (Recovery)
iBGP constraint: a route received from an iBGP router cannot
be transited to another iBGP router
R1
2. R3 sends the path to R2
3. R2 sends a withdrawal
Withdraw (2 0)
Provider 1
1. Path 0 ⇒R3 recovery.
Provider 2
path (0)
R2
Path (0)
R3
to R1
4. R3 sends the recovery path to R1
5. R1 regains its connection to the Beacon
0
Beacon
AS 0
R4
Summary
•  During failover and recovery events
•  Routing events impact packet loss significantly.
•  Routing failures contribute to end-to-end packet loss significantly.
•  Routing events can lead to long packet round-trip delays and
reordering
•  Routing policies and iBGP configuration play a major role
in causing packet loss during routing events.
Discussion
•  How could we prevent packet loss during path
exploration? Would storing an alternative path in each
router be a good idea? What are the downsides?
•  How could we exploit the previous results to improve end-
to-end performance?
•  How realistic could we consider the topology in the
second paper?
References
•  Feng Wang , Zhuoqing Morley MaoJia Wang, Lixin Gao
and Randy Bush. A Measurement Study on the Impact
of Routing Events on End-to-End Internet Path
Performance. SIGCOMM 2006.
•  Feng Wang , Zhuoqing Morley MaoJia Wang, Lixin Gao
and Randy Bush. Presentation on SIGCOMM 2006.
•  Daniel Turner, Kirill Levchenko, Alex C. Snoeren, and
Stefan Savage. California Fault Lines. SIGCOMM 2010.
•  Daniel Turner, Kirill Levchenko, Alex C. Snoeren, and
Stefan Savage. Presentation on SIGCOMM 2010.
Download