Network Measurements in Overlay Networks Richard Cameron Craddock

advertisement
Network Measurements in
Overlay Networks
Richard Cameron Craddock
School of Electrical and Computer Engineering
Georgia Institute of Technology
1
Outline



Resilient Overlay Networks
Best-Path vs. Multi-Path Overlay Routing
Measuring the Effect of Internet Path Faults
on Reactive Routing
2
Resilient Overlay
Networks
D. Andersen, H. Balakrishnan, F. Kaashoek, and R.
Morris
Proc. 18th ACM SOSP
October 2001
3
ANDERSEN, D. G., BALAKRISHNAN, H., KAASHOEK, M. F., AND MORRIS, R. Resilient Overlay Networks. In Proc. 18th ACM
SOSP (Banff, Canada, Oct. 2001), pp. 131–145.
Resilient Overlay Networks

RONs seek to quickly detect and respond to network
failures




Network nodes participate in a limited size overlay
network
Overlay nodes cooperate with one another to forward data
on behalf of any other nodes in the RON
RON detects problems by aggressively probing the paths
connecting its nodes
RON nodes exchange information about the quality of
paths among themselves, and build forwarding tables
based on a variety of path metrics

Latency, Packet Loss, and Available Throughput
4
ANDERSEN, D. G., BALAKRISHNAN, H., KAASHOEK, M. F., AND MORRIS, R. Resilient Overlay Networks. In Proc. 18th ACM
SOSP (Banff, Canada, Oct. 2001), pp. 131–145.
Resilient Overlay Networks Goals



Failure detection and recovery in less than 20
seconds
Tighter integration of routing and path
selection with the application
Expressive policy routing
5
ANDERSEN, D. G., BALAKRISHNAN, H., KAASHOEK, M. F., AND MORRIS, R. Resilient Overlay Networks. In Proc. 18th ACM
SOSP (Banff, Canada, Oct. 2001), pp. 131–145.
Active Probing


RON probes every
other node
PROBE_INTERVAL
plus a random jitter of
1/3
PROBE_INTERVAL
A probe not returned in
PROBE_TIMEOUT is
considered loss
6
ANDERSEN, D. G., BALAKRISHNAN, H., KAASHOEK, M. F., AND MORRIS, R. Resilient Overlay Networks. In Proc. 18th ACM
SOSP (Banff, Canada, Oct. 2001), pp. 131–145.
Link-State Dissemination



RON nodes disseminate their performance
metrics to the other nodes every
ROUTING_INTERVAL
This information is sent over the RON overlay
The only time that a RON node has
incomplete information about any other node
is when it is completely cut off from the
Overlay
7
ANDERSEN, D. G., BALAKRISHNAN, H., KAASHOEK, M. F., AND MORRIS, R. Resilient Overlay Networks. In Proc. 18th ACM
SOSP (Banff, Canada, Oct. 2001), pp. 131–145.
Outage Detection




On the loss of a probe, several consecutive probes
spaced by PROBE_TIMEOUT are sent out
If OUTAGE_THRESH probes elicit no response the
path is considered “dead”
If even one probe gets a response then high
frequency probing is cancelled
Paths experiencing outages are rated on their packet
loss history
8
ANDERSEN, D. G., BALAKRISHNAN, H., KAASHOEK, M. F., AND MORRIS, R. Resilient Overlay Networks. In Proc. 18th ACM
SOSP (Banff, Canada, Oct. 2001), pp. 131–145.
Latency and Loss Rate

Latency is the round trip time calculated from the
probes




Latency = A * Latency + (1-A) * New Sample
A is chosen to be 0.9
Overall latency is the SUM of the individual virtual link
latencies
Loss Rate is the average of the last k = 100 probe
samples

If losses are assumed independent then the overall path
loss rate is the PRODUCT of the individual virtual link
loss rates
9
ANDERSEN, D. G., BALAKRISHNAN, H., KAASHOEK, M. F., AND MORRIS, R. Resilient Overlay Networks. In Proc. 18th ACM
SOSP (Banff, Canada, Oct. 2001), pp. 131–145.
Throughput
score 

p is the one way packet loss probability


Estimated as half of the calculated two-way packet loss probability
rtt is the end-to-end round trip time
Throughput cannot be aggregated across virtual links


(2)
Throughput is calculated using (2)


1.5
rtt  p
In order to simplify the selection of throughput optimized paths only one
intermediate node is considered
An indirect path is only chosen if it improves throughput by 50%
10
ANDERSEN, D. G., BALAKRISHNAN, H., KAASHOEK, M. F., AND MORRIS, R. Resilient Overlay Networks. In Proc. 18th ACM
SOSP (Banff, Canada, Oct. 2001), pp. 131–145.
Experiment


The raw measurement data consists of probe packets
To probe each RON node independently repeated the
following steps




Pick a random node j
Pick a probe-type from one of {direct, latency, loss} using
round-robin.
Send probe to j
Delay for a random interval between 1 and 2 seconds
11
ANDERSEN, D. G., BALAKRISHNAN, H., KAASHOEK, M. F., AND MORRIS, R. Resilient Overlay Networks. In Proc. 18th ACM
SOSP (Banff, Canada, Oct. 2001), pp. 131–145.
Results

Two distinct datasets

RON1




64 hours between 3/21/2001 and 3/23/2001
12 nodes with 132 distinct paths
Traverses 36 different AS’s and 74 distinct inter-AS
links
RON2



85 hours between 5/7/2001 and 5/11/2001
16 nodes with 240 distinct paths
Traverses 50 AS’s and 118 different AS links
12
ANDERSEN, D. G., BALAKRISHNAN, H., KAASHOEK, M. F., AND MORRIS, R. Resilient Overlay Networks. In Proc. 18th ACM
SOSP (Banff, Canada, Oct. 2001), pp. 131–145.
Results
13
ANDERSEN, D. G., BALAKRISHNAN, H., KAASHOEK, M. F., AND MORRIS, R. Resilient Overlay Networks. In Proc. 18th ACM
SOSP (Banff, Canada, Oct. 2001), pp. 131–145.
Overcoming Path Outages


A RON win occurred when internet loss was >= p%
and RON loss was < p%
10 complete communication outages of which RON
routed around all of them
14
ANDERSEN, D. G., BALAKRISHNAN, H., KAASHOEK, M. F., AND MORRIS, R. Resilient Overlay Networks. In Proc. 18th ACM
SOSP (Banff, Canada, Oct. 2001), pp. 131–145.
Loss Rate



Improved loss rate by
more than 0.05 more
than 5% of the time in
RON1
RON can make loss
rates worse too
Improved loss rate by
more than 0.04 more
than 5% of the time in
RON2
15
ANDERSEN, D. G., BALAKRISHNAN, H., KAASHOEK, M. F., AND MORRIS, R. Resilient Overlay Networks. In Proc. 18th ACM
SOSP (Banff, Canada, Oct. 2001), pp. 131–145.
Handling Packet Floods



Three hosts connected in a
triangle
Indirect routing is possible
through the third node but
not preferable
Flood attack beginning at
5s


RON recovered in 13s
Non-RON doesn’t recover
16
ANDERSEN, D. G., BALAKRISHNAN, H., KAASHOEK, M. F., AND MORRIS, R. Resilient Overlay Networks. In Proc. 18th ACM
SOSP (Banff, Canada, Oct. 2001), pp. 131–145.
Latency



RON reduces communication latency in many cases
11% saw improvements of 40 ms or more in RON1
8.2% saw improvements of 40 ms or more in RON2
17
ANDERSEN, D. G., BALAKRISHNAN, H., KAASHOEK, M. F., AND MORRIS, R. Resilient Overlay Networks. In Proc. 18th ACM
SOSP (Banff, Canada, Oct. 2001), pp. 131–145.
TCP Throughput




RON’s throughputoptimizing router does not
attempt to change paths
unless it obtains a 50%
improvement in throughput
5% of samples doubled
their throughput
2% increased their
throughput by more then 5
times
9 by a factor of 10
18
ANDERSEN, D. G., BALAKRISHNAN, H., KAASHOEK, M. F., AND MORRIS, R. Resilient Overlay Networks. In Proc. 18th ACM
SOSP (Banff, Canada, Oct. 2001), pp. 131–145.
Conclusions





Resilient overlay networks can greatly improve the reliability
of the Internet
RON was able to overcome 100%(RON1) and 60%(RON2)
of the several hundred observed outages
RON takes 18 seconds on average to detect and recover from
a fault
RON can substantially improve loss rate, latency and TCP
throughput
Forwarding packets via at most one intermediate node is
sufficient for fault recovery and latency improvements
19
ANDERSEN, D. G., BALAKRISHNAN, H., KAASHOEK, M. F., AND MORRIS, R. Resilient Overlay Networks. In Proc. 18th ACM
SOSP (Banff, Canada, Oct. 2001), pp. 131–145.
Best-Path vs. Multi-Path
Overlay Routing
D. Andersen, A. Snoeren, and H. Balakrishnan
IMC
Miami, FL, October 2003.
20
D. G. Andersen, A. C. Snoeren, and H. Balakrishnan, "Best-Path vs Multi-Path Overlay Routing," IMC 2003.
Best-Path vs. Multi-Path Overlay
Routing



Best-path and multi-path routing techniques
have been proposed to reduce packet loss
These techniques are compared in terms of
loss rate and latency reduction
This comparison is made in the context of an
overlay network
21
D. G. Andersen, A. C. Snoeren, and H. Balakrishnan, "Best-Path vs Multi-Path Overlay Routing," IMC 2003.
Best-Path vs. Multi-Path Overlay
Routing

Multi-Path Routing


Packets are duplicated and sent on different paths
through overlay
Reactive Routing


Overlay nodes constantly measure the paths
between themselves using probes
Packets are sent on either the direct path or
forwarded via a sequence of other overlay nodes
22
D. G. Andersen, A. C. Snoeren, and H. Balakrishnan, "Best-Path vs Multi-Path Overlay Routing," IMC 2003.
Routing Methods

Direct:


Loss:


Single packet using the direct path
Loss optimized reactive routing
Lat:

Latency optimized reactive routing
23
D. G. Andersen, A. C. Snoeren, and H. Balakrishnan, "Best-Path vs Multi-Path Overlay Routing," IMC 2003.
Routing Methods

Direct rand:




2 redundant multi-path routing
First packet is sent directly
Second packet is sent randomly
Lat Loss:
2 redundant multi-path routing with reactive
routing
 First packet is sent on latency optimized link
 Second packet is sent on loss optimized link
D. G. Andersen, A. C. Snoeren, and H. Balakrishnan, "Best-Path vs Multi-Path Overlay Routing," IMC 2003.

24
Routing Methods

Direct direct


DD 10 ms


2-redundant direct routing with back-to-back
packets on the same path
Direct direct with 10ms delay between packets
DD 20 ms

Direct direct with 20ms delay between packets
25
D. G. Andersen, A. C. Snoeren, and H. Balakrishnan, "Best-Path vs Multi-Path Overlay Routing," IMC 2003.
Method

Nodes periodically initiate one or two request
packets to a target




Each request has a random 64-bit identifier which is
logged along with send and receive times
Nodes cycle through the different request types
Targets are chosen randomly
Nodes delay for a random period between 0.6 and
1.2 seconds
26
D. G. Andersen, A. C. Snoeren, and H. Balakrishnan, "Best-Path vs Multi-Path Overlay Routing," IMC 2003.
Base Network Statistics
27
D. G. Andersen, A. C. Snoeren, and H. Balakrishnan, "Best-Path vs Multi-Path Overlay Routing," IMC 2003.
Packet Loss Rate
28
D. G. Andersen, A. C. Snoeren, and H. Balakrishnan, "Best-Path vs Multi-Path Overlay Routing," IMC 2003.
Packet Loss Rate
29
D. G. Andersen, A. C. Snoeren, and H. Balakrishnan, "Best-Path vs Multi-Path Overlay Routing," IMC 2003.
Conditional Loss Probability and
Latency
30
D. G. Andersen, A. C. Snoeren, and H. Balakrishnan, "Best-Path vs Multi-Path Overlay Routing," IMC 2003.
Conclusion

There is loss and failure independence in the
Internet


The benefits of multi-path routing can be
achieved with direct duplication


40% of observed losses were avoidable
10 or 20 ms delay between packets
Reactive and redundant routing can work in
concert to reduce loss

45% decrease in packet loss rate
31
D. G. Andersen, A. C. Snoeren, and H. Balakrishnan, "Best-Path vs Multi-Path Overlay Routing," IMC 2003.
Measuring the Effect of
Internet Path Faults on
Reactive Routing
Nick Feamster, David G. Andersen, Hari Balakrishnan, and M.
Frans Kaashoek
SigMetrics
San Diego, June 2003
32
FEAMSTER, N., ANDERSEN, D., BALAKRISHNAN, H., AND KAASHOEK, M. F. Measuring the effects of Internet path faults
on reactive routing. In Proc. Sigmetrics (San Diego, CA, June 2003).
Measuring the Effect of Internet Path
Faults on Reactive Routing



Where do failures appear?
How long do failures last?
How well do failures correlate with BGP
routing instability?
33
FEAMSTER, N., ANDERSEN, D., BALAKRISHNAN, H., AND KAASHOEK, M. F. Measuring the effects of Internet path faults
on reactive routing. In Proc. Sigmetrics (San Diego, CA, June 2003).
Data Collection

Based on the analysis of data collected for one year
on a test bed of 31 hosts



Geographically as well as topologically diverse test bed
Paths between these hosts traverse more than 50% of the
well-connected ASs on the internet
Data includes:



Active probes between hosts
Traceroutes
BGP messages collected at 8 locations
34
FEAMSTER, N., ANDERSEN, D., BALAKRISHNAN, H., AND KAASHOEK, M. F. Measuring the effects of Internet path faults
on reactive routing. In Proc. Sigmetrics (San Diego, CA, June 2003).
Active Probing

An active probe consists of a request packet from the initiator to a target
and reply packet from the target to the initiator




Each host independently initiates a probe to a random target and then
sleeps between 1 and 2 seconds



Each probe has a 32 bit ID that is logged along with send and receive times
A central monitoring machine aggregates logs
Post processing finds all probes received within 60 minutes of when they are
sent
Mean time between probes on a particular path is 30s
With a 95% probability each path is probed at least once every 80s
Failures are defined as 3 or more consecutive lost probes

Limits the time resolution of failure detection to a few minutes
35
FEAMSTER, N., ANDERSEN, D., BALAKRISHNAN, H., AND KAASHOEK, M. F. Measuring the effects of Internet path faults
on reactive routing. In Proc. Sigmetrics (San Diego, CA, June 2003).
Loss-triggered traceroutes

Path failure indicated by the active prober initiates a
single traceroute



The failure of a traceroute could be due to either the
forward or reverse path


Traceroute is limited to 30 hops
The last reachable IP address is considered point of failure
One-way reachability from active probes ensures that the
traceroute measurement corresponds to failure on the
forward path
Measurement hosts periodically push traceroute logs
to the central monitoring machine
36
FEAMSTER, N., ANDERSEN, D., BALAKRISHNAN, H., AND KAASHOEK, M. F. Measuring the effects of Internet path faults
on reactive routing. In Proc. Sigmetrics (San Diego, CA, June 2003).
Network Depth Estimation


Want to determine if a failure occurs near an end
host of in the middle of the network
Assign an estimated network depth to each link
based on its connectivity to other network nodes




Links between routers and measurement nodes have a
network depth of 0
Any edge that connects a 0 depth router to other routers
has a depth of 1, and so on
Edges that can receive more than one value, get assigned
the smaller value
By computing the depth of all links, the depth at
which a traceroute fails can be estimated
37
FEAMSTER, N., ANDERSEN, D., BALAKRISHNAN, H., AND KAASHOEK, M. F. Measuring the effects of Internet path faults
on reactive routing. In Proc. Sigmetrics (San Diego, CA, June 2003).
Inferring AS Topology

Inferring AS topology requires:



Mapping interfaces to routers, alias resolution
Assigning routers to ASs
Alias resolution




Based on Rocketfuel’s “Ally” technique
A pair of IP addresses is candidate for alias resolution if
they both have the same next or previous hop in a
traceroute
For each candidate pair the alias resolution test is
performed 100 times
If the test is positive 80% or more of the time, the two IP
addresses are assigned to the canonical ID of the router
38
FEAMSTER, N., ANDERSEN, D., BALAKRISHNAN, H., AND KAASHOEK, M. F. Measuring the effects of Internet path faults
on reactive routing. In Proc. Sigmetrics (San Diego, CA, June 2003).
Inferring AS Topology



Routers are assigned to ASs based on the AS’s address space
If a router has addresses from more then one AS it is assigned
to the AS with the most addresses and considers the router a
border router
Routers that cannot be identified in the above manner are
assigned by neighbor router votes



If the majority of the links from a router lead into one AS, we
assign the router to that AS
If the router has links to multiple ASs it is considered a border
router
Routers that cannot be assigned in the above manner are
assigned by hand using traceroute information
39
FEAMSTER, N., ANDERSEN, D., BALAKRISHNAN, H., AND KAASHOEK, M. F. Measuring the effects of Internet path faults
on reactive routing. In Proc. Sigmetrics (San Diego, CA, June 2003).
BGP Data Collection

8 nodes in the test bed collected BGP
messages using Zebra 0.92a


Configured to see only BGP messages that cause
a change in the border router’s choice of best
route
Monitors observe most BGP messages
relevant to routing stability
40
FEAMSTER, N., ANDERSEN, D., BALAKRISHNAN, H., AND KAASHOEK, M. F. Measuring the effects of Internet path faults
on reactive routing. In Proc. Sigmetrics (San Diego, CA, June 2003).
Failure Location
41
FEAMSTER, N., ANDERSEN, D., BALAKRISHNAN, H., AND KAASHOEK, M. F. Measuring the effects of Internet path faults
on reactive routing. In Proc. Sigmetrics (San Diego, CA, June 2003).
Failure Location
42
FEAMSTER, N., ANDERSEN, D., BALAKRISHNAN, H., AND KAASHOEK, M. F. Measuring the effects of Internet path faults
on reactive routing. In Proc. Sigmetrics (San Diego, CA, June 2003).
Failure Length
43
FEAMSTER, N., ANDERSEN, D., BALAKRISHNAN, H., AND KAASHOEK, M. F. Measuring the effects of Internet path faults
on reactive routing. In Proc. Sigmetrics (San Diego, CA, June 2003).
Failures after RON
44
FEAMSTER, N., ANDERSEN, D., BALAKRISHNAN, H., AND KAASHOEK, M. F. Measuring the effects of Internet path faults
on reactive routing. In Proc. Sigmetrics (San Diego, CA, June 2003).
Correlating Failures and BGP
45
FEAMSTER, N., ANDERSEN, D., BALAKRISHNAN, H., AND KAASHOEK, M. F. Measuring the effects of Internet path faults
on reactive routing. In Proc. Sigmetrics (San Diego, CA, June 2003).
Correlating Failures and BGP
46
FEAMSTER, N., ANDERSEN, D., BALAKRISHNAN, H., AND KAASHOEK, M. F. Measuring the effects of Internet path faults
on reactive routing. In Proc. Sigmetrics (San Diego, CA, June 2003).
Conclusions



Failures are more likely to appear within an AS than on the
boundary
70% of observed failures last less than 5 min, 90% shorter
then 15 min
Failures near the core are more likely to coincide with BGP
messages



Failures typically precede failures by 4 minutes
RON can typically route around 50% of path failures
20% of the failures masked by RON were preceded by at
least one BGP message

Suggesting that reactive routing can be improved using BGP
instability as an indicator of path failures.
47
FEAMSTER, N., ANDERSEN, D., BALAKRISHNAN, H., AND KAASHOEK, M. F. Measuring the effects of Internet path faults
on reactive routing. In Proc. Sigmetrics (San Diego, CA, June 2003).
Discussion

Can passive measurements be used for a
RON?



How do you guarantee that you have enough
data?
How do you handle old data?
How practical are RONs?

Are the performance gains worth the overhead?
48
FEAMSTER, N., ANDERSEN, D., BALAKRISHNAN, H., AND KAASHOEK, M. F. Measuring the effects of Internet path faults
on reactive routing. In Proc. Sigmetrics (San Diego, CA, June 2003).
Download