Infrastructure-based Resilient Routing

advertisement
Infrastructure-based
Resilient Routing
Ben Y. Zhao, Ling Huang, Jeremy Stribling,
Anthony Joseph and John Kubiatowicz
University of California, Berkeley
ICSI Lunch Seminar, January 2004
Motivation

Network connectivity is not reliable



Disconnections frequent in the Internet (UMichTR98,IMC02)

50% of backbone links have MTBF < 10 days

20% of faults last longer than 10mins
IP-level repair relatively slow

Wide-area: BGP  3 mins

Local-area: IS-IS  5 seconds
Next generation wide-area network applications

Streaming media, VoIP, B2B transactions

Low tolerance of delay, jitter and faults
July 26, 2016
ICSI Lunch Seminar, Jan. 2004
ravenben@eecs.berkeley.edu
The Challenge

Routing failures are diverse

Many causes


Occur anywhere with local or global impact:



Misconfigurations, cut fiber, planned downtime, software bugs
Single fiber cut can disconnect AS pairs
One event can lead to complex protocol interactions
Isolating failures is difficult

End user symptoms often dynamic or intermittent

WAN measurement research is ongoing (Rocketfuel, etc)

Observations:

Fault detection from multiple distributed vantage points

In-network decision making necessary for timely responses
July 26, 2016
ICSI Lunch Seminar, Jan. 2004
ravenben@eecs.berkeley.edu
Talk Overview

Motivation

A structured overlay approach

Mechanisms and policy

Evaluation

Some questions
July 26, 2016
ICSI Lunch Seminar, Jan. 2004
ravenben@eecs.berkeley.edu
An Infrastructure Approach


Our goals

Resilient overlay to route around failures

Respond in milliseconds (not seconds)
Our approach (data & control plane)

Nodes are observation points
(similar to Plato’s NEWS service)

Nodes are also points of traffic redirection
(forwarding path determination and data forwarding)

No edge node involvement

Fast response time, security focused on infrastructure

Fully transparent, no application awareness necessary
July 26, 2016
ICSI Lunch Seminar, Jan. 2004
ravenben@eecs.berkeley.edu
Why Structured Overlays



Resilient Overlay Networks (MIT)
Fully connected mesh
Each node has full knowledge of network



D
Fast, independent calculation of routes
Nodes can construct any path, maximum
flexibility
Cost of flexibility


Protocol needs to choose the “right”
route/nodes
Per node O(n) state


Monitors n - 1 paths
O(n2) total path monitoring is expensive
S
July 26, 2016
ICSI Lunch Seminar, Jan. 2004
ravenben@eecs.berkeley.edu
The Big Picture
v
v
v
v
v
v
OVERLAY
v
v
v
v
v
v
v
Internet



Locate nearby overlay proxy
Establish overlay path to destination host
Overlay traffic routes traffic resiliently
July 26, 2016
ICSI Lunch Seminar, Jan. 2004
ravenben@eecs.berkeley.edu
Traffic Tunneling
Legacy
Node A
P’(B)
B
A, B are IP addresses
Legacy
Node B
register
register
Proxy
Proxy
get (hash(B)) P’(B)
put (hash(A), P’(A))
put (hash(B), P’(B))
Structured Peer to
Peer Overlay


Store mapping from end host IP to its proxy’s overlay ID
Similar to approach in Internet Indirection Infrastructure (I3)
July 26, 2016
ICSI Lunch Seminar, Jan. 2004
ravenben@eecs.berkeley.edu
Pros and Cons

Leverage small neighbor sets

Less neighbor paths to monitor: O(n)  O(log(n))


Reduction in probing bandwidth

Faster fault detection
Actively maintain static route redundancy

Manageable for “small” # of paths

Redirect traffic immediately when a failure is detected
Eliminate on-the-fly calculation of new routes

Restore redundancy in background after failure

Fast fault detection + precomputed paths = more responsiveness

Cons: overlay imposes routing stretch (mostly < 2)
July 26, 2016
ICSI Lunch Seminar, Jan. 2004
ravenben@eecs.berkeley.edu
In-network Resiliency Details

Active periodic probes for fault-detection



Exponentially weighted moving average link quality estimation

Avoid route flapping due to short term loss artifacts

Loss rate Ln = (1 - )  Ln-1 +  p
Simple approach taken, much ongoing research

Smart fault-detection / propagation (Zhuang04)

Intelligent and cooperative path selection (Seshardri04)
Maintaining backup paths

Create and store backup routes at node insertion

Query neighbors after failures to restore redundancy


Ask any neighbor at or above routing level of faulty node
e.g. ABCD sees ABDE failed, can ask any AB?? node for info
Simple policies to choose among redundant paths
July 26, 2016
ICSI Lunch Seminar, Jan. 2004
ravenben@eecs.berkeley.edu
First Reachable Link Selection (FRLS)

Use link quality estimation to choose
shortest “usable” path

Use shortest path with
minimal quality > T

Correlated failures

Reduce with intelligent topology
construction

Goal: leverage redundancy available
July 26, 2016
ICSI Lunch Seminar, Jan. 2004
ravenben@eecs.berkeley.edu
Evaluation


Metrics for evaluation

How much routing resiliency can we exploit?

How fast can we adapt to faults (responsiveness)?
Experimental platforms

Event-based simulations on transit stub topologies


Data collected over multiple 5000-node topologies
PlanetLab measurements

July 26, 2016
Microbenchmarks on responsiveness
ICSI Lunch Seminar, Jan. 2004
ravenben@eecs.berkeley.edu
% of All Pairs Reachable
Exploiting Route Redundancy (Sim)
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
Instantaneous IP
0
0.05
Tapestry / FRLS
0.1
0.15
0.2
Proportion of IP Links Broken


Simulation of Tapestry, 2 backup paths per routing entry
Transit-stub topology shown, results from TIER and AS graphs similar
July 26, 2016
ICSI Lunch Seminar, Jan. 2004
ravenben@eecs.berkeley.edu
Responsiveness to Faults (PlanetLab)
Time to Switch Routes (ms)
2500
2000
1500
1000
660
alpha=0.2
 = 0.2
 = 0.4
alpha=0.4
500
0
0
200
300
400
600
800
1000
1200
Link Probe Period (ms)


Two reasonable values for filter constant 
Response time scales linearly to probe period
July 26, 2016
ICSI Lunch Seminar, Jan. 2004
ravenben@eecs.berkeley.edu
Link Probing Bandwidth (Planetlab)
Bandwidth Per Node (KB/s)
7
PR=300ms
PR=600ms
6
5
4
3
2
1
0
1
10
100
1000
Size of Overlay


Bandwidth increases logarithmically with overlay size
Medium sized routing overlays incur low probing bandwidth
July 26, 2016
ICSI Lunch Seminar, Jan. 2004
ravenben@eecs.berkeley.edu
Conclusion

Trading flexibility for scalability and responsiveness

Structured routing has low path maintenance costs


Can no longer construct arbitrary paths


Allows “caching” of backup paths for quick failover
But simple policy exploits available redundancy well
Fast enough for most interactive applications

300ms beacon period  response time < 700ms

~300 nodes, b/w cost = 7KB/s
July 26, 2016
ICSI Lunch Seminar, Jan. 2004
ravenben@eecs.berkeley.edu
Ongoing Questions

Is this the right approach?

Is there a lower bound on desired responsiveness?

Is this responsive enough for VoIP?


If not, is multipath routing the solution?
What about deployment issues?

How does inter-domain deployment happen?

A third-party approach? (Akamai for routing)
July 26, 2016
ICSI Lunch Seminar, Jan. 2004
ravenben@eecs.berkeley.edu
Related Work



Redirection overlays

Detour (IEEE Micro 99)

Resilient Overlay Networks (SOSP 01)

Internet Indirection Infrastructure (SIGCOMM 02)

Secure Overlay Services (SIGCOMM 02)
Topology estimation techniques

Adaptive probing (IPTPS 03)

Internet tomography (IMC 03)

Routing underlay (SIGCOMM 03)
Many, many other structured peer-to-peer overlays
Thanks to Dennis Geels / Sean Rhea for their work on BMark
July 26, 2016
ICSI Lunch Seminar, Jan. 2004
ravenben@eecs.berkeley.edu
Backup Slides
July 26, 2016
ICSI Lunch Seminar, Jan. 2004
ravenben@eecs.berkeley.edu
Another Perspective on Reachability
Proportion of All Paths
1
0.9
0.8
Portion of all pairwise paths where
no failure-free
paths remain
0.7
A path exists, but
neither IP nor
FRLS can locate
the path
0.6
0.5
0.4
0.3
0.2
0.1
Portion of0 all paths
where IP0 and
FRLS both route
successfully
July 26, 2016
0.05
0.1
0.15
Proportion of IP Links Broken
ICSI Lunch Seminar, Jan. 2004
FRLS finds
path, 0.2
where
short-term IP
routing fails
ravenben@eecs.berkeley.edu
Constrained Multicast




Used only when all paths are below
quality threshold
Send duplicate messages on multiple
paths
Leverage route convergence
 Assign unique message
2299
2274
IDs
 Mark duplicates
 Keep moving window of IDs
2046
2281
 Recognize and drop duplicates
Limitations
? ? ?
 Assumes loss not from congestion
1111
 Ideal for local area routing
July 26, 2016
ICSI Lunch Seminar, Jan. 2004
2225
2286
2530
ravenben@eecs.berkeley.edu
Latency Overhead of Misrouting
Proportional Latency Overhead
Hop 0
Hop 1
Hop 4
Hop 3
Hop 2
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
20
40
60
80
100
120
160
140
180
Latency from source to destination (ms)
July 26, 2016
ICSI Lunch Seminar, Jan. 2004
ravenben@eecs.berkeley.edu
Bandwidth Cost of Constrained Multicast
Proportional Bandwidth Overhead
Hop 0
Hop 1
Hop 2
Hop 3
Hop 4
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
20
40
60
80
100
120
140
160
Latency from source to destination (ms)
July 26, 2016
ICSI Lunch Seminar, Jan. 2004
ravenben@eecs.berkeley.edu
Download