Craig Labovitz, G. Robert Malan, Farnam Jahanian, "Internet Routing

advertisement
Craig Labovitz, G. Robert Malan, Farnam Jahanian, "Internet Routing
Instability." IEEE/ACM Transactions on Networking, 6(5):515528, 1998.
Craig Labovitz, G. Robert Malan, Farnam Jahanian, "Origins of
Internet Routing Instability", IEEE INFOCOM 1999.
Craig Labovitz, G. Abha Ahuja, Farnam Jahanian, "Experimental
Study of Internet Stability and Backbone Failures." FTCS 1999.
Internet Routing Instability
Three Papers Presented by Michael A. Smith
Background

Events


NSFNet backbone ended in April ‘95
Evident




“Death of Internet is Imminent”


reported by popular press
Routing Instability (“route flaps”)

Informally defined as:

Spring 2006
Network degradation
bandwidth shortages
lack of router switching capacity
“the rapid change of network reachability and
topology information”
Internet Routing Instability
1 of 50
The Internet Backbone
Spring 2006

12 large ISPs, tier one

4000-6000 tier two providers

Large public exchange points are considered
the “core” of the Internet.

Backbone service providers must maintain a
complete map, or default-free routing table.

Divided into different regions of
administrative control called autonomous
systems (AS’s).

Most AS’s exchange routing information
through the border gateway protocol (BGP).
Internet Routing Instability
2 of 50
Routing Instability

Origins




Effects


Spring 2006
Router configuration errors
Transient physical and data link
problems
Software bugs
Poorer end-to-end network
performance
Degradation of overall efficiency of the
Internet infrastructure
Internet Routing Instability
3 of 50
Route Flaps

Result in large number of routing
updates passed to core Internet
exchange point routers.

Network instability spreads from
router to router and propagates
throughout the network.

Effects in Internet infrstructure:



Spring 2006
Increased packet loss
Delays in time for network convergence
Resource overhead (CPU, memory, etc.)
Internet Routing Instability
4 of 50
BGP
Spring 2006

An incremental protocol

Does not flood intra-domain network with
topological information or link state entries
(like IGRP and OSPF)

Sends update information only upon
changes in topology or policy

Uses TCP as underlying transport
mechanism (as opposed to reliability
through datagram service)

As a path vector routing protocol, it limits
the distribution of reachability information.
Internet Routing Instability
5 of 50
Routing on the Backbone

path - sequence of intermediate AS’s between source and
destination routers that form a directed route for packets
to travel

Router configuration files allow the stipulation of routing
policies which may:



Policy decisions can be made based on:


Spring 2006
specify the filtering of specific routes
modify path attributes before sharing
announcement of routes from peers
attributes of announced routes (such as MED’s)

After each router makes a new local decision on the best
route to a destination, it sends it.

As the route propagates, each AS appends its unique
number to the route’s ASPATH, which, in conjunction with
the prefix, provides a specific handle for transit.

The ASPATH mechanism allows a router to detect and
prevent routing loops.
Internet Routing Instability
6 of 50
Routing Information in BGP

Two forms:

Announcements


Indicates that a router has either learned a new network
attachment or has made a policy decision to prefer a diff.
route to a destination.
Withdrawals


Sent when a router decides that a network is no longer
reachable
Paper distinguishes between:



A BGP update may contain multiple
announcements and withdrawals.


Spring 2006
Explicit – associated with actual withdrawal message
Implicit – existing route replaced by new route
Ideally, routers should only generate routing
updates for relatively infrequent policy changes
and the addition of new physical networks.
It’s been found that BGP’s ASPATH
mechanism is not sufficient to ensure
network convergence.
Internet Routing Instability
7 of 50
Methodology of Studies
Spring 2006

Geographically diverse exchange points.

Although the route servers do not forward network
traffic, the route servers do peer with over 90% of
the service providers at each exchange point.
Internet Routing Instability
8 of 50
Route Tracker Architecture
 Devloped on Sun workstations
 Uses MRT and IPMA toolkits to analyze BGP updates
Spring 2006
Internet Routing Instability
9 of 50
“Internet Routing Instability”

Monitored BGP updates generated by five service provider
backbone routers at the major U.S. public exchange points
over a period of nine months.

Paper distinguishes three types of updates:




Instability is defined as:

Spring 2006
forwarding instability – may reflect legitimate topological changes
and affects the paths on which data will be forwarded
routing policy fluctuation – reflects changes in routing policy
information that do no affect forwarding paths
pathological – updates are redundant BGP information that do not
reflect routing nor forwarding instability
an instance of either forwarding instability or policy fluctuation

Data reflects the stability of inter-domain Internet routing,
or changes in topology or policy among AS’s

“Intra-domain routing instability is not explicitly measured
and is only indirectly observed through BGP information
exchanged with a domain’s peer.”
Internet Routing Instability
10 of 50
Results of Study
Spring 2006

The number of BGP updates exchanged per day in
the Internet core is one or more orders of magnitude
larger than expected.

Routing information is dominated by pathological, or
redundant updates, which may not reflect changes in
routing policy or topology.

Instability and redundant updates exhibit a specific
periodicity of 30 and 60 seconds.

Instability and redundant updates show a surprising
correlation to network usage and exhibit
corresponding daily and weekly cyclic trends.

Instability is not dominated by a small set of
autonomous systems or routes.
Internet Routing Instability
11 of 50
Results of Study (2)
Spring 2006

Instability and redundant updates exhibit
both strong high and low frequency
components. Much of the high frequency
instability is pathological.

Discounting the contribution of redundant
updates, the majority (over 80%) of
Internet routes exhibits a high degree of
stability.

This work has led to specific architectural
and protocol changes in commercial
Internet routers through the collaboration
with vendors.
Internet Routing Instability
12 of 50
Methodology of Study (2)
Spring 2006

12 Gb of data starting in January ’96

Uses several tools from XYZ toolkit

Focuses on largest exchange, MaeEast

Data verification against BGP
backbone logs from a number of large
service providers
Internet Routing Instability
13 of 50
More Background

Problems of network topology fluctuation (nonconvergence):



Internet routers of the day were based on route caching
architecture.






Spring 2006
Each interface card maintains a routing table of cache of
destination and next-hop lookups
If found, then switch on CPU independent “fast-path.”
Sustained levels of instability increase the probability of
packet encountering a cache miss, which leads to:


packets get dropped
packets delivered out of order
increased load on CPU
increased switching latency
dropped or lost packets
queuing delay, preventing timely routing of Keep-Alive packets
It should be noted that new generations of routers that do
not require caching and are able to maintain the full
routing table in memory do not exhibit the same
pathological loss under heavy routing updates.
Internet Routing Instability
14 of 50
Route Flap Storms

A failed router can instigate a “route flap
storm.”



Spring 2006
This pathological oscillation causes overloaded routers
to be marked as unreachable since the required interval
of Keep-Alive transmissions is not met.
Peers of the failed router find alternative paths for
destinations previously reachable and transmit updates.
After the failed router recovers, it will re-initiate BGP
peering sessions with peers, transmit large state
dumps, and cause more routers to fail.

“Route Flap Storms” in 1996 caused
extended outages for several million
network customers.

Newer generations of routers provide a
mechanism for giving BGP and Keep-Alive
messages higher priority.
Internet Routing Instability
15 of 50
Battling Routing Instability

Route Aggregation (Supernetting):



combines a number of smaller IP prefixes
into a single, less specific route
announcement.
reduces overall number of networks visible
on the core Internet
fails in multi-homing (when end-sites have
redundant connections to the internet via
multiple service providers).


Deployment of route dampening
algorithms


Spring 2006
In 1996, more than 25% (and growing) of prefixes
were multi-homed and therefore non-aggregatable.
“hold-down” updates that exceed certain
parameters (i.e. quota of updates per hour)
can introduce artificial connectivity problems
as “legitimate” announcements are delayed.
Internet Routing Instability
16 of 50
Problems
Spring 2006

The internet continues to exhibit high
levels of routing instability despite the
increased emphasis on aggregation and
route dampening.

Internet topology is growing increasingly
less hierarchical with the addition of new
exchange points and peering
relationships.

The behavior and dynamics of Internet
routing stability has gone mostly without
formal study prior to the publication of
the paper. Little was known!
Internet Routing Instability
17 of 50
Observations

Disproportionalism:





42,000 Internet prefixes
1300 Autonomous Systems
1500 Unique ASPATHS
3-6 million routing updates per day
125 updates per network per day



Spring 2006
At times, 100 prefix announcements per sec.
Once exceeded 30 million, monitor crashed!
This is a problem for all but the most
high-end of commercial routers, and
even they exhibit problems.
Internet Routing Instability
18 of 50
Classification of BGP Updates
Spring 2006

WADiff – A route is explicitly withdrawn as it
becomes unreachable and it is later replaced with an
alternative route to the same destination; forwarding
instability.

AADiff – A route is implicitly withdrawn and replaced
by an alternative route as the original route becomes
unreachable, or a preferred alternative path becomes
available; forwarding instability.

WADup – A route is explicitly withdrawn and then reannounced as unreachable. This may reflect
transient topological (link or router failure, or it may
represent a pathological oscillation; forwarding
instability or pathological behavior (see next slide)

All considered to be instability
Internet Routing Instability
19 of 50
Classification of Pathological
Behavior (Redunant Updates)
Spring 2006

AADup – A route is implicitly withdrawn and
replaced with a duplicate of the original
route (a router should only send an update
for a change in topology).

WWDup – The repeated transmission of BGP
withdrawals for a prefix that is currently
unreachable.

All considered to be pathological instability.

Pathological updates may have a minimal
impact on the performance of the Internet.
Internet Routing Instability
20 of 50
Expected Instability

Problems affecting aggregation into
supernets:






Spring 2006
Multi-homing
initial lack of hierarchical IP address space
allocation
reluctance to renumber IP addresses
Result: Large number of globally visible
addresses
Each globally visible address is reachable by
one or more paths.
You would expect Internet instability to be
proportional to the total number of available
paths to all globally visible network
addresses or aggregates
Internet Routing Instability
21 of 50
Mae-East Routing Updates
 Most WWDup withdrawals are transmitted by routers belonging to
AS’s that never previously announce reachability from the
withdrawn prefixes.
 On average, 500,000 – 6 million pathological withdrawals per day
Spring 2006
Internet Routing Instability
22 of 50
Update Totals per ISP on a
Given Day
 Many of the exchange point routers withdraw an order of
magnitude more routes than they announce during a given
day.
 Provider I shows the disproportionate effect that a single
service provider can have on the global routing mesh.
Spring 2006
Internet Routing Instability
23 of 50
More Observations

Guess what:

Spring 2006
There is a strong causal relationship between
the manufacturer of router used by an ISP
and the ISP’s exhibited level of pathological
BGP behavior.

Routing updates have a regular,
specific periodicity, usually either 30
or 60 seconds.

The persistence of instability is the
duration of time that routing
information fluctuates before it
stabilizes.
Internet Routing Instability
24 of 50
Origins of Routing Pathologies

Some pathological withdrawals can be
at attributed to implementation
decisions




Spring 2006
time-space trade off in not maintaining state
of advertisements
stateless BGP = O(N*U) updates
Presentation of results led to a router
vendor’s updating of software to a partial
state
Stateless BGP contributes an
insignificant number of updates and
does not account for oscillating
behavior of WWDup and AADup
updates.
Internet Routing Instability
25 of 50
Origins of Routing Pathologies (2)

Single-homed, stateless peer routers should
result in at most O(N) updates, but instead:



Periodic routing instability may be caused
by:



Spring 2006
It seemed that each legitimate withdrawal induces
some type of short-lived pathological network
oscillation
Persistence of these updates is between 1 and 5
minutes
inadvertant synchronization on update
transmission
improper configuration of interaction between IGP
and BGP (conversion is lossy)
Internet Routing Instability still remains
poorly understood
Internet Routing Instability
26 of 50
Forwarding Instability
 Instability Density
 Black squares are above a particular threshold (mean of
detrended data) (345 updates in March, 770 in September)
Spring 2006
Internet Routing Instability
27 of 50
Forwarding Instability (2)
 A week of raw forwarding
 Little instability over the weekend
Spring 2006
Internet Routing Instability
28 of 50
Forwarding Instability (3)
 Time series analyses, FFT and MEM spectral estimation, validate results.
 Routing instability corresponds closely to trends in Internet bandwidth
usage and packet loss (intuitively obvious?)
 Rigorous justification of network usage equating to routing instability is
problematic due to the size and heterogeneity of the internet.
Spring 2006
Internet Routing Instability
29 of 50
Fine-grained Instability Stats.
Spring 2006

No single AS consistently dominates
the instability statistics.

There is not a correlation between
the size (# routes responsible for in
table) of an AS and its proportion of
the instability statistics.

A small set of paths or prefixes do
not dominate the instability
statistics; instability is evenly
distributed across routes
Internet Routing Instability
30 of 50
Fine-grained Instability Stats. (2)
 Internet routing tables are dominated by 6-8 ISPs
 Over the course of the month, their share of the default-free routing tables
did not change significantly
Spring 2006
Internet Routing Instability
31 of 50
Fine-grained Instability Stats. (3)
 Internet routing tables are dominated by 6-8 ISPs
 Over the course of the month, their share of the default-free routing tables
did not change significantly
Spring 2006
Internet Routing Instability
32 of 50
Fine-grained Instability Stats. (4)
 80-100% of the daily instability is contributed by Prefix + AS pairs
announced less than 50 times.
 (a) ISP A announced seven routes between 630 and 650 times with no
withdrawals
Spring 2006
Internet Routing Instability
33 of 50
Fine-grained Instability Stats. (5)
 80-100% of the daily instability is contributed by Prefix + AS pairs
announced less than 50 times.
 (c) ISP A announced seven routes between 630 and 650 times with no
withdrawals
Spring 2006
Internet Routing Instability
34 of 50
Fine-grained Instability Stats. (6)

(a) 20-90% of AADiff events are contributed by
routes that changed 10 times or less

No single route consistently dominates the instability
measured.

Some days, a single Prefix+AS pair contributes
substantially (40%) - account for lowest curve in (a)
(ISP A)

WADiff climbs to a plateau about 95% faster than
other three categories.

WADiff has fewest number of Prefix+AS pairs that
dominate their days.


Spring 2006
Comforting, since categories probably best represent
topological instability
Investigation on prefix alone provided similar results.
Internet Routing Instability
35 of 50
Temporal Properties of
Instability Statistics

Update frequency distributions for
instability events at Prefix+AS level


Spring 2006
Update frequency is the inverse of the interarrival time between routing updates; higher
frequency corresponds to a short inter-arrival
time
Other work has been able to capture
the lower frequencies through both
routing table snapshots and end-toend techniques
Internet Routing Instability
36 of 50
Temporal Properties of
Instability Statistics (2)
 Histogram distribution captured in 30 second and 1 minute bins
 You would expect a Poisson distribution reflecting exogneous events, such
as power outages, fiber cuts, and natural human events.
 30 second periodicity suggests widespread systematic influence in origin.
Spring 2006
Internet Routing Instability
37 of 50
Temporal Properties of
Instability Statistics (3)
 Histogram distribution captured in 30 second and 1 minute bins
 You would expect a Poisson distribution reflecting exogneous events, such
as power outages, fiber cuts, and natural human events.
 30 second periodicity suggests widespread systematic influence in origin.
Spring 2006
Internet Routing Instability
38 of 50
Conclusions
Spring 2006

Routing instability can have a significant
deleterious impact in Internet infrastructure

Majority (99%) of routing information is
pathological and may not reflect real
network topological changes.

Instability is well distributed across AS’s and
prefix space.

Instability and redundant routing
information exhibit a strong periodicity (of
unknown origin).
Internet Routing Instability
39 of 50
Conclusions (2)

Spring 2006
Proportion of Internet Routes
affected by routing updates
Internet Routing Instability
40 of 50
Conclusions (3)
Spring 2006

Current trends in the evolution of the
Internet may have a significant impact on
routing instability and the future
performance of the network.

25% of networks are multi-homed and the
growth rate is about linear

Proliferation of exchange points is leading to
a less hierarchical Internet.

This research helps characterize the effect
of added topological complexity since the
end of the NSFNet backbone.
Internet Routing Instability
41 of 50
“Origins of Internet Routing
Instability”
Spring 2006

28 months gathering data from more than 40
commercial routers, switches, and Unix-based
PC routers

Also collected IBGP information at the state of
Michigan’s public Internet backbone, MichNet

Maintains that routing instability remains well
distributed across prefix and AS space but
that instability is not related to prefix length.

Since previous paper’s work, the volume of
inter-domain routing messages in the Internet
core has decreased by an order of magnitude.
Internet Routing Instability
42 of 50
Research Pays Off



Spring 2006
Number of BGP updates almost doubled in 28 mo.’s
Number of announcements per day eventually
(finally) surpassed the number of withdrawals at Mae
East.
On average, across backbone, exchange point
routers generated only half of the number of
withdrawals at the number of announcements
Internet Routing Instability
43 of 50
New Routing Update
Categories

We still have AADiff, AADup, and
WWDup, but we add:

Spring 2006
Tup and Tdown – fluctuation in the
reachability for a given prefix. An
announced route is withdrawn and
transitions down, or a currently
unreachable prefix is announced as
reachable and transitions up
Internet Routing Instability
44 of 50
Breakdown of BGP Updates



Tup roughly equal to Tdown, connection recovery (good!)
Fluctuation in prefix reachability account for over 40% of all
non WWDup BGP traffic
After January ’98, AADup comprised largest cat. of updates.
Spring 2006
Internet Routing Instability
45 of 50
Analysis of AADiffs

Spring 2006
90% of MED oscillations involve
only two large ISPs, product of
their specific routing policies.
Internet Routing Instability
46 of 50
Dynamically Mapped MED


AS2 always wants traffic flowing from AS3 to AS1 to take the shortest path
through its network, so instead of setting the MED value via static
configuration rules, AS2 dynamically maps the IGP distance between R5
and R3, and between R5 and R4 to the MED attribute value associated
with route advertisements from routers R3 and R4 to AS1.
AS2 influences AS1 who wants to reach Network A. AS1 will prefer the
route via R4.
Spring 2006
Internet Routing Instability
47 of 50
More Results
Spring 2006
Internet Routing Instability
48 of 50
Conclusions

Improvement




Spring 2006
Routing update messages reduced by
a magnitude
Suppressed pathological withdrawals
Instability is still well distributed
across AS and prefix space
More bugs in router software led to
anomalies
Internet Routing Instability
49 of 50
“Experimental Study of Internet Stability
and Wide-Area Backbone Failures”

Conclusions





Spring 2006
Internet has proven remarkably
robust.
A small number of routes contribute to
overall unavailability.
40% of routes exhibit multiple failures
Outages lasting longer than two hours
usually represent long-term outages
requiring significant engineering effort
for repair
BGP failures must stemp from nonhardware/software sourcdes, probably
TCP characteristics.
Internet Routing Instability
50 of 50
Download