Global Routing Insta..

advertisement
Global Routing Instabilities during Code Red II
and Nimda Worm Propagation
Preliminary Report
19 September 2001
James Cowie, Andy Ogielski, BJ Premore and Yougu Yuan
Renesys Corporation
Update Oct 2001: NANOG 23 presentation [pdf 750k]
Update Dec 2001: Extended technical report [pdf 1033k]
SUMMARY
As a part of an ongoing project to develop practical and scalable tools for analysis of very
large, high-dimensional Internet behavior datasets, researchers from Renesys Corporation
have been studying RIPE NCC's large repository of raw BGP message data.
In this online note, we summarize our preliminary analysis of the surprisingly strong
impact of the Internet propagation of Microsoft worms (such as Code Red and Nimda) on
the stability of the global routing system. The data exhibit strong correlations between
BGP message storms and worm propagation periods.
This note will continue to evolve as our analytical tools and techniques evolve, and will
reflect the results of ongoing experiments and the daily arrival of new BGP message
traffic.
INTRODUCTION
Many successful academic and commercial projects use direct traffic measurements (such
as ping, traceroute, and web page access data) to study the structure and dynamics of the
Internet. Such efforts are inherently limited by the locations of probe points required to
'cover' the Internet meaningfully. Compounding the problem, there are no effective
shortcuts - simply placing agents throughout the Internet's core, as done by several
commercial services, only builds up a picture of core-to-core traffic latencies and losses
that has no power to predict the true "Internet weather" that end users actually experience
at the network edge.
Studying global routing data provides one of the few alternatives to traffic-based analysis
of the Internet's dynamics. By the very nature of globally distributed BGP routing
processes, a listener at any well-connected point has the opportunity to obtain a very
accurate picture of the evolution of best routes to every prefix in the Internet, delayed
only by seconds to minutes. In particular, research has begun to focus on the dynamics of
streams of BGP route update messages as a tool to identify the emergence of long-lived
routing instability events.
DO MICROSOFT WORMS CAUSE GLOBAL ROUTING INSTABILITY?
On multiple occasions, we have detected hours-long periods of exponential growth and
decay in the route change rates, across all sampling points and most prefixes, indicating
significant widespread degradation in the end-to-end utility of the global Internet.
To our very great surprise, these events did not correlate with point failures in the core
Internet infrastructure, such as power outages in telco facilities or fiber cuts.
Instead, we have documented a compelling connection between global routing instability
and the propagation phase of Microsoft worms such as Code Red and Nimda. Contrary to
conventional wisdom, what were thought to be purely traffic-based denials of service in
fact are seen to generate widespread end-to-end routing instability originating at the
Internet's edge.
We speculate that, although most of the traffic in the Internet continued to flow normally
through the small fraction of links that make up the global backbones, most of the links
at the Internet edge had serious performance problems during the worms' probing and
propagation phases. A complete list of reasons still needs to be documented, but we
suspect i) congestion-induced failures of BGP sessions due to timeouts; ii) flow-diversity
induced failures of BGP sesions due to router CPU overloads; iii) proactive disconnection
of certain networks; and iv) failures of other equipment at the Internet edge such as DSL
routers and other devices.
METHODOLOGICAL BACKGROUND
When a BGP router's "best route" to a given network prefix has changed (for better or
worse), it sends out a BGP UPDATE message to each connected peer router. By
establishing BGP peering connections with a large number of BGP routers from wellconnected organizations, analysis of traffic gathered at a single BGP monitoring point
can provide a great deal of information about the way those organizations view the
Internet, and about the dynamics of how paths change over a wide range of timescales.
The RIPE routing information project maintains several such collection points across
Europe; they peer with many of the so-called global tier-1 providers, plus very many
smaller regional European networks.
Access to multiple BGP monitoring points provides additional opportunities to filter the
effects of infrastructure failures that are "close to" individual collection points, clearing
the way to unambiguously identify and study routing instability features that affect large
portions of the Internet simultaneously.
OVERVIEW OF CONCERNS
We are performing multiresolution analyses of tens of gigabytes of archived BGP
message data from the RIPE collection points, seeking to learn what we can about the
origins and mechanisms of global routing instability.
There are two predominant strategies for using routing statistics to measure global
Internet instability:


Reachability; that is, measuring the number of prefixes that appear in a particular
organization's routing tables at a given time.
Rates of change; that is, measuring the number of prefix announcements and
withdrawals in BGP UPDATE messages sent out by a particular organization per
unit time.
The BGP protocol contains dampening features that prevent a BGP router from
exchanging "too many" messages about a given prefix with a given peer. As a result, one
never sees information about route changes to a given network prefix more frequently
than once every 30 seconds per peer. If we see large increases in the number of BGP
update messages, therefore, it's an unambiguous sign that the diversity among network
prefixes under discussion is rising.
DISTINGUISHING FEATURES
The duration of these BGP message surges, and the nature of the growth (linear or
exponential, for example) are what distinguish truly global Internet instability from
simple background noise (which is pervasive).
Very short, high spikes in announcement rates are very common -- whenever a peer BGP
session undergoes a hard reset, for example, a full table dump will follow. More
surprisingly, our examination of the data indicates that failures in the core internet
infrastructure (fiber cuts, flooding, generator failures, building collapses, train wrecks)
tend to generate only short-term increases in the BGP prefix announcement rate, which
revert to the mean in a matter of seconds or minutes as the highly redundant core Internet
topology routes around the damage. Specific networks may remain unreachable until the
damage is repaired, but because content networks are so vastly outnumbered by access
networks, the "average" network prefix presumably adds very little in the way of
marginal utility to the "average" Internet user.
Of far greater concern are the appearance of sustained exponential rises in BGP message
rates that last for hours. That is what the worm-triggered traffic causes.
GLOBAL ROUTING STABILITY -- SUMMER OF 2001
The results described here are based on analysis of time-stamped BGP messages
collected at the RIPE NCC site rrc00 in Amsterdam, the Netherlands, for the period of
June - September, 2001. We also analyzed BGP traffic from other Internet exchanges that
host RIPE BGP collection sites, including LINX (London), SFINX (Paris), AMS-IX
(Amsterdam), CIXP (Geneva), and VIX (Vienna). The extended analysis results will be
presented in the coming updates to this note, and in separate publications.
The RIPE NCC collection facility is particularly interesting, as it collects BGP routing
updates from several large Internet providers, and thus provides a good and fairly
complete dynamic view - second by second - of the evolving state of global routing.
The following Autonomous Systems have BGP peering routers at the RIPE NCC
location:











AS286, KPNQwest Backbone
AS513, CERN
AS1103, SURFnet
AS2914, Verio
AS3257, Tiscali Global ASN
AS3549, Global Crossing (two separate peering sessions)
AS4608, Telstra Internet
AS4777, APNIC Pty Ltd - Tokyo
AS7018, AT&T
AS9177, Nextra (Schweiz)
AS13129, Global Access Telecommunications
TRENDS IN THE AGGREGATE RATE OF BGP ANNOUNCEMENTS
The wide graphic plot above shows the trends in the aggregated rate of BGP prefix
announcements received from all of the above peers at the RIPE NCC data collection
points from 1 June through 24 September 2001.
The X-axis is time, and the Y-axis is the log of the total number of network addresses
(prefixes) advertised in consecutive 30-second windows. In other words, each dot in the
plot represents a 30-second count of the number of prefixes advertised in BGP Update
messages received in Amsterdam.
What this plot shows:
These data provide a coarse indicator of Global BGP Instability, because they sum over
all of the advertisements of all of the prefixes by all of the autonomous systems that peer
with RIPE NCC. The timeseries represents a measure of the gross route change activity
for the entire Internet.
Any patterns and features observable at this high level might tell us something about the
global state of BGP routing, as it fluctuates from day to day. Surprisingly, there are
several strong patterns and features that emerge from the background of noise. For
example:


One can observe strong weekly and daily trends in the rate of route
advertisements, an effect which may be due to interactions with either the diurnal
patterns of traffic, or the diurnal patterns of activity by network operators
performing routine maintenance on BGP routers (interestingly, if this were the
case, it looks like network operators are ramping up their work from Monday
through Wednesday, and then steadily retire towards the weekend).
Two strange non-periodic features jump out when the baseline is examined:
1. an order of magnitude higher BGP message storm on July 19th,
2. a more rapidly rising and longer-lasting BGP message storm on September
18th.
What this plot does not show:
The above aggregated timeseries does not serve as a measurement of "reachability" over
time --- this would be achieved by plotting the number of prefixes in the routing tables,
and watching for dips.
Because the data aggregate across all ASs and prefixes, there are few "interesting"
features when network infrastructure is broken at localized geographic points.
Interestingly, neither the Baltimore tunnel train wreck of 18 July nor the attacks of 11
September appear as features in this plot. These tragic events did not destabilize the
global Internet. In general, the high levels of routing activity following fiber cuts between
tier-1 and other major providers remain localized within the immediately affected
Autonomous Systems, and do not create message storms that are highly visible
worldwide.
But something else did create a long-lasting instability of BGP routes on July 19th and
September 18th. In particular, we are concerned with the two non-periodic features
visible on 19-20 July and 18-19 September. These two "storms" in BGP update rates
correlate with the propagation phases of the Microsoft worms known as Code Red 2 (in
July) and Nimda (in September).
BGP STORM #1: JULY 19, 2001 (CODE RED II)
On July 19th, we observed an exponentially growing eight-fold increase in the
advertisement rate, over a period of about eight hours (all times are in GMT; subtract 4
hours for EDT). This BGP surge faded over the same time scale as it arrived. When one
considers the conventional wisdom about BGP convergence times (seconds to minutes),
it is more than a little disturbing to see a fundamental quantity like BGP advertisement
rate exhibiting exponential growth for eight hours.
One initial guess was a delayed effect from the Baltimore train wreck, whose impact was
highly visible in the discussions on various mailing lists such as NANOG, as network
operators "tweaked" routing for the next day or so. But this does not appear to have
caused the BGP storm.
A zoom-in on the BGP message storm of July 19.
In order to gain a better understanding of the mechanism driving this BGP storm, we
conducted a finer analysis. We began by separating the contributions of individual BGP
peering sessions to the total BGP update message traffic, and by separately following the
time courses of BGP route announcement and withdrawal messages.
BGP prefix announcements in 60 min periods. Each row is one BGP peer in RIPE NCC.
Note a wave of announcements on July 19.
Consider the plot above, in red. In this plot, BGP prefix announcements are counted in 60
minute bins, and graphed along the Z-axis as impulses. The X-axis is time, from July 1st
through July 31st. The Y-axis (going into the screen) separates the contributions of the 13
individual peer AS's at rrc00 (each data row parallel to the X-axis represents
announcements from one BGP router peering with the RIPE NCC message collecting
router). On other days, it is common for individual peers to contribute "spikes" of high
advertisement volume to the mix, presumably reflecting BGP sessions closing and
opening close to the collection point. On July 19th, however, all peers experience a
"wave" of smoothly increasing traffic, sustained for many hours.
BGP prefix withdrawals in 60 min periods. Each row is one BGP peer in RIPE NCC.
Note a wave of withdrawals on July 19.
Similarly, the above plot (in blue) plots the hourly rate of BGP withdrawal messages
across all peers at rrc00 --- the prefix withdrawal count on the Z axis showing an
indication of the number of network prefixes that are no longer reachable via the given
peer. The July 19th surge is the only occasion in July when all 13 peers register a
significant simultaneous surge in withdrawal rates, lasting for approximately eight hours.
Further analysis of the BGP message traffic has since indicated that no specific
autonomous system or set of autonomous systems seems to be generating the traffic
surge, and that no specific IP prefix or set of prefixes was flapping significantly more
than before the onset of the surge.
Instead, the net effect was that routes to most of the 110,000 prefixes in the Internet were
changing a few more times than normal. In other words, the data reflect a broad-based
BGP storm with no single-point cause.
CORRELATION WITH CODE RED II ATTACK PHASE
The time course of the July 19 BGP storm suggests that it has been triggered by the
sudden spread of a new variant of the Microsoft worm known as Code Red 2.
Our analysis has been materially aided by data collected by the network security
community and announced on the incidents.org, jammed.com and neohapsis.com mailing
lists. One can find there higher than usual number of anecdotal reports of sudden
connectivity losses, ARP storms, and similar localized worm effects. In addition,
however, several alert people presented quantitative data supporting the assertion that a
new type of Code Red worm started a very rapid propagation phase at a time coinciding
with the onset of the BGP storm that we have shown above.
Ideal data on the worm-generated traffic storm would show the time series of worm
activity for a good statistical sample of networks of known size (i.e. prefix length), so that
the global activity levels could be inferred by extrapolation.
Here we plot the Code Red 2 propagation data collected independently on two class B
networks (i.e. each nominally containing 2^16 = 64k IP addresses) during the entire day
of July 19. The original data were collected by Ken Eichman and Dave Goldsmith,
respectively; the relevant posts are here and here, and summarized on the incidents.org
mailing list "Handler's Diaries" on July 20 and July 21 .
The time series shown above (in red) plot the number of HTTP requests (or TCP SYN
packets) received in the two distinct /16 networks hour after hour. Note the virtually
identical time course of these attacks as seen from different networks. These plots give a
measure of the intensity of worm scanning traffic at all affected networks - that is,
knowing the probability of target IP address generation by the worm, plots like the one
above allow to estimate the global level of worm-induced traffic. The analysis of this is
continuing as we receive more such data.
A thorough analysis of the host infection rate by Code Red 2 is available from CAIDA
Analysis of Code-Red and in particular the David Moore's analysis The Spread of the
Code-Red Worm (CRv2).
NETWORK REACHABILITY FAILURES DURING CODE RED II
In the following figure we show that all prefixes were similarly affected by the effects of
the Code Red 2 worm attack on the stability of global routing. Only the classful networks
(/8, /16 and /24) are shown for clarity, but similar behavior has been seen for all
intermediate prefix lengths.
Rate of prefix withdrawals in 30-sec intervals for selected prefix lengths
Further analysis will be presented to demonstrate that no particular AS, or prefix, and no
particular set of ASs or prefixes, were to blame for this instability.
BGP STORM #2: SEPTEMBER 18-19, 2001 (NIMDA)
On Tuesday, September 18, simultaneous with the onset of the propagation phase of the
Nimda worm, we observed another BGP storm. This one came on faster, rode the trend
higher, and then, just as mysteriously, turned itself off, though much more slowly. Over a
period of roughly two hours, starting at about 13:00 GMT (9am EDT), rrc00 aggregate
BGP announcement rates exponentially ramped up by a factor of 25, from 400 per minute
to 10,000 per minute, with sustained "gusts" to more than 200,000 per minute. The
advertisement rate then decayed gradually over many days, reaching pre-Nimda levels by
September 24th.
A zoom-in on the BGP message storm of September 18 - 19.
The analysis of the BGP storm triggered by the NIMDA worm followed a similar course
as the analysis of Code Red presented above. Analysis of finer details continues, but as in
July, there does not seem to be a single AS or prefix, or group of ASs or prefixes, whose
BGP advertisement rate increased disproportionately during the storm.
BGP prefix announcements in 15 min periods. Each row is one BGP peer in RIPE NCC.
Note steep onset of a wave of announcements on September 18.
In this plot, prefix announcements in 15 min periods are separated by contributing BGP
peers at RIPE NCC as in the July worm analysis. Note the heavy announcement surge
across all peers, with a long-tailed finish that continues past September 20.
BGP prefix withdrawals in 15 min periods. Each row is one BGP peer in RIPE NCC.
Note steep onset of a wave of withdrawals on September 18.
And as in July, this plot of 15-minute September withdrawal data shows strong
correlation of withdrawals among all peers in the September 18-19 event.
CORRELATION WITH THE NIMDA ATTACK
The steep exponentially growth of the September 18 BGP storm is aligned with the
exponential spread of Nimda, the most virulent Microsoft worm seen to date. The Nimda
worm exhibits extremely high scan rates, multiple attack modes generating very heavy
traffic, and has been much more damaging that the July Code Red worm.
Preliminary analyses describing the Nimda spread and attack modes are available from
SecurityFocus and SANS Institute.
As with the Code Red 2 worm in July, the security mailing lists contain a huge number of
non-quantitative reports of a stunnigly rapid increase of HTTP probing and of various
network slowdowns and connectivity failures, with scan activity jumping dramatically at
approximately 13:00 GMT. Edge network administrators were reporting "insane ARP
storms", router failures, congestion slowdowns and connectivity problems.
The SANS report shows the typical growth trend in the rate of HTTP scans; it correlates
precisely with the onset of global routing instability, as measured by increases in the BGP
prefix announcement rate.
The time series in the image above (source: SANS) illustrates the trend in the worm
scanning rate. Notice a faster onset that with July 19 Code Red 2 attack. The data shown
here do not contain any information about the size of a network (or multiple networks)
where the measurements were obtained. As such data become available, the analysis of
the worm's target IP address generation algorithm will allow us to estimate the global
traffic intensity created by the worm spread. See also CAIDA's estimates of the total
number of Nimda infected hosts.
NETWORK REACHABILITY FAILURES DURING THE NIMDA ATTACK
In the figure below, we show again that all prefixes were similarly affected by the effects
of the Nimda worm attack on the global routing stability. Only the classful networks (/8,
/16 and /24) are shown for clarity, but similar behavior is seen for all prefix lengths.
Rate of prefix withdrawals in 30-sec intervals for selected prefix lengths.
However, the very rapid onset of Nimda-induced routing instability, as compared with
comparatively slower onset of the Code Red induced instability on July 19, supports
several preliminary observations:


The rate of withdrawals of smaller networks, such as /24, is the first to begin to
rise rapidly.
The log slope of the rate of withdrawals of smaller networks, such as /24, is
higher.
At this time, we take this to indicate that the worm-induced routing instabilities begin to
propagate from the Internet edge, while the Internet core remains stable (to the extent that
the data shows it). A preliminary comparative analysis of the BGP routing tables a few
hours before and after the onset of the Nimda storm shows that the increase in prefix
withdrawals during the period, while significant, did not result in long-lasting losses of
reachability to the edge networks. These were, by and large, transient failures.
PRELIMINARY CONCLUSIONS
It would be premature to draw definitive conclusions at this point about the exact causal
relationship between worm propagation and global internet routing instability. However,
we are accumulating strong evidence of at least a strong correlation.
The first hypothesis is that high rates of worm-related traffic are resulting in significant
traffic surges near the edges of the Internet, causing a large number of BGP sessions to
time out, close down, and reopen. The reasons may be due either to congestion losses, or
to router CPU overload due to surges in the number of flows. Although one would expect
BGP messages to be very high priority traffic, and thus not subject to congestion-related
loss until the situation were dire indeed (such prioritization is a good thing, as shown in a
recent SIGCOMM paper). But it's less clear that network operators routinely enable this
kind of prioritization.
Another hypothesis is that explosive growth of worm traffic at the Internet's edge causes
a large number of network operators from corporations and small ISPs to independently
shut down, or reboot or attempt to reconfigure their border routers; and the total amount
of BGP message traffic grows exponentially with the number of edge domains feeling the
effects of the worm.
Overall, we tentatively conclude that the worm-induced routing instability results from
the combination of effects such as these two listed above, plus the failures of other
network components. We will revise these speculations as more data becomes available.
Copyright © 2001 Renesys Corporation.
Contact James Cowie cowie@renesys.com or Andy Ogielski ato@renesys.com for
further information.
Thanks to Henk Uijterwaal and his group for the RIPE RIS data, and to Tim Griffin,
Dave Donoho and others at the 2001 Leiden Workshop on Multiresolution Analysis of
Global Internet Measurements for fruitful discussions.
The motivation for this study arose from work partially supported by the Defense
Advanced Research Projects Agency (DARPA), under grant N66001-00-8065 from the
U.S. Department of Defense. Its contents are solely the responsibility of the authors and
do not necessarily represent the official views of the Department of Defense.
This note should be cited as follows:
Cowie, J., Ogielski, A., Premore, B., and Yuan, Y. (2001)
"Global Routing Instabilities during Code Red II and Nimda
Worm Propagation."
http://www.renesys.com/projects/bgp_instability, September 2001.
Download