Global Routing Instabilities during Code Red II and Nimda Worm Propagation Preliminary Report 19 September 2001 James Cowie, Andy Ogielski, BJ Premore and Yougu Yuan Renesys Corporation Update Oct 2001: NANOG 23 presentation [pdf 750k] Update Dec 2001: Extended technical report [pdf 1033k] SUMMARY As a part of an ongoing project to develop practical and scalable tools for analysis of very large, high-dimensional Internet behavior datasets, researchers from Renesys Corporation have been studying RIPE NCC's large repository of raw BGP message data. In this online note, we summarize our preliminary analysis of the surprisingly strong impact of the Internet propagation of Microsoft worms (such as Code Red and Nimda) on the stability of the global routing system. The data exhibit strong correlations between BGP message storms and worm propagation periods. This note will continue to evolve as our analytical tools and techniques evolve, and will reflect the results of ongoing experiments and the daily arrival of new BGP message traffic. INTRODUCTION Many successful academic and commercial projects use direct traffic measurements (such as ping, traceroute, and web page access data) to study the structure and dynamics of the Internet. Such efforts are inherently limited by the locations of probe points required to 'cover' the Internet meaningfully. Compounding the problem, there are no effective shortcuts - simply placing agents throughout the Internet's core, as done by several commercial services, only builds up a picture of core-to-core traffic latencies and losses that has no power to predict the true "Internet weather" that end users actually experience at the network edge. Studying global routing data provides one of the few alternatives to traffic-based analysis of the Internet's dynamics. By the very nature of globally distributed BGP routing processes, a listener at any well-connected point has the opportunity to obtain a very accurate picture of the evolution of best routes to every prefix in the Internet, delayed only by seconds to minutes. In particular, research has begun to focus on the dynamics of streams of BGP route update messages as a tool to identify the emergence of long-lived routing instability events. DO MICROSOFT WORMS CAUSE GLOBAL ROUTING INSTABILITY? On multiple occasions, we have detected hours-long periods of exponential growth and decay in the route change rates, across all sampling points and most prefixes, indicating significant widespread degradation in the end-to-end utility of the global Internet. To our very great surprise, these events did not correlate with point failures in the core Internet infrastructure, such as power outages in telco facilities or fiber cuts. Instead, we have documented a compelling connection between global routing instability and the propagation phase of Microsoft worms such as Code Red and Nimda. Contrary to conventional wisdom, what were thought to be purely traffic-based denials of service in fact are seen to generate widespread end-to-end routing instability originating at the Internet's edge. We speculate that, although most of the traffic in the Internet continued to flow normally through the small fraction of links that make up the global backbones, most of the links at the Internet edge had serious performance problems during the worms' probing and propagation phases. A complete list of reasons still needs to be documented, but we suspect i) congestion-induced failures of BGP sessions due to timeouts; ii) flow-diversity induced failures of BGP sesions due to router CPU overloads; iii) proactive disconnection of certain networks; and iv) failures of other equipment at the Internet edge such as DSL routers and other devices. METHODOLOGICAL BACKGROUND When a BGP router's "best route" to a given network prefix has changed (for better or worse), it sends out a BGP UPDATE message to each connected peer router. By establishing BGP peering connections with a large number of BGP routers from wellconnected organizations, analysis of traffic gathered at a single BGP monitoring point can provide a great deal of information about the way those organizations view the Internet, and about the dynamics of how paths change over a wide range of timescales. The RIPE routing information project maintains several such collection points across Europe; they peer with many of the so-called global tier-1 providers, plus very many smaller regional European networks. Access to multiple BGP monitoring points provides additional opportunities to filter the effects of infrastructure failures that are "close to" individual collection points, clearing the way to unambiguously identify and study routing instability features that affect large portions of the Internet simultaneously. OVERVIEW OF CONCERNS We are performing multiresolution analyses of tens of gigabytes of archived BGP message data from the RIPE collection points, seeking to learn what we can about the origins and mechanisms of global routing instability. There are two predominant strategies for using routing statistics to measure global Internet instability: Reachability; that is, measuring the number of prefixes that appear in a particular organization's routing tables at a given time. Rates of change; that is, measuring the number of prefix announcements and withdrawals in BGP UPDATE messages sent out by a particular organization per unit time. The BGP protocol contains dampening features that prevent a BGP router from exchanging "too many" messages about a given prefix with a given peer. As a result, one never sees information about route changes to a given network prefix more frequently than once every 30 seconds per peer. If we see large increases in the number of BGP update messages, therefore, it's an unambiguous sign that the diversity among network prefixes under discussion is rising. DISTINGUISHING FEATURES The duration of these BGP message surges, and the nature of the growth (linear or exponential, for example) are what distinguish truly global Internet instability from simple background noise (which is pervasive). Very short, high spikes in announcement rates are very common -- whenever a peer BGP session undergoes a hard reset, for example, a full table dump will follow. More surprisingly, our examination of the data indicates that failures in the core internet infrastructure (fiber cuts, flooding, generator failures, building collapses, train wrecks) tend to generate only short-term increases in the BGP prefix announcement rate, which revert to the mean in a matter of seconds or minutes as the highly redundant core Internet topology routes around the damage. Specific networks may remain unreachable until the damage is repaired, but because content networks are so vastly outnumbered by access networks, the "average" network prefix presumably adds very little in the way of marginal utility to the "average" Internet user. Of far greater concern are the appearance of sustained exponential rises in BGP message rates that last for hours. That is what the worm-triggered traffic causes. GLOBAL ROUTING STABILITY -- SUMMER OF 2001 The results described here are based on analysis of time-stamped BGP messages collected at the RIPE NCC site rrc00 in Amsterdam, the Netherlands, for the period of June - September, 2001. We also analyzed BGP traffic from other Internet exchanges that host RIPE BGP collection sites, including LINX (London), SFINX (Paris), AMS-IX (Amsterdam), CIXP (Geneva), and VIX (Vienna). The extended analysis results will be presented in the coming updates to this note, and in separate publications. The RIPE NCC collection facility is particularly interesting, as it collects BGP routing updates from several large Internet providers, and thus provides a good and fairly complete dynamic view - second by second - of the evolving state of global routing. The following Autonomous Systems have BGP peering routers at the RIPE NCC location: AS286, KPNQwest Backbone AS513, CERN AS1103, SURFnet AS2914, Verio AS3257, Tiscali Global ASN AS3549, Global Crossing (two separate peering sessions) AS4608, Telstra Internet AS4777, APNIC Pty Ltd - Tokyo AS7018, AT&T AS9177, Nextra (Schweiz) AS13129, Global Access Telecommunications TRENDS IN THE AGGREGATE RATE OF BGP ANNOUNCEMENTS The wide graphic plot above shows the trends in the aggregated rate of BGP prefix announcements received from all of the above peers at the RIPE NCC data collection points from 1 June through 24 September 2001. The X-axis is time, and the Y-axis is the log of the total number of network addresses (prefixes) advertised in consecutive 30-second windows. In other words, each dot in the plot represents a 30-second count of the number of prefixes advertised in BGP Update messages received in Amsterdam. What this plot shows: These data provide a coarse indicator of Global BGP Instability, because they sum over all of the advertisements of all of the prefixes by all of the autonomous systems that peer with RIPE NCC. The timeseries represents a measure of the gross route change activity for the entire Internet. Any patterns and features observable at this high level might tell us something about the global state of BGP routing, as it fluctuates from day to day. Surprisingly, there are several strong patterns and features that emerge from the background of noise. For example: One can observe strong weekly and daily trends in the rate of route advertisements, an effect which may be due to interactions with either the diurnal patterns of traffic, or the diurnal patterns of activity by network operators performing routine maintenance on BGP routers (interestingly, if this were the case, it looks like network operators are ramping up their work from Monday through Wednesday, and then steadily retire towards the weekend). Two strange non-periodic features jump out when the baseline is examined: 1. an order of magnitude higher BGP message storm on July 19th, 2. a more rapidly rising and longer-lasting BGP message storm on September 18th. What this plot does not show: The above aggregated timeseries does not serve as a measurement of "reachability" over time --- this would be achieved by plotting the number of prefixes in the routing tables, and watching for dips. Because the data aggregate across all ASs and prefixes, there are few "interesting" features when network infrastructure is broken at localized geographic points. Interestingly, neither the Baltimore tunnel train wreck of 18 July nor the attacks of 11 September appear as features in this plot. These tragic events did not destabilize the global Internet. In general, the high levels of routing activity following fiber cuts between tier-1 and other major providers remain localized within the immediately affected Autonomous Systems, and do not create message storms that are highly visible worldwide. But something else did create a long-lasting instability of BGP routes on July 19th and September 18th. In particular, we are concerned with the two non-periodic features visible on 19-20 July and 18-19 September. These two "storms" in BGP update rates correlate with the propagation phases of the Microsoft worms known as Code Red 2 (in July) and Nimda (in September). BGP STORM #1: JULY 19, 2001 (CODE RED II) On July 19th, we observed an exponentially growing eight-fold increase in the advertisement rate, over a period of about eight hours (all times are in GMT; subtract 4 hours for EDT). This BGP surge faded over the same time scale as it arrived. When one considers the conventional wisdom about BGP convergence times (seconds to minutes), it is more than a little disturbing to see a fundamental quantity like BGP advertisement rate exhibiting exponential growth for eight hours. One initial guess was a delayed effect from the Baltimore train wreck, whose impact was highly visible in the discussions on various mailing lists such as NANOG, as network operators "tweaked" routing for the next day or so. But this does not appear to have caused the BGP storm. A zoom-in on the BGP message storm of July 19. In order to gain a better understanding of the mechanism driving this BGP storm, we conducted a finer analysis. We began by separating the contributions of individual BGP peering sessions to the total BGP update message traffic, and by separately following the time courses of BGP route announcement and withdrawal messages. BGP prefix announcements in 60 min periods. Each row is one BGP peer in RIPE NCC. Note a wave of announcements on July 19. Consider the plot above, in red. In this plot, BGP prefix announcements are counted in 60 minute bins, and graphed along the Z-axis as impulses. The X-axis is time, from July 1st through July 31st. The Y-axis (going into the screen) separates the contributions of the 13 individual peer AS's at rrc00 (each data row parallel to the X-axis represents announcements from one BGP router peering with the RIPE NCC message collecting router). On other days, it is common for individual peers to contribute "spikes" of high advertisement volume to the mix, presumably reflecting BGP sessions closing and opening close to the collection point. On July 19th, however, all peers experience a "wave" of smoothly increasing traffic, sustained for many hours. BGP prefix withdrawals in 60 min periods. Each row is one BGP peer in RIPE NCC. Note a wave of withdrawals on July 19. Similarly, the above plot (in blue) plots the hourly rate of BGP withdrawal messages across all peers at rrc00 --- the prefix withdrawal count on the Z axis showing an indication of the number of network prefixes that are no longer reachable via the given peer. The July 19th surge is the only occasion in July when all 13 peers register a significant simultaneous surge in withdrawal rates, lasting for approximately eight hours. Further analysis of the BGP message traffic has since indicated that no specific autonomous system or set of autonomous systems seems to be generating the traffic surge, and that no specific IP prefix or set of prefixes was flapping significantly more than before the onset of the surge. Instead, the net effect was that routes to most of the 110,000 prefixes in the Internet were changing a few more times than normal. In other words, the data reflect a broad-based BGP storm with no single-point cause. CORRELATION WITH CODE RED II ATTACK PHASE The time course of the July 19 BGP storm suggests that it has been triggered by the sudden spread of a new variant of the Microsoft worm known as Code Red 2. Our analysis has been materially aided by data collected by the network security community and announced on the incidents.org, jammed.com and neohapsis.com mailing lists. One can find there higher than usual number of anecdotal reports of sudden connectivity losses, ARP storms, and similar localized worm effects. In addition, however, several alert people presented quantitative data supporting the assertion that a new type of Code Red worm started a very rapid propagation phase at a time coinciding with the onset of the BGP storm that we have shown above. Ideal data on the worm-generated traffic storm would show the time series of worm activity for a good statistical sample of networks of known size (i.e. prefix length), so that the global activity levels could be inferred by extrapolation. Here we plot the Code Red 2 propagation data collected independently on two class B networks (i.e. each nominally containing 2^16 = 64k IP addresses) during the entire day of July 19. The original data were collected by Ken Eichman and Dave Goldsmith, respectively; the relevant posts are here and here, and summarized on the incidents.org mailing list "Handler's Diaries" on July 20 and July 21 . The time series shown above (in red) plot the number of HTTP requests (or TCP SYN packets) received in the two distinct /16 networks hour after hour. Note the virtually identical time course of these attacks as seen from different networks. These plots give a measure of the intensity of worm scanning traffic at all affected networks - that is, knowing the probability of target IP address generation by the worm, plots like the one above allow to estimate the global level of worm-induced traffic. The analysis of this is continuing as we receive more such data. A thorough analysis of the host infection rate by Code Red 2 is available from CAIDA Analysis of Code-Red and in particular the David Moore's analysis The Spread of the Code-Red Worm (CRv2). NETWORK REACHABILITY FAILURES DURING CODE RED II In the following figure we show that all prefixes were similarly affected by the effects of the Code Red 2 worm attack on the stability of global routing. Only the classful networks (/8, /16 and /24) are shown for clarity, but similar behavior has been seen for all intermediate prefix lengths. Rate of prefix withdrawals in 30-sec intervals for selected prefix lengths Further analysis will be presented to demonstrate that no particular AS, or prefix, and no particular set of ASs or prefixes, were to blame for this instability. BGP STORM #2: SEPTEMBER 18-19, 2001 (NIMDA) On Tuesday, September 18, simultaneous with the onset of the propagation phase of the Nimda worm, we observed another BGP storm. This one came on faster, rode the trend higher, and then, just as mysteriously, turned itself off, though much more slowly. Over a period of roughly two hours, starting at about 13:00 GMT (9am EDT), rrc00 aggregate BGP announcement rates exponentially ramped up by a factor of 25, from 400 per minute to 10,000 per minute, with sustained "gusts" to more than 200,000 per minute. The advertisement rate then decayed gradually over many days, reaching pre-Nimda levels by September 24th. A zoom-in on the BGP message storm of September 18 - 19. The analysis of the BGP storm triggered by the NIMDA worm followed a similar course as the analysis of Code Red presented above. Analysis of finer details continues, but as in July, there does not seem to be a single AS or prefix, or group of ASs or prefixes, whose BGP advertisement rate increased disproportionately during the storm. BGP prefix announcements in 15 min periods. Each row is one BGP peer in RIPE NCC. Note steep onset of a wave of announcements on September 18. In this plot, prefix announcements in 15 min periods are separated by contributing BGP peers at RIPE NCC as in the July worm analysis. Note the heavy announcement surge across all peers, with a long-tailed finish that continues past September 20. BGP prefix withdrawals in 15 min periods. Each row is one BGP peer in RIPE NCC. Note steep onset of a wave of withdrawals on September 18. And as in July, this plot of 15-minute September withdrawal data shows strong correlation of withdrawals among all peers in the September 18-19 event. CORRELATION WITH THE NIMDA ATTACK The steep exponentially growth of the September 18 BGP storm is aligned with the exponential spread of Nimda, the most virulent Microsoft worm seen to date. The Nimda worm exhibits extremely high scan rates, multiple attack modes generating very heavy traffic, and has been much more damaging that the July Code Red worm. Preliminary analyses describing the Nimda spread and attack modes are available from SecurityFocus and SANS Institute. As with the Code Red 2 worm in July, the security mailing lists contain a huge number of non-quantitative reports of a stunnigly rapid increase of HTTP probing and of various network slowdowns and connectivity failures, with scan activity jumping dramatically at approximately 13:00 GMT. Edge network administrators were reporting "insane ARP storms", router failures, congestion slowdowns and connectivity problems. The SANS report shows the typical growth trend in the rate of HTTP scans; it correlates precisely with the onset of global routing instability, as measured by increases in the BGP prefix announcement rate. The time series in the image above (source: SANS) illustrates the trend in the worm scanning rate. Notice a faster onset that with July 19 Code Red 2 attack. The data shown here do not contain any information about the size of a network (or multiple networks) where the measurements were obtained. As such data become available, the analysis of the worm's target IP address generation algorithm will allow us to estimate the global traffic intensity created by the worm spread. See also CAIDA's estimates of the total number of Nimda infected hosts. NETWORK REACHABILITY FAILURES DURING THE NIMDA ATTACK In the figure below, we show again that all prefixes were similarly affected by the effects of the Nimda worm attack on the global routing stability. Only the classful networks (/8, /16 and /24) are shown for clarity, but similar behavior is seen for all prefix lengths. Rate of prefix withdrawals in 30-sec intervals for selected prefix lengths. However, the very rapid onset of Nimda-induced routing instability, as compared with comparatively slower onset of the Code Red induced instability on July 19, supports several preliminary observations: The rate of withdrawals of smaller networks, such as /24, is the first to begin to rise rapidly. The log slope of the rate of withdrawals of smaller networks, such as /24, is higher. At this time, we take this to indicate that the worm-induced routing instabilities begin to propagate from the Internet edge, while the Internet core remains stable (to the extent that the data shows it). A preliminary comparative analysis of the BGP routing tables a few hours before and after the onset of the Nimda storm shows that the increase in prefix withdrawals during the period, while significant, did not result in long-lasting losses of reachability to the edge networks. These were, by and large, transient failures. PRELIMINARY CONCLUSIONS It would be premature to draw definitive conclusions at this point about the exact causal relationship between worm propagation and global internet routing instability. However, we are accumulating strong evidence of at least a strong correlation. The first hypothesis is that high rates of worm-related traffic are resulting in significant traffic surges near the edges of the Internet, causing a large number of BGP sessions to time out, close down, and reopen. The reasons may be due either to congestion losses, or to router CPU overload due to surges in the number of flows. Although one would expect BGP messages to be very high priority traffic, and thus not subject to congestion-related loss until the situation were dire indeed (such prioritization is a good thing, as shown in a recent SIGCOMM paper). But it's less clear that network operators routinely enable this kind of prioritization. Another hypothesis is that explosive growth of worm traffic at the Internet's edge causes a large number of network operators from corporations and small ISPs to independently shut down, or reboot or attempt to reconfigure their border routers; and the total amount of BGP message traffic grows exponentially with the number of edge domains feeling the effects of the worm. Overall, we tentatively conclude that the worm-induced routing instability results from the combination of effects such as these two listed above, plus the failures of other network components. We will revise these speculations as more data becomes available. Copyright © 2001 Renesys Corporation. Contact James Cowie cowie@renesys.com or Andy Ogielski ato@renesys.com for further information. Thanks to Henk Uijterwaal and his group for the RIPE RIS data, and to Tim Griffin, Dave Donoho and others at the 2001 Leiden Workshop on Multiresolution Analysis of Global Internet Measurements for fruitful discussions. The motivation for this study arose from work partially supported by the Defense Advanced Research Projects Agency (DARPA), under grant N66001-00-8065 from the U.S. Department of Defense. Its contents are solely the responsibility of the authors and do not necessarily represent the official views of the Department of Defense. This note should be cited as follows: Cowie, J., Ogielski, A., Premore, B., and Yuan, Y. (2001) "Global Routing Instabilities during Code Red II and Nimda Worm Propagation." http://www.renesys.com/projects/bgp_instability, September 2001.