Characteristics of Internet Background Radiation Authors: Ruomind Pang, Vinod Yegneswaran, Paul Bartfod, Vern Paxson, Larry Peterson Appeared in IMC 2004, Taormina, Sicily, Italy, October 2004 Presenter: Charles Ahern Introduction Older (mid 90’s) internet traffic studies make no mention of an appreciable amount of on-going nonproductive traffic Today, this traffic, either malicious or benign (misconfigurations) is prevalent The goal of this paper is to categorize this traffic, determine where it comes from and what it is doing Outline The magnitude of the problem How to decide what traffic is “nonproductive” Determining the nature of the traffic Filtering Responding (to gain further insight) Brief Experiment Details Quantifying & Qualifying Weaknesses & Contributions Magnitude The magnitude of nonproductive traffic on the internet is not minor Example: Traffic logs from Lawrence Berkeley Laboratory (LBL) for an arbitrary day show: 138 different remote hosts each scanned 25,000 or more LBL addresses for a total of over 8 million connection attempts This is more than DOUBLE the site’s entire successfully-established incoming connections, originated by 47,000 distinct remote hosts Given the traffic’s pervasive nature, they have termed it “Internet radiation” Determining What is Unwanted If we include all unsuccessful connection attempts, this will be an inaccurate statistic Transient failures Instead, measure traffic sent to hosts that don’t exist Likely to eliminate most transient failures and yield unwanted activity You can safely respond to this traffic Taming the large Traffic Volume Listening to traffic on thousands to millions of IP addresses… MUST handle efficiently Nearly 30,000 packets per second of background radiation on the Class A network they are monitoring Filtering schemes must be sound and effective Filtering Source-Connection Filtering Keep first N initiated by the source Disadvantages: Inconsistent view of the network N value is attack and service dependant Source-Port Filtering Keep first N connections for each source/destination port pair Allows wider variety of activities Still same downsides though Filtering Source-Payload Filtering One instance of each type of activity per source Good idea, hard to sometimes implement Hard to tell if two activities are similar until several packets are responded to Source-Destination Filtering (their choice) Assume one source will try the same activities on every IP it tries to connect to Filter Effectiveness Responders Highly efficient responder network Found that most radiation is TCP SYN packets, which means they must respond Approach to building responders was “data driven”: the determined which responders to build based on traffic volumes Pick the most common form, build a responder Once the traffic could be differentiated into specific types of activity, repeat with the next largest type of traffic Responders Created HTTP (port 80) NetBIOS (port 137/139) CIFS/SMB (port 139/445) DCE/RPC (port 135/1025) Dameware (port 6129) MyDoom (port 3127) Beagle (port 2745) Responders Responders need to stick to the protocol (“how” to say it) They also need to know “what” to say to keep communication going Differences in connections can be difficult to determine at the network or transport level, leading to needing an application level understanding required Responses are developed manually, and many are intricate and take research to determine their format Brief Experiment Details Two separate network sites with two different systems iSink and LBL Sink. Each system performed the same responses but used different underlying mechanisms iSink Class A network 224 addresses And 2 /19 subnets (16k addresses) on two adjacent UW campus class B networks One filter for each network Filtered requests passed to the iSink Did both passive (no responders) and active measurements iSink Setup LBL Sink Two sets of 10 contiguous /24 subnets First is passive and unfiltered Active analysis is divided into two sets of 5 subnets and filtered All traffic then tunneled to a Honeyd responder LBL Setup Summary of Data Collection Quantifying Traffic rate breakdown by protocol (rate is number of packets per destination IP per day) Traffic breakdown by # of sources Qualifying Activities are ranked by number of source IP’s, not by byte or packet volume Their filtering algorithm is biased to a source IP that tries to reach too many destinations The number of source IP’s reflects the popularity of the activity across the internet Single-source activities might be eccentric, while multi-source activity is more likely to be intentional Qualifying To qualify activities, all connections between a source-destination pair on a given port are looked at Only common ports are considered What about uncommon ports??? Ports Background radiation traffic is highly concentrated on popular ports. Example, on Mar 29, they saw 32,072 distinct source IP’s at LBL and only 0.5% of the source hosts contacted a port not among “popular” ports they monitored Thus by only looking at popular ports, most internet radiation is monitored Qualifying Weaknesses IP addresses were heavily used in filtering and statistical analysis. Because DHCP servers can assign different IP addresses, this can flaw the data Many attacks must be known beforehand so that they can build responders A new worm might be propagating heavily for the short period of time during their tests which would skew typically observed numbers Heavier weights put on “more popular” attacks due to IP filtering, however “less popular” attacks may generate much more traffic Contributions Were able to quantify how much typical internet traffic is nonproductive Were able to qualify this nonproductive traffic into categories and show much of it is malicious