Correcting Congestion-Based Error in Network Telescope’s Observations of Worm Dynamics Songjie Wei Jelena Mirkovic Computer & Information Sciences Dept. University of Delaware Information Sciences Institute University of Southern California Proceedings of the ACM Internet Measurement Conference (IMC) 2008 Introduction Worms compromise hosts and interfere with network operation Network telescopes monitor Internet-wide security incidents Worms enlist bots, crash routers, congest links An allocated portion of IPv4 space, otherwise unused Collecting data about worms, spoofed DDoS, etc. Network telescopes collect information about worm dynamics CAIDA telescope, WAIL telescope Code Red (2001), Slammer (2003), Witty worm (2004) Motivation How to Correct Validation Conclusions Sample Telescope Observation of a Worm Slammer’s early progress [Moore03] according to the WAIL telescope (/8) at University of Wisconsin 80M x 404B x 8 / 256 ≈ 1Gbps Too slow Worm scan size Telescope size [Moore03] David Moore, Vern Paxson, Stefan Savage, Collen Shannon, Stuart Staniford, and Nicholas Weaver. Inside the Slammer Worm. IEEE Security and Privacy, Jul/Aug 2003. Motivation How to Correct Validation Conclusions Why Do We Need Accurate Observations? Curiosity Sometimes the only way to test worm defenses is with a simulator or a model We should know what is REALLY going on Internet-scale defenses Also new worm propagation approaches Simulators and models must be validated against some ground truth - usually telescope’s observations Inaccurate observations lead to inaccurate simulators and models Motivation How to Correct Validation Conclusions Network Telescope’s Limitations Aggressive worms saturate Internet links and cause packet drops Slammer (404 bytes worm packet) infected 75,000+ hosts, peak rate of 26,000 scans per worm per second Witty infected 12,000 computers, generating 90 Gbps Telescopes may only observe some worm infectees and receive portions of worm scans sent to them Slow or short-life infectees less likely to be observed Worm scans get dropped on crashed routers or congested links (network intermediate link, telescope’s access link) Motivation How to Correct Validation Conclusions Telescope’s Inference of Worm Dynamics Scaling up a telescope’s local observation by the telescope’s size to infer the global worm dynamics For a /8 telescope, local_scans x 256 = global_scans, local_infected = global infected? The traditional inference may be wrong, due to congestion and packet loss Lower inference of the number of infected hosts Underestimated worm scanning rates (both global and individual worm rate) Motivation How to Correct Validation Conclusions Our Research Objectives Measure the congestion severity based on telescope’s observation Estimate scan loss and arrival ratio for each infectee Infer worm’s global dynamics Correct telescope’s local observation with consideration of network congestion and packet loss Scale up the local worm observation correctly to reflect the global dynamics of worm propagation in Internet Motivation How to Correct Validation Conclusions Related Work Worm simulation scale-down [Weaver04] Not clear how you would use it to scale up Worm forensics [Kumar05, Hama06] More accurate than our method but need worm source and human involvement Complementary Correcting telescope bias [Zou03] Bias occurs due to delay between a host’s infection and scans seen by telescope Minor for /8 telescopes Motivation How to Correct Validation Conclusions Assumptions about Worms and Internet Constant infectee scanning rate Scanning rate is bounded by infectee’s capability (CPU, memory, access bandwidth, OS, etc.) Packet loss mainly due to congestion Many incidents and countermeasures cause packet loss Congestion, routing failure, security countermeasures, etc. Hosts sharing routing paths have the same loss ratio No significant loss in early stage of worm spread Congestion gradually builds up as more hosts are infected Early stage is the time when none or a few packets are dropped Motivation How to Correct Validation Conclusions Scan Arrival Ratio Inferring the constant scanning rate T S R (t ) / T Duration of early stage t 1 Constant sending rate Scans received Scan arrival ratio: the percentage of worm scans sent by an infectee to the telescope that is successfully received P(t) = Motivation Scans received by the telescope during time i The infectee’s constant sending rate to the telescope How to Correct Validation = R(t) S Conclusions Inferring the End of the Early Stage Epidemic model of worm propagation Cliff C. Zou, Lixin Gao, Weibo Gong, and Don Towsley. “Monitoring and Early Warning for Internet Worms,” ACM CCS, 2003. T is when the number of infected hosts departs from the exponential model Due to congestion or end of slow start phase Motivation How to Correct Validation Conclusions End of Early Stage: Time T We measure the match between the number of infected hosts and the exponential model R-squared: a statistical measure, between 0 and 1, shows fit between a model’s prediction and real values 0 means totally different, and 1 means perfect match We define T as the time when R-squared value starts to decrease continuously from 1 Hosts infected before T are called early infectees Underestimating T is OK, as long as we identify enough early infectees Motivation How to Correct Validation Conclusions Scan Arrival Ratio for Late Infectees Infectees observed after time T are identified as late infectees We cannot infer their constant scanning rates directly because their scans are always observed with loss We differentiate between two cases If we know the ratio of another infectee sharing the same inter-AS routing path, we use this known ratio Infectees sharing a routing path suffer the same congestion Otherwise we use the average of ratios of all known early infectees Better than random guess More accurate when congestion is closer to the telescope Motivation How to Correct Validation Conclusions Inferring Worm Dynamics Global number of infected hosts Worm’s Internet scanning rate Number of infected hosts in the Internet Also hosts that are infected but not yet seen by telescope due to congestion Number of worm scans sent into the Internet per second Infectees’ scanning rate distribution Reflect the infectees’ aggressiveness Limited by infectee’s features and configurations Motivation How to Correct Validation Conclusions Global Number of Infected Hosts ∆Ir(t) = Ir(t) – Ir(t-1) Ir(t): number of infectees by time t ∆Io(t) = Io(t) – Io(t-1) Io(t): number of infectees observed by time t Uo(t): number of infectees not observed by time t Pagg(t): aggregated scan arrival ratio of infectees Smed(t-1): Med scans sent to telescope per second per infectee Io(t) Ir(t) Uo(t 1) Smed( t1) 1 (1 Pagg(t 1)) Probability that an infectee is seen by the telescope during time t Uo(t) (Ir(t) Uo(t 1)) (1 Pagg(t 1)) Smed(t1) Probability that an infectee is NOT seen by the telescope during time t Motivation How to Correct Validation Conclusions Inference Example t ∆Io Pagg Smed Io ∆Ir Uo Ir 0 0 1 0 0 0 0 0 1 1 0.8 5 1 1 0 1 2 3 0.5 5 4 2 0.00096 4.00096 3 2 0.5 5 6 2.06 0.065 6.06 4 3 0.2 5 9 3.03 0.097 9.10 5 6 0.2 5 15 8.83 2.92 17.92 Motivation How to Correct Validation Conclusions Variation of Scan Arrival to Telescope Early stage Congestion (no congestion) 140 Scans per second) 120 100 Constant sending rate to the telescope 80 60 Time T 40 20 0 1 4 7 10 13 16 19 22 25 28 31 34 37 40 Time (second) A sample observation of scans from an infectee to a /8 telescope (from our simulation of Witty worm) Motivation How to Correct Validation Conclusions Worm’s Internet Scanning Rate Ri(t): number of scans received by the telescope from infectee i during time t Pi(t): scan pass ratio of infectee i during time t Ri (t ) Ir (t ) IPv 4 Scanning Rate = Io (t ) iIo ( t ) Pi (t ) telescope _ size Counting those infectees that are not yet observed Motivation Each infectee’s sending rate How to Correct Scaling up local observation to global dynamics Validation Conclusions Infectees’ Scanning Rate We use the maximum of inferred infectee’s scanning rate as the infectee’s original scanning rate Ri (t ) IPv 4 Bi max t 0 ( ) Pi (t ) telescope _ size Motivation Scanning rate Sending rate of of infectee i infectee i at time t How to Correct Scaling up from telescope’s observation to the whole Internet Validation Conclusions Inferring Global Dynamics of Witty Worm Witty worm trace From CAIDA telescope, a /8 block of IPv4 Scans begin on March 19, 2004 at 4:45am UTC Original IP addresses Ground truth [Kumar05] Our experiment Avoid infectees within NATs Filter out infectees with less than 20 scans First 75 minutes,11,326 infectees, 45.5 M scans [Kumar05] A. Kumar, V. Paxson and N. Weaver, “Exploiting Underlying Structure for Reconstruction of an Internet-Scale Event”, IMC 2005. Motivation How to Correct Validation Conclusions R-squared value # of observed infectees Inferring T - End of the Early Stage Motivation How to Correct Validation Conclusions Witty Worm’s Internet Scanning Rate Larger T tends to underestimate the packet loss and thus the Internet-wide worm propagation strength. Motivation How to Correct Validation Conclusions Infectees’ Bandwidth Two error 1. We use average scan arrival ratio for some late infectees sources: 2. We lack information about slow/short-life infectees Motivation How to Correct Validation Conclusions Conclusions and Future Work We investigate the congestion effect on network telescope’s observation of worm propagation We propose ways to estimate congestion packet loss, and to correct telescope’s observation We correct CAIDA telescope’s observation of Witty We plan to extend our work for other error sources Influence of various telescope sizes on our analysis NATs, filters, non-uniform scanning worms Motivation How to Correct Validation Conclusions Questions? Comments? Jelena Mirkovic (sunshine@isi.edu) Songjie Wei (weis@cis.udel.edu) [PAWS] S. Wei and J. Mirkovic, “A Realistic Simulation of Internet-Scale Events,” Proceedings of the 2006 VALUETOOLS Conference, October 2006