STATISTICAL and APPLIED MATHEMATICAL SCIENCES INSTITUTE Long-Range Dependence in a Changing Internet Traffic Mix Cheolwoo Park SAMSI Félix Hernández-Campos Don Smith Department of Computer Science, UNC-Chapel Hill J. S. Marron Department of Statistics and Operations Research, UNC-Chapel Hill David Rolls Department of Mathematics and Statistics, UNC-Wilmington 1 Measurements Capture TCP/IP packet headers on Gigabit Ethernet link (inbound from Internet) UNC 1 Gbps Ethernet Internet ~35,000 Internet Users Monitor (tcpdump) 2 Summary data • Two-hour traces, 2nd week in April of 2002 and 2003 – 5:00 AM, 10:00 AM, 3:00 PM, 9:30 PM on each of 7 days – 28 traces (56 hours) per year • 2002 Traces ~ 5 billion packets ~ 1.6 terabytes of network traffic • 2003 Traces ~ 10 billion packets ~ 2.9 terabytes of network traffic ~ 95% TCP packets ~ 5% UDP packets ~ 75% TCP packets ~ 25% UDP packets ~ 93% TCP bytes ~ 7% UDP bytes ~ 86% TCP bytes ~ 14% UDP bytes 10% max 2-hr. mean link utilization 18% max 2-hr. mean link utilization 0.01%-0.16% packets dropped by monitor 0 packets dropped by monitor 3 Hurst parameter (H) estimates and confidence intervals • H estimated from wavelet analysis tools (“logscale diagrams” of D. Veitch) • H estimates for 2003 packet counts were significantly lower than for 2002 (not true for byte counts). • Several traces had H > 1 or very wide confidence intervals. • H estimates were independent of time of day or day of week (both packets and bytes) in both years. 4 H not related to link utilization or active TCP connections 5 Extreme examples of H > 1 or wide confidence intervals Trace Type H and CI Wednesday 10:00 AM, 2002 packets 0.84 [0.44, 1.24] Wednesday 10:00 AM, 2002 bytes 0.65 [0.26, 1.05] Wednesday 3:00 PM, 2002 bytes 1.23 [1.11, 1.34] Monday 10:00 AM, 2003 packets 1.31 [1.13, 1.49] Friday 3:00 PM, 2003 bytes 0.87 [0.46, 1.27] Saturday 10:00 AM, 2003 packets 1.18 [0.98, 1.39] Table I. Traces with H estimates or CI ranges that represent extreme examples 6 Dependent SiZer analysis of wide CI example • Test for statistically significant differences from FGN process with parameters estimated from data, H=0.8 • Top: local linear smoothing of data with different window widths • Bottom: statistical inference on trends of smoothed curve at each window width 7 Dependent SiZer analysis of H > 1 example • Analysis shows both non-linear trends and greater variability than FGN process at many time scales 8 Logscale diagram of typical 2002 and 2003 traces • Protocol dependent analysis suggested by increase in UDP • Filtered traces to create new traces: TCP only and UDP only • TCP is dominant influence in all cases except 2003 packet counts where UDP dominates. • Sharp increase at middle scales shapes H estimate (less slope so lower H). 9 Same conclusion for all traces. Why? 10 The Blubster effect (2003’s hot new peer-to-peer file sharing application) • Recall that UDP packets increased to 25% of 2003 packets (but only 14% of bytes). • Analysis of UDP packets found 70% from application (Blubster) in 2003 that was negligible in 2002. • Second filtering: make Blubster-only and “Rest (TCP + other UDP) traces. • Blubster alone dominated H estimate for packets, not bytes 11 Why? Blubster’s packet traffic is periodic • SiZer analysis of Blubster trace looking for structure beyond white noise • Found high-frequency variability with periods in 1-5 second range (caused by update and search queries among peers) • These correspond to the time-scales in logscale diagram where UDP dominates the wavelet coefficients. 12 Results summary • We presented results from a study of traffic on the UNC Internet link from two years, 2002 and 2003. • A single application generating about 18% of packets and < 10% of bytes in traces can strongly influence the H estimate (in this case, because of periodic behavior). • A significant number of traces produced H estimates >1 or wide confidence intervals. • Dependent Sizer is an effective tool for augmenting wavelet analysis and understanding structure in Internet data. • H was not related to time-of-day, day-of-week, link utilization, or number of active TCP connections. 13