Forecasting Network Performance Les Cottrell, Grid Performance Workshop, Edinburgh, June 22-34, 2005 http://www.slac.stanford.edu/grp/scs/net/talk05/predictedinburgh05.ppt Partially funded by DOE/MICS for Internet End-to-end Performance Monitoring (IEPM) 1 Outline • Why do we want forecasting & anomaly detection? • What are we using for the input data – And what are the problems • How do we make forecasts, detect anomaly? – First approaches – The real world • Results • Conclusions & Futures • Possible uses 2 Uses of Techniques • Automated problem identification: – Alerts for network administrators, e.g. • Bandwidth changes in time-series, iperf, SNMP – Alerts for systems people • OS/Host metrics – Anomalies for security • Forecasts (are a fallout of the techniques) for Grid Middleware, e.g. replica manager, data placement 3 Data 4 Using Active IEPM-BW measurements • Focus on high performance for a few hosts needing to send data to a small number of collaborator sites, e.g. HEP tiered model • Makes regular measurements – – – – Ping (RTT, connectivity), traceroute pathchirp, ABwE (packet pair dispersion) iperf (single & multi-stream), thrulay, Bbftp (file transfer application) • Looking at GridFTP but complex requiring renewing certificates • Lots of analysis and visualization • Running at CERN, SLAC, FNAL, BNL, Caltech to about 40 remote sites – http://www.slac.stanford.edu/comp/net/iepmbw.slac.stanford.edu/slac_wan_bw_tests.html 5 ABwE/abing • Uses packet pair dispersion of 20 packets to provide: – Capacity, X-traffic, available bandwidth – At 3 minute intervals – Very noisy time series data Moving averaged over 1 hour Capacity 6 Pathchirp/Rice/INCITE • From PAM paper, pathchirp more accurate but – Ten times as long (10s vs 1s) – More network traffic (~factor of 10) • Pathload factor of 10 again more • IEPM-BW now supports both 7 BUT… • Packet pair dispersion relies on accurate timing of inter packet separation – At > 1Gbps this is getting beyond resolution of Unix clocks – AND 10GE NICs are offloading function • Coalescing interrupts, Large Send & Receive Offload, TOE • Need to work with TOE vendors – Turn off offload – Do timing in NICs 8 Thrulay Iperf vs thrulay Average RTT RTT ms • Iperf has multi streams • Thrulay more manageable & gives RTT • They agree well • Throughput ~ 1/avg(RTT) Maximum RTT Minimum RTT Achievable throughput Mbits/s 9 BUT… • At 10Gbits/s on transatlantic path Slow start takes over 6 seconds – To get 90% of measurement in congestion avoidance need to measure for 1 minute (5.25 GBytes at 7Gbits/s (today’s typical performance) 10 Passive • Use Netflow records at border – – – – Per flow provide start/stop time, bytes/packets etc. Collect records for several weeks Divide by remote site, add parallel streams Fold data onto one week, see bands at known capacities 11 Netflow 2/2 • Use existing traffic, no extra traffic • Works on fast networks 12 Forecasting and Anomaly detection 13 Anomaly Detection • Anomaly is when the actual value significantly differs from the expected value – So need forecasts to find anomalies – Focus has been on ABwE time-series measurements: • Packet pair dispersion on 20 packets – Send 20 packet pairs back to back and measure one-way packet separation at remote end – Minimum gives an indication of bottleneck capacity of link • Measurement each 3 minutes • Low network impact BUT very noisy so hard test case 14 Plateau, most intuitive • Each observation: – If outside history buffer mean mh ± b*sh then add to trigger buffer – Else add to history, and remove oldest from trigger buffer • When trigger buffer > t points then trigger issued – Check if (mh - mt) / mh > D & 90% trigger in last T mins then have trigger – Move trigger buffer to history buffer = history length = 1 day, t = trigger length = 3 hours b= standard deviations = 2 History mean – 2 * stdev Observations Event * Trigger % full History mean 15 K-S • For each observation: for the previous 100 observations with next 100 observations – Compare the vertical difference in CDFs – How does it differ from random CDFs – Expressed as % difference 16 Compare K-S with Plateau Compare • Results between K-S & plateau very similar, using K-S coefficient threshold = 70% • Current plateau only finds negative changes – Useful to see when condition returns to normal • K-S implemented in C and executes faster than Plateau (in Perl), depends on parameters • K-S more formalized • Plateau and K-S work well for non seasonal observations (e.g. small changes day/night) 17 Seasons & false alerts • Congestion on Monday following a quiet weekend causes a high forecast, gives an alert • Also a history buffer of not a day causes History mean to be out of sync with observations 18 Diurnal Variation People arriving at work between 19:00 & 22:00 PDT (7:00 & 10:00 PK time) cause sudden drop in dynamic capacity 19 Effect on events • Change in bandwidth (drops) between 19:00 & 22:00 Pacific Time (7:00-10:00am PK time) • Causes more anomalous events around this time 20 Seasonal Changes • Use Holt-Winters (H-W) technique: – Uses triple exponential weighted moving average • EWMA(i) = Obs(i) * a + (1-a) * EWMA(i-1) – Three terms each with its own parameter (a, b, ) that take into account local smoothing, long term seasonal smoothing, and trends 21 H-W Implementation • Need regularly spaced data (else going back one season is difficult, and gets out of sync): – Interpolate data: select bin size • Average points in bin • If no points in first week bin then get data from future weeks • For following weeks, missing data bins filled from previous week • Initial values for smoothing from NIST “Engineering Statistics Handbook” • Choose parms by minimizing (1/N)Σ(Ft-yt)2 – Ft=forecast for time t as function of parameters, yt = observation at time t 22 H-W Implementation • Three implementations evaluated (two new) – FNAL (Maxim Grigoriev) • Inspiration for evaluating this method – Part of RRD (Brutlag) • Limited control over what it produces and how it works – SLAC • Implemented NIST formulation, different formulation/parameter values from Brutlag/FNAL, also added minimize sums of squares to get parms 23 Results 24 Example • • • • Local smoothing 99% weight for last 24 hours Linear trend 50% last 24 hours Seasonal mainly from last week, but includes several weeks Within an 80 minute window, 80% points outside deviation envelope ≡ event 1 hr avg Observations Deviations Forecast Weekend 25 Weekdays Evaluation • Created a library of time series for 100 days from June through Sep 2004 for 40 hosts • Analyzed using Plateau and saved all events where trigger buffer filled (no filters on size of step) – 23 hosts had 120 candidate events – Event types: steps; diurnal changes; congestion from cron jobs, bandwidth tests, flash crowds • Classify ~120 events as to whether interesting – Large, sharp drop in bandwidth, persist for >> 3hrs 26 Results • K-S shows similar results to Plateau • As adjust parameters to reduce false positives then increase missed events – E.g. for plateau with trigger buffer = 3 hrs filled to 90% in < 220 minutes, history buffer=1 day, effect of threshold D=(mhmt)/mh Plateau (b=2) K-S with ± 100 observations D False Miss 10% 16% 8% 30% 2% 32% 27 Conclusions • A few paths (10%) have strong seasonal effects • Plateau & K-S work well if only weak seasonal effects – K-S detects both step downs & up, also gives accurate time estimate of event (good for correlations) • H-W promising for seasonal effects, but – Is more complex, and requires more parameters which may not be easy to estimate – Requires regular data (interpolation step) • CPU time can depend critically on parameters chosen, e.g. increasing K-S range from ±100 to say ±400 increases CPU time by factor 14 • H-W works, still need to quantify its effectiveness • Looking at PCA to evaluate multiple metrics simultaneously (e.g. fwd & bwd traffic, RTT, multiple paths) AND multiple paths 28 Future Work • Future Development in PCA – Enable looking at multiple measurements simultaneously • E.g. RTT, loss, capacity …; multiple routes • Neural networks to interpolate heavyweight/infrequent measurements based on light weight more frequent • Continue Netflow passive exploration 29 Some Uses: • Detect anomalies reliably (few false positives, few misses): – Make extra measurements related to anomaly, e.g. ping, traceroute, performance history etc. – Notify people (e.g. via email) • Forecast into future taking account diurnal changes: – Make long-term (hours – days) integrated estimates of performance with probabilities – Use for data location selection 30 Apply forecasts to Router utilizations to find bottlenecks • Get measurements from Internet2/ESnet/Geant SONAR project via NMWG web services • Save as time series, forecast for each interface • For given path and duration forecast most probable bottlenecks • Use MPLS to apply QoS at bottlenecks (rather than for the entire path) for selected applications 31 More information • SLAC Plateau implementation – www.acm.org/sigs/sigcomm/sigcomm2004/workshop_paper s/nts26-logg1.pdf • SLAC H-W implementation – www-iepm.slac.stanford.edu/monitoring/forecast/hw.html • Eng. Statistics Handbook – http://www.itl.nist.gov/div898/handbook/pmc/section4/pmc435.htm • IEPM-BW Measurement Infrastructure – http://www-iepm.slac.stanford.edu/ 32