Forecasting Network Performance Grid Performance Workshop, Edinburgh, June 22-34, 2005 Les Cottrell

advertisement
Forecasting Network Performance
Les Cottrell,
Grid Performance Workshop,
Edinburgh, June 22-34, 2005
http://www.slac.stanford.edu/grp/scs/net/talk05/predictedinburgh05.ppt
Partially funded by DOE/MICS for Internet End-to-end
Performance Monitoring (IEPM)
1
Outline
• Why do we want forecasting & anomaly
detection?
• What are we using for the input data
– And what are the problems
• How do we make forecasts, detect anomaly?
– First approaches
– The real world
• Results
• Conclusions & Futures
• Possible uses
2
Uses of Techniques
• Automated problem identification:
– Alerts for network administrators, e.g.
• Bandwidth changes in time-series, iperf, SNMP
– Alerts for systems people
• OS/Host metrics
– Anomalies for security
• Forecasts (are a fallout of the techniques) for
Grid Middleware, e.g. replica manager, data
placement
3
Data
4
Using Active IEPM-BW
measurements
• Focus on high performance for a few hosts needing to
send data to a small number of collaborator sites, e.g.
HEP tiered model
• Makes regular measurements
–
–
–
–
Ping (RTT, connectivity), traceroute
pathchirp, ABwE (packet pair dispersion)
iperf (single & multi-stream), thrulay,
Bbftp (file transfer application)
• Looking at GridFTP but complex requiring renewing certificates
• Lots of analysis and visualization
• Running at CERN, SLAC, FNAL, BNL, Caltech to
about 40 remote sites
– http://www.slac.stanford.edu/comp/net/iepmbw.slac.stanford.edu/slac_wan_bw_tests.html
5
ABwE/abing
• Uses packet pair dispersion of 20 packets to provide:
– Capacity, X-traffic, available bandwidth
– At 3 minute intervals
– Very noisy time series data
Moving averaged
over 1 hour
Capacity
6
Pathchirp/Rice/INCITE
• From PAM paper, pathchirp more accurate but
– Ten times as long (10s vs 1s)
– More network traffic (~factor of 10)
• Pathload factor of 10 again more
• IEPM-BW now supports both
7
BUT…
• Packet pair dispersion relies on accurate timing
of inter packet separation
– At > 1Gbps this is getting beyond resolution of Unix
clocks
– AND 10GE NICs are offloading function
• Coalescing interrupts, Large Send & Receive Offload,
TOE
• Need to work with TOE vendors
– Turn off offload
– Do timing in NICs
8
Thrulay
Iperf vs thrulay
Average RTT
RTT ms
• Iperf has multi streams
• Thrulay more manageable
& gives RTT
• They agree well
• Throughput ~ 1/avg(RTT)
Maximum RTT
Minimum RTT
Achievable throughput Mbits/s
9
BUT…
• At 10Gbits/s on transatlantic path Slow start
takes over 6 seconds
– To get 90% of measurement in congestion
avoidance need to measure for 1 minute (5.25
GBytes at 7Gbits/s (today’s typical performance)
10
Passive
• Use Netflow records at border
–
–
–
–
Per flow provide start/stop time, bytes/packets etc.
Collect records for several weeks
Divide by remote site, add parallel streams
Fold data onto one week, see bands at known capacities
11
Netflow 2/2
• Use existing traffic, no extra traffic
• Works on fast networks
12
Forecasting and
Anomaly detection
13
Anomaly Detection
• Anomaly is when the actual value significantly
differs from the expected value
– So need forecasts to find anomalies
– Focus has been on ABwE time-series
measurements:
• Packet pair dispersion on 20 packets
– Send 20 packet pairs back to back and measure one-way packet
separation at remote end
– Minimum gives an indication of bottleneck capacity of link
• Measurement each 3 minutes
• Low network impact BUT very noisy so hard test case
14
Plateau, most intuitive
• Each observation:
– If outside history buffer mean mh ± b*sh then add to trigger
buffer
– Else add to history, and remove oldest from trigger buffer
• When trigger buffer > t points then trigger issued
– Check if (mh - mt) / mh > D & 90% trigger in last T mins then
have trigger
– Move trigger buffer to history buffer
= history length = 1 day,
t = trigger length = 3 hours
b= standard deviations = 2
History mean – 2 * stdev
Observations
Event
*
Trigger % full
History mean
15
K-S
• For each observation: for the previous 100
observations with next 100 observations
– Compare the vertical difference in CDFs
– How does it differ from random CDFs
– Expressed as % difference
16
Compare K-S with Plateau
Compare
• Results between K-S & plateau very similar,
using K-S coefficient threshold = 70%
• Current plateau only finds negative changes
– Useful to see when condition returns to normal
• K-S implemented in C and executes faster than
Plateau (in Perl), depends on parameters
• K-S more formalized
• Plateau and K-S work well for non seasonal
observations (e.g. small changes day/night)
17
Seasons & false alerts
• Congestion on Monday following a quiet weekend
causes a high forecast, gives an alert
• Also a history buffer of not a day causes History mean
to be out of sync with observations
18
Diurnal Variation
People arriving at work between 19:00 & 22:00 PDT (7:00 &
10:00 PK time) cause sudden drop in dynamic capacity
19
Effect on events
• Change in bandwidth (drops) between 19:00 &
22:00 Pacific Time (7:00-10:00am PK time)
• Causes more anomalous events around this
time
20
Seasonal Changes
• Use Holt-Winters (H-W) technique:
– Uses triple exponential weighted moving average
• EWMA(i) = Obs(i) * a + (1-a) * EWMA(i-1)
– Three terms each with its own parameter (a, b, )
that take into account local smoothing, long term
seasonal smoothing, and trends
21
H-W Implementation
• Need regularly spaced data (else going back
one season is difficult, and gets out of sync):
– Interpolate data: select bin size
• Average points in bin
• If no points in first week bin then get data from future
weeks
• For following weeks, missing data bins filled from
previous week
• Initial values for smoothing from NIST
“Engineering Statistics Handbook”
• Choose parms by minimizing (1/N)Σ(Ft-yt)2
– Ft=forecast for time t as function of parameters, yt =
observation at time t
22
H-W Implementation
• Three implementations evaluated (two new)
– FNAL (Maxim Grigoriev)
• Inspiration for evaluating this method
– Part of RRD (Brutlag)
• Limited control over what it produces and how it works
– SLAC
• Implemented NIST formulation, different
formulation/parameter values from Brutlag/FNAL, also
added minimize sums of squares to get parms
23
Results
24
Example
•
•
•
•
Local smoothing 99% weight for last 24 hours
Linear trend 50% last 24 hours
Seasonal mainly from last week, but includes several weeks
Within an 80 minute window, 80% points outside deviation
envelope ≡ event
1 hr avg
Observations
Deviations
Forecast
Weekend
25
Weekdays
Evaluation
• Created a library of time series for 100 days
from June through Sep 2004 for 40 hosts
• Analyzed using Plateau and saved all events
where trigger buffer filled (no filters on size of step)
– 23 hosts had 120 candidate events
– Event types: steps; diurnal changes; congestion
from cron jobs, bandwidth tests, flash crowds
• Classify ~120 events as to whether interesting
– Large, sharp drop in bandwidth, persist for >> 3hrs
26
Results
• K-S shows similar results to Plateau
• As adjust parameters to reduce false positives then
increase missed events
– E.g. for plateau with trigger buffer = 3 hrs filled to 90% in <
220 minutes, history buffer=1 day, effect of threshold D=(mhmt)/mh
Plateau (b=2)
K-S with ± 100 observations
D
False Miss
10% 16% 8%
30% 2%
32%
27
Conclusions
• A few paths (10%) have strong seasonal effects
• Plateau & K-S work well if only weak seasonal effects
– K-S detects both step downs & up, also gives accurate time estimate of
event (good for correlations)
• H-W promising for seasonal effects, but
– Is more complex, and requires more parameters which may not be easy
to estimate
– Requires regular data (interpolation step)
• CPU time can depend critically on parameters chosen, e.g.
increasing K-S range from ±100 to say ±400 increases CPU
time by factor 14
• H-W works, still need to quantify its effectiveness
• Looking at PCA to evaluate multiple metrics simultaneously
(e.g. fwd & bwd traffic, RTT, multiple paths) AND multiple paths
28
Future Work
• Future Development in PCA
– Enable looking at multiple measurements
simultaneously
• E.g. RTT, loss, capacity …; multiple routes
• Neural networks to interpolate
heavyweight/infrequent measurements based
on light weight more frequent
• Continue Netflow passive exploration
29
Some Uses:
• Detect anomalies reliably (few false positives,
few misses):
– Make extra measurements related to anomaly, e.g.
ping, traceroute, performance history etc.
– Notify people (e.g. via email)
• Forecast into future taking account diurnal
changes:
– Make long-term (hours – days) integrated estimates
of performance with probabilities
– Use for data location selection
30
Apply forecasts to
Router utilizations to
find bottlenecks
• Get measurements from Internet2/ESnet/Geant
SONAR project via NMWG web services
• Save as time series, forecast for each interface
• For given path and duration forecast most
probable bottlenecks
• Use MPLS to apply QoS at bottlenecks (rather
than for the entire path) for selected
applications
31
More information
• SLAC Plateau implementation
– www.acm.org/sigs/sigcomm/sigcomm2004/workshop_paper
s/nts26-logg1.pdf
• SLAC H-W implementation
– www-iepm.slac.stanford.edu/monitoring/forecast/hw.html
• Eng. Statistics Handbook
– http://www.itl.nist.gov/div898/handbook/pmc/section4/pmc435.htm
• IEPM-BW Measurement Infrastructure
– http://www-iepm.slac.stanford.edu/
32
Download