presentation source

advertisement
ESnet End-to-end Internet
Monitoring
Les Cottrell and Warren Matthews, SLAC and
David Martin, HEPNRC
Presented at the ESSC Review Meeting, Berkeley,
May 1998
Partially funded by DOE/MICS Field Work Proposal on Internet End-to-end Performance
Monitoring (IEPM)
3/4/98
z:\cottrell\escc\may98\esscmay98.ppt
1
Outline of Talk
•
•
•
•
•
Why are we (ESnet/HENP community) measuring?
What are we measuring & how?
What do we see?
What does it mean?
Summary
– Deployment/development, Internet Performance, Next Steps
– Collaborations
3/4/98
z:\cottrell\escc\may98\esscmay98.ppt
2
Why go to the effort?
• Internet woefully under-measured & underinstrumented
• Internet very diverse - no single path typical
• Users need end-to end measurements for:
–
–
–
–
realistic expectations, planning information
guidelines for setting and validating SLAs
information to help in identifying problems
help to decide where to apply resources
• Complements ESnet utilization measurements
• Provides information for reporting problems to NOC
3/4/98
z:\cottrell\escc\may98\esscmay98.ppt
3
Our Main Tool (PingER) is Ping Based
• “Universally available”, easy to understand
– no software for clients to install
• Low network impact
• Provides useful real world measures of response
time, loss, reachability, unpredictability
• Now monitoring from 14 sites in 8 countries
monitoring > 500 links in 22 countries (> 300 sites)
• Resources: 6bps/link, ~600kBytes/month/link
3/4/98
z:\cottrell\escc\may98\esscmay98.ppt
4
Measurement Architecture
HTTP
WWW
Ping
Reports & Data
SLAC
Archive
Analysis
Archive
Monitoring
Cache
Monitoring
Remote
Remote
HEPNRC
Monitoring Monitoring
Remote
Remote
3/4/98
z:\cottrell\escc\may98\esscmay98.ppt
5
Ping Loss Quality
• Want quick to grasp indicator of link quality
• Loss is the most sensitive indicator
– Studies on economic value of response time by IBM
showed there is a threshold around 4-5secs where
complaints increase.
– loss of packet requires ~ 4 sec TCP retry timeout
– For packet loss we use following thresholds:
• 0-1% = Good
1-2.5% = Acceptable
• 2.5%-5% = Poor
5%-12% = Very Poor
• > 12% = Bad (unusable for interactive work)
3/4/98
z:\cottrell\escc\may98\esscmay98.ppt
6
Quality Distributions from SLAC
• ESnet
median
good
quality
• Other
groups
poor or
very poor
3/4/98
z:\cottrell\escc\may98\esscmay98.ppt
7
Aggregation/Grouping
• Critical for 14 monitoring sites & > 500 links
• Group measurements by:
– area (e.g. N. America W, N. America E, W. Europe,
Japan, Asia, others, or by country, or TLD)
– trans-oceanic links, intercontinental links, crossing IXP
– ISP (ESnet, vBNS/I2, TEN-34...)
– by monitoring site
– one site seen from multiple sites
– common interest/affiliation (XIWT, HENP, Expmt …)
• Beware: reduces statistics, choice of sites critical
3/4/98
z:\cottrell\escc\may98\esscmay98.ppt
8
Tabular Navigation Tool
Monitoring site
Select grouping, e.g.
Intercontinental, TLDs, Site to site ...
Select metric
Select month
Goes back to Jul-97
,
Colored by quality
< 62.5ms excellent (white)
<125ms good (green)
< 250ms poor (yellow)
<500ms very poor (pink)
>500ms bad (red)
Remote site
Response, Loss, Quiescence, Reachability ...
Drill down
Site to show all sites monitoring it
Value to see all links contributing
MouseOver
To see number of links
To see country
To see monitoring site
3/4/98
z:\cottrell\escc\may98\esscmay98.ppt
9
Drill down (all sites monitoring CERN)
Also provides Excel for DIY
Sort
Select one of these groups
CMU
CMU
CNAF
RL
FNAL
SLAC
DESY
DESY
Carelton
RMKI
RMKI
CERN
KEK
3/4/98
z:\cottrell\escc\may98\esscmay98.ppt
10
Overall Improvements Jan-95 Nov-97
• For about 80 remote sites seen from SLAC
• Response time improved between 1 and 2.5% /
month
• Loss - similar (closer to 2.5%/month)
3/4/98
z:\cottrell\escc\may98\esscmay98.ppt
11
How does it look for ESnet Researchers
getting to US sites (280 links, 28 States)?
• Within ESnet excellent (median loss 0.1%)
• To vBNS sites very good (~ 2 * loss for ESnet)
• DOE funded Universities not on vBNS/ESnet
– acceptable to poor, getting better (factor 2 in 6 months)
– lot of variability (e.g.)
• BrownT, UMassT = unacceptable(>= 12%)
• Pitt*, SC*. ColoState*, UNMT, UOregonT, Rochester*, UC*,
OleMiss*, Harvard1q98, UWashingtonT, UNMT= v. poor(> 5%)
• SyracuseT, PurdueT, Hawaii* = poor (>= 2.5%)
– *=no vBNS plans, T= vBNS date TBD, V=on vBNS
3/4/98
z:\cottrell\escc\may98\esscmay98.ppt
12
University access changes in last year
• A year ago we looked at Universities with large
DOE programs
• Identified ones with poor (>2.5%) or worse (>5%)
performance
– UOregonT, Harvard1q98, UWashingtonT = very poor
(>= 5%)
– JHUV, DukeV, UCSDV, UMDV, UMichT, UColoV,
UPennT, UMNV, UCIT, UWiscV = acceptable (>1%)/good
– *=no vBNS plans, T= vBNS date TBD, V=on vBNS
3/4/98
z:\cottrell\escc\may98\esscmay98.ppt
13
Canada
• 20 links, 9 remote sites, 7 monitoring sites
• Seems to depend most on the remote site
–
–
–
–
UToronto bad to everyone
Carleton, Laurentian, McGill poor
Montreal, UVic acceptable/good
TRIUMF good with ESnet, poor to CERN
3/4/98
z:\cottrell\escc\may98\esscmay98.ppt
14
Europe
• Divides up into 2
– TEN-34 backbone sites (de, uk, nl, ch, fr, it, at)
• within Europe good performance
• from ESnet good to acceptable, except nl, fr (Renater) & .uk
are bad
– Others
• within Europe performance poor
• from ESnet bad to es, il, hu, pl acceptable for cz
3/4/98
z:\cottrell\escc\may98\esscmay98.ppt
15
Asia
• Israel bad
• KEK & Osaka good from US, very poor from
Canada
• Tokyo poor from US
• Japan-CERN/Italy acceptable, Japan-DESY bad
• FSU bad to Moscow, acceptable to Novosibirsk
• China is bad everywhere
3/4/98
z:\cottrell\escc\may98\esscmay98.ppt
16
Intercontinental Grouping (Loss)
Looks pretty bad for intercontinental use
Improving (about factor of 2 in last 6 months)
3/4/98
z:\cottrell\escc\may98\esscmay98.ppt
17
Summary 1/5
• Deployment Development
– ESnet/HENP/ICFA has 14 Collection sites in 8 countries
collecting data on > 500 links involving 22 countries
– HEPNRC archiving/analyzing, SLAC analyzing
– 600KB/month/link, 6 bps/link, .25 FTE @ archive site,
1.5-2.5 FTE on analysis
– reports available worldwide to end-users to access,
navigate, review & customize (via Excel) & see quality
– 4GBytes of data available to experts for analysis
– tools available for others to monitor, archive, analyze
• XIWT/IPWT chose & deployed PingER ~ 10 collection sites
are now monitoring 41 beacon sites
3/4/98
z:\cottrell\escc\may98\esscmay98.ppt
18
Summary 2/5
• Deployment Development
• Next Steps
– Improve tools:
•
•
•
•
•
Improve statistical robustness - Poisson sampling, medians
More groupings, beacon sites, matched pairs, for comparison
More navigation features to drill down
Better/easier identification of common bottlenecks
Prediction (extrapolations, develop models, configure and validate with
data)
– Pursuing deployment of dedicated PC based monitor platforms:
IETF Surveyor & NIMI/LBNL
• NIMIs up & running at PSC, LBNL, FNAL, SLAC, CERN
(CH), working with RAL (UK), KEK (JP), DESY (DE)
• Will provide throughput, traceroute & one way ping
measurements
3/4/98
z:\cottrell\escc\may98\esscmay98.ppt
19
Summary 3/5
• Deployment Development
• Next Steps
• Internet Performance (summary for our 500 links)
– Performance within ESnet is good
– Performance to vBNS good (median loss ~ 2* ESnet)
– Performance to non ESnet/vBNS sites is acceptable to
poor
– Intercontinental performance is very poor to bad
– Response time improving by 1-2% / month
– Packet loss improving between SLAC & other sites by
3% / month since Jan-95,
– Very dynamic
3/4/98
z:\cottrell\escc\may98\esscmay98.ppt
20
Summary 4/5
• Deployment Development
• Next Steps
• Internet Performance (continued):
– Links to sites outside N. America vary from good (KEK)
to bad
– Canada a mixed bag, depending on remote site it is
acceptable to bad
– TEN-34 backbone countries (exc UK) good to acceptable
– Otherwise Europe poor to bad
– Asia (apart from some Japanese sites) is bad
– Rest of world generally poor to bad.
3/4/98
z:\cottrell\escc\may98\esscmay98.ppt
21
Summary 5/5
• Deployment Development
• Next Steps
• Internet Performance
• Lots of collaboration & sharing:
–
–
–
–
–
–
–
–
–
SLAC & HEPNRC leading effort on PingER
14 monitoring sites, ~ 400 remote sites
Monitoring site tools CERN & CNAF/INFN, Oxford/TracePing
MapPing/MAPNet working with NLANR
TRIUMF Traceroute topology Map
NIMI/LBNL & Surveyor/IETF/IPPM
Industry: XIWT/IPWT, also SBIR from NetPredict on prediction
Talks at IETF, XIWT, ICFA, ESSC, ESCC, Interface’98, CHEP…
Lots of support: DOE/MICS/ESSC/ESnet, ICFA, XIWT
3/4/98
z:\cottrell\escc\may98\esscmay98.ppt
22
More Information & extra info follows
• ICFA Monitoring WG home page (links to status
report, meeting notes, how to access data, and code)
– http://www.slac.stanford.edu
/xorg/icfa/ntf/home.html
• WAN Monitoring at SLAC has lots of links
– http://www.slac.stanford.edu
/comp/net/wan-mon.html
• Tutorial on WAN Monitoring
– http://www.slac.stanford.edu
/comp/net/wan-mon/tutorial.html
• PingER History tables
– http://www.slac.stanford.edu/
/xorg/iepm/pinger/table.html
• NIMI http://www.psc.edu/~mahdavi/nimi_paper/NIMI.html
3/4/98
z:\cottrell\escc\may98\esscmay98.ppt
23
Perception of Packet Loss
• Above 4-6% packet loss video conferencing
becomes irritating, and non native language speakers
become unable to communicate.
• The occurrence of long delays of 4 seconds or more
at a frequency of 4-5% or more is also irritating for
interactive activities such as telnet and X windows.
• Above 10-12% packet loss there is an unacceptable
level of back to back loss of packets and extremely
long timeouts, connections start to get broken, and
video conferencing is unusable.
3/4/98
z:\cottrell\escc\may98\esscmay98.ppt
24
180 Day Ping Performance SLACCERN
3/4/98
z:\cottrell\escc\may98\esscmay98.ppt
25
Running 10 week averages
Sorted on biggest change
Standard deviation
gives idea of loading
3/4/98
z:\cottrell\escc\may98\esscmay98.ppt
26
Quiescence
• Frequency of
zero packet loss
(for all time not cut on prime
time)
3/4/98
z:\cottrell\escc\may98\esscmay98.ppt
27
Response & Loss Improvements
• Improved between 1 and 2.5% / month
• Response & Loss similar improvements
3/4/98
z:\cottrell\escc\may98\esscmay98.ppt
28
Top Level Domain Grouping (Loss)
Diagonals are within
TLD
US good/accept for
it,
de,
ch &
cz
Hungary is poor
China unusable
Canada poor to bad
UK - US bad
3/4/98
z:\cottrell\escc\may98\esscmay98.ppt
29
US ESnet & vBNS
Frequency
ESnet seen from U.S.
100
90
80
70
60
50
40
30
20
10
0
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
Frequency
Cumulative %
0
0.5
1
1.5
ESnet
2
Median 0.1%
Links 36
Unique remote sites 17
Monitoring sites 6
2.5
% Packet loss
vBNS seen from U.S.
16
100%
14
90%
80%
70%
60%
Frequency
12
10
Frequency
Cumulative %
8
6
4
2
0
0
0.5
1
3/4/98
1.5
% Packet loss
2
50%
40%
30%
20%
10%
0%
vBNS
.EDU, non ESnet/vBNS
Median 1.5% (avg 3.2%)
Links 54
Unique remote sites 36
Monitoring sites 3
Median 0.3%
Links 30
Unique remote sites 18
Monitoring sites 4
2.5
z:\cottrell\escc\may98\esscmay98.ppt
30
3/4/98
z:\cottrell\escc\may98\esscmay98.ppt
31
3/4/98
z:\cottrell\escc\may98\esscmay98.ppt
32
Advanced to U Chicago
U Chicago to Advanced
Loss
Loss
Delay
Delay
3/4/98
z:\cottrell\escc\may98\esscmay98.ppt
33
MapPing
• Java Applet, based on
MapNet from NLANR
– Colors links by
performance
– Selection:
•
•
•
•
collection site
performance metric
month
zoom level
– Mouse over gives coords
3/4/98
z:\cottrell\escc\may98\esscmay98.ppt
34
Traceroute Topology Tool
• Reverse traceroute
servers
• Traceping
• TopologyMap
KEK
From TRIUMF
– Ellipses show node on
route
– Open ellipse is
measurement node
– Blue ellipse not reachable
– Keeps history
FNAL
3/4/98
z:\cottrell\escc\may98\esscmay98.ppt
DESY CERN
35
Download