SPACE Weather School:
Basic theory & hands-on experience
Les Cottrell – SLAC
University of Helwan / Egypt, Sept 18 – Oct 3, 2010
Partially funded by DOE/MICS Field Work Proposal on Internet End-to-end
Performance Monitoring (IEPM), also supported by IUPAP http://www.slac.stanford.edu/grp/scs/net/talk10/diagnosis.pptx
Goal: provide a practical guide to debugging common problems
Why is diagnosis difficult yet important?
Local host
Ping, Traceroute, PingRoute
Looking at time series
Locating bottlenecks
Correlation of problems with routes
More tools and problems
Where is a node
Who do you tell, what do you say?
Case studies and More Information
Les Cottrell, SLAC Slide: 2
Internet's evolution as a composition of independently developed and deployed protocols, technologies, and core applications
Diversity, highly unpredictable, hard to find “invariants”
Rapid evolution & change, no equilibrium so far
Findings may be out of date
Measurement/diagnosis not high on vendors list of priorities
Resources/skill focus on more interesting an profitable issues
Tools lacking or inadequate
Implementations are flaky & not fully tested with new releases
Les Cottrell, SLAC Slide: 3
Distributed systems are very hard
A distributed system is one in which I can't get my work done because a computer I've never heard of has failed . Butler Lampson
Network is deliberately transparent
The bottlenecks can be in any of the following components:
the applications
the OS
the disks, NICs, bus, memory, etc. on sender or receiver
the network switches and routers, and so on
Problems may not be logical
Most problems are operator errors, configurations, bugs
When building distributed systems, we often observe unexpectedly low performance
the reasons for which are usually not obvious
Just when you think you’ve cracked it, in steps security
Firewall, NAT boxes etc.
Block pings, traceroute looks like port scan, diagnostic tool ports are blocked …
ISPs worried about providing access to core, making results public, & privacy issues
Les Cottrell, SLAC Slide: 4
Host “errors”
TCP buffers, heavy utilization …
Ethernet duplex and speed mismatch between your host and the network device
Misconfigured router/switches
Including routing errors, especially for backup paths
Bad equipment, wiring/fiber problem
Congestion
Les Cottrell, SLAC Slide: 5
Command prompt, find out about network connection
ipconfig ?
ipconfig
Default gives IP address, gateway/1 st router, subnet mask of all your network devices (Ethernet, wireless, bluetooth …)
Make a note of the gateway
Icon at bottom right of screen
Allows asking of questions and tries to provide assistance
Go to Command prompt and type
ping ?
Les Cottrell, SLAC Slide: 6
IP address of target RTT target
Specify number pings
C:\Users\cottrell> ping –n 4 –l 32 mail.alex.edu.ca
Size of packet
Pinging mail.alex.edu.ca [67.215.65.132] with 32 bytes of data:
Reply from 67.215.65.132: bytes=32 time=80ms TTL=45
Reply from 67.215.65.132: bytes=32 time=85ms TTL=45
Reply from 67.215.65.132: bytes=32 time=83ms TTL=45
Reply from 67.215.65.132: bytes=32 time=90ms TTL=43
Ping statistics for 67.215.65.132:
Packets: Sent = 4, Received = 4, Lost = 0 (0% loss),
Approximate round trip times in milli-seconds:
Minimum = 80ms, Maximum = 90ms, Average = 84ms
Try: ping –t, what use is ping -f
?
Les Cottrell, SLAC Slide: 7
C:\Users\cottrell> ping www.lbl.gov
Pinging www.lbl.gov [128.3.41.105] with 32 bytes of data:
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Ping statistics for 128.3.41.105:
Packets: Sent = 4, Received = 0, Lost = 4 (100% loss),
Enable
Telnet by following these steps:
Start=>Control Panel=>Programs And Features=>
Turn Windows features on or off=>
Check Telnet Client
Hit OK
Now try:
16cottrell@pinger:~> telnet www.lbl.gov 80
Blank screen web server waiting to talk to you
Hit ctrl ] and type exit
Compare with another port (non existent application)
C:\Users\cottrell>telnet www.lbl.gov 1010
Connecting To www.lbl.gov...Could not open connection to the host, on port 1010:
Connect failed
Les Cottrell, SLAC Slide: 8
Applications such as telnet (23), ssh (22) www (80,
443), DNS are assigned a “port” on the host
Sometimes written as for example www.slac.stanford.edu:80
See http://www.iana.org/assignments/port-numbers for what applications use which ports
Les Cottrell, SLAC Slide: 9
Try:
1.
ping localhost
2.
ping mail.alex.edu.eg
3.
ping sohag-univ.edu.eg
4.
ping www.minia.edu.eg
5.
ping www.alex.edu.eg
Les Cottrell, SLAC Slide: 10
Find servers:
http://www.cogentco.com/us/network_lookingglass.php
,
http://www.ip.tiscali.net/lg/
http://stat.qwest.net/cgi-bin/jlg-new-asia.pl
http://www.slac.stanford.edu/comp/net/wanmon/viper/tulip_map.htm
Les Cottrell, SLAC Slide: 11
Europe
300ms
RTT (ms.)
Les Cottrell, SLAC
300ms
0.3*0.6c
Longitude (degrees)
Each bar represents min RTT for 1 country
Satellite flies 24k miles high, RTT~400ms
Note cut off between satellite and terrestrial
Les Cottrell, SLAC
Slide: 13
Rough traceroute algorithm ttl=1; #To 1 st router port=33434; #Starting UDP port max=30; #default maximum number of hops while hops <= maxhops & ttl<max { send UDP packet to host:port with ttl get response if time exceeded note roundtrip time else if UDP port unreachable print * next print output ttl++; port++
}
Les Cottrell, SLAC Slide: 14
C:\Users\cottrell> tracert
Max hops Target IP address gets help
3 RTTs
C:\Users\cottrell> tracert -h 30 mail.alex.edu.eg
Tracing route to mail.alex.edu.eg [193.227.16.29] over a maximum of 30 hops
1 1 ms 1 ms 1 ms 10.13.11.1
2 1 ms <1 ms 1 ms 10.100.100.53
Router IP address
3 1 ms <1 ms <1 ms 10.0.0.3
4 1 ms 1 ms 1 ms 81.21.100.177
5 53 ms 12 ms 1 ms 10.181.28.33
6 2 ms 24 ms 2 ms 172.18.28.117
No response
7 5 ms 6 ms 6 ms 172.20.1.162
8 6 ms 6 ms 8 ms 172.19.8.106
9 * * *
10 6 ms 6 ms 6 ms mail.alex.edu.eg [193.227.16.29]
Try tracert www.lbl.gov
Why do the first hops take so long to reply?
Try tracert –d www.lbl.gov
Les Cottrell, SLAC Slide: 15
N.b. first few addresses are 10.x.y.z
Typically these are private (not known to the global
Internet) IP addresses, that can be re-used at multiple sites
See http://en.wikipedia.org/wiki/Private_network
Ranges 10.0.0.0 – 10.255.255.255 (16M addresses, 24bits)
172.16.0.0 – 172.31.255.255 (1M addresses, 20 bits)
192.168.0.0 – 192.168.255.255 (65K addresses, 16 bits)
Les Cottrell, SLAC Slide: 16
Traceroute to remote host
Is the route direct, over commercial congested nets
Reverse traceroute from remote host to you or 3 rd party
www.slac.stanford.edu/comp/net/wan-mon/traceroute-srv.html
www.tracert.com/
visualroute.visualware.com/ # requires Java
Visualroute servers in Europe
Les Cottrell, SLAC Slide: 17
Example: www.slac.stanford.edu/cgi-bin/nph-traceroute.pl
Related info
Security warning
Traceroute
Your IP name
Les Cottrell, SLAC
Your IP address Enter IP address or name
Slide: 18
Some Linux versions have bug that incorrectly IDs cksum error on MPLS links. Make Pkt length>=140, else get checksum errors (not a problem, just annoying). e.g. on Linux
traceroute www.slac.stanford.edu
140
Les Cottrell, SLAC Slide: 19
May help tell where losses start
Will need many pings if losses small
Start of losses?
Les Cottrell, SLAC
But?
Start of sustained losses
Routers may not respond
Slide: 20
Run traceroute, then ping each router n times
helps identify where in route the problems start to occur
Routers may not respond to pings, or may treat pings directed at them, differently to other packets
Get Matt’s TraceRoute MTR from www.bitwizard.nl/mtr/ or pathping (built into windows but inferior )
Slower
Less info
Les Cottrell, SLAC Slide: 21
Tracing route to mail.alex.edu.eg [193.227.16.29] over max 30 hops:
0 CDIV-PC83982.win.slac.stanford.edu [10.13.250.215]
1 10.13.11.1
2 10.100.100.53
3 10.0.0.3
4 81.21.100.177
5 10.181.28.33
6 172.18.28.117
7 172.20.1.162
8 172.19.8.106
9 10.191.8.30
10 mail.alex.edu.eg [193.227.16.29]
Computing statistics for 250 seconds...
Source to Here This Node/Link
Hop RTT Lost/Sent = Pct Lost/Sent = Pct Address
0 CDIV-PC83982.win.slac.stanford.edu
[10.13.250.215]
0/ 100 = 0% |
1 1ms 0/ 100 = 0% 0/ 100 = 0% 10.13.11.1
0/ 100 = 0% |
2 1ms 0/ 100 = 0% 0/ 100 = 0% 10.100.100.53
0/ 100 = 0% |
3 0ms 0/ 100 = 0% 0/ 100 = 0% 10.0.0.3
0/ 100 = 0% |
4 2ms 0/ 100 = 0% 0/ 100 = 0% 81.21.100.177
13/ 100 = 13% |
5 --100/ 100 =100% 87/ 100 = 87% 10.181.28.33
0/ 100 = 0% |
6 --100/ 100 =100% 87/ 100 = 87% 172.18.28.117
0/ 100 = 0% |
7 --100/ 100 =100% 87/ 100 = 87% 172.20.1.162
0/ 100 = 0% |
8 --100/ 100 =100% 87/ 100 = 87% 172.19.8.106
0/ 100 = 0% |
9 --100/ 100 =100% 87/ 100 = 87% 10.191.8.30
0/ 100 = 0% |
10 10ms 13/ 100 = 13% 0/ 100 = 0% mail.alex.edu.eg [193.227.16.29]
Trace complete.
Les Cottrell, SLAC Slide: 22
Look at history plots (PingER, ISPs, own border router etc.), when did problem start, how big an effect is it?
Assumes you know “proximity” of paths for which there are archived active measurements to the path that you are interested in
Also that relevant measurements exist
www-iepm.slac.stanford.edu/pinger/
Collaboration between Internet2/ESnet/Geant to provide access to router measurements holds promise
Les Cottrell, SLAC Slide: 23
Look for change in measured value
Note time
Correlate
Italy disconnected
Les Cottrell, SLAC Slide: 24
Is the server application listening:
telnet www.slac.stanford.edu
80
Trying 134.79.18.188...
Connected to www.slac.stanford.edu.
Escape character is '^]'.
^]
telnet> quit
Connection closed.
Try user application (mem to mem & disk to disk)
GridFTP, bbcp, bbftp …
Iperf or thrulay (also provides RTT) to test TCP or UDP throughput
dast.nlanr.net/Projects/Iperf/ , www.internet2.edu/~shalunov/thrulay/
NDT ( http://www.internet2.edu/performance/ndt/ )
What are the interface speeds?, What is the bottleneck?
Is there a duplex mismatch ?’ Are buffers set right (both ends)?
Les Cottrell, SLAC Slide: 25
Les Cottrell, SLAC Slide: 26
Wireless
Avoid peer-to-peer/ad-hoc connections
Disable connecting to ad-hoc (set infrastructure only)
Disable bridging
How to do it varies by OS (XP, OSX, Linux)
Ad hoc can still interfere if on same channel
Tools to locate an access point (e.g. Yellow-Jacket)
See
www2.slac.stanford.edu/comp/net/wireless/Wireless-Meeting-
Handout.mht
NAT boxes may block or not support application
Private addresses:
10.0.0.0 - 10.255.255.255 a single class A net
172.16.0.0 - 172.31.255.255 16 contiguous class Bs
192.168.0.0 – 192.168.255.255 256 contiguous class Cs
Les Cottrell, SLAC Slide: 27
Ping to localhost, ping to gateway & to remote host
Use IP address to avoid nameserver problems
Look for connectivity, loss & RTT
May need to run for a long time to see some pathologies
(e.g. bursty loss dues to DSL loss of sync)
Use telnet host port to see if ping blocked
Traceroute to remote host
Reverse traceroute from remote host to you
Ping routers along route (mtr helps)
Look at history plots (PingER), when did problem start, how big an effect is it?
• Look at own connectivity NDT ( netspeed.stanford.edu
)
Les Cottrell, SLAC Slide: 28
Beware some of information following is ephemeral, in general use heuristics with Google
Google “Internet country codes” for TLDs
Host may not be in TLD country, especially developing regions often use proxies elsewhere
Location may be encoded in router name
ipls=Indianapolis, snv =Sunnyvale …
Name server lookup (nslookup & dig) to find hostname given IP address
47cottrell@netflow:~>nslookup 210.56.16.10
Server: localhost
Address: 127.0.0.1
Name: lhr.comsats.net.pk
Address: 210.56.16.10
Use a whois server (download www.gena01.com/win32whois/ )
www.networksolutions.com/cgi-bin/whois/whois (Americas & Africa)
www.ripe.net/cgi-bin/whois (Europe)
www.apnic.net/ (Asia)
May identify site name, address, contact, etc, not all domains are in databases (e.g. will not find comsats.net.pk)
Les Cottrell, SLAC Slide: 29
Find the Autonomous System (AS) administering
Form giving AS for domain name
http://www.fixedorbit.com/search.htm
Gives AS number, name adjacent AS’s web page for
AS
Given an AS find out more about it:
Use http://bgp.potaroo.net/cidr/ go to bottom and enter AS into form:
– Gives ISP name, web page, phone number, email, hours etc.
Review list of AS's ordered by Upstream AS Adjacency
www.telstra.net/ops/bgp/bgp-as-upsstm.txt
Tells what AS is upstream of an ISP
Les Cottrell, SLAC Slide: 30
Visit site’s www server, often location in home page
May be able to get lat & long form database:
www.geoiptool.com/ or via: geotool.flagfox.net/
http://www.hostip.info/index.html
Networldmap determines geographical information by acquiring location information from willing participants.
http://www.ip2location.com/
But it is a subscriber service ($$$, but …), however it is probably best for developing regions
Quova has a large (2.4 Billion addresses) database of IP addresses to locations that they can provide access to for organizations, but must subscribe ($$$).
Triangulate pings from landmarks:
www.slac.stanford.edu/grp/scs/net/talk10/geolocation.pptx
Les Cottrell, SLAC Slide: 31
Local network support people
Internet Service Provider (ISP) usually done by local networker
Usually will know immediate one, e.g. trouble@es.net
Use puck.nether.net/netops/nocs.cgi
to find ISP
Use www.telstra.net/ops/bgp/bgp-as-upsstm.txt
to find upstream ISPs
Well managed sites and ISPs maintain a list of email addresses such as abuse@ or postmaster@, that one can send email to, for example to complain about spam etc.
This follows an Internet recommendation ( RFC 2142 ).
Some less helpful sites do not provide such services, for more on these, see RFC-ignorant.org
Les Cottrell, SLAC Slide: 32
Describe problem with details
What is affected?
Application, host OS ( uname –a ), NIC ( ifconfig, route )
How is it affected?
Non responsiveness, unable to contact remote host
Slow performance (see Brian’s talk), packet loss
When did it start?
Send ping output between hosts
Send traceroute forward & reverse – if possible
Maybe use –I (ICMP option)
NDT
Identify when it started
If complex think about creating web page with details
Top, vmstat, pingroute, pipechar, application output (GridFTP, iperf)…
Les Cottrell, SLAC Slide: 33
Tutorial on monitoring
www.slac.stanford.edu/comp/net/wan-mon/tutorial.html
RFC 2151 on Internet tools
www.freesoft.org/CIE/RFC/Orig/rfc2151.txt
Network monitoring tools
www.slac.stanford.edu/xorg/nmtf/nmtf-tools.html
www.caida.org/tools/taxonomy/
Network Performance Tools: an I2 Cookbook
e2epi.internet2.edu/network-perf-wk/tools-cookbook.pdf
Case Studies:
confluence.slac.stanford.edu/display/IEPM/Problem+Cases
e2epi.internet2.edu/case-studies/
Les Cottrell, SLAC Slide: 34
Les Cottrell, SLAC Slide: 35
Usual Unix tools ( uname -a, top, vmstat, iostat ..)
Is the host overloaded, do you have a gateway
( route ), name server ( nslookup ), which interface are you using ( mii-tool (needs root), gives duplex & speed = common error source)
Net: ifconfig –a (look at errors), netstat –a
Is server running (if you know port)?
> telnet localhost 2811 Trying 127.0.0.1
220 aftpexp04.bnl.gov GridFTP Server 1.12 GSSAPI type Globus/GSI wu-2.6.2 (gcc32dbg, 1069715860-42) ready.
^]
telnet> quit
Les Cottrell, SLAC Slide: 36
Packet size Remote host
Repeat count
RTT syrup:/home$ ping -c 6 -s 64 thumper.bellcore.com
PING thumper.bellcore.com (128.96.41.1): 64 data bytes
72 bytes from 128.96.41.1: icmp_seq=0 ttl=240 time=641.8 ms
72 bytes from 128.96.41.1: icmp_seq=2 ttl=240 time=1072.7 ms
72 bytes from 128.96.41.1: icmp_seq=3 ttl=240 time=1447.4 ms
72 bytes from 128.96.41.1: icmp_seq=4 ttl=240 time=758.5 ms
72 bytes from 128.96.41.1: icmp_seq=5 ttl=240 time=482.1 ms
Missing seq #
Summary
--thumper.bellcore.com ping statistics --- 6 packets transmitted, 5 packets received,
16% packet loss round-trip min/avg/max = 482.1/880.5/1447.4 ms
Les Cottrell, SLAC Slide: 37
UDP/ICMP tool to show route packets take from local to remote
Remote host host
Probes/hop
Max hops (20)
17cottrell@flora06:~>traceroute -q 1 -m 20 lhr.comsats.net.pk
traceroute to lhr.comsats.net.pk (210.56.16.10), 20 hops max, 40 byte packets
1 RTR-CORE1.SLAC.Stanford.EDU (134.79.19.2) 0.642 ms
2 RTR-MSFC-DMZ.SLAC.Stanford.EDU (134.79.135.21) 0.616 ms location
3 ESNET-A-GATEWAY.SLAC.Stanford.EDU (192.68.191.66) 0.716 ms
4 snv-slac.es.net (134.55.208.30) 1.377 ms
5 nyc-snv.es.net (134.55.205.22) 75.536 ms
6 nynap-nyc.es.net (134.55.208.146) 80.629 ms
7 gin-nyy-bbl.teleglobe.net (192.157.69.33) 154.742 ms
8 if-1-0-1.bb5.NewYork.Teleglobe.net (207.45.223.5) 137.403 ms
Long delay satellite
9 if-12-0-0.bb6.NewYork.Teleglobe.net (207.45.221.72) 135.850 ms
10 207.45.205.18 (207.45.205.18) 128.648 ms
11 210.56.31.94 (210.56.31.94) 762.150 ms
No response:
12 islamabad-gw2.comsats.net.pk (210.56.8.4) 751.851 ms
13 *
Lost packet or router
14 lhr.comsats.net.pk (210.56.16.10) 827.301 ms ignores
Les Cottrell, SLAC Slide: 38
Ping routers along route, e.g. a tool to install that helps:
www.slac.stanford.edu/comp/net/fpingroute.pl
or www.slac.stanford.edu/comp/net/fpingroute.pl
if fping avaialable
15cottrell@noric04:~>fpingroute.pl
fpingroute.pl does a traceroute to the selected host. For each of the hops along the route it then uses fping to ping each node (in parallel) 'count' times. Output includes traceroute information, RTTs, losses for 100 and
'size‘ byte pings.
Version=0.21, 8/24/04
Usage: fpingroute.pl [Opts] host where host is the remote host's IP address or name e.g. www.slac.stanford.edu
Opts: [-c count default=10]
[-s size default=1400]
[-i initial default=1]
Example: fpingroute.pl -i 3 -c 10 -s 1400 www.triumf.ca
Les Cottrell, SLAC Slide: 39
Ntop
Summarizes libpcap (sniffer) infor
Internet2 Detective:
Tests connectivity to I2, bandwidth, multicast, IPv6
Can run as Java applet
http://detective.internet2.edu/
NLANR Internet Advisor
Ethereal, tcpdump, snoop for masochists
Passive tools:
Netflow for characterizing network, spotting abnormalities, e.g.
www.itec.oar.net/abilene-netflow
www.slac.stanford.edu/comp/net/slac-netflow/html/SLACnetflow.html
SNMP based tools
Les Cottrell, SLAC Slide: 40