OWEN/NERO Bandwidth Audit

advertisement
1
Going Fast(er) On Internet2
Campus Focused Workshop on
Advanced Networks, San Diego
4/12/2000
Joe St Sauver
(joe@oregon.uoregon.edu)
Computing Center
University of Oregon
Disclaimer
• What we’re going to tell you today is based
on our experiences working primarily with
Usenet News at the U of O; it may/may not
pertain to other applications elsewhere.
• We tend to look for simple, scalable,
workable solutions which we can roll out
now, e.g., overprovisioning rather than QoS
• We tend to be cheap, skeptical, and cynical
• We tend to be good at pushing things until
they break; it is an acquired/teachable skill.
2
3
A Sidenote About
This Presentation
• It is longer than it should be, but we’ll go
until we run out of time and then stop.
• Sorry it is so graphically boring. :-)
• It is outlined in tedious detail because that
way we won’t forget what we wanted to say,
and thus you won’t need to take notes.
• Hopefully, it will thus be able to be decoded
by someone stumbling upon it post hoc.
4
I. Introduction
Or, "Are You Really Sure
You Want to Go Fast(er)?"
Now That I'm On I2,
Everything Will Get Really
Fast… Right?
• It is a popular misconception that once your
campus gets connected to Internet2,
everything you do on the network will
suddenly, magically, and painlessly go
"really, really fast."
• The reality is that going even moderately
fast can take patience, detective work,
tinkering, and maybe even forklift upgrades.
5
6
Do You Really NEED or Even
WANT To Go Fast(er)?
• Going fast(er) can be a big pain. Huh? …
-- It will take a lot of work
-- It may cost you some money
-- It almost always requires the active
assistance of lots of folks
-- You may find yourself (in the final
analysis) only partially successful, and
-- Fast boxes are choice targets for crackers
-- Lots of happy people DON’T go fast
7
As-Is/Out-of-the-Box
Might Be Good Enough
• Unless you're running into a particular
problem (e.g., you HAVE to go fast(er)),
one perfectly okay decision might be to just
go however fast you happen to go and not
worry about anything beyond that.
• E.G., a Concorde may be very fast, but a
Concorde might not be the best way to get
to the corner store for a loaf of bread.
8
What Can I Get By Default?
Example: Oregon<--Oklahoma
• At UO, from a relatively vanilla W2K
workstation connected via fast ethernet, one
can ftp a binary file (hdstg2.img, 2135829
bytes) from the University of Oklahoma's
ftp archive (ftp.ou.edu /mirrors/linux/redhat/
redhat-6.2/i386/RedHat/base/) in 9.43 sec:
226 Kbyte/second (or 1.8 Mbit/second)
9
For Comparison, A Second
Local-Only Example...
• Retrieving that same file from a local ftp
mirror (ftp://limestone.uoregon.edu/.1/
redhat/redhat-6.2/i386/RedHat/base/) that
same workstation allowed me to get the file
in 0.32 seconds, which translates to:
6,653.67 Kbyte/sec (or 53.2Mbit/sec)
10
Thinking About Those
Examples A Little
• As always, closer will usually be faster
[mental note… value of replicated content]
• Quoted throughput should be considered
approximate (e.g., the times aren't exact).
• There are start up effects (which will tend to
pull the overall throughput down); e.g., if
the file was larger, we'd look/be "faster"
• Ten seconds or 1/3 of a second, either way
you won't have time to go get coffee
11
Make An Effort to Know
How Fast You HAVE to Go
• As you try to go fast(er), it will be important
for you to know how fast you HAVE to go.
• For example: "I need to be able to deliver
1.0Mbps sustained for MPEG1-quality
video" or "I need to be able to transfer
180GB of data per day on a routine basis."
• Get your requirement into Mbps format so
you can readily make comparisons
12
Converting Data Transfer
Requirements Into Mbps
• Example:
180 gigabytes/day ==
(180,000 megabytes)(8 bits per byte)
---------------------------------------------- ==
(24 hrs/day)(60 mins/hr)(60 secs/min)
roughly 17 megabits/sec 'round the clock
13
Be Sure To Remember...
• Very few data transfer requirements are
"uniformly distributed 'round the clock" -plan for peaking loads
• Best case/theoretical requirements should
be considered a lower (not upper) bound
on bandwidth requirements.
• Plan for system/application downtime.
• What's the data transfer rate of growth?
14
It's Not The Volume, It's The
Time It Takes To Double...
• “It’s not the heat, it’s the humidity…”
• Example: Daily Usenet News volume (e.g.,
~200GB/day now, doubling every 6 mos.)
• Data from http://newsfeed.mesh.ad.jp/flow/
15
That Implies, For Example...
•
•
•
•
•
•
Today: 200GB/day (e.g., 18.5 Mbps)
6/2001: 400GB/day (37 Mbps)
12/2001: 800GB/day (74 Mbps)
6/2002: 1.6TB/day (148 Mbps)
12/2002: 3.2TB/day (296 Mbps)
… and of course, that’s assuming we don’t
see another upward inflection in the rate of
NNTP traffic growth (but trust me, we will).
16
What does ftp.cdrom.com say?
• “Wcarchive is the biggest, fastest, busiest
public FTP archive in the world. * * *
Each month, more than 10 million people
visit wcarchive -- sending out to them more
than 30 terabytes of files (as of June, 1999),
with the only limit being the Internet
backbone(s).” See: ftp://ftp.cdrom.com/
archive-info/configuration
• 30 TB/mo = “only” a steady ~92.6Mbps
In Most Cases, The Only
Reason You Need to Go Fast
Will Be LOTS Of Data….
• By "LOTS" of data, you should be thinking
in terms of hundreds of gigabytes/day on a
routine/ongoing basis.
• Assuming even moderate data retention
times (e.g., a week), 100’s of GB/day
implies use of what would traditionally be
considered a large disk farm.
17
18
Again Looking At cdrom.com...
• In the “old days,” (two or three years ago?)
large capacity disk farms were physically
large, expensive and quite uncommon...
• For example, Cdrom.com is/was fielding a
1/2 terabyte of disk consisting of 18x18GB
plus 20x9.1GB
19
Terabyte of Data on The
Desktop, Anyone?
• Now there are 82GB Ultra ATA Maxtors
(and for only $300 or so!) and 180GB
Ultra160 Barracudas will be shipping soon
• A terabyte of data can now happily run from
an undergrad’s desktop PC...
20
The Good News?
• In spite of the cheap availability of large
disks, there are really very few applications
which NEED to go very fast (either for long
periods of time or on a frequently recurring
basis between any two particular points).
• That is, most large flows are non-recurring,
and not particularly time sensitive. An
example might be one scientist ftp'ing one
large data set from one colleague one time.
21
Got Non-Reocurring, NonTime-Sensitive Flows? Relax...
• If you are working with non-recurring,
non-time sensitive flows, you have a fair
amount of slack: even if you don’t succeed
in going fast, the transfer will still get done
eventually, one way or the other.
• Put plainly, “Sort of slow may still be fast
enough.”
22
The (Sort Of) "Bad" News...
• There are LOTS of folks who WANT to go
fast(er) (whether they NEED to or not)
• There are MANY applications that IN
AGGREGATE may need to deliver "lots" of
data (e.g., not a tremendous amount to any
one user, but some to LOTS of users)
• Most apps can't distinguish between
Internet2 and the commodity Internet.
Why Would A Broad Interest in
Going Fast Be (Sort of)
Bad News?
• Recall my earlier proposition that going
fast(er) is hard/expensive/requires help from
lots of people, and often only sorta works.
• It wouldn’t take a tremendous number of
people going really fast to flattop existing
Internet2 capacity.
• For now, it is still expensive to buy I2 size
pipes to the commodity Internet.
23
24
Abilene OC3 Cost vs.
Commodity Internet Costs
• Abilene (Internet2) OC3:
$110,000/year
CWIX OC3:
$1,082,400/year
Sprint OC3:
$1,489,200/year
Genuity OC3:
$2,064,000/year
==> Commodity OC3's are expensive and it
doesn't take many people who're even doing
“just” 30 Mbps to fill an OC3.
(prices from http://www.boardwatch.com/
isp/bb/Backbone_Profiles.htm)
25
“I asked for a mission, and for
my sins they gave me one.”
• When you may be striving to build a
campus network enabling high throughput
to Internet2, beware: you are ALSO
building a network which will deliver high
throughput to the commodity Internet.
• If you encourage users to go fast to I2,
they will go fast everywhere (assuming
they go fast anywhere) because users
don’t know when they’re using Internet2.
26
Are We Racing To The
Precipice? Probably Not...
• Good news is (may be?) coming…
• Some vendors (e.g., Cogent
Communications) will soon be selling
100Mbps of commodity transit for
$3K/month, flat rate… if you're in one of
the “NFL cities” where they have a POP.
• Perversely, one of the things that determines
where carriers build out their POPs is the
existing/demonstrated bandwidth demand!
27
“I can’t get cheap commodity
transit where I’m located…”
• If you can’t get cheap commodity transit,
the only bandwidth provisioning solution
that financially scales to the high bandwidth
scenarios we’re all moving toward is to go
after settlement free peering with large
network service providers. Doing this
implies you need fiber to one or more
exchange points, and you need to be able to
convince providers of interest to peer…
28
Some University-Affiliated
Commodity Exchange Points
• Oregon IX (http://www.oregon-ix.net/)
• Hawaii IX (http://www.lava.net/hix/)
• SD-NAP (http://www.caida.org/projects/
sdnap/content/)
• BC IX (http://www.bcix.net/)
• Hong Kong IX (http://www.cuhk.hk/hkix/)
• and many more… see http://www.ep.net/
“What if those sort of strategies
aren’t right for us?”
• You have (or soon will have) problems
• You will spend your time making users go
slower, not helping them to go fast(er)
• Transparent web caching may help (some),
but watch out for witch hunt opportunities.
• Maybe try going after edge content delivery
networks (Akamai, iBeam , etc.)? Maybe
try bandwidth management appliances?
29
30
But...
• Users will go faster, even if you work hard
at trying to slow them down
• Transparent web caching may reduce your
traffic by a factor of two (but if your traffic
is doubling every 6 months, that implies
doing caching is only going to buy you 6
months worth of breathing room, and then
you’re back where you started from...)
31
But… But…
• Edge content delivery networks may help
with some specific content, but there’s still
a lot of other content that will NOT be
getting distributed via those ECDN’s.
• Bandwidth management appliances invite
user efforts to “beat the system” by
exploiting any weaknesses in your traffic
management model (just like in the bad old
mainframe chargeback days, ugh!)
32
On The Other Hand...
• Everybody may be talking about OC12’s,
OC48’s and OC192’s, but even a major NSP
like Abovenet still has a lot of OC3’s, fast
ethernet and DS3 class links...
• See Above.Net’s publicly available traffic
reports (http://west-boot.mfnx.net/traffic/)
• The lesson of Above.Net’s stats? OC3 class
traffic is still relatively rare/a big deal...
and not something to treat casually.
33
Free Advice (And You Know
What That’s Worth)
• Be sure you really need/want to go fast(er)
• Strive to understand your current traffic
requirements
• Never lose sight of the fact that going fast
on Internet2 will mean that you probably
need to go fast on the commodity Internet,
too
• Work to deploy scalable solutions
34
II. So Who’s Going
Fast On Internet2
Right Now?
“The All News Network,
All The Time.” [CNN moto]
35
Large TCP/IP Flows
• Our focus/interest is on large TCP/IP flows
which result in lots of bytes getting
transferred.
• We’re not worried about/interested in UDP
traffic; it will implode on its own. :-)
• We ignore brief one-off spikes associated
with demonstations/stunts/denial of service
attacks/etc. -- long term real base load is of
the greatest interest to us.
We Don’t Have a Per
Application Breakdown
for Abilene, But….
• … Canarie DOES report the most common
applications (including reporting the most
popular applications for the three CanarieAbilene peering points).
• See http://www.canet3.net/stats/reports.html
(the Abilene/CANet3 peering points are
labeled Abilene, AbileneNYC & SNAAP)
36
37
Making Traffic Statistics
Intuitively Meaningful
• While we could compare application traffic
in terms of Mbps or percentages or other
abstract units, it may help to characterize I2
traffic relative to a common traffic base we
all intuitively understand: WWW activity.
(excellent idea, CANet, bravo!)
• On the commodity Internet, we all know
that WWW traffic is the dominant protocol.
But what about on Internet2?
Most Popular TCP/IP Apps at
CANet/Abilene Peering Points,
Relative to HTTP as 1.0X for
the week ending 11/5/2000
• Abilene (Chicago):
• Abilene (NYC):
• SNNAP (Seattle):
NNTP
FTP
NNTP
FTP
NNTP
FTP
2.31X
1.59X
4.11X
1.35X
13.9X
1.23X
38
Most Popular TCP/IP Apps at
Selected CANet3 Sites,
11/05/2000, Relative to HTTP,
and As A % of Total Octets
• BCNet:
• MRNet:
• RISQ:
NNTP
FTP
NNTP
FTP
NNTP
FTP
49.4X
2.14X
90.7X
8.81X
31.1X
1.31X
77.1%
3.3%
74.6%
7.2%
72.0%
3.0%
39
40
==> Usenet News & FTP Are
The Dominant Applications
on I2 (Thank God…!)
• Usenet News (NNTP) is the dominant
TCP/IP application (which is good, since
most campuses centrally administer Usenet
news, and thus can manage it carefully)
• FTP is the second largest TCP/IP
application (which is also good since it is
typically non-time sensitive/non-recurring,
or is it non-recurring?)
41
Why Is Usenet News The Most
Successful Application on I2?
• News admins have been working hard at
making systems go fast for a long time now
• NNTP is architected to scale well
• News admins have a long history of
collaborating well with their peers. :-)
• Non-I2 News traffic quickly gateways onto
and off of I2 news servers at multiple points
• Performance matters (e.g., ‘Freenix effects’)
42
An Hypothesis About
Internet2 FTP Traffic Levels
• FTP, as the number two application on
Internet2, is also of interest to us. As we
began to think about it, we came up with a
hypothesis about what that FTP traffic
represented. All that FTP traffic *could* be
wild-haired misbuttoned boffins happily
transferring gigabytes and gigabytes worth
of spatial data on the mating habits of
Peruvian tree frogs... but we doubted it.
43
OR That FTP Traffic Could Be
Site-to-Site Mirroring Traffic
• Just beginning to think about this...
• Will we be able to differentiate mirroring
traffic from user traffic? Maybe, maybe not.
• Some observable flow characteristics:
-- both endpoints would be ftp servers (duh)
-- chronological patterns (e.g., assume
cron’d invocation of mirroring software)
• FTP log analysis from major FTP sites?
(particularly looking for ls -lR transfers…)
44
Interactive vs. Automated
FTP Traffic SubHypotheses
• SubHypothesis 1: web distribution of files
should have virtually replaced anonymous
ftp retrieval of files
• SubHypothesis 2: scp should be replacing
non-anonymous interactive ftp’ing
• SubHypothesis 3: cvsup should be replacing
traditional development tree mirroring
45
More SubHypotheses...
• SubHypothesis 4: to account for the volume
we’re talking about, there should be multithreaded mirroring tools in use (see, e.g.,
“Mirror Master” available from
ftp://sunsite.org.uk/packages/mirror/ )
• SubHypothesis 5: user-level semiautomated ftp tools may cloud the analysis
(e.g., http://www.ncftp.com/ncftp/); true
Windows-based mirroring software also
exists (e.g., http://www.netload.com.au/)
46
Do We Even Know What
Mirror’ers Are Doing?
• Smart mirroring tools should minimize
unnecessary transfers by only transfering
that which has “changed” -- but what’s a
change? Later mtime and different file size?
MD5 hash delta? ==> Varies by package.
• Field work opportunity for computer
anthropologists: go talk to the guys who run
the big ftp servers out there…
47
III. Thinking About Your
Application and I2
Or, "What do you mean I can't
make a lemon chiffon cake out of a
package of venison T-bones?"
48
Not All Applications Are Well
Suited to Going Fast on I2
• We did an article for the UO Computing
Center newsletter describing what sort of
applications are well suited to Internet2; the
NLANR Application Support Team liked it
well enough that they now have a version of
it up at
http://dast.nlanr.net/Guides/
writingapps.html
49
Mentally Categorizing
Applications
• Applications where you can control WHO
you work with, WHERE they are working
from, WHAT they are doing and WHEN
they are doing it, tend to work best on I2
• Simplest example: getting one file to one
colleague one time via a passworded server
• Degenerate case: large video on demand
files on a generically accessible web server
50
CONTROLLED
SERVER WITH
CONTENT/APP
Internet2
SINGLE
COLLABORATOR
USING CONTENT/APP
"idealized" model of I2 application (rarely an accurate model)
versus a
more
realistic
model of I2
content/apps
USER #3 FROM
A FOREIGN
RESEARCH AND
EDUCATION
NETWORK
USER #1 AT
CAMPUS A
VIA GIGAPOP
ALPHA
USER #2 VIA
THE
COMMODITY
INTERNET
USER #4
FROM THE
LOCAL
CAMPUS
PUBLIC
SERVER WITH
CONTENT
USER #5 FROM
AN I2 CAMPUS
(BUT OVER A
DIALIN MODEM)
USER #6 ALSO
FROM CAMPUS
A VIA GIGAPOP
ALPHA
USER #7 FROM
CAMPUS B WITH
ASYMMETRIC
ROUTES
USER #N
FROM ?
51
Why Is The Worst Case
Scenario So Bad?
• The worst case scenario is problematic
because “tricks” you can try using to
optimize flows in the idealized case simply
don't work in less controlled scenarios -specialized solutions that work for one user
don't scale to many users, and tricks that
work on the lossless I2 fall apart in the face
of the packet loss that's common on the
commodity Internet.
52
Other Problems With The Real
(vs. Idealized) Scenario
• You can’t (really) tell anything about the
potential throughput of a user by their
address (e.g., someone at an I2 campus
connected by an OC12 could still be coming
in over dialup -- no way for you to tell)
• You may get MULTIPLE users from the
same site at the same time, which means
that each will get at most 1/N of the
potential thruput that one might have gotten
53
Looking For Long Term and
Generalized Return on Effort
• The other factor is that when you are going
to tweak an application to improve its
throughput, you prefer an application that
will generalize and be of long term value -fixing an application that will only be used
one time, or which is of interest to a very
limited audience (“stunt applications”),
reduces the payoff associated with the effort
you're putting in, and may defer other work.
54
Examples of Apps That Tend
to Work Well Over Internet2
•
•
•
•
•
Usenet News
Mirroring of FTP sites
Web cache hierarchies
MPEG1 IP multicast video
Peer to peer networking (e.g., Napster) with
path preference (http://bestpath.iu.edu/)
But That's Not To Say That
Most Applications Can't Be
Made to Run Faster...
• … because they usually can.
55
56
IV. Gathering Baseline
Measurements
Or, "If only we'd known where we
were, we'd probably have had a lot
easier time going somewhere else."
57
Measuring Your Current
Throughput As A Baseline
• In some cases, the application you're using
may already report the throughput it is
getting (e.g., when you ftp a file, it provides
a report of bytes per second transfer speed
automatically).
• If your application is running on a dedicated
box, you can watch the throughput of that
interface directly or you may be able to use
SNMP to measure your throughput.
58
Example of Watching
Throughput under W2K...
• On W2K (or Windows NT) you can go to
Settings-->ControlPanel -->Administrative
Tools -->Performance and then click on the
"+" (Add Counters) to let you add "Network
Interface" "Bytes Sent/Sec" and "Bytes
Received/Sec" values derived from your
ethernet adapter.
• You can also look at those counters via
SNMP (Simple Network Mgmt Protocol).
59
Using SNMP...
• A variety of SNMP agents (such as SNMX)
are available which can allow you to
monitor network traffic by successively
polling SNMP counters
SNMX is available online at:
http://www.ddri.com/Products/
ace-snmx.html
60
Example SNMX Script
• #!/usr/local/bin/snmx
connect 128.223.abc.def
repeat
echo $ifInOctets.3 $ifOutOctets.3 | myprog
sleep 15
endrepeat
quit
• … where "myprog" computes and prints the
rate over time for those two SNMP counters
61
That Sort of Tool Generates at
Least Basic Throughput Info...
• Time….
Input Bps Output Bps
12:00:17 36606820 121320236
12:00:32 36870705 115150370
12:00:47 39005785 112971435
[etc.]
62
Why Not Just Use Something
Like HP OpenView?
• Match the tool to the task: simple tasks
should be handled with simple tools
• Users often won’t have a workstation to
dedicate to network monitoring tasks
• Simple tools are easier to explain to users
and easier for them to master
• It works well enough (even if it isn’t
perfect)
63
For Nicer (Graphical) Output,
Consider MRTG
• MRTG (Multi Router Traffic Grapher)
is available from
http://ee-staff.ethz.ch/~oetiker/webtools/
mrtg/mrtg.html) and makes nice graphs:
64
But MRTG Isn’t
Perfect, Either
• It is easy for MRTG configuration files to
end up out-of-date as interfaces get added or
deleted on routers, cables get moved around
on switches, etc.
• There’s also the problem that MRTG can
run into when centrally monitoring lots of
ports: it builds all of its graphs all of the
time, even if no one is looking at them
65
Yes, I Know About RRDtool
• RRDtool does indeed fix the problem of
trying to continually remake millions of
graphs that no one may ever look at,
however… RRDtool actually makes it hard
for those of us who like to build composite
web pages which monitor only one graph
from page X and another graph from page
Y, and a third graph from page Z (since
those graphs won’t pre-exist)
66
Anyhow, You Can't Always
Believe What You're Told...
• At higher speeds, older 32 bit SNMP
counters can roll over amazingly quickly:
2^32=4,294,967,296 octets*8 bits/octet
------------------------------------------------ ==
155,000,000 bits/second
221.675 seconds (only 3.7 minutes) ==>
you need to be polling FREQUENTLY
67
Example of An Incorrect Plot
Due to Counter Rollover
• Note the “picket fence” appearance and the
high average utilization rate (this plot was
done with five minute sampling intervals)
• None of this is new; see RFC 2233 3.1.6 for
a discussion of 32 bit counter problems.
68
And Then There Are VendorSpecific Problems, Such As...
• Microsoft Knowledge Base article Q146004
(http://support.microsoft.com/support/kb/
articles/Q146/0/04.asp) confirms that
SNMP counters for a variety of variables
are broken when NT/W2K is running on
SMP (multiprocessor) machines. The
Knowledge Base article states that “This
will not be fixed.” Ugh.
69
Once You Know How Fast
You're Currently Going...
• Once you know how fast you're currently
going, you can then determine how much of
a change you'll need to make (if any).
• Let's assume you do still need to make some
changes...
70
Throughput Is Limited by the
"Tightest Pipe" in the Network
• Network traffic between any two points
may pass through many links, some large
and some small, some congested and some
almost completely unused.
• Possible network throughput is physically
bounded by the link in that chain which has
the lowest available capacity. Even big
pipes can still end up getting filled up!
71
Examples of
Constraining Links...
• If you are dialing in, the obvious and clearly
pertinent constraining link is the speed of
your modem; nothing else you can try can
overcome the throughput limit of that link.
• If you are connecting from a shared (half
duplex) 10Mbps ethernet port, your
throughput will never be as potentially great
as that of someone who is on a switched
(full duplex) 100Mbps fast ethernet port.
72
But There Can Be More
Subtle Constraints...
• A prime suspect for the most common
campus-level choke point will be upstream
campus fast ethernet router interfaces which
may end up seeing aggregated traffic from
multiple downstream fast ethernet server
connections. While the clear solution is to
migrate those interfaces to gigabit, the
interfaces can be expensive (outright, and in
terms of using up scarce chassis slots)
73
And Router Horsepower...
• Another potential choke point can be the
CPU horsepower of your router and the
throughput of its backplane (and the
software feature you burden it with, e.g.,
long ACLs, encryption, etc.).
74
In the case of Cisco boxes...
• The VIP’s installed on routers in your path
may hit you quicker than you might think.
VIP 2/40’s, for example, at 65K pps, may be
an issue at bandwidths under 300Mbps (in
plus out) depending on packet sizes. See the
discussion http://puck.nether.net/
lists/cisco-nsp/ entitled “RSP/VIP
performance question”
75
And Even If Your Own House
Is In Order (As It Surely Is)...
• Everything that can choke your throughput
locally can (and will) also be potentially an
issue for the OTHER end of the pipe (which
will be even harder to try to identify and get
fixed).
76
Identifying and Eliminating
Network Choke Points
• Users need to do some network detective
work so they can understand the network
topology that lies between them and their
collaborators.
• An excellent starting point for users is to
teach them to use the traceroute command.
77
What About Traceroutes In
The Other Direction?
• Reverse paths may be completely different
(e.g., routing may be/will often be
asymmetric); http://www.internet-2.org.il/
i2-asymmetry/index.html
• You need/want a traceroute gateway at each
site you work with so you can traceroute in
both the forward and the reverse direction
78
Some Internet2 Sites Already
Have Traceroute Gateways Up
• http://darkwing.uoregon.edu/~llynch/cgi-bin/trace.cgi
(UO via Abilene Denver or Abilene Sacramento)
• http://www.net.cmu.edu/cgi-bin/netops.cgi
(CMU via Abilene; can get a ping report, too)
• http://netview.cc.iastate.edu/cgi-bin/trace
(Iowa State via vBNS, includes a ping report, too)
• http://noc.net.umd.edu/cgi-bin/traceroute/trace
(Maryland via Abilene)
• Plus many more, but by no means all
Internet2 sites (unfortunately); see
http://www.traceroute.org/ for add’l sites.
79
What Can Traceroute
Tell Your Users?
• Are they even using Internet2?
Odd note: users may need help learning to
make inferences from traceroute output
(such as references to Abilene or the vBNS
or to their local Gigapop)…
“But it never said Internet2 on any of the
traceroute output…”
80
Traceroute also hints about
geography/capacity/technology
• Many link labels will mention locations
(e.g., kscy-dnvr for Kansas City-Denver)
• Links may have labels that allude to their
speed, e.g., "OC3" (155Mbps), "OC12"
(622Mbps), "FE" (fast ethernet, 100Mbps),
"GE" (gigabit ethernet, 1000Mbps), etc.
• Links may refer to "ATM" (asynchonous
transfer mode) or "POS" (packet over sonet)
81
Traceroute Will Also Help
Make Latencies Meaningful
• Part of moving toward going fast is
developing a sense of “normal” latency
values
• Users should learn that local links should
have very small times (just a few msec), and
remote links should run on the order of
25msec to LA, 75msec to NYC, or 220msec
or more to remote locations such as Tokyo
82
Be sure they know what to do
when the news isn’t good...
• Occaisionally, if they traceroute to remote
destinations, they will see large round trip
times. This should not immediately make
them “freak out.”
• Large round trip times, particularly when
they only appear sporadically/during certain
times of the day, may be an indication that
there is a congested link in the path...
83
BUT Large RTT's
May Also Mean...
• … that they are simply going to a very
remote destination
• … that they are going via satellite rather
than via fiber
• … that ping traffic has been deprioritized by
a network device along the way (regular
TCP/IP traffic may be rolling along just
fine)
84
Link Capacity vs.
Available Link Capacity
• Once they have an idea of how they're
going to a particular site, their next goal
should be to see if there's available capacity
on the links between them and their remote
partner.
• In order to be able to do this, you will need
to know the speed of each link in the path
plus its usage (or try to infer link capacity
by watching for flat-topped usage graphs).
85
Looking Step by Step to See if
There's Capacity...
• In many cases, the only way to get true link
speeds is to talk to network engineers
responsible for those links (but in some
cases it may be viewed as impolite to ask
how big one's pipes are -- sort of like asking
how much money someone makes or how
much a person weighs)… or there may be
multiple or alternate paths that may make it
hard to get an applicable answer.
86
An Aside About Automated Per
Hop Throughput Estimators...
• There are some automated throughput
estimators such as pathchar (see:
http://www.caida.org/tools/utilities/
others/pathchar/) however we've had mixed
results from them…
87
Measuring End to End
Available Bandwidth
• Easiest solution may be to use ttcp
(ftp://ftp.arl.mil/pub/ttcp/), assuming you
can run a daemon on the remote end to
which you'd like to estimate throughput.
• See also netperf (http://www.netperf.org/)
• Problem: act of measuring changes that
which is being measured… e.g., ttcp or
netperf can/will fill up your pipes.
88
Network Usage Data
• So… when it comes to usage data, you're
basically hunting for MRTG (or
comparable) SNMP graphs for each link
between you and your remote site of
interest….
89
Campus Level Traffic
• For data about traffic on local (campus)
links, users should talk to campus network
administrators. Network administrators may
or may not have that data, and it may or
may not be publically available to your
users for a variety of reasons.
90
For Gigapop-level
Usage Data...
• See: http://monon.uits.iupui.edu/abilene/
and then click on a core node, and then
click on "Connector Stats" for the node you
selected.
• For the Oregon Gigapop, for example, see
the Denver and Sacramento (soon to be
Sunnyvale) core nodes.
91
For I2 Backbone Usage Data...
• See the Abilene Weather Map that's at
http://hydra.uits.iu.edu/~abilene/traffic/
• For foreign peering networks, see:
http://monon.uits.iupui.edu/
abilene/peers.html
92
What About Remote Peer
Campus' MRTG pages?
• They may or may not be available; your
remote colleagues should check with the
network engineers at their site for
information. Again, this data may not be
available.
93
What If I'm Working With
MANY remote sites?
• Repeat the above process for all of them,
one at a time, and recognize that stuff is
constantly changing, and go crazy…
• OR assume that so long as traffic is going
via I2, it is probably flowing via an
uncongested link; the problem thus becomes
one of monitoring what exit traffic takes -does it go via I2, or some other network?
94
One Approach to Monitoring
Traffic Exits…
• See my talk "Monitoring Traffic Exits In a
Multihomed I2 Environment"
http://www.ncne.nlanr.net/news/workshop/
2000/000515/Talks/sauver-jt05152000/
95
The Abilene Backbone Isn’t
Congested, True…
• The one chunk of the end-to-end network
path that probably won't be congested at all
is the Abilene backbone….
96
But ... Be Prepared for Some
Possible Indirect Routes...
• At least in the past, Abilene’s sparse number
of routing nodes and limited number of
peering points with other networks (e.g., the
old approach of hauling all foreign
connections to StarTap in Chicago, the
absence of a west coast Abilene-vBNS
interconnect, etc.) has meant that some
traffic was routed sub-optimally in terms of
its geographic route/latency.
97
For example, Oregon to China
...via the Midwest
• Traffic from Abilene to CERNet sites (such
as Peking University or Tsinghua
University) goes via StarTAP in Chicago,
which adds approximately 60 msec worth of
latency to packets from West Coast sites.
• Arguably, given that the total latency to
some overseas sites will be > 1000msec,
maybe we could ignore that extra 60msec...
98
An Example Where The I2
Topology IS Material...
• Oregon Abilene-connected schools going to
NM vBNS-connected schools via Chicago:
UNM ==> 99 msec
NMSU ==> 106 msec vs.
LANL ==> 48 msec (ESNet via Calren)
But DOE’s Albuquerque NM Operations
Office (www.doeal.gov) is 110 msec (via
ESNet Chicago!)
99
A 2nd Example Where I2’s
Topology Works Against Itself
• UO to Portland State via the OWEN/NERO
statewide network: ~13.5 msec
• UO to OGI (also in the Portland Area, but
connecting via the gigapop in Seattle) -travels down to Sacramento, then up to
Seattle, then back to Portland: ~33 msec
• Rhetorical-ish question: what is the “best”
path selection criteria for I2 schools with
multiple connectivity options?
100
Abilene Is Getting Better, But...
• Examples: now three peering points with
CANet3, International Transit Network, etc.
• BUT if Abilene won’t/can’t/prefers not to
fix its sparse number of routing nodes and
limited interconnections with some other
networks, the only viable solution (where it
is an issue) may be to obtain direct links to
networks where there are routing/latency
problems (where possible).
101
And About Those
Mission Networks...
• For some mission networks this simply
won’t be possible at all; see: www.es.net/
hypertext/ESNetUniversityPolicy.html
[Clearly the federal mission networks want
to be supportive of Internet2, and they want
to simplify their own lives, and the want to
avoid having people collect mission
network connectivity just for bragging
rights rather than for functional purposes.]
102
Let’s Come Back to the “Easy”
Part: The Campus...
• If your user’s local system isn't connected
via a fast ethernet (or a gig ethernet)
connection, get that connection upgraded.
Here at UO, there is a one time $150/port
charge to get fast ethernet service (where it
is available).
Pay the money and make that "last 100
meters” entirely a non-issue.
103
Possibly Choke Point #1:
Campus Backbone
• If you’ve got fast ethernet to the desktop,
you should have gigabit ethernet at the
campus core. UO currently has a gigabit
core, but there are other campuses that may
still be running with a fast ethernet or FDDI
core..
• If the core of your campus backbone isn't
running gigabit ethernet at this point, it's
(past) time to begin planning to upgrade it.
104
And Dig Into What’s
On Those Routers
• Just because it has fast interfaces or gig
interfaces doesn’t mean that it will keep up
with the traffic being shoved at it.
• Are you monitoring router CPU loads?
• Do you know what sort of VIPs are between
you and the world?
105
Possible/Likely Choke Point
#2: Intrastate backhaul links
• If your Internet2 traffic is currently
backhauled to a regional gigapop over
intrastate DS3 speed links, that obviously is
going to limit your potential Internet2
throughput.
• Those sort of potential choke points could
be upgraded to leased OC3’s (but fiber
based solutions would be more flexible)
106
Lighting Dark Fiber Becoming
Increasingly Affordable
• Traditional SONET-based solutions were
(and are) outrageous, but some optical
vendors are offering financially attractive
alternatives (e.g., see: http://www.luxn.net/)
• I have a fiber optic primer and tutorial
available that you’re welcome to check out;
see: http://cc.uoregon.edu/cnews/
summer2000/fiber.html
107
Possible/Likely Choke Point
#3: International Links
• International links are particularly
expensive, and hence tend NOT to be
overprovisioned.
• If I had to make a bet about where the
choke point would be for flows going to an
overseas destination, my money would
always be on the international link itself
• We (in the US) have little room to gripe,
however, since we aren't willing to help pay
Going Fast Isn’t Just A
Matter of Eliminating Network
Choke Points, However...
• You need to tackle the operating system, the
system hardware, and the application, too...
108
109
V. Operating System Issues
Or, "There are two major products
that come out of Berkeley: LSD and
UNIX. We don't believe this to be a
coincidence." Jeremy S. Anderson
Ugly Reality Number One:
Your User May Not Run
the OS You Prefer
• Prime example: the application I work with
most (NNTPRelay) is only available in a
production quality package for NT/W2K,
which means that I am unable to run a
flavor of Unix or OpenVMS for most of my
work. On the other hand, you may prefer
NT or W2K but have to run Unix (example:
NT and W2K still lack production IPv6...)
110
Basic HPC Mantra for OS
Tuning… Handle Bandwidth
Delay Product Issues
• Nutshell description of problem: you need
to be able to buffer the data being sent via
TCP/IP until it has been acknowledged as
having been sucessfully received at the
remote site. This requires large buffers for
high bandwidth flows to remote sites.
111
112
PSC OS Tuning Guide
• "Enabling High Performance Data Transfers
on Hosts" (http://www.psc.edu/networking/
perf_tune.html)
• Beginning to age, but still an excellent
resource
• BEWARE: Assumes small number of flows;
using large buffers can impact paged and
non-paged memory pool requirements
113
Paged and Non-Paged Pools
• Another strange-but-true Microsoft
NT/W2K factoid: according to Microsoft
Knowledge Base article Q126402, Windows
NT and W2K have hard caps on the
maximum size of the paged and non-paged
pools. E.G., even by tweaking the registry,
you cannot exceed 300-340MB worth of
paged pool, or 256MB worth of non-paged
pool.
Linux is Not 100% Free
of Actual or Potential TCP/IP
Issues, However, Either...
• See, for example, “Linux 2.2.12 TCP
Performance Fix for Short Messages”
at the ICASE Coral site:
www.icase.edu/coral/LinuxTCP2.html
• Much worth learning about TCP/IP
idiosyncrasies from the Beowulf
community
114
115
Another Favorite
Recommendation: SACK
• Another popular recommendation is to
enable SACK (selective
acknowledgements); a SACK enabled
receiver is able to inform the sender about
all packets received so that the sender needs
to resend only the packets that have actually
been dropped.
116
SACK May Be Inconsistent
With SYN Flood Protection...
• SACK and protection against SYN flooding
may not be simultaneously possible under
some OS's (see: http://www.microsoft.com/
TechNet/network/tcpip2k.asp for example)
• And note that many major sites (surprise,
surprise) don’t implement SACK (see:
http://www.aciri.org/tbit/nanog-tbit.pdf),
and only 6% of sites implement it correctly.
117
What If There's
Packet Loss?
• For a nice general treatment that users may
like, explaining what happens when they try
to go fast but hit packet loss, see:
"TCP Response Under Loss Conditions"
(http://www.academ.com/nanog/feb1997/
tcp-loss/index.html)
118
If You Want to Measure/
Monitor Packet Loss...
• AMP Active Measurement Program
(round trip)
http://amp.nlanr.net/active/
amp-uoregon/HPC/body.html
• Surveyor (one way)
http://www.advanced.org/surveyor/
Maybe We Don’t Need to
Worry About All This OS
Tuning Stuff??? Web100
• Goal is to AUTOMATICALLY tune Linux
hosts to achieve 100 Mbps class throughput
over Abilene and comparable networks.
• $2.9 million in funding from the NSF
• See: http://www.web100.org/
119
120
VI. System Hardware Issues
Or, "It is really hard to beat the price
performance of commodity PC
hardware these days."
If You Want to Go Fast,
Bottomline, You Need
At Least Okay Hardware
• Relevant hardware components include:
-- motherboard
-- CPU
-- memory
-- Disk I/O
-- NIC
-- network switch
121
122
”I Need Okay Hardware" Does
Not Necessarily Translate to
“I’ve Got to Buy Traditional
Unix Workstations”
• You will have a very hard time beating the
price/performance ratio of commodity PC
workstations.
• The big question is “should I build from
scratch or should I buy a prebuilt system?”
123
Build or Buy?
• We assume you’re fussy about what you run
(or you’re cheap like us) and will roll your
own
• But beware: if you’re planning on building
and running NT/W2K, Microsoft certifies
ONLY complete systems, not components.
• Until recently, too, you couldn’t really buy a
good cheap server class motherboard
124
Motherboards
• Key? You want a motherboard with
66MHz 64 bit PCI slots
• See, for example: SuperMicro 370DE6
(dual FCPGA PIII, ServerWorks ServerSet
III HE-SL chipset, 133Mhz front side bus,
up to 4GB registered ECC SDRAM, 2 64
bit 66MHz PCI slots, 4 64 bit 33MHz PCI
slots, Adaptec dual Ultra160 SCSI) ~ $650
125
Or Maybe...
• Tyan Thunder HEsl (S2567), with 64 bit
66MHz and 64 bit 33 MHz PCI slots, 2 PIII
processors, 2GB worth of DIMMs, dual
Ultra160 controllers, etc. (the Tyan web site
says “coming soon.”)
126
Network Interface Cards
• Don't expect to generally get 100 Mbps
from fast ethernet cards, nor 1000Mbps
from gigabit cards for a variety of reasons
(most notably because of small 1500 byte
MTUs, checksum-related overhead, and
non-zero-copy TCP/IP stacks)
127
“Measured By Weight,
Not Volume”
• For typical “gigabit” cards, you may only
get 350Mbps to a little over 600Mbps…
• See: http://www.cs.duke.edu/
ari/trapeze/tcp-clarity.html: “typical TCP
socket implementations running over
typical gigabit LANS (e.g., a Gigabit
Ethernet using the standard 1500-byte
MTU) deliver about half a gigabit per
second.”
128
See also...
• http://www.lanquest.com/labs/reports/
gigabitethernet/pci/IntelNP1288a.html
• http://www.nwfusion.com/news/1999/
0705gigabit.html
• http://www.networkcomputing.com/916/
916r1side4.html
129
Beware of NIC Interrupt Load
• Many network cards generate a large
number of interrupts, which can really
hammer your system's CPU -- Intel appears
to be doing a good job at minimizing this
problem... However we still tend to use
Netgear GA620 gigabit NICs because they
are inexpensive (~$330) and work well
enough for our requirements.
130
CPU
• To go fast on the network, you really want
multiple fast CPUs or you are liable to see
CPU saturation from the NIC
• Some dual motherboards may/may seem to
have stability issues under heavy load
• We do PIII’s; we’ve not been convinced that
Xeons (even with lots of cache) merit their
price premium (but we’d love to see
empirical benchmarks on this topic).
131
Network Switches
• We currently use 3Com gig ether switches
because some were generously donated
• We’re considering moving to HP4000M's
with 1000baseT gig-over-copper interfaces
because of their pricing; we know they have
limited backplane throughput (but that may
not be an issue for moderate port densities
and practically realized throughput levels)
132
Disk I/O
• News guys used to think: “for good
throughput, use lots of disks striped across
multiple controllers”…
• SCSI (in the fastest flavor then available),
was the customary prescription, but now
check out Promise & 3Ware for some
inexpensive IDE RAID possibilities
(www.promise.com and www.3ware.com)
133
Beware Filesystem Dynamics
• Filesystem dynamics can also impact disk
I/O throughput (e.g., inode insertion in UFS
becomes problematic when there are “lots”
of files in a single directory). Fast machines
should consider using alternative file
systems, such as either a cyclical file system
or perhaps XFS.
134
See...
• http://www.usenix.org/publications/library/
proceedings/lisa97/full_papers/14.fritchie/
14_html/main.html
• http://oss.sgi.com/projects/xfs/
135
Doing A Stripe of Lots of
Spindles: Sorta Old School...
• We’ve now come to realize that for really
high throughput, you simply can't touch
disk at all -- all the data has to be kept in
memory.
136
RAM Disks
• Dropping price of commodity PC RAM
makes RAM disks economically feasible for
the first time
• Popular PC motherboards can now
accommodate 2-4GB worth of RAM
• 512MB PC133 ECC Registered DRAM's
are down to $499/each now…
• Compare that to a Quantum 1.6GB solid
state drive at $14,499 or so...
137
W2K Ram Disk
• There is a limit to how big a "conventional"
ram disk can be in W2K because it is
normally carved out of paged/non-paged
pool space (which has a hard cap, etc., etc.).
• See http://www.jlajoie.com/ramdskNT/ for
information about a product that can use
excluded memory to create up to 2GB ram
disks in NT/W2K
138
Speaking of Memory...
• Traditional logic: more memory is always a
good thing -- "If you're swapping, add
memory"
• My app was swapping under NT/W2K,
so I tried adding memory only to find that
NT/W2K "wouldn't use it" -- no way to
explicitly set working set quotas under
NT/W2K as one can under OpenVMS.
139
Windows 2000 Memory Hell
• If you are planning to use W2K for
applications that have lots of large files
open, note that 1MB worth of paged pool
gets used up for each GB worth of files
which are open.
• C.F. earlier discussion regarding hard limits
to paged and non-paged pool under W2K
140
Couple of Nice
Additional Resources
• “TCP/IP and Network Performance Tuning”
http://sd.wareonearth.com/woe/Briefings/
tcptune/sld001.htm
• “Tuning Your TCP/IP Stack”
http://www.rvs.uni-hanover.de/people/
voeckler/tune/EN/tune.html
• “SQUID Frequently Asked Questions”
http://www.squid-cache.org/Doc/FAQ/
Many good practical OS-specific tips/quirks
Download