Document

advertisement
Optimizing Network Performance
Alan Whinery
U. Hawaii ITS
April 7, 2010
IP, TCP, ICMP

When you transfer a file with HTTP or FTP





A TCP connection is set up between sender and
reciver
The sending computer hands the file to TCP, which
slices the file into pieces, called segments, which it
assigns numbers, called Sequence Numbers
TCP hands each piece to IP, which makes
datagrams
IP hands each piece to Ethernet driver, which
transmits frames
(continued >>> )
IP, TCP, ICMP

Ethernet carries the frame (through switches) to a
router, which:


takes the IP datagrams out of the Ethernet frames
decides where it should go next





Check cache OR queue for CPU
If it is not forwarded*, the router may send an ICMP message back
to the sender to tell it why
hands it to a different Ethernet driver
etc.
(...)
* reasons routers neglect to forward: no route, expired TTL, failed IP
checksum, Access-list drop, input-queue flushes, selective discard
IP, TCP, ICMP





The last router delivers the datagrams to the
receiving computer by sending them in frames
across the final link
the receiving computer extracts the datagrams from
the frames,
extracts the segments from the datagrams
sends a TCP acknowledgement for this segment's
Sequence Number back to the sender
good segments are handed to the application (i.e.
web browser) which will write them to a file on disk
elements on each end computer

Disk – data rate, errors

DMA – data rate, errors

Ethernet (link) driver – link neg., speed duplex, errors


Features: (Int. Coa., Chk. Off., Seg. Off.)

buffer sizes, frame size

FCS check
TCP (OS) – transport, error/congestion recovery

Features (Con. Av., Buffer sizes, SACK,ECN,TS)

parameters – MSS, buffer/window sizes

IP4 (OS) – MTU, TTL, Checksum

IP6 (OS) – MTU, Hop Limit

Cable or transmission space
Brain teaser

A packet capture near a major UHNet
ingress/egress point will observe IP datagrams
with Good CRCs carrying TCP with bad CRCs.

On the order of a dozen or so per hour

How can this be?


It's either an unimaginable coincidence, OR
The source host has bit errors between the calculation of
TCP checksum and that of IP checksum
elements on each switch (L2/bridge)

link negotiation/physical

input queue

output queue

vlan tagging/processing

FCS check

Spanning Tree (changes/port-change-blocking)
elements on each router

Everything the switch has, plus

route table/route cache



changing, possibly temporarily invalid
When cache changes, “process routing” adds
latency
ARP
TCP

Like pouring water from a bucket into
a two-liter soda bottle.




(important to take the cap off first) :^)
If you pour too fast, some water gets
lost
when loss occurs, you pour more
slowly
TCP continues re-trying until all of the
water is in the bottle
Round Trip Time

RTT, similar to the round trip time reported by
“ping”, is how long it takes a packet to traverse
the network from the sender to the receiver and
then back to the sender.
Bandwidth * Delay Product

BDP is the one-half RTT times the useful
“bottleneck” transmission rate (BW) of the
network path


It's actually BW * the one-way delay -- 0.5 * RTT is
an estimate of one-way delay
Equal to the amount of data that will be “in
flight” in a “full pipe” from the Sender to the
receiver when the earliest possible ACK is
received.
How TCP works

S = sender R = receiver

S & R set up a “connection”



S starts sending segments not larger than MSS
R starts acknowledging segments as they are
received in good condition.


S & R negotiate RWIN MSS, etc
Acknowledgments refer to last segment received,
not every single segment
S limits unacknowledged “data in flight” to R's
advertised RWIN
How TCP works

TCP performance on a connection is limited by
the following three numbers:

Sender's socket buffer (you can set this)


Congestion Window (calculated during transfer)



Must hold 2 * BDP of data to “fill pipe”
Sender's estimate of the available bandwidth
Scratchpad number kept by sender based on ACK/loss
history
Receiver's Receive Window (you can set this)

must equal ~ BDP to “fill pipe”

These can be specified with nuttcp and iperf

OS defaults can be specified in each OS
How TCP works

original TCP



was unable to deal with out-of-order segments
was forced to throw away received segments that
occurred after a lost segment
Modern TCP Has

SACK (selective acknowledgements)

Timestamps

Explicit Congestion Notification
TCP Congestion Avoidance

Early TCP performed poorly in the face of lost
packets, a problem which became more serious
as transfer rates increased


Although bit-rates went up, RTT remained the
same.
Many TCP variants have been customized for
large bandwidth-delay products

HSTCP, FAST TCP, BIC TCP, CUBIC TCP, H-TCP,
Compound TCP
Modern Ethernet drivers

Current Ethernet devices offer several
optimizations

TCP/IP checksum offloading


TCP segmentation offloading



NIC chipset does checksumming for TCP and Ipv4
OS sends large blocks of data to NIC, NIC chops it up
Implies TCP Checksum offloading
Interrupt Coalescing

After receiving an Ethernet frame, NIC waits for more
before raising interrupt to ICU
Modern Ethernet drivers

Optimizing the NIC's switch connection(s)

Teaming


Flow-control (PAUSE frames)




Combining more than one NIC into one “link”
Allowing the switch to pause the NIC's sending
I have not found an example of negative effects
Can band-aid problem NICs by smoothing rate and
preventing queue drops (and therefore keeping TCP from
seeing congestion)
VLANs


Very useful on some servers, as you can set up several
interfaces on one NIC
Although it is offered in some Windows drivers, I have
only made it work in Linux
Modern Ethernet drivers

Optimizing the driver's use of the bus/dma/etc.
Or Ethernet switch

Scatter-gather


Write-combining


Data transfer “coalescing”
Message Signaled interrupts


Multipart DMA transfers
PCI 2.2 and PCI-E messages that expand available
interrupts and relieve the need for interrupt connector
pins
Multiple receive queues (hardware steering)
Modern Ethernet drivers

Although there are gains to be had from
tweaking offloading and other opts



Always baseline a system with defaults before
changing things
Sometimes, disabling all offloading and coalescing
can stabilize performance (perhaps exposing a bug)
Segmentation offloading affects a machine's
perspective when packet capturing its own frames
on its own interface
ethtool

Linux utility for interacting with Ethernet drivers

Support and output format varies between drivers

Shows useful statistics

View or set features (offloading, coalescing, etc)

Set Ethernet driver ring buffer sizes

Blink LEDs for NIC identification

Show link condition, speed, duplex, etc.
ethtool


Linux utility for interacting with Ethernet drivers
root@bongo:~# ethtool eth0
Settings for eth0:
Supported ports: [ MII ]
Supported link modes:
10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
1000baseT/Full
Supports auto-negotiation: Yes
Advertised link modes: 10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
1000baseT/Full
Advertised auto-negotiation: Yes
Speed: 1000Mb/s
Duplex: Full
Port: MII
PHYAD: 1
Transceiver: external
Auto-negotiation: on
Supports Wake-on: g
Wake-on: d
Link detected: yes
ethtool

Linux utility for interacting with Ethernet drivers
root@bongo:~# ethtool -i eth0
driver: forcedeth
version: 0.61
firmware-version:
Bus-info: 0000:00:14.0
root@uhmanoa:/home/whinery# ethtool eth2
Settings for eth2:
Supported ports: [ ]
Supported link modes:
Supports auto-negotiation: No
Advertised link modes: Not reported
Advertised auto-negotiation: No
Speed: Unknown! (10000)
Duplex: Full
Port: Twisted Pair
PHYAD: 0
Transceiver: internal
Auto-negotiation: off
Current message level: 0x00000004 (4)
Link detected: yes
modinfo

Extract status and documentation from Linux
modules (like Ethernet drivers)
root@bongo:~# modinfo forcedeth
filename:
/lib/modules/2.6.24-26-rt/kernel/drivers/net/forcedeth.ko
license:
GPL
description: Reverse Engineered nForce ethernet driver
author:
Manfred Spraul <manfred@colorfullife.com>
srcversion: 9A02DCF1CF871DD11BB129E
alias:
pci:v000010DEd00000AB3sv*sd*bc*sc*i*
(...)
depends:
vermagic:
2.6.24-26-rt SMP preempt mod_unload
parm:
max_interrupt_work:forcedeth maximum events handled per interrupt (int)
parm:
optimization_mode:In throughput mode (0), every tx & rx packet
will generate an interrupt. In CPU mode (1), interrupts are controlled by a timer. (int)
parm:
poll_interval:Interval determines how frequent timer interrupt is generated by
[(time_in_micro_secs * 100) / (2^10)]. Min is 0 and Max is 65535. (int)
parm:
msi:MSI interrupts are enabled by setting to 1 and disabled by setting to 0. (int)
parm:
msix:MSIX interrupts are enabled by setting to 1 and disabled by setting to 0. (int)
parm:
dma_64bit:High DMA is enabled by setting to 1 and disabled by setting to 0. (int)
NDT


Network Diagnostic Tool written by Rich Carlson
of US Dept. of Energy Argonne Lab/Internet2
Server written in C, primary client is a Java
Applet
NPAD (Network Path and
Application Diagnosis)



By Matt Mathis and John Heffner, Pittsburgh
Supercomputing Center
Allows for analysis of network loss, throughput
not for a target rate and RTT
Attempts to guide user to solution of network
problems
Iperf

Command-line throughput test server/client

Works on Linux/Windows/Mac OS X/ etc.

Originally developed by NLANR/DAST

Performs unicast TCP and UDP tests

Performs multicast UDP tests

Allows setting TCP parameters

Original development ended in 2002

Sourceforge fork project has produced mixed
results
Nuttcp

Command-line throughput test server/client

Runs on Linux, Windows, Mac OS X etc

By Bill Fink, Rob Scott

Does everything iperf does

Also third party testing

Bidirectional traceroutes

More extensive output
Nuttcp

nuttcp -T30 -i1 -vv 192.168.222.5


30 second TCP send from this host to target
nuttcp -T30 -i1 -vv 192.168.2.1 192.168.2.2

30 second TCP send from 2.1 to 2.2

This host is neither 2.1 nor 2.2

Each of the slaves must be running “nuttcp -S”
Nuttcp (or iperf) and periodic reports
C:\bin\nuttcp>nuttcp.exe -i1 -T10 128.171.6.156
22.1875 MB /
1.00 sec = 186.0967 Mbps
7.3125 MB /
1.00 sec =
61.3394 Mbps
14.0000 MB /
1.00 sec = 117.4402 Mbps
12.8125 MB /
1.00 sec = 107.4796 Mbps
7.1250 MB /
1.00 sec =
59.7715 Mbps
6.4375 MB /
1.00 sec =
53.9991 Mbps
10.7500 MB /
1.00 sec =
90.1771 Mbps
4.8750 MB /
1.00 sec =
40.8945 Mbps
9.5625 MB /
1.00 sec =
80.2164 Mbps
1.9375 MB /
1.00 sec =
16.2529 Mbps
97.0625 MB /

10.11 sec =
80.5500 Mbps 3 %TX 6 %RX
Seeing 10 1-second samples tells you more about a test than
one 10-second average
Testing notes

Neither iperf nor nuttcp uses TCP auto-tuning
Download