Communication Networks of Parallel & Distributed Systems: Low

advertisement
Communication Networks of
Parallel & Distributed Systems:
Low Latency & High Bandwidth
comes to clusters & grids
Dong Lu
Dept. of Computer Science
Northwestern University
June 18, 2002
http://www.cs.northwestern.edu/~
donglu
1
Introduction
Communication networks play a vital
role in parallel & distributed systems.
Modern communication networks
support low latency & high bandwidth
communication services.
June 18, 2002
http://www.cs.northwestern.edu/~
donglu
2
How is low latency & high
bandwidth achieved?
DMA based zero copy and OS bypassing
which can provide applications with direct
access to Network Interface Card.
Communication protocol processing is
offloaded by using a helper processor on NIC
or channel adapter. Often, TCP/IP is not
used.
Switched networks or hypercube with high
speed routers that support cut-through
routing.
June 18, 2002
http://www.cs.northwestern.edu/~
donglu
3
TREND
Those technologies are migrating from
inside parallel systems to clusters.

Example: InfiniBand Architecture.
Lower latency & higher bandwidth
communication networks are becoming
available to Grid computing.

Example: High speed optical networking is
becoming dominate in Internet.
June 18, 2002
http://www.cs.northwestern.edu/~
donglu
4
Outline of this talk
Communication networks in parallel systems.
Communication networks in Clusters and
System Area Networks.
New and improved Communication network
protocols & technologies in Grid computing.
Trend: Low latency & High bandwidth comes
to clusters & the Grid.
June 18, 2002
http://www.cs.northwestern.edu/~
donglu
5
Communication
networks in parallel
systems
IBM SP2
SGI Origin 2000
June 18, 2002
http://www.cs.northwestern.edu/~
donglu
6
IBM SP2
Any-to-any packet switched, multistage
network. Excellent scalability.
Micro Channel adapter has an onboard
microprocessor that offloads some of the
protocol processing load.
The adapter can move messages to and from
processor memory directly via direct memory
access (DMA), thus supports zero-copy
message passing.
June 18, 2002
http://www.cs.northwestern.edu/~
donglu
7
64 nodes IBM SP2
network topology
June 18, 2002
http://www.cs.northwestern.edu/~
donglu
8
SGI Origin 2000
Origin is a distributed shared memory, ccNUMA multiprocessor. cc-NUMA stands for
cache-coherent non-uniform memory access.
Hypercube network connected by SPIDER
routers, which support wormhole “cutthrough” routing, that is, start forwarding
without getting the whole packet, contrary to
what “store and forward” does.
Low latency remote memory access is
supported and the ratio of remote memory to
local memory latency is very low.
June 18, 2002
http://www.cs.northwestern.edu/~
donglu
9
128 processors SGI
Origin 2000 network
June 18, 2002
http://www.cs.northwestern.edu/~
donglu
10
Communication
Networks in Clusters
Gigabit Ethernet
Myrinet
Virtual Interface Architecture
InfiniBand Architecture
June 18, 2002
http://www.cs.northwestern.edu/~
donglu
11
Gigabit Ethernet
Can be switch based --- Higher bandwidth
and smaller collision domain.
Jumbo frame is supported, up to 9K.
Some NIC are Programmable, which have a
onboard processor.
Zero-copy OS-bypass message passing can
be supported with programmable NIC and
DMA, which make it a low-cost, low latency
and high bandwidth architecture!
June 18, 2002
http://www.cs.northwestern.edu/~
donglu
12
High performance GigaE
Architecture
EMP system, appeared in HPDC2001
June 18, 2002
http://www.cs.northwestern.edu/~
donglu
13
Myrinet
Developed based on the technology of
a parallel system --- Intel Paragon.
The first commercial LAN technology
able to provide zero-copy message
passing and can offload protocol
processing to the interface processor.
Switch based, Cut-through-routing is
supported.
June 18, 2002
http://www.cs.northwestern.edu/~
donglu
14
Myrinet
LANai is the host interface that has a
processor and DMA engine onboard.
High bandwidth & low latency, but very
expensive and not very stable.
June 18, 2002
http://www.cs.northwestern.edu/~
donglu
15
Myrinet
June 18, 2002
http://www.cs.northwestern.edu/~
donglu
16
Virtual Interface
Architecture
Support zero-copy and OS-bypassing to
provide low latency and high bandwidth
communication service. Message
send/receive operations and Remote
DMA are supported.
To a user process, VIA provides direct
access to the network interface in a fully
protected fashion.
June 18, 2002
http://www.cs.northwestern.edu/~
donglu
17
Remote DMA
June 18, 2002
http://www.cs.northwestern.edu/~
donglu
18
Virtual Interface
Architecture
Each process owns a VI and each VI consists
of one send queue and a receive queue.
The memory regions are registered before
data transfer by Open/Connect operations.
After the Open/Connection and memory
registration, user data can be transferred
without the operating system.
Memory protection is provided by protection
tag mechanism. Protection tags are
associated with VIs and memory regions.
June 18, 2002
http://www.cs.northwestern.edu/~
donglu
19
Virtual Interface
Architecture
June 18, 2002
http://www.cs.northwestern.edu/~
donglu
20
Infiniband Architecture
June 18, 2002
http://www.cs.northwestern.edu/~
donglu
21
Infiniband Architecture
Encompasses a system-area network
for connecting multiple independent
processor and I/O platforms.
Defines the communication and
management infrastructure supporting
both I/O and inter-processor
communications.
June 18, 2002
http://www.cs.northwestern.edu/~
donglu
22
Infiniband Architecture
Components: A host channel adapter (HCA),
A target channel adapter (TCA) and fabric
switch.
Channel adapter offload the protocol
processing load from CPU.
DMA/RDMA is supported.
Zero copy-data transfers without kernel
involvement and uses hardware to provide
highly reliable, fault-tolerant communication
June 18, 2002
http://www.cs.northwestern.edu/~
donglu
23
Communication
Networks in the GRID
IPv6
High performance TCP Reno
TCP tuning for distributed applications on the WAN
TCP Vegas vs. TCP Reno
Random Early Detection gateways
Aggressive TCP Reno: What I have done on Linux
kernel
June 18, 2002
http://www.cs.northwestern.edu/~
donglu
24
IPv6
Expanded Addressing Capabilities. 128 bits
vs. 32 bits in IPv4.
Flow Labeling Capability. Good news to real
time applications and high performance
applications.
Header Format Simplification.
Improved Support for Extensions and
Options.
Authentication and Privacy Capabilities.
June 18, 2002
http://www.cs.northwestern.edu/~
donglu
25
IPv6 header
|+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+|
| Version| Traffic Class |
Flow Label
|
|+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+|
| Payload Length
| Next Header | Hop Limit
|
|+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+|
|
Source Address
|
|+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-++-|
|
Destination Address
|
|+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+|
June 18, 2002
http://www.cs.northwestern.edu/~
donglu
26
High performance TCP
Reno (RFC1323)
TCP extension for high performance.
TCP performance depends not upon the
transfer rate itself, but rather upon
“bandwidth*delay product“, which is growing
quickly, much bigger than 65K.
The TCP header uses a 16 bit field to report
the receive window size to the sender.
Therefore, the largest window that can be
used is 2**16 = 65K bytes.
June 18, 2002
http://www.cs.northwestern.edu/~
donglu
27
High performance TCP
Reno (RFC1323)
A TCP option, "Window Scale" is adopted to
allow windows larger than 2**16 bytes.
However, high transfer rate alone can
threaten TCP reliability by violating the
assumptions behind the TCP mechanism for
duplicate detection and sequencing. That is,
any sequence number may eventually be
reused, error may result from an accidental
reuse of TCP sequence numbers in data
segments.
June 18, 2002
http://www.cs.northwestern.edu/~
donglu
28
High performance TCP
Reno (RFC1323)
PAWS (Protect Against Wrapped
Sequence numbers) mechanism is
proposed to avoid this potential
problem.
June 18, 2002
http://www.cs.northwestern.edu/~
donglu
29
TCP tuning for
distributed applications
The congestion window size is used by TCP
to control how many packets should be sent
into the network, and the send &receive
buffer size as well as the network congestion
status decide the congestion window size.
Many operating systems use a default TCP
buffer size of either 24 or 32 KB (Linux is only
8 KB).
June 18, 2002
http://www.cs.northwestern.edu/~
donglu
30
TCP tuning for
distributed applications
Suppose the slowest hop from site A to site B
is 100 Mbps (about 12 MB/sec), typical
latency across the US is about 25 ms.
12*25=300K. If the default 24K is used as
TCP buffer, then 24/300 = 8%. So, only a
small portion of bandwidth is used!
Buffer size = 2 * bandwidth * delay
or
Buffer size = bandwidth * RTT
June 18, 2002
http://www.cs.northwestern.edu/~
donglu
31
TCP Vegas vs. TCP Reno
Researchers have shown us that aggregate network
traffic can be characterized as self-similar or fractal,
which usually is a bad property for the performance
of Internet. Several researchers claim that the
primary source of self-similarity is from TCP Reno via
an "additive increase, multiplicative decrease" (AIMD)
congestion-control mechanism.
Instead of reacting to congestion as TCP Reno does,
TCP Vegas tries to avoid congestion.
June 18, 2002
http://www.cs.northwestern.edu/~
donglu
32
TCP Vegas
Vegas has two threshold values, A and B,
default values are A=1, B=3. ESR is the
expected sending rate and ASR is the actual
sending rate.




Let diff = ESR – ASR
If diff < A, increase the congestion window linearly
during the next round trip time.
If diff > B, decrease the window linearly during the
next RTT.
Otherwise, don’t change the congestion window
size.
June 18, 2002
http://www.cs.northwestern.edu/~
donglu
33
TCP Vegas vs. TCP Reno
Some researchers show that with
proper values for A and B, Vegas
behave better than Reno in the Grid
computing environment.
The problem with Vegas is that Vegas is
not verified on a large-scale network
and the optimal values of A and B are
not easy to decide.
June 18, 2002
http://www.cs.northwestern.edu/~
donglu
34
Random Early Detection
gateways
RED gateways maintain a weighted average of the
queue length, a minimum and maximum threshold
(REDmin, REDmax), and an early drop rate P. Packets
are then queued as follows:



If (queue length < REDmin), queue all packets.
If (queue length > REDmin, and queue length < REDmax), drop
packets with probability P.
If (queue length > REDmax), drop all packets.
RED can increase the fairness and overall network
performance, so it is widely applied in the routers in the
world. Since GRID is built on the basis of Internet, the
effect of RED routers should be considered.
June 18, 2002
http://www.cs.northwestern.edu/~
donglu
35
Aggressive TCP Reno:
What I have done
Linux kernel modification on TCP congestion
control.
Some studies have shown us that TCP Reno
congestion control is too conservative thus
the bandwidth is not fully utilized.
So, make it more aggressive. How?
June 18, 2002
http://www.cs.northwestern.edu/~
donglu
36
Aggressive TCP Reno:
What I have done
Window size start from more than one
packet (for example 20), and increase
more quickly during “slow start”.
Do the same “Congestion avoidance”.
Whenever there is a packet loss, don’t
drop to one packet, instead, drop to
80% of window size, and new threshold
will be 90% of current window size.
June 18, 2002
http://www.cs.northwestern.edu/~
donglu
37
Aggressive TCP Reno:
What I have done
Built into Linux kernel
TCP Reno
June 18, 2002
Aggress TCP
http://www.cs.northwestern.edu/~
donglu
38
Some Performance gains
for Virtualized Audio
Frequency
(computational work load)
1000
1600
2000
Run time on
Modified kernel
(aggressive TCP)
169.9
166.5
169.3
182.9
185.3
183.1
1035.5
1032.0
1035.8
1171.7
1161.8
1159.7
2187.9
2134.8
2130.2
Run time on
Unmodified Kernel
June 18, 2002
http://www.cs.northwestern.edu/~
donglu
2207.2
2206.4
2207.7
39
Aggressive TCP Reno
Through the kernel modification, some
performance gains are achieved without
modifying the application code.
But the results are still not very satisfactory
and not very stable. Why?
June 18, 2002
http://www.cs.northwestern.edu/~
donglu
40
Aggressive TCP Reno
That can be due to three reasons.
First, virtual audio is very computational intensive, so
enhancing the communication performance even
more drastically will not change the overall
performance much (most time was spent on
computing).
Second, the bandwidth*delay product on the cluster
is small, which implies that this technique may be
more effective on the WAN (with much bigger RTT).
Third, the effect of fast retransmit and fast recovery is
not considered here but it turns out to be important.
June 18, 2002
http://www.cs.northwestern.edu/~
donglu
41
Conclusion
Some technologies used by parallel systems are
going into cluster, making low latency and high
bandwidth available.
With the development of Internet, new & improved
protocols are proposed and tested to provide lower
latency & higher bandwidth to Grid computing.
New proposal on aggressive TCP is implemented
and tested and some performance gains are
achieved. More work is needed to make it more
effective and stable.
June 18, 2002
http://www.cs.northwestern.edu/~
donglu
42
Questions?
June 18, 2002
http://www.cs.northwestern.edu/~
donglu
43
Download