Cluster network notes, Part 2 - Computer Science and Computer

advertisement
Networks for Cluster Computing
NIC hardware used in clusters
Alternatives for the high-performance cluster network





Gigabit Ethernet
Myrinet
InfiniBand
Quadrics
and a few others ...
Ethernet, Fast Ethernet, Gigabit Ethernet (10, 100, 800Mbps)

Carrier Sense Multiple Access with Collision Detection (CSMA-CD)

CSMA-CD protocol on shared links

To send: listen to wire; if not busy then send,

if busy then stay on wire listening until the wire becomes not
busy then send immediately

After send, listen long enough to be sure there was no collision,

if not, assume it was successful

If collision, wait a little while and then try to send same frame again, up to 10
tries, then report a failure
Handling the carrier sense and collisions takes time
Time and latency for messaging is bad in a cluster
Three important features:
1) On a switch, the CSMA-CD protocol is only used on the link from the
node to the switch
A
B
C
D
Moral: always use switches in clusters - all Gigabit Ethernet products are
switched
2) Gigabit Ethernet also has a "bulk mode" that bundles frames together before sending

this helps reduce the number of collisions, hurts with latency on clusters
3) Ethernets also use a source routing algorithm to find a path when there
is a loop in the topology -- can't take advantage of multiple paths
in switches, because the same path is always used from one node to
another
Advantages and disadvantages of Gigabit/Fast Ethernet
Advantages:

inexpensive

easily available, easy to install and use, uses well-understood
protocols and drivers
Disadavantage: performance is not as good as it could be
Several networks have been designed specifically for cluster computing.
We will talk about two of them: InfiniBand and Myrinet
First, a look at the storage and file system structure on Red Diamond:
Storage Area Network and IBRIX file system
InfiniBand is both a hardware and a software standard
What it replaces: SCSI disk subsystems, interconnected to
cluster nodes via a FiberChannel network
>=1Gbps network, point-to-point links
… like the storage architecture in with Red Diamond.
Idea - interface, where possible, directly to host memory,
same goal as with low-latency protocols - send data to a
node or a device with no OS interference and no copy
overhead
InfiniBand consists of
- Target Channel Adapter (to devices like RAID disk subsystems)
- Host Channel Adapters (to nodes, can sit on PCI bus, or
on a modified motherboard)
- point-to-point switches
Idea: replace both the I/O subsystem and the network subsystem
with InfiniBand
How clusters are being built now (draw diagram from Oracle)
Higher level cluster communication is either
- built over TCP/IP sockets, or
- a low-latency communication layer such as Sockets Direct Protocol (SDP)
Myrinet - developed in the early 1990's at CalTech for the purpose of
supporting message-passing multicomputers (i.e., clusters)

Widely used in clusters, especially those on the Top 500
Goal: replicate the distinctive characteristics of the interconnect of high-performance
message passing computers of the 1980’s:
1) high data rates
2) regular topologies and scalabiliy - make the cluster easier to build
if the topology is regular, and scales to 1000's
3) very low error rate - don't want to support resends or acks in the low
level protocol because these take too much time
4) cut through routing as opposed to store and forward, no checksum on each
node, goes out as soon as the header is decoded, blocks if line is busy
5) flow control on every communication link

point-to-point full duplex links that connect hosts and switches

flow control through the use of STOP and GO bits on every link
A buffer exists on the receiving side to hold bits. If the buffer
gets too full, a STOP message is sent. If the buffer gets too
empty, a GO message is sent.
(Try this on your own …)
The size of the buffer depends on the
transmission rate, the rate of the signal in the wire, and the
length of the wire, and can be calculated.
E.g., suppose the transmission rate is 1Gbps (it's actually higher
on the newest NICs). Suppose the cables is at most 25m long, and
suppose the rate that the signal travels in the wire is 200m/usec
Then, the number of bits "on the wire" is
#bits =
1*10^9 bits/sec * 25m
--------------------------------- = 10^3 * .125 = 125
200m/ 10^(-6)sec
This is the number of bits that can fill a buffer on the receiving
side while a STOP message is being sent.
Times 2 for the bits that enter the wire while the STOP message is
on the wire, = 250 bits = 250/8 bytes = 31.25 bytes (about 32 bytes)
This matches the figure 3 in the paper, plus some extra bytes for
hysteresis.
Other Myrinet features:

variable length header that holds the route, stripped at each switch and
routed

DMA engine and processor for implementing low-latency protocol
Again, recall the network layers:
application
MPI, file system, database
transport - reliable end-to-end transfer
of messages
TCP
network - point to point transfer of
packets between networks
IP
datalink - point to point transfer of
packets on the same network
VIA
CSMA-CD, Myrinet slack buffer
physical - point to point transfer of
frames
encoding of bit, signalling
Myrinet uses a CLOS network -- for more information see the talk
by Chuck Seitz at www.myri.com
Basic idea: multiple shortest paths, dynamic routing, there are still
some open research issues
Advantages of Myrinet: very high performance, high data rate,
low probability of blocking,
fault tolerant, supports GM, MPI over GM
Disadvantages of Myrinet - high cost, single supplier
Software Protocols for Cluster Networks
First, let’s recall the network layers:
application
MPI, file system, database
transport - reliable end-to-end transfer of messages
TCP – Transmission Control Protocol
network - point to point transfer of packets between networks
IP – Internet Protocol
datalink - point to point transfer of packets on the same network
VIA – Virtual Interface Architecture
CSMA-CD, Myrinet slack buffer
physical - point to point transfer of frames
encoding of bits, signaling
Virtual Interface Architecture (VIA)

The VIA standard was developed by software and network researchers in
academics with some involvement with industry.

VIA is a software protocol specification that has come out of research
on low-latency protocols

There were several research projects from the 1990's, including
Active Messages (NOW, Berkeley),
Some others include:
Fast Messages (Chien, Illinois),
Virtual Memory Mapped Communication (VMMC) Princeton Shrimp Project
U-Net (von Eicken, Cornell),
BIP (Prylli, University of Lyon).
How VIA is different from TCP:

uses buffers in pinned memory on the send and receive side
o pinned memory is not allowed to be paged out to disk

uses scatter/gather lists in the specification of the data to be
sent, so that data is copied from a list of locations to the NIC
directly
L1 L2 L3 L4 L5
Message buffer

VIA allows for non-blocking sends and receives
Non-blocking send requires 2 steps:
1) 'post' a send, notify the NIC that data is available, the NIC gets the data,
2) some time later the program checks to see if the send is complete.
The user is responsible for not modifying the buffer before the send completes
Non-blocking receive also requires 2 steps:
1) post a receive to tell NIC where to put the data when the message arrives,
2) check later to see if the receive is complete

VIA also allows for polling for receives
o saves time when a program is doing nothing but waiting for a message
o but is costly when other activity has to happen at the same time

In VIA, if the receive is not posted when the message arrives, then it is
dropped by the NIC

VIA works best when specialized hardware is used
o the processing that is required when a message arrives that "pushes" it
into user memory can be done by hardware on the NIC
o hardware required is processing capability, memory, DMA engine

If specialized hardware is not available, then this can be
emulated by software on the host CPU,
 saves on the copy time, but
 work is similar to ordinary network protocols
 and the same overhead advantage is not obtained.
How VIA is the same as TCP

a connection is established
o Although, in VIA each end is aware of the message types that will be sent
or received

messaging is point-to-point, no easy way to do multicast or broadcast
Notice that VIA is considered too low a level for application programming
o The program has to allocate and manage real memory on the hosts.
Rather, cluster applications such as file systems or database systems can
be built using VIA
Hardware that supports VIA: Myrinet
InfiniBand
M-VIA runs over some Ethernet cards
VIA is important because it represents a standard protocol for low-latency
communication in clusters.
Myrinet is a product of a single company. Myrinet has its own low latency protocol.
InfiniBand was developed by I/O industry with some involvement with
academic researchers. InfiniBand also has developed its own low latency protocols.
Download