Networks for Cluster Computing NIC hardware used in clusters Alternatives for the high-performance cluster network Gigabit Ethernet Myrinet InfiniBand Quadrics and a few others ... Ethernet, Fast Ethernet, Gigabit Ethernet (10, 100, 800Mbps) Carrier Sense Multiple Access with Collision Detection (CSMA-CD) CSMA-CD protocol on shared links To send: listen to wire; if not busy then send, if busy then stay on wire listening until the wire becomes not busy then send immediately After send, listen long enough to be sure there was no collision, if not, assume it was successful If collision, wait a little while and then try to send same frame again, up to 10 tries, then report a failure Handling the carrier sense and collisions takes time Time and latency for messaging is bad in a cluster Three important features: 1) On a switch, the CSMA-CD protocol is only used on the link from the node to the switch A B C D Moral: always use switches in clusters - all Gigabit Ethernet products are switched 2) Gigabit Ethernet also has a "bulk mode" that bundles frames together before sending this helps reduce the number of collisions, hurts with latency on clusters 3) Ethernets also use a source routing algorithm to find a path when there is a loop in the topology -- can't take advantage of multiple paths in switches, because the same path is always used from one node to another Advantages and disadvantages of Gigabit/Fast Ethernet Advantages: inexpensive easily available, easy to install and use, uses well-understood protocols and drivers Disadavantage: performance is not as good as it could be Several networks have been designed specifically for cluster computing. We will talk about two of them: InfiniBand and Myrinet First, a look at the storage and file system structure on Red Diamond: Storage Area Network and IBRIX file system InfiniBand is both a hardware and a software standard What it replaces: SCSI disk subsystems, interconnected to cluster nodes via a FiberChannel network >=1Gbps network, point-to-point links … like the storage architecture in with Red Diamond. Idea - interface, where possible, directly to host memory, same goal as with low-latency protocols - send data to a node or a device with no OS interference and no copy overhead InfiniBand consists of - Target Channel Adapter (to devices like RAID disk subsystems) - Host Channel Adapters (to nodes, can sit on PCI bus, or on a modified motherboard) - point-to-point switches Idea: replace both the I/O subsystem and the network subsystem with InfiniBand How clusters are being built now (draw diagram from Oracle) Higher level cluster communication is either - built over TCP/IP sockets, or - a low-latency communication layer such as Sockets Direct Protocol (SDP) Myrinet - developed in the early 1990's at CalTech for the purpose of supporting message-passing multicomputers (i.e., clusters) Widely used in clusters, especially those on the Top 500 Goal: replicate the distinctive characteristics of the interconnect of high-performance message passing computers of the 1980’s: 1) high data rates 2) regular topologies and scalabiliy - make the cluster easier to build if the topology is regular, and scales to 1000's 3) very low error rate - don't want to support resends or acks in the low level protocol because these take too much time 4) cut through routing as opposed to store and forward, no checksum on each node, goes out as soon as the header is decoded, blocks if line is busy 5) flow control on every communication link point-to-point full duplex links that connect hosts and switches flow control through the use of STOP and GO bits on every link A buffer exists on the receiving side to hold bits. If the buffer gets too full, a STOP message is sent. If the buffer gets too empty, a GO message is sent. (Try this on your own …) The size of the buffer depends on the transmission rate, the rate of the signal in the wire, and the length of the wire, and can be calculated. E.g., suppose the transmission rate is 1Gbps (it's actually higher on the newest NICs). Suppose the cables is at most 25m long, and suppose the rate that the signal travels in the wire is 200m/usec Then, the number of bits "on the wire" is #bits = 1*10^9 bits/sec * 25m --------------------------------- = 10^3 * .125 = 125 200m/ 10^(-6)sec This is the number of bits that can fill a buffer on the receiving side while a STOP message is being sent. Times 2 for the bits that enter the wire while the STOP message is on the wire, = 250 bits = 250/8 bytes = 31.25 bytes (about 32 bytes) This matches the figure 3 in the paper, plus some extra bytes for hysteresis. Other Myrinet features: variable length header that holds the route, stripped at each switch and routed DMA engine and processor for implementing low-latency protocol Again, recall the network layers: application MPI, file system, database transport - reliable end-to-end transfer of messages TCP network - point to point transfer of packets between networks IP datalink - point to point transfer of packets on the same network VIA CSMA-CD, Myrinet slack buffer physical - point to point transfer of frames encoding of bit, signalling Myrinet uses a CLOS network -- for more information see the talk by Chuck Seitz at www.myri.com Basic idea: multiple shortest paths, dynamic routing, there are still some open research issues Advantages of Myrinet: very high performance, high data rate, low probability of blocking, fault tolerant, supports GM, MPI over GM Disadvantages of Myrinet - high cost, single supplier Software Protocols for Cluster Networks First, let’s recall the network layers: application MPI, file system, database transport - reliable end-to-end transfer of messages TCP – Transmission Control Protocol network - point to point transfer of packets between networks IP – Internet Protocol datalink - point to point transfer of packets on the same network VIA – Virtual Interface Architecture CSMA-CD, Myrinet slack buffer physical - point to point transfer of frames encoding of bits, signaling Virtual Interface Architecture (VIA) The VIA standard was developed by software and network researchers in academics with some involvement with industry. VIA is a software protocol specification that has come out of research on low-latency protocols There were several research projects from the 1990's, including Active Messages (NOW, Berkeley), Some others include: Fast Messages (Chien, Illinois), Virtual Memory Mapped Communication (VMMC) Princeton Shrimp Project U-Net (von Eicken, Cornell), BIP (Prylli, University of Lyon). How VIA is different from TCP: uses buffers in pinned memory on the send and receive side o pinned memory is not allowed to be paged out to disk uses scatter/gather lists in the specification of the data to be sent, so that data is copied from a list of locations to the NIC directly L1 L2 L3 L4 L5 Message buffer VIA allows for non-blocking sends and receives Non-blocking send requires 2 steps: 1) 'post' a send, notify the NIC that data is available, the NIC gets the data, 2) some time later the program checks to see if the send is complete. The user is responsible for not modifying the buffer before the send completes Non-blocking receive also requires 2 steps: 1) post a receive to tell NIC where to put the data when the message arrives, 2) check later to see if the receive is complete VIA also allows for polling for receives o saves time when a program is doing nothing but waiting for a message o but is costly when other activity has to happen at the same time In VIA, if the receive is not posted when the message arrives, then it is dropped by the NIC VIA works best when specialized hardware is used o the processing that is required when a message arrives that "pushes" it into user memory can be done by hardware on the NIC o hardware required is processing capability, memory, DMA engine If specialized hardware is not available, then this can be emulated by software on the host CPU, saves on the copy time, but work is similar to ordinary network protocols and the same overhead advantage is not obtained. How VIA is the same as TCP a connection is established o Although, in VIA each end is aware of the message types that will be sent or received messaging is point-to-point, no easy way to do multicast or broadcast Notice that VIA is considered too low a level for application programming o The program has to allocate and manage real memory on the hosts. Rather, cluster applications such as file systems or database systems can be built using VIA Hardware that supports VIA: Myrinet InfiniBand M-VIA runs over some Ethernet cards VIA is important because it represents a standard protocol for low-latency communication in clusters. Myrinet is a product of a single company. Myrinet has its own low latency protocol. InfiniBand was developed by I/O industry with some involvement with academic researchers. InfiniBand also has developed its own low latency protocols.