Communication Networks of Parallel & Distributed Systems: Low Latency & High Bandwidth comes to clusters & grids Dong Lu Dept. of Computer Science Northwestern University June 18, 2002 http://www.cs.northwestern.edu/~ donglu 1 Introduction Communication networks play a vital role in parallel & distributed systems. Modern communication networks support low latency & high bandwidth communication services. June 18, 2002 http://www.cs.northwestern.edu/~ donglu 2 How is low latency & high bandwidth achieved? DMA based zero copy and OS bypassing which can provide applications with direct access to Network Interface Card. Communication protocol processing is offloaded by using a helper processor on NIC or channel adapter. Often, TCP/IP is not used. Switched networks or hypercube with high speed routers that support cut-through routing. June 18, 2002 http://www.cs.northwestern.edu/~ donglu 3 TREND Those technologies are migrating from inside parallel systems to clusters. Example: InfiniBand Architecture. Lower latency & higher bandwidth communication networks are becoming available to Grid computing. Example: High speed optical networking is becoming dominate in Internet. June 18, 2002 http://www.cs.northwestern.edu/~ donglu 4 Outline of this talk Communication networks in parallel systems. Communication networks in Clusters and System Area Networks. New and improved Communication network protocols & technologies in Grid computing. Trend: Low latency & High bandwidth comes to clusters & the Grid. June 18, 2002 http://www.cs.northwestern.edu/~ donglu 5 Communication networks in parallel systems IBM SP2 SGI Origin 2000 June 18, 2002 http://www.cs.northwestern.edu/~ donglu 6 IBM SP2 Any-to-any packet switched, multistage network. Excellent scalability. Micro Channel adapter has an onboard microprocessor that offloads some of the protocol processing load. The adapter can move messages to and from processor memory directly via direct memory access (DMA), thus supports zero-copy message passing. June 18, 2002 http://www.cs.northwestern.edu/~ donglu 7 64 nodes IBM SP2 network topology June 18, 2002 http://www.cs.northwestern.edu/~ donglu 8 SGI Origin 2000 Origin is a distributed shared memory, ccNUMA multiprocessor. cc-NUMA stands for cache-coherent non-uniform memory access. Hypercube network connected by SPIDER routers, which support wormhole “cutthrough” routing, that is, start forwarding without getting the whole packet, contrary to what “store and forward” does. Low latency remote memory access is supported and the ratio of remote memory to local memory latency is very low. June 18, 2002 http://www.cs.northwestern.edu/~ donglu 9 128 processors SGI Origin 2000 network June 18, 2002 http://www.cs.northwestern.edu/~ donglu 10 Communication Networks in Clusters Gigabit Ethernet Myrinet Virtual Interface Architecture InfiniBand Architecture June 18, 2002 http://www.cs.northwestern.edu/~ donglu 11 Gigabit Ethernet Can be switch based --- Higher bandwidth and smaller collision domain. Jumbo frame is supported, up to 9K. Some NIC are Programmable, which have a onboard processor. Zero-copy OS-bypass message passing can be supported with programmable NIC and DMA, which make it a low-cost, low latency and high bandwidth architecture! June 18, 2002 http://www.cs.northwestern.edu/~ donglu 12 High performance GigaE Architecture EMP system, appeared in HPDC2001 June 18, 2002 http://www.cs.northwestern.edu/~ donglu 13 Myrinet Developed based on the technology of a parallel system --- Intel Paragon. The first commercial LAN technology able to provide zero-copy message passing and can offload protocol processing to the interface processor. Switch based, Cut-through-routing is supported. June 18, 2002 http://www.cs.northwestern.edu/~ donglu 14 Myrinet LANai is the host interface that has a processor and DMA engine onboard. High bandwidth & low latency, but very expensive and not very stable. June 18, 2002 http://www.cs.northwestern.edu/~ donglu 15 Myrinet June 18, 2002 http://www.cs.northwestern.edu/~ donglu 16 Virtual Interface Architecture Support zero-copy and OS-bypassing to provide low latency and high bandwidth communication service. Message send/receive operations and Remote DMA are supported. To a user process, VIA provides direct access to the network interface in a fully protected fashion. June 18, 2002 http://www.cs.northwestern.edu/~ donglu 17 Remote DMA June 18, 2002 http://www.cs.northwestern.edu/~ donglu 18 Virtual Interface Architecture Each process owns a VI and each VI consists of one send queue and a receive queue. The memory regions are registered before data transfer by Open/Connect operations. After the Open/Connection and memory registration, user data can be transferred without the operating system. Memory protection is provided by protection tag mechanism. Protection tags are associated with VIs and memory regions. June 18, 2002 http://www.cs.northwestern.edu/~ donglu 19 Virtual Interface Architecture June 18, 2002 http://www.cs.northwestern.edu/~ donglu 20 Infiniband Architecture June 18, 2002 http://www.cs.northwestern.edu/~ donglu 21 Infiniband Architecture Encompasses a system-area network for connecting multiple independent processor and I/O platforms. Defines the communication and management infrastructure supporting both I/O and inter-processor communications. June 18, 2002 http://www.cs.northwestern.edu/~ donglu 22 Infiniband Architecture Components: A host channel adapter (HCA), A target channel adapter (TCA) and fabric switch. Channel adapter offload the protocol processing load from CPU. DMA/RDMA is supported. Zero copy-data transfers without kernel involvement and uses hardware to provide highly reliable, fault-tolerant communication June 18, 2002 http://www.cs.northwestern.edu/~ donglu 23 Communication Networks in the GRID IPv6 High performance TCP Reno TCP tuning for distributed applications on the WAN TCP Vegas vs. TCP Reno Random Early Detection gateways Aggressive TCP Reno: What I have done on Linux kernel June 18, 2002 http://www.cs.northwestern.edu/~ donglu 24 IPv6 Expanded Addressing Capabilities. 128 bits vs. 32 bits in IPv4. Flow Labeling Capability. Good news to real time applications and high performance applications. Header Format Simplification. Improved Support for Extensions and Options. Authentication and Privacy Capabilities. June 18, 2002 http://www.cs.northwestern.edu/~ donglu 25 IPv6 header |+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+| | Version| Traffic Class | Flow Label | |+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+| | Payload Length | Next Header | Hop Limit | |+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+| | Source Address | |+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-++-| | Destination Address | |+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+| June 18, 2002 http://www.cs.northwestern.edu/~ donglu 26 High performance TCP Reno (RFC1323) TCP extension for high performance. TCP performance depends not upon the transfer rate itself, but rather upon “bandwidth*delay product“, which is growing quickly, much bigger than 65K. The TCP header uses a 16 bit field to report the receive window size to the sender. Therefore, the largest window that can be used is 2**16 = 65K bytes. June 18, 2002 http://www.cs.northwestern.edu/~ donglu 27 High performance TCP Reno (RFC1323) A TCP option, "Window Scale" is adopted to allow windows larger than 2**16 bytes. However, high transfer rate alone can threaten TCP reliability by violating the assumptions behind the TCP mechanism for duplicate detection and sequencing. That is, any sequence number may eventually be reused, error may result from an accidental reuse of TCP sequence numbers in data segments. June 18, 2002 http://www.cs.northwestern.edu/~ donglu 28 High performance TCP Reno (RFC1323) PAWS (Protect Against Wrapped Sequence numbers) mechanism is proposed to avoid this potential problem. June 18, 2002 http://www.cs.northwestern.edu/~ donglu 29 TCP tuning for distributed applications The congestion window size is used by TCP to control how many packets should be sent into the network, and the send &receive buffer size as well as the network congestion status decide the congestion window size. Many operating systems use a default TCP buffer size of either 24 or 32 KB (Linux is only 8 KB). June 18, 2002 http://www.cs.northwestern.edu/~ donglu 30 TCP tuning for distributed applications Suppose the slowest hop from site A to site B is 100 Mbps (about 12 MB/sec), typical latency across the US is about 25 ms. 12*25=300K. If the default 24K is used as TCP buffer, then 24/300 = 8%. So, only a small portion of bandwidth is used! Buffer size = 2 * bandwidth * delay or Buffer size = bandwidth * RTT June 18, 2002 http://www.cs.northwestern.edu/~ donglu 31 TCP Vegas vs. TCP Reno Researchers have shown us that aggregate network traffic can be characterized as self-similar or fractal, which usually is a bad property for the performance of Internet. Several researchers claim that the primary source of self-similarity is from TCP Reno via an "additive increase, multiplicative decrease" (AIMD) congestion-control mechanism. Instead of reacting to congestion as TCP Reno does, TCP Vegas tries to avoid congestion. June 18, 2002 http://www.cs.northwestern.edu/~ donglu 32 TCP Vegas Vegas has two threshold values, A and B, default values are A=1, B=3. ESR is the expected sending rate and ASR is the actual sending rate. Let diff = ESR – ASR If diff < A, increase the congestion window linearly during the next round trip time. If diff > B, decrease the window linearly during the next RTT. Otherwise, don’t change the congestion window size. June 18, 2002 http://www.cs.northwestern.edu/~ donglu 33 TCP Vegas vs. TCP Reno Some researchers show that with proper values for A and B, Vegas behave better than Reno in the Grid computing environment. The problem with Vegas is that Vegas is not verified on a large-scale network and the optimal values of A and B are not easy to decide. June 18, 2002 http://www.cs.northwestern.edu/~ donglu 34 Random Early Detection gateways RED gateways maintain a weighted average of the queue length, a minimum and maximum threshold (REDmin, REDmax), and an early drop rate P. Packets are then queued as follows: If (queue length < REDmin), queue all packets. If (queue length > REDmin, and queue length < REDmax), drop packets with probability P. If (queue length > REDmax), drop all packets. RED can increase the fairness and overall network performance, so it is widely applied in the routers in the world. Since GRID is built on the basis of Internet, the effect of RED routers should be considered. June 18, 2002 http://www.cs.northwestern.edu/~ donglu 35 Aggressive TCP Reno: What I have done Linux kernel modification on TCP congestion control. Some studies have shown us that TCP Reno congestion control is too conservative thus the bandwidth is not fully utilized. So, make it more aggressive. How? June 18, 2002 http://www.cs.northwestern.edu/~ donglu 36 Aggressive TCP Reno: What I have done Window size start from more than one packet (for example 20), and increase more quickly during “slow start”. Do the same “Congestion avoidance”. Whenever there is a packet loss, don’t drop to one packet, instead, drop to 80% of window size, and new threshold will be 90% of current window size. June 18, 2002 http://www.cs.northwestern.edu/~ donglu 37 Aggressive TCP Reno: What I have done Built into Linux kernel TCP Reno June 18, 2002 Aggress TCP http://www.cs.northwestern.edu/~ donglu 38 Some Performance gains for Virtualized Audio Frequency (computational work load) 1000 1600 2000 Run time on Modified kernel (aggressive TCP) 169.9 166.5 169.3 182.9 185.3 183.1 1035.5 1032.0 1035.8 1171.7 1161.8 1159.7 2187.9 2134.8 2130.2 Run time on Unmodified Kernel June 18, 2002 http://www.cs.northwestern.edu/~ donglu 2207.2 2206.4 2207.7 39 Aggressive TCP Reno Through the kernel modification, some performance gains are achieved without modifying the application code. But the results are still not very satisfactory and not very stable. Why? June 18, 2002 http://www.cs.northwestern.edu/~ donglu 40 Aggressive TCP Reno That can be due to three reasons. First, virtual audio is very computational intensive, so enhancing the communication performance even more drastically will not change the overall performance much (most time was spent on computing). Second, the bandwidth*delay product on the cluster is small, which implies that this technique may be more effective on the WAN (with much bigger RTT). Third, the effect of fast retransmit and fast recovery is not considered here but it turns out to be important. June 18, 2002 http://www.cs.northwestern.edu/~ donglu 41 Conclusion Some technologies used by parallel systems are going into cluster, making low latency and high bandwidth available. With the development of Internet, new & improved protocols are proposed and tested to provide lower latency & higher bandwidth to Grid computing. New proposal on aggressive TCP is implemented and tested and some performance gains are achieved. More work is needed to make it more effective and stable. June 18, 2002 http://www.cs.northwestern.edu/~ donglu 42 Questions? June 18, 2002 http://www.cs.northwestern.edu/~ donglu 43