• Infiniband architecture – Specification (Infiniband architecture specification release 1.2, Oct. 5, 2004) available at Infiniband Trade Association (http://www.infinibandta.org) • Potential improvements • Infiniband architecture overview • Infiniband architecture overview – Components: • • • • Links Channel adaptors Switches Routers – The specification allows Infiniband wide area network, but mostly adopted as a system/storage area network. – Topology: • Irregular • Regular: Fat tree – Link speed: • 2.5Gbps (X), 10Gbps (4X), and 30Gbps (12X). • Layers: somewhat similar to TCP/IP – Physical layer – Link layer • • • • Error detection (CRC checksum) flow control (credit based) switching, virtual lanes (VL), forwarding table computed by subnet manager – Not adaptive – Network layer: across subnets. • No use for the cluster environment – Transport layer • Reliable/unreliable, connection/datagram – Verbs: interface between adaptors and OS/Users • Packet format: • Local Route Header (LRH): 8 bytes. Used for local routing by switches within a IBA subnet • Global Route Header (GRH): 40 Bytes. Used for routing between subnets • Base Transport header (BTH): 12 Bytes, for IBA transport • Reliable datagram extended transport header (RDETH): 4 bytes, just for reliable datagram • Datagram extended transport header (DETH): 8 bytes • RDMA extended transport header (RETH): 16 bytes • Atomic, ACK, Atomic ACK, • Immediate DATA extended transport header: 4 bytes, optimized for small packets. • Invalidate • Invariant CRC and variant CRC: – CRC for fields not changed and changed. • Local Route Header: – Switching based on the destination port address (LID) – Multipath switching by allocating multiple LIDs to one port • Local Route Header: – Switching based on the destination port address (LID) – Multipath switching by allocating multiple LIDs to one port • GRH: same format as IPV6 address (16 bytes address) • Base transport header: • Verbs – OS/Users access the adaptor through verbs – Communication mechanism: Queue Pair (QP) • Support the four types of services, including reliable connection service • Each connection takes one QP on each end. • Each QP has a send queue and a receive queue. • Users can post send requests to the send queue and receive requests to the receive queue. • Three types of send operations: SEND, RDMA(WRITE, READ, ATOMIC), MEMORY-BINDING • One receive operation (matching SEND) • Queue Pair: – The status of the result of an operation (send/receive) is stored in the complete queue. – Send/receive queues can bind to different complete queues. • Related system level verbs: – Open QP, create complete queue, Open HCA, open protection domain, register memory, allocate memory window, etc • User level verbs: – post send/receive request, poll for completion. • To communicate: – Make system calls to setup everything (open QP, bind QP to port, bind complete queues, connect local QP to remote QP, register memory, etc). – Post send/receive requests. – Check completion. – What if a packet arrives before a receive request is posted? • Not specified in the standard • The right response should be a ‘receiver not ready (RNR)’ error. The sender is back-pressed in this case. • Infiniband has a perfect software interface (Chien'94 paper): – The network subsystem realizes all user level functionality. – User level accesses to the network interface. A few machine instructions will accomplish the transmission task without involving the OS. – Network supports in-order delivery and and fault tolerance. – Buffer management is pushed out to the user. • SilverStorm 9024: – – – – – – 24 ports 4X(10Gbps) or 8 ports 12X(30 Gbps) switch type: cut-through switch latency: < 140ns switch bandwidth: 480 Gbps forwarding table size: 48K VL support: 8 + 1 management • SilverStorm 9240: – 24 expansion slots, each expansion model 12 port 4X or 4 port 12X (24x12 = 288, 288 by 288 switch) – switch type cut-through – switch latency: < 140ns to < 420ns – switch bandwidth: 5.76Tbps – forwarding table size: 48K – VL support: 8 + 1 management • Potential improvements on Infiniband using compiled communication – Improving the internal Infiniband fabric: • Offline routing for static pattern (static SM for a reduced traffic pattern) can be beneficial for irregular networks. • Simplify the layer architecture by having a direct link model (for known patterns), the header can be simplified, may not matter much (Infiniband layers are thin). • Simplify the protection mechanism. • Circuit switch type Infiniband. • Reliable communication protocol is still needed. • Potential benefits can be evaluated by simulation. • Improving the messaging software (software to hardware interface): no chance. • Improving the MPI implementation over Infiniband: similar to our current work on Ethernet – Message scheduling for collective/point-to-point communications based on the network topology. – Exploring NIC features (buffers in NIC, multicast) – Reducing the number of instructions in a library routine makes sense. Compiled communication can be used to optimize the MPI library. – Compiled communication can help improving the library implementation (e.g. reducing the number of message copies, early requests posting , using RDMA, etc). • One particular project: – Design algorithms for Infiniband subnet manager – Improving routing performance for Infiniband subnet manager (SM). • Objective: minimize the maximum channel load for an given traffic pattern • Optimize according to a given pattern: the traffic pattern in an application is usually not all-to-all – Default routing used in IBA SM • For a sparse traffic pattern, the maximum channel load can usually be minimized using the minimim interference principle. – Need to extend minimum interference routing for load balance deadlock free routing. – The best way to realize IBA SM is still not clear (unknown) at this time, we can probably do something here. • Irregular network or Fat tree network