Interconnection Network Topology Design Trade-offs Organizational Structure Processors datapath + control logic control logic determined by examining register transfers in the datapath Networks links switches network interfaces 2 Link Design/Engineering Space Cable of one or more wires/fibers with connectors at the ends attached to switches or interfaces Narrow: - control, data and timing multiplexed on wire Short: - single logical value at a time Asynchronous: - source encodes clock in signal Synchronous: - source & dest on same clock Long: - stream of logical values at a time Wide: - control, data and timing on separate wires 3 Example: Cray MPPs T3D: Short, Wide, Synchronous 24 bits 16 data, 4 control, 4 reverse direction flow control single 150 MHz clock (including processor) flit = phit = 16 bits two control bits identify flit type (idle and framing) no-info, routing tag, packet, end-of-packet T3E: long, wide, asynchronous (500 MB/s) 14 bits, 375 MHz, LVDS flit = 5 phits = 70 bits (300 MB/s) 64 bits data + 6 control switches operate at 75 MHz framed into 1-word and 8-word read/write request packets 4 Switches Input Ports Receiver Input Buffer Output Buffer Transmiter Output Ports Cross-bar Control Routing, Scheduling 5 Switch Components Output ports Input ports synchronizer aligns data signal with local clock domain essentially FIFO buffer Crossbar transmitter (typically drives clock and data) connects each input to any output degree limited by area or pinout Buffering Control logic complexity depends on routing logic and scheduling algorithm determine output port for each incoming packet arbitrate among inputs directed at same output 6 Interconnection Topologies Topology [Regular] [Irregular] [Static] [OneDimensional] [TwoDimensional] [ThreeDimensional] [Dynamic] [Hypercube] [....] [SingleStage] [Multistage] [Crossbar] [OneSided] [....] [TwoSided] 7 Static Connection Topologies Mesh and Torus Illiac IV, MPP, DAP, CM-2, Paragon k-dimensional mesh N=nk, d=2k, D=k(n-1) wraparound variation - Illiac IV Torus n x n binary torus, d = 4, D = 2 n/2 Hypercubes iPSC, nCube, CM-2 N = 2n, d = n, D = n poor scalability, difficulty in packaging higher-dimensional hypercubes 8 Dynamic Interconnection Networks Bus-based networks Crossbar networks Single Stage Networks Shuffle-exchange N input and N output Crossbar Recirculating networks Multi-stage Networks more than one stage of switching elements switching box: straight, exchange, upper broadcast, lower broadcast network topology and control structure 9 Dynamic Interconnection Networks Two-sided MIN connecting an arbitrary input to an arbitrary output blocking, rearrangeable, nonblocking networks blocking networks rearrangeable networks Data manipulator, Omega, Flip, n-cube, Baseline Benes network nonblocking networks Clos, Crossbar 10 Interconnection Topologies Logical Properties: Physcial properties distance, degree length, width Fully connected network diameter = 1 degree = N cost? bus => O(N), but BW is O(1) crossbar => O(N2) for BW O(N) - actually worse VLSI technology determines switch degree 11 Linear Arrays and Rings Linear Array Torus Torus arranged to use short wires Linear Array Diameter? N-1 Average Distance? 2/3N Bisection bandwidth? 1 Route A -> B given by relative address R = B-A Space O(N) Torus? Or Ring Examples: FDDI, SCI, FiberChannel Arbitrated Loop, KSR1 12 Multidimensional Meshes and Tori 2D Grid d-dimensional array N = kd-1 X ...X kO nodes described by d-vector of coordinates (id-1, ..., iO) d-dimensional k-ary mesh: N = kd 3D Cube k = dN described by d-vector of radix k coordinate d-dimensional k-ary torus (or k-ary d-cube)? 13 Properties Routing Average Distance Wire Length? d x 2k/3 for mesh dk/2 for cube Degree? Bisection bandwidth? relative distance: R = (b d-1 - a d-1, ... , b0 - a0 ) traverse ri = b i - a i hops in each dimension dimension-order routing Partitioning? k d-1 bidirectional links Physical layout? 2D in O(N) space higher dimension? Short wires 14 Real World 2D mesh 1824 node Paragon: 16 x 114 array a single cabinet: 16 X 4 array 15 Embeddings in two dimensions 6x3x2 Embed multiple logical dimension in one physical dimension using long wires 16 Trees Diameter and ave distance logarithmic k-ary tree, height d = logk N address specified d-vector of radix k coordinates describing path down from root Fixed degree Route up to common ancestor and down R = B xor A let i be position of most significant 1 in R, route up i+1 levels down in direction given by low i+1 bits of B H-tree space is O(N) with O(N) long wires Bisection BW? 17 Fat-Trees Fatter links (really more of them) as you go up, so bisection BW scales with N 18 Butterflies 4 0 0 1 0 1 0 1 1 3 2 0 1 1 0 16 node butterfly building block Tree with lots of roots! N log N switches (actually N/2 x logN) Exactly one route from any source to any dest R = A xor B, at level i use ‘straight’ edge if ri=0, otherwise cross edge Bisection N/2 vs N (d-1)/d (d-dimensional mesh) vs 1 (tree) 19 Benes network and Fat Tree 16-node Benes Network (Unidirectional) 16-node 2-ary Fat-Tree (Bidirectional) Back-to-back butterfly can route all permutations off line What if you just pick a random mid point? 20 Hypercubes Also called binary n-cubes. # of nodes = N = 2n. O(logN) Hops Good bisection BW Complexity 0-D Out degree is n = logN 1-D 2-D 3-D 4-D 5-D ! 21 Relationship BttrFlies to Hypercubes Wiring is isomorphic Except that Butterfly always takes log n steps 22 Toplology Summary Topology Degree Diameter Ave Dist Bisection D (D ave) @ P=1024 1D Array 2 N-1 N/3 1 huge 1D Ring 2 N/2 N/4 2 2D Mesh 4 2 (N1/2 - 1) 2/3 N1/2 N1/2 63 (21) 2D Torus 4 N1/2 1/2 N1/2 2N1/2 32 (16) n(k-1) n(k-1)/2 2kn-1 27 (13.5) @n=3 n/2 N/2 10 (5) k-ary n-cube 2n Hypercube n=log N n All have some “bad permutations” many popular permutations are very bad for meshs (transpose) randomness in wiring or routing makes it hard to find a bad one! 23 Wire Efficient Communication Networks for Multicomputers What makes a network efficient? Efficient use of the limiting resources Limiting Factors switches and pins were only considered the limiting factors Wires are limiting factors because of power and delay as well as density At the board level as well as at the chip level, the system interconnection is limited by wire density Most of the power dissipated in the networks is CV2f power to used to drive wires. Most of the delay is propagation delay over wires or RC delay in driving wires 24 In the 3D world For n nodes, bisection area is O(n2/3 ) For large n, bisection bandwidth is limited to O(n2/3 ) Bill Dally, IEEE TPDS, [Dal90a] For fixed bisection bandwidth, low-dimensional k-ary d-cubes are better (otherwise higher is better) i.e., a few short fat wires are better than many long thin wires What about many long fat wires? 25 The Design Objective of the Network To minimize latency and maximize throughput Latency T(l,L) :the average time required to deliver a message Each node injects messages with average length L into the network at an average rate of l bits per cycle. Three independent variables: topology, routing, and flow control Topology Indirect Networks (k-ary d-flys: radix k and dimension d) No of processing nodes: N = kd BI = N/2 BWI = Nw/2 din = dout = k d = 2k D = d+1 : high bisection width : low degree : low diameter 2-ary 3-fly 26 Wire Efficient Topology Indirect Networks high bisection width, low degree, low diameter, long wires, symmetry the bisection width B = N/2 does not reflect the actual maximum wire density for this class of networks: vertical partition (N wires) more accurately reflects the wiring problems wire area O(N2) : plane mapping - expensive N = kd. As one varies k and d with the number of processing nodes, N, and BW fixed. the degree and diameter are directly controlled. the channel width remains fixed at w = BW/B=2BW/N. B is independent of the choice of k and d. disadvantage: it prevents the designer from trading off the bandwidth of a channel against the diameter of the network. 27 Wire Efficient Topology Direct Networks (k-ary d-cubes) BD = 2N/k BWD = 2Nw/k din = dout = d d = 2d D = dk/2 BI = N/2 : high bisection width BWI = Nw/2 din = dout = k d = 2k : low degree D = d+1 : low diameter For small d a low and controllable bisection width (N=kd) low degree high diameter short wires (d 3) wiring complexity O(N) 28 How Many Dimensions? d = 2 or d = 3 d4 Short wires, easy to build Many hops, low bisection bandwidth Requires traffic locality Harder to build, more wires, longer average length Fewer hops, better bisection bandwidth Can handle non-local traffic k-ary d-cubes provide a consistent framework for comparison N = kd scale dimension (d) or nodes per dimension (k) assume cut-through 29 Traditional Scaling: Unloaded Latency(N) 250 140 200 100 d=2 d=3 80 d=4 k=2 60 m/w 40 Ave Latency T(m=140) Ave Latency T(m=40) 120 150 100 50 20 0 0 0 2000 4000 6000 8000 10000 0 Machine Size (N) Assumes equal channel width 2000 4000 6000 8000 10000 Machine Size (N) Unit routing delay (D = 1) w=1 independent of node count or dimension dominated by average distance 30 Real Machines Wide links, smaller routing delay Tremendous variation 31 Average Distance 100 256 90 1024 80 16384 1048576 Ave Distance 70 60 50 ave dist = d (k-1)/2 40 30 20 10 0 0 5 10 15 20 25 Dimension but, equal channel width is not equal cost! Higher dimension => more channels 32