Hypercube Fall 2010 Parallel Processing, Low-Diameter Architectures Slide 1 About This Presentation This presentation is initially prepared by Behrooz Parhami for the textbook Introduction to Parallel Processing: Algorithms and Architectures (Plenum Press, 1999, ISBN 0-306-45970-1). Fall 2010 Parallel Processing, Low-Diameter Architectures Slide 2 Low-Diameter Architectures Study the hypercube and related interconnection schemes: • Prime example of low-diameter (logarithmic) networks • Theoretical properties, realizability, and scalability • Complete our view of the “sea of interconnection nets” Fall 2010 Parallel Processing, Low-Diameter Architectures Slide 3 Hypercubes and Their Algorithms Study the hypercube and its topological/algorithmic properties: • Develop simple hypercube algorithms (more in Ch. 14) • Learn about embeddings and their usefulness Topics in This Lecture 13.1 Definition and Main Properties 13.2 Embeddings and Their Usefulness 13.3 Embedding of Arrays and Trees 14.5 Broadcasting in Hypercube 14.6 Adaptive routing in Hypercube 15 Fall 2010 Inverting a Lower-Triangular Matrix Parallel Processing, Low-Diameter Architectures Slide 4 13.1 Definition and Main Properties P 0 P P 8 1 P P P P P P P P 0 P P 7 2 P P Intermediate architectures: logarithmic or sublogarithmic diameter 3 1 4 2 5 P 6 3 6 P 5 7 8 P 4 Begin studying networks that are intermediate between diameter-1 complete network and diameter-p1/2 mesh Sublogarithmic diameter 1 2 Complete network PDN Fall 2010 log n / log log n Star, pancake Superlogarithmic diameter log n Binary tree, hypercube n Torus Parallel Processing, Low-Diameter Architectures n/2 n1 Ring Linear array Slide 5 Hypercube and Its History Binary tree has logarithmic diameter, but small bisection Hypercube has a much larger bisection Hypercube is a mesh with the maximum possible number of dimensions 222 ... 2 q = log2 p We saw that increasing the number of dimensions made it harder to design and visualize algorithms for the mesh Oddly, at the extreme of log2 p dimensions, things become simple again! Brief history of the hypercube (binary q-cube) architecture Concept developed: early 1960s [Squi63] Direct (single-stage) and indirect (multistage) versions: mid 1970s Initial proposals [Peas77], [Sull77] included no hardware Caltech’s 64-node Cosmic Cube: early 1980s [Seit85] Introduced an elegant solution to routing (wormhole switching) Several commercial machines: mid to late 1980s Intel PSC (personal supercomputer), CM-2, nCUBE (Section 22.3) Fall 2010 Parallel Processing, Low-Diameter Architectures Slide 6 Basic Definitions 0 Hypercube is generic term; 3-cube, 4-cube, . . . , q-cube in specific cases 0 00 01 1 10 11 1 (a) Binary 1-cube, built of two binary 0-cubes, labeled 0 and 1 (b) Binary 2-cube, built of two binary 1-cubes, labeled 0 and 1 100 Fig. 13.1 The recursive structure of binary hypercubes. Parameters: p = 2q B = p/2 = 2q–1 D = q = log2p d = q = log2p 000 001 100 101 101 000 0 001 1 110 010 011 110 111 111 010 011 (c) Binary 3-cube, built of two binary 2-cubes, labeled 0 and 1 0100 0101 0000 0001 1100 1000 0 1001 1 0110 0010 1101 0111 0011 1110 1010 1111 1011 (d) Binary 4-cube, built of two binary 3-cubes, labeled 0 and 1 Fall 2010 Parallel Processing, Low-Diameter Architectures Slide 7 The 64-Node Hypercube Only sample wraparound links are shown to avoid clutter Isomorphic to the 4 4 4 3D torus (each has 64 6/2 links) Fall 2010 Parallel Processing, Low-Diameter Architectures Slide 8 Neighbors of a Node in a Hypercube xq–1xq–2 . . . x2x1x0 ID of node x xq–1xq–2 . . . x2x1x0 xq–1xq–2 . . . x2x1x0 . . . xq–1xq–2 . . . x2x1x0 dimension-0 neighbor; N0(x) dimension-1 neighbor; N1(x) . . . dimension-(q – 1) neighbor; Nq–1(x) 0100 Nodes whose labels differ in k bits (at Hamming distance k) connected by shortest path of length k Dim 3 Dim 1 1100 1101 Both node- and edge-symmetric 1111 Strengths: symmetry, log diameter, and linear bisection width 0110 Weakness: poor scalability 0111 1010 0010 Fall 2010 0101 Dim 0 xx Dim 2 0000 The q neighbors of node x Parallel Processing, Low-Diameter Architectures 1011 0011 Slide 9 13.2 Embeddings and Their Usefulness 0 a b 1 c Dilation = 1 Congestion = 1 Load factor = 1 3 e 4 0 b 6 c e 0 b f 4 6 6 0,1 b 2,5 c, d f d 3 4 5 2 Dilation = 1 Congestion = 2 Load factor = 2 f 2 a d Dilation = 2 Congestion = 2 Load factor = 1 a e 1 f 5 5 c 2 d 1 3 3,4 6 Fig. 13.2 Embedding a seven-node binary tree into 2D meshes of various sizes. Expansion: ratio of the number of nodes (9/7, 8/7, and 4/7 here) Dilation: Longest path onto which an edge is mapped (routing slowdown) Congestion: Max number of edges mapped onto one edge (contention slowdown) Load factor: Max number of nodes mapped onto one node (processing slowdown) Fall 2010 Parallel Processing, Low-Diameter Architectures Slide 10 13.3 Embedding of Arrays and Trees (q – 1)-bit Gray code 0 000 . . . 000 0 000 . . . 001 0 000 . . . 011 Nk(x) Nq–1(N k(x)) . . . 0 100 . . . 000 (q – 1)-cube 0 1 100 . . . 000 (q – 1)-cube 1 . . Fig. 13.3 Hamiltonian cycle in the q-cube. . 1 000 . . . 011 Alternate inductive proof: Hamiltonicity of the q-cube 1 000 . . . 010 is equivalent to the existence of a q-bit Gray code 1 000 . . . 000 Basis: q-bit Gray code beginning with the all-0s codeword (q – 1)-bit Gray code and ending with 10q–1 exists for q = 2: 00, 01, 11, 10 in reverse x Fall 2010 Nq–1(x) Parallel Processing, Low-Diameter Architectures Slide 11 Mesh/Torus Embedding in a Hypercube Dim 3 Dim 2 Column 3 Column 2 Dim 1 Column 1 Dim 0 Column 0 Fig. 13.5 The 4 4 mesh/torus is a subgraph of the 4-cube. Is a mesh or torus a subgraph of the hypercube of the same size? We prove this to be the case for a torus (and thus for a mesh) Fall 2010 Parallel Processing, Low-Diameter Architectures Slide 12 Torus is a Subgraph of Same-Size Hypercube A tool used in our proof 2 Product graph G1 G2: 0a a 0 0b 2a = b 1 3-by-2 torus 2b 1a 1b Has n1 n2 nodes Two nodes are connected if either component of the two nodes were connected in the component graphs Each node is labeled by a pair of labels, one from each component graph = = Fig. 13.4 Examples of product graphs. The 2a 2b 2c . . . torus is the product of 2a-, 2b-, 2c-, . . . node rings The (a + b + c + ... )-cube is the product of a-cube, b-cube, c-cube, . . . The 2q-node ring is a subgraph of the q-cube If a set of component graphs are subgraphs of another set, the product graphs will have the same relationship Fall 2010 Parallel Processing, Low-Diameter Architectures Slide 13 Embedding Trees in the Hypercube The (2q – 1)-node complete binary tree is not a subgraph of the q-cube even weight odd weights even weights Proof by contradiction based on the parity of node label weights (number of 1s in the labels) even weights odd weights The 2q-node double-rooted complete binary tree is a subgraph of the q-cube Nc(x) x Na(x) New Roots Nb(x) 2q -node double-rooted complete binary tree Fall 2010 Double-rooted tree in the (q–1)-cube 0 Nb(N c(x)) Nc(N b(x)) Nc(N a(x)) Na(N c(x)) Double-rooted tree in the (q–1)-cube 1 Parallel Processing, Low-Diameter Architectures Fig. 13.6 The 2q-node double-rooted complete binary tree is a subgraph of the q-cube. Slide 14 14.5 Broadcasting on a Hypercube Flooding: applicable to any network with all-port communication 00000 00001, 00010, 00100, 01000, 10000 00011, 00101, 01001, 10001, 00110, 01010, 10010, 01100, 10100, 11000 00111, 01011, 10011, 01101, 10101, 11001, 01110, 10110, 11010, 11100 01111, 10111, 11011, 11101, 11110 11111 Source node Neighbors of source Distance-2 nodes Distance-3 nodes Distance-4 nodes Distance-5 node Binomial broadcast tree with single-port communication Time 00000 Fig. 14.9 The binomial broadcast tree for a 5-cube. 10000 01000 00100 11000 01100 10100 11100 00010 00001 Fall 2010 Parallel Processing, Low-Diameter Architectures Slide 15 0 1 2 3 4 5 Hypercube Broadcasting Algorithms Fig. 14.10 Three hypercube broadcasting schemes as performed on a 4-cube. ABCD ABCD ABCD ABCD Binomial-t ree scheme (nonpipelined) A B A A B A B C A A A B C A A D A B A A A B A B C A Pipelined binomial -tree scheme A A D B A A B A D B D C D A C B B C C C D B B D Johnsson & Ho’s method Fall 2010 B B A D C Parallel Processing, Low-Diameter Architectures A B A A A B C A A A A To avoid clutter, only A shown Slide 16 14.6 Adaptive and Fault-Tolerant Routing There are up to q node-disjoint and edge-disjoint shortest paths between any node pairs in a q-cube Thus, one can route messages around congested or failed nodes/links A useful notion for designing adaptive wormhole routing algorithms is that of virtual communication networks 5 4 1 0 2 6 3 5 4 1 0 7 Each of the two subnetworks in Fig. 14.11 is acyclic 2 6 3 7 Subnetwork 1 Subnetwork 0 Fig. 14.11 Partitioning a 3-cube into subnetworks for deadlock-free routing. Fall 2010 Hence, any routing scheme that begins by using links in subnet 0, at some point switches the path to subnet 1, and from then on remains in subnet 1, is deadlock-free Parallel Processing, Low-Diameter Architectures Slide 17 Robustness of the Hypercube Rich connectivity provides many alternate paths for message routing Source Three faulty nodes The node that is furthest from S is not its diametrically opposite node in the fault-free hypercube Fall 2010 S X X X Destination The fault diameter of the q-cube is q + 1. Parallel Processing, Low-Diameter Architectures Slide 18 15 Other Hypercubic Architectures Learn how the hypercube can be generalized or extended: • Develop algorithms for our derived architectures • Compare these architectures based on various criteria Topics in This Chapter 15.1 Modified and Generalized Hypercubes 15.2 Butterfly and Permutation Networks 15.3 Plus-or-Minus-2i Network 15.4 The Cube-Connected Cycles Network 15.5 Shuffle and Shuffle-Exchange Networks 15.6 That’s Not All, Folks! Fall 2010 Parallel Processing, Low-Diameter Architectures Slide 19 15.1 Modified and Generalized Hypercubes 5 4 1 0 2 6 3 5 4 1 0 7 3-cube and a 4-cycle in it 2 6 3 7 Twisted 3-cube Fig. 15.1 Deriving a twisted 3-cube by redirecting two links in a 4-cycle. Diameter is one less than the original hypercube Fall 2010 Parallel Processing, Low-Diameter Architectures Slide 20 Folded Hypercubes 5 4 Fig. 15.2 Deriving a folded 3-cube by adding four diametral links. 0 Diameter is half that of the original hypercube 2 1 6 5 7 3 2 6 3 7 Fo lded 3-cub e with Dim-0 links removed Fall 2010 4 Rotate 180 degrees 1 0 2 6 3 7 Folded 3-cube 3 5 7 1 0 2 1 0 A diametral path in the 3-cube 4 5 4 6 5 3 17 After renaming, diametral links replace dim-0 lin ks Parallel Processing, Low-Diameter Architectures Fig. 15.3 Folded 3-cube viewed as 3-cube with a redundant dimension. Slide 21 Generalized Hypercubes A hypercube is a power or homogeneous product network q-cube = (oo)q ; q th power of K2 Generalized hypercube = qth power of Kr (node labels are radix-r numbers) Node x is connected to y iff x and y differ in one digit Each node has r – 1 dimension-k links x3 Example: radix-4 generalized hypercube Node labels are radix-4 numbers x2 x1 x0 Fall 2010 Dimension-0 links Parallel Processing, Low-Diameter Architectures Slide 22 Bijective Connection Graphs Beginning with a c-node seed network, the network size is recursively doubled in each step by linking nodes in the two halves via an arbitrary one-to-one mapping. Number of nodes = c 2q Hypercube is a special case, as are many hypercube variant networks (twisted, crossed, mobius, . . . , cubes) Bijective mapping Special case of c = 1: Diameter upper bound is q Diameter lower bound is an open problem (it is better than q + 1/2) Fall 2010 Parallel Processing, Low-Diameter Architectures Slide 23 15.2 Butterfly and Permutation Networks Dim 0 Fig. 7.4 Butterfly and wrapped butterfly networks. Dim 1 0 0 0 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 5 5 5 5 6 6 6 6 7 7 7 7 0 1 2 3 q 2 rows, q + 1 columns Fall 2010 Dim 0 Dim 1 Dim 2 Dim 2 Parallel Processing, Low-Diameter Architectures 1 2 0 3=0 q 2 rows, q columns Slide 24 Dim 0 Structure of Butterfly Networks Dim 1 Dim 0 Dim 2 0 Switching these two row pairs converts this to the original butterfly network. Changing the order of stages in a butterfly is thus equi valent to a relabeling of t he rows (in this example, row xyz becomes row xzy) Dim 2 Dim 3 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 12 12 13 13 14 14 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 0 1 2 Fig. 15.5 Butterfly network with permuted dimensions. Fall 2010 Dim 1 0 3 15 15 0 1 2 3 4 The 16-row butterfly network. Parallel Processing, Low-Diameter Architectures Slide 25 Fat Trees P0 P1 P2 Fig. 15.6 Two representations of a fat tree. P4 P3 P5 P6 Skinny tree? P 7 0 Fron t view: Binary tree 1 2 6 7 5 3 4 23 1 0 0 0 Fall 2010 1 6 7 5 4 2 1 2 Sid e v iew: Inverted bin ary tree 3 4 3 4 5 6 5 6 7 Fig. 15.7 Butterfly network redrawn as a fat tree. 7 Parallel Processing, Low-Diameter Architectures Slide 26 P8 Butterfly as Multistage Interconnection Network p Processors 0000 0 0001 1 0010 2 0011 3 0100 4 0101 5 0110 6 0111 7 1000 8 1001 9 1010 10 1011 11 1100 12 1101 13 1110 14 1111 15 0 log 2p Columns of 2-by-2 Switches 1 2 3 p Memory Banks 0 0000 1 0001 2 0010 3 0011 4 0100 5 0101 6 0110 7 0111 8 1000 9 1001 10 1010 11 1011 12 1100 13 1101 14 1110 15 1111 Fig. 6.9 Example of a multistage memory access network Fig. 15.8 Butterfly network used to connect modules that are on the same side Generalization of the butterfly network High-radix or m-ary butterfly, built of m m switches Has mq rows and q + 1 columns (q if wrapped) Fall 2010 Parallel Processing, Low-Diameter Architectures Slide 27 Beneš Network Processors 000 001 010 011 100 101 110 111 0 1 2 3 4 5 6 7 Fig. 15.9 0 2 log 2p – 1 Columns of 2-by-2 Switches 1 2 3 4 Memory Banks 000 0 001 1 010 2 011 3 100 4 101 5 110 6 111 7 Beneš network formed from two back-to-back butterflies. A 2q-row Beneš network: Can route any 2q 2q permutation It is “rearrangeable” Fall 2010 Parallel Processing, Low-Diameter Architectures Slide 28 Routing Paths in a Beneš Network 0 1 To which 2 memory 3 modules 4 can we 5 connect 6 proc 4 7 without rearranging 8 9 the other 10 paths? 11 12 13 What 14 about 15 0 proc 6? 2 q+ 1 Inputs 1 2 q Rows, Fig. 15.10 Fall 2010 2 3 4 2q + 1 Columns 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 5 6 15 q+ 1 2 Outputs Another example of a Beneš network. Parallel Processing, Low-Diameter Architectures Slide 29 Another View of The CCC Network 4 5 0 4,2 0,2 1 0,1 0,0 1,0 7 6 2,1 2 Example of hierarchical substitution to derive a lower-cost network from a basis network 3 Fig. 15.14 Alternate derivation of CCC from a hypercube. Replacing each node of a high-dimensional q-cube by a cycle of length q is how CCC was originally proposed Fall 2010 Parallel Processing, Low-Diameter Architectures Slide 30 Emulation of a 6-Cube by a 64-Node CCC With proper node mapping, dim-0 and dim-1 neighbors of each node will map onto the same cycle Suffices to show how to communicate along other dimensions of the 6-cube Fall 2010 Parallel Processing, Low-Diameter Architectures Slide 31 Emulation of Hypercube Algorithms by CCC Hypercube Dimension q–1 . . . Ascend 3 Normal 2 1 Cycle ID = x 2 m bits Proc ID = y m bits N j–1(x) , j–1 Descend 0 0 1 2 3 . . . Algorithm Steps Ascend, descend, and normal algorithms. x, j–1 Cycle x Dim j–1 x, j Nj (x) , j Dim j Node (x, j) is communicating Nj (x) , j–1 x, j+1 along dimension j; after the Dim j+1 next rotation, it will be linked to its dimension-(j + 1) N j+1(x) , j+1 neighbor. Fig. 15.15 CCC emulating a normal hypercube algorithm. N j+1(x) , j Fall 2010 Parallel Processing, Low-Diameter Architectures Slide 32 15.5 Shuffle and Shuffle-Exchange Networks 000 0 0 0 0 0 0 001 1 1 1 1 1 1 010 2 2 2 2 2 2 011 3 3 3 3 3 3 100 4 4 4 4 4 4 101 5 5 5 5 5 5 110 6 6 6 6 6 6 7 7 7 7 7 111 7 Shuffle Exchange Shuffle-Exchange Unshuffle Fig. 15.16 Fall 2010 Alternate Structure Shuffle, exchange, and shuffle–exchange connectivities. Parallel Processing, Low-Diameter Architectures Slide 33 Shuffle-Exchange Network 0 SE S 2 S SE Fall 2010 4 5 6 3 0 Fig. 15.17 4 SE 1 S 3 2 1 S SE 7 SE S S S SE 5 7 SE S SE 6 Alternate views of an eight-node shuffle–exchange network. Parallel Processing, Low-Diameter Architectures Slide 34 Routing in Shuffle-Exchange Networks In the 2q-node shuffle network, node x = xq–1xq–2 . . . x2x1x0 is connected to xq–2 . . . x2x1x0xq–1 (cyclic left-shift of x) In the 2q-node shuffle-exchange network, node x is additionally connected to xq–2 . . . x2x1x0x q–1 01011011 11010110 ^ ^^ ^ Source Destination Positions that differ 01011011 10110111 01101111 11011110 10111101 01111010 11110101 11101011 Shuffle Shuffle Shuffle Shuffle Shuffle Shuffle Shuffle Shuffle Fall 2010 to to to to to to to to 10110110 01101111 11011110 10111101 01111011 11110100 11101011 11010111 Exchange to 10110111 Exchange to Exchange to 01111010 11110101 Exchange to 11010110 Parallel Processing, Low-Diameter Architectures Slide 35 Diameter of Shuffle-Exchange Networks For 2q-node shuffle-exchange network: D = q = log2p, d = 4 With shuffle and exchange links provided separately, as in Fig. 15.18, the diameter increases to 2q – 1 and node degree reduces to 3 0 Shu ffle (solid) Exchang e (dotted) E 5 4 E 2 S 0 3 2 1 3 S S 6 1 S Fall 2010 S S S Fig. 15.18 7 6 4 E E 7 S 5 Eight-node network with separate shuffle and exchange links. Parallel Processing, Low-Diameter Architectures Slide 36 Multistage ShuffleExchange Network 0 0 0 0 1 1 1 A 1 2 2 2 2 3 3 3 3 4 A 4 4 4 5 5 5 5 6 6 6 6 7 7 7 7 0 Fig. 15.19 Multistage shuffle–exchange network (omega network) is the same as butterfly network. 1 2 q + 1 Columns 1 2 q + 1 Columns 3 0 0 0 0 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 5 5 5 5 6 6 6 6 7 7 7 A 4 A 7 0 Fall 2010 0 3 1 q Columns 2 Parallel Processing, Low-Diameter Architectures 0 1 q Columns 2 Slide 37 15.6 That’s Not All, Folks! When q is a power of 2, the 2qq-node cube-connected cycles network derived from the q-cube, by replacing each node with a q-node cycle, is a subgraph of the (q + log2q)-cube CCC is a pruned hypercube Other pruning strategies are possible, leading to interesting tradeoffs 100 000 Even-dimension links are kept in the even subcube 101 Odd-dimension links are kept in the odd subcube 001 110 D = log2 p + 1 111 d = (log2 p + 1)/2 010 011 B = p/4 All dimension-0 links are kept Fig. 15.20 Fall 2010 Example of a pruned hypercube. Parallel Processing, Low-Diameter Architectures Slide 38 Möbius Cubes Dimension-i neighbor of x = xq–1xq–2 ... xi+1xi ... x1x0 is: xq–1xq–2 ... 0xi... x1x0 if xi+1 = 0 (xi complemented, as in q-cube) xq–1xq–2 ... 1xi... x1x0 if xi+1 = 1 (xi and bits to its right complemented) For dimension q – 1, since there is no xq, the neighbor can be defined in two possible ways, leading to 0- and 1-Mobius cubes A Möbius cube has a diameter of about 1/2 and an average internode distance of about 2/3 of that of a hypercube 4 0 0 1 7 6 2 3 0-Mobius cube Fall 2010 7 5 6 1 4 2 5 Fig. 15.21 Two 8-node Möbius cubes. 3 1-Mobius cube Parallel Processing, Low-Diameter Architectures Slide 39 Seas for Networks A wide variety of direct interconnection networks have been proposed for, or used in, parallel computers They differ in topological, performance, robustness, and realizability attributes. Fig. 4.8 (expanded) The sea of direct interconnection networks. Fall 2010 Parallel Processing, Low-Diameter Architectures Slide 40 Assignment 2-1 Choose one of the following two exercises: • Compute the low bound of congestion when embedding a hypercube in a ring with same number of nodes. • i-hypercube contains 2jx2k mesh where i=j+k. SJTU@Fall 2012 Parallel Processing, Low-Diameter Architectures Slide 41