Matrix Multiplication on Hypercubes Using Full Bandwidth and Constant Storage Ching-Tien Ho S. Lennart Johnsson Alan Edelman TR-19-91 April 1991 Parallel Computing Research Group Center for Research in Computing Technology Harvard University Cambridge, Massachusetts To appear in Proceedings of the 6th Distributed Memory Computing Conference. Matrix Multiplication on Hypercubes Using Full Bandwidth and Constant Storage Ching-Tien Ho S. Lennart Johnsson Alan Edelman IBM Almaden Research Center Harvard University and Dept. of Mathematics 650 Harry Road Thinking Machines Corp. Univ. of California at Berkeley San Jose, CA 95120 Cambridge, MA Berkeley, CA 94270 ho@ibm.com johnsson@think.com edelman@math.berkeley.edu Abstract sizes is given in 7] and 8]. Cannon's algorithm may use up to four communication (unidirectional) channels per processor concurrently. The DNS algorithm only use two (bidirectional) channels at a time. The algorithm presented below concurrently use all log2 N (bidirectional) channels in an N-processor Boolean cube, while preserving a constant storage, i.e., O( PN2 ) for P P matrices. The paper is organized as follows. In the next section, we introduce a few basic results regarding the speci c communication operations used in the multiplication algorithms. In Section 3, we generalize the DNS algorithm to the multiplication of two P P matrices on pa Boolean p n-cube, con gured as a product cube of N N processors, and show how the Boolean cube bandwidth can be fully utilized. We conclude in Section 4. For matrix multiplication on hypercube multiprocessors with the product matrix accumulated in place a p processor must receive about P 2= N elements of each input operand, with operands of size P P distributed evenly over N processors. With concurrent communication on all ports, the number ofpelement transfers in sequence can be reduced to P 2= N log N for each input operand. We present a two-level partitioning of the matrices and an algorithm for the matrix multiplication with optimal data motion and constant storage. The algorithm has sequential arithmetic complexity 2P 3, and parallel arithmetic complexity 2P 3=N. The algorithm has been implemented on the Connection Machine model CM-2. For the performance on the 8K CM-2, we measured about 1.6 G ops, which would scale up to about 13 G ops for a 64K full machine. 1 Introduction 2 Preliminaries The multiplication of matrices is an important operation in many computationally intensive scienti c applications. E ective use of the communication bandwidth is critical for maximum performance. The communication needs are minimized by a good choice of address map, i.e., data placement, and routing algorithms that minimize path lengths and congestion once an address map is given. We consider matrix multiplication on a Boolean n-cube. With suciently high data motion capability at each node, communication may be performed on all ports concurrently, and the full communications bandwidth of the network used. Cannon 2] has given an algorithm for the multiplication of square matrices on two-dimensional meshes. Since a two-dimensional mesh is a subgraph of a Boolean cube 12], 5], it is possible to use Cannon's algorithm on a Boolean cube by emulating a mesh 6]. Dekel, Nassimi and Sahni 3] have described an algorithm (termed the DNS algorithm thereafter) for the multiplication of square matrices on a Boolean cube. Both the DNS and Cannon's algorithms assume that the number of matrix elements is equal to the number of processors. A generalization of Cannon's algorithm to matrices of arbitrary shapes and In the following log denotes log2 . The bit-wise exclusive-or operation is ndenoted \" and Zn = f0 1 n ; 1g. Also, denotes a string of n instances of , where is either 0 or 1. Let n = n=2 throughout the paper. We consider matrix multiplication C A B on a Boolean n-cube where A, B and C are p P P matrices, N = 2n, n is even, and P pn N. For clarity, we assume P is a multiple of n N. Note that the assumption is made only to simplify the complexity analysis. The element in row i and column j of matrix A is a(i j), i j 2 ZP . b(i j) and c(i j) are similarly de ned. Let S (1 0) = (0) and S (n 0) = S (n ; 1 0)jn ; 1jS (n ; 1 0) for n > 0, where \j" is the concatenation operator of two sequences. For instance, S (3 0) = (0 1 0 2 010). S (n 0) is the transition sequence in a binary-reected n-bit Gray code 14]. If S (n 0) = (x1 x2 x(2n 1)), then we can shift modulo n to de ne S (n s) = ((x1 + s) mod n (x2 + s) mod n 0 0 0 ; 1 3.1 The DNS algorithm (x(2n 1) + s) mod n) 0 s < n: For instance, S (3 1) = (1 2 1 0 1 21). Let (t n s) ben the tth element of the sequence S (n s), 1 t 2 ; 1. The communication times are measured by the number of elements transferred in sequence. Concurrent communication on all ports of all processors is assumed possible. All communications links are bidirectional. A particular communication pattern that is used for one phase of the matrix multiplication algorithm is bit-inversion 15]. A bit-inversion in an n-cube implies that processor i sends its data to processor i for all i's, where i is the bit-complement of i. Lemma 1 15] A tight bound for bit-inversion with K elements per processor on an n-cube is K . Proof: The required bandwidth is nNK and the available bandwidth is nN, which gives the lower bound K. An upper bound equal to the lower bound is given by the following algorithm. Divide the local data set into n parts, and exchange part i, 0 i n ; 1, according to the sequence of dimensions i (i + 1) mod n (i + n ; 1) mod n. All n data sets can be exchanged concurrently without edge con ict. ; The DNS algorithm 3] assumes that A and B are P P matrices2 and that the number of Boolean cube processors is P . The algorithmconsists of two phases: alignment and multiplication. Alignment: a(i j) a(i i j), 8i j 2 ZP b(i j) b(i j j), 8i j 2 ZP : Multiplication, step t, 0 t P ; 1: a(i j) a(i j 2(tp0)), if t 6= 0, 8i j 2 ZP b(i j) b(i 2(tp0) j), if t 6= 0, 8i j 2 ZP c(i j) a(i j) b(i j) + c(i j), 8i j 2 ZP : With one element per processor and (i j) being a processor address, the column index of an element of A is the same as the row index of an element of B for every processor (i j) after the alignment phase, and for each step of the multiplication phase. Moreover, for any integer, complementing the bits of its binary encoding according to the transition sequence in a binaryreected Gray code, such as the sequence S (p 0) for a p-bit number, produces every integer that can be encoded in p bits precisely once. Hence, during the course of the algorithm, processor (i j) receives all the elements of row i of matrix A and column j of matrix B appropriately synchronized. Replacing (t p 0) by (t p s), 1 s p ; 1, yields a matrix multiplication algorithm that for each t performs an exchange in dimension ((t p 0) + s) mod p instead of dimension (t p 0). This observation is the basis for de ning an algorithm that fully uses the communications bandwidth of the Boolean cube. In general, with bit-inversion on only a subset of the processor address bits of every processor, the communications requirements are reduced, but not the lower bound. Lemma 2 9] Any tight bound for communication in a Boolean n-cube is also a tight bound for the same communication in all disjoint n dimensional subcubes of an n dimensional cube, when the subcubes are identied by the same n dimensions, n > n. The signi cance of this lemma is that even though only a fraction nn of the total bandwidth of the n -cube is used, the communication time cannot be reduced when the communication in each subcube is optimal. Hence, in the case of the same bit-inversion on a subset of the bits of the address space the tight lower bound is still K for K elements per processor. The bits subject to inversion de ne a subcube, and the bits not inverted de ne the disjoint instances of the subcubes in which inversion is performed. 3.2 Naive extension Each processor holds a 2p n 2p n submatrix (consecutive assignment 6]). The data assignment is dened by the address map (wpr 1wpr 2 wpr n wpr n 1 w0r j 00 ; 00 00 00 | ; ; {z ; rp r 0 0 }| 0 ; ; 0; {z vpr } wpc 1wpc 2 wpc n wpc n 1 w0c ): {z }| {z } | ; ; rpc ; 0 ; 0; vpc All operands have corresponding address maps. Virtual processor address bits (labeled vp) de ne local storage addresses, whereas the real processor address bits (labeled rp) de ne di erent physical processors. The superscripts \r" and \c" denote \row" and \column", respectively. The exchange operation de ned by the exclusive-or operation on the virtual processor address bits reorders data in the local storage of all processors, but there is no exchange between real processors. An exclusive-or operation on bits in the real processor eld implies an exchange of all data between pairs of processors. The local address map is preserved. 3 A block algorithm In this section we rst describe the DNS algorithm, which assumes P P matrices distributed over P 2 processors. We then generalize it to the multiplication of P P matrices uniformlypover an p p distributed n-cube, factored as N N, where P N. Finally, we present an algorithm that use all communip cation channels of an n-cube when P n N. For notational convenience, we assume P is a power of two and put P = 2p . 0 2 3.3.1 Data allocation Lemma 3 The alignment on the bits in the virtual processor address eld, and the steps of the multiplication phase corresponding to bits in this address eld denes a complete matrix multiplication on blocks of size 2p n 2p n . ; 0 ; Each big block is allocated to the processors with consecutive assignment 6] (as in the preceding section). The address map for A is (wpr 1 wpr 2 wpr n wpr n 1 w0r j 0 Lemma 3 follows from the recursive nature of the binary-re ected Gray code. This block matrix multiplication can be replaced by any suitable matrix multiplication algorithm in each node, without a ecting the part of the algorithm requiring interprocessor communication. For instance, a block, matrix-vector, or SAXPY 13] based algorithm may be used depending on the architecture of each node. | ; ; }| r ; 0; {z } vp0r ; ; ; ; ; ; 0 0; ; ; rpc vp0c and for B it is (wpr 1wpr 2 wpr wpr 1 wpr n wpr n 1 w0r j | {z }| {z }| {z } ; ; ; ; ; ; 0 ; ; ; 0 ; rpr vp1r vp0r wpc;1wpc;2 wpc;n wpc;n ;1 w0c ) | {z }| {z } rpc vp0c 0 0 0 assuming that n is a power of two and = log n . This assumption is only made for notational convenience in the address map. The small blocks are dened by the elds labeled vp0. The small block size for A is PN n P N , and for B it is n P N PN . Big blocks are identi ed by the eld labeled vp1. The concatenated rp and vp0 elds de ne the big blocks. p Each p such block is distributed uniformly over all N N processors. By Lemma 3 the alignment and subsequent exchange and multiplication operations related to the vp0c eld of A and vp0r eld of B de ne a block matrix multiplication local to everyc processor.r Note that the lengths of the two elds vp0 and vp0 are the same. The exchange on the vp1 eld is a local memory move. This exchange implies that in the next several steps a new pair of big blocks will be multiplied. 0 The time complexity of the algorithm is, 1. Communication: Alignment: n PN2 . p Multiplication: ( N ; 1) PN2 . 2. Arithmetic: 2NP 3 . p 0 The alignments of the matrices A and B are assumed to take place concurrently in the above estimates. The arithmetic time is reduced in proportion to the number of processors, but the largest communication term only in proportion to the square root of the number of processors. The data motion for the matrix A only uses one cube dimension per processor, and so does the data motion for B. A total of two cube dimensions are used for each processor, in each step of the multiplication algorithm. The communications capability of Boolean cubes of many dimensions is poorly utilized. 0 0 p 0 p p 3.3.2 Alignment The dimension of the least signi cant bit is zero. The number of 1-bits in the binary representation of i is jjijj. jS j denotes the cardinality of a set S. Let D(i) be the ordered set of dimensions (in an increasing order) for which the corresponding bits of the binary representation of i are one. For example, D((10110)) = f1 2 4g. Clearly, jD(i)j = jjijj. The alignment is performed on the processor address elds alone, i.e., after the alignment processor (k `) has column indices `} | {z }) (| {z } k| {z 3.3 A block algorithm using all cube dimensions By partitioning the matrix A into 1 n blocks and the matrix B into n 1 blocks the matrix multiplication is transformed into n rank nP updates. The idea in the algorithm below is to perform the communication for the di erent high rank updates concurrently. That is n communication channels per processor are used for both A and B. The full communications bandwidth is used. We refer to the P nP blocks of A and the nP P blocks of B as big blocks in order to distinguish this blocking from the big blocks assigned to individual processors, the small blocks. The naive block algorithm modi ed as described below is used for the multiplication of each pair of big blocks. 0 0 0 rp 0 ; {z vp1c formed by applying the DNS algorithm 3] to the real processor address eld, and by employing any suitable matrix multiplication algorithm for the local blocks of size 2p n 2p n , assuming consecutive assignment of matrix elements to real processors. 0 ; wpc 1wpc 2 wpc wpc 1 wpc n wpc n 1 w0c ) {z }| {z }| {z } | Theorem 1 The multiplication of two square matrices of size P P on an n-cube, 2p n, can be per; ; 0 0 vp1c 0 rpc vp0c of the matrix A, and row indices (| {z } k| {z `} | {z }) 0 vp1r 3 rpr vp0r of the matrix B, where denotes all numbers that can be represented by that bit- eld. The matrices are properly aligned. For a processor in row k and column `, the alignment of A involves the set of cube dimensions D(kj0n ) (the higher-order n cube dimensions are used for the encoding of rows) and the alignment of B involves the set of cube dimensions D(`). Clearly, D(kj0n ) \ D(`) = . Note that the set of dimensions involved in the alignment operation does not depend on the id of big block. The number of processor dimensions involved in the alignment of A is jD(kj0n )j = jjkjj for processor row k. The number of dimensions involved in the alignment of B is jj`jj for processor column `. The data volume that needs to be communicated per processor is PN2 for A and B. The naive block algorithm does not fully use the communication bandwidth of the Boolean cube. We constrain the alignment of a row to be con ned to its row subcube, and the alignment of a column to be con ned to its column subcube. By Lemma 1 and Lemma 2 the minimum number of element transfers in sequence under this constraint is PN2 for A and B. The alignment of each operand is sped up by a factor of n by concurrent communication within subcubes, compared to the algorithm in the preceding section. 0 for this lower bound the subcube lower bound is also the total lower bound by Lemma 2. The bound for B is derived similarly, and since the set of dimensions used for the broadcasting of A and B are disjoint the lemma follows. 0 For the multiplication phase the binary-re ected Gray code exchange sequence accomplishes an all-to-all broadcasting 11] within columns for B, and within rows for A. Any sequence with this property applied to both A and B in the same order is acceptable. The exchange sequence S (n s), 1 s n ; 1, is as appropriate as S (n 0). It follows that n pairs of blocks can be exchanged concurrently. The total 2data p transfer time for the multiplication phase is nPN ( N ; 1) for the matrix A and B. For the rank nP algorithm big block column m of A multiplies block row m of B. All small blocks of big block m of A and B are subject to the exchange sequence S (n m). Let A(k ` m) be the small block assigned to processor (k `) of big block m of matrix A. The multiplication phase for big block column m of A and block row m of B involves the data motion de ned by 0 0 0 0 0 0 0 0 A(k ` m) A(k ` 2(tn m) m) 8m 2 Zn 8k ` 2 Z N concurrently B(k ` m) B(k 2(tn m) ` m) 8m 2 Zn 8k ` 2 Z N concurrently: p The index for time, t, ranges from 1 to N ; 1. Note that the communication for exchanges of A and B can be performed concurrently. Moreover, since (t n m1) 6= (t n m2) m1 6= m2 for all t, the communication can be performed concurrently also for all m 2 Zn . In any communication step all cube dimensions are used. 0 Lemma 4 A lower bound for2 the alignment of A and B on a Boolean n-cube is PN . p 0 0 Proof: Consider the N4 processors in rows f1n 1g and columns f1n 1 g, i.e., the processors to which the lower right quarter submatrix of each operand is allocated. These N4 processors form a (n ; 2)-dimensional 0 0 0 0 ; p 0 ; subcube. 2 Each processor in the subcube needs to exchange PN elements with the subcube storing the lower left quarter submatrix of A, and PN2 elements with the subcube storing the upper right quarter submatrix of B. The total number Pof2 elements that must be sent out of the subcube is 2 . The total number of links that connect to processors outside the subcube is 2 N4 . 0 0 0 Theorem 2 The data transfer time for the described algorithm with concurrent communication on all ports 2 2 p P P of a Boolean n-cube is N + n N ( N ; 1), which is op0 timal within a small constant factor with the operands distributed uniformly over the processors congured as a product of two n -cubes. 0 3.3.3 Multiplication Lemma 5 11] A lower bound for the data transfer time of the matrices A and B during multiplication is P 2 (pN ; 1). nN 4 Concluding Remarks We have presented an algorithm for multiplying two P P matrices p on a Boolean n-cube where n is even and P n N=2. The algorithm has a parallel arithmetic complexity 2P 3=N, a communication complex2 2 ity < PN + n2PN and the minimal storage requirement O( PN2 ). The previous DNS algorithm, while having the same arithmetic complexity and minimal storage requirement, has communication complexity a factor of n=2 higher than our algorithm. Our algorithm has 0 P2 Proof: Every processor p needs to receive N elements of A from each of ( N ; 1) processors. The lower bound for 2thisp all-to-all broadcasting within row subcubes is nPN ( N ; 1) 11]. But, since all row subcubes p 0 perform the same communication and are fully utilized 4 7] S. Lennart Johnsson. Data parallel programming and basic linear algebra subroutines. In John R. Rice, editor, Mathematical Aspects of Scientic Software, volume IMA series, 14, pages 183{196. Springer Verlag, 1987. YALE/DCS/RR584, September 1987, Technical Report DP87-1, TMC-64, Thinking Machines Corp, 1987. 8] S. Lennart Johnsson and Ching-Tien Ho. Algorithms for multiplying matrices of arbitrary shapes using shared memory primitives on a Boolean cube. Technical Report YALEU/DCS/RR-569, Dept. of Computer Science, Yale University, October 1987. 9] S. Lennart Johnsson and Ching-Tien Ho. Shue permutations on Boolean cubes. Technical Report YALEU/DCS/RR-653, Department of Computer Science, Yale University, October 1988. 10] S. Lennart Johnsson and Ching-Tien Ho. Multiplication of arbitrarily shaped matrices using the full communications bandwidth on Boolean cubes. Technical Report YALEU/DCS/RR-721, Department of Computer Science, Yale University, July 1989. 11] S. Lennart Johnsson and Ching-Tien Ho. Spanning graphs for optimum broadcasting and personalized communication in hypercubes. IEEE Trans. Computers, 38(9):1249{1268, September 1989. 12] S. Lennart Johnsson and Peggy Li. Solutionset for AMA/CS 146. Technical Report 5085:DF:83, California Institute of Technology, May 1983. 13] C.L. Lawson, R.J. Hanson, D.R. Kincaid, and F.T. Krogh. Basic Linear Algebra Subprograms for Fortran Usage. ACM TOMS, 5(3):308{323, September 1979. 14] E.M. Reingold, J. Nievergelt, and N. Deo. Combinatorial Algorithms. Prentice-Hall, Englewood Cli s. NJ, 1977. 15] Quentin F. Stout and Bruce Wagar. Passing messages in link-bound hypercubes. In Michael T. Heath, editor, Hypercube Multiprocessors 1987. Society for Industrial and Applied Mathematics, Philadelphia, PA, 1987. 16] Alan Wagner, 1988. Personal communication. been implemented on the Connection Machine model CM-2. For the performance on the 8K CM-2, we measured about 1.6 G ops, which would scale up to about 13 G ops for a 64K full machine. Note that if the storage is suciently large to allow all-to-all broadcasting within rows and columns to be performed by spanning tree algorithms then n steps suce, and the communications bandwidth can be fully utilized 11]. But, the storage requirement pper processor is proportional to PN2 , i.e., a factor of N higher than for the algorithm presented here, and is unlikely to be useful in practice. It should be noted that the algorithm here can be generalized to non-square matrices distributed over an N processor cube con gured into N1 N2 . The choice of N1 N2 depends on the aspect ratios of the two input matrices. See 10] for a detailed discussion. It is also possible to generalize Cannon's algorithm such that the full communication bandwidth of the cube is used. The generalization can be made since there exists n edge-disjoint Hamiltonian cycles in a 2n-cube, 4], 1], 16]. The matrices are assigned to the processors by a two-level partitioning, as in the algorithm described here. The storage per processor is the same as for our algorithm. However, the local control at each processor is more complicated, because the known method for constructing the n edgedisjoint Hamiltonian cycles is quite complex (double recursion), and the path encoding is complicated. p References 1] Jacques Aubert and Bernadette Schneider. Decomposition de la somme cartesienne d'un cycle et de l'union de deux cycles hamiltoniens en cycles hamiltoniens. Descrete Mathematics, 38:7{ 16, 1982. 2] L.E. Cannon. A Cellular Computer to Implement the Kalman Filter Algorithm. PhD thesis, Montana State Univ., 1969. 3] Eliezer Dekel, David Nassimi, and Sartaj Sahni. Parallel matrix and graph algorithms. SIAM J. Computing, 10:657{673, 1981. 4] Marsha Foregger. Hamiltonian decompositions of products of cycles. Descrete Mathematics, 24:251{260, 1978. 5] Geo rey C. Fox, S.W. Otto, and A.J.G. Hey. Matrix algorithms on a hypercube i: Matrix multiplication. Technical Report Caltech Concurrent Computation Project Memo 206, California Institute of Technology, dept. of Theoretical Physics, October 1985. 6] S. Lennart Johnsson. Communication ecient basic linear algebra computations on hypercube architectures. J. Parallel Distributed Computing, 4(2):133{172, April 1987. 5