Modeling Communication Cost for Matrix Multiply in Tree Network Cluster Hui Li: lihui@indiana.edu Xiaoyang Chen: xc7@imail.iu.edu Statement of Problem Computers are rational and for most structured parallel applications, it is usually possible to model performance once you know the problem model, machine model, and communication model. The matrix multiply is an important kernel in some scientific problems. It is one of the most well studied algorithms in high performance computing. Specifically, we will study the square full matrix multiply in order to simplify the problem model. In addition, we assume the machine with a fast CPU and very large memory space. And we just consider the machine as a processor with the processing speed of Tflops. Further, we assume the matrix multiply will run in a tree network cluster with the speed of Tcomm, which are the most cases happens in many dedicated cluster. The goal of this study is to model Tcomm/Tflops, the communication overhead per double float point operations, for matrix multiply in dedicated cluster. And the difficulty of work is to model the communication overhead of matrix multiply over the tree network topology. Survey There are several parallel dense matrix multiplication algorithms of the form C=a*A*B+b*C on a mesh of processes. These algorithms can be coherently classified into three categories according to the communication patterns. As no single algorithm always achieves the best performance for multiplying matrices with different sizes on arbitrary process grids, some poly-algorithms is proposed to address this dilemma [1]. Further, there exist some parallel libraries software that provides set of coherent parallel services for linear algebra programs. One advantage of building such parallel libraries is that they may fully exploit the parallelism of a specific problem and effectively utilize the underlying parallel systems, while providing a portable interface specification. A potential disadvantage is that hand-coded software, tuned to the specific processor, segment of network topology, operating system or compiler release will, though at great development cost, outperform portable libraries. Sample libraries include the Basic Linear Algebra Subprogram (BLAS), the Linear Algebra PACKage (LAPACK), and Scalable LAPACK (ScalAPACK). Architecture Design Matrix A Preparing stage Block (0,0) Block (0,1) Matrix B Block (0,0) Compute nodes Block (0,1) Compute Node (0,0) Compute Node (0,1) Compute Node (1,0) Compute Node (1,1) Node 1 Block (1,0) Block (1,1) Block (1,0) Block (1,1) Node 3 Block (0,0) Block (0,0) Block (0,1) Block A(0,0) B(0,0) Block A(0,0) B(0,1) Block (1,0) Block (1,1) Block A(1,1) B(1,0) Block A(1,1) B(1,1) Block (1,0) Block (1,1) Block A(0,1) B(1,0) Block A(0,1) B(1,1) Block (0,0) Block (0,1) Block A(1,0) B(0,0) Block A(1,0) B(0,1) Step 0 Block (1,1) Block (0,1) Matrix C Node 2 Node 4 Step 1 Block (1,0) Figure 1. Work flow of Fox Algorithm on Mesh of 2*2 Nodes Implementation timeline 1. Survey: milestone1 a. Paper readings b. Proposal (Oct, 31) 2. Matrix Multiply in Shared Memory Parallelism Model a. Port CBlas and JBlas into Matrix Multiply code b. Multiply threads version c. Performance study 3. Matrix Multiply in Distributed Memory Parallelism Model a. Port CBlas and JBlas into OpenMPI, Twister, DryadLINQ b. ScalaPACK c. Performance analysis 4. Report Validation methods 1. Performance Analysis: a. Absolute performance metric: Mflops per second b. Relative performance metric: Parallel efficiency Figure 2. Absolute performance of matrix multiply 2. Timing model : a. Timing model to analyze Tcomm/Tflops, the communication overhead per double float point calculation b. Compare predicated job turnaround and the real job turnaround time. 𝑇 = √𝑁 ∗ (√𝑁 ∗ (𝑇𝑙𝑎𝑡 + 𝑚2 ∗ (𝑇𝑖𝑜 + 𝑇𝑐𝑜𝑚𝑚 )) + 2 ∗ 𝑚3 ∗ 𝑇𝑓𝑙𝑜𝑝𝑠 ) References [1] A poly-algorithm for parallel dense matrix multiplication on two-dimensional process grid topologies