Mid-term Presentation

MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction Matrix Multiplication • Fundamental kernel algorithm used by many applications • Examples: Graph Theory, Physics, Electronics Scalability Issues: • Run on single machine: • Memory overhead increase in terms of N^2 • CPU overhead increase in terms of N^3 • Run on multiple machines: • Communication overhead increase in terms of N^2 memory overhead CPU overhead 1E+10 100000000 1000000 10000 100 0 500 1000 1500 Matrix Multiply Approaches Programming Mdoel Algorithm Customized Libraries User Implementation Sequential Naïve approach, tiles matrix multiply, Blas_dgemm Vendor supplied package (ie, Intel, AMD Blas), ATLAS Fortran, C, C++, C#, Java Shared memory parallelism Row Partition ATLAS Multi Threads, TPL, PLINQ, OpenMP Distributed memory parallelism Row Column Partition, Fox Algorithm ScalePack OpenMPI, Twister, Dryad Why DryadLINQ? • Dryad is a general purpose runtime that supports the processing of data intensive application in Windows • DryadLINQ is a high level programming language and compiler for Dryad • Applicability: • Dryad transparently deal with the parallelism, scheduling, fault. tolerance, messaging, and workload balancing issues. • SQL-like interface, based on .NET platform, easy to have code. • Performance: • Intelligent job execution engine, optimized execution plan. • Scale out for thousands of machines. Parallel Algorithms for Matrix Multiplication • MM algorithms can deal with matrices distributed on rectangular grids • No single algorithm always achieves best performance on different matrix and grid shapes. • MM Algorithms can be classified into categories according to the communication primitives • Row Partition • Row Column Partition • Fox Algorithm (BMR) – broadcast, multiply, roll up Row Partition • Heavy communication overhead • Large memory usage per node • The full Matrix B is copied to every node • The Matrix A row blocks are distributed to each node Pseudo Code sample: Partition matrix A by rows Broadcast matrix B Distributed matrix A row blocks Compute the matrix C row blocks Row Column Partition • Heavy communication overhead • Scheduling overhead for each iteration Column Block 1 Column Block 2 Column Block 3 ... Column Block n • Moderate memory usage 1 Row Block 1 Block (1,1) Block (1,2) Block (1,3) ... Block (1,n) 2 Row Block 2 Block (2,1) Block (2,2) Block (1,3) ... Block (2,n) 3 Row Block 3 Block (m,0) Block (m,1) Block (m,3) ... Block (m,n) … ... m Row Block m ... Block (m,n) B Matrix Pseudo Code sample: Partitioned matrix A by rows Partitioned matrix B by columns For each iteration i: broadcast matrix A row block i distributed matrix B column blocks compute matrix C row blocks Iterations A Matrix C Matrix ... Block (m,0) Block (m,1) Block (m,3) ... Node 1 Node 2 Node 3 Node n Fox Algorithm Stage One Stage Two Fox algorithm • Less communication overhead than other approach • Scale well for large matrices sizes Pseudo Code sample: Partitioned matrix A, B to blocks For each iteration i: 1) broadcast matrix A block (i%N, i%N) to row i 2) compute matrix C blocks add the to the previous result 3) roll-up matrix B block Performance Analysis on Fox algorithm Cache Size Turning Point Relative Parallel Efficiency • Cache Issue • Cache miss (size), pollution, confliction • Tiles matrix multiply • Memory Issue • Size (memory paging) • Bandwidth, latency OpenMPI/Threads/Cblas on 16 nodes 1.4 1.2 1 0.8 0.6 0.4 0.2 0 Grain Size Per Node 1node_8cores 16nodes_8cores • Absolute performance degrade as problem size increase for both cases • Single node performance worse than multiple nodes due to memory issue. 10^3Mflops 100 10 1 MPI/Threads/Cblas with Various Problem Sizes Multicore level parallelism • To use every core on a compute node for Dryad Job, the task must be programmed with multicore technology. (i.e. Task Parallel Library<TPL>, Thread, PLINQ) • For each thread, it will compute one row in matrix C or several rows in matrix C depends on the implementation. • By using TPL or PLINQ, the optimization for threads is implicit and easier to use. Timeline for term long project • Stage One • Familiar with HPC cluster • Sequential MM with C# • Multithreaded MM with C# • Performance comparison of above two approaches • Stage Two • Familiar with DryadLINQ Interface • Implement Row Partition algorithm with DryadLINQ • Performance study • Stage Three • Refinement experiments results • Report and presentation Backup slides Dryad Job Submission Client machine ToTable HPC Cluster DryadLINQ .NET program Query Expr Distributed query plan Invoke Query Vertex code JM foreach     Output .Net Objects DryadTable (11) Results Input Tables Dryad Execution Output Tables Input: C# and LINQ data objects  DryadLINQ distributed data objects. DryadLINQ translates LINQ programs into distributed Dryad computations: C# methods become code running on the vertices of a Dryad job. Output: DryadLINQ distributed data objects  .Net objects Dryad Job Execution Flow Performance on one Node Performance on Multiple Node Analysis for three algorithms Performance for three algorithms • Test done on 16 nodes of Tempest, using one core per node. Performance for Multithreaded MM • Test done on one node of Tempest, 24 cores 25 20 Speed-up 15 10 5 TPL Thread PLINQ 0 2400 4800 7200 9600 12000 Scale of Square Matrix 14400 16800 19200

Mid-term Presentation

Related documents

Products

Support

Mid-term Presentation

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib