MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction Matrix Multiplication • Fundamental kernel algorithm used by many applications • Examples: Graph Theory, Physics, Electronics Matrix Multiply Approaches Programming Mdoel Algorithm Customized Libraries User Implementation Sequential Naïve approach, tiles matrix multiply, Blas_dgemm Vendor supplied package (ie, Intel, AMD Blas), ATLAS Fortran, C, C++, C#, Java Shared memory parallelism Row Partition ATLAS Multi Threads, TPL, PLINQ, OpenMP Distributed memory parallelism Row Column Partition, Fox Algorithm ScalePack OpenMPI, Twister, Dryad, Hadoop Single Node Test Bare Metal VM Node Java 1 node 8 cores c1.xlarge C 1 node 8 cores c1.xlarge Bare Metal in Future Grid: Cache = 8*8MB = 64 MB Working memory of matrix multiply for 8 threads, 1500x1500 matrix: WM = 3*1500*1500 * 8 = 54 MB VM Node: Class Slots Core Memory Disk c1.xlarge 50 8 20000 20 Result of Total Time Bare Metal C Jav a VM node Bare Metal C Speed up Parallel Efficiency Java VM Node C Speed up Parallel Efficiency Java Pleasingly Parallel Programing Patterns 8 Subquer y Issue: Scheduling resources in the granularity of node rather than core lead to relative low system utilization Query Sample applications: 1. SAT problem 2. Parameter sweep 3. Blast, SW-G bio DryadLINQ Subquer y Subquer y User defined function User defined function Subquer y Application Program Subquer y Legacy Code 9 Hybrid Parallel Programming Pattern Sample applications: 1. Matrix Multiplication 2. GTM and MDS Solve previous issue by using PLINQ, TPL, Thread Pool technologies Query DryadLINQ subquery PLINQ User defined function TPL Implementation and Performance Hardware configuration TEMPEST TEMPEST TEMPEST-CNXX CPU Intel E7450 Intel E7450 Cores 24 24 Memory 24.0 GB 50.0 GB Memory/Core 1 GB 2 GB STORM STORMCN01,CN02, CN03 STORMCN04,CN05 STORMCN06,CN07 CPU AMD 2356 AMD 8356 Intel E7450 Cores 8 16 24 Memory 16 GB 16 GB 48 GB Memory/Core 2 GB 1 GB 2 GB 1. We use DryadLINQ CTP version released in December 2010 2. Windows HPC R2 SP2 3. .NET 4.0, Visual Studio 2010 Matrix multiplication performance results 1core on 16 nodes V.S 24 cores on 16 nodes Matrix Multiply with Different Runtimes • Implemented with different runtimes 1. Dryad 2. MPI 3. Twister 4. Hadoop • Implemented Fox algorithm • Run on a mesh of 4x4 nodes in both Windows and HPC environments. Dryad and DryadLINQ Client machine ToTable HPC Cluster DryadLINQ .NET program Query Expr Distributed query plan Invoke Query Vertex code JM foreach Output .Net Objects DryadTable (11) Results Input Tables Dryad Execution Output Tables Isard, Michael., Mihai. Budiu, Yuan Yu, Andrew Birrell, Dennis Fetterly. (2007). Dryad: distributed data-parallel programs from sequential building blocks. Yu, Yuan., Michael. Isard, Dennis Fetterly, Mihai Budiu. (2008). DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language. Hui Li, Yang Ruan, Yuduo Zhou, Judy Qiu, Geoffrey Fox. (2011). Design Patterns for Scientific Application6s in DryadLINQ CTP. DataCloud-SC11. Key Features Dryad CTP Hadoop 0.20.0 Comments Programming Interface 1 Execution model DAG of data flowing between operations[1,2,3] 3 Programming Interface 1) 2) 4 Higher Level Programming Language 1) 2) Map, Shuffle, Merge, Reduce stages[16] Based on LINQ[2,3] model, add interface extension for Dryad Able to use relational operator defined in LINQ 1) 2) DryadLINQ allow developers use 1) standard query operations defined within LINQ such as Select, Join, and GroupBy. Evaluations of queries are 2) converted into DAG . DAG is more flexible than MapReduce to express data flow processing Map and Reduce class [16] There is no public document Not natively support relational about Dryad raw API operations that have multiheterogeneous input data sets Pig allows Hadoop developers DryadLINQ outperform Pig to utilize relational queries as when processing relational well, but it’s found to be not datasets. [9] efficient . YSmart is another SQLto_MapReduce translator which outperforms Pig. Performance Issues 7 Data movement; communication DryadLINQ provide three channel Hadoop uses HTTP to transfer data Dryad provider better data protocols: File (the default), TCP Pipe, between Map tasks and Reduce transferring approaches than [1] [15] Shared-memory FIFO tasks during shuffling . Hadoop Note: RDMA is available in Windows8 9 Pipelining between Chain the execution of multiple queries Jobs. by using late evaluation technology, (iterative TCP pipe, shared memory FIFO [2, 3]. MapReduce) Hadoop cannot pipeline the execution of jobs as it needs materialize output of MapReduce jobs into disk (HDFS) when job is done [6, 7]. In Dryad, the pipelining can be broken when it explicitly evaluate the queries or materialize output results to disk. Backup slides Dryad Job Execution Flow Performance for Multithreaded MM • Test done on one node of Tempest, 24 cores 25 20 Speed-up 15 10 5 TPL Thread PLINQ 0 2400 4800 7200 9600 12000 Scale of Square Matrix 14400 16800 19200