Final Presentation

advertisement
MATRIX MULTIPLY WITH
DRYAD
B649 Course Project Introduction
Matrix Multiplication
• Fundamental kernel algorithm used by many applications
• Examples: Graph Theory, Physics, Electronics
Matrix Multiply Approaches
Programming
Mdoel
Algorithm
Customized
Libraries
User
Implementation
Sequential
Naïve approach,
tiles matrix
multiply,
Blas_dgemm
Vendor supplied
package (ie, Intel,
AMD Blas),
ATLAS
Fortran, C, C++,
C#, Java
Shared memory
parallelism
Row Partition
ATLAS
Multi Threads,
TPL, PLINQ,
OpenMP
Distributed
memory
parallelism
Row Column
Partition,
Fox Algorithm
ScalePack
OpenMPI, Twister,
Dryad, Hadoop
Single Node Test
Bare Metal
VM Node
Java
1 node 8 cores
c1.xlarge
C
1 node 8 cores
c1.xlarge
Bare Metal in Future Grid:
Cache = 8*8MB = 64 MB
Working memory of matrix multiply for 8 threads, 1500x1500 matrix:
WM = 3*1500*1500 * 8 = 54 MB
VM Node:
Class
Slots
Core
Memory
Disk
c1.xlarge
50
8
20000
20
Result of Total Time
Bare Metal
C
Jav
a
VM node
Bare Metal
C
Speed
up
Parallel
Efficiency
Java
VM Node
C
Speed
up
Parallel
Efficiency
Java
Pleasingly Parallel Programing
Patterns
8
Subquer
y
Issue:
Scheduling
resources in the
granularity of node
rather than core
lead to relative low
system utilization
Query
Sample applications:
1. SAT problem
2. Parameter sweep
3. Blast, SW-G bio
DryadLINQ
Subquer
y
Subquer
y
User defined function User defined function
Subquer
y
Application Program
Subquer
y
Legacy Code
9
Hybrid Parallel Programming Pattern
Sample applications:
1. Matrix Multiplication
2. GTM and MDS
Solve previous issue by using
PLINQ, TPL, Thread Pool
technologies
Query
DryadLINQ
subquery
PLINQ
User defined
function
TPL
Implementation and Performance
Hardware configuration
TEMPEST
TEMPEST
TEMPEST-CNXX
CPU
Intel E7450
Intel E7450
Cores
24
24
Memory
24.0 GB
50.0 GB
Memory/Core
1 GB
2 GB
STORM
STORMCN01,CN02, CN03
STORMCN04,CN05
STORMCN06,CN07
CPU
AMD 2356
AMD 8356
Intel E7450
Cores
8
16
24
Memory
16 GB
16 GB
48 GB
Memory/Core
2 GB
1 GB
2 GB
1. We use DryadLINQ CTP version released in December 2010
2. Windows HPC R2 SP2
3. .NET 4.0, Visual Studio 2010
Matrix multiplication performance results
1core on 16 nodes V.S 24 cores on 16 nodes
Matrix Multiply with Different Runtimes
• Implemented with different
runtimes
1. Dryad
2. MPI
3. Twister
4. Hadoop
• Implemented Fox algorithm
• Run on a mesh of 4x4 nodes
in both Windows and HPC
environments.
Dryad and DryadLINQ
Client machine
ToTable
HPC Cluster
DryadLINQ
.NET
program
Query Expr
Distributed
query plan
Invoke
Query Vertex
code
JM
foreach
Output
.Net Objects DryadTable
(11)
Results
Input
Tables
Dryad
Execution
Output Tables
Isard, Michael., Mihai. Budiu, Yuan Yu, Andrew Birrell, Dennis Fetterly. (2007). Dryad: distributed data-parallel programs
from sequential building blocks.
Yu, Yuan., Michael. Isard, Dennis Fetterly, Mihai Budiu. (2008). DryadLINQ: A System for General-Purpose Distributed
Data-Parallel Computing Using a High-Level Language.
Hui Li, Yang Ruan, Yuduo Zhou, Judy Qiu, Geoffrey Fox. (2011). Design Patterns for Scientific Application6s in DryadLINQ
CTP. DataCloud-SC11.
Key Features Dryad CTP
Hadoop 0.20.0
Comments
Programming Interface
1
Execution model
DAG of data flowing between
operations[1,2,3]
3
Programming
Interface
1)
2)
4
Higher Level
Programming
Language
1)
2)
Map, Shuffle, Merge, Reduce
stages[16]
Based on LINQ[2,3] model, add
interface extension for Dryad
Able to use relational operator
defined in LINQ
1)
2)
DryadLINQ allow developers use 1)
standard query operations defined
within LINQ such as Select, Join,
and GroupBy.
Evaluations of queries are
2)
converted into DAG .
DAG is more flexible than
MapReduce to express data
flow processing
Map and Reduce class [16]
There is no public document
Not natively support relational about Dryad raw API
operations that have multiheterogeneous input data sets
Pig allows Hadoop developers DryadLINQ outperform Pig
to utilize relational queries as when processing relational
well, but it’s found to be not
datasets.
[9]
efficient .
YSmart is another SQLto_MapReduce translator
which outperforms Pig.
Performance Issues
7
Data movement;
communication
DryadLINQ provide three channel
Hadoop uses HTTP to transfer data Dryad provider better data
protocols: File (the default), TCP Pipe, between Map tasks and Reduce
transferring approaches than
[1]
[15]
Shared-memory FIFO
tasks during shuffling .
Hadoop
Note: RDMA is available in Windows8
9
Pipelining between Chain the execution of multiple queries
Jobs.
by using late evaluation technology,
(iterative
TCP pipe, shared memory FIFO [2, 3].
MapReduce)
Hadoop cannot pipeline the
execution of jobs as it needs
materialize output of MapReduce
jobs into disk (HDFS) when job is
done [6, 7].
In Dryad, the pipelining can
be broken when it explicitly
evaluate the queries or
materialize output results to
disk.
Backup slides
Dryad Job Execution Flow
Performance for Multithreaded MM
• Test done on one node of Tempest, 24 cores
25
20
Speed-up
15
10
5
TPL
Thread
PLINQ
0
2400
4800
7200
9600
12000
Scale of Square Matrix
14400
16800
19200
Download