Mid-term Presentation

advertisement
MATRIX MULTIPLY WITH
DRYAD
B649 Course Project Introduction
Matrix Multiplication
• Fundamental kernel algorithm used by many applications
• Examples: Graph Theory, Physics, Electronics
Scalability Issues:
• Run on single machine:
• Memory overhead increase in terms of N^2
• CPU overhead increase in terms of N^3
• Run on multiple machines:
• Communication overhead increase in terms of N^2
memory overhead
CPU overhead
1E+10
100000000
1000000
10000
100
0
500
1000
1500
Matrix Multiply Approaches
Programming
Mdoel
Algorithm
Customized
Libraries
User
Implementation
Sequential
Naïve approach,
tiles matrix
multiply,
Blas_dgemm
Vendor supplied
package (ie, Intel,
AMD Blas),
ATLAS
Fortran, C, C++,
C#, Java
Shared memory
parallelism
Row Partition
ATLAS
Multi Threads,
TPL, PLINQ,
OpenMP
Distributed
memory
parallelism
Row Column
Partition,
Fox Algorithm
ScalePack
OpenMPI, Twister,
Dryad
Why DryadLINQ?
• Dryad is a general purpose runtime that supports the
processing of data intensive application in Windows
• DryadLINQ is a high level programming language and
compiler for Dryad
• Applicability:
• Dryad transparently deal with the parallelism, scheduling, fault.
tolerance, messaging, and workload balancing issues.
• SQL-like interface, based on .NET platform, easy to have code.
• Performance:
• Intelligent job execution engine, optimized execution plan.
• Scale out for thousands of machines.
Parallel Algorithms for Matrix Multiplication
• MM algorithms can deal with matrices distributed on
rectangular grids
• No single algorithm always achieves best performance on
different matrix and grid shapes.
• MM Algorithms can be classified into categories according
to the communication primitives
• Row Partition
• Row Column Partition
• Fox Algorithm (BMR) – broadcast, multiply, roll up
Row Partition
• Heavy communication overhead
• Large memory usage per node
• The full Matrix B is copied to every node
• The Matrix A row blocks are distributed to each node
Pseudo Code sample:
Partition matrix A by rows
Broadcast matrix B
Distributed matrix A row blocks
Compute the matrix C row blocks
Row Column Partition
• Heavy communication overhead
• Scheduling overhead for each iteration
Column Block 1
Column Block 2
Column Block 3
...
Column Block n
• Moderate memory usage
1
Row Block 1
Block
(1,1)
Block
(1,2)
Block
(1,3)
...
Block
(1,n)
2
Row Block 2
Block
(2,1)
Block
(2,2)
Block
(1,3)
...
Block
(2,n)
3
Row Block 3
Block
(m,0)
Block
(m,1)
Block
(m,3)
...
Block
(m,n)
…
...
m
Row Block m
...
Block
(m,n)
B
Matrix
Pseudo Code sample:
Partitioned matrix A by rows
Partitioned matrix B by columns
For each iteration i:
broadcast matrix A row block i
distributed matrix B column blocks
compute matrix C row blocks
Iterations
A
Matrix
C
Matrix
...
Block
(m,0)
Block
(m,1)
Block
(m,3)
...
Node 1
Node 2
Node 3
Node n
Fox Algorithm
Stage One
Stage Two
Fox algorithm
• Less communication overhead than other approach
• Scale well for large matrices sizes
Pseudo Code sample:
Partitioned matrix A, B to blocks
For each iteration i:
1) broadcast matrix A block (i%N,
i%N) to row i
2) compute matrix C blocks add the
to the previous result
3) roll-up matrix B block
Performance Analysis on Fox algorithm
Cache Size Turning Point
Relative Parallel Efficiency
• Cache Issue
• Cache miss (size),
pollution, confliction
• Tiles matrix multiply
• Memory Issue
• Size (memory paging)
• Bandwidth, latency
OpenMPI/Threads/Cblas on 16 nodes
1.4
1.2
1
0.8
0.6
0.4
0.2
0
Grain Size Per Node
1node_8cores
16nodes_8cores
• Absolute performance
degrade as problem size
increase for both cases
• Single node performance
worse than multiple nodes
due to memory issue.
10^3Mflops
100
10
1
MPI/Threads/Cblas with Various Problem Sizes
Multicore level parallelism
• To use every core on a compute node for Dryad Job, the
task must be programmed with multicore technology. (i.e.
Task Parallel Library<TPL>, Thread, PLINQ)
• For each thread, it will compute one row in matrix C or
several rows in matrix C depends on the implementation.
• By using TPL or PLINQ, the optimization for threads is
implicit and easier to use.
Timeline for term long project
• Stage One
• Familiar with HPC cluster
• Sequential MM with C#
• Multithreaded MM with C#
• Performance comparison of above two approaches
• Stage Two
• Familiar with DryadLINQ Interface
• Implement Row Partition algorithm with DryadLINQ
• Performance study
• Stage Three
• Refinement experiments results
• Report and presentation
Backup slides
Dryad Job Submission
Client machine
ToTable
HPC Cluster
DryadLINQ
.NET
program
Query Expr
Distributed
query plan
Invoke
Query Vertex
code
JM
foreach




Output
.Net Objects DryadTable
(11)
Results
Input
Tables
Dryad
Execution
Output Tables
Input: C# and LINQ data objects  DryadLINQ distributed data objects.
DryadLINQ translates LINQ programs into distributed Dryad computations:
C# methods become code running on the vertices of a Dryad job.
Output: DryadLINQ distributed data objects  .Net objects
Dryad Job Execution Flow
Performance on one Node
Performance on Multiple Node
Analysis for three algorithms
Performance for three algorithms
• Test done on 16 nodes of Tempest, using one core per
node.
Performance for Multithreaded MM
• Test done on one node of Tempest, 24 cores
25
20
Speed-up
15
10
5
TPL
Thread
PLINQ
0
2400
4800
7200
9600
12000
Scale of Square Matrix
14400
16800
19200
Download