Dense Linear Algebra subject make
[Soccer] is a very simple game. It’s just very hard to play it simple.
- Johan Cruyff
RvdG
PACC2011, Sept. 2011 1
PACC2011, Sept. 2011
Robert van de Geijn
Department of Computer Science
Institute for Computational Engineering and Sciences
The University of Texas at Austin
2
Outline
Motivation
What is FLAME?
Deriving algorithms to be correct
Representing algorithms in code
Of blocked algorithms and algorithms-by-blocks
Runtime support for multicore, GPU, and multiGPU
Extensions to distributed memory platforms
Related work
Conclusion
PACC2011, Sept. 2011 3
Moments of Inspiration
Birth of multi-threaded libflame
Aug. 2006 - an insight : libflame
+ algorithm-by-blocks
+ out-of-order scheduling
(runtime)
= multithreaded library
Sept. 2006 - working prototype
(by G. Quintana)
Oct. 2006 - grant proposal
(to NSF, later funded)
Jan. 2007 - paper submitted
(to SPAA07, accepted)
April 2007 - released with libflame R1.0
PACC2011, Sept. 2011
Birth of multi-GPU libflame
Fall 2007 - runtime used to manage data and tasks on a single GPU. (UJI-Spain)
March 2008 - NVIDIA donates
4GPU Tesla S870 system
Two hours after unboxing the boards, multiple heuristics for multiGPU runtime implemented
Then the power cord fried…
4
Birth of MultiGPU libflame
After two hours Shortly after
G. Quintana, Igual, E. Quintana, van de Geijn. "Solving Dense Linear Algebra
Problems on Platforms with Multiple Hardware Accelerators." PPoPP’09.
PACC2011, Sept. 2011 5
What Supports our Productivity/Performance?
Deep understanding of the domain
Foundational computer science
Derivation of algorithms
Software implementation of hardware techniques
Blocking for performance
Abstraction
Separation of concern
PACC2011, Sept. 2011 6
Outline
Motivation
What is FLAME?
Deriving algorithms to be correct
Representing algorithms in code
Of blocked algorithms and algorithms-by-blocks
Runtime support for multicore, GPU, and multiGPU
Extensions to distributed memory platforms
Related work
Conclusion
PACC2011, Sept. 2011 7
What is FLAME?
A notation for expressing linear algebra algorithms
A methodology for deriving such algorithm
A set of abstractions for representing such algorithms
In LaTeX, M-script, C, etc.
A modern library (libflame)
Alternative to BLAS, LAPACK, ScaLAPACK, and related efforts
Many new contributions to theory and practice of dense linear algebra
Also banded and Krylov subspace methods
A set of tools supporting the above
Mechanical derivation
Automatic generation of code
Design-by-Transformation (DxT)
PACC2011, Sept. 2011 8
Who is FLAME?
PACC2011, Sept. 2011 9
Outline
Motivation
What is FLAME?
Deriving algorithms to be correct
Representing algorithms in code
Of blocked algorithms and algorithms-by-blocks
Runtime support for multicore, GPU, and multiGPU
Extensions to distributed memory platforms
Related work
Conclusion
PACC2011, Sept. 2011 10
Deriving Algorithms to be Correct
Include all algorithms for a given operation:
Pick the right algorithm for the given architecture
Problem:
How to find the right algorithm
Solution:
Formal derivation (Hoare, Dijkstra, …):
Given operation, systematically derive family of algorithms for computing it.
PACC2011, Sept. 2011 11
Notation
The notation used to express an algorithm should reflect how the algorithm is naturally explained.
PACC2011, Sept. 2011 12
Example: The Cholesky Factorization
Lower triangular case:
A = L
*
L T
Key in the solution of s.p.d. linear systems
A x = b
(LL T )x = b
L y = b
y
L T x = y x
PACC2011, Sept. 2011 13
Algorithm Loop: Repartition
A
TL
A
BL
A
BR
A
00 a
10
T a
11
A
20 a
21
A
22
PACC2011, Sept. 2011
Indexing operations
14
Algorithm Loop: Update
A
00 a
10
T a
11
A
20 a
21
A
00 a
10
T a
11
A
20 a
/ a
11
A – a
21 a
21
T
PACC2011, Sept. 2011
Real computation
15
Algorithm Loop: Merging
A
00 a
10
T a
11
A
20 a
/ a
11
A – a
21 a
21
T
A
TL
A
BL
PACC2011, Sept. 2011
A
BR
Indexing operation
16
Worksheet for Cholesky Factorization
PACC2011, Sept. 2011 17
Mechanical Derivation of Algorithms
Mechanical development from math. specification
A = L * L T
Mechanical procedure
Paolo Bientinesi. "Mechanical Derivation and Systematic Analysis of Correct
Linear Algebra Algorithms." Ph.D. Dissertation. UT-Austin. August 2006.
PACC2011, Sept. 2011
18
18
Is Formal Derivation Practical?
libflame :
128+ matrix operations
1389+ implementations of algorithms
Test suite created in 2011
126,756 tests executed
Only 3 minor bugs in library… (now fixed)
PACC2011, Sept. 2011 19
Impact on (Single) GPU Computing
CUBLAS 2009 (have been optimized since)
PACC2011, Sept. 2011 20
Impact on (Single) GPU Computing
CUBLAS 2009
Fran Igual. “Matrix Computations on Graphics Processors and Clusters of GPUs." Ph.D. Dissertation.
Univ. Jaume I. May 2011.
Igual, G. Quintana, and van de Geijn. "Level-3 BLAS on a GPU: Picking the Low Hanging Fruit "
FLAME Working Note #37. Universidad Jaume I, updated May 21, 2009.
PACC2011, Sept. 2011 21
A Sampling of Functionality operation
Level-3 BLAS
Cholesky
LU with partial pivoting
LU with incremental pivoting
QR (UT)
LQ (UT)
SPD/HPD inversion
Triangular inversion
Triangular Sylvester
Triangular Lyapunov
Up-and-downdate (UT)
SVD
EVD
PACC2011, Sept. 2011 y y y y y y next week next week y y y y y
Classic FLAME SuperMatrix
MultiThreaded/
MultiGPU y y y y y y y y y y lapack2flame y y
N.A.
soon soon y y y
N.A.
y y
N.A.
y
22
Outline
Motivation
What is FLAME?
Deriving algorithms to be correct
Representing algorithms in code
Of blocked algorithms and algorithms-by-blocks
Runtime support for multicore, GPU, and multiGPU
Extensions to distributed memory platforms
Related work
Conclusion
PACC2011, Sept. 2011 23
Representing Algorithms in Code
Code should closely resemble how an algorithm is presented so that no bugs can be introduced when translating an algorithm to code.
PACC2011, Sept. 2011 24
Representing algorithms in code http://www.cs.utexas.edu/users/flame/Spark/
Spark+APIs
C,
F77,
Matlab,
LabView,
LaTeX
PACC2011, Sept. 2011 25
FLAME/C API
FLA_Part_2x2 ( A, &ATL, &ATR,
&ABL, &ABR, 0, 0, FLA_TL ); while ( FLA_Obj_length( ATL ) < FLA_Obj_length( A ) ){ b = min( FLA_Obj_length( ABR ), nb_alg );
FLA_Repart_2x2_to_3x3 (
ATL, /**/ ATR, &A00, /**/ &a01, &A02,
/* ************* */ /* ************************** */
&a10t,/**/ &alpha11, &a12t,
ABL, /**/ ABR, &A20, /**/ &a21, &A22,
1, 1, FLA_BR );
/*--------------------------------------*/
FLA_Sqrt( alpha11 );
FLA_Inv_scal( alpha11, a21 );
FLA_Syr( FLA_LOWER_TRIANGULAR, FLA_NO_TRANSPOSE,
FLA_MINUS_ONE, a21, A22 );
/*--------------------------------------*/
FLA_Cont_with_3x3_to_2x2 (
&ATL, /**/ &ATR, A00, a01, /**/ A02, a10t, alpha11, /**/ a12t,
/* ************** */ /* ************************/
&ABL, /**/ &ABR, A20, a21, /**/ A22,
FLA_TL );
}
For now , libflame employs external BLAS: GotoBLAS, MKL, ACML, CUBLAS
PACC2011, Sept. 2011 26
Outline
Motivation
What is FLAME?
Deriving algorithms to be correct
Representing algorithms in code
Of blocked algorithms and algorithms-by-blocks
Runtime support for multicore, GPU, and multiGPU
Extensions to distributed memory platforms
Related work
Conclusion
PACC2011, Sept. 2011 27
Multicore/MultiGPU - Issues
Manage computation
Assignment of tasks to cores and/or GPU s
Granularity is important
Manage memory
Manage data transfer between “host” and caches of cores or host and GPU local memories
Granularity is important
Keep the data in the local memory as long as possible
PACC2011, Sept. 2011 28
Where have we seen this before?
Computer architecture late 1960s:
Super scalar units proposed
Unit of data: floating point number
Unit of computation: floating point operation
Examine dependencies
Execute out-of-order, prefetch, cache data, etc., to keep computational units busy
Extract parallelism from sequential instruction stream
R. M.
Tomasulo, “An Efficient Algorithm for Exploiting Multiple Arithmetic
Units.” IBM J. of R&D, (1967)
Basis for exploitation of ILP on current superscalar processors!
PACC2011, Sept. 2011 29
Of Blocks and Tasks
Dense matrix computation
Unit of data: block in matrix
Unit of computation (task): operation with blocks
Dependency: input/output of operation with blocks
Instruction stream: sequential libflame code
Generates DAG
Runtime system schedules tasks
Goal: minimize data transfer and maximize utilization
PACC2011, Sept. 2011 30
Review: Blocked Algorithms
Cholesky factorization A
11
A
21
A
22
= L
11
:= L
21
:= A
22
L
11
T
= A
21
– L
21
L
11
-T
L
21
T
1st iteration
PACC2011, Sept. 2011
2nd iteration 3rd iteration
31
31
Blocked Algorithms
Cholesky factorization
A = L * L T
FLA_Part_2x2 (…);
APIs + Tools while ( FLA_Obj_length(ATL) < FLA_Obj_length(A) ){
FLA_Repart_2x2_to_3x3 (…);
/*--------------------------------------*/
FLA_Chol( FLA_LOWER_TRIANGULAR, A11 );
FLA_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR,
FLA_TRANSPOSE, FLA_NONUNIT_DIAG,
FLA_ONE, A11, A21 );
FLA_Syrk( FLA_LOWER_TRIANGULAR,FLA_NO_TRANSPOSE,
FLA_MINUS_ONE, A21, FLA_ONE, A22 );
/*--------------------------------------*/
FLA_Cont_with_3x3_to_2x2 (…);
}
PACC2011, Sept. 2011
32
32
Simple Parallelization: Blocked Algorithms
Link with multi-threaded BLAS
A
11
A
21
A
22
= L
11
:= L
21
:= A
22
L
11
T
= A
21
– L
21
L
11
-T
L
21
T
PACC2011, Sept. 2011
FLA_Part_2x2 (…); while ( FLA_Obj_length(ATL) < FLA_Obj_length(A) ){
}
FLA_Repart_2x2_to_3x3 (…);
/*--------------------------------------*/
FLA_Chol( FLA_LOWER_TRIANGULAR, A11 );
FLA_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR,
FLA_TRANSPOSE, FLA_NONUNIT_DIAG,
FLA_ONE, A11, A21 );
FLA_Syrk( FLA_LOWER_TRIANGULAR,FLA_NO_TRANSPOSE,
FLA_MINUS_ONE, A21, FLA_ONE, A22 );
/*--------------------------------------*/
FLA_Cont_with_3x3_to_2x2 (…);
33
33
Blocked Algorithms
There is more parallelism!
Inside the same iteration
In different iterations
1st iteration 2nd iteration
PACC2011, Sept. 2011
34
34
Coding Algorithm-by-Blocks
Algorithm-by-blocks : A
11
A
21
A
22
= L
11
:= L
21
:= A
22
L
11
T
= A
21
– L
21
L
11
-T
L
21
T
FLA_Part_2x2 (…); while ( FLA_Obj_length(ATL) < FLA_Obj_length(A) ){
FLA_Repart_2x2_to_3x3 (…);
/*--------------------------------------*/
FLA_Chol( FLA_LOWER_TRIANGULAR, A11 );
FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR,
FLA_TRANSPOSE, FLA_NONUNIT_DIAG,
FLA_ONE, A11, A21 );
FLASH_Syrk( FLA_LOWER_TRIANGULAR,FLA_NO_TRANSPOSE,
FLA_MINUS_ONE, A21, FLA_ONE, A22 );
/*--------------------------------------*/
FLA_Cont_with_3x3_to_2x2 (…);
}
PACC2011, Sept. 2011
35
35
Outline
Motivation
What is FLAME?
Deriving algorithms to be correct
Representing algorithms in code
Of blocked algorithms and algorithms-by-blocks
Runtime support for multicore, GPU, and multiGPU
Extensions to distributed memory platforms
Related work
Conclusion
PACC2011, Sept. 2011 36
Generating a DAG
FLA_Part_2x2 (…); while ( FLA_Obj_length(ATL) < FLA_Obj_length(A) ){
}
FLA_Repart_2x2_to_3x3 (…);
/*--------------------------------------*/
FLA_Chol( FLA_LOWER_TRIANGULAR, A11 );
FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR,
FLA_TRANSPOSE, FLA_NONUNIT_DIAG,
FLA_ONE, A11, A21 );
FLASH_Syrk( FLA_LOWER_TRIANGULAR,FLA_NO_TRANSPOSE,
FLA_MINUS_ONE, A21, FLA_ONE, A22 );
/*--------------------------------------*/
FLA_Cont_with_3x3_to_2x2 (…);
1
2
3
4
5 6
7
8 9 10
PACC2011, Sept. 2011 37
Managing Tasks and Blocks
Separation of concerns:
Sequential libflame routine generates the DAG
Runtime system (SuperMatrix) manages and schedules the DAG
As one moves from one architecture to another, only the runtime system needs to be updated
Multicore
Out-of-core
Single GPU
MultiGPU
Distributed Runtime
…
PACC2011, Sept. 2011 38
Runtime system - SuperMatrix
DAG
1
2
3
4
5 6
7
8 9 10
Super
Matrix
+ heuristic
PACC2011, Sept. 2011
Multicore
39
Runtime system for GPU - SuperMatrix accelerator
DAG
1
2
3
4
5 6
7
8 9 10
Super
Matrix
Manage data transfer
PACC2011, Sept. 2011
CPU + GPU
40
Runtime system for MultiGPU - SuperMatrix
Multi-accelerator
DAG
1
2
3
4
5 6
7
8 9 10
Super
Matrix
Manage data transfer
PACC2011, Sept. 2011
CPU + MultiGPU
41
MultiGPU
How do we program these?
CPU(s)
PCI-e bus
Interconnect
GPU
#0
GPU
#1
GPU
#2
GPU
#3
PACC2011, Sept. 2011
42
42
MultiGPU: a User’s View
FLA_Obj A;
// Initialize conventional matrix: buffer, m, rs, cs
// Obtain storage blocksize, # of threads: b, n_threads
FLA_Init();
FLASH_Obj_create( FLA_DOUBLE, m, m, 1, &b, &A );
FLASH_Copy_buffer_to_hier( m, m, buffer, rs, cs,
0, 0, A );
FLASH_Queue_set_num_threads( n_threads );
FLASH_Queue_enable_gpu();
FLASH_Chol( FLA_LOWER_TRIANGULAR, A );
FLASH_Obj_free( &A );
FLA_Finalize();
PACC2011, Sept. 2011
43
43
MultiGPU: Under the Cover
Naïve approach:
Before execution, transfer data to device
Call CUBLAS operations
CPU(s)
(implementation “someone else’s problem”)
Upon completion, retrieve results back to host
Interconnect
poor data locality
PCI-e bus
GPU
#0
GPU
#1
GPU
#2
GPU
#3
PACC2011, Sept. 2011
44
44
MultiGPU: Under the Cover
How do we program these?
View as a…
Shared-memory multiprocessor +
Distributed Shared
Memory (DSM) architecture
CPU(s)
Interconnect
PCI-e bus
PACC2011, Sept. 2011
45
GPU
#0
GPU
#1
GPU
#2
GPU
#3
45
MultiGPU: Under the Cover
View system as a sharedmemory multiprocessors
(multi-core processor with hw. coherence)
CPU(s)
MP P
0
+Cache
0
P
1
+Cache
1
P
2
+Cache
2
P
3
+Cache
3
Interconnect
PCI-e bus
PACC2011, Sept. 2011
46
GPU
#0
GPU
#1
GPU
#2
GPU
#3
46
MultiGPU: Under the Cover
Software Distributed-Shared Memory (DSM)
Software: flexibility vs. efficiency
Underlying distributed memory hidden
Reduce memory transfers using write-back, writeinvalidate,…
Well-known approach, not too efficient as a middleware for general apps.
Regularity of dense linear algebra makes a difference!
PACC2011, Sept. 2011
47
47
MultiGPU: Under the Cover
2
3
1
Reduce #data transfers:
Run-time handles device memory as a software cache:
Operate at block level
Software flexibility
Write-back
Write-invalidate
CPU(s)
4
5 6
7
8 9 10
Super
Matrix
Interconnect
PACC2011, Sept. 2011
PCI-e bus
48
GPU
#0
GPU
#1
GPU
#2
GPU
#3
48
MultiGPU: Under the Cover
FLA_Part_2x2 (…); while ( FLA_Obj_length(ATL) < FLA_Obj_length(A) ){
FLA_Repart_2x2_to_3x3 (…);
/*--------------------------------------*/
FLASH_Chol( FLA_LOWER_TRIANGULAR, A11 );
FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR,
FLA_TRANSPOSE, FLA_NONUNIT_DIAG,
FLA_ONE, A11, A21 );
FLASH_Syrk( FLA_LOWER_TRIANGULAR,FLA_NO_TRANSPOSE,
FLA_MINUS_ONE, A21, FLA_ONE, A22 );
/*--------------------------------------*/
FLA_Cont_with_3x3_to_2x2 (…);
}
Super
Matrix
• Factor A11 on host
PACC2011, Sept. 2011 49
Multi-GPU: Under the Cover
FLA_Part_2x2 (…); while ( FLA_Obj_length(ATL) < FLA_Obj_length(A) ){
FLA_Repart_2x2_to_3x3 (…);
/*--------------------------------------*/
FLASH_Chol( FLA_LOWER_TRIANGULAR, A11 );
FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR,
FLA_TRANSPOSE, FLA_NONUNIT_DIAG,
FLA_ONE, A11 , A21 );
FLASH_Syrk( FLA_LOWER_TRIANGULAR,FLA_NO_TRANSPOSE,
FLA_MINUS_ONE, A21, FLA_ONE, A22 );
/*--------------------------------------*/
FLA_Cont_with_3x3_to_2x2 (…);
}
Super
Matrix
• Transfer A11 from host to appropriate devices before using it in subsequent computations
(write-update)
PACC2011, Sept. 2011 50
Multi-GPU: Under the Cover
FLA_Part_2x2 (…); while ( FLA_Obj_length(ATL) < FLA_Obj_length(A) ){
FLA_Repart_2x2_to_3x3 (…);
/*--------------------------------------*/
FLASH_Chol( FLA_LOWER_TRIANGULAR, A11 );
FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR,
FLA_TRANSPOSE, FLA_NONUNIT_DIAG,
FLA_ONE, A11 , A21 );
FLASH_Syrk( FLA_LOWER_TRIANGULAR,FLA_NO_TRANSPOSE,
FLA_MINUS_ONE, A21, FLA_ONE, A22 );
/*--------------------------------------*/
FLA_Cont_with_3x3_to_2x2 (…);
}
Super
Matrix
• Cache A11 in receiving device(s) in case needed in subsequent computations
PACC2011, Sept. 2011 51
Multi-GPU: Under the Cover
FLA_Part_2x2 (…); while ( FLA_Obj_length(ATL) < FLA_Obj_length(A) ){
FLA_Repart_2x2_to_3x3 (…);
/*--------------------------------------*/
FLASH_Chol( FLA_LOWER_TRIANGULAR, A11 );
FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR,
FLA_TRANSPOSE, FLA_NONUNIT_DIAG,
FLA_ONE, A11, A21 );
FLASH_Syrk( FLA_LOWER_TRIANGULAR,FLA_NO_TRANSPOSE,
FLA_MINUS_ONE, A21, FLA_ONE, A22 );
/*--------------------------------------*/
FLA_Cont_with_3x3_to_2x2 (…);
}
PACC2011, Sept. 2011
Super
Matrix
• Send blocks to devices
• Perform Trsm on blocks of
A21 (hopefully using cached A11 )
• Keep updated A21 in device till needed by other
GPU(s) (write-back)
52
Multi-GPU: Under the Cover
FLA_Part_2x2 (…); while ( FLA_Obj_length(ATL) < FLA_Obj_length(A) ){
FLA_Repart_2x2_to_3x3 (…);
/*--------------------------------------*/
FLASH_Chol( FLA_LOWER_TRIANGULAR, A11 );
FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR,
FLA_TRANSPOSE, FLA_NONUNIT_DIAG,
FLA_ONE, A11, A21 );
FLASH_Syrk( FLA_LOWER_TRIANGULAR,FLA_NO_TRANSPOSE,
FLA_MINUS_ONE, A21, FLA_ONE, A22 );
/*--------------------------------------*/
FLA_Cont_with_3x3_to_2x2 (…);
}
PACC2011, Sept. 2011
Super
Matrix
• Send blocks to devices
• Perform Syrk/Gemm on blocks of A22 (hopefully using cached blocks of
A21 )
• Keep updated A22 in device till needed by other
GPU(s) (write-back)
53
C = C + AB T on S1070 (Tesla x 4)
PACC2011, Sept. 2011 54
Cholesky on S1070 (Tesla x 4)
PACC2011, Sept. 2011 55
Cholesky on S1070 (Tesla x 4)
PACC2011, Sept. 2011 56
Sampling of LAPACK functionality on S2050 (Fermi x 4)
PACC2011, Sept. 2011 57
Sampling of LAPACK functionality on S2050 (Fermi x 4)
PACC2011, Sept. 2011 58
Outline
Motivation
What is FLAME?
Deriving algorithms to be correct
Representing algorithms in code
Of blocked algorithms and algorithms-by-blocks
Runtime support for multicore, GPU, and multiGPU
Extensions to distributed memory platforms
Conclusion
PACC2011, Sept. 2011 59
libflame for Cluster (+ Accelerators)
PLAPACK
Distributed memory (MPI)
Inspired FLAME
Recently modified so that each node can have GPU
Keep data in GPU memory as much as possible
Elemental (Jack Poulson)
Distributed memory (MPI)
Inspired by FLAME/PLAPACK
Can use GPU at each node/core
libflame + SuperMatrix
Runtime schedules tasks and data transfer
Appropriate for small clusters
PACC2011, Sept. 2011 60
PLAPACK + GPU accelerators
Each node:
Xeon Nehalem (8 cores)
+ 2 NVIDIA C1060 (Tesla)
Fogue, Igual, E. Quintana, van de Geijn. "Retargeting PLAPACK to Clusters with
Hardware Accelerators." (WEHA 2010) .
PACC2011, Sept. 2011 61
Targeting Clusters with GPUs
SuperMatrix Distributed Runtime
Each node:
Xeon Nehalem (8 cores)
+ 1 NVIDIA C2050 (Fermi)
Igual, G. Quintana, van de Geijn. “Scheduling Algorithms-by-Blocks on Small Clusters.”
Concurrency and Computation: Practice and Experience. In review.
PACC2011, Sept. 2011 62
Elemental
Cholesky Factorization
PACC2011, Sept. 2011 63
Elemental vs. ScaLAPACK
Cholesky on 8192 cores -
BlueGene/P
Elemental has full ScaLAPACK functionality
(except nonsymmetric Eigenvalue
Problem).
Poulson, Marker, Hammond, Romero, van de Geijn. "Elemental: A New Framework for
Distributed Memory Dense Matrix Computations." ACM TOMS.
Submitted.
PACC2011, Sept. 2011 64
Single-Chip Cloud Computer
Intel SCC research processor
48-core concept vehicle
Created for many-core software research
Custom communication library (RCCE)
PACC2011, Sept. 2011 65
SCC Results
PACC2011, Sept. 2011
- 48 Pentium cores
- MPI replaced by RCCE
Igual, G. Quintana, van de Geijn.
“Scheduling Algorithms-by-Blocks on
Small Clusters.” Concurrency and
Computation: Practice and Experience.
In review.
Marker, Chan, Poulson, van de Geijn,
Van der Wijngaart, Mattson, Kubaska.
"Programming Many-Core Architectures
- A Case Study: Dense Matrix
Computations on the Intel SCC
Processor." Concurrency and
Computation: Practice and Experience.
To Appear.
66
Outline
Motivation
What is FLAME?
Deriving algorithms to be correct
Representing algorithms in code
Of blocked algorithms and algorithms-by-blocks
Runtime support for multicore, GPU, and multiGPU
Extensions to distributed memory platforms
Related work
Conclusion
PACC2011, Sept. 2011 67
Related work
Data-flow parallelism, dynamic scheduling, runtime
Cilk
OpenMP (task queues)
StarSs (SMPSs)
StarPU
Threading Building Blocks (TBB)
…
What we have is very dense linear algebra specific
PACC2011, Sept. 2011 68
Dense Linear Algebra Libraries
Target Platform
Sequential
Sequential + multithreaded BLAS
Multicore/multithreaded
Multicore + out-of-order scheduling
CPU + single GPU
Multicore + multiGPU
Distributed memory
Distributed memory + GPU
Out-of-Core
PACC2011, Sept. 2011
LAPACK Project
LAPACK
LAPACK
PLASMA
PLASMA+Quark
MAGMA
DAGuE?
ScaLAPACK
DAGuE?
ScaLAPACK?
?
FLAME Project libflame libflame libflame+SuperMatrix libflame+SuperMatrix libflame+SuperMatrix libflame+SuperMatrix libflame+SuperMatrix
PLAPACK
Elemental libflame+SuperMatrix
PLAPACK
Elemental libflame+SuperMatrix
69
Comparison with Quark
Agullo, Bouwmeester, Dongarra, Kurzak, Langou, Rosenbert. “Towards an Efficient Tile Matrix Inversion of Symmetric Positive Definite Matrices on Multicore Architectures.” VecPar, 2010.
PACC2011, Sept. 2011 70
Outline
Motivation
What is FLAME?
Deriving algorithms to be correct
Representing algorithms in code
Of blocked algorithms and algorithms-by-blocks
Runtime support for multicore, GPU, and multiGPU
Extensions to distributed memory platforms
Related work
Conclusion
PACC2011, Sept. 2011 71
Conclusions
Programmability is the key to harnessing parallel computation
One code, many target platforms
Formal derivation provides confidence in code
If there is a problem, it is not in the library!
Separation of concern
Library developer derives algorithms and codes them
Execution of routines generates DAG
Parallelism, temporal locality, and spatial locality are captured in
DAG
Runtime system uses appropriate heuristics to schedule
PACC2011, Sept. 2011 72
Identify units of data and units of computation
Write a sequential program that generates a DAG
Hand DAG to runtime for scheduling
PACC2011, Sept. 2011 73
The Future
Currently: Library is an instantiation in code
Future
Create repository of algorithms, expert knowledge about algorithms, and knowledge about a target architecture
Mechanically generate a library for a target architecture, exactly as an expert would
Design-by-Transformation (DxT)
Bryan Marker, Andy Terrel, Jack Poulson, Don Batory, and Robert van de Geijn.
"Mechanizing the Expert Dense Linear Algebra Developer." FLAME Working Note #58.
2011
PACC2011, Sept. 2011 74
Availability
Everything that has been discussed is available under LGPL license or BSD license
libflame + SuperMatrix
http://www.cs.utexas.edu/users/flame/
Elemental
http://code.google.com/p/elemental/
Dense Linear Algebra subject make
[Soccer] is a very simple game. It’s just very hard to play it simple.
- Johan Cruyff
RvdG
PACC2011, Sept. 2011 75