Parallel Runtime Scheduler and Execution Controller

PaRSEC: Parallel Runtime
Scheduling and Execution
Controller
Jack Dongarra, George Bosilca,
Aurelien Bouteiller, Anthony Danalis,
Mathieu Faverge, Thomas Herault
Also thanks to: Julien Herrmann, Julien Langou,
Bradley R. Lowery, Yves Robert
Motivation
• Today software developers
face systems with
• ~1 TFLOP of compute power per node
• 32+ of cores, 100+ hardware threads
• Highly heterogeneous architectures (cores +
•
•
•
•
specialized cores +
accelerators/coprocessors)
Deep memory hierarchies
Distributed systems
Fast evolution
Mainstream programming paradigms
introduce systemic noise, load imbalance,
overheads
(< 70% peak on DLA)
•
•
Tianhe-2 China, June'14:
34 PetaFLOPS
Peak performance of 54.9 PFLOPS
•
•
•
•
•
•
•
•
16,000 nodes contain 32,000 Xeon Ivy
Bridge processors and 48,000 Xeon Phi
accelerators totaling 3,120,000 cores
162 cabinets in 720m2 footprint
Total 1.404 PB memory (88GB per node)
Each Xeon Phi board utilizes 57 cores for
aggregate 1.003 TFLOPS at 1.1GHz clock
Proprietary TH Express-2 interconnect (fat
tree with thirteen 576-port switches)
12.4 PB parallel storage system
17.6MW power consumption under load;
24MW including (water) cooling
4096 SPARC V9 based Galaxy FT-1500
processors in front-end system
Task-based programming
Runtime
App
• Focus on data dependencies,
Data
Distrib.
Memory
Manager
Sched.
Comm
Heterogeneity
Manager
data flows, and tasks
• Don’t develop for an
architecture but for a
portability layer
• Let the runtime deal with the
hardware characteristics
• But provide as much user control as possible
• StarSS, StarPU, Swift,
Parallex, Quark, Kaapi,
DuctTeip, ..., and PaRSEC
Hardware
Parallel Runtime
Domain Specific
Extensions
The PaRSEC framework
Dense LA
Compact
Representation PTG
…
Sparse LA
Dynamic / Prototyping
Interface - DTD
Data
Scheduling
Scheduling
Scheduling
Cores
Memory
Hierarchies
Chemistry
Data
Movement
Coherence
Tasks
Tasks
Tasks
Data
Movement
Power User
Specialized
Specialized
Kernels
Specialized
Kernels
Kernels
Accelerators
PaRSEC toolchain
Data
distribution
Application code &
Codelets
Programmer
Dataflow
representation
Domain
Specific
Extensions
Supercomputer
Dataflow
compiler
Serial
Code
DAGuE
compiler
Parallel
tasks stubs
Runtime
DAGuE Toolchain
PaRSEC
Toolchain
System
compiler
MPI
pthreads
CUDA
PLASMA
MAGMA
Additional
libraries
Input Format – Quark/StarPU/MORSE
for (k = 0; k < A.mt; k++) {
Insert_Task( zgeqrt, A[k][k], INOUT,
T[k][k], OUTPUT);
for (m = k+1; m < A.mt; m++) {
Insert_Task( ztsqrt, A[k][k], INOUT | REGION_D|REGION_U,
A[m][k], INOUT | LOCALITY,
T[m][k], OUTPUT);
}
for (n = k+1; n < A.nt; n++) {
Insert_Task( zunmqr, A[k][k], INPUT | REGION_L,
T[k][k], INPUT,
A[k][m], INOUT);
for (m = k+1; m < A.mt; m++)
Insert_Task( ztsmqr, A[k][n], INOUT,
A[m][n], INOUT | LOCALITY,
A[m][k], INPUT,
T[m][k], INPUT);
}
}
• Sequential C code
• Annotated through
some specific syntax
•
Insert_Task
•
•
INOUT, OUTPUT, INPUT
REGION_L, REGION_U,
REGION_D, …
LOCALITY
•
Example: QR Factorization (DLA)
GEQRT
TSQRT
UNMQR
TSMQR
Dataflow Analysis
MEM
k = SIZE-1
Incoming Data
Outgoing Data
k=0
FOR k = 0 .. SIZE - 1
• data flow analysis
A[k][k], T[k][k] <- GEQRT( A[k][k] )
• Example on task DGEQRT
FOR m = k+1 .. SIZE - 1
UPPER
A[k][k]|Up, A[m][k], T[m][k] <TSQRT( A[k][k]|Up, A[m][k], T[m][k] )
FOR n = k+1 .. SIZE - 1
LOWER
A[k][n] <- UNMQR( A[k][k]|Low, T[k][k], A[k][n] )
FOR m = k+1 .. SIZE - 1
n = k+1
m = k+1
A[k][n], A[m][n] <TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] )
of QR
• Polyhedral Analysis
through Omega Test
• Compute algebraic
expressions for:
• Source and destination
tasks
• Necessary conditions for
that data flow to exist
Intermediate Representation:
Job Data Flow
GEQRT(k)
/* Execution space */
k = 0..( MT < NT ) ? MT-1 : NT-1 )
/* Locality */
: A(k, k)
RW A <- (k == 0) ? A(k, k)
: A1 TSMQR(k-1, k, k)
-> (k < NT-1) ? A UNMQR(k, k+1 .. NT-1) [type = LOWER]
-> (k < MT-1) ? A1 TSQRT(k, k+1)
[type = UPPER]
-> (k == MT-1) ? A(k, k)
[type = UPPER]
WRITE T <- T(k, k)
-> T(k, k)
-> (k < NT-1) ? T UNMQR(k, k+1 .. NT-1)
/* Priority */
;(NT-k)*(NT-k)*(NT-k)
GEQRT
TSQRT
UNMQR
TSMQR
BODY [GPU, CPU, MIC]
zgeqrt( A, T )
END
Control flow is eliminated, therefore maximum parallelism is possible
Data/Task Distribution
• Flexible data distribution
Data
distribution
• Decoupled from the algorithm
Application code &
Codelets
Programmer
• Expressed as a user-defined function
• Only limitation: must evaluate uniformly
Dataflow
representation
across all nodes
Domain
Specific
Extensions
Supercomputer
• Common distributions
provided in DSEs
Dataflow Parallel
compiler tasks stubs
System
compiler
• 1D cyclic, 2D cyclic, etc.
• Symbol Matrix for sparse direct solvers
Serial
Code
DAGuE
compiler
DAGuE Toolchain
Runtime
MPI
pthreads
CUDA
PLASMA
MAGMA
Additional
libraries
PaRSEC Runtime
Node 0
Thread 0
Thread 1
Comm.
Thread
Ta(0)
Ta(2) S
Thread 0
Node 1
Tb(0,0) S Ta(6) S Ta(8) S Tb(0,1)
S Ta(4) S
Tb(2,1)
N
Comm.
Thread
Thread 1
S
D
A
D
N
S
N
A
D
Ta(1)
S Tb(0,2)
S Ta(5) S
Ta(3)
S Tb(1,2)
S
S Ta(9) S
D
A
Ta(7)
D
D
S
Ta(9)
S
S Tb(2,2)
• Each computation thread
alternates between
executing a task and
scheduling tasks
• Computation threads are
bound to cores
• Communication threads
(one per node) transfer
task completion
notifications, and data
• Communication threads
can be bound or not
Strong Scaling
DGEQRF performance strong scaling
Cray XT5 (Kraken) - N = M = 41,472
P E R FO R M A N C E (TFLO P /S )
25
20
PaR
PL A S
D
C
E
S
≈ 270x270
double /
core
MA
15
10
5
0
LibSCI Scalapack
768 2304 4032 5760 7776
10080
14784
N U M B ER O F C O R ES
19584
23868
PaRSEC Runtime: Accelerators
BODY [GPU, CPU, MIC]
zgeqrt( A, T )
END
Accelerator 0
Comp.
OUT
When tasks that can run
on an accelerator are
scheduled
IN
• A computation thread takes control
Node 0
Thread 0
Thread 1
Comm.
Thread
S
Ta(0)
S
Ta(2) S
Tb(2,1)
N
D
S
S
S
Acc. Client
S
S Tb(0,1)
S Ta(4) S
N
N
D
S
Ta(6)
S
of a free accelerator
• Schedules tasks and data
movements on the accelerator
• Until no more tasks can run on the
accelerator
The engine takes care of
the data consistency
• Multiple copies (with versioning) of
D
each "tile" co-exist, on different
resources
• Data Movement between devices is
implicit
Multi GPU –
single node
Multi GPU - distributed
Performance (GFlop/s)
1400
1200
1000
800
600
400
200
0
10k
40k
C1060x4
C1060x3
C1060x2
C1060x1
30k
20k
Matrix size (N)
50k
•
•
•
Scalability
Single node
4xTesla (C1060)
16 cores (AMD opteron)
•
Keeneland
•
64 nodes
• 3 * M2090
• 16 cores
Example 1: Hierarchical QR
• A single QR step = nullify all
tiles below the current
diagonal tile
• Choosing what tile to "kill"
with what other tile defines
the duration of the step
• This coupling defines a Tree
• Choosing how to compose
trees depends on the shape
of the matrix, on the cost of
each kernel operation, on
the platform characteristics
A Binomial Tree
A Flat Tree
Example 1: Hierarchical QR
• A single QR step = nullify all
tiles below the current
diagonal tile
• Choosing what tile to "kill"
with what other tile defines
the duration of the operation
• This coupling defines a Tree
• Choosing how to compose
trees depends on the shape
of the matrix, on the cost of
each kernel operation, on
the platform characteristics
Composing Two Binomial Trees
Example 1: Hierarchical QR
Sequential Algorithm
JDF Representation
qtree (passed as arbitrary
structure to the JDF object)
implements elim / killer as a
set of convenient functions
zunmqr(k, i, n)
/* Execution space */
k = 0 .. minMN-1
i = 0 .. qrtree.getnbgeqrf( k ) - 1
n = k+1 .. NT-1
m
= qrtree.getm(k, i)
nextm = qrtree.nextpiv(k, m, MT)
depends on arbitrary
functions killer(i, k)
and elim(i, j, k)
: A(m, n)
READ
READ
A <- A zgeqrt(k, i)
T <- T zgeqrt(k, i)
[type = LOWER_TILE]
[type = LITTLE_T]
RW
C <- ( 0 == k ) ? A(m, n)
<- ( k > 0 ) ? A2 zttmqr(k-1, m, n)
-> ( k == MT-1) ? A(m, n)
-> ( k < MT-1) & (nextm != MT) ) ?
A1 zttmqr(k, nextm, n)
-> ( k < MT-1) & (nextm == MT) ) ?
A2 zttmqr(k, m, n)
Hierarchical QR
• How to compose trees to
Solving Linear Least Square Problem (DGEQRF)
60-node, 480-core, 2.27GHz Intel Xeon Nehalem, IB 20G System
Theoretical Peak: 4358.4 GFlop/s
get the best pipeline?
• Flat, Binary, Fibonacci,
3000
• Study on critical path
lengths
• Square -> Tall and Skinny
• Surprisingly Flat trees are
better for
communications on
square cases:
• Less communications
• Good pipeline
P E R FO R M A N C E (G FLO P /S )
Greedy, …
Hierarchical QR
2500
in
1D 2-level tree b
2000
0]
ary/flat [SLDH1
1500
1000
DPLASMA DGEQRF
500
LibSCI Scalapack
0
0
50000
100000
150000
200000
M A TR IX S IZE M
(N =4,480)
250000
300000
Hierarchical QR
• How to compose trees to
Solving Linear Least Square Problem (DGEQRF)
60-node, 480-core, 2.27GHz Intel Xeon Nehalem, IB 20G System
Theoretical Peak: 4358.4 GFlop/s
get the best pipeline?
• Flat, Binary, Fibonacci,
• Study on critical path
lengths
• Square -> Tall and Skinny
• Surprisingly Flat trees are
better for
communications on
square cases:
• Less communications
• Good pipeline
P E R FO R M A N C E (G FLO P /S )
Greedy, …
3500
Hierarchical QR
3000
DPLASMA DGEQRF
2500
1D 2-level tree binary/flat [SLDH10]
2000
k
alapac
c
S
I
C
LibS
1500
1000
500
0
0
10000
20000
30000
40000
M A TR IX S IZE N
(M =67,200)
50000
60000
70000
Example 2: Hybrid LU-QR
• Factorization A=LU
• where L unit lower triangular, U upper triangular
•
floating point operations
• Factorization A=QR
• where Q is orthogonal, and R upper triangular
•
floating point operations
• LUPP: Partial Pivoting involves many
communications in the critical path
• Without Partial Pivoting: low numerical stability
Example 2: LU "Incremental" Pivoting
Example 2: QR
Example 2: LU/QR Hybrid Algorithm
Example 2: LU/QR Hybrid Algorithm
selector(k,m,n)
[...]
do_lu = lu_tab[k]
did_lu = (k == 0) ? -1 : lu_tab[k-1]
q
= (n-k)%param_q
[...]
CTL
ctl
RW
A
<- (q == 0) ? ctl setchoice(k, p, hmax)
<- (q != 0) ? ctl setchoice_update(k, p, q)
<<<<<</*
->
->
->
/*
->
->
->
->
((k ==
((k ==
((k ==
((k !=
((k !=
((k !=
LU */
( (do_lu
( (do_lu
( (do_lu
QR */
( (do_lu
( (do_lu
( (do_lu
( (do_lu
n)
n)
n)
n)
n)
n)
&&
&&
&&
&&
&&
&&
(k
(k
(k
(k
(k
(k
==
!=
!=
==
!=
!=
m)) ?
m) &&
m) &&
0)) ?
0) &&
0) &&
A zlufacto(k, 0)
diagdom) ? B copypanel(k, m)
!diagdom) ? A copypanel(k, m)
A(m, n)
(did_lu == 1)) ? C zgemm( k-1,m,n)
(did_lu != 1)) ? A2 zttmqr(k-1,m,n)
== 1) && (k == n) && (k == m) )
? A zgetrf(k)
== 1) && (k == n) && (k != m) )
? C ztrsm_l(k,m)
== 1) && (k != n) && (k != m) && (!diagdom)) ? C zgemm(k,m,n)
!=
!=
!=
!=
1)
1)
1)
1)
&&
&&
&&
&&
(k
(k
(k
(k
==
==
!=
!=
n)
n)
n)
n)
&&
&&
&&
&&
(type
(type
(type
(type
!=
==
!=
==
0)
0)
0)
0)
)
)
)
)
?
?
?
?
A
A2
C
A2
zgeqrt(k,i)
zttqrt(k,m)
zunmqr(k,i,n)
zttmqr(k,m,n)
Hybrid LU/QR Performance
Conclusion
• Programming made easy(ier)
allowing different
communities to focus on
different problems
• Application developers on their algorithms
• Language specialists on Domain Specific
Languages
• System developers on system issues
• Compilers on whatever they can
Dense LA
Compact
Representation - PTG
Parallel Runtime
• Build a scientific enabler
Hardware
hardware capabilities
• Efficiency: deliver the best performance on
several families of algorithms
Domain Specific Extensions
• Portability: inherently take advantage of all
Schedulin
g
Schedulin
g
Schedulin
g
Cores
Memory
Hierarchie
s
…
Sparse LA
Dynamic Discovered
Representation - DTG
Chemistry
Hardcor
e
Data
Data
Movement
Coherence
Tasks
Tasks
Tasks
Data
Movement
Specialize
dKernels
Specialize
dKernels
Specialize
d Kernels
Accelerators