PaRSEC: Parallel Runtime Scheduling and Execution Controller Jack Dongarra, George Bosilca, Aurelien Bouteiller, Anthony Danalis, Mathieu Faverge, Thomas Herault Also thanks to: Julien Herrmann, Julien Langou, Bradley R. Lowery, Yves Robert Motivation • Today software developers face systems with • ~1 TFLOP of compute power per node • 32+ of cores, 100+ hardware threads • Highly heterogeneous architectures (cores + • • • • specialized cores + accelerators/coprocessors) Deep memory hierarchies Distributed systems Fast evolution Mainstream programming paradigms introduce systemic noise, load imbalance, overheads (< 70% peak on DLA) • • Tianhe-2 China, June'14: 34 PetaFLOPS Peak performance of 54.9 PFLOPS • • • • • • • • 16,000 nodes contain 32,000 Xeon Ivy Bridge processors and 48,000 Xeon Phi accelerators totaling 3,120,000 cores 162 cabinets in 720m2 footprint Total 1.404 PB memory (88GB per node) Each Xeon Phi board utilizes 57 cores for aggregate 1.003 TFLOPS at 1.1GHz clock Proprietary TH Express-2 interconnect (fat tree with thirteen 576-port switches) 12.4 PB parallel storage system 17.6MW power consumption under load; 24MW including (water) cooling 4096 SPARC V9 based Galaxy FT-1500 processors in front-end system Task-based programming Runtime App • Focus on data dependencies, Data Distrib. Memory Manager Sched. Comm Heterogeneity Manager data flows, and tasks • Don’t develop for an architecture but for a portability layer • Let the runtime deal with the hardware characteristics • But provide as much user control as possible • StarSS, StarPU, Swift, Parallex, Quark, Kaapi, DuctTeip, ..., and PaRSEC Hardware Parallel Runtime Domain Specific Extensions The PaRSEC framework Dense LA Compact Representation PTG … Sparse LA Dynamic / Prototyping Interface - DTD Data Scheduling Scheduling Scheduling Cores Memory Hierarchies Chemistry Data Movement Coherence Tasks Tasks Tasks Data Movement Power User Specialized Specialized Kernels Specialized Kernels Kernels Accelerators PaRSEC toolchain Data distribution Application code & Codelets Programmer Dataflow representation Domain Specific Extensions Supercomputer Dataflow compiler Serial Code DAGuE compiler Parallel tasks stubs Runtime DAGuE Toolchain PaRSEC Toolchain System compiler MPI pthreads CUDA PLASMA MAGMA Additional libraries Input Format – Quark/StarPU/MORSE for (k = 0; k < A.mt; k++) { Insert_Task( zgeqrt, A[k][k], INOUT, T[k][k], OUTPUT); for (m = k+1; m < A.mt; m++) { Insert_Task( ztsqrt, A[k][k], INOUT | REGION_D|REGION_U, A[m][k], INOUT | LOCALITY, T[m][k], OUTPUT); } for (n = k+1; n < A.nt; n++) { Insert_Task( zunmqr, A[k][k], INPUT | REGION_L, T[k][k], INPUT, A[k][m], INOUT); for (m = k+1; m < A.mt; m++) Insert_Task( ztsmqr, A[k][n], INOUT, A[m][n], INOUT | LOCALITY, A[m][k], INPUT, T[m][k], INPUT); } } • Sequential C code • Annotated through some specific syntax • Insert_Task • • INOUT, OUTPUT, INPUT REGION_L, REGION_U, REGION_D, … LOCALITY • Example: QR Factorization (DLA) GEQRT TSQRT UNMQR TSMQR Dataflow Analysis MEM k = SIZE-1 Incoming Data Outgoing Data k=0 FOR k = 0 .. SIZE - 1 • data flow analysis A[k][k], T[k][k] <- GEQRT( A[k][k] ) • Example on task DGEQRT FOR m = k+1 .. SIZE - 1 UPPER A[k][k]|Up, A[m][k], T[m][k] <TSQRT( A[k][k]|Up, A[m][k], T[m][k] ) FOR n = k+1 .. SIZE - 1 LOWER A[k][n] <- UNMQR( A[k][k]|Low, T[k][k], A[k][n] ) FOR m = k+1 .. SIZE - 1 n = k+1 m = k+1 A[k][n], A[m][n] <TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] ) of QR • Polyhedral Analysis through Omega Test • Compute algebraic expressions for: • Source and destination tasks • Necessary conditions for that data flow to exist Intermediate Representation: Job Data Flow GEQRT(k) /* Execution space */ k = 0..( MT < NT ) ? MT-1 : NT-1 ) /* Locality */ : A(k, k) RW A <- (k == 0) ? A(k, k) : A1 TSMQR(k-1, k, k) -> (k < NT-1) ? A UNMQR(k, k+1 .. NT-1) [type = LOWER] -> (k < MT-1) ? A1 TSQRT(k, k+1) [type = UPPER] -> (k == MT-1) ? A(k, k) [type = UPPER] WRITE T <- T(k, k) -> T(k, k) -> (k < NT-1) ? T UNMQR(k, k+1 .. NT-1) /* Priority */ ;(NT-k)*(NT-k)*(NT-k) GEQRT TSQRT UNMQR TSMQR BODY [GPU, CPU, MIC] zgeqrt( A, T ) END Control flow is eliminated, therefore maximum parallelism is possible Data/Task Distribution • Flexible data distribution Data distribution • Decoupled from the algorithm Application code & Codelets Programmer • Expressed as a user-defined function • Only limitation: must evaluate uniformly Dataflow representation across all nodes Domain Specific Extensions Supercomputer • Common distributions provided in DSEs Dataflow Parallel compiler tasks stubs System compiler • 1D cyclic, 2D cyclic, etc. • Symbol Matrix for sparse direct solvers Serial Code DAGuE compiler DAGuE Toolchain Runtime MPI pthreads CUDA PLASMA MAGMA Additional libraries PaRSEC Runtime Node 0 Thread 0 Thread 1 Comm. Thread Ta(0) Ta(2) S Thread 0 Node 1 Tb(0,0) S Ta(6) S Ta(8) S Tb(0,1) S Ta(4) S Tb(2,1) N Comm. Thread Thread 1 S D A D N S N A D Ta(1) S Tb(0,2) S Ta(5) S Ta(3) S Tb(1,2) S S Ta(9) S D A Ta(7) D D S Ta(9) S S Tb(2,2) • Each computation thread alternates between executing a task and scheduling tasks • Computation threads are bound to cores • Communication threads (one per node) transfer task completion notifications, and data • Communication threads can be bound or not Strong Scaling DGEQRF performance strong scaling Cray XT5 (Kraken) - N = M = 41,472 P E R FO R M A N C E (TFLO P /S ) 25 20 PaR PL A S D C E S ≈ 270x270 double / core MA 15 10 5 0 LibSCI Scalapack 768 2304 4032 5760 7776 10080 14784 N U M B ER O F C O R ES 19584 23868 PaRSEC Runtime: Accelerators BODY [GPU, CPU, MIC] zgeqrt( A, T ) END Accelerator 0 Comp. OUT When tasks that can run on an accelerator are scheduled IN • A computation thread takes control Node 0 Thread 0 Thread 1 Comm. Thread S Ta(0) S Ta(2) S Tb(2,1) N D S S S Acc. Client S S Tb(0,1) S Ta(4) S N N D S Ta(6) S of a free accelerator • Schedules tasks and data movements on the accelerator • Until no more tasks can run on the accelerator The engine takes care of the data consistency • Multiple copies (with versioning) of D each "tile" co-exist, on different resources • Data Movement between devices is implicit Multi GPU – single node Multi GPU - distributed Performance (GFlop/s) 1400 1200 1000 800 600 400 200 0 10k 40k C1060x4 C1060x3 C1060x2 C1060x1 30k 20k Matrix size (N) 50k • • • Scalability Single node 4xTesla (C1060) 16 cores (AMD opteron) • Keeneland • 64 nodes • 3 * M2090 • 16 cores Example 1: Hierarchical QR • A single QR step = nullify all tiles below the current diagonal tile • Choosing what tile to "kill" with what other tile defines the duration of the step • This coupling defines a Tree • Choosing how to compose trees depends on the shape of the matrix, on the cost of each kernel operation, on the platform characteristics A Binomial Tree A Flat Tree Example 1: Hierarchical QR • A single QR step = nullify all tiles below the current diagonal tile • Choosing what tile to "kill" with what other tile defines the duration of the operation • This coupling defines a Tree • Choosing how to compose trees depends on the shape of the matrix, on the cost of each kernel operation, on the platform characteristics Composing Two Binomial Trees Example 1: Hierarchical QR Sequential Algorithm JDF Representation qtree (passed as arbitrary structure to the JDF object) implements elim / killer as a set of convenient functions zunmqr(k, i, n) /* Execution space */ k = 0 .. minMN-1 i = 0 .. qrtree.getnbgeqrf( k ) - 1 n = k+1 .. NT-1 m = qrtree.getm(k, i) nextm = qrtree.nextpiv(k, m, MT) depends on arbitrary functions killer(i, k) and elim(i, j, k) : A(m, n) READ READ A <- A zgeqrt(k, i) T <- T zgeqrt(k, i) [type = LOWER_TILE] [type = LITTLE_T] RW C <- ( 0 == k ) ? A(m, n) <- ( k > 0 ) ? A2 zttmqr(k-1, m, n) -> ( k == MT-1) ? A(m, n) -> ( k < MT-1) & (nextm != MT) ) ? A1 zttmqr(k, nextm, n) -> ( k < MT-1) & (nextm == MT) ) ? A2 zttmqr(k, m, n) Hierarchical QR • How to compose trees to Solving Linear Least Square Problem (DGEQRF) 60-node, 480-core, 2.27GHz Intel Xeon Nehalem, IB 20G System Theoretical Peak: 4358.4 GFlop/s get the best pipeline? • Flat, Binary, Fibonacci, 3000 • Study on critical path lengths • Square -> Tall and Skinny • Surprisingly Flat trees are better for communications on square cases: • Less communications • Good pipeline P E R FO R M A N C E (G FLO P /S ) Greedy, … Hierarchical QR 2500 in 1D 2-level tree b 2000 0] ary/flat [SLDH1 1500 1000 DPLASMA DGEQRF 500 LibSCI Scalapack 0 0 50000 100000 150000 200000 M A TR IX S IZE M (N =4,480) 250000 300000 Hierarchical QR • How to compose trees to Solving Linear Least Square Problem (DGEQRF) 60-node, 480-core, 2.27GHz Intel Xeon Nehalem, IB 20G System Theoretical Peak: 4358.4 GFlop/s get the best pipeline? • Flat, Binary, Fibonacci, • Study on critical path lengths • Square -> Tall and Skinny • Surprisingly Flat trees are better for communications on square cases: • Less communications • Good pipeline P E R FO R M A N C E (G FLO P /S ) Greedy, … 3500 Hierarchical QR 3000 DPLASMA DGEQRF 2500 1D 2-level tree binary/flat [SLDH10] 2000 k alapac c S I C LibS 1500 1000 500 0 0 10000 20000 30000 40000 M A TR IX S IZE N (M =67,200) 50000 60000 70000 Example 2: Hybrid LU-QR • Factorization A=LU • where L unit lower triangular, U upper triangular • floating point operations • Factorization A=QR • where Q is orthogonal, and R upper triangular • floating point operations • LUPP: Partial Pivoting involves many communications in the critical path • Without Partial Pivoting: low numerical stability Example 2: LU "Incremental" Pivoting Example 2: QR Example 2: LU/QR Hybrid Algorithm Example 2: LU/QR Hybrid Algorithm selector(k,m,n) [...] do_lu = lu_tab[k] did_lu = (k == 0) ? -1 : lu_tab[k-1] q = (n-k)%param_q [...] CTL ctl RW A <- (q == 0) ? ctl setchoice(k, p, hmax) <- (q != 0) ? ctl setchoice_update(k, p, q) <<<<<</* -> -> -> /* -> -> -> -> ((k == ((k == ((k == ((k != ((k != ((k != LU */ ( (do_lu ( (do_lu ( (do_lu QR */ ( (do_lu ( (do_lu ( (do_lu ( (do_lu n) n) n) n) n) n) && && && && && && (k (k (k (k (k (k == != != == != != m)) ? m) && m) && 0)) ? 0) && 0) && A zlufacto(k, 0) diagdom) ? B copypanel(k, m) !diagdom) ? A copypanel(k, m) A(m, n) (did_lu == 1)) ? C zgemm( k-1,m,n) (did_lu != 1)) ? A2 zttmqr(k-1,m,n) == 1) && (k == n) && (k == m) ) ? A zgetrf(k) == 1) && (k == n) && (k != m) ) ? C ztrsm_l(k,m) == 1) && (k != n) && (k != m) && (!diagdom)) ? C zgemm(k,m,n) != != != != 1) 1) 1) 1) && && && && (k (k (k (k == == != != n) n) n) n) && && && && (type (type (type (type != == != == 0) 0) 0) 0) ) ) ) ) ? ? ? ? A A2 C A2 zgeqrt(k,i) zttqrt(k,m) zunmqr(k,i,n) zttmqr(k,m,n) Hybrid LU/QR Performance Conclusion • Programming made easy(ier) allowing different communities to focus on different problems • Application developers on their algorithms • Language specialists on Domain Specific Languages • System developers on system issues • Compilers on whatever they can Dense LA Compact Representation - PTG Parallel Runtime • Build a scientific enabler Hardware hardware capabilities • Efficiency: deliver the best performance on several families of algorithms Domain Specific Extensions • Portability: inherently take advantage of all Schedulin g Schedulin g Schedulin g Cores Memory Hierarchie s … Sparse LA Dynamic Discovered Representation - DTG Chemistry Hardcor e Data Data Movement Coherence Tasks Tasks Tasks Data Movement Specialize dKernels Specialize dKernels Specialize d Kernels Accelerators