Technische Universität München Advanced Scientific Computing on Intel Xeon Phi PARCO’13, Garching (Germany) Alexander Heinecke Technische Universität München, Chair of Scientific Computing Sep 13th 2013 Alexander Heinecke: Advanced Scientific Computing on Intel Xeon Phi PARCO’13, Garching (Germany) , Sep 13th 2013 1 Technische Universität München Acknowledgment For input and fruitful discussions I want to thank: Intel DE: Michael Klemm, Hans Pabst, Christopher Dahnken Intel Labs US/IN, PCL: Mikhail Smelyanskiy, Karthikeyan Vaidyanathan, Pradeep Dubey, Victor Lee, Christopher Hughes Intel US: Mike Julier, Susan Meredith, Catherine Djunaedi, Michael Hebenstreit Intel RU: Dmitry Budnikov, Maxim Shevtsov, Alexander Batushin Intel IL: Ayal Zaks, Arik Narkis, Arnon Peleg And a very special thank you to Joe Curley and the KNC Team for providing the SE10P cards for my test cluster! Alexander Heinecke: Advanced Scientific Computing on Intel Xeon Phi PARCO’13, Garching (Germany) , Sep 13th 2013 2 Technische Universität München Outline Introduction to Intel Xeon Phi Codes, which are not fully optimized for Xeon Phi, yet! Molecular Dynamics Earthquake Simulation - SeisSol, SuperMUC bid workload Hybrid HPL Data Mining with Sparse Grids Conclusions Alexander Heinecke: Advanced Scientific Computing on Intel Xeon Phi PARCO’13, Garching (Germany) , Sep 13th 2013 3 Technische Universität München The Intel Xeon Phi SE10P (5110P) • 61(60) cores with 4 threads each at 1.1(1.053) GHz • 32 KB L1$, 512 KB L2$, pre core • L2$ are kept coherent through a fast on-chip ring bus network • 512 bit wide vector instructions, 2 cycle throughput • 1074(1010) GFLOPS Peak DP • 2148(2020) GFLOPS Peak SP • 8 GB GDDR5, 512 bit interface, 352(330) GB/s • PCIe 2.0 Host-Interface Alexander Heinecke: Advanced Scientific Computing on Intel Xeon Phi PARCO’13, Garching (Germany) , Sep 13th 2013 4 Technische Universität München The Intel Xeon Phi Usage Model MIC CPU CPU centric CPU hosted offload to MIC Main() Compute() Comm() symmetric offload to CPU Main() Compute_m() Comm() Main() Compute() Comm() Compute_h() Compute_m() Main() Compute() Comm() Main() Compute_h() Comm() MIC centric MIC hosted Main() Compute() Comm() • scalar: CPU hosted • highly parallel: MIC offload • offload through MPI: symmetric • scalar phases: reverse offload • highly parallel: MIC hosted Alexander Heinecke: Advanced Scientific Computing on Intel Xeon Phi PARCO’13, Garching (Germany) , Sep 13th 2013 5 Technische Universität München Infiniband-Connectivity of Xeon Phi 7000 6000 5000 4000 3000 2000 1000 0 0 host1 <-> host0-mic0 host0-mic0 <-> host1 host0-mic0 <-> host1-mic0 host1-mic0 <-> host0-mic0 mic1 <-> mic0 host <-> mic0 host <-> mic1 MByte/s MByte/s mic0 <-> mic1 mic0 <-> host mic1 <-> host 1000000 2000000 3000000 4000000 Messagesize in Bytes 7000 6000 5000 4000 3000 2000 1000 0 0 1000000 2000000 3000000 4000000 Messagesize in Bytes ⇒ Xeon E5 has only 16 PCIe read buffers but 64 PCIe write buffers! ⇒ Start-up latency is 9us (host ↔ host: 1.5us, 1 hop), with 3 hops! ⇒ Using hosts as proxy like in GPU diret results in up to 6GB/s! Alexander Heinecke: Advanced Scientific Computing on Intel Xeon Phi PARCO’13, Garching (Germany) , Sep 13th 2013 6 Technische Universität München Outline Introduction to Intel Xeon Phi Codes, which are not fully optimized for Xeon Phi, yet! Molecular Dynamics Earthquake Simulation - SeisSol, SuperMUC bid workload Hybrid HPL Data Mining with Sparse Grids Conclusions Alexander Heinecke: Advanced Scientific Computing on Intel Xeon Phi PARCO’13, Garching (Germany) , Sep 13th 2013 7 Technische Universität München Linked Cell Particle Simulations with ls1 mardyn∗ • with increasing vectorlength net-vector usage goes down (holes) rcutoff • Xeon E5 has too low bandwidth to L1 (32 byte read, 16 byte write) calculate distances x_4 x_3 x_2 x_1 y_a d_a_4 d_a_3 d_a_2 d_a_1 y_4 y_3 y_2 y_1 z_4 z_3 z_2 z_1 z_a f_x_a f_y_a f_z_a d_a_i < rcutoff 1<i<4 x_a calculate forces (LJ-12-6) f_x_4 f_x_3 f_x_2 f_x_1 f_y_4 f_y_3 f_y_2 f_y_1 f_z_4 f_z_3 f_z_2 • gather/scatter can be used to achieve 100% vectorregister load f_z_1 ⇒ Porting ls1 mardyn to MIC hosted mode is ongoing research ∗ Eckhardt, et.al.: 591 TFLOPS Multi-Trillion Particles Simulation on SuperMUC, ISC’13 Alexander Heinecke: Advanced Scientific Computing on Intel Xeon Phi PARCO’13, Garching (Germany) , Sep 13th 2013 8 Technische Universität München Single Card Performance Xeon Phi* • Compute distances d_a_m7 d_a_m6 d_a_m5 d_a_m4 d_a_m3 d_a_m2 d_a_m1 d_a_m0 94,00% 100,00% 50,00% 96,00% 100,00% 61,00% 89,00% 63,00% 100,00% • Execute force computation One Box SuperMUC (16 SNB-EP cores) Xeon Phi 5110P; 120 MPI ranks; classic approach Xeon Phi 5110P; 120 MPI ranks; gather/scatter rcutoff 2-center Lennard-Jones 3-center Lennard-Jones 4-center Lennard-Jones ∗ Eckhardt, et.al.: Vectorization of Multi-Center, Highly-Parallel Rigid-Body Molecular Dynamics Simulations, Poster SC’13, accepted Alexander Heinecke: Advanced Scientific Computing on Intel Xeon Phi PARCO’13, Garching (Germany) , Sep 13th 2013 9 Technische Universität München SeisSol∗ SeisSol is a software-package for the simulation of seismic wave phenomena on unstructured grids, based on the discontinuous Galerkin method com- bined with ADER time discretization. On key-routine is the ADER time integration working on following matrix patterns: 0 0 1 2 3 3 4 5 4 5 6 5 6 7 7 6 7 8 8 0 1 2 3 4 5 6 7 8 0 2 2 3 4 5 6 7 8 1 1 2 3 4 5 6 0 0 1 1 2 3 4 7 8 8 9 9 9 9 9 9 10 10 10 10 10 10 11 11 0 1 13 14 2 6 7 21 8 22 1 2 3 4 5 6 7 8 11 11 12 13 0 1 19 6 20 7 21 21 8 22 22 1 2 3 4 5 6 7 8 6 7 21 8 22 22 23 23 23 23 23 23 24 24 24 24 24 25 25 25 25 25 25 26 26 26 26 26 27 27 27 27 27 27 28 28 28 29 29 29 29 29 30 30 30 30 30 31 31 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 1 2 3 4 5 6 7 8 3 4 5 6 7 8 33 34 0 2 32 33 34 0 1 31 32 33 33 34 0 31 32 32 33 34 0 31 31 32 32 33 0 28 29 30 34 4 5 19 20 21 26 28 2 3 17 18 19 20 24 28 1 15 16 17 18 0 0 13 14 15 16 4 19 20 12 13 14 2 3 5 18 18 0 14 17 17 5 19 20 22 16 16 4 17 18 19 21 15 15 3 16 17 18 12 12 13 14 15 15 16 11 11 12 12 13 14 20 34 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 0 1 2 3 4 5 6 7 8 ⇒ we are currently at 1.3X for compute kernels comparing 1 Xeon Phi to 1 dual socket SNB-EP! ∗ Breuer, et.al.: Accelerating SeisSol by Generating Vectorcode for Sparse Matrix Operators, PARCO’13, accepted ∗ Heinecke, et.al.: Optimized Kernels for large scale earthquake simulations with SeisSol, an unstructured ADER-DG code, Poster SC’13, accepted Alexander Heinecke: Advanced Scientific Computing on Intel Xeon Phi PARCO’13, Garching (Germany) , Sep 13th 2013 10 Technische Universität München Outline Introduction to Intel Xeon Phi Codes, which are not fully optimized for Xeon Phi, yet! Molecular Dynamics Earthquake Simulation - SeisSol, SuperMUC bid workload Hybrid HPL Data Mining with Sparse Grids Conclusions Alexander Heinecke: Advanced Scientific Computing on Intel Xeon Phi PARCO’13, Garching (Germany) , Sep 13th 2013 11 Technische Universität München The HPL Benchmark∗ • Matrix is stored in block-cyclic layout D Current panel Li Finished part of L Finished part of U Ui • Matrix is decomposed into P-process-rows and Q-process-columns • LU is most critical component Ai =Ai -LiUi • DGEMM fraction grows with increasing system size ⇒ solve largest possible problem for best performance ⇒ keep MIC busy all the time (doing DGEMM) ∗ : Heinecke, et.al.: Design and Implementation of the Linpack Benchmark for Single and Multi-Node Systems Based on Intel(R) Xeon Phi(TM) Coprocessor, IPDPS’13 Alexander Heinecke: Advanced Scientific Computing on Intel Xeon Phi PARCO’13, Garching (Germany) , Sep 13th 2013 12 Technische Universität München Designing Offload-DGEMM with work-stealing Knights Corner CPU B0 B1 B2 B3 A0 C00 C01 C02 C03 A1 C10 C11 C12 C13 A2 C20 C21 C22 C23 A3 C30 C31 C32 C33 • Tc = 2MNK Peak 8MN • Tt = PCIe < Tc • K > 4 Peak PCIe ≈ 1000 • C,A,B are split into tiles • MIC starts to process tiles from C (0, 0) Knights A0 Corner B0 DGEMM (A0,B0,C00) C00 B1 (A0,B1,C01) DGEMM B2 • CPU starts to process tiles C01 DGEMM (A0,B1,C02) B3 CPU DGEMM (A3,B,C3) Alexander Heinecke: Advanced Scientific Computing on Intel Xeon Phi PARCO’13, Garching (Germany) , Sep 13th 2013 from C (X , X ) • dynamic load balancing assigns work to MIC or CPU 13 Technische Universität München Hybrid HPL using Pipelined Lookahead (Hybrid, load balanced via workstealing) CPU DGEMM Columns K-N CPU DGEMM Columns 0-K Panel Factor. Panel Broadcast DMA comm. via PCIe Knights Corner DGEMM Columns K-N U BCST, SWP, DTRSM (I-J) U BCST, SWP, DTRSM (0-I) U BCST, SWP, DTRSM (J-N) • A panel was already swapped in last iteration • swap, U broadcast and DTRSM on one B-tile • panel free-up on CPU is merged with swap, U broadcast and DTRSM • afterwards start pipelined execution of swap, U broadcast, DTRSM and trailing update DGEMM ⇒ Xeon Phi is executing DGEMM basically all the time! Alexander Heinecke: Advanced Scientific Computing on Intel Xeon Phi PARCO’13, Garching (Germany) , Sep 13th 2013 14 Technische Universität München Lookahead Comparison 100% 100% 90% 90% 80% panel factorization exposed (CPU) 80% panel factorization exposed (CPU) 70% DGEMM on Knights Corner 70% DGEMM on Knights Corner 60% DTRSM exposed (CPU) DTRSM exposed (CPU) 50% 50% 40% 60% U broadcast and row swapping exposed (CPU) 40% 30% 20% 20% 10% 10% 0% 0% 82800 78000 73200 68400 63600 58800 54000 49200 44400 39600 34800 30000 25200 20400 15600 10800 6000 1200 82800 78000 73200 68400 63600 58800 54000 49200 44400 39600 34800 30000 25200 20400 15600 10800 6000 1200 30% Standard Lookahead U broadcast and row swapping exposed (CPU) Pipelined Lookahead Comparison of Lookahead variants on a four node cluster with 2 Xeon Phi per node and 64 GB RAM per node, local matrix size is 82.8K, P = Q = 2. Alexander Heinecke: Advanced Scientific Computing on Intel Xeon Phi PARCO’13, Garching (Germany) , Sep 13th 2013 15 Technische Universität München Performance Results System CPU only, 64GB CPU only, 64GB no pipeline, 1 card, 64GB pipeline, 1 card, 64GB no pipeline, 1 card, 64GB pipeline, 1 card, 64GB no pipeline, 1 card, 64GB pipeline, 1 card, 64GB no pipeline, 2 cards, 64GB pipeline, 2 cards, 64GB no pipeline, 2 cards, 64GB pipeline, 2 cards, 64GB no pipeline, 2 cards, 64GB pipeline, 2 cards, 64GB pipeline, 1 card, 128GB N 84K 168K 84K 84K 168K 168K 825K 825K 84K 84K 166K 166K 822K 822K 242K Alexander Heinecke: Advanced Scientific Computing on Intel Xeon Phi PARCO’13, Garching (Germany) , Sep 13th 2013 P 1 2 1 1 2 2 10 10 1 1 2 2 10 10 2 Q 1 2 1 1 2 2 10 10 1 1 2 2 10 10 2 TFLOPS 0.29 1.10 0.99 1.12 3.88 4.36 95.2 107.0 1.66 1.87 6.36 7.15 156.5 175.8 4.42 Eff. 86.4 82.8 71.0 79.8 69.1 77.6 67.7 76.1 68.2 76.6 65.0 73.1 64.0 71.9 79.6 16 Technische Universität München Outline Introduction to Intel Xeon Phi Codes, which are not fully optimized for Xeon Phi, yet! Molecular Dynamics Earthquake Simulation - SeisSol, SuperMUC bid workload Hybrid HPL Data Mining with Sparse Grids Conclusions Alexander Heinecke: Advanced Scientific Computing on Intel Xeon Phi PARCO’13, Garching (Germany) , Sep 13th 2013 17 Technische Universität München Sparse Grids in a nutshell∗ • Curse of Dimensionality: O (N d ) grid points • Therefore: Sparse Grids Reduce cost to O (N log(N )d−1 ) Similar accuracy, if problem sufficiently smooth • • • Basic idea: hierarchical basis in 1d (here: piecewise linear) 0,0 1,1 0,1 l =1 1,1 l =1 . . x0,1 x1,1 x0,0 x1,1 2,3 2,1 l =2 l =2 x2,1 . x2,3 x2,1 3,1 3,3 3,5 3,7 l =3 3,1 l =3 . x3,1 x3,3 x3,5 x3,7 ∗ 2,3 2,1 . x2,3 3,3 3,5 3,7 . x3,1 x3,3 x3,5 x3,7 : Heinecke, et.al.: Data Mining on Vast Dataset as a Cluster System Benchmark, PMBS’13, submitted Alexander Heinecke: Advanced Scientific Computing on Intel Xeon Phi PARCO’13, Garching (Germany) , Sep 13th 2013 18 Technische Universität München Data Mining with SG - In a nutshell d-dimensional instances and ansatz-functions fN (~x ): n o S = (~xm , ym ) ∈ Rd × K N m =1,...,M fN (~x ) = ∑ ~uj ϕj (~x ) j =1 problem to be solved: ! fN = arg min fN ∈VN 1 M ! M ∑ (ym − fN (~xm )) m =1 results into SLE: 1 1 BB T + λ C ~u = B~y M M Alexander Heinecke: Advanced Scientific Computing on Intel Xeon Phi PARCO’13, Garching (Germany) , Sep 13th 2013 2 + λ ||∇fN ||2L2 here C = I, bij = ϕi (~xj ) 19 Technische Universität München Implementation of B T and B B T : Parallelization over dataset instances: Level Index α Sum Dataset y B: Parallelization over grid points: Level Index α y Dataset Sum Alexander Heinecke: Advanced Scientific Computing on Intel Xeon Phi PARCO’13, Garching (Germany) , Sep 13th 2013 20 Technische Universität München Non-uniform Basis Functions - B T Operator for ~xm ∈ S with |S| = M, ~ym ← 0 do for all g ∈ grid with s ← ~αg do for k = 1 to d do if glk = 1 then s ← s·1 else if gik = 1 then s ← s · max(2 − glk ·~xmk ; 0) g else if gik = 2 lk − 1 then s ← s · max(glk ·~xmk − gik + 1; 0) else s ← s · max(1 − |glk ·~xmk − gik |; 0) end if end for ~ym ← ~ym + s end for end for Alexander Heinecke: Advanced Scientific Computing on Intel Xeon Phi PARCO’13, Garching (Germany) , Sep 13th 2013 21 Technische Universität München Extending B T and B to Clusters α m d v P1 P2 N P3 BT P4 + no comm. B w MPI_Allreduce Alexander Heinecke: Advanced Scientific Computing on Intel Xeon Phi PARCO’13, Garching (Germany) , Sep 13th 2013 + +++++++ + 22 Technische Universität München Performance Model for MPI Allreduce Total learning time on p nodes: Tlearn (p, N ) = Tseq (N ) + TB (p, N ) + TB T (p, N ) + Tcomm (p, N ), with Tseq = Ttotalsinglenode − TBsinglenode − TB T singlenode and TB (p ) = TBsinglenode and TB T (p ) = TB T singlenode p The communication time can be estimated by: Tcomm (p, N ) = ∑ p . TcommCG (p, N ), r ∈Refinements with employing the OSU MVAPICH benchmark suite: TcommCG (p, N ) = microbenchmark(N, p) · (#iterations + 1). Alexander Heinecke: Advanced Scientific Computing on Intel Xeon Phi PARCO’13, Garching (Germany) , Sep 13th 2013 23 Technische Universität München Performance Model Evaluation 4000 100 3500 GFLOPS 2500 60 2000 40 1500 Async Allreduce Allgatherv 20 Model Optimal 1000 500 00 • 5 10 15 #nodes 20 25 30 efficiency 80 3000 350 test dataset DR5: 400K instances, 5 dimensions Alexander Heinecke: Advanced Scientific Computing on Intel Xeon Phi PARCO’13, Garching (Germany) , Sep 13th 2013 24 Technische Universität München Xeon Phi Offload vs. Xeon Phi Native CoolMAC - 2 W8000s Beacon - 1 5110P native Beacon - 2 5110Ps native Todi - 1 K20X 14000 Beacon - 1 5110P offload Beacon - 2 5110Ps offload Beacon - 4 5110Ps offload SuperMUC instances, 5 dimensions (upper) 12000 GFLOPS 10000 8000 • Friedman1 dataset, 6000 4000 2000 0 1 GFLOPS • DR5 dataset, 400K 2 4 Beacon - 4 5110Ps offload Todi - 1 K20X 10000 9000 8000 7000 6000 5000 4000 3000 2000 1000 0 1 2 4 8 16 32 64 128 #nodes Beacon - 4 5110Ps native SuperMUC 10M instances per node, 10 dimensions (lower) • We’re running out of parallel tasks • offloading is a limitation in case of small grids 8 16 32 #nodes 64 128 256 Alexander Heinecke: Advanced Scientific Computing on Intel Xeon Phi PARCO’13, Garching (Germany) , Sep 13th 2013 25 Technische Universität München GFLOPS Using MIC native implementation on Cluster Beacon - CPU only Beacon - 1 5110P native Beacon - symmetric (CPU+MIC) 5000 4500 4000 3500 3000 2500 2000 1500 1000 500 0 1 node 2 nodes • • Beacon - 1 5110P offload Beacon - Hybrid (CPU+MIC) 4 nodes 8 nodes 16 nodes test dataset DR5: 400K instances, 5 dimensions symmetric mode: 2 ranks per Xeon Phi, 1 per Xeon E5 Alexander Heinecke: Advanced Scientific Computing on Intel Xeon Phi PARCO’13, Garching (Germany) , Sep 13th 2013 26 Technische Universität München Conclusions Data Parallelism in x86 delivers performance comparable to GPUs, but at a higher programming productivity! Key Findings: • we are flexible to use different programming models • we can just re-compile and code runs directly on Xeon Phi • Intrinsics can be easily transferred • Optimizations on Xeon Phi are optimization on Xeon E5, too! • tuning for Intel MIC Architecture does not differ from tuning for a standard CPU which is also required when using GPUs (since CPU FLOPS should not be neglected) • but: running 60+ MPI ranks on Xeon Phi seems to be suboptimal • but: ccNUMA like L2 cache needs to be respected Alexander Heinecke: Advanced Scientific Computing on Intel Xeon Phi PARCO’13, Garching (Germany) , Sep 13th 2013 27 Technische Universität München BACKUP BACKUP Alexander Heinecke: Advanced Scientific Computing on Intel Xeon Phi PARCO’13, Garching (Germany) , Sep 13th 2013 28 Technische Universität München STREAM on Xeon Phi 250 copy GByte/s 200 scale add 150 triad 100 50 0 1 5 10 20 30 Alexander Heinecke: Advanced Scientific Computing on Intel Xeon Phi PARCO’13, Garching (Germany) , Sep 13th 2013 40 50 #Threads 60 120 180 240 29 Technische Universität München DGEMM on Xeon Phi 1000 950 89.8% formating overhead 89.4% 900 850 800 750 DGEMM on SNB-EP using Intel MKL 11.0: A, B and C are square 700 GFLOPS 650 600 Outter product kernel on KNC (A and B in KNC-friendly format) , K=300 Outter product DGEMM on KNC, K=300 550 500 450 400 350 90.00% 300 250 1K 2K 3K 4K 5K 6K 7K 8K 9K 10K 11K 12K 13K 14K 15K 16K 17K 23K 28K N Heinecke, et.al.: Design and Implementation of the Linpack Benchmark for Single and Multi-Node Systems Based on Intel(R) Xeon Phi(TM) Coprocessor, at IPDPS’13 Alexander Heinecke: Advanced Scientific Computing on Intel Xeon Phi PARCO’13, Garching (Germany) , Sep 13th 2013 30 Technische Universität München DGETRF on Xeon Phi HPL with Static Look-Ahead on KNC HPL with Dynamic Scheduling on KNC Native DGEMM on KNC SMP Linpack on SNB-EP using MKL 11.0 1.000 89.8% 900 78.8% 800 GFLOPS 700 600 500 400 83.0% 300 200 100 0 1K 3K 5K 7K 9K 11K 13K 15K 17K 20K 22K 24K 26K 28K 30K Matrix Rank (N) Heinecke, et.al.: IPDPS’13 Alexander Heinecke: Advanced Scientific Computing on Intel Xeon Phi PARCO’13, Garching (Germany) , Sep 13th 2013 31 Technische Universität München The Intel Xeon Phi Usage Model - Offloading low-level SCI Low-level MPI-like interface (DMA-windows, mem-fences) med-level COI OpenCL-like, handles, buffers, etc. high-level #pragma annotations for driving offload to MIC Example: #pragma offload target(mic:d) nocopy(ptrData) in(ptrCoef:length(size) alloc if(0) free if(0) align(64)) out(ptrCoef[start:mychunk] : into (ptrCoef[start:mychunk]) alloc if(0) free if(0) align(64)) in(size) { .. } Alexander Heinecke: Advanced Scientific Computing on Intel Xeon Phi PARCO’13, Garching (Germany) , Sep 13th 2013 32 Technische Universität München 90% GFLOPS 80% Efficiency 70% 60% 50% 40% 30% Efficiency 2000 1800 1600 1400 1200 1000 800 600 400 200 0 20% 10% 79.2K 72K 64.8K 57.6K 50.4K 43.2K 36K 28.8K 21.6K 14.4K 7.2K 0% 1.2K GFLOPS Offload-DGEMM with one Xeon Phi N Alexander Heinecke: Advanced Scientific Computing on Intel Xeon Phi PARCO’13, Garching (Germany) , Sep 13th 2013 33 Technische Universität München 90% GFLOPS 80% Efficiency 70% 60% 50% 40% 30% Efficiency 2000 1800 1600 1400 1200 1000 800 600 400 200 0 20% 10% 79.2K 72K 64.8K 57.6K 50.4K 43.2K 36K 28.8K 21.6K 14.4K 7.2K 0% 1.2K GFLOPS Offload-DGEMM with two Xeon Phis N Alexander Heinecke: Advanced Scientific Computing on Intel Xeon Phi PARCO’13, Garching (Germany) , Sep 13th 2013 34 Technische Universität München Naive Hybrid HPL CPU DGEMM (MIC Hybrid) Panel Factor. Panel Broadcast DMA comm. via PCIe Knights Corner DGEMM U Broadcast SWAP+DTRSM • very simple • Least change to HPL, plug and play • no-lookahead is even inefficient on CPU • Performance is limited by swapping and panel fact. running on CPU • As MIC is 3-4× faster than CPU, MIC idle time is more expensive ⇒ Keeping Xeon Phi as busy as possible is major goal! Alexander Heinecke: Advanced Scientific Computing on Intel Xeon Phi PARCO’13, Garching (Germany) , Sep 13th 2013 35 Technische Universität München (Hybrid, load balanced via workstealing) CPU DGEMM Columns K-N U Broadcast SWAP+DTRSM CPU DGEMM Columns 0-K Panel Factor. Hybrid HPL using Standard Lookahead Knights Corner DGEMM Columns K-N Panel Broadcast DMA comm. via PCIe • little changes to HPL • DGEMM is split into tow parts: panel free-up and trailing update • trailing update DGEMM is load-balanced via work-stealing between MIC and CPU • panel fact. is entirely hidden ⇒ U broadcast, swapping of trailing matrix and DTRSM are still exposed! Alexander Heinecke: Advanced Scientific Computing on Intel Xeon Phi PARCO’13, Garching (Germany) , Sep 13th 2013 36 Technische Universität München Outline Introduction to Intel Xeon Phi Codes, which are not fully optimized for Xeon Phi, yet! Molecular Dynamics Earthquake Simulation - SeisSol, SuperMUC bid workload Hybrid HPL Data Mining with Sparse Grids Conclusions Alexander Heinecke: Advanced Scientific Computing on Intel Xeon Phi PARCO’13, Garching (Germany) , Sep 13th 2013 37 Technische Universität München Beacon: A Path to Energy-­‐Ef3icient HPC R. Glenn Brook Director, Application Acceleration Center of Excellence National Institute for Computational Sciences glenn-­‐brook@tennessee.edu Alexander Heinecke: Advanced Scientific Computing on Intel Xeon Phi PARCO’13, Garching (Germany) , Sep 13th 2013 38 Technische Universität München Green500 Hardware � Appro Xtreme-­‐X Supercomputer powered by ACE Beacon Green500 Cluster Nodes 36 compute CPU model Intel® Xeon® E5-­‐2670 CPUs per node 2x 8-­‐core, 2.6GHz RAM per node 256 GB SSD per node 960 GB (RAID 0) Intel® Xeon Phi™ Coprocessors 5110P per node 4 Cores per Intel® Xeon Phi™ coprocessor 5110P 60 @ 1.053GHz RAM per Intel® Xeon Phi™ coprocessor 5110P 8 GB GDDR5 Alexander Heinecke: Advanced Scientific Computing on Intel Xeon Phi PARCO’13, Garching (Germany) , Sep 13th 2013 39 Technische Universität München Assembled Team � Intel: � Mikhail Smelyanskiy – software lead � Karthikeyan Vaidyanathan – MIC HPL � Ian Steiner – host HPL optimization � Many, many others: Joe Curley, Jim Jeffers, Pradeep Dubey, Susan Meredith, Rajesh Agny, Russ Fromkin, and many dedicated and passionate team members � Technische Universität München: � Alexander Heinecke – MIC HPL � Appro: � John Lee – hardware lead � David Parks – HPL and system software � Edgardo Evangelista, Danny Tran, others – system deployment and support � NICS: � Glenn Brook – project lead � Ryan Braby, Troy Baer – system support Alexander Heinecke: Advanced Scientific Computing on Intel Xeon Phi PARCO’13, Garching (Germany) , Sep 13th 2013 40 Technische Universität München Power Consumption ����� ������ ������� ������ ��������� �� ��������� ������� Alexander Heinecke: Advanced Scientific Computing on Intel Xeon Phi PARCO’13, Garching (Germany) , Sep 13th 2013 41 Technische Universität München classical, recursive function evaluation l1=1 l1=2 l1=3 l1 l2=1 l2=2 l2=3 l2 Alexander Heinecke: Advanced Scientific Computing on Intel Xeon Phi PARCO’13, Garching (Germany) , Sep 13th 2013 42 Technische Universität München Algorithm: Learning data sets with adaptive sparse grids 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: D := data set G := start grid ~u := ~0 while TRUE do trainGrid (G,~u, D ) acc = getAccuracy (G,~u, D ) if acc ≥ accneeded then return ~u, G end if refineGrid (G,~u ) end while Alexander Heinecke: Advanced Scientific Computing on Intel Xeon Phi PARCO’13, Garching (Germany) , Sep 13th 2013 43 Technische Universität München First results (modlinear) - Baseline: classic algorithms DR5 data set (370K instances): 6 5.1 5.3 5 5 4.6 Speed-Up 4 3.1 3 3 2.6 2 1 0 X5650 SSE 6128HE SSE i7-2600 SSE i7-2600 AVX Alexander Heinecke: Advanced Scientific Computing on Intel Xeon Phi PARCO’13, Garching (Germany) , Sep 13th 2013 1x GTX470 2x GTX470 Hybrid X5650/ GTX470 44 Technische Universität München First results (regular) - Baseline: classic algorithms DR5 data set (370K instances): 60 50.9 50 42.8 Speed-Up 40 30 24 20 10 7.7 7.4 4.7 4 6128HE SSE i7-2600 SSE 0 X5650 SSE Alexander Heinecke: Advanced Scientific Computing on Intel Xeon Phi PARCO’13, Garching (Germany) , Sep 13th 2013 i7-2600 AVX 1x GTX470 2x GTX470 Hybrid X5650/ GTX470 45 Technische Universität München GFLOPS Scaling on SuperMUC 524288 262144 131072 65536 32768 16384 8192 4096 2048 1024 512 256 128 64 modlinear modlinear - ideal linear boundary linear boundary - ideal 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 modlinear: spatially adaptive grids (5K - 21K gridpoints in 7 refinements), modlinear basis, MSE: 4.8*10-4, runtime 16384: 4.4s linear boundary: regular grid (100K gridpoints), regular linear basis, MSE: 5.4*10-4, runtime 32768: 3.6s Alexander Heinecke: Advanced Scientific Computing on Intel Xeon Phi PARCO’13, Garching (Germany) , Sep 13th 2013 46 Technische Universität München Data Mining with Sparse Grids @SC12 Alexander Heinecke: Advanced Scientific Computing on Intel Xeon Phi PARCO’13, Garching (Germany) , Sep 13th 2013 47