Achievements and challenges running GPU-accelerated Quantum ESPRESSO on heterogeneous clusters Filippo Spiga1,2 <fs395@cam.ac.uk> 1HPCS, University of Cambridge 2Quantum ESPRESSO Foundation «What I cannot compute, I do not understand.» (adapted from Richard P. Feynman) What is Quantum ESPRESSO? • QUANTUM ESPRESSO is an integrated suite of computer codes for atomistic simulations based on DFT, pseudo-potentials, and plane waves • "ESPRESSO" stands for opEn Source Package for Research in Electronic Structure, Simulation, and Optimization • QUANTUM ESPRESSO is an initiative of SISSA, EPFL, and ICTP, with many partners in Europe and worldwide • QUANTUM ESPRESSO is free software that can be freely downloaded. Everybody is free to use it and welcome to contribute to its development 2 What Quantum ESPRESSO can do? • • • • • • ground-state calculations – Kohn-Sham orbitals and energies, total energies and atomic forces – finite as well as infinite system – any crystal structure or supercell – insulators and metals (different schemes of BZ integration) – structural optimization (many minimization schemes available) – transition states and minimum-energy paths (via NEB or string dynamics) electronic polarization via Berry’s phase – finite electric fields via saw-tooth potential or electric enthalpy norm-conserving as well as ultra-soft and PAW pseudopotentials many different energy functionals, including meta-GGA, DFT+U, and hybrids (van der Waals soon to be available) scalar-relativistic as well as fully relativistic (spin-orbit) calculations magnetic systems, including non-collinear magnetism Wannier intepolations • • • ab-initio molecular dynamics – Car-Parrinello (many ensembles and flavors) – Born-Oppenheimer (many ensembles and flavors) – QM-MM (interface with LAMMPS) linear response and vibrational dynamics – phonon dispersions, real-space interatomic force constants – electron-phonon interactions and superconductivity effective charges and dielectric tensors – third-order an-harmonicities and phonon lifetimes – infrared and (off-resonance) Raman cross sections – thermal properties via the quasi-harmonic approximation electronic excited states – TDDFT for very large systems (both real-time and “turbo-Lanczos”) – MBPT for very large systems (GW, BSE) .... plus several post processing tools! 3 Quantum ESPRESSO in numbers • 350,000+ lines of FORTRAN/C code • 46 registered developers • 1600+ registered users • 5700+ downloads of the latest 5.x.x version • 2 web-sites (QUANTUM-ESPRESSO.ORG & QE-FORGE.ORG) • 1 official user mailing-list, 1 official developer mailing-list • 24 international schools and training courses (1000+ participants) 4 PWscf in a nutshell program flow 3D-FFT + GEMM + LAPACK 3D-FFT 3D-FFT + GEMM 5 Spoiler! • Only PWscf ported to GPU • Performance serial (full socket vs full socket + GPU): 3x ~ 4x • Performance parallel (best MPI+OpenMP vs ... + GPU): 2x ~ 3x • Designed to run better at low number of nodes (efficiency not high) • spin magnetization and noncolin not ported (working on it) • I/O set low on purpose • NVIDIA Kepler GPU not exploited at their best (working on it) 6 Achievement: smart and selective BLAS phiGEMM: CPU+GPU GEMM operations • • • • • Drop-in library wont work as expected, need control overcome limit of the GPU memory flexible interface (C on the HOST, C on the DEVICE) dynamic workload adjustment (SPLIT) -- heuristic call-by-call profiling capabilities A1 C1 B × + GPU A2 CPU I C2 B D2H H2D × + unbalance 7 Challenge: rectangular GEMM bad shape, poor performance Issues: •A and B can be larger than GPU memory •A and B matrices are "badly" rectangular (dominant dimension) n n k k m Common case due to data distribution m Solutions: ~ +15% performance •tiling approach • • not too big, not too small GEMM computation must exceed copies (H-D, D-H), especially for small tiles •handling the "SPECIAL-K" case • • adding beta × C done once accumulating alpha × Ai × Bi times Optimizations included in phiGEMM ( version >1.9) 8 Challenge: parallel 3D-FFT • 3D-FFT burns up to 40%~45% of total SCF run-time • 90-ish % 3D-FFT of PWscf are inside vloc_psi ("Wave" grid) • 3D-FFT is "small" <3003 COMPLEX DP • 3D-FFT can be not a cube • In serial a 3DFFT is called as it is, in parallel 3D-FFT = Σ1D-FFT • In serial data layout is straightforward, in parallel not* • MPI communication become big issue for many-node problem • GPU FFT is mainly memory bounded grouping & batching 3D-FFT 9 Challenge: FFT data layout it is all about sticks & planes There are two "FFT grid" representation in Reciprocal Space: wave functions (Ecut) and charge density (4Ecut) A single 3D-FFT is divided in independent 1D-FFTs 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 Ec 4Ec y z x ~ N x N y / 5 F F T a lo n g z Pa ra llel Tran sp ose ~ N x N y N z / (5 N p) data exchange d pe r P E Tra n s fo rm a lon g Z Data are not contiguous and not “trivially” distributed across processors Tra n s fo rm a lon g X Tra n s fo rm a lon g Y 0 0 1 1 2 2 3 3 z y z x y N x N z / 2 F F T a lo n g y 0 PE 0 2 PE 2 1 PE 1 3 PE 3 x N y N z F F T a lo n g x Zeros are not transformed. Different cutoffs preserve accuracy 10 Challenge: parallel 3D-FFT Optimization #1 • • • CUDA-enabled MPI for P2P (within socket) Overlap FFT computation with MPI communication MPI communication >>> FFT computation for many nodes Sync MemCpy HD MPI MemCpy DH 11 Challenge: parallel 3D-FFT Optimization #2 Observation: Limitation in overlapping D-H copy due to MPI communication •pinned needed (!!!) •Stream D-H copy to hide CPU copy and FFT computation Optimization #3 Observation: MPI “packets” small for many nodes •Re-order data before communication •Batch MPI_Alltoallv communications Optimization #4 Idea: reduce data transmitted (risky...) •Perform FFTs and GEMM in DP, truncate data before communication to SP 12 Achievements: parallel 3D-FFT miniDFT 1.6 (k-points calculations, ultra-soft pseudo-potentials) Optimization #1: +37% improvement in communication Optimization #2: without proper stream mng with proper stream mng Optimization #3: +10% improvement in communication Optimization #4: +52% (!!!) improvement in communication SP vs DP Lower gain in PWscf !!! 13 Challenge: parallel 3D-FFT All data of all FFT computed back to host mem 1 2 Data reordering before GPU-GPU communication Image courtesy of D.Stoic 14 Challenge: H*psi compute/update H * psi: compute kinetic and non-local term (in G space) complexity : Ni × (N × Ng+ Ng × N × Np) Loop over (not converged) bands: FFT (psi) to R space complexity : Ni × Nb × FFT(Nr) complexity : Ni × Nb × Nr complexity : Ni × Nb × FFT(Nr) compute V * psi FFT (V * psi) back to G space compute Vexx complexity : Ni × Nc 2×FFT(Nr)) N = 2×Nb (where Nb = number of valence bands) Ng = number of G vectors Ni = number of Davidson iteration × Nq × Nb × (5 × Nr + Np = number of PP projector Nr = size of the 3D FFT grid Nq = number of q-point (may be different from Nk) 15 Challenge: H*psi non-converged electronic bands dilemma Non-predictable number of FFT across all SCF iterations 16 Challenge: parallel 3D-FFT the orthogonal approach FFT GR CUFFT GR PSI PSIC PSI PSIC PSIC Multiple LOCAL grid to compute “MPI_Allgatherv” DISTRIBUTED products products “MPI_Allscatterv” HPSI HPSI PSIC FFT RG Overlapping is possible!! PSIC PSIC CUFFT RG Considerations: • memory on GPU ATLAS K40 (12 GByte) • (still) too much communication GPU Direct capability needed • enough 3D-FFT not predictable in advance Not ready for production yet • benefit also for CPU-only! 17 Challenge: eigen-solvers which library? • LAPACK MAGMA (ICL, University of Tennessee) – hybridization approach (CPU + GPU), dynamic scheduling based on DLA (QUARK) – single and multi-GPU, no memory distributed (yet) – some (inevitable) numerical "discrepancies" • ScaLAPACK ELPA ELPA + GPU (RZG + NVIDIA) – ELPA (Eigenvalue SoLvers for Petaflop Applications) improves ScaLAPACK – ELPA-GPU proof-of-concept based on CUDA FORTRAN – effective results below expectation • Lancronz diagonaliz w/ tridiagonal QR algorithm (Penn State) – simple (too simple?) and designed to be GPU friendly – take advantage of GPU Direct – experimental, need testing and validation 18 HPC Machines TITAN (ORNL) [CRAY] WILKES (HPCS) [DELL] • • • • 128 nodes dual-socket dual 6-core Intel Ivy Bridge dual NVIDIA K20c per node dual Mellanox Connect-IB FDR #2 Green500 Nov 2013 ( ~3632 MFlops/W ) • • • • 18688 nodes single-socket single 16-core AMD Opteron one NVIDIA K20x per node Gemini interconnection #2 Top500 Jun 2013 ( ~17.59 PFlops Rmax ) 19 Achievement: Save Power serial multi-threaded, single GPU, NVIDIA Fermi generation Shilu-3 (C2050) -57% 0.08 0.06 0.04 0.02 5 Power [KW/h] Power [KW/h] Power [KW/h] 0.16 0.7 1200 -58% 0.5 0.3 0.1 9000 -54% 0.12 0.08 0.04 4 2100 4 3.2x 3.1x 3.67x 4 3 300 2 1 0 6 OMP 6 MPI 6 OMP 1 GPU 3 2 700 2 1 0 3 3000 Time [s] 600 1400 6000 Time [s] 900 Time [s] Water-on-Calcite (C2050) AUSURF112, k-point (C2050) 0.1 0 6 OMP 6 MPI 6 OMP 1 GPU Tests run early 2012 @ ICHEC 1 6 OMP 6 MPI 6 OMP 1 GPU 20 Achievement: improved time-to-solution 2.4x ~2.9x ~3.4x ~3.4x ~3.5x 2.5x 2.4x ~2.1x Serial Parallel Parallel tests run on Wilkes Serial tests run on SBN machine 21 Challenge: running on CRAY XK7 Key differences... •AMD Bulldozer architecture, 2 cores shares same FPU pipeline aprun -j 1 •NUMA locality matters a lot , for both CPU-only and CPU+GPU aprun –cc numanode •GPU Direct over RDMA is not supported (yet?) 3D-FFT not working •Scheduling policy "unfriendly" input has to be really big Performance below expectation (<2x) Tricks: many-pw.x, __USE_3D_FFT 22 Challenge: educate users • Performance portability myth • "configure, compile, run" same as the CPU version • All dependencies (MAGMA, phiGEMM) compiled by QE-GPU • No more than 2 MPI process per GPU – Hyper-Q does not work automatically, an additional running deamon is needed • Forget about 1:1 output comparison • QE-GPU can run on every GPU but some GPU are better than others... 23 Lessons learnt being "heterogeneous" today and tomorrow • GPU does not really improve code scalability, only time-to-solution • Re-think about data distribution for massive parallel architectures • Deal with un-controlled "numerical fluctuations" (GPU magnifies this) • The "data movement" constrain will soon disappear new Intel Phi Kings Landing, NVIDIA project Denver expected by 2015 • Looking for true alternatives, new algorithms – not easy, extensive validation _plus_ module dependencies • Performance is a function of human effort • Follow the mantra «Do what you are good at.» 24 THANK YOU FOR YOUR ATTENTION! Links: • http://hpc.cam.ac.uk • http://www.quantum-espresso.org/ • http://foundation.quantum-espresso.org/ • http://qe-forge.org/gf/project/q-e/ • http://qe-forge.org/gf/project/q-e-gpu/