Scaling Applica+ons to a Thousand GPUs and Beyond Alan Gray, Kevin Stratford (EPCC, The University of Edinburgh) Alistair Hart (Cray Exascale Research Initiative Europe) Roberto Ansaloni (Cray Italy) Introduc+on • Graphics Processing Unit (GPU) accelerated architectures are prolifera+ng power-­‐performance advantages over tradi+onal CPU based systems but are notoriously difficult to fully exploit for real applica+ons especially when scaling up to many nodes. • This talk describes work performed to enable efficiently u+liza+on of massively parallel GPU accelerated systems Ludwig laJce Boltzmann fluid dynamics C with MPI for communica+on and CUDA for GPU accelera+on Himeno benchmark Fortran with new co-­‐array feature for communica+on and OpenACC for GPU accelera+on 2 Ludwig • LaJce Boltzmann method: popular approach for solving Navier-­‐Stokes equa+ons which govern fluid flow Suitable for parallel implementa+ons • Ludwig: versatile LB package capable of simulating hydrodynamics of complex fluids e.g. mixtures, surficants, liquid crystals, particle suspensions cutting-edge research into condensed matter physics, including the search for new materials • Original C/MPI Ludwig capable of exploi+ng large-­‐scale tradi+onal CPU based machines good parallel scaling up to many thousands of cores. can simulate large, complex systems 3 4 Ludwig LaJce Boltzmann Model • Fluid represented as “par+cles” of fluid density, moving and colliding on a laJce 3D Lattice, distributed between MPI tasks - Each sub-lattice has inner and halo cells: Image source: J.-C. Desplat et al. Comp. Phys. Comm. 134(3), pp273-290 (2001) At each lattice site, fluid density represented by Distribution data structure - separate component for each velocity direction on the lattice Image source: E. Davidson, Message-passing for Lattice Boltzmann, EPCC MSc Dissertation (2008), 5 Ludwig Algorithm • Itera+ve updates to distribu+on func+on f: • This corresponds to 2 stages: RHS: collision stage: par+cles interact (collide) Update distribu+on local to each laJce site Computa+onally dominated by matrix-­‐vector opera+ons Dominates simula+on. Compute and memory bandwidth intensive LHS: propaga+on stage: par+cles move according to their velocity update distribu+on based on values at neighbouring laJce sites Memory bandwidth bound For parallel implementa+on, a communica+on stage is also required before propaga+on halo swap of f (using MPI) 6 Single GPU Implementa+on • All new GPU func+onality implemented in addi+ve manner GPU accelera+on op+onally invoked at compile +me • GPU kernels and data management facili+es implemented using CUDA Each CUDA thread assigned to single laJce site • Wrapper rou+nes developed to specify decomposi+on, invoke kernels and manage data With interfaces similar to original rou+nes • Important to offload all computa+onal components in +mestep, not just dominant collision stage In order to be able to keep data resident on GPU • Work was needed to allow use of encapsulated data in kernels 7 Single GPU Op+misa+on • Matrix Vector opera+ons: • Use of temporary scalar allows on-­‐ chip caching of intermediate summa+on values • Reordering of f allows coalesced memory accesses Consecu+ve threads reading consecu+ve memory addresses 8 Single GPU Op+misa+on 9 Mul+-­‐GPU Implementa+on • Major redevelopment of communica+on phase was required to achieve good inter-­‐GPU communica+on performance and allow scaling to many GPUs. • During each +mestep, distribu+on halo cells (six 2D planes) must be transferred between MPI processes. • Original code performs halo swaps “in-­‐place” Uses MPI datatypes to specify the planes of the sub-­‐laJce to be sent/received in each direc+on • GPU version needs halos transferred between GPUs must be staged through host CPUs Explicit buffering must be performed in applica+on No MPI datatype func+onality available on GPU 10 Mul+-­‐GPU Implementa+on • For each halo plane: buffer explicitly packed on GPU (CUDA kernels) buffer copied from GPU to host CPU (CUDA memory copies) buffers exchanged between hosts (MPI) buffer copied from host CPU to GPU (CUDA memory copies) buffer unpacked on GPU (CUDA kernels) • Comms reduced by factor of 4 by only sending velocity components propaga+ng outward from a local domain On GPU this filtering coded explicitly, since MPI Datatype func+onality unavailable Special care needed for corner sites which propagate in more than one direc+on 11 • Asynchronous CUDA func+onality (Streams) used to overlap different communica+on opera+ons where possible 12 • CRAY XE6: “traditional” supercomputer • Each compute node contains 2 AMD Interlagos CPUs • CRAY XK6: GPU Accelerated version • Each node contains 1 CPU + 1 NVIDIA m2090 GPU 13 Cray XK6 Compute Node XK6 Compute Node Characteris6cs AMD Series 6200 (Interlagos) en2 PCIe G NVIDIA Tesla X2090 Host Memory 16 or 32GB 1600 MHz DDR3 NVIDIA Tesla X2090 Memory 6GB GDDR5 capacity Gemini High Speed Interconnect Upgradeable to future GPUs HT 3 HT 3 Z Y X 14 Performance Results • The performance of the new GPU adapta+on has been measured on Titan Prototype (Cray/Oak Ridge Na+onal Laboratory) Cray XK6 ~1000 compute nodes == ~1000 NVIDIA Tesla (Fermi) X2090 GPUs (number of nodes regularly changed due to hardware tes+ng) nodes connected via Cray Gemini interconnect Used 1 MPI task per node (1 per GPU) • and compared to the original CPU version run on HECToR (University of Edinburgh) CRAY XE6 2816 compute nodes == 5632 AMD Interlagos 16-­‐core CPUs == 90,112 cores nodes connected via Cray Gemini interconnect. Used 32 MPI tasks per node (1 per CPU core) CPU version highly op+mised, including full u+liza+on of SIMD vector units 15 • Theore+cal Peak Capabili+es AMD 16-­‐core CPU (6276) NVIDIA X2090 GPU Double Precision Performance 147 Gflop/s 665 Gflop/s Memory Bandwidth 36.5 GB/s 177 GB/s • We are comparing 2xCPUs (XE6 node) with 1xGPU (XK6 node) • No+ng that the GPU version does not do any significant computa+on on the host CPU. 16 Ludwig Performance Results +,*-./$0.1234$5,.*$-#26$&728.1/$ '(%)"*+,"-./+0"'!!"12+34+*3" '%!" '$!" 56%"789:";-4+,<=>.3"?@A3B" '#!" 5C%"7DE;:;8"F+3<="5#!(!"G@A3B" !"#$%&'$ '!!" &!" %!" $!" #!" !" !" #!!" $!!" %!!" ()*#&$$ &!!" '!!!" '#!!" !"#$%&'()%*)"+,-).%/0123%'-%405/67#'-)8923.% 17 Ludwig Performance Results 4526-7*,-('#/*85-2*3.#%(7*3)'9-(7* '#!!" '(#)'(#)'(#"*+%" ,&$),&$),&$"*+%" '!!!" -%&)-%&)-%&"*+%" !"#$%#&'()"*+'#,-.#'#/0* '.,%)'.,%)'.,%"*+%" '(#)'(#)'(#"*/%" &!!" ,&$),&$),&$"*/%" -%&)-%&)-%&"*/%" %!!" '.,%)'.,%)'.,%"*/%" $!!" #!!" !" !" #!!" $!!" %!!" 1%2"3* &!!" '!!!" '#!!" !"#$%%&'()%*)"+,-).%/0123%'-%405/67#'-)8%923.% 18 Ludwig Summary • The Ludwig LB code has been successfully adapted to use a large number of GPUs in parallel. • Both single-­‐GPU and mul+-­‐GPU op+miza+ons important in harnessing available performance capability. • Work has resulted in a sopware package able to scale excellently on tradi+onal or GPU accelerated systems. GPU version has advantage and scales excellently to 936 GPUs • Performance advantage of GPU version is 1.5x to 2x, when comparing node for node i.e. 1 GPU compared with 2 CPUs (all cores u+lised) • Future work will include GPU enablement of advanced func+onality. Par+cle suspensions and liquid crystals of par+cular interest for new fast-­‐switching LCD displays 19 GPU Direc+ves • Language extensions, e.g. Cuda or OpenCL, allow programmers to interface with the GPU This gives control to the programmer, but is open tricky and +me consuming, and results in complex/non-­‐portable code • An alterna+ve approach is to allow the compiler to automa+cally accelerate code sec+ons on the GPU (including decomposi+on, data transfer, etc). • There must be a mechanism to provide the compiler with hints regarding which sec+ons to be accelerated, parallelism, data usage, etc • Direc7ves provide this mechanism Special syntax which is understood by accelerator compilers and ignored (treated as code comments) by non-­‐accelerator compilers. Same source code can be compiled for CPU/GPU combo or CPU only c.f. OpenMP 20 OpenACC • OpenACC standard announced in November 2011 By CAPS, CRAY, NVIDIA and PGI • Only applied to NVIDIA GPUs only (so far) • www.openacc-­‐standard.org 21 OpenACC Example 22 OpenACC Example 23 Himeno Benchmark • Case Study using OpenACC in a parallel performance benchmark on a supercomputer containing GPU accelerators • Himeno benchmark, wriqen in Fortran, implements a domain decomposed three-­‐dimensional Poisson solver with a 19-­‐point Laplacian stencil. • We used a version of Himeno updated to include Fortran 2008 features, in par+cular Fortran coarrays (CAF), to exchange data between “images" on different processors. R. Himeno, The Himeno benchmark, h=p://accc.riken.jp/HPC_e/himenobmt_e.html 24 Asynchronous Implementa+on • async clause allows overlapping of computa+on and communica+on 25 Himeno Performance Results 26 Himeno Summary • OpenACC direc+ve based GPU programming model offers produc+vity advantages over CUDA/OpenCL. • We used the OpenACC port Himeno benchmark to GPU architecture Using Fortran coarrays to allow scaling to many nodes • Node-­‐for-­‐node, the OpenACC version of the code was just under twice as fast on the Cray XK6 compared to the op+mized (hybrid MPI/OpenMP) version of the code running on a comparable CPU-­‐based system. • The OpenACC async clause did give a measurable performance boost of between 5% and 10% in the Himeno code rela+vely naive benchmark; in more realis+c, greater scope for overlapping data transfers 27 • A. Gray, K. Strauord, A. Hart, A. Richardson, LaFce Boltzmann for massively-­‐parallel GPU and SIMD architectures, submiqed to Supercompu+ng 2012 conference. • A. Hart, R. Ansaloni, A.Gray, Por7ng and scaling OpenACC applica7ons on massively-­‐parallel, GPU-­‐accelerated supercomputers, to appear in European Physical Journal Special Topics issue. 28