Scaling Applications to a Thousand GPUs and - GTC On

Scaling Applica+ons to a Thousand GPUs and Beyond Alan Gray, Kevin Stratford (EPCC, The University of Edinburgh) Alistair Hart (Cray Exascale Research Initiative Europe) Roberto Ansaloni (Cray Italy) Introduc+on •  Graphics Processing Unit (GPU) accelerated architectures are prolifera+ng   power-‐performance advantages over tradi+onal CPU based systems   but are notoriously difficult to fully exploit for real applica+ons   especially when scaling up to many nodes. •  This talk describes work performed to enable efficiently u+liza+on of massively parallel GPU accelerated systems   Ludwig laJce Boltzmann fluid dynamics   C with MPI for communica+on and CUDA for GPU accelera+on   Himeno benchmark   Fortran with new co-‐array feature for communica+on and OpenACC for GPU accelera+on 2 Ludwig •  LaJce Boltzmann method: popular approach for solving Navier-‐Stokes equa+ons which govern fluid flow   Suitable for parallel implementa+ons •  Ludwig: versatile LB package capable of simulating hydrodynamics of complex fluids   e.g. mixtures, surficants, liquid crystals, particle suspensions   cutting-edge research into condensed matter physics, including the search for new materials •  Original C/MPI Ludwig capable of exploi+ng large-‐scale tradi+onal CPU based machines   good parallel scaling up to many thousands of cores.   can simulate large, complex systems 3 4 Ludwig LaJce Boltzmann Model •  Fluid represented as “par+cles” of fluid density, moving and colliding on a laJce 3D Lattice, distributed between MPI tasks - Each sub-lattice has inner and halo cells: Image source: J.-C. Desplat et al. Comp. Phys. Comm. 134(3), pp273-290 (2001) At each lattice site, fluid density represented by Distribution data structure - separate component for each velocity direction on the lattice Image source: E. Davidson, Message-passing for Lattice Boltzmann, EPCC MSc Dissertation (2008), 5 Ludwig Algorithm •  Itera+ve updates to distribu+on func+on f: •  This corresponds to 2 stages:   RHS: collision stage: par+cles interact (collide)   Update distribu+on local to each laJce site   Computa+onally dominated by matrix-‐vector opera+ons   Dominates simula+on. Compute and memory bandwidth intensive   LHS: propaga+on stage: par+cles move according to their velocity   update distribu+on based on values at neighbouring laJce sites   Memory bandwidth bound   For parallel implementa+on, a communica+on stage is also required before propaga+on   halo swap of f (using MPI) 6 Single GPU Implementa+on •  All new GPU func+onality implemented in addi+ve manner   GPU accelera+on op+onally invoked at compile +me •  GPU kernels and data management facili+es implemented using CUDA   Each CUDA thread assigned to single laJce site •  Wrapper rou+nes developed to specify decomposi+on, invoke kernels and manage data   With interfaces similar to original rou+nes •  Important to offload all computa+onal components in +mestep, not just dominant collision stage   In order to be able to keep data resident on GPU •  Work was needed to allow use of encapsulated data in kernels 7 Single GPU Op+misa+on •  Matrix Vector opera+ons: •  Use of temporary scalar allows on-‐ chip caching of intermediate summa+on values •  Reordering of f allows coalesced memory accesses   Consecu+ve threads reading consecu+ve memory addresses 8 Single GPU Op+misa+on 9 Mul+-‐GPU Implementa+on •  Major redevelopment of communica+on phase was required to achieve good inter-‐GPU communica+on performance   and allow scaling to many GPUs. •  During each +mestep, distribu+on halo cells (six 2D planes) must be transferred between MPI processes. •  Original code performs halo swaps “in-‐place”   Uses MPI datatypes to specify the planes of the sub-‐laJce to be sent/received in each direc+on •  GPU version needs halos transferred between GPUs   must be staged through host CPUs   Explicit buffering must be performed in applica+on   No MPI datatype func+onality available on GPU 10 Mul+-‐GPU Implementa+on •  For each halo plane:   buffer explicitly packed on GPU (CUDA kernels)   buffer copied from GPU to host CPU (CUDA memory copies)   buffers exchanged between hosts (MPI)   buffer copied from host CPU to GPU (CUDA memory copies)   buffer unpacked on GPU (CUDA kernels) •  Comms reduced by factor of 4 by only sending velocity components propaga+ng outward from a local domain   On GPU this filtering coded explicitly, since MPI Datatype func+onality unavailable   Special care needed for corner sites which propagate in more than one direc+on 11 •  Asynchronous CUDA func+onality (Streams) used to overlap different communica+on opera+ons where possible 12 •  CRAY XE6: “traditional” supercomputer •  Each compute node contains 2 AMD Interlagos CPUs •  CRAY XK6: GPU Accelerated version •  Each node contains 1 CPU + 1 NVIDIA m2090 GPU 13 Cray XK6 Compute Node XK6 Compute Node Characteris6cs AMD Series 6200 (Interlagos) en2 PCIe G NVIDIA Tesla X2090 Host Memory 16 or 32GB 1600 MHz DDR3 NVIDIA Tesla X2090 Memory 6GB GDDR5 capacity Gemini High Speed Interconnect Upgradeable to future GPUs HT 3 HT 3 Z Y X 14 Performance Results •  The performance of the new GPU adapta+on has been measured on   Titan Prototype (Cray/Oak Ridge Na+onal Laboratory)   Cray XK6   ~1000 compute nodes == ~1000 NVIDIA Tesla (Fermi) X2090 GPUs   (number of nodes regularly changed due to hardware tes+ng)   nodes connected via Cray Gemini interconnect   Used 1 MPI task per node (1 per GPU) •  and compared to the original CPU version run on   HECToR (University of Edinburgh)   CRAY XE6   2816 compute nodes == 5632 AMD Interlagos 16-‐core CPUs == 90,112 cores   nodes connected via Cray Gemini interconnect.   Used 32 MPI tasks per node (1 per CPU core)   CPU version highly op+mised, including full u+liza+on of SIMD vector units 15 •  Theore+cal Peak Capabili+es AMD 16-‐core CPU (6276) NVIDIA X2090 GPU Double Precision Performance 147 Gflop/s 665 Gflop/s Memory Bandwidth 36.5 GB/s 177 GB/s •  We are comparing 2xCPUs (XE6 node) with 1xGPU (XK6 node) •  No+ng that the GPU version does not do any significant computa+on on the host CPU. 16 Ludwig Performance Results +,*-./$0.1234$5,.*$-#26$&728.1/$ '(%)"*+,"-./+0"'!!"12+34+*3" '%!" '$!" 56%"789:";-4+,<=>.3"?@A3B" '#!" 5C%"7DE;:;8"F+3<="5#!(!"G@A3B" !"#$%&'$ '!!" &!" %!" $!" #!" !" !" #!!" $!!" %!!" ()*#&$$ &!!" '!!!" '#!!" !"#$%&'()%*)"+,-).%/0123%'-%405/67#'-)8923.% 17 Ludwig Performance Results 4526-7*,-('#/*85-2*3.#%(7*3)'9-(7* '#!!" '(#)'(#)'(#"*+%" ,&$),&$),&$"*+%" '!!!" -%&)-%&)-%&"*+%" !"#$%#&'()"*+'#,-.#'#/0* '.,%)'.,%)'.,%"*+%" '(#)'(#)'(#"*/%" &!!" ,&$),&$),&$"*/%" -%&)-%&)-%&"*/%" %!!" '.,%)'.,%)'.,%"*/%" $!!" #!!" !" !" #!!" $!!" %!!" 1%2"3* &!!" '!!!" '#!!" !"#$%%&'()%*)"+,-).%/0123%'-%405/67#'-)8%923.% 18 Ludwig Summary •  The Ludwig LB code has been successfully adapted to use a large number of GPUs in parallel. •  Both single-‐GPU and mul+-‐GPU op+miza+ons important in harnessing available performance capability. •  Work has resulted in a sopware package able to scale excellently on tradi+onal or GPU accelerated systems.   GPU version has advantage and scales excellently to 936 GPUs •  Performance advantage of GPU version is 1.5x to 2x, when comparing node for node   i.e. 1 GPU compared with 2 CPUs (all cores u+lised) •  Future work will include GPU enablement of advanced func+onality.   Par+cle suspensions and liquid crystals of par+cular interest for new fast-‐switching LCD displays 19 GPU Direc+ves •  Language extensions, e.g. Cuda or OpenCL, allow programmers to interface with the GPU   This gives control to the programmer, but is open tricky and +me consuming, and results in complex/non-‐portable code •  An alterna+ve approach is to allow the compiler to automa+cally accelerate code sec+ons on the GPU (including decomposi+on, data transfer, etc). •  There must be a mechanism to provide the compiler with hints regarding which sec+ons to be accelerated, parallelism, data usage, etc •  Direc7ves provide this mechanism   Special syntax which is understood by accelerator compilers and ignored (treated as code comments) by non-‐accelerator compilers.   Same source code can be compiled for CPU/GPU combo or CPU only   c.f. OpenMP 20 OpenACC •  OpenACC standard announced in November 2011   By CAPS, CRAY, NVIDIA and PGI •  Only applied to NVIDIA GPUs only (so far) •  www.openacc-‐standard.org 21 OpenACC Example 22 OpenACC Example 23 Himeno Benchmark •  Case Study using OpenACC in a parallel performance benchmark on a supercomputer containing GPU accelerators •  Himeno benchmark, wriqen in Fortran, implements a domain decomposed three-‐dimensional Poisson solver with a 19-‐point Laplacian stencil. •  We used a version of Himeno updated to include Fortran 2008 features, in par+cular Fortran coarrays (CAF), to exchange data between “images" on different processors. R. Himeno, The Himeno benchmark, h=p://accc.riken.jp/HPC_e/himenobmt_e.html 24 Asynchronous Implementa+on •  async clause allows overlapping of computa+on and communica+on 25 Himeno Performance Results 26 Himeno Summary •  OpenACC direc+ve based GPU programming model offers produc+vity advantages over CUDA/OpenCL. •  We used the OpenACC port Himeno benchmark to GPU architecture   Using Fortran coarrays to allow scaling to many nodes •  Node-‐for-‐node, the OpenACC version of the code was just under twice as fast on the Cray XK6 compared to the op+mized (hybrid MPI/OpenMP) version of the code running on a comparable CPU-‐based system. •  The OpenACC async clause did give a measurable performance boost of between 5% and 10% in the Himeno code   rela+vely naive benchmark; in more realis+c, greater scope for overlapping data transfers 27 •  A. Gray, K. Strauord, A. Hart, A. Richardson, LaFce Boltzmann for massively-‐parallel GPU and SIMD architectures, submiqed to Supercompu+ng 2012 conference. •  A. Hart, R. Ansaloni, A.Gray, Por7ng and scaling OpenACC applica7ons on massively-‐parallel, GPU-‐accelerated supercomputers, to appear in European Physical Journal Special Topics issue. 28

Scaling Applications to a Thousand GPUs and - GTC On

Related documents

Products

Support

Scaling Applications to a Thousand GPUs and - GTC On

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib