Scaling Applications to a Thousand GPUs and - GTC On

advertisement
Scaling Applica+ons to a Thousand GPUs and Beyond Alan Gray, Kevin Stratford (EPCC, The University of Edinburgh)
Alistair Hart (Cray Exascale Research Initiative Europe)
Roberto Ansaloni (Cray Italy)
Introduc+on •  Graphics Processing Unit (GPU) accelerated architectures are prolifera+ng   power-­‐performance advantages over tradi+onal CPU based systems   but are notoriously difficult to fully exploit for real applica+ons   especially when scaling up to many nodes. •  This talk describes work performed to enable efficiently u+liza+on of massively parallel GPU accelerated systems   Ludwig laJce Boltzmann fluid dynamics   C with MPI for communica+on and CUDA for GPU accelera+on   Himeno benchmark   Fortran with new co-­‐array feature for communica+on and OpenACC for GPU accelera+on 2
Ludwig •  LaJce Boltzmann method: popular approach for solving Navier-­‐Stokes equa+ons which govern fluid flow   Suitable for parallel implementa+ons
•  Ludwig: versatile LB package capable of simulating
hydrodynamics of complex fluids
  e.g. mixtures, surficants, liquid crystals, particle
suspensions
  cutting-edge research into condensed matter physics, including the
search for new materials
•  Original C/MPI Ludwig capable of exploi+ng large-­‐scale tradi+onal CPU based machines   good parallel scaling up to many thousands of cores.   can simulate large, complex systems 3
4
Ludwig LaJce Boltzmann Model •  Fluid represented as “par+cles” of fluid density, moving and colliding on a laJce 3D Lattice, distributed between MPI tasks
- Each sub-lattice has inner and halo cells:
Image source: J.-C. Desplat et al. Comp. Phys. Comm. 134(3),
pp273-290 (2001)
At each lattice site, fluid density
represented by Distribution data
structure
- separate component for each
velocity direction on the lattice
Image source: E. Davidson, Message-passing for Lattice
Boltzmann, EPCC MSc Dissertation (2008),
5
Ludwig Algorithm •  Itera+ve updates to distribu+on func+on f: •  This corresponds to 2 stages:   RHS: collision stage: par+cles interact (collide)   Update distribu+on local to each laJce site   Computa+onally dominated by matrix-­‐vector opera+ons   Dominates simula+on. Compute and memory bandwidth intensive   LHS: propaga+on stage: par+cles move according to their velocity   update distribu+on based on values at neighbouring laJce sites   Memory bandwidth bound   For parallel implementa+on, a communica+on stage is also required before propaga+on   halo swap of f (using MPI) 6
Single GPU Implementa+on •  All new GPU func+onality implemented in addi+ve manner   GPU accelera+on op+onally invoked at compile +me •  GPU kernels and data management facili+es implemented using CUDA   Each CUDA thread assigned to single laJce site •  Wrapper rou+nes developed to specify decomposi+on, invoke kernels and manage data   With interfaces similar to original rou+nes •  Important to offload all computa+onal components in +mestep, not just dominant collision stage   In order to be able to keep data resident on GPU •  Work was needed to allow use of encapsulated data in kernels 7
Single GPU Op+misa+on •  Matrix Vector opera+ons: •  Use of temporary scalar allows on-­‐
chip caching of intermediate summa+on values •  Reordering of f allows coalesced memory accesses   Consecu+ve threads reading consecu+ve memory addresses 8
Single GPU Op+misa+on 9
Mul+-­‐GPU Implementa+on •  Major redevelopment of communica+on phase was required to achieve good inter-­‐GPU communica+on performance   and allow scaling to many GPUs. •  During each +mestep, distribu+on halo cells (six 2D planes) must be transferred between MPI processes. •  Original code performs halo swaps “in-­‐place”   Uses MPI datatypes to specify the planes of the sub-­‐laJce to be sent/received in each direc+on •  GPU version needs halos transferred between GPUs   must be staged through host CPUs   Explicit buffering must be performed in applica+on   No MPI datatype func+onality available on GPU 10
Mul+-­‐GPU Implementa+on •  For each halo plane:   buffer explicitly packed on GPU (CUDA kernels)   buffer copied from GPU to host CPU (CUDA memory copies)   buffers exchanged between hosts (MPI)   buffer copied from host CPU to GPU (CUDA memory copies)   buffer unpacked on GPU (CUDA kernels) •  Comms reduced by factor of 4 by only sending velocity components propaga+ng outward from a local domain   On GPU this filtering coded explicitly, since MPI Datatype func+onality unavailable   Special care needed for corner sites which propagate in more than one direc+on 11
•  Asynchronous CUDA func+onality (Streams) used to overlap different communica+on opera+ons where possible 12
•  CRAY XE6: “traditional” supercomputer
•  Each compute node contains 2 AMD Interlagos
CPUs
•  CRAY XK6: GPU Accelerated version
•  Each node contains 1 CPU + 1 NVIDIA m2090 GPU
13
Cray XK6 Compute Node XK6 Compute Node Characteris6cs AMD Series 6200 (Interlagos) en2
PCIe G
NVIDIA Tesla X2090 Host Memory 16 or 32GB 1600 MHz DDR3 NVIDIA Tesla X2090 Memory 6GB GDDR5 capacity Gemini High Speed Interconnect Upgradeable to future GPUs HT
3
HT
3
Z Y X 14
Performance Results •  The performance of the new GPU adapta+on has been measured on   Titan Prototype (Cray/Oak Ridge Na+onal Laboratory)   Cray XK6   ~1000 compute nodes == ~1000 NVIDIA Tesla (Fermi) X2090 GPUs   (number of nodes regularly changed due to hardware tes+ng)   nodes connected via Cray Gemini interconnect   Used 1 MPI task per node (1 per GPU) •  and compared to the original CPU version run on   HECToR (University of Edinburgh)   CRAY XE6   2816 compute nodes == 5632 AMD Interlagos 16-­‐core CPUs == 90,112 cores   nodes connected via Cray Gemini interconnect.   Used 32 MPI tasks per node (1 per CPU core)   CPU version highly op+mised, including full u+liza+on of SIMD vector units 15
•  Theore+cal Peak Capabili+es AMD 16-­‐core CPU (6276) NVIDIA X2090 GPU Double Precision Performance 147 Gflop/s 665 Gflop/s Memory Bandwidth 36.5 GB/s 177 GB/s •  We are comparing 2xCPUs (XE6 node) with 1xGPU (XK6 node) •  No+ng that the GPU version does not do any significant computa+on on the host CPU. 16
Ludwig Performance Results +,*-./$0.1234$5,.*$-#26$&728.1/$
'(%)"*+,"-./+0"'!!"12+34+*3"
'%!"
'$!"
56%"789:";-4+,<=>.3"?@A3B"
'#!"
5C%"7DE;:;8"F+3<="5#!(!"G@A3B"
!"#$%&'$
'!!"
&!"
%!"
$!"
#!"
!"
!"
#!!"
$!!"
%!!"
()*#&$$
&!!"
'!!!"
'#!!"
!"#$%&'()%*)"+,-).%/0123%'-%405/67#'-)8923.%
17
Ludwig Performance Results 4526-7*,-('#/*85-2*3.#%(7*3)'9-(7*
'#!!"
'(#)'(#)'(#"*+%"
,&$),&$),&$"*+%"
'!!!"
-%&)-%&)-%&"*+%"
!"#$%#&'()"*+'#,-.#'#/0*
'.,%)'.,%)'.,%"*+%"
'(#)'(#)'(#"*/%"
&!!"
,&$),&$),&$"*/%"
-%&)-%&)-%&"*/%"
%!!"
'.,%)'.,%)'.,%"*/%"
$!!"
#!!"
!"
!"
#!!"
$!!"
%!!"
1%2"3*
&!!"
'!!!"
'#!!"
!"#$%%&'()%*)"+,-).%/0123%'-%405/67#'-)8%923.%
18
Ludwig Summary •  The Ludwig LB code has been successfully adapted to use a large number of GPUs in parallel. •  Both single-­‐GPU and mul+-­‐GPU op+miza+ons important in harnessing available performance capability. •  Work has resulted in a sopware package able to scale excellently on tradi+onal or GPU accelerated systems.   GPU version has advantage and scales excellently to 936 GPUs •  Performance advantage of GPU version is 1.5x to 2x, when comparing node for node   i.e. 1 GPU compared with 2 CPUs (all cores u+lised) •  Future work will include GPU enablement of advanced func+onality.   Par+cle suspensions and liquid crystals of par+cular interest for new fast-­‐switching LCD displays 19
GPU Direc+ves •  Language extensions, e.g. Cuda or OpenCL, allow programmers to interface with the GPU   This gives control to the programmer, but is open tricky and +me consuming, and results in complex/non-­‐portable code •  An alterna+ve approach is to allow the compiler to automa+cally accelerate code sec+ons on the GPU (including decomposi+on, data transfer, etc). •  There must be a mechanism to provide the compiler with hints regarding which sec+ons to be accelerated, parallelism, data usage, etc •  Direc7ves provide this mechanism   Special syntax which is understood by accelerator compilers and ignored (treated as code comments) by non-­‐accelerator compilers.   Same source code can be compiled for CPU/GPU combo or CPU only   c.f. OpenMP 20
OpenACC •  OpenACC standard announced in November 2011   By CAPS, CRAY, NVIDIA and PGI •  Only applied to NVIDIA GPUs only (so far) •  www.openacc-­‐standard.org 21
OpenACC Example 22
OpenACC Example 23
Himeno Benchmark •  Case Study using OpenACC in a parallel performance benchmark on a supercomputer containing GPU accelerators •  Himeno benchmark, wriqen in Fortran, implements a domain decomposed three-­‐dimensional Poisson solver with a 19-­‐point Laplacian stencil. •  We used a version of Himeno updated to include Fortran 2008 features, in par+cular Fortran coarrays (CAF), to exchange data between “images" on different processors. R. Himeno, The Himeno benchmark, h=p://accc.riken.jp/HPC_e/himenobmt_e.html 24
Asynchronous Implementa+on •  async clause allows overlapping of computa+on and communica+on 25
Himeno Performance Results 26
Himeno Summary •  OpenACC direc+ve based GPU programming model offers produc+vity advantages over CUDA/OpenCL. •  We used the OpenACC port Himeno benchmark to GPU architecture   Using Fortran coarrays to allow scaling to many nodes •  Node-­‐for-­‐node, the OpenACC version of the code was just under twice as fast on the Cray XK6 compared to the op+mized (hybrid MPI/OpenMP) version of the code running on a comparable CPU-­‐based system. •  The OpenACC async clause did give a measurable performance boost of between 5% and 10% in the Himeno code   rela+vely naive benchmark; in more realis+c, greater scope for overlapping data transfers 27
•  A. Gray, K. Strauord, A. Hart, A. Richardson, LaFce Boltzmann for massively-­‐parallel GPU and SIMD architectures, submiqed to Supercompu+ng 2012 conference. •  A. Hart, R. Ansaloni, A.Gray, Por7ng and scaling OpenACC applica7ons on massively-­‐parallel, GPU-­‐accelerated supercomputers, to appear in European Physical Journal Special Topics issue. 28
Download