advertisement

Design and Optimization of a Portable Lattice Boltzmann Code for Heterogeneous Architectures E Calore, S F Schifano, R Tripiccione Enrico Calore INFN Ferrara, Italy Perspectives of GPU Computing in Physics and Astrophysics September 17, 2014 Rome, Italy E. Calore (INFN of Ferrara) Portable LBM for Heterogeneous HPC GPU Comp., Sep. 17, 2014 1 / 33 Outline 1 Introduction Hardware trends Software tools 2 The Lattice Boltzmann Method at a glance The D2Q37 model Propagate Boundary Conditions Collide 3 Implementations details 4 Results and Conclusion E. Calore (INFN of Ferrara) Portable LBM for Heterogeneous HPC GPU Comp., Sep. 17, 2014 2 / 33 Outline 1 Introduction Hardware trends Software tools 2 The Lattice Boltzmann Method at a glance The D2Q37 model Propagate Boundary Conditions Collide 3 Implementations details 4 Results and Conclusion E. Calore (INFN of Ferrara) Portable LBM for Heterogeneous HPC GPU Comp., Sep. 17, 2014 3 / 33 GPUs and MICs performances are growing Courtesy of Dr. Karl Rupp, Technische Universität Wien E. Calore (INFN of Ferrara) Portable LBM for Heterogeneous HPC GPU Comp., Sep. 17, 2014 4 / 33 Accelerators use in HPC is growing Accelerator architectures in the Top500 Supercomputers E. Calore (INFN of Ferrara) Portable LBM for Heterogeneous HPC GPU Comp., Sep. 17, 2014 5 / 33 Outline 1 Introduction Hardware trends Software tools 2 The Lattice Boltzmann Method at a glance The D2Q37 model Propagate Boundary Conditions Collide 3 Implementations details 4 Results and Conclusion E. Calore (INFN of Ferrara) Portable LBM for Heterogeneous HPC GPU Comp., Sep. 17, 2014 6 / 33 OpenCL (Open Computing Language): The same code can be run on CPUs, GPUs, MICs, etc. Functions to be offloaded on the accelerator have to be explicitly programmed (as in CUDA) Data movements between host and accelerator has to be explicitly programmed (as in CUDA) NVIDIA do not support it anymore OpenACC (for Open Accelerators): The same code (will probably) run on CPUs, GPUs, MICs, etc. Functions to be offloaded are “annotated” with #pragma directives Data movements between host and accelerator could be managed automatically or manually Support is still limited, but seems to be quickly growing E. Calore (INFN of Ferrara) Portable LBM for Heterogeneous HPC GPU Comp., Sep. 17, 2014 7 / 33 Outline 1 Introduction Hardware trends Software tools 2 The Lattice Boltzmann Method at a glance The D2Q37 model Propagate Boundary Conditions Collide 3 Implementations details 4 Results and Conclusion E. Calore (INFN of Ferrara) Portable LBM for Heterogeneous HPC GPU Comp., Sep. 17, 2014 8 / 33 The D2Q37 Lattice Boltzmann Model Lattice Boltzmann method (LBM) is a class of computational fluid dynamics (CFD) methods simulation of synthetic dynamics described by the discrete Boltzmann equation, instead of the Navier-Stokes equations a set of virtual particles called populations arranged at edges of a discrete and regular grid interacting by propagation and collision reproduce – after appropriate averaging – the dynamics of fluids D2Q37 is a D2 model with 37 components of velocity (populations) suitable to study behaviour of compressible gas and fluids optionally in presence of combustion 1 effects correct treatment of Navier-Stokes, heat transport and perfect-gas (P = ρT ) equations 1 chemical reactions turning cold-mixture of reactants into hot-mixture of burnt product. E. Calore (INFN of Ferrara) Portable LBM for Heterogeneous HPC GPU Comp., Sep. 17, 2014 9 / 33 Computational Scheme of LBM foreach time−step foreach lattice−point propagate ( ) ; endfor foreach lattice−point collide ( ) ; endfor endfor Embarassing parallelism All sites can be processed in parallel applying in sequence propagate and collide. Challenge Design an efficient implementation able exploit a large fraction of available peak performance. E. Calore (INFN of Ferrara) Portable LBM for Heterogeneous HPC GPU Comp., Sep. 17, 2014 10 / 33 Outline 1 Introduction Hardware trends Software tools 2 The Lattice Boltzmann Method at a glance The D2Q37 model Propagate Boundary Conditions Collide 3 Implementations details 4 Results and Conclusion E. Calore (INFN of Ferrara) Portable LBM for Heterogeneous HPC GPU Comp., Sep. 17, 2014 11 / 33 D2Q37: propagation scheme perform accesses to neighbour-cells at distance 1,2, and 3 generate memory-accesses with sparse addressing patterns E. Calore (INFN of Ferrara) Portable LBM for Heterogeneous HPC GPU Comp., Sep. 17, 2014 12 / 33 Outline 1 Introduction Hardware trends Software tools 2 The Lattice Boltzmann Method at a glance The D2Q37 model Propagate Boundary Conditions Collide 3 Implementations details 4 Results and Conclusion E. Calore (INFN of Ferrara) Portable LBM for Heterogeneous HPC GPU Comp., Sep. 17, 2014 13 / 33 D2Q37: boundary-conditions After propagation, boundary conditions are enforced at top and bottom edges of the lattice. 2D lattice with period-boundaries along X -direction at the top and the bottom boundary conditions are enforced: I to adjust some values at sites y = 0 . . . 2 and y = Ny − 3 . . . Ny − 1 I e.g. set vertical velocity to zero At left and and right edges we apply periodic boundary conditions. E. Calore (INFN of Ferrara) Portable LBM for Heterogeneous HPC GPU Comp., Sep. 17, 2014 14 / 33 Outline 1 Introduction Hardware trends Software tools 2 The Lattice Boltzmann Method at a glance The D2Q37 model Propagate Boundary Conditions Collide 3 Implementations details 4 Results and Conclusion E. Calore (INFN of Ferrara) Portable LBM for Heterogeneous HPC GPU Comp., Sep. 17, 2014 15 / 33 D2Q37 collision collision is computed at each lattice-cell after computation of boundary conditions computational intensive: for the D2Q37 model requires ≈ 7500 DP floating-point operations completely local: arithmetic operations require only the populations associate to the site computation of propagate and collide kernels are kept separate after propagate but before collide we may need to perform collective operations (e.g. divergence of of the velocity field) if we include computations conbustion effects. E. Calore (INFN of Ferrara) Portable LBM for Heterogeneous HPC GPU Comp., Sep. 17, 2014 16 / 33 Outline 1 Introduction Hardware trends Software tools 2 The Lattice Boltzmann Method at a glance The D2Q37 model Propagate Boundary Conditions Collide 3 Implementations details 4 Results and Conclusion E. Calore (INFN of Ferrara) Portable LBM for Heterogeneous HPC GPU Comp., Sep. 17, 2014 17 / 33 Grid and Memory Layout Uni-dimensional array of NTHREADS, each thread processing one lattice site. Ly = α × Nwi , α ∈ N; (Ly × Lx )/Nwi = Nwg Data stored as Structures-of-Arrays (SoA) E. Calore (INFN of Ferrara) Portable LBM for Heterogeneous HPC GPU Comp., Sep. 17, 2014 18 / 33 Grid and Memory Layout Uni-dimensional array of NTHREADS, each thread processing one lattice site. Ly = α × Nwi , α ∈ N; (Ly × Lx )/Nwi = Nwg Data stored as Structures-of-Arrays (SoA) E. Calore (INFN of Ferrara) Portable LBM for Heterogeneous HPC GPU Comp., Sep. 17, 2014 18 / 33 OpenCL Propagate device function __kernel void prop ( __global const data_t∗ prv , __global data_t∗ nxt ) { int ix , iy , site_i ; / / Work−i t e m i n d e x along t h e X dimension . / / Work−i t e m i n d e x along t h e Y dimension . / / Index o f c u r r e n t s i t e . / / Sets t h e work−i t e m i n d i c e s ( Y i s used as t h e f a s t e s t dimension ) . ix = ( int ) get_global_id ( 1 ) ; iy = ( int ) get_global_id ( 0 ) ; site_i = ( HX+3+ix ) ∗ NY + ( HY+iy ) ; nxt [ nxt [ nxt [ nxt [ nxt [ nxt [ nxt [ NX∗NY 2∗NX∗NY 3∗NX∗NY 4∗NX∗NY 5∗NX∗NY 6∗NX∗NY + + + + + + site_i ] site_i ] site_i ] site_i ] site_i ] site_i ] site_i ] = = = = = = = prv [ prv [ prv [ prv [ prv [ prv [ prv [ NX∗NY 2∗NX∗NY 3∗NX∗NY 4∗NX∗NY 5∗NX∗NY 6∗NX∗NY + + + + + + site_i site_i site_i site_i site_i site_i site_i − − − − − − − 3∗NY 3∗NY 3∗NY 2∗NY 2∗NY 2∗NY 2∗NY + 1]; ]; − 1]; + 2]; + 1]; ]; − 1]; ... E. Calore (INFN of Ferrara) Portable LBM for Heterogeneous HPC GPU Comp., Sep. 17, 2014 19 / 33 OpenACC Propagate function inline void propagate ( const __restrict data_t ∗ const prv , __restrict data_t ∗ const nxt ) { int ix , iy , site_i ; #pragma acc kernels present ( prv ) present ( nxt ) #pragma acc loop independent gang for ( ix = HX ; ix < ( HX+SIZEX ) ; ix++) { #pragma acc loop independent vector ( BLKSIZE ) for ( iy = HY ; iy < ( HY+SIZEY ) ; iy++) { site_i = ( ix∗NY ) + iy ; nxt [ nxt [ nxt [ nxt [ nxt [ nxt [ nxt [ NX∗NY 2∗NX∗NY 3∗NX∗NY 4∗NX∗NY 5∗NX∗NY 6∗NX∗NY + + + + + + site_i ] site_i ] site_i ] site_i ] site_i ] site_i ] site_i ] = = = = = = = prv [ prv [ prv [ prv [ prv [ prv [ prv [ NX∗NY 2∗NX∗NY 3∗NX∗NY 4∗NX∗NY 5∗NX∗NY 6∗NX∗NY + + + + + + site_i site_i site_i site_i site_i site_i site_i − − − − − − − 3∗NY 3∗NY 3∗NY 2∗NY 2∗NY 2∗NY 2∗NY + 1]; ]; − 1]; + 2]; + 1]; ]; − 1]; ... E. Calore (INFN of Ferrara) Portable LBM for Heterogeneous HPC GPU Comp., Sep. 17, 2014 20 / 33 Outline 1 Introduction Hardware trends Software tools 2 The Lattice Boltzmann Method at a glance The D2Q37 model Propagate Boundary Conditions Collide 3 Implementations details 4 Results and Conclusion E. Calore (INFN of Ferrara) Portable LBM for Heterogeneous HPC GPU Comp., Sep. 17, 2014 21 / 33 Hardware used: Eurora prototype Eurora (Eurotech and Cineca) Hot water cooling system Deliver 3,209 MFLOPs per Watt of sustained performance 1st in the Green500 of June 2013 Computing Nodes: 64 Processor Type: Intel Xeon E5-2658 @ 2.10GHz Intel Xeon E5-2687W @ 3.10GHz Accelerator Type: MIC - Intel Xeon-Phi 5120D GPU - NVIDIA Tesla K20x E. Calore (INFN of Ferrara) Portable LBM for Heterogeneous HPC GPU Comp., Sep. 17, 2014 22 / 33 OpenCL WG size selection for Propagate (Xeon-Phi) Performance of propagate as function of the number of work-items Nwi per work-group, and the number of work-groups Nwg . E. Calore (INFN of Ferrara) Portable LBM for Heterogeneous HPC GPU Comp., Sep. 17, 2014 23 / 33 OpenCL WG size selection for Collide (Xeon-Phi) Performance of collide as function of the number of work-items Nwi per work-group, and the number of work-groups Nwg . E. Calore (INFN of Ferrara) Portable LBM for Heterogeneous HPC GPU Comp., Sep. 17, 2014 24 / 33 2 x NVIDIA K20s GPU Run time on 2 x GPU (NVIDIA K20s) 80 CUDA OpenCL 70 OpenACC [msec] per iteration 60 50 40 30 20 10 0 Propagate E. Calore (INFN of Ferrara) BC Portable LBM for Heterogeneous HPC Collide GPU Comp., Sep. 17, 2014 25 / 33 2 x Intel Xeon Phi MIC Run time on 2 x MIC (Intel Xeon Phi) 80 70 C OpenCL [msec] per iteration 60 50 40 30 20 10 0 Propagate E. Calore (INFN of Ferrara) BC Portable LBM for Heterogeneous HPC Collide GPU Comp., Sep. 17, 2014 26 / 33 Propagate Run time (Propagate - 1920x2048 lattice) 600 [msec] per iteration C C Opt. CUDA 500 OpenCL OpenACC 400 300 200 100 0 MIC E. Calore (INFN of Ferrara) GPU CPU2 Portable LBM for Heterogeneous HPC CPU3 GPU Comp., Sep. 17, 2014 27 / 33 Collide Run time (Collide - 1920x2048 lattice) 600 [msec] per iteration C C Opt. CUDA 500 OpenCL OpenACC 400 300 200 100 0 MIC E. Calore (INFN of Ferrara) GPU CPU2 Portable LBM for Heterogeneous HPC CPU3 GPU Comp., Sep. 17, 2014 28 / 33 Scalability on Eurora Nodes (OpenCL code) Weak regime lattice size: Strong regime lattice size: E. Calore (INFN of Ferrara) 256 × 8192 × No_devices. 1024 × 8192. Portable LBM for Heterogeneous HPC GPU Comp., Sep. 17, 2014 29 / 33 Simulation of the Rayleigh-Taylor (RT) Instability Instability at the interface of two fluids of different densities triggered by gravity. A cold-dense fluid over a less dense and warmer fluid triggers an instability that mixes the two fluid-regions (till equilibrium is reached). E. Calore (INFN of Ferrara) Portable LBM for Heterogeneous HPC GPU Comp., Sep. 17, 2014 30 / 33 Conclusions 1 we have presented an OpenCL and an OpenACC implementations of a fluid-dynamic simulation based on Lattice Boltzmann methods 2 code portability: they have been succesfully ported and run on several computing architectures, including CPU, GPU and MIC systems 3 performance portability: results are of a similar level of codes written using more “native” programming frameworks, such as CUDA or C 4 OpenCL I I 5 easily portable across several architecture preserving performances; but not all vendors are today commited to support this standard; OpenACC I I easily utilizable with few coding efforts; but compilers are not available for all architectures yet. E. Calore (INFN of Ferrara) Portable LBM for Heterogeneous HPC GPU Comp., Sep. 17, 2014 31 / 33 Acknowledgments Luca Biferale, Mauro Sbragaglia, Patrizio Ripesi University of Tor Vergata and INFN Roma, Italy Andrea Scagliarini University of Barcelona, Spain Filippo Mantovani BSC institute, Spain Enrico Calore, Sebastiano Fabio Schifano, Raffaele Tripiccione, University and INFN of Ferrara, Italy Federico Toschi Eindhoven University of Technology The Netherlands, and CNR-IAC, Roma Italy This work has been performed in the framework of the INFN COKA and SUMA projects. We would like to thank CINECA (ITALY) and JSC (GERMANY) institutes for access to their systems. E. Calore (INFN of Ferrara) Portable LBM for Heterogeneous HPC GPU Comp., Sep. 17, 2014 32 / 33 Thanks for Your attention E. Calore (INFN of Ferrara) Portable LBM for Heterogeneous HPC GPU Comp., Sep. 17, 2014 33 / 33