ACCELERATED COMPOSITE DISTRIBUTION FUNCTION METHODS FOR COMPUTATIONAL FLUID DYNAMICS USING GPU Prof. Matthew Smith, Yen-Chih Chen Mechanical Engineering, NCKU MOTIVATION • The Finite Volume Method (FVM) has become a popular method for the solution of Computational Fluid Dynamics (CFD) problems. • The key to the Finite Volume Method is the computation of fluxes around the evolution of an integral form of conservation equations: where F are fluxes around the surface S of our control volume and U are our conserved quantities. MOTIVATION • By introducing the average value of U over the cell volume: • We can reformulate the governing equations as: • For a simple 1D problem on a regular spaced grid, we might discretize this in the explicit form: MOTIVATION • The fundamental problem is the computation of the fluxes across each cell surface, in this case and . • There are a large number of approaches available for the computation of these fluxes: • Exact or approximate Riemann solvers, • Integral Balance techniques, • Algebraic techniques, etc. MOTIVATION • Regardless of the method, there are several challenges to the use of the FVM in real-life problems: • The scheme must possess a suitably small amount of numerical (artificial) dissipation such that the physical dissipation present (if any) is correctly captured. • The solution itself must lend itself to the application – in many real-life multi-scale applications, the scale of the problem is very large. Hence (i) knowledge of the dissipative qualities of the solver must be understood, and (ii) the computational complexity cannot be too great to prohibit practical application. PARALLEL AND GPU COMPUTING • Due to these restrictions, many real-life applications require parallel computing to obtain a solution in a realistic amount of time. • Nowadays, there are a large number of architecture options available for researchers and engineers whom need to perform parallel computation: (For connecting these devices) PARALLEL AND GPU COMPUTING • My current research employs all of these devices and implementations (i.e. OpenMPI, OpenMP). • Today’s talk will be (mostly) relevant to the GPU aspect…. PARALLEL AND GPU COMPUTING • The GPU device does provide an attractive architecture for the parallel computation of FVM CFD applications. • It’s (now semi) unique architecture provides us with additional “restraints”, however, to our Finite Volume Method approach – I won’t cover these restraints here in the hope that someone with more time will talk about them. • We will revisit this idea, however, in upcoming slides. VECTOR SPLIT FVM COMPUTATION • One particular approach which is very well suited to GPU computation is the vector splitting of the fluxes at cell surfaces, for example: • The high degree of locality of these schemes results in an increased capacity to take advantage of vectorization – this is good for every parallel architecture. • A very large family of these methods exist: • Mach Number split solvers (i.e. van Leer, (S)HLL, AUSM to name a few) • Kinetic theory based splitting (EFM / KFVS, EIM, TDEFM, UEFM to name a few) VECTOR SPLIT FVM COMPUTATION • The flux across a surface can be computed by taking moments of the molecular velocity probability distribution function f: • Where Q is the vector of particulate conserved quantities: 1 Q , v, v 2 Ein 2 • In 1980, Dale Pullin introduced the Equilibrium Flux Method (EFM) which approximates this integral as: VECTOR SPLIT FVM COMPUTATION • The EFM method represents the flux of conserved quantity carried by particles travelling in molecular free flight from source cells L and R in the limit of zero Δt. (Smith, JCP, 2008) • The fluxes can be written to take the form: where VECTOR SPLIT FVM COMPUTATION • There are two major challenges associated with such an approach – albeit specific to EFM: • The free molecular flight assumption prohibits the interaction of particles enroute, and hence the formation of any intermediate states. Hence, as a 1st order flux, the method is too dissipative. • The fluxes are computationally expensive – containing several costly function evaluations: So how might we deal with these issues to create a solver which will efficiently operate on the GPU device? QDS • The QDS (Quiet Direct Simulation), once known as the Quiet Direct Simulation Monte Carlo (QDSMC), attempts to avoid one of these problems. • The flux integrals are replaced by a numerically computed flux using Gaussian quadrature: ,QDS F v'u f v'dv' H S v j v j w j j j 1 0 F,QDS 0 N v' u f v' dv' H S v j v j w j j N j 1 • This approach is commonplace in many kinetic-theory based solvers, and avoids the complications associated with analytical completion of the integral. QDS • Together with Fang-An Kuo (who will speak later in the week) his supervisor (Prof. J.-S. Wu), we extended the method to higher order accuracy and applied it to multiple GPU computation: NUMERICAL DISSIPATION • However, despite the success of the QDS method, there were still several issues: • A finite number of discrete “ordinates” (velocities) resulted in an error in the thermal (diffusive) flux, causing problems in regions where this is important. • The basic numerical dissipation present in EFM, while modified as a result of the velocity discretization, is still very much present. • A good starting point might be to quantify the numerical dissipation present in the EFM method. NUMERICAL DISSIPATION • A flux commonly used in FVM CFD for over 50 years is the “Rusanov Flux” where alpha (in this case) is a characteristic speed associated with the system. • One attractive feature of this form is that – through direct discretization of the governing equations and substitution, we can show that we are calculating: Re-arrange Sub in NUMERICAL DISSIPATION • So, one (naïve) strategy might be to re-arrange the added EFM forward and reverse fluxes to take on the form: and solve for our characteristic speed / diffusion coefficient : • Since these diffusion coefficients are well known for a large variety of Finite Volume flux solvers, this would allow us to: • Be able to estimate – comparatively – just how numerically dissipative EFM is. • Perhaps find a way to reduce the inherent numerical dissipation through manipulation of this term. NUMERICAL DISSIPATION • For the mass conservation equation, we have: • By assuming that the (i) the speed ratio, and (ii) the bulk velocity does not change significantly between L and R, we find: • This is equivalent to: NUMERICAL DISSIPATION • For the momentum conservation equation, we have: • Using similar assumptions, we find that: • This is equivalent to: • We’ll save the energy coefficient derivation for another time. NUMERICAL DISSIPATION • We’ve determined that the numerical dissipation for EFM is: • Closely coupled with Δx, • A strong function of the Mach number. • We can use this result as a benchmark for the next step. • Instead of using a finite set of discrete velocities for the approximation of the integral itself, we can use a finite set of continuous distribution functions which can approximate our original function and add the discrete fluxes. UEFM AND TEFM • The concept is rather simple: replace our original continuous distribution functions with a series of simplified probability distribution functions. N f v wi f ai i 1 where w is a weighting function and f(a) is a probability distribution function. • By using a simpler distribution function, we can avoid the complicated functions associated with traditional kinetic-theory based schemes. • This research will focus on two possible options.. UEFM AND TEFM • Sum of Uniform Distributions: • Sum of Linear Distributions: UEFM AND TEFM • The fluxes are the combined result of weighted fluxes resulting from the integrals: • However, we need to ensure that the correct values of a are used, otherwise the effective distribution may contain incorrect amounts of energy or momentum. UEFM AND TEFM • Rule No. 1: Equivalency of Mass- The resulting equivalent distribution function must be a valid distribution. N w f a dv 1 i 1 i i • Rule No. 2: Equivalency of Momentum - The resulting distribution must contain the same momentum. N vw f a dv v i 1 i i • Rule No. 3: Equivalency of Energy - The resulting distribution must contain the same thermal energy. UEFM AND TEFM • These rules help to guide us in our selection of w (weights) and a (characteristic thermal velocity). • The resulting flux expressions are easily obtained, though difference reference frames must be used: TEFM – Triangular Distribution Equilibrium Flux Method Integral in Laboratory Reference Frame UEFM – Uniform Distribution Equilibrium Flux Method Integral in Moving Gas Reference Frame UEFM AND TEFM • The resulting flux expressions are as we expected – lacking of complex exp() and erf() functions. The 1D forward fluxes are (for each distribution): ai v v 2 Fi wi 4 2 ai ai v 3 6ai ai v 2 4 Ein ai v 2 16ai • Simulation results from both methods show that the previous thermal diffusion problem associated with discrete velocity methods has disappeared. • The average performance increase for a dual-distribution approximation to the equilibrium distribution function is approx. 8% (on the CPU) for both methods. UEFM AND TEFM Ratio of overall performance between UEFM and EFM for a two step configuration Test A Test B Test C Average EFM 0.9935 1.0091 0.9974 1.0 UEFM 0.9054 0.9199 0.9221 0.9158 TEFM 0.9285 0.9245 0.9248 0.9261 Avg. Percentage Increase (%) for UEFM Solver Avg. Percentage Increase (%) for TEFM Solver N/A N/A N/A 8.42% N/A N/A N/A 7.39% Despite the meagre performance increase on the CPU, one (might) expect a larger increase in performance when executed on the GPU. UEFM AND TEFM • 1D Shock Tube results – UEFM • The kink present in the QDS result has disappeared (as per our goal). • However, there is additional numerical dissipation present, especially in the contact surface. UEFM AND TEFM • 1D Shock Tube results - TEFM • Also no kink or other rarefaction effects present. • It is clear that the triangular distribution approximation produces a scheme with closer numerical dissipation qualities to EFM. • What are the diffusion coefficients for these schemes? DISSIPATION OF UEFM AND TEFM • The same approach used for EFM can be applied here – we just need to rearrange the added forward and reverse fluxes and solve for the characteristic speed. • For UEFM (mass flux) this is: The UEFM method is guaranteed to be more diffusive than the EFM method, especially in high speed regions! DISSIPATION OF UEFM AND TEFM • For TEFM (mass flux) this is: • The presence of a negative M3 means that (i) we will have lower numerical dissipations, and (ii) this scheme could potentially become unstable at higher mach numbers. DISSIPATION OF UEFM AND TEFM • Without any modification, the UEFM solver has a lower dissipation that EFM. • This is more than likely due to the reduced tail of the velocity probability distribution function. • But do we have stability at higher mach numbers? DISSIPATION OF UEFM AND TEFM • We can test this out – high speed shock – bubble interaction. • We can employ a higher order scheme through expansion each flux component: DISSIPATION OF UEFM AND TEFM • To maintain stability, we have used the MC flux limiter: • Resolution of fine structures is possible at relatively modest resolutions (here, number of cells = 0.8 Million) DISSIPATION OF UEFM AND TEFM • To perform this simulation using QDS, we need an even number of discrete velocities representing the distribution function. • This results in too much numerical dissipation – hence, the QDS approach (and indeed, EFM) cannot produce this result. • This is what we call a convenient fluke. GPU PERFORMANCE (STRUCTURED) • Over the last few years, we’ve spend most of our efforts making CFD methods work quickly on Cartesian grids. • Performance Kelper GPU, Intel CPU. (Single Core, no vectorization) Device Test 1 (sec) Test 2 (sec) Test 3 (sec) Average computation time (sec) Average speed-up (times) CPU 75.41 75.41 75.38 75.4 1.0 GPU 0.26 0.26 0.26 0.26 290 *PS: This code has also been developed to make efficient use of AVX intrinsic functions, which help level the playing field – today I talk about GPU. APPLICATION • Unfortunately, real-life industrial work is often strongly multi-scale and requires the use of unstructured computational grids. • Sample application: 3D transient MOCVD reactor simulation • Due to: • The large volume of work resulting from industrial collaboration, and • The extensive time to create the mesh.. GPU devices are used to create the unstructured grids. APPLICATION MESH CREATION 128 TPB, (N + TPB – 1) / TPB blocks 128 TPB, (N + TPB – 1) / TPB blocks For i = 0,1..#STL Mesh Start YES For i = 0,1..R Solve_Intersect For i = 0,1..N NO Find_Neighbour Calc_Neighbour GPU Kernel Calc_Intercept GPU Kernel GPU Split_Cell() Cell Split GPU Kernel GPU OpenGL Mesh Inspection Program CPU MESH CREATION • GPU acceleration is required since many STL files have > 1M faces. • We created an OpenGL Mesh Inspection program to ensure the grid is suitable. • The program can generate: Typical times: CPU time ~ 3 hours, GPU time ~ 15 mins • Unstructured triangular grids, • Unstructured rectangular grids, • Structured grids. MESH CREATION • This work focuses on the use of unstructured, adaptive rectangular cells for use with direction decoupled (volume – to – volume) fluxes in addition to traditional fluxes. • We use rectangular cells since it simplifies the mathematics used to generate the flux expressions. Support for hanging nodes 2D representation of possible cell and neighbor arrangement. SOLVER • The solver is a true direction solver – this enables the solver to better handle flows unaligned with the computational grid. • Current solver options: • • • • • TDEFM (Slow) QDS (Quiet Direct Simulation) (N2 V-QDS) UEFM (Uniform Equilibrium Flux Method) TEFM (Triangular Equilibrium Flux Method)* Modified SHLL (Split Harten Lax Van-Leer) 2D representation of possible cell and neighbor arrangement. SOLVER 128 TPB, (N + TPB – 1) / TPB blocks Split_Flux() True Direction Flux kernel For i = 0,1..#NC Exhange_Fluxes() Flux Exchange GPU Kernel Simulation Start YES For i = 0,1..T NO Update Display ? YES Render Result Display Function State_Calc() Primitives and Chemistry kernel COMPUTING GPU For i = 0,1..NS CPU VISUALIZATION GPU PARALLEL PERFORMANCE (UNSTRUCTURED) Device Time Speedup Intel Xeon E5-2670 (1 core, SSE, -O3) ~ 84 hours 1x Nvidia Tesla C2075 (Intel E5-2670) 1 hour 19 mins ~ 63x Nvidia GTX-Titan (AMD x4 640) 24 mins ~ 210x Nvidia GTX-690 40 mins ~126 x (Intel i3) Performance on various single GPU devices (1.7 Million cells, effective resolution = 131 million cells, 100,000 time steps) PARALLEL PERFORMANCE • We see a significant performance drop as a result of the change from structured to unstructured. • Two reasons for this: • This is due to load unbalancing between blocks (10-20%) • Poorly structured memory access (80-90%) CONCLUSIONS • Many challenges remain for the future of unstructured grid CFD on multiple GPU devices. • Despite the challenges, we have still created a solver which improves upon the accuracy of previous implementations while still being fast enough to apply to practical problems. • Currently, the libraries for this solver are written to support OpenCL, CUDA and efficient vectorization on newer Xeon cores and the Intel Phi device (results not discussed here). • Future work (with regard to the GPU device) will lie in optimization of memory access for unstructured grids and load balancing across multiple devices. I’d like to thank the following companies for their valuable support.