Stan Posey NVIDIA, Santa Clara, CA, USA; sposey@nvidia.com NVIDIA HPC Technology and CAE Strategy Technology Development of professional GPUs as co-processing accelerators for x86 CPUs Strategy Strategic Alliances Business and technical collaboration with ISVs; Industry customers; Research organizations Applications Engineering Technical collaboration with ISVs (ANSYS, etc.) for development of GPU-accelerated solvers Software Development NVIDIA linear solver toolkit (implicit iterative solvers) , CUDA libraries, GPU compilers GPU System Integration HP, Dell, IBM, Cray, SGI, Fujitsu, others; Kepler K20 based-systems available since 2012 2 GPU Product Summary for CAE Applications NVIDIA Kepler family GPUs for CAE simulations K20 (5 GB) K20X (6 GB) K40 (12 GB) K6000 (12 GB) 3 CAE Workstations Now Configure with 2 GPUs NVIDIA® MAXIMUS Visual Computing Parallel Computing Intelligent GPU job Allocation Unified Driver for Quadro + Tesla ANSYS Certifications CAD Operations Pre-processing Post-processing FEA CFD CEM HP, Dell, Xenon, others Now Kepler-based GPUs Available Since November 2011 4 NVIDIA GPUs Accelerate CAE at Any Scale Same GPU Technology from MAXIMUS Workstations to TITAN at ORNL 20+ PetaFlops 18,688 NVIDIA Tesla K20x TITAN — #2 at Top500.org MAXIMUS Workstation Key Application S3D for Turbulent Combustion How to efficiently burn next gen diesel and bio fuels? 5 NVIDIA Use of CAE in Product Engineering ANSYS Icepak – active and passive cooling of IC packages ANSYS Mechanical – large deflection bending of PCBs ANSYS Mechanical – comfort and fit of 3D emitter glasses ANSYS Mechanical – shock & vib of solder ball assemblies 6 CAE Trends and GPU Acceleration Benefits Higher fidelity (better models) GPUs permit higher fidelity for existing (CPU-only) job times Parameter sensitivities (more models) GPUs increase throughput for existing (CPU-only) job capacity, and at lower cost Advanced techniques GPUs make practical: high order methods, time dependent vs. static, use of 3D solid finite elements vs. 2D shells, etc. Larger ISV software budgets GPUs provide more use of existing ISV software investment 7 Progress Summary for GPU-Parallel CAE Strong GPU investments by commercial CAE vendors (ISVs) GPU adoption led by implicit FEA and CEM, followed by CFD Recent CFD breakthroughs in linear solvers (AMG) and preconditioners GPUs now production-HPC for leading CAE end-user sites Led by automotive, electronics, and aerospace industries GPUs contributing to fast growth in emerging CAE applications New developments in particle-based CFD (LBM, SPH, DEM, etc.) Rapid growth for range of CEM applications and GPU adoption 8 GPU Progress – Commercial CAE Software GPU Status Available Today Structural Mechanics ANSYS Mechanical Abaqus/Standard MSC Nastran Marc AFEA NX Nastran HyperWorks OptiStruct PAM-CRASH implicit LS-DYNA implicit Product Evaluation Research Evaluation Fluid Dynamics LS-DYNA Abaqus/Explicit RADIOSS PAM-CRASH Electromagnetics ANSYS CFD (FLUENT) Moldflow Culises (OpenFOAM) Particleworks SpeedIT (OpenFOAM) AcuSolve EMPro CST MWS XFdtd SEMCAD X FEKO Nexxim CFD-ACE+ JMAG HFSS Abaqus/CFD LS-DYNA CFD Xpatch CFD++ FloEFD STAR-CCM+ XFlow 9 Additional Commercial GPU Developments ISV Domain Location Primary Applications FluiDyna CFD Germany Culises for OpenFOAM; LBultra Vratis CFD Poland Speed-IT for OpenFOAM; ARAEL Prometech CFD Japan Particleworks Turbostream CFD England, UK Turbostream IMPETUS Explicit FEA Sweden AFEA AVL CFD Austria FIRE CoreTech CFD (molding) Taiwan Moldex3D Intes Implicit FEA Germany PERMAS Next Limit CFD Spain XFlow CPFD CFD USA BARRACUDA Convergent/IDAJ CFD USA Converge CFD SCSK Implicit FEA Japan ADVENTURECluster CDH Implicit FEA Germany AMLS; FastFRS FunctionBay MB Dynamics S. Korea RecurDyn Cradle Software CFD Japan SC/Tetra; scSTREAM 10 Status Summary of ISVs and GPU Acceleration Every primary ISV has products available on GPUs or ongoing evaluation The 4 largest ISVs all have products based on GPUs, some at 3rd generation ANSYS SIMULIA MSC Software Altair The top 4 out of 5 ISV applications are available on GPUs today ANSYS Fluent, ANSYS Mechanical, Abaqus/Standard, MSC Nastran, . . . LS-DYNA implicit only Several new ISVs were founded with GPUs as a primary competitive strategy Prometech, FluiDyna, Vratis, IMPETUS, Turbostream Availability of commercial CEM software expanding with ECAE growth CST, Remcom, Agilent, EMSS on 3rd-gen; JSOL to release JMAG, ANSYS to release HFSS 11 CAE Software Focus on Sparse Solvers CAE Application Software Read input, matrix Set-up Implicit Sparse Matrix Operations GPU 40% - 75% of Profile time, Small % LoC - Hand-CUDA Parallel - GPU Libraries, CUBLAS - OpenACC Directives Implicit Sparse Matrix Operations CPU Global solution, write output (Investigating OpenACC for more tasks on GPU) + 12 GPU Approach of Direct Solvers for Implicit CSM Most time consumed in dense matrix operations such as Cholesky factorization, Schur complement, and others Method decomposes global stiffness matrix into tree of dense matrix fronts Most CSM implementations send dense operations to GPU while keeping the assembly tree transversal on the CPU 13 GPU Approach of Direct Solvers for Implicit CSM Typical implicit CSM deployment of multi-frontal sparse direct solvers Large dense matrix fronts factored on GPU Schematic Representation of the Stiffness Matrix that is Factorized by the Direct Solver Lower threshold: Fronts too small to overcome PCIe data transfer costs stay on CPU cores Small dense matrix fronts factored in parallel on CPU – more cores means higher performance 14 CAE Priority for ISV Software on GPUs #4 #2 #3 ANSYS / ANSYS Fluent OpenFOAM (Various ISVs) CD-adapco / STAR-CCM+ Autodesk Simulation CFD ESI / CFD-ACE+ SIMULIA / Abaqus/CFD #1 LSTC / LS-DYNA SIMULIA / Abaqus/Explicit Altair / RADIOSS ESI / PAM-CRASH ANSYS / ANSYS Mechanical Altair / RADIOSS Altair / AcuSolve (CFD) Autodesk / Moldflow ANSYS / ANSYS Mechanical SIMULIA / Abaqus/Standard MSC Software / MSC Nastran MSC Software / Marc LSTC / LS-DYNA implicit Altair / RADIOSS Bulk Siemens / NX Nastran Autodesk / Mechanical 15 Basics of GPU Computing for ISV Software ISV software use of GPU acceleration is user-transparent Jobs launch and complete without additional user steps User informs ISV application (GUI, command) that a GPU exists Schematic of a CPU with an attached GPU accelerator CPU begins/ends job, GPU manages heavy computations 1 DDR GDDR GDDR Cache CPU DDR 2 I/O Hub PCI-Express 4 GPU Schematic of an x86 CPU with a GPU accelerator 1. 2. 3. 4. ISV job launched on CPU Solver operations sent to GPU GPU sends results back to CPU ISV job completes on CPU 3 16 Computational Fluid Dynamics ANSYS Fluent 17 ANSYS and NVIDIA Collaboration Roadmap Release ANSYS Mechanical 13.0 SMP, Single GPU, Sparse and PCG/JCG Solvers Dec 2010 14.0 ANSYS Fluent ANSYS EM ANSYS Nexxim + Distributed ANSYS; + Multi-node Support Radiation Heat Transfer (beta) ANSYS Nexxim + Radiation HT; + GPU AMG Solver (beta), Single GPU ANSYS Nexxim Nov 2012 + Multi-GPU Support; + Hybrid PCG; + Kepler GPU Support 15.0 + CUDA 5 Kepler Tuning + Multi-GPU AMG Solver; + CUDA 5 Kepler Tuning ANSYS Nexxim ANSYS HFSS (Transient) Dec 2011 14.5 Dec 2013 18 ANSYS 15.0 HPC License Scheme for GPUs Treats each GPU socket as a CPU core, which significantly increases simulation productivity of your HPC licenses Needs 1 HPC task to enable a GPU All ANSYS HPC products unlock GPUs in 15.0, including HPC, HPC Pack, HPC Workgroup, and HPC Enterprise products. 19 ANSYS Fluent Profile for Coupled PBNS Solver Non-linear iterations Assemble Linear System of Equations Runtime: ~ 35% ~ 65% Solve Linear System of Equations: Ax = b Accelerate this first Converged ? No Stop Yes 20 Overview of AmgX Linear Solver Library Two forms of AMG Classical AMG, as in HYPRE, strong convergence, scalar Un-smoothed Aggregation AMG, lower setup times, handles block systems Krylov methods GMRES, CG, BiCGStab, preconditioned and ‘flexible’ variants Classic iterative methods Block-Jacobi, Gauss-Seidel, Chebyshev, ILU0, ILU1 Multi-colored versions for fine-grained parallelism Flexible configuration All methods as solvers, preconditioners, or smoothers; nesting Designed for non-linear problems Allows for frequently changing matrix, parallel and efficient setup 21 AmgX Developed for Ease-of-Use No CUDA experience necessary to use the library C API: links with C, C++ or Fortran Small API, focused Reads common matrix formats (CSR, COO, MM) Single GPU, Multi-GPU Interoperates easily with MPI, OpenMP, and Hybrid parallel applications Tuned for K20 & K40; supports Fermi and newer Single, Double precision Supported on Linux, Win64 22 How to Enable NVIDIA GPUs in ANSYS Fluent Windows: Linux: fluent 3ddp -g -ssh –t2 -gpgpu=1 -i journal.jou Cluster specification: nprocs = Total number of fluent processes M = Number of machines ngpgpus = Number of GPUs per machine Requirement 1 nprocs mod M = 0 Same number of solver processes on each machine Requirement 2 𝑛𝑝𝑟𝑜𝑐 𝑀 mod ngpgpus = 0 No. of processes should be an integer multiple of GPUs 23 Considerations for ANSYS Fluent on GPUs • GPUs accelerate the AMG solver of the CFD analysis – Fine meshes and low-dissipation problems have high %AMG – Coupled solution scheme spends 65% on average in AMG • In many cases, pressure-based coupled solvers offer faster convergence compared to segregated solvers (problem-dependent) • The system matrix must fit in the GPU memory – For coupled PBNS, each 1 MM cells need about 4 GB of GPU memory – High-memory GPUs such as Tesla K40 or Quadro K6000 are ideal • Better performance with use of lower CPU core counts – A ratio of 4 CPU cores to 1 GPU is recommended 24 ANSYS Fluent GPU Performance for Large Cases ANSYS Fluent 15.0 Performance – Results by NVIDIA, Dec 2013 36 CPU cores 144 CPU cores 36 CPU cores + 12 GPUs 144 CPU cores + 48 GPUs Truck Body Model ANSYS Fluent Time (Sec) 36 13 2X 1.4 X 9.5 Lower is Better 18 • • • • • External aerodynamics Steady, k-e turbulence Double-precision solver CPU: Intel Xeon E5-2667; 12 cores per node GPU: Tesla K40, 4 per node NOTE: Reported times are sec per iteration 14 million cells 111 million cells 25 ANSYS Fluent GPU Performance for Large Cases ANSYS Fluent 15.0 Performance – Results by NVIDIA, Dec 2013 144 CPU cores – Amg 144 CPU cores 48 GPUs – AmgX 144 CPU cores + 48 GPUs 36 80% AMG solver time 29 2X 2.7 X Lower is Better 18 11 AMG solver time per iteration (secs) Truck Body Model • • • • • • Fluent solution time per iteration (secs) 111M mixed cells External aerodynamics Steady, k-e turbulence Double-precision solver CPU: Intel Xeon E5-2667; 12 cores per node GPU: Tesla K40, 4 per node NOTE: AmgX is a linear solver toolkit from NVIDIA, used by ANSYS 26 ANSYS Fluent GPU Study on Productivity Gains • Same solution times: 64 cores vs. 32 cores + 8 GPUs • Frees up 32 CPUs and HPC licenses for additional job(s) • Approximate 56% increase in overall productivity for 25% increase in cost ANSYS Fluent Number of Jobs Per Day ANSYS Fluent 15.0 Preview 3 Performance – Results by NVIDIA, Sep 2013 25 Truck Body Model Higher is Better 20 15 16 16 64 Cores 32 Cores + 8 GPUs 4 x Nodes x 2 CPUs (64 Cores Total) 2 x Nodes x 2 CPUs (32 Cores Total) 8 GPUs (4 each Node) 10 5 14 M Mixed cells Steady, k-e turbulence Coupled PBNS, DP Total solution times CPU: AMG F-cycle GPU: FGMRES with AMG Preconditioner 0 NOTE: All results fully converged 27 Computational Fluid Dynamics OpenFOAM 28 NVIDIA Development Strategy for OpenFOAM Provide technical support for commercial GPU solver developments FluiDyna Culises library through NVIDIA collaboration on AMG Vratis Speed-IT library, development of CUSP-based AMG Invest in alliances (but not development) with key OpenFOAM organizations ESI and OpenCFD Foundation (H. Weller, M. Salari) Wikki and OpenFOAM-extend community (H. Jasak) IDAJ Japan and ICON UK – support both OF and OF-ext Conduct performance studies and customer benchmark evaluations Collaborations: developers, customers, OEMs (Dell, SGI, HP, etc.) 29 Culises: CFD Solver Library for OpenFOAM Culises Easy-to-Use AMG-PCG Solver: #1. Download and license from http://www.FluiDyna.de #2. Automatic installation with FluiDyna-provided script #3. Activate Culises and GPUs with 2 edits to config-file config-file CPU-only config-file CPU+GPU www.fluidyna.de FluiDyna: TU Munich Spin-Off from 2006 Culises provides a linear solver library Culises requires only two edits to control file of OpenFOAM Multi-GPU ready Contact FluiDyna for license details www.fluidyna.de 30 OpenFOAM Speedups Based on CFD Application GPU Speedups for Different Industry Cases: www.fluidyna.de Range of model sizes and different solver schemes (Krylov, AMG-PCG, etc.) Automotive 1.6x Job Speedup Multiphase 1.9x Solver Speedup Thermal 3.0x Pharma CFD 2.2x OpenFOAM CPU-Only Process CFD 4.7x Efficiency 31 FluiDyna Culises: CFD Solver for OpenFOAM Culises: A Library for Accelerated CFD on Hybrid GPU-CPU Systems Dr. Bjoern Landmann, FluiDyna developer.download.nvidia.com/GTC/PDF/GTC2012/PresentationPDF/S0293-GTC2012-Culises-Hybrid-GPU.pdf www.fluidyna.de DrivAer: Joint Car Body Shape by BMW and Audi http://www.aer.mw.tum.de/en/research-groups/automotive/drivaer Mesh Size - CPUs Solver speedup of 7x for 2 CPU + 4 GPU GPUs • 36M Cells (mixed type) • GAMG on CPU Job Speedup 9M - 2 CPU 18M - 2 CPU 36M - 2 CPU +1 GPU +2 GPUs +4 GPUs 2.5x 4.2x 6.9x 1.36x 1.52x 1.67x • AMGPCG on GPU 32 Computational Structural Mechanics ANSYS Mechanical 33 CSM Model Feature Recommendations for GPUs Model should be at least 500 KDOF or greater, more is better Ensures enough computational work to justify use of a GPU Models with solid FE’s will speedup more than shell FE’s Generally not enough computational work in 2D shell elements Direct solvers: moderate GPU memory and heavy system memory System memory needs capacity for entire system matrix (in-core) GPU memory needs capacity for a single matrix front Iterative solvers: large GPU memory and moderate system memory GPU memory needs capacity for entire system matrix (in-core) 34 ANSYS and NVIDIA Collaboration Roadmap Release ANSYS Mechanical 13.0 SMP, Single GPU, Sparse and PCG/JCG Solvers Dec 2010 14.0 ANSYS Fluent ANSYS EM ANSYS Nexxim + Distributed ANSYS; + Multi-node Support Radiation Heat Transfer (beta) ANSYS Nexxim + Radiation HT; + GPU AMG Solver (beta), Single GPU ANSYS Nexxim Nov 2012 + Multi-GPU Support; + Hybrid PCG; + Kepler GPU Support 15.0 + CUDA 5 Kepler Tuning + Multi-GPU AMG Solver; + CUDA 5 Kepler Tuning ANSYS Nexxim ANSYS HFSS (Transient) Dec 2011 14.5 Dec 2013 35 ANSYS Mechanical15.0 on Tesla GPUs 600 576 K40 K20 ANSYS Mechanical jobs/day V14sp-5 Model 2.2X 2.1X 363 324 K40 3.9X 3.5X 93 2 CPU cores 2 CPU cores + Tesla K20 275 275 K20 93 2 CPU cores 2 CPU cores + Tesla K40 Simulation productivity (with a HPC license) Turbine geometry 2,100,000 DOF SOLID187 FEs Static, nonlinear Distributed ANSYS 15.0 Direct sparse solver Higher is Better 8 CPU cores 7 CPU cores + Tesla K20 8 CPU cores 7 CPU cores + Tesla K40 Simulation productivity (with a HPC Pack) Distributed ANSYS Mechanical 15.0 with Intel Xeon E5-2697 v2 2.7 GHz CPU; Tesla K20 GPU and a Tesla K40 GPU with boost clocks. 36 ANSYS Mechanical15.0 on Tesla K40 315 ANSYS Mechanical jobs/day V14sp-6 Model 1.8X 180 172 2.9X 59 2 CPU cores Higher is Better 2 CPU cores + Tesla K40 Simulation productivity (with a HPC license) 4,900,000 DOF Static, nonlinear Distributed ANSYS 15.0 Direct sparse solver 8 CPU cores 7 CPU cores + Tesla K40 Simulation productivity (with a HPC Pack) Distributed ANSYS Mechanical 15.0 with Intel Xeon E5-2697 v2 2.7 GHz CPU and a Tesla K40 GPU with boost clocks. 37 Computational Structural Mechanics Abaqus/Standard 38 SIMUILA and Abaqus GPU Release Progression Abaqus 6.11, June 2011 Direct sparse solver is accelerated on the GPU Single GPU support; Fermi GPUs (Tesla 20-series, Quadro 6000) Abaqus 6.12, June 2012 Multi-GPU/node; multi-node DMP clusters Flexibility to run jobs on specific GPUs Fermi GPUs + Kepler Hotfix (since November 2012) Abaqus 6.13, June 2013 Un-symmetric sparse solver on GPU Official Kepler support (Tesla K20/K20X) 39 Rolls Royce: Abaqus 3.5x Speedup with 5M DOF Sandy Bridge + Tesla K20X Single Server Elapsed Time in seconds 20000 4.71M DOF (equations); ~77 TFLOPs Nonlinear Static (6 Steps) Direct Sparse solver, 100GB memory Speed up relative to 8 core 3 2.42x 15000 3.5 2.5 10000 2.11x 2 5000 1.5 0 1 8c 8c + 1g 8c + 2g 16c Speed up relative to 8 core (1x) • • • 16c + 2g Server with 2x E5-2670, 2.6GHz CPUs, 128GB memory, 2x Tesla K20X, Linux RHEL 6.2, Abaqus/Standard 6.12-2 40 Rolls Royce: Abaqus Speedups on an HPC Cluster • • • Sandy Bridge + Tesla K20X for 4 x Servers 4.71M DOF (equations); ~77 TFLOPs Nonlinear Static (6 Steps) Direct Sparse solver, 100GB memory Elapsed Time in seconds 9000 2.2x 6000 2.04X 1.9x 1.8X 3000 1.8x 0 24c 24c+4g 2 Servers 36c 36c+6g 3 Servers 48c 48c8g 4 Servers Servers with 2x E5-2670, 2.6GHz CPUs, 128GB memory, 2x Tesla K20X, Linux RHEL 6.2, Abaqus/Standard 6.12-2 41 Computational Structural Mechanics MSC Nastran 42 MSC Nastran Release 2013 for GPUs MSC Nastran Direct Equation Solver is GPU accelerated Sparse direct factorization with no limit on model size Real, Complex, Symmetric, Un-symmetric Impacts several solution sequences: High impact (SOL101, SOL108), Mid (SOL103), Low (SOL111, SOL400) Support of multi-GPU and for Linux and Windows NVIDIA GPUs include Tesla 20-series, Tesla K20/K20X, Quadro 6000 43 43 MSC Nastran 2013 and GPU Performance SMP + GPU acceleration of SOL101 and SOL103 6X 6 Higher is Better serial 4c 4c+1g 4.5 3 2.8X 2.7X 1.9X 1.5 1X 1X 0 SOL101, 2.4M rows, 42K front SOL103, 2.6M rows, 18K front Lanczos solver (SOL 103) Sparse matrix factorization Iterate on a block of vectors (solve) Orthogonalization of vectors Server node: Sandy Bridge E5-2670 (2.6GHz), Tesla K20X GPU, 128 GB memory 44 MSC Nastran 2013 and NVH Simulation on GPUs Coupled Structural-Acoustics simulation with SOL108 Europe Auto OEM 1X Elapsed Time (mins) 1000 Lower is Better 800 710K nodes, 3.83M elements 100 frequency increments (FREQ1) Direct Sparse solver 600 2.7X 400 4.8X 5.2X 200 5.5X 11.1X 0 serial 1c + 1g 4c (smp) 4c + 1g 8c (dmp=2) 8c + 2g (dmp=2) Server node: Sandy Bridge 2.6GHz, 2x 8 core, Tesla 2x K20X GPU, 128GB memory 45 Computational Structural Mechanics Altair OptiStruct 46 GPU Performance of OptiStruct PCG Solver Problem: Hood of a car with pressure loads, displacements and stresses Benchmark 2,2 Millions of Degrees of Freedom, 62 Millions of non zero 2 x GPU on 1 Node 7.5X 1200 380000 Shells + 13000 Solids + 1100 RBE3 SMP 6-core 5300 iterations Dual NVIDIA M2090 GPUs, Cuda v3.2 Intel Westmere 2x6 X5670@2,93Ghz Hybrid 2 MPI x 6 SMP Elapsed(s) NVIDIA PSG Cluster – 2 nodes with: 1000 Linux RHEL 5.4 with Intel MPI 4.0 SMP 6 + 1 GPU 800 572 600 Hybrid 2 MPI x 6 SMP + 2 GPUs 400 254 4.3X* 200 0 143 7.5X* 4 x GPU on 2 Nodes 13X! 350 306 300 Hybrid 4 MPI x 6 SMP Hybrid 4 MPI x 6 SMP + 4 GPUs 250 Elapsed(s) Platform 1106 200 150 100 50 0 85 13X* 47 Summary of GPU Progress for CAE GPUs provide significant speedups for solver intensive simulations Improved product quality with higher fidelity modeling Shorten product engineering cycles with faster simulation turnaround Simulations recently considered impractical now possible FEA: Larger DOFs in model, more complex material behavior, FSI CFD: Unsteady RANS, LES simulations practical in cost and time Effective parameter optimization from large increase in number of jobs 48 Stan Posey NVIDIA, Santa Clara, CA, USA; sposey@nvidia.com