Experiences with NVIDIA GPUs Abel compute cluster at UiO 2014-2015 Ole W. Saastad, PhD UiO/USIT/UAV/ITF/FI Feb. 2015 Experiences with NVIDIA GPUs Preface The introduction of Graphic Processing Units, GPU into high performance computing has introduces computing elements with very high computing capabilities as well as high complexity. The entry level has until recently been quite high. This has changed the last few years with introduction of new software that try provide easy access. However, it is still quite high. The motivation have been the very high compute capacity. A typical node in the Abel cluster provide 320 Gflops/s of performance when running the top500 test. The same test using a less powerful node but with two GPUs clocks in at 1800 Gflops/s. This represent a more than 5 fold increase in performance at a low, but not insignificant, cost. The top500 test solves a linear problem that require a lof of data to be moved from the host to the GPU memory and back, other tasks involving less data transfer and more computation will show far higher gains in performance. As a reference the theoretical performance claimed for the NVIDIA K20X GPUs are 3.95 / 1.31 Tflops/s for single and double precision floating point numbers. This document contains two parts. First is a review of the relevant application for NOTUR. Secondly a performance evaluation of a range of relevant applications and GPU use. Performance reporting is done in two different ways, in many cases wall clock time which is total time from launch to completion or in some cases performance measurement of the kind like jobs per unit of time. The first is of the category lower is better while the latter is higher is better. Please examine the graphs and notice which measurement type is used. Experiences with NVIDIA / CUDA Introduction The modern (2013) GPU (NVIDIA Kepler II, K20X) units are very complex chips containing 7.1 billion transistors using 28 nm silicon production technology. The claimed theoretical performance seems rather impressive, with close to 4 Tflops/s single precision (32 bits reals) and 1.3 Tflops/s in (64 bits) double precision. However, GPUs have always shown very impressive performance numbers which has been quite hard to attain for applications or even for benchmarks. Below we'll try to explain how the GPU processor work and why optimal utilization is quit hard. The instruction set is quite comprehensive and contain a rich set of instructions for both flow control, conversion, integer and floating point numbers. The floating point instruction set is shown in figure 2 and the 64 bit full precision ones in table 1. Note the lack of trigonometric and log/exp instructions in full precision not in even single precision nor in double precision, only reduced precision. Not needed for gaming, hence not implemented. These instructions are not present in the SSE units, only in the less used x87 unit. However, fused multiply add is included which is still 5 of 64 Experiences with NVIDIA / CUDA lacking on Intel Sandy Bridge. Also interesting to note the reciprocal instruction, reciprocals are much faster than full division. Figure 2: Floating point instructions supported by NVIDIA K20 processor. This instruction set is normally not exposed to the user nor is it used by common users. The language Parallel Thread Execution (PTX) is a pseudo-assembly language used in Nvidia's CUDA programming environment. The nvcc compiler translates code written in CUDA, a C-like language, into PTX, and the graphics driver contains a compiler which translates the PTX into a binary code which can be run on the processing cores. add sub mul mad fma div rcp sqrt abs neg min max GPU x x x x x x x x x x x x CPU x x x x x x x x x x x Table 1: Implemented 64 bit instructions with full precision in the different processors types. From the table above we see that all the common FP instructions needed for HPC are implemented. The log, exp and trigonometric are implemented in software just as they are on the Xeon. 6 of 64 Experiences with NVIDIA / CUDA The NVIDIA Tesla architecture is built around a scalable array of multithreaded Streaming Multiprocessors (SMs). When a host program invokes a kernel grid, the blocks of the grid are Figure 3: High level overview of the K20 GPU chip showing the 15 SMX units, L2cache Execution modeland interfaces. enumerated and distributed to multiprocessors with available execution capacity. The threads of a thread block execute concurrently on one multiprocessor. As thread blocks terminate, new blocks are launched on the vacated multiprocessors. A multiprocessor consists of multiple Scalar Processor (SP) cores, a multithreaded instruction unit, and on-chip shared memory. The multiprocessor creates, manages, and executes concurrent threads in hardware with zero scheduling overhead. It implements a single-instruction barrier synchronization. Fast barrier synchronization together with lightweight thread creation and zerooverhead thread scheduling efficiently support very fine-grained parallelism, allowing, for example, a low granularity decomposition of problems by assigning one thread to each data element (such as a pixel in an image, a voxel in a volume, a cell in a grid-based computation). To manage hundreds of threads running several different programs, the multiprocessor employs a new architecture we call SIMT (single-instruction, multiple-thread). The multiprocessor maps each thread to one scalar processor core, and each scalar thread executes independently with its own instruction address and register state. The multiprocessor SIMT unit creates, manages, schedules, and executes threads in groups of parallel threads called warps. (This term originates from weaving, the first parallel thread technology.) Individual threads composing a SIMT warp start together at the same program address but are otherwise free to branch and execute independently. 7 of 64 Experiences with NVIDIA / CUDA The GPU contain a large number of stream processors. These are scheduled more or less individually by chip scheduler. Hence the GPU cannot be considered a kind of vector processor but a large pool of small processor cores. However, the cores operate on so called warps (see below for mode information) which is like SIMD or vector concept. All of this show that one need very powerful tool to convert C or Fortran code to this kind of processor machinery. Figure 4: Detailed view of one of the 15 SMX processors. There are only one double precision unit for every three processor cores while there are one single precision unit in each processor core. Hence the 1/3 dp performance. 8 of 64 Experiences with NVIDIA / CUDA To harvest the very high performance one need to program all these processor units to operate coherent. Take a look at the execution model1 (extract given here) : The execution model is very unusual, especially compared to typical "what one thread should do" machine code. • A "thread" means one single execution of a kernel. This is mostly conceptual, since the hardware operates on warps of threads. • A "warp" is a group of 32 threads that all take the same branches. A warp is really a SIMD group: a bunch of floats sharing one instruction decoder. The hardware does a good job with predication, so warps aren't "in your face" like with SSE. • A "block" is a group of a few hundred threads that have access to the same "__shared__" memory. The block size is specified in software, but limited by hardware to 512 or 1024 threads maximum. More threads per block is generally better, with serious slowdowns for less than 100 or so threads per block. The PTX manual calls blocks "CTAs" (Cooperate Thread Arrays). • The entire kernel consists of a set of blocks of threads. The memory model is also highly segmented and specialized, unlike the flat memory of modern CPUs. • "registers" are unique to that thread. Early 8000-series cards had 8192 registers available; GTX 200 series had 16K registers; and the new Fermi GTX 400s have 32K registers. Registers are divided up among threads, so the fewer registers each thread uses, the more threads the machine can keep in flight, hiding latency. • "shared" memory is declared with __shared__, and can be read or written by all the threads in a block. This is handy for neighborhood communication where global memory would be too slow. There is at least 16KB of shared memory available per thread block; Fermi cards can expose up to 48KB with special config options. Use "__syncthreads()__" to synchronize shared writes across a whole thread block. • "global" memory is the central gigabyte or so of GPU RAM. All cudaMemcpy calls go to global memory, which is considered "slow" at only 100GB/second! Older hardware had very strict rules on "global memory coalescing", but luckily newer (Fermi-era) hardware just prefers locality, if you can manage it. • "param" is the PTX abstraction around the parameter-passing protocol. They reserve the right to change this, as hardware and software changes. • constants and compiled program code are stored in their own read-only memories. • "local" memory is unique to each thread, but paradoxically slower than shared memory. Don't use it! 1 http://www.cs.uaf.edu/2011/spring/cs641/lecture/03_03_CUDA_PTX.html 9 of 64 Experiences with NVIDIA / CUDA Very few programmers get exposed to this kind of detailed level of programming, but it serves the purpose to explain why it's very hard to compile and generate code that get scheduled effectively on this kind of processors. Figure 5: Kepler memory hierarchy. The more adventurous embark on the CUDA a c-like language. Nevertheless, here is large community of CUDA users. However this is out of scope for most scientific users who do not have programming as their field of study. This report will not cover CUDA programming as such. There are a lot of material text books and tutorials about CUDA available. 10 of 64 Experiences with NVIDIA / CUDA PART I How to access the GPU power for the scientific user As we have seen above the GPU are highly complex and very hard to program. We need to take on some powerful tools to hide the complexity. Three common user friendly ways can be taken. The concept of CUDA and OpenCL programming is omitted because it's not regarded as user friendly nor easy to program. For those who do take on this challenge the gains can be very high. An introduction to CUDA and OpenCL is not within the scope of this work. a) Precompiled applications Several precompiled applications are available that are distributed with GPU support. These applications generally just connect to the GPU and harvest the power without much extra setup for the user. All the details are hidden behind the scene. If your favorite application is in this category consider yourself lucky. Some examples include NAMD, BLAST, ADF and a few others. Examples of usage will be shown later in this report. Some have only been evaluated for performance and not discussed in part I. b) GPU enabled libraries Libraries exist that provide routines for the most common functions. These libraries provide functions that will execute on the GPU and provide easy access to the power of the GPUs. The parameter list etc are generally the same as the normal versions of the functions, but the actual name of the functions generally differ. Ongoing development with vendors like NVIDIA to make it more accessible. Fortran wrappers and generally lower the threshold for new users. c) Compiler with Accelerator support Some compilers support the OpenACC initiative that have a goal of providing GPU support as easy as multithreading is done using OpenMP by means of source code directives. The OpenACC directives are comparable in OpenMP in ease of use and complexity. This initiative has attracted a lot of attention as it can provide easy access to GPU power for the old dusty legacy decks. Much serial fortran code still exists which can be persuaded into GPU acceleration by means of some simple compiler directives. This exercise have been done before using OpenMP which many programmers are familiar with. 11 of 64 Experiences with NVIDIA / CUDA Precompiled applications NAMD is a is a parallel molecular dynamics code designed for high-performance simulation of large biomolecular systems that comes with support for multicore, MPI and GPUs. Running this application using GPU support is relatively simple. It is only a question of setting up the right environment, in this case the path to the library files. The CUDA library file comes with the application as the one comming with the CUDA distribution tend to be updated more frequently than NAMD. In addition the starting sequence of NAMD need to be altered to launch the NAMD, one need also to select on or two GPUs. This must be done in sync with requests for resources done in the queue system (slurm runscript) file. An example of a run script for a NAMD using GPUs could look like : #!/bin/bash # #SBATCH --job-name=namd #SBATCH –-account=<your account> #SBATCH --nodes=1 #SBATCH --ntasks-per-node=8 #SBATCH --mem-per-cpu=3900M #SBATCH --exclusive #SBATCH --partition=gpu --gres=gpu:2 #SBATCH --time=1:00:0 # Set env : . /cluster/bin/jobsetup export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:~/work/namd/NAMD_2.9_Linux-x86_64multicore-CUDA BIN=~/work/namd/NAMD_2.9_Linux-x86_64-multicore-CUDA/ # Start your work INP=ubq-nve.conf $BIN/charmrun ++local $BIN/namd2 +idlepoll $CUDA_VISIBLE_DEVICES $INP +p $SLURM_TASKS_PER_NODE +devices As can be seen from the script it is quite easy to set up the NAMD environment to use the GPUs. We are running only on one node, but using both GPUs. From NAMD output we see that it is using what we have requested : Info: Based on Charm++/Converse 60400 for multicore-linux64-iccstatic Info: Built Mon Apr 30 14:02:11 CDT 2012 by jim on naiad.ks.uiuc.edu Info: 1 NAMD 2.9 Linux-x86_64-multicore-CUDA 8 compute-19-8.local olews Info: Running on 8 processors, 1 nodes, 1 physical nodes. Info: CPU topology information available. Info: Charm++/Converse parallel runtime startup completed at 0.0219169 s Pe 2 physical rank 2 binding to CUDA device 0 on compute-19-8.local: 'Tesla K20Xm' Mem: 5759MB Rev: 3.5 Pe 6 physical rank 6 will use CUDA device of pe 4 12 of 64 Experiences with NVIDIA / CUDA Pe 3 physical rank 3 will use CUDA device of pe 2 Pe 7 physical rank 7 will use CUDA device of pe 4 Pe 5 physical rank 5 will use CUDA device of pe 4 Pe 1 physical rank 1 will use CUDA device of pe 2 Pe 4 physical rank 4 binding to CUDA device 1 on compute-19-8.local: 'Tesla K20Xm' Mem: 5759MB Rev: 3.5 Pe 0 physical rank 0 will use CUDA device of pe 2 Info: 11.707 MB of memory in use based on /proc/self/stat Both GPUs are in use, NAMD also use all the 8 cores in this node. It is generally a good idea to request the complete node by using the keyword exclusive in the run script. Figure 6: NAMD is ongoing rapid development and is one of the most used applications is molecular dynamics. Strong ongoing development to exploit the accelerators, both GPUs and MICs. 13 of 64 Experiences with NVIDIA / CUDA AMBER is a parallel molecular dynamics code designed for high-performance simulation of large biomolecular systems. It is a rather large project with a range of tools. It supports GPU acceleration. Please see the web page: http://ambermd.org/ for detailed information. Performance evaluation is show later in this document. Amber 14 requires a license even for Academic use. Figure 7: Visualisation of the structure of a molecule calculated with molecular dynamics using Amber.. 14 of 64 Experiences with NVIDIA / CUDA LAMMPS is a classical molecular dynamics code, and an acronym for Large-scale Atomic/ Molecular Massively Parallel Simulator. LAMMPS has potentials for solid-state materials (metals, semiconductors) and soft matter (biomolecules, polymers) and coarse-grained or mesoscopic systems. It can be used to model atoms or, more generically, as a parallel particle simulator at the atomic, meso, or continuum scale. LAMMPS runs on single processors or in parallel using message-passing techniques and a spatialdecomposition of the simulation domain. The code is designed to be easy to modify or extend with new functionality. LAMMPS is distributed as an open source code under the terms of the GPL. LAMMPS is distributed by Sandia National Laboratories, a US Department of Energy laboratory. The main authors of LAMMPS are listed on this page along with contact info and other contributors. Funding for LAMMPS development has come primarily from DOE (OASCR, OBER, ASCI, LDRD, Genomes-to-Life) and is acknowledged here. Figure 8: Example of LAMMPS simulations. Illustration taken from LAMMPS web page. See LAMMPS web page for more information : http://lammps.sandia.gov/ 15 of 64 Experiences with NVIDIA / CUDA GROMACS, is a versatile package to perform molecular dynamics, i.e. simulate the Newtonian equations of motion for systems with hundreds to millions of particles. It is primarily designed for biochemical molecules like proteins, lipids and nucleic acids that have a lot of complicated bonded interactions, but since GROMACS is extremely fast at calculating the nonbonded interactions (that usually dominate simulations) many groups are also using it for research on non-biological systems, e.g. polymers. GROMACS supports all the usual algorithms you expect from a modern molecular dynamics implementation, is a versatile package to perform molecular dynamics, i.e. simulate the Newtonian equations of motion for systems with hundreds to millions of particles. It is primarily designed for biochemical molecules like proteins, lipids and nucleic acids that have a lot of complicated bonded interactions, but since GROMACS is extremely fast at calculating the nonbonded interactions (that usually dominate simulations) many groups are also using it for research on non-biological systems, e.g. polymers. GROMACS is Free Software, available under the GNU Lesser General Public License. Starting with version 4.6, GROMACS includes a brand-new, native GPU acceleration developed in Stockholm. This replaces the previous GPU code and comes with a number of important features. The new GPU code is fast, and we mean it. Rather than speaking about relative speed, or speedup for a few special cases, this code is typically much faster (3-5x) even when compared to GROMACS running on all cores of a typical desktop. If you put two GPUs in a high-end cluster node, this too will result in a significant acceleration. Figure 9: Gromacs GPU acceleration goal. 16 of 64 Experiences with NVIDIA / CUDA Commercial applications There are a some commercial applications that are developed to take advantage of the GPUs. ACEMD One such application is ACMD. This is molecular dynamics application developed to run only on GPUs. It install and run with no problems. Performance claims to be very high. So far only the examples have been run. These examples seems to perform very well. The input is compatible with AMBER and NAMD input. Unfortunately the license cost of 1.5 kEUR per limits the adaptation. ADF Amsterdam Density Function (ADF) application is currently supporting for GPUs. Initial tests show a very stable application that run well with GPU acceleration. However, at the time of writing the performance is not really up to expectations. While it show significant speedup for some tests it is by no means a game changer. Figure 10: Visualisation of a molecular motor simulated by ADF calculations. 17 of 64 Experiences with NVIDIA / CUDA VASP There are reports that this application has some support for GPU acceleration. These preliminary reports need to be follows up with the developers in Vienna. At time of writing in October 2014 no such information is currently published. If UiO will be a beta test site is unknown, but we have applied. VASP is one of the major uses of NOTUR computation resources. Figure 11: Vienna Ab-initio Simulation Package, VASP. Gaussian Gaussian is working with Portland compiler group to develop an accelerated version. In 2011 this cooperation was announced, see : http://pressroom.nvidia.com/easyir/customrel.do? easyirid=A0D622CE9F579F09&releasejsp=release_157&prid=792463 From the press releasem, dated the 29th of August 2011: “NVIDIA today announced plans with Gaussian, Inc., and The Portland Group (PGI) to develop a future GPU-accelerated release of Gaussian, the world's leading software application for quantum chemistry.” Status for this work and scheduled release data is not yet announced. When Gaussian Inc were given the question on progress status their reply on November 1st 2013 was not very optimistic: “While I am aware that progress has been made on a GPU-enabled version of Gaussian, I am not certain what type of details I am permitted to provide about the progress. “ 18 of 64 Experiences with NVIDIA / CUDA The compiler team from Portland group could inform that the Gaussian code has been compiled with OpenACC directives in order to utilize the NVIDIA GPUs. Gaussian Inc has been very concerned about portability with their source code so nothing else apart from directives which are source code comments anyway was allowed. Portland group is very eager to release to the public any news about PGI accelerated Gaussian. Figure 12: Molecules calculated by Gaussian. 19 of 64 Experiences with NVIDIA / CUDA GPU enabled libraries CUDA Libraries, NVIDIA provides a range of highly optimized functions that can be called from fortran or C. The fortran functions are available through a wrapper which hides all the boring details. The wrapper routines are already compiled for Abel and located together with the CUDA 64 bit libraries. The CUDA documentation is available in both pdf and html format. It's located with the CUDA software under doc (on Abel look under $CUDA/doc, after the CUDA module has been loaded). Currenly there are a few libraries implemented. • CUBLAS - The CUBLAS library is an implementation of BLAS (Basic Linear Algebra Subprograms) on top of the NVIDIA®CUDA™ runtime. It allows the user to access the computational resources of NVIDIA Graphics Processing Unit (GPU), but does not auto-parallelize across multiple GPUs. • CUFFT - The FFT is a divide-and-conquer algorithm for efficiently computing discrete Fourier transforms of complex or real-valued data sets. It is one of the most important and widely used numerical algorithms in computational physics and general signal processing. The CUFFT library provides a simple interface for computing parallel FFTs on an NVIDIA GPU, which allows users to quickly leverage the floating-point power and parallelism of the GPU in a highly optimized and tested FFT library. • CURAND - The CURAND library provides facilities that focus on the simple and efficient generation of high-quality pseudorandom and quasirandom numbers. A pseudorandom sequence of numbers satisfies most of the statistical properties of a truly random sequence but is generated by a deterministic algorithm. A quasirandom sequence of -dimensional points is generated by a deterministic algorithm designed to fill an -dimensional space evenly. • CUSPARSE - The CUSPARSE library contains a set of basic linear algebra subroutines used for handling sparse matrices. It is implemented on top of the NVIDIA® CUDA™ runtime (which is part of the CUDA Toolkit) and is designed to be called from C and C++. • THRUST - Thrust is a C++ template library for CUDA based on the Standard Template Library (STL). Thrust allows you to implement high performance parallel applications with minimal programming effort through a high-level interface that is fully interoperable with CUDA C. Thrust provides a rich collection of data parallel primitives such as scan, sort, and reduce, which can be composed together to implement complex algorithms with concise, readable source code. By describing your computation in terms of these high-level abstractions you provide Thrust with the freedom to select the most efficient implementation automatically. As a result, Thrust can be utilized in rapid prototyping of CUDA applications, where programmer productivity matters most, as well as in production, where robustness and absolute performance are crucial. A detailed overview of the routines and functions supported is found in the documentation. All of the most common functions are all implemented, BLAS level 1,2 and 3, 1d, 2d and 3d FFTs and a comprehensive array of other functions and routines. The CUDA libraries have been written in C and have expect the allocation to be handled by the programmer. As C functions can easily be called from Fortran this represent no real problem, just some more programming code. There is a wrapper developed for BLAS which comes in two 20 of 64 Experiences with NVIDIA / CUDA versions. One that hide all allocation etc so that there is virtually no changes to the Fortran code and another that requires just the allocation steps. The latter has a much lighter calling sequence and yield higher performance while the first make testing very easy. Unfortunately these wrappers are at times of writing only developed for the BLAS library. Hence it is somewhat harder to start using the FFT, random and sparse libraries from Fortran. To do development with this kind of libraries one need to work on one of the nodes which host the GPU cards. To reserve such a node issue the following command : qlogin --account=<your project> --nodes=1 --ntasks-per-node=8 --mem-percpu=3900M --exclusiexclusive --partition=gpu --gres=gpu:2 --time=1:00:0 This will reserve a complete node with all CPUs and both GPUs and ensure that you are the only user with legitimate access to this node. Issue the normal commands for jobs like souring /cluster/bin/jobsetup and load the module needed. For this kind of work the cuda/5.0 module is appropriate. To use the CUDA BLAS libraries are remarkably simple, the CUDA environment sets up most the detailed paths etc. In your fortran code replace the call to sgemm with the following line: call cublas_sgemm('n', 'n', N, N, N, alpha, a, N, b, N, beta, c, N) The parameter list is identical, the only change is to prefix the sgemm name with cublas_. The same is true for double, complex and double complex data types (s,d,c and z). To compile there is a few more libraries to include, include the following on the link line: -L/usr/local/cuda/lib64 /usr/local/cuda/lib64/fortran_thunking.o -lcublas On Abel the /usr/local/cuda directory is a symbolic link to the common file structure where most of the module based software are located. It is only present on the nodes that host the GPUs. An example of compliation may look like: gfortran -o dgemmdriver.x -L/usr/local/cuda/lib64 /usr/local/cuda/lib64/fortran_thunking.o -lcublas dgemmdriver.f90 The file fortran_thunking.o contain wrappers that hide a lot of juicy code that is a bit more complicated. There is also a routine called fortran.o which expose a lot more details to the Fortran programmer, refer to the documentation for more information. The example uses gfortran, but the library functions are callable from the most common compilers. Quite remarkable performance can be obtained using the library in functions, [olews@compute-19-4 Fortran-BLAS]$ ./sgemmdriver.x Total footprint A,B & C 5538 MiB CUDA sgemm start CUDA sgemm end 15.0307150 secs 1416.83215 21 of 64 Gflops/s Experiences with NVIDIA / CUDA CPU/MKL sgemm start cpu time spent in threaded 146.434982 CPU dgemm end 146.434982 secs 145.429733 Speedup 9.53373528 diff sum(c)_gpu sum(c)_cpu:0.0000000000E+00 [olews@compute-19-4 Fortran-BLAS]$ 4 Gflops/s A speedup of 9.1 times compared to MKL using all cores in one socket. The nodes on Abel have two GPU cards and two sockets in each node, hence twice these numbers can be expected. 22 of 64 Experiences with NVIDIA / CUDA Compiler support for accelerator, OpenACC The open accelerator initiative aim to make use of accelerator processors like the GPUs as easy as OpenMP has provided easy access to multiprocessor systems today. OpenACC follows the same path as OpenMP by means of compiler directives in the source code. Goal is to make OpenMP and OpenACC similar and develop OpenACC into a framework that is comparable to OpenMP. Noticing the huge success of OpenMP the stakeholders in OpenACC hope to build on its success. Currently only the Portland group's compiler have any noticeable traction. Their implementation, however perform remarkably well which will be demonstrated below. The GNU compilers with gcc will support OpenACC during 2015, gcc 5. Oak Ridge National laboratory is supporting this work2. OpenACC is now released in version 2.0a, the web site : http://www.openacc.org will provide more informasjon. At SC13 it was clear that this is the most feasible way to enable production code for GPU acceleration. Both OpenCL and CUDA is too complex and time consuming when it comes to scientific programs. To start using the GPU or accelerator as it's called by Portland it is only required to enter compiler directives into the source code just as with OpenMP. A simple example might look like: !$acc region do i = 1,n r(i) = a(i) * 2.0 enddo !$acc end region The compiler will the directives and try to build code that the GPU can process and hopefully with considerably higher performance. OpenACC has a very simple syntax, but mastering the data placement and workload management is another case. Having said that, the following example might illustrate the potential of a simple approach. Please consider the general matrix matrix multiplication routine ?gemm in its reference implementation form. The core block of calculation looks like this with the accelerator directives entered: !$acc region 50 60 DO 90 J = 1,N IF (BETA.EQ.ZERO) THEN DO 50 I = 1,M C(I,J) = ZERO CONTINUE ELSE IF (BETA.NE.ONE) THEN DO 60 I = 1,M C(I,J) = BETA*C(I,J) CONTINUE END IF DO 80 L = 1,K IF (B(L,J).NE.ZERO) THEN 2: https://www.olcf.ornl.gov/2013/11/14/olcf-lends-expertise-for-introducing-gpu-accelerator-programming-to-popularlinux-gcc-compiler/ 23 of 64 Experiences with NVIDIA / CUDA 70 80 90 !$acc TEMP = ALPHA*B(L,J) DO 70 I = 1,M C(I,J) = C(I,J) + TEMP*A(I,L) CONTINUE END IF CONTINUE CONTINUE end region This is all there is to it in the simplest approach. Compiling is also relatively easy, using Portland groups's pgfortran the command line looks like : pgfortran -O3 -ta=nvidia,kepler,time -Minfo=acc -o dgemm_acc.x dgemm_acc.f Another example is Jacobi relaxation Calculation : !$acc kernels do j=1,m-2 do i=1,n-2 Anew(i,j) = 0.25_fp_kind * ( A(i+1,j ) + A(i-1,j ) + & A(i ,j-1) + A(i ,j+1) ) error = max( error, abs(Anew(i,j)-A(i,j)) ) end do end do !$acc end kernels which unfortunately does not yield the desired performance. It turns out that data placement forces unnecessary copying between host memory and device memory. Just by including a data copy statement the performance will increase dramatically. In this case a multifold increase. For details see the performance evaluation section. !$acc data copy(A, Anew) do while ( error .gt. tol .and. iter .lt. iter_max ) error=0.0_fp_kind Here the system will overlap the copying of data to the GPU device memory with the execution of the while loop. This kind of overlap will increase the efficiency and yield the very high performance we are after. !$acc kernels do j=1,m-2 do i=1,n-2 Anew(i,j) = 0.25_fp_kind * ( A(i+1,j ) + A(i-1,j ) + & A(i ,j-1) + A(i ,j+1) ) error = max( error, abs(Anew(i,j)-A(i,j)) ) end do end do !$acc end kernels Next is to evaluate the performance of the OpenACC initiative. This will be done in a later section in this document. However is not worth to note that it's well known that the bandwidth limitation between the GPU device memory and host's memory is a bottleneck. The connection between the 24 of 64 Experiences with NVIDIA / CUDA two banks of memory is via the PCIe bus with it's limitations. Typical numbers lay in the 6 GiB/s range using pinned host memory, but only about 2 GiB/s using pageable host memory. While intra GPU device memory bandwidth clocks in at about 170 GiB/s. Keeping the data in the GPU device memory for all the time computation is in progress is a key to success. The OpenACC directives contain a range of directives to accommodate this. In figures 13 and 14 the OpenACC quick guide is reproduced. There are many tutorials on the net to help users master the OpenACC programming skill. It is with OpenACC and OpenMP like Chess it's very easy to learn, but to master it take time and effort. Figure 13: OpenACC quick guide page 1. 25 of 64 Experiences with NVIDIA / CUDA Figure 14: OpenACC quick guide page 2. The use of OpenACC is currently the only quick and easy way to start harvesting the GPU power for legacy code. As there is no changes to the source code apart from compiler directives which are comments anyway this is also a portable way. With support from more and more compiler vendors and also gcc support this way forward look quite promising. 26 of 64 Experiences with NVIDIA / CUDA PART II Performance evaluation The very high theoretical performance claimed in the table reproduced in figure 1 make the GPUs very attractive. However, this performance is never seen in practice, not even for benchmarks written in CUDA or OpenCL. Even so, significant, times faster, not just percent faster, speedups is commonplace. The following section will give an overview of what realistically can be expected, with the exceptions of marketing numbers for some applications. A detailed description of the compute nodes are given in appendix tables 8 and 9. GPU enabled application, NAMD The current version (2.9 and 2.10) (X.Y_Linux-x86_64-multicore-CUDA) of NAMD is used for testing. This version is threaded for mulicore processors and has build in support for GPUs (NVIDIA – CUDA). For input the apoa1 benchmark have been selected as the test case. The GPU version is run on nodes with the GPU installed, these dual socket nodes have a somewhat less performant CPU (Intel quadcore E5-2609 at 2.4 GHz). The multicore tests are run using the multicore version of NAMD and Abel's common compute nodes, dual socket (Intel octacore E52670 at 2.6 GHz). The choice of compute nodes is that they are most representative for comparing real application run on Abel. First benchmark apoa1 The results are shown in figure 15 and figure 16 where the performance shown as wall run time is recorded. The red columns show the runes times obtained by the GPU accelerated nodes while the blue is for standard compute nodes. Remember that the CPUs in the two types of nodes are different. The accelerated nodes were designed to employ lower cost processors as the computation should be performed by the GPUs. 27 of 64 NAMD apoa1 benchmark Experiences with NVIDIA / CUDA GPU / CPU 400 350 Wall time [secs] 300 2 50 2 00 1 50 1 00 50 0 8/2 4/1 1 6/0 8/0 4/0 #CPUs / #GPUs Figure 15: NAMD single node performance, GPU accelerated node compared to a standard compute node. Note that the CPUs in the compute are faster than the ones in the GPU node. Figure 16 show the difference in performance of a single Abel node when using all compute power the node can muster. This include 16 cores on the compute node and two GPUs on the accelerated node. There is more than 3 fold increase in speed which show what kind of potential that can be unleashed by effective usage of the graphic processing unit. NAMD apoa1 benchmark Single node performance 1 40 Wall time [secs] 12 0 1 00 80 60 40 20 0 Compute node GPUnode Type node Figure 16: Single node performance exploiting all compute power with the node. 28 of 64 Experiences with NVIDIA / CUDA Second benchmark DL 1.4M Her1-Her1 membrane This is a benchmark originating from Daresbury laboratory and is part of their benchmark suite for procurement. The results are shown below in figure 17. A speedup of about a factor of two. DL benchmark - single node performance 1.4M her1-Her1 membrane 12000 Wall time [secs] 10000 8000 6000 4000 2000 0 Compute node Accelerated node Type node Figure 17: Single node performance for the DL NAMD benchmark. One compute node using all 16 cores vs. one GPU accelerated node using both GPUs and all 8 cores. NAMD 2.10 support a combination of mulitnode using IBverbs, multicore and CUDA. This open up a large number of combinations using different sets of ranks, threads and GPUs. The newer version NAMD 2.10 work well with multiple nodes, multiple cores and GPUs. 29 of 64 Experiences with NVIDIA / CUDA NAMD 2.10 GPU enabled performance 1.4M Her1-Her1 membrane benchmark GPU enabled 14 Speedup [times faster] 12 10 8 6 4 2 0 2/2 4/2 8/4 16/4 16/8 32/8 24/12 16/16 32/16 64/16 # CPUs / #GPUs Figure 18: Scaling of NAMD 2-10-cuda using multiple nodes and multiple GPUs. Performance seems to depend on both the number of cores and the number of GPUs. A good load balance seems to have been implemented. However, too many cores might lead to a decrease in performance. NAMD 2.10 GPU Performance 1.4M Her1-Her1 membrane 24000 Standard nodes Wall time [Sesonds] 20000 Accelerated nodes 16000 12000 8000 4000 0 1 2 4 8 # nodes Figure 19: Scaling and performance comparison using standard compute nodes and accelerated nodes using two GPUs per node. 30 of 64 Experiences with NVIDIA / CUDA GPU enabled application, LAMMPS The current version (lammps16Aug2013) of LAMMPS is used for testing. This application is an MPI application with support for GPUs (NVIDIA – CUDA). For input the 3d-Lennard-Jonson melt benchmark have been selected as the test case. The GPU version is run on nodes with the GPU installed, these dual socket nodes have a somewhat less performant CPU (Intel quadcore E5-2609 at 2.4 GHz). The multicore tests are run using the multicore version of NAMD and Abel's common compute nodes, dual socket (Intel octacore E52670 at 2.6 GHz). The choice of compute nodes is that they are most representative for comparing real application run on Abel. Lennard-Jones potential benchmark Single node performance 800 700 Run time [secs] 600 500 400 300 200 100 0 Standard compute node Accelerated node Node type Figure 20: LAMMPS benchmark (3d-LJ potential melt), single node performance. MPI run over alle processors, CUDA in addition for acceleration. For the CUDA version the MPI/CUDA combination is used. Single precision were used on the GPU. Figure 20 show a 3 fold speedup for the Lennad-Jones Potential melt benchmark. Other benchmarks will show a different speedup as different algorithms is offloaded to the GPU. The GPU/CUDA implementation support a range of precisions. The calculations can be performed on the GPUs in single precision, double precision or a mixture. This has also an impact of the recorded speed. 31 of 64 Experiences with NVIDIA / CUDA GPU enabled application, Beagle The current version (Beagle 1.0 release 1141) of Beagle is used for testing. This application is a high-performance library that can perform the core calculations at the heart of most Bayesian and Maximum Likelihood phylogenetics packages. It can make use of highly-parallel processors such as those in graphics cards (GPUs) found in many PCs. It is used in conjunction with the BEAST software, (BEAST v1.7.5 is used). The GPU version is run on nodes with the GPU installed, these dual socket nodes have a somewhat less performant CPU (Intel quadcore E5-2609 at 2.4 GHz). The CPU version use Abel's standard compute nodes, dual socket (Intel octacore E5-2670 at 2.6 GHz). The choice of compute nodes is that they are most representative for comparing real application run on Abel. Beagle CPU vs. GPU benchmark2 12 00 Run times [secs] 1000 800 600 400 2 00 0 Accele ra ted nod e Stan dard nod e Node type Figure 21: Beast/Beagle benchmark2 performance, aggregated performance for two concurrently running jobs. Run times for each were added together. 32 of 64 Experiences with NVIDIA / CUDA GPU enabled application, ADF The current version of ADF support GPUs. The results below show timings with two types of nodes. The standard compute node has twice as many cores with higher frequency. The actual GPU speedup is the difference in run time from the accelerated node using the GPUs and a run using only the CPUs. The possible speedup is dependent on input and which density functions the input calls for. Some examples have been tried. Not all show any advantage of the GPUs, bit some show a significant speedup. GPU acceleation with ADF ADF 2014.01 cuda version, Al10O15 input 45000 40000 Wall time [sec] 35000 30000 25000 20000 15000 10000 5000 0 Standard node Acc node w/o gpu Acc node w 2 gpus Node type Figure 22: Results from GPU enabled ADF. The effect of GPUs in a node with less CPU power than the standard compute node. The accelerated node show a 30% reduction in wall time. 33 of 64 Experiences with NVIDIA / CUDA ADF benchmark Caffeine 60000 50000 40000 30000 Wall time [secs] 20000 10000 0 Figure 23: ADF scaling using compute nodes and accelerated nodes. This is an example where accelerators do not yield enough performance to outperform the more performant cores in the standard compute nodes. Closer study of timing of individual routines show speedup of a magnitude or more. The potential is clearly there. 34 of 64 Experiences with NVIDIA / CUDA GPU enabled application, Amber The current version of the molecular dynamics package Amber version 14 supports GPU for a selection of functions. Please review the Amber 14 documentation for more detailed information of http://ambermd.org/gpus/ for up to date information. The Amber version support the combination of CUDA and MPI which oper up a large range of processor placement and job distributions. The Amber package comes with a set of benchmarks that can be used to evaluate the performance. 20 Single node performance 15 10 5 0 JA C_ PR O DU CT IO N_ JA NV C_ E PR -2 3, O 55 DU 8 CT at FA IO om N CT s _ N O PM PT R_ E IX -2 4f _P 3, s RO 55 8 DU at om CT FA IO CT s PM N_ O R_ NV E IX 2f E _P s -9 RO 0 , CE 90 DU 6 LL CT at UL IO om O N s SE _N PM PT _P E RO -9 0, DU 90 CT 6 IO at om N_ NV s PM E TR -4 E PC 08 ,6 AG 09 E_ at PR om O s DU PM CT E IO N -3 04 at om s G B Speedup using GPUs Amber 14 benchmarks Figure 24: Amber 14 performance on a single node, using up to 2 GPUs and all cores. Best speedups using GPUs compared to CPUs only are reported. Quite remarkable speedups can be obtained, see figure 24 above and 25 & 26 below. However, scaling using several nodes and MPI does not seem very good. In addition it does seem that there should only be one MPI rank per GPU to yield best performance. A hybrid model with load balance would be best, but this is not yet supported. Data suggest that the benefit of using GPUs for this application is sensible. To this must be added (taken from the Amber 14 manual page 313) : “One of the newer features of PMEMD is the ability to use NVIDIA GPUs to accelerate both explicit solvent PME and implicit solvent GB simulations. This work is by Ross Walker at the San Diego Super- computer Center and Scott Le Grand at Amazon Web Services, in collaboration with NVIDIA. “ . 35 of 64 Experiences with NVIDIA / CUDA As is often the case involvement of NVIDIA is needed to get acceptable performance. While help from NVIDIA is appreciated but it is not a sustainable situation. Amber 14 cuda performance 2 nodes, using CUDA and MPI Run times [secsonds/ns] 16000 14000 12000 10000 8000 6000 4000 2000 0 16:0 2:2 4:2 4:4 8:4 16:4 # CPUs : # GPUs Figure 25: Amber 14 cuda version running benchmark Myoglobin using an implicit solver. With two servers using both MPI and CUDA. Amber 14 CUDA performance 2 nodes using CUDA and MPI Run time [seconds/ns] 120000 100000 80000 60000 40000 20000 0 16:0 2:2 4:2 8:2 4:4 8:4 16:4 # CPUs : # GPUs Figure 26: Amber 14 CUDA version running benchmark cellulose production using an explicit solver. With two servers using both MPI and CUDA. 36 of 64 Experiences with NVIDIA / CUDA GPU enabled application, cp2k The current version of the molecular dynamics package cp2k have some support for NVIDIA CUDA acceleration. It compile and build using a combination of Intel MKL and NVIDIA libraries. The build process is as always a complex task, but well documented. There are a battery of versions to choose from, shared memory OpenMP versions, serial, CUDA enabled, MPI versions both hybrid and CUDA enabled. This combination of CUDA and MKL should be a quite powerful one. The cp2k CPU performance is good, but the extra performance gained by using the NVIDIA cards are not yet up to scale. Blog entry from NVIDIA suggest a modest 2.5x gain in performance : http://blogs.nvidia.com/blog/2012/08/23/unleash-legacy-mpi-codes-with-keplers-hyper-q/. However, this is done by using MPI and launching many threads onto the NVIDIA processor one for each MPI rank. This works quite well as we have seen from the tests using NAMD. For a more common approach using a single CPU and a GPU the gain versus a single CPU is quite modest. Test using Abel show very little gain when using the GPUs. The MPS 3 works well for a single GPU on a node, but when scheduling 4 MPI ranks onto a GPU the difference between MPS and no MPS is not obvious. As MPS only support a single GPU per node is really a showstopper for systems with two GPUs per node. For runs using a single CPU and a single GPU there is some performance gain, see figure below. CP2K performance Wall time [secs] Single core, serial program, one GPU H2O-64 benchmark 2400 2350 2300 2250 2200 2150 2100 2050 2000 1950 1900 Accelerated node CPU only Accelerated node CPU + GPU Figure 27: Single core serial performance, one core versus one core using GPU acceleration. Only slight increase in performance. This is the H2O 64 benchmark. 3 https://docs.nvidia.com/deploy/pdf/CUDA_Multi_Process_Service_Overview.pdfhttps://docs.nvidia.com/deploy/pd f/CUDA_Multi_Process_Service_Overview.pdf 37 of 64 Experiences with NVIDIA / CUDA However, very few runs with cp2k would be run in this way. The interesting results are the multi core runs using either shared memory with OpenMP or multi node with MPI. The figure below show what to expect when comparing a compute node with 16 2.6 GHz cores with an accelerated node with 8 2.4 GHz cores and GPUs. It's evident that GPU utilization is too small to make up for the less cores in the accelerated nodes. Cp2k performance, two nodes MPI and GPUs. Benchmark H2O 128, all cores and all GPUs. 350 Wall time [seconds] 300 250 200 150 100 50 0 Compute node Accelerated node Node type Figure 28: Comparing compute nodes with accelerated nodes. The more cores in the compute node clearly outperform the lesser cores (16 vs 8) in the accerated node even if some work is done by the GPUs. Clearly some more work remain to be done with this application before it can make full use of the NVIDIA GPUs. It is spending a significant part of it's time in MKL so any transfer of MKL run time to the GPU would be beneficial. 38 of 64 Experiences with NVIDIA / CUDA GPU enabled application, Gromacs The current version of the molecular dynamics package Gromacs have support for NVIDIA CUDA acceleration. Gromacs is well documented and uses cmake to build. Support a range of different schemes for parallel runs, multicore, MPI and combinations thereof in addition to CUDA support. Gromacs GPU performance Benchmark rnase_dodec 200 180 Compute node Acceleraed node Wall time [secs] 160 140 120 100 80 60 40 20 0 1 2 # Cores 4 8 Figure 29: Gromacs performance running the rnase_dodec. This version of Gromacs was built with gnufortran and uses fftw3. The current dataset could not be run on 16 cores. The figure above shows the performance gain possible with a GPU accelerated node. Run times are too short to make any sense running om multiple nodes. Just looking at the numbers one might expect a twofold speedup using GPUs which a significant speedup. Likely to increase as more functions in Gromacs will be ported to GPUs. 39 of 64 Experiences with NVIDIA / CUDA GPU enabled application, Gaussian The current version of Gaussian does not yet support GPUs. There is no given date for release of the GPU enabled Gaussian as of August 2014 (email from Gaussian Inc, August 14th 2014): “Also, I have been advised by my technical support that a GPU-enabled version of Gaussian requires changes to compilers as well as changes to Gaussian, therefore, unfortunately, we still cannot provide a time frame as to when such a port would be available. However, while again, we cannot make any guarantees, we hope to include some level of GPU support in the next major release of Gaussian (although the release date of the next major version has not yet been determined). I am very sorry that I do not have more exact information to provide to you, but we will be sure to contact you when a GPU-enabled version of Gaussian becomes available.” The results below is taken from presentations given by NVDIA/PGI/Gaussian in the spring of 2014. The setup is dramatically different from the Abel setup. This system have two sockets with 10 cores each and no less than 6 x NVIDIA Tesla K40 GPUs. Figure 30: Example of speedup obtained using 6 x Tesla K40 GPUs running Direct SCF in Gaussian. The speedups show are not for production code and as they are employing 6 more powerful GPUs than currently installed in Abel the numbers are just an illustration. Marketing numbers. 40 of 64 Experiences with NVIDIA / CUDA Figure 31: Example of speedup using 6 x Tesla K40 GPUs running Coupled Cluster calculation. If the speedups shown in figure 31 can in some way be indicative of future development Gaussian might be a major user of GPUs. However, by the start of 2015 no GPU enabled Gaussian version is available. 41 of 64 Experiences with NVIDIA / CUDA GPU enabled application, VASP The current version of VASP does not yet support GPUs. The results below is taken from presentations given by NVDIA/PGI/Gaussian in the spring of 2014. There is no given date for release of the GPU enabled VASP. Virtually no information is available. The setup is dramatically different from the Abel setup. This system have two sockets with 10 cores each and no less than 6 x NVIDIA Tesla K40 GPUs. Figure 32: Example of speedup using 6 x Tesla K40 GPUs running VASP calculation. The performance gains obtained show that GPUs will provide a significant speedup. We just have to wait for the production release and do an evaluation then. 42 of 64 Experiences with NVIDIA / CUDA GPU enabled application, NCBI BLASTp The NCBI version of the BLAST package provide a GPU enabled version of BLASTp. This should provide accelerated protein search. Performance information provided on the web suggest several times speedup numbers. More information on BLAST: http://blast.ncbi.nlm.nih.gov/Blast.cgi Local testing with a set of inputs and databases does show significant speedups, but not at the scale needed for widespread adaptation. The figure below show GPU do only yield a marginal speedup and that it is not enough to outperform the normal compute nodes. 4 cores and one GPU represent ½ of a standard compute node which is equal to 8 cores. BLASTP performance Accelerated node / std node 8000 7000 Wall time [sec] 6000 5000 4000 3000 2000 1000 0 2 cores 4 cores 4 cores 1 gpu 8 cores 4 cores 8 cores 16 cores Node and cores Figure 33: NCBI BLASP performande, accelerated node, using one GPU, (blue) and standard compute node (red). Benchmark is swissprot db and bacillus query. 43 of 64 Experiences with NVIDIA / CUDA GPU enabled application, Bowtie The well known bioinformatics application Bowtie has been ported to take advantage of NVIDIA GPUs. From the web page (http://nvlabs.github.io/nvbio/nvbowtie_page.html) : “nvBowtie is a GPU-accelerated re-engineering of Bowtie2, a very widely used short-read aligner”. Local tests have been performed and show speedups comparable to the reported ones. The nvBowtie utilises only one GPU hence the reported results compare one GPU and one socket on the accelerated node versus one socket (8 threads) on the standard compute node. Bowtie2 - Mouse genone Wall time [sec] One GPU/one soket vs. One socket 8 threads Stand compute node. 1000 900 800 700 600 500 400 300 200 100 0 Accelerated node one GPU Standard node one socket Node type Figure 34: Performance of nvBowtie aligning mouse sequences. As shown in the figure there is a speedup of about 4.5 using the GPU compared to one processor in a standard compute node. As the Bowtie scale quite well this number drop to 2.7 if all cores in the standard compute node are used. However, the acclerated node have two GPU som it's fair to use only one processor in the standard node. From a volume production standpoint the accelerated node outperfom the standard node by a factor of about 4.5. 44 of 64 Experiences with NVIDIA / CUDA GPU enabled application, BWA / BarraCUDA The well known bioinformatics application BWA has a GPU-enabled implementation called BarraCUDA. From the web page (ref :http://seqbarracuda.sourceforge.net/index.html): “Started in 2009, the aim of the BarraCUDA project is to develop a sequence mapping software that utilizes the massive parallelism of graphics processing units (GPUs) to accelerate the inexact alignment of short sequence reads to a particular location on a reference genome. Being based on BWA (http://bio-bwa.sf.net) from the Sanger Institute, BarraCUDA delivers a high level of alignment fidelity and is comparable to other mainstream alignment programs. It can perform gapped alignment with gap extensions, in order to minimise the number of false variant calls in re-sequencing studies.“ Some local tests have been run and performance recorded and evaluated. Unfortunately there were no major speedup with the inputs tested. Good GPU enabled applications are hard to come by. BWA / BarraCUDA - mouse geneome Complete node all cores vs. both GPUs. 450 400 Wall time [secs] 350 300 250 200 150 100 50 0 Standard node Acceleated node Node type Figure 35: BWA versus BarraCUDA wall times, all 16 cores in the standatd node used, both GPUs in the accelerated node. We see that it is hard even with two GPUs to beat 16 threads in two sockets. 45 of 64 Experiences with NVIDIA / CUDA Another example is drawn from scientific production on the Abel system. This is an example of real scientific production runs. The reference is about 300 Mbytes while the query is 1.7 Gbytes. A lot of these are processed on Abel on a routine basis. Figure 37 show the wall times recorded to run the alignment using different nodes. It is clear that for this type of jobs the speedup is quite large over. The GPU node perform the alignment over 5 times faster than a standard compute node. Barracuda vs BWA Real scientific data from research production 30000 Wall time [seconds] 25000 20000 15000 10000 5000 0 1 GPU 2 GPU Std 1 Std 2 Std 4 Std 8 Std 16 Node type and cores Figure 36: Barracuda and BWA performance recorded for a set of alignments performed in scientific production. The BWA scale well up to 8 cores. 46 of 64 Experiences with NVIDIA / CUDA BWA vs Barracuda All resources in one node, 16 cores / 2 GPUs 4000 Wall time [seconds] 3500 3000 2500 2000 1500 1000 500 0 Compute node Accelerated node Node type Figure 37: BWA versus Barracuda for production research data. A speedup of about 5 is recorded for this input. 47 of 64 Experiences with NVIDIA / CUDA GPU enabled application, cuBLASTP At the GPU conference in March 2014 (http://on-demand-gtc.gputechconf.com/) a number of application cases were presented. Below is a facsimile (edited) of a poster showing speedup of this application. Speedups from 3 to 8 x are presented. Some speedup from a single core others from all cores in a socket. Figure 38: A poster presented at the GPU conference in March 2014 showing speedup for the cuBLASTP application. The histograms have been magnified to show the speedup. This kind of speedups looks nice, but widespread adaptation of GPU accelerated cuBLASTP has yet to be seen. This software is licensed and not open source. Google search point to a senior license manager. 48 of 64 Experiences with NVIDIA / CUDA GPU enabled application, AUTODOCK At the GPU conference in March 2014 (http://on-demand-gtc.gputechconf.com/) a number of application cases were presented. Below is a facsimile (edited) of a poster showing speedup of this application. Quite good speedups are reported for Autodock, up to over 40 with selected some datasets. Figure 39: A poster presented at the GPU conference in March 2014 showing speedup for the cuBLASTP application. The histograms have been magnified to show the speedup. 49 of 64 Experiences with NVIDIA / CUDA GPU enabled libraries As we have learned on page 20 about the libraries that provide functions and routines that will execute the computation on the GPU this section will show what can be expected in realistic benchmarks. BLAS level 3 matrix multiplication The first example will show BLAS level 3 matrix-matrix operation performance. The most common is general matrix matrix multiplication, [s,d,c,z]gemm where the s,d,c and z represent the precision standard used for this kind of routines (s–single, d-double, c-complex, z-double complex). The benchmark is run on two kind of nodes just like the NAMD example above, one accelerated node with less performant CPUs and a compute node with high performance processors, see text on page 27 for details or the appendix for the full details about the hardware and software. Fully optimized libraries have been used in all cases, for the compute nodes the multithreaded version of Intel MKL has been used while on the GPU nodes the NVIDIA CUBLAS library has been used. However, as there is one GPU per socket, the tests on compute node will only use one socket, e.g. 8 cores. The combined numbers by using both GPUs and both sockets is assumed to be close to twice the recorded numbers. The reason for only using one GPU is that presently there is no load balancing so that a single call to a routine can be split between two GPUs. The test is set to compare the compute nodes versus the accelerated nodes in Abel. Hence we use the best performing hardware and libraries for the tests. 50 of 64 Experiences with NVIDIA / CUDA BLAS level 3 performance dgemm - double precison (64 bits) 600 Performance [Gflops/s] 500 400 CUDA BLAS MKL BLAS 300 200 100 0 91 366 1464 3295 5149 Total matrix footprint [MiB] Figure 40: BLAS level 3 dgemm performance using one GPU in an accelerated node vs. one socket in a compute node. One socket equal 8 cores. The results show significant speedups for both single precision and double precision. As expected the double precision performance is significantly lower that the single precision. The speedups demonstrated show a speedup using accelerator versus a standard intel Sandy Bridge processor at about 2.5 for double precision while 3.4 for the single precision case. BLAS level 3 performance sgemm - single precision (32 bits) 1400 Performance [Gflops/s] 1 200 1000 800 CUDA BLAS MKL BLAS 600 400 200 0 45 1 83 732 1 647 2575 3708 4577 5538 Total matrix footprint [MiB] Figure 41: BLAS level 3 sgemm performance using one GPU in an accelerated node vs. one socket in a compute node. One socket equal 8 cores. In quantitative number this can be compared to the theoretical numbers found in table 1. Table show 51 of 64 Experiences with NVIDIA / CUDA the BLAS matrix matrix multiplication performance in relation to the theoretical performance. Double precision Single precision Theoretical performance [Gflops/s] 1310 3950 BLAS performance [Gflops/s] 482 1300 32.9 % 36.8 % Efficiency [fraction of theoretical] Table 2: Efficiency and utilization of the compute power, obtained numbers vs. theoretical performance As a note it should be mentioned that the size of problems that can be attacked by this approach is limited to the matrix size that can be accommodated by the GPU card, in the present case 6 GiB. Linear algebra and the top 500 test A nice example of what is possible is the High Performance Linpack (HPL), the famous top500 test. A version has been developed that support GPU with load balancing between CPUs and GPUs. It uses the Intel MKL BLAS libraries and CUDA BLAS libraries to squeeze out all possible performance of the compute hardware installed in a node. Only three linear algebra functions are used: dgemm, dtrsm and LU factorization. Not only does the implementation provide GPU support for several GPUs in each node it is still the well known MPI application so that one can use as much resources as one might like. There are examples of large clusters on the top500 list that uses this approach on a large scale Figure 42: HPL implementation showing overlap of computation using GPU and CPU. 52 of 64 Experiences with NVIDIA / CUDA For a single node using all 8 processor cores (Intel E5-2609 quad core 2.4 GHz) processors and the two installed K20 GPUs the following performance were recorded: Params size block nxm time Tflops %peak WR12C2L8 85000 1280 1x2 222.07 1.844 66.5 WR12C2L8 85000 1280 1x2 222.28 1.842 66.4 WR12C2L8 85000 1280 1x2 222.31 1.842 66.4 WR12C2L8 85000 1024 1x2 222.87 1.837 66.2 WR12C2L8 85000 1280 1x2 222.84 1.837 66.2 WR12C2L8 85000 1024 1x2 222.92 1.837 66.2 WR12C2L8 85000 1280 1x2 223.03 1.836 66.2 Table 3: Recorded performance of a single node running HPL using all 8 cores and both GPUs. Compare this 1.8 Tflops/s with performance obtained using all cores on a more powerful compute node show below which clocks in at only 0.32 Tflops/s : Params size block nxm time Tflops %peak WR11R2R4 87500 180 4x4 1404.46 0.318 95.6 WR11R2R4 87500 200 4x4 1408.52 0.317 95.3 Table 4: Recorded performance of a single node compute running HPL using all 16. This node has more powerful CPUs than the GPU node. HPL / top500 benchmark performance Performance [Tflops/s] Double precision HPL on a single node 2 1,8 1,6 1,4 1,2 1 0,8 0,6 0,4 0,2 0 Accelerated Standard Node type Figure 43: Using all resources in the two types of nodes. Accelerated node using both GPUs and all CPUs versus one compute node using all CPUs. The CPUs on the compute node are more powerful than the CPUs in the accelerated node. 53 of 64 Experiences with NVIDIA / CUDA Figure 43 show the very high HPL performance provided by the GPUs, more than 5 fold increase in performance. The HPL example clearly show what kind of performance that can be obtained using the right software. The time and effort gone into the development of this HPL implementation is not small and very few applications can be optimized in this way. The hpl being an MPI application it can run over several nodes just like the ordinary hpl application. HPL / top500 Performance Double prec. MPI cluster of GPU nodes 22 20 Performance [Tflops/s] 18 16 14 12 10 8 6 4 2 0 1 2 3 4 6 8 14 # nodes Figure 44: HPL benchmark, MPI over IB with GPU accelerated compute nodes. 14 nodes are less than ½ a rack (36 nodes). One rack would yield about 50 Tflops/s (36/14*20), assuming perfect scaling we would only need 20 racks to build a Petaflop cluster, about the size of Abel in rack space. 54 of 64 Experiences with NVIDIA / CUDA Matlab with GPU enabled libraries Matlab comes with a range of functions that are GPU enabled. This can provide a very easy access to the accelerator performance. Figure 45: Matlab with a range of GPU enabled functions. More information can be found at : http://www.mathworks.se/discovery/matlab-gpu.html or http://www.mathworks.se/products/parallel-computing/ 55 of 64 Experiences with NVIDIA / CUDA OpenACC compiler support For detailed information about OpenACC please see : http://www.openacc-standard.org/. Currently (Aug 2014) only Portland and CAPS support OpenACC, (Cray compiler support Cray systems). There are a few initiatives to include support for OpenACC in the open source compilers. GCC will most probably have support some time during 2015. See below. Figure 46: Roadmap for gcc with respect to OpenACC support, taken from : http://techenablement.com. 56 of 64 Experiences with NVIDIA / CUDA Examples of open source compiler supporting OpenACC, this is an alternative while waiting for gcc 5.0 : Figure 47: An example of an open source compiler supporting OpenAcc. Performance is presented from runs of the NASA NPB benchmarks. Figure 48: Performance of the open source compiler OpenUH with OpenAcc support. 57 of 64 Experiences with NVIDIA / CUDA Linear algebra using Portland The linear algebra example using matrix matrix multiplication the reference code in plain fortran is used. The comparison is between the fortran code compiler with full compiler optimization using a single core. The test is to see what kind of speedup that can be obtained using compiler options on a single threaded reference code. The same code could be treated with OpenMP directives and speedup recorded, but this is another study. The results from runs using dgemm and sgemm are shown in table 5 and 6. The performance is quite small which is often the case when testing plain fortran code. It is only when using highly optimized libraries performance of large fraction of the theoretical is achieved. However, it clearly shows the huge benefit of accelerated code. Speedups recorded are from over 6 for double precision to over 8 for single precision. Footprint [MiB] PGI f77 [Gflops/s] PGI accelerated [Gflops/s] Speedup 91 2.51 2.33 0.928 366 2.17 9.48 4.37 1464 2.21 14.0 6.33 3295 2.20 14.1 6.41 5149 1.79 11.9 6.65 Table 5: dgemm (double precision) performance using plain f77 code versus plain f77 code with OpenACC directives inserted. A single core on the GPU node was used and a single GPU. Footprint [MiB] PGI f77 [Gflops/s] PGI accelerated [Gflops/s] Speedup 45 3,34 2,28 0.682 183 3,35 11,6 3.462 732 3,29 24,0 7.29 1647 3,23 26,9 8.33 2575 4,38 19,9 4.54 3708 4,36 22,0 5.05 4577 4,35 27,9 6.41 5538 4,36 22,0 5.05 Table 6: sgemm (single precision) performance using plain f77 code versus plain f77 code with OpenACC directives inserted. A single core on the GPU node was used and a single GPU. 58 of 64 Experiences with NVIDIA / CUDA The figures 49 and 50 show this in another form, where the huge improvement in performance by just inserting compiler directives is evident. Accelerated F77 code vs plain F77 16 Performance [Gflops/s] 14 Portland accelerator directives, dgemm PGI Accel PGI 12 10 8 6 4 2 0 91 366 1464 3295 5149 Total matrix footprint [MiB] Figure 49: Performance of plain fortran77 code with OpenACC directives vs. plain f77 code using double precision dgemm reference implementation. 59 of 64 Experiences with NVIDIA / CUDA Accelerated F77 code vs plain F77 Portland accelerator directives, sgemm Performance [Gflops/s] 30 25 PGI Accel PGI 20 15 10 5 0 45 183 732 1647 2575 3708 4577 5538 Total matrix footprint [MiB] Figure 50: Performance of plain fortran77 code with OpenACC directives vs. plain f77 code using single precision sgemm reference implementation. Jacobi relaxation Calculation The code contains loops like : !$omp do reduction( max:error ) !$acc kernels do j=1,m-2 do i=1,n-2 Anew(i,j) = 0.25_fp_kind * ( A(i+1,j ) + A(i-1,j ) + & A(i ,j-1) + A(i ,j+1) ) error = max( error, abs(Anew(i,j)-A(i,j)) ) end do end do !$acc end kernels !$omp end do This is the core loop, both directives for OpenMP and OpenACC are shown. The difference between step1 and step2 is the including of statements for copying data to the GPU device memory. !$acc data copy(A, Anew) do while ( error .gt. tol .and. iter .lt. iter_max ) error=0.0_fp_kind The copy directive overlap the while loop in time and yield far better performance. The complete code can be found on the net4 as exercises for OpenACC, openacc-examples. 4 http://www.openacc.org/Sample_Codes 60 of 64 Experiences with NVIDIA / CUDA Size mesh OpenMP 4 cores OpenMP 8 cores OpenACC Step1 OpenACC step2 8192 x 8192 44.8 41.1 725.2 13.0 16384 x 16384 178.1 163.6 2918.5 53.4 Table 7: Comparing OpenMP and OpenACC directives. Numbers are wall time in seconds. 61 of 64 Experiences with NVIDIA / CUDA Appendix Node configuration Compute node Specification Vendor Megware, Myriquid / Supermicro Mainboard Supermicro X9DRT Processor 2 x Intel Sandy Bridge, Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz 8 core L2 cache 8-way Set-associative 2048 kB Write Back L3 cache 20-way Set-associative 20480 kB Write Back Memory 8 x Samsung DDR3 Registered 8192 MB, 1600 MHz InfiniBand Mellanox ConnectX-3 FDR OS CentOS release 6.4 (Final) later upgraded to 6.6 Compilers gcc 4.8.0, 4.8.2, 4.9.2 / pgi 13.3 and 13.9 MPI Openmpi 1.6.4, 1.8, 1.8.4 Math library Intel MKL 2013.3, 2015.1 Table 8: Node configuration, standard Abel compute node. Accelerated node Specification Vendor Megware, Myriquid / Supermicro Mainboard Supermicro 9DRG-HF Processor 2 x Intel Sandy Bridge, Intel(R) Xeon(R) CPU E5-2609 0 @ 2.40GHz 4 core L2 cache 8-way Set-associative 1024 kB Write Back L3 cache 20-way Set-associative 10240 kB Write Back Memory 8 x Samsung DDR3 Registered 8192 MB, 1066 MHz InfiniBand Mellanox ConnectX-3 FDR GPU accelerator 2 x NVIDIA Tesla K20Xm v. 3.5, 6 GiB mem. SP cores 2688 DP cores 896 OS CentOS release 6.4 (Final) later upgraded to 6.6 Compilers Gcc 4.8.0, 4.8.2, 4.9.2 / pgi 13.3 and 13.9 Math library Intel MKL 2013.3, 2015.1 MPI Openmpi 1.6.4, 1.8, 1.8.4 CUDA 5.0, 5.5, 6.0, 6.5 and 7.0 Table 9: Node configuration, accelerated Abel compute node. 62 of 64 Experiences with NVIDIA / CUDA References Bioinformatics and Life Sciences NVIDIA web site about applications: http://www.nvidia.co.uk/object/gpu-computing-applications-uk.html http://www.nvidia.co.uk/object/bio_info_life_sciences_uk.html Porting of VASP to support GPUs: http://www.ncbi.nlm.nih.gov/pubmed/22903247 ACEMD http://www.acellera.com/products/acemd/ NVIDIA web site about applications: http://www.nvidia.co.uk/object/gpu-computing-applications-uk.html http://www.nvidia.co.uk/object/bio_info_life_sciences_uk.html Porting of VASP to support GPUs: http://www.ncbi.nlm.nih.gov/pubmed/22903247 ACEMD http://www.acellera.com/products/acemd/ HYDRO http://www.prace-ri.eu/IMG/pdf/porting_and_optimizing_hydro_to_new_platforms.pdf MARK http://warnercnr.colostate.edu/~gwhite/mark/mark.htm Notes on optimization : http://software.intel.com/en-us/articles/step-by-step-optimizing-with-intel-c-compiler -O1/-Os This option enables optimizations for speed and disables some optimizations that increase code size and affect speed. To limit code size, this option enables global optimization which includes dataflow analysis, code motion, strength reduction and test replacement, split-lifetime analysis, and instruction scheduling. This option also disables inlining of some intrinsics. If -O1 is specified, -Os option would be default enabled. In O1 option, the compiler auto-vectorization is disabled. If your application are sensitive to the code size, you may choose O1 option. -O2 63 of 64 Experiences with NVIDIA / CUDA This option enables optimizations for speed. This is the generally recommended optimization level. The compiler vectorization is enabled at O2 and higher levels. With this option, the compiler performs some basic loop optimizations, inlining of intrinsic, Intra-file interprocedural optimization, and most common compiler optimization technologies. -O3 Performs O2 optimizations and enables more aggressive loop transformations such as Fusion, Block-Unroll-and-Jam, and collapsing IF statements. The O3 optimizations may not cause higher performance unless loop and memory access transformations take place. The optimizations may slow down code in some cases compared to O2 optimizations. The O3 option is recommended for applications that have loops that heavily use floating-point calculations and process large data sets. Notes on MKL : http://software.intel.com/en-us/articles/intel-mkl-automatic-offload-enabled-functions-for-intelxeon-phi-coprocessors 64 of 64