Experiences with NVIDIA GPUs Abel compute cluster at UiO 2014-2015 UiO/USIT/UAV/ITF/FI

advertisement
Experiences with NVIDIA GPUs
Abel compute cluster at UiO 2014-2015
Ole W. Saastad, PhD
UiO/USIT/UAV/ITF/FI
Feb. 2015
Experiences with NVIDIA GPUs
Preface
The introduction of Graphic Processing Units, GPU into high performance computing has
introduces computing elements with very high computing capabilities as well as high complexity.
The entry level has until recently been quite high. This has changed the last few years with
introduction of new software that try provide easy access. However, it is still quite high.
The motivation have been the very high compute capacity. A typical node in the Abel cluster
provide 320 Gflops/s of performance when running the top500 test. The same test using a less
powerful node but with two GPUs clocks in at 1800 Gflops/s. This represent a more than 5 fold
increase in performance at a low, but not insignificant, cost. The top500 test solves a linear problem
that require a lof of data to be moved from the host to the GPU memory and back, other tasks
involving less data transfer and more computation will show far higher gains in performance. As a
reference the theoretical performance claimed for the NVIDIA K20X GPUs are 3.95 / 1.31 Tflops/s
for single and double precision floating point numbers.
This document contains two parts. First is a review of the relevant application for NOTUR.
Secondly a performance evaluation of a range of relevant applications and GPU use.
Performance reporting is done in two different ways, in many cases wall clock time which is total
time from launch to completion or in some cases performance measurement of the kind like jobs per
unit of time. The first is of the category lower is better while the latter is higher is better. Please
examine the graphs and notice which measurement type is used.
Experiences with NVIDIA / CUDA
Introduction
The modern (2013) GPU (NVIDIA Kepler II, K20X) units are very complex chips containing 7.1
billion transistors using 28 nm silicon production technology.
The claimed theoretical performance seems rather impressive, with close to 4 Tflops/s single
precision (32 bits reals) and 1.3 Tflops/s in (64 bits) double precision. However, GPUs have always
shown very impressive performance numbers which has been quite hard to attain for applications or
even for benchmarks. Below we'll try to explain how the GPU processor work and why optimal
utilization is quit hard.
The instruction set is quite comprehensive and contain a rich set of instructions for both flow
control, conversion, integer and floating point numbers. The floating point instruction set is shown
in figure 2 and the 64 bit full precision ones in table 1. Note the lack of trigonometric and log/exp
instructions in full precision not in even single precision nor in double precision, only reduced
precision. Not needed for gaming, hence not implemented. These instructions are not present in the
SSE units, only in the less used x87 unit. However, fused multiply add is included which is still
5 of 64
Experiences with NVIDIA / CUDA
lacking on Intel Sandy Bridge. Also interesting to note the reciprocal instruction, reciprocals are
much faster than full division.
Figure 2: Floating point instructions supported by NVIDIA K20 processor.
This instruction set is normally not exposed to the user nor is it used by common users. The
language Parallel Thread Execution (PTX) is a pseudo-assembly language used in Nvidia's CUDA
programming environment. The nvcc compiler translates code written in CUDA, a C-like language,
into PTX, and the graphics driver contains a compiler which translates the PTX into a binary code
which can be run on the processing cores.
add
sub
mul
mad
fma
div
rcp
sqrt
abs
neg
min
max
GPU
x
x
x
x
x
x
x
x
x
x
x
x
CPU
x
x
x
x
x
x
x
x
x
x
x
Table 1: Implemented 64 bit instructions with full precision in the different processors types.
From the table above we see that all the common FP instructions needed for HPC are implemented.
The log, exp and trigonometric are implemented in software just as they are on the Xeon.
6 of 64
Experiences with NVIDIA / CUDA
The NVIDIA Tesla architecture is built around a scalable array of multithreaded Streaming
Multiprocessors (SMs). When a host program invokes a kernel grid, the blocks of the grid are
Figure 3: High level overview of the K20 GPU chip showing the 15 SMX units, L2cache Execution
modeland interfaces.
enumerated and distributed to multiprocessors with available execution capacity. The threads of a
thread block execute concurrently on one multiprocessor. As thread blocks terminate, new blocks
are launched on the vacated multiprocessors.
A multiprocessor consists of multiple Scalar Processor (SP) cores, a multithreaded instruction unit,
and on-chip shared memory. The multiprocessor creates, manages, and executes concurrent threads
in hardware with zero scheduling overhead. It implements a single-instruction barrier
synchronization. Fast barrier synchronization together with lightweight thread creation and zerooverhead thread scheduling efficiently support very fine-grained parallelism, allowing, for example,
a low granularity decomposition of problems by assigning one thread to each data element (such as
a pixel in an image, a voxel in a volume, a cell in a grid-based computation). To manage hundreds
of threads running several different programs, the multiprocessor employs a new architecture we
call SIMT (single-instruction, multiple-thread).
The multiprocessor maps each thread to one scalar processor core, and each scalar thread executes
independently with its own instruction address and register state. The multiprocessor SIMT unit
creates, manages, schedules, and executes threads in groups of parallel threads called warps. (This
term originates from weaving, the first parallel thread technology.) Individual threads composing a
SIMT warp start together at the same program address but are otherwise free to branch and execute
independently.
7 of 64
Experiences with NVIDIA / CUDA
The GPU contain a large number of stream processors. These are scheduled more or less
individually by chip scheduler. Hence the GPU cannot be considered a kind of vector processor but
a large pool of small processor cores. However, the cores operate on so called warps (see below for
mode information) which is like SIMD or vector concept. All of this show that one need very
powerful tool to convert C or Fortran code to this kind of processor machinery.
Figure 4: Detailed view of one of the 15 SMX processors. There are only one double
precision unit for every three processor cores while there are one single precision unit in
each processor core. Hence the 1/3 dp performance.
8 of 64
Experiences with NVIDIA / CUDA
To harvest the very high performance one need to program all these processor units to operate
coherent. Take a look at the execution model1 (extract given here) :
The execution model is very unusual, especially compared to typical "what one thread should do"
machine code.
•
A "thread" means one single execution of a kernel. This is mostly conceptual, since the
hardware operates on warps of threads.
•
A "warp" is a group of 32 threads that all take the same branches. A warp is really a SIMD
group: a bunch of floats sharing one instruction decoder. The hardware does a good job
with predication, so warps aren't "in your face" like with SSE.
•
A "block" is a group of a few hundred threads that have access to the same "__shared__"
memory. The block size is specified in software, but limited by hardware to 512 or 1024
threads maximum. More threads per block is generally better, with serious slowdowns for
less than 100 or so threads per block. The PTX manual calls blocks "CTAs" (Cooperate
Thread Arrays).
•
The entire kernel consists of a set of blocks of threads. The memory model is also highly
segmented and specialized, unlike the flat memory of modern CPUs.
•
"registers" are unique to that thread. Early 8000-series cards had 8192 registers available;
GTX 200 series had 16K registers; and the new Fermi GTX 400s have 32K registers.
Registers are divided up among threads, so the fewer registers each thread uses, the more
threads the machine can keep in flight, hiding latency.
•
"shared" memory is declared with __shared__, and can be read or written by all the threads
in a block. This is handy for neighborhood communication where global memory would be
too slow. There is at least 16KB of shared memory available per thread block; Fermi cards
can expose up to 48KB with special config options. Use "__syncthreads()__" to synchronize
shared writes across a whole thread block.
•
"global" memory is the central gigabyte or so of GPU RAM. All cudaMemcpy calls go to
global memory, which is considered "slow" at only 100GB/second! Older hardware had
very strict rules on "global memory coalescing", but luckily newer (Fermi-era) hardware
just prefers locality, if you can manage it.
•
"param" is the PTX abstraction around the parameter-passing protocol. They reserve the
right to change this, as hardware and software changes.
•
constants and compiled program code are stored in their own read-only memories.
•
"local" memory is unique to each thread, but paradoxically slower than shared memory.
Don't use it!
1 http://www.cs.uaf.edu/2011/spring/cs641/lecture/03_03_CUDA_PTX.html
9 of 64
Experiences with NVIDIA / CUDA
Very few programmers get exposed to this kind of detailed level of programming, but it serves the
purpose to explain why it's very hard to compile and generate code that get scheduled effectively on
this kind of processors.
Figure 5: Kepler memory hierarchy.
The more adventurous embark on the CUDA a c-like language. Nevertheless, here is large
community of CUDA users. However this is out of scope for most scientific users who do not have
programming as their field of study. This report will not cover CUDA programming as such. There
are a lot of material text books and tutorials about CUDA available.
10 of 64
Experiences with NVIDIA / CUDA
PART I
How to access the GPU power for the scientific user
As we have seen above the GPU are highly complex and very hard to program. We need to take on
some powerful tools to hide the complexity.
Three common user friendly ways can be taken. The concept of CUDA and OpenCL programming
is omitted because it's not regarded as user friendly nor easy to program. For those who do take on
this challenge the gains can be very high. An introduction to CUDA and OpenCL is not within the
scope of this work.
a) Precompiled applications
Several precompiled applications are available that are distributed with GPU support. These
applications generally just connect to the GPU and harvest the power without much extra setup for
the user. All the details are hidden behind the scene. If your favorite application is in this category
consider yourself lucky.
Some examples include NAMD, BLAST, ADF and a few others. Examples of usage will be shown
later in this report. Some have only been evaluated for performance and not discussed in part I.
b) GPU enabled libraries
Libraries exist that provide routines for the most common functions. These libraries provide
functions that will execute on the GPU and provide easy access to the power of the GPUs. The
parameter list etc are generally the same as the normal versions of the functions, but the actual name
of the functions generally differ. Ongoing development with vendors like NVIDIA to make it more
accessible. Fortran wrappers and generally lower the threshold for new users.
c) Compiler with Accelerator support
Some compilers support the OpenACC initiative that have a goal of providing GPU support as easy
as multithreading is done using OpenMP by means of source code directives. The OpenACC
directives are comparable in OpenMP in ease of use and complexity. This initiative has attracted a
lot of attention as it can provide easy access to GPU power for the old dusty legacy decks. Much
serial fortran code still exists which can be persuaded into GPU acceleration by means of some
simple compiler directives. This exercise have been done before using OpenMP which many
programmers are familiar with.
11 of 64
Experiences with NVIDIA / CUDA
Precompiled applications
NAMD is a is a parallel molecular dynamics code designed for high-performance simulation of
large biomolecular systems that comes with support for multicore, MPI and GPUs.
Running this application using GPU support is relatively simple. It is only a question of setting up
the right environment, in this case the path to the library files. The CUDA library file comes with
the application as the one comming with the CUDA distribution tend to be updated more frequently
than NAMD. In addition the starting sequence of NAMD need to be altered to launch the NAMD,
one need also to select on or two GPUs. This must be done in sync with requests for resources done
in the queue system (slurm runscript) file.
An example of a run script for a NAMD using GPUs could look like :
#!/bin/bash
#
#SBATCH --job-name=namd
#SBATCH –-account=<your account>
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8
#SBATCH --mem-per-cpu=3900M
#SBATCH --exclusive
#SBATCH --partition=gpu --gres=gpu:2
#SBATCH --time=1:00:0
# Set env :
. /cluster/bin/jobsetup
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:~/work/namd/NAMD_2.9_Linux-x86_64multicore-CUDA
BIN=~/work/namd/NAMD_2.9_Linux-x86_64-multicore-CUDA/
# Start your work
INP=ubq-nve.conf
$BIN/charmrun ++local $BIN/namd2 +idlepoll
$CUDA_VISIBLE_DEVICES $INP
+p $SLURM_TASKS_PER_NODE
+devices
As can be seen from the script it is quite easy to set up the NAMD environment to use the GPUs.
We are running only on one node, but using both GPUs.
From NAMD output we see that it is using what we have requested :
Info: Based on Charm++/Converse 60400 for multicore-linux64-iccstatic
Info: Built Mon Apr 30 14:02:11 CDT 2012 by jim on naiad.ks.uiuc.edu
Info: 1 NAMD 2.9 Linux-x86_64-multicore-CUDA 8
compute-19-8.local olews
Info: Running on 8 processors, 1 nodes, 1 physical nodes.
Info: CPU topology information available.
Info: Charm++/Converse parallel runtime startup completed at 0.0219169 s
Pe 2 physical rank 2 binding to CUDA device 0 on compute-19-8.local: 'Tesla
K20Xm' Mem: 5759MB Rev: 3.5
Pe 6 physical rank 6 will use CUDA device of pe 4
12 of 64
Experiences with NVIDIA / CUDA
Pe 3 physical rank 3 will use CUDA device of pe 2
Pe 7 physical rank 7 will use CUDA device of pe 4
Pe 5 physical rank 5 will use CUDA device of pe 4
Pe 1 physical rank 1 will use CUDA device of pe 2
Pe 4 physical rank 4 binding to CUDA device 1 on compute-19-8.local: 'Tesla
K20Xm' Mem: 5759MB Rev: 3.5
Pe 0 physical rank 0 will use CUDA device of pe 2
Info: 11.707 MB of memory in use based on /proc/self/stat
Both GPUs are in use, NAMD also use all the 8 cores in this node. It is generally a good idea to
request the complete node by using the keyword exclusive in the run script.
Figure 6: NAMD is ongoing rapid development and is one of the most used applications is molecular
dynamics. Strong ongoing development to exploit the accelerators, both GPUs and MICs.
13 of 64
Experiences with NVIDIA / CUDA
AMBER is a parallel molecular dynamics code designed for high-performance simulation of large
biomolecular systems. It is a rather large project with a range of tools. It supports GPU acceleration.
Please see the web page: http://ambermd.org/ for detailed information. Performance evaluation is
show later in this document. Amber 14 requires a license even for Academic use.
Figure 7: Visualisation of the structure of a molecule calculated with molecular dynamics using
Amber..
14 of 64
Experiences with NVIDIA / CUDA
LAMMPS is a classical molecular dynamics code, and an acronym for Large-scale Atomic/
Molecular Massively Parallel Simulator. LAMMPS has potentials for solid-state materials (metals,
semiconductors) and soft matter (biomolecules, polymers) and coarse-grained or mesoscopic
systems. It can be used to model atoms or, more generically, as a parallel particle simulator at the
atomic, meso, or continuum scale.
LAMMPS runs on single processors or in parallel using message-passing techniques and a spatialdecomposition of the simulation domain. The code is designed to be easy to modify or extend with
new functionality.
LAMMPS is distributed as an open source code under the terms of the GPL.
LAMMPS is distributed by Sandia National Laboratories, a US Department of Energy laboratory.
The main authors of LAMMPS are listed on this page along with contact info and other
contributors. Funding for LAMMPS development has come primarily from DOE (OASCR, OBER,
ASCI, LDRD, Genomes-to-Life) and is acknowledged here.
Figure 8: Example of LAMMPS simulations. Illustration taken from LAMMPS web page.
See LAMMPS web page for more information : http://lammps.sandia.gov/
15 of 64
Experiences with NVIDIA / CUDA
GROMACS, is a versatile package to perform molecular dynamics, i.e. simulate the Newtonian
equations of motion for systems with hundreds to millions of particles.
It is primarily designed for biochemical molecules like proteins, lipids and nucleic acids that have a
lot of complicated bonded interactions, but since GROMACS is extremely fast at calculating the
nonbonded interactions (that usually dominate simulations) many groups are also using it for
research on non-biological systems, e.g. polymers.
GROMACS supports all the usual algorithms you expect from a modern molecular dynamics
implementation, is a versatile package to perform molecular dynamics, i.e. simulate the Newtonian
equations of motion for systems with hundreds to millions of particles.
It is primarily designed for biochemical molecules like proteins, lipids and nucleic acids that have a
lot of complicated bonded interactions, but since GROMACS is extremely fast at calculating the
nonbonded interactions (that usually dominate simulations) many groups are also using it for
research on non-biological systems, e.g. polymers.
GROMACS is Free Software, available under the GNU Lesser General Public License.
Starting with version 4.6, GROMACS includes a brand-new, native GPU acceleration developed in
Stockholm. This replaces the previous GPU code and comes with a number of important features.
The new GPU code is fast, and we mean it. Rather than speaking about relative speed, or speedup
for a few special cases, this code is typically much faster (3-5x) even when compared to
GROMACS running on all cores of a typical desktop. If you put two GPUs in a high-end cluster
node, this too will result in a significant acceleration.
Figure 9: Gromacs GPU acceleration goal.
16 of 64
Experiences with NVIDIA / CUDA
Commercial applications
There are a some commercial applications that are developed to take advantage of the GPUs.
ACEMD
One such application is ACMD. This is molecular dynamics application developed to run only on
GPUs. It install and run with no problems. Performance claims to be very high. So far only the
examples have been run. These examples seems to perform very well. The input is compatible with
AMBER and NAMD input.
Unfortunately the license cost of 1.5 kEUR per limits the adaptation.
ADF
Amsterdam Density Function (ADF) application is currently supporting for GPUs.
Initial tests show a very stable application that run well with GPU acceleration. However, at the
time of writing the performance is not really up to expectations. While it show significant speedup
for some tests it is by no means a game changer.
Figure 10: Visualisation of a molecular motor simulated by
ADF calculations.
17 of 64
Experiences with NVIDIA / CUDA
VASP
There are reports that this application has some support for GPU acceleration. These preliminary
reports need to be follows up with the developers in Vienna. At time of writing in October 2014 no
such information is currently published. If UiO will be a beta test site is unknown, but we have
applied.
VASP is one of the major uses of NOTUR computation resources.
Figure 11: Vienna Ab-initio Simulation Package, VASP.
Gaussian
Gaussian is working with Portland compiler group to develop an accelerated version. In 2011 this
cooperation was announced, see : http://pressroom.nvidia.com/easyir/customrel.do?
easyirid=A0D622CE9F579F09&releasejsp=release_157&prid=792463
From the press releasem, dated the 29th of August 2011:
“NVIDIA today announced plans with Gaussian, Inc., and The Portland Group (PGI) to develop a
future GPU-accelerated release of Gaussian, the world's leading software application for quantum
chemistry.”
Status for this work and scheduled release data is not yet announced. When Gaussian Inc were
given the question on progress status their reply on November 1st 2013 was not very optimistic:
“While I am aware that progress has been made on a GPU-enabled version of Gaussian, I am not
certain what type of details I am permitted to provide about the progress. “
18 of 64
Experiences with NVIDIA / CUDA
The compiler team from Portland group could inform that the Gaussian code has been compiled
with OpenACC directives in order to utilize the NVIDIA GPUs. Gaussian Inc has been very
concerned about portability with their source code so nothing else apart from directives which are
source code comments anyway was allowed. Portland group is very eager to release to the public
any news about PGI accelerated Gaussian.
Figure 12: Molecules calculated by Gaussian.
19 of 64
Experiences with NVIDIA / CUDA
GPU enabled libraries
CUDA Libraries, NVIDIA provides a range of highly optimized functions that can be called from
fortran or C. The fortran functions are available through a wrapper which hides all the boring
details. The wrapper routines are already compiled for Abel and located together with the CUDA
64 bit libraries. The CUDA documentation is available in both pdf and html format. It's located with
the CUDA software under doc (on Abel look under $CUDA/doc, after the CUDA module has been
loaded).
Currenly there are a few libraries implemented.
• CUBLAS - The CUBLAS library is an implementation of BLAS (Basic Linear Algebra
Subprograms) on top of the NVIDIA®CUDA™ runtime. It allows the user to access the
computational resources of NVIDIA Graphics Processing Unit (GPU), but does not auto-parallelize
across multiple GPUs.
•
CUFFT - The FFT is a divide-and-conquer algorithm for efficiently computing discrete Fourier
transforms of complex or real-valued data sets. It is one of the most important and widely used
numerical algorithms in computational physics and general signal processing. The CUFFT library
provides a simple interface for computing parallel FFTs on an NVIDIA GPU, which allows users to
quickly leverage the floating-point power and parallelism of the GPU in a highly optimized and
tested FFT library.
•
CURAND - The CURAND library provides facilities that focus on the simple and efficient
generation of high-quality pseudorandom and quasirandom numbers. A pseudorandom sequence of
numbers satisfies most of the statistical properties of a truly random sequence but is generated by a
deterministic algorithm. A quasirandom sequence of -dimensional points is generated by a
deterministic algorithm designed to fill an -dimensional space evenly.
•
CUSPARSE - The CUSPARSE library contains a set of basic linear algebra subroutines used for
handling sparse matrices. It is implemented on top of the NVIDIA® CUDA™ runtime (which is part
of the CUDA Toolkit) and is designed to be called from C and C++.
•
THRUST - Thrust is a C++ template library for CUDA based on the Standard Template Library
(STL). Thrust allows you to implement high performance parallel applications with minimal
programming effort through a high-level interface that is fully interoperable with CUDA C. Thrust
provides a rich collection of data parallel primitives such as scan, sort, and reduce, which can be
composed together to implement complex algorithms with concise, readable source code. By
describing your computation in terms of these high-level abstractions you provide Thrust with the
freedom to select the most efficient implementation automatically. As a result, Thrust can be utilized
in rapid prototyping of CUDA applications, where programmer productivity matters most, as well as
in production, where robustness and absolute performance are crucial.
A detailed overview of the routines and functions supported is found in the documentation. All of
the most common functions are all implemented, BLAS level 1,2 and 3, 1d, 2d and 3d FFTs and a
comprehensive array of other functions and routines.
The CUDA libraries have been written in C and have expect the allocation to be handled by the
programmer. As C functions can easily be called from Fortran this represent no real problem, just
some more programming code. There is a wrapper developed for BLAS which comes in two
20 of 64
Experiences with NVIDIA / CUDA
versions. One that hide all allocation etc so that there is virtually no changes to the Fortran code and
another that requires just the allocation steps. The latter has a much lighter calling sequence and
yield higher performance while the first make testing very easy. Unfortunately these wrappers are at
times of writing only developed for the BLAS library. Hence it is somewhat harder to start using the
FFT, random and sparse libraries from Fortran.
To do development with this kind of libraries one need to work on one of the nodes which host the
GPU cards. To reserve such a node issue the following command :
qlogin --account=<your project> --nodes=1 --ntasks-per-node=8 --mem-percpu=3900M --exclusiexclusive --partition=gpu --gres=gpu:2 --time=1:00:0
This will reserve a complete node with all CPUs and both GPUs and ensure that you are the only
user with legitimate access to this node. Issue the normal commands for jobs like souring
/cluster/bin/jobsetup and load the module needed. For this kind of work the cuda/5.0 module is
appropriate.
To use the CUDA BLAS libraries are remarkably simple, the CUDA environment sets up most the
detailed paths etc.
In your fortran code replace the call to sgemm with the following line:
call cublas_sgemm('n', 'n', N, N, N, alpha, a, N, b, N, beta, c, N)
The parameter list is identical, the only change is to prefix the sgemm name with cublas_. The same
is true for double, complex and double complex data types (s,d,c and z).
To compile there is a few more libraries to include, include the following on the link line:
-L/usr/local/cuda/lib64 /usr/local/cuda/lib64/fortran_thunking.o -lcublas
On Abel the /usr/local/cuda directory is a symbolic link to the common file structure where most of
the module based software are located. It is only present on the nodes that host the GPUs.
An example of compliation may look like:
gfortran -o dgemmdriver.x -L/usr/local/cuda/lib64
/usr/local/cuda/lib64/fortran_thunking.o -lcublas dgemmdriver.f90
The file fortran_thunking.o contain wrappers that hide a lot of juicy code that is a bit more
complicated. There is also a routine called fortran.o which expose a lot more details to the Fortran
programmer, refer to the documentation for more information.
The example uses gfortran, but the library functions are callable from the most common compilers.
Quite remarkable performance can be obtained using the library in functions,
[olews@compute-19-4 Fortran-BLAS]$ ./sgemmdriver.x
Total footprint A,B & C
5538 MiB
CUDA sgemm start
CUDA sgemm end
15.0307150
secs
1416.83215
21 of 64
Gflops/s
Experiences with NVIDIA / CUDA
CPU/MKL sgemm start
cpu time spent in threaded
146.434982
CPU dgemm end
146.434982
secs
145.429733
Speedup
9.53373528
diff sum(c)_gpu sum(c)_cpu:0.0000000000E+00
[olews@compute-19-4 Fortran-BLAS]$
4
Gflops/s
A speedup of 9.1 times compared to MKL using all cores in one socket. The nodes on Abel have
two GPU cards and two sockets in each node, hence twice these numbers can be expected.
22 of 64
Experiences with NVIDIA / CUDA
Compiler support for accelerator, OpenACC
The open accelerator initiative aim to make use of accelerator processors like the GPUs as easy as
OpenMP has provided easy access to multiprocessor systems today. OpenACC follows the same
path as OpenMP by means of compiler directives in the source code. Goal is to make OpenMP and
OpenACC similar and develop OpenACC into a framework that is comparable to OpenMP.
Noticing the huge success of OpenMP the stakeholders in OpenACC hope to build on its success.
Currently only the Portland group's compiler have any noticeable traction. Their implementation,
however perform remarkably well which will be demonstrated below. The GNU compilers with gcc
will support OpenACC during 2015, gcc 5. Oak Ridge National laboratory is supporting this work2.
OpenACC is now released in version 2.0a, the web site : http://www.openacc.org will provide more
informasjon. At SC13 it was clear that this is the most feasible way to enable production code for
GPU acceleration. Both OpenCL and CUDA is too complex and time consuming when it comes to
scientific programs.
To start using the GPU or accelerator as it's called by Portland it is only required to enter compiler
directives into the source code just as with OpenMP. A simple example might look like:
!$acc region
do i = 1,n
r(i) = a(i) * 2.0
enddo
!$acc end region
The compiler will the directives and try to build code that the GPU can process and hopefully with
considerably higher performance. OpenACC has a very simple syntax, but mastering the data
placement and workload management is another case. Having said that, the following example
might illustrate the potential of a simple approach. Please consider the general matrix matrix
multiplication routine ?gemm in its reference implementation form.
The core block of calculation looks like this with the accelerator directives entered:
!$acc region
50
60
DO 90 J = 1,N
IF (BETA.EQ.ZERO) THEN
DO 50 I = 1,M
C(I,J) = ZERO
CONTINUE
ELSE IF (BETA.NE.ONE) THEN
DO 60 I = 1,M
C(I,J) = BETA*C(I,J)
CONTINUE
END IF
DO 80 L = 1,K
IF (B(L,J).NE.ZERO) THEN
2: https://www.olcf.ornl.gov/2013/11/14/olcf-lends-expertise-for-introducing-gpu-accelerator-programming-to-popularlinux-gcc-compiler/
23 of 64
Experiences with NVIDIA / CUDA
70
80
90
!$acc
TEMP = ALPHA*B(L,J)
DO 70 I = 1,M
C(I,J) = C(I,J) + TEMP*A(I,L)
CONTINUE
END IF
CONTINUE
CONTINUE
end region
This is all there is to it in the simplest approach. Compiling is also relatively easy, using Portland
groups's pgfortran the command line looks like :
pgfortran -O3 -ta=nvidia,kepler,time -Minfo=acc -o dgemm_acc.x dgemm_acc.f
Another example is Jacobi relaxation Calculation :
!$acc kernels
do j=1,m-2
do i=1,n-2
Anew(i,j) = 0.25_fp_kind * ( A(i+1,j ) + A(i-1,j ) + &
A(i ,j-1) + A(i ,j+1) )
error = max( error, abs(Anew(i,j)-A(i,j)) )
end do
end do
!$acc end kernels
which unfortunately does not yield the desired performance. It turns out that data placement forces
unnecessary copying between host memory and device memory. Just by including a data copy
statement the performance will increase dramatically. In this case a multifold increase. For details
see the performance evaluation section.
!$acc data copy(A, Anew)
do while ( error .gt. tol .and. iter .lt. iter_max )
error=0.0_fp_kind
Here the system will overlap the copying of data to the GPU device memory with the execution of
the while loop. This kind of overlap will increase the efficiency and yield the very high performance
we are after.
!$acc kernels
do j=1,m-2
do i=1,n-2
Anew(i,j) = 0.25_fp_kind * ( A(i+1,j ) + A(i-1,j ) + &
A(i ,j-1) + A(i ,j+1) )
error = max( error, abs(Anew(i,j)-A(i,j)) )
end do
end do
!$acc end kernels
Next is to evaluate the performance of the OpenACC initiative. This will be done in a later section
in this document. However is not worth to note that it's well known that the bandwidth limitation
between the GPU device memory and host's memory is a bottleneck. The connection between the
24 of 64
Experiences with NVIDIA / CUDA
two banks of memory is via the PCIe bus with it's limitations. Typical numbers lay in the 6 GiB/s
range using pinned host memory, but only about 2 GiB/s using pageable host memory. While intra
GPU device memory bandwidth clocks in at about 170 GiB/s. Keeping the data in the GPU device
memory for all the time computation is in progress is a key to success. The OpenACC directives
contain a range of directives to accommodate this. In figures 13 and 14 the OpenACC quick guide
is reproduced. There are many tutorials on the net to help users master the OpenACC programming
skill.
It is with OpenACC and OpenMP like Chess it's very easy to learn, but to master it take time and
effort.
Figure 13: OpenACC quick guide page 1.
25 of 64
Experiences with NVIDIA / CUDA
Figure 14: OpenACC quick guide page 2.
The use of OpenACC is currently the only quick and easy way to start harvesting the GPU power
for legacy code. As there is no changes to the source code apart from compiler directives which are
comments anyway this is also a portable way. With support from more and more compiler vendors
and also gcc support this way forward look quite promising.
26 of 64
Experiences with NVIDIA / CUDA
PART II
Performance evaluation
The very high theoretical performance claimed in the table reproduced in figure 1 make the GPUs
very attractive. However, this performance is never seen in practice, not even for benchmarks
written in CUDA or OpenCL. Even so, significant, times faster, not just percent faster, speedups is
commonplace. The following section will give an overview of what realistically can be expected,
with the exceptions of marketing numbers for some applications.
A detailed description of the compute nodes are given in appendix tables 8 and 9.
GPU enabled application, NAMD
The current version (2.9 and 2.10) (X.Y_Linux-x86_64-multicore-CUDA) of NAMD is used for
testing. This version is threaded for mulicore processors and has build in support for GPUs
(NVIDIA – CUDA). For input the apoa1 benchmark have been selected as the test case.
The GPU version is run on nodes with the GPU installed, these dual socket nodes have a somewhat
less performant CPU (Intel quadcore E5-2609 at 2.4 GHz). The multicore tests are run using the
multicore version of NAMD and Abel's common compute nodes, dual socket (Intel octacore E52670 at 2.6 GHz). The choice of compute nodes is that they are most representative for comparing
real application run on Abel.
First benchmark apoa1
The results are shown in figure 15 and figure 16 where the performance shown as wall run time is
recorded. The red columns show the runes times obtained by the GPU accelerated nodes while the
blue is for standard compute nodes. Remember that the CPUs in the two types of nodes are
different. The accelerated nodes were designed to employ lower cost processors as the computation
should be performed by the GPUs.
27 of 64
NAMD apoa1 benchmark
Experiences with NVIDIA / CUDA
GPU / CPU
400
350
Wall time [secs]
300
2 50
2 00
1 50
1 00
50
0
8/2
4/1
1 6/0
8/0
4/0
#CPUs / #GPUs
Figure 15: NAMD single node performance, GPU accelerated node
compared to a standard compute node. Note that the CPUs in the
compute are faster than the ones in the GPU node.
Figure 16 show the difference in performance of a single Abel node when using all compute power
the node can muster. This include 16 cores on the compute node and two GPUs on the accelerated
node. There is more than 3 fold increase in speed which show what kind of potential that can be
unleashed by effective usage of the graphic processing unit.
NAMD apoa1 benchmark
Single node performance
1 40
Wall time [secs]
12 0
1 00
80
60
40
20
0
Compute node
GPUnode
Type node
Figure 16: Single node performance exploiting all compute power
with the node.
28 of 64
Experiences with NVIDIA / CUDA
Second benchmark DL 1.4M Her1-Her1 membrane
This is a benchmark originating from Daresbury laboratory and is part of their benchmark suite for
procurement.
The results are shown below in figure 17. A speedup of about a factor of two.
DL benchmark - single node performance
1.4M her1-Her1 membrane
12000
Wall time [secs]
10000
8000
6000
4000
2000
0
Compute node
Accelerated node
Type node
Figure 17: Single node performance for the DL NAMD benchmark. One
compute node using all 16 cores vs. one GPU accelerated node using both
GPUs and all 8 cores.
NAMD 2.10 support a combination of mulitnode using IBverbs, multicore and CUDA. This open
up a large number of combinations using different sets of ranks, threads and GPUs. The newer
version NAMD 2.10 work well with multiple nodes, multiple cores and GPUs.
29 of 64
Experiences with NVIDIA / CUDA
NAMD 2.10 GPU enabled performance
1.4M Her1-Her1 membrane benchmark GPU enabled
14
Speedup [times faster]
12
10
8
6
4
2
0
2/2
4/2
8/4
16/4
16/8
32/8
24/12
16/16
32/16
64/16
# CPUs / #GPUs
Figure 18: Scaling of NAMD 2-10-cuda using multiple nodes and multiple GPUs. Performance seems to
depend on both the number of cores and the number of GPUs. A good load balance seems to have been
implemented. However, too many cores might lead to a decrease in performance.
NAMD 2.10 GPU Performance
1.4M Her1-Her1 membrane
24000
Standard nodes
Wall time [Sesonds]
20000
Accelerated nodes
16000
12000
8000
4000
0
1
2
4
8
# nodes
Figure 19: Scaling and performance comparison using standard compute nodes and
accelerated nodes using two GPUs per node.
30 of 64
Experiences with NVIDIA / CUDA
GPU enabled application, LAMMPS
The current version (lammps16Aug2013) of LAMMPS is used for testing. This application is an
MPI application with support for GPUs (NVIDIA – CUDA). For input the 3d-Lennard-Jonson melt
benchmark have been selected as the test case.
The GPU version is run on nodes with the GPU installed, these dual socket nodes have a somewhat
less performant CPU (Intel quadcore E5-2609 at 2.4 GHz). The multicore tests are run using the
multicore version of NAMD and Abel's common compute nodes, dual socket (Intel octacore E52670 at 2.6 GHz). The choice of compute nodes is that they are most representative for comparing
real application run on Abel.
Lennard-Jones potential benchmark
Single node performance
800
700
Run time [secs]
600
500
400
300
200
100
0
Standard compute node
Accelerated node
Node type
Figure 20: LAMMPS benchmark (3d-LJ potential melt), single node performance. MPI run over alle
processors, CUDA in addition for acceleration. For the CUDA version the MPI/CUDA combination is used.
Single precision were used on the GPU.
Figure 20 show a 3 fold speedup for the Lennad-Jones Potential melt benchmark. Other benchmarks
will show a different speedup as different algorithms is offloaded to the GPU.
The GPU/CUDA implementation support a range of precisions. The calculations can be performed
on the GPUs in single precision, double precision or a mixture. This has also an impact of the
recorded speed.
31 of 64
Experiences with NVIDIA / CUDA
GPU enabled application, Beagle
The current version (Beagle 1.0 release 1141) of Beagle is used for testing. This application is a
high-performance library that can perform the core calculations at the heart of most Bayesian and
Maximum Likelihood phylogenetics packages. It can make use of highly-parallel processors such as
those in graphics cards (GPUs) found in many PCs. It is used in conjunction with the BEAST
software, (BEAST v1.7.5 is used).
The GPU version is run on nodes with the GPU installed, these dual socket nodes have a somewhat
less performant CPU (Intel quadcore E5-2609 at 2.4 GHz). The CPU version use Abel's standard
compute nodes, dual socket (Intel octacore E5-2670 at 2.6 GHz). The choice of compute nodes is
that they are most representative for comparing real application run on Abel.
Beagle CPU vs. GPU
benchmark2
12 00
Run times [secs]
1000
800
600
400
2 00
0
Accele ra ted nod e
Stan dard nod e
Node type
Figure 21: Beast/Beagle benchmark2 performance, aggregated performance for
two concurrently running jobs. Run times for each were added together.
32 of 64
Experiences with NVIDIA / CUDA
GPU enabled application, ADF
The current version of ADF support GPUs. The results below show timings with two types of
nodes. The standard compute node has twice as many cores with higher frequency. The actual GPU
speedup is the difference in run time from the accelerated node using the GPUs and a run using only
the CPUs.
The possible speedup is dependent on input and which density functions the input calls for. Some
examples have been tried. Not all show any advantage of the GPUs, bit some show a significant
speedup.
GPU acceleation with ADF
ADF 2014.01 cuda version, Al10O15 input
45000
40000
Wall time [sec]
35000
30000
25000
20000
15000
10000
5000
0
Standard node
Acc node w/o gpu
Acc node w 2 gpus
Node type
Figure 22: Results from GPU enabled ADF. The effect of GPUs in a node with less CPU power
than the standard compute node. The accelerated node show a 30% reduction in wall time.
33 of 64
Experiences with NVIDIA / CUDA
ADF benchmark
Caffeine
60000
50000
40000
30000
Wall time [secs]
20000
10000
0
Figure 23: ADF scaling using compute nodes and accelerated nodes. This is an example where accelerators
do not yield enough performance to outperform the more performant cores in the standard compute nodes.
Closer study of timing of individual routines show speedup of a magnitude or more. The potential is
clearly there.
34 of 64
Experiences with NVIDIA / CUDA
GPU enabled application, Amber
The current version of the molecular dynamics package Amber version 14 supports GPU for a
selection of functions. Please review the Amber 14 documentation for more detailed information of
http://ambermd.org/gpus/ for up to date information.
The Amber version support the combination of CUDA and MPI which oper up a large range of
processor placement and job distributions. The Amber package comes with a set of benchmarks that
can be used to evaluate the performance.
20
Single node performance
15
10
5
0
JA
C_
PR
O
DU
CT
IO
N_
JA
NV
C_
E
PR
-2
3,
O
55
DU
8
CT
at
FA
IO
om
N
CT
s
_
N
O
PM
PT
R_
E
IX
-2
4f
_P
3,
s
RO
55
8
DU
at
om
CT
FA
IO
CT
s
PM
N_
O
R_
NV
E
IX
2f
E
_P
s
-9
RO
0
,
CE
90
DU
6
LL
CT
at
UL
IO
om
O
N
s
SE
_N
PM
PT
_P
E
RO
-9
0,
DU
90
CT
6
IO
at
om
N_
NV
s
PM
E
TR
-4
E
PC
08
,6
AG
09
E_
at
PR
om
O
s
DU
PM
CT
E
IO
N
-3
04
at
om
s
G
B
Speedup using GPUs
Amber 14 benchmarks
Figure 24: Amber 14 performance on a single node, using up to 2 GPUs and all cores. Best speedups using
GPUs compared to CPUs only are reported.
Quite remarkable speedups can be obtained, see figure 24 above and 25 & 26 below. However,
scaling using several nodes and MPI does not seem very good. In addition it does seem that there
should only be one MPI rank per GPU to yield best performance. A hybrid model with load balance
would be best, but this is not yet supported.
Data suggest that the benefit of using GPUs for this application is sensible. To this must be added
(taken from the Amber 14 manual page 313) :
“One of the newer features of PMEMD is the ability to use NVIDIA GPUs to accelerate both
explicit solvent PME and implicit solvent GB simulations. This work is by Ross Walker at the San
Diego Super- computer Center and Scott Le Grand at Amazon Web Services, in collaboration with
NVIDIA. “ .
35 of 64
Experiences with NVIDIA / CUDA
As is often the case involvement of NVIDIA is needed to get acceptable performance. While help
from NVIDIA is appreciated but it is not a sustainable situation.
Amber 14 cuda performance
2 nodes, using CUDA and MPI
Run times [secsonds/ns]
16000
14000
12000
10000
8000
6000
4000
2000
0
16:0
2:2
4:2
4:4
8:4
16:4
# CPUs : # GPUs
Figure 25: Amber 14 cuda version running benchmark Myoglobin using an implicit solver. With two
servers using both MPI and CUDA.
Amber 14 CUDA performance
2 nodes using CUDA and MPI
Run time [seconds/ns]
120000
100000
80000
60000
40000
20000
0
16:0
2:2
4:2
8:2
4:4
8:4
16:4
# CPUs : # GPUs
Figure 26: Amber 14 CUDA version running benchmark cellulose production using an explicit solver.
With two servers using both MPI and CUDA.
36 of 64
Experiences with NVIDIA / CUDA
GPU enabled application, cp2k
The current version of the molecular dynamics package cp2k have some support for NVIDIA
CUDA acceleration. It compile and build using a combination of Intel MKL and NVIDIA libraries.
The build process is as always a complex task, but well documented. There are a battery of versions
to choose from, shared memory OpenMP versions, serial, CUDA enabled, MPI versions both hybrid
and CUDA enabled.
This combination of CUDA and MKL should be a quite powerful one. The cp2k CPU performance
is good, but the extra performance gained by using the NVIDIA cards are not yet up to scale. Blog
entry from NVIDIA suggest a modest 2.5x gain in performance :
http://blogs.nvidia.com/blog/2012/08/23/unleash-legacy-mpi-codes-with-keplers-hyper-q/.
However, this is done by using MPI and launching many threads onto the NVIDIA processor one
for each MPI rank. This works quite well as we have seen from the tests using NAMD. For a more
common approach using a single CPU and a GPU the gain versus a single CPU is quite modest.
Test using Abel show very little gain when using the GPUs. The MPS 3 works well for a single GPU
on a node, but when scheduling 4 MPI ranks onto a GPU the difference between MPS and no MPS
is not obvious. As MPS only support a single GPU per node is really a showstopper for systems
with two GPUs per node.
For runs using a single CPU and a single GPU there is some performance gain, see figure below.
CP2K performance
Wall time [secs]
Single core, serial program, one GPU H2O-64 benchmark
2400
2350
2300
2250
2200
2150
2100
2050
2000
1950
1900
Accelerated node CPU only
Accelerated node CPU + GPU
Figure 27: Single core serial performance, one core versus one core using GPU acceleration. Only
slight increase in performance. This is the H2O 64 benchmark.
3
https://docs.nvidia.com/deploy/pdf/CUDA_Multi_Process_Service_Overview.pdfhttps://docs.nvidia.com/deploy/pd
f/CUDA_Multi_Process_Service_Overview.pdf
37 of 64
Experiences with NVIDIA / CUDA
However, very few runs with cp2k would be run in this way. The interesting results are the multi
core runs using either shared memory with OpenMP or multi node with MPI. The figure below
show what to expect when comparing a compute node with 16 2.6 GHz cores with an accelerated
node with 8 2.4 GHz cores and GPUs. It's evident that GPU utilization is too small to make up for
the less cores in the accelerated nodes.
Cp2k performance, two nodes MPI and GPUs.
Benchmark H2O 128, all cores and all GPUs.
350
Wall time [seconds]
300
250
200
150
100
50
0
Compute node
Accelerated node
Node type
Figure 28: Comparing compute nodes with accelerated nodes. The more cores in the compute node
clearly outperform the lesser cores (16 vs 8) in the accerated node even if some work is done by the
GPUs.
Clearly some more work remain to be done with this application before it can make full use of the
NVIDIA GPUs. It is spending a significant part of it's time in MKL so any transfer of MKL run
time to the GPU would be beneficial.
38 of 64
Experiences with NVIDIA / CUDA
GPU enabled application, Gromacs
The current version of the molecular dynamics package Gromacs have support for NVIDIA CUDA
acceleration. Gromacs is well documented and uses cmake to build. Support a range of different
schemes for parallel runs, multicore, MPI and combinations thereof in addition to CUDA support.
Gromacs GPU performance
Benchmark rnase_dodec
200
180
Compute node
Acceleraed node
Wall time [secs]
160
140
120
100
80
60
40
20
0
1
2
# Cores
4
8
Figure 29: Gromacs performance running the rnase_dodec. This version of Gromacs was built with
gnufortran and uses fftw3. The current dataset could not be run on 16 cores.
The figure above shows the performance gain possible with a GPU accelerated node. Run times are
too short to make any sense running om multiple nodes. Just looking at the numbers one might
expect a twofold speedup using GPUs which a significant speedup. Likely to increase as more
functions in Gromacs will be ported to GPUs.
39 of 64
Experiences with NVIDIA / CUDA
GPU enabled application, Gaussian
The current version of Gaussian does not yet support GPUs. There is no given date for release of
the GPU enabled Gaussian as of August 2014 (email from Gaussian Inc, August 14th 2014):
“Also, I have been advised by my technical support that a GPU-enabled version of Gaussian
requires changes to compilers as well as changes to Gaussian, therefore, unfortunately, we still
cannot provide a time frame as to when such a port would be available. However, while again, we
cannot make any guarantees, we hope to include some level of GPU support in the next major
release of Gaussian (although the release date of the next major version has not yet been
determined). I am very sorry that I do not have more exact information to provide to you, but we
will be sure to contact you when a GPU-enabled version of Gaussian becomes available.”
The results below is taken from presentations given by NVDIA/PGI/Gaussian in the spring of 2014.
The setup is dramatically different from the Abel setup. This system have two sockets with 10 cores
each and no less than 6 x NVIDIA Tesla K40 GPUs.
Figure 30: Example of speedup obtained using 6 x Tesla K40 GPUs running Direct SCF in Gaussian.
The speedups show are not for production code and as they are employing 6 more powerful GPUs
than currently installed in Abel the numbers are just an illustration. Marketing numbers.
40 of 64
Experiences with NVIDIA / CUDA
Figure 31: Example of speedup using 6 x Tesla K40 GPUs running Coupled Cluster calculation.
If the speedups shown in figure 31 can in some way be indicative of future development Gaussian
might be a major user of GPUs.
However, by the start of 2015 no GPU enabled Gaussian version is available.
41 of 64
Experiences with NVIDIA / CUDA
GPU enabled application, VASP
The current version of VASP does not yet support GPUs. The results below is taken from
presentations given by NVDIA/PGI/Gaussian in the spring of 2014.
There is no given date for release of the GPU enabled VASP. Virtually no information is available.
The setup is dramatically different from the Abel setup. This system have two sockets with 10 cores
each and no less than 6 x NVIDIA Tesla K40 GPUs.
Figure 32: Example of speedup using 6 x Tesla K40 GPUs running VASP calculation.
The performance gains obtained show that GPUs will provide a significant speedup. We just have to
wait for the production release and do an evaluation then.
42 of 64
Experiences with NVIDIA / CUDA
GPU enabled application, NCBI BLASTp
The NCBI version of the BLAST package provide a GPU enabled version of BLASTp. This should
provide accelerated protein search. Performance information provided on the web suggest several
times speedup numbers. More information on BLAST: http://blast.ncbi.nlm.nih.gov/Blast.cgi
Local testing with a set of inputs and databases does show significant speedups, but not at the scale
needed for widespread adaptation. The figure below show GPU do only yield a marginal speedup
and that it is not enough to outperform the normal compute nodes. 4 cores and one GPU represent
½ of a standard compute node which is equal to 8 cores.
BLASTP performance
Accelerated node / std node
8000
7000
Wall time [sec]
6000
5000
4000
3000
2000
1000
0
2 cores
4 cores
4 cores 1 gpu 8 cores
4 cores
8 cores
16 cores
Node and cores
Figure 33: NCBI BLASP performande, accelerated node, using one GPU, (blue) and standard
compute node (red). Benchmark is swissprot db and bacillus query.
43 of 64
Experiences with NVIDIA / CUDA
GPU enabled application, Bowtie
The well known bioinformatics application Bowtie has been ported to take advantage of NVIDIA
GPUs. From the web page (http://nvlabs.github.io/nvbio/nvbowtie_page.html) : “nvBowtie is a
GPU-accelerated re-engineering of Bowtie2, a very widely used short-read aligner”.
Local tests have been performed and show speedups comparable to the reported ones. The
nvBowtie utilises only one GPU hence the reported results compare one GPU and one socket on the
accelerated node versus one socket (8 threads) on the standard compute node.
Bowtie2 - Mouse genone
Wall time [sec]
One GPU/one soket vs. One socket 8 threads Stand compute node.
1000
900
800
700
600
500
400
300
200
100
0
Accelerated node one GPU
Standard node one socket
Node type
Figure 34: Performance of nvBowtie aligning mouse sequences.
As shown in the figure there is a speedup of about 4.5 using the GPU compared to one processor in
a standard compute node. As the Bowtie scale quite well this number drop to 2.7 if all cores in the
standard compute node are used. However, the acclerated node have two GPU som it's fair to use
only one processor in the standard node.
From a volume production standpoint the accelerated node outperfom the standard node by a factor
of about 4.5.
44 of 64
Experiences with NVIDIA / CUDA
GPU enabled application, BWA / BarraCUDA
The well known bioinformatics application BWA has a GPU-enabled implementation called
BarraCUDA.
From the web page (ref :http://seqbarracuda.sourceforge.net/index.html):
“Started in 2009, the aim of the BarraCUDA project is to develop a sequence mapping software that
utilizes the massive parallelism of graphics processing units (GPUs) to accelerate the inexact
alignment of short sequence reads to a particular location on a reference genome.
Being based on BWA (http://bio-bwa.sf.net) from the Sanger Institute, BarraCUDA delivers a high
level of alignment fidelity and is comparable to other mainstream alignment programs. It can
perform gapped alignment with gap extensions, in order to minimise the number of false variant
calls in re-sequencing studies.“
Some local tests have been run and performance recorded and evaluated. Unfortunately there were
no major speedup with the inputs tested. Good GPU enabled applications are hard to come by.
BWA / BarraCUDA - mouse geneome
Complete node all cores vs. both GPUs.
450
400
Wall time [secs]
350
300
250
200
150
100
50
0
Standard node
Acceleated node
Node type
Figure 35: BWA versus BarraCUDA wall times, all 16 cores in the standatd node used, both GPUs in
the accelerated node. We see that it is hard even with two GPUs to beat 16 threads in two sockets.
45 of 64
Experiences with NVIDIA / CUDA
Another example is drawn from scientific production on the Abel system. This is an example of real
scientific production runs. The reference is about 300 Mbytes while the query is 1.7 Gbytes. A lot of
these are processed on Abel on a routine basis. Figure 37 show the wall times recorded to run the
alignment using different nodes. It is clear that for this type of jobs the speedup is quite large over.
The GPU node perform the alignment over 5 times faster than a standard compute node.
Barracuda vs BWA
Real scientific data from research production
30000
Wall time [seconds]
25000
20000
15000
10000
5000
0
1 GPU
2 GPU
Std 1
Std 2
Std 4
Std 8
Std 16
Node type and cores
Figure 36: Barracuda and BWA performance recorded for a set of alignments performed in scientific
production. The BWA scale well up to 8 cores.
46 of 64
Experiences with NVIDIA / CUDA
BWA vs Barracuda
All resources in one node, 16 cores / 2 GPUs
4000
Wall time [seconds]
3500
3000
2500
2000
1500
1000
500
0
Compute node
Accelerated node
Node type
Figure 37: BWA versus Barracuda for production research data. A speedup of about 5 is recorded for
this input.
47 of 64
Experiences with NVIDIA / CUDA
GPU enabled application, cuBLASTP
At the GPU conference in March 2014 (http://on-demand-gtc.gputechconf.com/) a number of
application cases were presented. Below is a facsimile (edited) of a poster showing speedup of this
application. Speedups from 3 to 8 x are presented. Some speedup from a single core others from all
cores in a socket.
Figure 38: A poster presented at the GPU conference in March 2014 showing speedup for the cuBLASTP
application. The histograms have been magnified to show the speedup.
This kind of speedups looks nice, but widespread adaptation of GPU accelerated cuBLASTP has yet
to be seen.
This software is licensed and not open source. Google search point to a senior license manager.
48 of 64
Experiences with NVIDIA / CUDA
GPU enabled application, AUTODOCK
At the GPU conference in March 2014 (http://on-demand-gtc.gputechconf.com/) a number of
application cases were presented. Below is a facsimile (edited) of a poster showing speedup of this
application. Quite good speedups are reported for Autodock, up to over 40 with selected some
datasets.
Figure 39: A poster presented at the GPU conference in March 2014 showing speedup for the cuBLASTP
application. The histograms have been magnified to show the speedup.
49 of 64
Experiences with NVIDIA / CUDA
GPU enabled libraries
As we have learned on page 20 about the libraries that provide functions and routines that will
execute the computation on the GPU this section will show what can be expected in realistic
benchmarks.
BLAS level 3 matrix multiplication
The first example will show BLAS level 3 matrix-matrix operation performance. The most common
is general matrix matrix multiplication, [s,d,c,z]gemm where the s,d,c and z represent the precision
standard used for this kind of routines (s–single, d-double, c-complex, z-double complex).
The benchmark is run on two kind of nodes just like the NAMD example above, one accelerated
node with less performant CPUs and a compute node with high performance processors, see text on
page 27 for details or the appendix for the full details about the hardware and software.
Fully optimized libraries have been used in all cases, for the compute nodes the multithreaded
version of Intel MKL has been used while on the GPU nodes the NVIDIA CUBLAS library has
been used. However, as there is one GPU per socket, the tests on compute node will only use one
socket, e.g. 8 cores. The combined numbers by using both GPUs and both sockets is assumed to be
close to twice the recorded numbers. The reason for only using one GPU is that presently there is no
load balancing so that a single call to a routine can be split between two GPUs.
The test is set to compare the compute nodes versus the accelerated nodes in Abel. Hence we use
the best performing hardware and libraries for the tests.
50 of 64
Experiences with NVIDIA / CUDA
BLAS level 3 performance
dgemm - double precison (64 bits)
600
Performance [Gflops/s]
500
400
CUDA BLAS
MKL BLAS
300
200
100
0
91
366
1464
3295
5149
Total matrix footprint [MiB]
Figure 40: BLAS level 3 dgemm performance using one GPU in an accelerated node vs. one
socket in a compute node. One socket equal 8 cores.
The results show significant speedups for both single precision and double precision. As expected
the double precision performance is significantly lower that the single precision. The speedups
demonstrated show a speedup using accelerator versus a standard intel Sandy Bridge processor at
about 2.5 for double precision while 3.4 for the single precision case.
BLAS level 3 performance
sgemm - single precision (32 bits)
1400
Performance [Gflops/s]
1 200
1000
800
CUDA BLAS
MKL BLAS
600
400
200
0
45
1 83
732
1 647
2575
3708
4577
5538
Total matrix footprint [MiB]
Figure 41: BLAS level 3 sgemm performance using one GPU in an accelerated node vs. one socket
in a compute node. One socket equal 8 cores.
In quantitative number this can be compared to the theoretical numbers found in table 1. Table show
51 of 64
Experiences with NVIDIA / CUDA
the BLAS matrix matrix multiplication performance in relation to the theoretical performance.
Double precision
Single precision
Theoretical performance [Gflops/s]
1310
3950
BLAS performance [Gflops/s]
482
1300
32.9 %
36.8 %
Efficiency [fraction of theoretical]
Table 2: Efficiency and utilization of the compute power, obtained numbers vs. theoretical performance
As a note it should be mentioned that the size of problems that can be attacked by this approach is
limited to the matrix size that can be accommodated by the GPU card, in the present case 6 GiB.
Linear algebra and the top 500 test
A nice example of what is possible is the High Performance Linpack (HPL), the famous top500 test.
A version has been developed that support GPU with load balancing between CPUs and GPUs. It
uses the Intel MKL BLAS libraries and CUDA BLAS libraries to squeeze out all possible
performance of the compute hardware installed in a node. Only three linear algebra functions are
used: dgemm, dtrsm and LU factorization. Not only does the implementation provide GPU support
for several GPUs in each node it is still the well known MPI application so that one can use as much
resources as one might like. There are examples of large clusters on the top500 list that uses this
approach on a large scale
Figure 42: HPL implementation showing overlap of computation using GPU and CPU.
52 of 64
Experiences with NVIDIA / CUDA
For a single node using all 8 processor cores (Intel E5-2609 quad core 2.4 GHz) processors and the
two installed K20 GPUs the following performance were recorded:
Params
size
block
nxm
time
Tflops
%peak
WR12C2L8
85000
1280
1x2
222.07
1.844
66.5
WR12C2L8
85000
1280
1x2
222.28
1.842
66.4
WR12C2L8
85000
1280
1x2
222.31
1.842
66.4
WR12C2L8
85000
1024
1x2
222.87
1.837
66.2
WR12C2L8
85000
1280
1x2
222.84
1.837
66.2
WR12C2L8
85000
1024
1x2
222.92
1.837
66.2
WR12C2L8
85000
1280
1x2
223.03
1.836
66.2
Table 3: Recorded performance of a single node running HPL using all 8 cores and both GPUs.
Compare this 1.8 Tflops/s with performance obtained using all cores on a more powerful compute
node show below which clocks in at only 0.32 Tflops/s :
Params
size
block
nxm
time
Tflops
%peak
WR11R2R4
87500
180
4x4
1404.46
0.318
95.6
WR11R2R4
87500
200
4x4
1408.52
0.317
95.3
Table 4: Recorded performance of a single node compute running HPL using all 16. This node has more
powerful CPUs than the GPU node.
HPL / top500 benchmark performance
Performance [Tflops/s]
Double precision HPL on a single node
2
1,8
1,6
1,4
1,2
1
0,8
0,6
0,4
0,2
0
Accelerated
Standard
Node type
Figure 43: Using all resources in the two types of nodes. Accelerated node using
both GPUs and all CPUs versus one compute node using all CPUs. The CPUs
on the compute node are more powerful than the CPUs in the accelerated node.
53 of 64
Experiences with NVIDIA / CUDA
Figure 43 show the very high HPL performance provided by the GPUs, more than 5 fold increase in
performance. The HPL example clearly show what kind of performance that can be obtained using
the right software. The time and effort gone into the development of this HPL implementation is not
small and very few applications can be optimized in this way.
The hpl being an MPI application it can run over several nodes just like the ordinary hpl
application.
HPL / top500 Performance
Double prec. MPI cluster of GPU nodes
22
20
Performance [Tflops/s]
18
16
14
12
10
8
6
4
2
0
1
2
3
4
6
8
14
# nodes
Figure 44: HPL benchmark, MPI over IB with GPU accelerated compute nodes.
14 nodes are less than ½ a rack (36 nodes). One rack would yield about 50 Tflops/s (36/14*20),
assuming perfect scaling we would only need 20 racks to build a Petaflop cluster, about the size of
Abel in rack space.
54 of 64
Experiences with NVIDIA / CUDA
Matlab with GPU enabled libraries
Matlab comes with a range of functions that are GPU enabled. This can provide a very easy access
to the accelerator performance.
Figure 45: Matlab with a range of GPU enabled functions.
More information can be found at : http://www.mathworks.se/discovery/matlab-gpu.html or
http://www.mathworks.se/products/parallel-computing/
55 of 64
Experiences with NVIDIA / CUDA
OpenACC compiler support
For detailed information about OpenACC please see : http://www.openacc-standard.org/.
Currently (Aug 2014) only Portland and CAPS support OpenACC, (Cray compiler support Cray
systems). There are a few initiatives to include support for OpenACC in the open source compilers.
GCC will most probably have support some time during 2015. See below.
Figure 46: Roadmap for gcc with respect to OpenACC support, taken from : http://techenablement.com.
56 of 64
Experiences with NVIDIA / CUDA
Examples of open source compiler supporting OpenACC, this is an alternative while waiting for
gcc 5.0 :
Figure 47: An example of an open source compiler supporting OpenAcc.
Performance is presented from runs of the NASA NPB benchmarks.
Figure 48: Performance of the open source compiler OpenUH with OpenAcc support.
57 of 64
Experiences with NVIDIA / CUDA
Linear algebra using Portland
The linear algebra example using matrix matrix multiplication the reference code in plain fortran is
used. The comparison is between the fortran code compiler with full compiler optimization using a
single core. The test is to see what kind of speedup that can be obtained using compiler options on a
single threaded reference code. The same code could be treated with OpenMP directives and
speedup recorded, but this is another study.
The results from runs using dgemm and sgemm are shown in table 5 and 6. The performance is
quite small which is often the case when testing plain fortran code. It is only when using highly
optimized libraries performance of large fraction of the theoretical is achieved. However, it clearly
shows the huge benefit of accelerated code. Speedups recorded are from over 6 for double
precision to over 8 for single precision.
Footprint [MiB]
PGI f77 [Gflops/s]
PGI accelerated
[Gflops/s]
Speedup
91
2.51
2.33
0.928
366
2.17
9.48
4.37
1464
2.21
14.0
6.33
3295
2.20
14.1
6.41
5149
1.79
11.9
6.65
Table 5: dgemm (double precision) performance using plain f77 code versus plain f77 code with OpenACC
directives inserted. A single core on the GPU node was used and a single GPU.
Footprint [MiB]
PGI f77 [Gflops/s]
PGI accelerated
[Gflops/s]
Speedup
45
3,34
2,28
0.682
183
3,35
11,6
3.462
732
3,29
24,0
7.29
1647
3,23
26,9
8.33
2575
4,38
19,9
4.54
3708
4,36
22,0
5.05
4577
4,35
27,9
6.41
5538
4,36
22,0
5.05
Table 6: sgemm (single precision) performance using plain f77 code versus plain f77 code with OpenACC
directives inserted. A single core on the GPU node was used and a single GPU.
58 of 64
Experiences with NVIDIA / CUDA
The figures 49 and 50 show this in another form, where the huge improvement in performance by
just inserting compiler directives is evident.
Accelerated F77 code vs plain F77
16
Performance [Gflops/s]
14
Portland accelerator directives, dgemm
PGI Accel
PGI
12
10
8
6
4
2
0
91
366
1464
3295
5149
Total matrix footprint [MiB]
Figure 49: Performance of plain fortran77 code with OpenACC directives vs.
plain f77 code using double precision dgemm reference implementation.
59 of 64
Experiences with NVIDIA / CUDA
Accelerated F77 code vs plain F77
Portland accelerator directives, sgemm
Performance [Gflops/s]
30
25
PGI Accel
PGI
20
15
10
5
0
45
183
732
1647
2575
3708
4577
5538
Total matrix footprint [MiB]
Figure 50: Performance of plain fortran77 code with OpenACC directives vs.
plain f77 code using single precision sgemm reference implementation.
Jacobi relaxation Calculation
The code contains loops like :
!$omp do reduction( max:error )
!$acc kernels
do j=1,m-2
do i=1,n-2
Anew(i,j) = 0.25_fp_kind * ( A(i+1,j ) + A(i-1,j ) + &
A(i ,j-1) + A(i ,j+1) )
error = max( error, abs(Anew(i,j)-A(i,j)) )
end do
end do
!$acc end kernels
!$omp end do
This is the core loop, both directives for OpenMP and OpenACC are shown. The difference
between step1 and step2 is the including of statements for copying data to the GPU device memory.
!$acc data copy(A, Anew)
do while ( error .gt. tol .and. iter .lt. iter_max )
error=0.0_fp_kind
The copy directive overlap the while loop in time and yield far better performance.
The complete code can be found on the net4 as exercises for OpenACC, openacc-examples.
4
http://www.openacc.org/Sample_Codes
60 of 64
Experiences with NVIDIA / CUDA
Size mesh
OpenMP 4 cores OpenMP 8 cores OpenACC Step1 OpenACC step2
8192 x 8192
44.8
41.1
725.2
13.0
16384 x 16384
178.1
163.6
2918.5
53.4
Table 7: Comparing OpenMP and OpenACC directives. Numbers are wall time in seconds.
61 of 64
Experiences with NVIDIA / CUDA
Appendix
Node configuration
Compute node
Specification
Vendor
Megware, Myriquid / Supermicro
Mainboard
Supermicro X9DRT
Processor
2 x Intel Sandy Bridge, Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz 8 core
L2 cache
8-way Set-associative 2048 kB Write Back
L3 cache
20-way Set-associative 20480 kB Write Back
Memory
8 x Samsung DDR3 Registered 8192 MB, 1600 MHz
InfiniBand
Mellanox ConnectX-3 FDR
OS
CentOS release 6.4 (Final) later upgraded to 6.6
Compilers
gcc 4.8.0, 4.8.2, 4.9.2 / pgi 13.3 and 13.9
MPI
Openmpi 1.6.4, 1.8, 1.8.4
Math library
Intel MKL 2013.3, 2015.1
Table 8: Node configuration, standard Abel compute node.
Accelerated node
Specification
Vendor
Megware, Myriquid / Supermicro
Mainboard
Supermicro 9DRG-HF
Processor
2 x Intel Sandy Bridge, Intel(R) Xeon(R) CPU E5-2609 0 @ 2.40GHz 4 core
L2 cache
8-way Set-associative 1024 kB Write Back
L3 cache
20-way Set-associative 10240 kB Write Back
Memory
8 x Samsung DDR3 Registered 8192 MB, 1066 MHz
InfiniBand
Mellanox ConnectX-3 FDR
GPU accelerator
2 x NVIDIA Tesla K20Xm v. 3.5, 6 GiB mem. SP cores 2688 DP cores 896
OS
CentOS release 6.4 (Final) later upgraded to 6.6
Compilers
Gcc 4.8.0, 4.8.2, 4.9.2 / pgi 13.3 and 13.9
Math library
Intel MKL 2013.3, 2015.1
MPI
Openmpi 1.6.4, 1.8, 1.8.4
CUDA
5.0, 5.5, 6.0, 6.5 and 7.0
Table 9: Node configuration, accelerated Abel compute node.
62 of 64
Experiences with NVIDIA / CUDA
References
Bioinformatics and Life Sciences
NVIDIA web site about applications:
http://www.nvidia.co.uk/object/gpu-computing-applications-uk.html
http://www.nvidia.co.uk/object/bio_info_life_sciences_uk.html
Porting of VASP to support GPUs:
http://www.ncbi.nlm.nih.gov/pubmed/22903247
ACEMD
http://www.acellera.com/products/acemd/
NVIDIA web site about applications:
http://www.nvidia.co.uk/object/gpu-computing-applications-uk.html
http://www.nvidia.co.uk/object/bio_info_life_sciences_uk.html
Porting of VASP to support GPUs:
http://www.ncbi.nlm.nih.gov/pubmed/22903247
ACEMD
http://www.acellera.com/products/acemd/
HYDRO
http://www.prace-ri.eu/IMG/pdf/porting_and_optimizing_hydro_to_new_platforms.pdf
MARK
http://warnercnr.colostate.edu/~gwhite/mark/mark.htm
Notes on optimization :
http://software.intel.com/en-us/articles/step-by-step-optimizing-with-intel-c-compiler
-O1/-Os
This option enables optimizations for speed and disables some optimizations that increase code size
and affect speed. To limit code size, this option enables global optimization which includes dataflow analysis, code motion, strength reduction and test replacement, split-lifetime analysis, and
instruction scheduling. This option also disables inlining of some intrinsics. If -O1 is specified, -Os
option would be default enabled. In O1 option, the compiler auto-vectorization is disabled. If your
application are sensitive to the code size, you may choose O1 option.
-O2
63 of 64
Experiences with NVIDIA / CUDA
This option enables optimizations for speed. This is the generally recommended optimization level.
The compiler vectorization is enabled at O2 and higher levels. With this option, the compiler
performs some basic loop optimizations, inlining of intrinsic, Intra-file interprocedural
optimization, and most common compiler optimization technologies.
-O3
Performs O2 optimizations and enables more aggressive loop transformations such as Fusion,
Block-Unroll-and-Jam, and collapsing IF statements. The O3 optimizations may not cause higher
performance unless loop and memory access transformations take place. The optimizations may
slow down code in some cases compared to O2 optimizations. The O3 option is recommended for
applications that have loops that heavily use floating-point calculations and process large data sets.
Notes on MKL :
http://software.intel.com/en-us/articles/intel-mkl-automatic-offload-enabled-functions-for-intelxeon-phi-coprocessors
64 of 64
Download