Memorandum on Graphic Processing Unit (GPU) performance Ole W. Saastad

advertisement
Memorandum on Graphic Processing Unit (GPU)
performance
Ole W. Saastad
USIT, University of Oslo
Version:
Date:
0.7
24/ Aug/ 2009
Memorandum on GPU Performance
Introduction
This memorandum give results and experience on the usage and feasibility of using Graphic
Processors (GPU) as found on modern Graphics Card Adapters. Very high performance data
have been circulating in the media and on various chipset manufacturers. In order to assess the
performance and also gain experience on the usage of such hardware this study was
undertaken. One major issue is the software stack providing access from user applications in
high level languages like C or Fortran. Two software stacks from Nvidia and ATI have been
tested.
All measurements are presented as measured in the lab and has been verified by to, three and
in some cases more runs. However, I can give no guarantee that another tester would see the
same numbers. My numbers represent what a used might experience when setting up the
system.
Hardware and Software
Hardware comprise a custom built workstation using quad core AMD Phenom processor,
PC8500 DDR2 memory, WD Raptor 10k SATA disk, Mtron 32GB Solid State Disk and an
XFX GeForce 8800GTS graphic card, ATI Radeon HD 3870 or ATI FireStream 9170.
Software was RedHat Enterprise Linux 5 with associated compilers, gcc and gfortran in
addition to driver software for the graphics cards. An addition to this software a stack which
provide access from Fortran programs was also included.
The Graphics Card driver software was installed as suggested. This software download and
compile kernel modules for the current kernel automatically and this was selected and the
correct kernel modules built and installed. A reboot is recommended as the /dev files failed to
appear before after a reboot.
The software stacks were installed as suggested by the installer scripts.
The current Nvidia cards has only support for single precision floating (32 bit floating, approx
8 decimal places) point data types. The ATI card has support for double precision (64 bits
floating, approx 16 decimal places), see appedix for more information.
The portland compiler suite used for tests on compiler support is PGI 9.0-3.
Additional information about the hardware and software are found in the appendices.
08/24/09
Page 3 of 47
Memorandum on GPU Performance
Set up and implementation
NVIDIA graphic card Geforce 8800 GTS
After installing the Software Development Kit the testing of the GPU can start. The toolkit
provides a variety of different projects which each one is testing a different aspect of GPU
usage for computation. These projects are written and implemented in C language. A list of
these are shown in table 1.
alignedTypes
Convolution
histogram256
MonteCarlo
scan
simpleTextureDrv
histogram64
MonteCarlo
scanLargeArray
SobelFilter
Separable
asyncAPI
Convolution
Texture
MultiGPU
bandwidthTest
cppIntegration
imageDenoising
multiGPU
simpleAtomics
template
binomialOptions
deviceQuery
lineOfSight
nbody
simpleCUBLAS
transpose
bitonic
dwtHaar1D
Mandelbrot
oceanFFT
simpleCUFFT
BlackScholes
dxtc
marchingCubes
particles
simpleGL
boxFilter
eigenvalues
matrixMul
postProcessGL
simpleStreams
clock
FastWalsh
matrixMulDrv
Transform
reduction
simpleTemplates
convolutionFFT2D fluidsGL
MersenneTwister
scalarProd
simpleTexture
Table 1: GPU software example projects
Some of these projects were tested to see how the software stack could be used. However no
real benchmarking effort were put into these preliminary tests. As the linear algebra is one of
the most used packages in HPC the focus was turned on this. The project “simpleCUBLAS”
was investigated and the code instrumented with some timers in order to assess the speedup
when using the GPU compared to simple C code implementation. The compilation of this kind
of code is somewhat special as the software toolkit provides a wrapper to the normal gcc
compiler called nvcc. This C compiler is to be used in order to compile GPU enabled C
programs. It is located at bin/nvcc. It is suggested in the installation guide to expand your path
to this directory in addition to let ld.config look for libraries at /usr/local/cuda/lib.
Problem size limitation
The nature of the the software stack is so that the problem is moved into the memory of the
graphic card. The card used for these tests is equipped with 512 Mbytes of DDR3 memory.
The biggest problems that hence can be attacked is limited to 512 MB in total. When running
problems like GEMM (general matrix multiply) which need to hold three matrices in memory,
A,B and C the largest problems that can be run is N slightly larger than N=6500
(NxNx4x3=484 MB). This is a small problem compared to problem that can be attacked by the
CPU which has far more memory to play with.
08/24/09
Page 4 of 47
Memorandum on GPU Performance
ATI graphic card Radeon HD 3870 and Firestream 9170
After installing the Software which is a driver for the graphics adapter and a software stack for
computation, CAL and Brook. Both are development environments. In addition there is a
dependency to X11. Brook makes it relatively simple to access the hardware, but programming
the kernels is still a challenge with a large number of threads to be managed.
Unlimited problem size
With ACML on top of the CAL problems of any case can be attacked as ACML provides the
possibility to run out-of-core (core here is GPU local memory - ½ GB in this study). This is a
major improvement over earlier math library implementations.
Lack of compiler support
At present there is only support for accelerators for one Fortran compiler, namely the Portland
suite of compilers.
08/24/09
Page 5 of 47
Memorandum on GPU Performance
User Experience
NVIDIA Gforce GTS 8800
simpleCUBLAS
The simpleCUBLAS project was tested and used as an example. The makefile contains most of
what is needed. However it run into problems when linking due to, not uncommon, missing
libraries. Sgemm is Simple Precision General Matrix Multiply, C = alpha A*B + beta C; where
A, B and C are matrices and alpha and beta are scalars, N is typically the matirix size.
olews@styren ~/work/NVIDIA_CUDA_SDK/projects/simpleCUBLAS $ make
simpleCUBLAS.c: In function 'main':
simpleCUBLAS.c:89: warning: function declaration isn't a prototype
simpleCUBLAS.c:89: warning: nested extern declaration of 'seconds'
/usr/bin/ld: cannot find -lglut
collect2: ld returned 1 exit status
make: *** [../../bin/linux/release/simpleCUBLAS] Error 1
olews@styren ~/work/NVIDIA_CUDA_SDK/projects/simpleCUBLAS
$
This is easy to overcome by issuing a manual link line, which also include the seconds file
which contain a timer function.
$nvcc -o simpleCUBLAS.x -L/usr/local/cuda/lib -lcublas seconds.c
simpleCUBLAS.c
This produces an executable. The problem size is set within the C code and by increasing the
problem long enough run times in order to perform reasonable timing useful runs could be run:
olews@styren ~/work/NVIDIA_CUDA_SDK/projects/simpleCUBLAS $ ./simpleCUBLAS.x
Simple sgemm finished 153.630000 secs.
GPU sgemm start
Simple sgemm finished 0.190000 secs.
Test PASSED
Press ENTER to exit...
A significant speedup was recored when comparing simple C reference code. The simple C
reference implementation looks like :
{
int i;
int j;
int k;
for (i = 0; i < n; ++i) {
for (j = 0; j < n; ++j) {
float prod = 0;
08/24/09
Page 6 of 47
Memorandum on GPU Performance
for (k = 0; k < n; ++k) {
prod += A[k * n + i] * B[j * n + k];
}
C[j * n + i] = alpha * prod + beta * C[j * n + i];
}
}
}
Hence is not the best reference. A better reference would be one sgemm from one of the highly
optimized BLAS libraries like AMD Core Math Lib (ACML) or the Goto library from
University of Texas.
The code to call the GPU software stack is relatively simple :
cublasSgemm('n', 'n', N, N, N, alpha, d_A, N, d_B, N, beta, d_C, N);
In addition there are a few statements to copy and retrieve the vectors to and from the GPU
memory. It is quite simple to use and there seems to be no serious obstacles.
/* Allocate device memory for the matrices */
status = cublasAlloc(n2, sizeof(d_A[0]), (void**)&d_A);
if (status != CUBLAS_STATUS_SUCCESS) {
fprintf (stderr, "!!!! device memory allocation error (A)\n");
return EXIT_FAILURE;
}
status = cublasAlloc(n2, sizeof(d_B[0]), (void**)&d_B);
if (status != CUBLAS_STATUS_SUCCESS) {
fprintf (stderr, "!!!! device memory allocation error (B)\n");
return EXIT_FAILURE;
}
status = cublasAlloc(n2, sizeof(d_C[0]), (void**)&d_C);
if (status != CUBLAS_STATUS_SUCCESS) {
fprintf (stderr, "!!!! device memory allocation error (C)\n");
return EXIT_FAILURE;
}
/* Initialize the device matrices with the host matrices */
status = cublasSetVector(n2, sizeof(h_A[0]), h_A, 1, d_A, 1);
if (status != CUBLAS_STATUS_SUCCESS) {
fprintf (stderr, "!!!! device access error (write A)\n");
return EXIT_FAILURE;
}
status = cublasSetVector(n2, sizeof(h_B[0]), h_B, 1, d_B, 1);
if (status != CUBLAS_STATUS_SUCCESS) {
fprintf (stderr, "!!!! device access error (write B)\n");
return EXIT_FAILURE;
}
08/24/09
Page 7 of 47
Memorandum on GPU Performance
status = cublasSetVector(n2, sizeof(h_C[0]), h_C, 1, d_C, 1);
if (status != CUBLAS_STATUS_SUCCESS) {
fprintf (stderr, "!!!! device access error (write C)\n");
return EXIT_FAILURE;
}
ConvolutionFFT2D
The C code example shows that there is slightly more complex to use CUDA FFT than a
library like FFTW or ACML. The lines to make a plan, copy data to device memory and
perform the FFT calculation are given below:
CUDA_SAFE_CALL( cudaMallocArray(&a_Kernel, &float2tex, KERNEL_W, KERNEL_H) );
CUDA_SAFE_CALL( cudaMallocArray(&a_Data,
&float2tex,
DATA_W,
DATA_H) );
CUDA_SAFE_CALL( cudaMalloc((void **)&d_PaddedKernel, FFT_SIZE) );
CUDA_SAFE_CALL( cudaMalloc((void **)&d_PaddedData,
FFT_SIZE) );
CUFFT_SAFE_CALL( cufftPlan2d(&FFTplan, FFT_H, FFT_W, CUFFT_C2C) );
CUDA_SAFE_CALL( cudaMemset(d_PaddedKernel, 0, FFT_SIZE) );
CUDA_SAFE_CALL( cudaMemset(d_PaddedData,
0, FFT_SIZE) );
CUDA_SAFE_CALL( cudaMemcpyToArray(a_Kernel, 0, 0, h_Kernel, KERNEL_SIZE,
cudaMemcpyHostToDevice) );
CUDA_SAFE_CALL( cudaMemcpyToArray(a_Data,
cudaMemcpyHostToDevice) );
0, 0, h_Data,
DATA_SIZE,
CUDA_SAFE_CALL( cudaBindTextureToArray(texKernel, a_Kernel) );
CUDA_SAFE_CALL( cudaBindTextureToArray(texData,
a_Data)
);
CUFFT_SAFE_CALL( cufftExecC2C(FFTplan, (cufftComplex *)d_PaddedKernel, (cufftComplex
*)d_PaddedKernel, CUFFT_FORWARD) );
CUDA_SAFE_CALL( cudaThreadSynchronize() );
CUT_SAFE_CALL( cutResetTimer(hTimer) );
CUT_SAFE_CALL( cutStartTimer(hTimer) );
CUFFT_SAFE_CALL( cufftExecC2C(FFTplan, (cufftComplex *)d_PaddedData,
*)d_PaddedData,
CUFFT_FORWARD) );
(cufftComplex
CUFFT_SAFE_CALL( cufftExecC2C(FFTplan, (cufftComplex *)d_PaddedData,
*)d_PaddedData,
CUFFT_INVERSE) );
(cufftComplex
CUDA_SAFE_CALL( cudaThreadSynchronize() );
CUT_SAFE_CALL( cutStopTimer(hTimer) );
CUDA_SAFE_CALL( cudaMemcpy(h_ResultGPU, d_PaddedData, FFT_SIZE, cudaMemcpyDeviceToHost) );
CUDA_SAFE_CALL( cudaUnbindTexture(texData) );
CUDA_SAFE_CALL( cudaUnbindTexture(texKernel) );
CUFFT_SAFE_CALL( cufftDestroy(FFTplan) );
CUDA_SAFE_CALL( cudaFree(d_PaddedData)
);
CUDA_SAFE_CALL( cudaFree(d_PaddedKernel) );
CUDA_SAFE_CALL( cudaFreeArray(a_Data)
);
CUDA_SAFE_CALL( cudaFreeArray(a_Kernel) );
It is interesting to note that there is no special or fancy datatypes. Slightly more complex but
still within the reach of normal scientific C programmer.
08/24/09
Page 8 of 47
Memorandum on GPU Performance
Fortran_Cuda_Blas
The Nvidia CUDA website also contain software packages containing a Fortran interface.
Using this interface makes it extremely simple to use the GPU. This interface is simple to
install and uses the CUDS software libraries already installed. Originally it tries to use g95, but
this is no longer shipped with RH5. I changed to gfortran which is f90 compatible and works
fine with the CUSA software stack. Several other Fortran compilers are supported, including
Intel.
The make process is simple to follow and perform the following steps :
olews@styren ~/work/Fortran_Cuda_Blas $ make sgemm_speed_cublas
gcc -O3 -DCUTheBLAS_USE_THUNKING -I/usr/local/cuda/include
-c
fortran.c
gfortran
-o
sgemm_speed_cublas
-O3
-DCUBLAS
sgemm_speed.f90
fortran.o
-L/usr/local/cuda/lib -lcublas -lcudart -L/usit/platon/gvd-u1/olews/lib64/ -lacml
olews@styren ~/work/Fortran_Cuda_Blas $
In this example the ACML BLAS library was linked in. The Fortran program can be compiled
to call an external BLAS sgemm function or to use a GPU based sgemm function. It if worth to
note that there is no usage of the nvcc compiler wrapper, only gcc and gfortran.
The Fortran program to test the sgemm is remarkably simple :
!
! Simple Fortan90 program that multiplies 2 square matrices calling Sgemm
!
C = alpha A*B + beta C
!
program matrix_multiply
implicit none
! Define the floating point kind to be
single_precision
integer, parameter :: fp_kind = kind(0.0)
! Define
real (fp_kind), dimension(:,:), allocatable ::
real ::
real (fp_kind)::
integer::
08/24/09
A, B, C
time_start,time_end
alpha=1._fp_kind,beta=1._fp_kind, c_right
i,j,m1,m2
Page 9 of 47
Memorandum on GPU Performance
do m1=128,(40*128),128
allocate(A(m1,m1))
allocate(B(m1,m1))
allocate(C(m1,m1))
! Initialize the matrices A,B and C
A=1._fp_kind
B=2._fp_kind
C=3._fp_kind
! With the prescribed inputs, each element of the C matrix should be equal to c_right
c_right= 2._fp_kind*m1+3._fp_kind
! Compute the matrix product
computation
call cpu_time(time_start)
call cublas_SGEMM ('n','n',m1,m1,m1,alpha,A,m1,B,m1,beta,C,m1)
!
call SGEMM ('n','n',m1,m1,m1,alpha,A,m1,B,m1,beta,C,m1)
call cpu_time(time_end)
! Print timing information
print "(i5,1x,i4,a,1x,f8.4,2x,a,f12.4)", m1, &
(m1*m1*4)/(1024*1024), " MB
time =",time_end-time_start, &
" GFLOPS=",1.e-9*2._fp_kind*m1*m1*m1/(time_end-time_start)
! check the result
do j=1,m1
do i=1,m1
if ( abs(c(i,j)- c_right ) .gt. 1.d-8 ) then
print *, "sgemm failed", i,j, abs(c(i,j)- c_right )
exit
end if
end do
end do
deallocate(A,B,C)
end do
end program matrix_multiply
All the work associated with movement of data from main memory to the device memory on
the GPU card is hidden away in the Fortran wrapper library, leaving a very simple and elegant
Fortran interface. In order to make calling from Fortran simple all arrays start on index 1 and
run column major as is done in Fortran.
08/24/09
Page 10 of 47
Memorandum on GPU Performance
Portland compiler C and Fortran support
The Portland group (www.pgroup.com) has developed a set of compilers that can generate
code for the NVIDIA GPU chipset. From the PGI's web site :
“PGI is introducing the Accelerator Programming Model for Fortran and C with PGI Release
9.0. The Accelerator Programming Model uses directives and compiler analysis to compile
natural Fortran and C for the GPU; this often allows you to maintain a single source version,
since ignoring the directives will compile the same program for the X64 CPU. Note the model
is called the Accelerator Programming Model, not the GPU Programming Model; the model is
designed to be forward-looking as well, to accomodate other accelerators that may come in the
future, preserving your software development investment.“
This is done in OpenMP style with compiler directives. The compiler then tries to generate
code that can be submitted to the CUDA software framework from provided by NVIDIA. The
user level experience is remarkably simple, a few compiler directives are all that is needed to
start using the GPU as an accelerator available within Fortran and C programs. Compiler
directives like :
!$acc region and
<your code>
!$acc region end
is all that is needed in the Fortran code to activate the compiler machinery that tries to generate
kernels for the GPU. The same style as for OpenMP directives is adapted.
The compiler support is a major step forward as it allows for recompilation of legacy codes and
take advantage of the new GPUs.
An example of Fortran code is given below :
!$acc region
do i = 1,n
r(i) = sin(a(i)) ** 2 + cos(a(i)) ** 2
enddo
!$acc end region
This is all there is to it. Just two simple compiler directives as comments. The compiler will
invoke analysis software when compiled with the correct options:
pgfortran -o f2.exe f2.f90 -ta=nvidia,cc11 -Minfo=accel
This will generate a kernel that can be copied to the GPU and used to calculate the problem.
There is nothing special to be done to run the program. Just launch it as normal. The libraries
etc will schedule and pick ut the GPU.
08/24/09
Page 11 of 47
Memorandum on GPU Performance
ATI Radeon HD 3870
There is a number of test and tutorial applications with CAL (Compute Abstraction Layer) and
some of these can be useful to compare performance. The software development kit also
contain Brook (Brook is an extension of standard ANSI C and is designed to incorporate the
ideas of data parallel computing and arithmetic intensity into a familiar, efficient language. The
general computational model, referred to as streaming).
Demo Applications
A range of demo applications comes with CAL, as shown in table 2.
double_matmult
hellocal
lu_decomposition
memimport_matmult
outofcoreMMM
simple_matmul
memexport_matmult
Table 2: CAL GPU software example projects, applications
Demo Run time utilities
domain
importspeed
integer_alu
nonblocking_map
perf_counters
cachespeed
exportspeed
inputspeed
memtiming
outputspeed
throughput
Table 3: CAL GPU software example projects, run time utilities
Compiling tutorials and demos this are easy, a simple make command produces a binary that
can be run and results recorded. The only disadvantage is that all these CAL and GPU
softwares rely on the fact that you need local access to X11, which mean that you cannot run
remote (this is a showstopper and has been conveyed to AMD/ATI – newer versions will not
have this limitation).
08/24/09
Page 12 of 47
Memorandum on GPU Performance
Results
NVIDIA Gforce
Bandwidth
One major issue associated with the application of the GPU is the memory bandwidth of
which data can be moved from main memory to and from device memory on the GPU device.
The GPU can only operate on data in device memory. The CUDA toolkit contains benchmark
project code that measure several aspects of memory bandwidth. Table 4 show how at which
rates data can be moved from main memory to device memory and within the GPU device.
The bandwidth to move data between the two types of memory is relatively low compared to
the processing power of the GPU. Pinning of memory helps in getting higher bandwidth, most
notably for copy data from device memory to main memory.
Transfer type
Bandwidth [GB/s]
Pageable mem
Bandwidth [GB/s]
Pinned memory
Host to Device
2.291
2.604
Device to Host
1.857
3.265
Device to Device
50.992
51.012
Table 4: Bandwidth measurements to and from the Graphics Card memory.
ConvolutionFFT2D
This sample demonstrates how 2D convolutions with very large kernel sizes can be efficiently
implemented using FFT transformations.
Table 5 show a very high speedup using GPU.
Size N / conv. size
Run time CPU [secs]
Run time GPU [secs]
Speedup
1000 x 1000 / 7 x 7
0.3
0.03
10
3000 x 3000 / 7 x 7
2.70
0.43
6.3
Table 5: Run time for Convolution using 2DFFT.
Financial MonteCarlo
This project uses Monto Carlo simulation to find stock option price setting using a large
number of statistical draws. The problem is embarrassingly parallel hence well suited to the
very high parallelism of the GPU with its large numbers of processing elements. The recorded
speedup is so high that could be thought of as unrealistic.
Nvidia describe the project as :
08/24/09
Page 13 of 47
Memorandum on GPU Performance
“The pricing of options has been a very important problem encountered in financial engineering since
the advent of organized option trading in 1973. As more computation has been applied to finance-related
problems, finding efficient implementations of option pricing models on modern architectures has
become more important. This white paper describes an implementation of the Monte Carlo approach to
option pricing in CUDA. For complete implementation details, please see the “MonteCarlo” example in
the NVIDIA CUDA SDK.”
Run time CPU
Run time GPU [sec]
Speedup
771.8
0.26
2965
Table 6: Financial MonteCarlo simuiation. Measurement in run time in seconds.
simpleCUBLAS
This example and benchmark implemented in C uses a direct interface to the CUDA libraries.
The results obtained are measured in seconds for run time of each called sgemm routine. The
reference is a simple plain C language implementation of the sgemm. The results are
unrealistic as the simple C implementation is unrealistic time consuming.
Size N Total Size MB (NxNx4x3)
Simple/CPU [sec]
GPU [sec]
Speedup
500
3
1.3
0.01
120
1000
11.4
16.2
0.02
810
2000
46
153.8
0.19
809
4000
183
1556
1.0
1556
5000
286
3210.0
2.55
1259
6200
484
8345.1
4.75
1757
Table 7: simpleCUBLAS performance as measured in run time in seconds.
08/24/09
Page 14 of 47
Memorandum on GPU Performance
Fortran_Cuda_Blas
The Fortran Cuda BLAS benchmark is calling the Fortran BLAS wrapper library is remarkably
simple to use from Fortran. The results below is compared to the AMD Core Math Library
(ACML). The measured speedup is impressive with speedup over 8 x is impressive compared
to the best performing BLAS libraries running on a x86-64 CPU at 2.3 GHz.
Size N
Total Size MB
(NxNx4x3)
512
3
1024
ACML/CPU
[Gflops/s]
GPU
[Gflops/s]
Speedup
14.9
38.36
2.6
12
14.9
79.55
5.3
2048
48
15.8
110.9
7.0
4096
192
15.9
118.3
7.5
5120
300
15.9
120.6
7.6
6016
414
15.9
135.9
8.6
6272
450
15.9
136.2
8.6
Table 8: Fortran CUDA BLAS performance measured in Gflops/s.
SGEMM ­ CPU vs. GPU
Performance [Gflops/s]
160
140
120
100
CPU/ACML
80
GPU
60
40
20
0
512
1024
2048
4096
5120
6016
6272
Matrix size N
Figure 1: SGEMM - CPU using AMD Core Math Lib. versus GPU.
08/24/09
Page 15 of 47
Memorandum on GPU Performance
Fortran Callable BLAS test
How simple can it be done seen from the user perspective ?
I have written a small Fortran test program to show how simple it can be done to access the
GPU. This show how simple it is to gain access to the power of the GPU.
program gemmtest
parameter(N=6300)
real a, b , c, alpha, beta
dimension a(N,N),
b(N,N), c(N,N)
real time_start, time_end, speedup
integer i,j
!
Init matrices
do i=1,N
do j=1,N
a(j,i)=rand(0)
b(j,i)=rand(0)
enddo
enddo
alpha = 1.0
beta = 1.0
write(*,*)"Total footprint A,B & C",N*N*4*3/(1024*1024)," MB"
c=0
call cpu_time(time_start)
write(*,*)"CUDA sgemm start"
call cublas_SGEMM('n', 'n', N, N, N, alpha, a, N, b, N, beta, c, N)
call cpu_time(time_end)
write(*,*) "c(1,1) ",c(1,1)," c(N,N) ", c(N,N)
write(*,*)"CUDA sgemm end",time_end-time_start," secs"
speedup=time_end-time_start
c=0
call cpu_time(time_start)
write(*,*)"CPU sgemm"
call sgemm('n', 'n', N, N, N, alpha, a, N, b, N, beta, c, N)
call cpu_time(time_end)
write(*,*) "c(1,1) ",c(1,1)," c(N,N) ", c(N,N)
write(*,*)"CPU sgemm end",time_end-time_start," secs"
speedup=(time_end-time_start)/speedup
write(*,*)"Speedup ",speedup
end
08/24/09
Page 16 of 47
Memorandum on GPU Performance
Compilation is also very simple :
gfortran -o sgemm-test.x -O2 fortran.o -L/usr/local/cuda/lib/ -lcublas
-L${HOME}/lib64 -lgoto
sgemmtest.f90
The Fortran code wrapper fortran.o is a C piece of code that interfaces between the Fortran
code and the CUDA libraries. This file is needed, but can be placed anywhere with system
wide access. The normal BLAS library in this case the Goto library is also linked in to be used
for reference. In this test both alpha and beta has been set to 1.0 making some more work than
just AxB, C = alpha *A x B + beta * C, this is known as fused multiply add.
Performance is very good for this test as shown in table below. Outperforming the ACML
(equal performance measured for the Goto library) by a factor of close to 6 is very good.
Doing the arithmetic one arrives at 15.6 Gflops/s for the CPU and 91.8 Gflops/s for the GPU
when solving the N=6300 problem (assuming flops count at 2xNxNxN)
Size
N
Total Size MB
(NxNx4x3)
GPU
[Secs]
ACML/CPU
[Secs]
Speedup
500
2
0.564
0.0179
0.031
1000
11
0.574
0.135
0.24
2000
45
0.721
1.03
1.4
4000
183
1.51
8.11
5.4
5000
286
3.00
16.0
5.3
6000
411
4.68
27.4
5.8
6300
454
5.45
32.0
5.9
Table 9: Fortran CUDA BLAS single precision performance measured in run times in seconds.
Within the limitation of single precision data type the data type complex is possible. The
amount of floating point operations per addition and to a larger extent multiplication is larger
per byte of data then for the real data type. This should yield even better utilization of the GPU
based BLAS functions.
08/24/09
Page 17 of 47
Memorandum on GPU Performance
Size N
Total Size MB
(NxNx4x3x2)
ACML/CPU
[Secs]
GPU
[Secs]
Speedup
500
5
0.558
0.0670
0.12
1000
22
0.524
0.632
0.83
2000
91
1.02
4.14
4.0
3000
205
13.9
2.51
5.5
3500
280
22.0
3.33
6.6
4000
366
32.9
4.25
7.7
4400
443
43.8
5.41
8.1
Table 10: Fortran CUDA BLAS single precision complex performance measured in run times
in seconds.
GEMM GPU vs. CPU speedup
GPU Speedup [times faster]
9
8
7
6
5
Single
complex
4
3
2
1
0
500
1000
2000
4000
Matrix Size [N]
Figure 2: GEMM - CPU using AMD Core Math Lib. versus GPU. Speedup using single
precision data type and complex data types.
08/24/09
Page 18 of 47
Memorandum on GPU Performance
Multi thread CPU
The installed CPU has four cores and can consequently run 4 threads in parallel. This should
yield close to 4 times the performance when using a multi threaded linear algebra library like
the Goto library compiled for the multi threaded AMD processors. For these tests the BLAS
level 3 routine matrix multiply GEMM has been used.
Size N
ACML/CPU
2 threads speedup
ACML/CPU
4 threads Speedup
GPU
Speedup
500
1.91
2.38
0.14
1000
1.94
3.71
0.92
2000
1.96
3.87
4.29
3000
1.96
3.86
5.56
4400
1.96
3.90
8.03
Table 11: Fortran CUDA BLAS single precision complex performance measured in run times
in seconds.
Speedup CPU threads and GPU
Speedup from 1 core
9
1 thread
8
2 threads
7
4 threads
6
GPU
5
4
3
2
1
0
500
1000
2000
3000
4400
Matrix size Figure 3: Speedup using multiple CPU threads and GPU for single precision complex data
type. Blas level 3 CGEMM.
08/24/09
Page 19 of 47
Memorandum on GPU Performance
Size N
ACML/CPU
2 threads speedup
ACML/CPU
4 threads Speedup
GPU
Speedup
500
0.95
1.89
0.04
1000
1.93
2.65
0.26
2000
1.96
3.87
1.59
4000
1.97
3.90
5.72
6000
1.96
3.89
5.97
Table 12: Fortran CUDA BLAS single precision performance measured in run times in
seconds.
Speedup CPU threads and GPU
Speedup from 1 core
7
1 thread
6
2 threads
4 threads
5
GPU
4
3
2
1
0
500
1000
2000
4000
6000
Matrix size Figure 4: Speedup using multiple CPU threads and GPU for single precision data type. BLAS
level 3 SGEMM.
08/24/09
Page 20 of 47
Memorandum on GPU Performance
Speedup GPU vs multi thr. Proc. Single prec. cmplx
Speedup GPU vs. CPU
2,5
2
1,5
1
0,5
0
500
1000
2000
3000
4400
Size
Figure 5: Speedup GPU versus CPU using threads with all cores, for this quad core AMD
processor 4 threads. BLAS level 3 CGEMM matix multiplication with complex single
precision.
Speedup GPU vs multi thread Proc. Single Precison Speedup GPU vs. CPU
1,8
1,6
1,4
1,2
1
0,8
0,6
0,4
0,2
0
500
1000
2000
4000
6000
Size
Figure 6: Speedup GPU versus CPU using threads with all cores, for this quad core AMD
processor 4 threads. BLAS level 3 SGEMM matrix multiplication with single precision.
08/24/09
Page 21 of 47
Memorandum on GPU Performance
Portland Compiler testing using Fortran code only
The Portland compiler uses directives to launch a machinery to try to generate kernels that will
run on the GPU processor. A simple example is given below and this example is used for
testing and benchmark purposes.
program main
use accel_lib
integer :: n
! size of the vector
real,dimension(:),allocatable :: a
! the vector
real,dimension(:),allocatable :: r
! the results
real,dimension(:),allocatable :: e
! expected results
integer :: i
integer :: c0, c1, c2, c3, cgpu, chost
character(10) :: arg1
if( iargc() .gt. 0 )then
call getarg( 1, arg1 )
read(arg1,'(i10)') n
else
n = 100000
endif
if( n .le. 0 ) n = 100000
allocate(a(n))
allocate(r(n))
allocate(e(n))
do i = 1,n
a(i) = i*2.0
enddo
!call acc_init( acc_device_nvidia )
call system_clock( count=c1 )
!$acc region
do i = 1,n
r(i) = sin(a(i)) ** 2 + cos(a(i)) ** 2
enddo
!$acc end region
08/24/09
Page 22 of 47
Memorandum on GPU Performance
call system_clock( count=c2 )
cgpu = c2 - c1
do i = 1,n
e(i) = sin(a(i)) ** 2 + cos(a(i)) ** 2
enddo
call system_clock( count=c3 )
chost = c3 - c2
! check the results
do i = 1,n
if( abs(r(i) - e(i)) .gt. .00001 )then
print *, i, r(i), e(i)
endif
enddo
print *, n, ' iterations completed'
print *, cgpu, ' microseconds on GPU'
print *, chost, ' microseconds on host'
end program main
This code is compiled with s simple statement :
pgfortran -o f2.exe -O3
f2.f90
-ta=nvidia,cc11 -Minfo=acc
And is run by launching the program in a normal way. The timing will be outputted and hence
show the performance of GPU and CPU.
Problem size [k]
Speedup GPU vs. CPU
100
0.12
1000
1.35
10000
5.21
20000
5.83
40000
6.27
50000
6.29
Table 13: Speedup compiler generated kernel running on GPU vs. f90 code running on CPU.
08/24/09
Page 23 of 47
Memorandum on GPU Performance
Compiler accelerator test
Kernel running on GPU vs. f90 code on CPU
7
Speedup GPU vs. CPU
6
5
4
3
2
1
0
1 00
1 000
10000
20000
40000
50000
Problem size (k)
Figure 7: Speedup compiler generated kernel running on GPU vs. f90 code running on
CPU.
The observed speedup is quite modest for this simple test. More computational demanding
codes might yield higher speedup.
An interesting test might be to see if the f90 code can utilize all 4 cores on the processor. The
f90 was instrumented with OpenMP directives and 1 through 4 cores were used. Using all 4
cores in this simple calculation show that 4 cores can keep up with the GPU.
Number of cores
Speedup GPU vs. CPU
1
6.31
2
3.15
3
2.19
4
1.62
Table 14: Speedup GPU vs. multi cores on CPU. Fortran 90 codes compiled with OpenMP
directives. Size is 50000k which is the largest possible, ref. table above.
08/24/09
Page 24 of 47
Memorandum on GPU Performance
Speedup GPU vs multicore CPU
Multi cores used in OpenMP f90 code
7
Speedup GPU vs. CPU
6
5
4
3
2
1
0
1
2
3
4
# cores used in OpenMP
Figure 8: Speedup GPU vs. multi cores on CPU. Fortran 90 codes compiled with
OpenMP directives. Size is 50000k which is the largest possible.
It is apparent that more complicated and more computational kernels must be given to the GPU
in order to get a scaling that pays off the extra work and power consumption involved with
using GPU cards in compute nodes. The examples above demonstrate that not all problems are
suitable for GPU acceleration.
A problem involving more computation that can be run entirely on the GPU might show
significant speedup compared to the above less encouraging examples. An example of such a
computational problem is a smooth routine given below :
subroutine smooth( a, b, w0, w1, w2, n, m, niters )
real, dimension(n,m) :: a, b
real :: w0, w1, w2
integer :: n, m, niters
integer :: i, j, iter
!$acc region
do iter = 1,niters
do i = 2,n-1
do j = 2,m-1
a(i,j) = w0 * b(i,j) + &
w1*(b(i-1,j)+b(i,j-1)+b(i+1,j)+b(i,j+1)) + &
08/24/09
Page 25 of 47
Memorandum on GPU Performance
w2*(b(i-1,j-1)+b(i-1,j+1)+b(i+1,j-1)+b(i+1,j+1))
enddo
enddo
do i = 2,n-1
do j = 2,m-1
b(i,j) = a(i,j)
enddo
enddo
enddo
!$acc end region
end subroutine smooth
Most of the computation can be performed on the GPU card which should yield significant
performance. The compiler manages to translate this computational problem to a GPU kernel
that can run efficiently on the GPU. See the outputted analysis from the compiler given below:
smooth:
59, Generating copyout(a(2:n-1,2:m-1))
Generating copyin(b(1:n,1:m))
Generating copyout(b(2:n-1,2:m-1))
60, Loop carried dependence due to exposed use of b(1:n,1:m) prevents parallelization
Parallelization would require privatization of array a(i2+2,2:m-1)
Sequential loop scheduled on host
61, Loop is parallelizable
62, Loop is parallelizable
Accelerator kernel generated
61, !$acc do parallel, vector(16)
62, !$acc do parallel, vector(16)
Cached references to size [18x18] block of 'b'
68, Loop is parallelizable
69, Loop is parallelizable
Accelerator kernel generated
68, !$acc do parallel, vector(16)
69, !$acc do parallel, vector(16)
08/24/09
Page 26 of 47
Memorandum on GPU Performance
For this example the performance is closer to the expectations of what a GPU can do. The
recorded speedups in table 15 are such that the extra effort using GPUs are paying off. The
code running on the CPU is compiled with full optimization (-O3). Speedups of 100x or more
are reported, but this is hand written GPU kernels and not compiler translated Fortran code.
Problem size [MB]
Run time CPU
[seconds]
Run time GPU
[seconds]
Speedup
GPU vs. CPU
8
7.91
0,31
26
32
33.3
1,06
31
72
91.4
2.53
36
128
178
4.19
43
200
347
7.26
48
288
538
9.84
55
392
839
14.23
59
462
1120,7
16.12
70
Table 15: Execution times and speedup of the “smooth” subroutine using compiler generated
GPU kernel running on GPU versus f90 code running on a single core one the CPU. The f90
code is compiled with full optimization -O3.
GPU vs. CPU
(single core)
80
70
60
Speedup
50
40
30
20
10
0
8
32
72
1 28
200
288
392
462
Problem size [MB]
Figure 9: Speedup running smooth filter on GPU vs. a single core on the CPU. The
Graphic cards has only 512 MB of memory.
08/24/09
Page 27 of 47
Memorandum on GPU Performance
Using all four cores in the quad core processor by means of OpenMP support in the compiler a
more realistic speedup can be obtained. Most codes that are suited for GPU acceleration are
also probably also suitable for OpenMP parallelization.
Table 16 and figure 10 show the obtained results for this test. The speedup recored for this test
is promising and well worth investigating further.
Problem size [MB]
Run time CPU 4 cores
using OpenMP [seconds]
Run time GPU
[seconds]
Speedup
GPU vs. CPU
8
2.06
0.31
7
32
8.80
1.06
8
72
22.4
2.52
9
128
47.2
4.19
11
200
93.4
7.25
13
288
179
9.84
18
392
375
14.2
26
462
509
16.1
32
Table 16: Execution times and speedup of the “smooth” subroutine using compiler generated
GPU kernel running on GPU versus f90 code running on four cores on the CPU. The f90 code
is compiled with full optimization -O3 and OpenMP instrumented do loops.
GPU vs. CPU
Multicore CPU, 4 cores OpenMP
35
30
Speedup
25
20
15
10
5
0
8
32
72
128
2 00
288
392
462
Problem size [MB]
Figure 10: Speedup running smooth filter on GPU vs. four cores on the Processor. The
Graphic cards has only 512 MB of memory. The f90 code was parallized using OpenMP
directived do loops.
08/24/09
Page 28 of 47
Memorandum on GPU Performance
ATI Radeon
Bandwidth
One major issue associated with the application of the GPU is the memory bandwidth of
which data can be moved from main memory to and from device memory on the GPU device.
The GPU can only operate on data in device memory. The CAL toolkit contains benchmarks
that measure several aspects of memory bandwidth. Table 24 show how at which rates data can
be moved from main memory to device memory and within the GPU device. The bandwidth to
move data between the two types of memory is relatively low compared to the processing
power of the GPU.
Cachespeed
Cachespeed a Simple copy kernel to test total cache speed. This program can use GPU and
main memory as source and destination. This makes it possible to measure memory bandwidth
between the main system cache and GPU device cache.
Number of
I/Os
GPU => CPU
bandwidth[GB/sec]
CPU => GPU
bandwidth [GB/sec]
GPU local bandwidth
[GB/sec]
2
3.85
62.33
55.60
4
7.70
104.63
119.27
6
11.55
156.25
176.89
8
15.40
135.48
236.74
10
19.26
151.80
279.02
12
23.12
165.44
292.06
14
26.98
174.16
301.31
16
30.89
182.75
307.88
Table 17: Bandwidth measurements using the GPU cachespeed utility.
08/24/09
Page 29 of 47
Memorandum on GPU Performance
Inputspeed
Inputspeed a Simple utility to measure import speed. It measures how fast memory can be
copied from main memory to GPU memory.
Number of I/Os
GPU Input bandwidth [GB/sec]
2
33.97
3
30.05
4
30.64
5
31.50
6
29.30
7
28.19
8
28.41
9
29.30
10
28.10
11
31.36
12
31.89
13
32.97
14
32.75
15
32.92
16
33.24
17
33.37
Table 18: Bandwidth measurements using the GPU inputspeed utility.
Outputspeed
Outputspeed a Simple utility to measure export speed. It measures how fast memory can be
copied from GPU memory to main memory.
Number of I/Os
GPU output bandwidth [GB/sec]
1
15.02
2
15.32
3
18.90
4
23.67
5
25.37
6
23.21
7
19.81
8
17.66
Table 19: Bandwidth measurements using the GPU outputspeed utility.
08/24/09
Page 30 of 47
Memorandum on GPU Performance
Fortran Callable BLAS test
How simple can it be done seen from the user perspective ?
I have written a small Fortran test program to show how simple it can be done to access the
GPU. This show how simple it is to gain access to the power of the GPU.
program gemmtest
parameter(N=10000)
real a, b , c, alpha, beta
dimension a(N,N),
b(N,N), c(N,N)
real time_start, time_end, speedup
integer i,j
!
Init matrices
do i=1,N
do j=1,N
a(j,i)=rand(0)
b(j,i)=rand(0)
enddo
enddo
alpha = 1.0
beta = 1.0
write(*,*)"Total footprint A,B & C",N*N*4*3/(1024*1024)," MB"
c=0
call cpu_time(time_start)
write(*,*)"ATI sgemm start"
SGEMM_CAL_F(.FALSE.,.FALSE., M, N, K, A,LDA,B,LDB,C,LDC,ALPHA, BETA, 0)
call cpu_time(time_end)
write(*,*) "c(1,1) ",c(1,1)," c(N,N) ", c(N,N)
write(*,*)"CUDA sgemm end",time_end-time_start," secs"
speedup=time_end-time_start
c=0
call cpu_time(time_start)
write(*,*)"CPU sgemm"
call sgemm('n', 'n', N, N, N, alpha, a, N, b, N, beta, c, N)
call cpu_time(time_end)
write(*,*) "c(1,1) ",c(1,1)," c(N,N) ", c(N,N)
write(*,*)"CPU sgemm end",time_end-time_start," secs"
speedup=(time_end-time_start)/speedup
write(*,*)"Speedup ",speedup
end
Compilation is also very simple :
gfortran -o sgemm-test.x -O3 -fopenmp -L/usr/lib64/acml/gfortran64/lib/ -lCALBLAS
-lacml -L/usr/local/amdcal/lib64/ -lamdcalcl -lamdcalrt sgemm-test.f90
The Fortran interface is very simple it is merely a call to the familiar ACML (AMD Core Math
Library). The rest is done by the library and device drivers. There is a special lower level
08/24/09
Page 31 of 47
Memorandum on GPU Performance
interface to bypass the acml layer, this is done by calling SGEMM_CAL_F instead of the
acml entry sgemm. This is done to be 100% sure that the gemm is run on the GPU and not the
CPU. Very little performance difference was observed, hence showing that there is no need to
bypass high level acml.
The normal BLAS library in this case the Goto and acml (in a version for CPU only) library is
also linked in to be used for reference. In this test both alpha and beta has been set to 1.0
making some more work than just AxB, C = alpha *A x B + beta * C, this is known as fused
multiply add.
Performance is very good for this test as shown in table below. Outperforming the CPU based
computation by a factor 7-10 which is very good.
Doing the arithmetic one arrives at 15.9 Gflops/s for the CPU and 166.8 Gflops/s for the GPU
when solving the N=16000 out-of-core problem (assuming flops count at 2xNxNxN)
By using ACML as in interface there is no need to recompile old programs, just a relink is
needed for for statically linked programs and just replacement of the library for dynamically
linked programs.
Size N
Total Size MB
(NxNx4x3)
ACML/GPU
[Secs]
ACML/CPU
[Secs]
Speedup
500
2
0.104
0.0180
0.17
1000
11
0.183
0.227
0.23
2000
45
0.457
1.06
2.32
3000
102
0.821
4.37
5.33
4000
183
1.32
8.39
6.35
5000
286
2.36
18.3
7.74
6000
411
3.61
28.4
7.85
7000
560
5.31
44.6
8.41
8000
732
7.25
66.6
9.17
9000
926
14.4
94.8
6.56
10000
1144
17.8
130.0
7.30
11000
1647
24.4
224.0
9.17
12000
1384
23.0
169.0
7.35
13000
1934
32.1
285.0
8.85
14000
2243
37.4
343.0
9.17
16000
2929
49.1
514.0
10.4
Table 20: Fortran ACML-CPU / ACML-GPU BLAS single precision performance measured in
run times in seconds.
08/24/09
Page 32 of 47
Memorandum on GPU Performance
SGEMM performance CPU and GPU
Single precision
1 80
CPU singl e
GPU si ngl e
1 60
Pe rforma n ce [Gflops/s]
1 40
120
1 00
80
60
40
20
0
1 000
500
3000
2 000
5000
4000
7000
6000
8000
9000
11 000
13000
1 6000
10000
1 2000
1 4000
Size N [NxN ma tice s]
Figure 11: CPU and GPU performance single precision GEMM.
08/24/09
Page 33 of 47
Memorandum on GPU Performance
Fortran Callable BLAS test in Double precision
The ATI Radeon HD 3870 can also do double precision floating point (see appendix for more
info about double precision datra type) using ACML. Table 21 show the measured
performance.
Size N
Total Size MB
(NxNx8x3)
ACML/GPU
[Secs]
ACML/CPU
[Secs]
Speedup
500
5
0.0370
0.0355
0.959
1000
22
0.150
0.266
1.77
2000
91
0.553
2.08
3.76
3000
205
1.17
6.92
5.91
4000
366
2.40
16.2
6.75
5000
572
7.50
31.8
4.24
6000
823
10.6
55.2
5.21
7000
1121
14.5
87.2
6.01
8000
1464
19.3
130
6.73
9000
1853
35.7
188
5.27
10000
2288
44.1
253
5.76
11000
2769
53.1
338
6.37
Table 21: Fortran ACML-CPU / ACML-GPU BLAS double precision performance measured
in run times in seconds.
08/24/09
Page 34 of 47
Memorandum on GPU Performance
Size
N
Total Size CPU single GPU single
[MB]
[Gflops/s] [Gflops/s]
(NxNx4x3)
Total Size
[MB]
(NxNx8x3)
CPU double GPU double
[Gflops/s]
[Gflops/s]
500
2
13.9
2.40
5
7.04
6.76
1000
11
8.81
10.9
22
7.51
13.3
2000
45
15.1
35.0
91
7.69
28.9
3000
102
12.4
65.8
205
7.80
46.0
4000
183
15.3
97.0
366
7.90
53.4
5000
286
13.7
105.9
572
7.86
33.3
6000
411
15.2
119.7
823
7.83
40.7
7000
560
15.4
129.2
1121
7.87
47.3
8000
732
15.4
141.2
1464
7.88
53.2
9000
926
15.4
101.3
1853
7.76
40.8
10000
1144
15.4
112.4
2288
7.91
45.4
11000
1647
15.4
115.7
2769
7.88
50.1
12000
1384
15.4
141.6
3296
Not possible
Not possible
13000
1934
15.4
136.9
3868
Not possible
Not possible
14000
2243
16.0
146.7
4486
Not possible
Not possible
16000
2929
15.9
166.8
5859
Not possible
Not possible
Table 22: ATI Radeon 3870 Fortran ACML-GPU BLAS performance in Gflops/s.
08/24/09
Page 35 of 47
Memorandum on GPU Performance
DGEMM performance CPU and GPU
Double precision
60
CPU double
GPU double
Perform ance [Gflops/s]
50
40
30
20
10
0
500
1 000
2000
3000
4000
5000
6000
7 000
8000
9000
1 0000 11 000
Size N [NxN m atrices]
Figure 12: CPU and GPU performance double precision GEMM.
08/24/09
Page 36 of 47
Memorandum on GPU Performance
ATI FireStream 9170
Fortran Callable BLAS test
The earlier employed Fortran test program to show how simple it can be done to access the
GPU is used to test the 9170 card. Details about the fortran proggram is found in the above
section.
Compilation is again very simple :
gfortran -o sgemm-test.x -O3 -fopenmp -L/usr/lib64/acml/gfortran64/lib/ -lCALBLAS
-lacml -L/usr/local/amdcal/lib64/ -lamdcalcl -lamdcalrt sgemm-test.f90
The Fortran interface is very simple it is merely a call to the familiar ACML (AMD Core Math
Library). The rest is done by the library and device drivers. There is a special lower level
interface to bypass the acml layer, this is done by calling SGEMM_CAL_F instead of the
acml entry sgemm. This is done to be 100% sure that the gemm is run on the GPU and not the
CPU. Very little performance difference was observed, hence showing that there is no need to
bypass high level acml.
The normal BLAS library in this case the Goto and acml (in a version for CPU only) library is
also linked in to be used for reference. In this test both alpha and beta has been set to 1.0
making some more work than just AxB, C = alpha *A x B + beta * C, this is known as fused
multiply add.
Performance is very good for this test as shown in table below. Outperforming the CPU based
computation by a factor 7-10 which is very good.
By using ACML as in interface there is no need to recompile old programs, just a relink is
needed for for statically linked programs and just replacement of the library for dynamically
linked programs.
08/24/09
Page 37 of 47
Memorandum on GPU Performance
Size
N
Total Size CPU single GPU single
[MB]
[Gflops/s] [Gflops/s]
(NxNx4x3)
Total Size
[MB]
(NxNx8x3)
CPU double GPU double
[Gflops/s]
[Gflops/s]
500
2
14,70757
11,91
5
7,14
6,76
1000
11
14,81688
22,23
22
7,41
10.87
2000
45
15,28408
54,25
91
7,54
28,89
3000
102
15,36944
69,51
205
7.56
40,56
4000
183
15,38695
100,41
366
7,59
60,16
5000
286
15,42587
126,99
572
7,59
55,68
6000
411
15,42376
122,33
823
7.56
67,11
7000
560
15,37592
140,16
1121
7,59
68,96
8000
732
15,42798
166,53
1464
7,59
83,90
9000
926
15,4195
152,12
1853
7,60
66,97
10000
1144
15,41815
175,91
2288
7,58
68,77
11000
1647
15,43494
176,21
2769
7,59
90,62
12000
1384
15,40842
248,76
3296
7,61
92,28
13000
1934
15,43158
189,04
3868
Not possible
Not possible
14000
2243
15,40059
202,22
4486
Not possible
Not possible
15000
2575
15,3710
211,89
5150
Not possible
Not possible
16000
2929
15,4658
216,82
5859
Not possible
Not possible
17000
3307
15,4658
167,76
6615
Not possible
Not possible
Table 23: ATI Firestream 9170 Fortran ACML-GPU BLAS performance (single precision and
double precision) in Gflops/s compared to Goto BLAS lib for the single thread CPU
performance.
08/24/09
Page 38 of 47
Memorandum on GPU Performance
GPU GEMM Performance
Single and Double precision
300
GPU si ngl e
GPU doubl e
Performance [Gflops/s]
2 50
2 00
1 50
1 00
50
0
1000
500
3000
2000
5000
4000
7 000
6000
9000
8000
11 000
13000
1 5000
1 7000
10000
12000
1 4000
1 6000
Matrix Size [N]
Figure 13: ATI Firestream 9170, GEMM performance single and double precision.
08/24/09
Page 39 of 47
Memorandum on GPU Performance
Quad core processors
Typical processors today have 4 cores and BLAS libraries can utilize all these cores in a single
function. It is fair to compare the GPU performance with 4 threads run on a single processor.
Doing so the GPU is only about twice as fast as compared to a quad core processor.
Firestream 9170 GEMM Performance
Single precision
2 50
Performance [Gflops/s]
2 00
CPU 4 Thr eads
GPU
1 50
1 00
50
0
500
1000
2000
4000
6000
8000
1 0000
1 2000
1 4000
16000
Matrix Size [N]
Figure 14: Performance comparison Firestream 9170 GPU vs. 4 threads on quad
core processor. Single precision datatype.
Firestream 9170 GEMM Performance
Double precision
1 00
90
CPU 4 Thr eads
GPU
Performance [Gflops/s]
80
70
60
50
40
30
20
10
0
500
1000
2 000
4000
6000
8000
10000
1 2000
Matrix Size [N]
Figure 15: Performance comparison Firestream 9170 GPU vs. 4 threads on quad
core processor. Double precision datatype.
08/24/09
Page 40 of 47
Memorandum on GPU Performance
Comparing NVIDIA and ATI
The ATI cards Radeon HD 3870 and Firestream 9170 have support for double precision
floating point numbers. This is substantial improvement over earlier cards and GPU chips.
Both the NVIDIA and the ATI cards has been installed in the same machine and run the same
Fortran 90 test program. The Matrix multiplication problem (SGEMM)is used for comparison
as this is one of the most common linear algebra problems to be solved. By assuming flops
count at 2xNxNxN for the SGEMM operation we can also calculate some performance
numbers.
For double precision ATI cards stands alone table 21 and 22.
Size
[N]
Total Size NVIDIA [Gflops/s]
[MB]
ATI 3870
[Gflops/s]
Firestream 9170
[Gflops/s]
500
2
0.442
2.40
11.9
1000
11
3.66
10.9
22.2
2000
45
22.2
35.0
54.3
3000
102
n.d.
65.8
69.5
4000
183
84.8
97.0
100.4
5000
286
83.3
105.9
127.0
6000
411
92.3
119.7
122.3
6300
454
91.8
n.d.
n.d.
7000
560
Not possible to run
129.2
140.2
8000
732
Not possible to run
141.2
166.5
9000
926
Not possible to run
101.3
152.1
10000
1144
Not possible to run
112.4
175.9
11000
1384
Not possible to run
115.7
176.2
12000
1647
Not possible to run
141.6
248.8
13000
1934
Not possible to run
136.9
189.0
14000
2243
Not possible to run
146.7
202.2
16000
2929
Not possible to run
166.8
216.8
Table 24: Performance comparison NVIDIA and ATI cards when running SGEMM. ATI/ACML
can run out-of-core and hence solve problems of any size.
08/24/09
Page 41 of 47
Memorandum on GPU Performance
Comparing GTS8800, HD3870 and FS9170
Single precision GEMM
2 50
GTS8800
HD3870
FS 9170
Performance [Gflops/s]
2 00
1 50
1 00
50
0
500
1 000
2 000
3000
4000
5000
6000
8000 10000 1 2000 1 4000 1 6000
Matrix Size [N]
Figure 16: Radeon HD3870 compared to GTS8800, single precision GEMM. The
ACML /ATI combination can run problems out-of-core eg. larger than the memory on
the GPU device (ranging from 512MB to 2GB for the tested cards).
08/24/09
Page 42 of 47
Memorandum on GPU Performance
Discussion
The Graphical Processors available today are formidable engines. The possible performance is
staggering. However, before we jump to the conclusion that this is the final solution to all
computational problems a few remarks must be made.
Memory size limitation
First of all the limitation of problems that can be attacked. The problem must be copied to the
device memory on the graphic card. These cards have memory in the sub GB range. The card
used for these tests had ½ GB implying that only problems with a footprint of less than ½ GB
can be attacked. Even if close to 10x BLAS level 3 (sgemm) performance is attractive it can
only do this for smaller matrices. Firestream 9170 has 2 GB.
However, the ACML library has the possibility to run out-of-core (core in this context is the
GPU device memory). In this way very large problems can be attacked. Table show that
problems sizes that were out of reach for the CUDA/NVIDIA system could be handled by the
ATI/ACML system with ease.
The memory limitation hence only apply to CUDA/NVIDIA.
Nvidia has Single Precision data type only
The only supported data types are integer and single precision floating point. However, the
single precision complex is supported.
ATI has Double Precision data type support
The ATI Radeon HD 3870 and Firestream 9170 have support for double precision data types.
As time og writing complex data types are not supported by the software (see appendix for
more information about the double precision data type).
Bandwidth limitation
The bandwidth of which data can be copied from main memory to device memory and vice
versa is limited. Measurements show bandwidths in the few GB range (2-3 GB/s Nvidia.
Somewhat more for ATI). This is far to low to allow for memory bandwidth intensive
calculations. The GPU is hence best suited for tasks that require a lot of calculations per byte
of data, or tasks that reduce the amount of data, typically reductions. However the out-of-core
caslculations using ACML and HD3870/FS9170 seems to contradict this.
Physical size and connection
The graphical cards with the powerful GPUs are connected to the motherboard via a PCI
express x 16 connector, in addition an extra power connector is needed. This is commonly
found on gaming/home/office nodes or work-stations. However, it is very very rare on servers.
In addition the servers used for computation are commonly of the 1U type and as the graphics
card is occupying two PCI slot in physical dimensions it cannot fit in a normal 1U node.
It can obviously not fit in a blade server.
08/24/09
Page 43 of 47
Memorandum on GPU Performance
This constraints the usage of GPU for calculation severely. Making it only a marginal option
today. In the future tighter integration might change this.
Software limitations
The current software stack relies on X11 to run and also require full control of the Xserver.
Thuis makes it very hard to run from a remote console. A normal ssh login with X11
forwarding enabled does not work. This is a showstopper for use in compute nodes today. All
references to graphic drivers or X11 should be avoided for an accelerator card.
Lack of thread hot and thread safe support
Currently the GPU can only run one thread at the time. There are 4 CPUs in a normal AMD
processor and each of these can run a computational demanding job. Each of these 4 might
want to access the GPU. For an MPI job there would be 4 or even 8 (dual socket node)
concurrently running processes on each node and each of these might want to access the GPU
probably concurrently. Either there should be one GPU per CPU (core) or one GPU should be
capable of handling multiple threads concurrently. The X2 cards contain two GPUs and this is
a start, an X4 card might be well suited to a single socket quad core processor node. A dual
socket node might need two X4 cards or even a card than could accommodate 8 GPUs.
Limited Compiler support
At present there is only support for accelerators of this kind in one of the commonly used
Fortran compilers, namely the Portland set of C and Fortran compilers. On the other hand the
ease of usage of the compiler directives with Portland is valuable. The company states that
they want it to be very similar to OpenMP directives. So far it looks like they are one the right
track.
Better integration of GPU processor and CPU like sharing the main memory etc will provide a
far simpler implementation of compiler support. Today the overhead of copying data from
main memory to GPU memory and associated administration to launch programs on the GPU.
The example of integration of x87 math co-processor is an example to follow.
08/24/09
Page 44 of 47
Memorandum on GPU Performance
Appendix
Component
Hardware
Motherboard
ASUS M3A32-MVP Deluxe.
Processor
AMD Phenom 9600 2.3 GHz.
Memory
Kingstom HyperX PC8500 total 4 GB.
Graphic Card
XFX GeForce 8800GTS, 512 MB DDR3.
Graphic Card
Asus Radeon HD 3870, 512 MB GDDR4.
Graphic Card
ATI Firestream 9170, 2GB GDDR4.
Power Supply
Chill CP-520A4 520 W.
Hard disk
Western Digital Raptor X 150 GB SATA 10k.
Hard disk
MTRON SSD 7000 3.5” 32 GB.
Table 25: Hardware
Software
Details
Operating system
RedHat EL 5.
C compiler
gcc version 4.1.2 20070626.
C compiler
Portland pgcc 9.0-3.
Fortran Compiler
gfortran version 4.1.2 20070626.
Fortran Compiler
Portland pgfortran 9.0-3.
GPU Driver (8800GTS)
NVIDIA-Linux-x86_64-169.09-pkg2.
GPU Driver (8800GTS) (Compiler tests)
NVIDIA-Linux-x86_64-185.18.31-pkg2.
GPU Software (8800GTS)
CUDA 1.1 Support (169.09).
GPU Software (8800GTS) (Compiler tests)
CUDA 2.2 Support (190.18).
GPU Driver (38770)
Ati-driver 8-5-x86.x86_64.
GPU Software (3870)
amdcal-1.01.1_beta.x86_64.
GPU Software (3870)
amdbrook-1.01.0_beta.x86_64.
Math library
acmlg0.3
GPU Driver (9170)
Ati-driver 8.531.
GPU Software (9170)
amdstream-cal-1.2.0_beta-1.
GPU Software (9170)
amdstream-brook-1.2.0_beta-1.
Table 26: Installed Software
08/24/09
Page 45 of 47
Memorandum on GPU Performance
Feature
Nvidia GTS8800
ATI Radeon
HD3870
ATI Firestream 9170
Stream Processors
128
320
320
RV 670
RV 670
GPU
Transistors
754 Million
666 Million
666 Million
Fabrication
Process
65 nm
55 nm
55 nm
n.d.
192 mm²
192 mm²
Core Clock
650 MHz
775 MHz
800 MHz
Memory Clock
1000 MHz
2.25 GHz
800 MHz
624 Gigaflops/s
497 Gigaflops/s
500 Gflops/s
Memory Type
DDR3
GDDR4
DDR4
Memory Interface
256-bit
256-bit
256-bit
PCIe 2.0 x16
PCIe 2.0 x16
PCIe 2.0 x16
146 W (manufacturer)
217 W (measured)
150 W (manufacturer)
Die Size
Processing
(Math)
System
Support
Power
consumption
Rate
Bus
Table 27: Specs for Nvidia GTS 8800, ATI Radeon HD 3870 and ATI Firestream 9170.
ATI statement about double precision
Cited from ATI web site FAQ (assume the same apply to HD3870 and not only 9170):
“How does AMD's stream computing address the IEEE754 standard for double precision floating point
computation?
The IEEE754 standard defines formats for representing single and double-precision floating point
numbers as well as some special cases like denorms, infinities and NaNs. It also defines four rounding
modes and methods for exception handling.
When we were preparing to launch our stream computing initiative in 2006, a series of customer
interviews was conducted to get input on requirements relative to this standard. They learned that as
long as we handled the special cases according to the most common usage, complete IEEE754
compliance wasn't required. AMD's FireStream 9170 implementation should handle a large majority of
customers' requirements.
In the AMD FireStream 9170:
08/24/09
Page 46 of 47
Memorandum on GPU Performance
•
•
•
Infinities and NaNs are handled as defined by the IEEE754 standard.
Rounding is handled using the "round to nearest" mode, which is the mode generally used in
most applications.
Denormal numbers are flushed to zero. This is a common optimization in implementations
where full-speed hardware support is not available, and is adequate for most applications. “
Wikipedia about the RV600 series of GPU
The new unified shader functionality is based upon a Very long instruction word (VLIW) architecture in
which the core executes operations in parallel.
The R600 uses 64 superscalar unified shader clusters, each consisting of 5 stream processing units for a
total of 320 stream processing units. The RV610 and RV630 variants have some of the shaders removed
from the array, containing a total of 40 (5x8) and 120 (5x24) stream processors, respectively.
Each of the first 4 stream processing units is able to retire a finished single precision floating point MAD
(or ADD or MUL) instruction per clock, dot product (dp, and special cased by combining ALUs), and
integer ADD. The fifth unit is more complex and can additionally handle special transcendental
functions such as sine and cosine. Each of the 64 shader clusters can execute 6 instructions per clock
cycle (peak), consisting of 5 shading instructions plus 1 branch.
08/24/09
Page 47 of 47
Download