Memorandum on Graphic Processing Unit (GPU) performance Ole W. Saastad USIT, University of Oslo Version: Date: 0.7 24/ Aug/ 2009 Memorandum on GPU Performance Introduction This memorandum give results and experience on the usage and feasibility of using Graphic Processors (GPU) as found on modern Graphics Card Adapters. Very high performance data have been circulating in the media and on various chipset manufacturers. In order to assess the performance and also gain experience on the usage of such hardware this study was undertaken. One major issue is the software stack providing access from user applications in high level languages like C or Fortran. Two software stacks from Nvidia and ATI have been tested. All measurements are presented as measured in the lab and has been verified by to, three and in some cases more runs. However, I can give no guarantee that another tester would see the same numbers. My numbers represent what a used might experience when setting up the system. Hardware and Software Hardware comprise a custom built workstation using quad core AMD Phenom processor, PC8500 DDR2 memory, WD Raptor 10k SATA disk, Mtron 32GB Solid State Disk and an XFX GeForce 8800GTS graphic card, ATI Radeon HD 3870 or ATI FireStream 9170. Software was RedHat Enterprise Linux 5 with associated compilers, gcc and gfortran in addition to driver software for the graphics cards. An addition to this software a stack which provide access from Fortran programs was also included. The Graphics Card driver software was installed as suggested. This software download and compile kernel modules for the current kernel automatically and this was selected and the correct kernel modules built and installed. A reboot is recommended as the /dev files failed to appear before after a reboot. The software stacks were installed as suggested by the installer scripts. The current Nvidia cards has only support for single precision floating (32 bit floating, approx 8 decimal places) point data types. The ATI card has support for double precision (64 bits floating, approx 16 decimal places), see appedix for more information. The portland compiler suite used for tests on compiler support is PGI 9.0-3. Additional information about the hardware and software are found in the appendices. 08/24/09 Page 3 of 47 Memorandum on GPU Performance Set up and implementation NVIDIA graphic card Geforce 8800 GTS After installing the Software Development Kit the testing of the GPU can start. The toolkit provides a variety of different projects which each one is testing a different aspect of GPU usage for computation. These projects are written and implemented in C language. A list of these are shown in table 1. alignedTypes Convolution histogram256 MonteCarlo scan simpleTextureDrv histogram64 MonteCarlo scanLargeArray SobelFilter Separable asyncAPI Convolution Texture MultiGPU bandwidthTest cppIntegration imageDenoising multiGPU simpleAtomics template binomialOptions deviceQuery lineOfSight nbody simpleCUBLAS transpose bitonic dwtHaar1D Mandelbrot oceanFFT simpleCUFFT BlackScholes dxtc marchingCubes particles simpleGL boxFilter eigenvalues matrixMul postProcessGL simpleStreams clock FastWalsh matrixMulDrv Transform reduction simpleTemplates convolutionFFT2D fluidsGL MersenneTwister scalarProd simpleTexture Table 1: GPU software example projects Some of these projects were tested to see how the software stack could be used. However no real benchmarking effort were put into these preliminary tests. As the linear algebra is one of the most used packages in HPC the focus was turned on this. The project “simpleCUBLAS” was investigated and the code instrumented with some timers in order to assess the speedup when using the GPU compared to simple C code implementation. The compilation of this kind of code is somewhat special as the software toolkit provides a wrapper to the normal gcc compiler called nvcc. This C compiler is to be used in order to compile GPU enabled C programs. It is located at bin/nvcc. It is suggested in the installation guide to expand your path to this directory in addition to let ld.config look for libraries at /usr/local/cuda/lib. Problem size limitation The nature of the the software stack is so that the problem is moved into the memory of the graphic card. The card used for these tests is equipped with 512 Mbytes of DDR3 memory. The biggest problems that hence can be attacked is limited to 512 MB in total. When running problems like GEMM (general matrix multiply) which need to hold three matrices in memory, A,B and C the largest problems that can be run is N slightly larger than N=6500 (NxNx4x3=484 MB). This is a small problem compared to problem that can be attacked by the CPU which has far more memory to play with. 08/24/09 Page 4 of 47 Memorandum on GPU Performance ATI graphic card Radeon HD 3870 and Firestream 9170 After installing the Software which is a driver for the graphics adapter and a software stack for computation, CAL and Brook. Both are development environments. In addition there is a dependency to X11. Brook makes it relatively simple to access the hardware, but programming the kernels is still a challenge with a large number of threads to be managed. Unlimited problem size With ACML on top of the CAL problems of any case can be attacked as ACML provides the possibility to run out-of-core (core here is GPU local memory - ½ GB in this study). This is a major improvement over earlier math library implementations. Lack of compiler support At present there is only support for accelerators for one Fortran compiler, namely the Portland suite of compilers. 08/24/09 Page 5 of 47 Memorandum on GPU Performance User Experience NVIDIA Gforce GTS 8800 simpleCUBLAS The simpleCUBLAS project was tested and used as an example. The makefile contains most of what is needed. However it run into problems when linking due to, not uncommon, missing libraries. Sgemm is Simple Precision General Matrix Multiply, C = alpha A*B + beta C; where A, B and C are matrices and alpha and beta are scalars, N is typically the matirix size. olews@styren ~/work/NVIDIA_CUDA_SDK/projects/simpleCUBLAS $ make simpleCUBLAS.c: In function 'main': simpleCUBLAS.c:89: warning: function declaration isn't a prototype simpleCUBLAS.c:89: warning: nested extern declaration of 'seconds' /usr/bin/ld: cannot find -lglut collect2: ld returned 1 exit status make: *** [../../bin/linux/release/simpleCUBLAS] Error 1 olews@styren ~/work/NVIDIA_CUDA_SDK/projects/simpleCUBLAS $ This is easy to overcome by issuing a manual link line, which also include the seconds file which contain a timer function. $nvcc -o simpleCUBLAS.x -L/usr/local/cuda/lib -lcublas seconds.c simpleCUBLAS.c This produces an executable. The problem size is set within the C code and by increasing the problem long enough run times in order to perform reasonable timing useful runs could be run: olews@styren ~/work/NVIDIA_CUDA_SDK/projects/simpleCUBLAS $ ./simpleCUBLAS.x Simple sgemm finished 153.630000 secs. GPU sgemm start Simple sgemm finished 0.190000 secs. Test PASSED Press ENTER to exit... A significant speedup was recored when comparing simple C reference code. The simple C reference implementation looks like : { int i; int j; int k; for (i = 0; i < n; ++i) { for (j = 0; j < n; ++j) { float prod = 0; 08/24/09 Page 6 of 47 Memorandum on GPU Performance for (k = 0; k < n; ++k) { prod += A[k * n + i] * B[j * n + k]; } C[j * n + i] = alpha * prod + beta * C[j * n + i]; } } } Hence is not the best reference. A better reference would be one sgemm from one of the highly optimized BLAS libraries like AMD Core Math Lib (ACML) or the Goto library from University of Texas. The code to call the GPU software stack is relatively simple : cublasSgemm('n', 'n', N, N, N, alpha, d_A, N, d_B, N, beta, d_C, N); In addition there are a few statements to copy and retrieve the vectors to and from the GPU memory. It is quite simple to use and there seems to be no serious obstacles. /* Allocate device memory for the matrices */ status = cublasAlloc(n2, sizeof(d_A[0]), (void**)&d_A); if (status != CUBLAS_STATUS_SUCCESS) { fprintf (stderr, "!!!! device memory allocation error (A)\n"); return EXIT_FAILURE; } status = cublasAlloc(n2, sizeof(d_B[0]), (void**)&d_B); if (status != CUBLAS_STATUS_SUCCESS) { fprintf (stderr, "!!!! device memory allocation error (B)\n"); return EXIT_FAILURE; } status = cublasAlloc(n2, sizeof(d_C[0]), (void**)&d_C); if (status != CUBLAS_STATUS_SUCCESS) { fprintf (stderr, "!!!! device memory allocation error (C)\n"); return EXIT_FAILURE; } /* Initialize the device matrices with the host matrices */ status = cublasSetVector(n2, sizeof(h_A[0]), h_A, 1, d_A, 1); if (status != CUBLAS_STATUS_SUCCESS) { fprintf (stderr, "!!!! device access error (write A)\n"); return EXIT_FAILURE; } status = cublasSetVector(n2, sizeof(h_B[0]), h_B, 1, d_B, 1); if (status != CUBLAS_STATUS_SUCCESS) { fprintf (stderr, "!!!! device access error (write B)\n"); return EXIT_FAILURE; } 08/24/09 Page 7 of 47 Memorandum on GPU Performance status = cublasSetVector(n2, sizeof(h_C[0]), h_C, 1, d_C, 1); if (status != CUBLAS_STATUS_SUCCESS) { fprintf (stderr, "!!!! device access error (write C)\n"); return EXIT_FAILURE; } ConvolutionFFT2D The C code example shows that there is slightly more complex to use CUDA FFT than a library like FFTW or ACML. The lines to make a plan, copy data to device memory and perform the FFT calculation are given below: CUDA_SAFE_CALL( cudaMallocArray(&a_Kernel, &float2tex, KERNEL_W, KERNEL_H) ); CUDA_SAFE_CALL( cudaMallocArray(&a_Data, &float2tex, DATA_W, DATA_H) ); CUDA_SAFE_CALL( cudaMalloc((void **)&d_PaddedKernel, FFT_SIZE) ); CUDA_SAFE_CALL( cudaMalloc((void **)&d_PaddedData, FFT_SIZE) ); CUFFT_SAFE_CALL( cufftPlan2d(&FFTplan, FFT_H, FFT_W, CUFFT_C2C) ); CUDA_SAFE_CALL( cudaMemset(d_PaddedKernel, 0, FFT_SIZE) ); CUDA_SAFE_CALL( cudaMemset(d_PaddedData, 0, FFT_SIZE) ); CUDA_SAFE_CALL( cudaMemcpyToArray(a_Kernel, 0, 0, h_Kernel, KERNEL_SIZE, cudaMemcpyHostToDevice) ); CUDA_SAFE_CALL( cudaMemcpyToArray(a_Data, cudaMemcpyHostToDevice) ); 0, 0, h_Data, DATA_SIZE, CUDA_SAFE_CALL( cudaBindTextureToArray(texKernel, a_Kernel) ); CUDA_SAFE_CALL( cudaBindTextureToArray(texData, a_Data) ); CUFFT_SAFE_CALL( cufftExecC2C(FFTplan, (cufftComplex *)d_PaddedKernel, (cufftComplex *)d_PaddedKernel, CUFFT_FORWARD) ); CUDA_SAFE_CALL( cudaThreadSynchronize() ); CUT_SAFE_CALL( cutResetTimer(hTimer) ); CUT_SAFE_CALL( cutStartTimer(hTimer) ); CUFFT_SAFE_CALL( cufftExecC2C(FFTplan, (cufftComplex *)d_PaddedData, *)d_PaddedData, CUFFT_FORWARD) ); (cufftComplex CUFFT_SAFE_CALL( cufftExecC2C(FFTplan, (cufftComplex *)d_PaddedData, *)d_PaddedData, CUFFT_INVERSE) ); (cufftComplex CUDA_SAFE_CALL( cudaThreadSynchronize() ); CUT_SAFE_CALL( cutStopTimer(hTimer) ); CUDA_SAFE_CALL( cudaMemcpy(h_ResultGPU, d_PaddedData, FFT_SIZE, cudaMemcpyDeviceToHost) ); CUDA_SAFE_CALL( cudaUnbindTexture(texData) ); CUDA_SAFE_CALL( cudaUnbindTexture(texKernel) ); CUFFT_SAFE_CALL( cufftDestroy(FFTplan) ); CUDA_SAFE_CALL( cudaFree(d_PaddedData) ); CUDA_SAFE_CALL( cudaFree(d_PaddedKernel) ); CUDA_SAFE_CALL( cudaFreeArray(a_Data) ); CUDA_SAFE_CALL( cudaFreeArray(a_Kernel) ); It is interesting to note that there is no special or fancy datatypes. Slightly more complex but still within the reach of normal scientific C programmer. 08/24/09 Page 8 of 47 Memorandum on GPU Performance Fortran_Cuda_Blas The Nvidia CUDA website also contain software packages containing a Fortran interface. Using this interface makes it extremely simple to use the GPU. This interface is simple to install and uses the CUDS software libraries already installed. Originally it tries to use g95, but this is no longer shipped with RH5. I changed to gfortran which is f90 compatible and works fine with the CUSA software stack. Several other Fortran compilers are supported, including Intel. The make process is simple to follow and perform the following steps : olews@styren ~/work/Fortran_Cuda_Blas $ make sgemm_speed_cublas gcc -O3 -DCUTheBLAS_USE_THUNKING -I/usr/local/cuda/include -c fortran.c gfortran -o sgemm_speed_cublas -O3 -DCUBLAS sgemm_speed.f90 fortran.o -L/usr/local/cuda/lib -lcublas -lcudart -L/usit/platon/gvd-u1/olews/lib64/ -lacml olews@styren ~/work/Fortran_Cuda_Blas $ In this example the ACML BLAS library was linked in. The Fortran program can be compiled to call an external BLAS sgemm function or to use a GPU based sgemm function. It if worth to note that there is no usage of the nvcc compiler wrapper, only gcc and gfortran. The Fortran program to test the sgemm is remarkably simple : ! ! Simple Fortan90 program that multiplies 2 square matrices calling Sgemm ! C = alpha A*B + beta C ! program matrix_multiply implicit none ! Define the floating point kind to be single_precision integer, parameter :: fp_kind = kind(0.0) ! Define real (fp_kind), dimension(:,:), allocatable :: real :: real (fp_kind):: integer:: 08/24/09 A, B, C time_start,time_end alpha=1._fp_kind,beta=1._fp_kind, c_right i,j,m1,m2 Page 9 of 47 Memorandum on GPU Performance do m1=128,(40*128),128 allocate(A(m1,m1)) allocate(B(m1,m1)) allocate(C(m1,m1)) ! Initialize the matrices A,B and C A=1._fp_kind B=2._fp_kind C=3._fp_kind ! With the prescribed inputs, each element of the C matrix should be equal to c_right c_right= 2._fp_kind*m1+3._fp_kind ! Compute the matrix product computation call cpu_time(time_start) call cublas_SGEMM ('n','n',m1,m1,m1,alpha,A,m1,B,m1,beta,C,m1) ! call SGEMM ('n','n',m1,m1,m1,alpha,A,m1,B,m1,beta,C,m1) call cpu_time(time_end) ! Print timing information print "(i5,1x,i4,a,1x,f8.4,2x,a,f12.4)", m1, & (m1*m1*4)/(1024*1024), " MB time =",time_end-time_start, & " GFLOPS=",1.e-9*2._fp_kind*m1*m1*m1/(time_end-time_start) ! check the result do j=1,m1 do i=1,m1 if ( abs(c(i,j)- c_right ) .gt. 1.d-8 ) then print *, "sgemm failed", i,j, abs(c(i,j)- c_right ) exit end if end do end do deallocate(A,B,C) end do end program matrix_multiply All the work associated with movement of data from main memory to the device memory on the GPU card is hidden away in the Fortran wrapper library, leaving a very simple and elegant Fortran interface. In order to make calling from Fortran simple all arrays start on index 1 and run column major as is done in Fortran. 08/24/09 Page 10 of 47 Memorandum on GPU Performance Portland compiler C and Fortran support The Portland group (www.pgroup.com) has developed a set of compilers that can generate code for the NVIDIA GPU chipset. From the PGI's web site : “PGI is introducing the Accelerator Programming Model for Fortran and C with PGI Release 9.0. The Accelerator Programming Model uses directives and compiler analysis to compile natural Fortran and C for the GPU; this often allows you to maintain a single source version, since ignoring the directives will compile the same program for the X64 CPU. Note the model is called the Accelerator Programming Model, not the GPU Programming Model; the model is designed to be forward-looking as well, to accomodate other accelerators that may come in the future, preserving your software development investment.“ This is done in OpenMP style with compiler directives. The compiler then tries to generate code that can be submitted to the CUDA software framework from provided by NVIDIA. The user level experience is remarkably simple, a few compiler directives are all that is needed to start using the GPU as an accelerator available within Fortran and C programs. Compiler directives like : !$acc region and <your code> !$acc region end is all that is needed in the Fortran code to activate the compiler machinery that tries to generate kernels for the GPU. The same style as for OpenMP directives is adapted. The compiler support is a major step forward as it allows for recompilation of legacy codes and take advantage of the new GPUs. An example of Fortran code is given below : !$acc region do i = 1,n r(i) = sin(a(i)) ** 2 + cos(a(i)) ** 2 enddo !$acc end region This is all there is to it. Just two simple compiler directives as comments. The compiler will invoke analysis software when compiled with the correct options: pgfortran -o f2.exe f2.f90 -ta=nvidia,cc11 -Minfo=accel This will generate a kernel that can be copied to the GPU and used to calculate the problem. There is nothing special to be done to run the program. Just launch it as normal. The libraries etc will schedule and pick ut the GPU. 08/24/09 Page 11 of 47 Memorandum on GPU Performance ATI Radeon HD 3870 There is a number of test and tutorial applications with CAL (Compute Abstraction Layer) and some of these can be useful to compare performance. The software development kit also contain Brook (Brook is an extension of standard ANSI C and is designed to incorporate the ideas of data parallel computing and arithmetic intensity into a familiar, efficient language. The general computational model, referred to as streaming). Demo Applications A range of demo applications comes with CAL, as shown in table 2. double_matmult hellocal lu_decomposition memimport_matmult outofcoreMMM simple_matmul memexport_matmult Table 2: CAL GPU software example projects, applications Demo Run time utilities domain importspeed integer_alu nonblocking_map perf_counters cachespeed exportspeed inputspeed memtiming outputspeed throughput Table 3: CAL GPU software example projects, run time utilities Compiling tutorials and demos this are easy, a simple make command produces a binary that can be run and results recorded. The only disadvantage is that all these CAL and GPU softwares rely on the fact that you need local access to X11, which mean that you cannot run remote (this is a showstopper and has been conveyed to AMD/ATI – newer versions will not have this limitation). 08/24/09 Page 12 of 47 Memorandum on GPU Performance Results NVIDIA Gforce Bandwidth One major issue associated with the application of the GPU is the memory bandwidth of which data can be moved from main memory to and from device memory on the GPU device. The GPU can only operate on data in device memory. The CUDA toolkit contains benchmark project code that measure several aspects of memory bandwidth. Table 4 show how at which rates data can be moved from main memory to device memory and within the GPU device. The bandwidth to move data between the two types of memory is relatively low compared to the processing power of the GPU. Pinning of memory helps in getting higher bandwidth, most notably for copy data from device memory to main memory. Transfer type Bandwidth [GB/s] Pageable mem Bandwidth [GB/s] Pinned memory Host to Device 2.291 2.604 Device to Host 1.857 3.265 Device to Device 50.992 51.012 Table 4: Bandwidth measurements to and from the Graphics Card memory. ConvolutionFFT2D This sample demonstrates how 2D convolutions with very large kernel sizes can be efficiently implemented using FFT transformations. Table 5 show a very high speedup using GPU. Size N / conv. size Run time CPU [secs] Run time GPU [secs] Speedup 1000 x 1000 / 7 x 7 0.3 0.03 10 3000 x 3000 / 7 x 7 2.70 0.43 6.3 Table 5: Run time for Convolution using 2DFFT. Financial MonteCarlo This project uses Monto Carlo simulation to find stock option price setting using a large number of statistical draws. The problem is embarrassingly parallel hence well suited to the very high parallelism of the GPU with its large numbers of processing elements. The recorded speedup is so high that could be thought of as unrealistic. Nvidia describe the project as : 08/24/09 Page 13 of 47 Memorandum on GPU Performance “The pricing of options has been a very important problem encountered in financial engineering since the advent of organized option trading in 1973. As more computation has been applied to finance-related problems, finding efficient implementations of option pricing models on modern architectures has become more important. This white paper describes an implementation of the Monte Carlo approach to option pricing in CUDA. For complete implementation details, please see the “MonteCarlo” example in the NVIDIA CUDA SDK.” Run time CPU Run time GPU [sec] Speedup 771.8 0.26 2965 Table 6: Financial MonteCarlo simuiation. Measurement in run time in seconds. simpleCUBLAS This example and benchmark implemented in C uses a direct interface to the CUDA libraries. The results obtained are measured in seconds for run time of each called sgemm routine. The reference is a simple plain C language implementation of the sgemm. The results are unrealistic as the simple C implementation is unrealistic time consuming. Size N Total Size MB (NxNx4x3) Simple/CPU [sec] GPU [sec] Speedup 500 3 1.3 0.01 120 1000 11.4 16.2 0.02 810 2000 46 153.8 0.19 809 4000 183 1556 1.0 1556 5000 286 3210.0 2.55 1259 6200 484 8345.1 4.75 1757 Table 7: simpleCUBLAS performance as measured in run time in seconds. 08/24/09 Page 14 of 47 Memorandum on GPU Performance Fortran_Cuda_Blas The Fortran Cuda BLAS benchmark is calling the Fortran BLAS wrapper library is remarkably simple to use from Fortran. The results below is compared to the AMD Core Math Library (ACML). The measured speedup is impressive with speedup over 8 x is impressive compared to the best performing BLAS libraries running on a x86-64 CPU at 2.3 GHz. Size N Total Size MB (NxNx4x3) 512 3 1024 ACML/CPU [Gflops/s] GPU [Gflops/s] Speedup 14.9 38.36 2.6 12 14.9 79.55 5.3 2048 48 15.8 110.9 7.0 4096 192 15.9 118.3 7.5 5120 300 15.9 120.6 7.6 6016 414 15.9 135.9 8.6 6272 450 15.9 136.2 8.6 Table 8: Fortran CUDA BLAS performance measured in Gflops/s. SGEMM ­ CPU vs. GPU Performance [Gflops/s] 160 140 120 100 CPU/ACML 80 GPU 60 40 20 0 512 1024 2048 4096 5120 6016 6272 Matrix size N Figure 1: SGEMM - CPU using AMD Core Math Lib. versus GPU. 08/24/09 Page 15 of 47 Memorandum on GPU Performance Fortran Callable BLAS test How simple can it be done seen from the user perspective ? I have written a small Fortran test program to show how simple it can be done to access the GPU. This show how simple it is to gain access to the power of the GPU. program gemmtest parameter(N=6300) real a, b , c, alpha, beta dimension a(N,N), b(N,N), c(N,N) real time_start, time_end, speedup integer i,j ! Init matrices do i=1,N do j=1,N a(j,i)=rand(0) b(j,i)=rand(0) enddo enddo alpha = 1.0 beta = 1.0 write(*,*)"Total footprint A,B & C",N*N*4*3/(1024*1024)," MB" c=0 call cpu_time(time_start) write(*,*)"CUDA sgemm start" call cublas_SGEMM('n', 'n', N, N, N, alpha, a, N, b, N, beta, c, N) call cpu_time(time_end) write(*,*) "c(1,1) ",c(1,1)," c(N,N) ", c(N,N) write(*,*)"CUDA sgemm end",time_end-time_start," secs" speedup=time_end-time_start c=0 call cpu_time(time_start) write(*,*)"CPU sgemm" call sgemm('n', 'n', N, N, N, alpha, a, N, b, N, beta, c, N) call cpu_time(time_end) write(*,*) "c(1,1) ",c(1,1)," c(N,N) ", c(N,N) write(*,*)"CPU sgemm end",time_end-time_start," secs" speedup=(time_end-time_start)/speedup write(*,*)"Speedup ",speedup end 08/24/09 Page 16 of 47 Memorandum on GPU Performance Compilation is also very simple : gfortran -o sgemm-test.x -O2 fortran.o -L/usr/local/cuda/lib/ -lcublas -L${HOME}/lib64 -lgoto sgemmtest.f90 The Fortran code wrapper fortran.o is a C piece of code that interfaces between the Fortran code and the CUDA libraries. This file is needed, but can be placed anywhere with system wide access. The normal BLAS library in this case the Goto library is also linked in to be used for reference. In this test both alpha and beta has been set to 1.0 making some more work than just AxB, C = alpha *A x B + beta * C, this is known as fused multiply add. Performance is very good for this test as shown in table below. Outperforming the ACML (equal performance measured for the Goto library) by a factor of close to 6 is very good. Doing the arithmetic one arrives at 15.6 Gflops/s for the CPU and 91.8 Gflops/s for the GPU when solving the N=6300 problem (assuming flops count at 2xNxNxN) Size N Total Size MB (NxNx4x3) GPU [Secs] ACML/CPU [Secs] Speedup 500 2 0.564 0.0179 0.031 1000 11 0.574 0.135 0.24 2000 45 0.721 1.03 1.4 4000 183 1.51 8.11 5.4 5000 286 3.00 16.0 5.3 6000 411 4.68 27.4 5.8 6300 454 5.45 32.0 5.9 Table 9: Fortran CUDA BLAS single precision performance measured in run times in seconds. Within the limitation of single precision data type the data type complex is possible. The amount of floating point operations per addition and to a larger extent multiplication is larger per byte of data then for the real data type. This should yield even better utilization of the GPU based BLAS functions. 08/24/09 Page 17 of 47 Memorandum on GPU Performance Size N Total Size MB (NxNx4x3x2) ACML/CPU [Secs] GPU [Secs] Speedup 500 5 0.558 0.0670 0.12 1000 22 0.524 0.632 0.83 2000 91 1.02 4.14 4.0 3000 205 13.9 2.51 5.5 3500 280 22.0 3.33 6.6 4000 366 32.9 4.25 7.7 4400 443 43.8 5.41 8.1 Table 10: Fortran CUDA BLAS single precision complex performance measured in run times in seconds. GEMM GPU vs. CPU speedup GPU Speedup [times faster] 9 8 7 6 5 Single complex 4 3 2 1 0 500 1000 2000 4000 Matrix Size [N] Figure 2: GEMM - CPU using AMD Core Math Lib. versus GPU. Speedup using single precision data type and complex data types. 08/24/09 Page 18 of 47 Memorandum on GPU Performance Multi thread CPU The installed CPU has four cores and can consequently run 4 threads in parallel. This should yield close to 4 times the performance when using a multi threaded linear algebra library like the Goto library compiled for the multi threaded AMD processors. For these tests the BLAS level 3 routine matrix multiply GEMM has been used. Size N ACML/CPU 2 threads speedup ACML/CPU 4 threads Speedup GPU Speedup 500 1.91 2.38 0.14 1000 1.94 3.71 0.92 2000 1.96 3.87 4.29 3000 1.96 3.86 5.56 4400 1.96 3.90 8.03 Table 11: Fortran CUDA BLAS single precision complex performance measured in run times in seconds. Speedup CPU threads and GPU Speedup from 1 core 9 1 thread 8 2 threads 7 4 threads 6 GPU 5 4 3 2 1 0 500 1000 2000 3000 4400 Matrix size Figure 3: Speedup using multiple CPU threads and GPU for single precision complex data type. Blas level 3 CGEMM. 08/24/09 Page 19 of 47 Memorandum on GPU Performance Size N ACML/CPU 2 threads speedup ACML/CPU 4 threads Speedup GPU Speedup 500 0.95 1.89 0.04 1000 1.93 2.65 0.26 2000 1.96 3.87 1.59 4000 1.97 3.90 5.72 6000 1.96 3.89 5.97 Table 12: Fortran CUDA BLAS single precision performance measured in run times in seconds. Speedup CPU threads and GPU Speedup from 1 core 7 1 thread 6 2 threads 4 threads 5 GPU 4 3 2 1 0 500 1000 2000 4000 6000 Matrix size Figure 4: Speedup using multiple CPU threads and GPU for single precision data type. BLAS level 3 SGEMM. 08/24/09 Page 20 of 47 Memorandum on GPU Performance Speedup GPU vs multi thr. Proc. Single prec. cmplx Speedup GPU vs. CPU 2,5 2 1,5 1 0,5 0 500 1000 2000 3000 4400 Size Figure 5: Speedup GPU versus CPU using threads with all cores, for this quad core AMD processor 4 threads. BLAS level 3 CGEMM matix multiplication with complex single precision. Speedup GPU vs multi thread Proc. Single Precison Speedup GPU vs. CPU 1,8 1,6 1,4 1,2 1 0,8 0,6 0,4 0,2 0 500 1000 2000 4000 6000 Size Figure 6: Speedup GPU versus CPU using threads with all cores, for this quad core AMD processor 4 threads. BLAS level 3 SGEMM matrix multiplication with single precision. 08/24/09 Page 21 of 47 Memorandum on GPU Performance Portland Compiler testing using Fortran code only The Portland compiler uses directives to launch a machinery to try to generate kernels that will run on the GPU processor. A simple example is given below and this example is used for testing and benchmark purposes. program main use accel_lib integer :: n ! size of the vector real,dimension(:),allocatable :: a ! the vector real,dimension(:),allocatable :: r ! the results real,dimension(:),allocatable :: e ! expected results integer :: i integer :: c0, c1, c2, c3, cgpu, chost character(10) :: arg1 if( iargc() .gt. 0 )then call getarg( 1, arg1 ) read(arg1,'(i10)') n else n = 100000 endif if( n .le. 0 ) n = 100000 allocate(a(n)) allocate(r(n)) allocate(e(n)) do i = 1,n a(i) = i*2.0 enddo !call acc_init( acc_device_nvidia ) call system_clock( count=c1 ) !$acc region do i = 1,n r(i) = sin(a(i)) ** 2 + cos(a(i)) ** 2 enddo !$acc end region 08/24/09 Page 22 of 47 Memorandum on GPU Performance call system_clock( count=c2 ) cgpu = c2 - c1 do i = 1,n e(i) = sin(a(i)) ** 2 + cos(a(i)) ** 2 enddo call system_clock( count=c3 ) chost = c3 - c2 ! check the results do i = 1,n if( abs(r(i) - e(i)) .gt. .00001 )then print *, i, r(i), e(i) endif enddo print *, n, ' iterations completed' print *, cgpu, ' microseconds on GPU' print *, chost, ' microseconds on host' end program main This code is compiled with s simple statement : pgfortran -o f2.exe -O3 f2.f90 -ta=nvidia,cc11 -Minfo=acc And is run by launching the program in a normal way. The timing will be outputted and hence show the performance of GPU and CPU. Problem size [k] Speedup GPU vs. CPU 100 0.12 1000 1.35 10000 5.21 20000 5.83 40000 6.27 50000 6.29 Table 13: Speedup compiler generated kernel running on GPU vs. f90 code running on CPU. 08/24/09 Page 23 of 47 Memorandum on GPU Performance Compiler accelerator test Kernel running on GPU vs. f90 code on CPU 7 Speedup GPU vs. CPU 6 5 4 3 2 1 0 1 00 1 000 10000 20000 40000 50000 Problem size (k) Figure 7: Speedup compiler generated kernel running on GPU vs. f90 code running on CPU. The observed speedup is quite modest for this simple test. More computational demanding codes might yield higher speedup. An interesting test might be to see if the f90 code can utilize all 4 cores on the processor. The f90 was instrumented with OpenMP directives and 1 through 4 cores were used. Using all 4 cores in this simple calculation show that 4 cores can keep up with the GPU. Number of cores Speedup GPU vs. CPU 1 6.31 2 3.15 3 2.19 4 1.62 Table 14: Speedup GPU vs. multi cores on CPU. Fortran 90 codes compiled with OpenMP directives. Size is 50000k which is the largest possible, ref. table above. 08/24/09 Page 24 of 47 Memorandum on GPU Performance Speedup GPU vs multicore CPU Multi cores used in OpenMP f90 code 7 Speedup GPU vs. CPU 6 5 4 3 2 1 0 1 2 3 4 # cores used in OpenMP Figure 8: Speedup GPU vs. multi cores on CPU. Fortran 90 codes compiled with OpenMP directives. Size is 50000k which is the largest possible. It is apparent that more complicated and more computational kernels must be given to the GPU in order to get a scaling that pays off the extra work and power consumption involved with using GPU cards in compute nodes. The examples above demonstrate that not all problems are suitable for GPU acceleration. A problem involving more computation that can be run entirely on the GPU might show significant speedup compared to the above less encouraging examples. An example of such a computational problem is a smooth routine given below : subroutine smooth( a, b, w0, w1, w2, n, m, niters ) real, dimension(n,m) :: a, b real :: w0, w1, w2 integer :: n, m, niters integer :: i, j, iter !$acc region do iter = 1,niters do i = 2,n-1 do j = 2,m-1 a(i,j) = w0 * b(i,j) + & w1*(b(i-1,j)+b(i,j-1)+b(i+1,j)+b(i,j+1)) + & 08/24/09 Page 25 of 47 Memorandum on GPU Performance w2*(b(i-1,j-1)+b(i-1,j+1)+b(i+1,j-1)+b(i+1,j+1)) enddo enddo do i = 2,n-1 do j = 2,m-1 b(i,j) = a(i,j) enddo enddo enddo !$acc end region end subroutine smooth Most of the computation can be performed on the GPU card which should yield significant performance. The compiler manages to translate this computational problem to a GPU kernel that can run efficiently on the GPU. See the outputted analysis from the compiler given below: smooth: 59, Generating copyout(a(2:n-1,2:m-1)) Generating copyin(b(1:n,1:m)) Generating copyout(b(2:n-1,2:m-1)) 60, Loop carried dependence due to exposed use of b(1:n,1:m) prevents parallelization Parallelization would require privatization of array a(i2+2,2:m-1) Sequential loop scheduled on host 61, Loop is parallelizable 62, Loop is parallelizable Accelerator kernel generated 61, !$acc do parallel, vector(16) 62, !$acc do parallel, vector(16) Cached references to size [18x18] block of 'b' 68, Loop is parallelizable 69, Loop is parallelizable Accelerator kernel generated 68, !$acc do parallel, vector(16) 69, !$acc do parallel, vector(16) 08/24/09 Page 26 of 47 Memorandum on GPU Performance For this example the performance is closer to the expectations of what a GPU can do. The recorded speedups in table 15 are such that the extra effort using GPUs are paying off. The code running on the CPU is compiled with full optimization (-O3). Speedups of 100x or more are reported, but this is hand written GPU kernels and not compiler translated Fortran code. Problem size [MB] Run time CPU [seconds] Run time GPU [seconds] Speedup GPU vs. CPU 8 7.91 0,31 26 32 33.3 1,06 31 72 91.4 2.53 36 128 178 4.19 43 200 347 7.26 48 288 538 9.84 55 392 839 14.23 59 462 1120,7 16.12 70 Table 15: Execution times and speedup of the “smooth” subroutine using compiler generated GPU kernel running on GPU versus f90 code running on a single core one the CPU. The f90 code is compiled with full optimization -O3. GPU vs. CPU (single core) 80 70 60 Speedup 50 40 30 20 10 0 8 32 72 1 28 200 288 392 462 Problem size [MB] Figure 9: Speedup running smooth filter on GPU vs. a single core on the CPU. The Graphic cards has only 512 MB of memory. 08/24/09 Page 27 of 47 Memorandum on GPU Performance Using all four cores in the quad core processor by means of OpenMP support in the compiler a more realistic speedup can be obtained. Most codes that are suited for GPU acceleration are also probably also suitable for OpenMP parallelization. Table 16 and figure 10 show the obtained results for this test. The speedup recored for this test is promising and well worth investigating further. Problem size [MB] Run time CPU 4 cores using OpenMP [seconds] Run time GPU [seconds] Speedup GPU vs. CPU 8 2.06 0.31 7 32 8.80 1.06 8 72 22.4 2.52 9 128 47.2 4.19 11 200 93.4 7.25 13 288 179 9.84 18 392 375 14.2 26 462 509 16.1 32 Table 16: Execution times and speedup of the “smooth” subroutine using compiler generated GPU kernel running on GPU versus f90 code running on four cores on the CPU. The f90 code is compiled with full optimization -O3 and OpenMP instrumented do loops. GPU vs. CPU Multicore CPU, 4 cores OpenMP 35 30 Speedup 25 20 15 10 5 0 8 32 72 128 2 00 288 392 462 Problem size [MB] Figure 10: Speedup running smooth filter on GPU vs. four cores on the Processor. The Graphic cards has only 512 MB of memory. The f90 code was parallized using OpenMP directived do loops. 08/24/09 Page 28 of 47 Memorandum on GPU Performance ATI Radeon Bandwidth One major issue associated with the application of the GPU is the memory bandwidth of which data can be moved from main memory to and from device memory on the GPU device. The GPU can only operate on data in device memory. The CAL toolkit contains benchmarks that measure several aspects of memory bandwidth. Table 24 show how at which rates data can be moved from main memory to device memory and within the GPU device. The bandwidth to move data between the two types of memory is relatively low compared to the processing power of the GPU. Cachespeed Cachespeed a Simple copy kernel to test total cache speed. This program can use GPU and main memory as source and destination. This makes it possible to measure memory bandwidth between the main system cache and GPU device cache. Number of I/Os GPU => CPU bandwidth[GB/sec] CPU => GPU bandwidth [GB/sec] GPU local bandwidth [GB/sec] 2 3.85 62.33 55.60 4 7.70 104.63 119.27 6 11.55 156.25 176.89 8 15.40 135.48 236.74 10 19.26 151.80 279.02 12 23.12 165.44 292.06 14 26.98 174.16 301.31 16 30.89 182.75 307.88 Table 17: Bandwidth measurements using the GPU cachespeed utility. 08/24/09 Page 29 of 47 Memorandum on GPU Performance Inputspeed Inputspeed a Simple utility to measure import speed. It measures how fast memory can be copied from main memory to GPU memory. Number of I/Os GPU Input bandwidth [GB/sec] 2 33.97 3 30.05 4 30.64 5 31.50 6 29.30 7 28.19 8 28.41 9 29.30 10 28.10 11 31.36 12 31.89 13 32.97 14 32.75 15 32.92 16 33.24 17 33.37 Table 18: Bandwidth measurements using the GPU inputspeed utility. Outputspeed Outputspeed a Simple utility to measure export speed. It measures how fast memory can be copied from GPU memory to main memory. Number of I/Os GPU output bandwidth [GB/sec] 1 15.02 2 15.32 3 18.90 4 23.67 5 25.37 6 23.21 7 19.81 8 17.66 Table 19: Bandwidth measurements using the GPU outputspeed utility. 08/24/09 Page 30 of 47 Memorandum on GPU Performance Fortran Callable BLAS test How simple can it be done seen from the user perspective ? I have written a small Fortran test program to show how simple it can be done to access the GPU. This show how simple it is to gain access to the power of the GPU. program gemmtest parameter(N=10000) real a, b , c, alpha, beta dimension a(N,N), b(N,N), c(N,N) real time_start, time_end, speedup integer i,j ! Init matrices do i=1,N do j=1,N a(j,i)=rand(0) b(j,i)=rand(0) enddo enddo alpha = 1.0 beta = 1.0 write(*,*)"Total footprint A,B & C",N*N*4*3/(1024*1024)," MB" c=0 call cpu_time(time_start) write(*,*)"ATI sgemm start" SGEMM_CAL_F(.FALSE.,.FALSE., M, N, K, A,LDA,B,LDB,C,LDC,ALPHA, BETA, 0) call cpu_time(time_end) write(*,*) "c(1,1) ",c(1,1)," c(N,N) ", c(N,N) write(*,*)"CUDA sgemm end",time_end-time_start," secs" speedup=time_end-time_start c=0 call cpu_time(time_start) write(*,*)"CPU sgemm" call sgemm('n', 'n', N, N, N, alpha, a, N, b, N, beta, c, N) call cpu_time(time_end) write(*,*) "c(1,1) ",c(1,1)," c(N,N) ", c(N,N) write(*,*)"CPU sgemm end",time_end-time_start," secs" speedup=(time_end-time_start)/speedup write(*,*)"Speedup ",speedup end Compilation is also very simple : gfortran -o sgemm-test.x -O3 -fopenmp -L/usr/lib64/acml/gfortran64/lib/ -lCALBLAS -lacml -L/usr/local/amdcal/lib64/ -lamdcalcl -lamdcalrt sgemm-test.f90 The Fortran interface is very simple it is merely a call to the familiar ACML (AMD Core Math Library). The rest is done by the library and device drivers. There is a special lower level 08/24/09 Page 31 of 47 Memorandum on GPU Performance interface to bypass the acml layer, this is done by calling SGEMM_CAL_F instead of the acml entry sgemm. This is done to be 100% sure that the gemm is run on the GPU and not the CPU. Very little performance difference was observed, hence showing that there is no need to bypass high level acml. The normal BLAS library in this case the Goto and acml (in a version for CPU only) library is also linked in to be used for reference. In this test both alpha and beta has been set to 1.0 making some more work than just AxB, C = alpha *A x B + beta * C, this is known as fused multiply add. Performance is very good for this test as shown in table below. Outperforming the CPU based computation by a factor 7-10 which is very good. Doing the arithmetic one arrives at 15.9 Gflops/s for the CPU and 166.8 Gflops/s for the GPU when solving the N=16000 out-of-core problem (assuming flops count at 2xNxNxN) By using ACML as in interface there is no need to recompile old programs, just a relink is needed for for statically linked programs and just replacement of the library for dynamically linked programs. Size N Total Size MB (NxNx4x3) ACML/GPU [Secs] ACML/CPU [Secs] Speedup 500 2 0.104 0.0180 0.17 1000 11 0.183 0.227 0.23 2000 45 0.457 1.06 2.32 3000 102 0.821 4.37 5.33 4000 183 1.32 8.39 6.35 5000 286 2.36 18.3 7.74 6000 411 3.61 28.4 7.85 7000 560 5.31 44.6 8.41 8000 732 7.25 66.6 9.17 9000 926 14.4 94.8 6.56 10000 1144 17.8 130.0 7.30 11000 1647 24.4 224.0 9.17 12000 1384 23.0 169.0 7.35 13000 1934 32.1 285.0 8.85 14000 2243 37.4 343.0 9.17 16000 2929 49.1 514.0 10.4 Table 20: Fortran ACML-CPU / ACML-GPU BLAS single precision performance measured in run times in seconds. 08/24/09 Page 32 of 47 Memorandum on GPU Performance SGEMM performance CPU and GPU Single precision 1 80 CPU singl e GPU si ngl e 1 60 Pe rforma n ce [Gflops/s] 1 40 120 1 00 80 60 40 20 0 1 000 500 3000 2 000 5000 4000 7000 6000 8000 9000 11 000 13000 1 6000 10000 1 2000 1 4000 Size N [NxN ma tice s] Figure 11: CPU and GPU performance single precision GEMM. 08/24/09 Page 33 of 47 Memorandum on GPU Performance Fortran Callable BLAS test in Double precision The ATI Radeon HD 3870 can also do double precision floating point (see appendix for more info about double precision datra type) using ACML. Table 21 show the measured performance. Size N Total Size MB (NxNx8x3) ACML/GPU [Secs] ACML/CPU [Secs] Speedup 500 5 0.0370 0.0355 0.959 1000 22 0.150 0.266 1.77 2000 91 0.553 2.08 3.76 3000 205 1.17 6.92 5.91 4000 366 2.40 16.2 6.75 5000 572 7.50 31.8 4.24 6000 823 10.6 55.2 5.21 7000 1121 14.5 87.2 6.01 8000 1464 19.3 130 6.73 9000 1853 35.7 188 5.27 10000 2288 44.1 253 5.76 11000 2769 53.1 338 6.37 Table 21: Fortran ACML-CPU / ACML-GPU BLAS double precision performance measured in run times in seconds. 08/24/09 Page 34 of 47 Memorandum on GPU Performance Size N Total Size CPU single GPU single [MB] [Gflops/s] [Gflops/s] (NxNx4x3) Total Size [MB] (NxNx8x3) CPU double GPU double [Gflops/s] [Gflops/s] 500 2 13.9 2.40 5 7.04 6.76 1000 11 8.81 10.9 22 7.51 13.3 2000 45 15.1 35.0 91 7.69 28.9 3000 102 12.4 65.8 205 7.80 46.0 4000 183 15.3 97.0 366 7.90 53.4 5000 286 13.7 105.9 572 7.86 33.3 6000 411 15.2 119.7 823 7.83 40.7 7000 560 15.4 129.2 1121 7.87 47.3 8000 732 15.4 141.2 1464 7.88 53.2 9000 926 15.4 101.3 1853 7.76 40.8 10000 1144 15.4 112.4 2288 7.91 45.4 11000 1647 15.4 115.7 2769 7.88 50.1 12000 1384 15.4 141.6 3296 Not possible Not possible 13000 1934 15.4 136.9 3868 Not possible Not possible 14000 2243 16.0 146.7 4486 Not possible Not possible 16000 2929 15.9 166.8 5859 Not possible Not possible Table 22: ATI Radeon 3870 Fortran ACML-GPU BLAS performance in Gflops/s. 08/24/09 Page 35 of 47 Memorandum on GPU Performance DGEMM performance CPU and GPU Double precision 60 CPU double GPU double Perform ance [Gflops/s] 50 40 30 20 10 0 500 1 000 2000 3000 4000 5000 6000 7 000 8000 9000 1 0000 11 000 Size N [NxN m atrices] Figure 12: CPU and GPU performance double precision GEMM. 08/24/09 Page 36 of 47 Memorandum on GPU Performance ATI FireStream 9170 Fortran Callable BLAS test The earlier employed Fortran test program to show how simple it can be done to access the GPU is used to test the 9170 card. Details about the fortran proggram is found in the above section. Compilation is again very simple : gfortran -o sgemm-test.x -O3 -fopenmp -L/usr/lib64/acml/gfortran64/lib/ -lCALBLAS -lacml -L/usr/local/amdcal/lib64/ -lamdcalcl -lamdcalrt sgemm-test.f90 The Fortran interface is very simple it is merely a call to the familiar ACML (AMD Core Math Library). The rest is done by the library and device drivers. There is a special lower level interface to bypass the acml layer, this is done by calling SGEMM_CAL_F instead of the acml entry sgemm. This is done to be 100% sure that the gemm is run on the GPU and not the CPU. Very little performance difference was observed, hence showing that there is no need to bypass high level acml. The normal BLAS library in this case the Goto and acml (in a version for CPU only) library is also linked in to be used for reference. In this test both alpha and beta has been set to 1.0 making some more work than just AxB, C = alpha *A x B + beta * C, this is known as fused multiply add. Performance is very good for this test as shown in table below. Outperforming the CPU based computation by a factor 7-10 which is very good. By using ACML as in interface there is no need to recompile old programs, just a relink is needed for for statically linked programs and just replacement of the library for dynamically linked programs. 08/24/09 Page 37 of 47 Memorandum on GPU Performance Size N Total Size CPU single GPU single [MB] [Gflops/s] [Gflops/s] (NxNx4x3) Total Size [MB] (NxNx8x3) CPU double GPU double [Gflops/s] [Gflops/s] 500 2 14,70757 11,91 5 7,14 6,76 1000 11 14,81688 22,23 22 7,41 10.87 2000 45 15,28408 54,25 91 7,54 28,89 3000 102 15,36944 69,51 205 7.56 40,56 4000 183 15,38695 100,41 366 7,59 60,16 5000 286 15,42587 126,99 572 7,59 55,68 6000 411 15,42376 122,33 823 7.56 67,11 7000 560 15,37592 140,16 1121 7,59 68,96 8000 732 15,42798 166,53 1464 7,59 83,90 9000 926 15,4195 152,12 1853 7,60 66,97 10000 1144 15,41815 175,91 2288 7,58 68,77 11000 1647 15,43494 176,21 2769 7,59 90,62 12000 1384 15,40842 248,76 3296 7,61 92,28 13000 1934 15,43158 189,04 3868 Not possible Not possible 14000 2243 15,40059 202,22 4486 Not possible Not possible 15000 2575 15,3710 211,89 5150 Not possible Not possible 16000 2929 15,4658 216,82 5859 Not possible Not possible 17000 3307 15,4658 167,76 6615 Not possible Not possible Table 23: ATI Firestream 9170 Fortran ACML-GPU BLAS performance (single precision and double precision) in Gflops/s compared to Goto BLAS lib for the single thread CPU performance. 08/24/09 Page 38 of 47 Memorandum on GPU Performance GPU GEMM Performance Single and Double precision 300 GPU si ngl e GPU doubl e Performance [Gflops/s] 2 50 2 00 1 50 1 00 50 0 1000 500 3000 2000 5000 4000 7 000 6000 9000 8000 11 000 13000 1 5000 1 7000 10000 12000 1 4000 1 6000 Matrix Size [N] Figure 13: ATI Firestream 9170, GEMM performance single and double precision. 08/24/09 Page 39 of 47 Memorandum on GPU Performance Quad core processors Typical processors today have 4 cores and BLAS libraries can utilize all these cores in a single function. It is fair to compare the GPU performance with 4 threads run on a single processor. Doing so the GPU is only about twice as fast as compared to a quad core processor. Firestream 9170 GEMM Performance Single precision 2 50 Performance [Gflops/s] 2 00 CPU 4 Thr eads GPU 1 50 1 00 50 0 500 1000 2000 4000 6000 8000 1 0000 1 2000 1 4000 16000 Matrix Size [N] Figure 14: Performance comparison Firestream 9170 GPU vs. 4 threads on quad core processor. Single precision datatype. Firestream 9170 GEMM Performance Double precision 1 00 90 CPU 4 Thr eads GPU Performance [Gflops/s] 80 70 60 50 40 30 20 10 0 500 1000 2 000 4000 6000 8000 10000 1 2000 Matrix Size [N] Figure 15: Performance comparison Firestream 9170 GPU vs. 4 threads on quad core processor. Double precision datatype. 08/24/09 Page 40 of 47 Memorandum on GPU Performance Comparing NVIDIA and ATI The ATI cards Radeon HD 3870 and Firestream 9170 have support for double precision floating point numbers. This is substantial improvement over earlier cards and GPU chips. Both the NVIDIA and the ATI cards has been installed in the same machine and run the same Fortran 90 test program. The Matrix multiplication problem (SGEMM)is used for comparison as this is one of the most common linear algebra problems to be solved. By assuming flops count at 2xNxNxN for the SGEMM operation we can also calculate some performance numbers. For double precision ATI cards stands alone table 21 and 22. Size [N] Total Size NVIDIA [Gflops/s] [MB] ATI 3870 [Gflops/s] Firestream 9170 [Gflops/s] 500 2 0.442 2.40 11.9 1000 11 3.66 10.9 22.2 2000 45 22.2 35.0 54.3 3000 102 n.d. 65.8 69.5 4000 183 84.8 97.0 100.4 5000 286 83.3 105.9 127.0 6000 411 92.3 119.7 122.3 6300 454 91.8 n.d. n.d. 7000 560 Not possible to run 129.2 140.2 8000 732 Not possible to run 141.2 166.5 9000 926 Not possible to run 101.3 152.1 10000 1144 Not possible to run 112.4 175.9 11000 1384 Not possible to run 115.7 176.2 12000 1647 Not possible to run 141.6 248.8 13000 1934 Not possible to run 136.9 189.0 14000 2243 Not possible to run 146.7 202.2 16000 2929 Not possible to run 166.8 216.8 Table 24: Performance comparison NVIDIA and ATI cards when running SGEMM. ATI/ACML can run out-of-core and hence solve problems of any size. 08/24/09 Page 41 of 47 Memorandum on GPU Performance Comparing GTS8800, HD3870 and FS9170 Single precision GEMM 2 50 GTS8800 HD3870 FS 9170 Performance [Gflops/s] 2 00 1 50 1 00 50 0 500 1 000 2 000 3000 4000 5000 6000 8000 10000 1 2000 1 4000 1 6000 Matrix Size [N] Figure 16: Radeon HD3870 compared to GTS8800, single precision GEMM. The ACML /ATI combination can run problems out-of-core eg. larger than the memory on the GPU device (ranging from 512MB to 2GB for the tested cards). 08/24/09 Page 42 of 47 Memorandum on GPU Performance Discussion The Graphical Processors available today are formidable engines. The possible performance is staggering. However, before we jump to the conclusion that this is the final solution to all computational problems a few remarks must be made. Memory size limitation First of all the limitation of problems that can be attacked. The problem must be copied to the device memory on the graphic card. These cards have memory in the sub GB range. The card used for these tests had ½ GB implying that only problems with a footprint of less than ½ GB can be attacked. Even if close to 10x BLAS level 3 (sgemm) performance is attractive it can only do this for smaller matrices. Firestream 9170 has 2 GB. However, the ACML library has the possibility to run out-of-core (core in this context is the GPU device memory). In this way very large problems can be attacked. Table show that problems sizes that were out of reach for the CUDA/NVIDIA system could be handled by the ATI/ACML system with ease. The memory limitation hence only apply to CUDA/NVIDIA. Nvidia has Single Precision data type only The only supported data types are integer and single precision floating point. However, the single precision complex is supported. ATI has Double Precision data type support The ATI Radeon HD 3870 and Firestream 9170 have support for double precision data types. As time og writing complex data types are not supported by the software (see appendix for more information about the double precision data type). Bandwidth limitation The bandwidth of which data can be copied from main memory to device memory and vice versa is limited. Measurements show bandwidths in the few GB range (2-3 GB/s Nvidia. Somewhat more for ATI). This is far to low to allow for memory bandwidth intensive calculations. The GPU is hence best suited for tasks that require a lot of calculations per byte of data, or tasks that reduce the amount of data, typically reductions. However the out-of-core caslculations using ACML and HD3870/FS9170 seems to contradict this. Physical size and connection The graphical cards with the powerful GPUs are connected to the motherboard via a PCI express x 16 connector, in addition an extra power connector is needed. This is commonly found on gaming/home/office nodes or work-stations. However, it is very very rare on servers. In addition the servers used for computation are commonly of the 1U type and as the graphics card is occupying two PCI slot in physical dimensions it cannot fit in a normal 1U node. It can obviously not fit in a blade server. 08/24/09 Page 43 of 47 Memorandum on GPU Performance This constraints the usage of GPU for calculation severely. Making it only a marginal option today. In the future tighter integration might change this. Software limitations The current software stack relies on X11 to run and also require full control of the Xserver. Thuis makes it very hard to run from a remote console. A normal ssh login with X11 forwarding enabled does not work. This is a showstopper for use in compute nodes today. All references to graphic drivers or X11 should be avoided for an accelerator card. Lack of thread hot and thread safe support Currently the GPU can only run one thread at the time. There are 4 CPUs in a normal AMD processor and each of these can run a computational demanding job. Each of these 4 might want to access the GPU. For an MPI job there would be 4 or even 8 (dual socket node) concurrently running processes on each node and each of these might want to access the GPU probably concurrently. Either there should be one GPU per CPU (core) or one GPU should be capable of handling multiple threads concurrently. The X2 cards contain two GPUs and this is a start, an X4 card might be well suited to a single socket quad core processor node. A dual socket node might need two X4 cards or even a card than could accommodate 8 GPUs. Limited Compiler support At present there is only support for accelerators of this kind in one of the commonly used Fortran compilers, namely the Portland set of C and Fortran compilers. On the other hand the ease of usage of the compiler directives with Portland is valuable. The company states that they want it to be very similar to OpenMP directives. So far it looks like they are one the right track. Better integration of GPU processor and CPU like sharing the main memory etc will provide a far simpler implementation of compiler support. Today the overhead of copying data from main memory to GPU memory and associated administration to launch programs on the GPU. The example of integration of x87 math co-processor is an example to follow. 08/24/09 Page 44 of 47 Memorandum on GPU Performance Appendix Component Hardware Motherboard ASUS M3A32-MVP Deluxe. Processor AMD Phenom 9600 2.3 GHz. Memory Kingstom HyperX PC8500 total 4 GB. Graphic Card XFX GeForce 8800GTS, 512 MB DDR3. Graphic Card Asus Radeon HD 3870, 512 MB GDDR4. Graphic Card ATI Firestream 9170, 2GB GDDR4. Power Supply Chill CP-520A4 520 W. Hard disk Western Digital Raptor X 150 GB SATA 10k. Hard disk MTRON SSD 7000 3.5” 32 GB. Table 25: Hardware Software Details Operating system RedHat EL 5. C compiler gcc version 4.1.2 20070626. C compiler Portland pgcc 9.0-3. Fortran Compiler gfortran version 4.1.2 20070626. Fortran Compiler Portland pgfortran 9.0-3. GPU Driver (8800GTS) NVIDIA-Linux-x86_64-169.09-pkg2. GPU Driver (8800GTS) (Compiler tests) NVIDIA-Linux-x86_64-185.18.31-pkg2. GPU Software (8800GTS) CUDA 1.1 Support (169.09). GPU Software (8800GTS) (Compiler tests) CUDA 2.2 Support (190.18). GPU Driver (38770) Ati-driver 8-5-x86.x86_64. GPU Software (3870) amdcal-1.01.1_beta.x86_64. GPU Software (3870) amdbrook-1.01.0_beta.x86_64. Math library acmlg0.3 GPU Driver (9170) Ati-driver 8.531. GPU Software (9170) amdstream-cal-1.2.0_beta-1. GPU Software (9170) amdstream-brook-1.2.0_beta-1. Table 26: Installed Software 08/24/09 Page 45 of 47 Memorandum on GPU Performance Feature Nvidia GTS8800 ATI Radeon HD3870 ATI Firestream 9170 Stream Processors 128 320 320 RV 670 RV 670 GPU Transistors 754 Million 666 Million 666 Million Fabrication Process 65 nm 55 nm 55 nm n.d. 192 mm² 192 mm² Core Clock 650 MHz 775 MHz 800 MHz Memory Clock 1000 MHz 2.25 GHz 800 MHz 624 Gigaflops/s 497 Gigaflops/s 500 Gflops/s Memory Type DDR3 GDDR4 DDR4 Memory Interface 256-bit 256-bit 256-bit PCIe 2.0 x16 PCIe 2.0 x16 PCIe 2.0 x16 146 W (manufacturer) 217 W (measured) 150 W (manufacturer) Die Size Processing (Math) System Support Power consumption Rate Bus Table 27: Specs for Nvidia GTS 8800, ATI Radeon HD 3870 and ATI Firestream 9170. ATI statement about double precision Cited from ATI web site FAQ (assume the same apply to HD3870 and not only 9170): “How does AMD's stream computing address the IEEE754 standard for double precision floating point computation? The IEEE754 standard defines formats for representing single and double-precision floating point numbers as well as some special cases like denorms, infinities and NaNs. It also defines four rounding modes and methods for exception handling. When we were preparing to launch our stream computing initiative in 2006, a series of customer interviews was conducted to get input on requirements relative to this standard. They learned that as long as we handled the special cases according to the most common usage, complete IEEE754 compliance wasn't required. AMD's FireStream 9170 implementation should handle a large majority of customers' requirements. In the AMD FireStream 9170: 08/24/09 Page 46 of 47 Memorandum on GPU Performance • • • Infinities and NaNs are handled as defined by the IEEE754 standard. Rounding is handled using the "round to nearest" mode, which is the mode generally used in most applications. Denormal numbers are flushed to zero. This is a common optimization in implementations where full-speed hardware support is not available, and is adequate for most applications. “ Wikipedia about the RV600 series of GPU The new unified shader functionality is based upon a Very long instruction word (VLIW) architecture in which the core executes operations in parallel. The R600 uses 64 superscalar unified shader clusters, each consisting of 5 stream processing units for a total of 320 stream processing units. The RV610 and RV630 variants have some of the shaders removed from the array, containing a total of 40 (5x8) and 120 (5x24) stream processors, respectively. Each of the first 4 stream processing units is able to retire a finished single precision floating point MAD (or ADD or MUL) instruction per clock, dot product (dp, and special cased by combining ALUs), and integer ADD. The fifth unit is more complex and can additionally handle special transcendental functions such as sine and cosine. Each of the 64 shader clusters can execute 6 instructions per clock cycle (peak), consisting of 5 shading instructions plus 1 branch. 08/24/09 Page 47 of 47