1 Content 1 2 3 3.1 3.2 3.3 3.4 3.5 4 4.1 4.2 5 5.1 5.1.1 5.1.2 5.2 5.2.1 5.2.2 5.2.3 5.3 6 6.1 6.2 6.3 7 8 9 Content Summary Introduction to CUDA Scalability (1) Kernels (1) Thread-Hierarchy (1) Memory-Hierarchy (1) GPU as Coprocessor Matrix Multiplication Mathematics Row-major order Implementations Global memory Calculating memory address Kernel function Shared memory Dynamic shared memory Calculating memory address Kernel function Serial C++ code Time measurements High performance counter Time for matrices up to 300x300 elements Conclusions Legal Download References Created by Christian Arrer. 2 Summary This document is a summary of the first 25 pages of the CUDA programmer’s guide written by NVIDIA Corporation. It is more or less an approach of putting together the important stuff in a few lines, so one can directly go on to an implementation for matrix multiplication and various algorithms for Fourier transformation. An adopted version of the shared memory matrix multiplication and the global memory matrix multiplication are implemented, but not described in detail in this document. In a measurement section the time in seconds between CUDA, MATLAB-built-in and a serial c code on the CPU are compared. The results are only scientific and acquired with the best honesty and not intended for any competition purpose. NSIGHT is not used here. These implementations of matrix multiplication do not force the matrices to be multiples of the CUDA block dimensions. Arbitrary matrix sizes are allowed, which the implementation in the CUDA programming guide prohibits. It is true that matrices can be padded with zeros to exactly fit the CUDA block chess board. – In this case it should be checked if the extraction of the result matrix from zero padded sources is more efficient, than out of bounds checking in the kernel function. 3 Introduction to CUDA Here comes a short text dealing with the core important things in CUDA for the beginning. 3.1 Scalability (1) CUDA provides a framework that allows the programmer to write code that dynamically adapts to the amount of available execution cores. This is achieved by a local and a more global parallelism. On the one hand the programmer can have a local parallelism with threads; on the other hand those threads are organized in blocks, which can be launched over the physically available multiprocessor either in serial or parallel. NVIDIA Hardware and Firmware handles the parallelism for the hardware’s possibilities. An NVIDIA-Chip with 10 streaming multiprocessors will have the ability to launch 10 blocks at once, while an NVIDIA-Chip with 4 streaming multiprocessors will only execute 4 blocks at the same time. 3.2 Kernels (1) Kernels are the basic elements which are executed in parallel. In C++ a CUDA-kernel is defined with the ___global___ keyword. Inside the Kernel the threadIdx structure is build-in accessible. Kernels are launched in one ore multiple threads. 3.3 Thread-Hierarchy (1) Figure 1: Hierarchy Threads: Are the smallest unit of the hierarchy. Threads live inside blocks. The thread’s index is always accessible via the threadIdx structure inside it. The maximum amount of threads per block is up to 1024 at the moment. Blocks: Blocks are arranged inside a grid. The number of blocks can exceed the number of streaming processors on the hardware. The block’s index is always accessible via the blockIdx structure. Grids are only two dimensional! A one, two or three dimensional structure is to use, weather what amount of dimensions the underlying calculation matrices have. Again it is important to mention, that blocks need to have the ability of arbitrary serial or parallel computation, they are the ones enabling scalability for the currently installed hardware. Threads need to use border synchronization to read with low latency from the current streaming multiprocessor’s shared memory. ___syncthreads() ensures all threads to read within one cycle. All built-in variables are compiled in Table 1. Table 1: Important variables and language constructs blockIdx.x blockIdx.y blockIdx.z blockDim.x blockDim.y blockDim.z threadIdx.x threadIdx.y threadIdx.z <<<dim3,dim3,dynamicSharedSize>>> 3.4 Memory-Hierarchy (1) X-block-index inside a grid. Y-block-index inside a grid. Z-block-index inside a grid. X-thread-elements inside a block Y-thread-elements inside a block Z-thread-elements inside a block Y-thread-index inside a block Y-thread-index inside a block Z-thread-index inside a block Invocation of the threads. There are five different types of memory, see Table 2: Table 2: Memory location types each thread’s private local memory each block’s shared memory global memory constant memory texture memory Only the thread himself can access this space. All threads in a block can read and write to this location. Every thread as access to this storage. This is read-only to all threads for special accessing. This is read-only to all threads for special accessing. 3.5 GPU as Coprocessor The GPU is more or less a auxiliary unit for the main program launched on the CPU. Data is transferred from the host memory (the systems DRAM) to the device memory (the Nvidia-Accelerator’s DRAM) and the other way round. The major device number of the Graphics card is equivalent to the compute-architecture (Table 3). Table 3: Compute architecture Major-Device-Number 3 2 1 Architecture Kepler Fermi Tesla 4 Matrix Multiplication 4.1 Mathematics When two matrices are multiplied, the result will again be a matrix. If the matrix product is somewhat like A * B = C, then the number of A’s columns needs to be equal to B’s number of rows. General matrices A and B will then look like this: Equation 1. Equation 1 A is a n * m matrix, that means A has n rows and m columns. B is a m * r matrix, so B has m rows and r columns. m is the common thing for both matrices. The result matrix C will become a n * r matrix. Each element of C is defined as the scalar product of a row vector of matrix A and a column vector of B (Equation 2). Equation 2 4.2 Row-major order Inside MATLAB and most other computer programs two dimensional matrices are stored in the row major order, which means the matrix column vectors are arranged in serial to a float or double array. In row major order, the matrix A from Equation 1 becomes the array in Equation 3. Equation 3 If the row of matrix A and the column of matrix B is available, then the linear index is easily calculated with Equation 4: Equation 4 5 Implementations NVIDIA CUDA distinguishes five types of memory. For matrix multiplication two of them are used in two different implementations. One time the GPU’s global memory will be utilized, the other time its shared memory will be utilized. 5.1 Global memory Basically the matrices A, B and C are loaded to the device’s global memory, using cudaMemcpy APIfunction. CUDA-API can be called from standard “.cpp”-files inside Visual C++, but CUDA kernels and __device__ functions can only be called and defined in “.cu”-files, which are compiled by the nvidia c compiler nvcc. As mentioned in the introduction part threads are summarized to blocks and blocks are summarized to a grid. The maximum number of threads in a block is limited by the hardware, if the matrix becomes bigger than the maximum amount of threads inside a block it will be necessary to create more blocks inside a two dimensional grid. Blocks can have three dimensions, but until now grids are two dimensional only. For this implementation a single for loop inside each thread is necessary to do the standard scalar product’s mathematics. The main challenge inside a thread is to compute the liner address numbers for each element. As mentioned before matrices are stored in row-major order. 5.1.1 Calculating memory address Figure 2: Matrix sketch Each thread has to be aware of its location in the result matrix. A thread can compute its row-index with the following formula (Equation 5), as the data structures gridDim, blockDim, blockIdx and threadIdx are available inside any kernel function. Nvidia and C++ start counting from 0, which is only consistent. It means there is a thread and a block having the index (0, 0, 0) and respectively (0, 0). Compare Error! Reference source not found.. Equation 5 To write in a row-major order we need to convert the row- and column-indexes into row major linear numbering. The calculation in the next formular (Equation 6) is only valid if the thread is not out of C’s boundaries, see Error! Reference source not found.. Equation 6 The linear index of the source matrix A can be computed as in Equation 7: Equation 7 The variable “inner” in the above Equation 7 is the standard-scalar-product-for-loop-iteration-counter. The linear index for the source matrix B is calculated as follows (Equation 8): Equation 8 5.1.2 Kernel function The Cuda kernel function is listed below in Code 1. void __global__ mmKernel(caMat A, caMat B, caMat C) { caMatIdx idxC; idxC.row = blockIdx.x * blockDim.x + threadIdx.x; idxC.col = blockIdx.y * blockDim.y + threadIdx.y; if(idxC.col < C.numCols && idxC.row < C.numRows) { // check if we are not out of bounds here idxC.lin = idxC.col * C.numRows + idxC.row; for(unsigned int i = 0; i < A.numCols; i++) { // loop innner matrix dimensions C.elemDevice[idxC.lin] = C.elemDevice[idxC.lin] + (A.elemDevice[idxC.row + (i * A.numRows)] * B.elemDevice[(idxC.col * B.numRows) + i]); // all matrices are row major } } } Code 1: global memory kernel 5.2 Shared memory The matrix multiplication is divided into blocks to fit in the device’s resources. Shared memory is a fast memory available to all threads inside one block. Loading blocks to shared memory before starting the calculation increases performance due to saving of bandwidth. Therefore the threads of each block need to be synchronized to wait until all data is loaded. All threads must synchronize for the result element calculation before the next matrix block is loaded to shared memory. With the shared memory implementation the data is loaded as usual to global memory and then transferred from global to shared memory. 5.2.1 Dynamic shared memory CUDA allows a single dynamic shared memory array for each thread. To have multiple shared memory spaces, pointers to given offset of this single shared memory can be retrieved to divide it logically in more variables. The third parameter of the kernel launch is the total size in bytes the dynamic shared memory will use. The shared memory needs to be declared exactly as extern __shared__ <custom_name>[]. nvcc wants to see the brackets “[]” at the end, using the dereferencing operator “*” will give errors. The size in bytes for a given data type need to be calculated with the sizeof(<data_type>) function and the amount of elements stored in the dynamic shared memory. The kernel launch looks like this (Code 2): sharedSize = threads.x * threads.x; sharedBytes = sharedSize * sizeof(float) * 2; mmKernelShared<<<blocks, threads, sharedBytes>>>(A, B, C, sharedSize); Code 2: kernel launch with dynamic shared memory 5.2.2 Calculating memory address Figure 3: Shared memory matrix sketch For computation of each block-result, blocks of A and B are loaded into shared memory. So there are two nested for loops. The outer for loop loads the source-blocks for each result-block iteratively to the shared memory. The inner for loop does the mathematics for the standard-scalar-product-part inside each thread. Compare Error! Reference source not found.. Like in the previous example, some address calculation needs to be done. For the left and right source matrices the linear indexes can be computed with their row- and column-index, the current block iteration and the knowledge of the input’s size (Equation 9). It is exactly the other way round for A and B. Triple checking equations is recommended, because things can easily get confusing here. Kernels can only be debugged with NSIGHT-plug-in. Equation 9 The calculation of the indexes for the result matrix C stays the same as with global memory implementation. The column and row index of matrix A and matrix B can be out of bounds. If this is the case the thread does not need to load his shared memory element. A thread is only true idle, if it is out of bounds in matrix C’s columns and rows. Otherwise it still can have to load a shared memory element. There for out of bounds in A and B needs to be checked separately. If the thread is out of bounds in any of the source matrices he loads a 0 to shared memory. The amount of blocks in the outer loop is given by Equation 10. Equation 10 For the use of shared memory, the blocks need to be squares. After loading to the shared memory inside one outer for loop iteration, the threads need to be synchronized. The inner for-loop calculates a part of the scalar product. If this loop is started all data needs to be present inside the shared memory. The threads are synchronized to ensure every thread loaded its data when the scalar product loop starts. Inside the scalar-product-loop the kernel must not access the global memory; this would compromise the bandwidth advantage of shared memory usage. To accumulate the partial value of the scalar product a local kernel variable is used. If the matrices are not multiples of the block size, in the right outer perimeter of matrix A or respectively in the down perimeter of matrix B the partial-standard-scalar-product-loop can get out of bounds. To prevent garbage memory access, the for loop needs an additional break condition (Equation 11). Equation 11 Both statements in the above Equation 11 are equivalent, because multiplication requires the inner dimensions to agree. The linear index for shared memory are easily calculated (Equation 12). The shared memory elements need to be multiplied and added to the local partial standard scalar product variable. Equation 12 5.2.3 Kernel function The CUDA kernel function is listed below in Code 3. void __global__ mmKernelShared(caMat A, caMat B, caMat C,unsigned int shrSize) { caMatIdx idxA; // linear, row and colum indexes in global nvidia memory caMatIdx idxB; caMatIdx idxC; extern __shared__ float shared[]; // need for dynamically allocatade shared memory float* shrA = NULL; float* shrB = NULL; float Cval; unsigned int numBlk = 0; //the number of shared memory blocks idxC.row = blockDim.x * blockIdx.x + threadIdx.x; idxC.col = blockDim.y * blockIdx.y + threadIdx.y; idxC.lin = idxC.col * C.numRows + idxC.row; numBlk = A.numCols/(unsigned int)blockDim.x; numBlk = numBlk + 1; for(unsigned int curBlk = 0; curBlk < numBlk; curBlk++) { // loop shared memory blocks Cval = 0; shrA = &shared[0]; shrB = &shared[shrSize]; idxA.row = idxC.row; idxA.col = curBlk * blockDim.y + threadIdx.y; idxB.row = curBlk * blockDim.x + threadIdx.x; idxB.col = idxC.col; if(idxA.row < A.numRows && idxA.col < A.numCols) { // check out of bounds, thread doesnt need to load anything idxA.lin = idxA.col * A.numRows + idxA.row; shrA[blockDim.x * threadIdx.y + threadIdx.x] = A.elemDevice[idxA.lin]; // load to shared, read from global } else { shrA[blockDim.x * threadIdx.y + threadIdx.x] = 0; } if(idxB.row < B.numRows && idxB.col < B.numCols) { // check out of bounds, thread doesnt need to load anything idxB.lin = idxB.col * B.numRows + idxB.row; shrB[blockDim.x * threadIdx.y + threadIdx.x] = B.elemDevice[idxB.lin]; // load to shared, read from global } else { shrB[blockDim.x * threadIdx.y + threadIdx.x] = 0; } __syncthreads(); // the previous copy operation needs to finish before we go on if(idxC.col < C.numCols && idxC.row < C.numRows) { for(unsigned int j = 0; j < blockDim.x && ((curBlk * blockDim.y + j) < A.numCols); j++) { // do not loop on out of bounds, ((curBlk * blockDim.x + j) < A.numRows) is the same Cval = Cval + shrA[j * blockDim.x + threadIdx.x] * shrB[threadIdx.y * blockDim.x + j]; } } __syncthreads(); // wait until the previous operation is finishend if(idxC.col < C.numCols && idxC.row < C.numRows) C.elemDevice[idxC.lin] = C.elemDevice[idxC.lin] + Cval; // write to global, do not write global in loop } } Code 3: shared memory kernel 5.3 Serial C++ code The used serial C++ code is as follows (Code 4). for(idxC.col = 0; idxC.col < this->C.numCols; idxC.col++) { for(idxC.row = 0; idxC.row < this->C.numRows; idxC.row++) { for(unsigned int inner = 0; inner < this->A.numCols; inner++) { idxA.lin = inner * A.numRows + idxC.row; idxB.lin = idxC.col * B.numRows + inner; this->C.elemHost[idxC.lin] = this->C.elemHost[idxC.lin] + this->A.elemHost[idxA.lin] * this->B.elemHost[idxB.lin]; } idxC.lin++; } } Code 4: serial c++ code The serial code does not take advantage of threads on CPU, multi core or hyper threading. Three nested loops are used. The code is highly inefficient. 6 Time measurements 6.1 High performance counter The “High precision event timer” is available on most PC-machines. On multi-core processors the HPET is somewhat difficult to use. Before taking the frequency of the counter the start counts and the stop counts it is necessary to set the process affinity to a single CPU core, otherwise the results will not be reliable. 6.2 Time for matrices up to 300x300 elements Diagram 1: time measurement The serial C++ code grows in an exponential shape and performs by far the worst. Both Cuda implementations for shared memory and global memory are nearly not to distinguish. This diagram (Error! Reference source not found.) is not really usable for analyzing CUDA behavior. The outliners could be removed on taking the average of more series, instead of a single sequence. 6.3 Conclusions Higher matrix sizes should be tested to compare the development of shared memory against global memory with increasing matrix size. Serial C++ code should be greatly improved; the implementation is maybe highly inefficient. Serial C++ on CPU should be able to beat MATLAB’s virtual machine code. 7 Legal 1. MATLAB is a label of MatWorks 2. NVIDIA, CUDA, GPGPU, NSIGHT and GPU are labels of NVIDIA 3. Visual C++ is a label of Microsoft 8 Download http://ch.arrer.net/external/gpgpu/matmult http://ch.arrer.net/external/gpgpu/matmult.zip http://ch.arrer.net/external/gpgpu/matmult.docx http://ch.arrer.net/external/gpgpu/matmult.pdf 9 References 1. developer.download.nvidia.com. [Online] http://developer.download.nvidia.com/compute/DevZone/docs/html/C/doc/CUDA_C_Programming_G uide.pdf.