SC12 The International Conference for High Performance Computing, Networking, Storage and Analysis, Salt Lake City, Utah, USA Workshop 119: An Educator's Toolbox for CUDA Performance of CUDA Programs Vector Addition and Matrix Addition/Multiplication B. Wilkinson, Nov 9, 2012 Preliminaries The purpose of this session is to explore the performance of vector/matrix addition and multiplication. A directory called Matrix is provided containing the files you need for this session. Refer to separate notes about access to the remote GPU servers, and using Linux commands and editors. Timing Execution. The programs here use CUDA events to time program execution. Provided files. Each provided guest account has the following: Directory Contents Matrix VectorAdd.cu MatrixAdd.cu MatrixAdd.cu Coalescing.cu Makefile Description Vector addition CUDA program Matrix addition CUDA program Matrix multiplication CUDA program Memory coalescing CUDA program Makefile that includes rules for compiling each program Vector addition uses one-dimensional structures whereas matrix operations use two-dimensional structures. The programs prompt for keyboard input to enable the user to alter the number of threads in a GPU block and number of GPU blocks used in the computation, within the limits of the GPU. The matrix programs also allow for the matrix size to be altered on the keyboard. They also do the computation on the CPU alone as well for comparison. Compiling. A simple makefile is provided to compile the programs individually. The listing of the makefile is given below: NVCC = /usr/local/cuda/bin/nvcc CUDAPATH = /usr/local/cuda NVCCFLAGS = -I$(CUDAPATH)/include LFLAGS = -L$(CUDAPATH)/lib64 -lcuda -lcudart -lm VectorAdd: VectorAdd.cu $(NVCC) $(NVCCFLAGS) $(LFLAGS) -o VectorAdd VectorAdd.cu MatrixAdd: MatrixAdd.cu $(NVCC) $(NVCCFLAGS) $(LFLAGS) -o MatrixAdd MatrixAdd.cu MatrixMult: MatrixMult.cu 1 $(NVCC) $(NVCCFLAGS) $(LFLAGS) -o MatrixMult MatrixMult.cu Coalescing: Coalescing.cu $(NVCC) $(NVCCFLAGS) $(LFLAGS) -o Coalescing Coalescing.cu Type make VectorAdd to compile the program VectorAdd.cu and similarly for other programs. Part 1 Vector Addition A prewritten CUDA program, VectorAdd.cu, is provided that adds two vectors (one-dimensional arrays), instrumented with CUDA events to measure the time of execution. In this task, you will compile VectorAdd.cu. The code is given below: #include <stdio.h> #include <cuda.h> #include <stdlib.h> #define N 4096 // size of array __global__ void vectorAdd(int *a,int *b, int *c) { int tid = blockIdx.x * blockDim.x + threadIdx.x; if(tid < N){ c[tid] = a[tid]+b[tid]; } } int main(int argc, char *argv[]) { int T = 10, B = 1; // threads per block and blocks per grid int a[N],b[N],c[N]; // vectors, statically declared int *dev_a, *dev_b, *dev_c; printf("Size of array = %d\n", N); do { printf("Enter number of threads per block; scanf("%d",&T); printf("\nEnter number of blocks per grid: "); scanf("%d",&B); if (T * B < N) printf("Error T x B < N, try again\n"); } while (T * B < N); cudaEvent_t start, stop; float elapsed_time_ms; // using cuda events to measure time cudaMalloc((void**)&dev_a,N * sizeof(int)); cudaMalloc((void**)&dev_b,N * sizeof(int)); cudaMalloc((void**)&dev_c,N * sizeof(int)); for(int i=0;i<N;i++) { a[i] = i; b[i] = i*1; } // load arrays with some numbers cudaMemcpy(dev_a, a , N*sizeof(int),cudaMemcpyHostToDevice); cudaMemcpy(dev_b, b , N*sizeof(int),cudaMemcpyHostToDevice); cudaMemcpy(dev_c, c , N*sizeof(int),cudaMemcpyHostToDevice); cudaEventCreate( &start ); // instrument code to measure start time 2 cudaEventCreate( &stop ); cudaEventRecord( start, 0 ); vectorAdd<<<B,T>>>(dev_a,dev_b,dev_c); cudaMemcpy(c,dev_c,N*sizeof(int),cudaMemcpyDeviceToHost); cudaEventRecord( stop, 0 ); // instrument code to measure end time cudaEventSynchronize( stop ); cudaEventElapsedTime( &elapsed_time_ms, start, stop ); for(int i=0;i<N;i++) { printf("%d+%d=%d\n",a[i],b[i],c[i]); } printf("Time to calculate results: %f ms.\n", elapsed_time_ms); cudaFree(dev_a); cudaFree(dev_b); cudaFree(dev_c); return 0; } Compiling. Type make VectorAdd to compile the program VectorAdd.cu Executing Program. Type ./VectorAdd to execute compiled program. (Note you have to specify the currrent directory, ./) The program will prompt you for the number of threads per block and number of blocks per grid. The total number of threads in the grid must equal or greater than the number of elements in the arrays (4096) or you will be prompted to try again. Experiment with different CUDA grid/block structures. Experiment with different numbers of blocks and threads/block at the keyboard. To have different array sizes will need you to alter N in Vector.Add and recompile. Modify the code to have a different value for N and recompile. 1 To accommodate command line input for N will require you to dynamically declare the arrays, which is done in the next program to add matrices. Also each experiment with different values requires you to re-execute the program, re-launching the kernel each time. In the next program we include code to allow you experiment with different values without leaving the program. Finally we are interested in how much faster the GPU code executes compared to doing the computation on the host CPU. The next code also includes the computation done on the host for comparison purposes. The result of using the GPU and using the CPU alone are compared to ensure both produce the same answers (assumed then correct!). 1 Note compute capability 2.x has a limit of 1024 threads per block and compute capability 1.x has a limit of 512 threads per block. 3 Part 2 Matrix Addition In this task, you will compile and execute a simple prewritten CUDA program that add to matrices, called MatrixAdd.cu. The code is given below: // Matrix addition program MatrixAdd.cu #include <stdio.h> #include <cuda.h> #include <stdlib.h> __global__ void gpu_matrixadd(int *a,int *b, int *c, int N) { int col = threadIdx.x + blockDim.x * blockIdx.x; int row = threadIdx.y + blockDim.y * blockIdx.y; int index = row * N + col; if(col < N && row < N) c[index] = a[index]+b[index]; } void cpu_matrixadd(int *a,int *b, int *c, int N) { int index; for(int col=0;col < N; col++) for(int row=0;row < N; row++) { index = row * N + col; c[index] = a[index]+b[index]; } } int main(int argc, char *argv[]) { char key; int i, j; // loop counters int Grid_Dim = 1; int Block_Dim =1; //Grid dimension, x and y, square //Block dimension, x and y, square int int int int N = 10; // size of array in each dimension *a,*b,*c,*d; *dev_a, *dev_b, *dev_c; size; // number of bytes in arrays cudaEvent_t start, stop; // using cuda events to measure time float elapsed_time_ms; // which is applicable for asynchronous code also /* --------------------ENTER INPUT PARAMETERS AND DATA -----------------------*/ do { // loop to repeat complete program printf ("Device characteristics -- some limitations (compute capability 2.x)\n"); printf (" Maximum size of x- and y- dimension of block if square = 32\n"); printf (" Maximum size of each dimension of grid = 65535\n"); printf("Enter size of array in one dimension (square array), currently %d\n",N); scanf("%d",&N); do { printf("\nEnter number of threads per block in x/y dimension), currently %d scanf("%d",&Block_Dim); printf("\nEnter number of blocks per grid in x/y dimensions), currently %d scanf("%d",&Grid_Dim); : ",Block_Dim); : ",Grid_Dim); if (Block_Dim > 32) printf("Error, too many threads in each dimension of block, try again\n"); if ((Grid_Dim * Block_Dim) < N) printf("Error, number of threads in x/y dimensions less than number of array elements, try again\n"); 4 else printf("Number of threads not used = %d\n", ((Grid_Dim * Block_Dim) - N) * ((Grid_Dim * Block_Dim) - N)); } while ((Block_Dim > 32) || ((Grid_Dim * Block_Dim) < N)); dim3 Grid(Grid_Dim, Grid_Dim); dim3 Block(Block_Dim, Block_Dim); //Grid structure //Block structure size = N * N * sizeof(int); // number of bytes in total in arrays a b c d // dynamically allocated memory for arrays on host = = = = (int*) (int*) (int*) (int*) for(i=0;i for(j=0;j a[i * b[i * } malloc(size); malloc(size); malloc(size); malloc(size); < < N N // results from GPU // results from CPU N;i++) N;j++) { + j] = i; + j] = i; // load arrays with some numbers /* ------------- COMPUTATION DONE ON GPU ----------------------------*/ cudaMalloc((void**)&dev_a, size); // allocate memory on device cudaMalloc((void**)&dev_b, size); cudaMalloc((void**)&dev_c, size); cudaMemcpy(dev_a, a , size ,cudaMemcpyHostToDevice); cudaMemcpy(dev_b, b , size ,cudaMemcpyHostToDevice); cudaMemcpy(dev_c, c , size ,cudaMemcpyHostToDevice); cudaEventCreate(&start); cudaEventCreate(&stop); // // instrument code to measure start time cudaEventRecord(start, 0); cudaEventSynchronize(start); // not needed gpu_matrixadd<<<Grid,Block>>>(dev_a,dev_b,dev_c,N); cudaMemcpy(c,dev_c, size ,cudaMemcpyDeviceToHost); cudaEventRecord(stop, 0); // instrument code to measure end time cudaEventSynchronize(stop); cudaEventElapsedTime(&elapsed_time_ms, start, stop ); printf("Time to calculate results on GPU: %f ms.\n", elapsed_time_ms); // exec. time /* ------------- COMPUTATION DONE ON HOST CPU ----------------------------*/ // cudaEventRecord(start, 0); // use same timing cudaEventSynchronize(start); // not needed cpu_matrixadd(a,b,d,N); // do calculation on host cudaEventRecord(stop, 0); // instrument code to measue end time cudaEventSynchronize(stop); cudaEventElapsedTime(&elapsed_time_ms, start, stop ); printf("Time to calculate results on CPU: %f ms.\n", elapsed_time_ms); // exec. time /* ------------------- check device creates correct results -----------------*/ for(i=0;i < N*N;i++) { if (c[i] != d[i]) printf("ERROR in results, CPU and GPU create different answers\n"); break; } printf("\nEnter c to repeat, return to terminate\n"); scanf("%c",&key); scanf("%c",&key); } while (key == 'c'); // loop of complete program 5 /* -------------- clean up ---------------------------------------*/ free(a); free(b); free(c); cudaFree(dev_a); cudaFree(dev_b); cudaFree(dev_c); cudaEventDestroy(start); cudaEventDestroy(stop); return 0; } Compiling and executing. Compile the program with the make file (make MatrixAdd) and execute program (./MatrixAdd). Confirm program functions correctly. Experiment with different CUDA grid/block structures. Experiment with different numbers of blocks and threads/block. Try larger array sizes. Are there any differences in execution time from first kernel launch to subsequent ones with the same input parameters? Part 3 Matrix Multiplication Modify the vector addition code to multiply two matrices and save as MatrixMult.cu. The sequential version to multiply two matrices is given below: void cpu_matrixmult(int *cpu_a, int *cpu_b, int *cpu_c, int N) { int row, col, k, sum; for(row=0; row < N; row++) // row of a for(col=0; col < N; col++) { // column of b sum = 0; for(k = 0; k < N; k++) sum += cpu_a[row * N + k] * cpu_b[k * N + col]; cpu_c[row * N + col] = sum; } } The matrix multiply code is actually given as part of the memory coalescing experiment in the final section. Compiling and executing the code. Modify the make file to compile the program and compile. Execute the program and experiment. 6 Part 4 Memory Coalescing A program is provided, Coalescing.cu, to explore different memory access patterns. Each GPU thread in the program simply loads its thread ID into a location in a two dimensional array in global memory so that we can identify which thread accessed each location. The actions are repeated T times and the time recorded. The number of iterations is specified through keyboard input as is the number of grid blocks. The size of the blocks is fixed as 16 x 16, the maximum for compute capability 1.x.2 The initial kernel for Coalescing.cu is given below: __global__ void gpu_Comput (int *h, int N, int T) { // Array loaded with global thread ID that accesses that location int col = threadIdx.x + blockDim.x * blockIdx.x; int row = threadIdx.y + blockDim.y * blockIdx.y; int threadID = col + row * N; int index = row + col * N; // sequentially down each row for (int t = 0; t < T; t++) h[index] = threadID; // loop to repeat to reduce other time effects // load array with flattened global thread ID } The main program calls this kernel once, copies the generated array back to the host, and prints out the contents of the array (every N/8 numbers). A sample output from the program is given below: Figure 1 Output for Coalescing.cu 2 The maximum for compute capability 2.x (coit-grid06.uncc.edu and coit-grid07.uncc.edu) is 32 x 32, 1024 threads 7 The complete program listing of Coalescing.cu is given below: // To measure effects of memory coalescing. Coalescing.cu. B. Wilkinson Jan 30, 2011 #include #include #include #include <stdio.h> <stdlib.h> <math.h> <cuda.h> #define BlockSize 16 // Size of blocks, 16 x 16 threads, fixed __global__ void gpu_Comput (int *h, int N, int T) { // Array loaded with global thread ID that acesses that location int col = threadIdx.x + blockDim.x * blockIdx.x; int row = threadIdx.y + blockDim.y * blockIdx.y; int threadID = col + row * N; int index = row + col * N; // sequentially down each row for (int t = 0; t < T; t++) h[index] = threadID; // loop to repeat to reduce other time effects // load array with flattened global thread ID } void printArray(int *h, int N) { printf("Results of computation, every N/8 numbers, eight numbers\n"); for (int row = 0; row < N; row += N/8) { for (int col = 0; col < N; col += N/8) printf("%6d ", h[col + row * N]); printf("\n"); } } int main(int argc, char *argv[]) { int T = 100; int B = 1; char key; // number of iterations, entered at keyboard // number of blocks, entered at keyboard int *h, *dev_h; // ptr to array holding numbers on host and device cudaEvent_t start, stop; float elapsed_time_ms1; cudaEventCreate( &start ); cudaEventCreate( &stop ); // cuda events to measure time /* ------------------------- Keyboard input -----------------------------------*/ do { // loop to repeat complete program printf("Grid Structure 2-D grid, 2-D blocks\n"); printf("Blocks fixed at 16 x 16 threads, 512 threads, max for compute cap. 1.x \n"); printf("Enter number of blocks in grid, each dimension, currently %d\n",B); scanf("%d",&B); printf("Enter number of iterations, currently %d\n",T); scanf("%d",&T); int N = B * BlockSize; // size of data array, given input data printf("Array size (and total grid-block size) %d x %d\n", N, N); 8 dim3 Block(BlockSize, BlockSize); dim3 Grid(B, B); //Block structure, 32 x 32 max //Grid structure, B x B /* ------------------------- Allocate Memory-----------------------------------*/ int size = N * N * sizeof(int); h = (int*) malloc(size); cudaMalloc((void**)&dev_h, size); // number of bytes in total in array // Array on host // allocate device memory /* ------------------------- GPU Computation -----------------------------------*/ cudaEventRecord( start, 0 ); gpu_Comput<<< Grid, Block >>>(dev_h, N, T); cudaEventRecord( stop, 0 ); // instrument code to measue end time cudaEventSynchronize( stop ); // wait for all work done by threads cudaEventElapsedTime( &elapsed_time_ms1, start, stop ); cudaMemcpy(h,dev_h, size ,cudaMemcpyDeviceToHost); //Get results to check printArray(h,N); printf("\nTime to calculate results on GPU: %f ms.\n", elapsed_time_ms1); /* ---------- INSERT CALL TIO SECOND KERNEL HERE -------------------------------------*/ /* -------------------------REPEAT PROGRAM INPUT-----------------------------------*/ printf("\nEnter c to repeat, return to terminate\n"); scanf("%c",&key); scanf("%c",&key); } while (key == 'c'); // loop of complete program /* -------------- clean up ---------------------------------------*/ free(h); cudaFree(dev_h); cudaEventDestroy(start); cudaEventDestroy(stop); return 0; } Compile and execute Coalescing.cu. Compile and execute Coalescing.cu on the assigned GPU server and confirm the program functions correctly for different input values. (Note: The program does not have error checking on input.) Effects of memory coalescing. The kernel could have loaded the array sequentially across columns rather than down rows. Here we will explore what happens if we do that. Create a copy of Coalescing.cu called Coalescing_sol.cu (cp Coalescing.cu Coalescing_sol.cu). Write a second kernel similar to the existing kernel but one that makes the memory accesses with threadID sequentially along columns rather than down rows. Insert this kernel code into Coalescing_sol.cu. Call your kernel from the main program before the section calling the original kernel. Include a call to print out the contents of the array. Compute and display the ratio of the original kernel execution time/your kernel execution time (speedup of using your kernel), to produce 9 output such as shown in Figure 2. Modify the make file to compile the code and compile. Investigate what speed up is achieved if any with your kernel and why. Figure 2 Sample output for Coalescing.cu with two kernels and speedup calculation Using shared memory. Using shared memory requires substantial code modifications. For those who have reached this point, make these modification and experiment. 10