Exercises and Solutions for “Programming Massively Parallel Processors: A Hands-on Approach” Second Edition © John A. Stratton and Izzat El Hajj, 2013 Chapter 3 © John A. Stratton and Izzat El Hajj, 2010-2013 Chapter 3 3.1. A matrix addition takes two input matrices B and C and produces one output matrix A. Each element of the output matrix A is the sum of the corresponding elements of the input matrices B and C, that is, A[i][j] == B[i][j] + C[i][j]. For simplicity, we will only handle square matrices of which the elements are single-precision floating-point numbers. Write a matrix addition kernel and the host stub function that can be called with four parameters: pointer to the output matrix, pointer to the first input matrix, pointer to the second input matrix, and the number of elements in each dimension. Use the following instruction: a) Write the host stub function by allocating memory for the input and output matrices, transferring input data to the device, launch the kernel, transferring the output data to host, and freeing the device memory for the input and output data. Leave the execution configuration parameters open for this step. void matrixMul(float *h_A, float *h_B, float *h_C, int matrixLen) { float *d_A, *d_B, *d_C; int size = matrixLen*matrixLen*sizeof(float); cudaMalloc((void**)&d_A, size); cudaMalloc((void**)&d_B, size); cudaMalloc((void**)&d_C, size); // Initialize GPU memory contents with 0 or host data cudaMemset(d_A, 0, size); cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice); cudaMemcpy(d_C, h_C, size, cudaMemcpyHostToDevice); // Execute the GPU kernel dim3 threads( ); dim3 blocks( ); matrixMulKernel<<<blocks, threads>>>(d_A, d_B, d_C, matrixLen); // Copy results back to host cudaMemcpy(h_A, d_A, size, cudaMemcpyDeviceToHost); // Free GPU memory and return cudaFree(d_A); cudaFree(d_B); cudaFree(d_C); } © John A. Stratton and Izzat El Hajj, 2010-2013 b) Write a kernel that has each thread producing one output matrix element. Fill in the execution configuration parameters for the design. __global__ void matrixMulKernel(float *A, float *B, float *C, int matrixLen) { int i = threadIdx.x + blockIdx.x*blockDim.x; int j = threadIdx.y + blockIdx.y*blockDim.y; if (i < matrixLen && j < matrixLen) A[i*matrixLen + j] = B[i*matrixLen + j] + C[i*matrixLen + j]; } // Use 16x16 thread blocks. // (a + (b-1)) / b == ceil(a/b) in integer arithmetic dim3 threads(16, 16); dim3 blocks( (matrixLen+15)/16, (matrixLen+15)/16); c) Write a kernel that has each thread producing one output matrix row. Fill in the execution configuration parameters for the design __global__ void matrixMulKernel(float *A, float *B, float *C, int matrixLen) { int j; int i = threadIdx.x + blockIdx.x*blockDim.x; if (i < matrixLen) { for (j = 0; j < matrixLen; j++) A[i*matrixLen + j] = B[i*matrixLen + j] + C[i*matrixLen + j]; } } // Use 16 threads per block. // (a + (b-1)) / b == ceil(a/b) in integer arithmetic dim3 threads(16); dim3 blocks( (matrixLen+15)/16 ); © John A. Stratton and Izzat El Hajj, 2010-2013 d) Write a kernel that has each thread producing one output matrix column. Fill in the execution configuration parameters for the design __global__ void matrixMulKernel(float *A, float *B, float *C, int matrixLen) { int i; int j = threadIdx.x + blockIdx.x*blockDim.x; if (j < matrixLen) { for (i = 0; i < matrixLen; i++) A[i*matrixLen + j] = B[i*matrixLen + j] + C[i*matrixLen + j]; } } // Use 16 threads per block. // (a + (b-1)) / b == ceil(a/b) in integer arithmetic dim3 threads(16); dim3 blocks( (matrixLen+15)/16 ); e) Analyze the pros and cons of each preceding kernel design The kernel producing one output element per thread will have the most parallelism. As will be described in a later chapter, the kernel computing one output row per element will have better memory system performance than the kernel computing one output column per element, and may even be comparable to or better than the first kernel for datasets large enough to make the difference in the amount of parallelism irrelevant. © John A. Stratton and Izzat El Hajj, 2010-2013 3.2. A matrix-vector multiplication takes an input matrix B and a vector C and produces one output vector A. Each element of the output vector A is the dot product of one row of the input matrix B and C, that is, A[i] = ∑ j B[i][j] + C[j]. For simplicity, we will only handle square matrices of which the elements are signle-precision floating-point numbers. Write a matrix-vector multiplication kernel and the host stub function that can be called with four parameters: pointer to the output vector, pointer to the input matrix, pointer to the input vector, and the number of elements in each dimension. __global__ void matrixMulKernel(float *A, float *B, float *C, int vectorLen) { int i = threadIdx.x + blockIdx.x*blockDim.x; float sum = 0.0f; if (i < vectorLen) { for (int j = 0; j < vectorLen; j++) sum += B[i*vectorLen + j] * C[j]; A[i] = sum; } } void matrixVectorMul(float *h_A, float *h_B, float vectorLen) { float *d_A, *d_B, *d_C; int matrixSize = vectorLen*vectorLen*sizeof(float); int vectorSize = vectorLen*sizeof(float); cudaMalloc((void**)&d_A, vectorSize); cudaMalloc((void**)&d_B, matrixSize); cudaMalloc((void**)&d_C, vectorSize); *h_C, int // Initialize GPU memory contents with 0 or host data cudaMemset(d_A, 0, vectorSize); cudaMemcpy(d_B, h_B, matrixSize, cudaMemcpyHostToDevice); cudaMemcpy(d_C, h_C, vectorSize, cudaMemcpyHostToDevice); // Execute the GPU kernel dim3 threads(128); dim3 blocks((vectorLen+(128-1))/128); matrixMulKernel<<<blocks, threads>>>(d_A, d_B, d_C, vectorLen); // Copy results back to host cudaMemcpy(h_A, d_A, vectorSize, cudaMemcpyDeviceToHost); // Free GPU memory and return cudaFree(d_A); cudaFree(d_B); cudaFree(d_C); } © John A. Stratton and Izzat El Hajj, 2010-2013 3.3. A new summer intern was frustrated with CUDA. He has been complaining that CUDA is very redious: he had to declare many functions that he plans to execute on both the host and the device twice, once as a host function and once as a device function. What is your response? The CUDA language has specifiers __host__ and __device__ for both host and device functions, but they are not mutually exclusive. A function that can be called on both the host and device could be annotated with both specifiers, allowing the same function to be called from either context. 3.4. Complete Parts 1 and 2 of the function in Figure 3.5. Part 1: // Allocate device memory cudaMalloc((void**) &A_d, cudaMalloc((void**) &B_d, cudaMalloc((void**) &C_d, for A, B, and C size); size); size); // Copy A and B to device memory cudaMemcpy(A_d, A, size, cudaMemcpyHostToDevice); cudaMemcpy(B_d, B, size, cudaMemcpyHostToDevice); Part 2: const unsigned int BLOCK_SIZE = 512; const unsigned int numBlocks = (n - 1)/BLOCK_SIZE + 1; dim3 gridDim(numBlocks, 1, 1), blockDim(BLOCK_SIZE, 1, 1); vecAddKernel<<< gridDim, blockDim >>> (A_d, B_d, C_d, n); 3.5. If we need to use each thread to calculate one output element of a vector addition, what would be the expression for mapping the thread/block indices to data index: (A) i = threadIdx.x + threadIdx.y (B) i = blockIdx.x + threadIdx.x (C) i = blockIdx.x*blockDim.x + threadIdx.x (D) i = blockIdx.x*threadIdx.x The answer is C. blockIdx.x*blockDim.x + threadIdx.x will generate a “global index”, where each thread in each block has a unique identifier from 0 to the total number of threads in the X dimension. © John A. Stratton and Izzat El Hajj, 2010-2013 3.6 We want to use each thread to calculate two (adjacent) elements of a vector addition. Assume that a variable i should be the index for the first element to be processed by a thread. What would be the expression for mapping the thread/block indices to data index? (A) i = blockIdx.x*blockDim.x + threadIdx.x + 2 (B) i = blockIdx.x*threadIdx.x*2 (C) i = (blockIdx.x*blockDim.x + threadIdx.x)*2 (D) i = blockIdx.x*threadIdx.x*2 + threadIdx.x The answer is C. If each thread computes on two elements, thread 0 will operate on elements 0 and 1, thread 1 will operate on elements 2 and 3, and so on. Therefore, each thread’s first element is equal to its own index times two, or i = (blockIdx.x*blockDim.x + threadIdx.x)*2. 3.7. For a vector addition, assume that the vector length is 2000, each thread calculates one output element, and the thread block size is 512 threads. How many threads will be in the grid? (A) 2000 (B) 2024 (C) 2048 (D) 2096 The answer is C. The 2000 elements require ceil(2000/512) = 4 blocks to compute them. The total number of threads will be 4*512 = 2048. The last 48 threads will be created, but will have no work assigned to them. © John A. Stratton and Izzat El Hajj, 2010-2013