Uploaded by romir.malik13

solutions for programming massively large processors ch3

advertisement
Exercises and Solutions for
“Programming Massively Parallel Processors:
A Hands-on Approach”
Second Edition
© John A. Stratton and Izzat El Hajj, 2013
Chapter 3
© John A. Stratton and Izzat El Hajj, 2010-2013
Chapter 3
3.1. A matrix addition takes two input matrices B and C and produces one output matrix
A. Each element of the output matrix A is the sum of the corresponding elements of the
input matrices B and C, that is, A[i][j] == B[i][j] + C[i][j]. For simplicity, we will only
handle square matrices of which the elements are single-precision floating-point numbers.
Write a matrix addition kernel and the host stub function that can be called with four
parameters: pointer to the output matrix, pointer to the first input matrix, pointer to the
second input matrix, and the number of elements in each dimension. Use the following
instruction:
a) Write the host stub function by allocating memory for the input and output matrices,
transferring input data to the device, launch the kernel, transferring the output data
to host, and freeing the device memory for the input and output data. Leave the
execution configuration parameters open for this step.
void matrixMul(float *h_A, float *h_B, float *h_C, int matrixLen) {
float *d_A, *d_B, *d_C;
int size = matrixLen*matrixLen*sizeof(float);
cudaMalloc((void**)&d_A, size);
cudaMalloc((void**)&d_B, size);
cudaMalloc((void**)&d_C, size);
// Initialize GPU memory contents with 0 or host data
cudaMemset(d_A, 0, size);
cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_C, h_C, size, cudaMemcpyHostToDevice);
// Execute the GPU kernel
dim3 threads(
);
dim3 blocks(
);
matrixMulKernel<<<blocks, threads>>>(d_A, d_B, d_C, matrixLen);
// Copy results back to host
cudaMemcpy(h_A, d_A, size, cudaMemcpyDeviceToHost);
// Free GPU memory and return
cudaFree(d_A); cudaFree(d_B); cudaFree(d_C);
}
© John A. Stratton and Izzat El Hajj, 2010-2013
b) Write a kernel that has each thread producing one output matrix element. Fill in the
execution configuration parameters for the design.
__global__ void
matrixMulKernel(float *A, float *B, float *C, int matrixLen) {
int i = threadIdx.x + blockIdx.x*blockDim.x;
int j = threadIdx.y + blockIdx.y*blockDim.y;
if (i < matrixLen && j < matrixLen)
A[i*matrixLen + j] = B[i*matrixLen + j] + C[i*matrixLen + j];
}
// Use 16x16 thread blocks.
// (a + (b-1)) / b == ceil(a/b) in integer arithmetic
dim3 threads(16, 16);
dim3 blocks( (matrixLen+15)/16, (matrixLen+15)/16);
c) Write a kernel that has each thread producing one output matrix row. Fill in the
execution configuration parameters for the design
__global__ void
matrixMulKernel(float *A, float *B, float *C, int matrixLen) {
int j;
int i = threadIdx.x + blockIdx.x*blockDim.x;
if (i < matrixLen) {
for (j = 0; j < matrixLen; j++)
A[i*matrixLen + j] = B[i*matrixLen + j] + C[i*matrixLen + j];
}
}
// Use 16 threads per block.
// (a + (b-1)) / b == ceil(a/b) in integer arithmetic
dim3 threads(16);
dim3 blocks( (matrixLen+15)/16 );
© John A. Stratton and Izzat El Hajj, 2010-2013
d) Write a kernel that has each thread producing one output matrix column. Fill in the
execution configuration parameters for the design
__global__ void
matrixMulKernel(float *A, float *B, float *C, int matrixLen) {
int i;
int j = threadIdx.x + blockIdx.x*blockDim.x;
if (j < matrixLen) {
for (i = 0; i < matrixLen; i++)
A[i*matrixLen + j] = B[i*matrixLen + j] + C[i*matrixLen + j];
}
}
// Use 16 threads per block.
// (a + (b-1)) / b == ceil(a/b) in integer arithmetic
dim3 threads(16);
dim3 blocks( (matrixLen+15)/16 );
e) Analyze the pros and cons of each preceding kernel design
The kernel producing one output element per thread will have the most parallelism. As will
be described in a later chapter, the kernel computing one output row per element will have
better memory system performance than the kernel computing one output column per
element, and may even be comparable to or better than the first kernel for datasets large
enough to make the difference in the amount of parallelism irrelevant.
© John A. Stratton and Izzat El Hajj, 2010-2013
3.2. A matrix-vector multiplication takes an input matrix B and a vector C and produces
one output vector A. Each element of the output vector A is the dot product of one row of
the input matrix B and C, that is, A[i] = ∑ j B[i][j] + C[j]. For simplicity, we will only
handle square matrices of which the elements are signle-precision floating-point numbers.
Write a matrix-vector multiplication kernel and the host stub function that can be called
with four parameters: pointer to the output vector, pointer to the input matrix, pointer to
the input vector, and the number of elements in each dimension.
__global__ void
matrixMulKernel(float *A, float *B, float *C, int vectorLen) {
int i = threadIdx.x + blockIdx.x*blockDim.x;
float sum = 0.0f;
if (i < vectorLen) {
for (int j = 0; j < vectorLen; j++)
sum += B[i*vectorLen + j] * C[j];
A[i] = sum;
}
}
void matrixVectorMul(float *h_A, float *h_B, float
vectorLen) {
float *d_A, *d_B, *d_C;
int matrixSize = vectorLen*vectorLen*sizeof(float);
int vectorSize = vectorLen*sizeof(float);
cudaMalloc((void**)&d_A, vectorSize);
cudaMalloc((void**)&d_B, matrixSize);
cudaMalloc((void**)&d_C, vectorSize);
*h_C,
int
// Initialize GPU memory contents with 0 or host data
cudaMemset(d_A, 0, vectorSize);
cudaMemcpy(d_B, h_B, matrixSize, cudaMemcpyHostToDevice);
cudaMemcpy(d_C, h_C, vectorSize, cudaMemcpyHostToDevice);
// Execute the GPU kernel
dim3 threads(128);
dim3 blocks((vectorLen+(128-1))/128);
matrixMulKernel<<<blocks, threads>>>(d_A, d_B, d_C, vectorLen);
// Copy results back to host
cudaMemcpy(h_A, d_A, vectorSize, cudaMemcpyDeviceToHost);
// Free GPU memory and return
cudaFree(d_A); cudaFree(d_B); cudaFree(d_C);
}
© John A. Stratton and Izzat El Hajj, 2010-2013
3.3. A new summer intern was frustrated with CUDA. He has been complaining that
CUDA is very redious: he had to declare many functions that he plans to execute on both
the host and the device twice, once as a host function and once as a device function. What
is your response?
The CUDA language has specifiers __host__ and __device__ for both host and device functions,
but they are not mutually exclusive. A function that can be called on both the host and device
could be annotated with both specifiers, allowing the same function to be called from either
context.
3.4. Complete Parts 1 and 2 of the function in Figure 3.5.
Part 1:
// Allocate device memory
cudaMalloc((void**) &A_d,
cudaMalloc((void**) &B_d,
cudaMalloc((void**) &C_d,
for A, B, and C
size);
size);
size);
// Copy A and B to device memory
cudaMemcpy(A_d, A, size, cudaMemcpyHostToDevice);
cudaMemcpy(B_d, B, size, cudaMemcpyHostToDevice);
Part 2:
const unsigned int BLOCK_SIZE = 512;
const unsigned int numBlocks = (n - 1)/BLOCK_SIZE + 1;
dim3 gridDim(numBlocks, 1, 1), blockDim(BLOCK_SIZE, 1, 1);
vecAddKernel<<< gridDim, blockDim >>> (A_d, B_d, C_d, n);
3.5. If we need to use each thread to calculate one output element of a vector addition, what
would be the expression for mapping the thread/block indices to data index:
(A) i = threadIdx.x + threadIdx.y
(B) i = blockIdx.x + threadIdx.x
(C) i = blockIdx.x*blockDim.x + threadIdx.x
(D) i = blockIdx.x*threadIdx.x
The answer is C. blockIdx.x*blockDim.x + threadIdx.x will generate a “global index”, where
each thread in each block has a unique identifier from 0 to the total number of threads in the X
dimension.
© John A. Stratton and Izzat El Hajj, 2010-2013
3.6 We want to use each thread to calculate two (adjacent) elements of a vector addition.
Assume that a variable i should be the index for the first element to be processed by a
thread. What would be the expression for mapping the thread/block indices to data index?
(A) i = blockIdx.x*blockDim.x + threadIdx.x + 2
(B) i = blockIdx.x*threadIdx.x*2
(C) i = (blockIdx.x*blockDim.x + threadIdx.x)*2
(D) i = blockIdx.x*threadIdx.x*2 + threadIdx.x
The answer is C. If each thread computes on two elements, thread 0 will operate on elements 0
and 1, thread 1 will operate on elements 2 and 3, and so on. Therefore, each thread’s first
element is equal to its own index times two, or i = (blockIdx.x*blockDim.x + threadIdx.x)*2.
3.7. For a vector addition, assume that the vector length is 2000, each thread calculates one
output element, and the thread block size is 512 threads. How many threads will be in the
grid?
(A) 2000
(B) 2024
(C) 2048
(D) 2096
The answer is C. The 2000 elements require ceil(2000/512) = 4 blocks to compute them. The
total number of threads will be 4*512 = 2048. The last 48 threads will be created, but will have
no work assigned to them.
© John A. Stratton and Izzat El Hajj, 2010-2013
Download