SC12 The International Conference for High Performance

advertisement
SC12 The International Conference for High Performance Computing, Networking, Storage and
Analysis, Salt Lake City, Utah, USA
Workshop 119: An Educator's Toolbox for CUDA
Performance of CUDA Programs
Vector Addition and Matrix Addition/Multiplication
B. Wilkinson, Nov 9, 2012
Preliminaries
The purpose of this session is to explore the performance of vector/matrix addition and multiplication.
A directory called Matrix is provided containing the files you need for this session. Refer to separate
notes about access to the remote GPU servers, and using Linux commands and editors.
Timing Execution. The programs here use CUDA events to time program execution.
Provided files. Each provided guest account has the following:
Directory
Contents
Matrix
VectorAdd.cu
MatrixAdd.cu
MatrixAdd.cu
Coalescing.cu
Makefile
Description
Vector addition CUDA program
Matrix addition CUDA program
Matrix multiplication CUDA program
Memory coalescing CUDA program
Makefile that includes rules for compiling each program
Vector addition uses one-dimensional structures whereas matrix operations use two-dimensional
structures. The programs prompt for keyboard input to enable the user to alter the number of threads in
a GPU block and number of GPU blocks used in the computation, within the limits of the GPU. The
matrix programs also allow for the matrix size to be altered on the keyboard. They also do the
computation on the CPU alone as well for comparison.
Compiling. A simple makefile is provided to compile the programs individually. The listing of the
makefile is given below:
NVCC = /usr/local/cuda/bin/nvcc
CUDAPATH = /usr/local/cuda
NVCCFLAGS = -I$(CUDAPATH)/include
LFLAGS = -L$(CUDAPATH)/lib64 -lcuda -lcudart -lm
VectorAdd: VectorAdd.cu
$(NVCC) $(NVCCFLAGS) $(LFLAGS) -o VectorAdd VectorAdd.cu
MatrixAdd: MatrixAdd.cu
$(NVCC) $(NVCCFLAGS) $(LFLAGS) -o MatrixAdd MatrixAdd.cu
MatrixMult: MatrixMult.cu
1
$(NVCC) $(NVCCFLAGS) $(LFLAGS) -o MatrixMult MatrixMult.cu
Coalescing: Coalescing.cu
$(NVCC) $(NVCCFLAGS) $(LFLAGS) -o Coalescing Coalescing.cu
Type make VectorAdd to compile the program VectorAdd.cu and similarly for other programs.
Part 1 Vector Addition
A prewritten CUDA program, VectorAdd.cu, is provided that adds two vectors (one-dimensional
arrays), instrumented with CUDA events to measure the time of execution. In this task, you will
compile VectorAdd.cu. The code is given below:
#include <stdio.h>
#include <cuda.h>
#include <stdlib.h>
#define N 4096
// size of array
__global__ void vectorAdd(int *a,int *b, int *c) {
int tid = blockIdx.x * blockDim.x + threadIdx.x;
if(tid < N){
c[tid] = a[tid]+b[tid];
}
}
int main(int argc, char *argv[]) {
int T = 10, B = 1;
// threads per block and blocks per grid
int a[N],b[N],c[N];
// vectors, statically declared
int *dev_a, *dev_b, *dev_c;
printf("Size of array = %d\n", N);
do {
printf("Enter number of threads per block;
scanf("%d",&T);
printf("\nEnter number of blocks per grid: ");
scanf("%d",&B);
if (T * B < N) printf("Error T x B < N, try again\n");
} while (T * B < N);
cudaEvent_t start, stop;
float elapsed_time_ms;
// using cuda events to measure time
cudaMalloc((void**)&dev_a,N * sizeof(int));
cudaMalloc((void**)&dev_b,N * sizeof(int));
cudaMalloc((void**)&dev_c,N * sizeof(int));
for(int i=0;i<N;i++) {
a[i] = i;
b[i] = i*1;
}
// load arrays with some numbers
cudaMemcpy(dev_a, a , N*sizeof(int),cudaMemcpyHostToDevice);
cudaMemcpy(dev_b, b , N*sizeof(int),cudaMemcpyHostToDevice);
cudaMemcpy(dev_c, c , N*sizeof(int),cudaMemcpyHostToDevice);
cudaEventCreate( &start );
// instrument code to measure start time
2
cudaEventCreate( &stop );
cudaEventRecord( start, 0 );
vectorAdd<<<B,T>>>(dev_a,dev_b,dev_c);
cudaMemcpy(c,dev_c,N*sizeof(int),cudaMemcpyDeviceToHost);
cudaEventRecord( stop, 0 );
// instrument code to measure end time
cudaEventSynchronize( stop );
cudaEventElapsedTime( &elapsed_time_ms, start, stop );
for(int i=0;i<N;i++) {
printf("%d+%d=%d\n",a[i],b[i],c[i]);
}
printf("Time to calculate results: %f ms.\n", elapsed_time_ms);
cudaFree(dev_a);
cudaFree(dev_b);
cudaFree(dev_c);
return 0;
}
Compiling. Type make VectorAdd to compile the program VectorAdd.cu
Executing Program. Type ./VectorAdd to execute compiled program. (Note you have to specify the
currrent directory, ./) The program will prompt you for the number of threads per block and number of
blocks per grid. The total number of threads in the grid must equal or greater than the number of
elements in the arrays (4096) or you will be prompted to try again.
Experiment with different CUDA grid/block structures. Experiment with different numbers of
blocks and threads/block at the keyboard. To have different array sizes will need you to alter N in
Vector.Add and recompile. Modify the code to have a different value for N and recompile. 1
To accommodate command line input for N will require you to dynamically declare the arrays, which
is done in the next program to add matrices. Also each experiment with different values requires you
to re-execute the program, re-launching the kernel each time. In the next program we include code to
allow you experiment with different values without leaving the program. Finally we are interested in
how much faster the GPU code executes compared to doing the computation on the host CPU. The
next code also includes the computation done on the host for comparison purposes. The result of using
the GPU and using the CPU alone are compared to ensure both produce the same answers (assumed
then correct!).
1
Note compute capability 2.x has a limit of 1024 threads per block and compute capability 1.x has a limit of 512 threads
per block.
3
Part 2 Matrix Addition
In this task, you will compile and execute a simple prewritten CUDA program that add to matrices,
called MatrixAdd.cu. The code is given below:
// Matrix addition program MatrixAdd.cu
#include <stdio.h>
#include <cuda.h>
#include <stdlib.h>
__global__ void gpu_matrixadd(int *a,int *b, int *c, int N) {
int col = threadIdx.x + blockDim.x * blockIdx.x;
int row = threadIdx.y + blockDim.y * blockIdx.y;
int index = row * N + col;
if(col < N && row < N)
c[index] = a[index]+b[index];
}
void cpu_matrixadd(int *a,int *b, int *c, int N) {
int index;
for(int col=0;col < N; col++)
for(int row=0;row < N; row++) {
index = row * N + col;
c[index] = a[index]+b[index];
}
}
int main(int argc, char *argv[])
{
char key;
int i, j;
// loop counters
int Grid_Dim = 1;
int Block_Dim =1;
//Grid dimension, x and y, square
//Block dimension, x and y, square
int
int
int
int
N = 10;
// size of array in each dimension
*a,*b,*c,*d;
*dev_a, *dev_b, *dev_c;
size;
// number of bytes in arrays
cudaEvent_t start, stop;
// using cuda events to measure time
float elapsed_time_ms;
// which is applicable for asynchronous code also
/* --------------------ENTER INPUT PARAMETERS AND DATA -----------------------*/
do {
// loop to repeat complete program
printf ("Device characteristics -- some limitations (compute capability 2.x)\n");
printf (" Maximum size of x- and y- dimension of block if square = 32\n");
printf (" Maximum size of each dimension of grid = 65535\n");
printf("Enter size of array in one dimension (square array), currently %d\n",N);
scanf("%d",&N);
do {
printf("\nEnter number of threads per block in x/y dimension), currently %d
scanf("%d",&Block_Dim);
printf("\nEnter number of blocks per grid in x/y dimensions), currently %d
scanf("%d",&Grid_Dim);
: ",Block_Dim);
: ",Grid_Dim);
if (Block_Dim > 32) printf("Error, too many threads in each dimension of block, try again\n");
if ((Grid_Dim * Block_Dim) < N) printf("Error, number of threads in x/y dimensions less than
number of array elements, try again\n");
4
else printf("Number of threads not used = %d\n", ((Grid_Dim * Block_Dim) - N) * ((Grid_Dim *
Block_Dim) - N));
} while ((Block_Dim > 32) || ((Grid_Dim * Block_Dim) < N));
dim3 Grid(Grid_Dim, Grid_Dim);
dim3 Block(Block_Dim, Block_Dim);
//Grid structure
//Block structure
size = N * N * sizeof(int);
// number of bytes in total in arrays
a
b
c
d
// dynamically allocated memory for arrays on host
=
=
=
=
(int*)
(int*)
(int*)
(int*)
for(i=0;i
for(j=0;j
a[i *
b[i *
}
malloc(size);
malloc(size);
malloc(size);
malloc(size);
<
<
N
N
// results from GPU
// results from CPU
N;i++)
N;j++) {
+ j] = i;
+ j] = i;
// load arrays with some numbers
/* ------------- COMPUTATION DONE ON GPU ----------------------------*/
cudaMalloc((void**)&dev_a, size); // allocate memory on device
cudaMalloc((void**)&dev_b, size);
cudaMalloc((void**)&dev_c, size);
cudaMemcpy(dev_a, a , size ,cudaMemcpyHostToDevice);
cudaMemcpy(dev_b, b , size ,cudaMemcpyHostToDevice);
cudaMemcpy(dev_c, c , size ,cudaMemcpyHostToDevice);
cudaEventCreate(&start);
cudaEventCreate(&stop);
//
// instrument code to measure start time
cudaEventRecord(start, 0);
cudaEventSynchronize(start);
// not needed
gpu_matrixadd<<<Grid,Block>>>(dev_a,dev_b,dev_c,N);
cudaMemcpy(c,dev_c, size ,cudaMemcpyDeviceToHost);
cudaEventRecord(stop, 0);
// instrument code to measure end time
cudaEventSynchronize(stop);
cudaEventElapsedTime(&elapsed_time_ms, start, stop );
printf("Time to calculate results on GPU: %f ms.\n", elapsed_time_ms);
// exec. time
/* ------------- COMPUTATION DONE ON HOST CPU ----------------------------*/
//
cudaEventRecord(start, 0); // use same timing
cudaEventSynchronize(start);
// not needed
cpu_matrixadd(a,b,d,N);
// do calculation on host
cudaEventRecord(stop, 0);
// instrument code to measue end time
cudaEventSynchronize(stop);
cudaEventElapsedTime(&elapsed_time_ms, start, stop );
printf("Time to calculate results on CPU: %f ms.\n", elapsed_time_ms);
// exec. time
/* ------------------- check device creates correct results -----------------*/
for(i=0;i < N*N;i++) {
if (c[i] != d[i]) printf("ERROR in results, CPU and GPU create different answers\n");
break;
}
printf("\nEnter c to repeat, return to terminate\n");
scanf("%c",&key);
scanf("%c",&key);
} while (key == 'c'); // loop of complete program
5
/* --------------
clean up
---------------------------------------*/
free(a);
free(b);
free(c);
cudaFree(dev_a);
cudaFree(dev_b);
cudaFree(dev_c);
cudaEventDestroy(start);
cudaEventDestroy(stop);
return 0;
}
Compiling and executing. Compile the program with the make file (make MatrixAdd) and execute
program (./MatrixAdd). Confirm program functions correctly.
Experiment with different CUDA grid/block structures. Experiment with different numbers of
blocks and threads/block. Try larger array sizes. Are there any differences in execution time from first
kernel launch to subsequent ones with the same input parameters?
Part 3 Matrix Multiplication
Modify the vector addition code to multiply two matrices and save as MatrixMult.cu. The sequential
version to multiply two matrices is given below:
void cpu_matrixmult(int *cpu_a, int *cpu_b, int *cpu_c, int N) {
int row, col, k, sum;
for(row=0; row < N; row++)
// row of a
for(col=0; col < N; col++) { // column of b
sum = 0;
for(k = 0; k < N; k++)
sum += cpu_a[row * N + k] * cpu_b[k * N + col];
cpu_c[row * N + col] = sum;
}
}
The matrix multiply code is actually given as part of the memory coalescing experiment in the final
section.
Compiling and executing the code. Modify the make file to compile the program and compile.
Execute the program and experiment.
6
Part 4 Memory Coalescing
A program is provided, Coalescing.cu, to explore different memory access patterns. Each GPU thread
in the program simply loads its thread ID into a location in a two dimensional array in global memory
so that we can identify which thread accessed each location. The actions are repeated T times and the
time recorded. The number of iterations is specified through keyboard input as is the number of grid
blocks. The size of the blocks is fixed as 16 x 16, the maximum for compute capability 1.x.2
The initial kernel for Coalescing.cu is given below:
__global__ void gpu_Comput (int *h, int N, int T) {
// Array loaded with global thread ID that accesses that location
int col = threadIdx.x + blockDim.x * blockIdx.x;
int row = threadIdx.y + blockDim.y * blockIdx.y;
int threadID = col + row * N;
int index = row + col * N;
// sequentially down each row
for (int t = 0; t < T; t++)
h[index] = threadID;
// loop to repeat to reduce other time effects
// load array with flattened global thread ID
}
The main program calls this kernel once, copies the generated array back to the host, and prints out the
contents of the array (every N/8 numbers). A sample output from the program is given below:
Figure 1 Output for Coalescing.cu
2
The maximum for compute capability 2.x (coit-grid06.uncc.edu and coit-grid07.uncc.edu) is 32 x 32, 1024 threads
7
The complete program listing of Coalescing.cu is given below:
// To measure effects of memory coalescing. Coalescing.cu. B. Wilkinson Jan 30, 2011
#include
#include
#include
#include
<stdio.h>
<stdlib.h>
<math.h>
<cuda.h>
#define BlockSize 16
// Size of blocks, 16 x 16 threads, fixed
__global__ void gpu_Comput (int *h, int N, int T) {
// Array loaded with global thread ID that acesses that location
int col = threadIdx.x + blockDim.x * blockIdx.x;
int row = threadIdx.y + blockDim.y * blockIdx.y;
int threadID = col + row * N;
int index = row + col * N;
// sequentially down each row
for (int t = 0; t < T; t++)
h[index] = threadID;
// loop to repeat to reduce other time effects
// load array with flattened global thread ID
}
void printArray(int *h, int N) {
printf("Results of computation, every N/8 numbers, eight numbers\n");
for (int row = 0; row < N; row += N/8) {
for (int col = 0; col < N; col += N/8)
printf("%6d ", h[col + row * N]);
printf("\n");
}
}
int main(int argc, char *argv[])
{
int T = 100;
int B = 1;
char key;
// number of iterations, entered at keyboard
// number of blocks, entered at keyboard
int *h, *dev_h;
// ptr to array holding numbers on host and device
cudaEvent_t start, stop;
float elapsed_time_ms1;
cudaEventCreate( &start );
cudaEventCreate( &stop );
// cuda events to measure time
/* ------------------------- Keyboard input -----------------------------------*/
do {
// loop to repeat complete program
printf("Grid Structure 2-D grid, 2-D blocks\n");
printf("Blocks fixed at 16 x 16 threads, 512 threads, max for compute cap. 1.x \n");
printf("Enter number of blocks in grid, each dimension, currently %d\n",B);
scanf("%d",&B);
printf("Enter number of iterations, currently %d\n",T);
scanf("%d",&T);
int N = B * BlockSize;
// size of data array, given input data
printf("Array size (and total grid-block size) %d x %d\n", N, N);
8
dim3 Block(BlockSize, BlockSize);
dim3 Grid(B, B);
//Block structure, 32 x 32 max
//Grid structure, B x B
/* ------------------------- Allocate Memory-----------------------------------*/
int size = N * N * sizeof(int);
h = (int*) malloc(size);
cudaMalloc((void**)&dev_h, size);
// number of bytes in total in array
// Array on host
// allocate device memory
/* ------------------------- GPU Computation -----------------------------------*/
cudaEventRecord( start, 0 );
gpu_Comput<<< Grid, Block >>>(dev_h, N, T);
cudaEventRecord( stop, 0 );
// instrument code to measue end time
cudaEventSynchronize( stop );
// wait for all work done by threads
cudaEventElapsedTime( &elapsed_time_ms1, start, stop );
cudaMemcpy(h,dev_h, size ,cudaMemcpyDeviceToHost);
//Get results to check
printArray(h,N);
printf("\nTime to calculate results on GPU: %f ms.\n", elapsed_time_ms1);
/* ---------- INSERT CALL TIO SECOND KERNEL HERE -------------------------------------*/
/* -------------------------REPEAT PROGRAM INPUT-----------------------------------*/
printf("\nEnter c to repeat, return to terminate\n");
scanf("%c",&key);
scanf("%c",&key);
} while (key == 'c'); // loop of complete program
/* --------------
clean up
---------------------------------------*/
free(h);
cudaFree(dev_h);
cudaEventDestroy(start);
cudaEventDestroy(stop);
return 0;
}
Compile and execute Coalescing.cu. Compile and execute Coalescing.cu on the assigned GPU
server and confirm the program functions correctly for different input values. (Note: The program does
not have error checking on input.)
Effects of memory coalescing. The kernel could have loaded the array sequentially across columns
rather than down rows. Here we will explore what happens if we do that. Create a copy of
Coalescing.cu called Coalescing_sol.cu (cp Coalescing.cu Coalescing_sol.cu).
Write a second kernel similar to the existing kernel but one that makes the memory accesses with
threadID sequentially along columns rather than down rows. Insert this kernel code into
Coalescing_sol.cu. Call your kernel from the main program before the section calling the original
kernel. Include a call to print out the contents of the array. Compute and display the ratio of the
original kernel execution time/your kernel execution time (speedup of using your kernel), to produce
9
output such as shown in Figure 2. Modify the make file to compile the code and compile. Investigate
what speed up is achieved if any with your kernel and why.
Figure 2 Sample output for Coalescing.cu with two kernels and speedup calculation
Using shared memory. Using shared memory requires substantial code modifications. For those who
have reached this point, make these modification and experiment.
10
Download