CIS 565 Fall 2011
Qing Sun
 Memory Management
 Kernels
 Matrix multiplication
Managing Memory
 CPU and GPU have separate memory spaces
 Host (CPU) code manages device (GPU) memory
 Allocate/free
 Copy data to and from device
 Applies to global device memory (DRAM)
GPU Memory Allocation / Release
 cudaMalloc(void** pointer, size_t nbytes)
 cudaMemset(void* pointer, int value, size_t count)
 cudaFree(void* pointer)
int n = 1024;
int nbytes = 1024 * sizeof(int);
int *d_a = 0;
cudaMalloc((void**) &d_a, nbytes);
cudaMemset(d_a, 0, nbytes);
Data Copies
 cudaMemcpy(void* dst, void* src, size_t nbytes,
cudaMemcpyKind direction);
 direction specifies locations (host or device) of src and dst
 Blocks CPU thread: returns after the copy is complete
 Doesn’t start copying until previous CUDA calls complete
 enum cudaMemcpyKind
 cudaMemcpyHostToDevice
 cudaMemcpyDeviceToHost
 cudaMemcpyDeviceToDevice
Executing Code on the GPU
 Kernels are C functions with some restrictions
 Cannot access host memory
 Must have void return type
 No variable number of arguments
 Not recursive
 No static variables
 Function arguments automatically copied from host to
Function Qualifiers
 Kernels designated by function qualifier:
 __global__
Function called from host and executed on device
Must return void
 Other CUDA function qualifiers
 __device__
Function called from device and run on device
Cannot be called from host code
 __host__
 Function called from host and executed on host (default)
 __host__ and __device__ qualifiers can be combined to
generate both CPU and GPU code
Launching Kernels
 Modified C function call syntax:
 kernel<<<dim3 dG, dim3 dB>>> (…)
 Execution Configuration (“<<<>>>”)
 dG – dimension and size of grid in blocks:
Two-dimensional: x and y
Blocks launched in the grid: dG.x * dG.y
 dB – dimension and size of blocks in threads:
Three-dimensional: x, y and z
Threads per block: dB.x * dB.y * dB.z
 Unspecified dim3 fields initialize to 1
Execution Configuration Examples
dim3 grid, block;
grid.x = 2; grid.y = 4;
block.x = 8; block.y = 16;
kernel<<<grid, block>>> (…);
dim3 grid(2,4), block(8,16);
kernel<<<grid, block>>> (…);
kernel<<<32, 512>>> (…);
CUDA Built-in Device Variables
 All __global__ and __device__ functions have
access to these automatically defined variables
 dim3 gridDim;
Dimensions of the grid in blocks (at most 2D)
 dim3 blockDim;
Dimensions of the block in threads
 dim3 blockIdx;
Block index within the grid
 dim3 threadIdx;
Thread index within the block
Unique Thread IDs
 Built-in variables are used to determine unique thread
 Map from local threadID (threadIdx) to a global ID
which can be used as array indices
Increment Array Example
 CPU program
void inc_cpu(int *a, int N)
int idx;
for (idx = 0; idx < N; idx++)
a[idx] = a[idx] + 1
void main()
inc_cpu(a, N);
 CUDA program
__global__ void inc_gpu(int *d_a, int N)
int idx = blockIdx.x * blockDim.x
+ threadIdx.x;
if (idx < N)
d_a[idx] = d_a[idx] + 1;
void main()
dim3 dimBlock(blocksize);
dim3 dimGrid(ceil(N / (float)blocksize));
inc_gpu<<<dimGrid, dimBlock>>>(d_a, N);
Host Synchronization
 All kernel launches are asynchronous
 Control returns to CPU immediately
 Kernel executes after all previous CUDA calls have
 cudaMemcpy() is synchronous
 Copy starts after all previous CUDA calls have completed
 Control returns to CPU after copy completes
 cudaThreadSynchronize()
 Blocks until all previous CUDA calls complete
Device Synchronization
 void __syncthreads();
 Synchronizes all threads in a block
 Generates barrier synchronization instruction
 No thread can pass this barrier until all threads in the block reach it
 Used to avoid RAW/WAR.WAW hazards when accessing shared
 Allowed in conditional code only if the conditional is uniform
across the entire thread block
idx = blockDim.x * blockIdx.x + threadIdx.x;
if (blockIdx.x == blockToReverse) {
sharedData[blockDim.x – (threadIdx.x + 1)] = a[idx];
a[idx] = sharedData[threadIdx.x];
Matrix Multiplication
 A simple matrix multiplication example that illustrates
the basic features of memory and thread management
in CUDA programs
 Local, register usage
 Thread ID usage
 Memory data transfer API between host and device
 Leave shared memory usage until later
Matrix Multiplication
 Each matrix is WIDTH * WIDTH
 Data parallel
CPU Implementation
void MatrixMulOnHost(float* M, float* N, float* P, int width)
for (int i = 0; i < width; i++)
for (int j = 0; j < width; j++)
float sum = 0;
for (int k = 0; k < width; k++)
float a = M[i * width + k];
float b = N[k * width + j];
sum += a * b;
P[i * width + j] = sum;
CUDA Skeleton
int main(void) {
1. //Allocate and initialize the matrices M, N, P
//I/O to read the input matrices M and N
2. //M * N on the device
MatrixMulOnDevice(M, N, P, WIDTH);
3. //I/O to write the output matrix P
//Free matrices M, N, P
return 0;
Step1: Data Transfer
void MatrixMulOnDevice(float* M, float* N, float* P,
int width) {
int size = width * width * sizeof (float);
1. //Load M and N to device memory
cudaMalloc (d_M, size);
cudaMemcpy (d_M, M, size, cudaMemcpyHostToDevice);
cudaMalloc (d_N, size);
cudaMemcpy (d_N, N, size, cudaMemcpyHostToDevice);
//Allocate P on the device
cudaMalloc (d_P, size);
//Kernel invocation code
//Read P from the device
cudaMemcpy (P, d_P, size, cudaMemcpyDeviceToHost);
//Free device matrices
cudaFree (d_M); cudaFree (d_N); cudaFree (d_P);
Step2: Implement Kernel
__global__ void MatrixMulKernel(float* d_M, float* d_N,
float* d_P, int width) {
//2D Thread ID
int tx = threadIdx.x;
int ty = threadIdx.y;
//Pvalue stores the d_P element that is computed by the
float Pvalue = 0;
for (int k = 0; k < width; k++)
float a = d_M[ty * width + k];
float b = d_N[k * width + tx];
Pvalue += a * b;
//Write the matrix to device memory each thread writes one
Pd[ty * width + tx] = Pvalue;
Step3: Invoke Kernel
void MatrixMulOnDevice(float* M, float* N, float* P,
int width) {
int size = width * width * sizeof (float);
cudaMalloc (d_M, size);
cudaMemcpy (d_M, M, size, cudaMemcpyHostToDevice);
cudaMalloc (d_N, size);
cudaMemcpy (d_N, N, size, cudaMemcpyHostToDevice);
cudaMalloc (d_P, size);
//Setup the execution configuration
dim3 dimGrid(1, 1);
dim3 dimBlock(width, width);
//Launch the device computation threads
MatrixMulKernel<<<dimGrid, dimBlock>>>(d_M, d_N, d_P);
cudaMemcpy (P, d_P, size, cudaMemcpyDeviceToHost);
cudaFree (d_M); cudaFree (d_N); cudaFree (d_P);
Simple Matrix Multiplication
Grid 1
Block 1
 One block of threads compute
matrix d_P
 Each thread computes one element
of d_P
 Each thread
 Loads a row of matrix d_M
 Loads a column of matrix d_N
 Performs one multiplication and
addition for each pair of d_M and
d_N elements
 Compute to off-chip memory access
ratio close to 1:1 (not very high)
 Size of matrix limited by the number
(2, 2)
of threads allowed in a thread block