CIS 565 Fall 2011 Qing Sun sunqing@seas.upenn.edu Outline Memory Management Kernels Matrix multiplication Managing Memory CPU and GPU have separate memory spaces Host (CPU) code manages device (GPU) memory Allocate/free Copy data to and from device Applies to global device memory (DRAM) GPU Memory Allocation / Release cudaMalloc(void** pointer, size_t nbytes) cudaMemset(void* pointer, int value, size_t count) cudaFree(void* pointer) int n = 1024; int nbytes = 1024 * sizeof(int); int *d_a = 0; cudaMalloc((void**) &d_a, nbytes); cudaMemset(d_a, 0, nbytes); cudaFree(d_a); Data Copies cudaMemcpy(void* dst, void* src, size_t nbytes, enum cudaMemcpyKind direction); direction specifies locations (host or device) of src and dst Blocks CPU thread: returns after the copy is complete Doesn’t start copying until previous CUDA calls complete enum cudaMemcpyKind cudaMemcpyHostToDevice cudaMemcpyDeviceToHost cudaMemcpyDeviceToDevice Executing Code on the GPU Kernels are C functions with some restrictions Cannot access host memory Must have void return type No variable number of arguments Not recursive No static variables Function arguments automatically copied from host to device Function Qualifiers Kernels designated by function qualifier: __global__ Function called from host and executed on device Must return void Other CUDA function qualifiers __device__ Function called from device and run on device Cannot be called from host code __host__ Function called from host and executed on host (default) __host__ and __device__ qualifiers can be combined to generate both CPU and GPU code Launching Kernels Modified C function call syntax: kernel<<<dim3 dG, dim3 dB>>> (…) Execution Configuration (“<<<>>>”) dG – dimension and size of grid in blocks: Two-dimensional: x and y Blocks launched in the grid: dG.x * dG.y dB – dimension and size of blocks in threads: Three-dimensional: x, y and z Threads per block: dB.x * dB.y * dB.z Unspecified dim3 fields initialize to 1 Execution Configuration Examples dim3 grid, block; grid.x = 2; grid.y = 4; block.x = 8; block.y = 16; kernel<<<grid, block>>> (…); dim3 grid(2,4), block(8,16); kernel<<<grid, block>>> (…); kernel<<<32, 512>>> (…); CUDA Built-in Device Variables All __global__ and __device__ functions have access to these automatically defined variables dim3 gridDim; Dimensions of the grid in blocks (at most 2D) dim3 blockDim; Dimensions of the block in threads dim3 blockIdx; Block index within the grid dim3 threadIdx; Thread index within the block Unique Thread IDs Built-in variables are used to determine unique thread IDs Map from local threadID (threadIdx) to a global ID which can be used as array indices Increment Array Example CPU program void inc_cpu(int *a, int N) { int idx; for (idx = 0; idx < N; idx++) a[idx] = a[idx] + 1 } void main() { … inc_cpu(a, N); } CUDA program __global__ void inc_gpu(int *d_a, int N) { int idx = blockIdx.x * blockDim.x + threadIdx.x; if (idx < N) d_a[idx] = d_a[idx] + 1; } void main() { … dim3 dimBlock(blocksize); dim3 dimGrid(ceil(N / (float)blocksize)); inc_gpu<<<dimGrid, dimBlock>>>(d_a, N); } Host Synchronization All kernel launches are asynchronous Control returns to CPU immediately Kernel executes after all previous CUDA calls have completed cudaMemcpy() is synchronous Copy starts after all previous CUDA calls have completed Control returns to CPU after copy completes cudaThreadSynchronize() Blocks until all previous CUDA calls complete Device Synchronization void __syncthreads(); Synchronizes all threads in a block Generates barrier synchronization instruction No thread can pass this barrier until all threads in the block reach it Used to avoid RAW/WAR.WAW hazards when accessing shared memory Allowed in conditional code only if the conditional is uniform across the entire thread block idx = blockDim.x * blockIdx.x + threadIdx.x; if (blockIdx.x == blockToReverse) { sharedData[blockDim.x – (threadIdx.x + 1)] = a[idx]; __syncthreads(); a[idx] = sharedData[threadIdx.x]; } Matrix Multiplication A simple matrix multiplication example that illustrates the basic features of memory and thread management in CUDA programs Local, register usage Thread ID usage Memory data transfer API between host and device Leave shared memory usage until later Matrix Multiplication P=M*N Each matrix is WIDTH * WIDTH Data parallel CPU Implementation void MatrixMulOnHost(float* M, float* N, float* P, int width) { for (int i = 0; i < width; i++) for (int j = 0; j < width; j++) { float sum = 0; for (int k = 0; k < width; k++) { float a = M[i * width + k]; float b = N[k * width + j]; sum += a * b; } P[i * width + j] = sum; } } CUDA Skeleton int main(void) { 1. //Allocate and initialize the matrices M, N, P //I/O to read the input matrices M and N 2. //M * N on the device MatrixMulOnDevice(M, N, P, WIDTH); 3. //I/O to write the output matrix P //Free matrices M, N, P return 0; } Step1: Data Transfer void MatrixMulOnDevice(float* M, float* N, float* P, int width) { int size = width * width * sizeof (float); 1. //Load M and N to device memory cudaMalloc (d_M, size); cudaMemcpy (d_M, M, size, cudaMemcpyHostToDevice); cudaMalloc (d_N, size); cudaMemcpy (d_N, N, size, cudaMemcpyHostToDevice); 2. 3. } //Allocate P on the device cudaMalloc (d_P, size); //Kernel invocation code //Read P from the device cudaMemcpy (P, d_P, size, cudaMemcpyDeviceToHost); //Free device matrices cudaFree (d_M); cudaFree (d_N); cudaFree (d_P); Step2: Implement Kernel __global__ void MatrixMulKernel(float* d_M, float* d_N, float* d_P, int width) { //2D Thread ID int tx = threadIdx.x; int ty = threadIdx.y; //Pvalue stores the d_P element that is computed by the thread float Pvalue = 0; for (int k = 0; k < width; k++) { float a = d_M[ty * width + k]; float b = d_N[k * width + tx]; Pvalue += a * b; } //Write the matrix to device memory each thread writes one element Pd[ty * width + tx] = Pvalue; } Step3: Invoke Kernel void MatrixMulOnDevice(float* M, float* N, float* P, int width) { int size = width * width * sizeof (float); cudaMalloc (d_M, size); cudaMemcpy (d_M, M, size, cudaMemcpyHostToDevice); cudaMalloc (d_N, size); cudaMemcpy (d_N, N, size, cudaMemcpyHostToDevice); cudaMalloc (d_P, size); //Setup the execution configuration dim3 dimGrid(1, 1); dim3 dimBlock(width, width); //Launch the device computation threads MatrixMulKernel<<<dimGrid, dimBlock>>>(d_M, d_N, d_P); cudaMemcpy (P, d_P, size, cudaMemcpyDeviceToHost); cudaFree (d_M); cudaFree (d_N); cudaFree (d_P); } Simple Matrix Multiplication Grid 1 Block 1 One block of threads compute Nd 2 matrix d_P Each thread computes one element 4 of d_P Each thread Loads a row of matrix d_M Loads a column of matrix d_N Performs one multiplication and addition for each pair of d_M and d_N elements Compute to off-chip memory access ratio close to 1:1 (not very high) Size of matrix limited by the number 2 Thread (2, 2) 6 3 2 5 4 48 of threads allowed in a thread block WIDTH Md Pd