PPT

CIS 565 Fall 2011 Qing Sun sunqing@seas.upenn.edu Outline  Memory Management  Kernels  Matrix multiplication Managing Memory  CPU and GPU have separate memory spaces  Host (CPU) code manages device (GPU) memory  Allocate/free  Copy data to and from device  Applies to global device memory (DRAM) GPU Memory Allocation / Release  cudaMalloc(void** pointer, size_t nbytes)  cudaMemset(void* pointer, int value, size_t count)  cudaFree(void* pointer) int n = 1024; int nbytes = 1024 * sizeof(int); int *d_a = 0; cudaMalloc((void**) &d_a, nbytes); cudaMemset(d_a, 0, nbytes); cudaFree(d_a); Data Copies  cudaMemcpy(void* dst, void* src, size_t nbytes, enum cudaMemcpyKind direction);  direction specifies locations (host or device) of src and dst  Blocks CPU thread: returns after the copy is complete  Doesn’t start copying until previous CUDA calls complete  enum cudaMemcpyKind  cudaMemcpyHostToDevice  cudaMemcpyDeviceToHost  cudaMemcpyDeviceToDevice Executing Code on the GPU  Kernels are C functions with some restrictions  Cannot access host memory  Must have void return type  No variable number of arguments  Not recursive  No static variables  Function arguments automatically copied from host to device Function Qualifiers  Kernels designated by function qualifier:  __global__   Function called from host and executed on device Must return void  Other CUDA function qualifiers  __device__   Function called from device and run on device Cannot be called from host code  __host__  Function called from host and executed on host (default)  __host__ and __device__ qualifiers can be combined to generate both CPU and GPU code Launching Kernels  Modified C function call syntax:  kernel<<<dim3 dG, dim3 dB>>> (…)  Execution Configuration (“<<<>>>”)  dG – dimension and size of grid in blocks:   Two-dimensional: x and y Blocks launched in the grid: dG.x * dG.y  dB – dimension and size of blocks in threads:   Three-dimensional: x, y and z Threads per block: dB.x * dB.y * dB.z  Unspecified dim3 fields initialize to 1 Execution Configuration Examples dim3 grid, block; grid.x = 2; grid.y = 4; block.x = 8; block.y = 16; kernel<<<grid, block>>> (…); dim3 grid(2,4), block(8,16); kernel<<<grid, block>>> (…); kernel<<<32, 512>>> (…); CUDA Built-in Device Variables  All __global__ and __device__ functions have access to these automatically defined variables  dim3 gridDim;  Dimensions of the grid in blocks (at most 2D)  dim3 blockDim;  Dimensions of the block in threads  dim3 blockIdx;  Block index within the grid  dim3 threadIdx;  Thread index within the block Unique Thread IDs  Built-in variables are used to determine unique thread IDs  Map from local threadID (threadIdx) to a global ID which can be used as array indices Increment Array Example  CPU program void inc_cpu(int *a, int N) { int idx; for (idx = 0; idx < N; idx++) a[idx] = a[idx] + 1 } void main() { … inc_cpu(a, N); }  CUDA program __global__ void inc_gpu(int *d_a, int N) { int idx = blockIdx.x * blockDim.x + threadIdx.x; if (idx < N) d_a[idx] = d_a[idx] + 1; } void main() { … dim3 dimBlock(blocksize); dim3 dimGrid(ceil(N / (float)blocksize)); inc_gpu<<<dimGrid, dimBlock>>>(d_a, N); } Host Synchronization  All kernel launches are asynchronous  Control returns to CPU immediately  Kernel executes after all previous CUDA calls have completed  cudaMemcpy() is synchronous  Copy starts after all previous CUDA calls have completed  Control returns to CPU after copy completes  cudaThreadSynchronize()  Blocks until all previous CUDA calls complete Device Synchronization  void __syncthreads();  Synchronizes all threads in a block  Generates barrier synchronization instruction  No thread can pass this barrier until all threads in the block reach it  Used to avoid RAW/WAR.WAW hazards when accessing shared memory  Allowed in conditional code only if the conditional is uniform across the entire thread block idx = blockDim.x * blockIdx.x + threadIdx.x; if (blockIdx.x == blockToReverse) { sharedData[blockDim.x – (threadIdx.x + 1)] = a[idx]; __syncthreads(); a[idx] = sharedData[threadIdx.x]; } Matrix Multiplication  A simple matrix multiplication example that illustrates the basic features of memory and thread management in CUDA programs  Local, register usage  Thread ID usage  Memory data transfer API between host and device  Leave shared memory usage until later Matrix Multiplication P=M*N  Each matrix is WIDTH * WIDTH  Data parallel CPU Implementation void MatrixMulOnHost(float* M, float* N, float* P, int width) { for (int i = 0; i < width; i++) for (int j = 0; j < width; j++) { float sum = 0; for (int k = 0; k < width; k++) { float a = M[i * width + k]; float b = N[k * width + j]; sum += a * b; } P[i * width + j] = sum; } } CUDA Skeleton int main(void) { 1. //Allocate and initialize the matrices M, N, P //I/O to read the input matrices M and N 2. //M * N on the device MatrixMulOnDevice(M, N, P, WIDTH); 3. //I/O to write the output matrix P //Free matrices M, N, P return 0; } Step1: Data Transfer void MatrixMulOnDevice(float* M, float* N, float* P, int width) { int size = width * width * sizeof (float); 1. //Load M and N to device memory cudaMalloc (d_M, size); cudaMemcpy (d_M, M, size, cudaMemcpyHostToDevice); cudaMalloc (d_N, size); cudaMemcpy (d_N, N, size, cudaMemcpyHostToDevice); 2. 3. } //Allocate P on the device cudaMalloc (d_P, size); //Kernel invocation code //Read P from the device cudaMemcpy (P, d_P, size, cudaMemcpyDeviceToHost); //Free device matrices cudaFree (d_M); cudaFree (d_N); cudaFree (d_P); Step2: Implement Kernel __global__ void MatrixMulKernel(float* d_M, float* d_N, float* d_P, int width) { //2D Thread ID int tx = threadIdx.x; int ty = threadIdx.y; //Pvalue stores the d_P element that is computed by the thread float Pvalue = 0; for (int k = 0; k < width; k++) { float a = d_M[ty * width + k]; float b = d_N[k * width + tx]; Pvalue += a * b; } //Write the matrix to device memory each thread writes one element Pd[ty * width + tx] = Pvalue; } Step3: Invoke Kernel void MatrixMulOnDevice(float* M, float* N, float* P, int width) { int size = width * width * sizeof (float); cudaMalloc (d_M, size); cudaMemcpy (d_M, M, size, cudaMemcpyHostToDevice); cudaMalloc (d_N, size); cudaMemcpy (d_N, N, size, cudaMemcpyHostToDevice); cudaMalloc (d_P, size); //Setup the execution configuration dim3 dimGrid(1, 1); dim3 dimBlock(width, width); //Launch the device computation threads MatrixMulKernel<<<dimGrid, dimBlock>>>(d_M, d_N, d_P); cudaMemcpy (P, d_P, size, cudaMemcpyDeviceToHost); cudaFree (d_M); cudaFree (d_N); cudaFree (d_P); } Simple Matrix Multiplication Grid 1 Block 1  One block of threads compute Nd 2 matrix d_P  Each thread computes one element 4 of d_P  Each thread  Loads a row of matrix d_M  Loads a column of matrix d_N  Performs one multiplication and addition for each pair of d_M and d_N elements  Compute to off-chip memory access ratio close to 1:1 (not very high)  Size of matrix limited by the number 2 Thread (2, 2) 6 3 2 5 4 48 of threads allowed in a thread block WIDTH Md Pd

PPT

Related documents

Products

Support

PPT

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib