ECE408 / CS483 Applied Parallel Programming Lecture 3: Introduction to CUDA C (Part 2) © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2012 ECE408/CS483, University of Illinois, Urbana-Champaign 1 Update – TA Office Hours • TA office hours will be held at the following times in CSL369: – Wednesday 2:00p.m. - 3:00p.m. – Friday 11:00 a.m. - 12:00 p.m. • More TA office hours will be added on a need basis © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2012 ECE408/CS483, University of Illinois, Urbana-Champaign 2 Tentative Plan for Labs • Lab 0: Account Setup – Released: Thu, Aug 30 (Week 1) – Due: N/A • Lab 1: Vector Add – Released: Tue, Sep 4 (Week 2) – Due: Tue, Sep 11 (Week 3) • Lab 2: Basic Matrix Multiplication – Released: Tue, Sep 11 (Week 3) – Due: Tue, Sep 18 (Week 4) • Lab 3: Tiled Matrix Multiplication – Released: Tue, Sep 18 (Week 4) – Due: Tue, Sep 25 (Week 5) • Lab 4: Tile Convolution – Released: Tue, Sep 25 (Week 5) – Due: Tue, Oct 9 (Week 7) © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2012 ECE408/CS483, University of Illinois, Urbana-Champaign • Lab 5: Reduction & Prefix Scan – Released: Tue, Oct 9 (Week 7) – Due: Tue, Oct 23 (Week 9) • Lab 6: Histogramming – Released: Tue, Oct 23 (Week 9) – Due: Tue, Nov 6 (Week 11) • Project: – Release & kickoff Lecture: Tue, Oct 23 (Week 9) – Proposals for Non-competition Projects Due: TBD – Progress Report: TBD (if any) – Presentations: TBD 3 Objective • To learn about data parallelism and the basic features of CUDA C that enable exploitation of data parallelism – Hierarchical thread organization – Main interfaces for launching parallel execution – Thread index(es) to data index mapping © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2012 ECE408/CS483, University of Illinois, Urbana-Champaign 4 CUDA C /OpenCL – Execution Model • Integrated host+device app C program – Serial or modestly parallel parts in host C code – Highly parallel parts in device SPMD kernel C code Serial Code (host) Parallel Kernel (device) KernelA<<< nBlk, nTid >>>(args); ... Serial Code (host) Parallel Kernel (device) KernelB<<< nBlk, nTid >>>(args); © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2012 ECE408/CS483, University of Illinois, Urbana-Champaign ... 5 Heterogeneous Computing vecAdd CUDA Host CodePart 1 void vecAdd(float* h_A, float* h_B, float* h_C, int n) { Host Memory Device Memory GPU CPU Part 2 int size = n* sizeof(float); float* d_A, d_B, d_C; … 1. // Allocate device memory for A, B, and C // copy A and B to device memory 2. // Kernel launch code – to have the device // to perform the actual vector addition Part 3 3. // copy C from the device memory // Free device vectors } © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2012 ECE408/CS483, University of Illinois, Urbana-Champaign 6 Partial Overview of CUDA Memories • Device code can: – R/W per-thread registers – R/W all-shared global memory (Device) Grid Block (0, 0) • Host code can – Transfer data to/from per grid global memory Host Block (1, 0) Registers Registers Registers Registers Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) Global Memory We will cover more later. © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2012 ECE408/CS483, University of Illinois, Urbana-Champaign 7 CUDA Device Memory Management API functions • cudaMalloc() Grid – Allocates object in the device global memory – Two parameters Block (0, 0) • Address of a pointer to the allocated object • Size of of allocated object in terms of bytes • cudaFree() Host Block (1, 0) Registers Registers Registers Registers Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) Global Memory – Frees object from device global memory • Pointer to freed object © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2012 ECE408/CS483, University of Illinois, Urbana-Champaign 8 Host-Device Data Transfer API functions • cudaMemcpy() (Device) Grid – memory data transfer – Requires four parameters • • • • Pointer to destination Pointer to source Number of bytes copied Type/Direction of transfer Host Block (0, 0) Block (1, 0) Registers Registers Registers Registers Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) Global Memory – Transfer to device is asynchronous © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2012 ECE408/CS483, University of Illinois, Urbana-Champaign 9 void vecAdd(float* h_A, float* h_B, float* h_C, int n) { int size = n * sizeof(float); float* d_A, d_B, d_C; 1. // Transfer A and B to device memory cudaMalloc((void **) &d_A, size); cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice); cudaMalloc((void **) &B_d, size); cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice); // Allocate device memory for cudaMalloc((void **) &d_C, size); 2. // Kernel invocation code – to be shown later … 3. // Transfer C from device to host cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost); // Free device memory for A, B, C cudaFree(d_A); cudaFree(d_B); cudaFree (d_C); © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2012 10 } ECE408/CS483, University of Illinois, Urbana-Champaign Example: Vector Addition Kernel Device Code // Compute vector sum C = A+B // Each thread performs one pair-wise addition __global__ void vecAddKernel(float* A, float* B, float* C, int n) { int i = threadIdx.x + blockDim.x * blockIdx.x; if(i<n) C[i] = A[i] + B[i]; } int vectAdd(float* h_A, float* h_B, float* h_C, int n) { // d_A, d_B, d_C allocations and copies omitted // Run ceil(n/256.0) blocks of 256 threads each vecAddKernel<<<ceil(n/256.0), 256>>>(d_A, d_B, d_C, n); } © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2012 ECE408/CS483, University of Illinois, Urbana-Champaign 11 Example: Vector Addition Kernel // Compute vector sum C = A+B // Each thread performs one pair-wise addition __global__ void vecAddkernel(float* A_d, float* B_d, float* C_d, int n) { int i = threadIdx.x + blockDim.x * blockIdx.x; if(i<n) C_d[i] = A_d[i] + B_d[i]; } Host Code int vecAdd(float* h_A, float* h_B, float* h_C, int n) { // d_A, d_B, d_C allocations and copies omitted // Run ceil(n/256.0) blocks of 256 threads each vecAddKernnel<<<ceil(n/256.0),256>>>(d_A, d_B, d_C, n); } © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2012 ECE408/CS483, University of Illinois, Urbana-Champaign 12 More on Kernel Launch Host Code int vecAdd(float* A, float* B, float* C, int n) { // A_d, B_d, C_d allocations and copies omitted // Run ceil(n/256) blocks of 256 threads each dim3 DimGrid(n/256, 1, 1); if (n%256) DimGrid.x++; This makes sure that there are dim3 DimBlock(256, 1, 1); enough threads to cover all elements. vecAddKernnel<<<DimGrid,DimBlock>>>(A_d, B_d, C_d, n); } • Any call to a kernel function is asynchronous from CUDA 1.0 on, explicit synch needed for blocking © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2012 ECE408/CS483, University of Illinois, Urbana-Champaign 13 Kernel execution in a nutshell __host__ __global__ Void vecAdd() void vecAddKernel(float *A_d, { float *B_d, float *C_d, int n) dim3 DimGrid = (ceil(n/256.0),1,1); { dim3 DimBlock = (256,1,1); int i = blockIdx.x * blockDim.x + threadIdx.x; vecAddKernel<<<DimGrid,DimBlock>>>(A_ d,B_d,C_d,n); if( i<n ) C_d[i] = A_d[i]+B_d[i]; } } Kernel ••• Blk 0 Blk N-1 Schedule onto multiprocessors M0 GPU ••• Mk RAM © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2012 ECE408/CS483, University of Illinois, Urbana-Champaign 14 More on CUDA Function Declarations Executed on the: Only callable from the: __device__ float DeviceFunc() device device __global__ void device host host host __host__ • KernelFunc() float HostFunc() __global__ defines a kernel function • Each “__” consists of two underscore characters • A kernel function must return void • __device__ and __host__ can be used together © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2012 ECE408/CS483, University of Illinois, Urbana-Champaign 15 LET’S TAKE A LOOK AT A REAL PIECE OF CUDA CODE © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2012 ECE408/CS483, University of Illinois, Urbana-Champaign 16 QUESTIONS? © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2012 ECE408/CS483, University of Illinois, Urbana-Champaign 17