Basic CUDA Programming Shin-Kai Chen skchen@twins.ee.nctu.edu.tw VLSI Signal Processing Laboratory Department of Electronics Engineering National Chiao Tung University What will you learn in this lab? • Concept of multicore accelerator • Multithreaded/multicore programming • Memory optimization Slides • Mostly from Prof. Wen-Mei Hwu of UIUC – http://courses.ece.uiuc.edu/ece498/al/ Syllabus.html CUDA – Hardware? Software? ... Application ... Host Thread Id #: 0123… m CUDA Device Grid 1 Kernel 1 Block (0, 0) Block (1, 0) Block (0, 1) Block (1, 1) Grid 2 Thread program Kernel 2 Block (1, 1) (0,0,1) (1,0,1) (2,0,1) (3,0,1) Thread Thread Thread Thread (0,0,0) (1,0,0) (2,0,0) (3,0,0) Thread Thread Thread Thread (0,1,0) (1,1,0) (2,1,0) (3,1,0) Courtesy: NDVIA Figure 3.2. An Example of CUDA Thread Organization. Host Input Assembler Thread Execution Manager Platform Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Texture Texture Texture Texture Texture Texture Texture Texture Texture Load/store Load/store Load/store Load/store Global Memory Load/store Load/store Host-Device Architecture CPU (host) GPU w/ local DRAM (device) G80 CUDA mode – A Device Example Host Input Assembler Thread Execution Manager Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Texture Texture Texture Texture Texture Texture Texture Texture Texture Load/store Load/store Load/store Load/store Global Memory Load/store Load/store Functional Units in G80 • Streaming Multiprocessor (SM) – 1 instruction decoder ( 1 instruction / 4 cycle ) – 8 streaming processor (SP) SM 0 SM 1 – Shared memory t0 t1 t2 … tm MT IU SP MT IU t0 t1 t2 … tm SP Blocks Blocks Shared Memory Shared Memory Setup CUDA for Windows CUDA Environment Setup • Get GPU that support CUDA – http://www.nvidia.com/object/cuda_learn_pro ducts.html • Download CUDA – http://www.nvidia.com/object/cuda_get.html • CUDA driver • CUDA toolkit • CUDA SDK (optional) • Install CUDA • Test CUDA – Device Query Setup CUDA for Visual Studio • From scratch – http://forums.nvidia.com/index.php?sho wtopic=30273 • CUDA VS Wizard – http://sourceforge.net/projects/cudavs wizard/ • Modified from existing project Lab1: First CUDA Program CUDA Computing Model Host Host Serial Code Serial Code Device Memory Transfer Lunch Kernel Parallel Code Parallel Code Memory Transfer Serial Code Serial Code Memory Transfer Lunch Kernel Parallel Code Memory Transfer Parallel Code Data Manipulation between Host and Device • cudaError_t cudaMalloc( void** devPtr, size_t count ) – Allocates count bytes of linear memory on the device and return in *devPtr as a pointer to the allocated memory • cudaError_t cudaMemcpy( void* dst, const void* src, size_t count, enum cudaMemcpyKind kind) – Copies count bytes from memory area pointed to by src to the memory area pointed to by dst – kind indicates the type of memory transfer • • • • cudaMemcpyHostToHost cudaMemcpyHostToDevice cudaMemcpyDeviceToHost cudaMemcpyDeviceToDevice • cudaError_t cudaFree( void* devPtr ) – Frees the memory space pointed to by devPtr Example Float GPU_kernel(int *B, int *A) { • Functionality: // Create Pointers for Memory Space on Device int *dA, *dB; – Given an integer array A holding 8192 elements – For each element in array A, calculate A[i]256 and leave the result in B[i] // Allocate Memory Space on Device cudaMalloc( (void**) &dA, sizeof(int)*SIZE ); cudaMalloc( (void**) &dB, sizeof(int)*SIZE ); // Copy Data to be Calculated cudaMemcpy( dA, A, sizeof(int)*SIZE, cudaMemcpyHostToDevice ); cudaMemcpy( dB, B, sizeof(int)*SIZE, cudaMemcpyHostToDevice ); // Lunch Kernel cuda_kernel<<<1,1>>>(dB, dA); // Copy Output Back cudaMemcpy( B, dB, sizeof(int)*SIZE, cudaMemcpyDeviceToHost ); // Free Memory Spaces on Device cudaFree( dA ); cudaFree( dB ); } Now, go and finish your first CUDA program !!! • Download http://twins.ee.nctu.edu.tw/~skchen/ lab1.zip • Open project with Visual C++ 2008 ( lab1/cuda_lab/cuda_lab.vcproj ) – main.cu • Random input generation, output validation, result reporting – device.cu • Lunch GPU kernel, GPU kernel code – parameter.h • Fill in appropriate APIs – GPU_kernel() in device.cu Lab2: Make the Parallel Code Faster Parallel Processing in CUDA • Parallel code can be partitioned into blocks and threads – cuda_kernel<<<nBlk, nTid>>>(…) • Multiple tasks will be initialized, each with different block id and thread id • The tasks are dynamically scheduled – Tasks within the same block will be scheduled on the same stream multiprocessor • Each task take care of single data partition according to its block id and thread id Locate Data Partition by Built-in Variables • Built-in Variables – gridDim • x, y Host Device Grid 1 Kernel 1 – blockIdx • x, y – blockDim • x, y, z – threadIdx • x, y, z Block (0, 0) Block (1, 0) Block (0, 1) Block (1, 1) Grid 2 Kernel 2 Block (1, 1) (0,0,1) (1,0,1) (2,0,1) (3,0,1) Thread Thread Thread Thread (0,0,0) (1,0,0) (2,0,0) (3,0,0) Thread Thread Thread Thread (0,1,0) (1,1,0) (2,1,0) (3,1,0) Courtesy: NDVIA Data Partition for Previous Example When processing 64 integer data: cuda_kernel<<<2, 2>>>(…) TASK 0 blockIdx.x = 0 threadIdx.x = 0 TASK 1 blockIdx.x = 0 threadIdx.x = 1 TASK 2 blockIdx.x = 1 threadIdx.x = 0 TASK 3 blockIdx.x = 1 threadIdx.x = 1 int total_task = gridDim.x * blockDim.x ; length int task_sn = blockIdx.x * blockDim.x + threadIdx.x ; … … … int length = SIZE / total_task ; head int head = task_sn * length ; Processing Single Data Partition __global__ void cuda_kernel ( int *B, int *A ) { int total_task = gridDim.x * blockDim.x; int task_sn = blockDim.x * blockIdx.x + threadIdx.x; int length = SIZE / total_task; int head = task_sn * length; for ( int i = head ; i < head + length ; i++ ) { B[i] = A[i]256; } return; } Parallelize Your Program !!! • Partition kernel into threads – Increase nTid from 1 to 512 – Keep nBlk = 1 • Group threads into blocks – Adjust nBlk and see if it helps • Maintain total number of threads below 512, e.g. nBlk * nTid < 512 Lab3: Resolve Memory Contention Parallel Memory Architecture • Memory is divided into banks to achieve high bandwidth • Each bank can service one address per cycle • Successive 32-bit words are assigned to successive banks BANK0 BANK1 BANK2 BANK3 BANK4 BANK5 BANK6 BANK7 BANK8 BANK9 BANK10 BANK11 BANK12 BANK13 BANK14 BANK15 Lab2 Review When processing 64 integer data: cuda_kernel<<<1, 4>>>(…) Iteration 2 Iteration 1 A[ 0] THREAD0 A[16] THREAD1 A[32] THREAD2 A[48] THREAD3 CONFILICT!!!! BANK0 BANK1 BANK2 BANK3 BANK4 BANK5 BANK6 BANK7 BANK8 BANK9 BANK10 BANK11 BANK12 BANK13 BANK14 BANK15 A[ 1] THREAD0 A[17] THREAD1 A[33] THREAD2 A[49] THREAD3 CONFILICT!!!! BANK0 BANK1 BANK2 BANK3 BANK4 BANK5 BANK6 BANK7 BANK8 BANK9 BANK10 BANK11 BANK12 BANK13 BANK14 BANK15 How about Interleave Accessing? When processing 64 integer data: cuda_kernel<<<1, 4>>>(…) Iteration 1 A[ 0] A[ 1] THREAD0 A[ 2] THREAD1 A[ 3] THREAD2 THREAD3 NO CONFLICT BANK0 BANK1 BANK2 BANK3 BANK4 BANK5 BANK6 BANK7 BANK8 BANK9 BANK10 BANK11 BANK12 BANK13 BANK14 BANK15 Iteration 2 A[ 4] THREAD0 A[ 5] THREAD1 A[ 6] THREAD2 A[ 7] THREAD3 NO CONFLICT BANK0 BANK1 BANK2 BANK3 BANK4 BANK5 BANK6 BANK7 BANK8 BANK9 BANK10 BANK11 BANK12 BANK13 BANK14 BANK15 Implementation of Interleave Accessing cuda_kernel<<<1, 4>>>(…) stripe … head • head = task_sn • stripe = total_task Improve Your Program !!! • Modify original kernel code in interleaving manner – cuda_kernel() in device.cu • Adjusting nBlk and nTid as in Lab2 and examine the effect – Maintain total number of threads below 512, e.g. nBlk * nTid < 512 Thank You • http://twins.ee.nctu.edu.tw/~skchen/ lab3.zip • Final project issue • Group issue