IEEE Boston Continuing Education Program Ken Domino, Domem Technologies May 9, 2011 Announcements Course website updates: Syllabus- http://domemtech.com/ieee-pp/Syllabus.docx Lecture1– http://domemtech.com/ieee-pp/Lecture1.pptx Lecture2– http://domemtech.com/ieee-pp/Lecture2.pptx References- http://domemtech.com/ieee-pp/References.docx Ocelot April 5 download is not working PRAM Parallel Random Access Machine (PRAM). Idealized SIMD parallel computing model. Unlimited RAM’s, called Processing Units (PU). RAM’s operate with same instructions and synchronously. Shared Memory unlimited, accessed in one unit time. Shared Memory access is one of CREW, CRCW, EREW. Communication between RAM’s is only through Shared Memory. PRAM pseudo code Parallel for loop for Pi , 1 ≤ i ≤ n in parallel do … end (aka “data-level parallelism) Synchronization A simple example from C: Synchronization What happens if we have two threads competing for the same resources (char_in/char_out)? Synchronization What happens if two threads execute this code serially? No prob! Synchronization What happens if two threads execute this code in parallel? We can sometimes get a problem. char_in of T2 overwrites char_in of T1! Synchronization Synchronization forces thread serialization, e.g., so concurrent access does not cause problems. Synchronization Two types: Mutual exclusion, using a “mutex” semaphore = a lock Cooperation, wait on an object until all other threads ready, using wait() + notify(), barrier synchronization Deadlock The use of mutual exclusion of two or more resources. PRAM Synchronization ”stay idle” – wait until other processors complete, ”cooperative” synchronization CUDA “Compute Unified Device Architecture” Developed by NVIDIA, introduced November 2006 Based on C, extended later to work with C++. CUDA provides three key abstractions: a hierarchy of thread groups shared memories barrier synchronization http://www.nvidia.com/object/IO_37226.html, http://www.gpgpu.org/oldsite/sc2006/workshop/presentations/Buck_NVIDIA_Cuda.pdf, Nickolls, J., Buck, I., Garland, M. and Skadron, K. Scalable parallel programming with CUDA. Queue, 6 (2). 40-53. GPU coprocessor to CPU NVIDIA GPU Architecture Multiprocessor (MP) = texture/processor clust er (TPC) Dynamic randomaccess memory (DRAM) aka “global memory” Raster operation processor (ROP) L2 – Level-2 memory cache NVIDIA GPU Architecture Streaming Multiprocessor (SM) Streaming processor (SP) Streaming multiprocessor control (SMC) Texture processing unit (TPU) Con Cache – “constant” memory Sh. Memory – “shared” memory Multithreaded instruction fetch and issue unit (MTIFI) 1st generation, G80 – 2006 3rd generation, Fermi, GTX 570 - 2010 Single-instruction, multiple-thread “SIMT” SIMT = SIMD + SPMD (single program, multiple data). Multiple threads. Sort of “Single Instruction”— except that each instruction executed is in multiple independent parallel threads. Instruction set architecture: a register-based instruction set including floating-point, integer, bit, conversion, transcendental, flow control, memory load/store, and texture operations. Single-instruction, multiple-thread The Stream Multiprocessor is a hardware multithreaded unit. Threads are executed in groups of 32 parallel threads called warps. Each thread has its own set of registers. Individual threads composing a warp are of the same program and start together at the same program address, but they are otherwise free to branch and execute independently. Single-instruction, multiple-thread Instruction executed is same for each warp. If threads of a warp diverge via a data dependent conditional branch, the warp serially executes each branch path taken. Single-instruction, multiple-thread Warps are serialized if there is: Divergence in instructions (i.e., conditional branch instruction) write access to the same memory Warp Scheduling SM hardware implements near-zero overhead Warp scheduling Warps whose next instruction has its operands ready for consumption can be executed Eligible Warps are selected for execution by priority All threads in a Warp execute the same instruction 4 clock cycles needed to dispatch the instruction for all threads (G80) Cooperative Thread Array (CTA) An abstraction to synchronizing threads AKA a thread block, grid CTA’s are mapped to warps Cooperative Thread Array (CTA) Each thread has a unique integer thread ID (TID). Threads of a CTA share data in global or shared memory Threads synchronize with the barrier instruction. CTA thread programs use their TIDs to select work and index shared data arrays. Cooperative Thread Array (CTA) The programmer declares a 1D, 2D, or 3D grid shape and dimensions in threads. The TID is 1D, 2D, or 3D indice. Restrictions in grid sizes Kernel Every thread in a grid executes the same body of instructions, called a kernel. In CUDA, it’s just a function. CUDA Kernels Kernels declared with __global__ void Parameters are the same for all threads. __global__ void fun(float * d, int size) { int idx = threadIdx.x + blockDim.x * blockIdx.x + blockDim.x * gridDim.x * blockDim.y * blockIdx.y + blockDim.x * gridDim.x * threadIdx.y; if (idx < 0) return; if (idx >= size) return; d[idx] = idx * 10.0 / 0.1; } CUDA Kernels Kernels are called via “chevron syntax” Func<<< Dg, Db, Ns, S >>>(parameters) Dg is of type dim3 and specifies the dimension and size of the grid Db is of type dim3 and specifies the dimension and size of the block Dg is of type dim3 and specifies the dimension and size of the grid Ns is of type size_t and specifies the number of bytes in shared memory that is dynamically allocated per block S is of type cudaStream_t and specifies the associated stream Kernel is void type; must return value through cbv parameter Example: Foo<<<1, 100>>(1, 2, i); Memory CTA’s have various types of memory Global, shared, constant, textured, registers Threads can access host memory, too. Types of memory CUDA Memory Data types (int, long, float, double, etc) are the same as in the host. Shared memory shared between blocks in a thread. Global memory shared by all threads in all blocks. Constant memory shared by all threads in all blocks, but it cannot be changed (so, faster). Host memory (of CPU) can be access by all threads in all blocks. Shared Memory __shared__ declares a variable that: Resides in the shared memory space of a thread block, Has the lifetime of the block, Is only accessible from all the threads within the block. Examples: extern __shared__ float shared[]; (or declared on kernel call—later!) Global Memory __device__ declares a variable that: Resides in global memory space; Has the lifetime of an application; Is accessible from all the threads within the grid and from the host through the runtime library (cudaGetSymbolAddress() / cudaGetSymbolSize() / cudaMemcpyToSymbol() / cudaMemcpyFromSymbol()) Can be allocated through cudaMalloc() Examples: extern __device__ int data[100]; cudaMalloc(&d, 100*sizeof(int)); Basic host function calls Global memory allocation via cudaMalloc() Copying memory between host and GPU via cudaMemcpy() Kernels are called by chevron syntax Counting 6’s Have an array of integers, h[], want to count the number of 6’s that appear in the array. H[0..size-1] How do we do this in CUDA? Counting 6’s Divide the array into blocks of blocksize threads. For each block, sum the number of times 6 appears. Return the sum for each block. Counting 6’s Divide the array into blocks of blocksize threads. For each block, sum the number of times 6 appears. Return the sum for each block. #include <stdio.h> __global__ void c6(int * d_in, int * d_out, int size) { int sum = 0; for (int i=0; i < blockDim.x; i++) { int val = d_in[i + blockIdx.x * blockDim.x]; if (val == 6) sum++; } d_out[blockIdx.x] = sum; } Counting 6’s In main program, call the kernel with the correct dimensions of the block. Note: size % blocksize = 0. How would we extend this for arbitrary array size? int main() { int size = 300; int * h = (int*)malloc(size * sizeof(int)); for (int i = 0; i < size; ++i) h[i] = i % 10; int * d_in; int * d_out; int bsize = 100; int blocks = size/bsize + 1; int threads_per_block = bsize; int rv1 = cudaMalloc(&d_in, size*sizeof(int)); int rv2 = cudaMalloc(&d_out, blocks*sizeof(int)); int rv3 = cudaMemcpy(d_in, h, size*sizeof(int), cudaMemcpyHostToDevice); c6<<<blocks, threads_per_block>>>(d_in, d_out, size); cudaThreadSynchronize(); int rv4 = cudaGetLastError(); int * r = (int*)malloc(blocks * sizeof(int)); int rv5 = cudaMemcpy(r, d_out, blocks*sizeof(int), cudaMemcpyDeviceToHost); int sum = 0; for (int i = 0; i < blocks; ++i) sum += r[i]; printf("Result = %d\n", sum); return 0; } Developing CUDA programs Install CUDA SDK (drivers, Toolkit, examples) Windows, Linux, Mac: Use Version 4.0, release candidate 2. (The older 3.2 release does not work with VS2010 easily! You can install both VS2010 and VS2008, but you will have to manage paths.) http://developer.nvidia.com/cuda-toolkit-40 Install toolkit, tools SDK, and example code For drivers, you must have an NVIDIA GPU card Recommendation: The CUDA examples use definitions in a common library—do not force your code to depend on it by using it. Developing CUDA programs Emulation Do not install CUDA drivers (will fail). Windows and Mac only Install VirtualBox. Create 40GB virtual drive. Install Ubuntu from ISO image on VirtualBox. Install Ocelot (http://code.google.com/p/gpuocelot/downloads/list) Install various dependencies (sudo apt-get xxxx install, for g++, boost, etc.) Note: There is a problem with the current release of Ocelot—I emailed Gregory.Diamos@gatech.edu to resolve build issue. Developing CUDA programs Windows: Install VS2010 C++ Express (http://www.microsoft.com/visualstudio/enus/products/2010-editions/visual-cpp-express) (Test installation with “Hello World” .cpp example.) Developing CUDA programs Windows: Create an empty c++ console project Create hw.cu “hello world” program in source directory Project ‐> Custom Build Rules, check box for CUDA 4.0 targets Add hw.cu into your empty project Note: “.cu” suffix stands for “CUDA source code”. You can put CUDA syntax into .cpp files, but build environment won’t know what to compile it with (cl/g++ vs nvcc). Developing CUDA programs #include <stdio.h> __global__ void fun(int * mem) { *mem = 1; } hw.cu: int main() { int h = 0; int * d; cudaMalloc(&d, sizeof(int)); cudaMemcpy(d, &h, sizeof(int), cudaMemcpyHostToDevice); fun<<<1,1>>>(d); cudaThreadSynchronize(); int rv = cudaGetLastError(); cudaMemcpy(&h, d, sizeof(int), cudaMemcpyDeviceToHost); printf("Result = %d\n", h); return 0; } Compile, link, Developing CUDA programs and run (Version 4.0 installation adjusts all environmental variables.) NVCC nvcc (NVIDIA CUDA compiler) is a driver program for compiler phases Use –keep option to see intermediate files. (Need to add “.” to include directories on compile.) NVCC Compiles to “.cu” into a “.cu.cpp” file Two types of targets: virtual and real, represented in PTX assembly code and “cubin” binary code, respectively. PTXAS Compiles PTX assembly code into machine code, placed in an ELF module. # cat hw.sm_10.cubin | 0000000 7f 45 4c 46 01 0000020 02 00 be 00 01 0000040 34 00 00 00 0a 0000060 16 00 01 00 00 0000100 00 00 00 00 00 0000120 00 00 00 00 00 0000140 03 00 00 00 00 0000160 7f 01 00 00 00 0000200 00 00 00 00 0b 0000220 00 00 00 00 23 od 01 00 01 00 00 00 00 00 00 05 -t 01 00 0a 00 00 00 00 00 00 00 x1 33 00 00 00 00 00 00 00 00 00 | head 02 00 00 00 00 00 34 00 20 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 03 00 00 22 00 00 00 00 00 00 00 00 00 00 00 00 00 34 03 00 00 01 a4 04 00 00 00 18 00 00 00 00 03 00 00 00 00 00 28 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Disassembly of the machine code can be done using cuobjectdump or my own utility nvdis (http://forums.nvidia.com/index.php?showtopic=183438) PTX, the GPU assembly code PTX = “Parallel Thread Execution” Target for PTX is an abstract GPU machine. Contains operations for load, store, register declarations, add, sub, mul, etc. .version 1.4 .target sm_10, map_f64_to_f32 // compiled with …/be.exe // nvopencc 4.0 built on 2011-03-24 .entry _Z3funPi ( .param .u32 __cudaparm__Z3funPi_mem) { .reg .u32 %r<4>; .loc 16 4 0 $LDWbegin__Z3funPi: .loc 16 6 0 mov.s32 %r1, 1; ld.param.u32 %r2, [__cudaparm__Z3funPi_mem]; st.global.s32 [%r2+0], %r1; .loc 16 7 0 exit; $LDWend__Z3funPi: } // _Z3funPi CUDA GPU targets Virtual – PTX code is embedded in executabe as a string, then compiled at runtime “just-intime”. Real – PTX code is compiled into target execute. Next time For next week, we will go into more detail: The CUDA runtime API; Writing efficient CUDA code; Look at some important examples.