Introduction to GPU Programming Volodymyr (Vlad) Kindratenko Innovative Systems Laboratory @ NCSA Institute for Advanced Computing Applications and Technologies (IACAT) Tutorial Goals • Become familiar with NVIDIA GPU architecture • Become familiar with the NVIDIA GPU application development flow • Be able to write and run a simple NVIDIA GPU application in CUDA 7/26/2009 2 Tutorial Outline • Introduction (15 minutes) – – Why use Graphics Processing Units (GPUs) for general-purpose computing Modern GPU architecture • – GPU programming • • Cluster architecture overview How to login and check out a node How to compile and run an existing application Hands-on: Anatomy of a GPU application (25 minutes) – – Host side Device side • • • Libraries, CUDA C, OpenCL, PGI x64+GPU Hands-on: getting started with NCSA GPU cluster (15 minutes) – – – • NVIDIA CUDA programming model Hands-on: Porting matrix multiplier to GPU (25 minutes) More on CUDA programming (40 minutes) 7/26/2009 3 Introduction • Why use Graphics Processing Units (GPUs) for general-purpose computing • Modern GPU architecture – NVIDIA • GPU programming – Libraries – CUDA C – OpenCL – PGI x64+GPU 7/26/2009 4 Why GPUs? Raw Performance Trends 1200 GTX 285 1000 GTX 280 GFLOP/s 800 8800 Ultra 600 8800 GTX NVIDIA GPU Intel CPU 400 7900 GTX 7800 GTX 200 5800 5950 Ultra 6800 Ultra Intel Xeon Quad-core 3 GHz 0 9/22/02 7/26/2009 2/4/04 6/18/05 10/31/06 3/14/08 5 Graph is courtesy of NVIDIA 180 Why GPUs? Memory Bandwidth Trends GTX 285 160 GTX 280 140 GByte/s 120 8800 Ultra 100 8800 GTX 80 60 7800 GTX 40 7900 GTX 5950 Ultra 6800 Ultra 20 5800 0 9/22/02 7/26/2009 2/4/04 6/18/05 10/31/06 3/14/08 6 Graph is courtesy of NVIDIA GPU vs. CPU Silicon Use 7/26/2009 7 Graph is courtesy of NVIDIA NVIDIA GPU Architecture • A scalable array of multithreaded Streaming Multiprocessors (SMs), each SM consists of – 8 Scalar Processor (SP) cores – 2 special function units for transcendentals – A multithreaded instruction unit – On-chip shared memory • GDDR3 SDRAM • PCIe interface 7/26/2009 8 Figure is courtesy of NVIDIA NVIDIA GeForce9400M G GPU • 16 streaming processors arranged as 2 streaming multiprocessors • At 0.8 GHz this provides TPC Geometry controller SMC SM SM I cache I cache MT issue MT issue C cache C cache SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SFU SFU SFU SFU Shared memory – 54 GFLOPS in singleprecision (SP) Shared memory • 128-bit interface to offchip GDDR3 memory Texture units Texture L1 128-bit interconnect L2 ROP ROP DRAM 7/26/2009 – 21 GB/s bandwidth L2 DRAM 9 NVIDIA Tesla C1060 GPU TPC 1 TPC 10 Geometry controller Geometry controller SMC SMC SM SM SM SM SM SM I cache I cache I cache I cache I cache I cache MT issue MT issue MT issue MT issue MT issue MT issue C cache C cache C cache C cache C cache C cache SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU Shared memory Shared memory Shared memory Shared memory Texture units Texture L1 Shared memory Shared memory Texture units Texture L1 512-bit memory interconnect L2 ROP ROP DRAM DRAM 7/26/2009 DRAM DRAM DRAM DRAM DRAM L2 DRAM • 240 streaming processors arranged as 30 streaming multiprocessors • At 1.3 GHz this provides – 1 TFLOPS SP – 86.4 GFLOPS DP • 512-bit interface to off-chip GDDR3 memory – 102 GB/s bandwidth 10 NVIDIA Tesla S1070 Computing Server Tesla GPU Tesla GPU System monitoring Thermal managemen t Power supply 4 GB GDDR3 SDRAM NVIDIA SWITCH PCI x16 NVIDIA SWITCH Tesla GPU Tesla GPU 4 GB GDDR3 SDRAM 7/26/2009 4 GB GDDR3 SDRAM PCI x16 • 4 T10 GPUs 4 GB GDDR3 SDRAM 11 Graph is courtesy of NVIDIA GPU Use/Programming • GPU libraries – NVIDIA’s CUDA BLAS and FFT libraries – Many 3rd party libraries • Low abstraction lightweight GPU programming toolkits – CUDA C – OpenCL • High abstraction compiler-based tools – PGI x64+GPU 7/26/2009 12 Getting Started with NCSA GPU Cluster • Cluster architecture overview • How to login and check out a node • How to compile and run an existing application 7/26/2009 13 NCSA AC GPU Cluster 7/26/2009 14 GPU Cluster Architecture • Servers: 32 • Accelerator Units: 32 – CPU cores: 128 – GPUs: 128 ac (head node) ac01 (compute node) 7/26/2009 ac02 (compute node) ac32 (compute node) 15 GPU Cluster Node Architecture IB • HP xw9400 workstation – 2216 AMD Opteron 2.4 GHz dual socket dual core – 8 GB DDR2 – InfiniBand QDR • S1070 1U GPU Computing Server – 1.3 GHz Tesla T10 processors – 4x4 GB GDDR3 SDRAM QDR IB HP xw9400 workstation PCIe x16 PCIe x16 PCIe interface PCIe interface T10 T10 T10 T10 DRAM DRAM DRAM DRAM Tesla S1070 Compute node 7/26/2009 16 Accessing the GPU Cluster • Use Secure Shell (SSH) client to access AC – `ssh USER@ac.ncsa.uiuc.edu` (User: gpu001 - gpu200; Password: CHiPS-09) – You will see something like this printed out: See machine details and a technical report at: http://www.ncsa.uiuc.edu/Projects/GPUcluster/ Machine Description and HOW TO USE. See: /usr/local/share/ac.readme CUDA wrapper readme: /usr/local/share/cuda_wrapper.readme *IMPORTANT* If you are using multiple GPU devices per host, be sure to understand how the cuda_wrapper changes this system!! … July 03, 2009 Nvidia compute exclusive mode made default. If this breaks your application, "touch ~/FORCE_NORMAL" to create an override for all your jobs. Questions? Contact Jeremy Enos jenos@ncsa.uiuc.edu [gpuXYZ@ac ~]$ _ 7/26/2009 17 Installing Tutorial Examples • Run this sequence to retrieve and install tutorial examples: cd cp /tmp/chips_tutorial.tgz . tar -xvzf chips_tutorial.tgz cd chips_tutorial ls src1 src2 src3 7/26/2009 18 Accessing the GPU Cluster Laptop 1 You are here ac01 (compute node) 7/26/2009 Laptop 2 Laptop 30 ac (head node) ac02 (compute node) ac32 (compute node) 19 Requesting a Cluster Node for Interactive Use • Run `qstat` to see what other users do, just for the fun of it • Run `qsub -I -l walltime=02:00:00` to request a node with a single GPU for 2 hours of interactive use – You will see something like this printed out: qsub: waiting for job 64424.acm to start qsub: job 64424.acm ready [gpuXYZ@acAB ~]$ _ 7/26/2009 20 Requesting a Cluster Node Laptop 1 Laptop 2 Laptop 30 ac (head node) ac01 (compute node) 7/26/2009 ac02 (compute node) You are here ac32 (compute node) 21 Checking GPU Characteristics • Run `deviceQuery` CUDA Device Query (Runtime API) version (CUDART static linking) There is 1 device supporting CUDA Device 0: "Tesla C1060" CUDA Capability Major revision number: 1 CUDA Capability Minor revision number: 3 Total amount of global memory: 4294705152 bytes Number of multiprocessors: 30 Number of cores: 240 Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 16384 bytes … Clock rate: 1.30 GHz Compute mode: Exclusive (only one host thread at a time can use this device) 7/26/2009 22 Compiling and Running and Existing Application • cd chips_tutorial/src1 – vecadd.c - reference C implementation – vecadd.cu – CUDA implementation • Compile & run CPU version gcc vecadd.c -o vecadd_cpu ./vecadd_cpu Running CPU vecAdd for 16384 elements C[0]=2147483648.00 ... • Compile & run GPU version nvcc vecadd.cu -o vecadd_gpu ./vecadd_gpu Running GPU vecAdd for 16384 elements C[0]=2147483648.00 ... 7/26/2009 23 nvcc • Any source file containing CUDA C language extensions must be compiled with nvcc • nvcc is a compiler driver that invokes many other tools to accomplish the job • Basic nvcc usage – nvcc <filename>.cu [-o <executable>] • Builds release mode – nvcc -deviceemu <filename>.cu • Builds device emulation mode (all code runs on CPU) – -g flag allows to build debug mode for gdb debugger 7/26/2009 24 Anatomy of a GPU Application • Host side • Device side • CUDA programming model 7/26/2009 25 CPU-Only Version void vecAdd(int N, float* A, float* B, float* C) { for (int i = 0; i < N; i++) C[i] = A[i] + B[i]; } Computational kernel int main(int argc, char **argv) { int N = 16384; // default vector size float *A = (float*)malloc(N * sizeof(float)); float *B = (float*)malloc(N * sizeof(float)); float *C = (float*)malloc(N * sizeof(float)); vecAdd(N, A, B, C); // call compute kernel free(A); free(B); free(C); Memory allocation Kernel invocation Memory de-allocation } 7/26/2009 26 Adding GPU support int main(int argc, char **argv) { int N = 16384; // default vector size float *A = (float*)malloc(N * sizeof(float)); float *B = (float*)malloc(N * sizeof(float)); float *C = (float*)malloc(N * sizeof(float)); float *devPtrA, *devPtrB, *devPtrC; cudaMalloc((void**)&devPtrA, N * sizeof(float)); cudaMalloc((void**)&devPtrB, N * sizeof(float)); cudaMalloc((void**)&devPtrC, N * sizeof(float)); Memory allocation on the GPU card Copy data from the CPU (host) memory to the GPU (device) memory cudaMemcpy(devPtrA, A, N * sizeof(float), cudaMemcpyHostToDevice); cudaMemcpy(devPtrB, B, N * sizeof(float), cudaMemcpyHostToDevice); 7/26/2009 27 Adding GPU support vecAdd<<<N/512, 512>>>(devPtrA, devPtrB, devPtrC); Kernel invocation cudaMemcpy(C, devPtrC, N * sizeof(float), cudaMemcpyDeviceToHost); cudaFree(devPtrA); cudaFree(devPtrB); cudaFree(devPtrC); free(A); free(B); free(C); } 7/26/2009 Copy results from device memory to the host memory Device memory de-allocation 28 GPU Kernel • CPU version void vecAdd(int N, float* A, float* B, float* C) { for (int i = 0; i < N; i++) C[i] = A[i] + B[i]; } • GPU version __global__ void vecAdd(float* A, float* B, float* C) { int i = blockIdx.x * blockDim.x + threadIdx.x; C[i] = A[i] + B[i]; } 7/26/2009 29 CUDA Programming Model • A CUDA kernel is executed by an array of threads threadID – All threads run the same code (SPMD) – Each thread has an ID that it uses to compute memory addresses and make control decisions … float x = input[threadID]; float y = func(x); output[threadID] = y; … • Threads are arranged as a grid of thread blocks – Threads within a block have access to a segment of shared memory 7/26/2009 Grid Thread Block 0 Thread Block 1 Thread Block N-1 … Shared memory Shared memory Shared memory 30 Kernel Invocation Syntax grid & thread block dimensionality vecAdd<<<32, 512>>>(devPtrA, devPtrB, devPtrC); Grid Thread Block 0 Thread Block 1 Thread Block N-1 … Shared memory Shared memory Shared memory int i = blockIdx.x * blockDim.x + threadIdx.x; block ID within a grid 7/26/2009 number of theards per block thread ID within a thread block 31 Mapping Threads to the Hardware • Blocks of threads are transparently assigned to SMs – A block of threads executes on one SM & does not migrate – Several blocks can reside concurrently on one SM Kernel grid Device Device Block 0 Block 1 Block 2 Block 3 Block 0 Block 1 Block 4 Block 5 Block 6 Block 7 Block 2 Block 4 Block 6 7/26/2009 time Block 0 Block 1 Block 2 Block 3 Block 4 Block 5 Block 6 Block 7 Block 3 Block 5 Block 7 Each block can execute in any order relative to other blocks. 32 Slide is courtesy of NVIDIA CUDA Programming Model • A kernel is executed as a grid of thread blocks – Grid of blocks can be 1 or 2dimentional – Thread blocks can be 1, 2, or 3-dimensional • Different kernels can have different grid/block configuration • Threads from the same block have access to a shared memory and their execution can be synchronized 7/26/2009 Host Device Grid 1 Kerne l1 Block (0, 0) Block (1, 0) Block (2, 0) Block (0, 1) Block (1, 1) Block (2, 1) Grid 2 Kerne l2 Block (1, 1) (0,0,1) (1,0,1) (2,0,1) (3,0,1) Thread Thread Thread Thread (0,0,0) (1,0,0) (2,0,0) (3,0,0) Thread Thread Thread Thread (0,1,0) (1,1,0) (2,1,0) (3,1,0) 33 Slide is courtesy of NVIDIA GPU Memory Hierarchy Host Device CPU DRAM chipset local global DRAM constant texture GPU Multiprocessor Multiprocessor Multiprocessor registers shared memory constant and texture caches Memory Location Cached Access Scope Lifetime Register On-chip N/A R/W One thread Thread Local Off-chip No R/W One thread Thread Shared On-chip N/A R/W All threads in a block Block Global Off-chip No R/W All threads + host Application Constant Off-chip Yes R All threads + host Application Texture Off-chip Yes R All threads + host Application 7/26/2009 34 Porting matrix multiplier to GPU • cd ../chips_tutorial/src2 • Compile & run CPU version icc -O3 mmult.c -o mmult ./mmult 1024.00 1024.00 1024.00 1024.00 1024.00 ... 1024.00 1024.00 1024.00 1024.00 1024.00 ... 1024.00 1024.00 1024.00 1024.00 1024.00 ... 1024.00 1024.00 1024.00 1024.00 1024.00 ... 1024.00 1024.00 1024.00 1024.00 1024.00 ... ... msec = 2215478 GFLOPS = 0.969 7/26/2009 35 int main(int argc, char* argv[]) { int N = 1024; struct timeval t1, t2, ta, tb; long msec1, msec2; float flop, mflop, gflop; float *a = (float *)malloc(N*N*sizeof(float)); float *b = (float *)malloc(N*N*sizeof(float)); float *c = (float *)malloc(N*N*sizeof(float)); // a = b * c void mmult(float *a, float *b, float *c, int N) { for (int j = 0; j < N; j++) for (int k = 0; k < N; k++) for (int i = 0; i < N; i++) a[i+j*N] += b[i+k*N]*c[k+j*N]; } mprint(a, N, 5); void minit(float *a, float *b, float *c, int N) { for (int j = 0; j < N; j++) for (int i = 0; i < N; i++) { a[i+N*j] = 0.0f; b[i+N*j] = 1.0f; c[i+N*j] = 1.0f; } } free(a); free(b); free(c); void mprint(float *a, int N, int M) { int i, j; minit(a, b, c, N); gettimeofday(&t1, NULL); mmult(a, b, c, N); // a = b * c gettimeofday(&t2, NULL); msec1 = t1.tv_sec * 1000000 + t1.tv_usec; msec2 = t2.tv_sec * 1000000 + t2.tv_usec; msec2 -= msec1; flop = N*N*N*2.0f; mflop = flop / msec2; gflop = mflop / 1000.0f; printf("msec = %10ld GFLOPS = %.3f\n", msec2, gflop); } for (int j = 0; j < M; j++) { for (int i = 0; i < M; i++) printf("%.2f ", a[i+N*j]); printf("...\n"); } printf("...\n"); } 7/26/2009 36 Matrix Representation in Memory for (i = 0; i < n; ++i) for (j = 0; j < n; ++j) for (k = 0; k < n; ++k) a[i+n*j] += b[i+n*k] * c[k+n*j]; - Matrices are stored in column-major order - For reference, jki-ordered version runs at 1.7 GFLOPS on 3 GHz Intel Xeon (single core) 7/26/2009 37 C Map this code: c1,1 c1,2 c1,3 c2,1 c2,2 c2,3 B for (i = 0; i < n; ++i) for (j = 0; j < n; ++j) for (k = 0; k < n; ++k) a[i+n*j] += b[i+n*k] * c[k+n*j]; b1,1 b1,2 a1,1 a1,2 a1,3 b2,1 b2,2 a2,1 a2,2 a2,3 b3,1 b3,2 a3,1 a3,2 a3,3 A=B*C a1,2=b1,1*c1,2+b1,2*c2,2 into this (logical) architecture: Grid of thread blocks blockIdx.x 0 1 2 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 blockDim.x threadIdx.x blockIdx.x * blockDim.x + threadIdx.x 7/26/2009 38 dim3 grid(1024/32, 1024); dim3 threads (32); 32x1024 grid of thread blocks Block of 32x1x1 threads (blockIdx.x, blockIdx.y) (threadIdx.x) 1024 thread blocks (j) { 7/26/2009 int i = blockIdx.x*32 + threadIdx.x; int j = blockIdx.y; float sum = 0.0f; for (int k = 0; k < n; k++) sum += b[i+n*k] * c[k+n*j]; a[i+n*j] = sum; } 32 threads per block (i) 32 thread blocks (i) 39 Kernel Original CPU kernel void mmult(float *a, float *b, float *c, int N) { for (int i = 0; i < N; i++) for (int j = 0; j < N; j++) for (int k = 0; k < N; k++) a[i+j*N] += b[i+k*N]*c[k+j*N]; } GPU Kernel __global__ void mmult(float *a, float *b, float *c, int N) { int i = blockIdx.x * blockDim.x + threadIdx.x; int j = blockIdx.y; float sum = 0.0; for (int k = 0; k < N; k++) sum += b[i+N*k] * c[k+N*j]; a[i+N*j] = sum; } dim3 dimGrid(32, 1024); dim3 dimBlock(32); mmult<<<dimGrid, dimBlock>>>(devPtrA, devPtrB, devPtrC, N); 7/26/2009 40 int main(int argc, char* argv[]) { int N = 1024; struct timeval t1, t2; long msec1, msec2; float flop, mflop, gflop; float *a = (float *)malloc(N*N*sizeof(float)); float *b = (float *)malloc(N*N*sizeof(float)); float *c = (float *)malloc(N*N*sizeof(float)); minit(a, b, c, N); // allocate device memory float *devPtrA, *devPtrB, *devPtrC; cudaMalloc((void**)&devPtrA, N*N*sizeof(float)); cudaMalloc((void**)&devPtrB, N*N*sizeof(float)); cudaMalloc((void**)&devPtrC, N*N*sizeof(float)); // copu input arrays to the device meory cudaMemcpy(devPtrB, b, N*N*sizeof(float), cudaMemcpyHostToDevice); cudaMemcpy(devPtrC, c, N*N*sizeof(float), cudaMemcpyHostToDevice); 7/26/2009 41 gettimeofday(&t1, NULL); msec1 = t1.tv_sec * 1000000 + t1.tv_usec; // define grid and thread block sizes dim3 dimGrid(32, 1024); dim3 dimBlock(32); // launch GPU kernel mmult<<<dimGrid, dimBlock>>>(devPtrA, devPtrB, devPtrC, N); // check for errors cudaError_t err = cudaGetLastError(); if (cudaSuccess != err) { fprintf(stderr, "CUDA error: %s.\n", cudaGetErrorString( err) ); exit(EXIT_FAILURE); } // wait until GPU kernel is done cudaThreadSynchronize(); gettimeofday(&t2, NULL); msec2 = t2.tv_sec * 1000000 + t2.tv_usec; 7/26/2009 42 // copy results to host cudaMemcpy(a, devPtrA, N*N*sizeof(float), cudaMemcpyDeviceToHost); mprint(a, N, 5); // free device memory cudaFree(devPtrA); cudaFree(devPtrB); cudaFree(devPtrC); free(a); free(b); free(c); msec2 -= msec1; flop = N*N*N*2.0f; mflop = flop / msec2; gflop = mflop / 1000.0f; printf("msec = %10ld GFLOPS = %.3f\n", msec2, gflop); } 7/26/2009 43 Porting matrix multiplier to GPU • Compile & run GPU version nvcc mmult.cu -o mmult ./mmult 1024.00 1024.00 1024.00 1024.00 1024.00 ... 1024.00 1024.00 1024.00 1024.00 1024.00 ... 1024.00 1024.00 1024.00 1024.00 1024.00 ... 1024.00 1024.00 1024.00 1024.00 1024.00 ... 1024.00 1024.00 1024.00 1024.00 1024.00 ... ... msec = 91363 GFLOPS = 23.505 7/26/2009 44 More on CUDA Programming • Language extensions – – – – • Common runtime components – • Intrinsic functions Synchronization and memory fencing functions Atomic functions Host runtime components (runtime API only) – – – – • Built-in vector types Device runtime components – – – • Function type qualifiers Variable type qualifiers Execution configuration Built-in variables Device management Memory management Error handling Debugging in the device emulation mode Exercise 7/26/2009 45 Function Type Qualifiers Executed on the: Only callable from the: __device__ float DeviceFunc() device device __global__ void device host host host __host__ KernelFunc() float HostFunc() __device__ and __global__ functions do not support recursion, cannot declare static variables inside their body, cannot have a variable number of arguments __device__ functions cannot have their address taken __host__ and __device__ qualifiers can be used together, in which case the function is compiled for both __global__ and __host__ qualifiers cannot be used together __global__ function must have void return type, its execution configuration must be specified, and the call is asynchronous 7/26/2009 46 Variable Type Qualifiers Memory Scope Lifetime __device__ int GlobalVar; global grid application __device__ __shared__ int SharedVar; shared block block constant grid Application __device__ __constant__ int ConstantVar; volatile int GlobarVar or SharedVar; __shared__ and __constant__ variables have implied static storage __device__, __shared__ and __constant__ variables cannot be defined using external keyword __device__ and __constant__ variables are only allowed at file scope __constant__ variables cannot be assigned to from the devices, they are initialized from the host only __shared__ variables cannot have an initialization as part of their declaration 7/26/2009 47 Execution Configuration Function declared as __global__ void kernel(float* param); must be called like this: kernel<<<Dg, Db, Ns, S>>>(param); where • Dg (type dim3) specifies the dimension and size of the grid, such that Dg.x*Dg.y equals the number of blocks being launched; • Db (type dim3) spesifies the dimension abd size of each block of threads, such that Db.x*Db.y*Db.z equals the number of threads per block; • optional Ns (type size_z) specifies the number of bytes of shared memory dynamically allocated per block for this call in addition to the statically allocated memory • optional S (type cudaStream_t) specifies the stream associated with this kernel call 7/26/2009 48 Built-in Variables variable type description gridDim dim3 dimensions of the grid blockID unit3 block index within the grid blockDim dim3 dimensions of the block threadIdx uint3 thread index within the block warpSize int warp size in threads It is not allowed to take addresses of any of the built-in variables It is not allowed to assign values to any of the built-in variables 7/26/2009 49 Built-in Vector Types Vector types derived from basic integer and float types • • • • • • • • • • • char1, char2, char3, char4 uchar1, uchar2, uchar3, uchar4 short1, short2, short3, short4 ushort1, ushort2, ushort3, ushort4 int1, int2, int3, int4 uint1, uint2, uint3 (dim3), uint4 long1, long2, long3, long4 ulong1, ulong2, ulong3, ulong4 longlong1, longlong2 float1, float2, float3, float4 double1, double2 7/26/2009 They are all structures, like this: typedef struct { float x,y,z,w; } float4; They all come with a constructor function in the form make_<type name>, e.g., int2 make_int2(int x, int y); 50 Intrinsic Functions Supported on the device only Start with __, as in __sinf(x) End with _rn (round-to-nearest-even rounding mode) _rz (round-towards-zero rounding mode) _ru (round-up rounding mode) _rd (round-down rounding mode) as in __fadd_rn(x,y); There are mathematical (__log10f(x)), type conversion (__int2float_rn(x)), type casting (__int_as_float(x)), and bit manipulation (__ffs(x)) functions 7/26/2009 51 Synchronization and Memory Fencing Functions function description void __threadfence() wait until all global and shared memory accesses made by the calling thread become visible to all threads in the device for global memory accesses and all threads in the thread block for shared memory accesses void __threadfence_block() Waits until all global and shared memory accesses made by the calling thread become visible to all threads in the thread block void __syncthreads() Waits until all threads in the thread block have reached this point and all global and shared memory accesses made by these threads become visible to all threads in the block 7/26/2009 52 Atomic Functions function Description atomicAdd() new = old + val atomicSub() new = old – val atomicExch() new = val atomicMin() new = min(old, val) atomicMax() new = max(old, val) atomicInc() new = ((old >= val) ? 0 : (old+1)) atomicDec() new = (((old==0) | (old > val)) ? val : (old-1)) atomicCAS() new = (old == compare ? val : old) Atomic{And, Or, Xor}() new = {(old & val), (old | val), (old^val)} An atomic function performs read-modify-write atomic operation on one 32-bit or one 64-bit word residing in global or shared memory. The operation is atomic in the sense that it is guaranteed to be performed without interference from other threads. 7/26/2009 53 CUDA APIs • higher-level API called the CUDA runtime API – myKernelunsigned char*)devPtr, width, <<<Dg, Db>>>(( height, pitch); • low-level API called the CUDA driver API – cuModuleLoad( &module, binfile ); – cuModuleGetFunction( &func, module, "mmkernel" ); – … – cuParamSetv( func, 0, &args, 48 ); – cuParamSetSize( func, 48 ); – cuFuncSetBlockShape( func, ts[0], ts[1], 1 ); – cuLaunchGrid( func, gs[0], gs[1] ); 7/26/2009 54 Device Management function description cudaGetDeviceCount() Returns the number of compute-capable devices cudaGetDeviceProperties() Returns information on the compute device cudaSetDevice() Sets device to be used for GPU execution cudaGetDevice() Returns the device currently being used cudaChooseDevice() Selects device that best matches given criteria 7/26/2009 55 Device Management Example void cudaDeviceInit() { int devCount, device; cudaGetDeviceCount(&devCount); if (devCount == 0) { printf("No CUDA capable devices detected.\n"); exit(EXIT_FAILURE); } for (device=0; device < devCount; device++) { cudaDeviceProp props; cudaGetDeviceProperties(&props, device); // If a device of compute capability >= 1.3 is found, use it if (props.major > 1 || (props.major == 1 && props.minor >= 3)) break; } if (device == devCount) { printf("No device above 1.2 compute capability detected.\n"); exit(EXIT_FAILURE); } else cudaSetDevice(device); } 7/26/2009 56 Memory Management function description cudaMalloc() Allocates memory on the GPU cudaMallocPitch() Allocates memory on the GPU device for 2D arrays, may pad the allocated memory to ensure alignment requirements cudaFree() Frees the memory allocated on the GPU cudaMallocArray() Allocates an array on the GPU cudaFreeArray() Frees an array allocated on the GPU cudaMallocHost() Allocates page-locked memory on the host cudaFreeHost() Frees page-locked memory in the host 7/26/2009 57 More on Memory Alignment cudaMalloc(&dev_a, m*n*sizeof(float)); a1,1 a1,2 a1,3 a2,1 a2,2 a2,3 a1,1 a2,1 a3,1 a1,2 a2,2 a3,2 a1,3 a2,3 a3,3 a3,1 a3,2 a3,3 Matrix columns are not aligned at 64-bit boundary cudaMallocPitch(&dev_a, &n, n*sizeof(float), m); a1,1 a2,1 a3,1 a1,2 a2,2 a3,2 a1,3 a2,3 a3,3 Matrix columns are aligned at 64-bit boundary n is the allocated (aligned) size for the first dimension (the pitch), given the requested sizes of the two dimensions. 7/26/2009 58 Memory Management Example cudaMallocPitch((void**)&devPtr, &pitch, width * sizeof(float), height); myKernel<<<100, 192>>>(devPtr, pitch); // device code __global__ void myKernel(float* devPtr, int pitch) { for (int r = 0; r < height; ++r) { float* row = (float*)((char*)devPtr + r * pitch); for (int c = 0; c < width; ++c) { float element = row[c]; } } } 7/26/2009 59 Memory Management function description cudaMemset() Initializes or sets GPU memory to a value cudaMemCpy() Copies data between host and the device cudaMemcpyToArray() cudaMemcpyFromArray() cudaMemcpyArrayToArray() cudaMemcpyToSymbol() cudaMemcpyFromSymbol() cudaGetSymbolAddress() Finds the address associated with a CUDA symbol cudaGetSymbolSize() Finds the size of the object associated with a CUDA symbol 7/26/2009 60 Error Handling All CUDA runtime API functions return an error code. The runtime maintains an error variable for each host thread that is overwritten by the error code every time an error concurs. function description cudaGetLastError() Returns error variable and resets it to cudaSuccess cudaGetErrorString() Returns the message string from an error code cudaError_t err = cudaGetLastError(); if (cudaSuccess != err) { fprintf(stderr, "CUDA error: %s.\n", cudaGetErrorString( err) ); exit(EXIT_FAILURE); } 7/26/2009 61 Exercise • Port Mandelbrot set fractal renderer to CUDA – Source is in ~/chips_tutorial/src3 • fractal.c – reference C implementation • Makefile – make file • fractal.cu.reference – CUDA implementation for reference 7/26/2009 62 Reference C Implementation void makefractal_cpu(unsigned char *image, int width, int height, double xupper, double xlower, double yupper, double ylower) { int x, y; double xinc = (xupper - xlower) / width; double yinc = (yupper - ylower) / height; for (y = 0; y < height; y++) { for (x = 0; x < width; x++) { image[y*width+x] = iter((xlower + x*xinc), (ylower + y*yinc)); } } } 7/26/2009 63 CUDA Kernel Implementation __global__ void makefractal_cpu(unsigned char *image, int width, int height, double xupper, double xlower, double yupper, double ylower) { int x = blockIdx.x; int y = blockIdx.y; int width = gridDim.x; int height = gridDim.y; double xupper=-0.74624, xlower=-0.74758, yupper=0.10779, ylower=0.10671; double xinc = (xupper - xlower) / width; double yinc = (yupper - ylower) / height; image[y*width+x] = iter((xlower + x*xinc), (ylower + y*yinc)); } 7/26/2009 64 Reference C Implementation inline unsigned char iter(double a, double b) { unsigned char i = 0; double c_x = 0, c_y = 0; double c_x_tmp, c_y_tmp; double D = 4.0; while ((c_x*c_x+c_y*c_y < D) && (i++ < 255)) { c_x_tmp = c_x * c_x - c_y * c_y; c_y_tmp = 2* c_y * c_x; c_x = a + c_x_tmp; c_y = b + c_y_tmp; } return i; The Mandelbrot set is generated by iterating complex function z2 + c, where c is a constant: z1 = (z0)2 + c z2 = (z1)2 + c z3 = (z2)2 + c and so forth. Sequence z0, z1, z2,... is called the orbit of z0 under iteration of z2 + c. We stop iteration when the orbit starts to diverge, or when a maximum number of iterations is done. } 7/26/2009 65 CUDA Kernel Implementation inline __device__ unsigned char iter(double a, double b) { unsigned char i = 0; double c_x = 0, c_y = 0; double c_x_tmp, c_y_tmp; double D = 4.0; while ((c_x*c_x+c_y*c_y < D) && (i++ < 255)) { c_x_tmp = c_x * c_x - c_y * c_y; c_y_tmp = 2* c_y * c_x; c_x = a + c_x_tmp; c_y = b + c_y_tmp; } return i; } 7/26/2009 66 Host Code int width = 1024; int height = 768; unsigned char *image = NULL; unsigned char *devImage; image = (unsigned char*)malloc(width*height*sizeof(unsigned char)); cudaMalloc((void**)&devImage, width*height*sizeof(unsigned char)); dim3 dimGrid(width, height); dim3 dimBlock(1); makefractal_gpu<<<dimGrid, dimBlock>>>(devImage); cudaMemcpy(image, devImage, width*height*sizeof(unsigned char), cudaMemcpyDeviceToHost); free(image); cudaFree(devImage); 7/26/2009 67 Few Examples • • • • xupper=-0.74624 xlower=-0.74758 yupper=0.10779 ylower=0.10671 • CPU time: 2.27 sec • GPU time: 0.29 sec 7/26/2009 • • • • xupper=-0.754534912109 xlower=-.757077407837 yupper=0.060144042969 ylower=0.057710774740 • CPU time: 1.5 sec • GPU time: 0.25 sec 68