Intermediate GPGPU Programming in CUDA Supada Laosooksathit NVIDIA Hardware Architecture Host memory Recall • 5 steps for CUDA Programming – Initialize device – Allocate device memory – Copy data to device memory – Execute kernel – Copy data back from device memory Initialize Device Calls • To select the device associated to the host thread – cudaSetDevice(device) – This function must be called before any __global__ function, otherwise device 0 is automatically selected. • To get number of devices – cudaGetDeviceCount(&devicecount) • To retrieve device’s property – cudaGetDeviceProperties(&deviceProp, device) Hello World Example • Allocate host and device memory Hello World Example • Host code Hello World Example • Kernel code To Try CUDA Programming • SSH to 138.47.102.111 • Set environment vals in .bashrc in your home directory export PATH=$PATH:/usr/local/cuda/bin export LD_LIBRARY_PATH=/usr/local/cuda/lib:$LD_LIBRARY_PATH export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH • Copy the SDK from /home/students/NVIDIA_GPU_Computing_SDK • Compile the following directories – NVIDIA_GPU_Computing_SDK/shared/ – NVIDIA_GPU_Computing_SDK/C/common/ • The sample codes are in NVIDIA_GPU_Computing_SDK/C/src/ Demo • Hello World – Print out block and thread IDs • Vector Add –C=A+B NVIDIA Hardware Architecture SM Specifications of a Device Specifications Compute Capability 1.3 Compute Capability 2.0 Warp size 32 32 Max threads/block 512 1024 Max Blocks/grid 65535 65535 Shared mem 16 KB/SM 48 KB/SM • For more details – deviceQuery in CUDA SDK – Appendix F in Programming Guide 4.0 Demo • deviceQuery – Show hardware specifications in details Memory Optimizations • Reduce the time of memory transfer between host and device – Use asynchronous memory transfer (CUDA streams) – Use zero copy • Reduce the number of transactions between on-chip and off-chip memory – Memory coalescing • Avoid bank conflicts in shared memory Reduce Time of Host-Device Memory Transfer • Regular memory transfer (synchronously) Reduce Time of Host-Device Memory Transfer • CUDA streams – Allow overlapping between kernel and memory copy CUDA Streams Example CUDA Streams Example GPU Timers • CUDA Events – An API – Use the clock shade in kernel – Accurate for timing kernel executions • CUDA timer calls – Libraries implemented in CUDA SDK CUDA Events Example Demo • simpleStreams Reduce Time of Host-Device Memory Transfer • Zero copy – Allow device pointers to access page-locked host memory directly – Page-locked host memory is allocated by cudaHostAlloc() Demo • Zero copy Reduce number of On-chip and Off-chip Memory Transactions • Threads in a warp access global memory • Memory coalescing – Copy a bunch of words at the same time Memory Coalescing • Threads in a warp access global memory in a straight forward way (4-byte word per thread) Memory Coalescing • Memory addresses are aligned in the same segment but the accesses are not sequential Memory Coalescing • Memory addresses are not aligned in the same segment Shared Memory • 16 banks for compute capability 1.x, 32 banks for compute capability 2.x • Help utilizing memory coalescing • Bank conflicts may occur – Two or more threads in access the same bank – In compute capability 1.x, no broadcast – In compute capability 2.x, broadcast the same data to many threads that request Bank Conflicts No bank conflict 2-way bank conflict Threads: Banks: 0 0 Threads: Banks: 0 0 1 1 1 1 2 2 2 2 3 3 3 3 Matrix Multiplication Example Matrix Multiplication Example • Reduce accesses to global memory – (A.height/BLOCK_SIZE) times reading A – (B.width/BLOCK_SIZE) times reading B Demo • Matrix Multiplication – With and without shared memory – Different block sizes Control Flow • if, switch, do, for, while • Branch divergence in a warp – Threads in a warp issue different instruction sets • Different execution paths will be serialized • Increase number of instructions in that warp Branch Divergence Summary • 5 steps for CUDA Programming • NVIDIA Hardware Architecture – Memory hierarchy: global memory, shared memory, register file – Specifications of a device: block, warp, thread, SM Summary • Memory optimization – Reduce overhead due to host-device memory transfer with CUDA streams, Zero copy – Reduce the number of transactions between onchip and off-chip memory by utilizing memory coalescing (shared memory) – Try to avoid bank conflicts in shared memory • Control flow – Try to avoid branch divergence in a warp References • http://docs.nvidia.com/cuda/cuda-cprogramming-guide/ • http://docs.nvidia.com/cuda/cuda-c-bestpractices-guide/ • http://www.developer.nvidia.com/cudatoolkit