Intermediate GPGPU Programming in CUDA

Intermediate GPGPU Programming in CUDA Supada Laosooksathit NVIDIA Hardware Architecture Host memory Recall • 5 steps for CUDA Programming – Initialize device – Allocate device memory – Copy data to device memory – Execute kernel – Copy data back from device memory Initialize Device Calls • To select the device associated to the host thread – cudaSetDevice(device) – This function must be called before any __global__ function, otherwise device 0 is automatically selected. • To get number of devices – cudaGetDeviceCount(&devicecount) • To retrieve device’s property – cudaGetDeviceProperties(&deviceProp, device) Hello World Example • Allocate host and device memory Hello World Example • Host code Hello World Example • Kernel code To Try CUDA Programming • SSH to 138.47.102.111 • Set environment vals in .bashrc in your home directory export PATH=$PATH:/usr/local/cuda/bin export LD_LIBRARY_PATH=/usr/local/cuda/lib:$LD_LIBRARY_PATH export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH • Copy the SDK from /home/students/NVIDIA_GPU_Computing_SDK • Compile the following directories – NVIDIA_GPU_Computing_SDK/shared/ – NVIDIA_GPU_Computing_SDK/C/common/ • The sample codes are in NVIDIA_GPU_Computing_SDK/C/src/ Demo • Hello World – Print out block and thread IDs • Vector Add –C=A+B NVIDIA Hardware Architecture SM Specifications of a Device Specifications Compute Capability 1.3 Compute Capability 2.0 Warp size 32 32 Max threads/block 512 1024 Max Blocks/grid 65535 65535 Shared mem 16 KB/SM 48 KB/SM • For more details – deviceQuery in CUDA SDK – Appendix F in Programming Guide 4.0 Demo • deviceQuery – Show hardware specifications in details Memory Optimizations • Reduce the time of memory transfer between host and device – Use asynchronous memory transfer (CUDA streams) – Use zero copy • Reduce the number of transactions between on-chip and off-chip memory – Memory coalescing • Avoid bank conflicts in shared memory Reduce Time of Host-Device Memory Transfer • Regular memory transfer (synchronously) Reduce Time of Host-Device Memory Transfer • CUDA streams – Allow overlapping between kernel and memory copy CUDA Streams Example CUDA Streams Example GPU Timers • CUDA Events – An API – Use the clock shade in kernel – Accurate for timing kernel executions • CUDA timer calls – Libraries implemented in CUDA SDK CUDA Events Example Demo • simpleStreams Reduce Time of Host-Device Memory Transfer • Zero copy – Allow device pointers to access page-locked host memory directly – Page-locked host memory is allocated by cudaHostAlloc() Demo • Zero copy Reduce number of On-chip and Off-chip Memory Transactions • Threads in a warp access global memory • Memory coalescing – Copy a bunch of words at the same time Memory Coalescing • Threads in a warp access global memory in a straight forward way (4-byte word per thread) Memory Coalescing • Memory addresses are aligned in the same segment but the accesses are not sequential Memory Coalescing • Memory addresses are not aligned in the same segment Shared Memory • 16 banks for compute capability 1.x, 32 banks for compute capability 2.x • Help utilizing memory coalescing • Bank conflicts may occur – Two or more threads in access the same bank – In compute capability 1.x, no broadcast – In compute capability 2.x, broadcast the same data to many threads that request Bank Conflicts No bank conflict 2-way bank conflict Threads: Banks: 0 0 Threads: Banks: 0 0 1 1 1 1 2 2 2 2 3 3 3 3 Matrix Multiplication Example Matrix Multiplication Example • Reduce accesses to global memory – (A.height/BLOCK_SIZE) times reading A – (B.width/BLOCK_SIZE) times reading B Demo • Matrix Multiplication – With and without shared memory – Different block sizes Control Flow • if, switch, do, for, while • Branch divergence in a warp – Threads in a warp issue different instruction sets • Different execution paths will be serialized • Increase number of instructions in that warp Branch Divergence Summary • 5 steps for CUDA Programming • NVIDIA Hardware Architecture – Memory hierarchy: global memory, shared memory, register file – Specifications of a device: block, warp, thread, SM Summary • Memory optimization – Reduce overhead due to host-device memory transfer with CUDA streams, Zero copy – Reduce the number of transactions between onchip and off-chip memory by utilizing memory coalescing (shared memory) – Try to avoid bank conflicts in shared memory • Control flow – Try to avoid branch divergence in a warp References • http://docs.nvidia.com/cuda/cuda-cprogramming-guide/ • http://docs.nvidia.com/cuda/cuda-c-bestpractices-guide/ • http://www.developer.nvidia.com/cudatoolkit

Intermediate GPGPU Programming in CUDA

Related documents

Products

Support

Intermediate GPGPU Programming in CUDA

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib