GPU Parallel Execution Model / Architecture Sarah Tariq, NVIDIA Developer Technology Group GPGPU Revolutionizes Computing Latency Processor + Throughput processor CPU GPU Low Latency or High Throughput? CPU Optimized for low-latency access to cached data sets Control logic for out-of-order and speculative execution GPU Optimized for data-parallel, throughput computation Architecture tolerant of memory latency More transistors dedicated to computation Low Latency or High Throughput? CPU architecture must minimize latency within each thread GPU architecture hides latency with computation from other thread warps GPU Stream Multiprocessor – High Throughput Processor Computation Thread/Warp W4 Tn Processing W3 W2 Waiting for data W1 Ready to be processed CPU core – Low Latency Processor T1 T2 T3 T4 Context switch Processing Flow PCIe Bus 1. Copy input data from CPU memory to GPU memory Processing Flow PCIe Bus 1. 2. Copy input data from CPU memory to GPU memory Load GPU program and execute, caching data on chip for performance Processing Flow PCIe Bus 1. 2. 3. Copy input data from CPU memory to GPU memory Load GPU program and execute, caching data on chip for performance Copy results from GPU memory to CPU memory GPU ARCHITECTURE GPU Architecture: Two Main Components HOST I/F Giga Thread DRAM I/F DRAM I/F Control units, registers, execution pipelines, caches L2 DRAM I/F Perform the actual computations Each SM has its own: DRAM I/F Streaming Multiprocessors (SMs) DRAM I/F Analogous to RAM in a CPU server Accessible by both GPU and CPU Currently up to 6 GB Bandwidth currently up to 150 GB/s for Quadro and Tesla products ECC on/off option for Quadro and Tesla products DRAM I/F Global memory GPU Architecture – Fermi: Streaming Multiprocessor (SM) 32 CUDA Cores per SM 32 fp32 ops/clock 16 fp64 ops/clock 32 int32 ops/clock 2 warp schedulers Up to 1536 threads concurrently 4 special-function units 64KB shared mem + L1 cache 32K 32-bit registers Instruction Cache Scheduler Scheduler Dispatch Dispatch Register File Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Load/Store Units x 16 Special Func Units x 4 Interconnect Network 64K Configurable Cache/Shared Mem Uniform Cache GPU Architecture – Fermi: CUDA Core Instruction Cache Scheduler Scheduler Dispatch Register File Floating point & Integer unit IEEE 754-2008 floating-point standard Fused multiply-add (FMA) instruction for both single and double precision Logic unit Move, compare unit Branch unit Dispatch Core Core Core Core Core Core Core Core Core Core Core Core CUDA Core Dispatch Port Operand Collector Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core FP Unit INT Unit Core Core Core Core Load/Store Units x 16 Result Queue Special Func Units x 4 Interconnect Network 64K Configurable Cache/Shared Mem Uniform Cache Kepler Fermi Kepler SM Instruction Cache Instruction Cache Warp Scheduler Scheduler Scheduler Dispatch Unit CUDA Core Dispatch Dispatch Dispatch Port Dispatch Port Dispatch Unit Warp Scheduler Dispatch Unit Dispatch Unit Warp Scheduler Dispatch Unit Dispatch Unit Warp Scheduler Dispatch Unit Dispatch Unit Register File (65,536 x 32-bit) Operand Collector Core Core Core Core Core Core LD/ST SFU Core Core Core Core Core Core LD/ST SFU Core Core Core Core Core Core LD/ST SFU Core Core Core Core Core Core LD/ST SFU Core Core Core Core Core Core LD/ST SFU Core Core Core Core Core Core LD/ST SFU Core Core Core Core Core Core LD/ST SFU Core Core Core Core Core Core LD/ST SFU Core Core Core Core Core Core LD/ST SFU Core Core Core Core Core Core LD/ST SFU Core Core Core Core Core Core LD/ST SFU Core Core Core Core Core Core LD/ST SFU Core Core Core Core Core Core LD/ST SFU Core Core Core Core Core Core LD/ST SFU Core Core Core Core Core Core LD/ST SFU Core Core Core Core Core Core LD/ST SFU Core Core Core Core Core Core LD/ST SFU Core Core Core Core Core Core LD/ST SFU Core Core Core Core Core Core LD/ST SFU Core Core Core Core Core Core LD/ST SFU Core Core Core Core Core Core LD/ST SFU Core Core Core Core Core Core LD/ST SFU Core Core Core Core Core Core LD/ST SFU Core Core Core Core Core Core LD/ST SFU Core Core Core Core Core Core LD/ST SFU Core Core Core Core Core Core LD/ST SFU Core Core Core Core Core Core LD/ST SFU Core Core Core Core Core Core LD/ST SFU Special Func Units x 4 Core Core Core Core Core Core LD/ST SFU Core Core Core Core Core Core LD/ST SFU Interconnect Network Core Core Core Core Core Core LD/ST SFU Core Core Core Core Core Core LD/ST SFU Register File ALU Result Queue Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Load/Store Units x 16 Uniform Cache 64K Configurable Cache/Shared Mem 64 KB Shared Memory / L1 Cache Interconnect Network Uniform Cache CUDA PROGRAMMING MODEL OpenACC and CUDA OpenACC on NVIDIA GPUs compiles to target the CUDA platform CUDA is a parallel computing platform and programming model invented by NVIDIA. Anatomy of a CUDA Application Serial code executes in a Host (CPU) thread Parallel code executes in many Device (GPU) threads across multiple processing elements CUDA Application Serial code Host = CPU Device = GPU Parallel code Serial code … Host = CPU Device = GPU Parallel code ... CUDA Kernels Parallel portion of application: execute as a kernel Entire GPU executes kernel, many threads CUDA threads: Lightweight Fast switching 1000s execute simultaneously CPU Host Executes functions GPU Device Executes kernels CUDA Kernels: Parallel Threads A kernel is a function executed on the GPU as an array of threads in parallel All threads execute the same code, can take different paths Each thread has an ID Select input/output data Control decisions float x = input[threadIdx.x]; float y = func(x); output[threadIdx.x] = y; CUDA Kernels: Subdivide into Blocks CUDA Kernels: Subdivide into Blocks Threads are grouped into blocks CUDA Kernels: Subdivide into Blocks Threads are grouped into blocks Blocks are grouped into a grid CUDA Kernels: Subdivide into Blocks Threads are grouped into blocks Blocks are grouped into a grid A kernel is executed as a grid of blocks of threads CUDA Kernels: Subdivide into Blocks GPU Threads are grouped into blocks Blocks are grouped into a grid A kernel is executed as a grid of blocks of threads Kernel Execution CUDA thread CUDA thread block CUDA core CUDA Streaming Multiprocessor … • Each thread is executed by a core • Each block is executed by one SM and does not migrate • Several concurrent blocks can reside on one SM depending on the blocks’ memory requirements and the SM’s memory resources CUDA-enabled GPU CUDA kernel grid ... … … … • Each kernel is executed on one device • Multiple kernels can execute on a device at one time Thread blocks allow cooperation Threads may need to cooperate: Cooperatively load/store blocks of memory that they all use Share results with each other or cooperate to produce a single result Synchronize with each other Thread blocks allow scalability Blocks can execute in any order, concurrently or sequentially This independence between blocks gives scalability: A kernel scales across any number of SMs Kernel Grid Launch Block 0 Device with 2 SMs SM 0 Block 0 Block 2 SM 1 Block 1 Block 3 Block 4 Block 5 Block 6 Block 7 Block 1 Block 2 Block 3 Block 4 Block 5 Block 6 Block 7 Device with 4 SMs SM 0 SM 1 SM 2 SM 3 Block 0 Block 1 Block 2 Block 3 Block 4 Block 5 Block 6 Block 7 Warps Blocks are divided into 32 thread wide units called warps Size of warps is implementation specific and can change in the future The SM creates, manages, schedules and executes threads at warp granularity Each warp consists of 32 threads of contiguous threadIds All threads in a warp execute the same instruction If threads of a warp diverge the warp serially executes each branch path taken When a warp executes an instruction that accesses global memory it coalesces the memory accesses of the threads within the warp into as few transactions as possible Memory hierarchy Thread: Registers Local memory Block of threads: Shared memory All blocks: Global memory OpenACC execution Model The OpenACC execution model has three levels: gang, worker and vector This is supposed to map to an architecture that is a collection of Processing Elements (PEs) Each PE is multithreaded and each thread can execute vector instructions For GPUs one possible mapping could be gang=block, worker=warp, vector=threads in a warp Depends on what the compiler thinks is the best mapping for the problem Mapping OpenACC to CUDA threads and blocks #pragma acc kernels for( int i = 0; i < n; ++i ) y[i] += a*x[i]; #pragma acc kernels loop gang(100) vector(128) for( int i = 0; i < n; ++i ) y[i] += a*x[i]; #pragma acc parallel num_gangs(100) vector_length(128) { #pragma acc loop gang vector for( int i = 0; i < n; ++i ) y[i] += a*x[i]; } 16 blocks, 256 threads each 100 thread blocks, each with 128 threads, each thread executes one iteration of the loop, using kernels 100 thread blocks, each with 128 threads, each thread executes one iteration of the loop, using parallel Mapping OpenACC to CUDA threads and blocks #pragma acc parallel num_gangs(100) { for( int i = 0; i < n; ++i ) y[i] += a*x[i]; } #pragma acc parallel num_gangs(100) { #pragma acc loop gang for( int i = 0; i < n; ++i ) y[i] += a*x[i]; } 100 thread blocks, each with apparently 1 thread, each thread redundantly executes the loop compiler can notice that only 'gangs' are being created, so it might decide to create threads instead, say 2 thread blocks of 50 threads. Mapping OpenACC to CUDA threads and blocks #pragma acc kernels loop gang(100) vector(128) for( int i = 0; i < n; ++i ) y[i] += a*x[i]; 100 thread blocks, each with 128 threads, each thread executes one iteration of the loop, using kernels #pragma acc kernels loop gang(50) vector(128) for( int i = 0; i < n; ++i ) y[i] += a*x[i]; 50 thread blocks, each with 128 threads. Each thread does two elements worth of work Doing multiple iterations per thread can improve performance by amortizing the cost of setup Mapping multi dimensional blocks and grids to OpenACC A nested for loop generates multidimensional blocks and grids #pragma acc kernels loop gang(100), vector(16) for( … ) 100 blocks tall (row/Y direction) #pragma acc loop gang(200), vector(32) for( … ) 200 blocks wide (column/X direction) 16 thread tall block 32 thread wide block EXERCISE 3 Applying this knowledge to Jacobi Lets start by running the current code with -ta=nvidia,time total: 13.874673 s Accelerator Kernel Timing data /usr/users/6/stariq/openacc/openacc-workshop/solutions/002-laplace2D-data/laplace2d.c main 68: region entered 1000 times time(us): total=4903207 init=82 region=4903125 kernels=4852949 data=0 w/o init: total=4903125 max=5109 min=4813 avg=4903 71: kernel launched 1000 times grid: [256x256] block: [16x16] time(us): total=4852949 max=5004 min=4769 avg=4852 /usr/users/6/stariq/openacc/openacc-workshop/solutions/002-laplace2D-data/laplace2d.c main 56: region entered 1000 times time(us): total=8701161 init=57 region=8701104 kernels=8365523 data=0 w/o init: total=8701104 max=8942 min=8638 avg=8701 Suboptimal 59: kernel launched 1000 times grid: [256x256] block: [16x16] grid and time(us): total=8222457 max=8310 min=8212 avg=8222 block 63: kernel launched 1000 times dimensions grid: [1] block: [256] time(us): total=143066 max=210 min=141 avg=143 /usr/users/6/stariq/openacc/openacc-workshop/solutions/002-laplace2D-data/laplace2d.c main 50: region entered 1 time time(us): total=13874525 init=162566 region=13711959 data=64170 w/o init: total=13711959 max=13711959 min=13711959 avg=13711959 Memcpy loop, taking 4.9s out of 13s Main computation loop, taking 8.7s out of 13s Enclosing while loop data region. Takes 13.7s, nearly the entire execution time Exercise 3 Task: use knowledge of GPU architecture to improve performance by specifying gang and vector clauses Start from given laplace2D.c or laplace2D.f90 (your choice) In the 003-laplace2d-loop directory Add gang and vector clauses to the inner loop, experiment with different values: #pragma acc loop gang(G) vector(V) Q: What speedup can you get? Versus 1 CPU core? Versus 6 CPU cores? Exercise 3 Solution: OpenACC C #pragma acc data copy(A), create(Anew) while ( error > tol && iter < iter_max ) { error=0.0; #pragma acc kernels loop for( int j = 1; j < n-1; j++) { #pragma acc loop gang(16) vector(32) for(int i = 1; i < m-1; i++) { Anew[j][i] = 0.25 * (A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); Leave compiler to choose Y dimension for grids and blocks. Grids are 16 blocks wide, blocks are 32 threads wide error = max(error, abs(Anew[j][i] - A[j][i]); } } #pragma acc kernels loop for( int j = 1; j < n-1; j++) { #pragma acc kernels gang(16) vector(32) for( int i = 1; i < m-1; i++ ) { A[j][i] = Anew[j][i]; } } iter++; } Leave compiler to choose Y dimension for grids and blocks. Grids are 16 blocks wide, blocks are 32 threads wide Exercise 3 Solution: OpenACC Fortran !$acc data copy(A) create(Anew) do while ( err > tol .and. iter < iter_max ) err=0._fp_kind !$acc kernels do j=1,m !$acc loop gang(16), vector(32) do i=1,n Anew(i,j) = .25_fp_kind * (A(i+1, j ) + A(i-1, j ) + & A(i , j-1) + A(i , j+1)) err = max(err, Anew(i,j) - A(i,j)) end do end do !$acc end kernels !$acc kernels loop do j=1,m-2 !$acc loop gang(16), vector(32) do i=1,n-2 A(i,j) = Anew(i,j) end do end do !$acc end kernels iter = iter +1 end do Leave compiler to choose Y dimension for grids and blocks. Grids are 16 blocks wide, blocks are 32 threads wide Leave compiler to choose Y dimension for grids and blocks. Grids are 16 blocks wide, blocks are 32 threads wide Exercise 3: Performance CPU: Intel Xeon X5680 6 Cores @ 3.33GHz GPU: NVIDIA Tesla M2070 Execution Time (s) Speedup CPU 1 OpenMP thread 69.80 -- CPU 2 OpenMP threads 44.76 1.56x CPU 4 OpenMP threads 39.59 1.76x CPU 6 OpenMP threads 39.71 1.76x Speedup vs. 1 CPU core OpenACC GPU 10.98 3.62x Speedup vs. 6 CPU cores Note: same code runs in 7.58s on NVIDIA Tesla M2090 GPU Run with –ta=nvidia,time total: 11.135176 s /usr/users/6/stariq/openacc/openacc-workshop/solutions/003-laplace2D-loop/laplace2d.c main 56: region entered 1000 times time(us): total=5568043 init=68 region=5567975 kernels=5223007 data=0 Performance improved w/o init: total=5567975 max=6040 min=5464 avg=5567 from 8.7s to 5.5s 60: kernel launched 1000 times grid: [16x512] block: [32x8] time(us): total=5197462 max=5275 min=5131 avg=5197 64: kernel launched 1000 times Grid size changed from grid: [1] block: [256] [256x256] to [16x512] time(us): total=25545 max=119 min=24 avg=25 Block size changed from [16x16] to [32x8] explanation Setting block width to be 32 32 is the size of a warp, so all the threads are accessing contiguous elements Setting grid width to be 16 This allows the compiler to execute 8x less blocks, meaning that each thread works on 8 output elements. As we discussed earlier, this helps amortize the cost of setup for simple kernels Questions?