GPU COMPUTING An Introduction August 10, 2021 Luis Altenkort, Marius Neumann, Marcel Rodekamp, Christian Schmidt, Xin Wu THE COMPETENCE NETWORK FOR HIGH-PERFORMANCE COMPUTING IN NRW. WHAT ARE GPUS AND WHY DO WE NEED THEM? – A graphics processing unit (GPU) is a processor featuring a highly parallel structure, making it efficient at processing large blocks of data. – Large data sets can be worked on by multiple cores. Two methods are commonly used: – multi CPU (∼ 10 − 100 cores) – GPUs (∼ 5000 cores) – Where are GPUs used? – video/graphics applications – linear algebra – machine learning – general purpose GPU (GPGPU) GPU computing August 10, 2021 1 CPU SCHEMATICS ALU ALU Controle L1 Cache Controle L1 Cache ALU ALU L2 Cache Controle L1 Cache Controle L1 Cache L2 Cache L3 Cache DRAM – O(10) Algorithmic-Logic-Units (ALUs) with broad instruction set – On Chip: – Sophisticated Controle Units – Local Per Core L1 Cache – Per Core L2 Cache – L3 Cache – Off Chip DRAM – low latency optimized – versatile usage GPU computing August 10, 2021 2 GPU SCHEMATICS Controle L1 Cache Controle L1 Cache Controle L1 Cache Controle L1 Cache Controle L1 Cache Controle L1 Cache ALU ALU ALU ALU ALU ALU ··· ALU ALU ALU ALU ALU ALU ALU ALU ··· ALU ALU ALU ALU ALU ALU ··· ALU ALU ALU ALU ALU ALU ALU ··· ALU ALU ALU ALU ALU ALU ALU ··· ALU ALU ALU ALU ALU ALU ALU ··· ALU L2 Cache // Per Block Shared DRAM // Global GPU computing – O(109 ) ALUs with limited instruction set – On Chip: – Basic Controle Units – Local Per Thread L1 Cache – Per Block L2 Cache – Off Chip DRAM – high throughput optimized – limited usage August 10, 2021 3 HARDWARE SPECIFICATION CPU GPU company processor launch (year) clock freq. (GHz) no. of cores cache (MB) memory size (GB) TDP (W) Intel Xeon Silver 4114 2017 2.20 - 3.00 10 13.75 404 85 Nvidia Tesla V100 2017 1.4 5120 6.0 32 300 SP peak (TFLOP/s) bandwidth (GB/s) > 0.7 4 > 15.7 300 GPU computing August 10, 2021 4 CODE STRUCTURE: CPU CODE Program Code ALU ALU Controle L1 Cache Controle L1 Cache ALU ALU L2 Cache Controle L1 Cache (multi-) CPU Controle L1 Cache L2 Cache L3 Cache computation intensive functions DRAM GPU computing August 10, 2021 5 CODE STRUCTURE: GPU CODE Program Code (multi-) CPU ALU ALU Controle L1 Cache Controle L1 Cache ALU ALU Controle L1 Cache Controle L1 Cache GPU Kernel Controle L1 Cache Controle L1 Cache Controle L1 Cache Controle L1 Cache Controle L2 Cache L2 Cache L1 Cache Controle L1 Cache ALU ALU ALU ALU ALU ALU ··· ALU ALU ALU ALU ALU ALU ALU ··· ALU ALU ALU ALU ALU ALU ALU ALU ··· ALU ALU ALU ALU ALU ALU ··· ALU ALU ALU ALU ALU ALU ALU ··· ALU ALU ALU ALU ALU ALU ALU ··· ALU L2 Cache // Per Block Shared L3 Cache DRAM // Global DRAM GPU computing August 10, 2021 6 THE KERNEL – The kernel spawns Threads, Blocks and Grids – Thread: Smallest executing unit – ALU with limited instruction set, but many availible – Block: collection of threads – share memory – Grid: collection of blocks – do not share memory 0,0 Threads 1,0 2,0 0,1 1,1 2,1 0,2 1,2 2,2 0,3 1,3 2,3 2D Block GPU computing B0,0 B0,1 B0,2 B0,3 B1,0 B1,1 B1,2 B1,3 B2,0 B2,1 B2,2 B2,3 2D Grid August 10, 2021 7 AN EXAMPLE: SAXPY a x1 y1 a · x1 + y1 x2 y2 a · x2 + y2 .. . + xd GPU computing .. . yd = – SAXPY: Single Precision A times X Plus Y – Compute a · ⃗x + ⃗y .. . a · xd + yd August 10, 2021 8 AN EXAMPLE: SAXPY 1 a x1 + y1 = a x2 + y2 = a .. . + .. . = a xd + yd = GPU computing a · x1 + y1 – SAXPY: Single Precision A times X Plus Y – Compute a · ⃗x + ⃗y – Sequential 1. Compute a · x1 + y1 August 10, 2021 8 AN EXAMPLE: SAXPY = a · x1 + y1 y2 = a · x2 + y2 + .. . = .. . + yd = a x1 + y1 a x2 + a .. . a xd GPU computing 2 – SAXPY: Single Precision A times X Plus Y – Compute a · ⃗x + ⃗y – Sequential 1. Compute a · x1 + y1 2. Compute a · x2 + y2 ··· August 10, 2021 8 AN EXAMPLE: SAXPY a x1 + y1 = a · x1 + y1 a x2 + y2 = a · x2 + y2 a .. . + .. . = .. . a xd + yd = a · xd + yd GPU computing 3 – SAXPY: Single Precision A times X Plus Y – Compute a · ⃗x + ⃗y – Sequential 1. Compute a · x1 + y1 2. Compute a · x2 + y2 ··· 3. Compute a · xd + yd August 10, 2021 8 AN EXAMPLE: SAXPY 1 a x1 + y1 1 = a · x1 + y1 a x2 + y2 1 = a · x2 + y2 a .. . + .. . 1 = .. . a xd + yd = a · xd + yd GPU computing – SAXPY: Single Precision A times X Plus Y – Compute a · ⃗x + ⃗y – Sequential – Parallel 1. Compute a · xi + yi for all i ∈ [1, d] at once August 10, 2021 8 AN EXAMPLE: SAXPY – SAXPY: Single Precision A times X Plus Y 1 a · x1 + y1 a x1 + y1 1 = a x2 + y2 1 = a · x2 + y2 a .. . + .. . = .. . a xd + yd = GPU computing – Sequential – Parallel 1. Compute a · x1 + y1 ··· Compute a · x4 + y4 August 10, 2021 8 AN EXAMPLE: SAXPY – SAXPY: Single Precision A times X Plus Y a · x1 + y1 a x1 + y1 a x2 + y2 2 = a · x2 + y2 a .. . + .. . 2 = .. . a xd + yd = a · xd + yd GPU computing = – Sequential – Parallel 2. Compute a · x5 + y5 ··· Compute a · x8 + y8 3. · · · August 10, 2021 8 TIMING – What speedups do we expect for CPU and GPU? – GPU – high overhead – faster than CPU once overhead is small compared to problem size GPU computing 104 Time [au] – CPU – neglectible overhead – linear schematic picture of SAXPY timings CPU GPU 105 103 102 101 101 102 103 Problem size [au] 104 August 10, 2021 105 9 TIMING – Why is the GPU sometimes slower? – A GPU core is less performant than a CPU core – Data copy required – Arithmetic intensity – SAXPY requires too less flops ⇒ Memory bound Controle L1 Cache Controle L1 Cache Controle L1 Cache Controle L1 Cache Controle L1 Cache Controle L1 Cache x1 x2 x3 ··· xd ALU ALU ALU ALU ALU ALU ··· ALU ALU ALU ALU ALU ALU ALU ··· ALU ALU ALU ALU ALU ALU ALU ··· ALU ALU ALU ALU ALU ALU ALU ··· ALU ALU ALU ALU ALU ALU ALU ··· ALU ALU ALU ALU ALU ALU ALU ··· ALU 2. Execute the device program ALU ALU Controle L1 Cache Controle L1 Cache ALU ALU L2 Cache Controle Controle L1 Cache Controle L1 Cache L1 Cache Controle Controle L1 Cache Controle L1 Cache L2 Cache 1. Copy via PCI L1 Cache Controle L1 Cache Controle L1 Cache L3 Cache ALU ALU ALU ALU ALU ALU ··· ALU ALU ALU ALU ALU ALU ALU ··· ALU ALU ALU ALU ALU ALU ALU ··· ALU ALU ALU ALU ALU ALU ALU ··· ALU ALU ALU ALU ALU ALU ALU ··· ALU ALU ALU ALU ALU ALU ALU ··· ALU L2 Cache // Per Block Shared DRAM DRAM // Global 3. Copy back via PCI GPU computing August 10, 2021 10 SEVERAL WAYS TO SAXPY ON GPUS – Several languages we are going to look at: – cuBLAS – openACC – OpenMP – Thrust – CUDA C++ – CUDA Fortran – CUDA Python – OpenCL – Sycl – OneAPI – Magma – Kokkos – Julia + CUDA.jl GPU computing August 10, 2021 11