Uploaded by mojtaba seifi

GPU tutorial saxpy introduction

advertisement
GPU COMPUTING
An Introduction
August 10, 2021
Luis Altenkort, Marius Neumann, Marcel Rodekamp, Christian Schmidt, Xin Wu
THE COMPETENCE NETWORK FOR HIGH-PERFORMANCE COMPUTING IN NRW.
WHAT ARE GPUS AND WHY DO WE NEED THEM?
– A graphics processing unit (GPU) is a processor
featuring a highly parallel structure, making it efficient at
processing large blocks of data.
– Large data sets can be worked on by multiple cores. Two
methods are commonly used:
– multi CPU (∼ 10 − 100 cores)
– GPUs (∼ 5000 cores)
– Where are GPUs used?
– video/graphics applications
– linear algebra
– machine learning
– general purpose GPU (GPGPU)
GPU computing
August 10, 2021
1
CPU SCHEMATICS
ALU
ALU
Controle
L1 Cache
Controle
L1 Cache
ALU
ALU
L2 Cache
Controle
L1 Cache
Controle
L1 Cache
L2 Cache
L3 Cache
DRAM
– O(10) Algorithmic-Logic-Units
(ALUs) with broad instruction set
– On Chip:
– Sophisticated Controle Units
– Local Per Core L1 Cache
– Per Core L2 Cache
– L3 Cache
– Off Chip DRAM
– low latency optimized
– versatile usage
GPU computing
August 10, 2021
2
GPU SCHEMATICS
Controle
L1 Cache
Controle
L1 Cache
Controle
L1 Cache
Controle
L1 Cache
Controle
L1 Cache
Controle
L1 Cache
ALU
ALU
ALU
ALU
ALU
ALU
···
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
···
ALU
ALU
ALU
ALU
ALU
ALU
···
ALU
ALU
ALU
ALU
ALU
ALU
ALU
···
ALU
ALU
ALU
ALU
ALU
ALU
ALU
···
ALU
ALU
ALU
ALU
ALU
ALU
ALU
···
ALU
L2 Cache // Per Block Shared
DRAM // Global
GPU computing
– O(109 ) ALUs with limited
instruction set
– On Chip:
– Basic Controle Units
– Local Per Thread L1 Cache
– Per Block L2 Cache
– Off Chip DRAM
– high throughput optimized
– limited usage
August 10, 2021
3
HARDWARE SPECIFICATION
CPU
GPU
company
processor
launch (year)
clock freq. (GHz)
no. of cores
cache (MB)
memory size (GB)
TDP (W)
Intel
Xeon Silver 4114
2017
2.20 - 3.00
10
13.75
404
85
Nvidia
Tesla V100
2017
1.4
5120
6.0
32
300
SP peak (TFLOP/s)
bandwidth (GB/s)
> 0.7
4
> 15.7
300
GPU computing
August 10, 2021
4
CODE STRUCTURE: CPU CODE
Program Code
ALU
ALU
Controle
L1 Cache
Controle
L1 Cache
ALU
ALU
L2 Cache
Controle
L1 Cache
(multi-)
CPU
Controle
L1 Cache
L2 Cache
L3 Cache
computation
intensive
functions
DRAM
GPU computing
August 10, 2021
5
CODE STRUCTURE: GPU CODE
Program Code
(multi-)
CPU
ALU
ALU
Controle
L1 Cache
Controle
L1 Cache
ALU
ALU
Controle
L1 Cache
Controle
L1 Cache
GPU
Kernel
Controle
L1 Cache
Controle
L1 Cache
Controle
L1 Cache
Controle
L1 Cache
Controle
L2 Cache
L2 Cache
L1 Cache
Controle
L1 Cache
ALU
ALU
ALU
ALU
ALU
ALU
···
ALU
ALU
ALU
ALU
ALU
ALU
ALU
···
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
···
ALU
ALU
ALU
ALU
ALU
ALU
···
ALU
ALU
ALU
ALU
ALU
ALU
ALU
···
ALU
ALU
ALU
ALU
ALU
ALU
ALU
···
ALU
L2 Cache // Per Block Shared
L3 Cache
DRAM // Global
DRAM
GPU computing
August 10, 2021
6
THE KERNEL
– The kernel spawns Threads, Blocks and Grids
– Thread: Smallest executing unit
– ALU with limited instruction set, but many availible
– Block: collection of threads
– share memory
– Grid: collection of blocks
– do not share memory
0,0
Threads
1,0
2,0
0,1
1,1
2,1
0,2
1,2
2,2
0,3
1,3
2,3
2D Block
GPU computing
B0,0
B0,1
B0,2
B0,3
B1,0
B1,1
B1,2
B1,3
B2,0
B2,1
B2,2
B2,3
2D Grid
August 10, 2021
7
AN EXAMPLE: SAXPY
a
x1
y1
a · x1 + y1
x2
y2
a · x2 + y2
..
.
+
xd
GPU computing
..
.
yd
=
– SAXPY: Single Precision
A times X Plus Y
– Compute a · ⃗x + ⃗y
..
.
a · xd + yd
August 10, 2021
8
AN EXAMPLE: SAXPY
1
a
x1
+
y1
=
a
x2
+
y2
=
a
..
.
+
..
.
=
a
xd
+
yd
=
GPU computing
a · x1 + y1
– SAXPY: Single Precision
A times X Plus Y
– Compute a · ⃗x + ⃗y
– Sequential
1. Compute a · x1 + y1
August 10, 2021
8
AN EXAMPLE: SAXPY
=
a · x1 + y1
y2
=
a · x2 + y2
+
..
.
=
..
.
+
yd
=
a
x1
+
y1
a
x2
+
a
..
.
a
xd
GPU computing
2
– SAXPY: Single Precision
A times X Plus Y
– Compute a · ⃗x + ⃗y
– Sequential
1. Compute a · x1 + y1
2. Compute a · x2 + y2
···
August 10, 2021
8
AN EXAMPLE: SAXPY
a
x1
+
y1
=
a · x1 + y1
a
x2
+
y2
=
a · x2 + y2
a
..
.
+
..
.
=
..
.
a
xd
+
yd
=
a · xd + yd
GPU computing
3
– SAXPY: Single Precision
A times X Plus Y
– Compute a · ⃗x + ⃗y
– Sequential
1. Compute a · x1 + y1
2. Compute a · x2 + y2
···
3. Compute a · xd + yd
August 10, 2021
8
AN EXAMPLE: SAXPY
1
a
x1
+
y1
1
=
a · x1 + y1
a
x2
+
y2
1
=
a · x2 + y2
a
..
.
+
..
.
1
=
..
.
a
xd
+
yd
=
a · xd + yd
GPU computing
– SAXPY: Single Precision
A times X Plus Y
– Compute a · ⃗x + ⃗y
– Sequential
– Parallel
1. Compute a · xi + yi
for all i ∈ [1, d] at
once
August 10, 2021
8
AN EXAMPLE: SAXPY
– SAXPY: Single Precision
A times X Plus Y
1
a · x1 + y1
a
x1
+
y1
1
=
a
x2
+
y2
1
=
a · x2 + y2
a
..
.
+
..
.
=
..
.
a
xd
+
yd
=
GPU computing
– Sequential
– Parallel
1. Compute a · x1 + y1
···
Compute a · x4 + y4
August 10, 2021
8
AN EXAMPLE: SAXPY
– SAXPY: Single Precision
A times X Plus Y
a · x1 + y1
a
x1
+
y1
a
x2
+
y2
2
=
a · x2 + y2
a
..
.
+
..
.
2
=
..
.
a
xd
+
yd
=
a · xd + yd
GPU computing
=
– Sequential
– Parallel
2. Compute a · x5 + y5
···
Compute a · x8 + y8
3. · · ·
August 10, 2021
8
TIMING
– What speedups do we expect for
CPU and GPU?
– GPU
– high overhead
– faster than CPU once
overhead is small compared
to problem size
GPU computing
104
Time [au]
– CPU
– neglectible overhead
– linear
schematic picture of SAXPY timings
CPU
GPU
105
103
102
101
101
102
103
Problem size [au]
104
August 10, 2021
105
9
TIMING
– Why is the GPU sometimes slower?
– A GPU core is less performant than a CPU core
– Data copy required
– Arithmetic intensity
– SAXPY requires too less flops
⇒ Memory bound
Controle
L1 Cache
Controle
L1 Cache
Controle
L1 Cache
Controle
L1 Cache
Controle
L1 Cache
Controle
L1 Cache
x1
x2
x3
···
xd
ALU
ALU
ALU
ALU
ALU
ALU
···
ALU
ALU
ALU
ALU
ALU
ALU
ALU
···
ALU
ALU
ALU
ALU
ALU
ALU
ALU
···
ALU
ALU
ALU
ALU
ALU
ALU
ALU
···
ALU
ALU
ALU
ALU
ALU
ALU
ALU
···
ALU
ALU
ALU
ALU
ALU
ALU
ALU
···
ALU
2. Execute the device program
ALU
ALU
Controle
L1 Cache
Controle
L1 Cache
ALU
ALU
L2 Cache
Controle
Controle
L1 Cache
Controle
L1 Cache
L1 Cache
Controle
Controle
L1 Cache
Controle
L1 Cache
L2 Cache
1. Copy via PCI
L1 Cache
Controle
L1 Cache
Controle
L1 Cache
L3 Cache
ALU
ALU
ALU
ALU
ALU
ALU
···
ALU
ALU
ALU
ALU
ALU
ALU
ALU
···
ALU
ALU
ALU
ALU
ALU
ALU
ALU
···
ALU
ALU
ALU
ALU
ALU
ALU
ALU
···
ALU
ALU
ALU
ALU
ALU
ALU
ALU
···
ALU
ALU
ALU
ALU
ALU
ALU
ALU
···
ALU
L2 Cache // Per Block Shared
DRAM
DRAM // Global
3. Copy back via PCI
GPU computing
August 10, 2021
10
SEVERAL WAYS TO SAXPY ON GPUS
– Several languages we are going to look at:
– cuBLAS
– openACC
– OpenMP
– Thrust
– CUDA C++
– CUDA Fortran
– CUDA Python
– OpenCL
– Sycl
– OneAPI
– Magma
– Kokkos
– Julia + CUDA.jl
GPU computing
August 10, 2021
11
Download