Accelerators into Production
Hype or reality ? - or both ?
Ole W. Saastad, Dr.Scient
UiO/USIT/UAV/ITF/FI
NEIC 2015 Conference, May 5-8 th 2015
Accelerators are not new
Universitetets senter for informasjonsteknologi
Top 500 supercomputers
• Of the top 500 systems, 53 now use accelerators
• 4 of the top 10 uses accelerators
• HPL performance
Universitetets senter for informasjonsteknologi
Accelerators in the Abel cluster
• NVIDIA Tesla K20x, Kepler GK110 arch
– 16 nodes with two each
– 32 GPUs in total
• Intel Xeon Phi, 5110P
– 4 nodes with two each
– 8 MIC systems in total
Universitetets senter for informasjonsteknologi
GPU performance K20Xm
DGEMM performance GPU vs. CPU
Tesla K20X vs Intel SB
600
500
400
300
2 00
1 00
CUDA BLAS
MKL BLAS
0
91 366 1464
Total matrix foorptrint [MiB]
32 95
Double precision 64 bit, about 500 Gflops/s
5149
Universitetets senter for informasjonsteknologi
SGEMM performance GPU vs CPU
Tesla K20X vs Intel SB
1400
12 00
1000
CUDA BLAS
MKL BLAS
800
600
400
2 00
0
45 5538 183 7 32 1647 2 575 37 08 457 7
Total matrix footprint [MiB]
Single precision 32 bit,
About 1300 Gflops/s
Exploiting the GPUs for production
• Pre compiled applications
– NAMD, MrBayes, Beagle, LAMMPS etc
• CUDA libraries
– BLAS, Sparse matrices, FFT
• Compiler supporting accelerator directives
– PGI support accelerator directives
– GCC in version 5.0
Universitetets senter for informasjonsteknologi
NAMD 2.9 and 2.10
• GPU enabled
• Easy to run charmrun namd2 +idlepoll +p 2 ++local +devices 0,1 input.inp
NAMD apoa1 benchmark
Single node performance
Speedup : 122/39 = 3.1x
140
120
100
80
60
40
20
0
Compute node
Type node
GPUnode
Universitetets senter for informasjonsteknologi
LAMMPS
• GPU enabled
• Easy to run mpirun lmp_cuda.double.x sf gpu c off v g 2 v x
128 v y 128 v z 128 v t 1000 in.lj.gpu
Lennard-Jones potential benchmark
Single node performance
Speedup 720/250 = 2.9x
Universitetets senter for informasjonsteknologi
800
700
600
500
400
300
200
100
0
Standard compute node
Node type
Accelerated node
Statistical package R
Tesla K20x vs. Sandy Bridge (one core)
20,0
18,0
16,0
14,0
12,0
10,0
8,0
6,0
CPU
GPU
4,0
2,0
0,0
QR decomposition
Correlation coefficients - cor(X) Linear regression
Distance calculation Matrix multiplication - X * Y
Matrix cross product - X'* X or X * X'
Test
Universitetets senter for informasjonsteknologi
CUDA libraries – easy access
• Precompiled, just linking (only C language)
– BLAS
– Sparse
– FFT (even fftw interface)
– Random
Universitetets senter for informasjonsteknologi
CUDA libraries
From fortran 90 : call cublas_dgemm('n', 'n', N, N, N, alpha, a, N, b, N, beta, c, N)
Same syntax as standard dgemm
Compile and link : gfortran o dgemmdriver.x L/usr/local/cuda/lib64
/usr/local/cuda/lib64/fortran_thunking.o lcublas dgemmdriver.f90
Interfaces hides the cuda syntax.
Where interfaces provided by NVIDIA this is remarkably simple.
Universitetets senter for informasjonsteknologi
CUDA libraries performance
Performance in Gflops/s
N
2000
4000
8000
12000
15000
Footprint MB CUDA BLAS
91
366
1464
3295
5149
3,29
24,94
159,7
345,55
482,15
MKL BLAS
171,61
172,91
189,87
193,55
193,52
600
One SB processor (8 cores)
versus one GPU
DGEMM performance GPU vs. CPU
Tesla K20X vs Intel SB
500 CUDA BLAS
MKL BLAS
400
Speedup 482/193= 2.5x
300
200
100
0
91 366 1464
Total matrix footprint [MiB]
3295 5149
Universitetets senter for informasjonsteknologi
Open ACC – very easy to get started
Universitetets senter for informasjonsteknologi
Open accelerator initiative info
• www.openacc-standard.org
• www.pgroup.com
• en.wikipedia.org/wiki/OpenACC
• developer.nvidia.com/openacc
Universitetets senter for informasjonsteknologi
Open ACCelerator initiative
Directives inserted
Into your old code
Very much like
OpenMP
Nothing could be simpler ?
Universitetets senter for informasjonsteknologi
Compiler supporting OpenACC
• Portland (PGI), pgcc, pgfortran, pgf90
– Installed on Abel
• CAPS HMPP
– Not installed on Abel
– Commercial, rather expensive
• GCC
– Limited support in version 5.1.0
Universitetets senter for informasjonsteknologi
Compiler supporting OpenACC fortran 90 code:
SUBROUTINE DGEMM_acc
!$acc region
DO 90 J = 1,N
IF (BETA.EQ.ZERO) THEN
DO 50 I = 1,M
C(I,J) = ZERO
50 CONTINUE
.........
90 CONTINUE
!$acc end region
Compile and link : pgfortran o dgemmtest.x ta=nvidia,kepler dgemm.f dgemmtest.f90
Universitetets senter for informasjonsteknologi
Running accerated code dgemm f77 reference code
Performance in Gflops/s
N
2000
4000
8000
12000
15000
Footprint MB PGI Accel
91
366
1464
3295
5149
2,33
9,48
14,03
14,09
11,85
PGI
2,51
2,17
2,21
2,2
1,79 sgemm
4
2
0
10
8
6
Accelerated F77 code vs plain F77
Portland accelerator directives, dgemm
16
14
PGI Accel
PGI
12
91 366 1464 3295
Total matrix footprint [MiB]
5149
Double: speedup 6.6x
Universitetets senter for informasjonsteknologi
15
10
5
0
Accelerated F77 code vs plain F77
Portland accelerator directives, sgemm
30
25
PGI Accel
PGI
20
45 183 732 1647 2575 3708 4577 5538
Total matrix footprint [MiB]
Single: speedup 6.2x
CUDA language - NVIDIA
• CUDA stands for « Compute Unified Device
Architecture ». CUDA is a parallel computing architecure and C based programming language for general purpose computing on NVIDIA GPU's
• Programming from scratch, special syntax for GPU
• Works only with NVIDIA
• Steep learning curve !
Universitetets senter for informasjonsteknologi
OpenCL - Open Compute Language
Universitetets senter for informasjonsteknologi
OpenCL language
• Open Compute Language
– Support for a range of processors incl x86-64
• An open standard supported by a multiple of vendors
• Complexity comparable to CUDA
• Performance comparable to CUDA
Universitetets senter for informasjonsteknologi
Intel Xeon Phi – MIC architecture
Universitetets senter for informasjonsteknologi
Outstanding performance
Theoretical performace:
Clock frequency 1.05 GHz
60 cores (x60)
8 dp entry wide Vector (x8)
FMA instruction (x2)
1.05*60*8*2 = 1008 Gflops/s
1 Tflops/s on a single PCIe card
Universitetets senter for informasjonsteknologi
MIC architecture, Knights Corner
• 60 physical cores, x86-64 in order execution
• 240 hardware threads
• 512 bits wide vector unit (8x64 bit or 16x32 bit floats)
• 8 GiB GDDR5 main memory in 4 banks
• Cache Coherent memory (directory based, TD)
• Limited hardware prefetch
• Software prefetch important
Universitetets senter for informasjonsteknologi
MIC architecture, Knights Landing
Information circulating in the press:
• 60 (+ ?) physical cores, x86-64 binary compatible
• 240 hardware threads
• 512 bits wide vector unit (8x64 bit or 16x32 bit floats)
• 384 GiB DDR4 main memory in 6 banks
• 16-24 (??) GiB MCDRAM on chip
• Cache Coherent memory (directory based, TD)
Universitetets senter for informasjonsteknologi
Simple to program – X86-64 arch.
Universitetets senter for informasjonsteknologi
8 x double vector unit and FMA
Universitetets senter for informasjonsteknologi
Vector and FMA for M x M
Matrix multiplication
Typical line to compute A B C D
Easy to map to FMA and vector since:
A
A
A
1
2
3
= B
= B
= B
1
2
3
* C
* C
* C
1
2
3
+ D
+ D
+ D
1
2
3 do i=iminloc,imaxloc
uold(i,j,in)=u(i,in)+(flux(i2,in)flux(i1,in))*dtdx end do
..
A
8
= B
8
* C
8
+ D
8
All this in one instruction VFMADDPD !
Universitetets senter for informasjonsteknologi
Benchmarks – offload Matmul MKL
MKL dgemm automatic offload
Two SB processors, One Phi card
1200,0
2288 MiB
20599 MiB
57220 MiB
1000,0
800,0
600,0
400,0
200,0
0,0 auto 0 50
Percent offloaded to mic
80 90 100
Universitetets senter for informasjonsteknologi
Benchmark – user fortran code
25
20
15
10
5
0
40
35
30
Host procs
Co-processor
2288 MiB
MxM offloading
Fortran 90 code, double precision, 64bits
5149 MiB 5859 MiB
Memory footprint matrices
6614 MiB
Universitetets senter for informasjonsteknologi
Easy to program hard to exploit
• Same source code – no changes, same compiler
• 60 physical cores – one vector unit per core
• 240 hardware threads – at least 120 is needed for fp work
• 8/16 number wide vector unit - try to fill it all the time
• Fused Multiply add instruction – when can you use this?
• Cache Coherent memory – nice but has a cost
• OpenMP – threads – cc-memory
• MPI – uses shared memory communication
Universitetets senter for informasjonsteknologi
Easy to program - native
• Compile using Intel compilers
– icc -mmic -openmp
– ifort -mmic -openmp
– Other flags are like for Sandy Bridge
• Compile on the host node and launch at MIC node
• Don't be fooled, not easy to get full performance !
Universitetets senter for informasjonsteknologi
Cost of accelerators
Pricing comparison
Extra cost for accelerator
200%
180%
160%
140%
120%
100%
80%
60%
40%
20%
0%
Standard Standard + 2 x NVIDIA
Universitetets senter for informasjonsteknologi
Standard + 2 x Phi
So what happen now ?
• NVIDIA / AMD GPUs or Intel MIC ?
• Software is a driver
• Applications another
• Compability with legacy programs
• Make it simple and you'll win !
• Price – gain versus cost
Universitetets senter for informasjonsteknologi
Accelerators hype or production ?
Universitetets senter for informasjonsteknologi