Accelerators into Production Hype or reality ? - or both ? UiO/USIT/UAV/ITF/FI

Accelerators into Production

Hype or reality ? - or both ?

Ole W. Saastad, Dr.Scient

UiO/USIT/UAV/ITF/FI

NEIC 2015 Conference, May 5-8 th 2015

Accelerators are not new

Universitetets senter for informasjonsteknologi

Top 500 supercomputers

• Of the top 500 systems, 53 now use accelerators

• 4 of the top 10 uses accelerators

• HPL performance


Accelerators in the Abel cluster

• NVIDIA Tesla K20x, Kepler GK110 arch

– 16 nodes with two each

– 32 GPUs in total

• Intel Xeon Phi, 5110P

– 4 nodes with two each

– 8 MIC systems in total


GPU performance K20Xm

DGEMM performance GPU vs. CPU

Tesla K20X vs Intel SB

600

500

400

300

2 00

1 00

CUDA BLAS

MKL BLAS

0

91 366 1464

Total matrix foorptrint [MiB]

32 95

Double precision 64 bit, about 500 Gflops/s

5149


SGEMM performance GPU vs CPU


1400

12 00

1000

CUDA BLAS

MKL BLAS

800

600

400

2 00

0

45 5538 183 7 32 1647 2 575 37 08 457 7

Total matrix footprint [MiB]

Single precision 32 bit,

About 1300 Gflops/s

Exploiting the GPUs for production

• Pre compiled applications

– NAMD, MrBayes, Beagle, LAMMPS etc

• CUDA libraries

– BLAS, Sparse matrices, FFT

• Compiler supporting accelerator directives

– PGI support accelerator directives

– GCC in version 5.0


NAMD 2.9 and 2.10

• GPU enabled

• Easy to run charmrun namd2 +idlepoll +p 2 ++local +devices 0,1 input.inp

NAMD apoa1 benchmark

Single node performance

Speedup : 122/39 = 3.1x

140

120

100

80

60

40

20

0

Compute node

Type node

GPUnode


LAMMPS

• GPU enabled

• Easy to run mpirun lmp_cuda.double.x sf gpu c off v g 2 v x

128 v y 128 v z 128 v t 1000 in.lj.gpu

Lennard-Jones potential benchmark

Single node performance

Speedup 720/250 = 2.9x


800

700

600

500

400

300

200

100

0

Standard compute node

Node type

Accelerated node

Statistical package R

Tesla K20x vs. Sandy Bridge (one core)

20,0

18,0

16,0

14,0

12,0

10,0

8,0

6,0

CPU

GPU

4,0

2,0

0,0

QR decomposition

Correlation coefficients - cor(X) Linear regression

Distance calculation Matrix multiplication - X * Y

Matrix cross product - X'* X or X * X'

Test


CUDA libraries – easy access

• Precompiled, just linking (only C language)

– BLAS

– Sparse

– FFT (even fftw interface)

– Random


CUDA libraries

From fortran 90 : call cublas_dgemm('n', 'n', N, N, N, alpha, a, N, b, N, beta, c, N)

Same syntax as standard dgemm

Compile and link : gfortran o dgemmdriver.x L/usr/local/cuda/lib64

/usr/local/cuda/lib64/fortran_thunking.o lcublas dgemmdriver.f90

Interfaces hides the cuda syntax.

Where interfaces provided by NVIDIA this is remarkably simple.


CUDA libraries performance

Performance in Gflops/s

N

2000

4000

8000

12000

15000

Footprint MB CUDA BLAS

91

366

1464

3295

5149

3,29

24,94

159,7

345,55

482,15

MKL BLAS

171,61

172,91

189,87

193,55

193,52

600

One SB processor (8 cores)

versus one GPU

DGEMM performance GPU vs. CPU


500 CUDA BLAS

MKL BLAS

400

Speedup 482/193= 2.5x

300

200

100

0

91 366 1464


3295 5149


Open ACC – very easy to get started


Open accelerator initiative info

• www.openacc-standard.org

• www.pgroup.com

• en.wikipedia.org/wiki/OpenACC

• developer.nvidia.com/openacc


Open ACCelerator initiative

Directives inserted

Into your old code

Very much like

OpenMP

Nothing could be simpler ?


Compiler supporting OpenACC

• Portland (PGI), pgcc, pgfortran, pgf90

– Installed on Abel

• CAPS HMPP

– Not installed on Abel

– Commercial, rather expensive

• GCC

– Limited support in version 5.1.0


Compiler supporting OpenACC fortran 90 code:

SUBROUTINE DGEMM_acc

!$acc region

DO 90 J = 1,N

IF (BETA.EQ.ZERO) THEN

DO 50 I = 1,M

C(I,J) = ZERO

50 CONTINUE

.........

90 CONTINUE

!$acc end region

Compile and link : pgfortran o dgemmtest.x ta=nvidia,kepler dgemm.f dgemmtest.f90


Running accerated code dgemm f77 reference code

Performance in Gflops/s

N

2000

4000

8000

12000

15000

Footprint MB PGI Accel

91

366

1464

3295

5149

2,33

9,48

14,03

14,09

11,85

PGI

2,51

2,17

2,21

2,2

1,79 sgemm

4

2

0

10

8

6

Accelerated F77 code vs plain F77

Portland accelerator directives, dgemm

16

14

PGI Accel

PGI

12

91 366 1464 3295


5149

Double: speedup 6.6x


15

10

5

0

Accelerated F77 code vs plain F77

Portland accelerator directives, sgemm

30

25

PGI Accel

PGI

20

45 183 732 1647 2575 3708 4577 5538


Single: speedup 6.2x

CUDA language - NVIDIA

• CUDA stands for « Compute Unified Device

Architecture ». CUDA is a parallel computing architecure and C based programming language for general purpose computing on NVIDIA GPU's

• Programming from scratch, special syntax for GPU

• Works only with NVIDIA

• Steep learning curve !


OpenCL - Open Compute Language


OpenCL language

• Open Compute Language

– Support for a range of processors incl x86-64

• An open standard supported by a multiple of vendors

• Complexity comparable to CUDA

• Performance comparable to CUDA


Intel Xeon Phi – MIC architecture


Outstanding performance

Theoretical performace:

Clock frequency 1.05 GHz

60 cores (x60)

8 dp entry wide Vector (x8)

FMA instruction (x2)

1.05*60*8*2 = 1008 Gflops/s

1 Tflops/s on a single PCIe card


MIC architecture, Knights Corner

• 60 physical cores, x86-64 in order execution

• 240 hardware threads

• 512 bits wide vector unit (8x64 bit or 16x32 bit floats)

• 8 GiB GDDR5 main memory in 4 banks

• Cache Coherent memory (directory based, TD)

• Limited hardware prefetch

• Software prefetch important


MIC architecture, Knights Landing

Information circulating in the press:

• 60 (+ ?) physical cores, x86-64 binary compatible

• 240 hardware threads

• 512 bits wide vector unit (8x64 bit or 16x32 bit floats)

• 384 GiB DDR4 main memory in 6 banks

• 16-24 (??) GiB MCDRAM on chip

• Cache Coherent memory (directory based, TD)


Simple to program – X86-64 arch.


8 x double vector unit and FMA


Vector and FMA for M x M

Matrix multiplication

Typical line to compute A B C D

Easy to map to FMA and vector since:

A

A

A

1

2

3

= B

= B

= B

1

2

3

* C

* C

* C

1

2

3

+ D

+ D

+ D

1

2

3 do i=iminloc,imaxloc

uold(i,j,in)=u(i,in)+(flux(i2,in)flux(i1,in))*dtdx end do

..

A

8

= B

8

* C

8

+ D

8

All this in one instruction VFMADDPD !


Benchmarks – offload Matmul MKL

MKL dgemm automatic offload

Two SB processors, One Phi card

1200,0

2288 MiB

20599 MiB

57220 MiB

1000,0

800,0

600,0

400,0

200,0

0,0 auto 0 50

Percent offloaded to mic

80 90 100


Benchmark – user fortran code

25

20

15

10

5

0

40

35

30

Host procs

Co-processor

2288 MiB

MxM offloading

Fortran 90 code, double precision, 64bits

5149 MiB 5859 MiB

Memory footprint matrices

6614 MiB


Easy to program hard to exploit

• Same source code – no changes, same compiler

• 60 physical cores – one vector unit per core

• 240 hardware threads – at least 120 is needed for fp work

• 8/16 number wide vector unit - try to fill it all the time

• Fused Multiply add instruction – when can you use this?

• Cache Coherent memory – nice but has a cost

• OpenMP – threads – cc-memory

• MPI – uses shared memory communication


Easy to program - native

• Compile using Intel compilers

– icc -mmic -openmp

– ifort -mmic -openmp

– Other flags are like for Sandy Bridge

• Compile on the host node and launch at MIC node

• Don't be fooled, not easy to get full performance !


Cost of accelerators

Pricing comparison

Extra cost for accelerator

200%

180%

160%

140%

120%

100%

80%

60%

40%

20%

0%

Standard Standard + 2 x NVIDIA


Standard + 2 x Phi

So what happen now ?

• NVIDIA / AMD GPUs or Intel MIC ?

• Software is a driver

• Applications another

• Compability with legacy programs

• Make it simple and you'll win !

• Price – gain versus cost


Accelerators hype or production ?


Accelerators into Production Hype or reality ? - or both ? UiO/USIT/UAV/ITF/FI

Related documents

Products

Support

Accelerators into Production Hype or reality ? - or both ? UiO/USIT/UAV/ITF/FI

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib