Accelerators into Production Hype or reality ? - or both ? UiO/USIT/UAV/ITF/FI

advertisement

Accelerators into Production

Hype or reality ? - or both ?

Ole W. Saastad, Dr.Scient

UiO/USIT/UAV/ITF/FI

NEIC 2015 Conference, May 5-8 th 2015

Accelerators are not new

Universitetets senter for informasjonsteknologi

Top 500 supercomputers

• Of the top 500 systems, 53 now use accelerators

• 4 of the top 10 uses accelerators

• HPL performance

Universitetets senter for informasjonsteknologi

Accelerators in the Abel cluster

• NVIDIA Tesla K20x, Kepler GK110 arch

– 16 nodes with two each

– 32 GPUs in total

• Intel Xeon Phi, 5110P

– 4 nodes with two each

– 8 MIC systems in total

Universitetets senter for informasjonsteknologi

GPU performance K20Xm

DGEMM performance GPU vs. CPU

Tesla K20X vs Intel SB

600

500

400

300

2 00

1 00

CUDA BLAS

MKL BLAS

0

91 366 1464

Total matrix foorptrint [MiB]

32 95

Double precision 64 bit, about 500 Gflops/s

5149

Universitetets senter for informasjonsteknologi

SGEMM performance GPU vs CPU

Tesla K20X vs Intel SB

1400

12 00

1000

CUDA BLAS

MKL BLAS

800

600

400

2 00

0

45 5538 183 7 32 1647 2 575 37 08 457 7

Total matrix footprint [MiB]

Single precision 32 bit,

About 1300 Gflops/s

Exploiting the GPUs for production

• Pre compiled applications

– NAMD, MrBayes, Beagle, LAMMPS etc

• CUDA libraries

– BLAS, Sparse matrices, FFT

• Compiler supporting accelerator directives

– PGI support accelerator directives

– GCC in version 5.0

Universitetets senter for informasjonsteknologi

NAMD 2.9 and 2.10

• GPU enabled

• Easy to run charmrun namd2 +idlepoll +p 2 ++local +devices 0,1  input.inp

NAMD apoa1 benchmark

Single node performance

Speedup : 122/39 = 3.1x

140

120

100

80

60

40

20

0

Compute node

Type node

GPUnode

Universitetets senter for informasjonsteknologi

LAMMPS

• GPU enabled

• Easy to run mpirun lmp_cuda.double.x ­sf gpu ­c off ­v g 2 ­v x 

128 ­v y 128 ­v z 128 ­v t 1000 in.lj.gpu

Lennard-Jones potential benchmark

Single node performance

Speedup 720/250 = 2.9x

Universitetets senter for informasjonsteknologi

800

700

600

500

400

300

200

100

0

Standard compute node

Node type

Accelerated node

Statistical package R

Tesla K20x vs. Sandy Bridge (one core)

20,0

18,0

16,0

14,0

12,0

10,0

8,0

6,0

CPU

GPU

4,0

2,0

0,0

QR decomposition

Correlation coefficients - cor(X) Linear regression

Distance calculation Matrix multiplication - X * Y

Matrix cross product - X'* X or X * X'

Test

Universitetets senter for informasjonsteknologi

CUDA libraries – easy access

• Precompiled, just linking (only C language)

– BLAS

– Sparse

– FFT (even fftw interface)

– Random

Universitetets senter for informasjonsteknologi

CUDA libraries

From fortran 90 : call cublas_dgemm('n', 'n', N, N, N, alpha, a, N, b, N, beta, c, N)

Same syntax as standard dgemm

Compile and link : gfortran ­o dgemmdriver.x ­L/usr/local/cuda/lib64 

/usr/local/cuda/lib64/fortran_thunking.o ­lcublas dgemmdriver.f90

Interfaces hides the cuda syntax.

Where interfaces provided by NVIDIA this is remarkably simple.

Universitetets senter for informasjonsteknologi

CUDA libraries performance

Performance in Gflops/s

N

2000

4000

8000

12000

15000

Footprint MB CUDA BLAS

91

366

1464

3295

5149

3,29

24,94

159,7

345,55

482,15

MKL BLAS

171,61

172,91

189,87

193,55

193,52

600

One SB processor (8 cores)

versus one GPU

DGEMM performance GPU vs. CPU

Tesla K20X vs Intel SB

500 CUDA BLAS

MKL BLAS

400

Speedup 482/193= 2.5x

300

200

100

0

91 366 1464

Total matrix footprint [MiB]

3295 5149

Universitetets senter for informasjonsteknologi

Open ACC – very easy to get started

Universitetets senter for informasjonsteknologi

Open accelerator initiative info

• www.openacc-standard.org

• www.pgroup.com

• en.wikipedia.org/wiki/OpenACC

• developer.nvidia.com/openacc

Universitetets senter for informasjonsteknologi

Open ACCelerator initiative

Directives inserted

Into your old code

Very much like

OpenMP

Nothing could be simpler ?

Universitetets senter for informasjonsteknologi

Compiler supporting OpenACC

• Portland (PGI), pgcc, pgfortran, pgf90

– Installed on Abel

• CAPS HMPP

– Not installed on Abel

– Commercial, rather expensive

• GCC

– Limited support in version 5.1.0

Universitetets senter for informasjonsteknologi

Compiler supporting OpenACC fortran 90 code:

SUBROUTINE DGEMM_acc

!$acc region

   DO 90 J = 1,N

     IF (BETA.EQ.ZERO) THEN

      DO 50 I = 1,M

        C(I,J) = ZERO

50    CONTINUE

   .........   

90 CONTINUE

!$acc end region  

Compile and link : pgfortran ­o dgemm­test.x ­ta=nvidia,kepler dgemm.f dgemmtest.f90

Universitetets senter for informasjonsteknologi

Running accerated code dgemm f77 reference code

Performance in Gflops/s

N

2000

4000

8000

12000

15000

Footprint MB PGI Accel

91

366

1464

3295

5149

2,33

9,48

14,03

14,09

11,85

PGI

2,51

2,17

2,21

2,2

1,79 sgemm

4

2

0

10

8

6

Accelerated F77 code vs plain F77

Portland accelerator directives, dgemm

16

14

PGI Accel

PGI

12

91 366 1464 3295

Total matrix footprint [MiB]

5149

Double: speedup 6.6x

Universitetets senter for informasjonsteknologi

15

10

5

0

Accelerated F77 code vs plain F77

Portland accelerator directives, sgemm

30

25

PGI Accel

PGI

20

45 183 732 1647 2575 3708 4577 5538

Total matrix footprint [MiB]

Single: speedup 6.2x

CUDA language - NVIDIA

• CUDA stands for « Compute Unified Device

Architecture ». CUDA is a parallel computing architecure and C based programming language for general purpose computing on NVIDIA GPU's

• Programming from scratch, special syntax for GPU

• Works only with NVIDIA

• Steep learning curve !

Universitetets senter for informasjonsteknologi

OpenCL - Open Compute Language

Universitetets senter for informasjonsteknologi

OpenCL language

• Open Compute Language

– Support for a range of processors incl x86-64

• An open standard supported by a multiple of vendors

• Complexity comparable to CUDA

• Performance comparable to CUDA

Universitetets senter for informasjonsteknologi

Intel Xeon Phi – MIC architecture

Universitetets senter for informasjonsteknologi

Outstanding performance

Theoretical performace:

Clock frequency 1.05 GHz

60 cores (x60)

8 dp entry wide Vector (x8)

FMA instruction (x2)

1.05*60*8*2 = 1008 Gflops/s

1 Tflops/s on a single PCIe card

Universitetets senter for informasjonsteknologi

MIC architecture, Knights Corner

• 60 physical cores, x86-64 in order execution

• 240 hardware threads

• 512 bits wide vector unit (8x64 bit or 16x32 bit floats)

• 8 GiB GDDR5 main memory in 4 banks

• Cache Coherent memory (directory based, TD)

• Limited hardware prefetch

• Software prefetch important

Universitetets senter for informasjonsteknologi

MIC architecture, Knights Landing

Information circulating in the press:

• 60 (+ ?) physical cores, x86-64 binary compatible

• 240 hardware threads

• 512 bits wide vector unit (8x64 bit or 16x32 bit floats)

• 384 GiB DDR4 main memory in 6 banks

• 16-24 (??) GiB MCDRAM on chip

• Cache Coherent memory (directory based, TD)

Universitetets senter for informasjonsteknologi

Simple to program – X86-64 arch.

Universitetets senter for informasjonsteknologi

8 x double vector unit and FMA

Universitetets senter for informasjonsteknologi

Vector and FMA for M x M

Matrix multiplication

Typical line to compute A B C D

Easy to map to FMA and vector since:

A

A

A

1

2

3

= B

= B

= B

1

2

3

* C

* C

* C

1

2

3

+ D

+ D

+ D

1

2

3 do i=iminloc,imaxloc

  uold(i,j,in)=u(i,in)+(flux(i­2,in)­flux(i­1,in))*dtdx end do

..

A

8

= B

8

* C

8

+ D

8

All this in one instruction VFMADDPD !

Universitetets senter for informasjonsteknologi

Benchmarks – offload Matmul MKL

MKL dgemm automatic offload

Two SB processors, One Phi card

1200,0

2288 MiB

20599 MiB

57220 MiB

1000,0

800,0

600,0

400,0

200,0

0,0 auto 0 50

Percent offloaded to mic

80 90 100

Universitetets senter for informasjonsteknologi

Benchmark – user fortran code

25

20

15

10

5

0

40

35

30

Host procs

Co-processor

2288 MiB

MxM offloading

Fortran 90 code, double precision, 64bits

5149 MiB 5859 MiB

Memory footprint matrices

6614 MiB

Universitetets senter for informasjonsteknologi

Easy to program hard to exploit

• Same source code – no changes, same compiler

• 60 physical cores – one vector unit per core

• 240 hardware threads – at least 120 is needed for fp work

• 8/16 number wide vector unit - try to fill it all the time

• Fused Multiply add instruction – when can you use this?

• Cache Coherent memory – nice but has a cost

• OpenMP – threads – cc-memory

• MPI – uses shared memory communication

Universitetets senter for informasjonsteknologi

Easy to program - native

• Compile using Intel compilers

– icc -mmic -openmp

– ifort -mmic -openmp

– Other flags are like for Sandy Bridge

• Compile on the host node and launch at MIC node

• Don't be fooled, not easy to get full performance !

Universitetets senter for informasjonsteknologi

Cost of accelerators

Pricing comparison

Extra cost for accelerator

200%

180%

160%

140%

120%

100%

80%

60%

40%

20%

0%

Standard Standard + 2 x NVIDIA

Universitetets senter for informasjonsteknologi

Standard + 2 x Phi

So what happen now ?

• NVIDIA / AMD GPUs or Intel MIC ?

• Software is a driver

• Applications another

• Compability with legacy programs

• Make it simple and you'll win !

• Price – gain versus cost

Universitetets senter for informasjonsteknologi

Accelerators hype or production ?

Universitetets senter for informasjonsteknologi

Download