A Performance Study of Solving a Large Dense Matrix for... Blaise DeCotes¹, Zhiang Hu², Shiquan Su³, Eduardo D'Azevedo³, Kwai Wong¹

advertisement

A Performance Study of Solving a Large Dense Matrix for Radiation Heat Transfer

Blaise DeCotes¹, Zhiang Hu², Shiquan Su³, Eduardo D'Azevedo³, Kwai Wong¹

¹University of Tennessee, Knoxville, ²Chinese University of Hong Kong, ³Oak Ridge National Laboratory

OVERVIEW

Solve a large scale radiation heat transfer problem in parallel

Showcase a parallel finite to finite area view factor code

Evaluate the performance of solving a large dense matrix on keeneland using MAGMA and

SCALAPACK

300

200

100

0

0

PARALLEL VIEW3D

Modify View3d, an open source serial code

Implement pre-detection and elimination of potential obstructions

Use finite to finite, differential to finite view factor calculations

Order of computations improved from O(N²*X) to O(N²*X/M) , where

N= # of surfaces, X= # of obstructing surfaces,

M = # of processors

800

700

View factor Computation +

Potential Obstruction computation

18,944 surfaces CUBI

600

500

400

50 100 150 200 250

120

100

80

Time

60

40

20

0

0

TEMPLATE DESIGN © 2008 www.PosterPresentations.com

Parallel View factor Calculations on Kraken

20,000 surfaces L-Shape

# of Processors

200

4 CUBI CASE

SCALAPACK WITH CUBLAS

Produce LU factors and pivot vector compatible with ScaLAPACK PZGETRF, double complex version only

Uses left-looking out-of-core algorithm to solve problems larger than available device memory on GPU

Uses ScaLAPACK PZGETRF for factorization of narrow column panel and CUBLAS v2 for computation on GPU

Algorithm assumes access to one GPU per MPI task, the amount of device memory is set by user

Data on GPU must be transferred to buffers on the GPU to be communicated by MPI

EXTENSIONS MADE TO SCALAPACK

Tests on Keeneland successful at using multi-

GPU with multiple nodes

Code extended to multiple GPUs by allocating

cublasHandle_t

based on the

MPI_Comm_Rank

of the process modulus the number of CUDA enabled devices present

Extend to support double precision complex

Rewrite portions of the source code to translate different data types

ScaLAPACK PZxxxx calls changed to PDxxxx

cuDoubleComplex

changed to

double

Function call changed from:

PDGETRF( M, N, MEM( IPA ), 1, 1,

DESCA, MEM( IPPIV ), INFO )

PDGETRF_OOC2(m,n,mem(ipa),1,1, descA, mem(ippiv), memsize, info )

PERFORMANCE -- SCALAPACK

Experiments were performed on Keeneland , 12 cores 2.8GHz CPU, 3 Tesla M2090 GPU

All Performance measures in Gigaflop/s

Table 1: Comparing in-node and multi-node with same # of

CPU, *All Tests with 3GPU

Problem

Size

15000

Keeneland Node Configuration

1 node,

3 ppn,

3 nodes,

1 ppn

1 node,

12 ppn

3 nodes

4 ppn

101 121 171 214

25000

35000

159

194

182

213

263

308

355

450

Table 2: Comparing in-node and multi-node with 3 times # of

CPU,GPU Problem size of N=25000

Single Node Performance Three Nodes Performance

1 CPU, 1 GPU 63 3 CPU, 3 GPU 166

3 CPU, 1 GPU

3 CPU, 3 GPU

110

159

9 CPU, 3 GPU

9 CPU, 9 GPU

315

340

500

450

400

350

300

250

200

150

100

50

Within Node Across 3 Nodes

3 CPU, 3 GPU 12 CPU, 3 GPU

0

15000 25000 35000 15000 25000 35000

Problem Size (N)

Table 3: Scaling size with 7 nodes (21 CPU, 21 GPU)

Problem Size Performance Problem Size Performance

5000

15000

25000

35000

45000

30

351

602

809

975

55000

65000

75000

85000

90000

1,054

1,252

1,310

1,372

1,526

Table 4: Varying process grid on single and multi-node

P

1

3

12

3

Q

3

4

7

7

# Nodes

1

1

7

7

Performance

160

263

455

602

PERFORMANCE -- MAGMA

Table 2: Comparing Single-GPU MAGMA to multi-GPU static and dynamic scheduling -- Problem size of N=25000

Single GPU

311 Gflop/s

Multi-GPU-

Static

823 Gflop/s

Single-GPU-

Dynamic

Multi-GPU-

Dynamic

CONCLUSIONS

Utilizing 3 CPUs and 3 GPUs spread across 3 nodes yielded higher performance than residing on a single node

Results show moving from 3CPU,1GPU on a single node to 9CPU,3GPU on three nodes gives an increase in performance by factor of three

Scaling tests reveal increasing performance from the GPU as the problem size grows.

This shows that at smaller problem sizes GPUs are being left idle for extended periods of time

On one node, a 4:1 CPU:GPU ratio maximized performance, while on seven nodes, 1:1 ratio of

CPU:GPU maximized performance

Larger problem size of many nodes leads to an increase in the N³ GEMM operations on the GPU, so less CPUs can factor the panels quickly enough

MAGMA single GPU and multi GPU performance is very efficient in keeping the memory on the GPU to minimize overhead of data transfer

REFERENCES

Parallel LU Factorization on GPU Cluster

E. D’Azevedo, J.C. Hill

Innovative Computing Lab -- University of

Tennessee, Knoxville --

http://icl.cs.utk.edu

CONTACT INFO

Kwai Wong kwong@utk.edu

Eduardo D’Azevedo e6d@ornl.gov

M. Blaise DeCotes bdecoate@utk.edu

Download