A Performance Study of Solving a Large Dense Matrix for... Blaise DeCotes¹, Zhiang Hu², Shiquan Su³, Eduardo D'Azevedo³, Kwai Wong¹

A Performance Study of Solving a Large Dense Matrix for Radiation Heat Transfer

Blaise DeCotes¹, Zhiang Hu², Shiquan Su³, Eduardo D'Azevedo³, Kwai Wong¹

¹University of Tennessee, Knoxville, ²Chinese University of Hong Kong, ³Oak Ridge National Laboratory

OVERVIEW



Solve a large scale radiation heat transfer problem in parallel



Showcase a parallel finite to finite area view factor code



Evaluate the performance of solving a large dense matrix on keeneland using MAGMA and

SCALAPACK

300

200

100

0

0

PARALLEL VIEW3D



Modify View3d, an open source serial code



Implement pre-detection and elimination of potential obstructions



Use finite to finite, differential to finite view factor calculations



Order of computations improved from O(N²*X) to O(N²*X/M) , where

N= # of surfaces, X= # of obstructing surfaces,

M = # of processors

800

700

View factor Computation +

Potential Obstruction computation

18,944 surfaces CUBI

600

500

400

50 100 150 200 250

120

100

80

Time

60

40

20

0

0

TEMPLATE DESIGN © 2008 www.PosterPresentations.com

Parallel View factor Calculations on Kraken

20,000 surfaces L-Shape

# of Processors

200

4 CUBI CASE

SCALAPACK WITH CUBLAS



Produce LU factors and pivot vector compatible with ScaLAPACK PZGETRF, double complex version only



Uses left-looking out-of-core algorithm to solve problems larger than available device memory on GPU



Uses ScaLAPACK PZGETRF for factorization of narrow column panel and CUBLAS v2 for computation on GPU



Algorithm assumes access to one GPU per MPI task, the amount of device memory is set by user



Data on GPU must be transferred to buffers on the GPU to be communicated by MPI

EXTENSIONS MADE TO SCALAPACK



Tests on Keeneland successful at using multi-

GPU with multiple nodes



Code extended to multiple GPUs by allocating

cublasHandle_t

based on the

MPI_Comm_Rank

of the process modulus the number of CUDA enabled devices present



Extend to support double precision complex



Rewrite portions of the source code to translate different data types



ScaLAPACK PZxxxx calls changed to PDxxxx



cuDoubleComplex

changed to

double



Function call changed from:

PDGETRF( M, N, MEM( IPA ), 1, 1,

DESCA, MEM( IPPIV ), INFO )

PDGETRF_OOC2(m,n,mem(ipa),1,1, descA, mem(ippiv), memsize, info )

PERFORMANCE -- SCALAPACK



Experiments were performed on Keeneland , 12 cores 2.8GHz CPU, 3 Tesla M2090 GPU



All Performance measures in Gigaflop/s

Table 1: Comparing in-node and multi-node with same # of

CPU, *All Tests with 3GPU

Problem

Size

15000

Keeneland Node Configuration

1 node,

3 ppn,

3 nodes,

1 ppn

1 node,

12 ppn

3 nodes

4 ppn

101 121 171 214

25000

35000

159

194

182

213

263

308

355

450

Table 2: Comparing in-node and multi-node with 3 times # of

CPU,GPU Problem size of N=25000

Single Node Performance Three Nodes Performance

1 CPU, 1 GPU 63 3 CPU, 3 GPU 166

3 CPU, 1 GPU

3 CPU, 3 GPU

110

159

9 CPU, 3 GPU

9 CPU, 9 GPU

315

340

500

450

400

350

300

250

200

150

100

50

Within Node Across 3 Nodes

3 CPU, 3 GPU 12 CPU, 3 GPU

0

15000 25000 35000 15000 25000 35000

Problem Size (N)

Table 3: Scaling size with 7 nodes (21 CPU, 21 GPU)

Problem Size Performance Problem Size Performance

5000

15000

25000

35000

45000

30

351

602

809

975

55000

65000

75000

85000

90000

1,054

1,252

1,310

1,372

1,526

Table 4: Varying process grid on single and multi-node

P

1

3

12

3

Q

3

4

7

7

# Nodes

1

1

7

7

Performance

160

263

455

602

PERFORMANCE -- MAGMA

Table 2: Comparing Single-GPU MAGMA to multi-GPU static and dynamic scheduling -- Problem size of N=25000

Single GPU

311 Gflop/s

Multi-GPU-

Static

823 Gflop/s

Single-GPU-

Dynamic

Multi-GPU-

Dynamic

CONCLUSIONS



Utilizing 3 CPUs and 3 GPUs spread across 3 nodes yielded higher performance than residing on a single node



Results show moving from 3CPU,1GPU on a single node to 9CPU,3GPU on three nodes gives an increase in performance by factor of three



Scaling tests reveal increasing performance from the GPU as the problem size grows.



This shows that at smaller problem sizes GPUs are being left idle for extended periods of time



On one node, a 4:1 CPU:GPU ratio maximized performance, while on seven nodes, 1:1 ratio of

CPU:GPU maximized performance



Larger problem size of many nodes leads to an increase in the N³ GEMM operations on the GPU, so less CPUs can factor the panels quickly enough



MAGMA single GPU and multi GPU performance is very efficient in keeping the memory on the GPU to minimize overhead of data transfer

REFERENCES



Parallel LU Factorization on GPU Cluster

E. D’Azevedo, J.C. Hill



Innovative Computing Lab -- University of

Tennessee, Knoxville --

http://icl.cs.utk.edu

CONTACT INFO

Kwai Wong kwong@utk.edu

Eduardo D’Azevedo e6d@ornl.gov

M. Blaise DeCotes bdecoate@utk.edu

A Performance Study of Solving a Large Dense Matrix for... Blaise DeCotes¹, Zhiang Hu², Shiquan Su³, Eduardo D'Azevedo³, Kwai Wong¹

A Performance Study of Solving a Large Dense Matrix for Radiation Heat Transfer

Blaise DeCotes¹, Zhiang Hu², Shiquan Su³, Eduardo D'Azevedo³, Kwai Wong¹

¹University of Tennessee, Knoxville, ²Chinese University of Hong Kong, ³Oak Ridge National Laboratory

OVERVIEW

PARALLEL VIEW3D

View factor Computation +

Potential Obstruction computation

18,944 surfaces CUBI

Parallel View factor Calculations on Kraken

20,000 surfaces L-Shape

4 CUBI CASE

SCALAPACK WITH CUBLAS

EXTENSIONS MADE TO SCALAPACK

cublasHandle_t

MPI_Comm_Rank

cuDoubleComplex

double

PERFORMANCE -- SCALAPACK

Keeneland Node Configuration

PERFORMANCE -- MAGMA

CONCLUSIONS

REFERENCES

Parallel LU Factorization on GPU Cluster

E. D’Azevedo, J.C. Hill

http://icl.cs.utk.edu

CONTACT INFO

Kwai Wong kwong@utk.edu

Eduardo D’Azevedo e6d@ornl.gov

M. Blaise DeCotes bdecoate@utk.edu

Related documents

Products

Support

A Performance Study of Solving a Large Dense Matrix for... Blaise DeCotes¹, Zhiang Hu², Shiquan Su³, Eduardo D'Azevedo³, Kwai Wong¹

A Performance Study of Solving a Large Dense Matrix for Radiation Heat Transfer

Blaise DeCotes¹, Zhiang Hu², Shiquan Su³, Eduardo D'Azevedo³, Kwai Wong¹

¹University of Tennessee, Knoxville, ²Chinese University of Hong Kong, ³Oak Ridge National Laboratory

OVERVIEW

PARALLEL VIEW3D

View factor Computation +

Potential Obstruction computation

18,944 surfaces CUBI

Parallel View factor Calculations on Kraken

20,000 surfaces L-Shape

4 CUBI CASE

SCALAPACK WITH CUBLAS

EXTENSIONS MADE TO SCALAPACK

cublasHandle_t

MPI_Comm_Rank

cuDoubleComplex

double

PERFORMANCE -- SCALAPACK

Keeneland Node Configuration

PERFORMANCE -- MAGMA

CONCLUSIONS

REFERENCES

Parallel LU Factorization on GPU Cluster

E. D’Azevedo, J.C. Hill

http://icl.cs.utk.edu

CONTACT INFO

Kwai Wong kwong@utk.edu

Eduardo D’Azevedo e6d@ornl.gov

M. Blaise DeCotes bdecoate@utk.edu

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib